Let's make onert run llama (based model).
Previously, we've run the prefill with cpu + npu.
Now, we would like to run llama both prefill and decode.
Decode will be done using cpu.
Prefill may be done either npu or cpu.
We may run meta-llama/Llama-3.2-1B-Instruct or tiny one like Maykeye/TinyLLama-v0.
(I prefer the latter for the 1st phase.)
(ADD) I will start with tinyllama.
To Do
FrontEnd
Runtime
Let's finish the core with the simplest choices, optimize later.
Optional
Let's make onert run llama (based model).
Previously, we've run the prefill with cpu + npu.
Now, we would like to run llama both prefill and decode.
Decode will be done using cpu.
Prefill may be done either npu or cpu.
We may run meta-llama/Llama-3.2-1B-Instruct or tiny one like Maykeye/TinyLLama-v0.
(I prefer the latter for the 1st phase.)
(ADD) I will start with tinyllama.
To Do
FrontEnd
reuse existing tvn(npu) 🤔Runtime
Let's finish the core with the simplest choices, optimize later.
Optional
[tool/model2nnpkg] Support sequence of circle and tvn #15638[tool] Introduce a ggml-weight-quantizer #15639