[onert/llm] Support tinyllama model

Let's make onert run llama (based model).

Previously, we've run the [prefill with cpu + npu](https://github.com/Samsung/ONE/issues/14223).

Now, we would like to run llama both prefill and decode.

Decode will be done using cpu. 
Prefill may be done either npu or cpu.

We may run [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) or tiny one like [Maykeye/TinyLLama-v0](https://huggingface.co/Maykeye/TinyLLama-v0).
(I prefer the latter for the 1st phase.)

(ADD) I will start with tinyllama.

### To Do

#### FrontEnd

- [x] prefill 
  - [x] ~reuse existing tvn~ (npu) 🤔 
  - [x] create new circle - **TICO** https://github.com/Samsung/TICO/pull/217
  - [x] quantize - https://github.com/Samsung/ONE/issues/15640#issuecomment-3003779005
  - [x] compile tvn https://github.com/Samsung/ONE/issues/15640#issuecomment-3003779005
- [x] decode (f32)
  - [x] f-circle and fuse using **TICO**
    - [x] prepare infra for capturing the actual input https://github.com/Samsung/TICO/issues/160
    - [x] create f-circle and fuse https://github.com/Samsung/TICO/pull/217 

#### Runtime
- [x] prefill
  - [x] prepare nnpkg - #15640 
  - [x] check value
    - [x] [f32](https://github.com/Samsung/ONE/issues/15627#issuecomment-3278630583)
    - [x] npu (internally done)
- [x] decode
  - [x] attention operator
    - [x] schema, ir #16055
    - [x] kernel (f32)
  - [x] batchmatmul
    - [x] #16064
  - [x] check value
    - [x] 1 decoder block
    - [x] whole decode model
- [x] CLI tool to run
  - [x] #16056
- [ ] API (GGMA)
  - [x] session, pkg, kv_cache, ... - #16056 
  - [ ] tokenizer

##

Let's finish the core with the simplest choices, optimize later.

##

<sub> Optional </sub>

- [ ] optimize
  - [ ] support weight between circle (emb and unemb)
  - [x] sharing model between nnpkg (unemb)
  - [ ] sharing weight between cpu and npu (prefill and decode)
  - [ ] q4_0 for cpu backend
- [x] more tools
  - [x] ~#15638~
  - [x] ~#15639~
  - [ ] circle2circle in Python (#16232)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[onert/llm] Support tinyllama model #15627

To Do

FrontEnd

Runtime

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[onert/llm] Support tinyllama model #15627

Description

To Do

FrontEnd

Runtime

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions