Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 130 additions & 0 deletions runtime/ggma/examples/generate_text/DEVELOPER.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# TinyLlama Text Generation Developer Guide

This document provides a detailed technical guide for generating, processing, and optimizing the TinyLlama text-generation model. For basic usage, see [USER.md](USER.md).

## Summary

1. Set up the environment and install dependencies.
2. Generate the initial `prefill` and `decode` Circle model files.
3. Run the pipeline to optimize, reshape, and prune the model, producing a final `decode.circle` ready for inference.

## Prerequisites

### 1. Python virtual environment
```bash
$ cd runtime/ggma/examples/generate_text/
$ python3 -m venv _
$ source _/bin/activate
```

### 2. Prepare [gyu](tools/gyu/README.md) and o2o tools
Install dependencies and setup `o2o` tools (similar to what `tools/gyu/init.py` does).

> **Note**: We install the CPU version of `torch` first because `gyu` depends on `TICO`, which by default pulls in the large NVIDIA version of `torch`. Installing the CPU version beforehand prevents this.

```bash
# 1. Install torch (CPU) and gyu requirements
$ pip install torch --index-url https://download.pytorch.org/whl/cpu
$ pip install -r tools/gyu/requirements.txt

# 2. Fetch o2o tools from PR #16233
$ git fetch origin pull/16233/head:pr-16233
$ git checkout pr-16233 -- tools/o2o
$ chmod +x tools/o2o/*.py

# 3. Add tools to PATH
$ export PATH=$PWD/tools/o2o:$PWD/tools/gyu:$PATH
```



## Generating Model Files

### 1. Install model dependencies
```bash
$ pip install -r tinyllama/tinyllama.requirements
```

### 2. Create the prefill and decode Circle model files
```bash
$ python tinyllama/tinyllama.py --mode prefill # Generates prefill.circle
$ python tinyllama/tinyllama.py --mode decode # Generates decode_.circle
```

Verify the generated files:
```bash
$ ls -lh *.circle
-rw-rw-r-- 1 gyu gyu 18M Nov 14 14:09 decode_.circle
-rw-rw-r-- 1 gyu gyu 18M Nov 14 14:09 prefill.circle
```

### 3. Update `tinyllama.decode.circle`
Fuse attention and normalize KV-cache inputs for the decode model.

```bash
$ fuse.attention.py < decode_.circle \
| reshape.io.py input --by_shape [1,16,30,4] [1,16,32,4] \
| transpose.io.kvcache.py > decode.circle
```

### 4. Merge prefill and decode circles
Merge the models, retype input IDs, and clean up.

```bash
$ merge.circles.py prefill.circle decode.circle \
| fuse.bmm_lhs_const.py \
| downcast.input_ids.py \
| gc.py > model.circle
```

Verify final model files:
```bash
$ ls -l {decode,prefill,model}.circle
-rw-rw-r-- 1 gyu gyu 18594868 Nov 22 17:26 decode.circle
-rw-rw-r-- 1 gyu gyu 18642052 Nov 22 07:53 prefill.circle
-rw-rw-r-- 1 gyu gyu 18629520 Nov 22 17:28 model.circle
```

## Create a GGMA package

1. Create the package root directory and move `model.circle` there:
```bash
$ cd runtime/ggma/examples/generate_text
$ mkdir tinyllama
$ mv model.circle tinyllama/
```

2. Copy the tokenizer files (replace `{your_snapshot}` with the actual snapshot hash):
```bash
$ cp -L ~/.cache/huggingface/hub/models--Maykeye--TinyLLama-v0/snapshots/{your_snapshot}/tokenizer.* tinyllama/
$ cp -L ~/.cache/huggingface/hub/models--Maykeye--TinyLLama-v0/snapshots/{your_snapshot}/config.json tinyllama/
```

```bash
$ tree tinyllama/
tinyllama/
├── model.circle
├── tokenizer.json
└── tokenizer.model
```

## Build and run `ggma_run`

```bash
$ make -j$(nproc)
$ make install
```

Check version:
```bash
$ Product/out/bin/ggma_run --version
ggma_run v0.1.0 (nnfw runtime: v1.31.0)
```

Run the model:
```bash
$ Product/out/bin/ggma_run tinyllama
prompt: Lily picked up a flower.
generated: { 1100, 7899, 289, 826, 351, 600, 2439, 288, 266, 3653, 31843, 1100, 7899, 289, 1261, 291, 5869, 291, 1261, 31843, 1100, 7899 }
detokenized: She liked to play with her friends in the park. She liked to run and jump and run. She liked
```
108 changes: 108 additions & 0 deletions runtime/ggma/examples/generate_text/USER.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Text Generation User Guide

This guide shows how to create a GGMA package for text generation models using the `opm` (one packaging manager) tool.

We use TinyLlama as an example throughout this guide.

## Creating a GGMA package

NOTE: Start from the ONE repository root directory.

### 1. Initialize environment (one-time setup)

Add [opm](../../../../tools/opm/README.md) to PATH:
```bash
$ export PATH=$PWD/tools/opm:$PATH
```

Then, change directory to tinyllama example directory and run opm init:
```bash
$ cd runtime/ggma/examples/generate_text/tinyllama
$ opm init
```

Python environment and o2o tools are prepared:
```bash
$ ls -ld o2o venv
drwxrwxr-x 2 opm opm 4096 Nov 24 09:44 o2o
drwxrwxr-x 6 opm opm 4096 Nov 24 09:42 venv
```

> **Note**: The `o2o` directory will be removed once [#13689](https://github.com/Samsung/ONE/pull/13689) is merged.

### 2. Import model from HuggingFace

```bash
$ opm import Maykeye/TinyLLama-v0
```

The HuggingFace model is downloaded to `build/tinyllama-v0/`:
```
$ tree build
build
└── tinyllama-v0
├── backup
├── config.json
├── demo.py
├── generation_config.json
├── model.onnx
├── model.safetensors
├── pytorch_model.bin
├── README.md
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
├── tokenizer.model
├── train.ipynb
└── valid.py
```

### 3. Export to GGMA package

```bash
$ opm export -s tinyllama.py
```

The GGMA package is generated in `build/out/`:
```
$ tree build/out
build/out/
├── config.json
├── model.circle
├── tokenizer.json
└── tokenizer.model
```

## Building GGMA and Running a GGMA package

NOTE: Start from the ONE repository root directory.

### Build

```bash
$ make -j$(nproc)
$ make install
```

For detailed build instructions, see the [ONE Runtime Build Guide](https://github.com/Samsung/ONE/blob/master/docs/runtime/README.md).

Confirm that `ggma_run` is built and show its version:
```bash
$ Product/out/bin/ggma_run --version
ggma_run v0.1.0 (nnfw runtime: v1.31.0)
```

### Run

Execute the GGMA package (default prompt) to see a sample output:
```bash
$ Product/out/bin/ggma_run build/out
prompt: Lily picked up a flower.
generated: { 1100, 7899, 289, 826, 351, 600, 2439, 288, 266, 3653, 31843, 1100, 7899, 289, 1261, 291, 5869, 291, 1261, 31843, 1100, 7899 }
detokenized: She liked to play with her friends in the park. She liked to run and jump and run. She liked
```

For detailed run instructions, see the [ggma_run guide](https://github.com/Samsung/ONE/blob/master/runtime/tests/tools/ggma_run/README.md).


For developers who want to understand what happens under the hood, see [DEVELOPER.md](DEVELOPER.md).
9 changes: 9 additions & 0 deletions runtime/ggma/examples/generate_text/tinyllama/pipeline.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
decode: |
reshape.io.py input --by_shape [1,16,30,4] [1,16,32,4] < decode.circle
| transpose.io.kvcache.py > _.circle && mv _.circle decode.circle

merge: |
merge.circles.py prefill.circle decode.circle
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will merge two circles into one circle.
In this phase, the weight sharing is handled by pointing the same buffer index for same content of weights.

| fuse.bmm_lhs_const.py
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onert does not allow const lhs for batchmatmul.

| downcast.input_ids.py
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will use int32 instead of int64 (← the default type from TICO generated) for input_ids, which is given by gather.

| gc.py > model.circle
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It removes unreachable {input/output,tensor,buffer,...}.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
transformers==4.50.3
98 changes: 98 additions & 0 deletions runtime/ggma/examples/generate_text/tinyllama/tinyllama.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
import argparse
import torch
from dataclasses import dataclass
from typing import Callable, List, Optional
from transformers import AutoTokenizer, AutoModelForCausalLM
from tico.utils.record_input import RecordingInput
import tico

# Constants
MODEL_ID = "Maykeye/TinyLLama-v0"
PROMPT = "Lily picked up a flower."


@dataclass
class ModeArg:
max_length: int
input_to_remove: List[str]
condition: Optional[Callable]


MODE_ARGS = {
"prefill":
ModeArg(max_length=32,
input_to_remove=["past_key_values", "attention_mask", "cache_position"],
condition=None),
"decode":
ModeArg(
max_length=30,
input_to_remove=["attention_mask"],
condition=lambda args_dict: args_dict["past_key_values"].get_seq_length() != 0)
}


def main():
parser = argparse.ArgumentParser(
description="Export TinyLlama model to Circle format.")
parser.add_argument("--mode",
choices=["prefill", "decode"],
required=True,
help="Export mode: prefill or decode")
args = parser.parse_args()

# Get configuration for the selected mode
config = MODE_ARGS[args.mode]

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
inputs = tokenizer(
PROMPT,
return_tensors="pt",
padding="max_length",
max_length=config.max_length,
truncation=True,
)

# Model
model = AutoModelForCausalLM.from_pretrained(MODEL_ID)
model.eval()

rec_context = RecordingInput(model,
config.condition,
input_to_remove=config.input_to_remove)

with torch.no_grad(), rec_context as rec:
outputs = model.generate(
**inputs,
max_new_tokens=32,
do_sample=False,
pad_token_id=tokenizer.eos_token_id,
)
captured_input = rec.captured_input

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated text: {generated_text}")

# Tico Conversion
# Reload model to ensure clean state for conversion if needed,
# but prefill.py and decode.py re-instantiate model. Let's follow that pattern to be safe.
model = AutoModelForCausalLM.from_pretrained(MODEL_ID)
model.eval()

if args.mode == "decode":
# Monkey patch for decode mode
from tico.serialize.operators.adapters.onert.llama_attention import (
llama_attention_forward_adapter, )
from transformers.models.llama.modeling_llama import LlamaAttention
LlamaAttention.forward = llama_attention_forward_adapter

circle_model = tico.convert(model, captured_input)
output_file = f"{args.mode}.circle"
circle_model.save(output_file)
print(f"Model saved to {output_file}")


if __name__ == "__main__":
main()