Samsung · glistening · Nov 14, 2025 · Nov 21, 2025 · Nov 23, 2025 · glistening
diff --git a/runtime/ggma/examples/generate_text/DEVELOPER.md b/runtime/ggma/examples/generate_text/DEVELOPER.md
@@ -0,0 +1,130 @@
+# TinyLlama Text Generation Developer Guide
+
+This document provides a detailed technical guide for generating, processing, and optimizing the TinyLlama text-generation model. For basic usage, see [USER.md](USER.md).
+
+## Summary
+
+1. Set up the environment and install dependencies.
+2. Generate the initial `prefill` and `decode` Circle model files.
+3. Run the pipeline to optimize, reshape, and prune the model, producing a final `decode.circle` ready for inference.
+
+## Prerequisites
+
+### 1. Python virtual environment
+```bash
+$ cd runtime/ggma/examples/generate_text/
+$ python3 -m venv _
+$ source _/bin/activate
+```
+
+### 2. Prepare [gyu](tools/gyu/README.md) and o2o tools
+Install dependencies and setup `o2o` tools (similar to what `tools/gyu/init.py` does).
+
+> **Note**: We install the CPU version of `torch` first because `gyu` depends on `TICO`, which by default pulls in the large NVIDIA version of `torch`. Installing the CPU version beforehand prevents this.
+
+```bash
+# 1. Install torch (CPU) and gyu requirements
+$ pip install torch --index-url https://download.pytorch.org/whl/cpu
+$ pip install -r tools/gyu/requirements.txt
+
+# 2. Fetch o2o tools from PR #16233
+$ git fetch origin pull/16233/head:pr-16233
+$ git checkout pr-16233 -- tools/o2o
+$ chmod +x tools/o2o/*.py
+
+# 3. Add tools to PATH
+$ export PATH=$PWD/tools/o2o:$PWD/tools/gyu:$PATH
+```
+
+
+
+## Generating Model Files
+
+### 1. Install model dependencies
+```bash
+$ pip install -r tinyllama/tinyllama.requirements
+```
+
+### 2. Create the prefill and decode Circle model files
+```bash
+$ python tinyllama/tinyllama.py --mode prefill   # Generates prefill.circle
+$ python tinyllama/tinyllama.py --mode decode    # Generates decode_.circle
+```
+
+Verify the generated files:
+```bash
+$ ls -lh *.circle
+-rw-rw-r-- 1 gyu gyu 18M Nov 14 14:09 decode_.circle
+-rw-rw-r-- 1 gyu gyu 18M Nov 14 14:09 prefill.circle
+```
+
+### 3. Update `tinyllama.decode.circle`
+Fuse attention and normalize KV-cache inputs for the decode model.
+
+```bash
+$ fuse.attention.py < decode_.circle \
+    | reshape.io.py input --by_shape [1,16,30,4] [1,16,32,4] \
+    | transpose.io.kvcache.py > decode.circle
+```
+
+### 4. Merge prefill and decode circles
+Merge the models, retype input IDs, and clean up.
+
+```bash
+$ merge.circles.py prefill.circle decode.circle \
+    | fuse.bmm_lhs_const.py \
+    | downcast.input_ids.py \
+    | gc.py > model.circle
+```
+
+Verify final model files:
+```bash
+$ ls -l {decode,prefill,model}.circle
+-rw-rw-r-- 1 gyu gyu 18594868 Nov 22 17:26 decode.circle
+-rw-rw-r-- 1 gyu gyu 18642052 Nov 22 07:53 prefill.circle
+-rw-rw-r-- 1 gyu gyu 18629520 Nov 22 17:28 model.circle
+```
+
+## Create a GGMA package
+
+1. Create the package root directory and move `model.circle` there:
+```bash
+$ cd runtime/ggma/examples/generate_text
+$ mkdir tinyllama
+$ mv model.circle tinyllama/
+```
+
+2. Copy the tokenizer files (replace `{your_snapshot}` with the actual snapshot hash):
+```bash
+$ cp -L ~/.cache/huggingface/hub/models--Maykeye--TinyLLama-v0/snapshots/{your_snapshot}/tokenizer.* tinyllama/
+$ cp -L ~/.cache/huggingface/hub/models--Maykeye--TinyLLama-v0/snapshots/{your_snapshot}/config.json tinyllama/
+```
+
+```bash
+$ tree tinyllama/
+tinyllama/
+├── model.circle
+├── tokenizer.json
+└── tokenizer.model
+```
+
+## Build and run `ggma_run`
+
+```bash
+$ make -j$(nproc)
+$ make install
+```
+
+Check version:
+```bash
+$ Product/out/bin/ggma_run --version
+ggma_run v0.1.0 (nnfw runtime: v1.31.0)
+```
+
+Run the model:
+```bash
+$ Product/out/bin/ggma_run tinyllama
+prompt: Lily picked up a flower.
+generated: { 1100, 7899, 289, 826, 351, 600, 2439, 288, 266, 3653, 31843, 1100, 7899, 289, 1261, 291, 5869, 291, 1261, 31843, 1100, 7899 }
+detokenized: She liked to play with her friends in the park. She liked to run and jump and run. She liked
+```
diff --git a/runtime/ggma/examples/generate_text/USER.md b/runtime/ggma/examples/generate_text/USER.md
@@ -0,0 +1,108 @@
+# Text Generation User Guide
+
+This guide shows how to create a GGMA package for text generation models using the `opm` (one packaging manager) tool.
+
+We use TinyLlama as an example throughout this guide.
+
+## Creating a GGMA package
+
+NOTE: Start from the ONE repository root directory.
+
+### 1. Initialize environment (one-time setup)
+
+Add [opm](../../../../tools/opm/README.md) to PATH:
+```bash
+$ export PATH=$PWD/tools/opm:$PATH
+```
+
+Then, change directory to tinyllama example directory and run opm init:
+```bash
+$ cd runtime/ggma/examples/generate_text/tinyllama
+$ opm init
+```
+
+Python environment and o2o tools are prepared:
+```bash
+$ ls -ld o2o venv
+drwxrwxr-x 2 opm opm 4096 Nov 24 09:44 o2o
+drwxrwxr-x 6 opm opm 4096 Nov 24 09:42 venv
+```
+
+> **Note**: The `o2o` directory will be removed once [#13689](https://github.com/Samsung/ONE/pull/13689) is merged.
+
+### 2. Import model from HuggingFace
+
+```bash
+$ opm import Maykeye/TinyLLama-v0
+```
+
+The HuggingFace model is downloaded to `build/tinyllama-v0/`:
+```
+$ tree build
+build
+└── tinyllama-v0
+    ├── backup
+    ├── config.json
+    ├── demo.py
+    ├── generation_config.json
+    ├── model.onnx
+    ├── model.safetensors
+    ├── pytorch_model.bin
+    ├── README.md
+    ├── special_tokens_map.json
+    ├── tokenizer_config.json
+    ├── tokenizer.json
+    ├── tokenizer.model
+    ├── train.ipynb
+    └── valid.py
+```
+
+### 3. Export to GGMA package
+
+```bash
+$ opm export -s tinyllama.py
+```
+
+The GGMA package is generated in `build/out/`:
+```
+$ tree build/out
+build/out/
+├── config.json
+├── model.circle
+├── tokenizer.json
+└── tokenizer.model
+```
+
+## Building GGMA and Running a GGMA package
+
+NOTE: Start from the ONE repository root directory.
+
+### Build
+
+```bash
+$ make -j$(nproc)
+$ make install
+```
+
+For detailed build instructions, see the [ONE Runtime Build Guide](https://github.com/Samsung/ONE/blob/master/docs/runtime/README.md).
+
+Confirm that `ggma_run` is built and show its version:
+```bash
+$ Product/out/bin/ggma_run --version
+ggma_run v0.1.0 (nnfw runtime: v1.31.0)
+```
+
+### Run
+
+Execute the GGMA package (default prompt) to see a sample output:
+```bash
+$ Product/out/bin/ggma_run build/out
+prompt: Lily picked up a flower.
+generated: { 1100, 7899, 289, 826, 351, 600, 2439, 288, 266, 3653, 31843, 1100, 7899, 289, 1261, 291, 5869, 291, 1261, 31843, 1100, 7899 }
+detokenized: She liked to play with her friends in the park. She liked to run and jump and run. She liked
+```
+
+For detailed run instructions, see the [ggma_run guide](https://github.com/Samsung/ONE/blob/master/runtime/tests/tools/ggma_run/README.md).
+
+
+For developers who want to understand what happens under the hood, see [DEVELOPER.md](DEVELOPER.md).
diff --git a/runtime/ggma/examples/generate_text/tinyllama/pipeline.yaml b/runtime/ggma/examples/generate_text/tinyllama/pipeline.yaml
@@ -0,0 +1,9 @@
+decode: |
+  reshape.io.py input --by_shape [1,16,30,4] [1,16,32,4] < decode.circle
+      | transpose.io.kvcache.py > _.circle && mv _.circle decode.circle
+
+merge: |
+  merge.circles.py prefill.circle decode.circle
+      | fuse.bmm_lhs_const.py
+      | downcast.input_ids.py
+      | gc.py > model.circle
diff --git a/runtime/ggma/examples/generate_text/tinyllama/requirements.txt b/runtime/ggma/examples/generate_text/tinyllama/requirements.txt
@@ -0,0 +1 @@
+transformers==4.50.3
diff --git a/runtime/ggma/examples/generate_text/tinyllama/tinyllama.py b/runtime/ggma/examples/generate_text/tinyllama/tinyllama.py
@@ -0,0 +1,98 @@
+import argparse
+import torch
+from dataclasses import dataclass
+from typing import Callable, List, Optional
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from tico.utils.record_input import RecordingInput
+import tico
+
+# Constants
+MODEL_ID = "Maykeye/TinyLLama-v0"
+PROMPT = "Lily picked up a flower."
+
+
+@dataclass
+class ModeArg:
+    max_length: int
+    input_to_remove: List[str]
+    condition: Optional[Callable]
+
+
+MODE_ARGS = {
+    "prefill":
+    ModeArg(max_length=32,
+            input_to_remove=["past_key_values", "attention_mask", "cache_position"],
+            condition=None),
+    "decode":
+    ModeArg(
+        max_length=30,
+        input_to_remove=["attention_mask"],
+        condition=lambda args_dict: args_dict["past_key_values"].get_seq_length() != 0)
+}
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Export TinyLlama model to Circle format.")
+    parser.add_argument("--mode",
+                        choices=["prefill", "decode"],
+                        required=True,
+                        help="Export mode: prefill or decode")
+    args = parser.parse_args()
+
+    # Get configuration for the selected mode
+    config = MODE_ARGS[args.mode]
+
+    # Tokenizer
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+    tokenizer.pad_token = tokenizer.eos_token
+    tokenizer.padding_side = "right"
+    inputs = tokenizer(
+        PROMPT,
+        return_tensors="pt",
+        padding="max_length",
+        max_length=config.max_length,
+        truncation=True,
+    )
+
+    # Model
+    model = AutoModelForCausalLM.from_pretrained(MODEL_ID)
+    model.eval()
+
+    rec_context = RecordingInput(model,
+                                 config.condition,
+                                 input_to_remove=config.input_to_remove)
+
+    with torch.no_grad(), rec_context as rec:
+        outputs = model.generate(
+            **inputs,
+            max_new_tokens=32,
+            do_sample=False,
+            pad_token_id=tokenizer.eos_token_id,
+        )
+        captured_input = rec.captured_input
+
+    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+    print(f"Generated text: {generated_text}")
+
+    # Tico Conversion
+    # Reload model to ensure clean state for conversion if needed,
+    # but prefill.py and decode.py re-instantiate model. Let's follow that pattern to be safe.
+    model = AutoModelForCausalLM.from_pretrained(MODEL_ID)
+    model.eval()
+
+    if args.mode == "decode":
+        # Monkey patch for decode mode
+        from tico.serialize.operators.adapters.onert.llama_attention import (
+            llama_attention_forward_adapter, )
+        from transformers.models.llama.modeling_llama import LlamaAttention
+        LlamaAttention.forward = llama_attention_forward_adapter
+
+    circle_model = tico.convert(model, captured_input)
+    output_file = f"{args.mode}.circle"
+    circle_model.save(output_file)
+    print(f"Model saved to {output_file}")
+
+
+if __name__ == "__main__":
+    main()
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		transformers==4.50.3
glistening marked this conversation as resolved. Show resolved Hide resolved