-
Notifications
You must be signed in to change notification settings - Fork 178
[ggma] Add documentation for TinyLlama example #16283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,130 @@ | ||
| # TinyLlama Text Generation Developer Guide | ||
|
|
||
| This document provides a detailed technical guide for generating, processing, and optimizing the TinyLlama text-generation model. For basic usage, see [USER.md](USER.md). | ||
|
|
||
| ## Summary | ||
|
|
||
| 1. Set up the environment and install dependencies. | ||
| 2. Generate the initial `prefill` and `decode` Circle model files. | ||
| 3. Run the pipeline to optimize, reshape, and prune the model, producing a final `decode.circle` ready for inference. | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| ### 1. Python virtual environment | ||
| ```bash | ||
| $ cd runtime/ggma/examples/generate_text/ | ||
| $ python3 -m venv _ | ||
| $ source _/bin/activate | ||
| ``` | ||
|
|
||
| ### 2. Prepare [gyu](tools/gyu/README.md) and o2o tools | ||
| Install dependencies and setup `o2o` tools (similar to what `tools/gyu/init.py` does). | ||
|
|
||
| > **Note**: We install the CPU version of `torch` first because `gyu` depends on `TICO`, which by default pulls in the large NVIDIA version of `torch`. Installing the CPU version beforehand prevents this. | ||
|
|
||
| ```bash | ||
| # 1. Install torch (CPU) and gyu requirements | ||
| $ pip install torch --index-url https://download.pytorch.org/whl/cpu | ||
| $ pip install -r tools/gyu/requirements.txt | ||
|
|
||
| # 2. Fetch o2o tools from PR #16233 | ||
| $ git fetch origin pull/16233/head:pr-16233 | ||
| $ git checkout pr-16233 -- tools/o2o | ||
| $ chmod +x tools/o2o/*.py | ||
|
|
||
| # 3. Add tools to PATH | ||
| $ export PATH=$PWD/tools/o2o:$PWD/tools/gyu:$PATH | ||
| ``` | ||
|
|
||
|
|
||
|
|
||
| ## Generating Model Files | ||
|
|
||
| ### 1. Install model dependencies | ||
| ```bash | ||
| $ pip install -r tinyllama/tinyllama.requirements | ||
| ``` | ||
|
|
||
| ### 2. Create the prefill and decode Circle model files | ||
| ```bash | ||
| $ python tinyllama/tinyllama.py --mode prefill # Generates prefill.circle | ||
| $ python tinyllama/tinyllama.py --mode decode # Generates decode_.circle | ||
| ``` | ||
|
|
||
| Verify the generated files: | ||
| ```bash | ||
| $ ls -lh *.circle | ||
| -rw-rw-r-- 1 gyu gyu 18M Nov 14 14:09 decode_.circle | ||
| -rw-rw-r-- 1 gyu gyu 18M Nov 14 14:09 prefill.circle | ||
| ``` | ||
|
|
||
| ### 3. Update `tinyllama.decode.circle` | ||
| Fuse attention and normalize KV-cache inputs for the decode model. | ||
|
|
||
| ```bash | ||
| $ fuse.attention.py < decode_.circle \ | ||
| | reshape.io.py input --by_shape [1,16,30,4] [1,16,32,4] \ | ||
| | transpose.io.kvcache.py > decode.circle | ||
| ``` | ||
|
|
||
| ### 4. Merge prefill and decode circles | ||
| Merge the models, retype input IDs, and clean up. | ||
|
|
||
| ```bash | ||
| $ merge.circles.py prefill.circle decode.circle \ | ||
| | fuse.bmm_lhs_const.py \ | ||
| | downcast.input_ids.py \ | ||
| | gc.py > model.circle | ||
| ``` | ||
|
|
||
| Verify final model files: | ||
| ```bash | ||
| $ ls -l {decode,prefill,model}.circle | ||
| -rw-rw-r-- 1 gyu gyu 18594868 Nov 22 17:26 decode.circle | ||
| -rw-rw-r-- 1 gyu gyu 18642052 Nov 22 07:53 prefill.circle | ||
| -rw-rw-r-- 1 gyu gyu 18629520 Nov 22 17:28 model.circle | ||
| ``` | ||
|
|
||
| ## Create a GGMA package | ||
|
|
||
| 1. Create the package root directory and move `model.circle` there: | ||
| ```bash | ||
| $ cd runtime/ggma/examples/generate_text | ||
| $ mkdir tinyllama | ||
| $ mv model.circle tinyllama/ | ||
| ``` | ||
|
|
||
| 2. Copy the tokenizer files (replace `{your_snapshot}` with the actual snapshot hash): | ||
| ```bash | ||
| $ cp -L ~/.cache/huggingface/hub/models--Maykeye--TinyLLama-v0/snapshots/{your_snapshot}/tokenizer.* tinyllama/ | ||
| $ cp -L ~/.cache/huggingface/hub/models--Maykeye--TinyLLama-v0/snapshots/{your_snapshot}/config.json tinyllama/ | ||
| ``` | ||
|
|
||
| ```bash | ||
| $ tree tinyllama/ | ||
| tinyllama/ | ||
| ├── model.circle | ||
| ├── tokenizer.json | ||
| └── tokenizer.model | ||
| ``` | ||
|
|
||
| ## Build and run `ggma_run` | ||
|
|
||
| ```bash | ||
| $ make -j$(nproc) | ||
| $ make install | ||
| ``` | ||
|
|
||
| Check version: | ||
| ```bash | ||
| $ Product/out/bin/ggma_run --version | ||
| ggma_run v0.1.0 (nnfw runtime: v1.31.0) | ||
| ``` | ||
|
|
||
| Run the model: | ||
| ```bash | ||
| $ Product/out/bin/ggma_run tinyllama | ||
| prompt: Lily picked up a flower. | ||
| generated: { 1100, 7899, 289, 826, 351, 600, 2439, 288, 266, 3653, 31843, 1100, 7899, 289, 1261, 291, 5869, 291, 1261, 31843, 1100, 7899 } | ||
| detokenized: She liked to play with her friends in the park. She liked to run and jump and run. She liked | ||
| ``` |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,108 @@ | ||
| # Text Generation User Guide | ||
|
|
||
| This guide shows how to create a GGMA package for text generation models using the `opm` (one packaging manager) tool. | ||
|
|
||
| We use TinyLlama as an example throughout this guide. | ||
|
|
||
| ## Creating a GGMA package | ||
|
|
||
| NOTE: Start from the ONE repository root directory. | ||
|
|
||
| ### 1. Initialize environment (one-time setup) | ||
|
|
||
| Add [opm](../../../../tools/opm/README.md) to PATH: | ||
| ```bash | ||
| $ export PATH=$PWD/tools/opm:$PATH | ||
| ``` | ||
|
|
||
| Then, change directory to tinyllama example directory and run opm init: | ||
| ```bash | ||
| $ cd runtime/ggma/examples/generate_text/tinyllama | ||
| $ opm init | ||
| ``` | ||
|
|
||
| Python environment and o2o tools are prepared: | ||
| ```bash | ||
| $ ls -ld o2o venv | ||
| drwxrwxr-x 2 opm opm 4096 Nov 24 09:44 o2o | ||
| drwxrwxr-x 6 opm opm 4096 Nov 24 09:42 venv | ||
| ``` | ||
|
|
||
| > **Note**: The `o2o` directory will be removed once [#13689](https://github.com/Samsung/ONE/pull/13689) is merged. | ||
|
|
||
| ### 2. Import model from HuggingFace | ||
|
|
||
| ```bash | ||
| $ opm import Maykeye/TinyLLama-v0 | ||
| ``` | ||
|
|
||
| The HuggingFace model is downloaded to `build/tinyllama-v0/`: | ||
| ``` | ||
| $ tree build | ||
| build | ||
| └── tinyllama-v0 | ||
| ├── backup | ||
| ├── config.json | ||
| ├── demo.py | ||
| ├── generation_config.json | ||
| ├── model.onnx | ||
| ├── model.safetensors | ||
| ├── pytorch_model.bin | ||
| ├── README.md | ||
| ├── special_tokens_map.json | ||
| ├── tokenizer_config.json | ||
| ├── tokenizer.json | ||
| ├── tokenizer.model | ||
| ├── train.ipynb | ||
| └── valid.py | ||
| ``` | ||
|
|
||
| ### 3. Export to GGMA package | ||
|
|
||
| ```bash | ||
| $ opm export -s tinyllama.py | ||
| ``` | ||
|
|
||
| The GGMA package is generated in `build/out/`: | ||
| ``` | ||
| $ tree build/out | ||
| build/out/ | ||
| ├── config.json | ||
| ├── model.circle | ||
| ├── tokenizer.json | ||
| └── tokenizer.model | ||
| ``` | ||
|
|
||
| ## Building GGMA and Running a GGMA package | ||
|
|
||
| NOTE: Start from the ONE repository root directory. | ||
|
|
||
| ### Build | ||
|
|
||
| ```bash | ||
| $ make -j$(nproc) | ||
| $ make install | ||
| ``` | ||
|
|
||
| For detailed build instructions, see the [ONE Runtime Build Guide](https://github.com/Samsung/ONE/blob/master/docs/runtime/README.md). | ||
|
|
||
| Confirm that `ggma_run` is built and show its version: | ||
| ```bash | ||
| $ Product/out/bin/ggma_run --version | ||
| ggma_run v0.1.0 (nnfw runtime: v1.31.0) | ||
| ``` | ||
|
|
||
| ### Run | ||
|
|
||
| Execute the GGMA package (default prompt) to see a sample output: | ||
| ```bash | ||
| $ Product/out/bin/ggma_run build/out | ||
| prompt: Lily picked up a flower. | ||
| generated: { 1100, 7899, 289, 826, 351, 600, 2439, 288, 266, 3653, 31843, 1100, 7899, 289, 1261, 291, 5869, 291, 1261, 31843, 1100, 7899 } | ||
| detokenized: She liked to play with her friends in the park. She liked to run and jump and run. She liked | ||
| ``` | ||
|
|
||
| For detailed run instructions, see the [ggma_run guide](https://github.com/Samsung/ONE/blob/master/runtime/tests/tools/ggma_run/README.md). | ||
|
|
||
|
|
||
| For developers who want to understand what happens under the hood, see [DEVELOPER.md](DEVELOPER.md). |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| decode: | | ||
| reshape.io.py input --by_shape [1,16,30,4] [1,16,32,4] < decode.circle | ||
| | transpose.io.kvcache.py > _.circle && mv _.circle decode.circle | ||
|
|
||
| merge: | | ||
| merge.circles.py prefill.circle decode.circle | ||
| | fuse.bmm_lhs_const.py | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. onert does not allow const lhs for batchmatmul. |
||
| | downcast.input_ids.py | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will use |
||
| | gc.py > model.circle | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It removes unreachable {input/output,tensor,buffer,...}. |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| transformers==4.50.3 | ||
glistening marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,98 @@ | ||
| import argparse | ||
| import torch | ||
| from dataclasses import dataclass | ||
| from typing import Callable, List, Optional | ||
| from transformers import AutoTokenizer, AutoModelForCausalLM | ||
| from tico.utils.record_input import RecordingInput | ||
| import tico | ||
|
|
||
| # Constants | ||
| MODEL_ID = "Maykeye/TinyLLama-v0" | ||
| PROMPT = "Lily picked up a flower." | ||
|
|
||
|
|
||
| @dataclass | ||
| class ModeArg: | ||
| max_length: int | ||
| input_to_remove: List[str] | ||
| condition: Optional[Callable] | ||
|
|
||
|
|
||
| MODE_ARGS = { | ||
| "prefill": | ||
| ModeArg(max_length=32, | ||
| input_to_remove=["past_key_values", "attention_mask", "cache_position"], | ||
| condition=None), | ||
| "decode": | ||
| ModeArg( | ||
| max_length=30, | ||
| input_to_remove=["attention_mask"], | ||
| condition=lambda args_dict: args_dict["past_key_values"].get_seq_length() != 0) | ||
| } | ||
|
|
||
|
|
||
| def main(): | ||
| parser = argparse.ArgumentParser( | ||
| description="Export TinyLlama model to Circle format.") | ||
| parser.add_argument("--mode", | ||
| choices=["prefill", "decode"], | ||
| required=True, | ||
| help="Export mode: prefill or decode") | ||
| args = parser.parse_args() | ||
|
|
||
| # Get configuration for the selected mode | ||
| config = MODE_ARGS[args.mode] | ||
|
|
||
| # Tokenizer | ||
| tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) | ||
| tokenizer.pad_token = tokenizer.eos_token | ||
| tokenizer.padding_side = "right" | ||
| inputs = tokenizer( | ||
| PROMPT, | ||
| return_tensors="pt", | ||
| padding="max_length", | ||
| max_length=config.max_length, | ||
| truncation=True, | ||
| ) | ||
|
|
||
| # Model | ||
| model = AutoModelForCausalLM.from_pretrained(MODEL_ID) | ||
| model.eval() | ||
|
|
||
| rec_context = RecordingInput(model, | ||
| config.condition, | ||
| input_to_remove=config.input_to_remove) | ||
|
|
||
| with torch.no_grad(), rec_context as rec: | ||
| outputs = model.generate( | ||
| **inputs, | ||
| max_new_tokens=32, | ||
| do_sample=False, | ||
| pad_token_id=tokenizer.eos_token_id, | ||
| ) | ||
| captured_input = rec.captured_input | ||
|
|
||
| generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) | ||
| print(f"Generated text: {generated_text}") | ||
|
|
||
| # Tico Conversion | ||
| # Reload model to ensure clean state for conversion if needed, | ||
| # but prefill.py and decode.py re-instantiate model. Let's follow that pattern to be safe. | ||
| model = AutoModelForCausalLM.from_pretrained(MODEL_ID) | ||
| model.eval() | ||
|
|
||
| if args.mode == "decode": | ||
| # Monkey patch for decode mode | ||
| from tico.serialize.operators.adapters.onert.llama_attention import ( | ||
| llama_attention_forward_adapter, ) | ||
| from transformers.models.llama.modeling_llama import LlamaAttention | ||
| LlamaAttention.forward = llama_attention_forward_adapter | ||
|
|
||
| circle_model = tico.convert(model, captured_input) | ||
| output_file = f"{args.mode}.circle" | ||
| circle_model.save(output_file) | ||
| print(f"Model saved to {output_file}") | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will merge two circles into one circle.
In this phase, the weight sharing is handled by pointing the same buffer index for same content of weights.