- Introduction
- Roadmap
- Inspiration
- Example Usage
- Infrastructure
- Models
- Adding a New Model
- KV Cache Backends
Unlike the models before LLMs gained popularity, which was mostly focused on model development, and optimizations, LLMs, due to their nature of being autoregressive, require a lot of infrastructure optimizations as well. Thus we are breaking down the whole process of inferencing from an LLM model into two parts: the infrastructure and the model.
| Features | Status |
|---|---|
| Random Sampling | ✅ |
| Greedy Sampling | ✅ |
| Streamer | ✅ |
| Continuous Batching | ✅ |
| Speculative Decoding | ✅ |
| Graph Compilation | ❌ |
| Models | FP32/BF16 | Dynamic Quantization | Static Quantization |
|---|---|---|---|
| OPT | ✅ | ❌ | ❌ |
| LLAMA (<=3.3) | ✅ | ❌ | ❌ |
| GPT-J | ✅ | ❌ | ❌ |
| Phi3/4 | ✅ | ❌ | ❌ |
| QWEN2/2.5 | ✅ | ❌ | ❌ |
| Gemma 3 | ✅ | ❌ | ❌ |
| GptOss | ✅ | ❌ | ❌ |
The inspiration behind creating a new infrastructure Large Language Models is so that the optimizations can be done in both infrastructure and the model level. The infrastructure should take in a HF models as is and should be able to run inference on it. The infrastructure should also support the ability to run multiple models (model independent), with multiple data types (type independent).
Some methods has been taken/adapted from HF Transformers, and vLLM.
Here is how you can load a model and run inference on it:
import torch
import pace
from transformers import AutoTokenizer
from pace.llm import LLMModel, SamplingConfig
model_name = "model-name"
torch_dtype = torch.bfloat16
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.padding_side = "left"
inputs_encoded = tokenizer.batch_encode_plus(["Input.."], return_tensors="pt", padding="longest")
pace_model = LLMModel(model_name, dtype=torch_dtype)
sampling_config = SamplingConfig(
max_new_tokens=35,
do_sample=False)
pace_output = pace_model.generate(inputs_encoded, sampling_config)
print(tokenizer.decode(pace_output.output_token_ids[0], skip_special_tokens=True))For a more detailed example, please refer to the example given in examples/pace_llm_basic.py
These are the components of the infrastructure and they are all under pace/llm:
The LLMModel class is what is exposed to the user and what the user should use to run inference on any model. The LLMModel class accepts a model id (from HF) or path to a model locally. LLMModel class is like a frontend to the Generator class, on calling generate method, it internally calls the Generator class to run inference on the model.
LLMModel methods:
- Constructor: Accepts the path to the model path (mandatory), a tokenizer path, and the data type for the model.
generate: Accepts the input and runs inference on the model. Thegeneratemethod accepts the input, and the sampling criteria.
The Generator class is responsible for generating the output from the model. The Generator class is model independent and can be used to run inference on any model.
The Generator class is responsible for:
- Loading the model, loading the correct weights and configurations for the model (through
model_utils) - Loading and managing the tokenizer.
- Preprocessing the input, managing the sampling, and the stopping criteria.
- Running inference on the model in an auto-regressive manner.
- Managing KV cache for the model with the help of
KVCacheManager(frompace.llm.attention).
Generator methods:
- Constructor: Accepts the model path, tokenizer path, and the data type for the model.
prepare_for_generate: Accepts the input and the sampling criteria and prepares the input, the mask and the sampler and the stopping criteria for the model.generate: Accepts the input and runs inference on the model. Thegeneratemethod accepts the input, and runs a while loop to generate the output. The loop passes the input through the model, the sampler and finally breaks when the stopping criteria is met.
Sampler class is responsible for sampling the next token from the model. The Sampler class is model independent and can be used to run inference on any model. The Sampler takes in the logits from the model and samples the next token based on the sampling criteria. The Sampling criteria is provided by SamplingConfig.
Sampler divides the sampling into three parts:
- Penalties:
repetition_penalty,frequency_penalty— applied to raw logits before any filtering. - Preprocessors:
temperature,top_k,top_p,min_p— shape the logit distribution. - Sampling:
greedy(argmax) orrandom(multinomial from the resulting probabilities).
Sampler methods:
- Constructor: Accepts the sampling criteria.
sample: Accepts the logits from the model and samples the next token based on the sampling criteria. The sampling criteria can be greedy or random sampling.
StoppingCriteria class is responsible for stopping the generation process based on the stopping criteria. The StoppingCriteria class is model independent and can be used to run inference on any model. The StoppingCriteria takes in the generated tokens and stops the generation process based on the stopping criteria.
StoppingCriteria methods:
- Constructor: Accepts the sampling config.
stop_now: Accepts the generated tokens and checks if the stopping criteria is met. The stopping criteria can be based on the number of tokens generated, EOS token or a stop string (more to be added later).
Some of the configuration files which helps to configure the generation process.
SamplingConfig contains multiple strategies like top_k, top_p, temperature etc. It is adapted from the HF implementation. Please check pace/llm/configs.py for more details.
model_utils is a utility class that is responsible for loading the model, the tokenizer, and the configurations for the model. The model_utils class is model independent and can be used to load any model.
It is responsible for:
- Loading the config from the model path, identifying the model type and loading the correct model class.
- Taking care of casting data types (FP32/BF16 supported for now).
- Checking if the model weights are properly present in the path, and load the weights into RAM and call the
model.load_weightsmethod to load the weights into the model properly according to the dictionary. Supports both.binand.safetensorsformats for weight files. - Loading the tokenizer from the tokenizer path if provided else from the model path.
hf_utils module is responsible for resolving the model path by downloading or loading from the cache for the model weights if the model name is provided. It does the same for the tokenizer as well.
All models will be adapted from the HF repo with inference only ops. One forward pass is done to generate one token. The models will be added in the models directory.
BaseModelForCausalLM is an abstract base class for all generator based models. All the models implemented in PACE will inherit from this class. It contains an initializer, a forward pass, and a load weights method, all of which are abstract and need to be implemented by the child classes.
Streamers are used to stream the output to the stdout, as soon as the output is generated. The streamers are model independent and can be used to stream the output of any model. HuggingFace provides a TextStream class which is used to stream the output to the stdout. The TextStream class is used to stream the output of the model to the stdout.
For an example of how to use the streamer, please refer to the example given in examples/pace_llm_streamer.py.
All models live in pace/llm/models/ and inherit from BaseModelForCausalLM. The Llama implementation (pace/llm/models/llama.py) is a good reference — it also serves Phi3/4 since they share the same architecture.
-
Create the model file:
pace/llm/models/<arch>.pyDefine the following components:
- Attention module — uses
FusedQKVLinear,RotaryEmbedding,Attention,Linearfrompace.llm.ops - Decoder layer — attention + MLP + norms, with residual connections
- Model backbone — embedding, decoder layers, final norm
- Top-level
<Arch>ForCausalLM— inheritsBaseModelForCausalLM, addslm_head
- Attention module — uses
-
Implement
forward: Signature isforward(input_ids, positions, kv_cache) -> ModelOutput. Only new (unprocessed) tokens are passed — the caller managesnum_computed_tokens. -
Implement
load_weights: Maps HuggingFace checkpoint weight names to PACE module parameters. Userename_layers(class attribute) for simple renames andtarget_mapfor splitting fused projections:class MyModelForCausalLM(BaseModelForCausalLM): # Splits a fused "gate_up_proj" checkpoint weight into separate gate/up target_map = { "gate_up_proj": ["gate_proj", "up_proj"], } # Renames to match PACE's MergedMLP sub-module structure rename_layers = { "up_proj": "up_proj.linear", "gate_proj": "gate_proj.linear", }
-
Register in model list: Add to
_MODELSinpace/llm/models/model_list.py:_MODELS = { ... "<Arch>ForCausalLM": ("<module_name>", "<Arch>ForCausalLM"), }
-
Accept
OperatorConfig: All layers should use the backend fromOperatorConfig(e.g.,opconfig.qkv_projection,opconfig.mlp,opconfig.norm,opconfig.lm_head). This allows users to select different backends (NATIVE, JIT, TPP, etc.) per operator type.
These ops are used in model implementations:
| Op | Import | Purpose |
|---|---|---|
Linear |
pace.llm.ops |
General linear projection |
FusedQKVLinear |
pace.llm.ops |
Fused Q/K/V projection with support for MHA and GQA |
RMSNorm |
pace.llm.ops |
RMS normalization |
FusedRMSNormResidual |
pace.llm.ops |
Fused RMSNorm + residual add |
RotaryEmbedding |
pace.llm.ops |
Rotary position embeddings (RoPE) |
MergedMLP |
pace.llm.ops |
Fused gate/up + down projection MLP |
Attention |
pace.llm.attention |
Attention with pluggable backends (JIT, NATIVE, SLAB, PAGED) |
- Config comes from HuggingFace via
AutoConfig.from_pretrained(model_path). - Subclass when the architecture is very similar to an existing one (e.g., Phi3 reuses the Llama implementation).
- Weight loading must handle fused projections — use
target_mapfor splitting andrename_layersfor renaming. For Q/K/V fusion, collect individual projection weights and callfused_layer.load_from_unfused(tensors).
PACE supports multiple KV cache types, defined in KVCacheType (pace/llm/attention/base.py). The cache type determines how key/value tensors are stored and which attention backend is used.
| Cache Type | Description | Compatible Attention Backends |
|---|---|---|
DYNAMIC |
Simple contiguous buffer, dynamically sized per sequence. Good for offline inference with small batches. | JIT, NATIVE |
BMC |
Block-Major Contiguous cache. Splits the cache into blocks controlled by PACE_BMC_NUM_SPLITS. Better memory utilization for longer sequences. |
JIT, NATIVE |
SLAB_POOL |
Pool-based slab allocator for production serving. Pre-allocates a fixed memory pool with configurable block sizes. See SlabAttention.md for details. | SLAB |
PAGED |
Paged attention with block-level memory management. Pool size controlled by PACE_MAX_CACHE_TOKENS (default 262144). |
PAGED |
OperatorConfig.finalize(cache_type=...) enforces compatibility. If the user-specified attention backend is incompatible with the cache type, it is overridden with a warning:
| Cache Type | Default Attention | Allowed Attention Backends |
|---|---|---|
DYNAMIC |
JIT | JIT, NATIVE |
BMC |
JIT | JIT, NATIVE |
SLAB_POOL |
SLAB | SLAB |
PAGED |
PAGED | PAGED |