Distributed Llama API

This is an early version of the server that is compatible with the OpenAi API. It supports only the /v1/chat/completions endpoint. Currently it's adjusted to the Llama 3 8B Instruct only.

How to run?

Download the model and the tokenizer from here.
Run the server with the following command:

./dllama-api --model converter/dllama_original_q40.bin --tokenizer converter/dllama-llama3-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 4

Check the chat-api-client.js file to see how to use the API from NodeJS application.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed Llama API

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Distributed Llama API