llama-cpp-python: Python Bindings for llama.cpp

Summary
llama-cpp-python provides robust Python bindings for the popular llama.cpp library, enabling efficient local inference with large language models. It offers a high-level API compatible with OpenAI's API, facilitating easy integration into existing applications. The project also includes a powerful web server for local deployment and supports various hardware acceleration backends.
Repository Info
Tags
Click on any tag to explore related repositories
Introduction
llama-cpp-python is a crucial project that brings the power of llama.cpp to the Python ecosystem. It offers simple yet comprehensive Python bindings, allowing developers to interact with large language models (LLMs) locally. This package is designed to provide both low-level access to the C API via ctypes and a high-level Python API for common tasks like text completion, chat completion, and embeddings. With support for OpenAI-like API, LangChain, and LlamaIndex compatibility, llama-cpp-python makes local LLM deployment and experimentation accessible to a broader audience.
Installation
Getting started with llama-cpp-python is straightforward. The primary method involves installing directly via pip, which also builds llama.cpp from source to optimize for your system.
pip install llama-cpp-python
For basic CPU support, pre-built wheels are also available:
pip install llama-cpp-python \
--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
To leverage hardware acceleration like CUDA, Metal (MPS), or OpenBLAS, you can set CMAKE_ARGS environment variables during installation. For example, with CUDA:
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
Detailed instructions for various backends and pre-built CUDA/Metal wheels can be found in the official documentation.
Examples
llama-cpp-python offers a high-level API designed for ease of use, mimicking the OpenAI API for familiar workflows.
Text Completion:
from llama_cpp import Llama
llm = Llama(
model_path="./models/7B/llama-model.gguf",
# n_gpu_layers=-1, # Uncomment to use GPU acceleration
)
output = llm(
"Q: Name the planets in the solar system? A: ",
max_tokens=32,
stop=["Q:", "\n"],
echo=True
)
print(output)
Chat Completion:
The API supports various chat formats, making it easy to interact with models designed for conversational AI.
from llama_cpp import Llama
llm = Llama(
model_path="path/to/llama-2/llama-model.gguf",
chat_format="llama-2"
)
llm.create_chat_completion(
messages = [
{"role": "system", "content": "You are an assistant who perfectly describes images."},
{
"role": "user",
"content": "Describe this image in detail please."
}
]
)
Multi-modal Models (e.g., LLaVA):
The library also supports multi-modal models, allowing for image and text input.
from llama_cpp import Llama
from llama_cpp.llama_chat_format import Llava15ChatHandler
chat_handler = Llava15ChatHandler(clip_model_path="path/to/llava/mmproj.bin")
llm = Llama(
model_path="./path/to/llava/llama-model.gguf",
chat_handler=chat_handler,
n_ctx=2048,
)
llm.create_chat_completion(
messages = [
{"role": "system", "content": "You are an assistant who perfectly describes images."},
{
"role": "user",
"content": [
{"type" : "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } }
]
}
]
)
Why Use It
llama-cpp-python stands out for several reasons:
- Local Inference: Run powerful LLMs directly on your machine, ensuring data privacy and reducing reliance on cloud APIs.
- OpenAI API Compatibility: Seamlessly integrate with existing applications built for the OpenAI API, minimizing code changes.
- Hardware Acceleration: Supports various backends like CUDA, Metal, OpenBLAS, and ROCm, optimizing performance on different hardware.
- Rich Feature Set: Beyond basic completion, it offers chat completion, function calling, multi-modal support, JSON mode, speculative decoding, and embeddings.
- Web Server: Includes an OpenAI-compatible web server for easy local deployment and access from any client.
- Active Development: The project is actively maintained and welcomes contributions, ensuring continuous improvement and new features.
Links
- GitHub Repository: https://github.com/abetlen/llama-cpp-python
- Official Documentation: https://llama-cpp-python.readthedocs.io/en/latest