llama-cpp-python: Python Bindings for llama.cpp

Introduction

llama-cpp-python is a crucial project that brings the power of llama.cpp to the Python ecosystem. It offers simple yet comprehensive Python bindings, allowing developers to interact with large language models (LLMs) locally. This package is designed to provide both low-level access to the C API via ctypes and a high-level Python API for common tasks like text completion, chat completion, and embeddings. With support for OpenAI-like API, LangChain, and LlamaIndex compatibility, llama-cpp-python makes local LLM deployment and experimentation accessible to a broader audience.

Installation

Getting started with llama-cpp-python is straightforward. The primary method involves installing directly via pip, which also builds llama.cpp from source to optimize for your system.

pip install llama-cpp-python

For basic CPU support, pre-built wheels are also available:

pip install llama-cpp-python \
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu

To leverage hardware acceleration like CUDA, Metal (MPS), or OpenBLAS, you can set CMAKE_ARGS environment variables during installation. For example, with CUDA:

CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python

Detailed instructions for various backends and pre-built CUDA/Metal wheels can be found in the official documentation.

Examples

llama-cpp-python offers a high-level API designed for ease of use, mimicking the OpenAI API for familiar workflows.

Text Completion:

from llama_cpp import Llama

llm = Llama(
      model_path="./models/7B/llama-model.gguf",
      # n_gpu_layers=-1, # Uncomment to use GPU acceleration
)
output = llm(
      "Q: Name the planets in the solar system? A: ",
      max_tokens=32,
      stop=["Q:", "\n"],
      echo=True
)
print(output)

Chat Completion:

The API supports various chat formats, making it easy to interact with models designed for conversational AI.

from llama_cpp import Llama
llm = Llama(
      model_path="path/to/llama-2/llama-model.gguf",
      chat_format="llama-2"
)
llm.create_chat_completion(
      messages = [
          {"role": "system", "content": "You are an assistant who perfectly describes images."},
          {
              "role": "user",
              "content": "Describe this image in detail please."
          }
      ]
)

Multi-modal Models (e.g., LLaVA):

The library also supports multi-modal models, allowing for image and text input.

from llama_cpp import Llama
from llama_cpp.llama_chat_format import Llava15ChatHandler
chat_handler = Llava15ChatHandler(clip_model_path="path/to/llava/mmproj.bin")
llm = Llama(
  model_path="./path/to/llava/llama-model.gguf",
  chat_handler=chat_handler,
  n_ctx=2048,
)
llm.create_chat_completion(
    messages = [
        {"role": "system", "content": "You are an assistant who perfectly describes images."},
        {
            "role": "user",
            "content": [
                {"type" : "text", "text": "What's in this image?"},
                {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } }
            ]
        }
    ]
)

Why Use It

llama-cpp-python stands out for several reasons:

Local Inference: Run powerful LLMs directly on your machine, ensuring data privacy and reducing reliance on cloud APIs.
OpenAI API Compatibility: Seamlessly integrate with existing applications built for the OpenAI API, minimizing code changes.
Hardware Acceleration: Supports various backends like CUDA, Metal, OpenBLAS, and ROCm, optimizing performance on different hardware.
Rich Feature Set: Beyond basic completion, it offers chat completion, function calling, multi-modal support, JSON mode, speculative decoding, and embeddings.
Web Server: Includes an OpenAI-compatible web server for easy local deployment and access from any client.
Active Development: The project is actively maintained and welcomes contributions, ensuring continuous improvement and new features.

llama-cpp-python: Python Bindings for llama.cpp

Summary

Repository Info

Tags