spacy-llm: Integrating LLMs into Structured NLP Pipelines with spaCy

Introduction

`spacy-llm` is a powerful Python package that integrates Large Language Models (LLMs) into spaCy, a leading library for advanced Natural Language Processing. This integration provides a modular system designed for fast prototyping and prompting, effectively turning unstructured LLM responses into robust outputs for a variety of NLP tasks, often without requiring training data.

The package features a serializable `llm` component for easy integration into your spaCy pipeline, along with modular functions to define specific tasks and models. It interfaces with major LLM APIs such as OpenAI, Cohere, Anthropic, Google PaLM, and Microsoft Azure AI. Additionally, `spacy-llm` supports a broad spectrum of open-source LLMs hosted on Hugging Face, including Falcon, Dolly, Llama 2, OpenLLaMA, StableLM, and Mistral. It also integrates with LangChain, allowing all LangChain models and features to be utilized within `spacy-llm`.

Out-of-the-box, `spacy-llm` provides tasks for Named Entity Recognition, Text Classification, Lemmatization, Relationship Extraction, Sentiment Analysis, Span Categorization, Summarization, Entity Linking, Translation, and raw prompt execution for maximum flexibility. Users can also implement their own custom functions for prompting, parsing, and model integrations via spaCy's registry. For handling prompts that exceed an LLM's context window, a map-reduce approach is available to split prompts and fuse the results.

Installation

To install `spacy-llm`, ensure you have `spacy` installed in your virtual environment, then run the following command:

python -m pip install spacy-llm

Examples

Here are a couple of quick examples to get started with `spacy-llm`.

In Python code

For quick experiments, you can use the following Python code to perform text classification with a GPT model from OpenAI:

import spacy

nlp = spacy.blank("en")
llm = nlp.add_pipe("llm_textcat")
llm.add_label("INSULT")
llm.add_label("COMPLIMENT")
doc = nlp("You look gorgeous!")
print(doc.cats)
# {"COMPLIMENT": 1.0, "INSULT": 0.0}

This example uses the `llm_textcat` factory, which leverages the latest version of the built-in text classification task and the default GPT-3.5 model from OpenAI.

Using a config file

For more control over the various parameters of the `llm` pipeline, you can utilize spaCy's config system. Create a `config.cfg` file like the one below:

[nlp]
lang = "en"
pipeline = ["llm"]

[components]

[components.llm]
factory = "llm"

[components.llm.task]
@llm_tasks = "spacy.TextCat.v3"
labels = ["COMPLIMENT", "INSULT"]

[components.llm.model]
@llm_models = "spacy.GPT-4.v2"

Then, run the following Python code to load and use your configured pipeline:

from spacy_llm.util import assemble

nlp = assemble("config.cfg")
doc = nlp("You look gorgeous!")
print(doc.cats)
# {"COMPLIMENT": 1.0, "INSULT": 0.0}

This approach provides greater flexibility for customizing your LLM-powered NLP workflows.

Why Use spacy-llm?

Large Language Models offer powerful natural language understanding, making them excellent for quickly prototyping custom NLP tasks with few or no examples. However, for production systems, supervised learning models often provide better efficiency, reliability, control, and accuracy for well-defined tasks.

`spacy-llm` offers the best of both worlds. You can rapidly initialize pipelines with LLM-powered components for quick experimentation and then seamlessly integrate or replace them with spaCy's traditional supervised learning or rule-based components as your project matures. This allows you to leverage the prototyping speed of LLMs while maintaining the production-readiness, efficiency, and control that spaCy is known for. Even when an LLM is justified for complex tasks, `spacy-llm` enables you to combine it with other spaCy components, such as cheaper text classification models for filtering or rule-based systems for output validation, creating a robust and optimized NLP system.

spacy-llm: Integrating LLMs into Structured NLP Pipelines with spaCy

Summary

Repository Information

Topics

Use at your own risk

Introduction

Installation

Examples

In Python code

Using a config file

Why Use spacy-llm?

Links

Related repositories

AuditNLG: Auditing Generative AI for Trustworthiness

Odysseus: A Comprehensive Self-Hosted AI Workspace for Productivity

Headroom: Drastically Reduce LLM Token Usage for AI Agents

PixelRAG: Pixel-Native Search for Visual Retrieval-Augmented Generation

Source repository