Lance: Modern Columnar Data Format for ML and LLMs

Introduction

Lance is a modern columnar data format, implemented in Rust, specifically optimized for machine learning (ML) and large language model (LLM) workflows. It provides a highly efficient way to store and manage data, offering significant performance advantages over traditional formats like Parquet for specific use cases. With Lance, you can achieve 100x faster random access, integrate vector indexing, and leverage robust data versioning capabilities. It is designed to be compatible with a wide range of popular data science tools, including Pandas, DuckDB, Polars, Pyarrow, and PyTorch, with more integrations continuously being added.

Key features of Lance include:

High-performance random access: Up to 100x faster than Parquet without compromising scan performance.
Vector search: Perform nearest neighbor searches in milliseconds, combining OLAP queries with vector search.
Zero-copy, automatic versioning: Effortlessly manage data versions without additional infrastructure.
Ecosystem integrations: Seamlessly works with Apache Arrow, Pandas, Polars, DuckDB, Ray, Spark, and more.

Installation

To get started with Lance, you can install the Python bindings using pip:

pip install pylance

For access to the latest features and bug fixes, you can install a preview release:

pip install --pre --extra-index-url https://pypi.fury.io/lancedb/ pylance

Examples

Here are some quick examples to demonstrate how to use Lance.

Converting to Lance

You can easily convert existing data, for example, from Parquet, into the Lance format:

import lance
import pandas as pd
import pyarrow as pa
import pyarrow.dataset

df = pd.DataFrame({"a": [5], "b": [10]})
uri = "/tmp/test.parquet"
tbl = pa.Table.from_pandas(df)
pa.dataset.write_dataset(tbl, uri, format='parquet')

parquet = pa.dataset.dataset(uri, format='parquet')
lance.write_dataset(parquet, "/tmp/test.lance")

Reading Lance data

Once converted, reading data from a Lance dataset is straightforward:

dataset = lance.dataset("/tmp/test.lance")
assert isinstance(dataset, pa.dataset.Dataset)

Using with Pandas

You can easily convert a Lance dataset to a Pandas DataFrame:

df = dataset.to_table().to_pandas()
print(df)

Using with DuckDB

Lance integrates well with DuckDB for SQL-based queries:

import duckdb

duckdb.query("SELECT * FROM dataset LIMIT 10").to_df()

Vector Search

Lance provides powerful capabilities for vector search. After building an index, you can query for nearest neighbors:

# Assuming 'sift1m' is a Lance dataset with a vector index built on the 'vector' column
# and 'query_vectors' is a list of vectors to search for.

# Get nearest neighbors for a query vector
# For example, for a single query vector 'q':
# rs = dataset.to_table(nearest={"column": "vector", "k": 10, "q": q})
# print(rs)

# To get nearest neighbors for multiple query vectors, as shown in the original README:
# rs = [dataset.to_table(nearest={"column": "vector", "k": 10, "q": q})
#       for q in query_vectors]

Why Use Lance?

Lance addresses critical challenges in the ML development cycle by offering a unified data format that excels across various stages, from data collection and exploration to feature engineering and training. Unlike traditional approaches that often require multiple data transformations and syncing copies, Lance aims to reduce data silos and streamline workflows.

Key advantages include:

Optimized for ML Workloads: Designed from the ground up for the unique demands of machine learning datasets, including deeply nested data, images, and point clouds.
Performance: Achieves superior performance for random access and vector search compared to formats like Parquet, crucial for large-scale ML training and real-time inference.
Data Management: Features like zero-copy versioning and rich secondary indices simplify data governance and experimentation.
Ecosystem Compatibility: Its strong integration with the Apache Arrow ecosystem ensures broad compatibility with existing data tools.

Lance is already used in production by various organizations, including LanceDB, LanceDB Enterprise, leading multimodal Gen AI companies, self-driving car companies, and e-commerce platforms for petabyte-scale multimodal data training and billion-scale vector personalized search.