# Lance: Modern Columnar Data Format for ML and LLMs

This repository profile is provided by osrepos.com, an open source repository discovery platform.

Source: osrepos.com
Repository profile: https://osrepos.com/repo/lancedb-lance
Generated for open source discovery and AI-assisted research.

Lance is a modern columnar data format, implemented in Rust, designed for machine learning and large language model workflows. It offers significant performance improvements over Parquet for random access, includes vector indexing, and supports data versioning. Compatible with popular tools like Pandas, DuckDB, and PyTorch, Lance streamlines data management for ML applications.

GitHub: https://github.com/lancedb/lance
OSRepos URL: https://osrepos.com/repo/lancedb-lance

## Summary

Lance is a modern columnar data format, implemented in Rust, designed for machine learning and large language model workflows. It offers significant performance improvements over Parquet for random access, includes vector indexing, and supports data versioning. Compatible with popular tools like Pandas, DuckDB, and PyTorch, Lance streamlines data management for ML applications.

## Topics

- Rust
- Data Format
- Machine Learning
- LLMs
- Vector Search
- Data Science
- Apache Arrow
- Data Analytics

## Repository Information

Last analyzed by OSRepos: Sat Nov 01 2025 16:01:46 GMT+0000 (Western European Standard Time)
Detail views: 6
GitHub clicks: 6

## Safety Notice

OSRepos shares public repositories for knowledge and discovery only. Review source code, dependencies, licenses, and security implications before running or installing anything.

## Content

## Introduction

Lance is a modern columnar data format, implemented in Rust, specifically optimized for machine learning (ML) and large language model (LLM) workflows. It provides a highly efficient way to store and manage data, offering significant performance advantages over traditional formats like Parquet for specific use cases. With Lance, you can achieve 100x faster random access, integrate vector indexing, and leverage robust data versioning capabilities. It is designed to be compatible with a wide range of popular data science tools, including Pandas, DuckDB, Polars, Pyarrow, and PyTorch, with more integrations continuously being added.

Key features of Lance include:
*   **High-performance random access:** Up to 100x faster than Parquet without compromising scan performance.
*   **Vector search:** Perform nearest neighbor searches in milliseconds, combining OLAP queries with vector search.
*   **Zero-copy, automatic versioning:** Effortlessly manage data versions without additional infrastructure.
*   **Ecosystem integrations:** Seamlessly works with Apache Arrow, Pandas, Polars, DuckDB, Ray, Spark, and more.

## Installation

To get started with Lance, you can install the Python bindings using pip:

shell
pip install pylance


For access to the latest features and bug fixes, you can install a preview release:

shell
pip install --pre --extra-index-url https://pypi.fury.io/lancedb/ pylance


## Examples

Here are some quick examples to demonstrate how to use Lance.

### Converting to Lance

You can easily convert existing data, for example, from Parquet, into the Lance format:

python
import lance
import pandas as pd
import pyarrow as pa
import pyarrow.dataset

df = pd.DataFrame({"a": [5], "b": [10]})
uri = "/tmp/test.parquet"
tbl = pa.Table.from_pandas(df)
pa.dataset.write_dataset(tbl, uri, format='parquet')

parquet = pa.dataset.dataset(uri, format='parquet')
lance.write_dataset(parquet, "/tmp/test.lance")


### Reading Lance data

Once converted, reading data from a Lance dataset is straightforward:

python
dataset = lance.dataset("/tmp/test.lance")
assert isinstance(dataset, pa.dataset.Dataset)


### Using with Pandas

You can easily convert a Lance dataset to a Pandas DataFrame:

python
df = dataset.to_table().to_pandas()
print(df)


### Using with DuckDB

Lance integrates well with DuckDB for SQL-based queries:

python
import duckdb

duckdb.query("SELECT * FROM dataset LIMIT 10").to_df()


### Vector Search

Lance provides powerful capabilities for vector search. After building an index, you can query for nearest neighbors:

python
# Assuming 'sift1m' is a Lance dataset with a vector index built on the 'vector' column
# and 'query_vectors' is a list of vectors to search for.

# Get nearest neighbors for a query vector
# For example, for a single query vector 'q':
# rs = dataset.to_table(nearest={"column": "vector", "k": 10, "q": q})
# print(rs)

# To get nearest neighbors for multiple query vectors, as shown in the original README:
# rs = [dataset.to_table(nearest={"column": "vector", "k": 10, "q": q})
#       for q in query_vectors]


## Why Use Lance?

Lance addresses critical challenges in the ML development cycle by offering a unified data format that excels across various stages, from data collection and exploration to feature engineering and training. Unlike traditional approaches that often require multiple data transformations and syncing copies, Lance aims to reduce data silos and streamline workflows.

Key advantages include:
*   **Optimized for ML Workloads:** Designed from the ground up for the unique demands of machine learning datasets, including deeply nested data, images, and point clouds.
*   **Performance:** Achieves superior performance for random access and vector search compared to formats like Parquet, crucial for large-scale ML training and real-time inference.
*   **Data Management:** Features like zero-copy versioning and rich secondary indices simplify data governance and experimentation.
*   **Ecosystem Compatibility:** Its strong integration with the Apache Arrow ecosystem ensures broad compatibility with existing data tools.

Lance is already used in production by various organizations, including LanceDB, LanceDB Enterprise, leading multimodal Gen AI companies, self-driving car companies, and e-commerce platforms for petabyte-scale multimodal data training and billion-scale vector personalized search.

## Links

*   **GitHub Repository:** [https://github.com/lancedb/lance](https://github.com/lancedb/lance){:target="_blank"}
*   **Documentation:** [https://lancedb.github.io/lance/](https://lancedb.github.io/lance/){:target="_blank"}
*   **Blog:** [https://blog.lancedb.com/](https://blog.lancedb.com/){:target="_blank"}
*   **Discord:** [https://discord.gg/zMM32dvNtd](https://discord.gg/zMM32dvNtd){:target="_blank"}
*   **X (Twitter):** [https://x.com/lancedb](https://x.com/lancedb){:target="_blank"}