Lance: Modern Columnar Data Format for ML and LLMs

This repository profile is provided by osrepos.com, an open source repository discovery platform.

Lance: Modern Columnar Data Format for ML and LLMs

Summary

Lance is a modern columnar data format, implemented in Rust, designed for machine learning and large language model workflows. It offers significant performance improvements over Parquet for random access, includes vector indexing, and supports data versioning. Compatible with popular tools like Pandas, DuckDB, and PyTorch, Lance streamlines data management for ML applications.

Repository Information

Analyzed by OSRepos on November 1, 2025

Use at your own risk

OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of code from these repositories is the user's own responsibility. Always review the repository, source code, dependencies, licenses, and security implications before running or installing anything. OSRepos is not responsible for issues, damages, or losses resulting from third-party repositories.

Introduction

Lance is a modern columnar data format, implemented in Rust, specifically optimized for machine learning (ML) and large language model (LLM) workflows. It provides a highly efficient way to store and manage data, offering significant performance advantages over traditional formats like Parquet for specific use cases. With Lance, you can achieve 100x faster random access, integrate vector indexing, and leverage robust data versioning capabilities. It is designed to be compatible with a wide range of popular data science tools, including Pandas, DuckDB, Polars, Pyarrow, and PyTorch, with more integrations continuously being added.

Key features of Lance include:

  • High-performance random access: Up to 100x faster than Parquet without compromising scan performance.
  • Vector search: Perform nearest neighbor searches in milliseconds, combining OLAP queries with vector search.
  • Zero-copy, automatic versioning: Effortlessly manage data versions without additional infrastructure.
  • Ecosystem integrations: Seamlessly works with Apache Arrow, Pandas, Polars, DuckDB, Ray, Spark, and more.

Installation

To get started with Lance, you can install the Python bindings using pip:

pip install pylance

For access to the latest features and bug fixes, you can install a preview release:

pip install --pre --extra-index-url https://pypi.fury.io/lancedb/ pylance

Examples

Here are some quick examples to demonstrate how to use Lance.

Converting to Lance

You can easily convert existing data, for example, from Parquet, into the Lance format:

import lance
import pandas as pd
import pyarrow as pa
import pyarrow.dataset

df = pd.DataFrame({"a": [5], "b": [10]})
uri = "/tmp/test.parquet"
tbl = pa.Table.from_pandas(df)
pa.dataset.write_dataset(tbl, uri, format='parquet')

parquet = pa.dataset.dataset(uri, format='parquet')
lance.write_dataset(parquet, "/tmp/test.lance")

Reading Lance data

Once converted, reading data from a Lance dataset is straightforward:

dataset = lance.dataset("/tmp/test.lance")
assert isinstance(dataset, pa.dataset.Dataset)

Using with Pandas

You can easily convert a Lance dataset to a Pandas DataFrame:

df = dataset.to_table().to_pandas()
print(df)

Using with DuckDB

Lance integrates well with DuckDB for SQL-based queries:

import duckdb

duckdb.query("SELECT * FROM dataset LIMIT 10").to_df()

Vector Search

Lance provides powerful capabilities for vector search. After building an index, you can query for nearest neighbors:

# Assuming 'sift1m' is a Lance dataset with a vector index built on the 'vector' column
# and 'query_vectors' is a list of vectors to search for.

# Get nearest neighbors for a query vector
# For example, for a single query vector 'q':
# rs = dataset.to_table(nearest={"column": "vector", "k": 10, "q": q})
# print(rs)

# To get nearest neighbors for multiple query vectors, as shown in the original README:
# rs = [dataset.to_table(nearest={"column": "vector", "k": 10, "q": q})
#       for q in query_vectors]

Why Use Lance?

Lance addresses critical challenges in the ML development cycle by offering a unified data format that excels across various stages, from data collection and exploration to feature engineering and training. Unlike traditional approaches that often require multiple data transformations and syncing copies, Lance aims to reduce data silos and streamline workflows.

Key advantages include:

  • Optimized for ML Workloads: Designed from the ground up for the unique demands of machine learning datasets, including deeply nested data, images, and point clouds.
  • Performance: Achieves superior performance for random access and vector search compared to formats like Parquet, crucial for large-scale ML training and real-time inference.
  • Data Management: Features like zero-copy versioning and rich secondary indices simplify data governance and experimentation.
  • Ecosystem Compatibility: Its strong integration with the Apache Arrow ecosystem ensures broad compatibility with existing data tools.

Lance is already used in production by various organizations, including LanceDB, LanceDB Enterprise, leading multimodal Gen AI companies, self-driving car companies, and e-commerce platforms for petabyte-scale multimodal data training and billion-scale vector personalized search.

Links

Related repositories

Similar repositories that may be relevant next.

OpenLogi: A Native, Local-First Logitech Options+ Alternative in Rust

OpenLogi: A Native, Local-First Logitech Options+ Alternative in Rust

June 1, 2026

OpenLogi is a native, local-first alternative to Logitech Options+, built with Rust. It allows users to remap mouse buttons, control DPI, and manage SmartShift functionality over HID++ without requiring an account or collecting telemetry. This project prioritizes privacy and local control for Logitech mouse users.

RustLogitechMouse Remapping
RustTraining: Comprehensive Learning Paths for Rust Programmers

RustTraining: Comprehensive Learning Paths for Rust Programmers

May 29, 2026

Microsoft's RustTraining repository offers a comprehensive collection of learning materials designed for Rust programmers of all levels. It provides seven structured training courses, covering topics from foundational concepts for various programming backgrounds to deep dives into async Rust, advanced patterns, and engineering practices. This resource aims to consolidate scattered knowledge into a cohesive and pedagogically sound learning experience.

RustProgrammingTraining
OpenHuman: Your Private, Powerful AI Super Intelligence

OpenHuman: Your Private, Powerful AI Super Intelligence

May 27, 2026

OpenHuman is an open-source, agent-based personal AI assistant built with Rust, designed for privacy, simplicity, and power. It integrates seamlessly into your daily workflow, offering local knowledge management, extensive third-party integrations, and advanced memory capabilities. This project aims to provide a personal AI that truly understands and remembers your context from day one.

RustAIPersonal AI
Tokio: An Asynchronous Runtime for Reliable Rust Applications

Tokio: An Asynchronous Runtime for Reliable Rust Applications

April 27, 2026

Tokio is a powerful asynchronous runtime for the Rust programming language, enabling developers to build fast, reliable, and scalable applications. It provides essential components like I/O, networking, scheduling, and timers, making it ideal for high-performance concurrent systems.

Rustasynchronousnetworking

Source repository

Open the original repository on GitHub.

6 counted GitHub visits

View on GitHub
OS
OSRepos

Analysis and discovery of open source repositories. Find interesting projects and follow their updates.

Monitor your website with YourWebsiteScore

OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of third-party repository code is at your own risk. Always review source code, dependencies, licenses, and security implications before running anything.

© 2025 OSRepos. Built with Nuxt 3 and lots of ❤️