{"name":"Lance: Modern Columnar Data Format for ML and LLMs","description":"Lance is a modern columnar data format, implemented in Rust, designed for machine learning and large language model workflows. It offers significant performance improvements over Parquet for random access, includes vector indexing, and supports data versioning. Compatible with popular tools like Pandas, DuckDB, and PyTorch, Lance streamlines data management for ML applications.","github":"https://github.com/lancedb/lance","url":"https://osrepos.com/repo/lancedb-lance","source":"osrepos.com","sourceDescription":"This repository profile is provided by osrepos.com, an open source repository discovery platform.","repositoryProfile":"https://osrepos.com/repo/lancedb-lance","generatedFor":"open source discovery and AI-assisted research","markdown":"https://osrepos.com/repo/lancedb-lance.md","json":"https://osrepos.com/repo/lancedb-lance.json","topics":["Rust","Data Format","Machine Learning","LLMs","Vector Search","Data Science","Apache Arrow","Data Analytics"],"keywords":["Rust","Data Format","Machine Learning","LLMs","Vector Search","Data Science","Apache Arrow","Data Analytics"],"stars":null,"summary":"Lance is a modern columnar data format, implemented in Rust, designed for machine learning and large language model workflows. It offers significant performance improvements over Parquet for random access, includes vector indexing, and supports data versioning. Compatible with popular tools like Pandas, DuckDB, and PyTorch, Lance streamlines data management for ML applications.","content":"## Introduction\n\nLance is a modern columnar data format, implemented in Rust, specifically optimized for machine learning (ML) and large language model (LLM) workflows. It provides a highly efficient way to store and manage data, offering significant performance advantages over traditional formats like Parquet for specific use cases. With Lance, you can achieve 100x faster random access, integrate vector indexing, and leverage robust data versioning capabilities. It is designed to be compatible with a wide range of popular data science tools, including Pandas, DuckDB, Polars, Pyarrow, and PyTorch, with more integrations continuously being added.\n\nKey features of Lance include:\n*   **High-performance random access:** Up to 100x faster than Parquet without compromising scan performance.\n*   **Vector search:** Perform nearest neighbor searches in milliseconds, combining OLAP queries with vector search.\n*   **Zero-copy, automatic versioning:** Effortlessly manage data versions without additional infrastructure.\n*   **Ecosystem integrations:** Seamlessly works with Apache Arrow, Pandas, Polars, DuckDB, Ray, Spark, and more.\n\n## Installation\n\nTo get started with Lance, you can install the Python bindings using pip:\n\nshell\npip install pylance\n\n\nFor access to the latest features and bug fixes, you can install a preview release:\n\nshell\npip install --pre --extra-index-url https://pypi.fury.io/lancedb/ pylance\n\n\n## Examples\n\nHere are some quick examples to demonstrate how to use Lance.\n\n### Converting to Lance\n\nYou can easily convert existing data, for example, from Parquet, into the Lance format:\n\npython\nimport lance\nimport pandas as pd\nimport pyarrow as pa\nimport pyarrow.dataset\n\ndf = pd.DataFrame({\"a\": [5], \"b\": [10]})\nuri = \"/tmp/test.parquet\"\ntbl = pa.Table.from_pandas(df)\npa.dataset.write_dataset(tbl, uri, format='parquet')\n\nparquet = pa.dataset.dataset(uri, format='parquet')\nlance.write_dataset(parquet, \"/tmp/test.lance\")\n\n\n### Reading Lance data\n\nOnce converted, reading data from a Lance dataset is straightforward:\n\npython\ndataset = lance.dataset(\"/tmp/test.lance\")\nassert isinstance(dataset, pa.dataset.Dataset)\n\n\n### Using with Pandas\n\nYou can easily convert a Lance dataset to a Pandas DataFrame:\n\npython\ndf = dataset.to_table().to_pandas()\nprint(df)\n\n\n### Using with DuckDB\n\nLance integrates well with DuckDB for SQL-based queries:\n\npython\nimport duckdb\n\nduckdb.query(\"SELECT * FROM dataset LIMIT 10\").to_df()\n\n\n### Vector Search\n\nLance provides powerful capabilities for vector search. After building an index, you can query for nearest neighbors:\n\npython\n# Assuming 'sift1m' is a Lance dataset with a vector index built on the 'vector' column\n# and 'query_vectors' is a list of vectors to search for.\n\n# Get nearest neighbors for a query vector\n# For example, for a single query vector 'q':\n# rs = dataset.to_table(nearest={\"column\": \"vector\", \"k\": 10, \"q\": q})\n# print(rs)\n\n# To get nearest neighbors for multiple query vectors, as shown in the original README:\n# rs = [dataset.to_table(nearest={\"column\": \"vector\", \"k\": 10, \"q\": q})\n#       for q in query_vectors]\n\n\n## Why Use Lance?\n\nLance addresses critical challenges in the ML development cycle by offering a unified data format that excels across various stages, from data collection and exploration to feature engineering and training. Unlike traditional approaches that often require multiple data transformations and syncing copies, Lance aims to reduce data silos and streamline workflows.\n\nKey advantages include:\n*   **Optimized for ML Workloads:** Designed from the ground up for the unique demands of machine learning datasets, including deeply nested data, images, and point clouds.\n*   **Performance:** Achieves superior performance for random access and vector search compared to formats like Parquet, crucial for large-scale ML training and real-time inference.\n*   **Data Management:** Features like zero-copy versioning and rich secondary indices simplify data governance and experimentation.\n*   **Ecosystem Compatibility:** Its strong integration with the Apache Arrow ecosystem ensures broad compatibility with existing data tools.\n\nLance is already used in production by various organizations, including LanceDB, LanceDB Enterprise, leading multimodal Gen AI companies, self-driving car companies, and e-commerce platforms for petabyte-scale multimodal data training and billion-scale vector personalized search.\n\n## Links\n\n*   **GitHub Repository:** [https://github.com/lancedb/lance](https://github.com/lancedb/lance){:target=\"_blank\"}\n*   **Documentation:** [https://lancedb.github.io/lance/](https://lancedb.github.io/lance/){:target=\"_blank\"}\n*   **Blog:** [https://blog.lancedb.com/](https://blog.lancedb.com/){:target=\"_blank\"}\n*   **Discord:** [https://discord.gg/zMM32dvNtd](https://discord.gg/zMM32dvNtd){:target=\"_blank\"}\n*   **X (Twitter):** [https://x.com/lancedb](https://x.com/lancedb){:target=\"_blank\"}","metrics":{"detailViews":6,"githubClicks":6},"dates":{"published":null,"modified":"2025-11-01T16:01:46.000Z"}}