TextDistance: A Comprehensive Python Library for Sequence Distance Calculation

Summary
TextDistance is a powerful Python library designed to compute the distance and similarity between sequences using over 30 different algorithms. It offers a pure Python implementation with a common, easy-to-use interface, and can optionally leverage external libraries for maximum performance. This tool is ideal for tasks requiring robust string comparison, such as fuzzy matching and data cleaning.
Repository Info
Tags
Click on any tag to explore related repositories
Introduction
TextDistance is a comprehensive and versatile Python library designed for computing the distance and similarity between sequences. It stands out by offering implementations of over 30 different algorithms, all accessible through a common and intuitive interface. Whether you need to compare strings, lists, or any other sequence, TextDistance provides a robust solution.
Key features include:
- Extensive Algorithm Collection: Over 30 algorithms covering edit-based, token-based, sequence-based, compression-based, phonetic, and simple comparisons.
- Pure Python Implementation: Ensures compatibility and ease of use across various environments.
- Common Interface: All algorithms share a consistent API for calculating distance, similarity, normalized distance, and normalized similarity.
- Optional External Libraries: For maximum speed, TextDistance can leverage highly optimized external C-based libraries like
jellyfishandLevenshteinif installed. - Multi-sequence Comparison: Supports comparing more than two sequences simultaneously.
Installation
Installing TextDistance is straightforward. You can choose between a pure Python version or include optional extra libraries for enhanced performance.
Stable Version (Pure Python):
pip install textdistance
Stable Version (With Extra Libraries for Maximum Speed):
pip install "textdistance[extras]"
Development Version:
You can install the development version directly from GitHub:
pip install -e git+https://github.com/life4/textdistance.git#egg=textdistance
Alternatively, clone the repository and install with benchmark extras:
git clone https://github.com/life4/textdistance.git
pip install -e ".[benchmark]"
Examples
TextDistance provides a simple and consistent interface across all its algorithms. Here's an example using the Hamming distance:
import textdistance
# Calculate Hamming distance
print(textdistance.hamming('test', 'text'))
# Output: 1
# Using the explicit distance method
print(textdistance.hamming.distance('test', 'text'))
# Output: 1
# Calculate similarity
print(textdistance.hamming.similarity('test', 'text'))
# Output: 3
# Calculate normalized distance (0 to 1, 0 means equal)
print(textdistance.hamming.normalized_distance('test', 'text'))
# Output: 0.25
# Calculate normalized similarity (0 to 1, 1 means equal)
print(textdistance.hamming.normalized_similarity('test', 'text'))
# Output: 0.75
# Using q-grams for comparison (e.g., qval=2 for bigrams)
print(textdistance.Hamming(qval=2).distance('test', 'text'))
# Output: 2
This consistent API makes it easy to experiment with different algorithms and find the best fit for your specific use case.
Why Use TextDistance?
TextDistance offers compelling reasons for developers and data scientists:
- Unmatched Versatility: With over 30 algorithms, it covers almost every scenario for sequence comparison, from simple character differences to complex phonetic matching. This breadth makes it a go-to tool for diverse applications.
- Optimized Performance: While providing a pure Python fallback, TextDistance intelligently integrates with faster, C-optimized external libraries when available. Benchmarks clearly demonstrate significant speed improvements, making it suitable for performance-critical applications.
- Simplified Development: The unified API across all algorithms drastically reduces the learning curve and development time. You can switch between algorithms with minimal code changes.
- Robustness for Data Tasks: It's an invaluable asset for data cleaning, deduplication, fuzzy matching, natural language processing, and bioinformatics, where accurate and efficient sequence comparison is crucial.
- Active Development and Community: The project is actively maintained, welcoming contributions and ensuring its continued evolution.
Links
- GitHub Repository: life4/textdistance
- PyPI Project Page: textdistance on PyPI
- Guide to Fuzzy Matching with Python: Read the article
- String similarity, the basic know your algorithms guide!: Read the article
- Normalized compression distance: Read the article