TextDistance: A Comprehensive Python Library for Sequence Distance Calculation

Introduction

TextDistance is a comprehensive and versatile Python library designed for computing the distance and similarity between sequences. It stands out by offering implementations of over 30 different algorithms, all accessible through a common and intuitive interface. Whether you need to compare strings, lists, or any other sequence, TextDistance provides a robust solution.

Key features include:

Extensive Algorithm Collection: Over 30 algorithms covering edit-based, token-based, sequence-based, compression-based, phonetic, and simple comparisons.
Pure Python Implementation: Ensures compatibility and ease of use across various environments.
Common Interface: All algorithms share a consistent API for calculating distance, similarity, normalized distance, and normalized similarity.
Optional External Libraries: For maximum speed, TextDistance can leverage highly optimized external C-based libraries like jellyfish and Levenshtein if installed.
Multi-sequence Comparison: Supports comparing more than two sequences simultaneously.

Installation

Installing TextDistance is straightforward. You can choose between a pure Python version or include optional extra libraries for enhanced performance.

Stable Version (Pure Python):

pip install textdistance

Stable Version (With Extra Libraries for Maximum Speed):

pip install "textdistance[extras]"

Development Version:

You can install the development version directly from GitHub:

pip install -e git+https://github.com/life4/textdistance.git#egg=textdistance

Alternatively, clone the repository and install with benchmark extras:

git clone https://github.com/life4/textdistance.git
pip install -e ".[benchmark]"

Examples

TextDistance provides a simple and consistent interface across all its algorithms. Here's an example using the Hamming distance:

import textdistance

# Calculate Hamming distance
print(textdistance.hamming('test', 'text'))
# Output: 1

# Using the explicit distance method
print(textdistance.hamming.distance('test', 'text'))
# Output: 1

# Calculate similarity
print(textdistance.hamming.similarity('test', 'text'))
# Output: 3

# Calculate normalized distance (0 to 1, 0 means equal)
print(textdistance.hamming.normalized_distance('test', 'text'))
# Output: 0.25

# Calculate normalized similarity (0 to 1, 1 means equal)
print(textdistance.hamming.normalized_similarity('test', 'text'))
# Output: 0.75

# Using q-grams for comparison (e.g., qval=2 for bigrams)
print(textdistance.Hamming(qval=2).distance('test', 'text'))
# Output: 2

This consistent API makes it easy to experiment with different algorithms and find the best fit for your specific use case.

Why Use TextDistance?

TextDistance offers compelling reasons for developers and data scientists:

Unmatched Versatility: With over 30 algorithms, it covers almost every scenario for sequence comparison, from simple character differences to complex phonetic matching. This breadth makes it a go-to tool for diverse applications.
Optimized Performance: While providing a pure Python fallback, TextDistance intelligently integrates with faster, C-optimized external libraries when available. Benchmarks clearly demonstrate significant speed improvements, making it suitable for performance-critical applications.
Simplified Development: The unified API across all algorithms drastically reduces the learning curve and development time. You can switch between algorithms with minimal code changes.
Robustness for Data Tasks: It's an invaluable asset for data cleaning, deduplication, fuzzy matching, natural language processing, and bioinformatics, where accurate and efficient sequence comparison is crucial.
Active Development and Community: The project is actively maintained, welcoming contributions and ensuring its continued evolution.

TextDistance: A Comprehensive Python Library for Sequence Distance Calculation

Summary

Repository Info

Tags

Introduction

Installation

Examples

Why Use TextDistance?

Links