Jieba: The Leading Python Library for Chinese Text Segmentation

Summary

Jieba is a highly popular and efficient Python library designed for Chinese text segmentation. It offers various cutting modes, including accurate, full, and search engine modes, making it versatile for different NLP tasks. With features like custom dictionaries and part-of-speech tagging, Jieba provides a comprehensive solution for processing Chinese text.

Use at your own risk

OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of code from these repositories is the user's own responsibility. Always review the repository, source code, dependencies, licenses, and security implications before running or installing anything. OSRepos is not responsible for issues, damages, or losses resulting from third-party repositories.

Introduction

Jieba, meaning 'to stutter' in Chinese, is a widely acclaimed Python library built to be the best Chinese word segmentation module. With over 34,000 stars on GitHub, it's a cornerstone for natural language processing (NLP) tasks involving Chinese text. Jieba offers robust functionality, supporting multiple segmentation modes, traditional Chinese, and custom dictionaries, all under an MIT license.

Its core strength lies in its sophisticated algorithm, which combines a prefix dictionary for efficient word graph scanning, dynamic programming for maximum probability path finding, and an HMM-based model with the Viterbi algorithm for identifying unknown words.

Installation

Getting started with Jieba is straightforward. The library is compatible with both Python 2 and 3.

To install Jieba, you can use pip or easy_install:

pip install jieba
# or
easy_install jieba

If you plan to use the advanced PaddlePaddle-based segmentation and part-of-speech tagging features, you will also need to install paddlepaddle-tiny:

pip install paddlepaddle-tiny==1.6.1

Examples

Jieba provides several segmentation modes to suit different analytical needs. Here are some basic examples demonstrating its usage:

import jieba

text = "?????????"

# Full Mode: Gets all possible words from the sentence
seg_list_full = jieba.cut(text, cut_all=True)
print("Full Mode: " + "/ ".join(seg_list_full))

# Accurate Mode: Attempts to cut the sentence into the most precise segments
seg_list_accurate = jieba.cut(text, cut_all=False)
print("Accurate Mode: " + "/ ".join(seg_list_accurate))

# Search Engine Mode: Based on Accurate Mode, cuts long words into shorter ones
text_search = "??????????????????????????"
seg_list_search = jieba.cut_for_search(text_search)
print("Search Engine Mode: " + ", ".join(seg_list_search))

# Part-of-Speech Tagging (example)
import jieba.posseg as pseg
words_pos = pseg.cut("???????")
print("Part-of-Speech Tagging:")
for word, flag in words_pos:
    print(f"  {word} {flag}")

Why Use Jieba?

Jieba stands out as a top choice for Chinese text segmentation due to several compelling reasons:

Versatile Segmentation Modes: It offers accurate, full, search engine, and PaddlePaddle-based modes, providing flexibility for various NLP applications, from text analysis to search engine indexing.
High Accuracy and Robustness: By combining a prefix dictionary, dynamic programming, and an HMM model, Jieba achieves high accuracy, including effective recognition of unknown words.
Customization: Developers can easily load custom dictionaries and dynamically adjust word frequencies, ensuring better performance for domain-specific texts or handling ambiguities.
Rich Feature Set: Beyond basic segmentation, Jieba includes powerful tools for keyword extraction (TF-IDF and TextRank), part-of-speech tagging, and tokenization with position information.
Performance: It boasts impressive segmentation speeds, with options for parallel processing to further enhance performance on multi-core systems.
Active Community and Ecosystem: Its widespread adoption has led to implementations in various other programming languages, fostering a broad ecosystem and continuous improvement.