Repository History

Explore all analyzed open source repositories

Topic: NLP
Jieba: The Leading Python Library for Chinese Text Segmentation

Jieba: The Leading Python Library for Chinese Text Segmentation

Jieba is a highly popular and efficient Python library designed for Chinese text segmentation. It offers various cutting modes, including accurate, full, and search engine modes, making it versatile for different NLP tasks. With features like custom dictionaries and part-of-speech tagging, Jieba provides a comprehensive solution for processing Chinese text.

Mar 31, 2026
View Details
LLMBox: A Comprehensive Python Library for LLM Training and Evaluation

LLMBox: A Comprehensive Python Library for LLM Training and Evaluation

LLMBox is a comprehensive Python library designed for implementing Large Language Models, offering a unified training pipeline and extensive model evaluation capabilities. It provides a one-stop solution for both training and utilizing LLMs, emphasizing flexibility and efficiency. Developers can leverage its diverse training strategies and blazingly fast inference for their LLM projects.

Mar 16, 2026
View Details
NUDGE: Lightweight Non-Parametric Embedding Fine-Tuning for Retrieval

NUDGE: Lightweight Non-Parametric Embedding Fine-Tuning for Retrieval

NUDGE is a lightweight, non-parametric tool designed to fine-tune pre-trained embeddings, significantly enhancing retrieval and RAG pipelines. It operates by adjusting data embeddings directly, rather than modifying model parameters, to maximize accuracy. This approach often leads to over 10% improvement in retrieval accuracy and runs in minutes.

Mar 4, 2026
View Details
LLMSanitize: An Open-Source Library for Contamination Detection in NLP and LLM Datasets

LLMSanitize: An Open-Source Library for Contamination Detection in NLP and LLM Datasets

LLMSanitize is an open-source Python library designed for detecting contamination in NLP datasets and Large Language Models (LLMs). It offers a comprehensive suite of methods, ranging from string matching to model likelihood and embedding similarity, to ensure data integrity. This tool is crucial for researchers and developers working with LLMs to maintain the reliability of their models and evaluations.

Feb 9, 2026
View Details
LLM Reasoners: Advanced Library for Large Language Model Reasoning

LLM Reasoners: Advanced Library for Large Language Model Reasoning

LLM Reasoners is a powerful Python library designed to significantly enhance the complex reasoning capabilities of Large Language Models. It offers a comprehensive suite of cutting-edge search algorithms, intuitive visualization tools, and optimized performance for efficient LLM inference. The library prioritizes rigorous implementation and reproducibility, making it a reliable tool for researchers and developers in the AI field.

Feb 2, 2026
View Details
TextMachina: A Python Framework for MGT Dataset Generation

TextMachina: A Python Framework for MGT Dataset Generation

TextMachina is a modular and extensible Python framework designed for creating high-quality, unbiased datasets for Machine-Generated Text (MGT) tasks. It supports detection, attribution, and boundary detection, offering a user-friendly pipeline with LLM integrations, prompt templating, and bias mitigation. This tool streamlines the process of building robust models for understanding and identifying AI-generated content.

Dec 21, 2025
View Details
sumy: Automatic Text Summarization for Documents and HTML Pages

sumy: Automatic Text Summarization for Documents and HTML Pages

sumy is a robust Python module designed for automatic summarization of text documents and HTML pages. It provides various summarization methods, supports multiple natural languages, and offers both a command-line utility and a flexible Python API. This versatile tool enables users to efficiently extract concise summaries from lengthy content.

Dec 14, 2025
View Details
Toolkit-for-Prompt-Compression: A Unified Toolkit for LLM Prompt Compression

Toolkit-for-Prompt-Compression: A Unified Toolkit for LLM Prompt Compression

PCToolkit is a unified, plug-and-play toolkit designed for efficient prompt compression in Large Language Models (LLMs). It provides state-of-the-art compression methods, diverse datasets, and comprehensive metrics for evaluating performance. This modular toolkit simplifies the process of condensing input prompts while preserving crucial information.

Dec 13, 2025
View Details
Judgy: Correcting LLM Judge Bias for Reliable AI Model Evaluation

Judgy: Correcting LLM Judge Bias for Reliable AI Model Evaluation

Judgy is a Python package designed to improve the reliability of evaluations performed by LLM-as-Judges. It provides tools to estimate the true success rate of a system by correcting for LLM judge bias and generating confidence intervals through bootstrapping. This helps ensure more accurate and trustworthy assessments of AI model performance.

Dec 7, 2025
View Details
txtinstruct: Building Instruction-Tuned Models with Custom Data

txtinstruct: Building Instruction-Tuned Models with Custom Data

txtinstruct is a Python framework designed for training instruction-tuned models. It focuses on supporting open data and models, enabling users to build their own instruction-following datasets and train models without licensing ambiguity. This project simplifies the process of creating custom instruction-tuned solutions.

Nov 23, 2025
View Details
llama-cpp-python: Python Bindings for llama.cpp

llama-cpp-python: Python Bindings for llama.cpp

llama-cpp-python provides robust Python bindings for the popular llama.cpp library, enabling efficient local inference with large language models. It offers a high-level API compatible with OpenAI's API, facilitating easy integration into existing applications. The project also includes a powerful web server for local deployment and supports various hardware acceleration backends.

Nov 11, 2025
View Details
python-ftfy: Effortlessly Fixing Mojibake and Unicode Glitches

python-ftfy: Effortlessly Fixing Mojibake and Unicode Glitches

ftfy is a powerful Python library designed to automatically correct "mojibake" and other common glitches in Unicode text. It intelligently detects and fixes encoding mix-ups, transforming unreadable characters into their intended form. This tool is essential for developers and data scientists working with messy text data, ensuring readability and data integrity.

Oct 21, 2025
View Details
Page 1