TextMachina: A Python Framework for MGT Dataset Generation

Introduction

TextMachina is a modular and extensible Python framework, designed to aid in the creation of high-quality, unbiased datasets. These datasets are crucial for building robust models for Machine-Generated Text (MGT)-related tasks such as detection, attribution, boundary detection, and mixcase. The framework provides a unified approach to generate diverse datasets, abstracting away the complexities involved in working with various LLM providers and ensuring data quality.

Installation

You can easily install TextMachina and its dependencies using pip.

To install all dependencies:

pip install text-machina[all]

For specific LLM providers or development dependencies, you can specify them:

pip install text-machina[anthropic,dev]

Alternatively, you can install directly from the source:

pip install .[all]

If you plan to modify the code for custom use cases, install in development mode:

pip install -e .[dev]

Examples

TextMachina offers both a Command Line Interface (CLI) and a programmatic API for generating MGT datasets.

Using the CLI

The CLI provides explore and generate endpoints. The explore endpoint allows you to inspect a small generated dataset interactively and compute metrics. For instance, to check an MGT detection dataset generated using XSum news articles and gpt-3.5-turbo-instruct:

text-machina explore --config-path etc/examples/xsum_gpt-3-5-turbo-instruct_openai.yaml \
--task-type detection \
--metrics-path etc/metrics.yaml \
--max-generations 10

This command will display an interactive interface showing generated and human text for detection, allowing you to verify dataset quality.

Once satisfied, use the generate endpoint to create a full dataset:

text-machina generate --config-path etc/examples/xsum_gpt-3-5-turbo-instruct_openai.yaml \
--task-type detection

TextMachina caches results, allowing you to resume interrupted runs using a --run-name flag.

Programmatically

For more control, you can use TextMachina programmatically. Instantiate a dataset generator with a Config object, which defines input, model, and generation parameters, then call its generate method.

Here's how to replicate the previous example in Python:

from text_machina import get_generator
from text_machina import Config, InputConfig, ModelConfig

input_config = InputConfig(
    domain="news",
    language="en",
    quantity=10,
    random_sample_human=True,
    dataset="xsum",
    dataset_text_column="document",
    dataset_params={"split": "test"},
    template=(
        "Write a news article whose summary is '{summary}'"
        "using the entities: {entities}\n\nArticle:"
    ),
    extractor="combined",
    extractors_list=["auxiliary.Auxiliary", "entity_list.EntityList"],
    max_input_tokens=256,
)

model_config = ModelConfig(
    provider="openai",
    model_name="gpt-3.5-turbo-instruct",
    api_type="COMPLETION",
    threads=8,
    max_retries=5,
    timeout=20,
)

generation_config = {"temperature": 0.7, "presence_penalty": 1.0}

config = Config(
    input=input_config,
    model=model_config,
    generation=generation_config,
    task_type="detection",
)
generator = get_generator(config)
dataset = generator.generate()

Why Use TextMachina

TextMachina stands out as a powerful tool for MGT dataset generation due to several key features:

Comprehensive MGT Dataset Generation: It supports a range of MGT tasks, including detection, attribution, boundary detection, and mixcase, providing a versatile solution for various research and application needs.
Extensive LLM Integrations: The framework seamlessly integrates with numerous LLM providers, such as Anthropic, Cohere, OpenAI, Google Vertex AI, Amazon Bedrock, AI21, Azure OpenAI, VLLM, TRT inference servers, and HuggingFace models, offering flexibility in model choice.
Advanced Dataset Quality Features: TextMachina incorporates prompt templating, constrained decoding to infer LLM hyperparameters, and post-processing functions to enhance dataset quality and prevent common biases.
Bias Mitigation: Built with bias prevention in mind, it helps users avoid introducing spurious correlations in their datasets throughout the entire pipeline.
User-Friendly Workflow: With both a robust CLI and a programmatic API, TextMachina caters to different user preferences, making dataset generation accessible and efficient.
Dataset Exploration: It provides tools to explore generated datasets and quantify their quality with a set of metrics, ensuring transparency and reliability.

TextMachina: A Python Framework for MGT Dataset Generation

Summary

Repository Info

Tags