TabSTAR: A Tabular Foundation Model for Data with Text Fields

Summary
TabSTAR is an innovative tabular foundation model designed to effectively process tabular data that includes text fields. It offers a user-friendly package for integrating pretrained models into your own datasets, alongside a comprehensive research mode for advanced development and benchmarking. This powerful tool simplifies the application of deep learning to complex tabular structures.
Repository Info
Tags
Click on any tag to explore related repositories
Introduction
TabSTAR is a groundbreaking Tabular Foundation Model specifically engineered to handle tabular data enriched with text fields. It addresses the challenge of integrating unstructured text information within structured tabular datasets, offering a powerful solution for various machine learning tasks. Whether you're looking to apply a pretrained model or delve into advanced research, TabSTAR provides a robust framework. It excels at processing tabular data where text fields are crucial, leveraging a foundation model approach to achieve high performance. It supports both a straightforward package mode for quick integration and a comprehensive research mode for in-depth experimentation and development.
Installation
TabSTAR offers two primary modes of operation, each with its own installation method:
Package Mode
For users who want to quickly integrate a pretrained TabSTAR model into their projects, install it via pip:
pip install tabstar
Research Mode
If you plan to engage in model development, pretraining, or benchmark evaluations, clone the repository and set up the environment:
source init.sh
This script will install all necessary dependencies and prepare your environment.
Examples
TabSTAR is designed for both practical application and research.
Package Mode Inference
Using TabSTAR for inference on your own data is straightforward. Here's a quick example for classification:
from importlib.resources import files
import pandas as pd
from sklearn.model_selection import train_test_split
from tabstar.tabstar_model import TabSTARClassifier
csv_path = files("tabstar").joinpath("resources", "imdb.csv")
x = pd.read_csv(csv_path)
y = x.pop('Genre_is_Drama')
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1)
tabstar = TabSTARClassifier()
tabstar.fit(x_train, y_train)
metric = tabstar.score(X=x_test, y=y_test)
print(f"AUC: {metric:.4f}")
Research Mode Operations
For researchers, TabSTAR provides scripts for advanced tasks:
- Benchmark Evaluation: Evaluate TabSTAR on public datasets using
python tabstar_paper/do_benchmark.py --model=tabstar --dataset_id=<DATASET_ID>. - Pretraining: Pretrain the model on a specified number of datasets with
python tabstar_paper/do_pretrain.py --n_datasets=256. - Finetuning: Finetune a pretrained model on a downstream task using
python tabstar_paper/do_finetune.py --pretrain_exp=<PRETRAINED_EXP> --dataset_id=46655.
Why Use TabSTAR?
TabSTAR stands out for several compelling reasons:
- Handles Complex Data: It uniquely addresses the challenge of tabular data containing text fields, a common scenario in real-world datasets where traditional tabular models often struggle.
- Foundation Model Power: By leveraging a foundation model approach, TabSTAR can learn rich representations from diverse tabular data, leading to superior performance on various tasks.
- Versatility: It caters to both practitioners needing a quick, effective solution via its package mode and researchers aiming to push the boundaries of tabular deep learning through its research mode.
- Ease of Use: The package mode provides a simple API for fitting and predicting with minimal setup, making it accessible for data scientists.
- Cutting-Edge Research: Backed by a scientific paper and ongoing development, TabSTAR represents a cutting-edge solution in the field of tabular machine learning.