TabSTAR: A Tabular Foundation Model for Data with Text Fields

Introduction

TabSTAR is a groundbreaking Tabular Foundation Model specifically engineered to handle tabular data enriched with text fields. It addresses the challenge of integrating unstructured text information within structured tabular datasets, offering a powerful solution for various machine learning tasks. Whether you're looking to apply a pretrained model or delve into advanced research, TabSTAR provides a robust framework. It excels at processing tabular data where text fields are crucial, leveraging a foundation model approach to achieve high performance. It supports both a straightforward package mode for quick integration and a comprehensive research mode for in-depth experimentation and development.

Installation

TabSTAR offers two primary modes of operation, each with its own installation method:

Package Mode

For users who want to quickly integrate a pretrained TabSTAR model into their projects, install it via pip:

pip install tabstar

Research Mode

If you plan to engage in model development, pretraining, or benchmark evaluations, clone the repository and set up the environment:

source init.sh

This script will install all necessary dependencies and prepare your environment.

Examples

TabSTAR is designed for both practical application and research.

Package Mode Inference

Using TabSTAR for inference on your own data is straightforward. Here's a quick example for classification:

from importlib.resources import files
import pandas as pd
from sklearn.model_selection import train_test_split

from tabstar.tabstar_model import TabSTARClassifier

csv_path = files("tabstar").joinpath("resources", "imdb.csv")
x = pd.read_csv(csv_path)
y = x.pop('Genre_is_Drama')
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1)
tabstar = TabSTARClassifier()
tabstar.fit(x_train, y_train)
metric = tabstar.score(X=x_test, y=y_test)
print(f"AUC: {metric:.4f}")

Research Mode Operations

For researchers, TabSTAR provides scripts for advanced tasks:

Benchmark Evaluation: Evaluate TabSTAR on public datasets using python tabstar_paper/do_benchmark.py --model=tabstar --dataset_id=<DATASET_ID>.
Pretraining: Pretrain the model on a specified number of datasets with python tabstar_paper/do_pretrain.py --n_datasets=256.
Finetuning: Finetune a pretrained model on a downstream task using python tabstar_paper/do_finetune.py --pretrain_exp=<PRETRAINED_EXP> --dataset_id=46655.

Why Use TabSTAR?

TabSTAR stands out for several compelling reasons:

Handles Complex Data: It uniquely addresses the challenge of tabular data containing text fields, a common scenario in real-world datasets where traditional tabular models often struggle.
Foundation Model Power: By leveraging a foundation model approach, TabSTAR can learn rich representations from diverse tabular data, leading to superior performance on various tasks.
Versatility: It caters to both practitioners needing a quick, effective solution via its package mode and researchers aiming to push the boundaries of tabular deep learning through its research mode.
Ease of Use: The package mode provides a simple API for fitting and predicting with minimal setup, making it accessible for data scientists.
Cutting-Edge Research: Backed by a scientific paper and ongoing development, TabSTAR represents a cutting-edge solution in the field of tabular machine learning.

TabSTAR: A Tabular Foundation Model for Data with Text Fields

Summary

Repository Info

Tags