# TabSTAR: A Tabular Foundation Model for Data with Text Fields

This repository profile is provided by osrepos.com, an open source repository discovery platform.

Source: osrepos.com
Repository profile: https://osrepos.com/repo/alanarazi7-tabstar
Generated for open source discovery and AI-assisted research.

TabSTAR is an innovative tabular foundation model designed to effectively process tabular data that includes text fields. It offers a user-friendly package for integrating pretrained models into your own datasets, alongside a comprehensive research mode for advanced development and benchmarking. This powerful tool simplifies the application of deep learning to complex tabular structures.

GitHub: https://github.com/alanarazi7/TabSTAR
OSRepos URL: https://osrepos.com/repo/alanarazi7-tabstar

## Summary

TabSTAR is an innovative tabular foundation model designed to effectively process tabular data that includes text fields. It offers a user-friendly package for integrating pretrained models into your own datasets, alongside a comprehensive research mode for advanced development and benchmarking. This powerful tool simplifies the application of deep learning to complex tabular structures.

## Topics

- deep-learning
- foundation-models
- language-models
- machine-learning
- tabular-data
- Python
- AI/ML
- Data Science

## Repository Information

Last analyzed by OSRepos: Fri Jan 02 2026 16:01:08 GMT+0000 (Western European Standard Time)
Detail views: 2
GitHub clicks: 2

## Safety Notice

OSRepos shares public repositories for knowledge and discovery only. Review source code, dependencies, licenses, and security implications before running or installing anything.

## Content

## Introduction

TabSTAR is a groundbreaking Tabular Foundation Model specifically engineered to handle tabular data enriched with text fields. It addresses the challenge of integrating unstructured text information within structured tabular datasets, offering a powerful solution for various machine learning tasks. Whether you're looking to apply a pretrained model or delve into advanced research, TabSTAR provides a robust framework. It excels at processing tabular data where text fields are crucial, leveraging a foundation model approach to achieve high performance. It supports both a straightforward package mode for quick integration and a comprehensive research mode for in-depth experimentation and development.

## Installation

TabSTAR offers two primary modes of operation, each with its own installation method:

### Package Mode

For users who want to quickly integrate a pretrained TabSTAR model into their projects, install it via pip:

bash
pip install tabstar


### Research Mode

If you plan to engage in model development, pretraining, or benchmark evaluations, clone the repository and set up the environment:

bash
source init.sh


This script will install all necessary dependencies and prepare your environment.

## Examples

TabSTAR is designed for both practical application and research.

### Package Mode Inference

Using TabSTAR for inference on your own data is straightforward. Here's a quick example for classification:

python
from importlib.resources import files
import pandas as pd
from sklearn.model_selection import train_test_split

from tabstar.tabstar_model import TabSTARClassifier

csv_path = files("tabstar").joinpath("resources", "imdb.csv")
x = pd.read_csv(csv_path)
y = x.pop('Genre_is_Drama')
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1)
tabstar = TabSTARClassifier()
tabstar.fit(x_train, y_train)
metric = tabstar.score(X=x_test, y=y_test)
print(f"AUC: {metric:.4f}")


### Research Mode Operations

For researchers, TabSTAR provides scripts for advanced tasks:

*   **Benchmark Evaluation:** Evaluate TabSTAR on public datasets using `python tabstar_paper/do_benchmark.py --model=tabstar --dataset_id=<DATASET_ID>`.
*   **Pretraining:** Pretrain the model on a specified number of datasets with `python tabstar_paper/do_pretrain.py --n_datasets=256`.
*   **Finetuning:** Finetune a pretrained model on a downstream task using `python tabstar_paper/do_finetune.py --pretrain_exp=<PRETRAINED_EXP> --dataset_id=46655`.

## Why Use TabSTAR?

TabSTAR stands out for several compelling reasons:

*   **Handles Complex Data:** It uniquely addresses the challenge of tabular data containing text fields, a common scenario in real-world datasets where traditional tabular models often struggle.
*   **Foundation Model Power:** By leveraging a foundation model approach, TabSTAR can learn rich representations from diverse tabular data, leading to superior performance on various tasks.
*   **Versatility:** It caters to both practitioners needing a quick, effective solution via its package mode and researchers aiming to push the boundaries of tabular deep learning through its research mode.
*   **Ease of Use:** The package mode provides a simple API for fitting and predicting with minimal setup, making it accessible for data scientists.
*   **Cutting-Edge Research:** Backed by a scientific paper and ongoing development, TabSTAR represents a cutting-edge solution in the field of tabular machine learning.

## Links

*   [TabSTAR GitHub Repository](https://github.com/alanarazi7/TabSTAR)
*   [TabSTAR Paper](https://arxiv.org/abs/2505.18125)
*   [TabSTAR Project Website](https://eilamshapira.com/TabSTAR/)