{"name":"TabSTAR: A Tabular Foundation Model for Data with Text Fields","description":"TabSTAR is an innovative tabular foundation model designed to effectively process tabular data that includes text fields. It offers a user-friendly package for integrating pretrained models into your own datasets, alongside a comprehensive research mode for advanced development and benchmarking. This powerful tool simplifies the application of deep learning to complex tabular structures.","github":"https://github.com/alanarazi7/TabSTAR","url":"https://osrepos.com/repo/alanarazi7-tabstar","source":"osrepos.com","sourceDescription":"This repository profile is provided by osrepos.com, an open source repository discovery platform.","repositoryProfile":"https://osrepos.com/repo/alanarazi7-tabstar","generatedFor":"open source discovery and AI-assisted research","markdown":"https://osrepos.com/repo/alanarazi7-tabstar.md","json":"https://osrepos.com/repo/alanarazi7-tabstar.json","topics":["deep-learning","foundation-models","language-models","machine-learning","tabular-data","Python","AI/ML","Data Science"],"keywords":["deep-learning","foundation-models","language-models","machine-learning","tabular-data","Python","AI/ML","Data Science"],"stars":null,"summary":"TabSTAR is an innovative tabular foundation model designed to effectively process tabular data that includes text fields. It offers a user-friendly package for integrating pretrained models into your own datasets, alongside a comprehensive research mode for advanced development and benchmarking. This powerful tool simplifies the application of deep learning to complex tabular structures.","content":"## Introduction\n\nTabSTAR is a groundbreaking Tabular Foundation Model specifically engineered to handle tabular data enriched with text fields. It addresses the challenge of integrating unstructured text information within structured tabular datasets, offering a powerful solution for various machine learning tasks. Whether you're looking to apply a pretrained model or delve into advanced research, TabSTAR provides a robust framework. It excels at processing tabular data where text fields are crucial, leveraging a foundation model approach to achieve high performance. It supports both a straightforward package mode for quick integration and a comprehensive research mode for in-depth experimentation and development.\n\n## Installation\n\nTabSTAR offers two primary modes of operation, each with its own installation method:\n\n### Package Mode\n\nFor users who want to quickly integrate a pretrained TabSTAR model into their projects, install it via pip:\n\nbash\npip install tabstar\n\n\n### Research Mode\n\nIf you plan to engage in model development, pretraining, or benchmark evaluations, clone the repository and set up the environment:\n\nbash\nsource init.sh\n\n\nThis script will install all necessary dependencies and prepare your environment.\n\n## Examples\n\nTabSTAR is designed for both practical application and research.\n\n### Package Mode Inference\n\nUsing TabSTAR for inference on your own data is straightforward. Here's a quick example for classification:\n\npython\nfrom importlib.resources import files\nimport pandas as pd\nfrom sklearn.model_selection import train_test_split\n\nfrom tabstar.tabstar_model import TabSTARClassifier\n\ncsv_path = files(\"tabstar\").joinpath(\"resources\", \"imdb.csv\")\nx = pd.read_csv(csv_path)\ny = x.pop('Genre_is_Drama')\nx_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1)\ntabstar = TabSTARClassifier()\ntabstar.fit(x_train, y_train)\nmetric = tabstar.score(X=x_test, y=y_test)\nprint(f\"AUC: {metric:.4f}\")\n\n\n### Research Mode Operations\n\nFor researchers, TabSTAR provides scripts for advanced tasks:\n\n*   **Benchmark Evaluation:** Evaluate TabSTAR on public datasets using `python tabstar_paper/do_benchmark.py --model=tabstar --dataset_id=<DATASET_ID>`.\n*   **Pretraining:** Pretrain the model on a specified number of datasets with `python tabstar_paper/do_pretrain.py --n_datasets=256`.\n*   **Finetuning:** Finetune a pretrained model on a downstream task using `python tabstar_paper/do_finetune.py --pretrain_exp=<PRETRAINED_EXP> --dataset_id=46655`.\n\n## Why Use TabSTAR?\n\nTabSTAR stands out for several compelling reasons:\n\n*   **Handles Complex Data:** It uniquely addresses the challenge of tabular data containing text fields, a common scenario in real-world datasets where traditional tabular models often struggle.\n*   **Foundation Model Power:** By leveraging a foundation model approach, TabSTAR can learn rich representations from diverse tabular data, leading to superior performance on various tasks.\n*   **Versatility:** It caters to both practitioners needing a quick, effective solution via its package mode and researchers aiming to push the boundaries of tabular deep learning through its research mode.\n*   **Ease of Use:** The package mode provides a simple API for fitting and predicting with minimal setup, making it accessible for data scientists.\n*   **Cutting-Edge Research:** Backed by a scientific paper and ongoing development, TabSTAR represents a cutting-edge solution in the field of tabular machine learning.\n\n## Links\n\n*   [TabSTAR GitHub Repository](https://github.com/alanarazi7/TabSTAR)\n*   [TabSTAR Paper](https://arxiv.org/abs/2505.18125)\n*   [TabSTAR Project Website](https://eilamshapira.com/TabSTAR/)","metrics":{"detailViews":2,"githubClicks":2},"dates":{"published":null,"modified":"2026-01-02T16:01:08.000Z"}}