maestro: Streamlining Fine-Tuning for Multimodal Models like PaliGemma 2 and Florence-2

Introduction

maestro is a streamlined tool developed by Roboflow to accelerate the fine-tuning of multimodal models. By encapsulating best practices, maestro simplifies complex tasks such as configuration, data loading, reproducibility, and training loop setup. It currently provides ready-to-use recipes for popular vision-language models, including Florence-2, PaliGemma 2, and Qwen2.5-VL, making advanced model customization more accessible.

Installation

To get started with maestro, you need to install the model-specific dependencies. It is recommended to create a dedicated Python environment for each model due to potential clashing requirements.

pip install "maestro[paligemma_2]"

Replace paligemma_2 with the specific model you intend to use, for example, florence_2 or qwen2_5_vl.

Examples

maestro offers both a command-line interface (CLI) and a Python API for fine-tuning your models. Additionally, the repository provides convenient Colab notebooks for hands-on experimentation.

Command-Line Interface (CLI)

Kick off fine-tuning directly from your terminal by specifying key parameters like dataset location, epochs, batch size, optimization strategy, and metrics.

maestro paligemma_2 train \
  --dataset "dataset/location" \
  --epochs 10 \
  --batch-size 4 \
  --optimization_strategy "qlora" \
  --metrics "edit_distance"

Python API

For greater control and integration into existing workflows, use the Python API. Import the train function from the corresponding module and define your configuration in a dictionary.

from maestro.trainer.models.paligemma_2.core import train

config = {
    "dataset": "dataset/location",
    "epochs": 10,
    "batch_size": 4,
    "optimization_strategy": "qlora",
    "metrics": ["edit_distance"]
}

train(config)

Colab Notebooks

Explore practical examples and fine-tune models directly in Google Colab. The maestro repository includes several cookbooks, such as:

Why Use maestro?

maestro stands out by simplifying the often-complex process of fine-tuning multimodal models. Its key advantages include:

Streamlined Workflow: Accelerates the entire fine-tuning process, from setup to training.
Best Practices Encapsulated: Handles configuration, data loading, reproducibility, and training loop setup, allowing users to focus on their data and models.
Ready-to-Use Recipes: Provides pre-configured setups for popular models like Florence-2, PaliGemma 2, and Qwen2.5-VL.
Hardware Efficiency: Supports optimization strategies like LoRA, QLoRA, and graph freezing to keep hardware requirements in check.
Consistent Data Handling: Utilizes a consistent JSONL format to streamline data preparation.
Unified Interface: Offers a single CLI/SDK to reduce code complexity across different models and tasks.

maestro: Streamlining Fine-Tuning for Multimodal Models like PaliGemma 2 and Florence-2

Summary

Repository Info

Tags