whisper.cpp: High-Performance Speech Recognition with OpenAI's Whisper Model

Introduction

whisper.cpp is a remarkable high-performance C/C++ port of OpenAI's cutting-edge Whisper automatic speech recognition (ASR) model. Designed for efficiency and portability, this project allows for fast, local inference of the Whisper model without external dependencies, making advanced speech-to-text capabilities accessible on a wide array of devices.

Key features include optimized performance for Apple Silicon (via ARM NEON, Accelerate, Metal, Core ML), AVX intrinsics for x86, VSX for POWER architectures, and robust GPU support for NVIDIA (cuBLAS), Vulkan, OpenVINO, Ascend NPU, and Moore Threads GPUs. It also supports mixed F16/F32 precision, integer quantization, and boasts zero memory allocations at runtime. whisper.cpp runs seamlessly across Mac OS, iOS, Android, Linux, Windows, WebAssembly, Raspberry Pi, and even within Docker containers, demonstrating its exceptional cross-platform compatibility.

Installation

Getting started with whisper.cpp is straightforward. Follow these steps for a quick setup:

Clone the repository:

git clone https://github.com/ggml-org/whisper.cpp.git

Navigate into the directory:
```
cd whisper.cpp
```

Download a Whisper model (e.g., base.en):

sh ./models/download-ggml-model.sh base.en

Build the project and transcribe an audio file:
```
cmake -B build
cmake --build build -j --config Release
./build/bin/whisper-cli -f samples/jfk.wav
```
For a quick demo, you can also simply run make base.en to download the model and transcribe sample audio files.

Examples

whisper.cpp offers a variety of examples showcasing its versatility:

Real-time Audio Input: The stream example enables continuous transcription from your microphone, ideal for live applications. Requires SDL2.

cmake -B build -DWHISPER_SDL2=ON
cmake --build build -j --config Release
./build/bin/whisper-stream -m ./models/ggml-base.en.bin -t 8 --step 500 --length 5000

Karaoke-style Movie Generation: Generate videos where the currently spoken word is highlighted, perfect for educational content or fun. This requires ffmpeg.
```
./build/bin/whisper-cli -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -owts
source ./samples/jfk.wav.wts
ffplay ./samples/jfk.wav.mp4
```

Voice Activity Detection (VAD): Integrate VAD models like Silero-VAD to process only speech segments, significantly speeding up transcription.

./models/download-vad-model.sh silero-v5.1.2
./build/bin/whisper-cli -vm ./models/ggml-silero-v5.1.2.bin --vad -f samples/jfk.wav -m models/ggml-base.en.bin

Mobile Applications: Examples for iOS and Android demonstrate on-device, offline transcription.
WebAssembly: Run Whisper directly in your browser with whisper.wasm.

Why Use It

whisper.cpp stands out for several compelling reasons:

Unparalleled Performance: Achieves high-speed inference, often faster-than-realtime, through extensive optimizations for various CPU architectures and dedicated GPU support (NVIDIA, Vulkan, OpenVINO, Apple Neural Engine, Ascend NPU, Moore Threads).
Exceptional Portability: Written in plain C/C++ with no external dependencies, it's incredibly easy to integrate into diverse projects and deploy across virtually any platform, from embedded systems to powerful servers.
Resource Efficiency: Features like mixed precision, integer quantization, and zero runtime memory allocations ensure minimal resource consumption, making it suitable for constrained environments.
Active Development & Community: Benefits from continuous improvements, a clear roadmap, and a vibrant community contributing to its development and offering numerous language bindings (Rust, JavaScript, Go, Java, Ruby, .NET, Python, R, Unity).

whisper.cpp: High-Performance Speech Recognition with OpenAI's Whisper Model

Summary

Repository Info

Tags

Introduction

Installation

Examples

Why Use It

Links