Chapter 17: Whisper Speech Recognition

Overview

This chapter introduces OpenAI Whisper for speech recognition in robotics applications. You'll learn how to implement voice-to-action systems that enable robots to understand and respond to natural language commands.

Learning Objectives

By the end of this chapter, you will be able to:

Understand Whisper's architecture and capabilities
Install and configure Whisper for real-time speech recognition
Process audio input and convert speech to text
Implement voice activity detection for efficient processing

OpenAI Whisper for Speech Recognition

OpenAI Whisper is a state-of-the-art speech recognition model that provides high accuracy across multiple languages, handles various accents and speaking styles, works well in noisy environments, supports real-time and batch processing, and is available as an open-source model.

Whisper Model Variants

tiny: Fastest, least accurate (76MB)
base: Good balance of speed and accuracy (145MB)
small: Better accuracy, moderate speed (484MB)
medium: High accuracy, slower (1.5GB)
large: Highest accuracy, slowest (3.0GB)

Installation and Setup

pip install openai-whisper
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Basic Whisper Usage

import whisper

model = whisper.load_model("small")

result = model.transcribe("command.wav")
print(result["text"])

Real-Time Voice Recognition

For real-time voice recognition, we need to capture audio from microphone, process audio in chunks, handle streaming input efficiently, and filter out background noise.

Code Examples

Audio Stream Processing

import pyaudio
import numpy as np
import queue
import threading
import whisper
import torch

class VoiceToAction:
    def __init__(self, model_size="small"):
        self.model = whisper.load_model(model_size)
        self.audio_queue = queue.Queue()

        self.format = pyaudio.paInt16
        self.channels = 1
        self.rate = 16000
        self.chunk = 1024

        self.audio = pyaudio.PyAudio()

    def start_listening(self):
        stream = self.audio.open(
            format=self.format,
            channels=self.channels,
            rate=self.rate,
            input=True,
            frames_per_buffer=self.chunk
        )

        threading.Thread(target=self.record_audio, args=(stream,), daemon=True).start()

    def record_audio(self, stream):
        while True:
            data = stream.read(self.chunk)
            self.audio_queue.put(data)

Summary

OpenAI Whisper provides robust speech recognition capabilities for robotics applications. Its multi-language support, noise tolerance, and open-source availability make it ideal for implementing voice-to-action systems that enable natural human-robot interaction.

Key Takeaways

Whisper offers multiple model sizes for different accuracy-speed tradeoffs
Real-time audio processing requires efficient streaming and buffering
Voice activity detection reduces unnecessary processing
Proper audio preprocessing improves recognition accuracy

What's Next

In the next chapter, we'll explore integrating Whisper with ROS 2, learning how to map recognized voice commands to robot actions for complete voice-to-action systems.

Overview​

Learning Objectives​

OpenAI Whisper for Speech Recognition​

Whisper Model Variants​

Installation and Setup​

Basic Whisper Usage​

Real-Time Voice Recognition​

Code Examples​

Audio Stream Processing​

Summary​

Key Takeaways​

What's Next​