Skip to main content

Chapter 17: Whisper Speech Recognition

Overview

This chapter introduces OpenAI Whisper for speech recognition in robotics applications. You'll learn how to implement voice-to-action systems that enable robots to understand and respond to natural language commands.

Learning Objectives

Learning Objectives

By the end of this chapter, you will be able to:

  • Understand Whisper's architecture and capabilities
  • Install and configure Whisper for real-time speech recognition
  • Process audio input and convert speech to text
  • Implement voice activity detection for efficient processing

OpenAI Whisper for Speech Recognition

OpenAI Whisper is a state-of-the-art speech recognition model that provides high accuracy across multiple languages, handles various accents and speaking styles, works well in noisy environments, supports real-time and batch processing, and is available as an open-source model.

Whisper Model Variants

  • tiny: Fastest, least accurate (76MB)
  • base: Good balance of speed and accuracy (145MB)
  • small: Better accuracy, moderate speed (484MB)
  • medium: High accuracy, slower (1.5GB)
  • large: Highest accuracy, slowest (3.0GB)

Installation and Setup

pip install openai-whisper
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Basic Whisper Usage

import whisper

model = whisper.load_model("small")

result = model.transcribe("command.wav")
print(result["text"])

Real-Time Voice Recognition

For real-time voice recognition, we need to capture audio from microphone, process audio in chunks, handle streaming input efficiently, and filter out background noise.

Code Examples

Audio Stream Processing

import pyaudio
import numpy as np
import queue
import threading
import whisper
import torch

class VoiceToAction:
def __init__(self, model_size="small"):
self.model = whisper.load_model(model_size)
self.audio_queue = queue.Queue()

self.format = pyaudio.paInt16
self.channels = 1
self.rate = 16000
self.chunk = 1024

self.audio = pyaudio.PyAudio()

def start_listening(self):
stream = self.audio.open(
format=self.format,
channels=self.channels,
rate=self.rate,
input=True,
frames_per_buffer=self.chunk
)

threading.Thread(target=self.record_audio, args=(stream,), daemon=True).start()

def record_audio(self, stream):
while True:
data = stream.read(self.chunk)
self.audio_queue.put(data)

Summary

OpenAI Whisper provides robust speech recognition capabilities for robotics applications. Its multi-language support, noise tolerance, and open-source availability make it ideal for implementing voice-to-action systems that enable natural human-robot interaction.

Key Takeaways

Key Takeaways
  • Whisper offers multiple model sizes for different accuracy-speed tradeoffs
  • Real-time audio processing requires efficient streaming and buffering
  • Voice activity detection reduces unnecessary processing
  • Proper audio preprocessing improves recognition accuracy

What's Next

In the next chapter, we'll explore integrating Whisper with ROS 2, learning how to map recognized voice commands to robot actions for complete voice-to-action systems.

AI Assistant
How can I help you today?