Week 9: Voice-to-Action with OpenAI Whisper

Introduction

Welcome to Week 9 of the Vision-Language-Action (VLA) module! This week we'll explore how to implement voice-to-action systems using OpenAI Whisper for speech recognition. We'll learn how to convert spoken commands into actionable robot behaviors, creating intuitive human-robot interaction interfaces. This technology enables robots to understand natural language commands and execute corresponding actions in real-world environments.

Learning Objectives

By the end of this week, you will be able to:

Understand the fundamentals of speech recognition and voice-to-action systems
Install and configure OpenAI Whisper for real-time speech recognition
Process audio input and convert speech to text commands
Map recognized commands to robot actions
Integrate voice commands with ROS 2 control systems

Prerequisites

Before starting this week's content, ensure you have:

Understanding of ROS 2 fundamentals (Weeks 1-3)
Basic knowledge of audio processing concepts
Experience with Python programming
Familiarity with natural language processing concepts

1. Introduction to Voice-to-Action Systems

1.1 What are Voice-to-Action Systems?

Voice-to-action systems enable robots to:

Recognize spoken commands using speech recognition
Interpret natural language instructions
Convert voice commands into executable robot actions
Provide intuitive human-robot interaction

1.2 Applications in Robotics

Assistive robotics for elderly care
Industrial automation with voice commands
Educational robotics
Service robotics in homes and offices
Search and rescue operations

1.3 System Architecture

A typical voice-to-action system includes:

Audio Input: Microphones for capturing speech
Speech Recognition: Converting audio to text
Natural Language Processing: Understanding command intent
Action Mapping: Converting commands to robot actions
Execution: Robot control and feedback

2. OpenAI Whisper for Speech Recognition

2.1 What is OpenAI Whisper?

OpenAI Whisper is a state-of-the-art speech recognition model that:

Provides high accuracy across multiple languages
Handles various accents and speaking styles
Works well in noisy environments
Supports real-time and batch processing
Is available as an open-source model

2.2 Whisper Model Variants

tiny: Fastest, least accurate (76MB)
base: Good balance of speed and accuracy (145MB)
small: Better accuracy, moderate speed (484MB)
medium: High accuracy, slower (1.5GB)
large: Highest accuracy, slowest (3.0GB)

2.3 Installation and Setup

pip install openai-whisper
# For GPU acceleration
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

2.4 Basic Whisper Usage

import whisper

# Load model (downloads if not present)
model = whisper.load_model("small")

# Transcribe audio file
result = model.transcribe("command.wav")
print(result["text"])

3. Real-Time Voice Recognition

3.1 Audio Input Processing

For real-time voice recognition, we need to:

Capture audio from microphone
Process audio in chunks
Handle streaming input efficiently
Filter out background noise

3.2 Audio Stream Processing

import pyaudio
import numpy as np
import queue
import threading
import whisper
import torch

class VoiceToAction:
    def __init__(self, model_size="small"):
        # Initialize Whisper model
        self.model = whisper.load_model(model_size)
        self.audio_queue = queue.Queue()

        # Audio parameters
        self.format = pyaudio.paInt16
        self.channels = 1
        self.rate = 16000
        self.chunk = 1024

        # Initialize PyAudio
        self.audio = pyaudio.PyAudio()

    def start_listening(self):
        # Open audio stream
        stream = self.audio.open(
            format=self.format,
            channels=self.channels,
            rate=self.rate,
            input=True,
            frames_per_buffer=self.chunk
        )

        # Start audio recording thread
        threading.Thread(target=self.record_audio, args=(stream,), daemon=True).start()

    def record_audio(self, stream):
        while True:
            data = stream.read(self.chunk)
            self.audio_queue.put(data)

3.3 Voice Activity Detection

Implement voice activity detection to:

Detect when speech starts and ends
Reduce unnecessary processing
Improve real-time performance
Handle background noise

4. Natural Language Understanding

4.1 Command Parsing

Convert recognized text into structured commands:

Extract action verbs (move, pick, place, etc.)
Identify objects and locations
Parse numerical parameters
Handle complex multi-step commands

4.2 Intent Recognition

import re

class CommandParser:
    def __init__(self):
        # Define command patterns
        self.move_patterns = [
            r'move to (.+)',
            r'go to (.+)',
            r'go to the (.+)',
            r'navigate to (.+)'
        ]

        self.pick_patterns = [
            r'pick up the (.+)',
            r'pick the (.+)',
            r'grab the (.+)',
            r'take the (.+)'
        ]

        self.place_patterns = [
            r'place it on the (.+)',
            r'put it on the (.+)',
            r'place the (.+) on the (.+)',
            r'put the (.+) on the (.+)'
        ]

    def parse_command(self, text):
        text = text.lower().strip()

        # Check move patterns
        for pattern in self.move_patterns:
            match = re.search(pattern, text)
            if match:
                return {'action': 'move', 'target': match.group(1)}

        # Check pick patterns
        for pattern in self.pick_patterns:
            match = re.search(pattern, text)
            if match:
                return {'action': 'pick', 'object': match.group(1)}

        # Check place patterns
        for pattern in self.place_patterns:
            match = re.search(pattern, text)
            if match:
                if len(match.groups()) == 1:
                    return {'action': 'place', 'target': match.group(1)}
                else:
                    return {'action': 'place', 'object': match.group(1), 'target': match.group(2)}

        return {'action': 'unknown', 'raw': text}

4.3 Context Awareness

Implement context awareness for:

Understanding relative positions
Handling ambiguous commands
Maintaining conversation state
Learning user preferences

5. Integration with ROS 2

5.1 ROS 2 Node Structure

import rclpy
from rclpy.node import Node
from std_msgs.msg import String
from geometry_msgs.msg import Pose
from sensor_msgs.msg import AudioData

class VoiceCommandNode(Node):
    def __init__(self):
        super().__init__('voice_command_node')

        # Publishers for robot commands
        self.move_pub = self.create_publisher(Pose, 'move_command', 10)
        self.action_pub = self.create_publisher(String, 'action_command', 10)

        # Subscriber for audio input
        self.audio_sub = self.create_subscription(
            AudioData, 'audio_input', self.audio_callback, 10)

        # Timer for processing audio queue
        self.timer = self.create_timer(0.1, self.process_audio)

        # Initialize Whisper model
        self.whisper_model = whisper.load_model("small")

    def audio_callback(self, msg):
        # Process audio data and convert to text
        audio_array = np.frombuffer(msg.data, dtype=np.int16)
        result = self.whisper_model.transcribe(audio_array)
        command_text = result["text"]

        # Parse and execute command
        self.execute_command(command_text)

    def execute_command(self, command_text):
        parser = CommandParser()
        parsed_command = parser.parse_command(command_text)

        if parsed_command['action'] == 'move':
            self.send_move_command(parsed_command['target'])
        elif parsed_command['action'] == 'pick':
            self.send_pick_command(parsed_command['object'])
        elif parsed_command['action'] == 'place':
            self.send_place_command(parsed_command['target'])

5.2 Action Mapping

Map recognized commands to ROS 2 actions:

Navigation commands → Navigation2 stack
Manipulation commands → MoveIt! or custom controllers
System commands → Service calls
Query commands → Parameter requests

6. Voice Command Vocabulary

"Go to the kitchen"
"Move to the table"
"Navigate to the charging station"
"Return to base"

6.2 Manipulation Commands

"Pick up the red cup"
"Place the book on the shelf"
"Open the door"
"Close the drawer"

6.3 System Commands

"Stop" or "Halt"
"Pause"
"Resume"
"Shutdown"
"Status"

7. Performance Optimization

7.1 Model Optimization

Use appropriate model size for your hardware
Implement model quantization
Use GPU acceleration when available
Cache frequently used models

7.2 Real-time Processing

Optimize audio buffer sizes
Use efficient threading
Implement command queuing
Handle processing delays gracefully

7.3 Accuracy Improvements

Train custom language models
Implement command confirmation
Use context-aware recognition
Add error correction mechanisms

8. Advanced Features

8.1 Multi-language Support

Whisper supports multiple languages:

English, German, French, Spanish, Italian
Portuguese, Polish, Chinese, Japanese, Korean
And many more languages

8.2 Custom Training

Fine-tune Whisper for:

Domain-specific vocabulary
Specific accents or dialects
Noisy environments
Specialized command sets

8.3 Voice Authentication

Implement voice biometrics for:

User identification
Security verification
Personalized responses
Access control

9. Practical Implementation

9.1 Complete Voice-to-Action System

import rclpy
import whisper
import pyaudio
import numpy as np
import threading
import queue
from rclpy.node import Node
from std_msgs.msg import String
from geometry_msgs.msg import Pose

class CompleteVoiceToActionNode(Node):
    def __init__(self):
        super().__init__('complete_voice_to_action')

        # Initialize Whisper model
        self.model = whisper.load_model("small")

        # Audio processing setup
        self.audio_queue = queue.Queue()
        self.setup_audio()

        # ROS 2 publishers
        self.command_pub = self.create_publisher(String, 'robot_commands', 10)

        # Start audio processing thread
        self.processing_thread = threading.Thread(
            target=self.process_audio_stream, daemon=True)
        self.processing_thread.start()

    def setup_audio(self):
        self.audio = pyaudio.PyAudio()
        self.stream = self.audio.open(
            format=pyaudio.paInt16,
            channels=1,
            rate=16000,
            input=True,
            frames_per_buffer=1024
        )

    def process_audio_stream(self):
        while rclpy.ok():
            # Read audio chunk
            data = self.stream.read(1024)
            audio_array = np.frombuffer(data, dtype=np.int16)

            # Process with Whisper
            result = self.model.transcribe(audio_array)
            text = result["text"]

            if text.strip():  # If we have recognized text
                self.process_command(text)

    def process_command(self, text):
        # Publish command to robot
        msg = String()
        msg.data = text
        self.command_pub.publish(msg)

        self.get_logger().info(f'Recognized: {text}')

10. Testing and Validation

10.1 Unit Testing

Test individual components:

Audio input processing
Speech recognition accuracy
Command parsing
ROS 2 integration

10.2 Integration Testing

Test the complete system:

End-to-end voice command processing
Robot response accuracy
System robustness
Error handling

Exercises

Basic Setup: Install Whisper and test speech recognition with audio files
Real-time Processing: Implement real-time audio capture and recognition
Command Mapping: Create a command parser for navigation tasks
ROS Integration: Integrate voice commands with a simple ROS 2 navigation system

Summary

This week we explored voice-to-action systems using OpenAI Whisper for speech recognition. We learned how to process audio input, recognize speech commands, parse natural language, and integrate with ROS 2 systems. Voice-to-action technology provides an intuitive interface for human-robot interaction, enabling robots to understand and respond to natural language commands.

Introduction​

Learning Objectives​

Prerequisites​

1. Introduction to Voice-to-Action Systems​

1.1 What are Voice-to-Action Systems?​

1.2 Applications in Robotics​

1.3 System Architecture​

2. OpenAI Whisper for Speech Recognition​

2.1 What is OpenAI Whisper?​

2.2 Whisper Model Variants​

2.3 Installation and Setup​

2.4 Basic Whisper Usage​

3. Real-Time Voice Recognition​

3.1 Audio Input Processing​

3.2 Audio Stream Processing​

3.3 Voice Activity Detection​

4. Natural Language Understanding​

4.1 Command Parsing​

4.2 Intent Recognition​

4.3 Context Awareness​

5. Integration with ROS 2​

5.1 ROS 2 Node Structure​

5.2 Action Mapping​

6. Voice Command Vocabulary​

6.1 Basic Navigation Commands​

6.2 Manipulation Commands​

6.3 System Commands​

7. Performance Optimization​

7.1 Model Optimization​

7.2 Real-time Processing​

7.3 Accuracy Improvements​

8. Advanced Features​

8.1 Multi-language Support​

8.2 Custom Training​

8.3 Voice Authentication​

9. Practical Implementation​

9.1 Complete Voice-to-Action System​

10. Testing and Validation​

10.1 Unit Testing​

10.2 Integration Testing​

Exercises​

Summary​

References​