Real-time Speech-to-Text powers live captions, voice agents, transcription services, and multilingual meeting platforms. Palabra delivers production-ready streaming STT with sub-500ms latency across 60+ languages through a single WebSocket connection.
Real-Time STT Overview
What is Real-Time STT?
Real-time STT processes live audio streams as they’re captured, delivering text output within 300-800ms of speech. This enables applications like live captions, voice-to-text chat, real-time transcription, and interactive voice agents. Unlike batch processing which requires complete audio files, streaming STT works with continuous microphone input or WebRTC streams.
Key technical requirements:
•Low latency: <800ms end-to-end
•Continuous streaming: No start/stop cycles
•Partial results: Display words as they’re recognized (95% confidence)
•Bidirectional: Handle interruptions, speaker changes
Streaming vs Batch
| Aspect | Streaming STT | Batch STT |
| Input | Live microphone/WebRTC | Complete audio files |
| Latency | 300-800ms | 5-30s |
| Accuracy | 85-95% | 95-99% |
| Use Cases | Live captions, voice agents | Podcasts, interviews |
| Cost | Higher (per minute) | Lower (per hour) |
Streaming: Microphone → 100ms chunks → STT → partial text every 300ms → UI
Batch: Upload MP3 → queue job → full transcript in 2min → download
Latency Requirements
Excellent: <300ms (Deepgram, Palabra edge)
Good: 300-800ms (Google STT, AssemblyAI)
Acceptable: 800ms-2s (Whisper Live)
Unusable: >3s
Palabra target: 450ms p95 globally, 250ms p95 from nearest edge location.
Core Technical Components
Audio Pipeline
1. Microphone Capture
navigator.mediaDevices.getUserMedia({audio: {sampleRate: 16000,channelCount: 1,echoCancellation: true,noiseSuppression: true}})
•Formats: PCM16 (signed 16-bit), 16kHz, mono, little-endian
•Chrome WebRTC: 20ms packets (320 samples) optimal for STT
2. Preprocessing
•VAD (Voice Activity Detection): Remove silence between words
•AGC (Automatic Gain Control): Normalize volume across speakers
•Noise Reduction: WebRTC built-in or RNNoise
•Resampling: 16kHz for all STT APIs
3. WebSocket Streaming
AudioContext → ScriptProcessorNode (320 samples)
→ PCM16 buffer → WebSocket.send() → STT API
STT Processing
•Acoustic Model: Converts audio spectrograms → phoneme probabilities
•Language Model: Phonemes → words using n-gram + neural LM
•Decoder: Viterbi beam search, 10-20 active paths
•Post-processing: Punctuation, capitalization, speaker labels
Speaker Diarization:
•Voice embeddings (d-vectors) per speaker
•Spectral clustering every 5-10s window
•Real-time: Track active speaker, label changes
Palabra Implementation
Palabra WebSocket API
Single endpoint architecture:
wss://api.palabra.ai/v1/stream?key=YOUR_KEY&lang=en&targets=es,fr,de
Request payload (binary):
Header (8 bytes): [chunk_size:2][timestamp:4][seq:2]
Audio (320 bytes): PCM16, 20ms @ 16kHz
Response JSON (every 300ms):
{"timestamp": 1234567,"source": {"text": "Hello world", "confidence": 0.97},"targets": {"es": {"text": "Hola mundo", "confidence": 0.95},"fr": {"text": "Bonjour le monde", "confidence": 0.93}},"speaker": 1,"final": true}
7-day free trial available at app.palabra.ai. Paid plans start at Pro (150 credits/month). Credits are charged per minute of usage; rates vary by product type and plan tier. All 60+ languages included on every plan.
3rd Party Integration
AssemblyAI + Palabra (best accuracy):
Audio → AssemblyAI WebSocket (/v2/realtime)
→ JSON transcript → Palabra POST /v1/translate
→ Multilingual captions
Deepgram Pipeline (lowest latency):
Deepgram (/v1/listen?model=nova-2) → English text
→ Palabra streaming translation → 60+ languages
Code Examples
Browser (JavaScript)
Complete live caption implementation:
class PalabraSTT {constructor(apiKey) {this.ws = new WebSocket(`wss://api.palabra.ai/v1/stream?key=${apiKey}`);this.setupAudioPipeline();}async setupAudioPipeline() {const stream = await navigator.mediaDevices.getUserMedia({audio: true});const audioContext = new AudioContext({sampleRate: 16000});const source = audioContext.createMediaStreamSource(stream);const processor = audioContext.createScriptProcessor(320, 1, 1);processor.onaudioprocess = (e) => {const input = e.inputBuffer.getChannelData(0);const pcm16 = this.floatTo16BitPCM(input);this.ws.send(pcm16);};source.connect(processor);processor.connect(audioContext.destination);}floatTo16BitPCM(input) {const output = new Int16Array(input.length);for (let i = 0; i < input.length; i++) {const s = Math.max(-1, Math.min(1, input[i]));output[i] = s < 0 ? s * 0x8000 : s * 0x7FFF;}return output.buffer;}onResult(callback) {this.ws.onmessage = (event) => callback(JSON.parse(event.data));}}// Usageconst stt = new PalabraSTT('your_key');stt.onResult((result) => {document.getElementById('captions-en').textContent = result.source.text;document.getElementById('captions-es').textContent = result.targets.es.text;});
Node.js Server
Audio proxy for multiple clients:
const WebSocket = require('ws');const { createAudioProcessor } = require('@palabra/sdk');const wss = new WebSocket.Server({ port: 8080 });wss.on('connection', (clientWs) => {const palabraWs = new WebSocket('wss://api.palabra.ai/v1/stream?key=YOUR_KEY');const processor = createAudioProcessor();clientWs.on('message', (audioChunk) => {processor.process(audioChunk);palabraWs.send(processor.getPCMChunk());});palabraWs.on('message', (result) => {clientWs.send(result);});palabraWs.on('close', () => {setTimeout(() => wss.clients.forEach(ws => ws.close()), 1000);});});
Performance Optimization
Latency Techniques
1. Chunk Size Optimization
10ms (160 samples): 15% CPU, 20% packet loss
20ms (320 samples): OPTIMAL – 98% accuracy
50ms (800 samples): +200ms latency penalty
2. Endpoint Selection
curl -H “X-Region: auto” https://api.palabra.ai/closest
→ Returns: eu-west-1, us-east-1, ap-southeast-1
3. Predictive Rendering
if (result.confidence > 0.95 && result.final) {display(result.text); // Commit to UI} else {predict(result.partial); // Show temporarily}
Error Handling
Reconnection with Exponential Backoff:
let reconnectAttempts = 0;ws.onclose = () => {const delay = Math.min(1000 * Math.pow(2, reconnectAttempts), 30000);setTimeout(connect, delay);reconnectAttempts++;};
Fallback Chain: Palabra → Deepgram → Browser SpeechRecognition → Offline mode
Production Checklist
Infrastructure
WebSocket Scaling:
•Kubernetes: 1 pod per 500 concurrent streams
•Redis Pub/Sub: Cross-region replication
•Load Balancer: Sticky sessions + health checks
•Monitoring: p95 latency <600ms, 99.9% uptime
Global CDN:
•Cloudflare Workers (20+ edge locations)
•Regional STT models (US/EU/Asia)
•<100ms WebSocket handshake globally
Security
•Audio Encryption: WebRTC SRTP + DTLS-SRTP end-to-end
•Data Controls: No audio storage, 30-day transcript TTL
•Compliance: GDPR, CCPA, HIPAA-ready (Enterprise)
•PII Redaction: Auto-detect names, emails, phone numbers