Real-Time STT: Technical Implementation

Real-Time STT: Technical Implementation
Table of contents

Real-time Speech-to-Text powers live captions, voice agents, transcription services, and multilingual meeting platforms. Palabra delivers production-ready streaming STT with sub-500ms latency across 60+ languages through a single WebSocket connection.

Real-Time STT Overview

What is Real-Time STT?

Real-time STT processes live audio streams as they’re captured, delivering text output within 300-800ms of speech. This enables applications like live captions, voice-to-text chat, real-time transcription, and interactive voice agents. Unlike batch processing which requires complete audio files, streaming STT works with continuous microphone input or WebRTC streams.

Key technical requirements:

•Low latency: <800ms end-to-end

•Continuous streaming: No start/stop cycles

•Partial results: Display words as they’re recognized (95% confidence)

•Bidirectional: Handle interruptions, speaker changes

Streaming vs Batch

AspectStreaming STTBatch STT
InputLive microphone/WebRTCComplete audio files
Latency300-800ms5-30s
Accuracy85-95%95-99%
Use CasesLive captions, voice agentsPodcasts, interviews
CostHigher (per minute)Lower (per hour)

Streaming: Microphone → 100ms chunks → STT → partial text every 300ms → UI

Batch: Upload MP3 → queue job → full transcript in 2min → download

Latency Requirements

Excellent: <300ms (Deepgram, Palabra edge)
Good: 300-800ms (Google STT, AssemblyAI)
Acceptable: 800ms-2s (Whisper Live)
Unusable: >3s

Palabra target: 450ms p95 globally, 250ms p95 from nearest edge location.

Core Technical Components

Audio Pipeline

1. Microphone Capture

navigator.mediaDevices.getUserMedia({
audio: {
sampleRate: 16000,
channelCount: 1,
echoCancellation: true,
noiseSuppression: true
}
})

•Formats: PCM16 (signed 16-bit), 16kHz, mono, little-endian

•Chrome WebRTC: 20ms packets (320 samples) optimal for STT

2. Preprocessing

•VAD (Voice Activity Detection): Remove silence between words

•AGC (Automatic Gain Control): Normalize volume across speakers

•Noise Reduction: WebRTC built-in or RNNoise

•Resampling: 16kHz for all STT APIs

3. WebSocket Streaming

AudioContext → ScriptProcessorNode (320 samples)
→ PCM16 buffer → WebSocket.send() → STT API

STT Processing

•Acoustic Model: Converts audio spectrograms → phoneme probabilities

•Language Model: Phonemes → words using n-gram + neural LM

•Decoder: Viterbi beam search, 10-20 active paths

•Post-processing: Punctuation, capitalization, speaker labels

Speaker Diarization:

•Voice embeddings (d-vectors) per speaker

•Spectral clustering every 5-10s window

•Real-time: Track active speaker, label changes

Palabra Implementation

Palabra WebSocket API

Single endpoint architecture:

wss://api.palabra.ai/v1/stream?key=YOUR_KEY&lang=en&targets=es,fr,de

Request payload (binary):

Header (8 bytes): [chunk_size:2][timestamp:4][seq:2]
Audio (320 bytes): PCM16, 20ms @ 16kHz

Response JSON (every 300ms):

{
"timestamp": 1234567,
"source": {"text": "Hello world", "confidence": 0.97},
"targets": {
"es": {"text": "Hola mundo", "confidence": 0.95},
"fr": {"text": "Bonjour le monde", "confidence": 0.93}
},
"speaker": 1,
"final": true
}

7-day free trial available at app.palabra.ai. Paid plans start at Pro (150 credits/month). Credits are charged per minute of usage; rates vary by product type and plan tier. All 60+ languages included on every plan.

3rd Party Integration

AssemblyAI + Palabra (best accuracy):

Audio → AssemblyAI WebSocket (/v2/realtime)
→ JSON transcript → Palabra POST /v1/translate
→ Multilingual captions

Deepgram Pipeline (lowest latency):

Deepgram (/v1/listen?model=nova-2) → English text
→ Palabra streaming translation → 60+ languages

Code Examples

Browser (JavaScript)

Complete live caption implementation:

class PalabraSTT {
constructor(apiKey) {
this.ws = new WebSocket(`wss://api.palabra.ai/v1/stream?key=${apiKey}`);
this.setupAudioPipeline();
}
async setupAudioPipeline() {
const stream = await navigator.mediaDevices.getUserMedia({audio: true});
const audioContext = new AudioContext({sampleRate: 16000});
const source = audioContext.createMediaStreamSource(stream);
const processor = audioContext.createScriptProcessor(320, 1, 1);
processor.onaudioprocess = (e) => {
const input = e.inputBuffer.getChannelData(0);
const pcm16 = this.floatTo16BitPCM(input);
this.ws.send(pcm16);
};
source.connect(processor);
processor.connect(audioContext.destination);
}
floatTo16BitPCM(input) {
const output = new Int16Array(input.length);
for (let i = 0; i < input.length; i++) {
const s = Math.max(-1, Math.min(1, input[i]));
output[i] = s < 0 ? s * 0x8000 : s * 0x7FFF;
}
return output.buffer;
}
onResult(callback) {
this.ws.onmessage = (event) => callback(JSON.parse(event.data));
}
}
// Usage
const stt = new PalabraSTT('your_key');
stt.onResult((result) => {
document.getElementById('captions-en').textContent = result.source.text;
document.getElementById('captions-es').textContent = result.targets.es.text;
});

Node.js Server

Audio proxy for multiple clients:

const WebSocket = require('ws');
const { createAudioProcessor } = require('@palabra/sdk');
const wss = new WebSocket.Server({ port: 8080 });
wss.on('connection', (clientWs) => {
const palabraWs = new WebSocket('wss://api.palabra.ai/v1/stream?key=YOUR_KEY');
const processor = createAudioProcessor();
clientWs.on('message', (audioChunk) => {
processor.process(audioChunk);
palabraWs.send(processor.getPCMChunk());
});
palabraWs.on('message', (result) => {
clientWs.send(result);
});
palabraWs.on('close', () => {
setTimeout(() => wss.clients.forEach(ws => ws.close()), 1000);
});
});

Performance Optimization

Latency Techniques

1. Chunk Size Optimization

10ms (160 samples): 15% CPU, 20% packet loss
20ms (320 samples): OPTIMAL – 98% accuracy
50ms (800 samples): +200ms latency penalty

2. Endpoint Selection

curl -H “X-Region: auto” https://api.palabra.ai/closest
→ Returns: eu-west-1, us-east-1, ap-southeast-1

3. Predictive Rendering

if (result.confidence > 0.95 && result.final) {
display(result.text); // Commit to UI
} else {
predict(result.partial); // Show temporarily
}

Error Handling

Reconnection with Exponential Backoff:

let reconnectAttempts = 0;
ws.onclose = () => {
const delay = Math.min(1000 * Math.pow(2, reconnectAttempts), 30000);
setTimeout(connect, delay);
reconnectAttempts++;
};

Fallback Chain: Palabra → Deepgram → Browser SpeechRecognition → Offline mode

Production Checklist

Infrastructure

WebSocket Scaling:

•Kubernetes: 1 pod per 500 concurrent streams

•Redis Pub/Sub: Cross-region replication

•Load Balancer: Sticky sessions + health checks

•Monitoring: p95 latency <600ms, 99.9% uptime

Global CDN:

•Cloudflare Workers (20+ edge locations)

•Regional STT models (US/EU/Asia)

•<100ms WebSocket handshake globally

Security

•Audio Encryption: WebRTC SRTP + DTLS-SRTP end-to-end

•Data Controls: No audio storage, 30-day transcript TTL

•Compliance: GDPR, CCPA, HIPAA-ready (Enterprise)

•PII Redaction: Auto-detect names, emails, phone numbers