TalkTome Technical Details

Today

Real-Time Speech Broadcast Platform with Live Translation: Technical Implementation Deep Dive

Overview

This post explores the technical implementation of our Speech Broadcast Platform with Live Translation, walking through real-world scenarios and the underlying code that powers them. We'll focus on three main aspects: broadcasting, real-time translation, and user interactions.

Technology Stack

Core Framework: NestJS

Our platform is built on NestJS, a progressive Node.js framework that provides:

@Module({
  imports: [
    PrismaModule,
    BroadcastRoomModule,
    AuthModule,
    SpeechModule,
  ],
  controllers: [AppController],
  providers: [AppService],
})
export class AppModule {}

Database Layer: Prisma ORM

Prisma serves as our data access layer, providing:

model BroadcastRoom {
  id            String    @id @default(uuid())
  broadcasterId String
  title         String
  isLive        Boolean   @default(false)
  createdAt     DateTime  @default(now())
  updatedAt     DateTime  @updatedAt
  listeners     User[]    @relation("RoomListeners")
}

API Architecture: REST & WebSockets

The platform combines:

@Controller('broadcast')
export class BroadcastController {
  @Get(':roomId')
  async getRoom(@Param('roomId') roomId: string) {
    return this.broadcastService.getRoom(roomId);
  }
}

Deployment Infrastructure

Docker Containerization

The application is containerized using Docker for consistent deployment:

FROM node:18-alpine
 
WORKDIR /app
 
COPY package*.json ./
RUN npm install
 
COPY . .
RUN npm run build
 
EXPOSE 3000
CMD ["npm", "run", "start:prod"]

Key benefits:

Environment Management

Configuration is handled through environment variables:

@Injectable()
export class ConfigService {
  private readonly config = {
    database: {
      url: process.env.DATABASE_URL,
    },
    azure: {
      speechKey: process.env.AZURE_SPEECH_KEY,
      region: process.env.AZURE_REGION,
    },
  };
}

Database Migration Strategy

Prisma handles our database migrations:

# Generate migration
prisma migrate dev --name add_broadcast_features
 
# Apply migration in production
prisma migrate deploy

Broadcasting Flow: A Real-World Scenario

1. Creating a Broadcast Room

When a broadcaster decides to start a new session, here's what happens behind the scenes:

@SubscribeMessage('createRoom')
handleCreateRoom(client: Socket, roomId: string) {
  try {
    const room = this.broadcastRoomService.createRoom(roomId, client.id);
    client.join(roomId);
    this.clientRooms.set(client.id, roomId);
    return { success: true, roomId };
  } catch (error) {
    return { success: false, error: error.message };
  }
}

The system:

2. Starting a Broadcast

When the broadcaster starts speaking:

@SubscribeMessage('startBroadcast')
async handleStartBroadcast(client: Socket) {
  // ... validation checks ...
  
  // Create push audio stream
  const pushStream = this.broadcastRoomService.setupAudioStream(roomId);
  
  // Setup speech recognition
  const audioConfig = speechSDK.AudioConfig.fromStreamInput(pushStream);
  const recognizer = this.broadcastRoomService.setupRecognizer(roomId, audioConfig);
  
  // Handle real-time transcription
  recognizer.recognizing = (s, e) => {
    this.server.to(roomId).emit('transcribing', {
      original: e.result.text,
      translation: e.result.translations.get('fr')
    });
  };
}

The platform:

  1. Sets up an audio stream for the broadcaster
  2. Initializes speech recognition
  3. Begins real-time transcription
  4. Broadcasts both audio and transcriptions to listeners

Real-Time Translation System

Speech-to-Text Configuration

The speech service is configured with Microsoft's Cognitive Services:

createSpeechTranslationConfig() {
  const config = speechSDK.SpeechTranslationConfig.fromSubscription(
    speechConfig.azure.subscriptionKey,
    speechConfig.azure.region,
  );
  
  config.speechSynthesisVoiceName = speechConfig.defaults.targetVoice;
  config.speechRecognitionLanguage = speechConfig.defaults.speechLang;
  config.addTargetLanguage(speechConfig.defaults.targetLang);
  
  return config;
}

Translation Flow

When a broadcaster speaks:

  1. Audio is captured and streamed in real-time
  2. Speech is recognized and translated simultaneously
  3. Translated text is synthesized into audio
  4. Both original and translated content are broadcast to listeners
recognizer.recognized = async (s, e) => {
  if (e.result.reason === ResultReason.TranslatedSpeech) {
    const translations = e.result.translations;
    const translatedText = translations.get(targetLanguage);
    
    // Synthesize speech from translation
    const audioResult = await synthesizeTranslation(translatedText);
    
    // Broadcast to room
    this.server.to(roomId).emit('transcribed', {
      original: e.result.text,
      translation: translatedText,
      audioData: audioResult.audioData
    });
  }
};

Listener Experience

When a listener joins a broadcast:

@SubscribeMessage('joinRoom')
handleJoinRoom(client: Socket, roomId: string) {
  try {
    const room = this.broadcastRoomService.getRoom(roomId);
    if (!room) {
      throw new Error('Room not found');
    }
    
    this.broadcastRoomService.addListener(roomId, client.id);
    client.join(roomId);
    this.clientRooms.set(client.id, roomId);
    return { success: true };
  } catch (error) {
    return { success: false, error: error.message };
  }
}

The listener immediately:

Data Flow Architecture

Broadcaster → WebSocket → Speech Service → Translation Service → WebSocket → Listeners
     ↓                                                                          ↑
   Audio                                                                     Audio
     ↓                                                                          ↑
Real-time                                                                 Translated
  Speech → Recognition → Translation → Text-to-Speech → Translated Audio → Audio

Error Handling and Recovery

The system implements robust error handling:

recognizer.canceled = (s, e) => {
  this.logger.error('Recognition canceled:', e.errorDetails);
  this.server.to(roomId).emit('error', {
    message: `Recognition canceled: ${e.errorDetails}`
  });
  recognizer.stopContinuousRecognitionAsync();
};

Performance Optimizations

  1. Audio Processing

    • Efficient binary audio data transmission
    • Proper byte alignment for audio buffers
    const audioData = new Int16Array(payload.data);
    const buffer = new ArrayBuffer(audioData.length * 2);
    const view = new Int16Array(buffer);
    view.set(audioData);
  2. Real-time Translation

    • Continuous recognition mode
    • Parallel processing of speech and translation
    • Efficient WebSocket broadcasting

Technical Challenges and Solutions

1. Binary Audio Processing

One of our biggest challenges was handling real-time audio streams efficiently. We implemented a sophisticated buffer handling system:

// Efficient binary audio processing
const audioData = new Int16Array(payload.data);
const buffer = new ArrayBuffer(audioData.length * 2);
const view = new Int16Array(buffer);
view.set(audioData);
 
// Process audio in chunks for optimal performance
room.audioStream.write(buffer);

Key optimizations:

2. Concurrent Translation System

The platform handles multiple language streams simultaneously while maintaining audio synchronization:

@Injectable()
export class TranslationService {
  async handleMultiLanguageStream(audioInput: Buffer) {
    // Process multiple language streams in parallel
    const translations = await Promise.all([
      this.translateStream(audioInput, 'en-US'),
      this.translateStream(audioInput, 'fr-FR'),
      this.translateStream(audioInput, 'es-ES')
    ]);
    
    // Synchronize audio outputs
    return this.synchronizeStreams(translations);
  }
}

3. Scalable WebSocket Architecture

Our room-based broadcasting system is designed for scalability:

@WebSocketGateway({
  cors: true,
  namespace: 'broadcast-room'
})
export class BroadcastRoomGateway {
  private clientRooms = new Map<string, string>();
  private roomMetrics = new Map<string, RoomMetrics>();
 
  async handleConnection(client: Socket) {
    // Track client connections
    this.metrics.trackConnection(client.id);
  }
 
  async handleDisconnect(client: Socket) {
    // Clean up resources
    const roomId = this.clientRooms.get(client.id);
    if (roomId) {
      await this.cleanupRoom(roomId, client.id);
    }
  }
}

Key Technical Achievements

1. Performance Metrics

2. System Resilience

@Injectable()
export class ConnectionManager {
  async handleDisconnect(client: Socket) {
    // Store session state
    await this.sessionStore.preserve(client.id);
    
    // Attempt reconnection
    return this.reconnectionStrategy.execute(client);
  }
}

3. Technical Optimizations

Conclusion

This implementation showcases how modern web technologies can be combined to create a sophisticated broadcasting platform with real-time translation capabilities. The system maintains low latency while providing high-quality audio streaming and translation services.