Topics

Speech Recognition

Models for transcribing, translating, and understanding spoken audio.

Speech Synthesis · Independent Researcher

A Broad Benchmark for Long-Form Speech Generation

A Broad Benchmark for Long-Form Speech Generation turns long-form speech generation into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Brain Decoding · Meta AI

Brain2Qwerty: Non-Invasive Brain-to-Text Decoding

Brain2Qwerty decodes typed sentences from non-invasive brain recordings: MEG reaches 32% CER on average, EEG trails at 67%, and the best participants reach 19%.

Multimodal Models · Skywork AI

Audio Interaction Model: A Streaming Audio LLM That Decides When to Speak

The Audio Interaction Model runs a perceive-decide-respond loop so an audio LLM listens, decides if and when to reply, and answers on the fly; trained on StreamAudio-2M and competitive across 8 benchmarks.

Speech Recognition · Shanghai AI Laboratory

Mega-ASR: Scaling Acoustic Simulation for In-the-Wild Speech Recognition

Mega-ASR fights ASR's noise-robustness gap by synthesizing 2.4M clips across 54 compound acoustic scenarios, then training Qwen3-ASR-1.7B in two stages — cutting WER to 45.69% vs 54.01% on VOiCES R4-B-F.

Speech Recognition · OpenAI

Whisper: 680,000 Hours of Weak Supervision for Robust ASR

OpenAI's Whisper trains a single sequence-to-sequence model on 680,000 hours of web audio. It matches fully supervised systems zero-shot — no fine-tuning — and adds translation and language ID.