Whisper

OpenAI Whisper is a speech-to-text tool.

Whisper: An ASR system trained on 680,000 hours of multilingual and multitask supervised data from the web. It can transcribe speech in multiple languages, translate speech to English, and identify the language of speech.
Whisper architecture: A Transformer model that uses an encoder-decoder structure. It takes 30-second audio chunks as input and outputs text captions with special tokens for different tasks, such as timestamps, transcription, and translation.
Whisper performance: A robust and versatile system that outperforms other models on diverse datasets, especially on speech to text translation. It is also faster and more efficient than other models using approximate nearest neighbor algorithms.
Whisper applications: A foundation for voice interfaces that can be used by developers and researchers to build useful and innovative applications with speech processing. It is open-sourced with models, code, paper, and model card.