MAI-Transcribe-1
Version: 2026-01-23
MAI Audio Models
MAI‑Transcribe‑1 is a best-in-class speech‑to‑text model, designed for real‑world audio. It provides consistently strong transcription accuracy across accents, speaking styles, and noisy environments, giving developers a strong foundation for building high‑quality voice understanding into their applications.Key capabilities
About this model
MAI‑Transcribe‑1 is a speech‑to‑text model built in‑house by the Microsoft AI Superintelligence team, designed to deliver reliable transcription across 25 languages. It powers a wide range of use cases, including video captions, meeting transcription, accessibility tools, call analysis, content creation workflows, and powering voice agents. The model is optimized to be robust across diverse accents, dialects, and real‑world acoustic conditions, giving developers a transcription system they can rely on. MAI‑Transcribe‑1 is actively under development, with new capabilities coming soon, including real‑time transcription, diarization and context biasing.Key model capabilities
- Best-in-class accuracy across 25 languages: English, French, German, Italian, Spanish, Hindi, Portuguese, Czech, Danish, Finnish, Hungarian, Dutch, Polish, Romanian, Swedish, Japanese, Korean, Chinese, Arabic, Indonesian, Russian, Thai, Turkish, and Vietnamese.
- Robust to real-world noisy situations.
- Automatic Language identification.
Use cases
See Responsible AI for additional consideration for responsible use.Key use cases
| Use case | Scenario | Solution |
|---|---|---|
| Live captions | A virtual event platform provides real-time captions for webinars. | Chunk audio and transcribe spoken content into captions displayed live during the event. |
| Call center transcription | A call center wants accurate, fast transcriptions of customer calls to empower their customer service agents. | Transcribe calls in real time, enabling agents to better understand and respond to customer queries. |
| Video subtitling | A video-hosting platform needs to generate subtitles for uploaded videos. | Transcribe the full video audio to produce a complete subtitle track. |
| Accessibility | An organization needs to make audio content accessible to deaf or hard-of-hearing users. | Transcribe audio from meetings, announcements, or media to provide text alternatives that support compliance and inclusive access. |
| E-learning | An e-learning platform provides transcriptions for video lectures. | Process prerecorded lecture videos, generating text transcripts for students. |
| Media archiving | A media company needs subtitles for a large archive of videos. | Transcribe video files in bulk, generating accurate subtitles for each video. |
| Market research | A research firm analyzes customer feedback from audio recordings. | Convert audio feedback into text, enabling easier analysis and insights extraction. |
Out of scope use cases
Real‑time transcription, diarization, and biasing aren't supported yet; these capabilities are planned for an upcoming release.Pricing
Pay-As-You-Go & Commitment Tiers See pricing details here .Technical specs
This information is not available.Training cut-off date
This information is not available.Input formats
LLM Speech: WAV, MP3, FLACSupported languages
English, French, German, Italian, Spanish, Hindi, Portuguese, Czech, Danish, Finnish, Hungarian, Dutch, Polish, Romanian, Swedish, Japanese, Korean, Chinese, Arabic, Indonesian, Russian, Thai, Turkish, and Vietnamese.Supported Azure regions
Global access enabled, but for now the resources need to point to East US and West US. We’ll be scaling out to additional regions soon.Sample JSON response
Please refer to the sample JSON for LLM Speech according to your usage.Model architecture
Autoregressive model with text predictionOptimizing model performance
Coming Soon...Additional assets
This information is not available.Distribution
You can deploy MAI-Transcribe-1 via Azure Speech in the cloud or on-premises. In some cases, you may not be able to use the Speech SDK. In those cases, you can use REST APIs to access the Speech service. For example, use REST APIs for LLM Speech .More information
Learn more in the full Azure Speech Service documentation .Responsible AI considerations
Safety techniques
Refer to the guidance for integration and responsible use with speech to text .Safety evaluations
This information is not available.Known limitations
MAI-Transcribe-1 recognizes what's spoken in an audio input, and then generates transcription outputs. This requires proper setup for the expected languages used in the audio input and spoken styles. Non-optimal settings might lead to lower accuracy. Refer to Technical limitations, operational factors, and ranges for more details.Acceptable use
Acceptable use policy
The Speech to Text API powered by MAI-Transcribe-1 offers convenient options for developing voice-enabled applications, but it is very important to consider the context in which you will integrate the API. You must ensure that you comply with all laws and regulations that apply to your application. This includes understanding your obligations under privacy and communication laws, including national and regional privacy, eavesdropping, and wiretap laws that apply to your jurisdiction. Collect and process only audio that is within the reasonable expectations of your users. This includes ensuring that you have all necessary and appropriate consents from users for you to collect, process, and store their audio data. Refer to Technical limitations, operational factors, and ranges for more details.Terms of Service
Terms of Service Link
MAI-Transcribe-1 is provided under Microsoft’s proprietary licensing terms. Access to the model is subscription-based and governed by Microsoft’s product licensing policies.- License Type: Proprietary
- Access Model: Subscription-based via Azure services
- Terms of Service: https://microsoft.com/licensing/terms/
Model Specifications
LicenseCustom
Last UpdatedApril 2026
Input TypeAudio
Output TypeText
ProviderMicrosoft