Tutorial
From zero to production — learn FunASR in 10 minutes.
30-Second Demo
Copy-paste this and run. No config needed — FunASR downloads everything automatically. No local setup? Open the Colab quickstart first.
pip install funasr python -c " from funasr import AutoModel model = AutoModel(model='paraformer-zh', vad_model='fsmn-vad', punc_model='ct-punc') res = model.generate(input='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav') print(res[0]['text']) "
Output:
That's it. One model object handles everything: loading audio, VAD segmentation, ASR, and punctuation.
Install
# Stable release pip install funasr # Latest (recommended — has newest models and bug fixes) pip install git+https://github.com/modelscope/FunASR.git
hub="hf" to download from HuggingFace.Which Model Should I Use?
| Model | Best For | Languages | Speed | Punctuation |
|---|---|---|---|---|
| Paraformer | Chinese production ASR | Chinese, English | Fast | Needs punc_model |
| Fun-ASR-Nano | Multi-language, dialects, lyrics | 31 languages | Medium | Built-in ✓ |
| SenseVoice | Emotion + events + ASR | 5 languages | Ultra-fast (70ms/10s) | Built-in ✓ |
| Qwen3-ASR | Highest accuracy, context-aware | 52 languages | Slow (LLM) | Built-in ✓ |
| Paraformer-Streaming | Real-time transcription | Chinese | Real-time | Needs punc_model |
• Chinese meeting/call →
paraformer-zh + fsmn-vad + ct-punc + cam++• Multi-language →
Fun-ASR-Nano• Need emotion/sound events →
SenseVoice• Real-time subtitles →
paraformer-zh-streaming• Highest quality, no latency concern →
Qwen3-ASR
Usage Scenarios
🎤 "I have a meeting recording and want text + who said what"
→ Paraformer + VAD + Punctuation + Speaker Diarization → Jump to Speaker Diarization
📺 "I want real-time subtitles for a live stream"
→ Paraformer-Streaming, feed audio chunks every 600ms → Jump to Streaming ASR
😊 "I want to detect user emotions from voice"
→ SenseVoice (outputs emotion tags: happy, sad, angry, neutral) → Jump to Emotion Detection
🌍 "I have audio in Japanese/Korean/Arabic/etc."
→ Fun-ASR-Nano (31 languages) or Qwen3-ASR (52 languages) → Jump to Offline ASR
✂️ "I want to clip a video by spoken content"
→ Use FunClip which integrates FunASR for smart video editing
Offline ASR
Paraformer (Chinese)
from funasr import AutoModel
model = AutoModel(
model="paraformer-zh", # Chinese ASR
vad_model="fsmn-vad", # handles any audio length
vad_kwargs={"max_single_segment_time": 60000},
punc_model="ct-punc", # adds punctuation
)
res = model.generate(input="meeting.wav", batch_size_s=300, hotword='达摩院 语音识别')
print(res[0]["text"]) # "欢迎大家来体验达摩院推出的语音识别模型。"
print(res[0]["timestamp"]) # [[880,1120],[1120,1360],...] (ms per character)
Fun-ASR-Nano (31 Languages)
from funasr import AutoModel
model = AutoModel(
model="FunAudioLLM/Fun-ASR-Nano-2512",
trust_remote_code=True,
remote_code="./model.py",
vad_model="fsmn-vad",
vad_kwargs={"max_single_segment_time": 30000},
device="cuda:0",
hub="hf",
)
res = model.generate(input=["audio.wav"], cache={}, batch_size=1,
hotwords=["keyword"], language="中文")
print(res[0]["text"]) # recognized text with punctuation
print(res[0]["timestamps"]) # [{"token":"开","start_time":0.42,"end_time":0.48}, ...]
punc_model.SenseVoice (ASR + Emotion + Events)
from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess
model = AutoModel(
model="iic/SenseVoiceSmall",
vad_model="fsmn-vad",
vad_kwargs={"max_single_segment_time": 30000},
device="cuda:0",
)
res = model.generate(input="audio.wav", cache={}, language="auto",
use_itn=True, batch_size_s=60, merge_vad=True, merge_length_s=15)
# Raw output contains emotion/event tags: <|zh|><|HAPPY|><|Speech|>你好
text = rich_transcription_postprocess(res[0]["text"])
print(text) # "你好" (clean text)
Qwen3-ASR (52 Languages, Highest Accuracy)
# pip install qwen-asr
from funasr import AutoModel
model = AutoModel(model="Qwen/Qwen3-ASR-1.7B", hub="hf", device="cuda:0")
res = model.generate(input="audio.wav")
print(res[0]["text"]) # recognized text
print(res[0].get("language")) # auto-detected language
Streaming ASR (Real-time)
Process audio chunk-by-chunk for real-time transcription. Each chunk produces partial text immediately.
from funasr import AutoModel
import soundfile
model = AutoModel(model="paraformer-zh-streaming")
speech, sr = soundfile.read("audio.wav")
chunk_size = [0, 10, 5] # 600ms display, 300ms lookahead
chunk_stride = chunk_size[1] * 960 # 9600 samples per chunk
cache = {}
total_chunks = int((len(speech) - 1) / chunk_stride + 1)
for i in range(total_chunks):
chunk = speech[i * chunk_stride:(i + 1) * chunk_stride]
is_final = (i == total_chunks - 1)
res = model.generate(input=chunk, cache=cache, is_final=is_final,
chunk_size=chunk_size,
encoder_chunk_look_back=4,
decoder_chunk_look_back=1)
if res[0]["text"]:
print(res[0]["text"], end="", flush=True) # incremental output
Output (printed incrementally):
•
cache={} must persist across all chunks (don't recreate it)•
is_final=True on last chunk flushes remaining buffered text•
chunk_size=[0,10,5]: first number unused, second=display granularity (×60ms), third=lookahead (×60ms)
Speaker Diarization ("Who Said What")
Works with all three major models: Paraformer, Fun-ASR-Nano, and SenseVoice. Add spk_model="cam++" to get speaker labels per sentence.
Paraformer + Speaker
from funasr import AutoModel
model = AutoModel(
model="paraformer-zh",
vad_model="fsmn-vad",
punc_model="ct-punc", # Paraformer needs this for sentence segmentation
spk_model="cam++",
)
res = model.generate(input="meeting.wav", batch_size_s=300)
for sent in res[0]["sentence_info"]:
print(f"[Speaker {sent['spk']}] [{sent['start']}-{sent['end']}ms] {sent['text']}")
Output:
Fun-ASR-Nano + Speaker (no punc needed)
model = AutoModel(
model="FunAudioLLM/Fun-ASR-Nano-2512",
trust_remote_code=True, remote_code="./model.py",
vad_model="fsmn-vad", vad_kwargs={"max_single_segment_time": 30000},
spk_model="cam++", # no punc_model needed!
device="cuda:0", hub="hf",
)
res = model.generate(input=["meeting.wav"], cache={}, batch_size=1, language="中文")
for sent in res[0]["sentence_info"]:
print(f"Speaker {sent['spk']}: {sent.get('text', sent.get('sentence', ''))}")
SenseVoice + Speaker (no punc needed)
model = AutoModel(
model="iic/SenseVoiceSmall",
vad_model="fsmn-vad", vad_kwargs={"max_single_segment_time": 30000},
spk_model="cam++", # no punc_model needed!
device="cuda:0",
)
res = model.generate(input="meeting.wav", cache={}, language="auto",
use_itn=True, batch_size_s=60, merge_vad=True, merge_length_s=15)
for sent in res[0]["sentence_info"]:
print(f"Speaker {sent['spk']}: {rich_transcription_postprocess(sent['text'])}")
Emotion Detection
Dedicated Emotion Model (emotion2vec)
from funasr import AutoModel model = AutoModel(model="iic/emotion2vec_plus_large", device="cuda:0") res = model.generate(input="audio.wav", granularity="utterance") print(res[0]["labels"]) # ['angry', 'happy', 'neutral', 'sad', ...] print(res[0]["scores"]) # [0.01, 0.03, 0.89, 0.05, ...] # → This audio is "neutral" with 89% confidence
SenseVoice (ASR + Emotion in one shot)
SenseVoice embeds emotion tags directly in the transcription output:
# Raw output: "<|zh|><|HAPPY|><|Speech|><|withitn|>今天真是太开心了。" # The <|HAPPY|> tag tells you the emotion # Use rich_transcription_postprocess() to get clean text
Voice Activity Detection
Offline (full audio)
from funasr import AutoModel
model = AutoModel(model="fsmn-vad")
res = model.generate(input="audio.wav")
print(res[0]["value"]) # [[610, 5530], [7200, 12400], ...]
# Each pair: [start_ms, end_ms] of speech
Streaming (chunk-by-chunk)
import soundfile
from funasr import AutoModel
model = AutoModel(model="fsmn-vad")
speech, sr = soundfile.read("audio.wav")
chunk_stride = int(200 * sr / 1000) # 200ms chunks
cache = {}
for i in range(int((len(speech)-1)/chunk_stride+1)):
chunk = speech[i*chunk_stride:(i+1)*chunk_stride]
is_final = i == int((len(speech)-1)/chunk_stride)
res = model.generate(input=chunk, cache=cache, is_final=is_final, chunk_size=200)
if res[0]["value"]:
print(res[0]["value"])
# [[610, -1]] → speech started at 610ms
# [[-1, 5530]] → speech ended at 5530ms
# [[610, 5530]] → complete segment
Punctuation Restoration
from funasr import AutoModel model = AutoModel(model="ct-punc") res = model.generate(input="那今天的会就到这里吧 happy new year 明年见") print(res[0]["text"]) # "那今天的会就到这里吧,happy new year,明年见。"
Deploy & Agent Integration
Use funasr-server when you need a local OpenAI-compatible endpoint for applications and agents.
pip install funasr fastapi uvicorn python-multipart funasr-server --device cuda --port 8000
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")
result = client.audio.transcriptions.create(
model="sensevoice",
file=open("meeting.wav", "rb"),
response_format="verbose_json",
)
For Claude Code, Cursor, and other MCP clients, configure examples/mcp_server/funasr_mcp.py. See the Agent integration guide.
Subtitle Generation
Generate SRT or VTT subtitles from audio/video files, with optional speaker labels.
cd examples/subtitle python generate_subtitle.py video.mp4 python generate_subtitle.py meeting.wav --spk python generate_subtitle.py podcast.mp3 --format vtt
ONNX Export
# Export model to ONNX format
from funasr import AutoModel
model = AutoModel(model="paraformer", device="cpu")
model.export(quantize=False) # saves to model cache directory
# Use ONNX model (faster, no PyTorch needed)
# pip install funasr-onnx
from funasr_onnx import Paraformer
model = Paraformer("damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch",
batch_size=1, quantize=True)
result = model(["audio.wav"])
print(result)
FAQ & Troubleshooting
Model download is slow
Set hub="hf" for HuggingFace (faster outside China), or download once and use local path:
model = AutoModel(model="/path/to/local/model", disable_update=True)
Out of Memory (OOM)
Three knobs to reduce memory:
- Reduce
batch_size_s(e.g., from 300 to 60) - Reduce
max_single_segment_timein vad_kwargs (e.g., 30000 → 15000) - Add
batch_size_threshold_s=30to force batch=1 for long segments
Every startup shows "Downloading Model..."
Not actually re-downloading — just verifying cache. To skip entirely:
model = AutoModel(model="/local/path/to/model", disable_update=True)
"ModelName is not registered"
Usually means pypi version is outdated. Install from source:
pip install git+https://github.com/modelscope/FunASR.git
Want to suppress all progress bars and logs?
model = AutoModel(model="...", disable_update=True, disable_pbar=True, log_level="ERROR")
How to pass pre-loaded numpy audio?
import soundfile as sf
audio, sr = sf.read("audio.wav") # numpy array, 16kHz
res = model.generate(input=audio) # pass directly — no file needed
GPU not being used?
# Check: import torch print(torch.cuda.is_available()) # must be True # Specify GPU explicitly: model = AutoModel(model="...", device="cuda:0")