Tutorial

From zero to production — learn FunASR in 10 minutes.

30-Second Demo Install Choose a Model Usage Scenarios Offline ASR Streaming ASR Speaker Diarization Emotion Detection VAD Punctuation Deploy & Agent Subtitles ONNX Export FAQ & Troubleshooting

30-Second Demo

Copy-paste this and run. No config needed — FunASR downloads everything automatically. No local setup? Open the Colab quickstart first.

pip install funasr

python -c "
from funasr import AutoModel
model = AutoModel(model='paraformer-zh', vad_model='fsmn-vad', punc_model='ct-punc')
res = model.generate(input='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav')
print(res[0]['text'])
"

Output:

欢迎大家来体验达摩院推出的语音识别模型。

That's it. One model object handles everything: loading audio, VAD segmentation, ASR, and punctuation.

Install

# Stable release
pip install funasr

# Latest (recommended — has newest models and bug fixes)
pip install git+https://github.com/modelscope/FunASR.git

China users: Models download from ModelScope by default (fast in China). International users: add hub="hf" to download from HuggingFace.

Which Model Should I Use?

Model	Best For	Languages	Speed	Punctuation
Paraformer	Chinese production ASR	Chinese, English	Fast	Needs punc_model
Fun-ASR-Nano	Multi-language, dialects, lyrics	31 languages	Medium	Built-in ✓
SenseVoice	Emotion + events + ASR	5 languages	Ultra-fast (70ms/10s)	Built-in ✓
Qwen3-ASR	Highest accuracy, context-aware	52 languages	Slow (LLM)	Built-in ✓
Paraformer-Streaming	Real-time transcription	Chinese	Real-time	Needs punc_model

Quick recommendation:
• Chinese meeting/call → paraformer-zh + fsmn-vad + ct-punc + cam++
• Multi-language → Fun-ASR-Nano
• Need emotion/sound events → SenseVoice
• Real-time subtitles → paraformer-zh-streaming
• Highest quality, no latency concern → Qwen3-ASR

Usage Scenarios

🎤 "I have a meeting recording and want text + who said what"

→ Paraformer + VAD + Punctuation + Speaker Diarization → Jump to Speaker Diarization

📺 "I want real-time subtitles for a live stream"

→ Paraformer-Streaming, feed audio chunks every 600ms → Jump to Streaming ASR

😊 "I want to detect user emotions from voice"

→ SenseVoice (outputs emotion tags: happy, sad, angry, neutral) → Jump to Emotion Detection

🌍 "I have audio in Japanese/Korean/Arabic/etc."

→ Fun-ASR-Nano (31 languages) or Qwen3-ASR (52 languages) → Jump to Offline ASR

✂️ "I want to clip a video by spoken content"

→ Use FunClip which integrates FunASR for smart video editing

Offline ASR

Paraformer (Chinese)

from funasr import AutoModel

model = AutoModel(
    model="paraformer-zh",          # Chinese ASR
    vad_model="fsmn-vad",           # handles any audio length
    vad_kwargs={"max_single_segment_time": 60000},
    punc_model="ct-punc",           # adds punctuation
)
res = model.generate(input="meeting.wav", batch_size_s=300, hotword='达摩院 语音识别')
print(res[0]["text"])       # "欢迎大家来体验达摩院推出的语音识别模型。"
print(res[0]["timestamp"])  # [[880,1120],[1120,1360],...] (ms per character)

Fun-ASR-Nano (31 Languages)

from funasr import AutoModel

model = AutoModel(
    model="FunAudioLLM/Fun-ASR-Nano-2512",
    trust_remote_code=True,
    remote_code="./model.py",
    vad_model="fsmn-vad",
    vad_kwargs={"max_single_segment_time": 30000},
    device="cuda:0",
    hub="hf",
)
res = model.generate(input=["audio.wav"], cache={}, batch_size=1,
                     hotwords=["keyword"], language="中文")
print(res[0]["text"])        # recognized text with punctuation
print(res[0]["timestamps"])  # [{"token":"开","start_time":0.42,"end_time":0.48}, ...]

Fun-ASR-Nano outputs punctuation natively — no need for punc_model.

SenseVoice (ASR + Emotion + Events)

from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess

model = AutoModel(
    model="iic/SenseVoiceSmall",
    vad_model="fsmn-vad",
    vad_kwargs={"max_single_segment_time": 30000},
    device="cuda:0",
)
res = model.generate(input="audio.wav", cache={}, language="auto",
                     use_itn=True, batch_size_s=60, merge_vad=True, merge_length_s=15)

# Raw output contains emotion/event tags: <|zh|><|HAPPY|><|Speech|>你好
text = rich_transcription_postprocess(res[0]["text"])
print(text)  # "你好" (clean text)

Qwen3-ASR (52 Languages, Highest Accuracy)

# pip install qwen-asr
from funasr import AutoModel

model = AutoModel(model="Qwen/Qwen3-ASR-1.7B", hub="hf", device="cuda:0")
res = model.generate(input="audio.wav")
print(res[0]["text"])             # recognized text
print(res[0].get("language"))    # auto-detected language

Streaming ASR (Real-time)

Process audio chunk-by-chunk for real-time transcription. Each chunk produces partial text immediately.

from funasr import AutoModel
import soundfile

model = AutoModel(model="paraformer-zh-streaming")

speech, sr = soundfile.read("audio.wav")
chunk_size = [0, 10, 5]          # 600ms display, 300ms lookahead
chunk_stride = chunk_size[1] * 960  # 9600 samples per chunk

cache = {}
total_chunks = int((len(speech) - 1) / chunk_stride + 1)
for i in range(total_chunks):
    chunk = speech[i * chunk_stride:(i + 1) * chunk_stride]
    is_final = (i == total_chunks - 1)

    res = model.generate(input=chunk, cache=cache, is_final=is_final,
                         chunk_size=chunk_size,
                         encoder_chunk_look_back=4,
                         decoder_chunk_look_back=1)
    if res[0]["text"]:
        print(res[0]["text"], end="", flush=True)  # incremental output

Output (printed incrementally):

欢迎大 | 家来 | 体验达 | 摩院推 | 出的语 | 音识 | 别模型

Key points:
• cache={} must persist across all chunks (don't recreate it)
• is_final=True on last chunk flushes remaining buffered text
• chunk_size=[0,10,5]: first number unused, second=display granularity (×60ms), third=lookahead (×60ms)

Speaker Diarization ("Who Said What")

Works with all three major models: Paraformer, Fun-ASR-Nano, and SenseVoice. Add spk_model="cam++" to get speaker labels per sentence.

Paraformer + Speaker

from funasr import AutoModel

model = AutoModel(
    model="paraformer-zh",
    vad_model="fsmn-vad",
    punc_model="ct-punc",      # Paraformer needs this for sentence segmentation
    spk_model="cam++",
)
res = model.generate(input="meeting.wav", batch_size_s=300)

for sent in res[0]["sentence_info"]:
    print(f"[Speaker {sent['spk']}] [{sent['start']}-{sent['end']}ms] {sent['text']}")

Output:

[Speaker 0] [880-5195ms] 欢迎大家来体验达摩院推出的语音识别模型。

Fun-ASR-Nano + Speaker (no punc needed)

model = AutoModel(
    model="FunAudioLLM/Fun-ASR-Nano-2512",
    trust_remote_code=True, remote_code="./model.py",
    vad_model="fsmn-vad", vad_kwargs={"max_single_segment_time": 30000},
    spk_model="cam++",     # no punc_model needed!
    device="cuda:0", hub="hf",
)
res = model.generate(input=["meeting.wav"], cache={}, batch_size=1, language="中文")
for sent in res[0]["sentence_info"]:
    print(f"Speaker {sent['spk']}: {sent.get('text', sent.get('sentence', ''))}")

SenseVoice + Speaker (no punc needed)

model = AutoModel(
    model="iic/SenseVoiceSmall",
    vad_model="fsmn-vad", vad_kwargs={"max_single_segment_time": 30000},
    spk_model="cam++",    # no punc_model needed!
    device="cuda:0",
)
res = model.generate(input="meeting.wav", cache={}, language="auto",
                     use_itn=True, batch_size_s=60, merge_vad=True, merge_length_s=15)
for sent in res[0]["sentence_info"]:
    print(f"Speaker {sent['spk']}: {rich_transcription_postprocess(sent['text'])}")

Emotion Detection

Dedicated Emotion Model (emotion2vec)

from funasr import AutoModel

model = AutoModel(model="iic/emotion2vec_plus_large", device="cuda:0")
res = model.generate(input="audio.wav", granularity="utterance")

print(res[0]["labels"])  # ['angry', 'happy', 'neutral', 'sad', ...]
print(res[0]["scores"])  # [0.01,   0.03,   0.89,      0.05, ...]
# → This audio is "neutral" with 89% confidence

SenseVoice (ASR + Emotion in one shot)

SenseVoice embeds emotion tags directly in the transcription output:

# Raw output: "<|zh|><|HAPPY|><|Speech|><|withitn|>今天真是太开心了。"
# The <|HAPPY|> tag tells you the emotion
# Use rich_transcription_postprocess() to get clean text

Voice Activity Detection

Offline (full audio)

from funasr import AutoModel

model = AutoModel(model="fsmn-vad")
res = model.generate(input="audio.wav")
print(res[0]["value"])  # [[610, 5530], [7200, 12400], ...]
                         # Each pair: [start_ms, end_ms] of speech

Streaming (chunk-by-chunk)

import soundfile
from funasr import AutoModel

model = AutoModel(model="fsmn-vad")
speech, sr = soundfile.read("audio.wav")
chunk_stride = int(200 * sr / 1000)  # 200ms chunks

cache = {}
for i in range(int((len(speech)-1)/chunk_stride+1)):
    chunk = speech[i*chunk_stride:(i+1)*chunk_stride]
    is_final = i == int((len(speech)-1)/chunk_stride)
    res = model.generate(input=chunk, cache=cache, is_final=is_final, chunk_size=200)
    if res[0]["value"]:
        print(res[0]["value"])
        # [[610, -1]]   → speech started at 610ms
        # [[-1, 5530]]  → speech ended at 5530ms
        # [[610, 5530]] → complete segment

Punctuation Restoration

from funasr import AutoModel

model = AutoModel(model="ct-punc")
res = model.generate(input="那今天的会就到这里吧 happy new year 明年见")
print(res[0]["text"])  # "那今天的会就到这里吧，happy new year，明年见。"

When do you need this? Only with Paraformer (which outputs raw text without punctuation). Fun-ASR-Nano, SenseVoice, and Qwen3-ASR output punctuation natively.

Deploy & Agent Integration

Use funasr-server when you need a local OpenAI-compatible endpoint for applications and agents.

pip install funasr fastapi uvicorn python-multipart
funasr-server --device cuda --port 8000

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")
result = client.audio.transcriptions.create(
    model="sensevoice",
    file=open("meeting.wav", "rb"),
    response_format="verbose_json",
)

For Claude Code, Cursor, and other MCP clients, configure examples/mcp_server/funasr_mcp.py. See the Agent integration guide.

Subtitle Generation

Generate SRT or VTT subtitles from audio/video files, with optional speaker labels.

cd examples/subtitle
python generate_subtitle.py video.mp4
python generate_subtitle.py meeting.wav --spk
python generate_subtitle.py podcast.mp3 --format vtt

ONNX Export

# Export model to ONNX format
from funasr import AutoModel
model = AutoModel(model="paraformer", device="cpu")
model.export(quantize=False)   # saves to model cache directory

# Use ONNX model (faster, no PyTorch needed)
# pip install funasr-onnx
from funasr_onnx import Paraformer
model = Paraformer("damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch",
                   batch_size=1, quantize=True)
result = model(["audio.wav"])
print(result)

FAQ & Troubleshooting

Model download is slow

Set hub="hf" for HuggingFace (faster outside China), or download once and use local path:

model = AutoModel(model="/path/to/local/model", disable_update=True)

Out of Memory (OOM)

Three knobs to reduce memory:

Reduce batch_size_s (e.g., from 300 to 60)
Reduce max_single_segment_time in vad_kwargs (e.g., 30000 → 15000)
Add batch_size_threshold_s=30 to force batch=1 for long segments

Every startup shows "Downloading Model..."

Not actually re-downloading — just verifying cache. To skip entirely:

model = AutoModel(model="/local/path/to/model", disable_update=True)

"ModelName is not registered"

Usually means pypi version is outdated. Install from source:

pip install git+https://github.com/modelscope/FunASR.git

Want to suppress all progress bars and logs?

model = AutoModel(model="...", disable_update=True, disable_pbar=True, log_level="ERROR")

How to pass pre-loaded numpy audio?

import soundfile as sf
audio, sr = sf.read("audio.wav")  # numpy array, 16kHz
res = model.generate(input=audio)  # pass directly — no file needed

GPU not being used?

# Check:
import torch
print(torch.cuda.is_available())  # must be True

# Specify GPU explicitly:
model = AutoModel(model="...", device="cuda:0")