Built this because tones are killing my spoken Mandarin and I can't reliably hear my own mistakes.
It's a 9M Conformer-CTC model trained on ~300h (AISHELL + Primewords), quantized to INT8 (11 MB), runs 100% in-browser via ONNX Runtime Web.
Grades per-syllable pronunciation + tones with Viterbi forced alignment.
Try it here: https://simedw.com/projects/ear/
There's one thing that gave me pause: In the phrase 我想学中文 it identified "wén" as "guó". While my pronunciation isn't perfect, there's no way that what I said is closer to "guó" than to "wén".
This indicates to me that the model learned word structures instead of tones here. "Zhōng guó" probably appears in the training data a lot, so the model has a bias towards recognizing that.
- Edit -
From the blog post:
> If my tone is wrong, I don’t want the model to guess what I meant. I want it to tell me what I actually said.
Your architecture also doesn't tell you what you actually said. It just maps what you said to the likeliest of the 1254 syllables that you allow. For example, it couldn't tell you that you said "wi" or "wr" instead of "wo", because those syllables don't exist in your setup.
reply