This demo is *really* impressive: https://huggingface.co/spaces/mistralai/Voxtra...

tekacs · 2026-02-04T16:41:58 1770223318

Having built with and tried every voice model over the last three years, real time and non-real time... this is off the charts compared to anything I've seen before.

And open weight too! So grateful for this.

drakenot · 2026-02-05T01:19:24 1770254364

This past month Parakeet v3 dropped with a streaming ASR model that is 0.6B params, can run on a CPU and is super good.

tekacs · 2026-02-05T13:37:55 1770298675

I did say all the model. :)

Yes I've tried Parakeet v3 too. For its own purpose - running locally - it's amazing.

The thing that's particularly amazing about this Voxtral model is how incredibly rock solid the accuracy is.

For the longest time previous models have been 'mostly correct' or as people have commented elsewhere on this HN thread, have dropped sentences or lost or added utterances.

I have no affiliation with these folks, but I tried and struggled to get this model to break even speaking as adversariately as I could.

That's a totally different class of model.

meatmanek · 2026-02-05T05:16:36 1770268596

Do you mean https://huggingface.co/nvidia/nemotron-speech-streaming-en-0... ?

drakenot · 2026-02-05T13:27:48 1770298068

Yes. That is it

puttycat · 2026-02-05T08:39:51 1770280791

What's the business plan here?

Oras · 2026-02-04T16:25:22 1770222322

Thank you for the link! Their playground in Mistral does not have a microphone. it just uploads files, which does not demonstrate the speed and accuracy, but the link you shared does.

I tried speaking in 2 languages at once, and it picked it up correctly. Truly impressive for real-time.

druskacik · 2026-02-04T18:11:16 1770228676

According to the announcement blog Le Chat is powered by the new model as well: https://chat.mistral.ai/chat

TacticalCoder · 2026-02-04T23:09:14 1770246554

> Truly impressive for real-time.

Impressive indeed. Works way better than the speech recognition I first got demo'ed in... 1998? I remember you had to "click" on the mic everytime you wanted to speak and, well, not only the transcription was bad, it was so bad that it'd try to interpret the sound of the click as a word.

It was so bad I told several people not to invest in what was back then a national tech darling:

https://en.wikipedia.org/wiki/Lernout_%26_Hauspie

That turned out to be a massive fraud.

But ...

> I tried speaking in 2 languages at once, and it picked it up correctly.

I'm a native french speaker and I tried with a very simple sentence mixing french and english:

"Pour un pistolet je prefere un red dot mais pour une carabine je prefere un ACOG" (aka "For a pistol I prefer a red dot but for a carbine I prefer an ACOG")

And instead I got this:

"Je prépare un redote, mais pour une carabine, je préfère un ACOG."

"Je prépare un redote ..." doesn't mean anything and it's not at all what I said.

I like it, it's impressive, but literally the first sentence I tried it got the first half entirely wrong.

jnaina · 2026-02-05T01:33:10 1770255190

I used sell the Mac Voice Navigator (from Articulate Systems) in the 90s, which was a SCSI based hardware box that you plug into a Mac, Mac SE or Mac II. It used to use the same L&H speech recognition tech (if I recall correctly) and was called the "User Interface" of the future.

Horrible speech recognition rate and very glitchy. Customers hated it, and lots of returns/complaints.

A few years later, L&H went bankrupt. And so did Articulate Systems.

https://applerescueofdenver.com/products-page/macintosh-to-p...

daemonologist · 2026-02-04T16:48:11 1770223691

404 on https://mistralai-voxtral-mini-realtime.hf.space/gradio_api/... for me (which shows up in the UI as a little red error in the top right).

echion · 2026-02-05T03:16:15 1770261375

Same here

skykooler · 2026-02-04T19:24:04 1770233044

Doesn't seem to work for me - tried in both Firefox and Chromium and I can see the waveform when I talk but the transcription just shows "Awaiting audio input".

winrid · 2026-02-05T06:49:45 1770274185

For me it shows the waveform and then "error"

starkgoose · 2026-02-04T20:05:36 1770235536

Try disabling CSP for the page

codethief · 2026-02-04T19:34:33 1770233673

Same here. In Chromium I don't even see the waveform.

fragmede · 2026-02-04T20:01:15 1770235275

I had to turn off ad-block to get it to work.

whimblepop · 2026-02-05T03:43:21 1770263001

I can see the waveform but it still doesn't work for me. Switched to Edge, disabled all adblocking and privacy extensions, built-in tracking prevention, and "enhanced site security" (whatever that is), and still no dice. I'd love to try it and be impressed, but it seems impossible. :(

atoav · 2026-02-05T09:00:27 1770282027

Did you check if your mic even works in principle? E.g. using https://www.onlinemictest.com/

If you don't get sound there it won't work anywhere. A surprising number of problems like these can be solved by selecting the correct audio input source (provided your computer shows more than one).

whimblepop · 2026-02-05T13:26:22 1770297982

Yep. Mic works fine. My mic even works on the test page! What doesn't work is any of the transcription functionality. :(

atoav · 2026-02-06T08:11:22 1770365482

I just bit the bullet and did it via python and the api.

niek_pas · 2026-02-05T06:46:57 1770274017

Same here on iPhone with Arc Search.

jaggederest · 2026-02-04T17:03:00 1770224580

It can transcribe Eminem's Rap God fast sequence, really, really impressive.

rafram · 2026-02-04T17:32:04 1770226324

That's almost certainly in the training data, to be fair.

keeganpoppen · 2026-02-04T18:49:07 1770230947

what a great test hahah

carbocation · 2026-02-04T18:10:06 1770228606

This model was able to transcribe Bad Bunny lyrics over the sound of the background music, played casually from my speakers. Impressive, to me.

elboru · 2026-02-05T04:21:31 1770265291

Wow, so it has surpassed humans.

pyprism · 2026-02-04T17:18:52 1770225532

Wow, that’s weird. I tried Bengali, but the text transcribed into Hindi!I know there are some similar words in these languages, but I used pure Bengali that is not similar to Hindi.

derefr · 2026-02-04T17:51:28 1770227488

Well, on the linked page, it mentions "strong transcription performance in 13 languages, including [...] Hindi" but with no mention of Bengali. It probably doesn't know a lick of Bengali, and is just trying to snap your words into the closest language it does know.

keeganpoppen · 2026-02-04T18:49:59 1770230999

it must have some exposure to bengali— just not enough for them to advertise it. otherwise it would have a damn hard time.

espadrine · 2026-02-05T17:50:47 1770313847

It is quite impressive.

I have seen the same impressive performance about 7 months ago here: https://kyutai.org/stt

If I look at the architecture of Voxtral 2, it seems to take a page from Kyutai’s delayed stream modeling.

The reason the delay is configurable is that you can delay the stream by a variable number of audio tokens. Each audio token is 80 ms of audio, converted to a spectrogram, fed to a convnet, passed through a transformer audio encoder, and the encoded audio embedding is passed, with a history of 1 audio embedding per 80 ms, into a text transformer, which outputs text embedding, then converted to a text token (which is thus also worth 80ms, but there is a special [STREAMING_PAD] token to skip producing a word).

There is no cross-attention in either Kyutai's STT nor in Voxtral 2, unlike Whisper's encoder-decoder design!

sheepscreek · 2026-02-04T18:04:33 1770228273

I’ve been using AquaVoice for real-time transcription for a while now, and it has become a core part of my workflow. It gets everything, jargon, capitalization, everything. Now I’m looking forward to doing that with 100% local inference!

k9294 · 2026-02-08T17:39:49 1770572389

Hey, I would really appreciate if you would try https://ottex.ai

I'm working on a Wispr/Spokenly competitor. It's free without any paywalled features, supports local models and a bunch of API providers including Mistral.

For local models ottex has - parakeet V3, Whisper, GLM-ASR nano, Qwen3-ASR (don't have voxtral yet though, looking into it).

btw, you can try new voxtral model via API (the model name to pick is `voxtral-mini-latest:transcribe`), I personally switched to it as my main default fast model - it's really good.

GolDDranks · 2026-02-04T23:59:41 1770249581

I can't get that demo to work. Tried with both Firefox and Chrome.

CamperBob2 · 2026-02-05T04:08:59 1770264539

Same here; the voice waveform animates as expected but the model doesn't do anything when I click on the microphone. It just says "Error" in the upper-right corner.

Also tried downloading and running locally, no luck. Same behavior.

Barbing · 2026-02-04T20:46:26 1770237986

Doesn’t seem to work in Safari on iOS 26.2, iPhone 17 Pro, just about anything extra disabled.

whimblepop · 2026-02-05T03:48:32 1770263312

No long with Firefox or Edge or Chrome on either macOS or Android for me, either. Same issue on all.

darkwater · 2026-02-04T21:02:45 1770238965

It's really nice although I've got a sentence in French when I was speaking Italian but I corrected myself in the middle of a word.

But I'm definitely going to keep an eye on this for local-only TTS for Home Assistant.

rafram · 2026-02-04T17:35:38 1770226538

Not terrible. It missed or mixed up a lot of words when I was speaking quickly (and not enunciating very well), but it does well with normal-paced speech.

timhh · 2026-02-04T22:08:27 1770242907

Yeah it messed up a bit for me too when I didn't enunciate well. If I speak clearly it seems to work very well even with background noise. Remember Dragon Naturally Speaking? Imagine having this back then!

mentalgear · 2026-02-04T21:30:07 1770240607

Here European Multilingual-Intelligence truly shines!

colordrops · 2026-02-04T22:02:04 1770242524

is this demo running fully in the browser?

simonw · 2026-02-04T22:19:13 1770243553

No, it's server-side.

Model is around 7.5 GB - once they get above 4 GB running them in a browser gets quite difficult I believe.

dcl · 2026-02-05T03:34:59 1770262499

Because it's a 4gb download?

subset · 2026-02-05T03:50:32 1770263432

I think that web browsers only allow up to 4GB of memory per tab.