Open Bug 1474084 Opened 6 years ago Updated 7 months ago

WebSpeech API with offline STT via DeepSpeech

Categories

(Core :: Web Speech, enhancement)

enhancement

Tracking

()

People

(Reporter: tim.langhorst, Unassigned)

References

(Blocks 2 open bugs, )

Details

There is ongoing work to implement the WebSpeech API using an online service, but this is about implementing it using on-device recognition.
This has multiple advantages:
* Working when offline / using slow internet connection
* Faster interference
* Everything is kept securely on the device
* Lower cost of interference servers at Mozilla
But it might be too much for embedded devices and not as accurate

I think platforms that should be implemented at a minimum are:
arm64 (for newer phones that got enough performance), arm64/x86_64 (32bit is dead)
and OS wise:
Windows, OSX, Linux, Android, IOS

It shouldn't be that bad to exclude some old/not much used platforms, because this is just an enhancement and there is the online interference as fallback.
And about the memory and storage requirements I think it should be downloaded on the fly when a user first uses the API (or gets asked like with DRM and it should also be downloadable in settings) and then should adjust to the device, so for a phone/tablet more like 200MB memory an 50MB storage and for a desktop more like 200MB storage and 1GB memory.
There should be something like 5+ Versions for different hardware capabilities.
And instead of "arm64/x86_64" is meant "amd64/x86_64"
Component: General → Web Speech
Product: Firefox → Core
Target Milestone: Future → ---
Version: unspecified → Trunk
See Also: → 1244460
Depends on: 1248897
Depends on: 1392065
I should also mention that this is only what I think is reasonable from a normal people's perspective, I'm not into source code for Mozilla
Severity: normal → enhancement
Assignee: nobody → lissyx+mozillians
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true

I have followed the procedure laid out in this section: https://wiki.mozilla.org/Web_Speech_API_-_Speech_Recognition#How_can_I_test_with_Deep_Speech.3F , which says to set media.webspeech.service.endpoint to https://dev.speaktome.nonprod.cloudops.mozgcp.net/ .

The results are very inaccurate, and would be unacceptable by today's standards. I am not just trying to lambaste a product I know the DeepSpeech people have invested much of their time and energies in, and I'm genuinely appreciative of. I am just wondering if the article is out of date, and there is a different endpoint that can be set which will give more accurate results.

I am comparing this to using with media.webspeech.service.endpoint pref removed (which I believe would then be using the OS native service,) where accuracy is very high, and more what I would expect. The point being, I do not think this is an issue with the audio hardware involved.

Status: ASSIGNED → NEW
Assignee: lissyx+mozillians → nobody
Severity: normal → S3
Blocks: 1856507
You need to log in before you can comment on or make changes to this bug.