Expose SpeechRecognition to the web

REOPENED
Assigned to

Status

()

enhancement
P2
normal
REOPENED
3 years ago
16 days ago

People

(Reporter: sebo, Assigned: anatal)

Tracking

(Blocks 4 bugs, {feature})

Trunk
Future
Points:
---
Dependency tree / graph
Bug Flags:
webcompat +

Firefox Tracking Flags

(Webcompat Priority:P1)

Details

(Whiteboard: [webcompat:p1], )

Attachments

(1 attachment, 6 obsolete attachments)

Reporter

Description

3 years ago
The SpeechRecognition API is currently only available in chrome context (at least on desktop Firefox).

It should also be made available within the website context.

This will require some kind of UI to control the permissions to access the microphone.

Sebastian
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1244460
Comment hidden (obsolete)
Comment hidden (obsolete)
Is this on anyone's radar? Just ran into a website saying that it "requires Google Chrome Browser". :( I emailed the developers, they say they need the webspeech API.
Flags: needinfo?(sebastianzartner)
Flags: needinfo?(overholt)
Flags: needinfo?(dietrich)
Reporter

Comment 5

3 years ago
I'm just the reporter, not an implementor.

Sebastian
Flags: needinfo?(sebastianzartner)
Flags: needinfo?(anatal)
Flags: needinfo?(kdavis)

Comment 6

3 years ago
Both Andre and myself implemented this originally; however, both of us are now in Connected Devices and consumed with preparation for London. So, at the earliest we could take a look after London. But even then it will be hard for us to dedicate lots of time to desktop.
Flags: needinfo?(kdavis)
Flags: needinfo?(anatal)
Let me see what I can do about prioritizing this on the platform side (I was going to suggest Andre/Kelly, too).
Flags: needinfo?(overholt)
Flags: needinfo?(dietrich)
The demo https://mdn.github.io/web-speech-api/ is useless without the api available...

Comment 9

3 years ago
The documentation https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API and it's examples implementations are not working in Firefox yet which is a little poor documented in section https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API#Browser_compatibility IHMO; Firefox Desktop is probably the browser right?

The demos work in Google Chrome (un)fortunately.
Actually, there is no backend implementation for Recognition API except to Gonk.
Reporter

Comment 11

3 years ago
(In reply to Makoto Kato [:m_kato] from comment #10)
> Actually, there is no backend implementation for Recognition API except to
> Gonk.

Bug 1244237 comment 0 claimed something else. That's why I created this bug.
But if there's really no backend yet, you should create a bug for it blocking this one, so this feature can finally be tackled.

Sebastian

Comment 12

2 years ago
I am willing to work on this. Can somebody guide me on this one, what exactly is impeding the implementation of the recognition back end. I was thinking if we can use the speech recognition API provided by windows (not a platform independent solution, I know). It would be great to get it working at least somewhere and I am working on Windows version.

Comment 13

2 years ago
@abhishek the way I understand you can test this using Firefox desktop in Chrome context so use js or create a plugin. We 'only' need to expose this into website context. See 'User story' above.

From https://developer.mozilla.org/en-US/Add-ons/Setting_up_extension_development_environment

> devtools.chrome.enabled = true. This enables to run JavaScript code snippets in the chrome context of the Scratchpad from the Tools menu. Don't forget to switch from content to browser as context.

Does this help you?

Comment 14

2 years ago
(In reply to Clemens Tolboom from comment #13)
> @abhishek the way I understand you can test this using Firefox desktop in
> Chrome context so use js or create a plugin. We 'only' need to expose this
> into website context. See 'User story' above.
> 
> From
> https://developer.mozilla.org/en-US/Add-ons/
> Setting_up_extension_development_environment
> 
> > devtools.chrome.enabled = true. This enables to run JavaScript code snippets in the chrome context of the Scratchpad from the Tools menu. Don't forget to switch from content to browser as context.
> 
> Does this help you?

I still get the error "Exception: InvalidStateError: An attempt was made to use an object that is not, or is no longer, usable". I enabled devtools.chrome and webspeech.recognition, still can't get it to work. Can you verify that you've got speech input to work on desktop firefox somehow?

Comment 15

2 years ago
@abhishek what script are you running in Web developer > Scratchpad ?

Checking with from https://github.com/mdn/web-speech-api/blob/master/speech-color-changer/script.js running lines 1-9 + 'recognition;' as line 10 from Scratchpad on page about:config shows me the object SpeechRecognition __proto__: SpeechRecognitionPrototype when inspecting 'recognition;'

Remark #2 on https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API#Browser_compatibility is about this issue :)

Please place your code on https://gist.github.com/ for other to help

Comment 16

2 years ago
(In reply to Clemens Tolboom from comment #15)
> @abhishek what script are you running in Web developer > Scratchpad ?
> 
> Checking with from
> https://github.com/mdn/web-speech-api/blob/master/speech-color-changer/
> script.js running lines 1-9 + 'recognition;' as line 10 from Scratchpad on
> page about:config shows me the object SpeechRecognition __proto__:
> SpeechRecognitionPrototype when inspecting 'recognition;'
> 
> Remark #2 on
> https://developer.mozilla.org/en-US/docs/Web/API/
> Web_Speech_API#Browser_compatibility is about this issue :)
> 
> Please place your code on https://gist.github.com/ for other to help

I am trying that exact same script. Here's the gist: https://gist.github.com/Abhishek8394/dbf9338a9baf6ca05639929e0b72404d
Also the remark #2 as mentioned, it says that recognition hasn't been implemented yet asks to enable an option?

Updated

2 years ago
Whiteboard: [webcompat]
Assignee

Comment 17

2 years ago
One needs to build Firefox with the flags turned on to include the only current implementation available (pocketsphinx + english). If you guys are willing to move forward with it, I can help you to set and have it running.
Flags: webcompat?
Assignee

Updated

2 years ago
Assignee: nobody → anatal
Depends on: 1392065
FWIW, Duolingo's support team is advising inquiring users (necessarily) to use Chrome:

---
Speaking exercises on the new website only work on browsers that support the Web Speech API, the most popular being Chrome. Unfortunately, the old speech system was outdated and a burden on our system. If you use the Chrome browser, you will get speaking exercises again.

If you are using Google Chrome and experiencing issues, users have found that updating or reinstalling the browser has resolved their issues. If you are still having issues, please reply to this email. :)

If you are using a browser without Web Speech API supported, you should not see any speaking exercises. If you do see speaking exercises, please know that your microphone will unfortunately not work. We are aware that some users may still be seeing these exercises and we're working hard to ensure that this is resolved quickly.
---

Duolingo has over 200M users.

I imagine that English-only speech reco wouldn't cover Duolingo's main customer bases.

Comment 19

2 years ago
:-( Anyone know the list of languages Duolingo supports?
Spanish, French, German, Italian, Portuguese, Dutch, Irish, Danish, Swedish, High Valyrian, Russian, Swahili, Polish, Romanian, Greek, Esperanto, Turkish, Vietnamese, Hebrew, Norwegian, Ukrainian, Hungarian, Welsh, Czech

Comment 21

2 years ago
Time to internationalize + localize Common Voice!
I am not using that feature on Duolingo because the recognition before was using flash (and wasn't so perfect). 
In any case this another example that remind that is important to have this API available.

Comment 23

2 years ago
We, the machine learning group at Mozilla, have just recently gotten the quality of our speech recognition engine[1] to be on-par with commercial systems.

Currently we have enough American English training data to create an American English model. We don't have enough data to train other accents or languages, thus the need to internationalize + localize Common Voice.

We plan, by the end of Q2 in 2018, to implement the WebSpeech API backed by our speech recognition engine and integrate, at a minimum, the American English model by then. Other languages should follow in the second half of 2018. (Which languages depends on C-Level decisions, the internationalization + localization + success of Common Voice, and Duolingo-like issues, which would guide our language choice.)

[1] https://github.com/mozilla/deepspeech
I hope that this API will be available before Q2 of course :-)

Comment 25

2 years ago
PR's welcome! :-)

Updated

2 years ago
Blocks: 1409526
Let's try to do something :)
Assignee: anatal → lissyx+mozillians
Assignee: lissyx+mozillians → anatal
See Also: → 1423867
No longer blocks: 973754
Duplicate of this bug: 973754
Flags: webcompat? → webcompat+
Whiteboard: [webcompat] → [webcompat:p2]
Comment on attachment 8984315 [details] [diff] [review]
Introducing an online speech recognition service to enable Web Speech API

Please use Mozilla coding style. 2 spaces for indentation, mFoo for member variable naming, 
aBar for argument names etc.

>+class DecodeResultTask final : public Runnable
>+{
>+public:
>+  DecodeResultTask(bool succeded,
>+                   const nsString& hypstring,
>+                   float confidence,
>+                   WeakPtr<dom::SpeechRecognition> recognition,
>+                   const nsAutoString& errormessage)
>+      : mozilla::Runnable("DecodeResultTask"),
>+        mSucceeded(succeded),
>+        mResult(hypstring),
>+        mConfidence(confidence),
>+        mRecognition(recognition),
>+        mErrorMessage(errormessage)
>+  {
>+    MOZ_ASSERT(
>+      NS_IsMainThread()); // This should be running on the main thread
>+  }
>+
>+  NS_IMETHOD
>+  Run() override
>+  {
>+    MOZ_ASSERT(NS_IsMainThread()); // This method is supposed to run on the main
>+                                   // thread!
>+
>+    if (!mSucceeded) {
>+      mRecognition->DispatchError(SpeechRecognition::EVENT_RECOGNITIONSERVICE_ERROR,
>+                                  SpeechRecognitionErrorCode::Network, // TODO different codes?
>+                                  mErrorMessage);
>+
>+    } else {
>+      // Declare javascript result events
>+      RefPtr<SpeechEvent> event = new SpeechEvent(
>+        mRecognition, SpeechRecognition::EVENT_RECOGNITIONSERVICE_FINAL_RESULT);
>+      SpeechRecognitionResultList* resultList =
>+        new SpeechRecognitionResultList(mRecognition);
>+      SpeechRecognitionResult* result = new SpeechRecognitionResult(mRecognition);
>+
>+      if (0 < mRecognition->MaxAlternatives()) {
>+        SpeechRecognitionAlternative* alternative =
>+          new SpeechRecognitionAlternative(mRecognition);
>+
>+        alternative->mTranscript = mResult;
>+        alternative->mConfidence = mConfidence;
>+
>+        result->mItems.AppendElement(alternative);
>+      }
>+      resultList->mItems.AppendElement(result);
>+
>+      event->mRecognitionResultList = resultList;
>+      NS_DispatchToMainThread(event);
>+    }
>+    return NS_OK;
>+ }
>+
>+private:
>+  bool mSucceeded;
>+  nsString mResult;
>+  float mConfidence;
>+  WeakPtr<dom::SpeechRecognition> mRecognition;
>+  nsCOMPtr<nsIThread> mWorkerThread;
>+  nsAutoString mErrorMessage;
>+};
>+
>+NS_IMPL_ISUPPORTS(OnlineSpeechRecognitionService,
>+                  nsISpeechRecognitionService, nsIObserver, nsIStreamListener)
>+
>+NS_IMETHODIMP
>+OnlineSpeechRecognitionService::OnStartRequest(nsIRequest* aRequest,
>+                            nsISupports* aContext)
>+{
>+  if (this->mBuf)
>+    this->mBuf = nullptr;
>+  return NS_OK;
>+}
>+
>+NS_IMETHODIMP
>+OnlineSpeechRecognitionService::OnDataAvailable(nsIRequest* aRequest,
>+                             nsISupports* aContext,
>+                             nsIInputStream* aInputStream,
>+                             uint64_t aOffset,
>+                             uint32_t aCount)
>+{
>+  nsresult rv;
>+  this->mBuf = new char[aCount];
>+  uint32_t _retval;
>+  rv = aInputStream->ReadSegments(NS_CopySegmentToBuffer, this->mBuf, aCount, &_retval);
>+  NS_ENSURE_SUCCESS(rv, rv);
>+  return NS_OK;
>+}
>+
>+NS_IMETHODIMP
>+OnlineSpeechRecognitionService::OnStopRequest(nsIRequest* aRequest,
>+                           nsISupports* aContext,
>+                           nsresult aStatusCode)
>+{
>+  bool success;
>+  nsresult rv;
>+  float confidence;
>+  confidence = 0;
>+  Json::Value root;
>+  Json::Reader reader;
>+  bool parsingSuccessful;
>+  nsAutoCString result;
>+  nsAutoCString hypoValue;
>+  nsAutoString errorMsg;
>+
>+  SR_LOG("STT Result: %s", this->mBuf);
>+
>+  if (NS_FAILED(aStatusCode)) {
>+    success = false;
>+    errorMsg.Assign(NS_LITERAL_STRING("CONNECTION_ERROR"));
>+  } else {
>+    success = true;
>+    parsingSuccessful = reader.parse(this->mBuf, root);
>+    if (!parsingSuccessful) {
>+      errorMsg.Assign(NS_LITERAL_STRING("RECOGNITIONSERVICE_ERROR"));
>+      success = false;
>+    } else {
>+      result.Assign(root.get("status","error").asString().c_str());
>+      if (result.EqualsLiteral("ok")) {
>+        // ok, we have a result
>+        hypoValue.Assign(root["data"][0].get("text","").asString().c_str());
>+        confidence = root["data"][0].get("confidence","0").asFloat();
>+      } else {
>+        // there's an internal server error
>+        errorMsg.Assign(NS_LITERAL_STRING("NO_HYPOTHESIS"));
>+        success = false;
>+      }
>+    }
>+  }
>+
>+  RefPtr<Runnable> resultrunnable =
>+    new DecodeResultTask(success, NS_ConvertUTF8toUTF16(hypoValue), confidence,
>+                         mRecognition, errorMsg);
>+  rv = NS_DispatchToMainThread(resultrunnable);
>+
>+  if (this->mBuf)
>+    this->mBuf = nullptr;
>+
>+  return NS_OK;
>+}
>+
>+OnlineSpeechRecognitionService::OnlineSpeechRecognitionService()
>+{
>+  if (this->mBuf)
>+    this->mBuf = nullptr;
What is this? mBuf is uninitialized in the constructor

>+  audioEncoder = nullptr;
>+  ISDecoderCreated = true;
>+  ISGrammarCompiled = true;
Please use normal C++ member variable initialization,
OnlineSpeechRecognitionService::OnlineSpeechRecognitionService()
  : mBuf(nullptr)
...

looks like ISDecoderCreated isn't ever used, nor ISGrammarCompiled

>+OnlineSpeechRecognitionService::~OnlineSpeechRecognitionService()
>+{
>+  if (this->mBuf)
>+    this->mBuf = nullptr;
again, rather mysterious nullcheck. And you leak mBuf here.

>+NS_IMETHODIMP
>+OnlineSpeechRecognitionService::ProcessAudioSegment(
>+  AudioSegment* aAudioSegment, int32_t aSampleRate)
>+{
On which thread does this method run?
Encoding may take time so it should not run on the main thread.
So, 
MOZ_ASSERT(!NS_IsMainThread());

>+OnlineSpeechRecognitionService::Observe(nsISupports* aSubject,
>+                                              const char* aTopic,
>+                                              const char16_t* aData)
align params

>+class OnlineSpeechRecognitionService : public nsISpeechRecognitionService,
>+                                             public nsIObserver,
>+                                             public nsIStreamListener
align inherited classes
Attachment #8984315 - Flags: review?(bugs) → review-
Assignee

Comment 30

11 months ago
Hi Olli,

Here's a version with your comments already addressed. If you want to test it, apply the patch, flip the prefs, and head here: https://andrenatal.github.io/webspeechapi/index_auto.html. 

Currently the tests we have convert the fake recognition service and the state machine, so I'm working to make them working with this online service too.

Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=1b9f223217bde85215e8f99416cae66098ff13fc

Thanks

Andre
Attachment #8984315 - Attachment is obsolete: true
Attachment #8986675 - Flags: review?(bugs)
Comment on attachment 8986675 [details] [diff] [review]
Bug-1248897-Introducing an online speech recognition service for web speech api

>+NS_IMETHODIMP
>+OnlineSpeechRecognitionService::OnStartRequest(nsIRequest *aRequest,
>+                                               nsISupports *aContext)
>+{
>+  this->mBuf = nullptr;
Why this?
If mBuf points to some value, you leak it here.
No need for this-> here nor elsewhere


>+NS_IMETHODIMP
>+OnlineSpeechRecognitionService::OnDataAvailable(nsIRequest *aRequest,
>+                                                nsISupports *aContext,
>+                                                nsIInputStream *aInputStream,
Per Mozilla coding style * goes with the type.
nsIRequest* aRequest etc. Here and elsewhere.


>+                                                uint64_t aOffset,
>+                                                uint32_t aCount)
>+{
>+  nsresult rv;
>+  this->mBuf = new char[aCount];
>+  uint32_t _retval;
retVal
But it isn't retVal, but read count, so perhaps readCount ?


>+OnlineSpeechRecognitionService::OnStopRequest(nsIRequest *aRequest,
>+                                              nsISupports *aContext,
>+                                              nsresult aStatusCode)
>+{
>+  bool success;
>+  float confidence = 0;
>+  Json::Value root;
>+  Json::Reader reader;
>+  bool parsingSuccessful;
>+  nsAutoCString result;
>+  nsAutoCString hypoValue;
>+  nsAutoString errorMsg;
>+  SR_LOG("STT Result: %s", this->mBuf);
>+
>+  if (NS_FAILED(aStatusCode)) {
>+    success = false;
>+    errorMsg.Assign(NS_LITERAL_STRING("CONNECTION_ERROR"));
the spec doesn't define "CONNECTION_ERROR"

>+  } else {
>+    success = true;
>+    parsingSuccessful = reader.parse(this->mBuf, root);
>+    if (!parsingSuccessful) {
>+      errorMsg.Assign(NS_LITERAL_STRING("RECOGNITIONSERVICE_ERROR"));
The spec doesn't define "RECOGNITIONSERVICE_ERROR"

>+      success = false;
>+    } else {
>+      result.Assign(root.get("status","error").asString().c_str());
missing space after ,

>+      if (result.EqualsLiteral("ok")) {
>+        // ok, we have a result
>+        hypoValue.Assign(root["data"][0].get("text","").asString().c_str());
>+        confidence = root["data"][0].get("confidence","0").asFloat();
>+      } else {
>+        // there's an internal server error
>+        errorMsg.Assign(NS_LITERAL_STRING("NO_HYPOTHESIS"));
I don't see "NO_HYPOTHESIS" in the spec


>+OnlineSpeechRecognitionService::~OnlineSpeechRecognitionService()
>+{
>+  this->mBuf = nullptr;
So you leak mBuf

>+  this->mAudioEncoder = nullptr;
>+  this->mWriter = nullptr;
No need to set these to null in destructor.



>   nsresult rv;
>   rv = mRecognitionService->Initialize(this);
>   if (NS_WARN_IF(NS_FAILED(rv))) {
>     return;
>   }
> 
>+  SpeechRecognition::SetIdle(false);
>+
>+  nsCOMPtr<nsIThreadManager> tm = do_GetService(NS_THREADMANAGER_CONTRACTID);
>+  rv = tm->NewNamedThread(NS_LITERAL_CSTRING("WebSpeechEncoderThread"), 0,
>+                                           getter_AddRefs(this->mEncodeThread));
fix indentation


(this will need couple of iterations.)
Attachment #8986675 - Flags: review?(bugs) → review-

Comment 32

11 months ago
(In reply to kdavis from comment #23)
> We, the machine learning group at Mozilla, have just recently gotten the
> quality of our speech recognition engine[1] to be on-par with commercial
> systems.
> 
> Currently we have enough American English training data to create an
> American English model. We don't have enough data to train other accents or
> languages, thus the need to internationalize + localize Common Voice.
> 
> We plan, by the end of Q2 in 2018, to implement the WebSpeech API backed by
> our speech recognition engine and integrate, at a minimum, the American
> English model by then. Other languages should follow in the second half of
> 2018. (Which languages depends on C-Level decisions, the
> internationalization + localization + success of Common Voice, and
> Duolingo-like issues, which would guide our language choice.)
> 
> [1] https://github.com/mozilla/deepspeech

So it's 2018 Q2, any updates on deepspeech implementation?
Reporter

Comment 33

11 months ago
(In reply to Tim Langhorst from comment #32)
> So it's 2018 Q2, any updates on deepspeech implementation?

You may have missed it but this bug is assigned and actively being worked on. Though I'm just the reporter, not the implementer, so I have no idea how long it will take until it's finished. Regarding Olli Pettay's last comment saying "this will need couple of iterations", I assume it will still take a while until the patch is ready to land in Firefox.

Sebastian

Comment 34

11 months ago
(In reply to Sebastian Zartner [:sebo] from comment #33)
> You may have missed it but this bug is assigned and actively being worked
> on. Though I'm just the reporter, not the implementer, so I have no idea how
> long it will take until it's finished. Regarding Olli Pettay's last comment
> saying "this will need couple of iterations", I assume it will still take a
> while until the patch is ready to land in Firefox.

Yeah but this is ONLINE speech recognition and not the OFFLINE one via deepspeech that I'm interested in. That's why I've replied to kdavis
Reporter

Comment 35

11 months ago
(In reply to Tim Langhorst from comment #34)
> (In reply to Sebastian Zartner [:sebo] from comment #33)
> > You may have missed it but this bug is assigned and actively being worked
> > on. Though I'm just the reporter, not the implementer, so I have no idea how
> > long it will take until it's finished. Regarding Olli Pettay's last comment
> > saying "this will need couple of iterations", I assume it will still take a
> > while until the patch is ready to land in Firefox.
> 
> Yeah but this is ONLINE speech recognition and not the OFFLINE one via
> deepspeech that I'm interested in. That's why I've replied to kdavis

Good point. Asking him for info about that. Also, if offline speech recognition is still pursued, should a separate bug be created for it?

Sebastian
Flags: needinfo?(kdavis)
Keywords: feature

Comment 36

11 months ago
You can open a separate bug for backing the WebSpeech API with offline STT.

If you do so, can you list the required platforms, runtime memory requirements, maximal size the library can be, and maximal size the model can be in the bug. Thanks.
Flags: needinfo?(kdavis)
Reporter

Updated

11 months ago
Blocks: 1474124
Reporter

Comment 37

11 months ago
(In reply to kdavis from comment #36)
> You can open a separate bug for backing the WebSpeech API with offline STT.
> 
> If you do so, can you list the required platforms, runtime memory
> requirements, maximal size the library can be, and maximal size the model
> can be in the bug. Thanks.

As I wrote before, I am just the reporter, not an implementer, therefore I can't decide on those questions. Nonetheless I have created bug 1474124 for the offline API. Anyone able to answer those questions should do that there.

Sebastian

Comment 38

11 months ago
I have already opened bug 1474084

Updated

11 months ago
Blocks: 1474084
Assignee

Comment 39

11 months ago
Hi Olli, please see a new version with your comments addressed.

Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=60c87fdc8f392c5178161474c067dd2cb7897ee5
Attachment #8986675 - Attachment is obsolete: true
Attachment #8990848 - Flags: review?(bugs)
Assignee

Updated

11 months ago
Attachment #8990848 - Attachment is obsolete: true
Attachment #8990848 - Flags: review?(bugs)
Assignee

Comment 40

11 months ago
Please consider this version.
Attachment #8991057 - Flags: review?(bugs)
Marking this as affecting 63 just to indicate that this is being actively worked on in Nightly in my list of features.
Whiteboard: [webcompat:p2] → [webcompat:p1]
Comment on attachment 8991057 [details] [diff] [review]
0001-Bug-1248897-Introducing-an-online-speech-recognition.patch

>+
>+  /** Audio data */
>+  nsTArray<uint8_t> mAudioVector;
>+
>+  RefPtr<AudioTrackEncoder> mAudioEncoder;
>+  UniquePtr<ContainerWriter> mWriter;
>+  char* mBuf;
Why you use char* and not for example nsCString?
>+SpeechRecognition::~SpeechRecognition()
>+{
>+  if (this->mEncodeThread) {
>+    this->mEncodeThread->Shutdown();
>+  }
>+  this->mDocument = nullptr;
Setting nsCOMPtr or RefPtr member variables to null in destructor is useless.
nsCOMPtr's and RefPtr's destructors will do that automatically



>+bool
>+SpeechRecognition::IsIdle()
>+{
>+  return kIDLE;
>+}
>+
>+void
>+SpeechRecognition::SetIdle(bool aIdle)
>+{
>+  SR_LOG("Setting idle");
>+  kIDLE = aIdle;
>+}
It is totally unclear what kIDLE means. And the static variable is wrongly named. k-prefix is for constants, and kIDLE clearly isn't a constant.




>+  nsCOMPtr<nsIThreadManager> tm = do_GetService(NS_THREADMANAGER_CONTRACTID);
>+  rv = tm->NewNamedThread(NS_LITERAL_CSTRING("WebSpeechEncoderThread"),
>+                          0,
>+                          getter_AddRefs(this->mEncodeThread));
>+
Given all the memshrink efforts because of Fission (see dev.platform), need to ensure we kill the thread rather soon once it isn't needed anymore.
So, probably way before destructor of SpeechRecognition.


> 
>+  static bool IsIdle();
>+  static void SetIdle(bool aIdle);
These need some documentation


>+
>+  nsCOMPtr<nsIDocument> mDocument;
mDocument isn't cycle collected, so this will leak.
Do you need mDocument? SpeechRecognition is an DOMEventTargetHelper object, so one can always get the Window object and from that get the extant document.
 


>     CXXFLAGS += ['-Wno-error=shadow']
>diff --git a/dom/media/webspeech/recognition/test/FakeSpeechRecognitionService.cpp b/dom/media/webspeech/recognition/test/FakeSpeechRecognitionService.cpp
>index 5f8e6181fb32..69fcfdf44690 100644
>--- a/dom/media/webspeech/recognition/test/FakeSpeechRecognitionService.cpp
Why the changes to FakeSpeechRecognitionService?

SpeechRecognitionService.h

>+++ b/dom/media/webspeech/recognition/test/FakeSpeechRecognitionService.h
>@@ -15,17 +15,17 @@
>   {0x48c345e7, 0x9929, 0x4f9a, {0xa5, 0x63, 0xf4, 0x78, 0x22, 0x2d, 0xab, 0xcd}};
> 
> namespace mozilla {
> 
> class FakeSpeechRecognitionService : public nsISpeechRecognitionService,
>                                      public nsIObserver
> {
> public:
>-  NS_DECL_ISUPPORTS
>+  NS_DECL_THREADSAFE_ISUPPORTS
Hmm, why this change?
Attachment #8991057 - Flags: review?(bugs) → review-
Assignee

Comment 43

10 months ago
(In reply to Olli Pettay [:smaug] (vacation Jul 15->) from comment #42)
> Comment on attachment 8991057 [details] [diff] [review]
> 0001-Bug-1248897-Introducing-an-online-speech-recognition.patch
> 
> >+
> >+  /** Audio data */
> >+  nsTArray<uint8_t> mAudioVector;
> >+
> >+  RefPtr<AudioTrackEncoder> mAudioEncoder;
> >+  UniquePtr<ContainerWriter> mWriter;
> >+  char* mBuf;
> Why you use char* and not for example nsCString?


I changed to use an nsCString but needed to add a new function [1] to consume the data from the inpustream since NS_CopySegmentToBuffer expects a char*.   


[1] https://github.com/andrenatal/gecko-dev-speech/commit/6de461f6d6b69cf118d3fba3c1f907ca5775e1f9#diff-9563eabe4915e08b28c1dcd70b998ee5R59


> >+SpeechRecognition::~SpeechRecognition()
> >+{
> >+  if (this->mEncodeThread) {
> >+    this->mEncodeThread->Shutdown();
> >+  }
> >+  this->mDocument = nullptr;
> Setting nsCOMPtr or RefPtr member variables to null in destructor is useless.
> nsCOMPtr's and RefPtr's destructors will do that automatically
> 
> 

Fixed.

> 
> >+bool
> >+SpeechRecognition::IsIdle()
> >+{
> >+  return kIDLE;
> >+}
> >+
> >+void
> >+SpeechRecognition::SetIdle(bool aIdle)
> >+{
> >+  SR_LOG("Setting idle");
> >+  kIDLE = aIdle;
> >+}
> It is totally unclear what kIDLE means. And the static variable is wrongly
> named. k-prefix is for constants, and kIDLE clearly isn't a constant.
> 
> 

Ok, fixed the name to sIdle and added some comments explaining its purpose

> 
> 
> >+  nsCOMPtr<nsIThreadManager> tm = do_GetService(NS_THREADMANAGER_CONTRACTID);
> >+  rv = tm->NewNamedThread(NS_LITERAL_CSTRING("WebSpeechEncoderThread"),
> >+                          0,
> >+                          getter_AddRefs(this->mEncodeThread));
> >+
> Given all the memshrink efforts because of Fission (see dev.platform), need
> to ensure we kill the thread rather soon once it isn't needed anymore.
> So, probably way before destructor of SpeechRecognition.
> 
> 

Yes, sure. So we are now shutting down the thread on StopRecording[1] and AbortSilently[2]

[1] https://github.com/andrenatal/gecko-dev-speech/commit/6de461f6d6b69cf118d3fba3c1f907ca5775e1f9#diff-f47e4fa6c9b339d424ff52ce22e431beR567
[2] https://github.com/andrenatal/gecko-dev-speech/commit/6de461f6d6b69cf118d3fba3c1f907ca5775e1f9#diff-f47e4fa6c9b339d424ff52ce22e431beR563


> > 
> >+  static bool IsIdle();
> >+  static void SetIdle(bool aIdle);
> These need some documentation
> 
> 

Done [1]

[1] https://github.com/andrenatal/gecko-dev-speech/commit/6de461f6d6b69cf118d3fba3c1f907ca5775e1f9#diff-2187e63217f1dd33be599ab9dfd73879R102

> >+
> >+  nsCOMPtr<nsIDocument> mDocument;
> mDocument isn't cycle collected, so this will leak.
> Do you need mDocument? SpeechRecognition is an DOMEventTargetHelper object,
> so one can always get the Window object and from that get the extant
> document.
>  
> 

I was using the document to pass is as the principal when creating the channel to be used in the http request[1]. But I figured that I could just use `nsContentUtils::GetSystemPrincipal()`. 
So I reverted all the changes I was doing with the document and am not exposing it as a member anymore. Thanks for pointing that.

https://github.com/andrenatal/gecko-dev-speech/commit/6de461f6d6b69cf118d3fba3c1f907ca5775e1f9#diff-9563eabe4915e08b28c1dcd70b998ee5R277

> 
> >     CXXFLAGS += ['-Wno-error=shadow']
> >diff --git a/dom/media/webspeech/recognition/test/FakeSpeechRecognitionService.cpp b/dom/media/webspeech/recognition/test/FakeSpeechRecognitionService.cpp
> >index 5f8e6181fb32..69fcfdf44690 100644
> >--- a/dom/media/webspeech/recognition/test/FakeSpeechRecognitionService.cpp
> Why the changes to FakeSpeechRecognitionService?
> 
> SpeechRecognitionService.h
> 

If we don't add SpeechRecognition::SetIdle(true) here[1], some existing tests will break.

[1] https://github.com/andrenatal/gecko-dev-speech/commit/6de461f6d6b69cf118d3fba3c1f907ca5775e1f9#diff-9a8ef72f8c41148099a7df7aafd4aed9R52


> >+++ b/dom/media/webspeech/recognition/test/FakeSpeechRecognitionService.h
> >@@ -15,17 +15,17 @@
> >   {0x48c345e7, 0x9929, 0x4f9a, {0xa5, 0x63, 0xf4, 0x78, 0x22, 0x2d, 0xab, 0xcd}};
> > 
> > namespace mozilla {
> > 
> > class FakeSpeechRecognitionService : public nsISpeechRecognitionService,
> >                                      public nsIObserver
> > {
> > public:
> >-  NS_DECL_ISUPPORTS
> >+  NS_DECL_THREADSAFE_ISUPPORTS
> Hmm, why this change?

As we are now calling ProcessAudioSegment in a thread[1], if we don't change the interface to NS_DECL_THREADSAFE_ISUPPORTS, the tests are breaking as well. Here's a gist of the test results with only NS_DECL_ISUPPORTS [2]


[1] https://github.com/andrenatal/gecko-dev-speech/commit/6de461f6d6b69cf118d3fba3c1f907ca5775e1f9#diff-f47e4fa6c9b339d424ff52ce22e431beL403

[2] https://gist.github.com/andrenatal/5e9b12aff43a8369d7664d9a4d6314a6
Assignee

Comment 44

10 months ago
Please see the updated version with all the issues addressed.
Attachment #8991057 - Attachment is obsolete: true
Assignee

Comment 45

10 months ago
Comment on attachment 8993800 [details] [diff] [review]
0001-Bug-1248897-Introducing-an-online-speech-recognition.patch

Hi Andreas, do you mind reviewing the current version of the patch for us, please?
Attachment #8993800 - Flags: review?(apehrson)
Assignee

Comment 47

10 months ago
@pehrson, another interesting issue that I found is that when I close the browser (or first the tab and then the browser) while the microphone is open and capturing (or when the MediaStream is being fed with audio [1]), I'm getting this crash [2] 


[1] https://github.com/andrenatal/webspeechapi/blob/gh-pages/index_auto.js#L85

[2] https://gist.github.com/andrenatal/2c5220f4a65ec785297d9ac811808ced 

Do you have any idea why that's happening?

Thanks,

Andre
Assignee

Updated

10 months ago
Flags: needinfo?(apehrson)
Assignee

Updated

10 months ago
Attachment #8993800 - Flags: review?(apehrson) → review?(bugs)
Assignee

Comment 48

10 months ago
Asking smaug for review again since seems he's back from PTO.
Sorry for the late reply. I was on PTO last week but am back for this week.

(In reply to André Natal from comment #47)
> @pehrson, another interesting issue that I found is that when I close the
> browser (or first the tab and then the browser) while the microphone is open
> and capturing (or when the MediaStream is being fed with audio [1]), I'm
> getting this crash [2] 
> 
> 
> [1]
> https://github.com/andrenatal/webspeechapi/blob/gh-pages/index_auto.js#L85
> 
> [2] https://gist.github.com/andrenatal/2c5220f4a65ec785297d9ac811808ced 
> 
> Do you have any idea why that's happening?
> 
> Thanks,
> 
> Andre

The "WARNING: YOU ARE LEAKING THE WORLD" indicates that you have added a non-cycle-collected reference cycle somewhere in your code. This is preventing shutdown. If you have one that is legit you probably want to listen for xpcom-shutdown and break it then.

First you need to find it however, which is easier said than done. I often find that auditing my own code helps, or sometimes (if there are lots of new RefPtr members) disabling chunk after chunk until it doesn't happen anymore, to try and narrow down the piece of code containing the cycle.
Flags: needinfo?(apehrson)
Comment on attachment 8993800 [details] [diff] [review]
0001-Bug-1248897-Introducing-an-online-speech-recognition.patch

Review of attachment 8993800 [details] [diff] [review]:
-----------------------------------------------------------------

I have mostly looked at the usage of AudioTrackEncoder, though other comments might have slipped in too.

Some basic things for the AudioTrackEncoder are causing this r-, most notably mAudioVector allocations and mAudioTrackEncoder->AdvanceCurrentTime calls.

::: dom/media/webspeech/recognition/OnlineSpeechRecognitionService.cpp
@@ +195,5 @@
> +  }
> +
> +  nsresult rv;
> +
> +  if (!isHeaderWritten) {

Is this a member? If so it should be mIsHeaderWritten.

@@ +199,5 @@
> +  if (!isHeaderWritten) {
> +    mWriter = MakeUnique<OggWriter>();
> +    mAudioEncoder = MakeAndAddRef<OpusTrackEncoder>(aSampleRate);
> +    mAudioEncoder->TryInit(*aAudioSegment, duration);
> +    mAudioEncoder->SetStartOffset(0);

SetStartOffset needs to get the number of samples that comes before the first segment. This so that bookkeeping of the actual samples work. An example to understand how this works:

mAudioEncoder->SetStartOffset(128);
// The internal buffer now contains 128 samples of silence and no sound.
mAudioEncoder->AppendAudioSegment(aAudioSegment);
// For aAudioSegment of 256 samples the buffer now contains 128 samples of silence and 256 of sound
mAudioEncoder->AdvanceCurrentTime(256);
// The internal buffer now contains 256 samples of silence and 128 of sound. 128 samples were passed on.
mAudioEncoder->AppendAudioSegment(aAudioSegment2);
// For aAudioSegment2 of 128 samples the buffer now contains 256 samples of silence and 256 of sound
mAudioEncoder->AdvanceCurrentTime(384);
// The internal buffer now contains 384 samples of silence and 128 of sound. 128 samples were passed on.

@@ +214,5 @@
> +    rv = mWriter->GetContainerData(&aOutputBufs, ContainerWriter::GET_HEADER);
> +    NS_ENSURE_SUCCESS(rv, rv);
> +
> +    for (auto& buffer : aOutputBufs) {
> +        mAudioVector.AppendElements(buffer);

Indentation is wrong in so many places. Consider fixing it in all your files with clang-format.

This AppendElements() is going to have to expand the array very often. Consider using something else for storage, like an array of EncodedFrameContainer that would be simpler to append to as it grows large. Since you only use the array after finishing the recording you can concat the EncodedFrameContainers then as the total duration is known and you only need one allocation, or use them one-by-one if the streaming api allows it.

Is there a fixed upper bound on the length of the recording? If not it'll be possible to trigger an OOM crash through this (though it is encoded opus so it'll take a while). Then we should probably store it in a temp file like MediaRecorder does.

It's probably a good idea too to use a fallible allocator for such big arrays, then we can abort the speech recognition with an error if there's a memory problem.

@@ +220,5 @@
> +    isHeaderWritten = true;
> +  }
> +
> +  mAudioEncoder->AppendAudioSegment(std::move(*aAudioSegment));
> +  mAudioEncoder->AdvanceCurrentTime(duration);

This must be the accumulated duration of all played segments until now.

Since you are passing the duration of only this segment, you are basically creating an unbounded buffer of audio data that only goes away when the AudioTrackEncoder is destructed.

This alone is a reason for r-.

@@ +231,5 @@
> +                                        mAudioEncoder->IsEncodingComplete() ?
> +                                        ContainerWriter::END_OF_STREAM : 0);
> +  NS_ENSURE_SUCCESS(rv, rv);
> +
> +  nsTArray<nsTArray<uint8_t>> aOutputBufs;

s/aOutputBufs/outputBufs/

This is not an argument to this method, since the declaration is here.

@@ +232,5 @@
> +                                        ContainerWriter::END_OF_STREAM : 0);
> +  NS_ENSURE_SUCCESS(rv, rv);
> +
> +  nsTArray<nsTArray<uint8_t>> aOutputBufs;
> +  rv = mWriter->GetContainerData(&aOutputBufs, ContainerWriter::FLUSH_NEEDED);

Don't flush every time. Instead do it only when you really need the data (like after EOS).

You should only flush once, or you'll confuse the OggWriter apparently: https://searchfox.org/mozilla-central/rev/033d45ca70ff32acf04286244644d19308c359d5/dom/media/ogg/OggWriter.cpp#180

@@ +382,5 @@
> +  return NS_OK;
> +}
> +
> +SpeechRecognitionResultList*
> +OnlineSpeechRecognitionService::BuildMockResultList()

This seems to belong in a unit test rather than release.

::: dom/media/webspeech/recognition/SpeechRecognition.cpp
@@ +57,5 @@
> +// sIdle holds the current state of the API, i.e, if a recognition
> +// is running or not. We don't want more than one recognition to be running
> +// at same time, so we set sIdle to true when the API is not active and to
> +// false when is active (recording or recognizing)
> +static bool sIdle = true;

The comment needs to state which thread can access sIdle, since it's not threadsafe.

Note that this static doesn't prevent two different child processes from running speech recognition at the same time. Is that intended?

::: dom/media/webspeech/recognition/SpeechRecognition.h
@@ +140,5 @@
>    void FeedAudioData(already_AddRefed<SharedBuffer> aSamples, uint32_t aDuration, MediaStreamListener* aProvider, TrackRate aTrackRate);
>  
>    friend class SpeechEvent;
> +
> +  nsCOMPtr<nsIThread> mEncodeThread;

Please document members and methods so I as a reviewer have a reference to verify their implementation and usage against. Such comments should at least say what the intention is and what threads are allowed to call/read/write the thing you're documenting.
Attachment #8993800 - Flags: review-
Comment on attachment 8993800 [details] [diff] [review]
0001-Bug-1248897-Introducing-an-online-speech-recognition.patch

(I'll review after pehrsons' comments are addressed)
Attachment #8993800 - Flags: review?(bugs)
Curious, what is the status with this?
Flags: needinfo?(anatal)
Assignee

Comment 53

8 months ago
Due lack of resources, we needed to pause the work on this while I was working to bring voice search for Firefox Reality. Some of the issues that Andreas pointed was already fixed (we worked together on those), so I expect to have a new patch in the next couple of weeks.
Flags: needinfo?(anatal)

Comment 54

5 months ago
Is this still on the back-burner? Currently I'm having to use Chrome for a project that requires this API but I would much rather use FF (especially with the offline recognition).

Comment 55

5 months ago
(In reply to davy.wybiral from comment #54)
> Is this still on the back-burner? Currently I'm having to use Chrome for a
> project that requires this API but I would much rather use FF (especially
> with the offline recognition).

Speaking on behalf of the machine learning team, it's on our radar.

We've just recently got our STT engine running on "small platform" devices using TFLite[1].

So for English the remaining steps are some polish of our "small platform" STT engine and integration of the engine in to the patch Andre is working on.

For other languages we need data, which is a longer term problem we need to solve that's partially addressed by Common Voice.

[1] https://www.tensorflow.org/lite/

Comment 56

3 months ago

Andre, Oli, and Andreas what's the current status of this?

It looks like Andreas' comments in his last review have not been addressed.

Alex was thinking about finishing this patch and integrating Deep Speech too.

Should he just jump in, fix Andreas' request for changes, and do it?

Flags: needinfo?(bugs)
Flags: needinfo?(apehrson)
Flags: needinfo?(anatal)

Andre is driving this so check with him. Last I heard he was working on setting up a mock server with the mochitests, but then he has gotten disrupted by other work a number of times too.

I'm happy to review any media bits, whoever writes them.

Flags: needinfo?(apehrson)

Comment 58

3 months ago

Many of the Andreas' comments were already addressed after the last All Hands, including a major refactor of the media part. As he said, I was working on the mochitests and also finding the root of a memory leak that occurs when closing the tabs while the microphone capture is in process.

If the ultimate goal is to integrate Deep Speech, I believe a better use for Alex' time would be to work in the backend instead the frontend being discussed here, since they should be totally decoupled, i.e, finish the docker containing deepspeech and deploy it to Mozilla's services cloud infrastructure, for online decoding, and/or, create another bug and patch just to integrate deepspeech's inference stack and models into gecko, plus the HTTP service which will receive the requests from the frontend here, in the case of offline.

Both are still missing and currently needs more attention than the patch being worked here, which is already functional if manually applied to Gecko.

Flags: needinfo?(anatal)

Comment 59

3 months ago

create another bug and patch just to integrate deepspeech's inference stack and models into gecko

For offline there is already #1474084

(In reply to Andre Natal from comment #58)

If the ultimate goal is to integrate Deep Speech, I believe a better use for Alex' time would be to work in the backend instead the frontend being discussed here, since they should be totally decoupled, i.e, finish the docker containing deepspeech and deploy it to Mozilla's services cloud infrastructure, for online decoding, and/or, create another bug and patch just to integrate deepspeech's inference stack and models into gecko, plus the HTTP service which will receive the requests from the frontend here, in the case of offline.

There's not that much to finish, it's working, and is iso-feature to the other implem served by speech proxy. I guess it would be more a question of production deployment etc. :)

Comment 61

3 months ago

(In reply to Andre Natal from comment #58)

If the ultimate goal is to integrate Deep Speech, I believe a better use for Alex' time would be to work in the backend instead the frontend being discussed here, since they should be totally decoupled, i.e, finish the docker containing deepspeech and deploy it to Mozilla's services cloud infrastructure, for online decoding, and/or, create another bug and patch just to integrate deepspeech's inference stack and models into gecko, plus the HTTP service which will receive the requests from the frontend here, in the case of offline.

Both are still missing and currently needs more attention than the patch being worked here, which is already functional if manually applied to Gecko.

The Deep Speech backend exists already here [https://gitlab.com/deepspeech/ds-srv].

The associated Docker file is there too [https://gitlab.com/deepspeech/ds-srv/blob/master/Dockerfile.gpu]

So there is no blocker in that regard.

However, Alex and I are interested in bringing STT on device, bug 1474084, so no servers are required, and Alex wants to resolve this bug and bug 1474084.

Comment 62

3 months ago

(In reply to Alexandre LISSY :gerard-majax from comment #60)

(In reply to Andre Natal from comment #58)

If the ultimate goal is to integrate Deep Speech, I believe a better use for Alex' time would be to work in the backend instead the frontend being discussed here, since they should be totally decoupled, i.e, finish the docker containing deepspeech and deploy it to Mozilla's services cloud infrastructure, for online decoding, and/or, create another bug and patch just to integrate deepspeech's inference stack and models into gecko, plus the HTTP service which will receive the requests from the frontend here, in the case of offline.

There's not that much to finish, it's working, and is iso-feature to the other implem served by speech proxy. I guess it would be more a question of production deployment etc. :)

Okay, I'll create a thread including you and Mozilla Services to roll that out to production

Comment 63

3 months ago

(In reply to kdavis from comment #61)

(In reply to Andre Natal from comment #58)

If the ultimate goal is to integrate Deep Speech, I believe a better use for Alex' time would be to work in the backend instead the frontend being discussed here, since they should be totally decoupled, i.e, finish the docker containing deepspeech and deploy it to Mozilla's services cloud infrastructure, for online decoding, and/or, create another bug and patch just to integrate deepspeech's inference stack and models into gecko, plus the HTTP service which will receive the requests from the frontend here, in the case of offline.

Both are still missing and currently needs more attention than the patch being worked here, which is already functional if manually applied to Gecko.

The Deep Speech backend exists already here [https://gitlab.com/deepspeech/ds-srv].

The associated Docker file is there too [https://gitlab.com/deepspeech/ds-srv/blob/master/Dockerfile.gpu]

So there is no blocker in that regard.

However, Alex and I are interested in bringing STT on device, bug 1474084, so no servers are required, and Alex wants to resolve this bug and bug 1474084.

Cool, so is better to work on the tidbits to integrate deepspeech into gecko on bug 1474084 instead this one.

The goal of the patch here is to create an agnostic frontend which can communicate to whichever decoder through an HTTP Rest API, regardless of online or offline.

If the goal is to create a local deepspeech speech server exposed via http, you can use this as a frontend, but if the goal is to do something different, like for example injecting the frames directly into the inference stack, then is better to create a completely new SpeechRecognitionService instead injecting decoder specific code in this patch.

Comment 64

3 months ago

(In reply to Andre Natal from comment #63)

The goal of the patch here is to create an agnostic frontend which can communicate to whichever decoder through an HTTP Rest API, regardless of online or offline.

We want to use REST for offline?

Comment 65

3 months ago

I meant if you add a local http service to DEFAULT_RECOGNITION_ENDPOINT, it should work.

If that's not your goal, you should work on a whole new SpeechRecognitionService containing deep speech specific code on bug 1474084.

Hope that helps.

(not sure I have anything to say here. I'm just reviewing whatever is dumped to my review queue ;)
And happy to review DOM/Gecko side of this)

Flags: needinfo?(bugs)

Comment 67

3 months ago

(In reply to Andre Natal from comment #62)

Okay, I'll create a thread including you and Mozilla Services to roll that out to production

Could you put me on CC. Thanks

Assignee

Updated

2 months ago
Depends on: 1541290
Assignee

Updated

2 months ago
Duplicate of this bug: 1392065
Assignee

Updated

2 months ago
Depends on: 1541298
Assignee

Comment 69

2 months ago
Attachment #8993800 - Attachment is obsolete: true
Assignee

Comment 70

2 months ago

This patch introduces a Speech Recognition Service which interfaces with Mozilla's remote STT endpoint which is currently being used by multiple services

Assignee

Updated

2 months ago
Attachment #9055677 - Attachment is obsolete: true

See bug 1547409. Migrating webcompat priority whiteboard tags to project flags.

Webcompat Priority: --- → P1
Assignee

Updated

16 days ago
Priority: -- → P2
Target Milestone: --- → Future
You need to log in before you can comment on or make changes to this bug.