Bugzilla

Reporter

Comment 5

•

8 years ago

I'm just the reporter, not an implementor.

Sebastian

Flags: needinfo?(sebastianzartner)

Updated

•

8 years ago

Flags: needinfo?(anatal)

Updated

•

8 years ago

Flags: needinfo?(kdavis)

Andrew Overholt [:overholt]

Comment 6

•

8 years ago

Both Andre and myself implemented this originally; however, both of us are now in Connected Devices and consumed with preparation for London. So, at the earliest we could take a look after London. But even then it will be hard for us to dedicate lots of time to desktop.

Flags: needinfo?(kdavis)

Flags: needinfo?(anatal)

Comment 7

•

8 years ago

Let me see what I can do about prioritizing this on the platform side (I was going to suggest Andre/Kelly, too).

Flags: needinfo?(overholt)

Dão Gottwald [:dao]

Updated

•

8 years ago

Flags: needinfo?(dietrich)

Daniele "Mte90" Scasciafratte

Comment 8

•

8 years ago

The demo https://mdn.github.io/web-speech-api/ is useless without the api available...

Clemens Tolboom

Comment 9

•

8 years ago

The documentation https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API and it's examples implementations are not working in Firefox yet which is a little poor documented in section https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API#Browser_compatibility IHMO; Firefox Desktop is probably the browser right?

The demos work in Google Chrome (un)fortunately.

Makoto Kato [:m_kato]

Comment 10

•

8 years ago

Actually, there is no backend implementation for Recognition API except to Gonk.

Reporter

Comment 11

•

8 years ago

(In reply to Makoto Kato [:m_kato] from comment #10)
> Actually, there is no backend implementation for Recognition API except to
> Gonk.

Bug 1244237 comment 0 claimed something else. That's why I created this bug.
But if there's really no backend yet, you should create a bug for it blocking this one, so this feature can finally be tackled.

Sebastian

Abhishek

Comment 12

•

7 years ago

I am willing to work on this. Can somebody guide me on this one, what exactly is impeding the implementation of the recognition back end. I was thinking if we can use the speech recognition API provided by windows (not a platform independent solution, I know). It would be great to get it working at least somewhere and I am working on Windows version.

Clemens Tolboom

Comment 13

•

7 years ago

@abhishek the way I understand you can test this using Firefox desktop in Chrome context so use js or create a plugin. We 'only' need to expose this into website context. See 'User story' above.

From https://developer.mozilla.org/en-US/Add-ons/Setting_up_extension_development_environment

> devtools.chrome.enabled = true. This enables to run JavaScript code snippets in the chrome context of the Scratchpad from the Tools menu. Don't forget to switch from content to browser as context.

Does this help you?

Abhishek

Comment 14

•

7 years ago

(In reply to Clemens Tolboom from comment #13)
> @abhishek the way I understand you can test this using Firefox desktop in
> Chrome context so use js or create a plugin. We 'only' need to expose this
> into website context. See 'User story' above.
> 
> From
> https://developer.mozilla.org/en-US/Add-ons/
> Setting_up_extension_development_environment
> 
> > devtools.chrome.enabled = true. This enables to run JavaScript code snippets in the chrome context of the Scratchpad from the Tools menu. Don't forget to switch from content to browser as context.
> 
> Does this help you?

I still get the error "Exception: InvalidStateError: An attempt was made to use an object that is not, or is no longer, usable". I enabled devtools.chrome and webspeech.recognition, still can't get it to work. Can you verify that you've got speech input to work on desktop firefox somehow?

Clemens Tolboom

Comment 15

•

7 years ago

@abhishek what script are you running in Web developer > Scratchpad ?

Checking with from https://github.com/mdn/web-speech-api/blob/master/speech-color-changer/script.js running lines 1-9 + 'recognition;' as line 10 from Scratchpad on page about:config shows me the object SpeechRecognition __proto__: SpeechRecognitionPrototype when inspecting 'recognition;'

Remark #2 on https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API#Browser_compatibility is about this issue :)

Please place your code on https://gist.github.com/ for other to help

Abhishek

Comment 16

•

7 years ago

(In reply to Clemens Tolboom from comment #15)
> @abhishek what script are you running in Web developer > Scratchpad ?
> 
> Checking with from
> https://github.com/mdn/web-speech-api/blob/master/speech-color-changer/
> script.js running lines 1-9 + 'recognition;' as line 10 from Scratchpad on
> page about:config shows me the object SpeechRecognition __proto__:
> SpeechRecognitionPrototype when inspecting 'recognition;'
> 
> Remark #2 on
> https://developer.mozilla.org/en-US/docs/Web/API/
> Web_Speech_API#Browser_compatibility is about this issue :)
> 
> Please place your code on https://gist.github.com/ for other to help

I am trying that exact same script. Here's the gist: https://gist.github.com/Abhishek8394/dbf9338a9baf6ca05639929e0b72404d
Also the remark #2 as mentioned, it says that recognition hasn't been implemented yet asks to enable an option?

Thomas Wisniewski

Updated

•

7 years ago

See Also: → https://github.com/webcompat/web-bugs/issues/4372

Whiteboard: [webcompat]

Assignee

Comment 17

•

7 years ago

One needs to build Firefox with the flags turned on to include the only current implementation available (pocketsphinx + english). If you guys are willing to move forward with it, I can help you to set and have it running.

Updated

•

7 years ago

Flags: webcompat?

See Also: → https://webcompat.com/issues/9013

Richard Newman [:rnewman]

Assignee

Updated

•

7 years ago

Assignee: nobody → anatal

Depends on: 1392065

Comment 18

•

7 years ago

FWIW, Duolingo's support team is advising inquiring users (necessarily) to use Chrome:

---
Speaking exercises on the new website only work on browsers that support the Web Speech API, the most popular being Chrome. Unfortunately, the old speech system was outdated and a burden on our system. If you use the Chrome browser, you will get speaking exercises again.

If you are using Google Chrome and experiencing issues, users have found that updating or reinstalling the browser has resolved their issues. If you are still having issues, please reply to this email. :)

If you are using a browser without Web Speech API supported, you should not see any speaking exercises. If you do see speaking exercises, please know that your microphone will unfortunately not work. We are aware that some users may still be seeing these exercises and we're working hard to ensure that this is resolved quickly.
---

Duolingo has over 200M users.

I imagine that English-only speech reco wouldn't cover Duolingo's main customer bases.

Dietrich Ayala (:dietrich)

Comment 19

•

7 years ago

:-( Anyone know the list of languages Duolingo supports?

Comment 20

•

7 years ago

Spanish, French, German, Italian, Portuguese, Dutch, Irish, Danish, Swedish, High Valyrian, Russian, Swahili, Polish, Romanian, Greek, Esperanto, Turkish, Vietnamese, Hebrew, Norwegian, Ukrainian, Hungarian, Welsh, Czech

Daniele "Mte90" Scasciafratte

Comment 21

•

7 years ago

Time to internationalize + localize Common Voice!

Comment 22

•

7 years ago

I am not using that feature on Duolingo because the recognition before was using flash (and wasn't so perfect). 
In any case this another example that remind that is important to have this API available.

Daniele "Mte90" Scasciafratte

Comment 23

•

7 years ago

We, the machine learning group at Mozilla, have just recently gotten the quality of our speech recognition engine[1] to be on-par with commercial systems.

Currently we have enough American English training data to create an American English model. We don't have enough data to train other accents or languages, thus the need to internationalize + localize Common Voice.

We plan, by the end of Q2 in 2018, to implement the WebSpeech API backed by our speech recognition engine and integrate, at a minimum, the American English model by then. Other languages should follow in the second half of 2018. (Which languages depends on C-Level decisions, the internationalization + localization + success of Common Voice, and Duolingo-like issues, which would guide our language choice.)

[1] https://github.com/mozilla/deepspeech

Comment 24

•

7 years ago

I hope that this API will be available before Q2 of course :-)

Comment 25

•

7 years ago

PR's welcome! :-)

Thomas Wisniewski

Updated

•

7 years ago

See Also: → https://webcompat.com/issues/10715

Thomas Wisniewski

Updated

•

7 years ago

Blocks: 1409526

Comment 26

•

7 years ago

Let's try to do something :)

Assignee: anatal → lissyx+mozillians

Updated

•

7 years ago

Assignee: lissyx+mozillians → anatal

Jean-Yves Avenard [:jya]

Updated

•

7 years ago

Updated

•

6 years ago

See Also: → https://webcompat.com/issues/4372

Mike Taylor [:miketaylr]

Updated

•

6 years ago

No longer blocks: 973754

Updated

•

6 years ago

Blocks: 1456885

Mike Taylor [:miketaylr]

Updated

•

6 years ago

Flags: webcompat? → webcompat+

Whiteboard: [webcompat] → [webcompat:p2]

Assignee

Comment 28

•

6 years ago

Attached patch Introducing an online speech recognition service to enable Web Speech API (obsolete) — Details — Splinter Review

Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b8299b451b61dcfe1569938fb09102e96b3ef79f

Attachment #8984315 - Flags: review?(bugs)

Comment 29

•

6 years ago

Comment on attachment 8984315 [details] [diff] [review]
Introducing an online speech recognition service to enable Web Speech API

Please use Mozilla coding style. 2 spaces for indentation, mFoo for member variable naming, 
aBar for argument names etc.

>+class DecodeResultTask final : public Runnable
>+{
>+public:
>+  DecodeResultTask(bool succeded,
>+                   const nsString& hypstring,
>+                   float confidence,
>+                   WeakPtr<dom::SpeechRecognition> recognition,
>+                   const nsAutoString& errormessage)
>+      : mozilla::Runnable("DecodeResultTask"),
>+        mSucceeded(succeded),
>+        mResult(hypstring),
>+        mConfidence(confidence),
>+        mRecognition(recognition),
>+        mErrorMessage(errormessage)
>+  {
>+    MOZ_ASSERT(
>+      NS_IsMainThread()); // This should be running on the main thread
>+  }
>+
>+  NS_IMETHOD
>+  Run() override
>+  {
>+    MOZ_ASSERT(NS_IsMainThread()); // This method is supposed to run on the main
>+                                   // thread!
>+
>+    if (!mSucceeded) {
>+      mRecognition->DispatchError(SpeechRecognition::EVENT_RECOGNITIONSERVICE_ERROR,
>+                                  SpeechRecognitionErrorCode::Network, // TODO different codes?
>+                                  mErrorMessage);
>+
>+    } else {
>+      // Declare javascript result events
>+      RefPtr<SpeechEvent> event = new SpeechEvent(
>+        mRecognition, SpeechRecognition::EVENT_RECOGNITIONSERVICE_FINAL_RESULT);
>+      SpeechRecognitionResultList* resultList =
>+        new SpeechRecognitionResultList(mRecognition);
>+      SpeechRecognitionResult* result = new SpeechRecognitionResult(mRecognition);
>+
>+      if (0 < mRecognition->MaxAlternatives()) {
>+        SpeechRecognitionAlternative* alternative =
>+          new SpeechRecognitionAlternative(mRecognition);
>+
>+        alternative->mTranscript = mResult;
>+        alternative->mConfidence = mConfidence;
>+
>+        result->mItems.AppendElement(alternative);
>+      }
>+      resultList->mItems.AppendElement(result);
>+
>+      event->mRecognitionResultList = resultList;
>+      NS_DispatchToMainThread(event);
>+    }
>+    return NS_OK;
>+ }
>+
>+private:
>+  bool mSucceeded;
>+  nsString mResult;
>+  float mConfidence;
>+  WeakPtr<dom::SpeechRecognition> mRecognition;
>+  nsCOMPtr<nsIThread> mWorkerThread;
>+  nsAutoString mErrorMessage;
>+};
>+
>+NS_IMPL_ISUPPORTS(OnlineSpeechRecognitionService,
>+                  nsISpeechRecognitionService, nsIObserver, nsIStreamListener)
>+
>+NS_IMETHODIMP
>+OnlineSpeechRecognitionService::OnStartRequest(nsIRequest* aRequest,
>+                            nsISupports* aContext)
>+{
>+  if (this->mBuf)
>+    this->mBuf = nullptr;
>+  return NS_OK;
>+}
>+
>+NS_IMETHODIMP
>+OnlineSpeechRecognitionService::OnDataAvailable(nsIRequest* aRequest,
>+                             nsISupports* aContext,
>+                             nsIInputStream* aInputStream,
>+                             uint64_t aOffset,
>+                             uint32_t aCount)
>+{
>+  nsresult rv;
>+  this->mBuf = new char[aCount];
>+  uint32_t _retval;
>+  rv = aInputStream->ReadSegments(NS_CopySegmentToBuffer, this->mBuf, aCount, &_retval);
>+  NS_ENSURE_SUCCESS(rv, rv);
>+  return NS_OK;
>+}
>+
>+NS_IMETHODIMP
>+OnlineSpeechRecognitionService::OnStopRequest(nsIRequest* aRequest,
>+                           nsISupports* aContext,
>+                           nsresult aStatusCode)
>+{
>+  bool success;
>+  nsresult rv;
>+  float confidence;
>+  confidence = 0;
>+  Json::Value root;
>+  Json::Reader reader;
>+  bool parsingSuccessful;
>+  nsAutoCString result;
>+  nsAutoCString hypoValue;
>+  nsAutoString errorMsg;
>+
>+  SR_LOG("STT Result: %s", this->mBuf);
>+
>+  if (NS_FAILED(aStatusCode)) {
>+    success = false;
>+    errorMsg.Assign(NS_LITERAL_STRING("CONNECTION_ERROR"));
>+  } else {
>+    success = true;
>+    parsingSuccessful = reader.parse(this->mBuf, root);
>+    if (!parsingSuccessful) {
>+      errorMsg.Assign(NS_LITERAL_STRING("RECOGNITIONSERVICE_ERROR"));
>+      success = false;
>+    } else {
>+      result.Assign(root.get("status","error").asString().c_str());
>+      if (result.EqualsLiteral("ok")) {
>+        // ok, we have a result
>+        hypoValue.Assign(root["data"][0].get("text","").asString().c_str());
>+        confidence = root["data"][0].get("confidence","0").asFloat();
>+      } else {
>+        // there's an internal server error
>+        errorMsg.Assign(NS_LITERAL_STRING("NO_HYPOTHESIS"));
>+        success = false;
>+      }
>+    }
>+  }
>+
>+  RefPtr<Runnable> resultrunnable =
>+    new DecodeResultTask(success, NS_ConvertUTF8toUTF16(hypoValue), confidence,
>+                         mRecognition, errorMsg);
>+  rv = NS_DispatchToMainThread(resultrunnable);
>+
>+  if (this->mBuf)
>+    this->mBuf = nullptr;
>+
>+  return NS_OK;
>+}
>+
>+OnlineSpeechRecognitionService::OnlineSpeechRecognitionService()
>+{
>+  if (this->mBuf)
>+    this->mBuf = nullptr;
What is this? mBuf is uninitialized in the constructor

>+  audioEncoder = nullptr;
>+  ISDecoderCreated = true;
>+  ISGrammarCompiled = true;
Please use normal C++ member variable initialization,
OnlineSpeechRecognitionService::OnlineSpeechRecognitionService()
  : mBuf(nullptr)
...

looks like ISDecoderCreated isn't ever used, nor ISGrammarCompiled

>+OnlineSpeechRecognitionService::~OnlineSpeechRecognitionService()
>+{
>+  if (this->mBuf)
>+    this->mBuf = nullptr;
again, rather mysterious nullcheck. And you leak mBuf here.

>+NS_IMETHODIMP
>+OnlineSpeechRecognitionService::ProcessAudioSegment(
>+  AudioSegment* aAudioSegment, int32_t aSampleRate)
>+{
On which thread does this method run?
Encoding may take time so it should not run on the main thread.
So, 
MOZ_ASSERT(!NS_IsMainThread());

>+OnlineSpeechRecognitionService::Observe(nsISupports* aSubject,
>+                                              const char* aTopic,
>+                                              const char16_t* aData)
align params

>+class OnlineSpeechRecognitionService : public nsISpeechRecognitionService,
>+                                             public nsIObserver,
>+                                             public nsIStreamListener
align inherited classes

Attachment #8984315 - Flags: review?(bugs) → review-

Assignee

Comment 30

•

6 years ago

Attached patch Bug-1248897-Introducing an online speech recognition service for web speech api (obsolete) — Details — Splinter Review

Hi Olli,

Here's a version with your comments already addressed. If you want to test it, apply the patch, flip the prefs, and head here: https://andrenatal.github.io/webspeechapi/index_auto.html. 

Currently the tests we have convert the fake recognition service and the state machine, so I'm working to make them working with this online service too.

Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=1b9f223217bde85215e8f99416cae66098ff13fc

Thanks

Andre

Attachment #8984315 - Attachment is obsolete: true

Attachment #8986675 - Flags: review?(bugs)

Comment 31

•

6 years ago

Comment on attachment 8986675 [details] [diff] [review]
Bug-1248897-Introducing an online speech recognition service for web speech api

>+NS_IMETHODIMP
>+OnlineSpeechRecognitionService::OnStartRequest(nsIRequest *aRequest,
>+                                               nsISupports *aContext)
>+{
>+  this->mBuf = nullptr;
Why this?
If mBuf points to some value, you leak it here.
No need for this-> here nor elsewhere


>+NS_IMETHODIMP
>+OnlineSpeechRecognitionService::OnDataAvailable(nsIRequest *aRequest,
>+                                                nsISupports *aContext,
>+                                                nsIInputStream *aInputStream,
Per Mozilla coding style * goes with the type.
nsIRequest* aRequest etc. Here and elsewhere.


>+                                                uint64_t aOffset,
>+                                                uint32_t aCount)
>+{
>+  nsresult rv;
>+  this->mBuf = new char[aCount];
>+  uint32_t _retval;
retVal
But it isn't retVal, but read count, so perhaps readCount ?


>+OnlineSpeechRecognitionService::OnStopRequest(nsIRequest *aRequest,
>+                                              nsISupports *aContext,
>+                                              nsresult aStatusCode)
>+{
>+  bool success;
>+  float confidence = 0;
>+  Json::Value root;
>+  Json::Reader reader;
>+  bool parsingSuccessful;
>+  nsAutoCString result;
>+  nsAutoCString hypoValue;
>+  nsAutoString errorMsg;
>+  SR_LOG("STT Result: %s", this->mBuf);
>+
>+  if (NS_FAILED(aStatusCode)) {
>+    success = false;
>+    errorMsg.Assign(NS_LITERAL_STRING("CONNECTION_ERROR"));
the spec doesn't define "CONNECTION_ERROR"

>+  } else {
>+    success = true;
>+    parsingSuccessful = reader.parse(this->mBuf, root);
>+    if (!parsingSuccessful) {
>+      errorMsg.Assign(NS_LITERAL_STRING("RECOGNITIONSERVICE_ERROR"));
The spec doesn't define "RECOGNITIONSERVICE_ERROR"

>+      success = false;
>+    } else {
>+      result.Assign(root.get("status","error").asString().c_str());
missing space after ,

>+      if (result.EqualsLiteral("ok")) {
>+        // ok, we have a result
>+        hypoValue.Assign(root["data"][0].get("text","").asString().c_str());
>+        confidence = root["data"][0].get("confidence","0").asFloat();
>+      } else {
>+        // there's an internal server error
>+        errorMsg.Assign(NS_LITERAL_STRING("NO_HYPOTHESIS"));
I don't see "NO_HYPOTHESIS" in the spec


>+OnlineSpeechRecognitionService::~OnlineSpeechRecognitionService()
>+{
>+  this->mBuf = nullptr;
So you leak mBuf

>+  this->mAudioEncoder = nullptr;
>+  this->mWriter = nullptr;
No need to set these to null in destructor.



>   nsresult rv;
>   rv = mRecognitionService->Initialize(this);
>   if (NS_WARN_IF(NS_FAILED(rv))) {
>     return;
>   }
> 
>+  SpeechRecognition::SetIdle(false);
>+
>+  nsCOMPtr<nsIThreadManager> tm = do_GetService(NS_THREADMANAGER_CONTRACTID);
>+  rv = tm->NewNamedThread(NS_LITERAL_CSTRING("WebSpeechEncoderThread"), 0,
>+                                           getter_AddRefs(this->mEncodeThread));
fix indentation


(this will need couple of iterations.)

Attachment #8986675 - Flags: review?(bugs) → review-

Comment 32

•

6 years ago

(In reply to kdavis from comment #23)
> We, the machine learning group at Mozilla, have just recently gotten the
> quality of our speech recognition engine[1] to be on-par with commercial
> systems.
> 
> Currently we have enough American English training data to create an
> American English model. We don't have enough data to train other accents or
> languages, thus the need to internationalize + localize Common Voice.
> 
> We plan, by the end of Q2 in 2018, to implement the WebSpeech API backed by
> our speech recognition engine and integrate, at a minimum, the American
> English model by then. Other languages should follow in the second half of
> 2018. (Which languages depends on C-Level decisions, the
> internationalization + localization + success of Common Voice, and
> Duolingo-like issues, which would guide our language choice.)
> 
> [1] https://github.com/mozilla/deepspeech

So it's 2018 Q2, any updates on deepspeech implementation?

Reporter

Comment 33

•

6 years ago

(In reply to Tim Langhorst from comment #32)
> So it's 2018 Q2, any updates on deepspeech implementation?

You may have missed it but this bug is assigned and actively being worked on. Though I'm just the reporter, not the implementer, so I have no idea how long it will take until it's finished. Regarding Olli Pettay's last comment saying "this will need couple of iterations", I assume it will still take a while until the patch is ready to land in Firefox.

Sebastian

Comment 34

•

6 years ago

(In reply to Sebastian Zartner [:sebo] from comment #33)
> You may have missed it but this bug is assigned and actively being worked
> on. Though I'm just the reporter, not the implementer, so I have no idea how
> long it will take until it's finished. Regarding Olli Pettay's last comment
> saying "this will need couple of iterations", I assume it will still take a
> while until the patch is ready to land in Firefox.

Yeah but this is ONLINE speech recognition and not the OFFLINE one via deepspeech that I'm interested in. That's why I've replied to kdavis

Marco Castelluccio [:marco]

Reporter

Comment 35

•

6 years ago

(In reply to Tim Langhorst from comment #34)
> (In reply to Sebastian Zartner [:sebo] from comment #33)
> > You may have missed it but this bug is assigned and actively being worked
> > on. Though I'm just the reporter, not the implementer, so I have no idea how
> > long it will take until it's finished. Regarding Olli Pettay's last comment
> > saying "this will need couple of iterations", I assume it will still take a
> > while until the patch is ready to land in Firefox.
> 
> Yeah but this is ONLINE speech recognition and not the OFFLINE one via
> deepspeech that I'm interested in. That's why I've replied to kdavis

Good point. Asking him for info about that. Also, if offline speech recognition is still pursued, should a separate bug be created for it?

Sebastian

Flags: needinfo?(kdavis)

Updated

•

6 years ago

Keywords: feature

Comment 36

•

6 years ago

You can open a separate bug for backing the WebSpeech API with offline STT.

If you do so, can you list the required platforms, runtime memory requirements, maximal size the library can be, and maximal size the model can be in the bug. Thanks.

Flags: needinfo?(kdavis)

Reporter

Updated

•

6 years ago

Blocks: 1474124

Reporter

Comment 37

•

6 years ago

(In reply to kdavis from comment #36)
> You can open a separate bug for backing the WebSpeech API with offline STT.
> 
> If you do so, can you list the required platforms, runtime memory
> requirements, maximal size the library can be, and maximal size the model
> can be in the bug. Thanks.

As I wrote before, I am just the reporter, not an implementer, therefore I can't decide on those questions. Nonetheless I have created bug 1474124 for the offline API. Anyone able to answer those questions should do that there.

Sebastian

Comment 38

•

6 years ago

I have already opened bug 1474084

Updated

•

6 years ago

Blocks: 1474084

Assignee

Comment 39

•

6 years ago

Attached patch 0001-Bug-1248897-Introducing-an-online-speech-recognition.patch (obsolete) — Details — Splinter Review

Hi Olli, please see a new version with your comments addressed.

Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=60c87fdc8f392c5178161474c067dd2cb7897ee5

Attachment #8986675 - Attachment is obsolete: true

Attachment #8990848 - Flags: review?(bugs)

Assignee

Updated

•

6 years ago

Attachment #8990848 - Attachment is obsolete: true

Attachment #8990848 - Flags: review?(bugs)

Liz Henry (:lizzard) (relman/hg->git project)

Assignee

Comment 40

•

6 years ago

Attached patch 0001-Bug-1248897-Introducing-an-online-speech-recognition.patch (obsolete) — Details — Splinter Review

Please consider this version.

Attachment #8991057 - Flags: review?(bugs)

Comment 41

•

6 years ago

Marking this as affecting 63 just to indicate that this is being actively worked on in Nightly in my list of features.

status-firefox47: affected → wontfix

status-firefox63: --- → affected

Panos Astithas (he/him) [:past] (please ni?)

Updated

•

6 years ago

Whiteboard: [webcompat:p2] → [webcompat:p1]

Comment 42

•

6 years ago

Comment on attachment 8991057 [details] [diff] [review]
0001-Bug-1248897-Introducing-an-online-speech-recognition.patch

>+
>+  /** Audio data */
>+  nsTArray<uint8_t> mAudioVector;
>+
>+  RefPtr<AudioTrackEncoder> mAudioEncoder;
>+  UniquePtr<ContainerWriter> mWriter;
>+  char* mBuf;
Why you use char* and not for example nsCString?
>+SpeechRecognition::~SpeechRecognition()
>+{
>+  if (this->mEncodeThread) {
>+    this->mEncodeThread->Shutdown();
>+  }
>+  this->mDocument = nullptr;
Setting nsCOMPtr or RefPtr member variables to null in destructor is useless.
nsCOMPtr's and RefPtr's destructors will do that automatically



>+bool
>+SpeechRecognition::IsIdle()
>+{
>+  return kIDLE;
>+}
>+
>+void
>+SpeechRecognition::SetIdle(bool aIdle)
>+{
>+  SR_LOG("Setting idle");
>+  kIDLE = aIdle;
>+}
It is totally unclear what kIDLE means. And the static variable is wrongly named. k-prefix is for constants, and kIDLE clearly isn't a constant.




>+  nsCOMPtr<nsIThreadManager> tm = do_GetService(NS_THREADMANAGER_CONTRACTID);
>+  rv = tm->NewNamedThread(NS_LITERAL_CSTRING("WebSpeechEncoderThread"),
>+                          0,
>+                          getter_AddRefs(this->mEncodeThread));
>+
Given all the memshrink efforts because of Fission (see dev.platform), need to ensure we kill the thread rather soon once it isn't needed anymore.
So, probably way before destructor of SpeechRecognition.


> 
>+  static bool IsIdle();
>+  static void SetIdle(bool aIdle);
These need some documentation


>+
>+  nsCOMPtr<nsIDocument> mDocument;
mDocument isn't cycle collected, so this will leak.
Do you need mDocument? SpeechRecognition is an DOMEventTargetHelper object, so one can always get the Window object and from that get the extant document.
 


>     CXXFLAGS += ['-Wno-error=shadow']
>diff --git a/dom/media/webspeech/recognition/test/FakeSpeechRecognitionService.cpp b/dom/media/webspeech/recognition/test/FakeSpeechRecognitionService.cpp
>index 5f8e6181fb32..69fcfdf44690 100644
>--- a/dom/media/webspeech/recognition/test/FakeSpeechRecognitionService.cpp
Why the changes to FakeSpeechRecognitionService?

SpeechRecognitionService.h

>+++ b/dom/media/webspeech/recognition/test/FakeSpeechRecognitionService.h
>@@ -15,17 +15,17 @@
>   {0x48c345e7, 0x9929, 0x4f9a, {0xa5, 0x63, 0xf4, 0x78, 0x22, 0x2d, 0xab, 0xcd}};
> 
> namespace mozilla {
> 
> class FakeSpeechRecognitionService : public nsISpeechRecognitionService,
>                                      public nsIObserver
> {
> public:
>-  NS_DECL_ISUPPORTS
>+  NS_DECL_THREADSAFE_ISUPPORTS
Hmm, why this change?

Attachment #8991057 - Flags: review?(bugs) → review-

Assignee

Comment 43

•

6 years ago

(In reply to Olli Pettay [:smaug] (vacation Jul 15->) from comment #42)
> Comment on attachment 8991057 [details] [diff] [review]
> 0001-Bug-1248897-Introducing-an-online-speech-recognition.patch
> 
> >+
> >+  /** Audio data */
> >+  nsTArray<uint8_t> mAudioVector;
> >+
> >+  RefPtr<AudioTrackEncoder> mAudioEncoder;
> >+  UniquePtr<ContainerWriter> mWriter;
> >+  char* mBuf;
> Why you use char* and not for example nsCString?


I changed to use an nsCString but needed to add a new function [1] to consume the data from the inpustream since NS_CopySegmentToBuffer expects a char*.   


[1] https://github.com/andrenatal/gecko-dev-speech/commit/6de461f6d6b69cf118d3fba3c1f907ca5775e1f9#diff-9563eabe4915e08b28c1dcd70b998ee5R59


> >+SpeechRecognition::~SpeechRecognition()
> >+{
> >+  if (this->mEncodeThread) {
> >+    this->mEncodeThread->Shutdown();
> >+  }
> >+  this->mDocument = nullptr;
> Setting nsCOMPtr or RefPtr member variables to null in destructor is useless.
> nsCOMPtr's and RefPtr's destructors will do that automatically
> 
> 

Fixed.

> 
> >+bool
> >+SpeechRecognition::IsIdle()
> >+{
> >+  return kIDLE;
> >+}
> >+
> >+void
> >+SpeechRecognition::SetIdle(bool aIdle)
> >+{
> >+  SR_LOG("Setting idle");
> >+  kIDLE = aIdle;
> >+}
> It is totally unclear what kIDLE means. And the static variable is wrongly
> named. k-prefix is for constants, and kIDLE clearly isn't a constant.
> 
> 

Ok, fixed the name to sIdle and added some comments explaining its purpose

> 
> 
> >+  nsCOMPtr<nsIThreadManager> tm = do_GetService(NS_THREADMANAGER_CONTRACTID);
> >+  rv = tm->NewNamedThread(NS_LITERAL_CSTRING("WebSpeechEncoderThread"),
> >+                          0,
> >+                          getter_AddRefs(this->mEncodeThread));
> >+
> Given all the memshrink efforts because of Fission (see dev.platform), need
> to ensure we kill the thread rather soon once it isn't needed anymore.
> So, probably way before destructor of SpeechRecognition.
> 
> 

Yes, sure. So we are now shutting down the thread on StopRecording[1] and AbortSilently[2]

[1] https://github.com/andrenatal/gecko-dev-speech/commit/6de461f6d6b69cf118d3fba3c1f907ca5775e1f9#diff-f47e4fa6c9b339d424ff52ce22e431beR567
[2] https://github.com/andrenatal/gecko-dev-speech/commit/6de461f6d6b69cf118d3fba3c1f907ca5775e1f9#diff-f47e4fa6c9b339d424ff52ce22e431beR563


> > 
> >+  static bool IsIdle();
> >+  static void SetIdle(bool aIdle);
> These need some documentation
> 
> 

Done [1]

[1] https://github.com/andrenatal/gecko-dev-speech/commit/6de461f6d6b69cf118d3fba3c1f907ca5775e1f9#diff-2187e63217f1dd33be599ab9dfd73879R102

> >+
> >+  nsCOMPtr<nsIDocument> mDocument;
> mDocument isn't cycle collected, so this will leak.
> Do you need mDocument? SpeechRecognition is an DOMEventTargetHelper object,
> so one can always get the Window object and from that get the extant
> document.
>  
> 

I was using the document to pass is as the principal when creating the channel to be used in the http request[1]. But I figured that I could just use `nsContentUtils::GetSystemPrincipal()`. 
So I reverted all the changes I was doing with the document and am not exposing it as a member anymore. Thanks for pointing that.

https://github.com/andrenatal/gecko-dev-speech/commit/6de461f6d6b69cf118d3fba3c1f907ca5775e1f9#diff-9563eabe4915e08b28c1dcd70b998ee5R277

> 
> >     CXXFLAGS += ['-Wno-error=shadow']
> >diff --git a/dom/media/webspeech/recognition/test/FakeSpeechRecognitionService.cpp b/dom/media/webspeech/recognition/test/FakeSpeechRecognitionService.cpp
> >index 5f8e6181fb32..69fcfdf44690 100644
> >--- a/dom/media/webspeech/recognition/test/FakeSpeechRecognitionService.cpp
> Why the changes to FakeSpeechRecognitionService?
> 
> SpeechRecognitionService.h
> 

If we don't add SpeechRecognition::SetIdle(true) here[1], some existing tests will break.

[1] https://github.com/andrenatal/gecko-dev-speech/commit/6de461f6d6b69cf118d3fba3c1f907ca5775e1f9#diff-9a8ef72f8c41148099a7df7aafd4aed9R52


> >+++ b/dom/media/webspeech/recognition/test/FakeSpeechRecognitionService.h
> >@@ -15,17 +15,17 @@
> >   {0x48c345e7, 0x9929, 0x4f9a, {0xa5, 0x63, 0xf4, 0x78, 0x22, 0x2d, 0xab, 0xcd}};
> > 
> > namespace mozilla {
> > 
> > class FakeSpeechRecognitionService : public nsISpeechRecognitionService,
> >                                      public nsIObserver
> > {
> > public:
> >-  NS_DECL_ISUPPORTS
> >+  NS_DECL_THREADSAFE_ISUPPORTS
> Hmm, why this change?

As we are now calling ProcessAudioSegment in a thread[1], if we don't change the interface to NS_DECL_THREADSAFE_ISUPPORTS, the tests are breaking as well. Here's a gist of the test results with only NS_DECL_ISUPPORTS [2]


[1] https://github.com/andrenatal/gecko-dev-speech/commit/6de461f6d6b69cf118d3fba3c1f907ca5775e1f9#diff-f47e4fa6c9b339d424ff52ce22e431beL403

[2] https://gist.github.com/andrenatal/5e9b12aff43a8369d7664d9a4d6314a6

Assignee

Comment 44

•

6 years ago

Attached patch 0001-Bug-1248897-Introducing-an-online-speech-recognition.patch (obsolete) — Details — Splinter Review

Please see the updated version with all the issues addressed.

Attachment #8991057 - Attachment is obsolete: true

Assignee

Comment 45

•

6 years ago

Comment on attachment 8993800 [details] [diff] [review]
0001-Bug-1248897-Introducing-an-online-speech-recognition.patch

Hi Andreas, do you mind reviewing the current version of the patch for us, please?

Attachment #8993800 - Flags: review?(apehrson)

Assignee

Comment 46

•

6 years ago

Here's the try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=855bbee26ceee70a3b2e325ee7e6a5f332507cd3

Assignee

Comment 47

•

6 years ago

@pehrson, another interesting issue that I found is that when I close the browser (or first the tab and then the browser) while the microphone is open and capturing (or when the MediaStream is being fed with audio [1]), I'm getting this crash [2] 


[1] https://github.com/andrenatal/webspeechapi/blob/gh-pages/index_auto.js#L85

[2] https://gist.github.com/andrenatal/2c5220f4a65ec785297d9ac811808ced 

Do you have any idea why that's happening?

Thanks,

Andre

Assignee

Updated

•

6 years ago

Flags: needinfo?(apehrson)

Assignee

Updated

•

6 years ago

Attachment #8993800 - Flags: review?(apehrson) → review?(bugs)

Assignee

Comment 48

•

6 years ago

Asking smaug for review again since seems he's back from PTO.

Comment 49

•

6 years ago

Sorry for the late reply. I was on PTO last week but am back for this week.

(In reply to André Natal from comment #47)
> @pehrson, another interesting issue that I found is that when I close the
> browser (or first the tab and then the browser) while the microphone is open
> and capturing (or when the MediaStream is being fed with audio [1]), I'm
> getting this crash [2] 
> 
> 
> [1]
> https://github.com/andrenatal/webspeechapi/blob/gh-pages/index_auto.js#L85
> 
> [2] https://gist.github.com/andrenatal/2c5220f4a65ec785297d9ac811808ced 
> 
> Do you have any idea why that's happening?
> 
> Thanks,
> 
> Andre

The "WARNING: YOU ARE LEAKING THE WORLD" indicates that you have added a non-cycle-collected reference cycle somewhere in your code. This is preventing shutdown. If you have one that is legit you probably want to listen for xpcom-shutdown and break it then.

First you need to find it however, which is easier said than done. I often find that auditing my own code helps, or sometimes (if there are lots of new RefPtr members) disabling chunk after chunk until it doesn't happen anymore, to try and narrow down the piece of code containing the cycle.

Flags: needinfo?(apehrson)

Comment 50

•

6 years ago

Comment on attachment 8993800 [details] [diff] [review]
0001-Bug-1248897-Introducing-an-online-speech-recognition.patch

Review of attachment 8993800 [details] [diff] [review]:
-----------------------------------------------------------------

I have mostly looked at the usage of AudioTrackEncoder, though other comments might have slipped in too.

Some basic things for the AudioTrackEncoder are causing this r-, most notably mAudioVector allocations and mAudioTrackEncoder->AdvanceCurrentTime calls.

::: dom/media/webspeech/recognition/OnlineSpeechRecognitionService.cpp
@@ +195,5 @@
> +  }
> +
> +  nsresult rv;
> +
> +  if (!isHeaderWritten) {

Is this a member? If so it should be mIsHeaderWritten.

@@ +199,5 @@
> +  if (!isHeaderWritten) {
> +    mWriter = MakeUnique<OggWriter>();
> +    mAudioEncoder = MakeAndAddRef<OpusTrackEncoder>(aSampleRate);
> +    mAudioEncoder->TryInit(*aAudioSegment, duration);
> +    mAudioEncoder->SetStartOffset(0);

SetStartOffset needs to get the number of samples that comes before the first segment. This so that bookkeeping of the actual samples work. An example to understand how this works:

mAudioEncoder->SetStartOffset(128);
// The internal buffer now contains 128 samples of silence and no sound.
mAudioEncoder->AppendAudioSegment(aAudioSegment);
// For aAudioSegment of 256 samples the buffer now contains 128 samples of silence and 256 of sound
mAudioEncoder->AdvanceCurrentTime(256);
// The internal buffer now contains 256 samples of silence and 128 of sound. 128 samples were passed on.
mAudioEncoder->AppendAudioSegment(aAudioSegment2);
// For aAudioSegment2 of 128 samples the buffer now contains 256 samples of silence and 256 of sound
mAudioEncoder->AdvanceCurrentTime(384);
// The internal buffer now contains 384 samples of silence and 128 of sound. 128 samples were passed on.

@@ +214,5 @@
> +    rv = mWriter->GetContainerData(&aOutputBufs, ContainerWriter::GET_HEADER);
> +    NS_ENSURE_SUCCESS(rv, rv);
> +
> +    for (auto& buffer : aOutputBufs) {
> +        mAudioVector.AppendElements(buffer);

Indentation is wrong in so many places. Consider fixing it in all your files with clang-format.

This AppendElements() is going to have to expand the array very often. Consider using something else for storage, like an array of EncodedFrameContainer that would be simpler to append to as it grows large. Since you only use the array after finishing the recording you can concat the EncodedFrameContainers then as the total duration is known and you only need one allocation, or use them one-by-one if the streaming api allows it.

Is there a fixed upper bound on the length of the recording? If not it'll be possible to trigger an OOM crash through this (though it is encoded opus so it'll take a while). Then we should probably store it in a temp file like MediaRecorder does.

It's probably a good idea too to use a fallible allocator for such big arrays, then we can abort the speech recognition with an error if there's a memory problem.

@@ +220,5 @@
> +    isHeaderWritten = true;
> +  }
> +
> +  mAudioEncoder->AppendAudioSegment(std::move(*aAudioSegment));
> +  mAudioEncoder->AdvanceCurrentTime(duration);

This must be the accumulated duration of all played segments until now.

Since you are passing the duration of only this segment, you are basically creating an unbounded buffer of audio data that only goes away when the AudioTrackEncoder is destructed.

This alone is a reason for r-.

@@ +231,5 @@
> +                                        mAudioEncoder->IsEncodingComplete() ?
> +                                        ContainerWriter::END_OF_STREAM : 0);
> +  NS_ENSURE_SUCCESS(rv, rv);
> +
> +  nsTArray<nsTArray<uint8_t>> aOutputBufs;

s/aOutputBufs/outputBufs/

This is not an argument to this method, since the declaration is here.

@@ +232,5 @@
> +                                        ContainerWriter::END_OF_STREAM : 0);
> +  NS_ENSURE_SUCCESS(rv, rv);
> +
> +  nsTArray<nsTArray<uint8_t>> aOutputBufs;
> +  rv = mWriter->GetContainerData(&aOutputBufs, ContainerWriter::FLUSH_NEEDED);

Don't flush every time. Instead do it only when you really need the data (like after EOS).

You should only flush once, or you'll confuse the OggWriter apparently: https://searchfox.org/mozilla-central/rev/033d45ca70ff32acf04286244644d19308c359d5/dom/media/ogg/OggWriter.cpp#180

@@ +382,5 @@
> +  return NS_OK;
> +}
> +
> +SpeechRecognitionResultList*
> +OnlineSpeechRecognitionService::BuildMockResultList()

This seems to belong in a unit test rather than release.

::: dom/media/webspeech/recognition/SpeechRecognition.cpp
@@ +57,5 @@
> +// sIdle holds the current state of the API, i.e, if a recognition
> +// is running or not. We don't want more than one recognition to be running
> +// at same time, so we set sIdle to true when the API is not active and to
> +// false when is active (recording or recognizing)
> +static bool sIdle = true;

The comment needs to state which thread can access sIdle, since it's not threadsafe.

Note that this static doesn't prevent two different child processes from running speech recognition at the same time. Is that intended?

::: dom/media/webspeech/recognition/SpeechRecognition.h
@@ +140,5 @@
>    void FeedAudioData(already_AddRefed<SharedBuffer> aSamples, uint32_t aDuration, MediaStreamListener* aProvider, TrackRate aTrackRate);
>  
>    friend class SpeechEvent;
> +
> +  nsCOMPtr<nsIThread> mEncodeThread;

Please document members and methods so I as a reviewer have a reference to verify their implementation and usage against. Such comments should at least say what the intention is and what threads are allowed to call/read/write the thing you're documenting.

Attachment #8993800 - Flags: review-

Comment 51

•

6 years ago

Comment on attachment 8993800 [details] [diff] [review]
0001-Bug-1248897-Introducing-an-online-speech-recognition.patch

(I'll review after pehrsons' comments are addressed)

Attachment #8993800 - Flags: review?(bugs)

Updated

•

6 years ago

See Also: → https://webcompat.com/issues/18394

Comment 52

•

6 years ago

Curious, what is the status with this?

Flags: needinfo?(anatal)

Sylvestre Ledru [:Sylvestre]

Assignee

Comment 53

•

6 years ago

Due lack of resources, we needed to pause the work on this while I was working to bring voice search for Firefox Reality. Some of the issues that Andreas pointed was already fixed (we worked together on those), so I expect to have a new patch in the next couple of weeks.

Flags: needinfo?(anatal)

davy.wybiral

Comment 54

•

5 years ago

Is this still on the back-burner? Currently I'm having to use Chrome for a project that requires this API but I would much rather use FF (especially with the offline recognition).

Updated

•

5 years ago

status-firefox47: wontfix → ---

status-firefox63: affected → ---

Thomas Wisniewski [:twisniewski]

Comment 55

•

5 years ago

(In reply to davy.wybiral from comment #54)
> Is this still on the back-burner? Currently I'm having to use Chrome for a
> project that requires this API but I would much rather use FF (especially
> with the offline recognition).

Speaking on behalf of the machine learning team, it's on our radar.

We've just recently got our STT engine running on "small platform" devices using TFLite[1].

So for English the remaining steps are some polish of our "small platform" STT engine and integration of the engine in to the patch Andre is working on.

For other languages we need data, which is a longer term problem we need to solve that's partially addressed by Common Voice.

[1] https://www.tensorflow.org/lite/

Updated

•

5 years ago

See Also: → https://webcompat.com/issues/25180

Comment 56

•

5 years ago

Andre, Oli, and Andreas what's the current status of this?

It looks like Andreas' comments in his last review have not been addressed.

Alex was thinking about finishing this patch and integrating Deep Speech too.

Should he just jump in, fix Andreas' request for changes, and do it?

Flags: needinfo?(bugs)

Flags: needinfo?(apehrson)

Flags: needinfo?(anatal)

Comment 57

•

5 years ago

Andre is driving this so check with him. Last I heard he was working on setting up a mock server with the mochitests, but then he has gotten disrupted by other work a number of times too.

I'm happy to review any media bits, whoever writes them.

Flags: needinfo?(apehrson)

Comment 58

•

5 years ago

Many of the Andreas' comments were already addressed after the last All Hands, including a major refactor of the media part. As he said, I was working on the mochitests and also finding the root of a memory leak that occurs when closing the tabs while the microphone capture is in process.

If the ultimate goal is to integrate Deep Speech, I believe a better use for Alex' time would be to work in the backend instead the frontend being discussed here, since they should be totally decoupled, i.e, finish the docker containing deepspeech and deploy it to Mozilla's services cloud infrastructure, for online decoding, and/or, create another bug and patch just to integrate deepspeech's inference stack and models into gecko, plus the HTTP service which will receive the requests from the frontend here, in the case of offline.

Both are still missing and currently needs more attention than the patch being worked here, which is already functional if manually applied to Gecko.

Flags: needinfo?(anatal)

Comment 59

•

5 years ago

create another bug and patch just to integrate deepspeech's inference stack and models into gecko

For offline there is already #1474084

Comment 60

•

5 years ago

(In reply to Andre Natal from comment #58)

If the ultimate goal is to integrate Deep Speech, I believe a better use for Alex' time would be to work in the backend instead the frontend being discussed here, since they should be totally decoupled, i.e, finish the docker containing deepspeech and deploy it to Mozilla's services cloud infrastructure, for online decoding, and/or, create another bug and patch just to integrate deepspeech's inference stack and models into gecko, plus the HTTP service which will receive the requests from the frontend here, in the case of offline.

There's not that much to finish, it's working, and is iso-feature to the other implem served by speech proxy. I guess it would be more a question of production deployment etc. :)

Comment 61

•

5 years ago

(In reply to Andre Natal from comment #58)

If the ultimate goal is to integrate Deep Speech, I believe a better use for Alex' time would be to work in the backend instead the frontend being discussed here, since they should be totally decoupled, i.e, finish the docker containing deepspeech and deploy it to Mozilla's services cloud infrastructure, for online decoding, and/or, create another bug and patch just to integrate deepspeech's inference stack and models into gecko, plus the HTTP service which will receive the requests from the frontend here, in the case of offline.

Both are still missing and currently needs more attention than the patch being worked here, which is already functional if manually applied to Gecko.

The Deep Speech backend exists already here [https://gitlab.com/deepspeech/ds-srv].

The associated Docker file is there too [https://gitlab.com/deepspeech/ds-srv/blob/master/Dockerfile.gpu]

So there is no blocker in that regard.

However, Alex and I are interested in bringing STT on device, bug 1474084, so no servers are required, and Alex wants to resolve this bug and bug 1474084.

Comment 62

•

5 years ago

(In reply to Alexandre LISSY :gerard-majax from comment #60)

(In reply to Andre Natal from comment #58)

If the ultimate goal is to integrate Deep Speech, I believe a better use for Alex' time would be to work in the backend instead the frontend being discussed here, since they should be totally decoupled, i.e, finish the docker containing deepspeech and deploy it to Mozilla's services cloud infrastructure, for online decoding, and/or, create another bug and patch just to integrate deepspeech's inference stack and models into gecko, plus the HTTP service which will receive the requests from the frontend here, in the case of offline.

There's not that much to finish, it's working, and is iso-feature to the other implem served by speech proxy. I guess it would be more a question of production deployment etc. :)

Okay, I'll create a thread including you and Mozilla Services to roll that out to production

Comment 63

•

5 years ago

(In reply to kdavis from comment #61)

(In reply to Andre Natal from comment #58)

If the ultimate goal is to integrate Deep Speech, I believe a better use for Alex' time would be to work in the backend instead the frontend being discussed here, since they should be totally decoupled, i.e, finish the docker containing deepspeech and deploy it to Mozilla's services cloud infrastructure, for online decoding, and/or, create another bug and patch just to integrate deepspeech's inference stack and models into gecko, plus the HTTP service which will receive the requests from the frontend here, in the case of offline.

Both are still missing and currently needs more attention than the patch being worked here, which is already functional if manually applied to Gecko.

The Deep Speech backend exists already here [https://gitlab.com/deepspeech/ds-srv].

The associated Docker file is there too [https://gitlab.com/deepspeech/ds-srv/blob/master/Dockerfile.gpu]

So there is no blocker in that regard.

However, Alex and I are interested in bringing STT on device, bug 1474084, so no servers are required, and Alex wants to resolve this bug and bug 1474084.

Cool, so is better to work on the tidbits to integrate deepspeech into gecko on bug 1474084 instead this one.

The goal of the patch here is to create an agnostic frontend which can communicate to whichever decoder through an HTTP Rest API, regardless of online or offline.

If the goal is to create a local deepspeech speech server exposed via http, you can use this as a frontend, but if the goal is to do something different, like for example injecting the frames directly into the inference stack, then is better to create a completely new SpeechRecognitionService instead injecting decoder specific code in this patch.

Comment 64

•

5 years ago

(In reply to Andre Natal from comment #63)

The goal of the patch here is to create an agnostic frontend which can communicate to whichever decoder through an HTTP Rest API, regardless of online or offline.

We want to use REST for offline?

Comment 65

•

5 years ago

•

Edited

I meant if you add a local http service to DEFAULT_RECOGNITION_ENDPOINT, it should work.

If that's not your goal, you should work on a whole new SpeechRecognitionService containing deep speech specific code on bug 1474084.

Hope that helps.

Comment 66

•

5 years ago

(not sure I have anything to say here. I'm just reviewing whatever is dumped to my review queue ;)
And happy to review DOM/Gecko side of this)

Flags: needinfo?(bugs)

Comment 67

•

5 years ago

(In reply to Andre Natal from comment #62)

Okay, I'll create a thread including you and Mozilla Services to roll that out to production

Could you put me on CC. Thanks

Assignee

Updated

•

5 years ago

Depends on: 1541290

Assignee

Updated

•

5 years ago

Depends on: 1541298

Assignee

Comment 69

•

5 years ago

Attached patch 0001-Bug-1248897-Introducing-an-online-speech-recognition.patch (obsolete) — Details — Splinter Review

Attachment #8993800 - Attachment is obsolete: true

Assignee

Comment 70

•

5 years ago

Attached file Introducing an online speech recognition service for Web Speech API — Details

This patch introduces a Speech Recognition Service which interfaces with Mozilla's remote STT endpoint which is currently being used by multiple services

Assignee

Updated

•

5 years ago

Attachment #9055677 - Attachment is obsolete: true

https://treeherder.mozilla.org/#/jobs?repo=try&revision=1c95e0fea71e&selectedJob=238007445

Assignee

Comment 71

•

5 years ago

•

Edited

try:

Please note that try is now displaying the leaks described on the blocking bugs I opened yesterday

https://bugzilla.mozilla.org/show_bug.cgi?id=1541298
https://bugzilla.mozilla.org/show_bug.cgi?id=1541290

Firefox Bug Husbandry Bot

Comment 72

•

5 years ago

See bug 1547409. Migrating webcompat priority whiteboard tags to project flags.

Webcompat Priority: --- → P1

Assignee

Updated

•

5 years ago

Priority: -- → P2

Target Milestone: --- → Future

Thomas Wisniewski [:twisniewski]

Updated

•

5 years ago

See Also: → https://webcompat.com/issues/31597

Comment 73

•

5 years ago

For testing reference, there are web platform tests for SpeechRecognition, though many aren't all appearing on wpt.fyi and it's unclear to me how good the coverage is. It appears that Blink has a few extra tests.

Assignee

Updated

•

5 years ago

Blocks: 1565102

Assignee

Updated

•

5 years ago

Blocks: 1565103

Assignee

Updated

•

5 years ago

No longer blocks: 1565103

Liz Henry (:lizzard) (relman/hg->git project)

Assignee

Updated

•

5 years ago

Blocks: 1565103

Updated

•

5 years ago

status-firefox70: --- → affected

tracking-firefox70: --- → +

Liz Henry (:lizzard) (relman/hg->git project)

Comment 74

•

5 years ago

Andre, is this still on track for 70?

And, can you suggest a release note (either for 70 or for whatever future release this ends up in)? Thanks!

Release Note Request (optional, but appreciated)
[Why is this notable]:
[Affects Firefox for Android]:
[Suggested wording]:
[Links (documentation, blog post, etc)]:

relnote-firefox: --- → ?

Flags: needinfo?(anatal)

Assignee

Updated

•

5 years ago

status-firefox70: affected → ---

Flags: needinfo?(anatal)

Priority: P2 → P1

Assignee

Updated

•

5 years ago

tracking-firefox70: + → ---

Pascal Chevrel:pascalc (PTO until April 26)

Assignee

Comment 75

•

5 years ago

(In reply to Liz Henry (:lizzard) from comment #74)

Andre, is this still on track for 70?

And, can you suggest a release note (either for 70 or for whatever future release this ends up in)? Thanks!

Release Note Request (optional, but appreciated)
[Why is this notable]:
[Affects Firefox for Android]:
[Suggested wording]:
[Links (documentation, blog post, etc)]:

Hi Liz, we moved it to 71. It's possible to track this there?

Thanks!

Comment 76

•

5 years ago

Updated to track our 71 release. André, will this need a mention in our release notes, a blog post or a mention on our Nightly twitter account to ask our core community to test it? Thanks

status-firefox70: --- → wontfix

status-firefox71: --- → affected

tracking-firefox71: --- → +

Flags: needinfo?(anatal)

Mike Taylor [:miketaylr]

Updated

•

5 years ago

Webcompat Priority: P1 → P2

Adam Stevenson [:adamopenweb]

Updated

•

5 years ago

Webcompat Priority: P2 → P1

Pascal Chevrel:pascalc (PTO until April 26)

Assignee

Updated

•

5 years ago

Blocks: 1588067

Nils Ohlmeier [:drno]

Comment 77

•

5 years ago

(In reply to Pascal Chevrel:pascalc from comment #76)

Updated to track our 71 release. André, will this need a mention in our release notes, a blog post or a mention on our Nightly twitter account to ask our core community to test it? Thanks

It's not going to ride the trains. The plan is to hold the feature in Nightly. I guess that means it doesn't go into the release notes. But asking for folks to flip the pref and test it in Nightly would be good.

Comment 78

•

5 years ago

(In reply to Nils Ohlmeier [:drno] from comment #77)

(In reply to Pascal Chevrel:pascalc from comment #76)

Updated to track our 71 release. André, will this need a mention in our release notes, a blog post or a mention on our Nightly twitter account to ask our core community to test it? Thanks

It's not going to ride the trains. The plan is to hold the feature in Nightly. I guess that means it doesn't go into the release notes. But asking for folks to flip the pref and test it in Nightly would be good.

We have release notes for the Nightly channel https://www.mozilla.org/en-US/firefox/71.0a1/releasenotes/

jason.oneil

Comment 79

•

5 years ago

(In reply to Nils Ohlmeier [:drno] from comment #77)

It's not going to ride the trains. The plan is to hold the feature in Nightly. I guess that means it doesn't go into the release notes. But asking for folks to flip the pref and test it in Nightly would be good.

Are there any instructions for testing this in nightly? I'm a web developer who runs Firefox Nightly and has given several talks on the Web Speech API. Very keen to help with testing.

Assignee

Comment 80

•

5 years ago

Hi Pascal, yes, let's do it, but let's wait until we have the code fully merged to start this discussion. Currently the date of landing is still uncertain since the code wasn't fully reviewed yet.

Flags: needinfo?(anatal)

Assignee

Comment 81

•

5 years ago

(In reply to jason.oneil from comment #79)

(In reply to Nils Ohlmeier [:drno] from comment #77)

It's not going to ride the trains. The plan is to hold the feature in Nightly. I guess that means it doesn't go into the release notes. But asking for folks to flip the pref and test it in Nightly would be good.

Are there any instructions for testing this in nightly? I'm a web developer who runs Firefox Nightly and has given several talks on the Web Speech API. Very keen to help with testing.

Hi Jason,

the API hasn't landed and isn't available in Nightly yet , but as soon there is, to enable it will be just matter of switching a couple of flags on (if at all)

Assignee

Updated

•

5 years ago

status-firefox70: wontfix → ---

status-firefox71: affected → ---

tracking-firefox71: + → ---

Assignee

Comment 82

•

5 years ago

We will land this in Nightly 72.

Natalia Csoregi [:nataliaCs]

Assignee

Updated

•

5 years ago

Keywords: checkin-needed

Pulsebot

Comment 83

•

5 years ago

Pushed by nbeleuzu@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e322e2112b1f
Introducing an online speech recognition service for Web Speech API r=smaug,pehrsons,padenot

Keywords: checkin-needed

Comment 84

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/e322e2112b1f

Status: REOPENED → RESOLVED

Closed: 8 years ago → 5 years ago

status-firefox72: --- → fixed

Resolution: --- → FIXED

Natalia Csoregi [:nataliaCs]

Updated

•

5 years ago

Regressions: 1590368

Updated

•

5 years ago

Depends on: 1590652

Dietrich Ayala (:dietrich)

Updated

•

5 years ago

No longer depends on: 1590652

Comment 85

•

5 years ago

WOOHOOO!! Congratulations everyone! So excited to see this ship after filing this bug four years ago (whoa, time flies!).

Dietrich Ayala (:dietrich)

Comment 86

•

5 years ago

(er, duped, but filed a bunch of these originally, so seeing the notifications coming in!!)

Chris Mills [:cmills]

Updated

•

5 years ago

Keywords: dev-doc-needed

Updated

•

5 years ago

Depends on: 1596773

Updated

•

5 years ago

Depends on: 1596788

Updated

•

5 years ago

Depends on: 1596804

Updated

•

5 years ago

Depends on: 1596819

Updated

•

5 years ago

Depends on: 1597204

Raluca Popovici, Desktop QA

Updated

•

5 years ago

Depends on: 1597220

Updated

•

5 years ago

Depends on: 1597287

Updated

•

5 years ago

Depends on: 1597637

Raluca Popovici, Desktop QA

Updated

•

5 years ago

Depends on: 1597760

Updated

•

5 years ago

Depends on: 1597960

Julien Cristau [:jcristau]

Updated

•

5 years ago

Depends on: 1597978

Comment 87

•

5 years ago

This is default-off, so not ready for release notes.

relnote-firefox: ? → -

Ksenia Berezina [:ksenia]

Updated

•

4 years ago

See Also: → https://github.com/webcompat/web-bugs/issues/46579

Comment 88

•

4 years ago

Is the speech recognition code free open source software?

Why is the speech reccognition code not shipped with the browser?

brandmairstefan

Comment 89

•

4 years ago

(In reply to guest271314 from comment #88)

Is the speech recognition code free open source software?

Why is the speech reccognition code not shipped with the browser?

There is a related issue regarding shipping DeepSpeech with Firefox https://bugzilla.mozilla.org/show_bug.cgi?id=1474084 . The code for deep speech is, of course, open source. It can be found here https://github.com/mozilla/DeepSpeech

Comment 90

•

4 years ago

(In reply to brandmairstefan from comment #89)

(In reply to guest271314 from comment #88)

Is the speech recognition code free open source software?

Why is the speech reccognition code not shipped with the browser?

There is a related issue regarding shipping DeepSpeech with Firefox https://bugzilla.mozilla.org/show_bug.cgi?id=1474084 . The code for deep speech is, of course, open source. It can be found here https://github.com/mozilla/DeepSpeech

Tried to use SpeechRecognition at Nightly 73 - with necessary flags set - and the browser crashed.

Am trying to compare the results of various STT services. So far tested https://speech-to-text-demo.ng.bluemix.net/ and https://cloud.google.com/speech-to-text/, which provide different results or a 33 second audio file (WAV).

Is there a means to send an audio file to the Mozilla end point without using SpeechRecognition?

Ksenia Berezina [:ksenia]

Comment 91

•

4 years ago

SpeechRecognition appears to stop at a brief pause between sound given 33 seconds of audio.

audiostart 266094
start 266095
speechstart 266419

speechend 267451
audioend 267453 
end 268580

Updated

•

4 years ago

See Also: → https://github.com/webcompat/web-bugs/issues/49149

Viney Dhiman

Comment 92

•

4 years ago

Not working for me. I have updated to Firefox nightly v79 18/06/2020.
I have set both value to true, relaunch browser. After visiting the site https://speechnotes.co, unable to click on the MIcrophone. While I tried with another site called https://dictation.io/speech, here tool is not recognizing my voice.

Claudio Klemp

Comment 93

•

3 years ago

Greetings to everyone, I'm new here, but I've filed speechrecognition issues with Chromium / Android, they will solve this year, as they did answer.

I would like to give you an idear of a different approach, I experienced ...

If we take a whatever word let it be " house " and build a container around this object, something similair to a cell which has a DNA and organelles, we could achieve the following :

Steps:

1.) "house" + a collection of soundapi equalizer registrations which gets aquired and updates this dna chain
2.) pass the collection through a sound fingerprinting to compute common patterns (phillips technology Netherlands, but they do cost)
3.) place the fingerprinting patterns into the house dna chain / house object.

redo 1 to 3 on regular basis, let it be every 10 / 20 new equalizer registrations you do obtain by a user requesting to recognize his words.

Serverwise place the fingerprints into it's memory

Once voicerecognition is requested, preprocess the input voice as fingerprint
Get the fingerprint or fingerprints from the server memory
and return the word " house ", contained in the object.

Nobody will forbit you to combine or connect or link different objects into one, such as :

house, Haus (german), casa (italian), casa (portughese), maison (french)

and obtain this way also immediate translation possibilities, or same soundalike meaning words.

The way to build this thing as single objects and not as a database becomes necessary for performance issues, as a database becomes slower and slower the more data it contains. I know this from musik recognition engines used to monitor radio stations and televisions and report the author rights to be settled. (A little like Shazan)

The other reason is, that you do not need to search for a pattern if the pattern is already named as it's result, so it becomes nothing else than a lookup of a file, which is much faster than a database lookup.

An example of a data container :

describing the content (which could be the soundapi equalizer recordings and the fingerprintings)

data-styled="{"x":161,"y":403,"width":"79vw","height":"49vh","top":"47vh","right":"91vw","bottom":"96vh","left":"12vw","font-family":"Times","font-weight":"400","font-size":"6.884480746791131vmin","line-height":"normal","text-align":"start","objHref":"","dna-mouse-up":"exchangeForm,","dna-enter":"secondary,","objCode":"1603813965238","objTitle":"Museo Gregoriano Egizio","objUrl":"","objData":"","objMime":"image/jpeg","objText":"","objScreenX":"","objScreenY":"","objScreenZ":"","objName":"","objAddr":"","objEmail":"","objWhatsapp":"","objSkype":"","objPhone":"","objWWW":"","objBuild":"Tue, 27 Oct 2020 15:52:45 GMT","objExpire":"","objGroup":"","objOwner":"","objCmd":"pasted image","objUpdate":"Fri, 01 Jan 2021 15:28:20 GMT","objParent":"1603813859288","name":"","org-height":"734","org-width":"981","org-size":"","compression":"0.6","res-width":"981","res-height":"734","res-size":"154787","scale-factor":"1","play-dna":"","objTime":16856080079020.5,"objStart":1609514900047,"background-color":"Tomato"}"

and allowing the data with :

"dna-mouse-up":"exchangeForm,",
"dna-enter":"secondary,",

to execute commands / programs based here based upon mouseup or screenarea entering.
(I took it from a website I've build)

or with such a set of istructions

data-styled="{"clickdog29021":"50,50,50,50,2000,5","clickdog41510":"89.47368421052632,39.39393939393939,7.236842105263158,39.928698752228165,412,3","clickdog51838":"7.894736842105263,40.28520499108734,75.6578947368421,40.106951871657756,352,-3","clickdog59010":"12.5,45.27629233511586,73.02631578947368,7.8431372549019605,468,9","clickdog65855":"73.35526315789474,16.22103386809269,17.105263157894736,47.41532976827095,354,-9","clickdog77738":"50,50,50,50,2000,6"}"

the numbers are unique processing id, starting x,y ending x,y,(vh/vw) , time used (ms) and acting the way
to run a hand to great (5), then show the visitor how to swipe to change page forwards and backwards (3, -3), last and first page(9,-9) then finally greet for a good bye (6).

clicking would be 1, long click 2, and very long click 4 (used to issue different commands available on the screen)

The advantage of such a thing is, that each single datacontainer will continue to learn and process, and upon the real world usage of those datacontainers creating a major " understanding " (datacollection) of the word " house ", those containers less used or unused will timeout ("objExpire":"") or /and ("objUpdate":"Fri, 01 Jan 2021 15:28:20 GMT") and you have something as a natural selection or evolution of containers adapting themself to the millions of speakers using the system to dictate their hopefully intelligent words.

I hope I could explain the idea or technique the way that it became understandable.

At your disposal
(I do not place a link to above shows, I would consider as publicity, but the thing is also on the web and works)
Claudio Klemp

Reporter

Comment 94

•

3 years ago

Hi Claudio and Happy New Year! Please note that Bugzilla is meant to track the implementation of specific feature requests or bug fixes. In this case it is the SpeechRecognition Interface of the Web Speech API and this bug is already closed.

More general discussions should happen at https://discourse.mozilla.org/, which might then end up as specific requests here.

Sebastian

Comment 95

•

3 years ago

(In reply to André Natal from comment #81)

(In reply to jason.oneil from comment #79)

(In reply to Nils Ohlmeier [:drno] from comment #77)

It's not going to ride the trains. The plan is to hold the feature in Nightly. I guess that means it doesn't go into the release notes. But asking for folks to flip the pref and test it in Nightly would be good.

Are there any instructions for testing this in nightly? I'm a web developer who runs Firefox Nightly and has given several talks on the Web Speech API. Very keen to help with testing.

Hi Jason,

the API hasn't landed and isn't available in Nightly yet , but as soon there is, to enable it will be just matter of switching a couple of flags on (if at all)

Which "couple of flags"? I observed the devtools.chrome.enabled = true and media.webspeech.recognition.enable = true from this discussion and set those accordingly. I am testing on https://mdn.github.io/web-speech-api/speech-color-changer/index.html. But page does not load properly due to breakage on webkitSpeechRecognition is not defined error. Testing on Nightly 86.0a1 (2021-01-03).

Flags: needinfo?(jason.oneil)

Comment 96

•

3 years ago

Additional info: Using Macos version 10.13.6

Comment 97

•

3 years ago

#95 The last time I tested this was the page I used for information re speech recognition https://wiki.mozilla.org/Web_Speech_API_-_Speech_Recognition#How_can_I_test_with_Deep_Speech.3F, see https://bugzilla.mozilla.org/show_bug.cgi?id=1604994.

Comment 98

•

3 years ago

(In reply to guest271314 from comment #97)

#95 The last time I tested this was the page I used for information re speech recognition https://wiki.mozilla.org/Web_Speech_API_-_Speech_Recognition#How_can_I_test_with_Deep_Speech.3F, see https://bugzilla.mozilla.org/show_bug.cgi?id=1604994.

I get "Voice input isn't supported on this browser".

When running a simple script, window.SpeechRecognition is undefined.

Comment 99

•

3 years ago

You'll need to enable both media.webspeech.recognition.enable and media.webspeech.recognition.force_enable to test out the API in its current state.

Comment 100

•

3 years ago

(In reply to Andreas Pehrson [:pehrsons] from comment #99)

You'll need to enable both media.webspeech.recognition.enable and media.webspeech.recognition.force_enable to test out the API in its current state.

Thank you, that works. media.webspeech.recognition.force_enable wasn't mentioned in this discussion before.

Am I to understand that setting media.webspeech.service.endpoint to https://dev.speaktome.nonprod.cloudops.mozgcp.net/ will cause SpeechRecognition to use Deepspeech instead of the OS native service?

Comment 101

•

3 years ago

to use Deepspeech instead of the OS native service?

What is meant by "OS native service"?

Which OS are you running?

AFAIK On Linux there is no "native OS service" for speech recognition.

Comment 102

•

3 years ago

(In reply to JulianHofstadter from comment #100)

(In reply to Andreas Pehrson [:pehrsons] from comment #99)

You'll need to enable both media.webspeech.recognition.enable and media.webspeech.recognition.force_enable to test out the API in its current state.

Thank you, that works. media.webspeech.recognition.force_enable wasn't mentioned in this discussion before.

Am I to understand that setting media.webspeech.service.endpoint to https://dev.speaktome.nonprod.cloudops.mozgcp.net/ will cause SpeechRecognition to use Deepspeech instead of the OS native service?

See this wiki page as mentioned in comment 97. It lists the default destination, and further down how to switch to deep speech.

[1]https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecognition/interimResults

Comment 103

•

3 years ago

In trying to understand the docs, I understand the interimResults[1] property to mean that results will be returned at intervals while speaking, thus triggering a result event more than one time before speech recognition ends. However in practice I am observing that setting interimResults to true has no effect, and only a single result is returned at the end of recognition.

Am I interpreting this incorrectly, and if so, can someone tell me what practical difference I should see between setting this to true and false?

Comment 104

•

3 years ago

This implementation is far from the spec, but at least basic recognition should work. I don't think it allows much of configuring at all, but if you want an authoritative answer, read the source.

That said, this bug is not the place to keep discussions, so this will be my last comment. Feel free to continue on Matrix.