Closed Bug 1248897 Opened 8 years ago Closed 5 years ago

Expose SpeechRecognition to the web

Categories

(Core :: Web Speech, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED
Future
Webcompat Priority P1
Tracking Status
relnote-firefox --- -
firefox72 --- fixed

People

(Reporter: sebo, Assigned: anatal, NeedInfo)

References

(Depends on 9 open bugs, Blocks 4 open bugs, )

Details

(Keywords: dev-doc-needed, feature, Whiteboard: [webcompat:p1])

Attachments

(1 file, 6 obsolete files)

The SpeechRecognition API is currently only available in chrome context (at least on desktop Firefox).

It should also be made available within the website context.

This will require some kind of UI to control the permissions to access the microphone.

Sebastian
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → DUPLICATE
Is this on anyone's radar? Just ran into a website saying that it "requires Google Chrome Browser". :( I emailed the developers, they say they need the webspeech API.
Flags: needinfo?(sebastianzartner)
Flags: needinfo?(overholt)
Flags: needinfo?(dietrich)
I'm just the reporter, not an implementor.

Sebastian
Flags: needinfo?(sebastianzartner)
Flags: needinfo?(anatal)
Flags: needinfo?(kdavis)
Both Andre and myself implemented this originally; however, both of us are now in Connected Devices and consumed with preparation for London. So, at the earliest we could take a look after London. But even then it will be hard for us to dedicate lots of time to desktop.
Flags: needinfo?(kdavis)
Flags: needinfo?(anatal)
Let me see what I can do about prioritizing this on the platform side (I was going to suggest Andre/Kelly, too).
Flags: needinfo?(overholt)
Flags: needinfo?(dietrich)
The demo https://mdn.github.io/web-speech-api/ is useless without the api available...
The documentation https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API and it's examples implementations are not working in Firefox yet which is a little poor documented in section https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API#Browser_compatibility IHMO; Firefox Desktop is probably the browser right?

The demos work in Google Chrome (un)fortunately.
Actually, there is no backend implementation for Recognition API except to Gonk.
(In reply to Makoto Kato [:m_kato] from comment #10)
> Actually, there is no backend implementation for Recognition API except to
> Gonk.

Bug 1244237 comment 0 claimed something else. That's why I created this bug.
But if there's really no backend yet, you should create a bug for it blocking this one, so this feature can finally be tackled.

Sebastian
I am willing to work on this. Can somebody guide me on this one, what exactly is impeding the implementation of the recognition back end. I was thinking if we can use the speech recognition API provided by windows (not a platform independent solution, I know). It would be great to get it working at least somewhere and I am working on Windows version.
@abhishek the way I understand you can test this using Firefox desktop in Chrome context so use js or create a plugin. We 'only' need to expose this into website context. See 'User story' above.

From https://developer.mozilla.org/en-US/Add-ons/Setting_up_extension_development_environment

> devtools.chrome.enabled = true. This enables to run JavaScript code snippets in the chrome context of the Scratchpad from the Tools menu. Don't forget to switch from content to browser as context.

Does this help you?
(In reply to Clemens Tolboom from comment #13)
> @abhishek the way I understand you can test this using Firefox desktop in
> Chrome context so use js or create a plugin. We 'only' need to expose this
> into website context. See 'User story' above.
> 
> From
> https://developer.mozilla.org/en-US/Add-ons/
> Setting_up_extension_development_environment
> 
> > devtools.chrome.enabled = true. This enables to run JavaScript code snippets in the chrome context of the Scratchpad from the Tools menu. Don't forget to switch from content to browser as context.
> 
> Does this help you?

I still get the error "Exception: InvalidStateError: An attempt was made to use an object that is not, or is no longer, usable". I enabled devtools.chrome and webspeech.recognition, still can't get it to work. Can you verify that you've got speech input to work on desktop firefox somehow?
@abhishek what script are you running in Web developer > Scratchpad ?

Checking with from https://github.com/mdn/web-speech-api/blob/master/speech-color-changer/script.js running lines 1-9 + 'recognition;' as line 10 from Scratchpad on page about:config shows me the object SpeechRecognition __proto__: SpeechRecognitionPrototype when inspecting 'recognition;'

Remark #2 on https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API#Browser_compatibility is about this issue :)

Please place your code on https://gist.github.com/ for other to help
(In reply to Clemens Tolboom from comment #15)
> @abhishek what script are you running in Web developer > Scratchpad ?
> 
> Checking with from
> https://github.com/mdn/web-speech-api/blob/master/speech-color-changer/
> script.js running lines 1-9 + 'recognition;' as line 10 from Scratchpad on
> page about:config shows me the object SpeechRecognition __proto__:
> SpeechRecognitionPrototype when inspecting 'recognition;'
> 
> Remark #2 on
> https://developer.mozilla.org/en-US/docs/Web/API/
> Web_Speech_API#Browser_compatibility is about this issue :)
> 
> Please place your code on https://gist.github.com/ for other to help

I am trying that exact same script. Here's the gist: https://gist.github.com/Abhishek8394/dbf9338a9baf6ca05639929e0b72404d
Also the remark #2 as mentioned, it says that recognition hasn't been implemented yet asks to enable an option?
Whiteboard: [webcompat]
One needs to build Firefox with the flags turned on to include the only current implementation available (pocketsphinx + english). If you guys are willing to move forward with it, I can help you to set and have it running.
Flags: webcompat?
Assignee: nobody → anatal
Depends on: 1392065
FWIW, Duolingo's support team is advising inquiring users (necessarily) to use Chrome:

---
Speaking exercises on the new website only work on browsers that support the Web Speech API, the most popular being Chrome. Unfortunately, the old speech system was outdated and a burden on our system. If you use the Chrome browser, you will get speaking exercises again.

If you are using Google Chrome and experiencing issues, users have found that updating or reinstalling the browser has resolved their issues. If you are still having issues, please reply to this email. :)

If you are using a browser without Web Speech API supported, you should not see any speaking exercises. If you do see speaking exercises, please know that your microphone will unfortunately not work. We are aware that some users may still be seeing these exercises and we're working hard to ensure that this is resolved quickly.
---

Duolingo has over 200M users.

I imagine that English-only speech reco wouldn't cover Duolingo's main customer bases.
:-( Anyone know the list of languages Duolingo supports?
Spanish, French, German, Italian, Portuguese, Dutch, Irish, Danish, Swedish, High Valyrian, Russian, Swahili, Polish, Romanian, Greek, Esperanto, Turkish, Vietnamese, Hebrew, Norwegian, Ukrainian, Hungarian, Welsh, Czech
Time to internationalize + localize Common Voice!
I am not using that feature on Duolingo because the recognition before was using flash (and wasn't so perfect). 
In any case this another example that remind that is important to have this API available.
We, the machine learning group at Mozilla, have just recently gotten the quality of our speech recognition engine[1] to be on-par with commercial systems.

Currently we have enough American English training data to create an American English model. We don't have enough data to train other accents or languages, thus the need to internationalize + localize Common Voice.

We plan, by the end of Q2 in 2018, to implement the WebSpeech API backed by our speech recognition engine and integrate, at a minimum, the American English model by then. Other languages should follow in the second half of 2018. (Which languages depends on C-Level decisions, the internationalization + localization + success of Common Voice, and Duolingo-like issues, which would guide our language choice.)

[1] https://github.com/mozilla/deepspeech
I hope that this API will be available before Q2 of course :-)
PR's welcome! :-)
Blocks: 1409526
Let's try to do something :)
Assignee: anatal → lissyx+mozillians
Assignee: lissyx+mozillians → anatal
See Also: → 1423867
No longer blocks: 973754
Blocks: 1456885
Flags: webcompat? → webcompat+
Whiteboard: [webcompat] → [webcompat:p2]
Comment on attachment 8984315 [details] [diff] [review]
Introducing an online speech recognition service to enable Web Speech API

Please use Mozilla coding style. 2 spaces for indentation, mFoo for member variable naming, 
aBar for argument names etc.

>+class DecodeResultTask final : public Runnable
>+{
>+public:
>+  DecodeResultTask(bool succeded,
>+                   const nsString& hypstring,
>+                   float confidence,
>+                   WeakPtr<dom::SpeechRecognition> recognition,
>+                   const nsAutoString& errormessage)
>+      : mozilla::Runnable("DecodeResultTask"),
>+        mSucceeded(succeded),
>+        mResult(hypstring),
>+        mConfidence(confidence),
>+        mRecognition(recognition),
>+        mErrorMessage(errormessage)
>+  {
>+    MOZ_ASSERT(
>+      NS_IsMainThread()); // This should be running on the main thread
>+  }
>+
>+  NS_IMETHOD
>+  Run() override
>+  {
>+    MOZ_ASSERT(NS_IsMainThread()); // This method is supposed to run on the main
>+                                   // thread!
>+
>+    if (!mSucceeded) {
>+      mRecognition->DispatchError(SpeechRecognition::EVENT_RECOGNITIONSERVICE_ERROR,
>+                                  SpeechRecognitionErrorCode::Network, // TODO different codes?
>+                                  mErrorMessage);
>+
>+    } else {
>+      // Declare javascript result events
>+      RefPtr<SpeechEvent> event = new SpeechEvent(
>+        mRecognition, SpeechRecognition::EVENT_RECOGNITIONSERVICE_FINAL_RESULT);
>+      SpeechRecognitionResultList* resultList =
>+        new SpeechRecognitionResultList(mRecognition);
>+      SpeechRecognitionResult* result = new SpeechRecognitionResult(mRecognition);
>+
>+      if (0 < mRecognition->MaxAlternatives()) {
>+        SpeechRecognitionAlternative* alternative =
>+          new SpeechRecognitionAlternative(mRecognition);
>+
>+        alternative->mTranscript = mResult;
>+        alternative->mConfidence = mConfidence;
>+
>+        result->mItems.AppendElement(alternative);
>+      }
>+      resultList->mItems.AppendElement(result);
>+
>+      event->mRecognitionResultList = resultList;
>+      NS_DispatchToMainThread(event);
>+    }
>+    return NS_OK;
>+ }
>+
>+private:
>+  bool mSucceeded;
>+  nsString mResult;
>+  float mConfidence;
>+  WeakPtr<dom::SpeechRecognition> mRecognition;
>+  nsCOMPtr<nsIThread> mWorkerThread;
>+  nsAutoString mErrorMessage;
>+};
>+
>+NS_IMPL_ISUPPORTS(OnlineSpeechRecognitionService,
>+                  nsISpeechRecognitionService, nsIObserver, nsIStreamListener)
>+
>+NS_IMETHODIMP
>+OnlineSpeechRecognitionService::OnStartRequest(nsIRequest* aRequest,
>+                            nsISupports* aContext)
>+{
>+  if (this->mBuf)
>+    this->mBuf = nullptr;
>+  return NS_OK;
>+}
>+
>+NS_IMETHODIMP
>+OnlineSpeechRecognitionService::OnDataAvailable(nsIRequest* aRequest,
>+                             nsISupports* aContext,
>+                             nsIInputStream* aInputStream,
>+                             uint64_t aOffset,
>+                             uint32_t aCount)
>+{
>+  nsresult rv;
>+  this->mBuf = new char[aCount];
>+  uint32_t _retval;
>+  rv = aInputStream->ReadSegments(NS_CopySegmentToBuffer, this->mBuf, aCount, &_retval);
>+  NS_ENSURE_SUCCESS(rv, rv);
>+  return NS_OK;
>+}
>+
>+NS_IMETHODIMP
>+OnlineSpeechRecognitionService::OnStopRequest(nsIRequest* aRequest,
>+                           nsISupports* aContext,
>+                           nsresult aStatusCode)
>+{
>+  bool success;
>+  nsresult rv;
>+  float confidence;
>+  confidence = 0;
>+  Json::Value root;
>+  Json::Reader reader;
>+  bool parsingSuccessful;
>+  nsAutoCString result;
>+  nsAutoCString hypoValue;
>+  nsAutoString errorMsg;
>+
>+  SR_LOG("STT Result: %s", this->mBuf);
>+
>+  if (NS_FAILED(aStatusCode)) {
>+    success = false;
>+    errorMsg.Assign(NS_LITERAL_STRING("CONNECTION_ERROR"));
>+  } else {
>+    success = true;
>+    parsingSuccessful = reader.parse(this->mBuf, root);
>+    if (!parsingSuccessful) {
>+      errorMsg.Assign(NS_LITERAL_STRING("RECOGNITIONSERVICE_ERROR"));
>+      success = false;
>+    } else {
>+      result.Assign(root.get("status","error").asString().c_str());
>+      if (result.EqualsLiteral("ok")) {
>+        // ok, we have a result
>+        hypoValue.Assign(root["data"][0].get("text","").asString().c_str());
>+        confidence = root["data"][0].get("confidence","0").asFloat();
>+      } else {
>+        // there's an internal server error
>+        errorMsg.Assign(NS_LITERAL_STRING("NO_HYPOTHESIS"));
>+        success = false;
>+      }
>+    }
>+  }
>+
>+  RefPtr<Runnable> resultrunnable =
>+    new DecodeResultTask(success, NS_ConvertUTF8toUTF16(hypoValue), confidence,
>+                         mRecognition, errorMsg);
>+  rv = NS_DispatchToMainThread(resultrunnable);
>+
>+  if (this->mBuf)
>+    this->mBuf = nullptr;
>+
>+  return NS_OK;
>+}
>+
>+OnlineSpeechRecognitionService::OnlineSpeechRecognitionService()
>+{
>+  if (this->mBuf)
>+    this->mBuf = nullptr;
What is this? mBuf is uninitialized in the constructor

>+  audioEncoder = nullptr;
>+  ISDecoderCreated = true;
>+  ISGrammarCompiled = true;
Please use normal C++ member variable initialization,
OnlineSpeechRecognitionService::OnlineSpeechRecognitionService()
  : mBuf(nullptr)
...

looks like ISDecoderCreated isn't ever used, nor ISGrammarCompiled

>+OnlineSpeechRecognitionService::~OnlineSpeechRecognitionService()
>+{
>+  if (this->mBuf)
>+    this->mBuf = nullptr;
again, rather mysterious nullcheck. And you leak mBuf here.

>+NS_IMETHODIMP
>+OnlineSpeechRecognitionService::ProcessAudioSegment(
>+  AudioSegment* aAudioSegment, int32_t aSampleRate)
>+{
On which thread does this method run?
Encoding may take time so it should not run on the main thread.
So, 
MOZ_ASSERT(!NS_IsMainThread());

>+OnlineSpeechRecognitionService::Observe(nsISupports* aSubject,
>+                                              const char* aTopic,
>+                                              const char16_t* aData)
align params

>+class OnlineSpeechRecognitionService : public nsISpeechRecognitionService,
>+                                             public nsIObserver,
>+                                             public nsIStreamListener
align inherited classes
Attachment #8984315 - Flags: review?(bugs) → review-
Hi Olli,

Here's a version with your comments already addressed. If you want to test it, apply the patch, flip the prefs, and head here: https://andrenatal.github.io/webspeechapi/index_auto.html. 

Currently the tests we have convert the fake recognition service and the state machine, so I'm working to make them working with this online service too.

Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=1b9f223217bde85215e8f99416cae66098ff13fc

Thanks

Andre
Attachment #8984315 - Attachment is obsolete: true
Attachment #8986675 - Flags: review?(bugs)
Comment on attachment 8986675 [details] [diff] [review]
Bug-1248897-Introducing an online speech recognition service for web speech api

>+NS_IMETHODIMP
>+OnlineSpeechRecognitionService::OnStartRequest(nsIRequest *aRequest,
>+                                               nsISupports *aContext)
>+{
>+  this->mBuf = nullptr;
Why this?
If mBuf points to some value, you leak it here.
No need for this-> here nor elsewhere


>+NS_IMETHODIMP
>+OnlineSpeechRecognitionService::OnDataAvailable(nsIRequest *aRequest,
>+                                                nsISupports *aContext,
>+                                                nsIInputStream *aInputStream,
Per Mozilla coding style * goes with the type.
nsIRequest* aRequest etc. Here and elsewhere.


>+                                                uint64_t aOffset,
>+                                                uint32_t aCount)
>+{
>+  nsresult rv;
>+  this->mBuf = new char[aCount];
>+  uint32_t _retval;
retVal
But it isn't retVal, but read count, so perhaps readCount ?


>+OnlineSpeechRecognitionService::OnStopRequest(nsIRequest *aRequest,
>+                                              nsISupports *aContext,
>+                                              nsresult aStatusCode)
>+{
>+  bool success;
>+  float confidence = 0;
>+  Json::Value root;
>+  Json::Reader reader;
>+  bool parsingSuccessful;
>+  nsAutoCString result;
>+  nsAutoCString hypoValue;
>+  nsAutoString errorMsg;
>+  SR_LOG("STT Result: %s", this->mBuf);
>+
>+  if (NS_FAILED(aStatusCode)) {
>+    success = false;
>+    errorMsg.Assign(NS_LITERAL_STRING("CONNECTION_ERROR"));
the spec doesn't define "CONNECTION_ERROR"

>+  } else {
>+    success = true;
>+    parsingSuccessful = reader.parse(this->mBuf, root);
>+    if (!parsingSuccessful) {
>+      errorMsg.Assign(NS_LITERAL_STRING("RECOGNITIONSERVICE_ERROR"));
The spec doesn't define "RECOGNITIONSERVICE_ERROR"

>+      success = false;
>+    } else {
>+      result.Assign(root.get("status","error").asString().c_str());
missing space after ,

>+      if (result.EqualsLiteral("ok")) {
>+        // ok, we have a result
>+        hypoValue.Assign(root["data"][0].get("text","").asString().c_str());
>+        confidence = root["data"][0].get("confidence","0").asFloat();
>+      } else {
>+        // there's an internal server error
>+        errorMsg.Assign(NS_LITERAL_STRING("NO_HYPOTHESIS"));
I don't see "NO_HYPOTHESIS" in the spec


>+OnlineSpeechRecognitionService::~OnlineSpeechRecognitionService()
>+{
>+  this->mBuf = nullptr;
So you leak mBuf

>+  this->mAudioEncoder = nullptr;
>+  this->mWriter = nullptr;
No need to set these to null in destructor.



>   nsresult rv;
>   rv = mRecognitionService->Initialize(this);
>   if (NS_WARN_IF(NS_FAILED(rv))) {
>     return;
>   }
> 
>+  SpeechRecognition::SetIdle(false);
>+
>+  nsCOMPtr<nsIThreadManager> tm = do_GetService(NS_THREADMANAGER_CONTRACTID);
>+  rv = tm->NewNamedThread(NS_LITERAL_CSTRING("WebSpeechEncoderThread"), 0,
>+                                           getter_AddRefs(this->mEncodeThread));
fix indentation


(this will need couple of iterations.)
Attachment #8986675 - Flags: review?(bugs) → review-
(In reply to kdavis from comment #23)
> We, the machine learning group at Mozilla, have just recently gotten the
> quality of our speech recognition engine[1] to be on-par with commercial
> systems.
> 
> Currently we have enough American English training data to create an
> American English model. We don't have enough data to train other accents or
> languages, thus the need to internationalize + localize Common Voice.
> 
> We plan, by the end of Q2 in 2018, to implement the WebSpeech API backed by
> our speech recognition engine and integrate, at a minimum, the American
> English model by then. Other languages should follow in the second half of
> 2018. (Which languages depends on C-Level decisions, the
> internationalization + localization + success of Common Voice, and
> Duolingo-like issues, which would guide our language choice.)
> 
> [1] https://github.com/mozilla/deepspeech

So it's 2018 Q2, any updates on deepspeech implementation?
(In reply to Tim Langhorst from comment #32)
> So it's 2018 Q2, any updates on deepspeech implementation?

You may have missed it but this bug is assigned and actively being worked on. Though I'm just the reporter, not the implementer, so I have no idea how long it will take until it's finished. Regarding Olli Pettay's last comment saying "this will need couple of iterations", I assume it will still take a while until the patch is ready to land in Firefox.

Sebastian
(In reply to Sebastian Zartner [:sebo] from comment #33)
> You may have missed it but this bug is assigned and actively being worked
> on. Though I'm just the reporter, not the implementer, so I have no idea how
> long it will take until it's finished. Regarding Olli Pettay's last comment
> saying "this will need couple of iterations", I assume it will still take a
> while until the patch is ready to land in Firefox.

Yeah but this is ONLINE speech recognition and not the OFFLINE one via deepspeech that I'm interested in. That's why I've replied to kdavis
(In reply to Tim Langhorst from comment #34)
> (In reply to Sebastian Zartner [:sebo] from comment #33)
> > You may have missed it but this bug is assigned and actively being worked
> > on. Though I'm just the reporter, not the implementer, so I have no idea how
> > long it will take until it's finished. Regarding Olli Pettay's last comment
> > saying "this will need couple of iterations", I assume it will still take a
> > while until the patch is ready to land in Firefox.
> 
> Yeah but this is ONLINE speech recognition and not the OFFLINE one via
> deepspeech that I'm interested in. That's why I've replied to kdavis

Good point. Asking him for info about that. Also, if offline speech recognition is still pursued, should a separate bug be created for it?

Sebastian
Flags: needinfo?(kdavis)
Keywords: feature
You can open a separate bug for backing the WebSpeech API with offline STT.

If you do so, can you list the required platforms, runtime memory requirements, maximal size the library can be, and maximal size the model can be in the bug. Thanks.
Flags: needinfo?(kdavis)
Blocks: 1474124
(In reply to kdavis from comment #36)
> You can open a separate bug for backing the WebSpeech API with offline STT.
> 
> If you do so, can you list the required platforms, runtime memory
> requirements, maximal size the library can be, and maximal size the model
> can be in the bug. Thanks.

As I wrote before, I am just the reporter, not an implementer, therefore I can't decide on those questions. Nonetheless I have created bug 1474124 for the offline API. Anyone able to answer those questions should do that there.

Sebastian
I have already opened bug 1474084
Blocks: 1474084
Hi Olli, please see a new version with your comments addressed.

Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=60c87fdc8f392c5178161474c067dd2cb7897ee5
Attachment #8986675 - Attachment is obsolete: true
Attachment #8990848 - Flags: review?(bugs)
Attachment #8990848 - Attachment is obsolete: true
Attachment #8990848 - Flags: review?(bugs)
Please consider this version.
Attachment #8991057 - Flags: review?(bugs)
Marking this as affecting 63 just to indicate that this is being actively worked on in Nightly in my list of features.
Whiteboard: [webcompat:p2] → [webcompat:p1]
Comment on attachment 8991057 [details] [diff] [review]
0001-Bug-1248897-Introducing-an-online-speech-recognition.patch

>+
>+  /** Audio data */
>+  nsTArray<uint8_t> mAudioVector;
>+
>+  RefPtr<AudioTrackEncoder> mAudioEncoder;
>+  UniquePtr<ContainerWriter> mWriter;
>+  char* mBuf;
Why you use char* and not for example nsCString?
>+SpeechRecognition::~SpeechRecognition()
>+{
>+  if (this->mEncodeThread) {
>+    this->mEncodeThread->Shutdown();
>+  }
>+  this->mDocument = nullptr;
Setting nsCOMPtr or RefPtr member variables to null in destructor is useless.
nsCOMPtr's and RefPtr's destructors will do that automatically



>+bool
>+SpeechRecognition::IsIdle()
>+{
>+  return kIDLE;
>+}
>+
>+void
>+SpeechRecognition::SetIdle(bool aIdle)
>+{
>+  SR_LOG("Setting idle");
>+  kIDLE = aIdle;
>+}
It is totally unclear what kIDLE means. And the static variable is wrongly named. k-prefix is for constants, and kIDLE clearly isn't a constant.




>+  nsCOMPtr<nsIThreadManager> tm = do_GetService(NS_THREADMANAGER_CONTRACTID);
>+  rv = tm->NewNamedThread(NS_LITERAL_CSTRING("WebSpeechEncoderThread"),
>+                          0,
>+                          getter_AddRefs(this->mEncodeThread));
>+
Given all the memshrink efforts because of Fission (see dev.platform), need to ensure we kill the thread rather soon once it isn't needed anymore.
So, probably way before destructor of SpeechRecognition.


> 
>+  static bool IsIdle();
>+  static void SetIdle(bool aIdle);
These need some documentation


>+
>+  nsCOMPtr<nsIDocument> mDocument;
mDocument isn't cycle collected, so this will leak.
Do you need mDocument? SpeechRecognition is an DOMEventTargetHelper object, so one can always get the Window object and from that get the extant document.
 


>     CXXFLAGS += ['-Wno-error=shadow']
>diff --git a/dom/media/webspeech/recognition/test/FakeSpeechRecognitionService.cpp b/dom/media/webspeech/recognition/test/FakeSpeechRecognitionService.cpp
>index 5f8e6181fb32..69fcfdf44690 100644
>--- a/dom/media/webspeech/recognition/test/FakeSpeechRecognitionService.cpp
Why the changes to FakeSpeechRecognitionService?

SpeechRecognitionService.h

>+++ b/dom/media/webspeech/recognition/test/FakeSpeechRecognitionService.h
>@@ -15,17 +15,17 @@
>   {0x48c345e7, 0x9929, 0x4f9a, {0xa5, 0x63, 0xf4, 0x78, 0x22, 0x2d, 0xab, 0xcd}};
> 
> namespace mozilla {
> 
> class FakeSpeechRecognitionService : public nsISpeechRecognitionService,
>                                      public nsIObserver
> {
> public:
>-  NS_DECL_ISUPPORTS
>+  NS_DECL_THREADSAFE_ISUPPORTS
Hmm, why this change?
Attachment #8991057 - Flags: review?(bugs) → review-
(In reply to Olli Pettay [:smaug] (vacation Jul 15->) from comment #42)
> Comment on attachment 8991057 [details] [diff] [review]
> 0001-Bug-1248897-Introducing-an-online-speech-recognition.patch
> 
> >+
> >+  /** Audio data */
> >+  nsTArray<uint8_t> mAudioVector;
> >+
> >+  RefPtr<AudioTrackEncoder> mAudioEncoder;
> >+  UniquePtr<ContainerWriter> mWriter;
> >+  char* mBuf;
> Why you use char* and not for example nsCString?


I changed to use an nsCString but needed to add a new function [1] to consume the data from the inpustream since NS_CopySegmentToBuffer expects a char*.   


[1] https://github.com/andrenatal/gecko-dev-speech/commit/6de461f6d6b69cf118d3fba3c1f907ca5775e1f9#diff-9563eabe4915e08b28c1dcd70b998ee5R59


> >+SpeechRecognition::~SpeechRecognition()
> >+{
> >+  if (this->mEncodeThread) {
> >+    this->mEncodeThread->Shutdown();
> >+  }
> >+  this->mDocument = nullptr;
> Setting nsCOMPtr or RefPtr member variables to null in destructor is useless.
> nsCOMPtr's and RefPtr's destructors will do that automatically
> 
> 

Fixed.

> 
> >+bool
> >+SpeechRecognition::IsIdle()
> >+{
> >+  return kIDLE;
> >+}
> >+
> >+void
> >+SpeechRecognition::SetIdle(bool aIdle)
> >+{
> >+  SR_LOG("Setting idle");
> >+  kIDLE = aIdle;
> >+}
> It is totally unclear what kIDLE means. And the static variable is wrongly
> named. k-prefix is for constants, and kIDLE clearly isn't a constant.
> 
> 

Ok, fixed the name to sIdle and added some comments explaining its purpose

> 
> 
> >+  nsCOMPtr<nsIThreadManager> tm = do_GetService(NS_THREADMANAGER_CONTRACTID);
> >+  rv = tm->NewNamedThread(NS_LITERAL_CSTRING("WebSpeechEncoderThread"),
> >+                          0,
> >+                          getter_AddRefs(this->mEncodeThread));
> >+
> Given all the memshrink efforts because of Fission (see dev.platform), need
> to ensure we kill the thread rather soon once it isn't needed anymore.
> So, probably way before destructor of SpeechRecognition.
> 
> 

Yes, sure. So we are now shutting down the thread on StopRecording[1] and AbortSilently[2]

[1] https://github.com/andrenatal/gecko-dev-speech/commit/6de461f6d6b69cf118d3fba3c1f907ca5775e1f9#diff-f47e4fa6c9b339d424ff52ce22e431beR567
[2] https://github.com/andrenatal/gecko-dev-speech/commit/6de461f6d6b69cf118d3fba3c1f907ca5775e1f9#diff-f47e4fa6c9b339d424ff52ce22e431beR563


> > 
> >+  static bool IsIdle();
> >+  static void SetIdle(bool aIdle);
> These need some documentation
> 
> 

Done [1]

[1] https://github.com/andrenatal/gecko-dev-speech/commit/6de461f6d6b69cf118d3fba3c1f907ca5775e1f9#diff-2187e63217f1dd33be599ab9dfd73879R102

> >+
> >+  nsCOMPtr<nsIDocument> mDocument;
> mDocument isn't cycle collected, so this will leak.
> Do you need mDocument? SpeechRecognition is an DOMEventTargetHelper object,
> so one can always get the Window object and from that get the extant
> document.
>  
> 

I was using the document to pass is as the principal when creating the channel to be used in the http request[1]. But I figured that I could just use `nsContentUtils::GetSystemPrincipal()`. 
So I reverted all the changes I was doing with the document and am not exposing it as a member anymore. Thanks for pointing that.

https://github.com/andrenatal/gecko-dev-speech/commit/6de461f6d6b69cf118d3fba3c1f907ca5775e1f9#diff-9563eabe4915e08b28c1dcd70b998ee5R277

> 
> >     CXXFLAGS += ['-Wno-error=shadow']
> >diff --git a/dom/media/webspeech/recognition/test/FakeSpeechRecognitionService.cpp b/dom/media/webspeech/recognition/test/FakeSpeechRecognitionService.cpp
> >index 5f8e6181fb32..69fcfdf44690 100644
> >--- a/dom/media/webspeech/recognition/test/FakeSpeechRecognitionService.cpp
> Why the changes to FakeSpeechRecognitionService?
> 
> SpeechRecognitionService.h
> 

If we don't add SpeechRecognition::SetIdle(true) here[1], some existing tests will break.

[1] https://github.com/andrenatal/gecko-dev-speech/commit/6de461f6d6b69cf118d3fba3c1f907ca5775e1f9#diff-9a8ef72f8c41148099a7df7aafd4aed9R52


> >+++ b/dom/media/webspeech/recognition/test/FakeSpeechRecognitionService.h
> >@@ -15,17 +15,17 @@
> >   {0x48c345e7, 0x9929, 0x4f9a, {0xa5, 0x63, 0xf4, 0x78, 0x22, 0x2d, 0xab, 0xcd}};
> > 
> > namespace mozilla {
> > 
> > class FakeSpeechRecognitionService : public nsISpeechRecognitionService,
> >                                      public nsIObserver
> > {
> > public:
> >-  NS_DECL_ISUPPORTS
> >+  NS_DECL_THREADSAFE_ISUPPORTS
> Hmm, why this change?

As we are now calling ProcessAudioSegment in a thread[1], if we don't change the interface to NS_DECL_THREADSAFE_ISUPPORTS, the tests are breaking as well. Here's a gist of the test results with only NS_DECL_ISUPPORTS [2]


[1] https://github.com/andrenatal/gecko-dev-speech/commit/6de461f6d6b69cf118d3fba3c1f907ca5775e1f9#diff-f47e4fa6c9b339d424ff52ce22e431beL403

[2] https://gist.github.com/andrenatal/5e9b12aff43a8369d7664d9a4d6314a6
Please see the updated version with all the issues addressed.
Attachment #8991057 - Attachment is obsolete: true
Comment on attachment 8993800 [details] [diff] [review]
0001-Bug-1248897-Introducing-an-online-speech-recognition.patch

Hi Andreas, do you mind reviewing the current version of the patch for us, please?
Attachment #8993800 - Flags: review?(apehrson)
@pehrson, another interesting issue that I found is that when I close the browser (or first the tab and then the browser) while the microphone is open and capturing (or when the MediaStream is being fed with audio [1]), I'm getting this crash [2] 


[1] https://github.com/andrenatal/webspeechapi/blob/gh-pages/index_auto.js#L85

[2] https://gist.github.com/andrenatal/2c5220f4a65ec785297d9ac811808ced 

Do you have any idea why that's happening?

Thanks,

Andre
Flags: needinfo?(apehrson)
Attachment #8993800 - Flags: review?(apehrson) → review?(bugs)
Asking smaug for review again since seems he's back from PTO.
Sorry for the late reply. I was on PTO last week but am back for this week.

(In reply to André Natal from comment #47)
> @pehrson, another interesting issue that I found is that when I close the
> browser (or first the tab and then the browser) while the microphone is open
> and capturing (or when the MediaStream is being fed with audio [1]), I'm
> getting this crash [2] 
> 
> 
> [1]
> https://github.com/andrenatal/webspeechapi/blob/gh-pages/index_auto.js#L85
> 
> [2] https://gist.github.com/andrenatal/2c5220f4a65ec785297d9ac811808ced 
> 
> Do you have any idea why that's happening?
> 
> Thanks,
> 
> Andre

The "WARNING: YOU ARE LEAKING THE WORLD" indicates that you have added a non-cycle-collected reference cycle somewhere in your code. This is preventing shutdown. If you have one that is legit you probably want to listen for xpcom-shutdown and break it then.

First you need to find it however, which is easier said than done. I often find that auditing my own code helps, or sometimes (if there are lots of new RefPtr members) disabling chunk after chunk until it doesn't happen anymore, to try and narrow down the piece of code containing the cycle.
Flags: needinfo?(apehrson)
Comment on attachment 8993800 [details] [diff] [review]
0001-Bug-1248897-Introducing-an-online-speech-recognition.patch

Review of attachment 8993800 [details] [diff] [review]:
-----------------------------------------------------------------

I have mostly looked at the usage of AudioTrackEncoder, though other comments might have slipped in too.

Some basic things for the AudioTrackEncoder are causing this r-, most notably mAudioVector allocations and mAudioTrackEncoder->AdvanceCurrentTime calls.

::: dom/media/webspeech/recognition/OnlineSpeechRecognitionService.cpp
@@ +195,5 @@
> +  }
> +
> +  nsresult rv;
> +
> +  if (!isHeaderWritten) {

Is this a member? If so it should be mIsHeaderWritten.

@@ +199,5 @@
> +  if (!isHeaderWritten) {
> +    mWriter = MakeUnique<OggWriter>();
> +    mAudioEncoder = MakeAndAddRef<OpusTrackEncoder>(aSampleRate);
> +    mAudioEncoder->TryInit(*aAudioSegment, duration);
> +    mAudioEncoder->SetStartOffset(0);

SetStartOffset needs to get the number of samples that comes before the first segment. This so that bookkeeping of the actual samples work. An example to understand how this works:

mAudioEncoder->SetStartOffset(128);
// The internal buffer now contains 128 samples of silence and no sound.
mAudioEncoder->AppendAudioSegment(aAudioSegment);
// For aAudioSegment of 256 samples the buffer now contains 128 samples of silence and 256 of sound
mAudioEncoder->AdvanceCurrentTime(256);
// The internal buffer now contains 256 samples of silence and 128 of sound. 128 samples were passed on.
mAudioEncoder->AppendAudioSegment(aAudioSegment2);
// For aAudioSegment2 of 128 samples the buffer now contains 256 samples of silence and 256 of sound
mAudioEncoder->AdvanceCurrentTime(384);
// The internal buffer now contains 384 samples of silence and 128 of sound. 128 samples were passed on.

@@ +214,5 @@
> +    rv = mWriter->GetContainerData(&aOutputBufs, ContainerWriter::GET_HEADER);
> +    NS_ENSURE_SUCCESS(rv, rv);
> +
> +    for (auto& buffer : aOutputBufs) {
> +        mAudioVector.AppendElements(buffer);

Indentation is wrong in so many places. Consider fixing it in all your files with clang-format.

This AppendElements() is going to have to expand the array very often. Consider using something else for storage, like an array of EncodedFrameContainer that would be simpler to append to as it grows large. Since you only use the array after finishing the recording you can concat the EncodedFrameContainers then as the total duration is known and you only need one allocation, or use them one-by-one if the streaming api allows it.

Is there a fixed upper bound on the length of the recording? If not it'll be possible to trigger an OOM crash through this (though it is encoded opus so it'll take a while). Then we should probably store it in a temp file like MediaRecorder does.

It's probably a good idea too to use a fallible allocator for such big arrays, then we can abort the speech recognition with an error if there's a memory problem.

@@ +220,5 @@
> +    isHeaderWritten = true;
> +  }
> +
> +  mAudioEncoder->AppendAudioSegment(std::move(*aAudioSegment));
> +  mAudioEncoder->AdvanceCurrentTime(duration);

This must be the accumulated duration of all played segments until now.

Since you are passing the duration of only this segment, you are basically creating an unbounded buffer of audio data that only goes away when the AudioTrackEncoder is destructed.

This alone is a reason for r-.

@@ +231,5 @@
> +                                        mAudioEncoder->IsEncodingComplete() ?
> +                                        ContainerWriter::END_OF_STREAM : 0);
> +  NS_ENSURE_SUCCESS(rv, rv);
> +
> +  nsTArray<nsTArray<uint8_t>> aOutputBufs;

s/aOutputBufs/outputBufs/

This is not an argument to this method, since the declaration is here.

@@ +232,5 @@
> +                                        ContainerWriter::END_OF_STREAM : 0);
> +  NS_ENSURE_SUCCESS(rv, rv);
> +
> +  nsTArray<nsTArray<uint8_t>> aOutputBufs;
> +  rv = mWriter->GetContainerData(&aOutputBufs, ContainerWriter::FLUSH_NEEDED);

Don't flush every time. Instead do it only when you really need the data (like after EOS).

You should only flush once, or you'll confuse the OggWriter apparently: https://searchfox.org/mozilla-central/rev/033d45ca70ff32acf04286244644d19308c359d5/dom/media/ogg/OggWriter.cpp#180

@@ +382,5 @@
> +  return NS_OK;
> +}
> +
> +SpeechRecognitionResultList*
> +OnlineSpeechRecognitionService::BuildMockResultList()

This seems to belong in a unit test rather than release.

::: dom/media/webspeech/recognition/SpeechRecognition.cpp
@@ +57,5 @@
> +// sIdle holds the current state of the API, i.e, if a recognition
> +// is running or not. We don't want more than one recognition to be running
> +// at same time, so we set sIdle to true when the API is not active and to
> +// false when is active (recording or recognizing)
> +static bool sIdle = true;

The comment needs to state which thread can access sIdle, since it's not threadsafe.

Note that this static doesn't prevent two different child processes from running speech recognition at the same time. Is that intended?

::: dom/media/webspeech/recognition/SpeechRecognition.h
@@ +140,5 @@
>    void FeedAudioData(already_AddRefed<SharedBuffer> aSamples, uint32_t aDuration, MediaStreamListener* aProvider, TrackRate aTrackRate);
>  
>    friend class SpeechEvent;
> +
> +  nsCOMPtr<nsIThread> mEncodeThread;

Please document members and methods so I as a reviewer have a reference to verify their implementation and usage against. Such comments should at least say what the intention is and what threads are allowed to call/read/write the thing you're documenting.
Attachment #8993800 - Flags: review-
Comment on attachment 8993800 [details] [diff] [review]
0001-Bug-1248897-Introducing-an-online-speech-recognition.patch

(I'll review after pehrsons' comments are addressed)
Attachment #8993800 - Flags: review?(bugs)
Curious, what is the status with this?
Flags: needinfo?(anatal)
Due lack of resources, we needed to pause the work on this while I was working to bring voice search for Firefox Reality. Some of the issues that Andreas pointed was already fixed (we worked together on those), so I expect to have a new patch in the next couple of weeks.
Flags: needinfo?(anatal)
Is this still on the back-burner? Currently I'm having to use Chrome for a project that requires this API but I would much rather use FF (especially with the offline recognition).
(In reply to davy.wybiral from comment #54)
> Is this still on the back-burner? Currently I'm having to use Chrome for a
> project that requires this API but I would much rather use FF (especially
> with the offline recognition).

Speaking on behalf of the machine learning team, it's on our radar.

We've just recently got our STT engine running on "small platform" devices using TFLite[1].

So for English the remaining steps are some polish of our "small platform" STT engine and integration of the engine in to the patch Andre is working on.

For other languages we need data, which is a longer term problem we need to solve that's partially addressed by Common Voice.

[1] https://www.tensorflow.org/lite/

Andre, Oli, and Andreas what's the current status of this?

It looks like Andreas' comments in his last review have not been addressed.

Alex was thinking about finishing this patch and integrating Deep Speech too.

Should he just jump in, fix Andreas' request for changes, and do it?

Flags: needinfo?(bugs)
Flags: needinfo?(apehrson)
Flags: needinfo?(anatal)

Andre is driving this so check with him. Last I heard he was working on setting up a mock server with the mochitests, but then he has gotten disrupted by other work a number of times too.

I'm happy to review any media bits, whoever writes them.

Flags: needinfo?(apehrson)

Many of the Andreas' comments were already addressed after the last All Hands, including a major refactor of the media part. As he said, I was working on the mochitests and also finding the root of a memory leak that occurs when closing the tabs while the microphone capture is in process.

If the ultimate goal is to integrate Deep Speech, I believe a better use for Alex' time would be to work in the backend instead the frontend being discussed here, since they should be totally decoupled, i.e, finish the docker containing deepspeech and deploy it to Mozilla's services cloud infrastructure, for online decoding, and/or, create another bug and patch just to integrate deepspeech's inference stack and models into gecko, plus the HTTP service which will receive the requests from the frontend here, in the case of offline.

Both are still missing and currently needs more attention than the patch being worked here, which is already functional if manually applied to Gecko.

Flags: needinfo?(anatal)

create another bug and patch just to integrate deepspeech's inference stack and models into gecko

For offline there is already #1474084

(In reply to Andre Natal from comment #58)

If the ultimate goal is to integrate Deep Speech, I believe a better use for Alex' time would be to work in the backend instead the frontend being discussed here, since they should be totally decoupled, i.e, finish the docker containing deepspeech and deploy it to Mozilla's services cloud infrastructure, for online decoding, and/or, create another bug and patch just to integrate deepspeech's inference stack and models into gecko, plus the HTTP service which will receive the requests from the frontend here, in the case of offline.

There's not that much to finish, it's working, and is iso-feature to the other implem served by speech proxy. I guess it would be more a question of production deployment etc. :)

(In reply to Andre Natal from comment #58)

If the ultimate goal is to integrate Deep Speech, I believe a better use for Alex' time would be to work in the backend instead the frontend being discussed here, since they should be totally decoupled, i.e, finish the docker containing deepspeech and deploy it to Mozilla's services cloud infrastructure, for online decoding, and/or, create another bug and patch just to integrate deepspeech's inference stack and models into gecko, plus the HTTP service which will receive the requests from the frontend here, in the case of offline.

Both are still missing and currently needs more attention than the patch being worked here, which is already functional if manually applied to Gecko.

The Deep Speech backend exists already here [https://gitlab.com/deepspeech/ds-srv].

The associated Docker file is there too [https://gitlab.com/deepspeech/ds-srv/blob/master/Dockerfile.gpu]

So there is no blocker in that regard.

However, Alex and I are interested in bringing STT on device, bug 1474084, so no servers are required, and Alex wants to resolve this bug and bug 1474084.

(In reply to Alexandre LISSY :gerard-majax from comment #60)

(In reply to Andre Natal from comment #58)

If the ultimate goal is to integrate Deep Speech, I believe a better use for Alex' time would be to work in the backend instead the frontend being discussed here, since they should be totally decoupled, i.e, finish the docker containing deepspeech and deploy it to Mozilla's services cloud infrastructure, for online decoding, and/or, create another bug and patch just to integrate deepspeech's inference stack and models into gecko, plus the HTTP service which will receive the requests from the frontend here, in the case of offline.

There's not that much to finish, it's working, and is iso-feature to the other implem served by speech proxy. I guess it would be more a question of production deployment etc. :)

Okay, I'll create a thread including you and Mozilla Services to roll that out to production

(In reply to kdavis from comment #61)

(In reply to Andre Natal from comment #58)

If the ultimate goal is to integrate Deep Speech, I believe a better use for Alex' time would be to work in the backend instead the frontend being discussed here, since they should be totally decoupled, i.e, finish the docker containing deepspeech and deploy it to Mozilla's services cloud infrastructure, for online decoding, and/or, create another bug and patch just to integrate deepspeech's inference stack and models into gecko, plus the HTTP service which will receive the requests from the frontend here, in the case of offline.

Both are still missing and currently needs more attention than the patch being worked here, which is already functional if manually applied to Gecko.

The Deep Speech backend exists already here [https://gitlab.com/deepspeech/ds-srv].

The associated Docker file is there too [https://gitlab.com/deepspeech/ds-srv/blob/master/Dockerfile.gpu]

So there is no blocker in that regard.

However, Alex and I are interested in bringing STT on device, bug 1474084, so no servers are required, and Alex wants to resolve this bug and bug 1474084.

Cool, so is better to work on the tidbits to integrate deepspeech into gecko on bug 1474084 instead this one.

The goal of the patch here is to create an agnostic frontend which can communicate to whichever decoder through an HTTP Rest API, regardless of online or offline.

If the goal is to create a local deepspeech speech server exposed via http, you can use this as a frontend, but if the goal is to do something different, like for example injecting the frames directly into the inference stack, then is better to create a completely new SpeechRecognitionService instead injecting decoder specific code in this patch.

(In reply to Andre Natal from comment #63)

The goal of the patch here is to create an agnostic frontend which can communicate to whichever decoder through an HTTP Rest API, regardless of online or offline.

We want to use REST for offline?

I meant if you add a local http service to DEFAULT_RECOGNITION_ENDPOINT, it should work.

If that's not your goal, you should work on a whole new SpeechRecognitionService containing deep speech specific code on bug 1474084.

Hope that helps.

(not sure I have anything to say here. I'm just reviewing whatever is dumped to my review queue ;)
And happy to review DOM/Gecko side of this)

Flags: needinfo?(bugs)

(In reply to Andre Natal from comment #62)

Okay, I'll create a thread including you and Mozilla Services to roll that out to production

Could you put me on CC. Thanks

Depends on: 1541290
Depends on: 1541298
Attachment #8993800 - Attachment is obsolete: true

This patch introduces a Speech Recognition Service which interfaces with Mozilla's remote STT endpoint which is currently being used by multiple services

Attachment #9055677 - Attachment is obsolete: true

See bug 1547409. Migrating webcompat priority whiteboard tags to project flags.

Webcompat Priority: --- → P1
Priority: -- → P2
Target Milestone: --- → Future

For testing reference, there are web platform tests for SpeechRecognition, though many aren't all appearing on wpt.fyi and it's unclear to me how good the coverage is. It appears that Blink has a few extra tests.

Blocks: 1565102
Blocks: 1565103
No longer blocks: 1565103
Blocks: 1565103

Andre, is this still on track for 70?

And, can you suggest a release note (either for 70 or for whatever future release this ends up in)? Thanks!

Release Note Request (optional, but appreciated)
[Why is this notable]:
[Affects Firefox for Android]:
[Suggested wording]:
[Links (documentation, blog post, etc)]:

relnote-firefox: --- → ?
Flags: needinfo?(anatal)
Flags: needinfo?(anatal)
Priority: P2 → P1

(In reply to Liz Henry (:lizzard) from comment #74)

Andre, is this still on track for 70?

And, can you suggest a release note (either for 70 or for whatever future release this ends up in)? Thanks!

Release Note Request (optional, but appreciated)
[Why is this notable]:
[Affects Firefox for Android]:
[Suggested wording]:
[Links (documentation, blog post, etc)]:

Hi Liz, we moved it to 71. It's possible to track this there?

Thanks!

Updated to track our 71 release. André, will this need a mention in our release notes, a blog post or a mention on our Nightly twitter account to ask our core community to test it? Thanks

Flags: needinfo?(anatal)
Webcompat Priority: P1 → P2
Webcompat Priority: P2 → P1
Blocks: 1588067

(In reply to Pascal Chevrel:pascalc from comment #76)

Updated to track our 71 release. André, will this need a mention in our release notes, a blog post or a mention on our Nightly twitter account to ask our core community to test it? Thanks

It's not going to ride the trains. The plan is to hold the feature in Nightly. I guess that means it doesn't go into the release notes. But asking for folks to flip the pref and test it in Nightly would be good.

(In reply to Nils Ohlmeier [:drno] from comment #77)

(In reply to Pascal Chevrel:pascalc from comment #76)

Updated to track our 71 release. André, will this need a mention in our release notes, a blog post or a mention on our Nightly twitter account to ask our core community to test it? Thanks

It's not going to ride the trains. The plan is to hold the feature in Nightly. I guess that means it doesn't go into the release notes. But asking for folks to flip the pref and test it in Nightly would be good.

We have release notes for the Nightly channel https://www.mozilla.org/en-US/firefox/71.0a1/releasenotes/

(In reply to Nils Ohlmeier [:drno] from comment #77)

It's not going to ride the trains. The plan is to hold the feature in Nightly. I guess that means it doesn't go into the release notes. But asking for folks to flip the pref and test it in Nightly would be good.

Are there any instructions for testing this in nightly? I'm a web developer who runs Firefox Nightly and has given several talks on the Web Speech API. Very keen to help with testing.

Hi Pascal, yes, let's do it, but let's wait until we have the code fully merged to start this discussion. Currently the date of landing is still uncertain since the code wasn't fully reviewed yet.

Flags: needinfo?(anatal)

(In reply to jason.oneil from comment #79)

(In reply to Nils Ohlmeier [:drno] from comment #77)

It's not going to ride the trains. The plan is to hold the feature in Nightly. I guess that means it doesn't go into the release notes. But asking for folks to flip the pref and test it in Nightly would be good.

Are there any instructions for testing this in nightly? I'm a web developer who runs Firefox Nightly and has given several talks on the Web Speech API. Very keen to help with testing.

Hi Jason,

the API hasn't landed and isn't available in Nightly yet , but as soon there is, to enable it will be just matter of switching a couple of flags on (if at all)

We will land this in Nightly 72.

Keywords: checkin-needed

Pushed by nbeleuzu@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e322e2112b1f
Introducing an online speech recognition service for Web Speech API r=smaug,pehrsons,padenot

Keywords: checkin-needed
Status: REOPENED → RESOLVED
Closed: 8 years ago5 years ago
Resolution: --- → FIXED
Regressions: 1590368
Depends on: 1590652
No longer depends on: 1590652

WOOHOOO!! Congratulations everyone! So excited to see this ship after filing this bug four years ago (whoa, time flies!).

(er, duped, but filed a bunch of these originally, so seeing the notifications coming in!!)

Depends on: 1596773
Depends on: 1596788
Depends on: 1596804
Depends on: 1596819
Depends on: 1597204
Depends on: 1597220
Depends on: 1597287
Depends on: 1597637
Depends on: 1597760
Depends on: 1597960
Depends on: 1597978

This is default-off, so not ready for release notes.

Is the speech recognition code free open source software?

Why is the speech reccognition code not shipped with the browser?

(In reply to guest271314 from comment #88)

Is the speech recognition code free open source software?

Why is the speech reccognition code not shipped with the browser?

There is a related issue regarding shipping DeepSpeech with Firefox https://bugzilla.mozilla.org/show_bug.cgi?id=1474084 . The code for deep speech is, of course, open source. It can be found here https://github.com/mozilla/DeepSpeech

(In reply to brandmairstefan from comment #89)

(In reply to guest271314 from comment #88)

Is the speech recognition code free open source software?

Why is the speech reccognition code not shipped with the browser?

There is a related issue regarding shipping DeepSpeech with Firefox https://bugzilla.mozilla.org/show_bug.cgi?id=1474084 . The code for deep speech is, of course, open source. It can be found here https://github.com/mozilla/DeepSpeech

Tried to use SpeechRecognition at Nightly 73 - with necessary flags set - and the browser crashed.

Am trying to compare the results of various STT services. So far tested https://speech-to-text-demo.ng.bluemix.net/ and https://cloud.google.com/speech-to-text/, which provide different results or a 33 second audio file (WAV).

Is there a means to send an audio file to the Mozilla end point without using SpeechRecognition?

SpeechRecognition appears to stop at a brief pause between sound given 33 seconds of audio.

audiostart 266094
start 266095
speechstart 266419

speechend 267451
audioend 267453 
end 268580

Not working for me. I have updated to Firefox nightly v79 18/06/2020.
I have set both value to true, relaunch browser. After visiting the site https://speechnotes.co, unable to click on the MIcrophone. While I tried with another site called https://dictation.io/speech, here tool is not recognizing my voice.

Greetings to everyone, I'm new here, but I've filed speechrecognition issues with Chromium / Android, they will solve this year, as they did answer.

I would like to give you an idear of a different approach, I experienced ...

If we take a whatever word let it be " house " and build a container around this object, something similair to a cell which has a DNA and organelles, we could achieve the following :

Steps:

1.) "house" + a collection of soundapi equalizer registrations which gets aquired and updates this dna chain
2.) pass the collection through a sound fingerprinting to compute common patterns (phillips technology Netherlands, but they do cost)
3.) place the fingerprinting patterns into the house dna chain / house object.

redo 1 to 3 on regular basis, let it be every 10 / 20 new equalizer registrations you do obtain by a user requesting to recognize his words.

Serverwise place the fingerprints into it's memory

Once voicerecognition is requested, preprocess the input voice as fingerprint
Get the fingerprint or fingerprints from the server memory
and return the word " house ", contained in the object.

Nobody will forbit you to combine or connect or link different objects into one, such as :

house, Haus (german), casa (italian), casa (portughese), maison (french)

and obtain this way also immediate translation possibilities, or same soundalike meaning words.

The way to build this thing as single objects and not as a database becomes necessary for performance issues, as a database becomes slower and slower the more data it contains. I know this from musik recognition engines used to monitor radio stations and televisions and report the author rights to be settled. (A little like Shazan)

The other reason is, that you do not need to search for a pattern if the pattern is already named as it's result, so it becomes nothing else than a lookup of a file, which is much faster than a database lookup.

An example of a data container :

describing the content (which could be the soundapi equalizer recordings and the fingerprintings)

data-styled="{"x":161,"y":403,"width":"79vw","height":"49vh","top":"47vh","right":"91vw","bottom":"96vh","left":"12vw","font-family":"Times","font-weight":"400","font-size":"6.884480746791131vmin","line-height":"normal","text-align":"start","objHref":"","dna-mouse-up":"exchangeForm,","dna-enter":"secondary,","objCode":"1603813965238","objTitle":"Museo Gregoriano Egizio","objUrl":"","objData":"","objMime":"image/jpeg","objText":"","objScreenX":"","objScreenY":"","objScreenZ":"","objName":"","objAddr":"","objEmail":"","objWhatsapp":"","objSkype":"","objPhone":"","objWWW":"","objBuild":"Tue, 27 Oct 2020 15:52:45 GMT","objExpire":"","objGroup":"","objOwner":"","objCmd":"pasted image","objUpdate":"Fri, 01 Jan 2021 15:28:20 GMT","objParent":"1603813859288","name":"","org-height":"734","org-width":"981","org-size":"","compression":"0.6","res-width":"981","res-height":"734","res-size":"154787","scale-factor":"1","play-dna":"","objTime":16856080079020.5,"objStart":1609514900047,"background-color":"Tomato"}"

and allowing the data with :

"dna-mouse-up":"exchangeForm,",
"dna-enter":"secondary,",

to execute commands / programs based here based upon mouseup or screenarea entering.
(I took it from a website I've build)

or with such a set of istructions

data-styled="{"clickdog29021":"50,50,50,50,2000,5","clickdog41510":"89.47368421052632,39.39393939393939,7.236842105263158,39.928698752228165,412,3","clickdog51838":"7.894736842105263,40.28520499108734,75.6578947368421,40.106951871657756,352,-3","clickdog59010":"12.5,45.27629233511586,73.02631578947368,7.8431372549019605,468,9","clickdog65855":"73.35526315789474,16.22103386809269,17.105263157894736,47.41532976827095,354,-9","clickdog77738":"50,50,50,50,2000,6"}"

the numbers are unique processing id, starting x,y ending x,y,(vh/vw) , time used (ms) and acting the way
to run a hand to great (5), then show the visitor how to swipe to change page forwards and backwards (3, -3), last and first page(9,-9) then finally greet for a good bye (6).

clicking would be 1, long click 2, and very long click 4 (used to issue different commands available on the screen)

The advantage of such a thing is, that each single datacontainer will continue to learn and process, and upon the real world usage of those datacontainers creating a major " understanding " (datacollection) of the word " house ", those containers less used or unused will timeout ("objExpire":"") or /and ("objUpdate":"Fri, 01 Jan 2021 15:28:20 GMT") and you have something as a natural selection or evolution of containers adapting themself to the millions of speakers using the system to dictate their hopefully intelligent words.

I hope I could explain the idea or technique the way that it became understandable.

At your disposal
(I do not place a link to above shows, I would consider as publicity, but the thing is also on the web and works)
Claudio Klemp

Hi Claudio and Happy New Year! Please note that Bugzilla is meant to track the implementation of specific feature requests or bug fixes. In this case it is the SpeechRecognition Interface of the Web Speech API and this bug is already closed.

More general discussions should happen at https://discourse.mozilla.org/, which might then end up as specific requests here.

Sebastian

(In reply to André Natal from comment #81)

(In reply to jason.oneil from comment #79)

(In reply to Nils Ohlmeier [:drno] from comment #77)

It's not going to ride the trains. The plan is to hold the feature in Nightly. I guess that means it doesn't go into the release notes. But asking for folks to flip the pref and test it in Nightly would be good.

Are there any instructions for testing this in nightly? I'm a web developer who runs Firefox Nightly and has given several talks on the Web Speech API. Very keen to help with testing.

Hi Jason,

the API hasn't landed and isn't available in Nightly yet , but as soon there is, to enable it will be just matter of switching a couple of flags on (if at all)

Which "couple of flags"? I observed the devtools.chrome.enabled = true and media.webspeech.recognition.enable = true from this discussion and set those accordingly. I am testing on https://mdn.github.io/web-speech-api/speech-color-changer/index.html. But page does not load properly due to breakage on webkitSpeechRecognition is not defined error. Testing on Nightly 86.0a1 (2021-01-03).

Flags: needinfo?(jason.oneil)

Additional info: Using Macos version 10.13.6

#95 The last time I tested this was the page I used for information re speech recognition https://wiki.mozilla.org/Web_Speech_API_-_Speech_Recognition#How_can_I_test_with_Deep_Speech.3F, see https://bugzilla.mozilla.org/show_bug.cgi?id=1604994.

(In reply to guest271314 from comment #97)

#95 The last time I tested this was the page I used for information re speech recognition https://wiki.mozilla.org/Web_Speech_API_-_Speech_Recognition#How_can_I_test_with_Deep_Speech.3F, see https://bugzilla.mozilla.org/show_bug.cgi?id=1604994.

I get "Voice input isn't supported on this browser".

When running a simple script, window.SpeechRecognition is undefined.

You'll need to enable both media.webspeech.recognition.enable and media.webspeech.recognition.force_enable to test out the API in its current state.

(In reply to Andreas Pehrson [:pehrsons] from comment #99)

You'll need to enable both media.webspeech.recognition.enable and media.webspeech.recognition.force_enable to test out the API in its current state.

Thank you, that works. media.webspeech.recognition.force_enable wasn't mentioned in this discussion before.

Am I to understand that setting media.webspeech.service.endpoint to https://dev.speaktome.nonprod.cloudops.mozgcp.net/ will cause SpeechRecognition to use Deepspeech instead of the OS native service?

to use Deepspeech instead of the OS native service?

What is meant by "OS native service"?

Which OS are you running?

AFAIK On Linux there is no "native OS service" for speech recognition.

(In reply to JulianHofstadter from comment #100)

(In reply to Andreas Pehrson [:pehrsons] from comment #99)

You'll need to enable both media.webspeech.recognition.enable and media.webspeech.recognition.force_enable to test out the API in its current state.

Thank you, that works. media.webspeech.recognition.force_enable wasn't mentioned in this discussion before.

Am I to understand that setting media.webspeech.service.endpoint to https://dev.speaktome.nonprod.cloudops.mozgcp.net/ will cause SpeechRecognition to use Deepspeech instead of the OS native service?

See this wiki page as mentioned in comment 97. It lists the default destination, and further down how to switch to deep speech.

In trying to understand the docs, I understand the interimResults[1] property to mean that results will be returned at intervals while speaking, thus triggering a result event more than one time before speech recognition ends. However in practice I am observing that setting interimResults to true has no effect, and only a single result is returned at the end of recognition.

Am I interpreting this incorrectly, and if so, can someone tell me what practical difference I should see between setting this to true and false?

[1]https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecognition/interimResults

This implementation is far from the spec, but at least basic recognition should work. I don't think it allows much of configuring at all, but if you want an authoritative answer, read the source.

That said, this bug is not the place to keep discussions, so this will be my last comment. Feel free to continue on Matrix.

(In reply to Andreas Pehrson [:pehrsons] from comments)

See this wiki page as mentioned in comment 97. It lists the default destination, and further down how to switch to deep speech.

Very informative, that helps a lot, thank you.

This implementation is far from the spec, but at least basic recognition should work. I don't think it allows much of configuring at all

Ah, that's helpful.

That said, this bug is not the place to keep discussions, so this will be my last comment. Feel free to continue on Matrix.

Thanks, I'll take your advice.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: