Closed Bug 650295 Opened 13 years ago Closed 11 years ago

Add support for speech input

Categories

(Core :: General, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla22

People

(Reporter: icecold, Assigned: ggp)

References

(Blocks 1 open bug)

Details

(Keywords: dev-doc-needed)

Attachments

(2 files, 12 obsolete files)

124.39 KB, patch
ggp
: review+
johns
: checkin+
Details | Diff | Splinter Review
193.16 KB, patch
ggp
: review+
johns
: checkin+
Details | Diff | Splinter Review
User-Agent:       Mozilla/5.0 (Windows NT 6.1; rv:6.0a1) Gecko/20110414 Firefox/6.0a1
Build Identifier: 

"With this API, developers can give web apps the ability to transcribe your voice to text. When a web page uses this feature, you simply click on an icon and then speak into your computer’s microphone. The recorded audio is sent to speech servers for transcription, after which the text is typed out for you."

Google implemented support in Chrome and made a it available on Google Translate. (you need Chrome Canary build)

Reproducible: Always
Severity: normal → enhancement
I know this is something people have been talking about, but I couldn't find an existing bug, so confirming.
Status: UNCONFIRMED → NEW
Ever confirmed: true
This is probably better addressed at the people at Google who somehow think that this would be a good addition to the HTML specs...

How exactly would this interact (or interfere) with the speech recognition that is already present on a user's machine?

On Windows, all versions since XP have sophisticated built-in speech recognition capabilities (capable of adapting and training to each person's individual voice and diction) that already allow users to speech-input text into any location where keyboard input is accepted (including form inputs in any browser).  And although other operating systems, to my knowledge, do not have built-in speech diction recognition like Windows, there are third-party software that do the same thing.

Also, if the lack of popularity of the recognition feature in Windows is any indication, I don't think this is a feature that will have legs in the desktop world (mobile is probably a very different story, of course).
http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Feb/att-0020/api-draft.html is just one *proposed* API, and it has sever problems, so
it should not be implemented as such.
I have proposed a different API and Microsoft has proposed few APIs.
Summary: Add support for HTML speech input API → Add support for speech input and tts output API
s/sever/severe/
Assignee: nobody → roshanvid
Status: NEW → ASSIGNED
OS: Windows 7 → All
Hardware: x86 → All
Version: unspecified → Trunk
Be in the look out for duplicates of this from users unable to "Search Out Loud" in Firefox now that this is live =S
Reference: http://www.google.com/insidesearch/voicesearch-chrome.html
Summary: Add support for speech input and tts output API → Add support for speech input and tts output API (used for voice search)
Summary: Add support for speech input and tts output API (used for voice search) → Add support for speech input and tts output API
Blocks: webrtc
Updating summary. Note TTS already filed as bug 525444.
Summary: Add support for speech input and tts output API → Add support for speech input
What is the status on this bug?

I have been using speech input in Chrome to great effect for voice commands, searching data without typing, and game input.

However, I'd like to see an improved (and more open) implementation for Gecko.  We should be able to specify the server the speech input gets sent to for processing.  Being locked into Google's speech-to-text servers won't cut it especially since their Translate API previously went down and came back as a paid service.

Google's server-side code also doesn't seem to handle background noise very well, and there's no way for us to fix this due to the closed source nature.

Further, especially for game input, the way it's implemented in Chromium is especially cumbersome.  Games are keyboard-driven applications and forcing the user to click a microphone icon every time with the mouse is a hassle and poor design.  Therefore, it is proposed that we can "optionally" request permission to automatically process speech input.  Here's the bug I filed at Chromium:

http://code.google.com/p/chromium/issues/detail?id=95523
Depends on: 746536
I have also been using speech input in Chrome with great effectiveness. I use forums all the time and it would be wonderful if textinput boxes would have the ability to accept speech input using Firefox. 

Here's an example website that shows how to allow speech input using Chrome.

http://www.mybloggertricks.com/2012/05/add-free-speech-recognition-to-your.html

Please add this HTML5 feature to Firefox.
http://www.mybloggertricks.com/2012/05/add-free-speech-recognition-to-your.html describes 
a Chrome only, non-standard (and to-be-removed) feature how to add speech input.
Nothing to do with HTML(5) specification.
Thank you for the clarification Olli Pettay. I was under the impression it was a HTML5 feature.

At any rate it would be wonderful to see a similar implementation with Firefox as it's a great feature.
Apparently the Chrome implementation of speech input uses HTML5 features.

http://webdesign.about.com/od/html5tutorials/qt/Collect-Speech-Input-With-Html5-On-Google-Chrome.htm
As I said, nothing to do with HTML5, or any specification in fact. 
The API described in http://webdesign.about.com/od/html5tutorials/qt/Collect-Speech-Input-With-Html5-On-Google-Chrome.htm is Google's proprietary API, and will be hopefully changed to  
something closer to http://www.w3.org/2005/Incubator/htmlspeech/XGR-htmlspeech-20111206/
Thank you once again Olli. 

I just hope that Firefox will have speech to text available sometime soon. Being able to fill out textinput forms like the one I'm typing in now using speech would be very helpful.
I think this bug should cover this spec:
http://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html

In addition to the above spec, I think it should be well integrated with MediaStream objects. See: http://lists.w3.org/Archives/Public/public-speech-api/2012Jun/0072.html

As for speech engines and implementations, I think it should be agnostic. Meaning, many potential engines should be supported, some examples:
1. Platform services (SAPI on windows, and similar on Linux, Mac, and Android).
2. Remote services, like Nuance or Google (translate).
3. Local JS and native implementations (how and if they would be bundled with Firefox is debatable).

The implementation "glue" should be available in chrome-access JS, so a JS object would just need to implement an interface, and register the service object with a privileged call.
I think Nikhil did implement something along these lines.
(In reply to Eitan Isaacson [:eeejay] from comment #14)
> I think this bug should cover this spec:
> http://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html
> 
> In addition to the above spec, I think it should be well integrated with
> MediaStream objects. See:
> http://lists.w3.org/Archives/Public/public-speech-api/2012Jun/0072.html
> 
> As for speech engines and implementations, I think it should be agnostic.
> Meaning, many potential engines should be supported, some examples:
> 1. Platform services (SAPI on windows, and similar on Linux, Mac, and
> Android).
> 2. Remote services, like Nuance or Google (translate).
> 3. Local JS and native implementations (how and if they would be bundled
> with Firefox is debatable).

Work on this is already underway.

> 
> The implementation "glue" should be available in chrome-access JS, so a JS
> object would just need to implement an interface, and register the service
> object with a privileged call.

No immediate plan for exposing to JS since much of the underlying APIs are not
directly exposed to JS and we don't expect there to a ton of speech-to-text providers.
(In reply to Fabrice Desré [:fabrice] from comment #15)
> I think Nikhil did implement something along these lines.

Oops! wrong bug, I meant to post this in bug #525444. Nikhil, is your work public somewhere?
(In reply to Nikhil Marathe from comment #16)
> (In reply to Eitan Isaacson [:eeejay] from comment #14)
> > 
> > The implementation "glue" should be available in chrome-access JS, so a JS
> > object would just need to implement an interface, and register the service
> > object with a privileged call.
> 
> No immediate plan for exposing to JS since much of the underlying APIs are
> not
> directly exposed to JS and we don't expect there to a ton of speech-to-text
> providers.

This makes sense, for now. On the TTS front, I think supporting more than one engine is important, since each engine has different strengths and varying language support, and the user can have different preferences. This may be true one day for speech recognition too, but for now there is not much out there.
> Oops! wrong bug, I meant to post this in bug #525444. Nikhil, is your work public somewhere?

Parts can be immediately.  Other parts, not so much -- we started using (for prototyping) google's service which we may or may not be able to share right now (i am not sure i chromium uses it or just chrome).  nsm, can you investigate this?
Hm, Chromium uses[1] the HTTP API, yes, and there are lots of examples of how to use it on the Web, but I couldn't find anything official from Google saying if the API is supposed to be used by the public. The closest thing I found was this message in the Google Product Forums: 

http://productforums.google.com/forum/#!topic/chrome/U02jzA32IGc
>The documentation and terms of service for this HTTP service are not made publicly
>available because this service is not intended for public use.
>The Google Speech HTTP API is intended for use from Chrome and should not be used
>by third parties without explicit permission from Google.
>
>Third parties who want to access this API can request access by emailing a summary
>of their intended application, together with a business case for why Google should
>support their use of this service, to "google-speech-api" at our company's domain
>(google.com).

[1] http://src.chromium.org/viewvc/chrome/trunk/src/content/browser/speech/google_one_shot_remote_engine.cc?view=markup
Assignee: roshanvid → ggoncalves
This patch implements the state machine that handles recording audio, creating the backend recognition service (through a factory mechanism), detecting speech end and issuing events and errors associated with the results of the recognition.

The DOM objects associated with the recognition (result list, alternative, errors and events) are also implemented.
Attachment #712684 - Flags: review?(bugs)
This patch implements tests for the finite state machine introduced in the previous patch. It uses the nsIObserver interface implemented by the main SpeechRecognition to request artificial state machine events (to simulate, for example, a response from the recognition service), and then checks whether the expected DOM events are raised.
Attachment #712689 - Flags: review?(bugs)
Comment on attachment 712684 [details] [diff] [review]
Bug 650295 - Implement the main state machine for Speech Recognition


>+// The atoms below are used by webspeech
>+// but already defined by other APIs
>+// GK_ATOM(onerror, "onerror")
>+// GK_ATOM(onend, "onend")
No need to add this kind of comment
>+// This is a helper class which enables Web Speech to be enabled or disabled
>+// as whole.  Individual Web Speech object classes should inherit from this.
>+class EnableWebSpeechCheck {
Nit, { should go to the next line

>+SpeechGrammar::SpeechGrammar(nsISupports *parent) : mParent(parent)
nsISupport* aParent

>+SpeechGrammar::GetSrc(nsString& retval) const
aRetVal

>+{
>+  NS_ABORT_IF_FALSE(0, "not implemented");
>+}
>+
>+void
>+SpeechGrammar::SetSrc(const nsAString& arg)
>+{
>+  NS_ABORT_IF_FALSE(0, "not implemented");
>+}
>+
>+float
>+SpeechGrammar::Weight() const
>+{
>+  NS_ABORT_IF_FALSE(0, "not implemented");
>+  return 0;
>+}
>+
>+void
>+SpeechGrammar::SetWeight(float arg)
>+{
>+  NS_ABORT_IF_FALSE(0, "not implemented");
>+}
Better to throw for now. NS_ABORT is quite strong.

>+public:
>+  SpeechGrammar(nsISupports *parent);
nsISupports* aParent


>+  static SpeechGrammar* Constructor(const GlobalObject& aGlobal, ErrorResult& aRv);
>+
>+  void GetSrc(nsString& retval) const;
aRetVal


>+
>+  void SetSrc(const nsAString& arg);
aArg


>+  void SetWeight(float arg);
aArg


>+private:
>+  nsCOMPtr<nsISupports> mParent;
You need to add mParent to cycle collection. So in .cpp file
NS_IMPL_CYCLE_COLLECTION_WRAPPERCACHE_0(SpeechGrammar) ->
NS_IMPL_CYCLE_COLLECTION_WRAPPERCACHE_1(SpeechGrammar, mParent)

>+
>+NS_IMPL_CYCLE_COLLECTION_WRAPPERCACHE_0(SpeechGrammarList)
Also this should traverse object's mParent


>+already_AddRefed<SpeechGrammar>
>+SpeechGrammarList::Item(uint32_t index)
aIndex


>+SpeechGrammarList::AddFromURI(const nsAString &src, const Optional< float >& weight)
const nsAString& aSrc, const Optional<float>& aWeight


>+SpeechGrammarList::AddFromString(const nsAString& string, const Optional< float >& weight)
const nsAString& aString, const Optional<float>& aWeight


>+{
>+  JS_ASSERT(0);
No JS_* stuff unless absolutely needed. We have MOZ_ASSERT.
Also, assert is perhaps too strong for now. Better to thrown? or just return something dummy for now.

>+already_AddRefed<SpeechGrammar>
>+SpeechGrammarList::IndexedGetter(uint32_t index, bool& present)
aIndex, aPresent

>+public:
>+  SpeechGrammarList(nsISupports *parent);
nsISupports* aParent

>+  // Mark this as resultNotAddRefed to return raw pointers
Leftover from the binding example generator ?

>+  already_AddRefed<SpeechGrammar> Item(uint32_t index);
aIndex

>+
>+  void AddFromURI(const nsAString& src, const Optional< float >& weight);
>+
>+  void AddFromString(const nsAString& string, const Optional< float >& weight);
>+
>+  already_AddRefed<SpeechGrammar> IndexedGetter(uint32_t index, bool& present);
Check the coding style of parameters.
If the example binding codegen uses wrong coding style, file a bug to get that fixed.


>+GetSpeechRecognitionLog()
>+{
>+  static PRLogModuleInfo *sLog;
>+  if (!sLog)
>+    sLog = PR_NewLogModule("SpeechRecognition");
if (expr) {
  stmt;
}



>+SpeechRecognition::SpeechRecognition()
>+  : mProcessingEvent(false)
>+  , mEndpointer(new Endpointer(kSAMPLE_RATE))
>+  , mSpeechDetectionTimer(do_CreateInstance(NS_TIMER_CONTRACTID))
>+{
>+  SR_LOG("created SpeechRecognition");
>+  SetIsDOMBinding();
>+  nsCOMPtr<nsIObserverService> obs = services::GetObserverService();
>+  obs->AddObserver(this, "xpcom-shutdown", false);
>+  obs->AddObserver(this, "getUserMedia:response:allow", false);
>+  obs->AddObserver(this, "getUserMedia:response:deny", false);
So these AddObserver calls mean that all the SpeechRecognition objects are kept alive
until xpcom-shutdown.



>+already_AddRefed<SpeechRecognition>
>+SpeechRecognition::Constructor(const GlobalObject& aGlobal, ErrorResult& aRv)
>+{
>+  nsCOMPtr<nsPIDOMWindow> win = do_QueryInterface(aGlobal.Get());
>+  if (!win) {
>+    aRv.Throw(NS_ERROR_FAILURE);
>+  }
>+  MOZ_ASSERT(win->IsInnerWindow());
>+  SpeechRecognition* object = new SpeechRecognition();
>+  NS_ADDREF(object);
>+  object->BindToOwner(win);
>+  return object;
>+}
nsRefPtr<SpeechRecognition> object = new SpeechRecognition();
object->BindToOwner(win);
return object.forget();



>+SpeechRecognition::ProcessEvent(SpeechEvent *event)
>+{
>+  SR_LOG("Processing event %d", event->mType);
>+  NS_ASSERTION(NS_IsMainThread(), "Should be running on main thread");
Not sure how useful this assertion is. If a cycle collectable object is used in non-main-thread, 
we'll crash soon and hard.

>+MediaOperationRunnable *
>+SpeechRecognition::MakeMediaOperationRunnable(MediaOperation aType)
>+{
>+  nsIThread *mt;
>+
>+  NS_GetMainThread(&mt);
>+  nsCOMPtr<nsPIDOMWindow> window = do_QueryInterface(GetOwner());
Why do you need do_QueryInterface? GetOwner() return nsPIDOMWindow.

>+
>+  nsRefPtr<GetUserMediaCallbackMediaStreamListener> listener(
>+      new GetUserMediaCallbackMediaStreamListener(mt, window->WindowID()));
2 space indentation please.


>+uint64_t SpeechRecognition::ProcessAudioSegment(AudioSegment *segment) {
AudioSegment* aSegment
and { goes to the next line

>+SpeechRecognition::StartedAudioCapture(SpeechEvent *event)
SpeechEvent* aEvent


>+{
>+  nsCOMPtr<nsIObserverService> obs = services::GetObserverService();
>+  obs->NotifyObservers(nullptr, "webspeech:audio-capture-started", nullptr);
>+  DispatchTrustedEvent(NS_LITERAL_STRING("start"));
>+  DispatchTrustedEvent(NS_LITERAL_STRING("audiostart"));
Is this a safe point to dispatch events? Remember that event listeners can call stop() or
navigate the page to a new location and what not.
Could the events be dispatched asynchronously?
(need to check what the spec says)

>+SpeechRecognition::DetectSpeech(SpeechEvent *event)
SpeechEvent* aEvent


>+SpeechRecognition::WaitForSpeechEnd(SpeechEvent *event)
SpeechEvent* aEvent


>+SpeechRecognition::NotifyFinalResult(SpeechEvent *event)
ditto and same also elsewhere

>+SpeechRecognition::AbortSilently(SpeechEvent *event)
>+{
>+  if (mRecognitionService)
>+    mRecognitionService->Abort();
if (expr) {
  stmt;
}



>+SpeechRecognition::SelectDevice()
>+{
>+  SR_LOG("Selecting device");
>+  MediaEngine *engine = MediaManager::Get()->GetBackend();
>+
>+  nsTArray<nsRefPtr<MediaEngineAudioSource> > audioSources;
>+  engine->EnumerateAudioDevices(&audioSources);
>+
>+  if (audioSources.Length() <= 0)
>+    return NS_ERROR_FAILURE;
Maybe
NS_ENSURE_STATE(audioSources.Length() > 0);
And .Length() sure can't return < 0, since it is unsigned


>+
>+  for (uint32_t i = 0; i < audioSources.Length(); i++) {
++i


>+SpeechRecognition::StartRecording()
>+{
>+  if (!mAudioSource)
>+    return NS_ERROR_FAILURE;
NS_ENSURE_STATE(mAudioSource);


>+SpeechRecognition::Observe(nsISupports *aSubject, const char *aTopic, const PRUnichar * aData)
>+{
>+  NS_ASSERTION(NS_IsMainThread(), "Observer invoked off the main thread");
>+  nsCOMPtr<nsIObserverService> obs = services::GetObserverService();
>+
>+  if (!strcmp(aTopic, "xpcom-shutdown")) {
>+    // TODO: cleanup?
>+    return NS_OK;
>+  } else if (!strcmp(aTopic, "getUserMedia:response:allow")) {
>+    nsCOMPtr<nsIMediaDevice> device;
>+    if (aSubject) {
>+      device = do_QueryInterface(aSubject);
>+      if (device) {
>+        nsString type;
>+        device->GetType(type);
>+        if (type.EqualsLiteral("audio")) {
>+          nsString name;
>+          device->GetName(name);
Do you really need nsString in these two cases?
nsAutoString would be better since it does stack allocation by default


>+          SR_LOG("User provided device %s", NS_ConvertUTF16toUTF8(name).get());
>+          MediaDevice *dev = static_cast<MediaDevice*>(device.get());
MediaDevice* dev = ...

>+// TODO Mark this as resultNotAddRefed to return raw pointers
>+already_AddRefed<SpeechGrammarList>
>+SpeechRecognition::Grammars() const
>+{
>+	NS_ABORT_IF_FALSE(0, "not implemented");
>+	return 0;
>+}
>+
>+void
>+SpeechRecognition::SetGrammars(mozilla::dom::SpeechGrammarList& arg)
>+{
>+	NS_ABORT_IF_FALSE(0, "not implemented");
>+}
>+
>+void
>+SpeechRecognition::GetLang(nsString& retval) const
>+{
>+	NS_ABORT_IF_FALSE(0, "not implemented");
>+}
>+
>+void
>+SpeechRecognition::SetLang(const nsAString& arg)
>+{
>+	NS_ABORT_IF_FALSE(0, "not implemented");
>+}
>+
>+bool
>+SpeechRecognition::Continuous() const
>+{
>+	NS_ABORT_IF_FALSE(0, "not implemented");
>+	return false;
>+}
>+
>+void
>+SpeechRecognition::SetContinuous(bool arg)
>+{
>+	NS_ABORT_IF_FALSE(0, "not implemented");
>+}
>+
>+bool
>+SpeechRecognition::InterimResults() const
>+{
>+	NS_ABORT_IF_FALSE(0, "not implemented");
>+	return false;
>+}
>+
>+void
>+SpeechRecognition::SetInterimResults(bool arg)
>+{
>+	NS_ABORT_IF_FALSE(0, "not implemented");
>+}
>+
>+uint32_t
>+SpeechRecognition::MaxAlternatives() const
>+{
>+	NS_ABORT_IF_FALSE(0, "not implemented");
>+	return 0;
>+}
>+
>+void
>+SpeechRecognition::SetMaxAlternatives(uint32_t arg)
>+{
>+	NS_ABORT_IF_FALSE(0, "not implemented");
>+}
>+
>+void
>+SpeechRecognition::GetServiceURI(nsString& retval) const
>+{
>+	NS_ABORT_IF_FALSE(0, "not implemented");
>+}
>+
>+void
>+SpeechRecognition::SetServiceURI(const nsAString& arg)
>+{
>+	NS_ABORT_IF_FALSE(0, "not implemented");
>+}
Some \t characters in the code above


>+SpeechRecognition::getGUMData()
GetGUMData();

>+{
>+  nsresult rv;
>+  nsCOMPtr<nsIUUIDGenerator> uuidgen =
>+    do_GetService("@mozilla.org/uuid-generator;1", &rv);
>+  if (NS_FAILED(rv))
>+    return EmptyString();
if (expr) {
  stmt;
}

>+class SpeechEvent : public nsRunnable
>+{
>+public:
>+  SpeechEvent(SpeechRecognition *recognition, SpeechRecognition::EventType type)
>+  : mAudioSegment(0)
>+  , mRecognitionResultList(0)
>+  , mError(0)
>+  , mRecognition(recognition)
>+  , mType(type)
>+  {
>+  }
>+
>+  ~SpeechEvent();
>+
>+  NS_IMETHOD Run();
>+  AudioSegment *mAudioSegment;
>+  nsRefPtr<SpeechRecognitionResultList> mRecognitionResultList; // TODO: make this a session being passed which also has index and stuff
>+  nsCOMPtr<nsIDOMSpeechRecognitionError> mError;
>+
>+  friend class SpeechRecognition;
>+private:
>+  SpeechRecognition *mRecognition;
This is super-unsafe, once we stop leaking all the SpeechRecognition objects.
nsRefPtr<SpeechRecognition> ?


>+NS_IMPL_CYCLE_COLLECTION_WRAPPERCACHE_0(SpeechRecognitionResult)
_1 and mParent


>+SpeechRecognitionResultList::SpeechRecognitionResultList(SpeechRecognition *parent)
* aParent

>+SpeechStreamListener::NotifyQueuedTrackChanges(MediaStreamGraph *aGraph, TrackID aID,
>+                                TrackRate aTrackRate,
>+                                TrackTicks aTrackOffset,
>+                                uint32_t aTrackEvents,
>+                                const MediaSegment& aQueuedMedia)
Odd indentation


>+++ b/content/media/webspeech/recognition/energy_endpointer.h
>@@ -0,0 +1,156 @@
>+// Copyright (c) 2012 The Chromium Authors. All rights reserved.
>+// Use of this source code is governed by a BSD-style license that can be
>+// found in the LICENSE file.
We have other Chromium code too in tree, but I wonder if we should keep such code in
certain places. You could ask perhaps cjones or bsmedberg or gerv
Attachment #712684 - Flags: review?(bugs) → review-
Comment on attachment 712689 [details] [diff] [review]
Bug 650295 - Tests for Speech Recognition

Shouldn't we have fake events also in non-debug builds?
Might be better to enable that stuff based on some pref.

Coding style for if-else is

if (expr) {
  stmt1;
} else {
  stmt2;
}


I need to go through this more properly once I've re-reviewed the updated version of the first patch.
Attachment #712689 - Flags: review?(bugs)
Fixed coding style violations.

Known issue: SpeechEvents carry an nsRefPtr to SpeechRecognition, which is initialized out of the main thread. Since SpeechRecognition is not thread-safe, this causes a crash.
Attachment #712684 - Attachment is obsolete: true
Due to the lack of a recognition service, the tests will fail to run. The next iteration of this patch should contain a FakeSpeechRecognitionService that allows the tests to be used.
Attachment #712689 - Attachment is obsolete: true
Attachment #713705 - Attachment is obsolete: true
Attachment #717310 - Flags: review?(bugs)
This new version contains only small (mostly cosmetic) changes compared to the previous. Once again, the test browser_success_with_recognition_service.js is not expected to pass.
Attachment #713706 - Attachment is obsolete: true
Attachment #717312 - Flags: review?(bugs)
Comment on attachment 717310 [details] [diff] [review]
Bug 650295 - Implement the main state machine for Speech Recognition


>+class SpeechGrammar MOZ_FINAL : public nsISupports,
>+                                public nsWrapperCache,
>+                                public EnableWebSpeechCheck
>+{
>+public:
>+  SpeechGrammar(nsISupports* aParent);
>+  ~SpeechGrammar();
>+
>+  NS_DECL_CYCLE_COLLECTING_ISUPPORTS
>+  NS_DECL_CYCLE_COLLECTION_SCRIPT_HOLDER_CLASS(SpeechGrammar)
>+
>+  // TODO: return something sensible here, and change the return type
>+  nsISupports* GetParentObject() const;
You can remove the TODO comment, right?
>+class SpeechGrammarList MOZ_FINAL : public nsISupports,
>+                                    public nsWrapperCache,
>+                                    public EnableWebSpeechCheck
>+{
>+public:
>+  SpeechGrammarList(nsISupports* aParent);
>+  ~SpeechGrammarList();
>+
>+  NS_DECL_CYCLE_COLLECTING_ISUPPORTS
>+  NS_DECL_CYCLE_COLLECTION_SCRIPT_HOLDER_CLASS(SpeechGrammarList)
>+
>+  SpeechGrammarList* Constructor(const GlobalObject& aGlobal, ErrorResult& aRv);
>+
>+  // TODO: return something sensible here, and change the return type
>+  nsISupports* GetParentObject() const;
ditto

>+NS_INTERFACE_MAP_BEGIN(SpeechRecognition)
>+  NS_INTERFACE_MAP_ENTRY(nsIObserver)
>+  NS_INTERFACE_MAP_ENTRY_AMBIGUOUS(nsISupports, nsIObserver)
>+NS_INTERFACE_MAP_END_INHERITING(nsDOMEventTargetHelper)
nsDOMEventTargetHelper gives the right nsISupports, so remove
NS_INTERFACE_MAP_ENTRY_AMBIGUOUS(nsISupports, nsIObserver)
(I believe it would be even a security bug to return wrong nsISupports)


>+SpeechRecognition::~SpeechRecognition()
>+{
>+  mSpeechDetectionTimer = nullptr;
There shouldn't be need for this. nsCOMPtr/nsRefPtr release automatically when the object owning them is deleted.


>+SpeechRecognition::StartedAudioCapture(SpeechEvent* aEvent)
>+{
>+  DispatchTrustedEvent(NS_LITERAL_STRING("start"));
>+  DispatchTrustedEvent(NS_LITERAL_STRING("audiostart"));
>+
>+  mEndpointer->SetEnvironmentEstimationMode();
>+  mEstimationSamples += ProcessAudioSegment(aEvent->mAudioSegment);
Should we dispatch both events after mEndpointer->SetEnvironmentEstimationMode() and mEstimationSamples += ProcessAudioSegment(aEvent->mAudioSegment); ?



>+SpeechRecognition::AbortSilently(SpeechEvent* aEvent)
>+{
>+  if (mRecognitionService) {
>+    mRecognitionService->Abort();
>+  }
>+
>+  if (STATE_BETWEEN(STATE_ESTIMATING, STATE_RECOGNIZING)) {
>+    StopRecording();
>+  }
>+
>+  DispatchTrustedEvent(NS_LITERAL_STRING("end"));
>+  return Reset();
Should we call Reset before dispatching end? The event listener for "end" might
try to re-start recognition or something like that.



>+SpeechRecognition::FSMState
>+SpeechRecognition::AbortError(SpeechEvent* aEvent)
>+{
>+  AbortSilently(aEvent);
>+  NotifyError(aEvent);
>+  return Reset();
>+}
Same here


>+SpeechRecognition::NotifyError(SpeechEvent* aEvent)
>+{
>+  nsCOMPtr<nsIDOMEvent> domEvent = do_QueryInterface(aEvent->mError);
>+  NS_ASSERTION(domEvent, "SpeechEvent::mError not a valid error event");
Not very useful assertion. We crash anyway in the next line if mError isn't domEvent


>+SpeechRecognition::Observe(nsISupports* aSubject, const char* aTopic,
>+  const PRUnichar* aData)
align the last parameter under the other parameters

>+SpeechRecognition::SetGrammars(mozilla::dom::SpeechGrammarList& aArg,
>+  ErrorResult& aRv)
ditto

>+SpeechRecognition::Start(ErrorResult& aRv)
>+{
>+  if (!STATE_EQUALS(STATE_IDLE)) {
>+    aRv.Throw(NS_ERROR_DOM_INVALID_STATE_ERR);
>+    return;
>+  }
>+
>+  const char* speechRecognitionServiceCID =
>+    NS_SPEECH_RECOGNITION_SERVICE_CONTRACTID_PREFIX "google";
So we need to make this more generic. Add a pref for the service name?
"media.webspeech.service", "google"
(Btw, do we need separate contractid for simple google service and for the continuous recognition service?)





>+
>+  nsresult rv;
>+  mRecognitionService = do_GetService(speechRecognitionServiceCID, &rv);
>+  NS_ASSERTION(mRecognitionService.get(),
>+               "failed to instantiate recognition service");
And if you read the pref, you need to then null check here and not do the assertion.

>+SpeechRecognition::DispatchError(EventType aErrorType, int aErrorCode,
>+  const nsAString& aMessage)
Align the last parameter

>+{
>+  NS_ASSERTION(aErrorType == EVENT_RECOGNITIONSERVICE_ERROR ||
>+               aErrorType == EVENT_AUDIO_ERROR, "Invalid error type!");
Could you add MOZ_ASSERT to check that this is called only on main thread.



>+  SpeechEvent* event = new SpeechEvent(this, aErrorType);
nsRefPtr<SpeechEvent> so tht you don't leak if NS_DispatchToMainThread fails
(it may fail during shutdown).



>+void
>+SpeechRecognition::FeedAudioData(already_AddRefed<SharedBuffer> aSamples,
>+  unsigned int aDuration,
>+  MediaStreamListener* aProvider)
Align params

>+{
>+  NS_ASSERTION(!NS_IsMainThread(),
>+               "FeedAudioData should not be called in the main thread");
>+
>+  AudioSegment* segment = new AudioSegment();
>+
>+  nsAutoTArray<const int16_t*, 1> channels;
>+  channels.AppendElement(static_cast<const int16_t*>(aSamples.get()->Data()));
>+  segment->AppendFrames(aSamples, channels, aDuration);
>+
>+  SpeechEvent* event = new SpeechEvent(this, EVENT_AUDIO_DATA);
nsRefPtr<SpeechEvent>

>+
>+// TODO: ownership model for this
So the ownership is now clear?


>+
>+  // Mark this as resultNotAddRefed to return raw pointers
>+  already_AddRefed<SpeechRecognitionAlternative> Item(uint32_t aIndex);
Remove the //.... comment


>+class SpeechRecognitionResultList MOZ_FINAL : public nsISupports,
>+                                              public nsWrapperCache,
>+                                              public EnableWebSpeechCheck
>+{
>+public:
>+  SpeechRecognitionResultList(SpeechRecognition* aParent);
>+  ~SpeechRecognitionResultList();
>+
>+  NS_DECL_CYCLE_COLLECTING_ISUPPORTS
>+  NS_DECL_CYCLE_COLLECTION_SCRIPT_HOLDER_CLASS(SpeechRecognitionResultList)
>+
>+  // TODO: return something sensible here, and change the return type
>+  nsISupports* GetParentObject() const;
Remove TODO comment

>+
>+  virtual JSObject* WrapObject(JSContext* aCx, JSObject* aScope,
>+                               bool* aTriedToWrap);
>+
>+  uint32_t Length() const;
>+
>+  // Mark this as resultNotAddRefed to return raw pointers
>+  already_AddRefed<SpeechRecognitionResult> Item(uint32_t aIndex);
>+
>+  already_AddRefed<SpeechRecognitionResult> IndexedGetter(uint32_t aIndex, bool& aPresent);
>+
>+  nsTArray<nsRefPtr<SpeechRecognitionResult>> mItems;
You need to cycle collect mItems
>+
>+// TODO should this be in the DOM namespace?
dom namespace is ok.


In general looking really good so far :)
Attachment #717310 - Flags: review?(bugs) → review-
Comment on attachment 717312 [details] [diff] [review]
Bug 650295 - Tests for Speech Recognition

I think for testing purposes we want mTestConfig to be available even in opt build, but perhaps use
a pref for it.


> SpeechRecognition::StopRecording()
> {
>+#ifdef DEBUG
>+  if (mTestConfig.fakeFSMEvents) {
>+    NS_ENSURE_STATE(mDOMStream.get() == nullptr);
>+  }
>+#endif
Could you explain this?
Also, it could be NS_ENSURE_STATE(!mDOMStream);



>@@ -433,20 +451,72 @@
> 
>   if (!strcmp(aTopic, NS_TIMER_CALLBACK_TOPIC) &&
>       STATE_BETWEEN(STATE_IDLE, STATE_WAITING_FOR_SPEECH)) {
> 
>     DispatchError(SpeechRecognition::EVENT_AUDIO_ERROR,
>                   nsIDOMSpeechRecognitionError::NO_SPEECH,
>                   NS_LITERAL_STRING("No speech detected (timeout)"));
>   }
>+#ifdef DEBUG
>+  else if (mTestConfig.fakeFSMEvents &&
>+          !strcmp(aTopic, SPEECH_RECOGNITION_TEST_EVENT_REQUEST_TOPIC)) {
>+    ProcessTestEventRequest(nsDependentString(aData));
>+  }
>+#endif




>+SpeechRecognition::ProcessTestEventRequest(const nsAString& aEventName)
>+{
>+  if (aEventName.EqualsLiteral("EVENT_START")) {
>+    ErrorResult err;
>+    Start(err);
>+    StartRecording(nullptr);
>+  } else if (aEventName.EqualsLiteral("EVENT_STOP")) {
>+    Stop();
>+  } else if (aEventName.EqualsLiteral("EVENT_ABORT")) {
>+    Abort();
>+  } else if (aEventName.EqualsLiteral("EVENT_AUDIO_ERROR")) {
>+    DispatchError(SpeechRecognition::EVENT_AUDIO_ERROR,
>+                  nsIDOMSpeechRecognitionError::AUDIO_CAPTURE, // TODO different codes?
>+                  NS_LITERAL_STRING("AUDIO_ERROR test event"));
>+  } else if (aEventName.EqualsLiteral("EVENT_AUDIO_DATA")) {
>+    const char* mockAudioDataFilename = "browser/content/media/webspeech/recognition/test/hello.wav";
Hmm, could we read the file name from fake service. And one could set the file name to fake service.
It would make it easier to write certain automatic tests. And perhaps the fake service could have a method
which takes url and gives back the data. It might need to read the data synchronously, which is rather bad, but
then, the service is for testing only.

>+    SR_LOG("Using mock audio file '%s'\n", mockAudioDataFilename);
>+
>+    FILE* mockAudioDataFile = fopen(mockAudioDataFilename, "rb");
>+    NS_ASSERTION(mockAudioDataFile, "failed to open mock audio data file");
>+
>+    while (true) {
>+      nsRefPtr<SharedBuffer> samples(SharedBuffer::Create(160 * sizeof(int16_t)));
>+      int16_t* to = static_cast<int16_t*>(samples->Data());
>+      unsigned int read = fread(to, sizeof(int16_t), 160, mockAudioDataFile);
>+      if (read == 0) break;
>+
>+      /* FIXME pass FakeSpeechStreamListener when we have it */
>+      FeedAudioData(samples.forget(), read, nullptr);
>+    }
If url would be taken as param, then data reading should use necko APIs.
Or, hmm, perhaps sync XHR would work.


>+FakeSpeechRecognitionService::Initialize(dom::SpeechRecognition* aSpeechRecognition)
>+{
>+    mRecognition = aSpeechRecognition;
>+    nsCOMPtr<nsIObserverService> obs = services::GetObserverService();
>+    obs->AddObserver(this, SPEECH_RECOGNITION_TEST_EVENT_REQUEST_TOPIC, false);
>+    return NS_OK;
>+}
2 space indentation here and elsewhere

>+  } else if (eventName.EqualsLiteral("EVENT_RECOGNITIONSERVICE_FINAL_RESULT")) {
>+    SpeechEvent* event = new SpeechEvent(mRecognition, SpeechRecognition::EVENT_RECOGNITIONSERVICE_FINAL_RESULT);
nsRefPtr<SpeechEvent>


>+FakeSpeechRecognitionService::BuildMockResultList()
>+{
>+  SpeechRecognitionResultList *resultList = new SpeechRecognitionResultList(mRecognition);
* goes with the type. Same elsewhere
Attachment #717312 - Flags: review?(bugs) → review-
Attachment #717310 - Attachment is obsolete: true
Attachment #721943 - Flags: review?(bugs)
In this iteration, the tests were converted to plain mochitest files. We can now specify the input audio file for tests in JS, using an <audio> element's mozCaptureStreamUntilEnded() method to get a stream that mocks microphone input.
Attachment #717312 - Attachment is obsolete: true
Attachment #721945 - Flags: review?(bugs)
Refreshing the patch with some small fixes and moz.build conversion.
Attachment #721943 - Attachment is obsolete: true
Attachment #721943 - Flags: review?(bugs)
Attachment #722596 - Flags: review?(bugs)
Attachment #721945 - Attachment is obsolete: true
Attachment #721945 - Flags: review?(bugs)
Attachment #722597 - Flags: review?(bugs)
Comment on attachment 722596 [details] [diff] [review]
Bug 650295 - Implement the main state machine for Speech Recognition

>+SpeechGrammarList::AddFromURI(const nsAString& aSrc,
>+  const Optional<float>& aWeight, ErrorResult& aRv)
Could you align the params below const nsAString& aSrc

>+SpeechGrammarList::AddFromString(const nsAString& aString,
>+  const Optional<float>& aWeight, ErrorResult& aRv)
Similar thing here


>+SpeechGrammarList::IndexedGetter(uint32_t aIndex, bool& aPresent,
>+  ErrorResult& aRv)
And here

>+  virtual JSObject* WrapObject(JSContext* aCx, JSObject* aScope,
>+                               bool* aTriedToWrap);
Note, the last param has been removed from WrapObject


Could you make sure that PrefControlled works ok when the pref is false.
None of the interfaces should be visible in the global scope in that case.


Need to still review SpeechRecognition and SpeechStreamListener
Comment on attachment 722596 [details] [diff] [review]
Bug 650295 - Implement the main state machine for Speech Recognition


>+  SR_LOG("created SpeechRecognition");
>+  SetIsDOMBinding();
>+  mEndpointer->set_speech_input_complete_silence_length(500000);
>+  mEndpointer->set_long_speech_input_complete_silence_length(1000000);
>+  mEndpointer->set_long_speech_length(3*1000000);
space before and after *
Might be good to have prefs for all these values.
So, something like 
mEndpointer->set_speech_input_complete_silence_length(Preferences::GetInt(""media.webspeech.silence_length", 500000));
etc.

>+
>+  nsresult rv = mSpeechDetectionTimer->Init(this, kSPEECH_DETECTION_TIMEOUT_MS,
>+                                            nsITimer::TYPE_ONE_SHOT);
>+  NS_ENSURE_SUCCESS(rv, rv);
>+
>+  return NS_OK;
>+}
Just
return mSpeechDetectionTimer->Init(this, kSPEECH_DETECTION_TIMEOUT_MS, nsITimer::TYPE_ONE_SHOT);


>+  nsCOMPtr<nsIDOMEvent> domEvent;
>+  NS_NewDOMSpeechRecognitionError(getter_AddRefs(domEvent), nullptr, nullptr);
>+
>+  nsCOMPtr<nsIDOMSpeechRecognitionError> srError = do_QueryInterface(domEvent);
Oh, the interface should be nsIDOMSpeechRecognitionErrorEvent 
I filed https://www.w3.org/Bugs/Public/show_bug.cgi?id=21272

No need to change code before that bug is fixed though



>+  nsAutoPtr<Endpointer> mEndpointer;
Hmm, couldn't this be just Endpointer mEndpointer.


>+  nsRefPtr<SharedBuffer> samples(SharedBuffer::Create(aChunk.mDuration
>+                                                      * 1 // channel
>+                                                      * sizeof(int16_t)));
* should be in the previous line in both cases

Someone familiar with our media APIs should take a look at too.
Attachment #722596 - Flags: review?(bugs) → review+
Comment on attachment 722597 [details] [diff] [review]
Bug 650295 - Tests for Speech Recognition

> SpeechRecognition::SpeechRecognition()
>   : mProcessingEvent(false)
>   , mEndpointer(new Endpointer(kSAMPLE_RATE))
>+  , mAudioSamplesPerChunk(mEndpointer->FrameSize())
>   , mSpeechDetectionTimer(do_CreateInstance(NS_TIMER_CONTRACTID))
> {
>   SR_LOG("created SpeechRecognition");
>   SetIsDOMBinding();
>+
>+  Preferences::AddBoolVarCache(&mTestConfig.enableTests, TEST_PREFERENCE_ENABLE);
>+  if (mTestConfig.enableTests) {
>+    nsCOMPtr<nsIObserverService> obs = services::GetObserverService();
>+    obs->AddObserver(this, SPEECH_RECOGNITION_TEST_EVENT_REQUEST_TOPIC, false);
>+    obs->AddObserver(this, SPEECH_RECOGNITION_TEST_END_TOPIC, false);
>+    Preferences::AddBoolVarCache(&mTestConfig.fakeFSMEvents, TEST_PREFERENCE_FAKE_FSM_EVENTS);
>+    Preferences::AddBoolVarCache(&mTestConfig.fakeRecognitionService, TEST_PREFERENCE_FAKE_RECOGNITION_SERVICE);
We don't want to add these bool var caches each time a SpeechRecognition object is created, only once.


>-uint64_t
>+unsigned int
> SpeechRecognition::ProcessAudioSegment(AudioSegment* aSegment)
> {
>   AudioSegment::ChunkIterator iterator(*aSegment);
>-  uint64_t samples = 0;
>+  unsigned int samples = 0;
Why?
I'd really prefer explicit int type.


>+ * Buffer audio samples into mAudioSamplesBuffer until aBufferSize.
>+ * Updates mBufferedSamples and returns the number of samples that were buffered.
>+ */
>+unsigned int
>+SpeechRecognition::FillSamplesBuffer(const int16_t* aSamples,
>+                                     unsigned int aSampleCount)
>+{
unsigned int?

(need to get these patches landed, disabled by default. Easier to then hack this stuff some more)
Attachment #722597 - Flags: review?(bugs) → review+
Try run for 39466d8f1c53 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=39466d8f1c53
Results (out of 18 total builds):
    failure: 18
Builds (or logs if builds failed) available at:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/nsm.nikhil@gmail.com-39466d8f1c53
Try run for 727cb5c33e69 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=727cb5c33e69
Results (out of 233 total builds):
    exception: 8
    success: 78
    warnings: 60
    failure: 87
Builds (or logs if builds failed) available at:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/nsm.nikhil@gmail.com-727cb5c33e69
Try run for b0594e7aa914 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=b0594e7aa914
Results (out of 23 total builds):
    exception: 22
    failure: 1
Builds (or logs if builds failed) available at:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/nsm.nikhil@gmail.com-b0594e7aa914
Depends on: 852196
Updating the patches. Several small changes since the last version.
Attachment #722596 - Attachment is obsolete: true
Attachment #722597 - Attachment is obsolete: true
Attachment #729278 - Flags: review?(bugs)
Attachment #729280 - Flags: review?(bugs)
Comment on attachment 729280 [details] [diff] [review]
Bug 650295 - Tests for Speech Recognition

Nit, void Init() {
'{' goes to its own line.
Attachment #729280 - Flags: review?(bugs) → review+
Attachment #729278 - Flags: review?(bugs) → review+
Refreshing with licensing resolved and smaug's last comments. Keeping the r+ from smaug.
Attachment #729278 - Attachment is obsolete: true
Attachment #730307 - Flags: review+
Attachment #729280 - Attachment is obsolete: true
Attachment #730308 - Flags: review+
https://hg.mozilla.org/mozilla-central/rev/dfd7768f8f48
https://hg.mozilla.org/mozilla-central/rev/c5e0abff4496
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla22
Depends on: 855704
Blocks: 855728
This seems to be missing review from a build peer.
Depends on: 855735
No longer blocks: 855728
Depends on: 855728
Why would this need a review from a build peer?
If I missed something, my bad, but this passed on try.
relnote-firefox: --- → ?
No relnote yet. This won't be enabled before we have backend.
QA Contact: simona.marcu
Depends on: 861538
Depends on: 861590
(In reply to Oskar Ivanić (:icecold) from comment #0)
> User-Agent:       Mozilla/5.0 (Windows NT 6.1; rv:6.0a1) Gecko/20110414
> Firefox/6.0a1
> Build Identifier: 
> 
> "With this API, developers can give web apps the ability to transcribe your
> voice to text. When a web page uses this feature, you simply click on an
> icon and then speak into your computer’s microphone. The recorded audio is
> sent to speech servers for transcription, after which the text is typed out
> for you."
> 
> Google implemented support in Chrome and made a it available on Google
> Translate. (you need Chrome Canary build)
> 
> Reproducible: Always

So when google speech api will be available in Firefox.
Hi All,

This bug status says, it is resolved.
In which release version, It has been resolved, Please update.

Thanks,
Mahesh
¡Hola Guilherme!

So is this media.webspeech.recognition.enable;false and media.webspeech.synth.enabled;false on Mozilla/5.0 (Windows NT 6.1; WOW64; rv:44.0) Gecko/20100101 Firefox/44.0 ID:20151019030227 CSet: 1a157155a4fe0074b3d03b54fe9e466472c2cd56 ?

How do I go about verifying this bug?

All demos out there on the web seem to depend on x-webkit-speech

¡Gracias!
Alex
Flags: needinfo?(guilherme.p.gonc+bmo)
Flags: needinfo?(guilherme.p.gonc+bmo)
Blocks: 1685416
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: