Last Comment Bug 525444 - Expose text to speech (TTS) capability to content
: Expose text to speech (TTS) capability to content
Status: RESOLVED FIXED
: access, dev-doc-needed
Product: Core
Classification: Components
Component: DOM (show other bugs)
: unspecified
: All All
: -- normal with 2 votes (vote)
: mozilla23
Assigned To: Eitan Isaacson [:eeejay]
:
Mentors:
: 687879 (view as bug list)
Depends on: b2g-a11y 856370 857673 857994 858012 858014 858136 858529 858973 859246 864858 868703 TTS_for_firefox
Blocks: webapi b2g-v-next 906867
  Show dependency treegraph
 
Reported: 2009-10-30 07:11 PDT by David Bolter [:davidb]
Modified: 2014-05-22 15:54 PDT (History)
42 users (show)
ryanvm: in‑testsuite+
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
(Part 1/3) Basic SpeechSynthesis setup. (43.25 KB, patch)
2013-02-19 20:18 PST, Eitan Isaacson [:eeejay]
bugs: review-
Details | Diff | Splinter Review
(Part 2/3) Added speech service API. (57.04 KB, patch)
2013-02-19 20:19 PST, Eitan Isaacson [:eeejay]
bugs: review-
Details | Diff | Splinter Review
(Part 3/3) Support OOP speech synth (54.57 KB, patch)
2013-02-19 20:21 PST, Eitan Isaacson [:eeejay]
bugs: review-
Details | Diff | Splinter Review
Bug 525444 - (Part 1/3) Basic SpeechSynthesis setup and voice registration. (71.86 KB, patch)
2013-03-07 16:49 PST, Eitan Isaacson [:eeejay]
bugs: review+
Details | Diff | Splinter Review
Bug 525444 - (Part 2/3) Added speech service API. (46.71 KB, patch)
2013-03-07 16:50 PST, Eitan Isaacson [:eeejay]
bugs: review-
Details | Diff | Splinter Review
Bug 525444 - (Part 3/3) Support OOP speech synth (59.37 KB, patch)
2013-03-07 16:50 PST, Eitan Isaacson [:eeejay]
bugs: review+
Details | Diff | Splinter Review
Bug 525444 - (Part 1/3) Basic SpeechSynthesis setup and voice registration. (73.05 KB, patch)
2013-03-19 10:28 PDT, Eitan Isaacson [:eeejay]
no flags Details | Diff | Splinter Review
Bug 525444 - (Part 2/3) Added speech service API. (51.06 KB, patch)
2013-03-19 10:29 PDT, Eitan Isaacson [:eeejay]
bugs: review+
Details | Diff | Splinter Review
Bug 525444 - (Part 3/3) Support OOP speech synth (59.13 KB, patch)
2013-03-19 10:30 PDT, Eitan Isaacson [:eeejay]
no flags Details | Diff | Splinter Review

Description David Bolter [:davidb] 2009-10-30 07:11:04 PDT
Why is there no standard document.say("foo"); ?

Windows has MSAPI.
Mac has VoiceOver.
Linux has GnomeSpeech.

Has this come up before in ecma/js or dom discussions?
Comment 1 Olli Pettay [:smaug] 2009-10-30 07:19:14 PDT
W3C has multimodal WG.

There are specifications like XHTML+Voice (mainly from IBM) and SALT (mainly from
Microsoft) which can do this.
SALT is much simpler, but I don't know if it is being actively developed anymore.
Comment 2 David Bolter [:davidb] 2009-10-30 07:45:47 PDT
Interesting.

Something to keep in mind is whether self voicing web applications should be
encouraged or not. There isn't a good success story on the desktop. Still, it does open another door to innovation on the web and who am I to say great things won't happen.
Comment 3 David Bolter [:davidb] 2010-02-05 08:19:14 PST
As html canvas is raw visuals, I think we also need raw aural (including TTS), raw haptics etc. I think giving web devs only some of the palette hurts accessibility/usability.
Comment 4 Olli Pettay [:smaug] 2010-02-06 08:41:02 PST
Yeah, adding TTS support would be great. The main problem, IMO, is
that many systems don't have good TTS engines.
Also, if we support TTS, I think ASR should be supported too.
(though, not sure what kind of ASR; probably something with grammars and semantic interpretation.)
Comment 5 David Bolter [:davidb] 2010-02-07 11:04:09 PST
True, but we could support on systems that have TTS.

For ASR, now you're talking input :) For this scenario a lot can happen without our involvement, since the ASR engine is usually turning speech into synthetic events, and using TSF (bug 478029) where available. I think the world might finally be ready for this stuff.

Maybe multimodality is something Labs could explore.
Comment 6 Olli Pettay [:smaug] 2010-02-07 11:21:03 PST
(In reply to comment #5)
> True, but we could support on systems that have TTS.
Right. I wonder what all parameters of the TTS should be exposed to
web. List of supported languages (some TTS engines aren't very good with Finnish, but may handle English very well)?
And how to expose TTS to web. Something similar to SALT?
 
> For ASR, now you're talking input :)
Yes, it is different thing, but perhaps, or probably, when designing the
API for TTS, we need to think about how to handle ASR.

> For this scenario a lot can happen without
> our involvement,
Well, ASR may need context sensitive information, like grammars and semantic
interpretation. W3C has some recommendations for all this, and those should
be probably used.

> since the ASR engine is usually turning speech into synthetic
> events,
hmm, synthetic events? I'd guess using ASR for message (for example SMS)
dictation on mobile platforms might be pretty nice feature. In that case ASR would turn speech to text, not only to some events.
Tough, dictation can't be easily grammar based.
 
> Maybe multimodality is something Labs could explore.
Indeed.
Comment 7 Olli Pettay [:smaug] 2010-02-07 11:26:36 PST
Would be perhaps useful to write down a list of OSes on which we can
support TTS (and ASR).
Also writing down some kind of requirements document for the 1st version of
possible API could be useful. Though, that might need to happen in W3C.
Comment 8 David Bolter [:davidb] 2010-02-09 11:00:35 PST
(In reply to comment #7)
> Would be perhaps useful to write down a list of OSes on which we can
> support TTS (and ASR).

Agreed. I'll try to fit that in somewhere; but others are welcome to dive in!

> Also writing down some kind of requirements document for the 1st version of
> possible API could be useful. Though, that might need to happen in W3C.

Probably. The Multimodal group seems to have bitten off too much for this to happen quickly there... not sure.
Comment 9 Reece H. Dunn 2010-04-08 12:44:58 PDT
Sorry for the long post...

Is there any reason why the Speech Synthesis Markup Language (SSML), Pronunciation Lexicon Specification (PLS) and CSS Speech (aural style sheets) module cannot be supported? These are web/W3C standards that are aimed at informing a browser how to speak the text to a listener.

On Windows, SAPI4/5 can be used to interface with text-to-speech engines. On Linux, doesn't Gnome's GAIL API provide a similar mechanism. I'm not sure of the API that KDE/Qt3 and Qt4 use, nor Mac.

In terms of TTS engines, Windows comes with a default voice (Sam, Mary, etc.). On Linux, a lot of the distributions come with eSpeak which is very good (it has a direct API, but is licensed as GPLv3 so there may be issues using that API directly). I don't know what voice/TTS engine Macs come with by default.

There are several issues that need to be resolved:

  1/  SSML support is provided by the different engines to varying levels of support, but is not easy to tie into the Firefox DOM;

  2/  most (all?) engines don't support using IPA to specify pronunciation as required by SSML and PLS;

  3/  not all engines (e.g. eSpeak) support adding custom pronunciations as specified by PLS;

  4/  the speech needs to be run on a separate thread to prevent it blocking the UI, so this needs to be coordinated with events -- crashes in the thread?;

  5/  should the UI highlight the words being spoken so that a user can follow the text as it is being spoken?;

  6/  need to control the TTS engine in text chunks (e.g. to support playing audio cues on hyperlinks) which could complicate audio production (synchronisation, etc.);

  7/  audio will need to be managed by the browser, not through the TTS engine (so that they are not competing for sound resources and the audio can be synchronised when playing external sounds, or mixing with something like SMIL);

  8/  how does this interact with accessibility APIs (IAccessible, GAIL, UIAutomation, etc.) and assistive technologies (screen readers like JAWS)?;

  9/  other things I haven't thought of/listed here.

To be done properly, this should support SSML/PLS, HTML/CSS, SMIL and potentially all other web standards that expose text (speaking MathML content?).

In terms of providing specific JavaScript APIs, this should be supported as well through the appropriate standards and should integrate into the overall audio/aural browser support.

Then there is the management of this. This bug should probably be a meta-bug (or another bug created as the meta-bug), and a bug created for:
  *  TTS core;
  *  TTS on Windows, on Linux/Gnome/Gail, on Linux/Qt, on Mac;
  *  SSML support;
  *  PLS support;
  *  Core CSS Speech module support;
  *  HTML speech support;
  *  MathML support for spoken mathematical expressions;
  *  SVG support for speaking the text components;
  *  ASR support core;
  *  ASR events -- "links", navigating a hyperlink, interacting with the browser, etc. (overlap with assistive technology tools/accessibility APIs?);
  *  anything else I've missed.
Comment 10 Olli Pettay [:smaug] 2010-04-08 12:58:03 PDT
(In reply to comment #9)
> Is there any reason why the Speech Synthesis Markup Language (SSML),
> Pronunciation Lexicon Specification (PLS) and CSS Speech (aural style sheets)
> module cannot be supported?
Someone would need to implement them ;)
And I'm not quite sure if SSML or PLS are good for a web browser or for the web.
Maybe, maybe not.

> These are web/W3C standards that are aimed at
> informing a browser how to speak the text to a listener.
Yup. Although the target is, afaik, more things like voice browsers.

>   1/  SSML support is provided by the different engines to varying levels of
> support, but is not easy to tie into the Firefox DOM;
Before this someone should evaluate if SSML is good for the web.
Same thing with PLS.

>   7/  audio will need to be managed by the browser, not through the TTS engine
> (so that they are not competing for sound resources and the audio can be
> synchronised when playing external sounds, or mixing with something like SMIL);
If something like SALT was implemented, its PromptQueue could be extended
a bit to handle all this.
Comment 11 timeless 2010-04-08 13:07:10 PDT
GPL code can't run in our process. It would have to run in someone else's process.

there's also http://www.ibm.com/developerworks/xml/standards/x-voicexmlspec.html fwiw, iirc that's the one that tellme uses.
Comment 12 Reece H. Dunn 2010-04-08 14:41:56 PDT
(In reply to comment #10)
> (In reply to comment #9)
> > Is there any reason why the Speech Synthesis Markup Language (SSML),
> > Pronunciation Lexicon Specification (PLS) and CSS Speech (aural style sheets)
> > module cannot be supported?
> Someone would need to implement them ;)

I'd be willing to work on this if there is interest.

> And I'm not quite sure if SSML or PLS are good for a web browser or for the
> web.
> Maybe, maybe not.

They are a W3C standard :), so are theoretically more relevant than an API from a specific vendor.

> > These are web/W3C standards that are aimed at
> > informing a browser how to speak the text to a listener.
> Yup. Although the target is, afaik, more things like voice browsers.

If the aim is to support text-to-speech in Firefox, it would seem a bit silly to have it working for HTML, MathML and others, but not SSML.

If it was something like speaking RTF, ODF, PDF or any other document formats then that would not be in the remit of Firefox, as those are not web.

SSML is targeted to "voice browsers" (it is from the "voice browser" working group in W3C), but is aimed at the *synthesis* of speech to aid in controlling/directing a text-to-speech engine, so is relevant here (a page wanting to generate spoken data could generate an SSML file in JavaScript and get the browser to render it -- a bit convoluted and more complex than something like document.say, but more flexible and uses W3C standards).

> >   1/  SSML support is provided by the different engines to varying levels of
> > support, but is not easy to tie into the Firefox DOM;
> Before this someone should evaluate if SSML is good for the web.
> Same thing with PLS.

SSML makes use of PLS, so it makes sense to support both.

SMIL can reference SSML files via the ref element (e.g. <ref src="hello.ssml">), so SMIL support is lacking this functionality if SSML is not supported.

Also, as I have mentioned above, if you are supporting aural CSS then it makes sense to also support SSML.

NOTE: Opera has limited support for SSML and supports aural CSS/CSS3 Speech module [http://dev.opera.com/articles/view/add-voice-interactivity-to-your-site/#ssml].

I don't know how prevalent SSML is on the web (especially outside controlling text-to-speech pronunciation/prosody), nor if that is due to the lack of support by browsers.

> >   7/  audio will need to be managed by the browser, not through the TTS engine
> > (so that they are not competing for sound resources and the audio can be
> > synchronised when playing external sounds, or mixing with something like SMIL);
> If something like SALT was implemented, its PromptQueue could be extended
> a bit to handle all this.

From what I can see, SALT is more akin to VoiceXML (developed by Microsoft to compete with VoiceXML) and is aimed at speech recognition/controlling interfaces through speech input, as opposed to generating speech via a text-to-speech engine. They are solving a different (but related) problem.

Also, PromptQueue is not really relevant for synchronising the audio generated. You could have "Go" spoken by the TTS engine, then bell.wav played on entry to a hyperlink element (via aural CSS/CSS Speech module), then "here" spoken, then bell2.wav played when leaving the hyperlink element, then "to find out more." spoken. All of this audio needs to be synchronised by the browser in order to produce a smooth audio experience.

In addition to this, you may have mixed content in a SMIL file -- audio may be playing in the background; SSML files may need to be rendered as speech; video files may need to play at certain points and so on.

Hooking this up with ASR/VoiceXML/SALT/... is another complex area!
Comment 13 David Bolter [:davidb] 2010-04-08 18:07:52 PDT
(In reply to comment #12)
> (In reply to comment #10)
> > (In reply to comment #9)
> > > Is there any reason why the Speech Synthesis Markup Language (SSML),
> > > Pronunciation Lexicon Specification (PLS) and CSS Speech (aural style sheets)
> > > module cannot be supported?
> > Someone would need to implement them ;)
> 
> I'd be willing to work on this if there is interest.

That's amazing.

> 
> > And I'm not quite sure if SSML or PLS are good for a web browser or for the
> > web.
> > Maybe, maybe not.
> 
> They are a W3C standard :), so are theoretically more relevant than an API from
> a specific vendor.

We don't want that; but I think we might want a baby-step API. Something that all browsers can implement without too much trouble might help interoperability.

> SSML is targeted to "voice browsers" (it is from the "voice browser" working
> group in W3C), but is aimed at the *synthesis* of speech to aid in
> controlling/directing a text-to-speech engine, so is relevant here (a page
> wanting to generate spoken data could generate an SSML file in JavaScript and
> get the browser to render it -- a bit convoluted and more complex than
> something like document.say, but more flexible and uses W3C standards).

This is what worries me. How convoluted and complex is it?
Comment 14 David Bolter [:davidb] 2010-04-08 18:08:55 PDT
(In reply to comment #13)
> > They are a W3C standard :), so are theoretically more relevant than an API from
> > a specific vendor.
> 
> We don't want that;

Clarification: I mean we don't want vendor specific API.
Comment 15 Olli Pettay [:smaug] 2010-04-09 02:00:43 PDT
(In reply to comment #12)
> NOTE: Opera has limited support for SSML and supports aural CSS/CSS3 Speech
> module
> [http://dev.opera.com/articles/view/add-voice-interactivity-to-your-site/#ssml].

Opera has for a long time supported also XHTML+Voice which is in many ways
pretty bad spec and something I wouldn't want to see to be used in the web.

SALT is much closer to XHTML+Voice than VoiceXML. It seems that SALT isn't
being developed actively anymore, but in any case how it works and what kind
of API it has fits in to web way better than XHTML+Voice.
(And SALT is also *a lot* easier to implement, trust me ;))

I think implementing something quite close to SALT would be a good first step
to bring ASR/TTS support to the web.
It doesn't need to be SALT, but it could be something new, which could
be hopefully standardized eventually.
Comment 16 Olli Pettay [:smaug] 2010-04-09 02:03:22 PDT
Just implementing SSML or PLS doesn't give us much.
There must be something to control TTS and ASR.
Like you have in your list: TTS core and ASR core. Those are the most important
parts, at least as a first step.
Comment 17 Reece H. Dunn 2010-04-09 05:33:23 PDT
(In reply to comment #16)
> Just implementing SSML or PLS doesn't give us much.
> There must be something to control TTS and ASR.
> Like you have in your list: TTS core and ASR core. Those are the most important
> parts, at least as a first step.

Right. It would be better to get something like Aural CSS or the CSS3 Speech Module working first and getting HTML talking to you! This should lay the foundation for building on the other work -- MathML reading, SSML, etc.
Comment 18 Reece H. Dunn 2010-04-09 06:14:19 PDT
(In reply to comment #13)
> (In reply to comment #12)
> > SSML is targeted to "voice browsers" (it is from the "voice browser" working
> > group in W3C), but is aimed at the *synthesis* of speech to aid in
> > controlling/directing a text-to-speech engine, so is relevant here (a page
> > wanting to generate spoken data could generate an SSML file in JavaScript and
> > get the browser to render it -- a bit convoluted and more complex than
> > something like document.say, but more flexible and uses W3C standards).
> 
> This is what worries me. How convoluted and complex is it?

The SSML standard does not specify anything about a DOM. The conclusion I make about this is that with the current standard, the only way to get SSML to 'play' would be to navigate to it, or specify a reference to it in SMIL and use the SMIL DOM to control the playback.

Also, because of the lack of a DOM, you need to create the SSML string directly.

So, AFAICS, something like:

 <script>
  var ssml = "<?xml version='1.0'?><speak xml:lang='en' version='1.0'  xmlns='http://www.w3.org/2001/10/synthesis'><p>It is now <say-as type='time:hm'>" + getCurrentTime() + "</say-as>.</p></speak>";
  frame.innerHTML = ssml;
 </script>
 <iframe id="frame"/>

I haven't looked into how exactly this would work in practice, or what the recommended approach is here due to a lack of DOM specification and processing details for agents other than "voice browsers" -- the understanding that I get on the SSML processing details is that an agent will start speaking the SSML text directly.

Indeed, a lot (if not all) of the text-to-speech engines support SSML as an input. However, these often miss IPA phoneme support (for phonemes="..."), often accept malformed inputs and vary in implementation quality. They also lack an interface for interacting with a UI client like Firefox, and don't support the ability to reuse the internal DOM that Firefox would use to store the file (i.e. an XML DOM), making it difficult to do things like render the SSML file to the screen and highlight words as they are being spoken.

SSML support may not therefore end up being supported. However, aural CSS and the CSS3 Speech module also have a similar issue: there is no DOM/JavaScript API I know of to say "speak this page/fragment". This means that if enabled, all pages will be spoken on loading, which complicates the usage.

So... we need to understand the scope of what should be implemented, and understand the requirements/usage for Firefox as a whole w.r.t. ASR/TTS support.
Comment 19 David Bolter [:davidb] 2010-04-30 12:04:53 PDT
(In reply to comment #14)
> (In reply to comment #13)
> > > They are a W3C standard :), so are theoretically more relevant than an API from
> > > a specific vendor.
> > 
> > We don't want that;
> 
> Clarification: I mean we don't want vendor specific API.

Further clarification: I don't mind experimentation! And baby steps!
Comment 20 David Bolter [:davidb] 2011-07-13 12:52:45 PDT
Olli, what is the next step here?
Comment 21 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-07-27 13:59:11 PDT
It would be interesting to explore a pure JS+audio-API implementation of TTS, perhaps by transliterating or emscripten'ing an existing library.
Comment 22 Alon Zakai (:azakai) 2011-07-27 14:13:56 PDT
(In reply to comment #21)
> It would be interesting to explore a pure JS+audio-API implementation of
> TTS, perhaps by transliterating or emscripten'ing an existing library.

Yes!

I will investigate an emscripten solution for this.
Comment 23 Alon Zakai (:azakai) 2011-07-30 21:13:53 PDT
Here is a pure JS implementation by compiling eSpeak from C++ using Emscripten:

http://syntensity.com/static/espeak.html

It uses a WAV data URI to play the generated sound.
Comment 24 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-08-01 10:21:02 PDT
Cool :D.
Comment 25 David Bolter [:davidb] 2011-08-02 05:47:47 PDT
I agree with comment 24.

Alon, are there parameters you can expose on your demo page, like rate, pitch etc?
Comment 26 Marco Zehe (:MarcoZ) on PTO until August 15 2011-08-02 06:07:07 PDT
I also tried this and would be curious to see if it is possible to use different languages. So this thing would not only speak English, but German, French etc.

Also, I'd like to see this running on an Android phone with Fennec. Has anyone tried that yet?
Comment 27 Trevor Saunders (:tbsaunde) 2011-08-02 23:00:53 PDT
So, this looks pretty good, but   from my very quick look I'm a bit worried about performance.  It looked like when runing firefox was using about 80-90 percent cpu on on a c2d that's about 3 years old.  (this was just top with espeak speaking over screen reader, so I'd love to hear I heard  saw wrong :)).  How much easy optimazation can we do here?
Comment 28 Alon Zakai (:azakai) 2011-08-03 09:57:35 PDT
I have set up a repo on github for the project in comment 23,

https://github.com/kripken/speak.js

calling it speak.js for now. Should hopefully be a blog post on planet soon.

(In reply to comment #25)
> I agree with comment 24.
> 
> Alon, are there parameters you can expose on your demo page, like rate,
> pitch etc?

Yes, this is now live in the demo, same link as before,

http://syntensity.com/static/espeak.html

(In reply to comment #26)
> I also tried this and would be curious to see if it is possible to use
> different languages. So this thing would not only speak English, but German,
> French etc.

eSpeak supports other languages so this is possible. I'll investigate.

(In reply to comment #27)
> So, this looks pretty good, but   from my very quick look I'm a bit worried
> about performance.  It looked like when runing firefox was using about 80-90
> percent cpu on on a c2d that's about 3 years old.  (this was just top with
> espeak speaking over screen reader, so I'd love to hear I heard  saw wrong
> :)).  How much easy optimazation can we do here?

I would expect TTS in JavaScript to max out a CPU for as long as it takes to generate the audio. So the percentage of CPU should be close to 100, but the question is how long that lasts. Not sure how speak.js compares to the local application screen readers, but it looks like it generates audio about as fast as it is spoken on my 1.5-year old laptop.

Note that speak.js is not really optimized yet, it can be made much faster if that is important. Hard to estimate how much in advance though. It can also be put in a web worker so as not to stall the main page. Another factor to consider is that speak.js will be much faster once type inference lands, since TI makes code compiled from static languages very fast (often twice as fast).
Comment 29 Alon Zakai (:azakai) 2011-08-03 11:30:55 PDT
(In reply to comment #26)
> I also tried this and would be curious to see if it is possible to use
> different languages. So this thing would not only speak English, but German,
> French etc.

Turns out this is very easy to do. Here is a version (unoptimized, so slower, but ignore that) with French bundled in,

http://syntensity.com/static/espeak_fr.html

> 
> Also, I'd like to see this running on an Android phone with Fennec. Has
> anyone tried that yet?

Works on an Android phone for me, however it is quite slow. Probably because typed arrays are not fast on ARM, bug 649202.
Comment 30 Trevor Saunders (:tbsaunde) 2011-08-03 12:58:52 PDT
> (In reply to comment #27)
> > So, this looks pretty good, but   from my very quick look I'm a bit worried
> > about performance.  It looked like when runing firefox was using about 80-90
> > percent cpu on on a c2d that's about 3 years old.  (this was just top with
> > espeak speaking over screen reader, so I'd love to hear I heard  saw wrong
> > :)).  How much easy optimazation can we do here?
> 
> I would expect TTS in JavaScript to max out a CPU for as long as it takes to
> generate the audio. So the percentage of CPU should be close to 100, but the
> question is how long that lasts. Not sure how speak.js compares to the local
> application screen readers, but it looks like it generates audio about as
> fast as it is spoken on my 1.5-year old laptop.

espeak on the same machine natively uses about  under 10%, and that is  in a program whose use of concurency is truely brain damaged, I suspect it hurts more than helps, and espeak actually uses less than 4%.

btw I was more concerned aout battery consumption here than speech rate, especially when your using speech the whole time you use the phone in the case of a screen reader.

> 
> Note that speak.js is not really optimized yet, it can be made much faster
> if that is important. Hard to estimate how much in advance though. It can

oh? interesting, I didn't know that was possible with mscripton stuff other than optimizing the compiled js which sounds unpleasant.

> to consider is that speak.js will be much faster once type inference lands,
> since TI makes code compiled from static languages very fast (often twice as
> fast).

yeah, good point
Comment 31 Alon Zakai (:azakai) 2011-08-03 13:45:40 PDT
(In reply to comment #30)
> > (In reply to comment #27)
> > > So, this looks pretty good, but   from my very quick look I'm a bit worried
> > > about performance.  It looked like when runing firefox was using about 80-90
> > > percent cpu on on a c2d that's about 3 years old.  (this was just top with
> > > espeak speaking over screen reader, so I'd love to hear I heard  saw wrong
> > > :)).  How much easy optimazation can we do here?
> > 
> > I would expect TTS in JavaScript to max out a CPU for as long as it takes to
> > generate the audio. So the percentage of CPU should be close to 100, but the
> > question is how long that lasts. Not sure how speak.js compares to the local
> > application screen readers, but it looks like it generates audio about as
> > fast as it is spoken on my 1.5-year old laptop.
> 
> espeak on the same machine natively uses about  under 10%, and that is  in a
> program whose use of concurency is truely brain damaged, I suspect it hurts
> more than helps, and espeak actually uses less than 4%.

Well, in general the most optimized C++ to JS conversions I have done ended up being 3-5X slower than C++ (almost as fast as languages like Java or C#). So if native eSpeak takes 10%, I would not be surprised to see the JS version take 50%, once it is fully optimized.

> 
> btw I was more concerned aout battery consumption here than speech rate,
> especially when your using speech the whole time you use the phone in the
> case of a screen reader.

For a mobile device, yeah, this would take significantly more power than a native implementation. I wonder though if the energy to power the speaker isn't pretty big anyhow.

> 
> > 
> > Note that speak.js is not really optimized yet, it can be made much faster
> > if that is important. Hard to estimate how much in advance though. It can
> 
> oh? interesting, I didn't know that was possible with mscripton stuff other
> than optimizing the compiled js which sounds unpleasant.

There are many possible optimizations here.

For one thing, the current code writes out WAV data, encodes that in base64, and loads that in a data URL ;) That's the simplest way to do things, but incredibly inefficient.

For another, you can run LLVM optimizations on the code before compiling it with Emscripten. I ran only a small amount of those optimizations so far.

Also, Emscripten itself has various speculative optimizations, that often require some minor changes to the original source code. I haven't looked into that yet.
Comment 32 Alon Zakai (:azakai) 2011-08-17 15:23:56 PDT
Blogpost about speak.js, with some technical details and updates: http://hacks.mozilla.org/2011/08/speak-js-text-to-speech-on-the-web/
Comment 33 Gerardo Capiel 2011-09-21 10:54:04 PDT
Hi, I'm the VP of Engineering at Benetech, the nonprofit behind Bookshare (.org) - the world's largest library of accessible ebooks for people with print disabilities (e.g. blind, dyslexic).  As David Bolter and Marco know, this topic is of a lot of interest to me.

While I think these js "hacks" are pretty cool, the speech quality is poor relative to the alternatives that people with print disabilities are used to.  Furthermore, this implementation lacks callbacks that enable you to do things like highlighting the words that are being spoken.  This capability is critical for dyslexic users who are a large portion of the people with print disabilities (for Bookshare nearly 80%).

I think the work that Google has done with Chrome TTS APIs is good and something to model off.  Check out:

http://code.google.com/chrome/extensions/tts.html

I built a prototype of word-level highlighting at:
https://github.com/gcapiel/ChromeWebAppBookshareReader

From the downloads section on my GitHub project, you can download the web app install, since the TTS APIs are only available to packaged web apps or extensions at this time.
Comment 34 Eitan Isaacson [:eeejay] 2011-09-21 11:36:14 PDT
(In reply to Gerardo Capiel from comment #33)
> I think the work that Google has done with Chrome TTS APIs is good and
> something to model off.  Check out:
> 
> http://code.google.com/chrome/extensions/tts.html
> 
> I built a prototype of word-level highlighting at:
> https://github.com/gcapiel/ChromeWebAppBookshareReader
> 
> From the downloads section on my GitHub project, you can download the web
> app install, since the TTS APIs are only available to packaged web apps or
> extensions at this time.

This is similar to bug #687879 I just opened about chrome-level access to tts (as opposed to content level).
Comment 35 Alon Zakai (:azakai) 2011-09-21 11:41:08 PDT
(In reply to Gerardo Capiel from comment #33)
> Hi, I'm the VP of Engineering at Benetech, the nonprofit behind Bookshare
> (.org) - the world's largest library of accessible ebooks for people with
> print disabilities (e.g. blind, dyslexic).  As David Bolter and Marco know,
> this topic is of a lot of interest to me.
> 
> While I think these js "hacks" are pretty cool, the speech quality is poor
> relative to the alternatives that people with print disabilities are used
> to.  Furthermore, this implementation lacks callbacks that enable you to do
> things like highlighting the words that are being spoken.  This capability
> is critical for dyslexic users who are a large portion of the people with
> print disabilities (for Bookshare nearly 80%).
> 

I completely understand your position - the JS hack here is of very low quality.

Note, though, that the hack was just meant as a proof of concept. If we decide to actually do it, speech quality could be greatly improved (to exactly the level of the best possible open source TTS implementation that exists), and things like highlighting support could be added. The hack was literally thrown together in a few days just to show it is even possible to do this in JS.
Comment 36 Eitan Isaacson [:eeejay] 2012-12-21 14:43:42 PST
*posted this earlier in the wrong bug*

I think this bug should cover this spec:
http://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html

In addition to the above spec, I think it should be well integrated with MediaStream objects. See: http://lists.w3.org/Archives/Public/public-speech-api/2012Jun/0072.html

As for speech engines and implementations, I think it should be agnostic. Meaning, many potential engines should be supported, some examples:
1. Platform services (SAPI on windows, and similar on Linux, Mac, and Android).
2. Remote services, like Nuance or Google (translate).
3. Local JS and native implementations (how and if they would be bundled with Firefox is debatable).

The implementation "glue" should be available in chrome-access JS, so a JS object would just need to implement an interface, and register the service object with a privileged call.
Comment 37 Eitan Isaacson [:eeejay] 2013-01-04 13:12:46 PST
I have discussed with folks the merits of all the overlapping audio API's and it looks like the Web Audio API would probably be the foundation for this. Individual engines would provile source AudioNodes.
Comment 38 Olli Pettay [:smaug] 2013-01-04 15:32:02 PST
Erm, http://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html is being implemented for
other browsers.
The plan has been to be able bind both recognizer and tts to mediastreams or audionodes, 
but recognizer and tts should work even without any audio API usage.
Comment 39 Eitan Isaacson [:eeejay] 2013-02-19 20:18:50 PST
Created attachment 715862 [details] [diff] [review]
(Part 1/3) Basic SpeechSynthesis setup.

The boilerplate and empty-ish classes.
Comment 40 Eitan Isaacson [:eeejay] 2013-02-19 20:19:44 PST
Created attachment 715863 [details] [diff] [review]
(Part 2/3) Added speech service API.

Added the user-agent end of the deal where we could implement speech services.
Comment 41 Eitan Isaacson [:eeejay] 2013-02-19 20:21:06 PST
Created attachment 715864 [details] [diff] [review]
(Part 3/3) Support OOP speech synth

Added all the ipdl bits to make this work cross-process. Added a meta-test stolen from other modules.
Comment 42 Olli Pettay [:smaug] 2013-02-26 04:47:35 PST
Comment on attachment 715862 [details] [diff] [review]
(Part 1/3) Basic SpeechSynthesis setup.

This overlaps with recognition part, so whichever gets pushed later needs to merge

>+  static PRLogModuleInfo* sLog;
I'd prefer initializing explicitly to nullptr.

>+
>+  if (!sLog)
>+    sLog = PR_NewLogModule("SpeechSynthesis");
{} around if, please. Same also elsewhere.


>+SpeechSynthesis::HandleEvent(nsIDOMEvent* aEvent)
>+{
This is odd stuff. I wouldn't use event listeners for this but just manually call the relevant
SpeechSynthesis object when end or error happens.
If event listeners are used, some script might add a listener and call .stopImmediatePropagation().
Though, when not using event listeners need to be careful to handle the case when
SSUtterance is in the queue of several SpeechSynthesis objects.

> 
>+NS_IMETHODIMP
>+nsGlobalWindow::GetSpeechSynthesis(nsISupports** aSpeechSynthesis)
>+{
>+#ifdef MOZ_WEBSPEECH
>+  FORWARD_TO_INNER(GetSpeechSynthesis, (aSpeechSynthesis), NS_ERROR_NOT_INITIALIZED);
>+
>+  NS_IF_ADDREF(*aSpeechSynthesis = nsPIDOMWindow::GetSpeechSynthesisInternal());
>+  return NS_OK;
>+#else
>+  return NS_ERROR_NOT_IMPLEMENTED;
>+#endif
>+}
We shouldn't even expose speechSynthesis on window if it is not implemented.
You may need to add something similar to http://mxr.mozilla.org/mozilla-central/source/dom/interfaces/base/nsIDOMWindowB2G.idl
which add b2g only stuff.


Looking good, but r- still
Comment 43 Olli Pettay [:smaug] 2013-02-26 05:10:24 PST
Comment on attachment 715863 [details] [diff] [review]
(Part 2/3) Added speech service API.


> SpeechSynthesis::SpeechSynthesis(nsIDOMWindow* aParent)
>   : mParent(aParent)
>+  , mCurrentTask(nullptr)
no need to initialize nsRefPtr to null.


> {
>   nsCOMPtr<nsPIDOMWindow> win = do_QueryInterface(aParent);
Hmm, what is this win used for? and why QI?

Use {} with if, everywhere.

> nsISupports*
> SpeechSynthesisVoice::GetParentObject() const
> {
>-  return mParent;
>+  return mSpeechService;
> }
This doesn't look right.
GetParentObject() needs to return something which eventually leads to the current global (window)

+  /**
+   * Dispatch boundary event.
+   *
+   * @param aName        name of boundary, 'word' or 'sentence'
+   * @param aElapsedTime time in seconds since speech has started.
+   * @param aCharIndex   offset of spoken characters.
+   */
+  void dispatchBoundary(in DOMString aName, in float aElapsedTime,
+                  in unsigned long aCharIndex);
align params

>+/**
>+ * The main interface of a speech synthesis service.
>+ *
>+ * A service's speak method could be implemented in two ways:
>+ *  1. Indirect audio - the service is responsible for outputting audio.
>+ *    The service calls the nsISpeechTask.dispatch* methods directly. Starting
>+ *    with dispatchStart() and ending with dispatchEnd or dispatchError().
>+ *
>+ *  2. Direct audio - the service provides us with PCM-16 data, and we output it.
>+ *    The service does not call the dispatch task methods directly. Instead,
>+ *    audio information is provided at setup(), and audio data is sent with
>+ *    sendAudio(). The utterance is terminated with an empty sendAudio().
>+ */
>+[scriptable, uuid(3952d388-050c-47ba-a70f-5fc1cadf1db0)]
>+interface nsISpeechService : nsISupports
Make this builtinclass ? and nsISpeechTask too?

>+class SpeechStreamListener : public MediaStreamListener
>+{
>+public:
>+  SpeechStreamListener(nsSpeechTask* aSpeechTask) :
>+    mSpeechTask(aSpeechTask) {
>+  }
>+
>+  void DoNotifyFinished() {
{ should be in the next line


>+  virtual void NotifyFinished(MediaStreamGraph* aGraph) {
ditto

>+    nsCOMPtr<nsIRunnable> event =
>+      NS_NewRunnableMethod(this, &SpeechStreamListener::DoNotifyFinished);
>+    aGraph->DispatchToMainThreadAfterStreamStateUpdate(event.forget());
>+  }
>+
>+private:
>+  nsSpeechTask* mSpeechTask;
Could you explain how mSpeechTask is kept alive? 
(raw pointers in callback-like objects tend to lead to sg-crit crashes if not handled explicitly)


>+NS_IMETHODIMP
>+nsSpeechTask::SendAudio(const JS::Value& aData, const JS::Value& aLandmarks,
>+                        JSContext* aCx)
>+{
You should enter to the compartment of aData before doing anything with it.
And add also JSAutoRequest to be super safe in case binary addons end up using this.

And before accessing aLandmarks, enter its compartment

>+class nsSpeechTask : public nsISpeechTask
...
>+  MediaStreamListener* mListener;
Could you explain the ownership model here.

I'll review the tests in a later review round.
Comment 44 Olli Pettay [:smaug] 2013-02-26 05:30:44 PST
Comment on attachment 715864 [details] [diff] [review]
(Part 3/3) Support OOP speech synth

if (expr) {
  stmt;
} else {
  stmt;
}

Be consistent with ++ in for loops.
++foo


I don't understand the test. Why do we have stuff like
+        ppmm.addMessageListener("test:SpeechSynthesis:ipcSynthAddVoice", onSynthAddVoice);
+        ppmm.addMessageListener("test:SpeechSynthesis:ipcSynthSetDefault", onSynthSetDefault);
+        ppmm.addMessageListener("test:SpeechSynthesis:ipcSynthCleanup", onSynthCleanup);

I don't see anything sending such messages.
Do we have some message manager testing thing I'm not aware of?
Comment 45 Eitan Isaacson [:eeejay] 2013-03-07 14:53:36 PST
(In reply to Olli Pettay [:smaug] from comment #42)
> Comment on attachment 715862 [details] [diff] [review]
> (Part 1/3) Basic SpeechSynthesis setup.
> 
> This overlaps with recognition part, so whichever gets pushed later needs to
> merge
> 

Right, we made sure to collaborate on that. So the tree layout should work well together.

> >+  static PRLogModuleInfo* sLog;
> I'd prefer initializing explicitly to nullptr.
> 

Done.

> >+
> >+  if (!sLog)
> >+    sLog = PR_NewLogModule("SpeechSynthesis");
> {} around if, please. Same also elsewhere.
> 

Done.

> 
> >+SpeechSynthesis::HandleEvent(nsIDOMEvent* aEvent)
> >+{
> This is odd stuff. I wouldn't use event listeners for this but just manually
> call the relevant
> SpeechSynthesis object when end or error happens.
> If event listeners are used, some script might add a listener and call
> .stopImmediatePropagation().

Good point. I'll work on that.

> Though, when not using event listeners need to be careful to handle the case
> when
> SSUtterance is in the queue of several SpeechSynthesis objects.
> 

That is probably illegal anyway. If it were not, the utterance would be the target of multiple events from different synths, and it would be confusing. I opened a bug for that:
https://www.w3.org/Bugs/Public/show_bug.cgi?id=21195

> > 
> >+NS_IMETHODIMP
> >+nsGlobalWindow::GetSpeechSynthesis(nsISupports** aSpeechSynthesis)
> >+{
> >+#ifdef MOZ_WEBSPEECH
> >+  FORWARD_TO_INNER(GetSpeechSynthesis, (aSpeechSynthesis), NS_ERROR_NOT_INITIALIZED);
> >+
> >+  NS_IF_ADDREF(*aSpeechSynthesis = nsPIDOMWindow::GetSpeechSynthesisInternal());
> >+  return NS_OK;
> >+#else
> >+  return NS_ERROR_NOT_IMPLEMENTED;
> >+#endif
> >+}
> We shouldn't even expose speechSynthesis on window if it is not implemented.
> You may need to add something similar to
> http://mxr.mozilla.org/mozilla-central/source/dom/interfaces/base/
> nsIDOMWindowB2G.idl
> which add b2g only stuff.
> 

Thanks, I was looking for an example of how that is done. But how do i hide it when the pref is not enabled?

> 
> Looking good, but r- still
Comment 46 Eitan Isaacson [:eeejay] 2013-03-07 15:02:37 PST
(In reply to Olli Pettay [:smaug] from comment #43)
> Comment on attachment 715863 [details] [diff] [review]
> (Part 2/3) Added speech service API.
> 
> 
> > SpeechSynthesis::SpeechSynthesis(nsIDOMWindow* aParent)
> >   : mParent(aParent)
> >+  , mCurrentTask(nullptr)
> no need to initialize nsRefPtr to null.
> 

Done.

> 
> > {
> >   nsCOMPtr<nsPIDOMWindow> win = do_QueryInterface(aParent);
> Hmm, what is this win used for? and why QI?
> 
> Use {} with if, everywhere.
> 

Right.

> > nsISupports*
> > SpeechSynthesisVoice::GetParentObject() const
> > {
> >-  return mParent;
> >+  return mSpeechService;
> > }
> This doesn't look right.
> GetParentObject() needs to return something which eventually leads to the
> current global (window)
> 

Changed the ownership model here. Each SpeechSynthesis has its own SpeechSynthesisVoice children that proxy the global service.

> +  /**
> +   * Dispatch boundary event.
> +   *
> +   * @param aName        name of boundary, 'word' or 'sentence'
> +   * @param aElapsedTime time in seconds since speech has started.
> +   * @param aCharIndex   offset of spoken characters.
> +   */
> +  void dispatchBoundary(in DOMString aName, in float aElapsedTime,
> +                  in unsigned long aCharIndex);
> align params
> 

Done.

> >+/**
> >+ * The main interface of a speech synthesis service.
> >+ *
> >+ * A service's speak method could be implemented in two ways:
> >+ *  1. Indirect audio - the service is responsible for outputting audio.
> >+ *    The service calls the nsISpeechTask.dispatch* methods directly. Starting
> >+ *    with dispatchStart() and ending with dispatchEnd or dispatchError().
> >+ *
> >+ *  2. Direct audio - the service provides us with PCM-16 data, and we output it.
> >+ *    The service does not call the dispatch task methods directly. Instead,
> >+ *    audio information is provided at setup(), and audio data is sent with
> >+ *    sendAudio(). The utterance is terminated with an empty sendAudio().
> >+ */
> >+[scriptable, uuid(3952d388-050c-47ba-a70f-5fc1cadf1db0)]
> >+interface nsISpeechService : nsISupports
> Make this builtinclass ? and nsISpeechTask too?
> 

This is an interface that could be implemented in js. but nsISpeechTask for sure, and nsISynthVoiceRegistry for sure.

> >+class SpeechStreamListener : public MediaStreamListener
> >+{
> >+public:
> >+  SpeechStreamListener(nsSpeechTask* aSpeechTask) :
> >+    mSpeechTask(aSpeechTask) {
> >+  }
> >+
> >+  void DoNotifyFinished() {
> { should be in the next line
> 

Done.

> 
> >+  virtual void NotifyFinished(MediaStreamGraph* aGraph) {
> ditto
> 

Done.

> >+    nsCOMPtr<nsIRunnable> event =
> >+      NS_NewRunnableMethod(this, &SpeechStreamListener::DoNotifyFinished);
> >+    aGraph->DispatchToMainThreadAfterStreamStateUpdate(event.forget());
> >+  }
> >+
> >+private:
> >+  nsSpeechTask* mSpeechTask;
> Could you explain how mSpeechTask is kept alive? 
> (raw pointers in callback-like objects tend to lead to sg-crit crashes if
> not handled explicitly)
> 

I added a comment there. The speech task holds an exclusive reference to the stream, which in turn holds an exclusive reference to the listener. So it should not outlive the task. Of course, we null-check just in case. I was inspired by DOMCameraPreview.

> 
> >+NS_IMETHODIMP
> >+nsSpeechTask::SendAudio(const JS::Value& aData, const JS::Value& aLandmarks,
> >+                        JSContext* aCx)
> >+{
> You should enter to the compartment of aData before doing anything with it.
> And add also JSAutoRequest to be super safe in case binary addons end up
> using this.
> 

Done. I don't really understand it, so you should probably check to see if it is right :)

> And before accessing aLandmarks, enter its compartment
> 

When that will be used..

> >+class nsSpeechTask : public nsISpeechTask
> ...
> >+  MediaStreamListener* mListener;
> Could you explain the ownership model here.
> 

Oops, that should not be a member. Removed.

> I'll review the tests in a later review round.
Comment 47 Olli Pettay [:smaug] 2013-03-07 15:24:03 PST
For the pref handling when exposing stuff in window, you could check how touch events are handled.
(ontouchstart/move/...)
Comment 48 Eitan Isaacson [:eeejay] 2013-03-07 16:44:27 PST
(In reply to Olli Pettay [:smaug] from comment #47)
> For the pref handling when exposing stuff in window, you could check how
> touch events are handled.
> (ontouchstart/move/...)

You need to restart for that to take effect , so it makes it harder to set the pref in mochitests.
Comment 49 Eitan Isaacson [:eeejay] 2013-03-07 16:47:00 PST
(In reply to Olli Pettay [:smaug] from comment #44)
> Comment on attachment 715864 [details] [diff] [review]
> (Part 3/3) Support OOP speech synth
> 
> I don't understand the test. Why do we have stuff like
> +        ppmm.addMessageListener("test:SpeechSynthesis:ipcSynthAddVoice",
> onSynthAddVoice);
> +        ppmm.addMessageListener("test:SpeechSynthesis:ipcSynthSetDefault",
> onSynthSetDefault);
> +        ppmm.addMessageListener("test:SpeechSynthesis:ipcSynthCleanup",
> onSynthCleanup);
> 
> I don't see anything sending such messages.
> Do we have some message manager testing thing I'm not aware of?

The previous patch had the message sending. Fixed in the next set of patches for review...
Comment 50 Eitan Isaacson [:eeejay] 2013-03-07 16:49:45 PST
Created attachment 722567 [details] [diff] [review]
Bug 525444 - (Part 1/3) Basic SpeechSynthesis setup and voice registration.

- Rebased with moz.build introduction
 - Added xpcom modules to manifests
 - Added braces to one line ifs
 - Initialize logger object to null
 - Only allow an utterance to be spoken once
 - Don't use an event handler for the speech queue
 - Update SpeechUtterance to hold direct voice references instead of voiceURI
 - Move speechSynthesis getter to another idl and make it easier to ifdef out
 - Make nsISynthVoiceRegistry  builtinclass
 - Make voice instances non-global
 - Add voice query functions to registry, use those

Main change from previous patch is the voice registry service is introduced here, and not in patch #2
Comment 51 Eitan Isaacson [:eeejay] 2013-03-07 16:50:05 PST
Created attachment 722568 [details] [diff] [review]
Bug 525444 - (Part 2/3) Added speech service API.

- Fix dispatchStart()
 - Make nsSpeechTask a tad safer and correcter.
 - Added braces to one line ifs
 - Use SpeechSynthesis.OnEnd when speech is done.
 - Nits
 - Clarify ownership model for stream listener
 - Did JS compartment thing
 - Make nsISpeechTask builtinclass
Comment 52 Eitan Isaacson [:eeejay] 2013-03-07 16:50:24 PST
Created attachment 722569 [details] [diff] [review]
Bug 525444 - (Part 3/3) Support OOP speech synth

- moz.build ipc
 - Added braces to one line ifs
 - Add conditional steps for common test functions in OOP
 - Remove unused warnings
 - Some nits
Comment 53 Olli Pettay [:smaug] 2013-03-12 18:25:54 PDT
Comment on attachment 722569 [details] [diff] [review]
Bug 525444 - (Part 3/3) Support OOP speech synth


> #include "nsString.h"
> #include "mozilla/StaticPtr.h"
>+#include "mozilla/dom/ContentChild.h"
>+#include "mozilla/dom/ContentParent.h"
>+#include "mozilla/unused.h"
>+
>+#include "mozilla/dom/SpeechSynthesisChild.h"
>+#include "mozilla/dom/SpeechSynthesisParent.h"
> 
> #undef LOG
> #ifdef PR_LOGGING
> extern PRLogModuleInfo* GetSpeechSynthLog();
> #define LOG(type, msg) PR_LOG(GetSpeechSynthLog(), type, msg)
> #else
> #define LOG(type, msg)
> #endif
> 
>+namespace {
>+
>+void
>+GetAllSpeechSynthActors(InfallibleTArray<mozilla::dom::SpeechSynthesisParent*>& aActors)
>+{
>+  MOZ_ASSERT(NS_IsMainThread());
>+  MOZ_ASSERT(aActors.IsEmpty());
>+
>+  nsAutoTArray<mozilla::dom::ContentParent*, 20> contentActors;
>+  mozilla::dom::ContentParent::GetAll(contentActors);
>+
>+  for (uint32_t contentIndex = 0;
>+       contentIndex < contentActors.Length();
>+       contentIndex++) {
Nit, ++contentIndex

>+    for (uint32_t speechsynthIndex = 0;
>+         speechsynthIndex < speechsynthActors.Length();
>+         speechsynthIndex++) {
++foo, not foo++

>+    for (uint32_t i = 0; i < voices.Length(); i++) {
ditto

>+      RemoteVoice voice = voices[i];
>+      AddVoiceImpl(nullptr, voice.voiceURI(),
>+                   voice.name(), voice.lang(),
>+                   voice.localService());
>+    }
>+
>+    for (uint32_t i = 0; i < defaults.Length(); i++) {
ditto, and also elsewhere.
Comment 54 Olli Pettay [:smaug] 2013-03-13 03:19:22 PDT
Comment on attachment 722567 [details] [diff] [review]
Bug 525444 - (Part 1/3) Basic SpeechSynthesis setup and voice registration.



>+class SpeechSynthesis MOZ_FINAL : public nsISupports,
...
>+  nsCOMPtr<nsIDOMWindow> mParent;
You probably want this to be nsCOMPtr<nsPIDOMWindow>.

>+
>+  nsTArray<nsRefPtr<SpeechSynthesisUtterance>> mSpeechQueue;
>+
>+  nsRefPtrHashtable<nsStringHashKey, SpeechSynthesisVoice> mVoiceCache;
>+};


>+public:
>+  SpeechSynthesisUtterance(const nsAString& aText);
>+  virtual ~SpeechSynthesisUtterance();
>+
>+  NS_DECL_ISUPPORTS_INHERITED
>+  NS_DECL_CYCLE_COLLECTION_CLASS_INHERITED(SpeechSynthesisUtterance,
>+                                           nsDOMEventTargetHelper)
You don't add anything to cycle collection so this shouldn't be needed.

>+SpeechSynthesisVoice::WrapObject(JSContext* aCx, JSObject* aScope, bool* aTriedToWrap)
Note, the last param has been removed from WrapObject()


>+void
>+SpeechSynthesisVoice::GetName(nsString& aRetval) const
>+{
>+  nsresult rv =
>+    nsSynthVoiceRegistry::GetInstance()->GetVoiceName(mUri, aRetval);
>+  NS_ENSURE_SUCCESS_VOID(rv);
>+}
>+
>+void
>+SpeechSynthesisVoice::GetLang(nsString& aRetval) const
>+{
>+  nsresult rv =
>+    nsSynthVoiceRegistry::GetInstance()->GetVoiceLang(mUri, aRetval);
>+  NS_ENSURE_SUCCESS_VOID(rv);
>+}
Not very useful NS_ENSURE_*
If you want to show warning in the terminal, use NS_WARN_IF_FALSE(NS_FAILED(rv), ...)


>+#ifdef MOZ_WEBSPEECH
>+    DOM_CLASSINFO_MAP_ENTRY(nsIDOMSpeechSynthesisGetter)
Could you use 
DOM_CLASSINFO_MAP_CONDITIONAL_ENTRY(nsIDOMSpeechSynthesisGetter, SpeechSynthesis::PrefEnabled())


>   DOM_CLASSINFO_MAP_BEGIN_NO_CLASS_IF(ChromeWindow, nsIDOMWindow)
>     DOM_CLASSINFO_WINDOW_MAP_ENTRIES(true)
>     DOM_CLASSINFO_MAP_ENTRY(nsIDOMChromeWindow)
>+#ifdef MOZ_WEBSPEECH
>+    DOM_CLASSINFO_MAP_ENTRY(nsIDOMSpeechSynthesisGetter)
Could you use 
DOM_CLASSINFO_MAP_CONDITIONAL_ENTRY(nsIDOMSpeechSynthesisGetter, SpeechSynthesis::PrefEnabled())

>   DOM_CLASSINFO_MAP_BEGIN_NO_CLASS_IF(ModalContentWindow, nsIDOMWindow)
>     DOM_CLASSINFO_WINDOW_MAP_ENTRIES(nsGlobalWindow::HasIndexedDBSupport())
>     DOM_CLASSINFO_MAP_ENTRY(nsIDOMModalContentWindow)
>+#ifdef MOZ_WEBSPEECH
>+    DOM_CLASSINFO_MAP_ENTRY(nsIDOMSpeechSynthesisGetter)
Could you use 
DOM_CLASSINFO_MAP_CONDITIONAL_ENTRY(nsIDOMSpeechSynthesisGetter, SpeechSynthesis::PrefEnabled())
Or actually, add that to DOM_CLASSINFO_WINDOW_MAP_ENTRIES


>+SpeechSynthesis*
>+nsPIDOMWindow::GetSpeechSynthesisInternal()
>+{
>+  MOZ_ASSERT(IsInnerWindow());
>+
>+  if (!SpeechSynthesis::PrefEnabled())
>+    return nullptr;
if (expr) {
  stmt;
}
same also elsewhere.


>+#ifdef MOZ_WEBSPEECH
>+  // mSpeechSythesis is only used on outer windows.
>+  nsRefPtr<mozilla::dom::SpeechSynthesis>     mSpeechSythesis;
>+#endif
It should be use in the inner window, not outer. And in fact the getter method forwards to inner.
Comment 55 Olli Pettay [:smaug] 2013-03-13 03:34:01 PDT
Comment on attachment 722568 [details] [diff] [review]
Bug 525444 - (Part 2/3) Added speech service API.


>+/**
>+ * The main interface of a speech synthesis service.
>+ *
>+ * A service's speak method could be implemented in two ways:
>+ *  1. Indirect audio - the service is responsible for outputting audio.
>+ *    The service calls the nsISpeechTask.dispatch* methods directly. Starting
>+ *    with dispatchStart() and ending with dispatchEnd or dispatchError().
>+ *
>+ *  2. Direct audio - the service provides us with PCM-16 data, and we output it.
>+ *    The service does not call the dispatch task methods directly. Instead,
>+ *    audio information is provided at setup(), and audio data is sent with
>+ *    sendAudio(). The utterance is terminated with an empty sendAudio().
How do we know which way will be used? A bit odd setup if either methods A or methods B will be called, but
you don't know which ones.
Comment 56 Eitan Isaacson [:eeejay] 2013-03-13 09:53:32 PDT
(In reply to Olli Pettay [:smaug] from comment #55)
> Comment on attachment 722568 [details] [diff] [review]
> Bug 525444 - (Part 2/3) Added speech service API.
> 
> 
> >+/**
> >+ * The main interface of a speech synthesis service.
> >+ *
> >+ * A service's speak method could be implemented in two ways:
> >+ *  1. Indirect audio - the service is responsible for outputting audio.
> >+ *    The service calls the nsISpeechTask.dispatch* methods directly. Starting
> >+ *    with dispatchStart() and ending with dispatchEnd or dispatchError().
> >+ *
> >+ *  2. Direct audio - the service provides us with PCM-16 data, and we output it.
> >+ *    The service does not call the dispatch task methods directly. Instead,
> >+ *    audio information is provided at setup(), and audio data is sent with
> >+ *    sendAudio(). The utterance is terminated with an empty sendAudio().
> How do we know which way will be used? A bit odd setup if either methods A
> or methods B will be called, but
> you don't know which ones.

How do you think this should be made clearer? Maybe have two distinct service interfaces for direct and indirect audio? Each one with a different speak() signature?
Comment 57 Eitan Isaacson [:eeejay] 2013-03-15 12:38:24 PDT
*** Bug 687879 has been marked as a duplicate of this bug. ***
Comment 58 Eitan Isaacson [:eeejay] 2013-03-19 10:28:57 PDT
Created attachment 726749 [details] [diff] [review]
Bug 525444 - (Part 1/3) Basic SpeechSynthesis setup and voice registration.

- use nsPIDOMWindow as SpeechSynthesis parent.
 - removed cycle collection macro from SpeechSynthesisUtterance
 - explictly warn on getter failures in SpeechSynthesisVoice
 - further nits
 - put pref checker in synch directory

r=smaug
Comment 59 Eitan Isaacson [:eeejay] 2013-03-19 10:29:35 PDT
Created attachment 726750 [details] [diff] [review]
Bug 525444 - (Part 2/3) Added speech service API.

- separate implementation and iface methods for dispatch*()
 - add a serviceType attribute to speech services, and enforce it.
Comment 60 Eitan Isaacson [:eeejay] 2013-03-19 10:30:08 PDT
Created attachment 726752 [details] [diff] [review]
Bug 525444 - (Part 3/3) Support OOP speech synth

- fixed some nits

r=smaug
Comment 61 Eitan Isaacson [:eeejay] 2013-03-26 14:39:30 PDT
Didn't forget this. I want to put this in try first. I'll get to it once I am back out of Android land.
Comment 62 Alex Vincent [:WeirdAl] 2013-04-03 12:33:06 PDT
(In reply to Eitan Isaacson [:eeejay] from comment #58)
> Created attachment 726749 [details] [diff] [review]
> Bug 525444 - (Part 1/3) Basic SpeechSynthesis setup and voice registration.

One small typo:  you refer to mSpeechSythesis instead of mSpeechSynthesis.
Comment 64 Boris Zbarsky [:bz] 2013-04-03 19:08:04 PDT
Hmm.  Why is nsIDOMSpeechSynthesisGetter named with the "nsIDOM" bit?  Wouldn't nsISpeechSynthesisGetter make more sense?
Comment 65 Eitan Isaacson [:eeejay] 2013-04-03 19:15:24 PDT
(In reply to Boris Zbarsky (:bz) from comment #64)
> Hmm.  Why is nsIDOMSpeechSynthesisGetter named with the "nsIDOM" bit? 
> Wouldn't nsISpeechSynthesisGetter make more sense?

I was inspired by nsIDOMWindowB2G and SpeechSynthesisGetter from the w3c spec idl, obviously that didn't work well :)
Comment 66 Boris Zbarsky [:bz] 2013-04-03 19:18:30 PDT
nsIDOMWindowB2G is a flat-out bug.  ;)

The SpeechSynthesisGetter in the spec is also a bug (should just be a partial interface), but note that it's [NoInterfaceObject].  For XPConnect interfaces that means "no DOM after the nsI bit"....
Comment 68 Olli Pettay [:smaug] 2013-04-04 02:22:14 PDT
(In reply to Boris Zbarsky (:bz) from comment #64)
> Hmm.  Why is nsIDOMSpeechSynthesisGetter named with the "nsIDOM" bit? 
> Wouldn't nsISpeechSynthesisGetter make more sense?
Ugh, I missed this when reviewing.
Comment 69 :aceman 2013-04-05 13:56:48 PDT
And the --disable-webspeech option does not seem to work, the build fails:

No IDL file found for interface nsIDOMSpeechSynthesisGetter in include path ['../../../dist/idl']
Comment 70 Olli Pettay [:smaug] 2013-04-06 04:00:32 PDT
Why wasn't the comment about classinfo https://bugzilla.mozilla.org/show_bug.cgi?id=525444#c54
addressed?
Comment 71 Eitan Isaacson [:eeejay] 2013-04-15 01:25:31 PDT
(In reply to Olli Pettay [:smaug] from comment #70)
> Why wasn't the comment about classinfo
> https://bugzilla.mozilla.org/show_bug.cgi?id=525444#c54
> addressed?

Oops, looks like I missed that. Sorry.
Comment 72 Lukas Blakk [:lsblakk] use ?needinfo 2013-05-08 13:11:30 PDT
Looks like this might be best under the 'developer' tag so people can make document.say page content - is that all the user-facing functionality here?  Anything else?
Comment 73 Eitan Isaacson [:eeejay] 2013-05-08 13:13:32 PDT
(In reply to lsblakk@mozilla.com from comment #72)
> Looks like this might be best under the 'developer' tag so people can make
> document.say page content - is that all the user-facing functionality here? 
> Anything else?

I would hold off on release notes for this version. There are no speech adapters shipped with Firefox yet, so it doesn't really do anything yet. Unless you have an extension with voices, but there aren't any of those out yet either.
Comment 74 Mike Gifford 2014-05-04 07:34:24 PDT
Really doesn't look like this is supported here https://en.wikipedia.org/wiki/Comparison_of_layout_engines_%28Cascading_Style_Sheets%29

or http://css3test.com/

Why is this marked as resolved?
Comment 75 Boris Zbarsky [:bz] 2014-05-04 14:31:57 PDT
Mike, why do you think this bug has anything to do with CSS?  As far as I can tell this bug was about a script API.
Comment 76 Mike Gifford 2014-05-04 14:37:13 PDT
Woops...  Sorry..  Must have just had too many tabs open, sorry.

Note You need to log in before you can comment on or make changes to this bug.