Closed Bug 1566415 Opened 5 years ago Closed 5 years ago

Finger printing surface increased on MacOS via Web Speech API

Categories

(Core :: Web Speech, defect)

defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 1233846

People

(Reporter: marcosc, Unassigned)

Details

The Web Speech API is exposing the voices installed on a end-user's device, increasing the finger printing surface of Firefox.

STR, in the browser console:

  • Call speechSynthesis.getVoices(); // Returns empty array.
  • Call speechSynthesis.getVoices(); // Returns all the voices on the system

The voices can be unique to an individual, based on the ones they have downloaded.

Why do you need to call it twice I wonder? I guess it needs to populate the list after the first call?

See also Bug 1333641 where we disabled this for Resist Fingerprinting Mode and Bug 1455899 where on Windows you can observe other tabs using the API.

Anyway; the spec says the User Agent has the discretion to do what it wants. This method returns the available voices. It is user agent dependent which voices are available. So if we wanted to, we could only return default voices (if we knew what those were), or we could gate this behind user interaction (a low bar), restrict it in third party contexts...

(In reply to Tom Ritter [:tjr] from comment #1)

Why do you need to call it twice I wonder? I guess it needs to populate the list after the first call?

Correct - it’s a design quirk in the API. The API pre-dates promises, so this was their solution... you call .getVoices() once, a “voiceschanged” (or something) event fires, you call .getVoices() again to get the list.

See also Bug 1333641 where we disabled this for Resist Fingerprinting Mode and Bug 1455899 where on Windows you can observe other tabs using the API.

Thanks! Will take a look. I’m coordinating with Apple and Google to address these kinds of issues, so will check if the other bugs apply to them.

Anyway; the spec says the User Agent has the discretion to do what it wants. This method returns the available voices. It is user agent dependent which voices are available. So if we wanted to, we could only return default voices (if we knew what those were), or we could gate this behind user interaction (a low bar), restrict it in third party contexts...

Would be fantastic. We aim to update the spec and provide better guidance once we figure out a solution.

I'll flag Martin since he seems to take point on a lot of these and maybe he can add input or redirect...

Flags: needinfo?(mt)

So this is very much like fonts in some ways. If the list is fixed (or it can be made to be fixed) based on browser build or OS, then the fingerprinting exposure is small or nonexistent. However, if - like fonts - new voices can be added, defaults changed, and ordering varies based on system, then we have a serious problem. If we are exposing a significant amount of entropy, then I would say a redesign is needed. As it stands, Tom's suggestion of providing a fixed list seems about right. That reduces to "this is Firefox", which is good enough.

This probably needs to be per-platform if we are using system-provided voices. That is marginally worse, but still not terrible enough to warrant going further. I don't see any inherent reason to gate behind gestures or limit access from third-party contexts.

Fundamentally however, I think that there are some opportunities to do better here. Fonts have a lot of exposure to the web that complicate them, but synthesis does not need that. Sites that need this capability generally only need it in the context of providing local playback of synthesized voice. So they don't need access to the rendered audio.

This means that we can do things like make the output of the API inaccessible to the site unless they provide their own voice synthesis engine (not a current feature, but one that might come in time). If we did that, we would be able to allow for customization at the user end without exposing those details to sites. If someone wants to choose the gender presentation of the voice, that can remain hidden from sites. About the only thing that might be problematic there is the duration of the playback. I know that some sight-impaired people have an amazing capacity to comprehend speech at high speeds, and we would want to support that. However the length of a stream is information that would be hard to hide.

Flags: needinfo?(mt)

Doing some archeology, this had been reported multiple times:
https://bugzilla.mozilla.org/show_bug.cgi?id=1485280
https://bugzilla.mozilla.org/show_bug.cgi?id=1233846

This had been reported as a spec bug back in 2015:
https://www.w3.org/Bugs/Public/show_bug.cgi?id=29350

Group: core-security → core-security-release
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → DUPLICATE
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.