Closed Bug 1112032 (MediaSession) Opened 9 years ago Closed 3 years ago

[meta] Implement MediaSession API

Categories

(Core :: Audio/Video: Playback, enhancement, P3)

enhancement

Tracking

()

RESOLVED FIXED
Webcompat Priority P3

People

(Reporter: baku, Assigned: chunmin)

References

(Depends on 5 open bugs, Blocks 1 open bug, )

Details

(4 keywords, Whiteboard: [webcompat:p3])

Attachments

(5 files, 6 obsolete files)

This first patch defines the webidl interfaces and it does the skeleton of the MediaController IPDL protocol.
Assignee: nobody → amarchesini
Attached patch patch (obsolete) — — Splinter Review
I don't know who can review this patch. I would say ehsan, but he is on PTO.
There is a lot of IPC PContent stuff. Maybe bent?
Attachment #8556159 - Flags: review?(bent.mozilla)
Elsewhere [1], I proposed the following. I would strongly urge people to consider the alternative HTML proposal, as it seems less complicated than the current API.  

So, I've been reading over all the proposals and I'm also landing at a similar place as @richt and Philip Jägenstedt.

IMHO, the most sensible thing seems to be amending <audio> and <video> to be remote-controlled (through an attribute). This gets rid of a whole bunch of unnecessary API surface, while allowing reuse of existing HTML MediaController machinery. It also allows much easier targeting of (mostly HTML-predefined) events directly on the elements being remote controlled - rather than through an intermediate MediaController instance.

The invariants here are:

    1. Not all media is created equal: only certain media elements need to be designated as capable of receiving media-key-related events (one per page). In Chrome on iOS, this already happens automatically even without the need of any special attribute - what is not clear to me yet is how next/prev will work, specially as it assumes a fully active web document running in the background capable of receiving these events.
    2. Media should be able to receive the events, and the UA can send their streams to be "remote controlled" by the underlying OS (as per IOs - both Safari and Chrome already do this, btw - without anything special). Registering a media controller without actual media seems like a bad disconnect - which would be possible with .requestMediaController() - events could be routed nowhere (yes, this is also true with a video without a src, or a bad src, but it's easier to deal with that at the UA level).
    3. Stack order or the focus media player can be handled by the UA (also as per Boris Smus' proposal).
    Web Audio's busted API needs to be fixed to work with Audio elements - we've been saying that for years now. Let's deal with that later - it's not the 90% use case.
    4. Reusing poster and title as metadata seems sensible - though we might need to make some additional enhancements in the future.
    5. As shown already in iOS (Chrome and Safari), it's not necessary to have a door-hanger for "Allow foo.com to receive media key events?".

Given that we have a pretty clear understanding of the use cases, I would suggest we consider starting with the existing HTML elements first. If that doesn't work, we can look at adding a new API.
cc'ing Roc since I'd really like to get feedback from the media team here.

(In reply to Marcos Caceres [:marcosc] from comment #4)
> Elsewhere [1],

I think you forgot to include the [1] reference?

The idea of being able to tie this to a <audio> or <video> element is an interesting one. Not sure if that was discussed before? I agree that that would automatically provide a bunch of the values that needs to be communicated to the platform.

But there are a few problems we'd need to solve:

1. We need to also enable WebAudio's AudioContext to act as a receiver for media key events.

This should be relatively easy I suspect, by adding to AudioContext whatever API we add to HTMLMediaElement.

2. I suspect that some websites that do complex audio playback doesn't just use a single <audio> or <video> element. For example to reduce the gap between songs they might prepare the next song in one <audio> element while the current song is playing in another <audio> element.

So I think we need to make sure that it's possible to add mediakey handling to two elements at once. Which means that we'll have to define what happens when you attempt to overlap them. I.e. which songname/etc is considered the active one. A simple solution might be that when the page starts playing the second one, the first one is paused.

3. We likely need to support "complex" playback patterns. Such as if you use multiple <audio> elements or AudioContexts to play sound, all of which should be paused when the user presses the 'pause' button. Or if you point an <audio> element to a audio file which contains multiple songs. Or if you use a single <video> together with MSE to play multiple music videos in a gapless fashion.

That's where the current API proposal is advantageous since it allows the page to easily override and do everything itself.

Possibly this could be addressed using the solution to 1 above. I.e. we could ask the page to create a single AudioContext which simply "plays silence" by not hooking any audio sources to it. That AudioContext could then act as a event dispatcher and let the page know when it should take various actions. I.e. the AudioContext would turn into what the MediaController object is in the current proposal.

But I'm not sure if there are downsides with asking pages to play such a "silent" AudioContext.

4. Should we automatically grab things like songname and album art from a <audio> if it's pointed at a mp3 file which contains songname and album art? Is that information well enough defined in the various supported container formats that it'll work cross browser? Either way we'll likely need to enable the page to override that information.


These problems seem solvable. But someone needs to sit down and create a proposal. And we need to make sure that it doesn't end up getting so complex that the current low-level primitive isn't the better place to start.

Marcos, would you have time to do this? If so, how quickly?
1. Yes, especially now that AudioContext is in the process of being able to be suspended and resumed
2. Mimick native API, right? Also, this would break legacy video games and rich audio apps. They are already broken on iOS, though.
3. AudioContext are supposed to be expensive objects (not saying we can't optimize for this). This sounds like a hack, no?
4. Dublin Core aims to define a set of metadata that are "universal enough". We have some code that do this and tried to bring it to the whatwg, but it got no traction
(In reply to Paul Adenot (:padenot) from comment #6)
> 1. Yes, especially now that AudioContext is in the process of being able to
> be suspended and resumed

Cool!

> 2. Mimick native API, right? Also, this would break legacy video games and
> rich audio apps. They are already broken on iOS, though.

I'm not sure what you mean here?

> 3. AudioContext are supposed to be expensive objects (not saying we can't
> optimize for this). This sounds like a hack, no?

I don't know enough about WebAudio to know if this qualifies as a hack or not. I'll defer to others.

> 4. Dublin Core aims to define a set of metadata that are "universal enough".
> We have some code that do this and tried to bring it to the whatwg, but it
> got no traction

Was the DC proposal a new metadata format? Or did it formalize the fields that current mp3 players are using?

Either way I'm happy to go the simplest route here. We can always add automatic "read from the file" support later.
> Marcos, would you have time to do this? If so, how quickly?

I have time - I can dedicate the next year to it. About how quickly: will aim for a FPWD by end of March.
Hmm... any chance we could have a proposal before end of march? I'd rather not completely block this project for two months. I don't think we'd need a spec document to unblock, but just some draft IDL.

The proposal at http://discourse.specifiction.org/t/media-controls-api/718 seems like a decent start, but doesn't address some of the issues in comment 5.

I also don't think that we should have the page expose a *playlist* to the UA. Which it seems like http://discourse.specifiction.org/t/media-controls-api/718 is trying to do. Getting the next/previous song seems like it's very often going to require running a bunch of custom code, so I think it's better that we simply fire events for next/previous song and let the page handle it.
(In reply to Jonas Sicking (:sicking) from comment #10)
> Hmm... any chance we could have a proposal before end of march?

Will certainly do my best. The aim is to have a FPWD ready to publish by then, so it means iterating over stuff through the next 7 weeks. We might get lucky and hit on something that works long before then, and then it's just normal spec bla bla around the IDL. 

Of course, everyone here is free to participate in the process and review PRs as they comes in (just subscribe to https://github.com/whatwg/media-keys). 

> I'd rather not completely block this project for two months. I don't think we'd need a
> spec document to unblock, but just some draft IDL.

The IDL doesn't block the project at all. The core of the work is putting in all the cross-platform infrastructure behind the IDL facade. Understanding the limitations of each platform and how we make this work in an interoperable manner is the bulk of the work, IMO. 

So I can only imagine that just capturing the media-key events with Gecko, and getting the browser to become the focused media-player on Mac, Linux, Windows, and B2G, should keep Andrea plenty busy while standards-nerds work out what the IDL looks like:) 

> The proposal at http://discourse.specifiction.org/t/media-controls-api/718
> seems like a decent start, but doesn't address some of the issues in comment
> 5.

Yes, that is true. We will work on those incrementally. In particular, need to actually look at, and talk to, various potential users of this (soundcloud, pendora, spotify, youtube etc.... probably all using Flash tho right now). The idea that people would be using multiple <audio> elements strikes me as particularly sad (instead of swapping out src of a single element or using some clever Web Audio stuff). But who knows... part of what I need to investigate, but I prefer to actually see what devs are doing rather than speculate. I have a few friends at the various music companies, so will reach out to them. 

> I also don't think that we should have the page expose a *playlist* to the
> UA. Which it seems like
> http://discourse.specifiction.org/t/media-controls-api/718 is trying to do.

No, I don't think that is the case. At least, that's not what I had in mind. Right now, a generic mechanism for receiving and routing the media-key events would be ideal. If that then underpins, say, HTMLMediaController, that would be really cool. Anyway, that's in the future (as is, say, something awesome like declaring a playlists with, say, `<audio srcset="...">` :)). 

In a few weeks, we may end up at exactly the same API that Baku and Ehsan proposed. Just need to make sure the use cases are covered - and that we don't start with a solution looking for a problem + that other potential implementers are interested (they seem to be, which is already good). Don't want a repeat of  SysApps, if you know what I mean.

> Getting the next/previous song seems like it's very often going to require
> running a bunch of custom code, so I think it's better that we simply fire
> events for next/previous song and let the page handle it.

Agree. A generic event capturing/routing mechanism would be great for media-key events, because it would allow transitioning between or within <audio>, <video>, <canvas>, AudioContext, <x-presentation>, or whatever else comes down the line.
> > I'd rather not completely block this project for two months. I don't think we'd need a
> > spec document to unblock, but just some draft IDL.
> 
> The IDL doesn't block the project at all. The core of the work is putting in
> all the cross-platform infrastructure behind the IDL facade. Understanding
> the limitations of each platform and how we make this work in an
> interoperable manner is the bulk of the work, IMO. 

We should check this with Baku who's actually doing the implementation. But I'll note that there are already patches in this bug.

Adding needinfo for that.

> > The proposal at http://discourse.specifiction.org/t/media-controls-api/718
> > seems like a decent start, but doesn't address some of the issues in comment
> > 5.
> 
> Yes, that is true. We will work on those incrementally. In particular, need
> to actually look at, and talk to, various potential users of this
> (soundcloud, pendora, spotify, youtube etc.... probably all using Flash tho
> right now).

Ah, yes, flash is a very good point. We definitely need the API to work for websites that use flash. That's a very good example of issue 3 from comment 5.

I think that makes a good case for why the current proposal is the right approach. At least as the low-level primitive on which we can build sugar like HTMLMediaElement integration.

Unless we can go with a silent AudioContext providing the low-level primitive.

> The idea that people would be using multiple <audio> elements
> strikes me as particularly sad (instead of swapping out src of a single
> element or using some clever Web Audio stuff).

If you want to start loading the next song before the current song as finished playing, wouldn't you have to do that?
Flags: needinfo?(amarchesini)
> We should check this with Baku who's actually doing the implementation. But
> I'll note that there are already patches in this bug.

What we have here is a patch that implements what we discussed on webapi ml.
The code is written and it's ready to be reviewed. If in the meantime we have some spec, maybe we can do that as follow-ups... I hope so :)

What we have here is a WebIDL MediaController object, exposed by the navigator.requestMediaController() as a Promise. The MediaController has mediaActive attribute and it is a EventTarget for MediaKeyEvents.

There is also a MediaControllerDispatcher that is meant to be used for debugging and for b2g. This object is able to dispatch mediakey events and to know if we have some active MediaController elements (MediaController with mediaActive set to true).

I didn't implement any additional attributes on porpoise (such as 'title', 'duration', etc) because I know that these can change a lot when the spec will be discussed with other vendors.

So, to me, the patch can still be in review.
Jonas, there is a lot of IPDL code. Do you want to review it? :)
Flags: needinfo?(amarchesini) → needinfo?(jonas)
OS: Linux → All
Hardware: x86_64 → All
(In reply to Andrea Marchesini (:baku) from comment #13)
> > We should check this with Baku who's actually doing the implementation. But
> > I'll note that there are already patches in this bug.
> 
> What we have here is a patch that implements what we discussed on webapi ml.
> The code is written and it's ready to be reviewed. If in the meantime we
> have some spec, maybe we can do that as follow-ups... I hope so :)

The problem is that the API might be incompatible with what various media sites are already relying on, particularly on iOS where this capability has shipped for a while (in both Safari and Chrome). See for discussion: 

https://github.com/whatwg/media-keys/issues/1 

Ideally, we would want to get parity with Safari and Chrome (which right now doesn't require an API at all for the basic cases of playing/pausing). Basically, if you start playing <audio/video> in a tab, you get the media focus for free - and then the media-keys just change the state of playback (causing the appropriate state-change events to be fired after the fact, as per HTML). 

With skip forward/back, it might be as simple as just firing those events on the focused HTMLMediaElement itself. 

So, given feedback and research so far... 

> 1. We need to also enable WebAudio's AudioContext to act as a receiver for media key events.

I don't know if I agree with this (at least initially) - I feel more strongly that we should make it possible to use Web Audio with <audio>. 

> 2. I suspect that some websites that do complex audio playback doesn't just use a single <audio> or <video> element. For example to reduce the gap between songs they might prepare the next song in one <audio> element while the current song is playing in another <audio> element.

This doesn't appear to be the case from the players I've looked at so far. Single element is fine - but more sophisticated queuing could be added. 

> 3. We likely need to support "complex" playback patterns.

I think we should "v2" the complex stuff - particularly, enabling more flash is not great - we should actively work towards leapfrogging Flash, as it currently lacks this capability. All that could be handled by a single Web Audio node being routed to an <audio> element.   

> 4. Should we automatically grab things like songname and album art from a <audio> if it's pointed at a mp3 file which contains songname and album art? 

Yes - where we can. If the files are MP3s, we can probably grab a little bit from the ID3 tags. We grab the image from the poster attribute on the media element.
Requesting info from Roc too, as this is more squarely in his domain.
Flags: needinfo?(roc)
It's confusing (and potentially a big problem in the future) to call this new API MediaController since HTML5 already specs a MediaController which is a rather different thing.

Web Audio can play through an <audio> element today by getting a MediaStream via MediaStreamAudioDestinationNode and feeding that into an <audio> element. However, you can't pause/resume the AudioContext via the audio element, primarily because Web Audio doesn't define pause/resume operations at all. So we'd need to spec that out in the Audio WG.

However, I don't know what I'm being asked here.
Flags: needinfo?(roc)
(In reply to Robert O'Callahan (:roc) (out of office, slow reviews) (Mozilla Corporation) from comment #16)
> It's confusing (and potentially a big problem in the future) to call this
> new API MediaController since HTML5 already specs a MediaController which is
> a rather different thing.

I've renamed this bug to try to put an end to the confusion (as this keeps getting pointed out as a problem). 

> Web Audio can play through an <audio> element today by getting a MediaStream
> via MediaStreamAudioDestinationNode and feeding that into an <audio>
> element. However, you can't pause/resume the AudioContext via the audio
> element, primarily because Web Audio doesn't define pause/resume operations
> at all. So we'd need to spec that out in the Audio WG.

My understanding is that Paul Adenot, or someone from the Web Audio WG, is going to add play/pause to Web Audio - so that should be covered soon.  

> However, I don't know what I'm being asked here.

Sorry for not being clear. The question is: should we follow iOS's existing convention of automatically integrating <audio/video> into the system media player, or specify a completely new API? 

I'm arguing that we get 95% of what we need for free if we just rely on HTML's media elements and do what iOS does (when you play media through an element, it gets automatically routed to the system's media player - hence you get play/pause control for free, plus the poster, and the title, etc. you can get directly from the metadata of the media, like by reading ID3 tags or whatever). 

Then all we need to add is fast-forward and rewind events to HTMLMediaElement. Eventually, we can add something like a "srcset" or just "srclist" to HTML media elements, to support playlists (at least of one type of media, though it would be trivial to self manage a playlist - JS libraries like jPlayer already manage playlists today).
Summary: Implement MediaController → Implement remote control of media
For the purposes of simpler discussion, let's rename the existing proposal to MediaKeyReceiver.

(In reply to Marcos Caceres [:marcosc] from comment #14)
> (In reply to Andrea Marchesini (:baku) from comment #13)
> > > We should check this with Baku who's actually doing the implementation. But
> > > I'll note that there are already patches in this bug.
> > 
> > What we have here is a patch that implements what we discussed on webapi ml.
> > The code is written and it's ready to be reviewed. If in the meantime we
> > have some spec, maybe we can do that as follow-ups... I hope so :)
> 
> The problem is that the API might be incompatible with what various media
> sites are already relying on, particularly on iOS where this capability has
> shipped for a while (in both Safari and Chrome).

Could we simply define that if the MediaKeyReceiver interface is instantiated by the page, then hardware keys are sent to the MediaKeyReceiver interface rather than automatically playing/pausing <audio>.

But for pages that does not instantiate MediaKeyReceiver that they can keep doing what they are currently doing.

Would that resolve the incompatibilities?


> > 1. We need to also enable WebAudio's AudioContext to act as a receiver for media key events.
> 
> I don't know if I agree with this (at least initially) - I feel more
> strongly that we should make it possible to use Web Audio with <audio>. 
> 
> > 3. We likely need to support "complex" playback patterns.
> 
> I think we should "v2" the complex stuff - particularly, enabling more flash
> is not great - we should actively work towards leapfrogging Flash, as it
> currently lacks this capability. All that could be handled by a single Web
> Audio node being routed to an <audio> element.   

This I disagree with. I don't think v1 would be a limited capabilitiy API and v2 would be a "complex" API which solves more usecases.

If we're doing v1 and v2 then I'd rather do v1 as a low-level API which lacks a lot of sugar, but which enables webpages webpages to add hardware key support, with v2 adding API which make it more convenient to hook up hardware key control to an <audio> and have it automatically controlled.

Or, to put it another way: I think it's more important that websites can add hardware key support, than that they can do so by simply adding an attribute in the markup.

An enabling API is more urgent than one that saves a few lines of code.
Flags: needinfo?(jonas)
Another important point is that, like you say, many websites, like Pandora, currently use flash. I'd rather not wait for all of them to rewrite to use <audio> before they can get key support.
Can I proceed renaming the interface? I wonder how much of this API I can ask to be reviewed.
Except the name, it seems that the API is in sync with what jonas is saying.
Flags: needinfo?(mcaceres)
(In reply to Jonas Sicking (:sicking) from comment #18)
> > The problem is that the API might be incompatible with what various media
> > sites are already relying on, particularly on iOS where this capability has
> > shipped for a while (in both Safari and Chrome).
> 
> Could we simply define that if the MediaKeyReceiver interface is
> instantiated by the page, then hardware keys are sent to the
> MediaKeyReceiver interface rather than automatically playing/pausing <audio>.
>
> But for pages that does not instantiate MediaKeyReceiver that they can keep
> doing what they are currently doing.
> 
> Would that resolve the incompatibilities?

No, because if the MediaKeyReceiver doesn't actually stop the sound - then that's completely unacceptable and a massive problem. Think about this: you go to "awesomemusic.com" and pop in your headset, continue opening other tabs in the browser. Then someone taps you on the shoulder; you try to pause the music through the headset, but it doesn't work! "oh CRAP! my headset if broken!!!" you think. "I'll try to pause through the soft keys", but also doesn't work! OMG! you lower the volume all the way, but then any time you want to use your phone, the music continues to blare. Eventually you find the tab with "awesomemusic.com" and shut it down. You never use  "awesomemusic.com" again, and are left pissed off at your crappy phone that can't pause. Imagine that happens when you are driving or in some other stressful situation. 

If I press "pause" on the lock screen of a mobile device, the sound MUST cease - no exceptions, no excuses. Hence, this can't be in control of the developer - it has to be controlled by the system, and must guarantee pausing on playback. 

Consider: if pressing pause on the lock screen, or through a headset, or a keyboard, is guaranteed to stop all playback AND send an event to MediaKeyReceiver, then that makes MediaKeyReceiver redundant because <audio> will emit a pause event regardless.  

If we did have a MediaKeyReceiver, it could only be used in concert with a HTML media element - because a HTML media element would need to be active, or at least have been active at some point, in order to receive events (and to be registered as the media currently under the control of the system). A MediaKeyReceiver has no business capturing events unless the user has initiated some kind of media playback through it - the only thing it can do is "hint" that it should have events routed to it *if no other media currently has the focus* (example below). 

There are two ways that an application can signal to the OS that it is going to play media: 

1. the user explicitly starts playback by pressing "play" control provided by a HTML element (as per today).  
2. the web application *HINTS* that some element will play media - so the OS must route key events there if 1 hasn't first occurred. This allows, for instance, a tab to be opened, backgrounded, and for events to reach the correct element without first needing the user to start playback of any media (not supported today).

Example of 1: any news site today. I can go to bbc news, press play on a story (specially the radio ones), and press the power button on my phone, and I'm able to put the phone back in my pocket while a news story plays through my headset. I can pause the playback though my headset or through the lock screen. This requires no work for developers. It just works (tm).    

Complex examples of 2: 

## Focus ONLY when needed
1. I open Spotify.
2. It shows me news but no media options (so I don't pick any!).
3. I go to the home screen of the device. 
4. I bring up system player controls - it provides me with no play options. 

## Focus ONLY when needed 2
1. I open Spotify.
2. I pick a playlist (but I don't pick any song! so nothing is playing). However, Spotify hints that it can play music to the OS.
3. I go to the home screen of the device. 
4. Because Spotify hinted it can play audio, when I bring up system player controls, it shows the first song in the playlist. Similarly, I could have just hit "play" on my headset and Spotify would have started. Use case: I'm going to queue up my "run playlist", but only start playing when I start jogging (which requires me going to another application, which doesn't play audio, hence doesn't interfere with the media focus). I press the headset "play", I start running.  

## Cross application focus
1. I open Audible, and start listening to an audio book (explicit action, per 1). 
2. I pause Audible's playback through my headset. 
3. I open Spotify.
4. I pick a playlist (but I don't pick any song!). 
5. I bring up system player controls - it still shows Audible. This is because an explicit action started the playback from Audible. Hence, 1 trumps 2. 
6. I pick a song in my playlist. Spotify now handles the media events.  

> > > 3. We likely need to support "complex" playback patterns.
> > 
> > I think we should "v2" the complex stuff - particularly, enabling more flash
> > is not great - we should actively work towards leapfrogging Flash, as it
> > currently lacks this capability. All that could be handled by a single Web
> > Audio node being routed to an <audio> element.   
> 
> This I disagree with. I don't think v1 would be a limited capabilitiy API
> and v2 would be a "complex" API which solves more usecases.
> 
> If we're doing v1 and v2 then I'd rather do v1 as a low-level API which
> lacks a lot of sugar, but which enables webpages webpages to add hardware
> key support, with v2 adding API which make it more convenient to hook up
> hardware key control to an <audio> and have it automatically controlled.

For the reason I mentioned above (the sound MUST cease), and to solve the immediate use cases, I don't see why we would want to over-complicate something that already works, and that sites already rely on. 

Ok yes, Flash is not covered - but I don't think propping up Flash should be a goal. Understanding why some sites use Flash, and providing that functionality to the web would be better (I can only imagine this is, sadly, due to the lack of audio DRM).

> Or, to put it another way: I think it's more important that websites can add
> hardware key support, than that they can do so by simply adding an attribute
> in the markup.
>
> An enabling API is more urgent than one that saves a few lines of code.

This has nothing to do with markup or ease of use for developers. It's about: 

1. preexisting solutions and sites already depending on legacy behavior - particularly on mobile. 
2. covering the most common use cases, which are just being able to put a tab in the background and able to route media-key events to them (as one does with Spotify and iTunes).
Flags: needinfo?(mcaceres)
(In reply to Jonas Sicking (:sicking) from comment #7)
> (In reply to Paul Adenot (:padenot) from comment #6)
> > 2. Mimick native API, right? Also, this would break legacy video games and
> > rich audio apps. They are already broken on iOS, though.
> 
> I'm not sure what you mean here?

Native mobile ecosystems have mechanism that pause/play audio in other *apps* when starting playing. Some games, etc. use multiples <audio> at the same time, as you say. Those games are already broken on iOS (where you can only have a single <audio> playing at the same time, afaik). We could simply give the media focus to a document and not an HTMLMediaElement, though.

> > 4. Dublin Core aims to define a set of metadata that are "universal enough".
> > We have some code that do this and tried to bring it to the whatwg, but it
> > got no traction
> 
> Was the DC proposal a new metadata format? Or did it formalize the fields
> that current mp3 players are using?

It's just a formalization of the fields to use for audio (e.g., artist, album, title, etc.).
 
> Either way I'm happy to go the simplest route here. We can always add
> automatic "read from the file" support later.

We already have that (a proprietary `mozGetMetadata` method, that returns a object containing the full set of metadata).
(In reply to Robert O'Callahan (:roc) (out of office, slow reviews) (Mozilla Corporation) from comment #16)
> Web Audio can play through an <audio> element today by getting a MediaStream
> via MediaStreamAudioDestinationNode and feeding that into an <audio>
> element. However, you can't pause/resume the AudioContext via the audio
> element, primarily because Web Audio doesn't define pause/resume operations
> at all. So we'd need to spec that out in the Audio WG.

This is bug 1094764 (and the spec has been merged into the main document), for the record. I'm almost done with it (and I've listed down all the things missing from the spec, I'll prepare as spec patch as well).

It was initially put in the spec for performance reasons (not letting an AudioContext running all the time when it's not needed, saving battery/CPU), but can be useful here.
I like the idea of enabling the UA to forcefully mute a page. However that needs to be a UA choice since it's not something we'll want to do on desktop. I.e. you wouldn't want desktop to forcefully mute *all* pages when the 'pause' button is pushed.


Either way, to me the requirements are still:

* Must work with flash websites. I don't want to block adoption from websites on having them rewrite
  their audio systems to not use flash.

* Given that we have patches that are blocked on having an API proposal, we can't wait for an API
  proposal until end of March.


One path forward here seems to be to create essentially two separate APIs:

1. A set of events which are fired against a <audio> or AudioContext whenever the UA forcefully mutes
   the page. As well as a description of how that forceful mute works. E.g. does it pause the playback or
   just silence it.

2. An API like MediaKeyReceiver which enables pages to handle complex audio systems like ones based on
   flash.

and maybe, as sugar,

3. An attribute which can be set on an <audio> which essentially makes the <audio> work as a
   MediaKeyReceiver.
Comment #24 sounds good to me.

Basically we've got some events that are not cancellable, can't be hijacked for other functions (e.g. controlling presentations), and are guaranteed to mute playback ... and we've got other events that are cancellable and can be repurposed for other things ... and these need to be distinguished in the API. The latter should still do something useful by default if a Web page just plays a media element.
(In reply to Jonas Sicking (:sicking) from comment #24)
> I like the idea of enabling the UA to forcefully mute a page. However that
> needs to be a UA choice since it's not something we'll want to do on
> desktop. I.e. you wouldn't want desktop to forcefully mute *all* pages when
> the 'pause' button is pushed.

Agree. Just want to mute the one that has the "media focus". The underlying system can then impose its own restriction on how many sources can be simultaneously playing. 
 
> Either way, to me the requirements are still:
> 
> * Must work with flash websites. I don't want to block adoption from
> websites on having them rewrite
>   their audio systems to not use flash.

Ok, but this is still going to be a huge hack for those sites. Hopefully it won't motivate Adobe to just add media key events directly to the Flash plugin. 
 
> * Given that we have patches that are blocked on having an API proposal, we
> can't wait for an API proposal until end of March.

Apart from Baku's patches here, what is being blocked? There are no dependencies listed on this bug. If there are dependent bugs, it would be good to have those listed - as it would help to know who the potential consumers are and to make sure we are addressing their requirements. 

> One path forward here seems to be to create essentially two separate APIs:
> 
> 1. A set of events which are fired against a <audio> or AudioContext
> whenever the UA forcefully mutes
>    the page. As well as a description of how that forceful mute works. E.g.
> does it pause the playback or
>    just silence it.

It would pause it. It would be quite unexpected if I was listening to an audiobook, or podcast, or even a song, and pause silenced rather than paused. 

> 2. An API like MediaKeyReceiver which enables pages to handle complex audio
> systems like ones based on flash.

Just thinking out loud, but something like where both `RemoteControl` and `HTMLMediaElement` implement [NoInterfaceObject] `MediaKeyReceiver`.

This would allow MediaKeyReceiver to remain as an independent API, but HTMLMediaElement could still be explained (as far as media-keys are concerned) in terms of MediaKeyReceiver. A constructable, or requested, `RemoteControl` will hopefully then allow the ability to address use cases like controlling RevealJS presentations, the Flash use cases, and things built with custom elements, etc.

`RemoteControl` would have to be requested (`requestRemoteControl()`), granting the user the explicit ability to now allow arbitrary web apps to claim the media focus if they are not actually controlling media (or if they are playing media through a third-party plugin, as per Flash). Hopefully requiring an explicit grant should also prevent media-key hijacking.  

> and maybe, as sugar,
> 
> 3. An attribute which can be set on an <audio> which essentially makes the
> <audio> work as a
>    MediaKeyReceiver.

Would be nice. Ok, this is starting to look good now :)

I'll take this back to the WHATWG and see what we can come up with.
I was invited to comment on this via email, so here goes.

(In reply to Marcos Caceres [:marcosc] from comment #21)

> If I press "pause" on the lock screen of a mobile device, the sound MUST
> cease - no exceptions, no excuses. Hence, this can't be in control of the
> developer - it has to be controlled by the system, and must guarantee
> pausing on playback. 

Much of the design hinges on this. On Android it's possible to produce audio that isn't silenced by the headphone button. Not sure about iOS. It certainly is possible on desktop platforms.

To enforce that the lock screen and headphone buttons can always be used requires that the problem be solved for Web Audio as well, and that any audio, including notifications, causes lock screen UI to appear and the headphone button to be responsive. I think muting an entire page will be the only recourse in some cases, and that sounds sad for pages that want to gently fade out when paused.

I have a hard time making up my mind about this, but I don't think we should accept the burden of enforcing silence lightly.

> ## Cross application focus
> 1. I open Audible, and start listening to an audio book (explicit action,
> per 1). 
> 2. I pause Audible's playback through my headset. 
> 3. I open Spotify.
> 4. I pick a playlist (but I don't pick any song!). 
> 5. I bring up system player controls - it still shows Audible. This is
> because an explicit action started the playback from Audible. Hence, 1
> trumps 2. 
> 6. I pick a song in my playlist. Spotify now handles the media events.  

I think this is essentially the "possible future use case" "I have registered a media player app and I want it to start up when I press the play key" from https://etherpad.mozilla.org/audiocontrols

It seems to be at odds with the iOS model, but I hear that platform conventions on Windows are different. Needs investigation?

> 1. preexisting solutions and sites already depending on legacy behavior -
> particularly on mobile. 

This has been mentioned several times, but I don't understand. The automatic handling of audio focus will look to a site just like any other play/pause that could originate from <video controls> or a context/long-press menu. How can one depend on it in a way that restricts the design of this new API?
I'm not sure how to combine these two things in an API:
1. Only grant audio focus (and thus show lock screen UI) when playback begins.
2. Work with Flash.

Flash is completely opaque, at best the browser could see that it's producing audio exceeding some threshold level. A playing HTMLMediaElement or an active AudioContext could very well be silent, however. It seems harder to implement and test something defined in terms of the waveforms. It's also going to be silly when you start playing something that's initially silent, like some CD bonus tracks or movies tend to be.
(In reply to Philip Jägenstedt from comment #28)
> I'm not sure how to combine these two things in an API:
> 1. Only grant audio focus (and thus show lock screen UI) when playback
> begins.
> 2. Work with Flash.

You would need two APIs that share a common base (though either inheritance or through a mix-in). 
 
> Flash is completely opaque, at best the browser could see that it's
> producing audio exceeding some threshold level.

No, the idea is much simpler and hackier: just request to get the media-key events, then:

```
foo.requestRemoteControl().then(
  //This OS supports this and it's been granted. 
  (rc) => {
    //Route events to flash from the web page, using .newMessage as IPC! YAY!
    rc.onrewind = rc.onfastforward = (e) => $("#flashObject").newMessage("JS_EVENT:" + e.name);
  }, 
  (err) =>{
    //User probably rejected this (check with permission API!)
    //Or the OS doesn't support this... show enhanced controls in UI!
    $("#flashObject").newMessage("NO MEDIA KEYS!");
  }
);

```

> A playing HTMLMediaElement
> or an active AudioContext could very well be silent, however. It seems
> harder to implement and test something defined in terms of the waveforms.

Yes, I doubt we want to do anything like this. 

> It's also going to be silly when you start playing something that's
> initially silent, like some CD bonus tracks or movies tend to be.

It's important to decouple what <audio> does today from the use cases we are trying to solve. We want to keep the current behavior, but try to provide and additional API that enables these extended use cases.
(In reply to Marcos Caceres [:marcosc] from comment #29)
> (In reply to Philip Jägenstedt from comment #28)
> > I'm not sure how to combine these two things in an API:
> > 1. Only grant audio focus (and thus show lock screen UI) when playback
> > begins.
> > 2. Work with Flash.
> 
> You would need two APIs that share a common base (though either inheritance
> or through a mix-in). 
>  
> > Flash is completely opaque, at best the browser could see that it's
> > producing audio exceeding some threshold level.
> 
> No, the idea is much simpler and hackier: just request to get the media-key
> events, then:
> 
> ```
> foo.requestRemoteControl().then(
>   //This OS supports this and it's been granted. 
>   (rc) => {
>     //Route events to flash from the web page, using .newMessage as IPC! YAY!
>     rc.onrewind = rc.onfastforward = (e) =>
> $("#flashObject").newMessage("JS_EVENT:" + e.name);
>   }, 
>   (err) =>{
>     //User probably rejected this (check with permission API!)
>     //Or the OS doesn't support this... show enhanced controls in UI!
>     $("#flashObject").newMessage("NO MEDIA KEYS!");
>   }
> );
> 
> ```

AFAICT you've now sacrificed "Only grant audio focus (and thus show lock screen UI) when playback begins" in favor of "Work with Flash", or are you saying that it should be undefined what causes audio focus to be granted and that some platforms may require media playback to begin and some not? That wouldn't be a great outcome I think.

> > A playing HTMLMediaElement
> > or an active AudioContext could very well be silent, however. It seems
> > harder to implement and test something defined in terms of the waveforms.
> 
> Yes, I doubt we want to do anything like this. 

Good, that would be terrible!

> > It's also going to be silly when you start playing something that's
> > initially silent, like some CD bonus tracks or movies tend to be.
> 
> It's important to decouple what <audio> does today from the use cases we are
> trying to solve. We want to keep the current behavior, but try to provide
> and additional API that enables these extended use cases.

Not sure what you're saying, my point was that trying to define this in terms of waveforms will produce terrible results for initially silent media playback. But that option's off the table.
(In reply to Philip Jägenstedt from comment #30
> AFAICT you've now sacrificed "Only grant audio focus (and thus show lock
> screen UI) when playback begins" in favor of "Work with Flash", or are you
> saying that it should be undefined what causes audio focus to be granted and
> that some platforms may require media playback to begin and some not? That
> wouldn't be a great outcome I think.

No, I'm saying both can be supported through the same mechanism (nothing need be thrown away): You get lock-screen integration with HTMLMediaElement for free, but a developer can explicitly request for the media keys to be routed to a particular object through `requestRemoteControl()`. 

 
> > > It's also going to be silly when you start playing something that's
> > > initially silent, like some CD bonus tracks or movies tend to be.
> > 
> > It's important to decouple what <audio> does today from the use cases we are
> > trying to solve. We want to keep the current behavior, but try to provide
> > and additional API that enables these extended use cases.
> 
> Not sure what you're saying, my point was that trying to define this in
> terms of waveforms will produce terrible results for initially silent media
> playback. But that option's off the table.

Ok cool - I was trying to be polite and not say it was a terrible suggestion :)
(In reply to Marcos Caceres [:marcosc] from comment #31)
> (In reply to Philip Jägenstedt from comment #30
> > AFAICT you've now sacrificed "Only grant audio focus (and thus show lock
> > screen UI) when playback begins" in favor of "Work with Flash", or are you
> > saying that it should be undefined what causes audio focus to be granted and
> > that some platforms may require media playback to begin and some not? That
> > wouldn't be a great outcome I think.
> 
> No, I'm saying both can be supported through the same mechanism (nothing
> need be thrown away): You get lock-screen integration with HTMLMediaElement
> for free, but a developer can explicitly request for the media keys to be
> routed to a particular object through `requestRemoteControl()`. 

OK, your object `foo` can either get access by the act of starting media playback, or via `foo.requestRemoveControls()`. If this is very important to Mozilla I guess that would work, but I'd probably prefer to not see the explicit request bit in Blink. I don't have a veto, though.
(In reply to Philip Jägenstedt from comment #32)
> OK, your object `foo` can either get access by the act of starting media
> playback, or via `foo.requestRemoveControls()`. If this is very important to
> Mozilla I guess that would work, but I'd probably prefer to not see the
> explicit request bit in Blink. I don't have a veto, though.

Jer Noble from Apple is also not a fan: 
https://github.com/richtr/html-media-focus/issues/5#issuecomment-73909005

From a standardization perspective, my recommendation is that we (Moz) go for interop with Blink and Webkit over supporting the Flash use case (by possibly enhancing both AudioContext and HTMLMediaElement). That at least gives us the interop, the JS primitive for use in web components, and we address the most important and forward-looking use cases (i.e., a future where Flash is not part of the Web Platform, which is already the reality on mobile). 

If we want Flash to support this, maybe we should just ask Adobe to add support in the Flash plugin instead. We don't support Flash in FxOS anyway, right? And Adobe already said it's discontinuing Flash support on mobile (hence, it doesn't affect Firefox on Android, and absolutely won't matter on Firefox for iOS). As stated on Wikipedia:

"In November 2011, however, Adobe announced the withdrawal of support for Flash on mobile devices. Adobe is reaffirming its commitment to "aggressively contribute" to HTML5.[37] In November 2011 there were also a number of announcements that demonstrated a possible decline in demand for rich Internet application architectures, and Flash in particular.[38] Adobe announced the end of Flash for mobile platforms or TV, instead focusing on HTML5 for browser content and Adobe AIR for the various mobile application stores.[39][40][41] Pundits questioned its continued relevance even on the desktop[42] and described it as "the beginning of the end".[43]"

IMHO, it seems unlikely that Flash will continue to be around in the medium term on the desktop. Additionally, Jim Ley from the BBC, told me that they will drop Flash once MSE is in place (i.e., it's not a matter of DRM - see [jim]). I imagine it will be the same for other content producers (and I'm happy to reach out to more to reaffirm this), specially once EME is also in place. Adding something like `requestRemoveControls()` would mostly then just be a doorstop solution to support a dying technology. 

[jim] https://twitter.com/marcosc/status/564805399673438209
It sounds like we mostly agree. Let's try to make a nice API that makes sense for HTMLMediaElement and AudioContext, and possibly bolt on Flash if and when it seems necessary to do so.
I think it's critical that we provide a low-level primitive here. I.e. one where the website is implementing its own audio system and we just provide they integration for hardware keys. This is what the extensible web is about. If we bundle too much functionality in HTMLAudioElement, that inevitably means that we will make some developers lives harder because they are not using HTMLAudioElement exactly the way that we've proposed.

For example they might "sprite" multiple songs into a single source file, or use multiple HTMLAudioElements to do preloading of the next song etc.

But yes, we also need to support flash for the sake of supporting flash. We can't ignore the fact that the majority of websites out there use flash for playing audio.

That said, I don't think we can, should, or need to, require that flash audio is automatically paused/muted the same way that we can with WebAudio/<audio>. Nor do we need to automatically detect when the flash is playing audio and have that affect audio focus.

I think it's perfectly fine that for a website that uses flash, or does "complicated things", that they tell the platform when they are playing audio. Any website can fake playing audio anyway by simply looping a silent mp3 file.

Likewise I think it's fine that we don't forcefully mute flash audio. We won't be able to do that anyway. I.e. no matter what we do in this spec, any website will on desktop be able to play flash audio even when <audio>/WebAudio has been forcefully muted by the platform.
I think that the main points of disagreement are not low-level API vs high-level declarative solution, but rather these two points:

1. Should it be possible to get access to media keys and lock screen UI without playing any audio?

2. Should each audio-producing object (HTMLMediaElement, AudioContext, maybe plugins) get access individually, or should it be possible to group objects or hand off access to another object?

We don't have total freedom in how to answer these questions, they will be determined by existing platform APIs to some extent. It seems that in order to work on iOS, the answer to the first question must be no. For the second question, I hope we can support "complicated things", but we should make sure that no platform ties audio focus to a single media player and thus a single HTMLMediaElement.

See also https://github.com/whatwg/media-keys/blob/gh-pages/README.md which has changed a lot today.
(In reply to Philip Jägenstedt from comment #36)
> I think that the main points of disagreement are not low-level API vs
> high-level declarative solution, but rather these two points:
> 
> 1. Should it be possible to get access to media keys and lock screen UI
> without playing any audio?

It should not be possible to get access to media keys without having audio focus.

However it will always be possible to get audio focus without playing any audio. Just loop a silent mp3.

> 2. Should each audio-producing object (HTMLMediaElement, AudioContext, maybe
> plugins) get access individually, or should it be possible to group objects
> or hand off access to another object?

I don't really have a strong opinion on this. What is important to me though is that for websites that do use multiple HTMLMediaElements to do things like load the next song and switch over to it once the current one is finished, that they can do this without creating too much madness.
For what it's worth, it appears to me that the original proposal in this bug does seem to fulfill the various requirements mentioned here (though I might be missing some).

* It only exposes events to the page which has audio focus.
* It enables the platform to forcefully mute <audio>/WebAudio using whatever policy it wants (though
  the proposal treats the events which should go along those policies as a separate API, which may be
  unnecessary).
* It enables websites which use flash to still build a UI which uses media keys.
* It enables websites which do "complicated things" to still build a UI which uses media keys.

At least as far as I can tell. Please let me know if I'm missing something?


But I by no means claim that the proposal is perfect.


One problem that it actually has is that it doesn't let you read the song information from a cross-origin media file. I.e. <audio> and <video> allow you to play media from other origins by allowing embedding but preventing any data to be extracted by the page.

If we enabled using a HTMLMediaElement as the information source for song information, then we could still display that to the user, without the page ever getting access to the information. This could still be relatively easily fixed through small additions to the that proposal.


Anyhow, I'm very open to other proposals. But I don't yet see a reason why we should give up on any of the requirements that have been mentioned so far.
(In reply to Jonas Sicking (:sicking) from comment #37)
> (In reply to Philip Jägenstedt from comment #36)
> > I think that the main points of disagreement are not low-level API vs
> > high-level declarative solution, but rather these two points:
> > 
> > 1. Should it be possible to get access to media keys and lock screen UI
> > without playing any audio?
> 
> It should not be possible to get access to media keys without having audio
> focus.
> 
> However it will always be possible to get audio focus without playing any
> audio. Just loop a silent mp3.

I'm not sure if we're in agreement or not. I think that in order to be more easily implementable on systems that tie media key access to audio playback, media key access and audio focus should be a single concept, and that should require audio playback to begin. Pages can play silent audio to get around it, sure. The alternative would be for the implementation to play silent audio internally on those systems.

> > 2. Should each audio-producing object (HTMLMediaElement, AudioContext, maybe
> > plugins) get access individually, or should it be possible to group objects
> > or hand off access to another object?
> 
> I don't really have a strong opinion on this. What is important to me though
> is that for websites that do use multiple HTMLMediaElements to do things
> like load the next song and switch over to it once the current one is
> finished, that they can do this without creating too much madness.

I also haven't reached any conclusion yet, but agree.
(In reply to Jonas Sicking (:sicking) from comment #38)
> For what it's worth, it appears to me that the original proposal in this bug
> does seem to fulfill the various requirements mentioned here (though I might
> be missing some).
> 
> * It only exposes events to the page which has audio focus.

That's good. From https://etherpad.mozilla.org/audiocontrols I can't tell if the intention is for a page to only be able to request audio focus once. I think allowing components within a page to compete for audio focus seems useful, consider watching an inline video in your inbox when a new mail arrives and triggers a notification sound.

> * It enables the platform to forcefully mute <audio>/WebAudio using whatever
> policy it wants (though
>   the proposal treats the events which should go along those policies as a
> separate API, which may be
>   unnecessary).

How? I see no connection between the new MediaController and the objects which are playing audio.

> * It enables websites which use flash to still build a UI which uses media
> keys.
> * It enables websites which do "complicated things" to still build a UI
> which uses media keys.

Both true.

> At least as far as I can tell. Please let me know if I'm missing something?

Some of the finesse of Android's Audio Focus and iOS's Audio Session systems are missing, but that was not included as use cases in the document. I think at least ducking for notifications and the distinction between being interrupted by another media player (user action) and being interrupted by a phone call (external event) is important.

> But I by no means claim that the proposal is perfect.
> 
> 
> One problem that it actually has is that it doesn't let you read the song
> information from a cross-origin media file. I.e. <audio> and <video> allow
> you to play media from other origins by allowing embedding but preventing
> any data to be extracted by the page.
> 
> If we enabled using a HTMLMediaElement as the information source for song
> information, then we could still display that to the user, without the page
> ever getting access to the information. This could still be relatively
> easily fixed through small additions to the that proposal.

I think the API should simply allow these things to be provided by scripts. Passing it automatically from a CORS-same-origin media element to this API seems like optional finesse that would be fine to postpone or ignore.

> Anyhow, I'm very open to other proposals. But I don't yet see a reason why
> we should give up on any of the requirements that have been mentioned so far.

https://github.com/whatwg/media-keys#proposals

Marcos has promised to include a cleaned up version of Mozilla's proposal here.

We're really going to have to resolve the issue coupling of audio focus to audio playback to get anywhere, and Apple needs to be involved in that discussion. If you file a pull request to remove https://github.com/whatwg/media-keys#limitations (which I just added) perhaps that would be a venue to discuss that.
(In reply to Philip Jägenstedt from comment #40)
> (In reply to Jonas Sicking (:sicking) from comment #38)
> I think allowing components within a page to compete for audio focus seems
> useful, consider watching an inline video in your inbox when a new mail
> arrives and triggers a notification sound.

I agree with that.

> > * It enables the platform to forcefully mute <audio>/WebAudio using whatever
> > policy it wants (though
> >   the proposal treats the events which should go along those policies as a
> > separate API, which may be
> >   unnecessary).
> 
> How? I see no connection between the new MediaController and the objects
> which are playing audio.

What I meant is that MediaController/MediaKeyReceiver in no way gets in the way of platforms applying forcing policies. I.e. MediaController/MediaKeyReceiver treats those as orthogonal features.

So whatever forcing policies that the platform had, it can keep having.

That said, we definitely need to define how these forcing policies work, and what events they fire. And the order of those events and the MediaController/MediaKeyReceiver events.

Or, to put it another way, I'm definitely supportive of the idea of defining forcing policies as part of this spec. And I could definitely believe that the MediaController/MediaKeyReceiver API could use changing to make interaction with forcing policies better.

> Some of the finesse of Android's Audio Focus and iOS's Audio Session systems
> are missing,

Please provide more details.

> but that was not included as use cases in the document. I think
> at least ducking for notifications and the distinction between being
> interrupted by another media player (user action) and being interrupted by a
> phone call (external event) is important.

Keep in mind that audio systems on different platforms are very different. And that there are lots of different policies involved. If you want to expose reasons why audio should be stopped/resumed, it's going to get very complicated very fast.

Additionally there might be privacy issues involved. I wouldn't want to tell pages when the user receives a phone call.

> > One problem that it actually has is that it doesn't let you read the song
> > information from a cross-origin media file. I.e. <audio> and <video> allow
> > you to play media from other origins by allowing embedding but preventing
> > any data to be extracted by the page.
> > 
> > If we enabled using a HTMLMediaElement as the information source for song
> > information, then we could still display that to the user, without the page
> > ever getting access to the information. This could still be relatively
> > easily fixed through small additions to the that proposal.
> 
> I think the API should simply allow these things to be provided by scripts.
> Passing it automatically from a CORS-same-origin media element to this API
> seems like optional finesse that would be fine to postpone or ignore.

My point is that a lot of media is loaded cross origin *without* CORS. So script can't load things like song or album title from the media file.

I'm also *really* concerned about all this talk about ignoring use cases and postponing them to the next version. First off, this stuff is really not hard to solve, second, my goal here is to quickly get adoption from websites. My goal is very explicitly not to try to kill flash or to get servers to serve more content using CORS.

> We're really going to have to resolve the issue coupling of audio focus to
> audio playback to get anywhere, and Apple needs to be involved in that
> discussion. If you file a pull request to remove
> https://github.com/whatwg/media-keys#limitations (which I just added)
> perhaps that would be a venue to discuss that.

I'm happy to have a conversation about this. But I really don't have time to create pull requests. This is what I had hoped that you guys could help with.

As far as I can tell the original proposal in this bug supported audio focus just fine.
Since it was asked why this was urgent (I can't find the comment right now):

The reason we need this for FirefoxOS is that we have a much more restrictive audio policy than desktop browsers. On desktop any background tab just keeps playing audio. We don't want to do this in FirefoxOS for a couple of reasons:

* The smaller screen and lack of hardware keyboard makes switching between open apps/tabs more cumbersome,
  so finding the page which is creating audio is more work.
* The smaller screen means that the UI for closing an app/tab can be hard to find.

So by default all audio is forcefully muted when a page moves into the background (we don't support flash and so we can forcefully mute all sound).

But we want some pages to be able to play audio in the background. For websites like spotify, youtube and various local variants around the world. So that users can use those websites as music players.

Currently we enable that through the FirefoxOS-specific attribute mozaudiochannel="content". This attribute will make FirefoxOS to apply a different audio policy to the audio. The exact details of the policy is somewhat complicated, but the most important aspect of it is that it doesn't get muted when the user moves the page to the background. But it does get muted if there's an incoming phone call.


However it's non-standard. Which is why we filed this bug. So that we can get a standard solution.


The MediaController/MediaKeyReceiver proposal was designed specifically so that FirefoxOS could keep implementing audio policies. Not because we want a bunch of FirefoxOS specific stuff, but because other platforms presumably also have audio muting policies as well that the API should support.

So in FirefoxOS we'd treat the presence of a MediaController/MediaKeyReceiver as an indication that the audio should keep playing even if the page goes into background.
(hit enter too soon)

So for FirefoxOS it's important that we get audio websites to use MediaController/MediaKeyReceiver since otherwise the user won't be able to use them as background audio players.

In order for that to happen we don't just need a spec and an implementation. We need that to have existed for a while so that websites can test it and adopt it. Hence this stuff is urgent. We're already late with this.

This is also why it's important that it's easy for websites to adopt this API. If the steps to adopt the API involves "rewrite to not use flash" or "rewrite to always use a single <audio>" or "rewrite to use CORS" then we won't get adoption soon enough.

Though really ease of adoption is something that we should always keep in mind when designing APIs.
(In reply to Jonas Sicking (:sicking) from comment #41)
> (In reply to Philip Jägenstedt from comment #40)
> > Some of the finesse of Android's Audio Focus and iOS's Audio Session systems
> > are missing,
> 
> Please provide more details.

https://github.com/whatwg/media-keys#audio-focus--audio-session

Android:
http://developer.android.com/training/managing-audio/audio-focus.html
http://developer.android.com/reference/android/media/AudioManager.html

The distinction is made between indefinite focus (AUDIOFOCUS_GAIN and AUDIOFOCUS_LOSS) and different kinds of transient focus, in particular where ducking is appropriate (AUDIOFOCUS_GAIN_TRANSIENT_MAY_DUCK and AUDIOFOCUS_LOSS_TRANSIENT_CAN_DUCK) and where no other audio at all is acceptable (AUDIOFOCUS_GAIN_TRANSIENT_EXCLUSIVE).

There are also different stream types, but I don't know if these imply some defaults with regard to focus management or if it's just about using the right volume level. (Android has separate volume levels for ringtones, notifications, etc.)

iOS:
https://developer.apple.com/library/ios/documentation/Audio/Conceptual/AudioSessionProgrammingGuide/Introduction/Introduction.html

I've never used this API, but from this page we can surmise that at least ducking is part of the model, although I can't tell if its up to an app to duck or if that's done automatically and is not detectable to the app. (On Android it's up to the app.)
 
> > but that was not included as use cases in the document. I think
> > at least ducking for notifications and the distinction between being
> > interrupted by another media player (user action) and being interrupted by a
> > phone call (external event) is important.
> 
> Keep in mind that audio systems on different platforms are very different.
> And that there are lots of different policies involved. If you want to
> expose reasons why audio should be stopped/resumed, it's going to get very
> complicated very fast.
> 
> Additionally there might be privacy issues involved. I wouldn't want to tell
> pages when the user receives a phone call.

This was poorly phrased. It's not the reason that needs to be exposed, but the different behaviors: if the focus loss is indefinite or transient and whether or not ducking is appropriate.

> > > One problem that it actually has is that it doesn't let you read the song
> > > information from a cross-origin media file. I.e. <audio> and <video> allow
> > > you to play media from other origins by allowing embedding but preventing
> > > any data to be extracted by the page.
> > > 
> > > If we enabled using a HTMLMediaElement as the information source for song
> > > information, then we could still display that to the user, without the page
> > > ever getting access to the information. This could still be relatively
> > > easily fixed through small additions to the that proposal.
> > 
> > I think the API should simply allow these things to be provided by scripts.
> > Passing it automatically from a CORS-same-origin media element to this API
> > seems like optional finesse that would be fine to postpone or ignore.
> 
> My point is that a lot of media is loaded cross origin *without* CORS. So
> script can't load things like song or album title from the media file.
> 
> I'm also *really* concerned about all this talk about ignoring use cases and
> postponing them to the next version. First off, this stuff is really not
> hard to solve, second, my goal here is to quickly get adoption from
> websites. My goal is very explicitly not to try to kill flash or to get
> servers to serve more content using CORS.

This isn't about killing Flash or encouraging CORS, I honestly think the use case seems like optional finesse. Since this metadata currently isn't exposed to the Web regardless of origin, sites have to use out-of-band information to display anything useful to their users. (They could XHR the CORS-same-origin ones and parse the metadata with scripts, but I doubt that's common.) Are there any somewhat popular sites where you can play audio where the relevant information isn't already available out-of-band?

> > We're really going to have to resolve the issue coupling of audio focus to
> > audio playback to get anywhere, and Apple needs to be involved in that
> > discussion. If you file a pull request to remove
> > https://github.com/whatwg/media-keys#limitations (which I just added)
> > perhaps that would be a venue to discuss that.
> 
> I'm happy to have a conversation about this. But I really don't have time to
> create pull requests. This is what I had hoped that you guys could help with.

OK, I'll ping @sicking on some GitHub issue to summon you.
I've filed https://github.com/whatwg/media-keys/issues/9 for discussing the coupling issues.
(In reply to Jonas Sicking (:sicking) from comment #41)
> (In reply to Philip Jägenstedt from comment #40)
> > (In reply to Jonas Sicking (:sicking) from comment #38)
> > > * It enables the platform to forcefully mute <audio>/WebAudio using whatever
> > > policy it wants (though
> > >   the proposal treats the events which should go along those policies as a
> > > separate API, which may be
> > >   unnecessary).
> > 
> > How? I see no connection between the new MediaController and the objects
> > which are playing audio.
> 
> What I meant is that MediaController/MediaKeyReceiver in no way gets in the
> way of platforms applying forcing policies. I.e.
> MediaController/MediaKeyReceiver treats those as orthogonal features.
> 
> So whatever forcing policies that the platform had, it can keep having.
> 
> That said, we definitely need to define how these forcing policies work, and
> what events they fire. And the order of those events and the
> MediaController/MediaKeyReceiver events.
> 
> Or, to put it another way, I'm definitely supportive of the idea of defining
> forcing policies as part of this spec. And I could definitely believe that
> the MediaController/MediaKeyReceiver API could use changing to make
> interaction with forcing policies better.

Yes, any per-tab policy would be easy to keep. We should figure out if audio focus/session at a more fine-grained level is worthwhile, and if automatic policy at that level would be useful. It would be a matter of convenience such as in the notication-and-movie-in-same-tab case.
Attachment #8556159 - Flags: review?(bent.mozilla)
Summary: Implement remote control of media → Implement MediaSession API
Attachment #8553870 - Attachment is obsolete: true
Attachment #8556159 - Attachment is obsolete: true
Attached patch patch 1 - WebIDL — — Splinter Review
This first patch implements the WebIDL interfaces.
This second patch introduces 'kind' and 'session' attributes in HTMLMediaElements.
Attached patch patch 3 - Release algorithm — — Splinter Review
We need IPDL for make a communication between the different MediaSessions running in different processes.
I have to submit other 2 patches but they are untested yet. Once all is fully tested and implemented I'll ask to have the code reviewed.
We would like to ensure that you're fully aware of all the spec issues we are currently dealing with:
https://github.com/whatwg/mediasession/issues/

Also note that we're approaching the implementation of this in Blink by first implementing the minimal bits required to get UI on a lock screen and customize it. Issues of particular relevance:
https://github.com/whatwg/mediasession/issues/45
https://github.com/whatwg/mediasession/issues/46
https://github.com/whatwg/mediasession/issues/48
https://github.com/whatwg/mediasession/issues/50
https://github.com/whatwg/mediasession/issues/71

In other words, there isn't much stability here, and we hope you'll leave feedback on the spec. (Anne has been talking to me on IRC, which is appreciated.)
Is there any dependent-FxOS work, e.g. removing system message workarounds? If so please speak now so FxOS group could plan accordingly.
Flags: needinfo?(amarchesini)
Tim, you are right. Once this API is implemented, the next step is to have a custom code for each platform in order to support Media Focus and the integration with the OSs. Firefox OS will be the first integration I want to do.
Flags: needinfo?(amarchesini)
Flags: platform-rel?
platform-rel: --- → ?
platform-rel: ? → ---
I'm not working on this.
Assignee: amarchesini → nobody
Blake, how about moving this one to Audio/Video: Playback?
Flags: needinfo?(bwu)
(In reply to John Lin [:jolin][:jhlin] from comment #59)
> Blake, how about moving this one to Audio/Video: Playback?
That's a good idea.
Component: DOM → Audio/Video: Playback
Flags: needinfo?(bwu)
Priority: -- → P3
See Also: → 1461611
Blocks: 1461611
See Also: 1461611
Flags: webcompat?
Probably worth implementing just for YouTube on mobile.
Flags: webcompat? → webcompat+
Whiteboard: [webcompat:p3]

See bug 1547409. Migrating webcompat priority whiteboard tags to project flags.

Webcompat Priority: --- → P3
Type: defect → enhancement

Planning to implement a minimal version of the current media session API[0]. Fenix is the first target.

[0] https://w3c.github.io/mediasession/

Assignee: nobody → cchang
Attached file Bug 1580602 - P2: Implement MediaMetadata API. (obsolete) —

Depends on D45456

Depends on D45457

Depends on D45458

(In reply to C.M.Chang[:chunmin] from comment #67)

Created attachment 9091931 [details]
Bug 1112032 - P4: Correct MediaSessionActionDetails for seek operations.

In the current implementation, I have trouble to meet the following requirement:

Run handler with the details parameter set to:

  • MediaSessionSeekActionDetails, if action is seekbackward or seekforward.
  • MediaSessionSeekToActionDetails, if action is seekto.
  • Otherwise, with MediaSessionActionDetails.

I cannot read the MediaSessionSeekToActionDetails.seekTime from the MediaSessionActionHandler's parameter when the callback/handler is for seekto in the javascript test file. The following code demonstrates the problem specifically:

navigator.mediaSession.setActionHandler("seekto", function(details) {
  console.log(details.action); // ok, it's "seekto".
  console.log(details.seekTime); // undefined, but it's a `required` members in MediaSessionSeekToActionDetails
});

I guess the reason is because the type of details is MediaSessionActionDetails instead of MediaSessionSeekToActionDetails.

I created a test-only function notifySeekToHandler to trigger the handler for seekto. notifySeekToHandler will get a MediaSessionSeekToActionDetails object form the caller and pass it directly to the handler. The handler type is MediaSessionActionHandler, which is defined as void(MediaSessionActionDetails details). However, the parameter details in the handler has no seekTime member. I guess the reason is that details is a MediaSessionActionDetails instead of MediaSessionSeekToActionDetails. Not sure if there is a way to make js realize the details is actually a MediaSessionSeekToActionDetails.

[Exposed=Window]
partial interface Navigator {
  [SameObject] readonly attribute MediaSession mediaSession;
};

enum MediaSessionAction {
  "play",
  ...
  "seekto"
};

callback MediaSessionActionHandler = void(MediaSessionActionDetails details);

[Exposed=Window]
interface MediaSession {
  ...
  void setActionHandler(MediaSessionAction action, MediaSessionActionHandler? handler);
  ...

  // Test-only function to notify the `seekto` handler
  [ChromeOnly]
  void notifySeekToHandler(MediaSessionSeekToActionDetails details);
};

dictionary MediaSessionActionDetails {
  required MediaSessionAction action;
};

dictionary MediaSessionSeekToActionDetails : MediaSessionActionDetails {
  required double seekTime;
  ...
};

In general, I need to figure out how to make the following javascript code works:

let baseDict = { type: "base" };
let derivedDict = { type: "derived",  num: 3.14 };

let foo = new Foo();

foo.setHandler("base",, function(dict) {
  console.log(dict.type); // "base"
  console.log(dict.num); // undefined
});

foo.setHandler("derived", function(dict) {
  console.log(dict.type); // "derived"
  console.log(dict.num); // 3.14
});

runHandlerForBase(baseDict);
runHandlerForDerived(derivedDict );

with the following WebIDL interface

enum Type {
  "base",
  "derived"
};

dictionary Base {
  required Type type;
};

dictionary Derived : Base {
  required double num;
};

// For "base", the `dict` is `Base`
// For "derived", the `dict` is `Derived`
callback Handler = void(Base dict);

interface Foo {
  void setHandler(Type type, Handler handler);
  void runHandlerForBase(Base dict);
  void runHandlerForDerived(Derived dict);
}

(In reply to C.M.Chang[:chunmin] from comment #68)

In general, I need to figure out how to make the following javascript code works:

let baseDict = { type: "base" };
let derivedDict = { type: "derived",  num: 3.14 };

let foo = new Foo();

foo.setHandler("base",, function(dict) {
  console.log(dict.type); // "base"
  console.log(dict.num); // undefined
});

foo.setHandler("derived", function(dict) {
  console.log(dict.type); // "derived"
  console.log(dict.num); // 3.14
});

runHandlerForBase(baseDict);
runHandlerForDerived(derivedDict );

with the following WebIDL interface

enum Type {
  "base",
  "derived"
};

dictionary Base {
  required Type type;
};

dictionary Derived : Base {
  required double num;
};

// For "base", the `dict` is `Base`
// For "derived", the `dict` is `Derived`
callback Handler = void(Base dict);

interface Foo {
  void setHandler(Type type, Handler handler);
  void runHandlerForBase(Base dict);
  void runHandlerForDerived(Derived dict);
}

It seems there is no way to do that safely. One way is to define the Handler by callback Handler = void(Object dict) but accessing Object directly will cause security issues because JS objects are hightly configurable.

(In reply to C.M.Chang[:chunmin] from comment #69)

It seems there is no way to do that safely. One way is to define the Handler by callback Handler = void(Object dict) but accessing Object directly will cause security issues because JS objects are hightly configurable.

It might be fine since the dict is created by browser itself. In real case, details, whose type is MediaSessionActionDetails, is created by the browser itself so it won't be a random value.

This problem seems to me a spec issue. I am going to file a spec issue and move the patches to another bug that only implement part of the operations without seek stuff. This bug will be used as a meta bug so subscribers can get the status update.

Depends on: 1580602
Attachment #9091928 - Attachment description: Bug 1112032 - P1: Implement a dummy MediaSession interface. → Bug 1580602 - P1: Implement a dummy MediaSession interface.
Attachment #9091929 - Attachment description: Bug 1112032 - P2: Implement MediaMetadata API. → Bug 1580602 - P2: Implement MediaMetadata API.
Attachment #9091930 - Attachment description: Bug 1112032 - P3: Implement setActionHandler. → Bug 1580602 - P3: Implement setActionHandler API.

Comment on attachment 9091928 [details]
Bug 1580602 - P1: Implement a dummy MediaSession interface.

Revision D45456 was moved to bug 1580602. Setting attachment 9091928 [details] to obsolete.

Attachment #9091928 - Attachment is obsolete: true

Comment on attachment 9091929 [details]
Bug 1580602 - P2: Implement MediaMetadata API.

Revision D45457 was moved to bug 1580602. Setting attachment 9091929 [details] to obsolete.

Attachment #9091929 - Attachment is obsolete: true

Comment on attachment 9091930 [details]
Bug 1580602 - P3: Implement setActionHandler API.

Revision D45458 was moved to bug 1580602. Setting attachment 9091930 [details] to obsolete.

Attachment #9091930 - Attachment is obsolete: true
Depends on: 1580623

Comment on attachment 9091931 [details]
Bug 1112032 - P4: Correct MediaSessionActionDetails for seek operations.

Revision D45459 was moved to bug 1580623. Setting attachment 9091931 [details] to obsolete.

Attachment #9091931 - Attachment is obsolete: true
Alias: MediaSession
Keywords: meta
Summary: Implement MediaSession API → [meta] Implement MediaSession API
Depends on: 1582508
Depends on: 1582509
Depends on: 1582569
Blocks: 1588090
Depends on: 1592151
Depends on: 1592454
Depends on: 1599591
Depends on: 1599938
Depends on: 1599942
Depends on: 1611272
Depends on: 1611328
Depends on: 1611332
Depends on: 1620077
Depends on: 1621166
Depends on: 1621403
Depends on: 1624711
Depends on: 1637466
Depends on: 1663631
Depends on: 1665496
See Also: → media-control
Depends on: 1669434
Whiteboard: [webcompat:p3] → [webcompat:p3], [feature-testing-meta]
Whiteboard: [webcompat:p3], [feature-testing-meta] → [webcompat:p3]

We've shipped media session API on Fx82, close this bug.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Depends on: 1673613
Depends on: 1681412
Depends on: 1686895
Depends on: 1717997
Depends on: 1716974
You need to log in before you can comment on or make changes to this bug.