1649819 - Perma raptor task timeouts for youtube widevine related tests

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Description

•

5 years ago

See https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&selectedTaskRun=VKkEPGi4RDeUnpqvM3Swpw.0&searchStr=raptor&tochange=e950bdd669d44eeb8e6af84a987c2a8ba5d9d1a2

Log: https://treeherder.mozilla.org/logviewer.html#?job_id=307978924&repo=mozilla-central

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Updated

•

5 years ago

Flags: needinfo?(gmierz2)

Summary: raptor task timeouts for youtube widevine related tests → Perma raptor task timeouts for youtube widevine related tests

Greg Mierzwinski [:sparky]

Comment 1

•

5 years ago

Thanks for letting me know about this, I'm testing to see if increasing the timeout will fix it (the test is working locally, but the videos are longer than what we had with the 2019 branch): https://treeherder.mozilla.org/#/jobs?repo=try&revision=b58895719dd5a28d37466aad8fcc60972f56e940

(leaving the ni? open for myself)

Greg Mierzwinski [:sparky]

Updated

•

5 years ago

Severity: -- → S2

Flags: needinfo?(gmierz2)

Priority: -- → P2

Greg Mierzwinski [:sparky]

Updated

•

5 years ago

Flags: needinfo?(gmierz2)

Greg Mierzwinski [:sparky]

Comment 2

•

5 years ago

This is very odd. I noticed that the failures started in this push so it seems like we may be able to find a commit that caused it: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&tier=1%2C2%2C3&searchStr=widevine&revision=79d69f36a2202fadd572b0243b618e100d29b525

I found one patch which made changes to something involving DRM content: https://treeherder.mozilla.org/#/jobs?repo=autoland&searchStr=youtube&revision=e319075a4c0e7a4569518311a1b4a4688b2637b4

I ran the same tests on it but they didn't fail. Then I ran the tests on the child commits in that push and they still weren't failing. I tested locally as well but I couldn't reproduce it, I'm testing it once more today though in case I missed something.

:bryce, :gijs would either of you have an idea of what might be happening here? The widevine youtube playback test failures started with the push that contained your patches and it seems like the most likely culprit in all the commits.

Flags: needinfo?(gmierz2)

Flags: needinfo?(gijskruitbosch+bugs)

Flags: needinfo?(bvandyk)

:Gijs (he/him)

Assignee

Comment 3

•

5 years ago

(In reply to Greg Mierzwinski [:sparky] from comment #2)

This is very odd. I noticed that the failures started in this push so it seems like we may be able to find a commit that caused it: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&tier=1%2C2%2C3&searchStr=widevine&revision=79d69f36a2202fadd572b0243b618e100d29b525

I found one patch which made changes to something involving DRM content: https://treeherder.mozilla.org/#/jobs?repo=autoland&searchStr=youtube&revision=e319075a4c0e7a4569518311a1b4a4688b2637b4

I ran the same tests on it but they didn't fail. Then I ran the tests on the child commits in that push and they still weren't failing. I tested locally as well but I couldn't reproduce it, I'm testing it once more today though in case I missed something.

:bryce, :gijs would either of you have an idea of what might be happening here? The widevine youtube playback test failures started with the push that contained your patches and it seems like the most likely culprit in all the commits.

No. The actor code that got changed should only be used if the DRM/EME stuff is disabled or unavailable, I think.

What does the test actually test? I tried looking for documentation (e.g. https://firefox-source-docs.mozilla.org/testing/perfdocs/raptor.html ) but found none. I assume it's not against an actual real, web-based youtube instance, but against something more static?

It seems super suspicious that autoland is green (as was your trypush with a longer timeout) and mozilla-central is not. Have we figured out what the difference is yet? IIRC the perf numbers can be substantially different because of different branding and thus different behaviour in some components.

Of course, the profiler jobs are also green, so that's no help.

As an aside, this is a practical case of "the perf test says bad things but we have no idea why" as discussed in our recent meeting...

Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(gmierz2)

Greg Mierzwinski [:sparky]

Comment 4

•

5 years ago

(In reply to :Gijs (he/him) from comment #3)

No. The actor code that got changed should only be used if the DRM/EME stuff is disabled or unavailable, I think.

What does the test actually test? I tried looking for documentation (e.g. https://firefox-source-docs.mozilla.org/testing/perfdocs/raptor.html ) but found none. I assume it's not against an actual real, web-based youtube instance, but against something more static?

When I run the tests locally, I do get a banner about it being installed but I don't think that falls into your classifications. That's correct, these tests run a video with various settings to monitor dropped frames. You can see this here, we'll have these in the docs in the future (it's a work in progress atm): https://yttest.dev.mozaws.net/testing_2020/perf-youtube-playback/2019/main.html?test_type=playbackperf-widevine-sfr-h264-test&raptor=true&exclude=1,2&muted=true&command=run

It seems super suspicious that autoland is green (as was your trypush with a longer timeout) and mozilla-central is not. Have we figured out what the difference is yet? IIRC the perf numbers can be substantially different because of different branding and thus different behaviour in some components.

Of course, the profiler jobs are also green, so that's no help.

As an aside, this is a practical case of "the perf test says bad things but we have no idea why" as discussed in our recent meeting...

Right now I'm thinking it's related to what you mentioned different branding and thus different behaviour in some components. in mozilla-central. I tried it out again today just in case and it's certainly passing on try without any issues (it didn't get anywhere near the new task timeout value): https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=ce2ddb9b847d6454cf6d9613d2065280fe435187

I checked here to see if conditioned profiles could be causing this and they aren't: https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=833d0287d08f5e2edf137f61a7543df00fcb6992

I don't see any changes related to CI either at or near that day either: https://changelog.dev.mozaws.net/

:gijs, :aryx do either of you see any patches in that mozilla-central commit that could have changed things related to branding or know who I could ask about that?

CC tomprince in case you see anything running in that mozilla-central commit which may be causing this m-c only failure.

Flags: needinfo?(gmierz2)

Flags: needinfo?(gijskruitbosch+bugs)

Flags: needinfo?(aryx.bugmail)

:Gijs (he/him)

Assignee

Comment 5

•

5 years ago

(In reply to Greg Mierzwinski [:sparky] from comment #4)

(In reply to :Gijs (he/him) from comment #3)

No. The actor code that got changed should only be used if the DRM/EME stuff is disabled or unavailable, I think.

What does the test actually test? I tried looking for documentation (e.g. https://firefox-source-docs.mozilla.org/testing/perfdocs/raptor.html ) but found none. I assume it's not against an actual real, web-based youtube instance, but against something more static?

When I run the tests locally, I do get a banner about it being installed but I don't think that falls into your classifications.

Huh. No, that banner does have to do with the code that got modified, so I guess that's plausible? But it still doesn't make much sense that this is working on the integration branches. If the patch broke something, I'd expect the breakage to be consistent... Basically, we changed the way the content process communicates with the parent about DRM stuff.

Do we intend for this test to measure the cost of setting up the DRM stuff? That is, we normally have to download CDMs from the different providers, which happens async on first run. I would expect for the perf test, we measure that separately, not as part of a dropped frames test...

Of course, if the changes somehow broke the CDM download then that would be bad... but I can't think what would have done that (and why it would only do it some of the time, but consistently on m-c!)... Bryce might have a better idea.

Flags: needinfo?(gijskruitbosch+bugs)

Greg Mierzwinski [:sparky]

Comment 6

•

5 years ago

(In reply to :Gijs (he/him) from comment #5)

Do we intend for this test to measure the cost of setting up the DRM stuff? That is, we normally have to download CDMs from the different providers, which happens async on first run. I would expect for the perf test, we measure that separately, not as part of a dropped frames test...

No we don't measure that cost, only the dropped frames per video that played. In this case the DRM stuff for widevine is set up once we navigate to the test page and a video starts playing - we use some preferences to get the CDMs downloaded and installed. In other words, it's not preinstalled.

Bryce Seager van Dyk [:bryce] (he/him) - Not reading bugmail

Comment 7

•

5 years ago

Following those patches the hope is that things should behave the same as before, but also in a Fission friendly way. So I've overlooked something if behaviour has changed (which is entirely possible). I don't have a clear idea of what's going on, but let me add some ideas to try and help us get there.

Those changes touch messages that go between the UI and lower level code concerned with the CDM being enabled and installed.
- They shouldn't, if working as anticipated, add much run time to the tests. They certainly shouldn't make these test runs go from ~22 mins to 45 mins and time out.
- If I've made a mistake somewhere I could envision problems such as the CDM (DRM thingy) not being installed, or us not resolving DRM promises correctly when the CDM becomes installed. Depending on what happens these could manifest in differently, but if the yellow DRM installing notification never goes away that could indicate something is wrong, similarly if the promise requesting media key system access never resolves in the test. I've just manually tested this works in my local build and don't see anything immediately broken.
Different build configs interact with DRM being enabled or not. I wonder if there's anything the is difference between the builds that could be tickling a bug that was introduced in the above changes.

Do we have mix of passing and failing runs of these tests on MC? I see Linux ones reliably failing in the link from comment 0, but other operating systems look intermittent. E.g. Windows 7 shippable appears to be broken sometimes and other times not. I don't know what this suggest, if anything.

Flags: needinfo?(bvandyk)

Greg Mierzwinski [:sparky]

Comment 8

•

5 years ago

•

Edited

(In reply to Bryce Seager van Dyk (:bryce) from comment #7)

Do we have mix of passing and failing runs of these tests on MC? I see Linux ones reliably failing in the link from comment 0, but other operating systems look intermittent. E.g. Windows 7 shippable appears to be broken sometimes and other times not. I don't know what this suggest, if anything.

Yes it is an intermittent failure, but it's a near-perma failure on linux since it's rarely passed. I haven't been able to find a difference between the passing and failing runs yet.

So running locally, I managed to reproduce an error that would cause the task to timeout:

TestExecutor:  Test 12.2.2.1:PlaybackPerfH264240p30fps@1X threw an error: NotSupportedError: Timed out while waiting for a CDM update

Here's what I did, maybe one of you can reproduce it too:

hg update -r 31fb4a2a69124b6a21e5d8f44a0407d489749407
./mach build
./mach raptor -t raptor-youtube-playback-widevine-h264-sfr --post-startup-delay 1 --project mozilla-central        # TEST PASSING
hg update -r 79d69f36a2202fadd572b0243b618e100d29b525
./mach build
./mach raptor -t raptor-youtube-playback-widevine-h264-sfr --post-startup-delay 1 --project mozilla-central
1. When the browser starts, and the test page is loading click on the chrome to get focus on the window.
2. Wait for the yellow banner to show, then immediately move the browser window - make sure you hold it for a good bit of time 5-15s (I moved mine to the top-left corner, but it doesn't seem to matter).

I haven't seen it pass yet after following these steps and on the previous mozilla-central commit (31fb4a2a69124b6a21e5d8f44a0407d489749407), I can freely move the browser and the CDM is still installed.

Flags: needinfo?(gijskruitbosch+bugs)

Flags: needinfo?(bvandyk)

Flags: needinfo?(aryx.bugmail)

:Gijs (he/him)

Assignee

Comment 9

•

5 years ago

I don't have a linux build/machine to hand, I will try and get back to this on Monday.

Geoff Brown [:gbrown]

Updated

•

5 years ago

Blocks: 1589796

Comment hidden (Intermittent Failures Robot)

Greg Mierzwinski [:sparky]

Updated

•

5 years ago

Blocks: 1609295

Greg Mierzwinski [:sparky]

Comment 11

•

5 years ago

Attached file Bug 1649819 - Temporarily disable youtube widevine tests on linux and windows7. r?#perftest — Details

Phabricator Automation

Updated

•

5 years ago

Assignee: nobody → gmierz2

Status: NEW → ASSIGNED

Phabricator Automation

Updated

•

5 years ago

Attachment #9161603 - Attachment description: Bug 1649819 - Temporarily disable youtube widevine tests. r?#perftest → Bug 1649819 - Temporarily disable youtube widevine tests on linux and windows7. r?#perftest

Greg Mierzwinski [:sparky]

Comment 12

•

5 years ago

We're going to disable the linux/windows7 tests temporarily while this is being investigated since they are wasting CI time.

Greg Mierzwinski [:sparky]

Updated

•

5 years ago

Keywords: leave-open

Comment hidden (Intermittent Failures Robot)

Greg Mierzwinski [:sparky]

Comment 14

•

5 years ago

I'm not sure why, or how, but I was able to get a failure on try yesterday: https://treeherder.mozilla.org/#/jobs?repo=try&revision=3b3a785cfe92d93528ef76299d7fc3e37ac5b40c

Pulsebot

Comment 15

•

5 years ago

Pushed by gmierz2@outlook.com: https://hg.mozilla.org/integration/autoland/rev/99e22250e61d Temporarily disable youtube widevine tests on linux and windows7. r=tarek

:Gijs (he/him)

Assignee

Comment 16

•

5 years ago

Thanks for the steps in comment #8. I'm actually unsure what the window moving has to do with anything (and perhaps that deserves further investigation in a separate bug? Feels like it's interfering with idle callback running...), but I did finally figure out what the issue is.

We stopped communicating over the message manager, and instead communicate over the actors' IPC channel. But I missed that there was non-browser-media.js code that depended on this communication, too, and it didn't get converted, so now it's just sitting there and never gets any messages. This is https://searchfox.org/mozilla-central/rev/91d82d7cbf05a71954dfa49d0e43824c7c973e62/toolkit/mozapps/extensions/internal/GMPProvider.jsm#561,572 , which explicitly/proactively forces a check for updates to the CDM immediately when a request comes in that tries to use the CDM. Now that we no longer do that, the CDM gets installed only when some timer kicks in sometime after startup and goes "gee, maybe we should install some CDMs, let's have a look". That explains why installs aren't happening quickly anymore and breaking the test.

This is just an implementation bug so I'm gonna move the bug.

Assignee: gmierz2 → gijskruitbosch+bugs

status-firefox77: --- → unaffected

status-firefox78: --- → unaffected

status-firefox79: --- → unaffected

status-firefox80: --- → affected

status-firefox-esr78: --- → unaffected

Component: Raptor → Audio/Video: Playback

Flags: needinfo?(gijskruitbosch+bugs)

Flags: needinfo?(bvandyk)

Keywords: regression

Priority: P2 → P1

Product: Testing → Core

:Gijs (he/him)

Assignee

Updated

•

5 years ago

Regressed by: 1642465

BMO Automation

Updated

•

5 years ago

Has Regression Range: --- → yes

:Gijs (he/him)

Assignee

Comment 17

•

5 years ago

Attached file Bug 1649819 - fix triggering gmp/widevine/openh264 eme CDM installs by attempting to use them, r?bryce — Details

Atila Butkovits

Comment 18

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/99e22250e61d

:Gijs (he/him)

Assignee

Comment 19

•

5 years ago

I pushed to try to confirm we can re-enable: https://treeherder.mozilla.org/#/jobs?repo=try&revision=22bc9268c811eee47aeb27c986c857c8afa098fe

FWIW, I investigated the window movement stuff. It turns out the post-first-startup download of the CDM lives behind a check for user interaction activity (!), so if you keep moving the mouse / typing, then no download will happen. Still unclear why that'd impact the linux and win7 builds, particularly, but I'm not super bothered about that. I filed bug 1651161 and co to evaluate if/how we can be more clear about how _scheduleArbitrarilyLateStartupTasks works and what guarantees it (doesn't) provide(s), and to check if all the things that live there are really best placed there or need to move somewhere else.

For this particular test, though it found a real bug, I wonder if we should get a separate bug on file to try to somehow ensure that the GMP add-ons are already present when it runs, given that checking that those bits work isn't really what the perf test is designed for! :sparky, what do you think?

Flags: needinfo?(gmierz2)

:Gijs (he/him)

Assignee

Updated

•

5 years ago

Comment 20

•

5 years ago

Pushed by gijskruitbosch@gmail.com: https://hg.mozilla.org/integration/autoland/rev/732b3cb93098 fix triggering gmp/widevine/openh264 eme CDM installs by attempting to use them, r=bryce

Comment hidden (Intermittent Failures Robot)

Bogdan Tara[:bogdan_tara | bogdant]

Comment 22

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/732b3cb93098

:Gijs (he/him)

Assignee

Updated

•

5 years ago

Keywords: leave-open

Greg Mierzwinski [:sparky]

Updated

•

5 years ago

Blocks: 1651353

Greg Mierzwinski [:sparky]

Updated

•

5 years ago

Blocks: 1651361

Greg Mierzwinski [:sparky]

Comment 23

•

5 years ago

:gijs, thanks a lot for fixing the issue! I've filed a bug to get the tests re-enabled.

I agree, we should have the installation done separately. I'm thinking we could make use of conditioned profiles for this but I'm not sure if the installation would be stored in the profiles or if we have to do it on the machine we are running the tests on. Another question with this solution would be, are there any legal implications of having a publicly accessible profile with the CDM installed on it? (It sounds like it could fall under redistribution).

That said, could we block that change on bug 1651230 so that we have some coverage for these components until some functional tests are added?

Flags: needinfo?(gmierz2) → needinfo?(gijskruitbosch+bugs)

:Gijs (he/him)

Assignee

Comment 24

•

5 years ago

(In reply to Greg Mierzwinski [:sparky] from comment #23)

:gijs, thanks a lot for fixing the issue! I've filed a bug to get the tests re-enabled.

I agree, we should have the installation done separately. I'm thinking we could make use of conditioned profiles for this but I'm not sure if the installation would be stored in the profiles or if we have to do it on the machine we are running the tests on.

They get stored in the profile, yes. Try running await AddonManager.getAddonsByTypes(["plugin"]) in the browser console in your usual user profile, and you'll see the gmpPath is a profile folder path.

Another question with this solution would be, are there any legal implications of having a publicly accessible profile with the CDM installed on it? (It sounds like it could fall under redistribution).

I'm not a lawyer, so I don't know about this. If this is a problem, you could obviously also always download them from the official location using a python script or whatever, prior to then having the Firefox profile use them, but you'd need to check with the add-ons team whether that's likely to be a viable option in terms of manually/outside-of-Firefox "prepping" the profile to load these. Lastly, I guess you could also consider making the step of installing the CDMs an explicit part of the perf test, checking if they are there and failing the test before doing any attempts at playback if not.

That said, could we block that change on bug 1651230 so that we have some coverage for these components until some functional tests are added?

That sounds good to me.

Flags: needinfo?(gijskruitbosch+bugs)

:Gijs (he/him)

Assignee

Comment 25

•

5 years ago

I guess given bug 1651353 we can close this? Though if it helps anything we can reopen until the tests are re-enabled...

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

status-firefox80: affected → fixed

Resolution: --- → FIXED

Version: unspecified → 80 Branch

Ryan VanderMeulen [:RyanVM]

Updated

•

5 years ago

status-firefox-esr68: --- → unaffected

Comment hidden (Intermittent Failures Robot)

Greg Mierzwinski [:sparky]

Comment 27

•

5 years ago

This has started again today: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&selectedTaskRun=ObpAMA1uSKi9WTz6kCnnZg.0&tier=1%2C2%2C3&searchStr=raptor&revision=2c8bc998c107bd6a6182b5fc4740fe77fc07cfca

:Gijs (he/him)

Assignee

Comment 28

•

5 years ago

(In reply to Greg Mierzwinski [:sparky] from comment #27)

This has started again today: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&selectedTaskRun=ObpAMA1uSKi9WTz6kCnnZg.0&tier=1%2C2%2C3&searchStr=raptor&revision=2c8bc998c107bd6a6182b5fc4740fe77fc07cfca

So is there a new regressor? bug 1651353 suggests the affected tests were re-enabled 5 days ago, and the failures do look like they started with this morning's merge which you highlighted... Probably worth filing a new bug and bisecting that down again? :-(

Flags: needinfo?(gmierz2)

Greg Mierzwinski [:sparky]

Comment 29

•

5 years ago

Attached image Screenshot from 2020-07-13 08-30-50.png — Details

The failures aren't related to any patch, the server needs its certificate updated apparently. :bebe, would you or alex be able to update it?

Flags: needinfo?(gmierz2) → needinfo?(fstrugariu)

Florin Strugariu [:Bebe]

Comment 30

•

5 years ago

Alex Can you take care of this?

Flags: needinfo?(fstrugariu) → needinfo?(aionescu)

Greg Mierzwinski [:sparky]

Updated

•

5 years ago

Comment 31

•

5 years ago

(In reply to :Gijs (he/him) from comment #28)

(In reply to Greg Mierzwinski [:sparky] from comment #27)

This has started again today: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&selectedTaskRun=ObpAMA1uSKi9WTz6kCnnZg.0&tier=1%2C2%2C3&searchStr=raptor&revision=2c8bc998c107bd6a6182b5fc4740fe77fc07cfca

So is there a new regressor? bug 1651353 suggests the affected tests were re-enabled 5 days ago, and the failures do look like they started with this morning's merge which you highlighted... Probably worth filing a new bug and bisecting that down again? :-(

Oh I missed your message! No new regressor, it's a server issue - thanks for your quick reply though :)
Filed a new bug for this here: https://bugzilla.mozilla.org/show_bug.cgi?id=1652451

Alexandru Ionescu (needinfo me) [:alexandrui]

Updated

•

5 years ago

Flags: needinfo?(aionescu)

Bug 1649819 - Temporarily disable youtube widevine tests on linux and windows7. r?#perftest 5 years ago Greg Mierzwinski [:sparky] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1649819 - fix triggering gmp/widevine/openh264 eme CDM installs by attempting to use them, r?bryce 5 years ago :Gijs (he/him) 47 bytes, text/x-phabricator-request		Details \| Review
Screenshot from 2020-07-13 08-30-50.png 5 years ago Greg Mierzwinski [:sparky] 230.56 KB, image/png		Details