Closed Bug 1736785 Opened 3 years ago Closed 10 months ago

83.1% imgur ContentfulSpeedIndex (Windows) regression on Wed October 6 2021

Categories

(Core :: Widget: Win32, defect, P3)

Firefox 95
defect

Tracking

()

RESOLVED WONTFIX
Tracking Status
firefox-esr78 --- unaffected
firefox-esr91 --- unaffected
firefox93 --- unaffected
firefox94 --- wontfix
firefox95 --- wontfix
firefox96 --- wontfix

People

(Reporter: alexandrui, Assigned: handyman)

References

(Regression)

Details

(Keywords: perf, perf-alert, regression)

Attachments

(4 files, 1 obsolete file)

Perfherder has detected a browsertime performance regression from push 231dfa18e7e86deffb3f4b3c686524219e71ae48. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

Ratio Test Platform Options Absolute values (old vs new)
83% imgur ContentfulSpeedIndex windows10-64-shippable-qr warm webrender 840.32 -> 1,538.67

Improvements:

Ratio Test Platform Options Absolute values (old vs new)
11% imgur ContentfulSpeedIndex windows10-64-shippable-qr warm webrender 910.38 -> 807.67

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests. Please follow our guide to handling regression bugs and let us know your plans within 3 business days, or the offending patch(es) will be backed out in accordance with our regression policy.

For more information on performance sheriffing please see our FAQ.

Flags: needinfo?(cmartin)

Set release status flags based on info from the regressing bug 1730309

Looks like it got noisier after bug 1730309 landed. Confirmed that the graph on Beta shows a similar change after uplift.
https://treeherder.mozilla.org/perfherder/graphs?timerange=2592000&series=autoland,3401262,1,13

Flags: needinfo?(cmartin) → needinfo?(davidp99)
Component: Theme → Widget: Win32
Product: Firefox → Core

TL;DR: Nothing from my perf analysis suggests why this is happening but I've got a guess that should be very safe to land so I'm winging it.

--

Focusing on these two tasks:

Before patch (Oct 4)

After patch (Oct 24)

I've generated a bunch of profiles; here are a few:

Oct 4: 1, 2, 3

Oct 24: 1, 2, 3

Focusing on the warm profiles, I see no samples in any code from the flagged patch (part 2 of bug 1730309) or in the other patch (part 1) -- the only code change in part 2 was in nsXULTooltipListener.

On my fast desktop, I get the opposite of the bad behavior between two builds -- "latest" gives me an Imagur ContentfulSpeedIndex in the range of 2.5-3 seconds but an Oct 4 build gives me a ContentfulSpeedIndex around 10 seconds. This is reliable across a half-dozen test runs. I assume this is due to other changes -- I should have focused more narrowly but the gist is this does not match the results from automation at all so it's probably not going to result in an explanation. I also note that the ads on the page tend to be different between builds, which could be the reason (are they basing ad loading on some internal profiling?).

Since PerfHerder technically blamed part 2 of the patch series, and that patch is just a HasAttr test, I want to see if moving that inside an else for an IsXULElement conditional changes this -- the test only applies to window chrome and this should reduce consideration for page load stuff. If this works as hoped then the regression should essentially vanish.

Flags: needinfo?(davidp99)

Reduce the number of times the HasAttr(titlebar_button) test runs by ignoring non-XUL elements. The test is only true for window chrome and could potentially be expensive if the element hasn't mapped it's attributes.

Assignee: nobody → davidp99
Status: NEW → ASSIGNED
Attachment #9247888 - Attachment is obsolete: true

TL;DR: The failure does not seem to be real. I believe the browser recordings were previously usually truncated after a banner pops up, causing us to skip more than half of the length, which led to a lower score. The recordings are now usually not truncated much. AFAICT this is due to either variable video playback in the web page (where the video is covered by a translucent panel) or due to variable image recording times. Either way, this difference causes us to go from just over a culling threshold to just under it. I don't know which way is "correct" but obviously swiching between them is not. This is why manual inspection shows no distinguishable difference in the recordings despite their wildly different scores. There is a difference in performance, but it is meager, and we don't even know if it is positive or negative. The test is not measuring anything useful here.

--

This failure looks huge and that I think we have seen it at least once before in bug 1704463. I'm going to explain what I think is happening in detail with an example, but this can be easily confirmed (or not) with a local run that looks at the temp files it produces:

From this archive for this run made on Oct 4 (before the results changed), I'm looking at the cold start video at imgur/pages/imgur_com/gallery/m5tYJL6/query-6bc054f4/data/video/1.mp4 from the adjacent imgur archive -- attached as oct1.mp4. The adjacent live.log file is attached as oct1.log.

From this archive for this run made on Oct 13 (after the results changed), I'm looking at the cold start video at imgur/pages/imgur_com/gallery/m5tYJL6/query-6bc054f4/data/video/2.mp4 -- attached as oct13.mp4. The adjacent live.log file is attached as oct13.log.

From oct1.log, the important info is:

[task 2021-10-04T22:21:11.447Z] INFO - [JOB-26] Duration: 00:00:07.00, start: 0.000000, bitrate: 382 kb/s
[...]
"ContentfulSpeedIndex": 349, "videoRecordingStart": 933, "ContentfulSpeedIndexProgress": "0=0, 167=47, 200=54, 267=68, 300=74, 400=79, 433=81, 467=83, 500=85, 533=87, 567=88, 600=90, 633=92, 667=92, 700=92, 733=92, 767=93, 800=93, 833=93, 867=93, 900=93, 933=94, 967=94, 1000=94, 1033=94, 1067=95, 1100=95, 1133=95, 1167=95, 1200=95, 1233=100"

The adjacent info from oct13.log is:

[task 2021-10-13T15:27:58.320Z] INFO - [JOB-27] Duration: 00:00:07.11, start: 0.000000, bitrate: 348 kb/s
[...]
"ContentfulSpeedIndex": 2505, "videoRecordingStart": 938, "ContentfulSpeedIndexProgress": "0=0, 302=12, 402=14, 436=14, 469=15, 503=15, 536=15, 570=15, 603=16, 637=16, 670=16, 704=16, 737=16, 771=16, 805=16, 838=16, 872=16, 905=16, 939=16, 972=16, 1006=16, 1039=16, 1073=16, 1106=16, 1140=16, 1173=17, 1207=18, 1240=18, 1274=19, 1307=20, 1341=20, 1374=21, 1408=22, 1441=23, 1475=23, 1508=24, 1542=25, 1575=25, 1609=26, 1643=27, 1676=27, 1710=28, 1743=29, 1777=29, 1810=30, 1844=31, 1877=31, 1911=32, 1944=33, 1978=34, 2011=34, 2045=35, 2078=36, 2112=36, 2145=37, 2179=38, 2212=38, 2246=39, 2279=40, 2313=40, 2346=41, 2380=42, 2413=42, 2447=43, 2480=44, 2514=45, 2548=45, 2581=46, 2615=47, 2648=47, 2682=48, 2715=49, 2749=49, 2782=50, 2816=51, 2849=51, 2883=52, 2916=53, 2950=54, 2983=54, 3017=55, 3050=56, 3084=56, 3117=57, 3151=58, 3184=59, 3218=59, 3251=60, 3285=61, 3318=61, 3352=62, 3386=63, 3419=64, 3453=64, 3486=65, 3520=66, 3553=66, 3587=67, 3620=68, 3654=69, 3687=69, 3721=70, 3754=71, 3788=72, 3821=72, 3855=73, 3888=74, 3922=74, 3955=75, 3989=76, 4022=77, 4056=77, 4089=78, 4123=79, 4156=79, 4190=80, 4224=81, 4257=82, 4291=82, 4324=83, 4358=84, 4391=84, 4425=85, 4458=86, 4492=87, 4525=87, 4559=88, 4592=89, 4626=89, 4659=90, 4693=91, 4726=92, 4760=92, 4793=93, 4827=94, 4860=95, 4894=95, 4927=96, 4961=97, 4994=97, 5028=98, 5062=99, 5095=100"

From the browsertime source, these numbers are actually frame times (which are in milliseconds) and percentages (measuring how much content is believed to be rendered). The first thing to note, then, is that the better scoring video thinks the movie was about 4 seconds shorter than the other one does. The banner ad is shown for about 5 seconds. We also see that, while both videos complete nearly immediately, the first one is at 100% at 1.2 seconds, while the second is said to be only about 18% complete then.

Going back to the browsertime source, there are only a handful of places where the video could be shortened. Most of the things that eliminate redundant frames don't remove frames from the beginning or end (so the duration wouldn't change) and we don't pass any relevant parameters (renderignore and viewport do nothing for us). The only spot I see that could be shrinking the video is eliminate_duplicate_frames, which hardcodes a 15% "fuzz" in matching frames and allows removing frames all the way to the end. (This could be easily confirmed by running the script and looking at the output files but I haven't set that up.) The fuzz parameter for ImageMagick uses simple Euclidean distance to evaluate color distance between two pixels (so dR^2 + dG^2 + dB^2 where dR is red_for_pixel2 - red_for_pixel1, etc). Say we are using 8-bit color then 15% would be a difference of e.g. 38.4 in one color channel or about 20 in all 3. Since the movie is the only moving part of the recording, we can compare it's pixels. I've spit out the frames in the right format using the same parameters we give to ffmpeg and ImageMagick. The sky seems to be pretty consistent and the brightest part at a color of (66,73,81). Something like (33,41,48) seems to be the darkest in the video, for a distance of 56.6. That's over the threshold but requires the analysis to test a pixel that takes both values, at the time the pixel took both values. But the script only ever compares a frame against the final frame and there may not be pixels that cover both extremes anywhere. Still, it could explain why some of the data says the video is < 2 seconds while others say it is about 7 seconds, and everything in between.

From there, it is easy to see why we get larger values for longer videos. That is sort of the point. The actual calculation makes sense. And I have confirmed that ImageMagick returns almost exactly the same measures (pixel counts on the edge detected image) for frames from the similar times in both videos.

I chose a low scoring video for "before" and a high scoring video for "after" but they each have plenty of both. The same analysis, but reversed, could be done using a slow "before" and a fast "after". The point is that the cause of the results stays consistent with the measured video length.

I'm not sure what an ideal fix would be. Making the threshold programmable? Not testing with this page? Fixing the length of the video as a parameter? That last one sounds like it would work (that's the --end argument but I'm not sure it would ignore this duplicate frames culling).

Attached file oct1.log
Attached file oct13.log
Attached video oct1.mp4
Attached video oct13.mp4

Are you the right person to alert to the notes in comment 5?

Flags: needinfo?(aionescu)

(In reply to David Parks [:handyman] from comment #5)

I'm not sure what an ideal fix would be. Making the threshold programmable? Not testing with this page? Fixing the length of the video as a parameter? That last one sounds like it would work (that's the --end argument but I'm not sure it would ignore this duplicate frames culling).

Which is the ones that doesn't take much time from you and sounds reasonable?

Flags: needinfo?(aionescu) → needinfo?(davidp99)

TL;DR: It appears that the test is predicated on the length of the "active" portion of the video being fairly precisely identified. Clearly, in both videos, the page is loaded by the time the banner is shown, so any calculations after that are just (incorrectly) measuring the video, and that adds a lot to the score. OTOH, we don't know when the banner will pop up or if that means we should stop analyzing -- the point of the test is to detect if the page load does not stay that fast. I see these differences as irreconcilable, which would mean we should cut the test.

--

We could change the pixel difference fuzz factor. There are multiple problems with this:

  • The fuzz factor is hard coded. It could be made into a parameter but the change would have to happen upstream.
  • We would want to change the fuzz factor only for this particular page, which would require script apparatus that I don't see that we have. Adds complexity.
  • How do we change the fuzz factor? We could reduce it slightly but that would always "count" the banner ad part, which is not measuring the right thing. We could increase it slightly but it's unclear how far up we'd have to go to get semi-consistent length estimates, and raising the threshold to ignore the movie is probably enough to also miss any legitimate "page render" changes after the banner.

I don't think any of this is worth the hassle.

Fixing the video's length is an easier way to get close to what we are looking for but also has problems:

  • Again, the --end parameter would be set only for this video, which means adding new script apparatus to allow that.
  • As mentioned above, since we don't know the proper length to analyze (since we don't know when the page will be done being displayed) any fixed length might stop the test early. OTOH, given the way the test is written, any end time that comes after the finished page load still has the original issue in this bug (albeit less so) -- sometimes the test will cut the stuff that comes later and sometimes it won't, and that will incorrectly affect the score.

This is still more hassle to only partially improve the test accuracy.

So my vote is to eliminate this imgur page contentful speed test altogether. That, I especially don't know how to do -- that's a question for the team handling automation. I'm not sure they have a way to keep the other speed tests (like the perceptual measure) but to cut this one.

Flags: needinfo?(davidp99) → needinfo?(aionescu)

Wouldn't be enough to just close this as WONTFIX?

Flags: needinfo?(aionescu) → needinfo?(davidp99)

That is essentially what happened with bug 1704463, and it resurfaced. That could be a reasonable solution if there were a way to communicate that unstable results were to be ignored, but that would also negate the value of running the test at all. Still, if the infrastructure makes culling this one test difficult then (somehow) communicating that warning might be the best solution available.

Flags: needinfo?(davidp99)
Has Regression Range: --- → yes

Dave, I found your name in our docs. Could you advise how to proceed here? See comment 12 and the suggestion to remove this test.

Flags: needinfo?(dave.hunt)
Priority: -- → P3

Sorry for the delayed responses. :handyman/:spohl, can you verify if this is still an issue? I believe this ended up getting fixed by another patch that fixed an issue in visualmetrics.py: https://github.com/sitespeedio/browsertime/pull/1680

This landed in early december and you can see it landing in this graph (note where it goes from incredibly noisy to stable): https://treeherder.mozilla.org/perfherder/graphs?highlightAlerts=0&highlightChangelogData=1&highlightCommonAlerts=0&highlightedRevisions=231dfa18e7e86deffb3f4b3c686524219e71ae48&highlightedRevisions=c32a2cf16642959c809342bb1d34cdceb757b917&series=autoland,3401262,1,13&timerange=7776000

This is the data we are seeing from this test right now (graph above stops because of a platform change): https://treeherder.mozilla.org/perfherder/graphs?highlightAlerts=1&highlightChangelogData=1&highlightCommonAlerts=0&series=mozilla-central,3424100,1,13&timerange=5184000

EDIT: If this all looks good to you, I'll close this as fixed.

Flags: needinfo?(spohl.mozilla.bugs)
Flags: needinfo?(davidp99)
Flags: needinfo?(dave.hunt)

Thanks for chiming in! I will let David respond here.

Flags: needinfo?(spohl.mozilla.bugs)

Hey :sparky, thanks for looking into this! I got my hopes up for a second but, unfortunately, it doesn't look like the visualmetrics change is responsible for the improved results -- in fact, they should just be cropping the bottom 6% of the video, or about 77 pixels, which are completely static (modulo whatever Windows decorations might do with the window border, which doesn't look like anything). I can see why the change helped Chrome's testing to remove network status text but it actually just makes ours less precise since we don't have any of that in our test videos (but not in a very important way -- see examples of our videos above).

The primary reason for the change in results seems to be that the elephant video stopped auto-playing around the time the results stabilize (~Dec 1). I haven't pinpointed the exact date that changed but this is a run with terrible results from Nov 18 -- the score was 2304 and some of the test video lengths were determined to be over 5 seconds. The elephant video autoplayed.

This is a run with good results from Dec 3 -- the score was 439 and the elephant video does not autoplay. All runs were determined to be about 3 seconds or less.

In a sense, not autoplaying the elephant video makes the results more accurate but it is really more accurate in evaluating something we aren't interested in. The end of the test video will never be determined to be before the banner so, since the test is measuring the distance from the last frame image, which is an image of the banner, we are just measuring the visual difference between whatever is on screen (elephant video and icons if we are fast. a black background if not) and the banner. That isn't useful. I'd say that the presence of the banner invalidates the concept behind this test -- tweaks won't fix it.

Another change made its way into our test videos around this time -- they went from 1024x768 to 1280x1024. I don't think this affected much but it would change a bit and I believe the original issue that spawned this bug was due to a small visual threshold being exceeded. Still, I don't think it's important here.

Flags: needinfo?(davidp99) → needinfo?(gmierz2)

Ah yes, thanks for finding that one! It's definitely the autoplay changes that did this.

So right now, the big banner which causes the translucent background is gone after we've made some improvements to the recording process. Have a look at the videos in this archive: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/JkK1-TyTRkSwais1RAyx4w/runs/0/artifacts/public/test_info/browsertime-results.tgz

(They came from here: https://treeherder.mozilla.org/jobs?repo=mozilla-central&group_state=expanded&selectedTaskRun=JkK1-TyTRkSwais1RAyx4w.0&searchStr=imgur&revision=32a2d1ce4bab2979c3be01244f100f000ca77d8e)

Is this the banner you were referring to?

The issue one of the patches I mentioned above fixed was which frames get picked for deletion so that may have improved some of this.

It's hard for me to say if this is an issue or a limitation with the metric, the visualmetrics.py script, or with the test itself. It feels more like a limitation with the metric that we need to handle through changes to the test. Changing the fuzz factor has a high possibility of breaking our other metrics as well (we want to rewrite the vismet script in the future to handle these things better).

I've looped in :bas since he created the CSI metric and he may have something to add.

Flags: needinfo?(gmierz2)

(In reply to Greg Mierzwinski [:sparky] from comment #19)

So right now, the big banner which causes the translucent background is gone after we've made some improvements to the recording process. Have a look at the videos in this archive: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/JkK1-TyTRkSwais1RAyx4w/runs/0/artifacts/public/test_info/browsertime-results.tgz

Is this the banner you were referring to?

The videos in this archive look the same as the others where the video doesn't play. That is, they look like the videos attached to this bug, except that the video doesn't play and the size is 1280x1024. But the giant blue "We value your privacy" banner (and the translucent black pane that covers the stuff not covered by the banner) is still there. Maybe you meant to paste a different link?

The patch to visualmetrics.py you referenced seems it was meant to remove the little status updates that sometimes appear at the bottom-left of a page, e.g. "Connecting to www.imgur.com..." and so on. We don't have those, or anything like that in our runs. The bottom 40 or so rows of pixels from our images, which is what are cut by the changes in the new patch, are all identical once the banner pops up. Removing them shouldn't affect the image comparison that we use to decide if an image is similar enough to the last image to be deleted (or, more precisely, to be considered the new ending of the video -- the test deletes the images after it).

Hi Alex! Looks like the last update on this ticket was 2 years ago, and we cannot see the original linked alert anymore. How do you suggest to proceed ?

Flags: needinfo?(aionescu)
Status: ASSIGNED → RESOLVED
Closed: 10 months ago
Flags: needinfo?(aionescu)
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: