Closed Bug 1519074 Opened 5 years ago Closed 5 years ago

7.24 - 19.34% raptor-tp6-yandex-firefox (linux64, linux64-qr, windows10-64, windows10-64-qr, windows7-32) regression on push 8686beefb44f25a78bb91bc6d5a9e5582f9ffa9a (Wed Jan 9 2019)

Categories

(Core :: DOM: Security, defect)

defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla66
Tracking Status
firefox-esr60 --- unaffected
firefox64 --- unaffected
firefox65 --- unaffected
firefox66 --- fixed

People

(Reporter: igoldan, Assigned: Gijs)

References

Details

(Keywords: perf, regression)

Attachments

(1 file)

Raptor has detected a Firefox performance regression from push:

https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=2c7c37b0ce9ca5f6745140b6551e9faa79e6c099&tochange=8686beefb44f25a78bb91bc6d5a9e5582f9ffa9a

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

19% raptor-tp6-yandex-firefox linux64-qr opt 182.25 -> 217.50
15% raptor-tp6-yandex-firefox windows10-64-qr opt 169.19 -> 195.35
13% raptor-tp6-yandex-firefox linux64 pgo 171.94 -> 194.14
8% raptor-tp6-yandex-firefox windows7-32 opt 193.90 -> 209.49
7% raptor-tp6-yandex-firefox windows10-64 opt 194.61 -> 208.95
7% raptor-tp6-yandex-firefox windows7-32 pgo 182.50 -> 195.70

You can find links to graphs and comparison views for each of the above tests at: http://localhost:5000/perf.html#/alerts?id=18653

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a Treeherder page showing the Raptor jobs in a pushlog format.

To learn more about the regressing test(s) or reproducing them, please see: https://wiki.mozilla.org/Performance_sheriffing/Raptor

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Performance_sheriffing/Talos/RegressionBugsHandling

Flags: needinfo?(gijskruitbosch+bugs)

Where can I see the markup the test uses?

Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(igoldan)

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #0)

You can find links to graphs and comparison views for each of the above tests at: http://localhost:5000/perf.html#/alerts?id=18653

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a Treeherder page showing the Raptor jobs in a pushlog format.

Please can you point to the actual link?

(In reply to :Gijs (he/him) from comment #3)

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #0)

You can find links to graphs and comparison views for each of the above tests at: http://localhost:5000/perf.html#/alerts?id=18653

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a Treeherder page showing the Raptor jobs in a pushlog format.

Please can you point to the actual link?

Oh, I'm sorry about that (forgot to close the local dev tab). Correct link: https://treeherder.mozilla.org/perf.html#/alerts?id=18653

Flags: needinfo?(igoldan)

(In reply to :Gijs (he/him) from comment #2)

Where can I see the markup the test uses?

:rwood, :bebe could you provide a link to the markup these tests use?

Flags: needinfo?(rwood)
Flags: needinfo?(fstrugariu)

The profile shows us spending about 40-50ms in the parent process (after this change), in the middle of trying to load the page, re-parsing the CSP as it comes in over IPC ( https://perfht.ml/2H7kMfS ). We spend about 70-80ms in the CSP parser in general, over the course of the profile, in the parent process. I think we can avoid doing the re-parsing for the http channel case, because the CSP checks will happen in the content process... presumably we could make an optional, default-to-false, param to the PrincipalInfoToPrincipal method, maybe?. I don't know if that's really the right fix... If we can't / don't want to do that, IMO we should come up with a more efficient way of transferring the CSP across processes, which might be worth thinking about anyway.

Confusingly, I don't see us spending any time at all parsing CSPs in the content process. I don't know why that is, I'd expect us to have had to parse it in the content process, too.

Christoph, thoughts on what we can do in this space?

Flags: needinfo?(ckerschb)

(In reply to :Gijs (he/him) from comment #6)

The profile shows us spending about 40-50ms in the parent process (after this change), in the middle of trying to load the page, re-parsing the CSP as it comes in over IPC ( https://perfht.ml/2H7kMfS ). We spend about 70-80ms in the CSP parser in general, over the course of the profile, in the parent process. I think we can avoid doing the re-parsing for the http channel case, because the CSP checks will happen in the content process... presumably we could make an optional, default-to-false, param to the PrincipalInfoToPrincipal method, maybe?. I don't know if that's really the right fix... If we can't / don't want to do that, IMO we should come up with a more efficient way of transferring the CSP across processes, which might be worth thinking about anyway.

Confusingly, I don't see us spending any time at all parsing CSPs in the content process. I don't know why that is, I'd expect us to have had to parse it in the content process, too.

Christoph, thoughts on what we can do in this space?

I was already worried about that performance regression. This will become even a bigger issue once we are going to land Bug 965637 where we also serialize the CSP within the client. Or maybe already when we land Bug 1518454 where we serialize the CSP explicitly when passing it from the frontend to docshell loads. I don't have any suggestion on how to fix without spending more time on it.

For now, I guess it makes sense to let Boris know. Boris any suggestion on how we could approach that problem?

Flags: needinfo?(ckerschb) → needinfo?(bzbarsky)

(In reply to Christoph Kerschbaumer [:ckerschb] from comment #7)

I was already worried about that performance regression. This will become even a bigger issue once we are going to land Bug 965637 where we also serialize the CSP within the client. Or maybe already when we land Bug 1518454 where we serialize the CSP explicitly when passing it from the frontend to docshell loads. I don't have any suggestion on how to fix without spending more time on it.

For now, I guess it makes sense to let Boris know. Boris any suggestion on how we could approach that problem?

I'm a bit confused. Why do we need to serialize it when going from frontend to docshell? When crossing processes, I would expect this, but not when just passing from JS to docshell and vice versa.

I think there's a number of avenues for investigation here that could help, but I'm not convinced I have time to do said digging myself soon...

  1. why are we not spending time parsing in the content process? That is, is parsing just slow everywhere, and if not, why is it so slow in the parent?
  2. can we avoid sending the CSP in the HTTP channel principal case if all CSP decisions are made in the content process?
  3. can we come up with a more efficient way of serializing/deserializing the CSP that doesn't involve reparsing on the other side?
  4. where is the CSP even coming from (ie the needinfos that are out for the contents here) ? I looked at the raptor content github pages and in m-c and I don't see any CSP in the yandex case, so I'm confused. I assume I'm not looking in the right places...

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #5)

(In reply to :Gijs (he/him) from comment #2)

Where can I see the markup the test uses?

:rwood, :bebe could you provide a link to the markup these tests use?

https://github.com/mozilla/perf-automation/tree/master/pagesets/mitmproxy/raptor-tp6

Flags: needinfo?(rwood)
Flags: needinfo?(fstrugariu)

(In reply to Robert Wood [:rwood] from comment #9)

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #5)

(In reply to :Gijs (he/him) from comment #2)

Where can I see the markup the test uses?

:rwood, :bebe could you provide a link to the markup these tests use?

https://github.com/mozilla/perf-automation/tree/master/pagesets/mitmproxy/raptor-tp6

Where are the headers used for these pages by MITMProxy?

Flags: needinfo?(rwood)

To answer at least some of the questions in comment 8, we do in fact need CSP in the parent process: CSP can affect redirects, and those checks will need to happen in the parent process with Fission. That said, we could possibly serialize just the redirect-related parts of the CSP or something to handle this, to avoid parsing the whole thing.

Now some generalities:

  1. We should definitely make sure that our serialization format is cheap to deserialize. I'm not sure it is right now.

  2. We should also carefully look at what parts of CSP we need where and not ship around things we don't need.

  3. We shouldn't deserialize in the parent process until we need it. Again, I suspect that in practice that's only in the face of redirects, right?

Flags: needinfo?(bzbarsky)

One other thing: because CSP is mutable, there are definitely times when we need to snapshot a CSP and hang the snapshot off a channel. This seems like something that should be doable more cheaply than a serialize+deserialize, but I wouldn't be surprised if right now we do a serialize+deserialize to accomplish this.

(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #11)

  1. We shouldn't deserialize in the parent process until we need it.

Are there examples elsewhere in the codebase of doing as-needed IPC serialization/deserialization for this type of thing in C++ ? I'm guessing it'd need a significant amount of tweaking if some other class (principals, atm) has a member that it just expects to be there...

Again, I suspect that in practice that's only in the face of redirects, right?

Not even that right now. Because of bug 1509738, we can only check redirects in the child today. This is why I suggested (at least for now) not sharing the CSP through the http channel parent. And yes, that wouldn't really be a permanent solution.

Anyway, I'm still confused about why this parsing isn't showing up in the child process and/or where the CSP stuff is even coming from... I'll see if I can find time to dig into that.

Flags: needinfo?(bzbarsky)

Are there examples elsewhere in the codebase of doing as-needed IPC serialization/deserialization

I meant deserialization to a full-blown usable CSP, not the IPC end of things....

Flags: needinfo?(bzbarsky)

So yandex in the raptor test, when running locally and using a debugger in the child processes, sends a 3.5k length (!?) CSP string in the recordings via http header, presumably stored in yandex.mp, which are some kind of weird compressed file format (well, file doesn't know what it is, it just says 'data'). Which explains where this comes from, at least.

Right now, I think the reason this is ballooning is that we pass a principal as part of the load info of every http request, and for http channels we pass this from child to parent, so we end up with a stack like:

#0	0x000000010f9ffe11 in mozilla::ipc::LoadInfoArgsToLoadInfo(mozilla::net::OptionalLoadInfoArgs const&, nsILoadInfo**) at mozilla-unified/ipc/glue/BackgroundUtils.cpp:521
#1	0x000000010f899415 in mozilla::net::HttpChannelParent::DoAsyncOpen(mozilla::ipc::URIParams const&, mozilla::ipc::OptionalURIParams const&, mozilla::ipc::OptionalURIParams const&, mozilla::ipc::OptionalURIParams const&, unsigned int const&, mozilla::ipc::OptionalURIParams const&, mozilla::ipc::OptionalURIParams const&, unsigned int const&, nsTArray<mozilla::net::RequestHeaderTuple> const&, nsTString<char> const&, mozilla::ipc::OptionalIPCStream const&, bool const&, short const&, unsigned int const&, unsigned char const&, bool const&, unsigned int const&, bool const&, unsigned long long const&, nsTString<char> const&, bool const&, nsTString<char> const&, bool const&, bool const&, bool const&, unsigned int const&, mozilla::net::OptionalLoadInfoArgs const&, mozilla::net::OptionalHttpResponseHead const&, nsTString<char> const&, unsigned int const&, unsigned long long const&, mozilla::net::OptionalCorsPreflightArgs const&, unsigned int const&, bool const&, bool const&, bool const&, nsTString<char> const&, unsigned int const&, unsigned int const&, unsigned long long const&, nsTString<char16_t> const&, unsigned long long const&, nsTArray<mozilla::Tuple<nsTString<char>, nsTString<char> > > const&, unsigned long long const&, mozilla::TimeStamp const&, mozilla::TimeStamp const&, mozilla::TimeStamp const&, mozilla::TimeStamp const&, mozilla::TimeStamp const&, mozilla::TimeStamp const&, bool const&, mozilla::TimeStamp const&) at mozilla-unified/netwerk/protocol/http/HttpChannelParent.cpp:443
#2	0x000000010f899249 in mozilla::net::HttpChannelParent::Init(mozilla::net::HttpChannelCreationArgs const&) at mozilla-unified/netwerk/protocol/http/HttpChannelParent.cpp:134
#3	0x000000010f94cc5c in mozilla::net::NeckoParent::RecvPHttpChannelConstructor(mozilla::net::PHttpChannelParent*, mozilla::dom::PBrowserOrId const&, IPC::SerializedLoadContext const&, mozilla::net::HttpChannelCreationArgs const&) at mozilla-unified/netwerk/ipc/NeckoParent.cpp:310
#4	0x000000010fb7bfe4 in mozilla::net::PNeckoParent::OnMessageReceived(IPC::Message const&) at builds/opt/ipc/ipdl/PNeckoParent.cpp:1043
#5	0x000000010fa11db2 in mozilla::ipc::MessageChannel::DispatchAsyncMessage(IPC::Message const&) at mozilla-unified/ipc/glue/MessageChannel.cpp:2160
#6	0x000000010fa10e97 in mozilla::ipc::MessageChannel::DispatchMessage(IPC::Message&&) at mozilla-unified/ipc/glue/MessageChannel.cpp:2087
#7	0x000000010fa1180e in mozilla::ipc::MessageChannel::MessageTask::Run() at mozilla-unified/ipc/glue/MessageChannel.cpp:1967
#8	0x000000010f4f08f9 in nsThread::ProcessNextEvent(bool, bool*) at mozilla-unified/xpcom/threads/nsThread.cpp:1167
#9	0x000000010f4eee13 in NS_ProcessPendingEvents(nsIThread*, unsigned int) at mozilla-unified/xpcom/threads/nsThreadUtils.cpp:416
#10	0x0000000111acacef in nsBaseAppShell::NativeEventCallback() at mozilla-unified/widget/nsBaseAppShell.cpp:87
#11	0x0000000111b33678 in nsAppShell::ProcessGeckoEvents(void*) at mozilla-unified/widget/cocoa/nsAppShell.mm:464

So every network request in this pageload gets the loading/triggering principal from the main page, with the giant CSP, which then gets reparsed for every single request, which is why the time balloons so much (and parsing it once just doesn't take enough time to show up in the profiles, plus it might be cached after the first pageload cycle because we just keep loading the same URI without clearing cache or restarting)... tbh, that also makes me wonder if we could be smarter about caching the principal in the parent in the first place...

(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #14)

Are there examples elsewhere in the codebase of doing as-needed IPC serialization/deserialization

I meant deserialization to a full-blown usable CSP, not the IPC end of things....

I have a WIP patch ( https://hg.mozilla.org/try/rev/a0f144cbf403df9f7be62a758c6e6097c9d72863 ) that does this that seems to bring this down to close to the original levels. Raptor seems to be bimodal, looking at the graphs, so the noise makes it hard to say if this really takes care of all of the issue, but it seems to improve things, at least:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=a2914ba991199e173f5e2b84bcaa5794f3fcca31&newProject=try&newRevision=dd4be19992a29e85f55a055c6fa348ade09f994e&framework=10&showOnlyComparable=1

I've just realized that it would probably make more sense to push the complexity/lazy-parsing into the CSP object itself, esp. given we're trying to move it off the principal. I'll see whether I can get that to work.

(In reply to :Gijs (he/him) from comment #15)

Right now, I think the reason this is ballooning is that we pass a principal as part of the load info of every http request, and for http channels we pass this from child to parent
So every network request in this pageload gets the loading/triggering principal from the main page, with the giant CSP, which then gets reparsed for every single request, (...) tbh, that also makes me wonder if we could be smarter about caching the principal in the parent in the first place...

Still curious if we could avoid still more overhead if we kept track of principals more cleverly on the parent side.

Assignee: nobody → gijskruitbosch+bugs
Status: NEW → ASSIGNED
Component: General → DOM: Security
Flags: needinfo?(rwood)
Product: Testing → Core
See Also: → 1521120
Pushed by gijskruitbosch@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/a69c1a1dc6fe
delay deserializing some CSP info until necessary, r=bzbarsky,ckerschb
Pushed by gijskruitbosch@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/c31d2c5695b1
delay deserializing some CSP info until necessary, r=bzbarsky,ckerschb
Flags: needinfo?(gijskruitbosch+bugs)
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla66

== Change summary for alert #18838 (as of Tue, 22 Jan 2019 06:49:12 GMT) ==

Improvements:

9% raptor-tp6-yandex-firefox linux64 pgo 159.24 -> 145.66
4% raptor-tp6-yahoo-news-firefox linux64 pgo 431.96 -> 415.81

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=18838

(In reply to Florin Strugariu [:Bebe] from comment #23)

9% raptor-tp6-yandex-firefox linux64 pgo 159.24 -> 145.66

I gotta admit, this is pretty confusing:

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #0)

13% raptor-tp6-yandex-firefox linux64 pgo 171.94 -> 194.14

These numbers aren't doing anything even close to matching up (neither before nor after, so it's not just that the patch is an improvement over the previous state or doesn't address things completely or something like that - it just doesn't match at all). I assume this is because the test is bimodal. Ionuț, can you take a closer look and let me know if you think there is more that needs to be done here? Also, can you check if someone is investigating why this test is bimodal? That seems like something that smells of either the test framework doing something odd or our implementation meaning we vary quite wildly in terms of how quickly we load this page, or something...

Flags: needinfo?(igoldan)

(In reply to :Gijs (he/him) from comment #24)

(In reply to Florin Strugariu [:Bebe] from comment #23)

9% raptor-tp6-yandex-firefox linux64 pgo 159.24 -> 145.66

I gotta admit, this is pretty confusing:

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #0)

13% raptor-tp6-yandex-firefox linux64 pgo 171.94 -> 194.14

These numbers aren't doing anything even close to matching up (neither before nor after, so it's not just that the patch is an improvement over the previous state or doesn't address things completely or something like that - it just doesn't match at all). I assume this is because the test is bimodal.

I confirm these tests are noisy, but not bimodal. They don't have the tendency of alternating min/max values. If there's a concern with these values being too different, we updated the Raptor test runner in bug 1518479. It falsely improved most if not all of the Raptor baselines.

Ionuț, can you take a closer look and let me know if you think there is more that needs to be done here?

I compared the old regression with the changes pushed in comment 21. I'd say it got fixed by 50%.

Also, can you check if someone is investigating why this test is bimodal? That seems like something that smells of either the test framework doing something odd or our implementation meaning we vary quite wildly in terms of how quickly we load this page, or something...

I'm not sure if someone is actively investigating this issue. :rwood are you aware of any research on this matter?

Flags: needinfo?(igoldan) → needinfo?(rwood)

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #25)

(In reply to :Gijs (he/him) from comment #24)

(In reply to Florin Strugariu [:Bebe] from comment #23)

9% raptor-tp6-yandex-firefox linux64 pgo 159.24 -> 145.66

I gotta admit, this is pretty confusing:

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #0)

13% raptor-tp6-yandex-firefox linux64 pgo 171.94 -> 194.14

These numbers aren't doing anything even close to matching up (neither before nor after, so it's not just that the patch is an improvement over the previous state or doesn't address things completely or something like that - it just doesn't match at all). I assume this is because the test is bimodal.

I confirm these tests are noisy, but not bimodal. They don't have the tendency of alternating min/max values.

Hm, I must have been confused with either another subtest or the values for the 2017 ("ux") reference device. I'll try and find what I was looking at later...

If there's a concern with these values being too different, we updated the Raptor test runner in bug 1518479. It falsely improved most if not all of the Raptor baselines.

Ah! OK, but we can still see the subtest values which let us compare apples to apples (instead of an average that got lowered because we added another test).

Ionuț, can you take a closer look and let me know if you think there is more that needs to be done here?

I compared the old regression with the changes pushed in comment 21. I'd say it got fixed by 50%.

Thanks for looking. Can you elaborate? Looking at the subtests, i.e.:

https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=mozilla-inbound&originalRevision=a43422e9e4da8252bf1b48eb05206d0520b06f39&newProject=mozilla-inbound&newRevision=8686beefb44f25a78bb91bc6d5a9e5582f9ffa9a&originalSignature=1825430&newSignature=1825430&framework=10

(regression for https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=8686beefb44f25a78bb91bc6d5a9e5582f9ffa9a, which was the original regressing bug)

https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=autoland&originalRevision=57dc8bbbc38f79ce532feb8947b40311e96ea577&newProject=autoland&newRevision=c31d2c5695b185f5dab74513f911388ddcb23a46&originalSignature=1825130&newSignature=1825130&framework=10

( improvement for https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=c31d2c5695b185f5dab74513f911388ddcb23a46 , ie this bug)

The regressions are:

dcf: 169.00 < 198.00
fcp: No results No results
fnbpaint: 72.00 > 70.50
loadtime: 304.00 < 357.50
ttfi: 250.00 < 273.00

and the improvements:

dcf: 202.00 > 170.00
fcp: 72.00 = 72.00
fnbpaint: 72.00 = 72.00
loadtime: 348.50 > 300.50
ttfi: 274.00 > 250.00

This actually seems like it's basically fixed? The 169 vs. 170 (1ms regression) and 300.5 vs 304.00 (3.5ms improvement) seem like noise... We don't really know what, if anything, happened to fcp, but given fcp and fnbpaint seem to be returning the same numbers (so presumably are just providing time to the same paint event), I think that that number staying the same is fine.

Is the result worse on other platforms? Where should I be looking for what still needs doing here (probably in a follow-up, I guess)?

Flags: needinfo?(igoldan)

I looked over all regressions and visually noticed the fix didn't have the same magnitude as the regression. So I compared before regression - after regression difference with before fix - after fix, to be able to give an approximate fix percentage.

I ran a big bunch of retriggers/backfills for the before/after regression & before/after fix changesets, to be able to give a more precise answer.

This fix created a regression and a lot of noise on raptor:

== Change summary for alert #18882 (as of Fri, 18 Jan 2019 22:35:13 GMT) ==

Regressions:

17% raptor-tp6-yandex-firefox raptor-tp6-yandex-firefox-fcp linux64-qr opt 71.38 -> 83.66
16% raptor-tp6-yandex-firefox raptor-tp6-yandex-firefox-fcp linux64 opt 72.75 -> 84.33
11% raptor-tp6-yandex-firefox raptor-tp6-yandex-firefox-fcp windows10-64 opt 72.19 -> 79.83
9% raptor-tp6-yandex-firefox raptor-tp6-yandex-firefox-fcp windows7-32 opt 71.77 -> 78.29
8% raptor-tp6-yandex-firefox raptor-tp6-yandex-firefox-fcp windows10-64-qr opt 71.73 -> 77.25

Improvements:

17% raptor-tp6-yandex-firefox raptor-tp6-yandex-firefox-loadtime linux64-qr opt 390.91 -> 324.78
13% raptor-tp6-yandex-firefox raptor-tp6-yandex-firefox-loadtime linux64 opt 375.25 -> 326.79
10% raptor-tp6-yandex-firefox raptor-tp6-yandex-firefox-loadtime windows7-32 opt 376.21 -> 337.33
10% raptor-tp6-yandex-firefox raptor-tp6-yandex-firefox-loadtime windows7-32 pgo 348.08 -> 314.62
9% raptor-tp6-yandex-firefox raptor-tp6-yandex-firefox-loadtime windows10-64 opt 376.25 -> 342.08
7% raptor-tp6-yandex-firefox linux64-qr opt 171.04 -> 158.36

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=18882

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #25)

Also, can you check if someone is investigating why this test is bimodal? That seems like something that smells of either the test framework doing something odd or our implementation meaning we vary quite wildly in terms of how quickly we load this page, or something...

I'm not sure if someone is actively investigating this issue. :rwood are you aware of any research on this matter?

No, but perhaps acreskey has come across this?

Flags: needinfo?(rwood) → needinfo?(acreskey)

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #27)

I looked over all regressions and visually noticed the fix didn't have the same magnitude as the regression. So I compared before regression - after regression difference with before fix - after fix, to be able to give an approximate fix percentage.

Did you do this for subtests or for the combined value for raptor? Because of the change in bug 1518479, a difference on the overall test is expected, but isn't a sign this didn't work, just that the added value means any change to the other values shows up as reduced magnitude of change on the overall test value. This is (one of) the problem(s) with representing 4 or 5 different perf numbers from a test by just taking the average, especially when we then change what set of numbers we take the average of...

(In reply to Florin Strugariu [:Bebe] from comment #28)

This fix created a regression and a lot of noise on raptor:

So the noise and regression is in fcp. I expect (but we can't know, unless we can run the test retroactively on old pushes?) that this fix reintroduced noise, so it's not really a real regression.

See: https://treeherder.mozilla.org/perf.html#/graphs?timerange=2592000&series=autoland,1837147,1,10&series=mozilla-inbound,1837450,1,10&series=mozilla-central,1822566,1,10

The first contentful paint value looks to be tracking mostly the same pattern as the first non blank paint value (which might be expected? Who knows about this?) once introduced. I already pointed this out in comment 26: both give frame timestamps and presumably both are usually being hit on the same frame (at least for this test).

This regression bug was filed on Jan 10 as a result of bug 1515863 landing. You can clearly see the same noise and higher values in the graph prior to jan 10. In other words, this noise was there before, and then it became more stable as a result of the change on jan 10 (which also regressed a bunch of other raptor metrics, which were addressed in this bug).

In any case, this fix has landed and backing it out at this point doesn't seem a pleasant option, as the original regression this was meant to address looks to be fixed, and the original security fix + the change in this bug is about to move to beta (merge day). A trade-off of <=12ms on fcp compared to 20-65ms on load time also seems net positive to me. In other words, I don't think reopening and backing this change out is a good option.

I'd still like to at least understand what is going on with the (imo comparatively minor) fcp changes. Please can you file a separate bug for this, and link to profiles like comment #0 in this bug, to make it easier to investigate what's going on here?

I also wish raptor put markers into the profile for the metrics it collected, but that's fodder for another bug, I guess...

Flags: needinfo?(fstrugariu)

(In reply to :Gijs (limited availability until Tue 29; he/him) from comment #30)

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #27)

I looked over all regressions and visually noticed the fix didn't have the same magnitude as the regression. So I compared before regression - after regression difference with before fix - after fix, to be able to give an approximate fix percentage.

Did you do this for subtests or for the combined value for raptor? Because of the change in bug 1518479, a difference on the overall test is expected, but isn't a sign this didn't work, just that the added value means any change to the other values shows up as reduced magnitude of change on the overall test value. This is (one of) the problem(s) with representing 4 or 5 different perf numbers from a test by just taking the average, especially when we then change what set of numbers we take the average of...

I did this check for combined values for raptor. I guess we can close this bug then.

Flags: needinfo?(igoldan)

I'm not familiar with the yandex pageload test yet but I'll profile it and see how it behaves.
I think in general we're going to see a lot of the big shifts drop out now that bug 1270059 is merged.

Depends on: 1523314
Flags: needinfo?(acreskey)

(In reply to :Gijs (he/him) from comment #30)

I'd still like to at least understand what is going on with the (imo comparatively minor) fcp changes. Please can you file a separate bug for this, and link to profiles like comment #0 in this bug, to make it easier to investigate what's going on here?

Created bug 1523314

Flags: needinfo?(fstrugariu)
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: