1519074 - 7.24 - 19.34% raptor-tp6-yandex-firefox (linux64, linux64-qr, windows10-64, windows10-64-qr, windows7-32) regression on push 8686beefb44f25a78bb91bc6d5a9e5582f9ffa9a (Wed Jan 9 2019)

Reporter

Description

•

6 years ago

Raptor has detected a Firefox performance regression from push:

https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=2c7c37b0ce9ca5f6745140b6551e9faa79e6c099&tochange=8686beefb44f25a78bb91bc6d5a9e5582f9ffa9a

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

19% raptor-tp6-yandex-firefox linux64-qr opt 182.25 -> 217.50
15% raptor-tp6-yandex-firefox windows10-64-qr opt 169.19 -> 195.35
13% raptor-tp6-yandex-firefox linux64 pgo 171.94 -> 194.14
8% raptor-tp6-yandex-firefox windows7-32 opt 193.90 -> 209.49
7% raptor-tp6-yandex-firefox windows10-64 opt 194.61 -> 208.95
7% raptor-tp6-yandex-firefox windows7-32 pgo 182.50 -> 195.70

You can find links to graphs and comparison views for each of the above tests at: http://localhost:5000/perf.html#/alerts?id=18653

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a Treeherder page showing the Raptor jobs in a pushlog format.

To learn more about the regressing test(s) or reproducing them, please see: https://wiki.mozilla.org/Performance_sheriffing/Raptor

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Performance_sheriffing/Talos/RegressionBugsHandling

Ionuț Goldan [:igoldan]

Reporter

Updated

•

6 years ago

Flags: needinfo?(gijskruitbosch+bugs)

Ionuț Goldan [:igoldan]

Reporter

Comment 1

•

6 years ago

Here are the Gecko profiles for raptor-tp5-yandex-firefox Windows 10:

before: https://perf-html.io/from-url/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FA8z0PLfYQzSe9hMNnMTPgg%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_raptor-tp6-yandex-firefox.zip

after: https://perf-html.io/from-url/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FXe2pzk-6QcS_2DfSZ4VwSQ%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_raptor-tp6-yandex-firefox.zip

and on Linux 64bit Quantum Render:

before: https://perf-html.io/from-url/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FPeFv_VtCTYW3l4-FgGfpgg%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_raptor-tp6-yandex-firefox.zip

after: https://perf-html.io/from-url/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FVpdZenU3TvK3ACwNTyyQzg%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_raptor-tp6-yandex-firefox.zip

:Gijs (he/him)

Assignee

Comment 2

•

6 years ago

Where can I see the markup the test uses?

Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(igoldan)

:Gijs (he/him)

Assignee

Comment 3

•

6 years ago

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #0)

You can find links to graphs and comparison views for each of the above tests at: http://localhost:5000/perf.html#/alerts?id=18653

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a Treeherder page showing the Raptor jobs in a pushlog format.

Please can you point to the actual link?

Ionuț Goldan [:igoldan]

Reporter

Comment 4

•

6 years ago

(In reply to :Gijs (he/him) from comment #3)

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #0)

You can find links to graphs and comparison views for each of the above tests at: http://localhost:5000/perf.html#/alerts?id=18653

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a Treeherder page showing the Raptor jobs in a pushlog format.

Please can you point to the actual link?

Oh, I'm sorry about that (forgot to close the local dev tab). Correct link: https://treeherder.mozilla.org/perf.html#/alerts?id=18653

Flags: needinfo?(igoldan)

Ionuț Goldan [:igoldan]

Reporter

Comment 5

•

6 years ago

(In reply to :Gijs (he/him) from comment #2)

Where can I see the markup the test uses?

:rwood, :bebe could you provide a link to the markup these tests use?

Flags: needinfo?(rwood)

Flags: needinfo?(fstrugariu)

:Gijs (he/him)

Assignee

Comment 6

•

6 years ago

The profile shows us spending about 40-50ms in the parent process (after this change), in the middle of trying to load the page, re-parsing the CSP as it comes in over IPC ( https://perfht.ml/2H7kMfS ). We spend about 70-80ms in the CSP parser in general, over the course of the profile, in the parent process. I think we can avoid doing the re-parsing for the http channel case, because the CSP checks will happen in the content process... presumably we could make an optional, default-to-false, param to the PrincipalInfoToPrincipal method, maybe?. I don't know if that's really the right fix... If we can't / don't want to do that, IMO we should come up with a more efficient way of transferring the CSP across processes, which might be worth thinking about anyway.

Confusingly, I don't see us spending any time at all parsing CSPs in the content process. I don't know why that is, I'd expect us to have had to parse it in the content process, too.

Christoph, thoughts on what we can do in this space?

Flags: needinfo?(ckerschb)

Christoph Kerschbaumer [:ckerschb]

Comment 7

•

6 years ago

(In reply to :Gijs (he/him) from comment #6)

The profile shows us spending about 40-50ms in the parent process (after this change), in the middle of trying to load the page, re-parsing the CSP as it comes in over IPC ( https://perfht.ml/2H7kMfS ). We spend about 70-80ms in the CSP parser in general, over the course of the profile, in the parent process. I think we can avoid doing the re-parsing for the http channel case, because the CSP checks will happen in the content process... presumably we could make an optional, default-to-false, param to the PrincipalInfoToPrincipal method, maybe?. I don't know if that's really the right fix... If we can't / don't want to do that, IMO we should come up with a more efficient way of transferring the CSP across processes, which might be worth thinking about anyway.

Confusingly, I don't see us spending any time at all parsing CSPs in the content process. I don't know why that is, I'd expect us to have had to parse it in the content process, too.

Christoph, thoughts on what we can do in this space?

I was already worried about that performance regression. This will become even a bigger issue once we are going to land Bug 965637 where we also serialize the CSP within the client. Or maybe already when we land Bug 1518454 where we serialize the CSP explicitly when passing it from the frontend to docshell loads. I don't have any suggestion on how to fix without spending more time on it.

For now, I guess it makes sense to let Boris know. Boris any suggestion on how we could approach that problem?

Flags: needinfo?(ckerschb) → needinfo?(bzbarsky)

:Gijs (he/him)

Assignee

Comment 8

•

6 years ago

(In reply to Christoph Kerschbaumer [:ckerschb] from comment #7)

I was already worried about that performance regression. This will become even a bigger issue once we are going to land Bug 965637 where we also serialize the CSP within the client. Or maybe already when we land Bug 1518454 where we serialize the CSP explicitly when passing it from the frontend to docshell loads. I don't have any suggestion on how to fix without spending more time on it.

For now, I guess it makes sense to let Boris know. Boris any suggestion on how we could approach that problem?

I'm a bit confused. Why do we need to serialize it when going from frontend to docshell? When crossing processes, I would expect this, but not when just passing from JS to docshell and vice versa.

I think there's a number of avenues for investigation here that could help, but I'm not convinced I have time to do said digging myself soon...

why are we not spending time parsing in the content process? That is, is parsing just slow everywhere, and if not, why is it so slow in the parent?
can we avoid sending the CSP in the HTTP channel principal case if all CSP decisions are made in the content process?
can we come up with a more efficient way of serializing/deserializing the CSP that doesn't involve reparsing on the other side?
where is the CSP even coming from (ie the needinfos that are out for the contents here) ? I looked at the raptor content github pages and in m-c and I don't see any CSP in the yandex case, so I'm confused. I assume I'm not looking in the right places...

Robert Wood [:rwood]

Comment 9

•

6 years ago

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #5)

(In reply to :Gijs (he/him) from comment #2)

Where can I see the markup the test uses?

:rwood, :bebe could you provide a link to the markup these tests use?

https://github.com/mozilla/perf-automation/tree/master/pagesets/mitmproxy/raptor-tp6

Flags: needinfo?(rwood)

Flags: needinfo?(fstrugariu)

:Gijs (he/him)

Assignee

Comment 10

•

6 years ago

(In reply to Robert Wood [:rwood] from comment #9)

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #5)

(In reply to :Gijs (he/him) from comment #2)

Where can I see the markup the test uses?

:rwood, :bebe could you provide a link to the markup these tests use?

https://github.com/mozilla/perf-automation/tree/master/pagesets/mitmproxy/raptor-tp6

Where are the headers used for these pages by MITMProxy?

Flags: needinfo?(rwood)

Boris Zbarsky [:bzbarsky]

Comment 11

•

6 years ago

To answer at least some of the questions in comment 8, we do in fact need CSP in the parent process: CSP can affect redirects, and those checks will need to happen in the parent process with Fission. That said, we could possibly serialize just the redirect-related parts of the CSP or something to handle this, to avoid parsing the whole thing.

Now some generalities:

We should definitely make sure that our serialization format is cheap to deserialize. I'm not sure it is right now.
We should also carefully look at what parts of CSP we need where and not ship around things we don't need.
We shouldn't deserialize in the parent process until we need it. Again, I suspect that in practice that's only in the face of redirects, right?

Flags: needinfo?(bzbarsky)

Boris Zbarsky [:bzbarsky]

Comment 12

•

6 years ago

One other thing: because CSP is mutable, there are definitely times when we need to snapshot a CSP and hang the snapshot off a channel. This seems like something that should be doable more cheaply than a serialize+deserialize, but I wouldn't be surprised if right now we do a serialize+deserialize to accomplish this.

:Gijs (he/him)

Assignee

Comment 13

•

6 years ago

(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #11)

We shouldn't deserialize in the parent process until we need it.

Are there examples elsewhere in the codebase of doing as-needed IPC serialization/deserialization for this type of thing in C++ ? I'm guessing it'd need a significant amount of tweaking if some other class (principals, atm) has a member that it just expects to be there...

Again, I suspect that in practice that's only in the face of redirects, right?

Not even that right now. Because of bug 1509738, we can only check redirects in the child today. This is why I suggested (at least for now) not sharing the CSP through the http channel parent. And yes, that wouldn't really be a permanent solution.

Anyway, I'm still confused about why this parsing isn't showing up in the child process and/or where the CSP stuff is even coming from... I'll see if I can find time to dig into that.

Flags: needinfo?(bzbarsky)

Boris Zbarsky [:bzbarsky]

Comment 14

•

6 years ago

Are there examples elsewhere in the codebase of doing as-needed IPC serialization/deserialization

I meant deserialization to a full-blown usable CSP, not the IPC end of things....

Flags: needinfo?(bzbarsky)

:Gijs (he/him)

Assignee

Comment 15

•

6 years ago

So yandex in the raptor test, when running locally and using a debugger in the child processes, sends a 3.5k length (!?) CSP string in the recordings via http header, presumably stored in yandex.mp, which are some kind of weird compressed file format (well, file doesn't know what it is, it just says 'data'). Which explains where this comes from, at least.

Right now, I think the reason this is ballooning is that we pass a principal as part of the load info of every http request, and for http channels we pass this from child to parent, so we end up with a stack like:

#0	0x000000010f9ffe11 in mozilla::ipc::LoadInfoArgsToLoadInfo(mozilla::net::OptionalLoadInfoArgs const&, nsILoadInfo**) at mozilla-unified/ipc/glue/BackgroundUtils.cpp:521
#1	0x000000010f899415 in mozilla::net::HttpChannelParent::DoAsyncOpen(mozilla::ipc::URIParams const&, mozilla::ipc::OptionalURIParams const&, mozilla::ipc::OptionalURIParams const&, mozilla::ipc::OptionalURIParams const&, unsigned int const&, mozilla::ipc::OptionalURIParams const&, mozilla::ipc::OptionalURIParams const&, unsigned int const&, nsTArray<mozilla::net::RequestHeaderTuple> const&, nsTString<char> const&, mozilla::ipc::OptionalIPCStream const&, bool const&, short const&, unsigned int const&, unsigned char const&, bool const&, unsigned int const&, bool const&, unsigned long long const&, nsTString<char> const&, bool const&, nsTString<char> const&, bool const&, bool const&, bool const&, unsigned int const&, mozilla::net::OptionalLoadInfoArgs const&, mozilla::net::OptionalHttpResponseHead const&, nsTString<char> const&, unsigned int const&, unsigned long long const&, mozilla::net::OptionalCorsPreflightArgs const&, unsigned int const&, bool const&, bool const&, bool const&, nsTString<char> const&, unsigned int const&, unsigned int const&, unsigned long long const&, nsTString<char16_t> const&, unsigned long long const&, nsTArray<mozilla::Tuple<nsTString<char>, nsTString<char> > > const&, unsigned long long const&, mozilla::TimeStamp const&, mozilla::TimeStamp const&, mozilla::TimeStamp const&, mozilla::TimeStamp const&, mozilla::TimeStamp const&, mozilla::TimeStamp const&, bool const&, mozilla::TimeStamp const&) at mozilla-unified/netwerk/protocol/http/HttpChannelParent.cpp:443
#2	0x000000010f899249 in mozilla::net::HttpChannelParent::Init(mozilla::net::HttpChannelCreationArgs const&) at mozilla-unified/netwerk/protocol/http/HttpChannelParent.cpp:134
#3	0x000000010f94cc5c in mozilla::net::NeckoParent::RecvPHttpChannelConstructor(mozilla::net::PHttpChannelParent*, mozilla::dom::PBrowserOrId const&, IPC::SerializedLoadContext const&, mozilla::net::HttpChannelCreationArgs const&) at mozilla-unified/netwerk/ipc/NeckoParent.cpp:310
#4	0x000000010fb7bfe4 in mozilla::net::PNeckoParent::OnMessageReceived(IPC::Message const&) at builds/opt/ipc/ipdl/PNeckoParent.cpp:1043
#5	0x000000010fa11db2 in mozilla::ipc::MessageChannel::DispatchAsyncMessage(IPC::Message const&) at mozilla-unified/ipc/glue/MessageChannel.cpp:2160
#6	0x000000010fa10e97 in mozilla::ipc::MessageChannel::DispatchMessage(IPC::Message&&) at mozilla-unified/ipc/glue/MessageChannel.cpp:2087
#7	0x000000010fa1180e in mozilla::ipc::MessageChannel::MessageTask::Run() at mozilla-unified/ipc/glue/MessageChannel.cpp:1967
#8	0x000000010f4f08f9 in nsThread::ProcessNextEvent(bool, bool*) at mozilla-unified/xpcom/threads/nsThread.cpp:1167
#9	0x000000010f4eee13 in NS_ProcessPendingEvents(nsIThread*, unsigned int) at mozilla-unified/xpcom/threads/nsThreadUtils.cpp:416
#10	0x0000000111acacef in nsBaseAppShell::NativeEventCallback() at mozilla-unified/widget/nsBaseAppShell.cpp:87
#11	0x0000000111b33678 in nsAppShell::ProcessGeckoEvents(void*) at mozilla-unified/widget/cocoa/nsAppShell.mm:464

So every network request in this pageload gets the loading/triggering principal from the main page, with the giant CSP, which then gets reparsed for every single request, which is why the time balloons so much (and parsing it once just doesn't take enough time to show up in the profiles, plus it might be cached after the first pageload cycle because we just keep loading the same URI without clearing cache or restarting)... tbh, that also makes me wonder if we could be smarter about caching the principal in the parent in the first place...

:Gijs (he/him)

Assignee

Comment 16

•

6 years ago

(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #14)

Are there examples elsewhere in the codebase of doing as-needed IPC serialization/deserialization

I meant deserialization to a full-blown usable CSP, not the IPC end of things....

I have a WIP patch ( https://hg.mozilla.org/try/rev/a0f144cbf403df9f7be62a758c6e6097c9d72863 ) that does this that seems to bring this down to close to the original levels. Raptor seems to be bimodal, looking at the graphs, so the noise makes it hard to say if this really takes care of all of the issue, but it seems to improve things, at least:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=a2914ba991199e173f5e2b84bcaa5794f3fcca31&newProject=try&newRevision=dd4be19992a29e85f55a055c6fa348ade09f994e&framework=10&showOnlyComparable=1

I've just realized that it would probably make more sense to push the complexity/lazy-parsing into the CSP object itself, esp. given we're trying to move it off the principal. I'll see whether I can get that to work.

(In reply to :Gijs (he/him) from comment #15)

Right now, I think the reason this is ballooning is that we pass a principal as part of the load info of every http request, and for http channels we pass this from child to parent
So every network request in this pageload gets the loading/triggering principal from the main page, with the giant CSP, which then gets reparsed for every single request, (...) tbh, that also makes me wonder if we could be smarter about caching the principal in the parent in the first place...

Still curious if we could avoid still more overhead if we kept track of principals more cleverly on the parent side.

:Gijs (he/him)

Assignee

Comment 17

•

6 years ago

OK, I moved the lazy ipc-to-real-csp work into the CSP object itself, and the perf wins look pretty good: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=a2914ba991199e173f5e2b84bcaa5794f3fcca31&newProject=try&newRevision=3d65fbe1014b48b537b82af38fad6e17ae5ebb1f&framework=10&showOnlyComparable=1

trypush to see if I'm not accidentally breaking something:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=e386df1fc022a241c45f8da174ad8594123cfb1f

:Gijs (he/him)

Assignee

Updated

•

6 years ago

Assignee: nobody → gijskruitbosch+bugs

Status: NEW → ASSIGNED

Component: General → DOM: Security

Flags: needinfo?(rwood)

Product: Testing → Core

:Gijs (he/him)

Assignee

Comment 18

•

6 years ago

Attached file Bug 1519074 - delay deserializing some CSP info until necessary, r?bzbarsky!,ckerschb! — Details

:Gijs (he/him)

Assignee

Updated

•

6 years ago

Comment 19

•

6 years ago

Pushed by gijskruitbosch@gmail.com: https://hg.mozilla.org/integration/autoland/rev/a69c1a1dc6fe delay deserializing some CSP info until necessary, r=bzbarsky,ckerschb

Andreea Pavel [:apavel]

Comment 20

•

6 years ago

Backed out for build bustages

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception&classifiedState=unclassified&selectedJob=222759657&revision=a69c1a1dc6fe4dec9598bcdd3038d12bf681a045

Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=222759657&repo=autoland&lineNumber=24205

Backout: https://hg.mozilla.org/integration/autoland/rev/db283d60bcf1b16b3812cdd57a88e1eada225f81

Flags: needinfo?(gijskruitbosch+bugs)

Pulsebot

Comment 21

•

6 years ago

Pushed by gijskruitbosch@gmail.com: https://hg.mozilla.org/integration/autoland/rev/c31d2c5695b1 delay deserializing some CSP info until necessary, r=bzbarsky,ckerschb

:Gijs (he/him)

Assignee

Updated

•

6 years ago

Flags: needinfo?(gijskruitbosch+bugs)

Cosmin Sabou [:CosminS]

Comment 22

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/c31d2c5695b1

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

status-firefox66: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla66

Florin Strugariu [:Bebe]

Comment 23

•

6 years ago

== Change summary for alert #18838 (as of Tue, 22 Jan 2019 06:49:12 GMT) ==

Improvements:

9% raptor-tp6-yandex-firefox linux64 pgo 159.24 -> 145.66
4% raptor-tp6-yahoo-news-firefox linux64 pgo 431.96 -> 415.81

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=18838

:Gijs (he/him)

Assignee

Comment 24

•

6 years ago

(In reply to Florin Strugariu [:Bebe] from comment #23)

9% raptor-tp6-yandex-firefox linux64 pgo 159.24 -> 145.66

I gotta admit, this is pretty confusing:

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #0)

13% raptor-tp6-yandex-firefox linux64 pgo 171.94 -> 194.14

These numbers aren't doing anything even close to matching up (neither before nor after, so it's not just that the patch is an improvement over the previous state or doesn't address things completely or something like that - it just doesn't match at all). I assume this is because the test is bimodal. Ionuț, can you take a closer look and let me know if you think there is more that needs to be done here? Also, can you check if someone is investigating why this test is bimodal? That seems like something that smells of either the test framework doing something odd or our implementation meaning we vary quite wildly in terms of how quickly we load this page, or something...

Flags: needinfo?(igoldan)

Ryan VanderMeulen [:RyanVM]

Updated

•

6 years ago

status-firefox64: --- → unaffected

status-firefox65: --- → unaffected

status-firefox-esr60: --- → unaffected

Ionuț Goldan [:igoldan]

Reporter

Comment 25

•

6 years ago

•

Edited

(In reply to :Gijs (he/him) from comment #24)

(In reply to Florin Strugariu [:Bebe] from comment #23)

9% raptor-tp6-yandex-firefox linux64 pgo 159.24 -> 145.66

I gotta admit, this is pretty confusing:

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #0)

13% raptor-tp6-yandex-firefox linux64 pgo 171.94 -> 194.14

These numbers aren't doing anything even close to matching up (neither before nor after, so it's not just that the patch is an improvement over the previous state or doesn't address things completely or something like that - it just doesn't match at all). I assume this is because the test is bimodal.

I confirm these tests are noisy, but not bimodal. They don't have the tendency of alternating min/max values. If there's a concern with these values being too different, we updated the Raptor test runner in bug 1518479. It falsely improved most if not all of the Raptor baselines.

Ionuț, can you take a closer look and let me know if you think there is more that needs to be done here?

I compared the old regression with the changes pushed in comment 21. I'd say it got fixed by 50%.

Also, can you check if someone is investigating why this test is bimodal? That seems like something that smells of either the test framework doing something odd or our implementation meaning we vary quite wildly in terms of how quickly we load this page, or something...

I'm not sure if someone is actively investigating this issue. :rwood are you aware of any research on this matter?

Flags: needinfo?(igoldan) → needinfo?(rwood)

:Gijs (he/him)

Assignee

Comment 26

•

6 years ago

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #25)

(In reply to :Gijs (he/him) from comment #24)

(In reply to Florin Strugariu [:Bebe] from comment #23)

9% raptor-tp6-yandex-firefox linux64 pgo 159.24 -> 145.66

I gotta admit, this is pretty confusing:

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #0)

13% raptor-tp6-yandex-firefox linux64 pgo 171.94 -> 194.14

These numbers aren't doing anything even close to matching up (neither before nor after, so it's not just that the patch is an improvement over the previous state or doesn't address things completely or something like that - it just doesn't match at all). I assume this is because the test is bimodal.

I confirm these tests are noisy, but not bimodal. They don't have the tendency of alternating min/max values.

Hm, I must have been confused with either another subtest or the values for the 2017 ("ux") reference device. I'll try and find what I was looking at later...

If there's a concern with these values being too different, we updated the Raptor test runner in bug 1518479. It falsely improved most if not all of the Raptor baselines.

Ah! OK, but we can still see the subtest values which let us compare apples to apples (instead of an average that got lowered because we added another test).

Ionuț, can you take a closer look and let me know if you think there is more that needs to be done here?

I compared the old regression with the changes pushed in comment 21. I'd say it got fixed by 50%.

Thanks for looking. Can you elaborate? Looking at the subtests, i.e.:

https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=mozilla-inbound&originalRevision=a43422e9e4da8252bf1b48eb05206d0520b06f39&newProject=mozilla-inbound&newRevision=8686beefb44f25a78bb91bc6d5a9e5582f9ffa9a&originalSignature=1825430&newSignature=1825430&framework=10

(regression for https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=8686beefb44f25a78bb91bc6d5a9e5582f9ffa9a, which was the original regressing bug)

https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=autoland&originalRevision=57dc8bbbc38f79ce532feb8947b40311e96ea577&newProject=autoland&newRevision=c31d2c5695b185f5dab74513f911388ddcb23a46&originalSignature=1825130&newSignature=1825130&framework=10

( improvement for https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=c31d2c5695b185f5dab74513f911388ddcb23a46 , ie this bug)

The regressions are:

dcf: 169.00 < 198.00
fcp: No results No results
fnbpaint: 72.00 > 70.50
loadtime: 304.00 < 357.50
ttfi: 250.00 < 273.00

and the improvements:

dcf: 202.00 > 170.00
fcp: 72.00 = 72.00
fnbpaint: 72.00 = 72.00
loadtime: 348.50 > 300.50
ttfi: 274.00 > 250.00

This actually seems like it's basically fixed? The 169 vs. 170 (1ms regression) and 300.5 vs 304.00 (3.5ms improvement) seem like noise... We don't really know what, if anything, happened to fcp, but given fcp and fnbpaint seem to be returning the same numbers (so presumably are just providing time to the same paint event), I think that that number staying the same is fine.

Is the result worse on other platforms? Where should I be looking for what still needs doing here (probably in a follow-up, I guess)?

Flags: needinfo?(igoldan)

Ionuț Goldan [:igoldan]

Reporter

Comment 27

•

6 years ago

I looked over all regressions and visually noticed the fix didn't have the same magnitude as the regression. So I compared before regression - after regression difference with before fix - after fix, to be able to give an approximate fix percentage.

I ran a big bunch of retriggers/backfills for the before/after regression & before/after fix changesets, to be able to give a more precise answer.

Florin Strugariu [:Bebe]

Comment 28

•

6 years ago

This fix created a regression and a lot of noise on raptor:

== Change summary for alert #18882 (as of Fri, 18 Jan 2019 22:35:13 GMT) ==

Regressions:

17% raptor-tp6-yandex-firefox raptor-tp6-yandex-firefox-fcp linux64-qr opt 71.38 -> 83.66
16% raptor-tp6-yandex-firefox raptor-tp6-yandex-firefox-fcp linux64 opt 72.75 -> 84.33
11% raptor-tp6-yandex-firefox raptor-tp6-yandex-firefox-fcp windows10-64 opt 72.19 -> 79.83
9% raptor-tp6-yandex-firefox raptor-tp6-yandex-firefox-fcp windows7-32 opt 71.77 -> 78.29
8% raptor-tp6-yandex-firefox raptor-tp6-yandex-firefox-fcp windows10-64-qr opt 71.73 -> 77.25

Improvements:

17% raptor-tp6-yandex-firefox raptor-tp6-yandex-firefox-loadtime linux64-qr opt 390.91 -> 324.78
13% raptor-tp6-yandex-firefox raptor-tp6-yandex-firefox-loadtime linux64 opt 375.25 -> 326.79
10% raptor-tp6-yandex-firefox raptor-tp6-yandex-firefox-loadtime windows7-32 opt 376.21 -> 337.33
10% raptor-tp6-yandex-firefox raptor-tp6-yandex-firefox-loadtime windows7-32 pgo 348.08 -> 314.62
9% raptor-tp6-yandex-firefox raptor-tp6-yandex-firefox-loadtime windows10-64 opt 376.25 -> 342.08
7% raptor-tp6-yandex-firefox linux64-qr opt 171.04 -> 158.36

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=18882

Robert Wood [:rwood]

Comment 29

•

6 years ago

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #25)

Also, can you check if someone is investigating why this test is bimodal? That seems like something that smells of either the test framework doing something odd or our implementation meaning we vary quite wildly in terms of how quickly we load this page, or something...

I'm not sure if someone is actively investigating this issue. :rwood are you aware of any research on this matter?

No, but perhaps acreskey has come across this?

Flags: needinfo?(rwood) → needinfo?(acreskey)

:Gijs (he/him)

Assignee

Comment 30

•

6 years ago

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #27)

I looked over all regressions and visually noticed the fix didn't have the same magnitude as the regression. So I compared before regression - after regression difference with before fix - after fix, to be able to give an approximate fix percentage.

Did you do this for subtests or for the combined value for raptor? Because of the change in bug 1518479, a difference on the overall test is expected, but isn't a sign this didn't work, just that the added value means any change to the other values shows up as reduced magnitude of change on the overall test value. This is (one of) the problem(s) with representing 4 or 5 different perf numbers from a test by just taking the average, especially when we then change what set of numbers we take the average of...

(In reply to Florin Strugariu [:Bebe] from comment #28)

This fix created a regression and a lot of noise on raptor:

So the noise and regression is in fcp. I expect (but we can't know, unless we can run the test retroactively on old pushes?) that this fix reintroduced noise, so it's not really a real regression.

See: https://treeherder.mozilla.org/perf.html#/graphs?timerange=2592000&series=autoland,1837147,1,10&series=mozilla-inbound,1837450,1,10&series=mozilla-central,1822566,1,10

The first contentful paint value looks to be tracking mostly the same pattern as the first non blank paint value (which might be expected? Who knows about this?) once introduced. I already pointed this out in comment 26: both give frame timestamps and presumably both are usually being hit on the same frame (at least for this test).

This regression bug was filed on Jan 10 as a result of bug 1515863 landing. You can clearly see the same noise and higher values in the graph prior to jan 10. In other words, this noise was there before, and then it became more stable as a result of the change on jan 10 (which also regressed a bunch of other raptor metrics, which were addressed in this bug).

In any case, this fix has landed and backing it out at this point doesn't seem a pleasant option, as the original regression this was meant to address looks to be fixed, and the original security fix + the change in this bug is about to move to beta (merge day). A trade-off of <=12ms on fcp compared to 20-65ms on load time also seems net positive to me. In other words, I don't think reopening and backing this change out is a good option.

I'd still like to at least understand what is going on with the (imo comparatively minor) fcp changes. Please can you file a separate bug for this, and link to profiles like comment #0 in this bug, to make it easier to investigate what's going on here?

I also wish raptor put markers into the profile for the metrics it collected, but that's fodder for another bug, I guess...

Flags: needinfo?(fstrugariu)

Ionuț Goldan [:igoldan]

Reporter

Comment 31

•

6 years ago

(In reply to :Gijs (limited availability until Tue 29; he/him) from comment #30)

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #27)

I looked over all regressions and visually noticed the fix didn't have the same magnitude as the regression. So I compared before regression - after regression difference with before fix - after fix, to be able to give an approximate fix percentage.

Did you do this for subtests or for the combined value for raptor? Because of the change in bug 1518479, a difference on the overall test is expected, but isn't a sign this didn't work, just that the added value means any change to the other values shows up as reduced magnitude of change on the overall test value. This is (one of) the problem(s) with representing 4 or 5 different perf numbers from a test by just taking the average, especially when we then change what set of numbers we take the average of...

I did this check for combined values for raptor. I guess we can close this bug then.

Flags: needinfo?(igoldan)

Andrew Creskey [:acreskey]

Comment 32

•

6 years ago

I'm not familiar with the yandex pageload test yet but I'll profile it and see how it behaves.
I think in general we're going to see a lot of the big shifts drop out now that bug 1270059 is merged.

Ionuț Goldan [:igoldan]

Reporter

Updated

•

6 years ago

Depends on: 1523314

Andrew Creskey [:acreskey]

Updated

•

6 years ago

Flags: needinfo?(acreskey)

Florin Strugariu [:Bebe]

Comment 33

•

6 years ago

(In reply to :Gijs (he/him) from comment #30)

I'd still like to at least understand what is going on with the (imo comparatively minor) fcp changes. Please can you file a separate bug for this, and link to profiles like comment #0 in this bug, to make it easier to investigate what's going on here?

Created bug 1523314

Flags: needinfo?(fstrugariu)

Ionuț Goldan [:igoldan]

Reporter

Updated

•

6 years ago

Status: RESOLVED → VERIFIED