Closed Bug 1434323 Opened 5 years ago Closed 4 years ago

Performance analysis for uninitialized class members checker.

Categories

(Developer Infrastructure :: Source Code Analysis, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: andi, Assigned: andi)

References

Details

(Keywords: sec-audit)

In order to land the uninitialised member checker, https://bugzilla.mozilla.org/show_bug.cgi?id=525063, we need first to patch gecko where member variables need to be initialised. 
By doing performance analysis, Julian determined the following:

There is a potential performance hit that was encountered of 4.91% on: "tresize opt e10s /linux64-qr".

The complete results can be found here:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=c9a6c3630d3e&newProject=try&newRevision=ed22bf9677a1bdc84f3fc8118e1a81d838c540ec&framework=1&showOnlyImportant=1

Trying to reproduce locally using the following configuration the results can be breakdown as follows:

1. on an Intel Haswell, Intel integrated graphics: circa 1% slowdown
2. on VirtualBox 5.2.6, with Settings->Display->Screen->Enable*Acceleration
    disabled: circa 10% slowdown

Concentrating on (2).  about:support has

  WebGL 1 Driver Renderer   VMware, Inc. -- llvmpipe (LLVM 5.0, 256 bits)

 [VirtualBox 5.2.6], which has a 10% wallclock slowdown, I cannot measure any comparable increase in instruction count, memory access count or simulated cache miss count using Callgrind or Cachegrind.  The increases are only around 1%.

Julian an I hypothesised that it might be the use of LLVMpipe (MESA's software renderer) that causes the slowdown.

Running tests on a physical machine having LIBGL_ALWAYS_SOFTWARE=true causes the use of LLVMpipe. Having this setup, a median value of most 0.5% was seen.

Julian if you feel that I've missed something from our discussion please write here.
As per my last comment.
Flags: needinfo?(jseward)
Group: core-security → dom-core-security
For a bit more detail on the current regression analysis status, see
bug 525063 comment 212 and 525063 comment 213.

I guess my question at this point is: should we continue to chase these
regressions?  If so, how?  Or are they unimportant?
Flags: needinfo?(jseward)
Nathan could you please advise on this matter?
Flags: needinfo?(nfroyd)
What sort of machine do our talos tests run on?  Don't they run on some sort of physical hardware?  We should be able to either a) reproduce that setup locally or b) investigate on an identical loaner machine, right?
Flags: needinfo?(nfroyd) → needinfo?(jmaher)
for that try push we use:
https://wiki.mozilla.org/Buildbot/Talos/Misc#Hardware_Profile_of_machines_used_in_automation

keep in mind for linux now we run new hardware that has intel video cards vs the old which has nvidia.  Soon windows will be switching to this new hardware as well.
Flags: needinfo?(jmaher)
I wonder if this can be explained by the old cpu that is used: Intel X3450.
If Julian is seeing virtually no differences with Cachegrind, it seems unlikely that the CPU is to blame.

Are the patches fine-grained enough that we could try to bisect the patch series to see if a particular patch triggers bad talos behavior?
Flags: needinfo?(jseward)
Nathan: we could quite easily split them using a few lines of bash. I've already done it in the past.
Yeah, let's try that!
So at this point I wanted to test the reliability of the tests and of the appearance of performance issues (Julian initially noticed that the perf hits were not always on the same tests, but I wanted to try to try a slightly more formal approach).

The results are very confusing.

Here you can see 3 pairs of try jobs, each one comparing the same revision with and without the patches applied. We're talking about applying the exact same patches 3 times and comparing against the exact same revision:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=53ca65fc839d51d0a485c6809a20ffa4ae53568a&newProject=try&newRevision=982d6dc10b69fc1840a4b865cc3b3960a7dfc211&framework=1&showOnlyImportant=1

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=bac85d68d0c142b3519a5bcbb248b8e7ebf1efd8&newProject=try&newRevision=b829db5255b0436d2ba4602294ee73fca1aa48fb&framework=1&showOnlyImportant=1

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=9706f41bff61899a15b774bbdd0cf11b9d342b2e&newProject=try&newRevision=d6f4af868dba3d3d66955ee75b083cdb18cc471c&framework=1&showOnlyImportant=1

As you can see the performance hits are never on the same tests and platforms. I tried with 3 different baseline try jobs to try to have some time locality in case this could have an impact.

If you take the 3 patched try jobs and compare them against a single baseline version, the results are equally random:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=53ca65fc839d&newProject=try&newRevision=b829db5255b0&framework=1&showOnlyImportant=1

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=53ca65fc839d51d0a485c6809a20ffa4ae53568a&newProject=try&newRevision=982d6dc10b69fc1840a4b865cc3b3960a7dfc211&framework=1&showOnlyImportant=1

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=53ca65fc839d51d0a485c6809a20ffa4ae53568a&newProject=try&newRevision=d6f4af868dba&framework=1&showOnlyImportant=1


In any case, we seem to have consistent performance hits, but never on the same tests. This might mean that the performance hits depend on the current CPU/memory/HDD context, because I don't think that it's due to the tests being too random (the statistical probability is very, very low).

I'm going to keep going on the binary search, maybe I can find a common source for all these problems.
Joel, could you please help understanding what is going on here? Thanks
Flags: needinfo?(jmaher)
for many of the tests you can see that there is a range of historical values (noise)- and often we find that one build vs another build yields data in a subset of the range.  This shows well on the try pushes where we see one build showing low values and another build showing high values- then we flag it as important.

I say build, but this could just be a random machine or two, other things going on at the time of day, etc.  For example patterns shift on many tests on weekends vs busy weekday times.  

I did a bunch of retriggers on the 6 pushes (the 3 sets of try jobs), but the jobs are pending- I suspect they will run later in the day (i.e. not soon).  

This is one of the reasons we retrigger and look at historical data to make sure alerts make sense before filing bugs.  I suspect if the code here landed there would be a couple alerts and we would file a bug for reference and recommend marking it invalid based on the data.
Flags: needinfo?(jmaher)
Product: Core → Firefox Build System
David, would have a clue here? For now, we are clueless here :/
Thanks
Flags: needinfo?(dmajor)
There are already some suggestions here, and I don't see any followup on them:

(In reply to Tristan Bourvon from comment #10)
> I'm going to keep going on the binary search, maybe I can find a common
> source for all these problems.

Did this happen?

(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #12)
> This is one of the reasons we retrigger and look at historical data to make
> sure alerts make sense before filing bugs.  I suspect if the code here
> landed there would be a couple alerts and we would file a bug for reference
> and recommend marking it invalid based on the data.

Why not give Joel and team an exercise: produce a try push of your proposed changes and ask "would you file regressions against me for this?" -- maybe you'll find that there's nothing to worry about.
Flags: needinfo?(dmajor)
Thanks David!

Tristan, could you give an update about your results on the binary search?
Flags: needinfo?(tristanbourvon)
Well honestly, it's impossible to accurately perform a binary search on try...
For example, here are two jobs started almost simultaneously with no patches applied, on the same revision, and 10 (!) Talos re-triggers.

Here's what I get:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=3367457b7cd5c03e7990bec633fe00e05a9de025&newProject=try&newRevision=3775f7854ae50a3e013391ddd4dec6783cd940a4&framework=1&showOnlyImportant=1

I thought I was making progress on the binary search but there was so much noise, until I did this, and now it just seems useless to try to do any precise comparison, even over many data points.

Julian and I were unable to replicate the performance regressions locally on 3 different OS and 2 different machines.

I think that doing what David suggests would indeed be a good approach: we should just push the full patch on try, and ask for the review team to find a statistically certain reason to deny the patch.

Sylvestre, does that sound good to you? I will update the autofix to the latest m-c revision, apply all the patches, push a try jobs with many, many Talos retriggers and hopefully we'll have enough data points for Joel and his team to make a decision.
Flags: needinfo?(tristanbourvon) → needinfo?(sledru)
Works for me. 
David, here is why binary search didn't give the expected results...
Flags: needinfo?(sledru) → needinfo?(dmajor)
Sounds good. FWIW, the results here look so close to noise that, if this bug were not hidden, I would say just land it, and accept the small risk of backout and/or fix any regressions in-place. But given that we generally like for this type of bug to stick the first time, I think it's worthwhile to get a pre-landing reassurance from the perf experts.
Flags: needinfo?(dmajor)
Here is the result of running 30 Talos retriggers on a recent revision with the patch applied and without it.

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=7661a3f84c919f45bb36be7a220eecd4b95e6629&newProject=try&newRevision=1d7ce040f6d924f9e3ded4aa0a2a51faf11f7bb7&framework=1&showOnlyImportant=1

The results seem to indicate that we can land this patch, is this correct?

If that's the case, there have been suggestions as to how we should land his. It seems to me that landing one patch per module would give a pretty good level of granularity to allow easy reviews, while still landing the fixes in bulk. How does that sound?
Yes, I think you can land the patches!  Thanks for looking into the performance aspect so closely.

I'll start reviewing the patches in the other bug.  In terms of landing in pieces vs in large chunks, I'll leave that up to you, please do what makes your life easier.
Flags: needinfo?(jseward)
I think we are good, closing!
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Group: core-security-release
Group: dom-core-security
Group: core-security-release
Product: Firefox Build System → Developer Infrastructure
You need to log in before you can comment on or make changes to this bug.