Closed Bug 1125998 Opened 9 years ago Closed 9 years ago

10.10 debug tests which involve content take 1.5 - 3x longer than on 10.8

Categories

(Testing :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kmoir, Unassigned)

References

Details

Attachments

(2 files)

For example:
10.8 debug crashtests 
Start	Fri Jan 23 15:44:27 2015
End	Fri Jan 23 16:01:19 2015
Elapsed	16 mins, 52 secs

16:01:13     INFO -  REFTEST FINISHED: Slowest test took 54517ms (file:///builds/slave/talos-slave/test/build/tests/reftest/tests/image/test/crashtests/694165-1.xhtml)
16:01:13     INFO -  REFTEST INFO | Result summary:
16:01:13     INFO -  REFTEST INFO | Successful: 2759 (3 pass, 2756 load only)
16:01:13     INFO -  REFTEST INFO | Unexpected: 0 (0 unexpected fail, 0 unexpected pass, 0 unexpected asserts, 0 unexpected fixed asserts, 0 failed load, 0 exception)
16:01:13     INFO -  REFTEST INFO | Known problems: 55 (0 known fail, 38 known asserts, 0 random, 17 skipped, 0 slow)
16:01:13     INFO -  REFTEST INFO | Total canvas count = 2

10.10 debug crashtests
Start	Mon Jan 26 08:30:41 2015
End	Mon Jan 26 09:21:34 2015
Elapsed	50 mins, 53 secs
09:21:29     INFO -  REFTEST FINISHED: Slowest test took 251221ms (file:///builds/slave/talos-slave/test/build/tests/reftest/tests/layout/base/crashtests/500467-1.html)
09:21:29     INFO -  REFTEST INFO | Result summary:
09:21:29     INFO -  REFTEST INFO | Successful: 2760 (3 pass, 2757 load only)
09:21:29     INFO -  REFTEST INFO | Unexpected: 0 (0 unexpected fail, 0 unexpected pass, 0 unexpected asserts, 0 unexpected fixed asserts, 0 failed load, 0 exception)
09:21:29     INFO -  REFTEST INFO | Known problems: 56 (0 known fail, 36 known asserts, 0 random, 20 skipped, 0 slow)
09:21:29     INFO -  REFTEST INFO | Total canvas count = 2
09:21:29     INFO -  REFTEST TEST-START | Shutdown


They are running on the same rev of hardware, but different OS obviously.  

Will run these tests on the same try rev for a better comparison.
Here is a try run that is in progress

https://treeherder.mozilla.org/#/jobs?repo=try&revision=718dbb2d1403

Here is an earlier one but doesn't include 10.8 for comparison
https://treeherder.mozilla.org/#/jobs?repo=try&revision=4993c517e85e

:jmaher: do you have any ideas on who would be able to help debugging this issue?
Flags: needinfo?(jmaher)
I took at look at this, I think we are not taking twice as long, I parsed the logs for 10.8 and 10.10 and we are maybe a few seconds slower for each minute of runtime.  The problem is we crash early on in 10.10.  

The setup time is pretty much the same as well as the application launch time.  My take is to:
1) remove the crashing tests (right now we have: dom/media/test/crashtests/691096-1.html)
2) repeat until green

then revisit it all!
Flags: needinfo?(jmaher)
I have a green run now by disabling the following tests

dom/media/test/crashtests/crashtests.list
691096-1.html
863929.html

layout/base/crashtests/crashtests.list
500467-1.html

layout/generic/crashtests/crashtests.list
385526.html
724978.xhtml
730559.html 

https://hg.mozilla.org/try/rev/720efb703527

The tests still take about 46 minutes to complete however.
I looked at this, and there is a ~30 minute difference.

Looking over the raw load times, I find that we have 1754 seconds of time delta between 10.10 and 10.8 when comparing the timestamps for test-start/test-end.  This is over 1295 files, so about half of them are different (keep in mind some 10.8 ones take longer than 10.10, but it is far and few between)

How can we make this actionable.  We could start with the worse offenders, 62 tests account for 855 seconds of difference.

I suspect if we found a few causes for those differences it might speed things up overall.
this is a text file of the format:
testname, diff

where:
diff = 10.10[testname](test-end - test-start) - 10.8[testname](test-end - test-start)

there are a few of these where the value is -xx.0, this indicates that 10.8 is slower for those tests than 10.10.

As a caveat this is a single data point of test files, I didn't analyze >1 log file.
Crashtests are the obvious suite, but they aren't the problem. xpcshell and jsrefest, which don't render any content, take the same time on 10.10 and 10.8 debug. Crashtest, which renders a lot of painful content, took almost three times as long in my last paired-up run. Mochitests, also about content, run from 1.3 to 1.6 times as long, depending on how painful the chunk is to run. devtools and mochitest-chrome and browser-chrome are more like 1.1 or 1.2 (and marionette is a hilarious joke at 1:1).
Summary: 10.10 debug crashtests take 1.5 - 2x longer than on 10.8 → 10.10 debug tests which involve content take 1.5 - 3x longer than on 10.8
:dbaron: jmaher suggested that you might be able to suggest someone who might be able to help with this issue from the developer side of things - any suggestions?
Flags: needinfo?(dbaron)
Blocks: 1123195
I defer to mstange, who has more relevant knowledge on Mac reftests than I do, and also hopefully isn't (like me) going to be in standards meetings for the next 5 days.

It might be worth just trying to profile, although there could be good clues based on which tests are slower (or is everything uniformly slower?).
Flags: needinfo?(dbaron) → needinfo?(mstange)
:dbaron thanks

:mstange: There's a list of the tests that are taking longer in comment 5 that jmaher provided
jmaher: I ran the talos tests like you mentioned

try: -b o -p macosx64 -u none -t all[10.8,10.10] mozharness: --spsProfile

And here are the results

https://treeherder.mozilla.org/#/jobs?repo=try&revision=e9ef7777c3e2

Not sure how to analyze them, if you have guidance it would be appreciated
Flags: needinfo?(jmaher)
oh this is cool!  We have a noticeable different in talos results/numbers between 10.10 and 10.8!

I looked at tsvgx (in the 's') job and loaded up the cleopatra interface:
10.10: http://people.mozilla.org/~bgirard/cleopatra/?zippedProfile=http://mozilla-releng-blobs.s3.amazonaws.com/blobs/Try/sha512/e9c26f0afb53bfabd92f56d11a6d0d14408f0abb60445f55fb2bfe27da61295d0dea60576ede78f055e81053e99fd037eb7e400e6b4517e4c627d6395c8c7ad8&pathInZip=profile_tsvgx/composite-scale-opacity.svg/cycle_0.sps#uploadProfileFirst!

10.8: http://people.mozilla.org/~bgirard/cleopatra/?zippedProfile=http://mozilla-releng-blobs.s3.amazonaws.com/blobs/Try/sha512/9010fd8ff40983473c1f777873b29172669fe778d12db7d55259eac6c8901414bd6ede8b6816a4bc33bb19b3c115397b3669514eb7d45a6597c0bdb3afcbc25e&pathInZip=profile_tsvgx/composite-scale-opacity.svg/cycle_0.sps#uploadProfileFirst!

I really don't know how to interpret this stuff, but I do see this:
* in 10.10 the calls we spent the most time in, we have RIPLayerBITImage in the 5th spot
* in 10.8 the calls we spent the most time in, we have ripl_BITImage in the 5th spot

possibly with this data mstange can help determine a starting point for us to look at!
Flags: needinfo?(jmaher)
Thanks Joel!

mstange: In what timeframe do you think you'd be able to look at this?  Just trying to figure out how we can proceed with our 10.10 deployment.
I'll have a closer look tomorrow.

(In reply to Joel Maher (:jmaher) from comment #11)
> oh this is cool!  We have a noticeable different in talos results/numbers
> between 10.10 and 10.8!

The difference is much clearer than I expected, even in the profiles! Interesting.

> http://people.mozilla.org/~bgirard/cleopatra/ .../cycle_0.sps#uploadProfileFirst!

(For anybody who's trying to open these profiles, due to a bug in cleopatra you need to remove the "#uploadProfileFirst!" part from the URL, otherwise the profile won't load.)

> I really don't know how to interpret this stuff, but I do see this:
> * in 10.10 the calls we spent the most time in, we have RIPLayerBITImage in
> the 5th spot
> * in 10.8 the calls we spent the most time in, we have ripl_BITImage in the
> 5th spot

This is in CoreGraphics, which is the system-supplied graphics library we use for rendering. It looks like the ripl_BITImage function has just been renamed between 10.8 and 10.10.

But what's interesting is that we spend almost double the time in argb32_image_mark_argb32. This is a CoreGraphics function that is responsible for drawing scaled or rotated images. The time spent inside it depends on the number of pixels we're trying to fill.
Since we're drawing the same images the same number of times on 10.8 and 10.10 (I hope!), I don't really understand what could cause the difference in runtime here. Either CoreGraphics itself has just become slower, or we're calling it more frequently, or we're trying to fill a larger number of pixels.
I was going to go with the "more pixels" hypothesis, but I checked the logs, and the screen resolution and browser sizes are the same in the 10.8 and the 10.10 run: The screen resolution is 1600x1200 and the inner browser size is 1024x689, except for one TART run, during which the inner browser size is 1024x768
. (Search for "Display 0:" and "Browser inner width/height")
That throws the "more pixels" hypothesis out of the window.
Flags: needinfo?(mstange)
mstange: Did your investigation last week reveal any causes for the slow debug tests?
Flags: needinfo?(mstange)
Not yet, still looking.
Flags: needinfo?(mstange)
mstange: Has your testing revealed why the debug tests are taking longer?
Flags: needinfo?(mstange)
Sorry it has taken me so long to really dive into this, Kim.

Here's a patch I'm going to push to try. With the resulting profiles that this patch will cause us to upload, hopefully I'll get a little further in analyzing the problem.
Flags: needinfo?(mstange)
thanks for looking into this!  Looking forward to the try push results
Thanks mstange! (As an aside, we updated the slaves to 10.10.2 yesterday)
Alas, you won't get the talos you want, since you need a second [10.8,10.10] for the -t. Can probably force it with https://wiki.mozilla.org/ReleaseEngineering/How_To/Trigger_arbitrary_jobs though, if the bustage in that has been fixed.
Oh right, I had forgotten about that. I'll try the trigger script once the opt build has finished, thanks.
Unfortunately the blob upload of the dumped profiles failed, because the extension I chose for the files (.sps) is not in the whitelist at https://github.com/mozilla/build-blobber/blob/master/blobber/config.py#L12-L24 ...
I'll push again with .sps.json .

But at least it appears that dumping the profiles themselves succeeded, on 10.10 anyway.
This is taking a few more tries. I'll update this bug once I've had some success.
I've been successful in getting profiles, but it's unclear what conclusions we can draw from them.

10.8: http://people.mozilla.org/~bgirard/cleopatra/?zippedProfile=http://tests.themasta.com/hostedprofiles/profiles-10.8.zip
10.10: http://people.mozilla.org/~bgirard/cleopatra/?zippedProfile=http://tests.themasta.com/hostedprofiles/profiles-10.10.zip

The main thing these profiles show is that "bloat logging", which we use for leak detection, is responsible for a lot of time spent during debug crashtests in general. Both on 10.8 and on 10.10, the main thread is blocking on the global bloat table lock most of the time, instead of doing something useful. But notably, the time we spend blocking on the lock on 10.10 is a little larger than on 10.8.

If the lock is really responsible for the slowdown on 10.10, this raises a few questions: Is the overhead of grabbing a lock on 10.10 just higher? (unlikely.) Is there more contention of the lock on 10.10, i.e. are other threads doing more stuff, or taking longer to do their thing while the lock is being held?
We'd like to know what other threads are doing, but profiling them is a little tricky with our built-in profiler.

We've filed bug 1137397 in order to investigate getting rid of that global lock for the bloat logging case. But the prospects of the idea in that bug aren't that great and we shouldn't block deploying 10.10 on the resolution of it.

Here's a try push with bloat logging forcefully turned off: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8eaf3902355b
Crashtests took 8 minutes on 10.8 and 9 minutes on 10.10, versus 19min and 47min with bloat logging on.
Well would you look at that. Different approach, a very simple patch, and it gets us down to 11min on 10.10 with bloat log enabled:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=12085f7958b0

I've filed bug 1137963 with the patch. I'm now going to fire of a full set of debug builds on all platforms with and without the patch so that we can see whether it's just Mac that profits from it, or whether we should take it for all platforms.
Blocks: 1137963
No longer blocks: 1137963
Depends on: 137963
Depends on: 1137963
No longer depends on: 137963
Depends on: 1138616
The work in bug 1138616 addressed this.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: