Closed
Bug 1125998
Opened 10 years ago
Closed 10 years ago
10.10 debug tests which involve content take 1.5 - 3x longer than on 10.8
Categories
(Testing :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: kmoir, Unassigned)
References
Details
Attachments
(2 files)
2.63 KB,
text/plain
|
Details | |
7.77 KB,
patch
|
Details | Diff | Splinter Review |
For example:
10.8 debug crashtests
Start Fri Jan 23 15:44:27 2015
End Fri Jan 23 16:01:19 2015
Elapsed 16 mins, 52 secs
16:01:13 INFO - REFTEST FINISHED: Slowest test took 54517ms (file:///builds/slave/talos-slave/test/build/tests/reftest/tests/image/test/crashtests/694165-1.xhtml)
16:01:13 INFO - REFTEST INFO | Result summary:
16:01:13 INFO - REFTEST INFO | Successful: 2759 (3 pass, 2756 load only)
16:01:13 INFO - REFTEST INFO | Unexpected: 0 (0 unexpected fail, 0 unexpected pass, 0 unexpected asserts, 0 unexpected fixed asserts, 0 failed load, 0 exception)
16:01:13 INFO - REFTEST INFO | Known problems: 55 (0 known fail, 38 known asserts, 0 random, 17 skipped, 0 slow)
16:01:13 INFO - REFTEST INFO | Total canvas count = 2
10.10 debug crashtests
Start Mon Jan 26 08:30:41 2015
End Mon Jan 26 09:21:34 2015
Elapsed 50 mins, 53 secs
09:21:29 INFO - REFTEST FINISHED: Slowest test took 251221ms (file:///builds/slave/talos-slave/test/build/tests/reftest/tests/layout/base/crashtests/500467-1.html)
09:21:29 INFO - REFTEST INFO | Result summary:
09:21:29 INFO - REFTEST INFO | Successful: 2760 (3 pass, 2757 load only)
09:21:29 INFO - REFTEST INFO | Unexpected: 0 (0 unexpected fail, 0 unexpected pass, 0 unexpected asserts, 0 unexpected fixed asserts, 0 failed load, 0 exception)
09:21:29 INFO - REFTEST INFO | Known problems: 56 (0 known fail, 36 known asserts, 0 random, 20 skipped, 0 slow)
09:21:29 INFO - REFTEST INFO | Total canvas count = 2
09:21:29 INFO - REFTEST TEST-START | Shutdown
They are running on the same rev of hardware, but different OS obviously.
Will run these tests on the same try rev for a better comparison.
Reporter | ||
Comment 1•10 years ago
|
||
Here is a try run that is in progress
https://treeherder.mozilla.org/#/jobs?repo=try&revision=718dbb2d1403
Here is an earlier one but doesn't include 10.8 for comparison
https://treeherder.mozilla.org/#/jobs?repo=try&revision=4993c517e85e
:jmaher: do you have any ideas on who would be able to help debugging this issue?
Flags: needinfo?(jmaher)
Comment 2•10 years ago
|
||
I took at look at this, I think we are not taking twice as long, I parsed the logs for 10.8 and 10.10 and we are maybe a few seconds slower for each minute of runtime. The problem is we crash early on in 10.10.
The setup time is pretty much the same as well as the application launch time. My take is to:
1) remove the crashing tests (right now we have: dom/media/test/crashtests/691096-1.html)
2) repeat until green
then revisit it all!
Flags: needinfo?(jmaher)
Reporter | ||
Comment 3•10 years ago
|
||
I have a green run now by disabling the following tests
dom/media/test/crashtests/crashtests.list
691096-1.html
863929.html
layout/base/crashtests/crashtests.list
500467-1.html
layout/generic/crashtests/crashtests.list
385526.html
724978.xhtml
730559.html
https://hg.mozilla.org/try/rev/720efb703527
The tests still take about 46 minutes to complete however.
Comment 4•10 years ago
|
||
I looked at this, and there is a ~30 minute difference.
Looking over the raw load times, I find that we have 1754 seconds of time delta between 10.10 and 10.8 when comparing the timestamps for test-start/test-end. This is over 1295 files, so about half of them are different (keep in mind some 10.8 ones take longer than 10.10, but it is far and few between)
How can we make this actionable. We could start with the worse offenders, 62 tests account for 855 seconds of difference.
I suspect if we found a few causes for those differences it might speed things up overall.
Comment 5•10 years ago
|
||
this is a text file of the format:
testname, diff
where:
diff = 10.10[testname](test-end - test-start) - 10.8[testname](test-end - test-start)
there are a few of these where the value is -xx.0, this indicates that 10.8 is slower for those tests than 10.10.
As a caveat this is a single data point of test files, I didn't analyze >1 log file.
Comment 6•10 years ago
|
||
Crashtests are the obvious suite, but they aren't the problem. xpcshell and jsrefest, which don't render any content, take the same time on 10.10 and 10.8 debug. Crashtest, which renders a lot of painful content, took almost three times as long in my last paired-up run. Mochitests, also about content, run from 1.3 to 1.6 times as long, depending on how painful the chunk is to run. devtools and mochitest-chrome and browser-chrome are more like 1.1 or 1.2 (and marionette is a hilarious joke at 1:1).
Summary: 10.10 debug crashtests take 1.5 - 2x longer than on 10.8 → 10.10 debug tests which involve content take 1.5 - 3x longer than on 10.8
Reporter | ||
Comment 7•10 years ago
|
||
:dbaron: jmaher suggested that you might be able to suggest someone who might be able to help with this issue from the developer side of things - any suggestions?
Flags: needinfo?(dbaron)
I defer to mstange, who has more relevant knowledge on Mac reftests than I do, and also hopefully isn't (like me) going to be in standards meetings for the next 5 days.
It might be worth just trying to profile, although there could be good clues based on which tests are slower (or is everything uniformly slower?).
Flags: needinfo?(dbaron) → needinfo?(mstange)
Reporter | ||
Comment 9•10 years ago
|
||
:dbaron thanks
:mstange: There's a list of the tests that are taking longer in comment 5 that jmaher provided
Reporter | ||
Comment 10•10 years ago
|
||
jmaher: I ran the talos tests like you mentioned
try: -b o -p macosx64 -u none -t all[10.8,10.10] mozharness: --spsProfile
And here are the results
https://treeherder.mozilla.org/#/jobs?repo=try&revision=e9ef7777c3e2
Not sure how to analyze them, if you have guidance it would be appreciated
Flags: needinfo?(jmaher)
Comment 11•10 years ago
|
||
oh this is cool! We have a noticeable different in talos results/numbers between 10.10 and 10.8!
I looked at tsvgx (in the 's') job and loaded up the cleopatra interface:
10.10: http://people.mozilla.org/~bgirard/cleopatra/?zippedProfile=http://mozilla-releng-blobs.s3.amazonaws.com/blobs/Try/sha512/e9c26f0afb53bfabd92f56d11a6d0d14408f0abb60445f55fb2bfe27da61295d0dea60576ede78f055e81053e99fd037eb7e400e6b4517e4c627d6395c8c7ad8&pathInZip=profile_tsvgx/composite-scale-opacity.svg/cycle_0.sps#uploadProfileFirst!
10.8: http://people.mozilla.org/~bgirard/cleopatra/?zippedProfile=http://mozilla-releng-blobs.s3.amazonaws.com/blobs/Try/sha512/9010fd8ff40983473c1f777873b29172669fe778d12db7d55259eac6c8901414bd6ede8b6816a4bc33bb19b3c115397b3669514eb7d45a6597c0bdb3afcbc25e&pathInZip=profile_tsvgx/composite-scale-opacity.svg/cycle_0.sps#uploadProfileFirst!
I really don't know how to interpret this stuff, but I do see this:
* in 10.10 the calls we spent the most time in, we have RIPLayerBITImage in the 5th spot
* in 10.8 the calls we spent the most time in, we have ripl_BITImage in the 5th spot
possibly with this data mstange can help determine a starting point for us to look at!
Flags: needinfo?(jmaher)
Reporter | ||
Comment 12•10 years ago
|
||
Thanks Joel!
mstange: In what timeframe do you think you'd be able to look at this? Just trying to figure out how we can proceed with our 10.10 deployment.
Comment 13•10 years ago
|
||
I'll have a closer look tomorrow.
(In reply to Joel Maher (:jmaher) from comment #11)
> oh this is cool! We have a noticeable different in talos results/numbers
> between 10.10 and 10.8!
The difference is much clearer than I expected, even in the profiles! Interesting.
> http://people.mozilla.org/~bgirard/cleopatra/ .../cycle_0.sps#uploadProfileFirst!
(For anybody who's trying to open these profiles, due to a bug in cleopatra you need to remove the "#uploadProfileFirst!" part from the URL, otherwise the profile won't load.)
> I really don't know how to interpret this stuff, but I do see this:
> * in 10.10 the calls we spent the most time in, we have RIPLayerBITImage in
> the 5th spot
> * in 10.8 the calls we spent the most time in, we have ripl_BITImage in the
> 5th spot
This is in CoreGraphics, which is the system-supplied graphics library we use for rendering. It looks like the ripl_BITImage function has just been renamed between 10.8 and 10.10.
But what's interesting is that we spend almost double the time in argb32_image_mark_argb32. This is a CoreGraphics function that is responsible for drawing scaled or rotated images. The time spent inside it depends on the number of pixels we're trying to fill.
Since we're drawing the same images the same number of times on 10.8 and 10.10 (I hope!), I don't really understand what could cause the difference in runtime here. Either CoreGraphics itself has just become slower, or we're calling it more frequently, or we're trying to fill a larger number of pixels.
I was going to go with the "more pixels" hypothesis, but I checked the logs, and the screen resolution and browser sizes are the same in the 10.8 and the 10.10 run: The screen resolution is 1600x1200 and the inner browser size is 1024x689, except for one TART run, during which the inner browser size is 1024x768
. (Search for "Display 0:" and "Browser inner width/height")
That throws the "more pixels" hypothesis out of the window.
Flags: needinfo?(mstange)
Reporter | ||
Comment 14•10 years ago
|
||
mstange: Did your investigation last week reveal any causes for the slow debug tests?
Flags: needinfo?(mstange)
Reporter | ||
Comment 16•10 years ago
|
||
mstange: Has your testing revealed why the debug tests are taking longer?
Flags: needinfo?(mstange)
Comment 17•10 years ago
|
||
Sorry it has taken me so long to really dive into this, Kim.
Here's a patch I'm going to push to try. With the resulting profiles that this patch will cause us to upload, hopefully I'll get a little further in analyzing the problem.
Flags: needinfo?(mstange)
Comment 18•10 years ago
|
||
Comment 19•10 years ago
|
||
thanks for looking into this! Looking forward to the try push results
Reporter | ||
Comment 20•10 years ago
|
||
Thanks mstange! (As an aside, we updated the slaves to 10.10.2 yesterday)
Comment 21•10 years ago
|
||
Alas, you won't get the talos you want, since you need a second [10.8,10.10] for the -t. Can probably force it with https://wiki.mozilla.org/ReleaseEngineering/How_To/Trigger_arbitrary_jobs though, if the bustage in that has been fixed.
Comment 22•10 years ago
|
||
Oh right, I had forgotten about that. I'll try the trigger script once the opt build has finished, thanks.
Comment 23•10 years ago
|
||
Unfortunately the blob upload of the dumped profiles failed, because the extension I chose for the files (.sps) is not in the whitelist at https://github.com/mozilla/build-blobber/blob/master/blobber/config.py#L12-L24 ...
I'll push again with .sps.json .
But at least it appears that dumping the profiles themselves succeeded, on 10.10 anyway.
Comment 24•10 years ago
|
||
Comment 25•10 years ago
|
||
This is taking a few more tries. I'll update this bug once I've had some success.
Comment 26•10 years ago
|
||
I've been successful in getting profiles, but it's unclear what conclusions we can draw from them.
10.8: http://people.mozilla.org/~bgirard/cleopatra/?zippedProfile=http://tests.themasta.com/hostedprofiles/profiles-10.8.zip
10.10: http://people.mozilla.org/~bgirard/cleopatra/?zippedProfile=http://tests.themasta.com/hostedprofiles/profiles-10.10.zip
The main thing these profiles show is that "bloat logging", which we use for leak detection, is responsible for a lot of time spent during debug crashtests in general. Both on 10.8 and on 10.10, the main thread is blocking on the global bloat table lock most of the time, instead of doing something useful. But notably, the time we spend blocking on the lock on 10.10 is a little larger than on 10.8.
If the lock is really responsible for the slowdown on 10.10, this raises a few questions: Is the overhead of grabbing a lock on 10.10 just higher? (unlikely.) Is there more contention of the lock on 10.10, i.e. are other threads doing more stuff, or taking longer to do their thing while the lock is being held?
We'd like to know what other threads are doing, but profiling them is a little tricky with our built-in profiler.
We've filed bug 1137397 in order to investigate getting rid of that global lock for the bloat logging case. But the prospects of the idea in that bug aren't that great and we shouldn't block deploying 10.10 on the resolution of it.
Here's a try push with bloat logging forcefully turned off: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8eaf3902355b
Crashtests took 8 minutes on 10.8 and 9 minutes on 10.10, versus 19min and 47min with bloat logging on.
Comment 27•10 years ago
|
||
Well would you look at that. Different approach, a very simple patch, and it gets us down to 11min on 10.10 with bloat log enabled:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=12085f7958b0
I've filed bug 1137963 with the patch. I'm now going to fire of a full set of debug builds on all platforms with and without the patch so that we can see whether it's just Mac that profits from it, or whether we should take it for all platforms.
Reporter | ||
Updated•10 years ago
|
Reporter | ||
Updated•10 years ago
|
Reporter | ||
Comment 28•10 years ago
|
||
The work in bug 1138616 addressed this.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•