For example: 10.8 debug crashtests Start Fri Jan 23 15:44:27 2015 End Fri Jan 23 16:01:19 2015 Elapsed 16 mins, 52 secs 16:01:13 INFO - REFTEST FINISHED: Slowest test took 54517ms (file:///builds/slave/talos-slave/test/build/tests/reftest/tests/image/test/crashtests/694165-1.xhtml) 16:01:13 INFO - REFTEST INFO | Result summary: 16:01:13 INFO - REFTEST INFO | Successful: 2759 (3 pass, 2756 load only) 16:01:13 INFO - REFTEST INFO | Unexpected: 0 (0 unexpected fail, 0 unexpected pass, 0 unexpected asserts, 0 unexpected fixed asserts, 0 failed load, 0 exception) 16:01:13 INFO - REFTEST INFO | Known problems: 55 (0 known fail, 38 known asserts, 0 random, 17 skipped, 0 slow) 16:01:13 INFO - REFTEST INFO | Total canvas count = 2 10.10 debug crashtests Start Mon Jan 26 08:30:41 2015 End Mon Jan 26 09:21:34 2015 Elapsed 50 mins, 53 secs 09:21:29 INFO - REFTEST FINISHED: Slowest test took 251221ms (file:///builds/slave/talos-slave/test/build/tests/reftest/tests/layout/base/crashtests/500467-1.html) 09:21:29 INFO - REFTEST INFO | Result summary: 09:21:29 INFO - REFTEST INFO | Successful: 2760 (3 pass, 2757 load only) 09:21:29 INFO - REFTEST INFO | Unexpected: 0 (0 unexpected fail, 0 unexpected pass, 0 unexpected asserts, 0 unexpected fixed asserts, 0 failed load, 0 exception) 09:21:29 INFO - REFTEST INFO | Known problems: 56 (0 known fail, 36 known asserts, 0 random, 20 skipped, 0 slow) 09:21:29 INFO - REFTEST INFO | Total canvas count = 2 09:21:29 INFO - REFTEST TEST-START | Shutdown They are running on the same rev of hardware, but different OS obviously. Will run these tests on the same try rev for a better comparison.
Here is a try run that is in progress https://treeherder.mozilla.org/#/jobs?repo=try&revision=718dbb2d1403 Here is an earlier one but doesn't include 10.8 for comparison https://treeherder.mozilla.org/#/jobs?repo=try&revision=4993c517e85e :jmaher: do you have any ideas on who would be able to help debugging this issue?
I took at look at this, I think we are not taking twice as long, I parsed the logs for 10.8 and 10.10 and we are maybe a few seconds slower for each minute of runtime. The problem is we crash early on in 10.10. The setup time is pretty much the same as well as the application launch time. My take is to: 1) remove the crashing tests (right now we have: dom/media/test/crashtests/691096-1.html) 2) repeat until green then revisit it all!
I have a green run now by disabling the following tests dom/media/test/crashtests/crashtests.list 691096-1.html 863929.html layout/base/crashtests/crashtests.list 500467-1.html layout/generic/crashtests/crashtests.list 385526.html 724978.xhtml 730559.html https://hg.mozilla.org/try/rev/720efb703527 The tests still take about 46 minutes to complete however.
I looked at this, and there is a ~30 minute difference. Looking over the raw load times, I find that we have 1754 seconds of time delta between 10.10 and 10.8 when comparing the timestamps for test-start/test-end. This is over 1295 files, so about half of them are different (keep in mind some 10.8 ones take longer than 10.10, but it is far and few between) How can we make this actionable. We could start with the worse offenders, 62 tests account for 855 seconds of difference. I suspect if we found a few causes for those differences it might speed things up overall.
Created attachment 8557942 [details] tests with an abs(diff) > 2.0 - 62 in total this is a text file of the format: testname, diff where: diff = 10.10[testname](test-end - test-start) - 10.8[testname](test-end - test-start) there are a few of these where the value is -xx.0, this indicates that 10.8 is slower for those tests than 10.10. As a caveat this is a single data point of test files, I didn't analyze >1 log file.
Crashtests are the obvious suite, but they aren't the problem. xpcshell and jsrefest, which don't render any content, take the same time on 10.10 and 10.8 debug. Crashtest, which renders a lot of painful content, took almost three times as long in my last paired-up run. Mochitests, also about content, run from 1.3 to 1.6 times as long, depending on how painful the chunk is to run. devtools and mochitest-chrome and browser-chrome are more like 1.1 or 1.2 (and marionette is a hilarious joke at 1:1).
:dbaron: jmaher suggested that you might be able to suggest someone who might be able to help with this issue from the developer side of things - any suggestions?
I defer to mstange, who has more relevant knowledge on Mac reftests than I do, and also hopefully isn't (like me) going to be in standards meetings for the next 5 days. It might be worth just trying to profile, although there could be good clues based on which tests are slower (or is everything uniformly slower?).
:dbaron thanks :mstange: There's a list of the tests that are taking longer in comment 5 that jmaher provided
jmaher: I ran the talos tests like you mentioned try: -b o -p macosx64 -u none -t all[10.8,10.10] mozharness: --spsProfile And here are the results https://treeherder.mozilla.org/#/jobs?repo=try&revision=e9ef7777c3e2 Not sure how to analyze them, if you have guidance it would be appreciated
oh this is cool! We have a noticeable different in talos results/numbers between 10.10 and 10.8! I looked at tsvgx (in the 's') job and loaded up the cleopatra interface: 10.10: http://people.mozilla.org/~bgirard/cleopatra/?zippedProfile=http://mozilla-releng-blobs.s3.amazonaws.com/blobs/Try/sha512/e9c26f0afb53bfabd92f56d11a6d0d14408f0abb60445f55fb2bfe27da61295d0dea60576ede78f055e81053e99fd037eb7e400e6b4517e4c627d6395c8c7ad8&pathInZip=profile_tsvgx/composite-scale-opacity.svg/cycle_0.sps#uploadProfileFirst! 10.8: http://people.mozilla.org/~bgirard/cleopatra/?zippedProfile=http://mozilla-releng-blobs.s3.amazonaws.com/blobs/Try/sha512/9010fd8ff40983473c1f777873b29172669fe778d12db7d55259eac6c8901414bd6ede8b6816a4bc33bb19b3c115397b3669514eb7d45a6597c0bdb3afcbc25e&pathInZip=profile_tsvgx/composite-scale-opacity.svg/cycle_0.sps#uploadProfileFirst! I really don't know how to interpret this stuff, but I do see this: * in 10.10 the calls we spent the most time in, we have RIPLayerBITImage in the 5th spot * in 10.8 the calls we spent the most time in, we have ripl_BITImage in the 5th spot possibly with this data mstange can help determine a starting point for us to look at!
Thanks Joel! mstange: In what timeframe do you think you'd be able to look at this? Just trying to figure out how we can proceed with our 10.10 deployment.
I'll have a closer look tomorrow. (In reply to Joel Maher (:jmaher) from comment #11) > oh this is cool! We have a noticeable different in talos results/numbers > between 10.10 and 10.8! The difference is much clearer than I expected, even in the profiles! Interesting. > http://people.mozilla.org/~bgirard/cleopatra/ .../cycle_0.sps#uploadProfileFirst! (For anybody who's trying to open these profiles, due to a bug in cleopatra you need to remove the "#uploadProfileFirst!" part from the URL, otherwise the profile won't load.) > I really don't know how to interpret this stuff, but I do see this: > * in 10.10 the calls we spent the most time in, we have RIPLayerBITImage in > the 5th spot > * in 10.8 the calls we spent the most time in, we have ripl_BITImage in the > 5th spot This is in CoreGraphics, which is the system-supplied graphics library we use for rendering. It looks like the ripl_BITImage function has just been renamed between 10.8 and 10.10. But what's interesting is that we spend almost double the time in argb32_image_mark_argb32. This is a CoreGraphics function that is responsible for drawing scaled or rotated images. The time spent inside it depends on the number of pixels we're trying to fill. Since we're drawing the same images the same number of times on 10.8 and 10.10 (I hope!), I don't really understand what could cause the difference in runtime here. Either CoreGraphics itself has just become slower, or we're calling it more frequently, or we're trying to fill a larger number of pixels. I was going to go with the "more pixels" hypothesis, but I checked the logs, and the screen resolution and browser sizes are the same in the 10.8 and the 10.10 run: The screen resolution is 1600x1200 and the inner browser size is 1024x689, except for one TART run, during which the inner browser size is 1024x768 . (Search for "Display 0:" and "Browser inner width/height") That throws the "more pixels" hypothesis out of the window.
mstange: Did your investigation last week reveal any causes for the slow debug tests?
Not yet, still looking.
mstange: Has your testing revealed why the debug tests are taking longer?
Created attachment 8568788 [details] [diff] [review] profiling for crashtests Sorry it has taken me so long to really dive into this, Kim. Here's a patch I'm going to push to try. With the resulting profiles that this patch will cause us to upload, hopefully I'll get a little further in analyzing the problem.
thanks for looking into this! Looking forward to the try push results
Thanks mstange! (As an aside, we updated the slaves to 10.10.2 yesterday)
Alas, you won't get the talos you want, since you need a second [10.8,10.10] for the -t. Can probably force it with https://wiki.mozilla.org/ReleaseEngineering/How_To/Trigger_arbitrary_jobs though, if the bustage in that has been fixed.
Oh right, I had forgotten about that. I'll try the trigger script once the opt build has finished, thanks.
Unfortunately the blob upload of the dumped profiles failed, because the extension I chose for the files (.sps) is not in the whitelist at https://github.com/mozilla/build-blobber/blob/master/blobber/config.py#L12-L24 ... I'll push again with .sps.json . But at least it appears that dumping the profiles themselves succeeded, on 10.10 anyway.
This is taking a few more tries. I'll update this bug once I've had some success.
I've been successful in getting profiles, but it's unclear what conclusions we can draw from them. 10.8: http://people.mozilla.org/~bgirard/cleopatra/?zippedProfile=http://tests.themasta.com/hostedprofiles/profiles-10.8.zip 10.10: http://people.mozilla.org/~bgirard/cleopatra/?zippedProfile=http://tests.themasta.com/hostedprofiles/profiles-10.10.zip The main thing these profiles show is that "bloat logging", which we use for leak detection, is responsible for a lot of time spent during debug crashtests in general. Both on 10.8 and on 10.10, the main thread is blocking on the global bloat table lock most of the time, instead of doing something useful. But notably, the time we spend blocking on the lock on 10.10 is a little larger than on 10.8. If the lock is really responsible for the slowdown on 10.10, this raises a few questions: Is the overhead of grabbing a lock on 10.10 just higher? (unlikely.) Is there more contention of the lock on 10.10, i.e. are other threads doing more stuff, or taking longer to do their thing while the lock is being held? We'd like to know what other threads are doing, but profiling them is a little tricky with our built-in profiler. We've filed bug 1137397 in order to investigate getting rid of that global lock for the bloat logging case. But the prospects of the idea in that bug aren't that great and we shouldn't block deploying 10.10 on the resolution of it. Here's a try push with bloat logging forcefully turned off: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8eaf3902355b Crashtests took 8 minutes on 10.8 and 9 minutes on 10.10, versus 19min and 47min with bloat logging on.
Well would you look at that. Different approach, a very simple patch, and it gets us down to 11min on 10.10 with bloat log enabled: https://treeherder.mozilla.org/#/jobs?repo=try&revision=12085f7958b0 I've filed bug 1137963 with the patch. I'm now going to fire of a full set of debug builds on all platforms with and without the patch so that we can see whether it's just Mac that profits from it, or whether we should take it for all platforms.
The work in bug 1138616 addressed this.