Closed
Bug 1120511
Opened 10 years ago
Closed 9 years ago
Autophone - Twitter Throbber stop regression 2015-01-15
Categories
(Firefox for Android Graveyard :: General, defect)
Tracking
(fennec40+)
RESOLVED
FIXED
Tracking | Status | |
---|---|---|
fennec | 40+ | --- |
People
(Reporter: bc, Assigned: seth)
References
(Blocks 1 open bug)
Details
(Keywords: regression)
A throbber stop regression happened on local twitter tests only. It is most apparent on Nexus S devices but appears to also affect the others. http://phonedash.mozilla.org/#/org.mozilla.fennec/throbberstop/local-twitter/norejected/2015-01-11/2015-01-31/notcached/noerrorbars/standarderror Regression Range: https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=b096ec537e81&tochange=a15929ba55cd Candidates: Bug 1118694, Bug 1119158, or Bug 1079627 These patches have all be backed out from what I can tell so I'm going to immediately mark this as fixed already, but will reopen if this reappears when these bugs land.
Reporter | ||
Comment 1•10 years ago
|
||
Fixed by backouts.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 2•10 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=53d0b45c24c5&tochange=daf8243cd190 2015-01-15 15:09:09 nexus-s-3 (mozilla-inbound/rev/53d0b45c24c5): 6110±83 N=8 2015-01-15 15:11:58 nexus-s-3 (mozilla-inbound/rev/daf8243cd190): 6935±32 N=8 definitely bug Bug 1079627
Blocks: 1079627
Status: RESOLVED → REOPENED
tracking-fennec: --- → ?
Resolution: FIXED → ---
Summary: Autophone - Twitter Throbber stop regression 2015-01-12 → Autophone - Twitter Throbber stop regression 2015-01-15
Assignee | ||
Comment 3•10 years ago
|
||
Thanks Bob, this is a useful testcase. I'll investigate immediately.
Assignee | ||
Comment 4•10 years ago
|
||
Actually, wait, this is not what I thought it was at all. What does this actually measure? Time to the document's load event being fired?
Reporter | ||
Comment 5•10 years ago
|
||
The Throbber stop measurement as displayed in the graph is the difference in time between when the Throbber stop and Throbber start messages are written to logcat by GeckoToolbarDisplayLayout. See https://dxr.mozilla.org/mozilla-central/source/mobile/android/base/toolbar/ToolbarDisplayLayout.java#476 I think Throbber stop is an indication the page has completed loading but I don't know how that related to the load event. snorp, mfinkle, can you clarify ?
Flags: needinfo?(snorp)
Flags: needinfo?(mark.finkle)
Comment 6•10 years ago
|
||
Throbber Start and Throbber Stop should map to the points where Gecko nsIWebProgressListener fires START|NETWORK and STOP|NETWORK notifications, respectively. In the UI we use those to control the visibility of the "page progress" indicator. It used to be a throbber(spinner) but is now a simple progress line. The time between Throbber Start and Throbber Stop is a combination of Gecko networking and page parsing & loading, and rendering. We also have some Java UI affects too. The tests use a Blank page and a Twitter page (both completely local, no internet). I do not see any regression with the Blank page, but we see a regression with the Twitter page. The Twitter page does display a cached Twitter stream, complete with avatar images. I would go out on a limb and propose that the changes in bug 1079627 are doing something with the decoding/display of those images and is causing more time to be taken by Gecko.
Flags: needinfo?(mark.finkle)
Assignee | ||
Comment 7•10 years ago
|
||
Thanks for the details about what's being measured, Mark! Based on that information, I'm pretty sure the root cause is that bug 1079627 causes image decoders to run to completion when possible instead of yielding every 16kb. This is more efficient, but the problem is that currently off-main-thread OnDataAvailable handlers are scheduled to the same thread pool as the decoders. I already have a patch that should fix this: bug 1119158, which adds a new I/O thread for ImageLib and directs the OnDataAvailable handlers there. I'm hoping to push it today, but I'm trying to be cautious because it's been a long time since there's been a merge from inbound to central, and I've got a lot of patches already sitting on inbound waiting to be merged.
Comment 8•10 years ago
|
||
Sounds like you have a suspect in mind, but if not, maybe Bug 1009122? Merged to mc on Jan 15, regressed Windows Tp5, affects network loads by writing more to the cache.
Assignee | ||
Comment 9•10 years ago
|
||
That's also possible. Let's see how it plays out once bug 1119158 lands.
Reporter | ||
Comment 10•10 years ago
|
||
I see an improvement with https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=c5e5421fd495&tochange=5bf670f092eb which points to Bug 1119774. I didn't see any improvement from Bug 1119158. We'll know more as the patches progress across the branches.
Assignee | ||
Comment 11•10 years ago
|
||
(In reply to Bob Clary [:bc:] from comment #10) > I see an improvement with > https://hg.mozilla.org/integration/mozilla-inbound/ > pushloghtml?fromchange=c5e5421fd495&tochange=5bf670f092eb which points to > Bug 1119774. I didn't see any improvement from Bug 1119158. We'll know more > as the patches progress across the branches. Hmm, if bug 1119774 improved it, I really have no explanation as to why...
Reporter | ||
Comment 12•10 years ago
|
||
The improvement appears to be temporary. Perhaps it was just noise.
Comment 13•10 years ago
|
||
I just wanted to note that the phones that really seem affected our the Nexus S phones. These phones are single core devices. The Nexus 4 / Nexus & lines don't seem affect much at all. The Nexus 5 lines really seem unfazed.
Assignee | ||
Comment 14•10 years ago
|
||
Hmmm, that's interesting. Bug 1079627, as I mentioned, causes the off-main-thread decoders to run to completion when possible. It may be that this changes the way threads are getting scheduled in a way that hurts these single core phones.
Reporter | ||
Comment 15•10 years ago
|
||
seth, I don't have the full test selection available for try builds on autophone yet, but I ran an example try build against it which you can see at: http://phonedash.mozilla.org/#/org.mozilla.fennec/throbberstart/local-twitter/norejected/2015-01-20/2015-01-20/notcached/noerrorbars/standarderror/try Note the controls which allow you to select whether to display try builds or not. You can submit a try build which will run all autophone tests using a try commit message like: try: -b do -p android-api-9,android-api-11 -u autophone-tests -t none You can find the logs for your runs using the staging version of treeherder at: https://treeherder.allizom.org/#/jobs?repo=try When looking for your results on http://phonedash.mozilla.org, be sure to look for the appropriate build time amongst the try results.
Assignee | ||
Comment 16•10 years ago
|
||
(In reply to Bob Clary [:bc:] from comment #15) > seth, I don't have the full test selection available for try builds on > autophone yet, but I ran an example try build against it which you can see > at: > > http://phonedash.mozilla.org/#/org.mozilla.fennec/throbberstart/local- > twitter/norejected/2015-01-20/2015-01-20/notcached/noerrorbars/standarderror/ > try Nice! Thanks for setting this up!
Assignee | ||
Comment 17•10 years ago
|
||
Just pushed two variations of a simple priority-based fix that I want to try before anything more complicated: https://tbpl.mozilla.org/?tree=Try&rev=2cb013ff2cbd https://tbpl.mozilla.org/?tree=Try&rev=565901e4f895 If I understand correctly, these results will automatically show up on phonedash. So easy!
Reporter | ||
Comment 18•10 years ago
|
||
Yes, though you will need to make sure to distinguish your builds from glandium's. The UI doesn't display your name or filter on it. He has try several submissions this morning. The workers are behind at the moment due to load. I definitely need to add email notifications for try builds. Note that phonedash will show you the perf measurements, but if you want to see the autophone logs, logcat, tomestones etc you can visit treeherder for your revisions https://treeherder.allizom.org/ui/#/jobs?repo=try&revision=2cb013ff2cbd https://treeherder.allizom.org/ui/#/jobs?repo=try&revision=565901e4f895
Reporter | ||
Updated•10 years ago
|
Flags: needinfo?(snorp)
Comment 19•10 years ago
|
||
Tracking for Fx38. I think that's where the regression landed.
Assignee: nobody → seth
tracking-fennec: ? → 38+
Reporter | ||
Comment 20•10 years ago
|
||
Autophone is now available in http://trychooser.pub.build.mozilla.org/ to help with your try syntax. Please limit your tests to the ones you need and don't use the mochitests unless you really need them. Also remember we don't have many devices and they are usually pretty busy so please don't DOS them. I'll work on the documentation in the next few days but this should work for most of your cases: try: -b o -p android-api-9,android-api-11 -u autophone-s1s2 -t none Until Autophone begins reporting to production Treeherder, you can find logs etc on the staging instance https://treeherder.allizom.org. Let me know if you have issues.
Assignee | ||
Comment 21•10 years ago
|
||
OK, trying to move forward on this again. I've pushed a couple of new tests to try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=562e490eabb0 https://treeherder.mozilla.org/#/jobs?repo=try&revision=963f747b229e
Comment 22•9 years ago
|
||
(In reply to Seth Fowler [:seth] from comment #21) > OK, trying to move forward on this again. I've pushed a couple of new tests > to try: > > https://treeherder.mozilla.org/#/jobs?repo=try&revision=562e490eabb0 > https://treeherder.mozilla.org/#/jobs?repo=try&revision=963f747b229e Hi Seth, any update here?
Flags: needinfo?(seth)
There have been some autophone changes in the way we measure stuff. Bob can we rerun with the affected range?
Oops. Comment #23
Flags: needinfo?(seth) → needinfo?(bob)
Seth we do need to figure out what's going on here. Have you looked at it recently? We're in danger of shipping this in 38.
Flags: needinfo?(seth)
Reporter | ||
Comment 26•9 years ago
|
||
I think I can do a set of try runs for the revision range and they will display in the current phonedash graph.
According to comment #13 this only affects single core devices, so I think that significantly lowers the priority of this. Renoming.
tracking-fennec: 38+ → ?
Reporter | ||
Comment 28•9 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a4cb9a3af7c1&exclusion_profile=false https://treeherder.mozilla.org/#/jobs?repo=try&revision=f75770ab5754&exclusion_profile=false https://treeherder.mozilla.org/#/jobs?repo=try&revision=8b199bf71ae4&exclusion_profile=false https://treeherder.mozilla.org/#/jobs?repo=try&revision=0c35ea5b954d&exclusion_profile=false http://phonedash.mozilla.org/#/org.mozilla.fennec/throbberstop/local-twitter/rejected/2015-04-29/2015-04-29/notcached/errorbars/standarderror/try
Flags: needinfo?(bob)
Comment 29•9 years ago
|
||
OK, I still think we should fix this on trunk, but just not uplift
tracking-fennec: ? → 40+
Assignee | ||
Comment 30•9 years ago
|
||
I'm now actively working on this again. I'm currently expecting this to be fixed by bug 1160423.
Depends on: 1160423
Flags: needinfo?(seth)
Assignee | ||
Comment 31•9 years ago
|
||
So bug 1163856 is going to be the final fix for this, not bug 1160423. Bug 1163856 is just a different approach to reach the same goal as bug 1160423, and it's the approach that it looks like we're going to go with. I've been landing other bugs that improve image decode scheduling over the last month or so. You can track them down starting from bug 1152147 and bug 1160421. The most important one is bug 1174923, which should have made such a large difference that I'd expect it to have resolved this bug all by itself. I still think bug 1163856 is important, though, as it'll reduce the total amount of computation we do on the critical path of loading each image. Bob and I have discussed the current state of the Autophone benchmarks and come to the conclusion that the original regression we're discussing in this bug is already fixed. There have been other regressions/sources of noise in the meantime that obscure it for some specific devices. But the Nexus S2 for example (one of the most seriously regressed devices) is now actually completing the Twitter throbber stop benchmark faster than it was before the 1/15 regressions, and all devices are significantly improved over the past month. Based on these results, I'm going to double check that everything still looks good when bug 1163856 lands, and then I'm going to resolve this bug.
Assignee | ||
Comment 32•9 years ago
|
||
Hmm, ok. So bug 1163856 has landed, and it looks like it has moved us slightly in the wrong direction on the S2 (while leaving all other devices more-or-less unchanged). It's a ~170ms regression, or approximately 3% worse than the previous value. Bug 1163856 makes us wait to consider an image loaded until the image's header has been decoded off-main-thread. Previous to that change, if we hadn't already decoded the image's header off-main-thread (or if the main thread just hadn't been told about it yet, which is often the case since a runnable has to propagate through the event queue to notify the main thread about it), we'd do a synchronous decode of the image's headers on the main thread. That very often resulted in us decoding the image's headers twice, once off-main-thread and once on the main thread. The S2 is a dual-core phone, so it gets two image decoding threads, which contend with the main thread (and whichever other threads may be busy) during page load. It seems likely that we're getting a bad schedule on the S2 which leads to image headers either getting decoded after they would've been on the main thread (at least, often enough to get a 3% regression) or the results just waiting in the event queue too long. The event queue may really be the entire problem - we have no mechanism for letting events run sooner when they would improve page load performance, so non-essential (for page load) events may be running first, delaying page load while they run. We can view the synchronous size decode, in this context, as having been a hack to give priority to image header computations in the schedule, at the cost of duplicating work. So what to do about this? There's a very real tradeoff here. If we don't use the approach of bug 1163856, we do more work, which costs us energy. Even worse, it's more work on the main thread, which increases jank. However, if we do nothing, then we're accepting this 3% regression, which isn't ideal either. The most obvious option that I see for avoiding the regression, if we care strongly about this, would be to always set Image::INIT_FLAG_SYNC_LOAD on devices with < 3 cores. This would have the effect of moving image header decoding to the main thread on such devices. Any issues related to a poor schedule or event queue wait times, as noted above, would be eliminated. However, I honestly have a hard time convincing myself this is the right thing to do. Keeping work off the main thread is always preferable when we can get away with it, and I'm not sure that it's worth moving so much work back onto the main thread for 3%.
Assignee | ||
Comment 33•9 years ago
|
||
(Note that above, I misidentified the regressing device as a dual core Galaxy S2. It's actually a single-core Nexus S. This means we have one image decoding thread contending with the main thread on a single core, which makes poor scheduling even more likely.)
Reporter | ||
Comment 34•9 years ago
|
||
I did two try runs: the first without the patches from bug 1163856 and the second with them https://treeherder.mozilla.org/#/jobs?repo=try&exclusion_profile=false&startdate=2015-07-02&enddate=2015-07-02&author=bclary@mozilla.com and did not see the same regression for the nexus-s-6 device.
Reporter | ||
Comment 35•9 years ago
|
||
Correction, I was looking at the local blank test not the twitter test. http://phonedash.mozilla.org/#/org.mozilla.fennec/throbberstop/local-twitter/norejected/2015-07-02/2015-07-02/notcached/errorbars/standarderror/try Does show the regression for local twitter on nexus-s-6 and nexus-4-jdq39-3 but given the other improvements I think it is livable unless you want to try the Image::INIT_FLAG_SYNC_LOAD on single core devices.
Assignee | ||
Comment 36•9 years ago
|
||
So I pushed a try job which was intended to test the INIT_FLAG_SYNC_LOAD approach: https://treeherder.mozilla.org/#/jobs?repo=try&revision=eb4eb143e1f3 I can't find the results on Autophone, though. It's entirely possible I'm doing something wrong. Bob, did I make a mistake with that try push?
Flags: needinfo?(bob)
Reporter | ||
Comment 37•9 years ago
|
||
Autophone is a tier-3 job which is hidden by default. You need to show the hidden jobs. See https://treeherder.mozilla.org/#/jobs?repo=try&revision=eb4eb143e1f3&exclusion_profile=false&filter-searchStr=autophone Your results on the graph are the last ones in: http://phonedash.mozilla.org/#/org.mozilla.fennec/throbberstart/local-twitter/norejected/2015-06-30/2015-06-30/notcached/noerrorbars/standarderror/try
Flags: needinfo?(bob)
Assignee | ||
Comment 38•9 years ago
|
||
Thanks Bob. Based on those results (6646ms vs 6769ms before), it appears that the INIT_FLAG_SYNC_LOAD approach gives us a ~1.8% improvement. That's 123ms, which is most of the 170ms regression discussed in comment 32. I'd say that we should go ahead and land that change, and then consider this bug resolved.
Assignee | ||
Comment 39•9 years ago
|
||
OK, now that bug 1180931 has landed, we should have more-or-less eliminated this regression. At least one of the nexus s devices is now hovering right at 6 seconds, just as things were before the regression. I'm going to go ahead and call this resolved.
Status: REOPENED → RESOLVED
Closed: 10 years ago → 9 years ago
Resolution: --- → FIXED
Updated•4 years ago
|
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•