Closed Bug 1120511 Opened 10 years ago Closed 9 years ago

Autophone - Twitter Throbber stop regression 2015-01-15

Categories

(Firefox for Android Graveyard :: General, defect)

ARM
Android
defect
Not set
normal

Tracking

(fennec40+)

RESOLVED FIXED
Tracking Status
fennec 40+ ---

People

(Reporter: bc, Assigned: seth)

References

(Blocks 1 open bug)

Details

(Keywords: regression)

A throbber stop regression happened on local twitter tests only. It is most apparent on Nexus S devices but appears to also affect the others.

http://phonedash.mozilla.org/#/org.mozilla.fennec/throbberstop/local-twitter/norejected/2015-01-11/2015-01-31/notcached/noerrorbars/standarderror

Regression Range:

https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=b096ec537e81&tochange=a15929ba55cd

Candidates: Bug 1118694, Bug 1119158, or Bug 1079627

These patches have all be backed out from what I can tell so I'm going to immediately mark this as fixed already, but will reopen if this reappears when these bugs land.
Fixed by backouts.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=53d0b45c24c5&tochange=daf8243cd190

2015-01-15 15:09:09
nexus-s-3
(mozilla-inbound/rev/53d0b45c24c5):
6110±83
N=8

2015-01-15 15:11:58
nexus-s-3
(mozilla-inbound/rev/daf8243cd190):
6935±32
N=8

definitely bug Bug 1079627
Blocks: 1079627
Status: RESOLVED → REOPENED
tracking-fennec: --- → ?
Resolution: FIXED → ---
Summary: Autophone - Twitter Throbber stop regression 2015-01-12 → Autophone - Twitter Throbber stop regression 2015-01-15
Thanks Bob, this is a useful testcase. I'll investigate immediately.
Actually, wait, this is not what I thought it was at all.

What does this actually measure? Time to the document's load event being fired?
The Throbber stop measurement as displayed in the graph is the difference in time between when the Throbber stop and Throbber start messages are written to logcat by GeckoToolbarDisplayLayout. See https://dxr.mozilla.org/mozilla-central/source/mobile/android/base/toolbar/ToolbarDisplayLayout.java#476

I think Throbber stop is an indication the page has completed loading but I don't know how that related to the load event. snorp, mfinkle, can you clarify ?
Flags: needinfo?(snorp)
Flags: needinfo?(mark.finkle)
Throbber Start and Throbber Stop should map to the points where Gecko nsIWebProgressListener fires START|NETWORK and STOP|NETWORK notifications, respectively. In the UI we use those to control the visibility of the "page progress" indicator. It used to be a throbber(spinner) but is now a simple progress line.

The time between Throbber Start and Throbber Stop is a combination of Gecko networking and page parsing & loading, and rendering. We also have some Java UI affects too.

The tests use a Blank page and a Twitter page (both completely local, no internet). I do not see any regression with the Blank page, but we see a regression with the Twitter page. The Twitter page does display a cached Twitter stream, complete with avatar images. I would go out on a limb and propose that the changes in bug 1079627 are doing something with the decoding/display of those images and is causing more time to be taken by Gecko.
Flags: needinfo?(mark.finkle)
Thanks for the details about what's being measured, Mark!

Based on that information, I'm pretty sure the root cause is that bug 1079627 causes image decoders to run to completion when possible instead of yielding every 16kb. This is more efficient, but the problem is that currently off-main-thread OnDataAvailable handlers are scheduled to the same thread pool as the decoders.

I already have a patch that should fix this: bug 1119158, which adds a new I/O thread for ImageLib and directs the OnDataAvailable handlers there. I'm hoping to push it today, but I'm trying to be cautious because it's been a long time since there's been a merge from inbound to central, and I've got a lot of patches already sitting on inbound waiting to be merged.
Sounds like you have a suspect in mind, but if not, maybe Bug 1009122? Merged to mc on Jan 15, regressed Windows Tp5, affects network loads by writing more to the cache.
That's also possible. Let's see how it plays out once bug 1119158 lands.
I see an improvement with https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=c5e5421fd495&tochange=5bf670f092eb which points to Bug 1119774. I didn't see any improvement from Bug 1119158. We'll know more as the patches progress across the branches.
(In reply to Bob Clary [:bc:] from comment #10)
> I see an improvement with
> https://hg.mozilla.org/integration/mozilla-inbound/
> pushloghtml?fromchange=c5e5421fd495&tochange=5bf670f092eb which points to
> Bug 1119774. I didn't see any improvement from Bug 1119158. We'll know more
> as the patches progress across the branches.

Hmm, if bug 1119774 improved it, I really have no explanation as to why...
The improvement appears to be temporary. Perhaps it was just noise.
I just wanted to note that the phones that really seem affected our the Nexus S phones. These phones are single core devices. The Nexus 4 / Nexus & lines don't seem affect much at all. The Nexus 5 lines really seem unfazed.
Hmmm, that's interesting. Bug 1079627, as I mentioned, causes the off-main-thread decoders to run to completion when possible. It may be that this changes the way threads are getting scheduled in a way that hurts these single core phones.
seth, I don't have the full test selection available for try builds on autophone yet, but I ran an example try build against it which you can see at:

http://phonedash.mozilla.org/#/org.mozilla.fennec/throbberstart/local-twitter/norejected/2015-01-20/2015-01-20/notcached/noerrorbars/standarderror/try

Note the controls which allow you to select whether to display try builds or not.

You can submit a try build which will run all autophone tests using a try commit message like:

try: -b do -p android-api-9,android-api-11 -u autophone-tests -t none

You can find the logs for your runs using the staging version of treeherder at:

https://treeherder.allizom.org/#/jobs?repo=try

When looking for your results on http://phonedash.mozilla.org, be sure to look for the appropriate build time amongst the try results.
(In reply to Bob Clary [:bc:] from comment #15)
> seth, I don't have the full test selection available for try builds on
> autophone yet, but I ran an example try build against it which you can see
> at:
> 
> http://phonedash.mozilla.org/#/org.mozilla.fennec/throbberstart/local-
> twitter/norejected/2015-01-20/2015-01-20/notcached/noerrorbars/standarderror/
> try

Nice! Thanks for setting this up!
Just pushed two variations of a simple priority-based fix that I want to try before anything more complicated:

https://tbpl.mozilla.org/?tree=Try&rev=2cb013ff2cbd

https://tbpl.mozilla.org/?tree=Try&rev=565901e4f895

If I understand correctly, these results will automatically show up on phonedash. So easy!
Yes, though you will need to make sure to distinguish your builds from glandium's. The UI doesn't display your name or filter on it. He has try several submissions this morning. The workers are behind at the moment due to load. I definitely need to add email notifications for try builds.

Note that phonedash will show you the perf measurements, but if you want to see the autophone logs, logcat, tomestones etc you can visit treeherder for your revisions 

https://treeherder.allizom.org/ui/#/jobs?repo=try&revision=2cb013ff2cbd
https://treeherder.allizom.org/ui/#/jobs?repo=try&revision=565901e4f895
Flags: needinfo?(snorp)
Tracking for Fx38. I think that's where the regression landed.
Assignee: nobody → seth
tracking-fennec: ? → 38+
Autophone is now available in
http://trychooser.pub.build.mozilla.org/ to help with your try
syntax. Please limit your tests to the ones you need and don't
use the mochitests unless you really need them. Also remember we
don't have many devices and they are usually pretty busy so
please don't DOS them. I'll work on the documentation in the next
few days but this should work for most of your cases:

try: -b o -p android-api-9,android-api-11 -u autophone-s1s2 -t none

Until Autophone begins reporting to production Treeherder, you
can find logs etc on the staging instance
https://treeherder.allizom.org.

Let me know if you have issues.
(In reply to Seth Fowler [:seth] from comment #21)
> OK, trying to move forward on this again. I've pushed a couple of new tests
> to try:
> 
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=562e490eabb0
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=963f747b229e

Hi Seth, any update here?
Flags: needinfo?(seth)
There have been some autophone changes in the way we measure stuff. Bob can we rerun with the affected range?
Oops. Comment #23
Flags: needinfo?(seth) → needinfo?(bob)
Seth we do need to figure out what's going on here. Have you looked at it recently? We're in danger of shipping this in 38.
Flags: needinfo?(seth)
I think I can do a set of try runs for the revision range and they will display in the current phonedash graph.
According to comment #13 this only affects single core devices, so I think that significantly lowers the priority of this. Renoming.
tracking-fennec: 38+ → ?
OK, I still think we should fix this on trunk, but just not uplift
tracking-fennec: ? → 40+
I'm now actively working on this again. I'm currently expecting this to be fixed by bug 1160423.
Depends on: 1160423
Flags: needinfo?(seth)
So bug 1163856 is going to be the final fix for this, not bug 1160423. Bug 1163856 is just a different approach to reach the same goal as bug 1160423, and it's the approach that it looks like we're going to go with.

I've been landing other bugs that improve image decode scheduling over the last month or so. You can track them down starting from bug 1152147 and bug 1160421. The most important one is bug 1174923, which should have made such a large difference that I'd expect it to have resolved this bug all by itself. I still think bug 1163856 is important, though, as it'll reduce the total amount of computation we do on the critical path of loading each image.

Bob and I have discussed the current state of the Autophone benchmarks and come to the conclusion that the original regression we're discussing in this bug is already fixed. There have been other regressions/sources of noise in the meantime that obscure it for some specific devices. But the Nexus S2 for example (one of the most seriously regressed devices) is now actually completing the Twitter throbber stop benchmark faster than it was before the 1/15 regressions, and all devices are significantly improved over the past month.

Based on these results, I'm going to double check that everything still looks good when bug 1163856 lands, and then I'm going to resolve this bug.
Depends on: 1163856
No longer depends on: 1160423
Hmm, ok. So bug 1163856 has landed, and it looks like it has moved us slightly in the wrong direction on the S2 (while leaving all other devices more-or-less unchanged). It's a ~170ms regression, or approximately 3% worse than the previous value.

Bug 1163856 makes us wait to consider an image loaded until the image's header has been decoded off-main-thread. Previous to that change, if we hadn't already decoded the image's header off-main-thread (or if the main thread just hadn't been told about it yet, which is often the case since a runnable has to propagate through the event queue to notify the main thread about it), we'd do a synchronous decode of the image's headers on the main thread. That very often resulted in us decoding the image's headers twice, once off-main-thread and once on the main thread.

The S2 is a dual-core phone, so it gets two image decoding threads, which contend with the main thread (and whichever other threads may be busy) during page load. It seems likely that we're getting a bad schedule on the S2 which leads to image headers either getting decoded after they would've been on the main thread (at least, often enough to get a 3% regression) or the results just waiting in the event queue too long. The event queue may really be the entire problem - we have no mechanism for letting events run sooner when they would improve page load performance, so non-essential (for page load) events may be running first, delaying page load while they run. We can view the synchronous size decode, in this context, as having been a hack to give priority to image header computations in the schedule, at the cost of duplicating work.

So what to do about this? There's a very real tradeoff here. If we don't use the approach of bug 1163856, we do more work, which costs us energy. Even worse, it's more work on the main thread, which increases jank. However, if we do nothing, then we're accepting this 3% regression, which isn't ideal either.

The most obvious option that I see for avoiding the regression, if we care strongly about this, would be to always set Image::INIT_FLAG_SYNC_LOAD on devices with < 3 cores. This would have the effect of moving image header decoding to the main thread on such devices. Any issues related to a poor schedule or event queue wait times, as noted above, would be eliminated.

However, I honestly have a hard time convincing myself this is the right thing to do. Keeping work off the main thread is always preferable when we can get away with it, and I'm not sure that it's worth moving so much work back onto the main thread for 3%.
(Note that above, I misidentified the regressing device as a dual core Galaxy S2. It's actually a single-core Nexus S. This means we have one image decoding thread contending with the main thread on a single core, which makes poor scheduling even more likely.)
I did two try runs: the first without the patches from bug 1163856 and the second with them 

https://treeherder.mozilla.org/#/jobs?repo=try&exclusion_profile=false&startdate=2015-07-02&enddate=2015-07-02&author=bclary@mozilla.com

and did not see the same regression for the nexus-s-6 device.
Correction, I was looking at the local blank test not the twitter test. 

http://phonedash.mozilla.org/#/org.mozilla.fennec/throbberstop/local-twitter/norejected/2015-07-02/2015-07-02/notcached/errorbars/standarderror/try

Does show the regression for local twitter on nexus-s-6 and nexus-4-jdq39-3 but given the other improvements I think it is livable unless you want to try the Image::INIT_FLAG_SYNC_LOAD on single core devices.
So I pushed a try job which was intended to test the INIT_FLAG_SYNC_LOAD approach:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=eb4eb143e1f3

I can't find the results on Autophone, though. It's entirely possible I'm doing something wrong. Bob, did I make a mistake with that try push?
Flags: needinfo?(bob)
Thanks Bob. Based on those results (6646ms vs 6769ms before), it appears that the INIT_FLAG_SYNC_LOAD approach gives us a ~1.8% improvement. That's 123ms, which is most of the 170ms regression discussed in comment 32.

I'd say that we should go ahead and land that change, and then consider this bug resolved.
Depends on: 1180931
OK, now that bug 1180931 has landed, we should have more-or-less eliminated this regression. At least one of the nexus s devices is now hovering right at 6 seconds, just as things were before the regression.

I'm going to go ahead and call this resolved.
Status: REOPENED → RESOLVED
Closed: 10 years ago9 years ago
Resolution: --- → FIXED
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.