1120511 - Autophone - Twitter Throbber stop regression 2015-01-15

Reporter

Description

•

10 years ago

A throbber stop regression happened on local twitter tests only. It is most apparent on Nexus S devices but appears to also affect the others.

http://phonedash.mozilla.org/#/org.mozilla.fennec/throbberstop/local-twitter/norejected/2015-01-11/2015-01-31/notcached/noerrorbars/standarderror

Regression Range:

https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=b096ec537e81&tochange=a15929ba55cd

Candidates: Bug 1118694, Bug 1119158, or Bug 1079627

These patches have all be backed out from what I can tell so I'm going to immediately mark this as fixed already, but will reopen if this reappears when these bugs land.

Bob Clary [:bc] (inactive)

Reporter

Comment 1

•

10 years ago

Fixed by backouts.

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Bob Clary [:bc] (inactive)

Reporter

Comment 2

•

10 years ago

https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=53d0b45c24c5&tochange=daf8243cd190

2015-01-15 15:09:09
nexus-s-3
(mozilla-inbound/rev/53d0b45c24c5):
6110±83
N=8

2015-01-15 15:11:58
nexus-s-3
(mozilla-inbound/rev/daf8243cd190):
6935±32
N=8

definitely bug Bug 1079627

Blocks: 1079627

Status: RESOLVED → REOPENED

tracking-fennec: --- → ?

Resolution: FIXED → ---

Summary: Autophone - Twitter Throbber stop regression 2015-01-12 → Autophone - Twitter Throbber stop regression 2015-01-15

Seth Fowler [:seth] [:s2h]

Assignee

Comment 3

•

10 years ago

Thanks Bob, this is a useful testcase. I'll investigate immediately.

Seth Fowler [:seth] [:s2h]

Assignee

Comment 4

•

10 years ago

Actually, wait, this is not what I thought it was at all.

What does this actually measure? Time to the document's load event being fired?

Bob Clary [:bc] (inactive)

Reporter

Comment 5

•

10 years ago

The Throbber stop measurement as displayed in the graph is the difference in time between when the Throbber stop and Throbber start messages are written to logcat by GeckoToolbarDisplayLayout. See https://dxr.mozilla.org/mozilla-central/source/mobile/android/base/toolbar/ToolbarDisplayLayout.java#476

I think Throbber stop is an indication the page has completed loading but I don't know how that related to the load event. snorp, mfinkle, can you clarify ?

Flags: needinfo?(snorp)

Flags: needinfo?(mark.finkle)

Mark Finkle (:mfinkle) (use needinfo?)

Comment 6

•

10 years ago

Throbber Start and Throbber Stop should map to the points where Gecko nsIWebProgressListener fires START|NETWORK and STOP|NETWORK notifications, respectively. In the UI we use those to control the visibility of the "page progress" indicator. It used to be a throbber(spinner) but is now a simple progress line.

The time between Throbber Start and Throbber Stop is a combination of Gecko networking and page parsing & loading, and rendering. We also have some Java UI affects too.

The tests use a Blank page and a Twitter page (both completely local, no internet). I do not see any regression with the Blank page, but we see a regression with the Twitter page. The Twitter page does display a cached Twitter stream, complete with avatar images. I would go out on a limb and propose that the changes in bug 1079627 are doing something with the decoding/display of those images and is causing more time to be taken by Gecko.

Flags: needinfo?(mark.finkle)

Seth Fowler [:seth] [:s2h]

Assignee

Comment 7

•

10 years ago

Thanks for the details about what's being measured, Mark!

Based on that information, I'm pretty sure the root cause is that bug 1079627 causes image decoders to run to completion when possible instead of yielding every 16kb. This is more efficient, but the problem is that currently off-main-thread OnDataAvailable handlers are scheduled to the same thread pool as the decoders.

I already have a patch that should fix this: bug 1119158, which adds a new I/O thread for ImageLib and directs the OnDataAvailable handlers there. I'm hoping to push it today, but I'm trying to be cautious because it's been a long time since there's been a merge from inbound to central, and I've got a lot of patches already sitting on inbound waiting to be merged.

Richard Newman [:rnewman]

Comment 8

•

10 years ago

Sounds like you have a suspect in mind, but if not, maybe Bug 1009122? Merged to mc on Jan 15, regressed Windows Tp5, affects network loads by writing more to the cache.

Seth Fowler [:seth] [:s2h]

Assignee

Comment 9

•

10 years ago

That's also possible. Let's see how it plays out once bug 1119158 lands.

Bob Clary [:bc] (inactive)

Reporter

Comment 10

•

10 years ago

I see an improvement with https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=c5e5421fd495&tochange=5bf670f092eb which points to Bug 1119774. I didn't see any improvement from Bug 1119158. We'll know more as the patches progress across the branches.

Seth Fowler [:seth] [:s2h]

Assignee

Comment 11

•

10 years ago

(In reply to Bob Clary [:bc:] from comment #10)
> I see an improvement with
> https://hg.mozilla.org/integration/mozilla-inbound/
> pushloghtml?fromchange=c5e5421fd495&tochange=5bf670f092eb which points to
> Bug 1119774. I didn't see any improvement from Bug 1119158. We'll know more
> as the patches progress across the branches.

Hmm, if bug 1119774 improved it, I really have no explanation as to why...

Bob Clary [:bc] (inactive)

Reporter

Comment 12

•

10 years ago

The improvement appears to be temporary. Perhaps it was just noise.

Mark Finkle (:mfinkle) (use needinfo?)

Comment 13

•

10 years ago

I just wanted to note that the phones that really seem affected our the Nexus S phones. These phones are single core devices. The Nexus 4 / Nexus & lines don't seem affect much at all. The Nexus 5 lines really seem unfazed.

Seth Fowler [:seth] [:s2h]

Assignee

Comment 14

•

10 years ago

Hmmm, that's interesting. Bug 1079627, as I mentioned, causes the off-main-thread decoders to run to completion when possible. It may be that this changes the way threads are getting scheduled in a way that hurts these single core phones.

Bob Clary [:bc] (inactive)

Reporter

Comment 15

•

10 years ago

seth, I don't have the full test selection available for try builds on autophone yet, but I ran an example try build against it which you can see at:

http://phonedash.mozilla.org/#/org.mozilla.fennec/throbberstart/local-twitter/norejected/2015-01-20/2015-01-20/notcached/noerrorbars/standarderror/try

Note the controls which allow you to select whether to display try builds or not.

You can submit a try build which will run all autophone tests using a try commit message like:

try: -b do -p android-api-9,android-api-11 -u autophone-tests -t none

You can find the logs for your runs using the staging version of treeherder at:

https://treeherder.allizom.org/#/jobs?repo=try

When looking for your results on http://phonedash.mozilla.org, be sure to look for the appropriate build time amongst the try results.

Seth Fowler [:seth] [:s2h]

Assignee

Comment 16

•

10 years ago

(In reply to Bob Clary [:bc:] from comment #15)
> seth, I don't have the full test selection available for try builds on
> autophone yet, but I ran an example try build against it which you can see
> at:
> 
> http://phonedash.mozilla.org/#/org.mozilla.fennec/throbberstart/local-
> twitter/norejected/2015-01-20/2015-01-20/notcached/noerrorbars/standarderror/
> try

Nice! Thanks for setting this up!

Seth Fowler [:seth] [:s2h]

Assignee

Comment 17

•

10 years ago

Just pushed two variations of a simple priority-based fix that I want to try before anything more complicated:

https://tbpl.mozilla.org/?tree=Try&rev=2cb013ff2cbd

https://tbpl.mozilla.org/?tree=Try&rev=565901e4f895

If I understand correctly, these results will automatically show up on phonedash. So easy!

Bob Clary [:bc] (inactive)

Reporter

Comment 18

•

10 years ago

Yes, though you will need to make sure to distinguish your builds from glandium's. The UI doesn't display your name or filter on it. He has try several submissions this morning. The workers are behind at the moment due to load. I definitely need to add email notifications for try builds.

Note that phonedash will show you the perf measurements, but if you want to see the autophone logs, logcat, tomestones etc you can visit treeherder for your revisions 

https://treeherder.allizom.org/ui/#/jobs?repo=try&revision=2cb013ff2cbd
https://treeherder.allizom.org/ui/#/jobs?repo=try&revision=565901e4f895

Bob Clary [:bc] (inactive)

Reporter

Updated

•

10 years ago

Flags: needinfo?(snorp)

Mark Finkle (:mfinkle) (use needinfo?)

Comment 19

•

10 years ago

Tracking for Fx38. I think that's where the regression landed.

Assignee: nobody → seth

tracking-fennec: ? → 38+

Bob Clary [:bc] (inactive)

Reporter

Comment 20

•

10 years ago

Autophone is now available in
http://trychooser.pub.build.mozilla.org/ to help with your try
syntax. Please limit your tests to the ones you need and don't
use the mochitests unless you really need them. Also remember we
don't have many devices and they are usually pretty busy so
please don't DOS them. I'll work on the documentation in the next
few days but this should work for most of your cases:

try: -b o -p android-api-9,android-api-11 -u autophone-s1s2 -t none

Until Autophone begins reporting to production Treeherder, you
can find logs etc on the staging instance
https://treeherder.allizom.org.

Let me know if you have issues.

Seth Fowler [:seth] [:s2h]

Assignee

Comment 21

•

10 years ago

OK, trying to move forward on this again. I've pushed a couple of new tests to try:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=562e490eabb0
https://treeherder.mozilla.org/#/jobs?repo=try&revision=963f747b229e

Brad Lassey [:blassey] (use needinfo?)

Comment 22

•

9 years ago

(In reply to Seth Fowler [:seth] from comment #21)
> OK, trying to move forward on this again. I've pushed a couple of new tests
> to try:
> 
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=562e490eabb0
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=963f747b229e

Hi Seth, any update here?

Flags: needinfo?(seth)

James Willcox (:snorp) (jwillcox@mozilla.com) (he/him)

Comment 23

•

9 years ago

There have been some autophone changes in the way we measure stuff. Bob can we rerun with the affected range?

James Willcox (:snorp) (jwillcox@mozilla.com) (he/him)

Comment 24

•

9 years ago

Oops. Comment #23

Flags: needinfo?(seth) → needinfo?(bob)

James Willcox (:snorp) (jwillcox@mozilla.com) (he/him)

Comment 25

•

9 years ago

Seth we do need to figure out what's going on here. Have you looked at it recently? We're in danger of shipping this in 38.

Flags: needinfo?(seth)

Bob Clary [:bc] (inactive)

Reporter

Comment 26

•

9 years ago

I think I can do a set of try runs for the revision range and they will display in the current phonedash graph.

James Willcox (:snorp) (jwillcox@mozilla.com) (he/him)

Comment 27

•

9 years ago

According to comment #13 this only affects single core devices, so I think that significantly lowers the priority of this. Renoming.

tracking-fennec: 38+ → ?

Bob Clary [:bc] (inactive)

Reporter

Comment 28

•

9 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a4cb9a3af7c1&exclusion_profile=false
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f75770ab5754&exclusion_profile=false
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8b199bf71ae4&exclusion_profile=false
https://treeherder.mozilla.org/#/jobs?repo=try&revision=0c35ea5b954d&exclusion_profile=false

http://phonedash.mozilla.org/#/org.mozilla.fennec/throbberstop/local-twitter/rejected/2015-04-29/2015-04-29/notcached/errorbars/standarderror/try

Flags: needinfo?(bob)

Brad Lassey [:blassey] (use needinfo?)

Comment 29

•

9 years ago

OK, I still think we should fix this on trunk, but just not uplift

tracking-fennec: ? → 40+

Seth Fowler [:seth] [:s2h]

Assignee

Comment 30

•

9 years ago

I'm now actively working on this again. I'm currently expecting this to be fixed by bug 1160423.

Depends on: 1160423

Flags: needinfo?(seth)

Seth Fowler [:seth] [:s2h]

Assignee

Comment 31

•

9 years ago

So bug 1163856 is going to be the final fix for this, not bug 1160423. Bug 1163856 is just a different approach to reach the same goal as bug 1160423, and it's the approach that it looks like we're going to go with.

I've been landing other bugs that improve image decode scheduling over the last month or so. You can track them down starting from bug 1152147 and bug 1160421. The most important one is bug 1174923, which should have made such a large difference that I'd expect it to have resolved this bug all by itself. I still think bug 1163856 is important, though, as it'll reduce the total amount of computation we do on the critical path of loading each image.

Bob and I have discussed the current state of the Autophone benchmarks and come to the conclusion that the original regression we're discussing in this bug is already fixed. There have been other regressions/sources of noise in the meantime that obscure it for some specific devices. But the Nexus S2 for example (one of the most seriously regressed devices) is now actually completing the Twitter throbber stop benchmark faster than it was before the 1/15 regressions, and all devices are significantly improved over the past month.

Based on these results, I'm going to double check that everything still looks good when bug 1163856 lands, and then I'm going to resolve this bug.

Depends on: 1163856
No longer depends on: 1160423

Seth Fowler [:seth] [:s2h]

Assignee

Comment 32

•

9 years ago

Hmm, ok. So bug 1163856 has landed, and it looks like it has moved us slightly in the wrong direction on the S2 (while leaving all other devices more-or-less unchanged). It's a ~170ms regression, or approximately 3% worse than the previous value.

Bug 1163856 makes us wait to consider an image loaded until the image's header has been decoded off-main-thread. Previous to that change, if we hadn't already decoded the image's header off-main-thread (or if the main thread just hadn't been told about it yet, which is often the case since a runnable has to propagate through the event queue to notify the main thread about it), we'd do a synchronous decode of the image's headers on the main thread. That very often resulted in us decoding the image's headers twice, once off-main-thread and once on the main thread.

The S2 is a dual-core phone, so it gets two image decoding threads, which contend with the main thread (and whichever other threads may be busy) during page load. It seems likely that we're getting a bad schedule on the S2 which leads to image headers either getting decoded after they would've been on the main thread (at least, often enough to get a 3% regression) or the results just waiting in the event queue too long. The event queue may really be the entire problem - we have no mechanism for letting events run sooner when they would improve page load performance, so non-essential (for page load) events may be running first, delaying page load while they run. We can view the synchronous size decode, in this context, as having been a hack to give priority to image header computations in the schedule, at the cost of duplicating work.

So what to do about this? There's a very real tradeoff here. If we don't use the approach of bug 1163856, we do more work, which costs us energy. Even worse, it's more work on the main thread, which increases jank. However, if we do nothing, then we're accepting this 3% regression, which isn't ideal either.

The most obvious option that I see for avoiding the regression, if we care strongly about this, would be to always set Image::INIT_FLAG_SYNC_LOAD on devices with < 3 cores. This would have the effect of moving image header decoding to the main thread on such devices. Any issues related to a poor schedule or event queue wait times, as noted above, would be eliminated.

However, I honestly have a hard time convincing myself this is the right thing to do. Keeping work off the main thread is always preferable when we can get away with it, and I'm not sure that it's worth moving so much work back onto the main thread for 3%.

Seth Fowler [:seth] [:s2h]

Assignee

Comment 33

•

9 years ago

(Note that above, I misidentified the regressing device as a dual core Galaxy S2. It's actually a single-core Nexus S. This means we have one image decoding thread contending with the main thread on a single core, which makes poor scheduling even more likely.)

Bob Clary [:bc] (inactive)

Reporter

Comment 34

•

9 years ago

I did two try runs: the first without the patches from bug 1163856 and the second with them 

https://treeherder.mozilla.org/#/jobs?repo=try&exclusion_profile=false&startdate=2015-07-02&enddate=2015-07-02&author=bclary@mozilla.com

and did not see the same regression for the nexus-s-6 device.

Bob Clary [:bc] (inactive)

Reporter

Comment 35

•

9 years ago

Correction, I was looking at the local blank test not the twitter test. 

http://phonedash.mozilla.org/#/org.mozilla.fennec/throbberstop/local-twitter/norejected/2015-07-02/2015-07-02/notcached/errorbars/standarderror/try

Does show the regression for local twitter on nexus-s-6 and nexus-4-jdq39-3 but given the other improvements I think it is livable unless you want to try the Image::INIT_FLAG_SYNC_LOAD on single core devices.

Seth Fowler [:seth] [:s2h]

Assignee

Comment 36

•

9 years ago

So I pushed a try job which was intended to test the INIT_FLAG_SYNC_LOAD approach:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=eb4eb143e1f3

I can't find the results on Autophone, though. It's entirely possible I'm doing something wrong. Bob, did I make a mistake with that try push?

Flags: needinfo?(bob)

Bob Clary [:bc] (inactive)

Reporter

Comment 37

•

9 years ago

Autophone is a tier-3 job which is hidden by default. You need to show the hidden jobs. See

https://treeherder.mozilla.org/#/jobs?repo=try&revision=eb4eb143e1f3&exclusion_profile=false&filter-searchStr=autophone

Your results on the graph are the last ones in:

http://phonedash.mozilla.org/#/org.mozilla.fennec/throbberstart/local-twitter/norejected/2015-06-30/2015-06-30/notcached/noerrorbars/standarderror/try

Flags: needinfo?(bob)

Seth Fowler [:seth] [:s2h]

Assignee

Comment 38

•

9 years ago

Thanks Bob. Based on those results (6646ms vs 6769ms before), it appears that the INIT_FLAG_SYNC_LOAD approach gives us a ~1.8% improvement. That's 123ms, which is most of the 170ms regression discussed in comment 32.

I'd say that we should go ahead and land that change, and then consider this bug resolved.

Seth Fowler [:seth] [:s2h]

Assignee

Updated

•

9 years ago

Depends on: 1180931

Seth Fowler [:seth] [:s2h]

Assignee

Comment 39

•

9 years ago

OK, now that bug 1180931 has landed, we should have more-or-less eliminated this regression. At least one of the nexus s devices is now hovering right at 6 seconds, just as things were before the regression.

I'm going to go ahead and call this resolved.

Status: REOPENED → RESOLVED

Closed: 10 years ago → 9 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

4 years ago

Product: Firefox for Android → Firefox for Android Graveyard