1182637 - [meta][e10s][telemetry] e10s is jankier than non-e10s ('gecko_hangs_per_minute', Telemetry experiments)

Reporter

Description

•

9 years ago

According to Telemetry, the frequency of jank events lasting more than 100ms increased from ~19 events/min to ~21 events/minute with e10s. This was derived from the main-thread's event processing times and session uptime: "gecko_hangs_per_minute" in http://nbviewer.ipython.org/urls/gist.githubusercontent.com/vitillo/cb6f1304316c1c1a2cbc/raw/e10s%20analysis.ipynb ^ Based on Telemetry collected from Nightly 41 on June 15th, from buildIDs in the range [20150601, 20150616] Full e10s performance write-up: https://groups.google.com/d/msg/mozilla.dev.platform/elb0Au7qwG0/Yarsx1xrrnwJ

Jim Mathies [:jimm]

Updated

•

9 years ago

tracking-e10s: --- → ?

Brad Lassey [:blassey] (use needinfo?)

Updated

•

9 years ago

tracking-e10s: ? → +

Brad Lassey [:blassey] (use needinfo?)

Comment 1

•

9 years ago

Vladan, one theory is that the cause is addons over-using CPOWs. Is there anyway you can either prove or disprove that theory with data?

Flags: needinfo?(vdjeric)

Vladan Djeric (:vladan)

Reporter

Comment 2

•

9 years ago

(In reply to Brad Lassey [:blassey] (use needinfo?) from comment #1) > Vladan, one theory is that the cause is addons over-using CPOWs. Is there > anyway you can either prove or disprove that theory with data? Yes, we could re-run the analysis on reports from profiles with no addons. Roberto, would you have time to do this sometime in the next couple of weeks?

Flags: needinfo?(vdjeric) → needinfo?(rvitillo)

Vladan Djeric (:vladan)

Reporter

Comment 3

•

9 years ago

we also have the MISBEHAVING_ADDONS_CPOW_TIME_MS, MISBEHAVING_ADDONS_JANK_LEVEL and ADDON_SHIM_USAGE histograms which we could use as controls

Roberto Agostino Vitillo (:rvitillo)

Comment 4

•

9 years ago

We don't have enough users not using e10s without add-ons to make any strong claims. Soon enough we are probably not going to have enough users to compare e10s and non e10s builds as well. We should make sure that at least one quarter of Nightly users still runs non-e10s builds.

Flags: needinfo?(rvitillo)

Vladan Djeric (:vladan)

Reporter

Comment 5

•

9 years ago

Anthony: Can you re-run Roberto's gecko_hangs_per_minute comparison of e10s vs non-e10s responsiveness, on Aurora 41 using buildIDs from before July 29th, and profiles with no extensions except the default Firefox theme extension? Roberto's analysis on Nightly: http://nbviewer.ipython.org/urls/gist.githubusercontent.com/vitillo/cb6f1304316c1c1a2cbc/raw/e10s%20analysis.ipynb

Flags: needinfo?(azhang)

Anthony Zhang [:azhang] (last day at Mozilla: 2016-04-29)

Comment 6

•

9 years ago

Done: https://gist.github.com/Uberi/40458dbccd88388739f7 This one is based on rvitillo's no addons e10s analysis [1]. It uses data from Aurora 41 users with buildIDs from July 28th, 2015, who have no extensions installed (except the default theme extension). [1]: https://gist.github.com/vitillo/8b6af93c6741e2278807

Flags: needinfo?(azhang)

Roberto Agostino Vitillo (:rvitillo)

Comment 7

•

9 years ago

Anthony, why is the "gecko_hangs_per_minute" comparison missing from your notebook?

Flags: needinfo?(azhang)

Anthony Zhang [:azhang] (last day at Mozilla: 2016-04-29)

Comment 8

•

9 years ago

Missed that while checking over the notebook - the link above is updated now with all the info.

Flags: needinfo?(azhang)

Vladan Djeric (:vladan)

Reporter

Updated

•

9 years ago

Blocks: e10s-measurement

Jim Mathies [:jimm]

Updated

•

9 years ago

Summary: Telemetry: e10s is jankier than non-e10s → [e10s][telemetry] e10s is jankier than non-e10s

Jim Mathies [:jimm]

Comment 9

•

9 years ago

This data comes from here - http://mxr.mozilla.org/mozilla-central/source/xpcom/threads/BackgroundHangMonitor.cpp the current aurora experiment data is here - https://github.com/vitillo/e10s_analyses/blob/master/aurora/e10s_experiment.ipynb search for 'gecko_hangs_per_minute'. This includes a random mixtures of users with addons installed. Results: Median difference in hangs over 100ms per minute is 8.90, (31.18, 22.28). The probability of this effect being purely by chance is 0.00. This is pretty bad. Why the chrome thread would see this increase in jank is surprising considering how well we're doing on other performance stats. For starters we might want to look into the background thread measuring code to be sure it's working right with e10s.

Jim Mathies [:jimm]

Updated

•

9 years ago

Summary: [e10s][telemetry] e10s is jankier than non-e10s → [e10s][telemetry] e10s is jankier than non-e10s (aurora experiment w/addons)

Jim Mathies [:jimm]

Updated

•

9 years ago

URL: http://nbviewer.ipython.org/urls/gist...

Jim Mathies [:jimm]

Comment 10

•

9 years ago

for clients without any extensions: https://github.com/vitillo/e10s_analyses/blob/master/aurora/addons/e10s_noext_experiment.ipynb for clients with at least one extension: https://github.com/vitillo/e10s_analyses/blob/master/aurora/addons/e10s_onlyext_experiment.ipynb for clients with just adblock plus installed: https://github.com/vitillo/e10s_analyses/blob/master/aurora/addons/e10s_top_extension_experiment.ipynb

Summary: [e10s][telemetry] e10s is jankier than non-e10s (aurora experiment w/addons) → [e10s][telemetry] e10s is jankier than non-e10s ('gecko_hangs_per_minute', aurora experiment)

Jim Mathies [:jimm]

Updated

•

9 years ago

Blocks: 1224276
No longer blocks: e10s-perf

Vladan Djeric (:vladan)

Reporter

Updated

•

9 years ago

Depends on: 1224374

Gabor Krizsanits (INACTIVE)

Assignee

Updated

•

9 years ago

Assignee: nobody → gkrizsanits

Gabor Krizsanits (INACTIVE)

Assignee

Comment 11

•

9 years ago

I'm trying to grasp what gecko_hangs_per_minute really means for us. Given that the whole characteristic of these data changed with e10s I'm not fully convinced that this is a bad thing. In the latest version this number went up from 20 to 26 (non-e10s vs e10s) per minute so I assume a 100ms jank is not a disaster. So instead of the counts of these janks their characteristic / distribution would be more interesting. If that's 20 200ms janks vs 26 102ms janks this can be even a win... Vladan, could we plot these janks maybe? Or get their average in both cases at least? Without that I don't think this is a fair comparison. Another thing I noticed that in MessagePumpDefault::Run we don't call notifyWait on the hangMonitor when we break. This should only matter when the main thread is over I think but still seem like some junk data... (http://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/message_pump_default.cc#21 ) Or is this intentional?

Flags: needinfo?(vladan.bugzilla)

Roberto Agostino Vitillo (:rvitillo)

Comment 12

•

9 years ago

poiru is preparing a plot of the activity histogram of the Gecko thread faceted by e10s status, which will answer the first part of Gabor's question.

Flags: needinfo?(birunthan)

Vladan Djeric (:vladan)

Reporter

Comment 13

•

9 years ago

Gabor: The BHR statistic refers to the median *rate* of jank events that last 100ms or *longer*. Poiru is plotting them. Note that we may actually be under-reporting the jank rate as a result of a client-side bug (bug 1228437) and the analyses need to be updated too

Depends on: 1228437

Flags: needinfo?(vladan.bugzilla)

Vladan Djeric (:vladan)

Reporter

Comment 14

•

9 years ago

ni'ing myself to look into BHR behaviour in message_pump_default.cc

Flags: needinfo?(vladan.bugzilla)

Gabor Krizsanits (INACTIVE)

Assignee

Comment 15

•

9 years ago

(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #13) > Gabor: The BHR statistic refers to the median *rate* of jank events that > last 100ms or *longer*. Poiru is plotting them. Note that we may actually be > under-reporting the jank rate as a result of a client-side bug (bug 1228437) > and the analyses need to be updated too Thanks, I got that part, it's just that I found that info not sufficient to make a fair conclusion. That's why I asked for the plotting / distribution. About the second part, isn't that because we use MessagePumpForChildProcess in the child instead of MessagePumpDefault that does the BHR?

Biru [:poiru]

Comment 16

•

9 years ago

The thread stack activity histograms can be viewed here: http://nbviewer.ipython.org/github/vitillo/e10s_analyses/blob/master/aurora/e10s_top_hang_stacks.ipynb Note that the histograms are for the parent process only because of bug 1228437.

Flags: needinfo?(birunthan)

Gabor Krizsanits (INACTIVE)

Assignee

Comment 17

•

9 years ago

(In reply to Birunthan Mohanathas [:poiru] from comment #16) > The thread stack activity histograms can be viewed here: > http://nbviewer.ipython.org/github/vitillo/e10s_analyses/blob/master/aurora/ > e10s_top_hang_stacks.ipynb Million thanks, this is great stuff. So one thing that is new in e10s is - 7.3378% (0.0041%): IPDL::PCookieService::RecvGetCookieString Blake, I think you've worked on this stuff, is there a chance that instead of pull scheme we use push scheme for cookie strings? So we would have the data in every process and when something change we broadcast it. It's probably not easy if possible at all to get it right... but would help this issue probably. The other new big player is a bit more tricky: - 33.5110% (18.4398%): Startup::XRE_Main This pseudo stack is not very descriptive, it can be really anything, and very likely multiple things... I guess we need some new tags for e10s.

Flags: needinfo?(mrbkap)

Blake Kaplan (:mrbkap) (inactive)

Comment 18

•

9 years ago

Yeah, we probably can do that. I also have bug 1185734 where this was mentioned.

Flags: needinfo?(mrbkap)

Gabor Krizsanits (INACTIVE)

Assignee

Updated

•

9 years ago

Depends on: 1232638

Gabor Krizsanits (INACTIVE)

Assignee

Updated

•

9 years ago

Depends on: 1232641

Jim Mathies [:jimm]

Updated

•

9 years ago

No longer blocks: 1224276

Jim Mathies [:jimm]

Updated

•

9 years ago

Blocks: e10s-perf

Priority: -- → P1

Vladan Djeric (:vladan)

Reporter

Comment 19

•

9 years ago

Clearing needinfo, this is bug 1228437 now

Flags: needinfo?(vladan.bugzilla)

Vladan Djeric (:vladan)

Reporter

Updated

•

9 years ago

Summary: [e10s][telemetry] e10s is jankier than non-e10s ('gecko_hangs_per_minute', aurora experiment) → [meta][e10s][telemetry] e10s is jankier than non-e10s ('gecko_hangs_per_minute', Telemetry experiments)

Vladan Djeric (:vladan)

Reporter

Updated

•

9 years ago

Depends on: 1235910

Gabor Krizsanits (INACTIVE)

Assignee

Comment 20

•

9 years ago

Based on Bill's measurements in Bug 1235910 Comment 3 this bug seems to be invalid. (after re-evaluating his statistics over the pings from the last few days I've got similar results). Should we wait for Bug 1228437 and re-evaluate the whole thing, or should we close this bug for now since with e10s we seem to have way less janks than without?

Flags: needinfo?(birunthan)

Roberto Agostino Vitillo (:rvitillo)

Comment 21

•

9 years ago

(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #20) > Should we wait for Bug 1228437 and re-evaluate the whole thing, or should we > close this bug for now since with e10s we seem to have way less janks than > without? Let's wait for Bug 1228437 and re-evaluate the metric in a new experiment.

Flags: needinfo?(birunthan)

Roberto Agostino Vitillo (:rvitillo)

Comment 22

•

9 years ago

Looking at the overall distribution of hangs [1] over 100 ms reveals that the e10s population has more hangs than the non-e10s one, even without Bug 1228437. The log-scaled distribution of the *number of hangs* per user [2] appears to have a bimodal shape for the e10s population, i.e. some users have less hangs per session than non-e10s users and some others have way more hangs than non-e10s ones. [1] https://gist.github.com/vitillo/8ac5eb3aec8f45464fc7 [2] https://gist.github.com/vitillo/718bf5f8d48d8a5d452a

Gabor Krizsanits (INACTIVE)

Assignee

Comment 23

•

9 years ago

Thanks for the measurements! (In reply to Roberto Agostino Vitillo (:rvitillo) from comment #22) > Looking at the overall distribution of hangs [1] over 100 ms reveals that > the e10s population has more hangs than the non-e10s one, even without Bug > 1228437. > I'm sure you're doing the right thing just trying to understand this data. Previously we've always normalized the hangs with the uptime: uptime = ping["payload"]["simpleMeasurements"].get("uptime", -1) ... ping["gecko_hangs_per_minute"] = over100/uptime to compare the hangs per minutes instead of the less interesting hangs per sessions data (longer sessions -> more hangs I guess). Am I missing something or you're not doing that any longer? If so can you explain me why? > The log-scaled distribution of the *number of hangs* per user [2] appears to > have a bimodal shape for the e10s population, i.e. some users have less > hangs per session than non-e10s users and some others have way more hangs > than non-e10s ones. > > [1] https://gist.github.com/vitillo/8ac5eb3aec8f45464fc7 > [2] https://gist.github.com/vitillo/718bf5f8d48d8a5d452a This is interesting... Is it possible to get some more data about the characteristics of each group? Or investigating the environment of some users sampled randomly from each group?

Flags: needinfo?(rvitillo)

Roberto Agostino Vitillo (:rvitillo)

Comment 24

•

9 years ago

(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #23) > Thanks for the measurements! > > (In reply to Roberto Agostino Vitillo (:rvitillo) from comment #22) > > Looking at the overall distribution of hangs [1] over 100 ms reveals that > > the e10s population has more hangs than the non-e10s one, even without Bug > > 1228437. > > > > I'm sure you're doing the right thing just trying to understand this data. > Previously we've always normalized the hangs with the uptime: > > uptime = ping["payload"]["simpleMeasurements"].get("uptime", -1) > ... > ping["gecko_hangs_per_minute"] = over100/uptime > > to compare the hangs per minutes instead of the less interesting hangs per > sessions data (longer sessions -> more hangs I guess). Am I missing something > or you're not doing that any longer? If so can you explain me why? The notebook where the normalization is applied was written originally for an analysis on non experimental data, so we couldn't assume that the populatations were biased in some way wrt session duration. The plots I have posted here are based on a A/B experiment, which means that there should be an equal amount of long sessions in both populations (unless e10s impacts the session duration which it doesn't) and as such it's reasonable to compare just the raw counts. Put in another way, if we were to run an A/A experiment on Beta we wouldn't expect those plots to differ significantly (given a big enough sample). > > The log-scaled distribution of the *number of hangs* per user [2] appears to > > have a bimodal shape for the e10s population, i.e. some users have less > > hangs per session than non-e10s users and some others have way more hangs > > than non-e10s ones. > > > > [1] https://gist.github.com/vitillo/8ac5eb3aec8f45464fc7 > > [2] https://gist.github.com/vitillo/718bf5f8d48d8a5d452a > > This is interesting... Is it possible to get some more data about the > characteristics > of each group? Or investigating the environment of some users sampled > randomly from each > group? I am looking into it.

Flags: needinfo?(rvitillo)

Gabor Krizsanits (INACTIVE)

Assignee

Comment 25

•

9 years ago

(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #24) > an analysis on non experimental data, so we couldn't assume that the > populatations were biased in some way wrt session duration. Oh, right! Thanks for explaining, it didn't occurred to me :)

Jim Mathies [:jimm]

Comment 26

•

9 years ago

distribution of hangs with users who do not have addons installed - https://gist.github.com/vitillo/9b3e30620000dc3779d3

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 27

•

9 years ago

Perhaps I'm misreading, but all these notebooks seem to look at builds starting with buildid 20151228134903. As explained in https://bugzilla.mozilla.org/show_bug.cgi?id=1234618#c14, the fix for bug 1234618 didn't actually land on beta until 12/31. I don't know what the first buildid is that had the fix, but it's clearly after 12/31. So all this analysis invalid. The bimodal behavior probably reflects the fact that some builds you're looking at have the fix and others don't.

Roberto Agostino Vitillo (:rvitillo)

Comment 28

•

9 years ago

I thought the patch landed in b4 given https://bugzilla.mozilla.org/show_bug.cgi?id=1235910#c5 but it actually landed in b6 (build-id 20160104162232). Unfortunately there are not enough users left in the experiment that have a build-id >= 20160104162232 and we can't make any statistical significant claims.

Jim Mathies [:jimm]

Comment 29

•

9 years ago

(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #28) > I thought the patch landed in b4 given > https://bugzilla.mozilla.org/show_bug.cgi?id=1235910#c5 but it actually > landed in b6 (build-id 20160104162232). Unfortunately there are not enough > users left in the experiment that have a build-id >= 20160104162232 and we > can't make any statistical significant claims. https://gist.github.com/vitillo/9b3e30620000dc3779d3 This could be good news if the users who shifted to the left had the fix. I guess we'll find out in 45.

Gabor Krizsanits (INACTIVE)

Assignee

Comment 30

•

9 years ago

(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #28) > can't make any statistical significant claims. With the uptime normalization I would expect a lot faster convergence, have you tried that statistic too by any chance?

Roberto Agostino Vitillo (:rvitillo)

Comment 31

•

9 years ago

(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #30) > (In reply to Roberto Agostino Vitillo (:rvitillo) from comment #28) > > can't make any statistical significant claims. > > With the uptime normalization I would expect a lot faster convergence, have > you tried that statistic too by any chance? No, but there are only a few dozen users left so we are going to have to wait for the next experiment.

Chris H-C :chutten

Updated

•

9 years ago

Blocks: e10s-responsiveness

:kip (Kearwood Gilbert)

Updated

•

9 years ago

Whiteboard: [webvr]

Chris H-C :chutten

Comment 32

•

9 years ago

Here's the updated analysis running against preliminary Beta 45 numbers: https://gist.github.com/chutten/aa269205a452be7fd89b It's looking good, I think. Roberto, any comment?

Flags: needinfo?(rvitillo)

Roberto Agostino Vitillo (:rvitillo)

Comment 33

•

9 years ago

(In reply to Chris H-C :chutten from comment #32) > Here's the updated analysis running against preliminary Beta 45 numbers: > https://gist.github.com/chutten/aa269205a452be7fd89b > > It's looking good, I think. Roberto, any comment? Please submit a PR for e10s-analyses so we can review it there. The analysis is just comparing the parent activity so you are going to have to add the child activity as well for e10s since we have that data now.

Flags: needinfo?(rvitillo)

Gabor Krizsanits (INACTIVE)

Assignee

Comment 34

•

9 years ago

This bug is only about janks in the parent process, since that's what we consider as a possible blocker for e10s. For the child side janks we should file another bug. I'm closing this for now as we agreed on the last e10s meeting, if parent side janks turn out to be an issue again we can reopen it. Flagging as invalid since as it turned out the bug was in the measurements.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → INVALID