Last Comment Bug 789975 - Extremely long pause while collecting telemetry information on the main thread
: Extremely long pause while collecting telemetry information on the main thread
Status: RESOLVED FIXED
[Snappy:p1][qa?]
:
Product: Toolkit
Classification: Components
Component: Telemetry (show other bugs)
: unspecified
: x86 Mac OS X
: -- normal (vote)
: mozilla21
Assigned To: Justin Lebar (not reading bugmail)
:
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-09-10 10:01 PDT by :Ehsan Akhgari (busy, don't ask for review please)
Modified: 2013-01-24 06:33 PST (History)
8 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
fixed
fixed
fixed


Attachments
gather telemetry on how long memory reporters take during telemetry (2.18 KB, patch)
2012-12-07 15:46 PST, Nathan Froyd [:froydnj]
vladan.bugzilla: review+
Details | Diff | Review
My TELEMETRY_MEMORY_REPORTER_MS (16.98 KB, image/png)
2013-01-10 14:55 PST, :Ehsan Akhgari (busy, don't ask for review please)
no flags Details
Patch, v1 (7.21 KB, patch)
2013-01-14 09:07 PST, Justin Lebar (not reading bugmail)
n.nethercote: review+
akeybl: approval‑mozilla‑aurora+
akeybl: approval‑mozilla‑beta+
Details | Diff | Review

Description :Ehsan Akhgari (busy, don't ask for review please) 2012-09-10 10:01:52 PDT
See this profile.  This seems to be related to gathering information from the memory reporters.

http://people.mozilla.com/~bgirard/cleopatra/?report=40ea5d43a22eeee73a45e1d763e4c6142e202133
Comment 1 Nicholas Nethercote [:njn] 2012-09-10 16:47:16 PDT
The sample indicates that the majority of the time is spent in gatherMemory(), in TelemetryPing.js, which iterates through the single reporters and collects measurements for a number of them.

One question:  is this a debug or opt build?  There are two ways to measure "explicit" -- a slow way and a fast way.  In debug builds we do both and check they give the same result.  So if this is a debug build it's not surprising, and I'll argue for WONTFIX.  If it's an opt build, that's more interesting...
Comment 2 :Ehsan Akhgari (busy, don't ask for review please) 2012-09-10 20:58:09 PDT
(In reply to comment #1)
> The sample indicates that the majority of the time is spent in gatherMemory(),
> in TelemetryPing.js, which iterates through the single reporters and collects
> measurements for a number of them.
> 
> One question:  is this a debug or opt build?  There are two ways to measure
> "explicit" -- a slow way and a fast way.  In debug builds we do both and check
> they give the same result.  So if this is a debug build it's not surprising,
> and I'll argue for WONTFIX.  If it's an opt build, that's more interesting...

This was on the latest nightly on Mac.  And just to put some context here, there was a multi-second pause only part of which is captured in the profile.
Comment 3 (dormant account) 2012-11-28 16:18:27 PST
Lets add telemetry to see how long memory gathering takes to figure out if this is a problem.
Comment 4 Nathan Froyd [:froydnj] 2012-12-07 15:46:43 PST
Created attachment 689973 [details] [diff] [review]
gather telemetry on how long memory reporters take during telemetry

Easy patch.  I figure a max of 5 seconds is long enough.
Comment 5 Nathan Froyd [:froydnj] 2012-12-11 12:44:47 PST
remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/a2f64f2e1d88
Comment 6 Ed Morley [:emorley] 2012-12-12 02:08:03 PST
https://hg.mozilla.org/mozilla-central/rev/a2f64f2e1d88
Comment 7 :Ehsan Akhgari (busy, don't ask for review please) 2013-01-10 13:26:52 PST
OK, this happened to me again: <http://people.mozilla.com/~bgirard/cleopatra/?customProfile=http://people.mozilla.org/~eakhgari/profiles/telemetry.shame>
Comment 8 :Ehsan Akhgari (busy, don't ask for review please) 2013-01-10 14:55:34 PST
Created attachment 700704 [details]
My TELEMETRY_MEMORY_REPORTER_MS
Comment 9 Vladan Djeric (:vladan) 2013-01-10 16:02:37 PST
Ehsan's MEMORY_FREE_PURGED_PAGES_MS histogram also has a scary entry. We should call a different version of GetResidentMemory that does not explicitly purge any MADV_FREE'd pages from physical memory:

http://mxr.mozilla.org/mozilla-central/source/xpcom/base/nsMemoryReporterManager.cpp#283

Ehsan suggested that we should still keep this potentially super-slow version for about:memory
Comment 10 Vladan Djeric (:vladan) 2013-01-10 16:10:32 PST
This is a pretty good example of why we need Telemetry dashboard triggers, we caught this a year after first landing the probe in bug 693404
Comment 11 Justin Lebar (not reading bugmail) 2013-01-11 01:14:25 PST
> We should call a different version of GetResidentMemory that does not explicitly purge 
> any MADV_FREE'd pages from physical memory:

That would make effectively useless the resident measurement on MacOS.

In telemetry, I see 0.00% of users on any platform on any version who have MEMORY_FREE_PURGED_PAGES_MS greater than 71ms.  So I wonder why Ehsan is seeing this.
Comment 12 Justin Lebar (not reading bugmail) 2013-01-11 03:18:32 PST
> This is a pretty good example of why we need Telemetry dashboard triggers, we caught this 
> a year after first landing the probe in bug 693404

You mean triggers (e.g. e-mail warnings) based on data in the telemetry database?

I also think we should have this, so we have no disagreement.  But note that per comment 11, these would not have caught this issue.

In my experience, it takes a long time for the Telemetry dashboard folks to implement complex features like this, and you may not get what you want without a lot of back-and-forth.  And of course the dashboard software isn't open-source, so you can't do it yourself.

ISTM that if we instead pushed to get a RESTful API to the telemetry data, then we could implement these warnings -- and all our other feature requests -- ourselves, in a timely, open-source, collaborative manner.
Comment 13 Vladan Djeric (:vladan) 2013-01-11 04:05:34 PST
(In reply to Justin Lebar [:jlebar] from comment #11)
> That would make effectively useless the resident measurement on MacOS.

Do we have an alternative? Could we somehow implement a timeout?

> In telemetry, I see 0.00% of users on any platform on any version who have
> MEMORY_FREE_PURGED_PAGES_MS greater than 71ms.  So I wonder why Ehsan is
> seeing this.

When I select Firefox 17 + Release + Mac OS + after Nov 1st on the dashboard, I see 4.552% of MEMORY_FREE_PURGED_PAGES_MS calls taking 71ms or longer. Also, Telemetry gathers memory information on startup + before daily send + on every cycle-collector-begin event, so that means many more than ~4.5% of such sessions are noticeably affected. I know Ehsan had hundreds of entries in his TELEMETRY_MEMORY_REPORTER_MS histogram when he ran into that hang.

(In reply to Justin Lebar [:jlebar] from comment #12)
> ISTM that if we instead pushed to get a RESTful API to the telemetry data,
> then we could implement these warnings -- and all our other feature requests
> -- ourselves, in a timely, open-source, collaborative manner.

I agree, this is the way forward
Comment 14 Vladan Djeric (:vladan) 2013-01-11 04:11:05 PST
Small correction: Telemetry gathers memory information on every cycle-collector-begin event but at most once per minute
Comment 15 Justin Lebar (not reading bugmail) 2013-01-11 04:19:00 PST
> Do we have an alternative?

Not that I'm aware of, but this is all pretty dark magic, so who knows.

> Could we somehow implement a timeout?

We make multiple calls to mmap in the purge function.  We could definitely check the time after each one and bail if we've taken too long.  But that only helps if the pause is spread out among many of the mmap's.  If instead just one mmap is taking a long time, that doesn't help.

I don't know which situation we're in.

> I see 4.552% of MEMORY_FREE_PURGED_PAGES_MS calls taking 71ms or longer.

Hm.  Maybe this munmap/mmap pair is causing us to swap.  That could explain these spikes.
Comment 16 Justin Lebar (not reading bugmail) 2013-01-11 04:21:18 PST
ehsan, do you have a magnetic hdd in your mac?
Comment 17 :Ehsan Akhgari (busy, don't ask for review please) 2013-01-11 11:33:19 PST
(In reply to comment #16)
> ehsan, do you have a magnetic hdd in your mac?

No.  I also have 16 gigs of ram, and I don't swap very often.
Comment 18 Justin Lebar (not reading bugmail) 2013-01-13 23:14:52 PST
> When I select Firefox 17 + Release + Mac OS + after Nov 1st on the dashboard, I see 
> 4.552% of MEMORY_FREE_PURGED_PAGES_MS calls taking 71ms or longer.

For posterity, here's a link to that: http://goo.gl/EmLaR

It pains me greatly to do it, but I think we should simply pref the purging off.  It's more important not to jank even 1% of users users than it is to get good memory metrics.

If we want to try out bailing on the munmaps if they're taking too long (totally doable, but I don't know whether it will help, since I don't know why the munmaps are taking so long in the first place), we can experiment with that on the nightly branch by flipping the pref there.  We'd have to figure out how to annotate the RSS with an indication that it's bogus, or otherwise not send it in the telemetry report.

But first things first.
Comment 19 Justin Lebar (not reading bugmail) 2013-01-14 09:07:46 PST
Created attachment 701845 [details] [diff] [review]
Patch, v1

This creates a new memory reporter, "resident-fast", and uses that one in
telemetry.  We use the vanilla "resident" reporter in about:memory and hide
resident-fast.
Comment 20 Nicholas Nethercote [:njn] 2013-01-14 14:35:42 PST
Comment on attachment 701845 [details] [diff] [review]
Patch, v1

Review of attachment 701845 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/aboutmemory/content/aboutMemory.js
@@ +723,5 @@
> +  //
> +  // Also ignore the resident-fast reporter; we use the vanilla resident
> +  // reporter because it's more important that we get accurate results than
> +  // that we avoid the (small) possibility of a long pause when loading
> +  // about:memory.

My preference is for aboutMemory.js to treat as few reporters specially as possible.  Is there a problem with just showing both of them?  On Mac it'll be particularly useful to get an idea of how inaccurate resident-fast is.

::: toolkit/components/telemetry/TelemetryPing.js
@@ +55,5 @@
>    "js-gc-heap": "MEMORY_JS_GC_HEAP",
>    "js-compartments/system": "MEMORY_JS_COMPARTMENTS_SYSTEM",
>    "js-compartments/user": "MEMORY_JS_COMPARTMENTS_USER",
>    "explicit": "MEMORY_EXPLICIT",
> +  "resident-fast": "MEMORY_RESIDENT",

Should we rename this MEMORY_RESIDENT_FAST?  I understand we'll lose telemetry continuity.  I'll let you decide.
Comment 21 Justin Lebar (not reading bugmail) 2013-01-14 14:53:27 PST
> Is there a problem with just showing both of them?

My thinking was:

 * I want resident-fast to always be there, so we don't have to write special code in telemetry in order to use it.

 * But resident-fast is the same as resident on Windows and Linux.

 * So I'd rather not add a memory reporter that doesn't add any value.

I'd certainly be OK special-casing about:memory to show resident-fast on MacOS only, because I agree that there's some value to seeing it on that platform...

> Should we rename this MEMORY_RESIDENT_FAST?  I understand we'll lose telemetry 
> continuity.

I'd rather not lose continuity on Windows and Linux just because we're changing the MacOS reporter.  I'd be OK losing continuity on MacOS, but then we'd need special-casing in TelemetryPing.js.
Comment 22 Justin Lebar (not reading bugmail) 2013-01-14 18:24:22 PST
Nick, what do you think we should do here?  I'm not wed to the particular trade-offs I made in this patch if you have a better idea.
Comment 23 Justin Lebar (not reading bugmail) 2013-01-14 18:25:30 PST
>  * So I'd rather not add a memory reporter that doesn't add any value.

I should say, I'd rather not add a memory reporter /to about:memory/ that doesn't add any value.
Comment 24 Nicholas Nethercote [:njn] 2013-01-14 18:31:36 PST
I'm ok with losing Mac continuity.

I'd prefer not special-casing resident-fast in about:memory, but I can live with it if you feel strongly.
Comment 25 Justin Lebar (not reading bugmail) 2013-01-14 18:51:00 PST
https://hg.mozilla.org/integration/mozilla-inbound/rev/622427e981a8
Comment 26 Justin Lebar (not reading bugmail) 2013-01-14 18:54:37 PST
Comment on attachment 701845 [details] [diff] [review]
Patch, v1

Once this bakes, I suspect we should take this on branches.  At least Aurora.

[Approval Request Comment]
* Bug caused by (feature/regressing bug #): Perhaps bug 693404, although it's odd that it took so long for us to hear of this.
* User impact if declined: Hangs for a single-digit fraction of users when we collect telemetry data.
* Testing completed (on m-c, etc.): Needs to bake for a few days, and we need to ensure that we're still getting telemetry for the resident field.
* Risk to taking this patch (and alternatives if risky): Could mess up our telemetry data, so we need to test.
* String or UUID changes made by this patch: None.
Comment 27 Ryan VanderMeulen [:RyanVM] 2013-01-14 19:16:45 PST
(In reply to Justin Lebar [:jlebar] from comment #25)
> https://hg.mozilla.org/integration/mozilla-inbound/rev/622427e981a8

Backed out for build bustage.
https://hg.mozilla.org/integration/mozilla-inbound/rev/69017949e118

https://tbpl.mozilla.org/php/getParsedLog.php?id=18801402&tree=Mozilla-Inbound

../../../xpcom/base/nsMemoryReporterManager.cpp: In member function 'virtual nsresult MemoryReporter_ResidentFast::GetAmount(int64_t*)':
../../../xpcom/base/nsMemoryReporterManager.cpp:402:1034: error: 'GetResidentFast' was not declared in this scope
../../../xpcom/base/nsMemoryReporterManager.cpp:402:1037: error: control reaches end of non-void function
make[6]: *** [nsMemoryReporterManager.o] Error 1
Comment 28 Nicholas Nethercote [:njn] 2013-01-14 19:25:05 PST
jlebar, when you reland can you tweak the resident-fast description to something like this:

This is the same measurement as 'resident', but it tries to be as fast as
possible at the expense of accuracy.  On most platforms this is identical to
the 'resident' measurement, but on Mac it may over-count.  You should use
'resident-fast' where you care about latency of collection (e.g. in telemetry).
Otherwise you should use 'resident'.

Key changes:
- single quotes around reporter names
- talk about measurements instead of reporters
Comment 29 Justin Lebar (not reading bugmail) 2013-01-15 06:15:12 PST
Relanded with new description and hopefully less bustage.

https://hg.mozilla.org/integration/mozilla-inbound/rev/0348a30c19b4
Comment 30 Nicholas Nethercote [:njn] 2013-01-15 11:52:33 PST
The description looks unchanged...
Comment 31 Ed Morley [:emorley] 2013-01-15 15:05:34 PST
https://hg.mozilla.org/mozilla-central/rev/0348a30c19b4
Comment 32 Justin Lebar (not reading bugmail) 2013-01-16 06:30:11 PST
(In reply to Nicholas Nethercote [:njn] from comment #30)
> The description looks unchanged...

Sigh, I changed the vsize description instead of the resident-fast description.
Comment 33 Justin Lebar (not reading bugmail) 2013-01-16 06:34:54 PST
Fixed the descriptions: https://hg.mozilla.org/integration/mozilla-inbound/rev/cca86ae200b1
Comment 34 Alex Keybl [:akeybl] 2013-01-16 10:20:14 PST
Comment on attachment 701845 [details] [diff] [review]
Patch, v1

Let's land in Aurora and verify our Telemetry data before considering for Beta.
Comment 35 Ed Morley [:emorley] 2013-01-16 12:26:25 PST
https://hg.mozilla.org/mozilla-central/rev/cca86ae200b1
Comment 36 Alex Keybl [:akeybl] 2013-01-18 16:09:20 PST
Justin - can you verify whether Telemetry data looks better now that this has uplifted to Aurora?
Comment 37 Justin Lebar (not reading bugmail) 2013-01-18 18:32:41 PST
(In reply to Alex Keybl [:akeybl] from comment #36)
> Justin - can you verify whether Telemetry data looks better now that this
> has uplifted to Aurora?

I checked earlier today and there wasn't enough data yet.  I'll check again on Monday.
Comment 38 Alex Keybl [:akeybl] 2013-01-22 06:38:35 PST
Comment on attachment 701845 [details] [diff] [review]
Patch, v1

a=akeybl if the telemetry is looking good. We're going to build with beta 3 at EOD today, so it's unlikely we'll fix for FF19 if ready to land after today.
Comment 39 Justin Lebar (not reading bugmail) 2013-01-22 07:36:55 PST
To my eyes, there's no difference in the telemetry_memory_reporter_ms histogram before and after 1/17 on the nightly channel on macos.  We still have very little data (~300 samples) from after this change.

(Actually, we have very little data in general.  I thought we turned on telemetry by default for all nightly users.  If this histogram is getting multiple entries per session, why do I only see a few hundred samples per day?)

But perhaps release users hit this edge case, whatever it is, more than nightly users.  Unfortunately Aurora was closed and so I couldn't get any data from that tree.

I think we should just push this and note that there's some problem with telemetry here.
Comment 40 Justin Lebar (not reading bugmail) 2013-01-22 07:52:47 PST
https://hg.mozilla.org/releases/mozilla-aurora/rev/c635335afd9f
Comment 41 Justin Lebar (not reading bugmail) 2013-01-22 11:16:03 PST
https://hg.mozilla.org/releases/mozilla-beta/rev/3ec9a94061f9
Comment 42 Vladan Djeric (:vladan) 2013-01-22 12:35:04 PST
(In reply to Justin Lebar [:jlebar] from comment #39)
> To my eyes, there's no difference in the telemetry_memory_reporter_ms
> histogram before and after 1/17 on the nightly channel on macos.

I see a marked drop starting with the January 16th build in the Evolution dashboard:

http://tinyurl.com/a9lda66

The mean, std dev and 95th percentile all decreased substantially :)
Comment 43 Justin Lebar (not reading bugmail) 2013-01-22 12:45:18 PST
> I see a marked drop starting with the January 16th build in the Evolution dashboard

Okay, that's great!

When I select a date in the telemetry histogram at the top, is that filtering by date received instead of build date?

  http://goo.gl/cSVbb

That's probably my error.

Do you know what's the trick for getting the telemetry histogram to filter by buildid?  Every time I try, it just refreshes the page, clears the buildids I entered, and doesn't change the histogram.
Comment 44 Justin Lebar (not reading bugmail) 2013-01-22 12:49:03 PST
I see, the buildid thing is an autocomplete field, and if you don't let it autocomplete for you, it will just silently fail.  Cool.

I tried filtering by buildid and didn't get anywhere, but whatever, the telemetry evolution data is good enough for me.  Thanks for checking that.
Comment 45 Paul Silaghi, QA [:pauly] 2013-01-24 06:33:38 PST
(In reply to Alex Keybl [:akeybl] from comment #36)
> Justin - can you verify whether Telemetry data looks better now that this
> has uplifted to Aurora?

Is there anything QA can do here?

Note You need to log in before you can comment on or make changes to this bug.