Closed Bug 789975 Opened 12 years ago Closed 11 years ago

Extremely long pause while collecting telemetry information on the main thread

Categories

(Toolkit :: Telemetry, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla21
Tracking Status
firefox19 --- fixed
firefox20 --- fixed
firefox21 --- fixed

People

(Reporter: ehsan.akhgari, Assigned: justin.lebar+bug)

Details

(Whiteboard: [Snappy:p1][qa?])

Attachments

(3 files)

See this profile.  This seems to be related to gathering information from the memory reporters.

http://people.mozilla.com/~bgirard/cleopatra/?report=40ea5d43a22eeee73a45e1d763e4c6142e202133
Whiteboard: [Snappy]
The sample indicates that the majority of the time is spent in gatherMemory(), in TelemetryPing.js, which iterates through the single reporters and collects measurements for a number of them.

One question:  is this a debug or opt build?  There are two ways to measure "explicit" -- a slow way and a fast way.  In debug builds we do both and check they give the same result.  So if this is a debug build it's not surprising, and I'll argue for WONTFIX.  If it's an opt build, that's more interesting...
(In reply to comment #1)
> The sample indicates that the majority of the time is spent in gatherMemory(),
> in TelemetryPing.js, which iterates through the single reporters and collects
> measurements for a number of them.
> 
> One question:  is this a debug or opt build?  There are two ways to measure
> "explicit" -- a slow way and a fast way.  In debug builds we do both and check
> they give the same result.  So if this is a debug build it's not surprising,
> and I'll argue for WONTFIX.  If it's an opt build, that's more interesting...

This was on the latest nightly on Mac.  And just to put some context here, there was a multi-second pause only part of which is captured in the profile.
Lets add telemetry to see how long memory gathering takes to figure out if this is a problem.
Whiteboard: [Snappy] → [Snappy:p1]
Assignee: nobody → nfroyd
Easy patch.  I figure a max of 5 seconds is long enough.
Attachment #689973 - Flags: review?(vdjeric)
Attachment #689973 - Flags: review?(vdjeric) → review+
remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/a2f64f2e1d88
Whiteboard: [Snappy:p1] → [Snappy:p1][leave open]
Ehsan's MEMORY_FREE_PURGED_PAGES_MS histogram also has a scary entry. We should call a different version of GetResidentMemory that does not explicitly purge any MADV_FREE'd pages from physical memory:

http://mxr.mozilla.org/mozilla-central/source/xpcom/base/nsMemoryReporterManager.cpp#283

Ehsan suggested that we should still keep this potentially super-slow version for about:memory
This is a pretty good example of why we need Telemetry dashboard triggers, we caught this a year after first landing the probe in bug 693404
> We should call a different version of GetResidentMemory that does not explicitly purge 
> any MADV_FREE'd pages from physical memory:

That would make effectively useless the resident measurement on MacOS.

In telemetry, I see 0.00% of users on any platform on any version who have MEMORY_FREE_PURGED_PAGES_MS greater than 71ms.  So I wonder why Ehsan is seeing this.
> This is a pretty good example of why we need Telemetry dashboard triggers, we caught this 
> a year after first landing the probe in bug 693404

You mean triggers (e.g. e-mail warnings) based on data in the telemetry database?

I also think we should have this, so we have no disagreement.  But note that per comment 11, these would not have caught this issue.

In my experience, it takes a long time for the Telemetry dashboard folks to implement complex features like this, and you may not get what you want without a lot of back-and-forth.  And of course the dashboard software isn't open-source, so you can't do it yourself.

ISTM that if we instead pushed to get a RESTful API to the telemetry data, then we could implement these warnings -- and all our other feature requests -- ourselves, in a timely, open-source, collaborative manner.
(In reply to Justin Lebar [:jlebar] from comment #11)
> That would make effectively useless the resident measurement on MacOS.

Do we have an alternative? Could we somehow implement a timeout?

> In telemetry, I see 0.00% of users on any platform on any version who have
> MEMORY_FREE_PURGED_PAGES_MS greater than 71ms.  So I wonder why Ehsan is
> seeing this.

When I select Firefox 17 + Release + Mac OS + after Nov 1st on the dashboard, I see 4.552% of MEMORY_FREE_PURGED_PAGES_MS calls taking 71ms or longer. Also, Telemetry gathers memory information on startup + before daily send + on every cycle-collector-begin event, so that means many more than ~4.5% of such sessions are noticeably affected. I know Ehsan had hundreds of entries in his TELEMETRY_MEMORY_REPORTER_MS histogram when he ran into that hang.

(In reply to Justin Lebar [:jlebar] from comment #12)
> ISTM that if we instead pushed to get a RESTful API to the telemetry data,
> then we could implement these warnings -- and all our other feature requests
> -- ourselves, in a timely, open-source, collaborative manner.

I agree, this is the way forward
Small correction: Telemetry gathers memory information on every cycle-collector-begin event but at most once per minute
> Do we have an alternative?

Not that I'm aware of, but this is all pretty dark magic, so who knows.

> Could we somehow implement a timeout?

We make multiple calls to mmap in the purge function.  We could definitely check the time after each one and bail if we've taken too long.  But that only helps if the pause is spread out among many of the mmap's.  If instead just one mmap is taking a long time, that doesn't help.

I don't know which situation we're in.

> I see 4.552% of MEMORY_FREE_PURGED_PAGES_MS calls taking 71ms or longer.

Hm.  Maybe this munmap/mmap pair is causing us to swap.  That could explain these spikes.
ehsan, do you have a magnetic hdd in your mac?
(In reply to comment #16)
> ehsan, do you have a magnetic hdd in your mac?

No.  I also have 16 gigs of ram, and I don't swap very often.
> When I select Firefox 17 + Release + Mac OS + after Nov 1st on the dashboard, I see 
> 4.552% of MEMORY_FREE_PURGED_PAGES_MS calls taking 71ms or longer.

For posterity, here's a link to that: http://goo.gl/EmLaR

It pains me greatly to do it, but I think we should simply pref the purging off.  It's more important not to jank even 1% of users users than it is to get good memory metrics.

If we want to try out bailing on the munmaps if they're taking too long (totally doable, but I don't know whether it will help, since I don't know why the munmaps are taking so long in the first place), we can experiment with that on the nightly branch by flipping the pref there.  We'd have to figure out how to annotate the RSS with an indication that it's bogus, or otherwise not send it in the telemetry report.

But first things first.
Assignee: nfroyd → justin.lebar+bug
Attached patch Patch, v1Splinter Review
This creates a new memory reporter, "resident-fast", and uses that one in
telemetry.  We use the vanilla "resident" reporter in about:memory and hide
resident-fast.
Attachment #701845 - Flags: review?(n.nethercote)
Comment on attachment 701845 [details] [diff] [review]
Patch, v1

Review of attachment 701845 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/aboutmemory/content/aboutMemory.js
@@ +723,5 @@
> +  //
> +  // Also ignore the resident-fast reporter; we use the vanilla resident
> +  // reporter because it's more important that we get accurate results than
> +  // that we avoid the (small) possibility of a long pause when loading
> +  // about:memory.

My preference is for aboutMemory.js to treat as few reporters specially as possible.  Is there a problem with just showing both of them?  On Mac it'll be particularly useful to get an idea of how inaccurate resident-fast is.

::: toolkit/components/telemetry/TelemetryPing.js
@@ +55,5 @@
>    "js-gc-heap": "MEMORY_JS_GC_HEAP",
>    "js-compartments/system": "MEMORY_JS_COMPARTMENTS_SYSTEM",
>    "js-compartments/user": "MEMORY_JS_COMPARTMENTS_USER",
>    "explicit": "MEMORY_EXPLICIT",
> +  "resident-fast": "MEMORY_RESIDENT",

Should we rename this MEMORY_RESIDENT_FAST?  I understand we'll lose telemetry continuity.  I'll let you decide.
Attachment #701845 - Flags: review?(n.nethercote) → review+
> Is there a problem with just showing both of them?

My thinking was:

 * I want resident-fast to always be there, so we don't have to write special code in telemetry in order to use it.

 * But resident-fast is the same as resident on Windows and Linux.

 * So I'd rather not add a memory reporter that doesn't add any value.

I'd certainly be OK special-casing about:memory to show resident-fast on MacOS only, because I agree that there's some value to seeing it on that platform...

> Should we rename this MEMORY_RESIDENT_FAST?  I understand we'll lose telemetry 
> continuity.

I'd rather not lose continuity on Windows and Linux just because we're changing the MacOS reporter.  I'd be OK losing continuity on MacOS, but then we'd need special-casing in TelemetryPing.js.
Nick, what do you think we should do here?  I'm not wed to the particular trade-offs I made in this patch if you have a better idea.
>  * So I'd rather not add a memory reporter that doesn't add any value.

I should say, I'd rather not add a memory reporter /to about:memory/ that doesn't add any value.
I'm ok with losing Mac continuity.

I'd prefer not special-casing resident-fast in about:memory, but I can live with it if you feel strongly.
Comment on attachment 701845 [details] [diff] [review]
Patch, v1

Once this bakes, I suspect we should take this on branches.  At least Aurora.

[Approval Request Comment]
* Bug caused by (feature/regressing bug #): Perhaps bug 693404, although it's odd that it took so long for us to hear of this.
* User impact if declined: Hangs for a single-digit fraction of users when we collect telemetry data.
* Testing completed (on m-c, etc.): Needs to bake for a few days, and we need to ensure that we're still getting telemetry for the resident field.
* Risk to taking this patch (and alternatives if risky): Could mess up our telemetry data, so we need to test.
* String or UUID changes made by this patch: None.
Attachment #701845 - Flags: approval-mozilla-beta?
Attachment #701845 - Flags: approval-mozilla-aurora?
(In reply to Justin Lebar [:jlebar] from comment #25)
> https://hg.mozilla.org/integration/mozilla-inbound/rev/622427e981a8

Backed out for build bustage.
https://hg.mozilla.org/integration/mozilla-inbound/rev/69017949e118

https://tbpl.mozilla.org/php/getParsedLog.php?id=18801402&tree=Mozilla-Inbound

../../../xpcom/base/nsMemoryReporterManager.cpp: In member function 'virtual nsresult MemoryReporter_ResidentFast::GetAmount(int64_t*)':
../../../xpcom/base/nsMemoryReporterManager.cpp:402:1034: error: 'GetResidentFast' was not declared in this scope
../../../xpcom/base/nsMemoryReporterManager.cpp:402:1037: error: control reaches end of non-void function
make[6]: *** [nsMemoryReporterManager.o] Error 1
jlebar, when you reland can you tweak the resident-fast description to something like this:

This is the same measurement as 'resident', but it tries to be as fast as
possible at the expense of accuracy.  On most platforms this is identical to
the 'resident' measurement, but on Mac it may over-count.  You should use
'resident-fast' where you care about latency of collection (e.g. in telemetry).
Otherwise you should use 'resident'.

Key changes:
- single quotes around reporter names
- talk about measurements instead of reporters
Relanded with new description and hopefully less bustage.

https://hg.mozilla.org/integration/mozilla-inbound/rev/0348a30c19b4
The description looks unchanged...
(In reply to Nicholas Nethercote [:njn] from comment #30)
> The description looks unchanged...

Sigh, I changed the vsize description instead of the resident-fast description.
Comment on attachment 701845 [details] [diff] [review]
Patch, v1

Let's land in Aurora and verify our Telemetry data before considering for Beta.
Attachment #701845 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Whiteboard: [Snappy:p1][leave open] → [Snappy:p1]
Justin - can you verify whether Telemetry data looks better now that this has uplifted to Aurora?
Flags: needinfo?(justin.lebar+bug)
(In reply to Alex Keybl [:akeybl] from comment #36)
> Justin - can you verify whether Telemetry data looks better now that this
> has uplifted to Aurora?

I checked earlier today and there wasn't enough data yet.  I'll check again on Monday.
Flags: needinfo?(justin.lebar+bug)
Flags: needinfo?(justin.lebar+bug)
Comment on attachment 701845 [details] [diff] [review]
Patch, v1

a=akeybl if the telemetry is looking good. We're going to build with beta 3 at EOD today, so it's unlikely we'll fix for FF19 if ready to land after today.
Attachment #701845 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
To my eyes, there's no difference in the telemetry_memory_reporter_ms histogram before and after 1/17 on the nightly channel on macos.  We still have very little data (~300 samples) from after this change.

(Actually, we have very little data in general.  I thought we turned on telemetry by default for all nightly users.  If this histogram is getting multiple entries per session, why do I only see a few hundred samples per day?)

But perhaps release users hit this edge case, whatever it is, more than nightly users.  Unfortunately Aurora was closed and so I couldn't get any data from that tree.

I think we should just push this and note that there's some problem with telemetry here.
Flags: needinfo?(justin.lebar+bug)
(In reply to Justin Lebar [:jlebar] from comment #39)
> To my eyes, there's no difference in the telemetry_memory_reporter_ms
> histogram before and after 1/17 on the nightly channel on macos.

I see a marked drop starting with the January 16th build in the Evolution dashboard:

http://tinyurl.com/a9lda66

The mean, std dev and 95th percentile all decreased substantially :)
> I see a marked drop starting with the January 16th build in the Evolution dashboard

Okay, that's great!

When I select a date in the telemetry histogram at the top, is that filtering by date received instead of build date?

  http://goo.gl/cSVbb

That's probably my error.

Do you know what's the trick for getting the telemetry histogram to filter by buildid?  Every time I try, it just refreshes the page, clears the buildids I entered, and doesn't change the histogram.
I see, the buildid thing is an autocomplete field, and if you don't let it autocomplete for you, it will just silently fail.  Cool.

I tried filtering by buildid and didn't get anywhere, but whatever, the telemetry evolution data is good enough for me.  Thanks for checking that.
(In reply to Alex Keybl [:akeybl] from comment #36)
> Justin - can you verify whether Telemetry data looks better now that this
> has uplifted to Aurora?

Is there anything QA can do here?
Whiteboard: [Snappy:p1] → [Snappy:p1][qa?]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: