Last Comment Bug 663423 - Memory usage reports via telemetry
: Memory usage reports via telemetry
Status: RESOLVED FIXED
[MemShrink:P1][inbound]
:
Product: Core
Classification: Components
Component: General (show other bugs)
: unspecified
: All All
: -- normal (vote)
: mozilla7
Assigned To: Nicholas Nethercote [:njn] (on vacation until July 11)
:
Mentors:
Depends on: 660731
Blocks: 585196 MemShrinkTools
  Show dependency treegraph
 
Reported: 2011-06-10 09:35 PDT by Nicholas Nethercote [:njn] (on vacation until July 11)
Modified: 2011-07-04 11:42 PDT (History)
13 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
patch 1 (552 bytes, patch)
2011-06-30 21:11 PDT, Nicholas Nethercote [:njn] (on vacation until July 11)
taras.mozilla: review+
Details | Diff | Review
patch 2 (4.75 KB, patch)
2011-06-30 21:12 PDT, Nicholas Nethercote [:njn] (on vacation until July 11)
taras.mozilla: review+
Details | Diff | Review
enforce mem reporters (1.67 KB, patch)
2011-07-01 15:29 PDT, (dormant account)
n.nethercote: review+
Details | Diff | Review
patch 1(for checkin) (730 bytes, patch)
2011-07-01 15:53 PDT, (dormant account)
justin.lebar+bug: review+
Details | Diff | Review
UNITS_COUNT fix. (2.40 KB, patch)
2011-07-02 09:24 PDT, Justin Lebar (not reading bugmail)
taras.mozilla: review-
n.nethercote: feedback+
Details | Diff | Review
Fix for UNITS_COUNT v2 (2.43 KB, patch)
2011-07-02 15:57 PDT, Justin Lebar (not reading bugmail)
taras.mozilla: review+
Details | Diff | Review

Description Nicholas Nethercote [:njn] (on vacation until July 11) 2011-06-10 09:35:37 PDT
Now that telemetry is working, it'd be great to get some kind of data gathering and analysis relating to memory usage.
Comment 1 (dormant account) 2011-06-10 09:37:46 PDT
We have *some* kind of reporting already. http://people.mozilla.com/~tglek/telemetry/log.html lists aggregate info on some of the about:memory stuff
Comment 2 Nicholas Nethercote [:njn] (on vacation until July 11) 2011-06-10 09:39:15 PDT
Great!  So this bug is a matter of working out what more we can do with that data, if we can automate any kind of tracking, etc.
Comment 3 (dormant account) 2011-06-10 09:41:03 PDT
(In reply to comment #2)
> Great!  So this bug is a matter of working out what more we can do with that
> data, if we can automate any kind of tracking, etc.

Well, I put those probes in as an example. I would be happy if you could suggest refinements to that or a better approach altogether.
Comment 4 Justin Lebar (not reading bugmail) 2011-06-21 13:25:31 PDT
Here's the list of currently-tracked memory reporters, from TelemtryPing.js:

  "explicit/js/gc-heap": [1024, 512 * 1024, 10],
  "resident": [32 * 1024, 1024 * 1024, 10],
  "explicit/layout/all": [1024, 64 * 1025, 10],
Comment 5 Nicholas Nethercote [:njn] (on vacation until July 11) 2011-06-21 16:30:39 PDT
We should do "explicit" as well, that's arguably the single most interesting number.  That'll require bug 660731.
Comment 6 Nicholas Nethercote [:njn] (on vacation until July 11) 2011-06-24 00:00:10 PDT
http://blog.mozilla.com/tglek/2011/06/22/developers-how-to-submit-telemetry-data/ is useful info for this bug.
Comment 7 Nicholas Nethercote [:njn] (on vacation until July 11) 2011-06-26 23:47:45 PDT
explicit/js/gc-heap-chunk-unused (from bug 661474) would also be interesting, for an indication of fragmentation on the JS heap.
Comment 8 (dormant account) 2011-06-29 10:10:32 PDT
(In reply to comment #7)
> explicit/js/gc-heap-chunk-unused (from bug 661474) would also be
> interesting, for an indication of fragmentation on the JS heap.

is there a reason you didn't include a telemetry probe in that bug?
Comment 9 Nicholas Nethercote [:njn] (on vacation until July 11) 2011-06-29 16:19:11 PDT
(In reply to comment #8)
> 
> is there a reason you didn't include a telemetry probe in that bug?

I am planning to do all the new telemetry additions in a single patch here :)
Comment 10 Nicholas Nethercote [:njn] (on vacation until July 11) 2011-06-30 21:11:53 PDT
Created attachment 543339 [details] [diff] [review]
patch 1

So the memory reporting in telemetry is completely busted.  This first patch fixes it.  Is this stuff tested anywhere?
Comment 11 Nicholas Nethercote [:njn] (on vacation until July 11) 2011-06-30 21:12:54 PDT
Created attachment 543340 [details] [diff] [review]
patch 2

This patch: 
- Adds three extra memory measurements to telemetry.
- Removes |memReporters| from |gatherMemory|, because it is dead.

It relies on mgr.explicit which is added by the patch in bug 660731.
Comment 12 Nicholas Nethercote [:njn] (on vacation until July 11) 2011-06-30 21:15:23 PDT
BTW, I didn't do explicit/js/gc-heap-chunk-unused because that's computed via a multi-reporter (bug 666075) and there's currently no way to extract a single number from a multi-reporter.
Comment 13 (dormant account) 2011-07-01 10:00:05 PDT
(In reply to comment #10)
> Created attachment 543339 [details] [diff] [review] [review]
> patch 1
> 
> So the memory reporting in telemetry is completely busted.  This first patch
> fixes it.  Is this stuff tested anywhere?

Not yet. I'm not sure of the best way to test it. I was thinking of adding a NS_WARNING to the for loop if every memory reporter wasn't found (or if some of them are reporting 0, which seems crazy). Turns out there isn't an NS_WARNING equivalent from JS.
Comment 14 Nicholas Nethercote [:njn] (on vacation until July 11) 2011-07-01 14:24:57 PDT
(In reply to comment #13)
> Turns out there isn't an NS_WARNING equivalent from JS.

JS_ASSERT!  (Yes, I'm serious;  if this is busted I want to know, not have a warning that gets lost in the 100s of other warnings :)
Comment 15 (dormant account) 2011-07-01 14:25:37 PDT
(In reply to comment #14)
> (In reply to comment #13)
> > Turns out there isn't an NS_WARNING equivalent from JS.
> 
> JS_ASSERT!  (Yes, I'm serious;  if this is busted I want to know, not have a
> warning that gets lost in the 100s of other warnings :)

I meant from .js
Comment 16 (dormant account) 2011-07-01 14:26:26 PDT
Also while we do this, can we check if val != 0 before adding it to a histogram, I don't know why MEMORY_RESIDENT would be 0 so often.
Comment 17 (dormant account) 2011-07-01 15:29:37 PDT
Created attachment 543528 [details] [diff] [review]
enforce mem reporters

The above patches use NS_ASSERT, which isn't available in this module(and is apparently dangerous too as it shows up as a modal dialog).

Turns out there is no good way to assert(that I can see) from js.
Comment 18 (dormant account) 2011-07-01 15:53:56 PDT
Created attachment 543529 [details] [diff] [review]
patch 1(for checkin)

I don't want to take the risk of branching for aurora with broken memory reporters. Lets checkin the fix right away.
Comment 19 Justin Lebar (not reading bugmail) 2011-07-01 15:58:00 PDT
Comment on attachment 543529 [details] [diff] [review]
patch 1(for checkin)

>+      if (val)
>+        h.add(val);

Why are we excluding zero values?  It's perfectly valid for the number of hard page faults since the last sample to be 0, for instance.
Comment 20 Justin Lebar (not reading bugmail) 2011-07-01 16:35:51 PDT
(In reply to comment #19)
> Why are we excluding zero values?  It's perfectly valid for the number of
> hard page faults since the last sample to be 0, for instance.

Taras and I had a long argument about this on IRC.

He thinks that 0 values aren't interesting, because they're a non-event.  In the unlikely scenario that you want to compute the proportion of time spent not swapping versus swapping, you can infer the number of 0s that are missing by looking at some other memory reporter.  Also, since presumably the 0s bucket will be much larger than the other buckets, including it will make the histogram ugly.

I think the histogram makes no sense without the 0 values.  You *always* want to compare the number of swap events to the number of non-swap events.  Otherwise you're just learning a number which is strongly correlated with how long the browser has been running.  If the histogram can't display this well, then it's bugged.

Anyway, Taras needs this hack in there because some other reporters aren't working properly -- RSS reports 0 about half the time, for reasons which are unclear to me.  We agreed that the patch gets r=me if this check is called out as a hack and we commit to fix it in the next cycle.
Comment 21 (dormant account) 2011-07-01 16:40:29 PDT
http://hg.mozilla.org/integration/mozilla-inbound/rev/e5de7146ac19

please don't close this bug on mc merge :)
Comment 22 Justin Lebar (not reading bugmail) 2011-07-01 16:40:51 PDT
> You *always* want to compare the number of swap events to the number of non-swap events.

For instance, I might want to know: "Is FF becoming less swappy for users?"  I'd compute this by comparing the number of events below a threshold (say less than 1000 faults / min) to those above a threshold.  The 0s are crucial here.

(This further shows why it doesn't make sense to exclude 0s: A value of 0 faults / min isn't a lot different from 2 faults / min.  Yet only one makes it into the histogram?)
Comment 23 Justin Lebar (not reading bugmail) 2011-07-01 17:33:59 PDT
Also, WRT the point that you can infer the number of 0s by examining another memory reporter: That only works if the memory reporters aren't randomly returning 0.  :)
Comment 24 Nicholas Nethercote [:njn] (on vacation until July 11) 2011-07-01 21:35:13 PDT
Comment on attachment 543528 [details] [diff] [review]
enforce mem reporters

Review of attachment 543528 [details] [diff] [review]:
-----------------------------------------------------------------
Comment 25 Marco Bonardo [::mak] 2011-07-02 03:07:24 PDT
merged comment 21
http://hg.mozilla.org/mozilla-central/rev/e5de7146ac19
Comment 26 Justin Lebar (not reading bugmail) 2011-07-02 09:24:52 PDT
Created attachment 543613 [details] [diff] [review]
UNITS_COUNT fix.

I tested this and didn't see 0s coming from anywhere except hard-page-faults, where it's expected.

If RSS is still reporting 0 for some unknown reason, I don't think we should hide it, as we currently do.  Since it's possible that the 0s are reported in place of certain values (as we don't know where they were coming from), taking out the 0s entirely kind of invalidates the RSS measure.

If you really feel strongly that 0 hard page faults in an interval shouldn't be reported, then let's exclude 0s for UNITS_COUNT, but not for other units.  Although I still feel like that's entirely the wrong thing to do.  Maybe njn can referee...
Comment 27 Justin Lebar (not reading bugmail) 2011-07-02 09:27:17 PDT
Comment on attachment 543613 [details] [diff] [review]
UNITS_COUNT fix.

Taras, do you think we can get this in before we branch?
Comment 28 (dormant account) 2011-07-02 15:10:02 PDT
Comment on attachment 543613 [details] [diff] [review]
UNITS_COUNT fix.

>-      // hack to deal with some memory reporters returning 0
>-      if (val)
>-        h.add(val);
>+      h.add(val);
This is still needed for UNITS_BYTES.
Comment 29 Justin Lebar (not reading bugmail) 2011-07-02 15:17:30 PDT
Comment on attachment 543613 [details] [diff] [review]
UNITS_COUNT fix.

Maybe njn can arbitrate; would you be OK with that, Taras?
Comment 30 Nicholas Nethercote [:njn] (on vacation until July 11) 2011-07-02 15:19:18 PDT
I'd like to include zero counts for byte measurements.  We should never get zero for them, AFAIK -- memory reporters return -1 if they fail -- so if we do that is interesting, and probably a sign something is buggy.
Comment 31 Justin Lebar (not reading bugmail) 2011-07-02 15:20:09 PDT
I think we all agree with including the zeros for byte measurements.  What about for counts?
Comment 32 Nicholas Nethercote [:njn] (on vacation until July 11) 2011-07-02 15:20:15 PDT
Comment on attachment 543613 [details] [diff] [review]
UNITS_COUNT fix.

Review of attachment 543613 [details] [diff] [review]:
-----------------------------------------------------------------
Comment 33 (dormant account) 2011-07-02 15:22:27 PDT
(In reply to comment #30)
> I'd like to include zero counts for byte measurements.  We should never get
> zero for them, AFAIK -- memory reporters return -1 if they fail -- so if we
> do that is interesting, and probably a sign something is buggy.

Ok, so r+ with a fix that checks for -1 and does a continue.

Math.round(-1/1024)==0 :)
Comment 34 Nicholas Nethercote [:njn] (on vacation until July 11) 2011-07-02 15:27:15 PDT
(In reply to comment #31)
> I think we all agree with including the zeros for byte measurements.  What
> about for counts?

I don't think we should throw away any valid measurements, ie. measurements where nothing went wrong when taking the measurement.

Do we need to check for -1 for page fault counts?
Comment 35 Justin Lebar (not reading bugmail) 2011-07-02 15:31:02 PDT
> Do we need to check for -1 for page fault counts?

Yes, they return -1 on failure.
Comment 36 Justin Lebar (not reading bugmail) 2011-07-02 15:57:04 PDT
Created attachment 543627 [details] [diff] [review]
Fix for UNITS_COUNT v2
Comment 37 (dormant account) 2011-07-02 16:26:23 PDT
Comment on attachment 543627 [details] [diff] [review]
Fix for UNITS_COUNT v2

Thanks, happy to see a proper fix for this.
Please triplecheck that this works as expected on your machine.
Comment 38 Justin Lebar (not reading bugmail) 2011-07-02 17:48:05 PDT
Checked one last time that it works on my machine, and pushed to inbound.

http://hg.mozilla.org/integration/mozilla-inbound/rev/016fcb800fc2
Comment 39 Nicholas Nethercote [:njn] (on vacation until July 11) 2011-07-03 23:19:04 PDT
http://hg.mozilla.org/integration/mozilla-inbound/rev/1ec0339ac9a1

That is "patch 2" with some minor rebasing.

This should be the last patch for this bug, so when it's merged to m-c this bug can be marked as RESOLVED FIXED.  Thanks.
Comment 40 Marco Bonardo [::mak] 2011-07-04 05:05:27 PDT
http://hg.mozilla.org/mozilla-central/rev/016fcb800fc2
Comment 41 Marco Bonardo [::mak] 2011-07-04 05:06:38 PDT
ehr, still missing part 2 from this merge.
Comment 42 Marco Bonardo [::mak] 2011-07-04 11:42:59 PDT
http://hg.mozilla.org/mozilla-central/rev/1ec0339ac9a1

Note You need to log in before you can comment on or make changes to this bug.