Status

()

defect
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: njn, Assigned: njn)

Tracking

Trunk
mozilla48
Points:
---

Firefox Tracking Flags

(firefox47 affected, firefox48 fixed)

Details

(Whiteboard: [MemShrink:P2])

Attachments

(2 attachments)

DMD currently uses a somewhat dodgy sampling algorithm:

- Heap blocks larger than the threshold, T, (which defaults to 4093) are always recorded accurately.

- Heap blocks smaller than T are "sampled". In the common case such blocks are ignored. We keep a running counter of how many bytes of small blocks have been skipped, and when that counter hits T, we pretend the current heap block has size T, record it, and then reset the counter.

Bad things about this:

- I'm not convinced the sampling is "fair".

- You don't get an accurate record of how many blocks were actually allocated.

I plan to make two major changes:

- Record the size of all blocks accurately, but only record allocation stack traces for some of them. That way we have exact block sizes and counts, but some of them will be in a "no stack trace" bucket.

- Use Bernoulli trials to decide which blocks get a stack trace.

I will preserve the existing "record everything precisely" mode, because that is often still quite useful.
We chatted about this on IRC; here's what I remember of our conclusions:

You want to use the FastBernoulliTrial::trial(size_t n) method, where n is the size of the block being allocated in bytes. This is equivalent to marking or not marking each byte in the allocation stream with probability P, and then sampling the block if any of the bytes are tagged. DMD's existing behavior with respect to "large" blocks is, I think, a less principled attempt to address the same intuition: that larger allocations are more important than small allocation.

When you do record an allocation, the size should be the total of all the allocations you've made since the last record. That way, the sizes in the record will not correspond to the amount of memory allocated at the actual stack captured, but the overall totals will be correct. The latter seems more valuable.

(You should *not* use this running total size as the argument to FastBernoulliTrial::trial(size_t). That would treat later allocations as more important than earlier allocations, where "later" is respect to a randomly chosen event, so is meaningless.)
> When you do record an allocation, the size should be the total of all the
> allocations you've made since the last record. That way, the sizes in the
> record will not correspond to the amount of memory allocated at the actual
> stack captured, but the overall totals will be correct.

I changed my mind about this. Having incorrect sizes for some blocks is awful. That's why I now plan to measure the presence and size of all blocks accurately, but only get the allocation stack trace for some of them, as per comment 0.
+1
Whiteboard: [MemShrink] → [MemShrink:P2]
Assignee: nobody → n.nethercote
erahm: please review normally.

mccr8: you might like to check that I haven't broken scan mode. There is one
scan mode test within test_dmd.js, but block_analyzer.py doesn't seem to have
any testing -- I realized that I needed to change |outputVersion| only thanks
to some lucky grepping -- so the coverage isn't great.

jimb: you might like to look at how I've used FastBernoulliTrial. Doing a
case-insensitive search for "bernoulli" should give you all the points of
interest.
Attachment #8732677 - Flags: review?(erahm)
Attachment #8732677 - Flags: feedback?(jimb)
Attachment #8732677 - Flags: feedback?(continuation)
Due to the change in part 1, DMD now prints an entry for every live block,
which increases the output file size significantly in the default case. However,
a lot of those entries are identical and so can be aggregated via the existing
"num" property.

This patch does that, reducing output size by more than half.
Attachment #8732678 - Flags: review?(erahm)
Comment on attachment 8732677 [details] [diff] [review]
(part 1) - Overhaul DMD's "sampling"

Review of attachment 8732677 [details] [diff] [review]:
-----------------------------------------------------------------

::: memory/replace/dmd/DMD.cpp
@@ +393,5 @@
> +  // With partial stacks, not all heap blocks will get a stack trace recorded.
> +  // A Bernoulli trial is performed for each heap block to decide if it gets
> +  // one. Because bigger heap blocks are more likely to get a stack trace, even
> +  // though most heap *blocks* won't get a stack trace, most heap *bytes* will.
> +  enum class Stacks

You may want this enum for other reasons, but just so you know, setting your sampling probability to 1.0 will give you the same effect, and FastBernoulliTrial handles that case specially, so you won't be generating any random numbers (although that's certainly not going to affect performance).

@@ +1559,5 @@
>    gStateLock = InfallibleAllocPolicy::new_<Mutex>();
>  
> +  // We use (0, 0, 0) as the parameters because they don't matter --
> +  // ResetBernoulli() overwrites them.
> +  gBernoulli = InfallibleAllocPolicy::new_<FastBernoulliTrial>(0, 0, 0);

Doesn't this trip the assertion here?

https://hg.mozilla.org/mozilla-central/file/3e04659fdf6a/mfbt/XorShift128PlusRNG.h#l105
Attachment #8732677 - Flags: feedback?(jimb) → feedback+
> > +  gBernoulli = InfallibleAllocPolicy::new_<FastBernoulliTrial>(0, 0, 0);
> 
> Doesn't this trip the assertion here?

Probably! I haven't run a debug build since I made that change :)

Thank you for the fast feedback.
Comment on attachment 8732677 [details] [diff] [review]
(part 1) - Overhaul DMD's "sampling"

Review of attachment 8732677 [details] [diff] [review]:
-----------------------------------------------------------------

::: memory/replace/dmd/block_analyzer.py
@@ -222,5 @@
>      if j['version'] != outputVersion:
>          raise Exception("'version' property isn't '{:d}'".format(outputVersion))
>  
>      invocation = j['invocation']
> -    sampleBelowSize = invocation['sampleBelowSize']

Is there no recording of the sampling rate in this file?

Anyways, as long as disabling sampling works, I'm sure there won't be any serious issues with your patch for scanning.
Attachment #8732677 - Flags: feedback?(continuation) → feedback+
> Is there no recording of the sampling rate in this file?

Correct. Sampling doesn't really exist any more, except that some blocks lack an allocation stack trace.
Great. That will degrade more gracefully for scan mode. With sampling and scan mode, you'd get weird things like some address being reported as a super huge block, then you'd try to scan the entire "block" but read off into bogus memory.
(In reply to Andrew McCreight [:mccr8] from comment #10)
> Great. That will degrade more gracefully for scan mode. With sampling and
> scan mode, you'd get weird things like some address being reported as a
> super huge block, then you'd try to scan the entire "block" but read off
> into bogus memory.

The Options constructor currently disables sampling (by setting the sample-below size to 1) if scan mode is enabled...
Comment on attachment 8732677 [details] [diff] [review]
(part 1) - Overhaul DMD's "sampling"

Review of attachment 8732677 [details] [diff] [review]:
-----------------------------------------------------------------

Overall this looks good pending a fix for the bernoulli trial allocation. I made a few other minor comments that would be nice to cleanup before landing. Also a debug try push would be nice.

::: memory/replace/dmd/DMD.cpp
@@ +240,5 @@
> +  template <class T, typename P1, typename P2, typename P3>
> +  static T* new_(P1 aP1, P2 aP2, P3 aP3)
> +  {
> +    void* mem = malloc_(sizeof(T));
> +    ExitOnFailure(mem);

|malloc_| already does |ExitOnfailure| (I guess the same goes for the other newish functions). In general I'm not sure I see the merit of this function, can't we just |malloc_(sizeof(...))| in ResetBernoulli?

@@ +390,5 @@
> +  // With full stacks, every heap block gets a stack trace recorded for it.
> +  // This is complete but slow.
> +  //
> +  // With partial stacks, not all heap blocks will get a stack trace recorded.
> +  // A Bernoulli trial is performed for each heap block to decide if it gets

A wiki (or source code) link defining a Bernoulli trial would be useful here.

@@ +1559,5 @@
>    gStateLock = InfallibleAllocPolicy::new_<Mutex>();
>  
> +  // We use (0, 0, 0) as the parameters because they don't matter --
> +  // ResetBernoulli() overwrites them.
> +  gBernoulli = InfallibleAllocPolicy::new_<FastBernoulliTrial>(0, 0, 0);

I think we can just do a conditional |malloc_| in |ResetBernoulli| and call it good.

::: memory/replace/dmd/DMD.h
@@ -173,4 @@
>  //     "mode": "dark-matter",
> -//
> -//     // The value of the --sample-below-size option. A mandatory integer.
> -//     "sampleBelowSize": 4093

Should we add an entry for the sampling mode (full or partial)?

@@ +191,5 @@
> +//       // One or more stack traces at which this heap block was reported by a
> +//       // memory reporter. An optional array that will only be present in
> +//       // "dark-matter" mode. The elements are strings that index into
> +//       // the "traceTable" object.
> +//       // "reps": ["B"]

nit, remove extra '//'

::: memory/replace/dmd/dmd.py
@@ +275,5 @@
>  
> +    # Insert the necessary entries for unrecorded stack traces. Note that 'ut'
> +    # and 'uf' will not overlap with any keys produced by DMD's
> +    # ToIdStringConverter::Base32() function.
> +    traceTable['ut'] = ['uf']

Can you make 'ut' a constant? It'll make reading this later a bit more clear.

@@ +383,5 @@
>  
> +        if 'alloc' in block:
> +            allocatedAtTraceKey = block['alloc']
> +        else:
> +            allocatedAtTraceKey = 'ut'

|allocatedAtTraceKey = block.get('alloc', 'ut')| would be a bit more concise.

::: memory/replace/dmd/test/SmokeDMD.cpp
@@ +283,3 @@
>    ResetEverything(options);
>  
> +  int k10000 = aSeven + 9993;

Perhaps kTenThousand would be less odd (and match aSeven).
Attachment #8732677 - Flags: review?(erahm) → review+
(In reply to Eric Rahm [:erahm] from comment #12)
> > +  // With partial stacks, not all heap blocks will get a stack trace recorded.
> > +  // A Bernoulli trial is performed for each heap block to decide if it gets
> 
> A wiki (or source code) link defining a Bernoulli trial would be useful here.

The comments in mfbt/FastBernoulliTrial.h explain what Bernoulli trials are, and why they're attractive, so a link there would probably be fine.
Comment on attachment 8732678 [details] [diff] [review]
(part 2) - Aggregate live blocks

Review of attachment 8732678 [details] [diff] [review]:
-----------------------------------------------------------------

This seems fine, do we need to update/add tests for it though?

::: memory/replace/dmd/DMD.cpp
@@ -1885,5 @@
>          const DeadBlock& b = r.front().key();
>          b.AddStackTracesToTable(usedStackTraces);
>  
>          size_t num = r.front().value();
> -        MOZ_ASSERT(num > 0);

What was wrong with this assertion?
Attachment #8732678 - Flags: review?(erahm) → review+
(In reply to Eric Rahm [:erahm] from comment #14)
> (part 2) - Aggregate live blocks
> 
> This seems fine, do we need to update/add tests for it though?

Not really. It already had some testing because we already use it for dead blocks. We now also use it for live blocks, so it gets used more, but the existing tests will catch any problems in those cases.

> >          size_t num = r.front().value();
> > -        MOZ_ASSERT(num > 0);
> 
> What was wrong with this assertion?

Oh, I read it as >=0 which would be pointless for an unsigned type. My mistake, I'll put it back. (And that explains why the compiler didn't complain about an always-succeeding check.)
https://hg.mozilla.org/mozilla-central/rev/ce540d9af1cb
https://hg.mozilla.org/mozilla-central/rev/5f7d9726c2ff
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
I just realized I wrote some comments here but failed to submit them...


> ::: memory/replace/dmd/DMD.cpp
> @@ +240,5 @@
> > +  template <class T, typename P1, typename P2, typename P3>
> > +  static T* new_(P1 aP1, P2 aP2, P3 aP3)
> > +  {
> > +    void* mem = malloc_(sizeof(T));
> > +    ExitOnFailure(mem);
> 
> |malloc_| already does |ExitOnfailure| (I guess the same goes for the other
> newish functions).

True, I'll remove those ExitOnFailure() calls.


> > +  // With partial stacks, not all heap blocks will get a stack trace recorded.
> > +  // A Bernoulli trial is performed for each heap block to decide if it gets
> 
> A wiki (or source code) link defining a Bernoulli trial would be useful here.

I cited mfbt/FastBernoulliTrial.h as jimb suggested.


> @@ +1559,5 @@
> >    gStateLock = InfallibleAllocPolicy::new_<Mutex>();
> >  
> > +  // We use (0, 0, 0) as the parameters because they don't matter --
> > +  // ResetBernoulli() overwrites them.
> > +  gBernoulli = InfallibleAllocPolicy::new_<FastBernoulliTrial>(0, 0, 0);
> 
> I think we can just do a conditional |malloc_| in |ResetBernoulli| and call
> it good.

I think it's best to allocate it in Init() with InfallibleMallocPolicy::malloc_() and then reset it in ResetBernoulli.


> ::: memory/replace/dmd/DMD.h
> @@ -173,4 @@
> >  //     "mode": "dark-matter",
> > -//
> > -//     // The value of the --sample-below-size option. A mandatory integer.
> > -//     "sampleBelowSize": 4093
> 
> Should we add an entry for the sampling mode (full or partial)?

No. sampleBelowSize was present because dmd.py needed it in a couple of places. In contrast, dmd.py doesn't need to reference the sampling mode anywhere.
You need to log in before you can comment on or make changes to this bug.