Closed Bug 859657 Opened 11 years ago Closed 11 years ago

WorkerPrivate::BlockAndCollectRuntimeStats can block the main thread for several seconds

Categories

(Core :: DOM: Workers, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla23

People

(Reporter: vladan, Assigned: n.nethercote)

References

Details

(Whiteboard: [MemShrink])

Attachments

(3 files)

Telemetry chrome-hang data shows that this method can tie up the main thread for several seconds on Windows. Could we add Telemetry to get timings for this operation?

PR_WaitCondVar (in nspr4.pdb)
 -> mozilla::CondVar::Wait(unsigned int) (in xul.pdb)
 -> mozilla::dom::workers::WorkerPrivate::BlockAndCollectRuntimeStats(bool,void *) (in xul.pdb)
 -> mozilla::dom::workers::WorkerPrivate::MemoryReporter::GetExplicitNonHeap(__int64 *) (in xul.pdb)
 -> nsMemoryReporterManager::GetExplicit(__int64 *) (in xul.pdb)
 -> ExplicitReporter::GetAmount(__int64 *) (in xul.pdb)
 -> NS_InvokeByIndex_P (in xul.pdb)
All the memory reporter stuff triggered when viewing about:memory can cause hangs like this.  But this is in response to a user-requested action, so it's not unexpected or even bad.
The telemetry memory check happens like once every few seconds without any user interaction, right?
Hmm.  The telemetry ping does much less work than the full report triggered from about:memory.  I thought this stack was for the full report, but now I see the GetExplicitNonHeap() in it, which means this *is* for the telemetry ping.  Bugger.

I think this is the relevant code:

  // If the worker is not already blocked (e.g. waiting for a worker event or
  // currently in a ctypes call) then we need to trigger the operation
  // callback to trap the worker.
  if (!mBlockedForMemoryReporter) {
    JS_TriggerOperationCallback(rt);

    // Wait until the worker actually blocks.
    while (!mBlockedForMemoryReporter) {
      mMemoryReportCondVar.Wait();
    }
  }
With the current approach for collecting memory info from a worker (trigger op callback, wait for the worker to block and acknowledge, collect memory info, unblock worker) the only thing we can do here is maybe try to increase the number of places that the JS engine checks the operation callback. Other than that we need an async memory report interface.
We should fix it so that telemetry doesn't cause this to run, and then live with the blocking for when a user uses about:memory.
> We should fix it so that telemetry doesn't cause this to run, and then live
> with the blocking for when a user uses about:memory.

Then the "explicit" measurement taken for telemetry pings would exclude all non-heap memory allocated by workers, which includes the JS heap and generated code.  Which would be quite a bummer.
And anything else relying on nsMemoryReporterManager.explicit would also get a bogus value.  This includes the endurance tests (http://mozmill-ci.blargon7.com/#/endurance/charts?branch=22.0&platform=All&from=2013-03-14&to=2013-04-11), but AWSY wouldn't be affected.

I've considered just getting rid of nsMemoryReporterManager.explicit, because keeping it working is an ongoing pain.  The loss of explicit measurements in telemetry would be the main problem, though this bug is already putting them in jeopardy.
bent, khuey: let's imagine that on a telemetry ping, the worker just had to read a single number from JSRuntime.  Would we need to wait on the condvar in that case?
If that value was being accessed/modified concurrently (and if JS is running in another thread then there is no way to avoid that) then we have to have some kind of synchronization or we'd sometimes be recording a bogus value. That seems like a deal-breaker, right?
> If that value was being accessed/modified concurrently (and if JS is running
> in another thread then there is no way to avoid that)

Isn't JSRuntime single-threaded?
(In reply to Nicholas Nethercote [:njn] from comment #10)
> Isn't JSRuntime single-threaded?

Yes, but not really here.

The way memory reporting works currently is that a worker's JSRuntime is primarily only used on the worker thread, but when we report memory the worker blocks and then the main thread uses the worker's JSRuntime to collect memory stats.

We used to post a message to the worker to have it collect memory data on its own thread and then report back to the main thread later (though the main thread remained blocked while waiting for that response). The two approaches boil down to about the same thing though.

Either way the number we want is being modified on another thread. We either have to synchronize or risk reading bad data.
Would wrapping the counter in JSRuntime in a mutex be better?  It's still synchronization, but more localized... (/me waves hands).
I think it would be a perf problem... Every time the worker thread wanted to update the number (like, every alloc and dealloc) it would have to grab the lock first.
> Every time the worker thread wanted to
> update the number (like, every alloc and dealloc) it would have to grab the
> lock first.

It would only be needed for the mmap allocs and deallocs -- i.e. allocating/freeing GC chunks, code chunks, the stack, asm.js arrays (on 64-bit only), and the YARR bump allocator.  Such operations are relatively rare, much more so than malloc/free.
Hm. That doesn't sound as bad... I guess we'd need to see numbers to be sure though.
> allocating/freeing GC chunks, code chunks, the stack, asm.js arrays (on
> 64-bit only), and the YARR bump allocator

Aw geez, on Unix the "explicit" measurement of the stack uses mincore() to measure only the pages that are resident!  There's no way we can keep a per-JSRuntime counter accurate in the face of that.

I'm really tempted to just ditch |nsMemoryReporterManager.explicit| at this point.
Heading down the "get rid of 'explicit'" path, I see four levels of change,
from smallest to largest.

(1) Remove the MEMORY_EXPLICIT telemetry measure.

(2) Also remove the standalone "explicit" reporter.  (Which just calls
    nsMemoryReporterManager::GetExplicit().)

(3) Also remove nsIMemoryReporter.explicitNonHeap, which means that
    |nsIMemoryReportermanager.explicit| will get slower.

(4) Also remove |nsIMemoryReporterManager.explicit|, which will break some
    things like the endurance tests.

In this patch I've just done (1).  It's a shame we won't have MEMORY_EXPLICIT
any more, but it's always been horribly noisy and I'm not aware of anything
useful that's ever come from it, so perhaps it's not a great loss.

(Relatedly, Benoit is working on adding the telemetry memory measurements to
SPS, but he said the "explicit" is no good because it does too much stuff with
the GC.)

I'm also sorely tempted to do (3) (which would also require (2)) to reduce some
complexity;  after all, it was added purely to make telemetry fast.  (4) seems
like overkill.
Attachment #736644 - Flags: review?(justin.lebar+bug)
Assignee: nobody → n.nethercote
> (1) Remove the MEMORY_EXPLICIT telemetry measure.

I was going to suggest this when I first read the bug, so I guess we're in agreement.  This has been a lot more trouble than it's worth, particularly given how our telemetry dashboards frequently lose data and are perpetually broken.  Maybe one day when all of this is better, we can worry about this again.
This is step (2) from comment 17.
Attachment #736655 - Flags: review?(justin.lebar+bug)
This is step (3) from comment 17.

24 files changed, 19 insertions(+), 283 deletions(-)

BTW, this will also fix 728990.
Attachment #736656 - Flags: review?(justin.lebar+bug)
Blocks: 733374
> BTW, this will also fix 728990.

Sorry, it's actually bug 733374.
Attachment #736644 - Flags: review?(justin.lebar+bug) → review+
Attachment #736655 - Flags: review?(justin.lebar+bug) → review+
Comment on attachment 736656 [details] [diff] [review]
(part 3) - Remove nsIMemoryMultiReporter.explicitNonHeap because it's more trouble than it's worth.

>diff --git a/xpcom/base/nsMemoryReporterManager.cpp b/xpcom/base/nsMemoryReporterManager.cpp
>--- a/xpcom/base/nsMemoryReporterManager.cpp
>+++ b/xpcom/base/nsMemoryReporterManager.cpp

>-    // For each multi-reporter we could call CollectReports and filter out the
>-    // non-explicit, non-NONHEAP measurements.  But that's lots of wasted work,
>-    // so we instead use GetExplicitNonHeap() which exists purely for this
>-    // purpose.
>-    //
>-    // (Actually, in debug builds we also do it the slow way and compare the
>-    // result to the result obtained from GetExplicitNonHeap().  This
>-    // guarantees the two measurement paths are equivalent.  This is wise
>-    // because it's easy for memory reporters to have bugs.  But there's an
>-    // exception if DMD is enabled, because that makes DMD think that all the
>-    // blocks are double-counted.)
>+    // For each multi-reporter we call CollectReports and filter out the
>+    // non-explicit, non-NONHEAP measurements.  That's lots of wasted work,
>+    // and we used to have a GetExplicitNonHeap() method which did this more
>+    // efficiently, but it ended up being more trouble than it was worth.

Would you mind citing the bug?  This whole "you can use blame" business is
pretty silly IMO.
Attachment #736656 - Flags: review?(justin.lebar+bug) → review+
Whiteboard: [MemShrink]
I merged parts 2 and 3 before landing because I realized that part 2 by itself would have caused test failures, and I was too lazy to move the relevant test changes from part 3 to part 2.

https://hg.mozilla.org/integration/mozilla-inbound/rev/6ee0e740dc62
https://hg.mozilla.org/integration/mozilla-inbound/rev/02629c5c6d68
(In reply to Nicholas Nethercote [:njn] (away until June 3) from comment #17)
> (4) Also remove |nsIMemoryReporterManager.explicit|, which will break some
>     things like the endurance tests.
> 
> complexity;  after all, it was added purely to make telemetry fast.  (4)
> seems like overkill.

So does this mean that the property will still exist but contains a null/undefined value now? If that's the case I would say we remove it from our endurance tests.
> > (4) Also remove |nsIMemoryReporterManager.explicit|, which will break some
> >     things like the endurance tests.
> > 
> > complexity;  after all, it was added purely to make telemetry fast.  (4)
> > seems like overkill.
> 
> So does this mean that the property will still exist but contains a
> null/undefined value now? If that's the case I would say we remove it from
> our endurance tests.

Step (4) wasn't done.  So |nsIMemoryReporterManager.explicit| still works;  it's just slower than it used to be.  I'd keep it in the endurance tests unless you find it causes problems.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: