Last Comment Bug 859657 - WorkerPrivate::BlockAndCollectRuntimeStats can block the main thread for several seconds
: WorkerPrivate::BlockAndCollectRuntimeStats can block the main thread for seve...
Status: RESOLVED FIXED
[MemShrink]
:
Product: Core
Classification: Components
Component: DOM: Workers (show other bugs)
: unspecified
: All All
: -- normal (vote)
: mozilla23
Assigned To: Nicholas Nethercote [:njn]
:
: Andrew Overholt [:overholt]
Mentors:
Depends on:
Blocks: 733374
  Show dependency treegraph
 
Reported: 2013-04-08 21:05 PDT by Vladan Djeric (:vladan)
Modified: 2013-06-02 21:16 PDT (History)
9 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
(part 1) - Don't measure MEMORY_EXPLICIT in telemetry because it can cause pauses. (2.53 KB, patch)
2013-04-11 20:42 PDT, Nicholas Nethercote [:njn]
justin.lebar+bug: review+
Details | Diff | Splinter Review
(part 2) - Remove the "explicit" single reporter. (2.44 KB, patch)
2013-04-11 22:15 PDT, Nicholas Nethercote [:njn]
justin.lebar+bug: review+
Details | Diff | Splinter Review
(part 3) - Remove nsIMemoryMultiReporter.explicitNonHeap because it's more trouble than it's worth. (33.29 KB, patch)
2013-04-11 22:15 PDT, Nicholas Nethercote [:njn]
justin.lebar+bug: review+
Details | Diff | Splinter Review

Description Vladan Djeric (:vladan) 2013-04-08 21:05:05 PDT
Telemetry chrome-hang data shows that this method can tie up the main thread for several seconds on Windows. Could we add Telemetry to get timings for this operation?

PR_WaitCondVar (in nspr4.pdb)
 -> mozilla::CondVar::Wait(unsigned int) (in xul.pdb)
 -> mozilla::dom::workers::WorkerPrivate::BlockAndCollectRuntimeStats(bool,void *) (in xul.pdb)
 -> mozilla::dom::workers::WorkerPrivate::MemoryReporter::GetExplicitNonHeap(__int64 *) (in xul.pdb)
 -> nsMemoryReporterManager::GetExplicit(__int64 *) (in xul.pdb)
 -> ExplicitReporter::GetAmount(__int64 *) (in xul.pdb)
 -> NS_InvokeByIndex_P (in xul.pdb)
Comment 1 Nicholas Nethercote [:njn] 2013-04-09 03:11:47 PDT
All the memory reporter stuff triggered when viewing about:memory can cause hangs like this.  But this is in response to a user-requested action, so it's not unexpected or even bad.
Comment 2 Ben Turner (not reading bugmail, use the needinfo flag!) 2013-04-09 07:39:02 PDT
The telemetry memory check happens like once every few seconds without any user interaction, right?
Comment 3 Nicholas Nethercote [:njn] 2013-04-09 18:50:43 PDT
Hmm.  The telemetry ping does much less work than the full report triggered from about:memory.  I thought this stack was for the full report, but now I see the GetExplicitNonHeap() in it, which means this *is* for the telemetry ping.  Bugger.

I think this is the relevant code:

  // If the worker is not already blocked (e.g. waiting for a worker event or
  // currently in a ctypes call) then we need to trigger the operation
  // callback to trap the worker.
  if (!mBlockedForMemoryReporter) {
    JS_TriggerOperationCallback(rt);

    // Wait until the worker actually blocks.
    while (!mBlockedForMemoryReporter) {
      mMemoryReportCondVar.Wait();
    }
  }
Comment 4 Ben Turner (not reading bugmail, use the needinfo flag!) 2013-04-10 08:54:08 PDT
With the current approach for collecting memory info from a worker (trigger op callback, wait for the worker to block and acknowledge, collect memory info, unblock worker) the only thing we can do here is maybe try to increase the number of places that the JS engine checks the operation callback. Other than that we need an async memory report interface.
Comment 5 Kyle Huey [:khuey] (Exited; not receiving bugmail, email if necessary) 2013-04-10 08:59:21 PDT
We should fix it so that telemetry doesn't cause this to run, and then live with the blocking for when a user uses about:memory.
Comment 6 Nicholas Nethercote [:njn] 2013-04-10 18:00:47 PDT
> We should fix it so that telemetry doesn't cause this to run, and then live
> with the blocking for when a user uses about:memory.

Then the "explicit" measurement taken for telemetry pings would exclude all non-heap memory allocated by workers, which includes the JS heap and generated code.  Which would be quite a bummer.
Comment 7 Nicholas Nethercote [:njn] 2013-04-10 21:45:50 PDT
And anything else relying on nsMemoryReporterManager.explicit would also get a bogus value.  This includes the endurance tests (http://mozmill-ci.blargon7.com/#/endurance/charts?branch=22.0&platform=All&from=2013-03-14&to=2013-04-11), but AWSY wouldn't be affected.

I've considered just getting rid of nsMemoryReporterManager.explicit, because keeping it working is an ongoing pain.  The loss of explicit measurements in telemetry would be the main problem, though this bug is already putting them in jeopardy.
Comment 8 Nicholas Nethercote [:njn] 2013-04-10 22:25:36 PDT
bent, khuey: let's imagine that on a telemetry ping, the worker just had to read a single number from JSRuntime.  Would we need to wait on the condvar in that case?
Comment 9 Ben Turner (not reading bugmail, use the needinfo flag!) 2013-04-10 23:38:24 PDT
If that value was being accessed/modified concurrently (and if JS is running in another thread then there is no way to avoid that) then we have to have some kind of synchronization or we'd sometimes be recording a bogus value. That seems like a deal-breaker, right?
Comment 10 Nicholas Nethercote [:njn] 2013-04-11 00:05:41 PDT
> If that value was being accessed/modified concurrently (and if JS is running
> in another thread then there is no way to avoid that)

Isn't JSRuntime single-threaded?
Comment 11 Ben Turner (not reading bugmail, use the needinfo flag!) 2013-04-11 06:40:15 PDT
(In reply to Nicholas Nethercote [:njn] from comment #10)
> Isn't JSRuntime single-threaded?

Yes, but not really here.

The way memory reporting works currently is that a worker's JSRuntime is primarily only used on the worker thread, but when we report memory the worker blocks and then the main thread uses the worker's JSRuntime to collect memory stats.

We used to post a message to the worker to have it collect memory data on its own thread and then report back to the main thread later (though the main thread remained blocked while waiting for that response). The two approaches boil down to about the same thing though.

Either way the number we want is being modified on another thread. We either have to synchronize or risk reading bad data.
Comment 12 Nicholas Nethercote [:njn] 2013-04-11 15:57:32 PDT
Would wrapping the counter in JSRuntime in a mutex be better?  It's still synchronization, but more localized... (/me waves hands).
Comment 13 Ben Turner (not reading bugmail, use the needinfo flag!) 2013-04-11 16:51:01 PDT
I think it would be a perf problem... Every time the worker thread wanted to update the number (like, every alloc and dealloc) it would have to grab the lock first.
Comment 14 Nicholas Nethercote [:njn] 2013-04-11 17:22:09 PDT
> Every time the worker thread wanted to
> update the number (like, every alloc and dealloc) it would have to grab the
> lock first.

It would only be needed for the mmap allocs and deallocs -- i.e. allocating/freeing GC chunks, code chunks, the stack, asm.js arrays (on 64-bit only), and the YARR bump allocator.  Such operations are relatively rare, much more so than malloc/free.
Comment 15 Ben Turner (not reading bugmail, use the needinfo flag!) 2013-04-11 18:44:15 PDT
Hm. That doesn't sound as bad... I guess we'd need to see numbers to be sure though.
Comment 16 Nicholas Nethercote [:njn] 2013-04-11 20:03:15 PDT
> allocating/freeing GC chunks, code chunks, the stack, asm.js arrays (on
> 64-bit only), and the YARR bump allocator

Aw geez, on Unix the "explicit" measurement of the stack uses mincore() to measure only the pages that are resident!  There's no way we can keep a per-JSRuntime counter accurate in the face of that.

I'm really tempted to just ditch |nsMemoryReporterManager.explicit| at this point.
Comment 17 Nicholas Nethercote [:njn] 2013-04-11 20:42:43 PDT
Created attachment 736644 [details] [diff] [review]
(part 1) - Don't measure MEMORY_EXPLICIT in telemetry because it can cause pauses.

Heading down the "get rid of 'explicit'" path, I see four levels of change,
from smallest to largest.

(1) Remove the MEMORY_EXPLICIT telemetry measure.

(2) Also remove the standalone "explicit" reporter.  (Which just calls
    nsMemoryReporterManager::GetExplicit().)

(3) Also remove nsIMemoryReporter.explicitNonHeap, which means that
    |nsIMemoryReportermanager.explicit| will get slower.

(4) Also remove |nsIMemoryReporterManager.explicit|, which will break some
    things like the endurance tests.

In this patch I've just done (1).  It's a shame we won't have MEMORY_EXPLICIT
any more, but it's always been horribly noisy and I'm not aware of anything
useful that's ever come from it, so perhaps it's not a great loss.

(Relatedly, Benoit is working on adding the telemetry memory measurements to
SPS, but he said the "explicit" is no good because it does too much stuff with
the GC.)

I'm also sorely tempted to do (3) (which would also require (2)) to reduce some
complexity;  after all, it was added purely to make telemetry fast.  (4) seems
like overkill.
Comment 18 Justin Lebar (not reading bugmail) 2013-04-11 22:08:00 PDT
> (1) Remove the MEMORY_EXPLICIT telemetry measure.

I was going to suggest this when I first read the bug, so I guess we're in agreement.  This has been a lot more trouble than it's worth, particularly given how our telemetry dashboards frequently lose data and are perpetually broken.  Maybe one day when all of this is better, we can worry about this again.
Comment 19 Nicholas Nethercote [:njn] 2013-04-11 22:15:08 PDT
Created attachment 736655 [details] [diff] [review]
(part 2) - Remove the "explicit" single reporter.

This is step (2) from comment 17.
Comment 20 Nicholas Nethercote [:njn] 2013-04-11 22:15:55 PDT
Created attachment 736656 [details] [diff] [review]
(part 3) - Remove nsIMemoryMultiReporter.explicitNonHeap because it's more trouble than it's worth.

This is step (3) from comment 17.

24 files changed, 19 insertions(+), 283 deletions(-)

BTW, this will also fix 728990.
Comment 21 Nicholas Nethercote [:njn] 2013-04-14 22:20:38 PDT
> BTW, this will also fix 728990.

Sorry, it's actually bug 733374.
Comment 22 Justin Lebar (not reading bugmail) 2013-04-15 02:59:11 PDT
Comment on attachment 736656 [details] [diff] [review]
(part 3) - Remove nsIMemoryMultiReporter.explicitNonHeap because it's more trouble than it's worth.

>diff --git a/xpcom/base/nsMemoryReporterManager.cpp b/xpcom/base/nsMemoryReporterManager.cpp
>--- a/xpcom/base/nsMemoryReporterManager.cpp
>+++ b/xpcom/base/nsMemoryReporterManager.cpp

>-    // For each multi-reporter we could call CollectReports and filter out the
>-    // non-explicit, non-NONHEAP measurements.  But that's lots of wasted work,
>-    // so we instead use GetExplicitNonHeap() which exists purely for this
>-    // purpose.
>-    //
>-    // (Actually, in debug builds we also do it the slow way and compare the
>-    // result to the result obtained from GetExplicitNonHeap().  This
>-    // guarantees the two measurement paths are equivalent.  This is wise
>-    // because it's easy for memory reporters to have bugs.  But there's an
>-    // exception if DMD is enabled, because that makes DMD think that all the
>-    // blocks are double-counted.)
>+    // For each multi-reporter we call CollectReports and filter out the
>+    // non-explicit, non-NONHEAP measurements.  That's lots of wasted work,
>+    // and we used to have a GetExplicitNonHeap() method which did this more
>+    // efficiently, but it ended up being more trouble than it was worth.

Would you mind citing the bug?  This whole "you can use blame" business is
pretty silly IMO.
Comment 23 Nicholas Nethercote [:njn] 2013-04-15 20:15:03 PDT
I merged parts 2 and 3 before landing because I realized that part 2 by itself would have caused test failures, and I was too lazy to move the relevant test changes from part 3 to part 2.

https://hg.mozilla.org/integration/mozilla-inbound/rev/6ee0e740dc62
https://hg.mozilla.org/integration/mozilla-inbound/rev/02629c5c6d68
Comment 24 Nicholas Nethercote [:njn] 2013-04-15 20:46:58 PDT
Bustage fix:
https://hg.mozilla.org/integration/mozilla-inbound/rev/e30afdf18dd2
Comment 26 Henrik Skupin (:whimboo) 2013-05-20 22:04:49 PDT
(In reply to Nicholas Nethercote [:njn] (away until June 3) from comment #17)
> (4) Also remove |nsIMemoryReporterManager.explicit|, which will break some
>     things like the endurance tests.
> 
> complexity;  after all, it was added purely to make telemetry fast.  (4)
> seems like overkill.

So does this mean that the property will still exist but contains a null/undefined value now? If that's the case I would say we remove it from our endurance tests.
Comment 27 Nicholas Nethercote [:njn] 2013-06-02 21:16:18 PDT
> > (4) Also remove |nsIMemoryReporterManager.explicit|, which will break some
> >     things like the endurance tests.
> > 
> > complexity;  after all, it was added purely to make telemetry fast.  (4)
> > seems like overkill.
> 
> So does this mean that the property will still exist but contains a
> null/undefined value now? If that's the case I would say we remove it from
> our endurance tests.

Step (4) wasn't done.  So |nsIMemoryReporterManager.explicit| still works;  it's just slower than it used to be.  I'd keep it in the endurance tests unless you find it causes problems.

Note You need to log in before you can comment on or make changes to this bug.