The default bug view has changed. See this FAQ.

WorkerPrivate::BlockAndCollectRuntimeStats can block the main thread for several seconds

RESOLVED FIXED in mozilla23

Status

()

Core
DOM: Workers
RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: vladan, Assigned: njn)

Tracking

unspecified
mozilla23
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [MemShrink])

Attachments

(3 attachments)

(Reporter)

Description

4 years ago
Telemetry chrome-hang data shows that this method can tie up the main thread for several seconds on Windows. Could we add Telemetry to get timings for this operation?

PR_WaitCondVar (in nspr4.pdb)
 -> mozilla::CondVar::Wait(unsigned int) (in xul.pdb)
 -> mozilla::dom::workers::WorkerPrivate::BlockAndCollectRuntimeStats(bool,void *) (in xul.pdb)
 -> mozilla::dom::workers::WorkerPrivate::MemoryReporter::GetExplicitNonHeap(__int64 *) (in xul.pdb)
 -> nsMemoryReporterManager::GetExplicit(__int64 *) (in xul.pdb)
 -> ExplicitReporter::GetAmount(__int64 *) (in xul.pdb)
 -> NS_InvokeByIndex_P (in xul.pdb)
(Assignee)

Comment 1

4 years ago
All the memory reporter stuff triggered when viewing about:memory can cause hangs like this.  But this is in response to a user-requested action, so it's not unexpected or even bad.
The telemetry memory check happens like once every few seconds without any user interaction, right?
(Assignee)

Comment 3

4 years ago
Hmm.  The telemetry ping does much less work than the full report triggered from about:memory.  I thought this stack was for the full report, but now I see the GetExplicitNonHeap() in it, which means this *is* for the telemetry ping.  Bugger.

I think this is the relevant code:

  // If the worker is not already blocked (e.g. waiting for a worker event or
  // currently in a ctypes call) then we need to trigger the operation
  // callback to trap the worker.
  if (!mBlockedForMemoryReporter) {
    JS_TriggerOperationCallback(rt);

    // Wait until the worker actually blocks.
    while (!mBlockedForMemoryReporter) {
      mMemoryReportCondVar.Wait();
    }
  }
With the current approach for collecting memory info from a worker (trigger op callback, wait for the worker to block and acknowledge, collect memory info, unblock worker) the only thing we can do here is maybe try to increase the number of places that the JS engine checks the operation callback. Other than that we need an async memory report interface.
We should fix it so that telemetry doesn't cause this to run, and then live with the blocking for when a user uses about:memory.
(Assignee)

Comment 6

4 years ago
> We should fix it so that telemetry doesn't cause this to run, and then live
> with the blocking for when a user uses about:memory.

Then the "explicit" measurement taken for telemetry pings would exclude all non-heap memory allocated by workers, which includes the JS heap and generated code.  Which would be quite a bummer.
(Assignee)

Comment 7

4 years ago
And anything else relying on nsMemoryReporterManager.explicit would also get a bogus value.  This includes the endurance tests (http://mozmill-ci.blargon7.com/#/endurance/charts?branch=22.0&platform=All&from=2013-03-14&to=2013-04-11), but AWSY wouldn't be affected.

I've considered just getting rid of nsMemoryReporterManager.explicit, because keeping it working is an ongoing pain.  The loss of explicit measurements in telemetry would be the main problem, though this bug is already putting them in jeopardy.
(Assignee)

Comment 8

4 years ago
bent, khuey: let's imagine that on a telemetry ping, the worker just had to read a single number from JSRuntime.  Would we need to wait on the condvar in that case?
If that value was being accessed/modified concurrently (and if JS is running in another thread then there is no way to avoid that) then we have to have some kind of synchronization or we'd sometimes be recording a bogus value. That seems like a deal-breaker, right?
(Assignee)

Comment 10

4 years ago
> If that value was being accessed/modified concurrently (and if JS is running
> in another thread then there is no way to avoid that)

Isn't JSRuntime single-threaded?
(In reply to Nicholas Nethercote [:njn] from comment #10)
> Isn't JSRuntime single-threaded?

Yes, but not really here.

The way memory reporting works currently is that a worker's JSRuntime is primarily only used on the worker thread, but when we report memory the worker blocks and then the main thread uses the worker's JSRuntime to collect memory stats.

We used to post a message to the worker to have it collect memory data on its own thread and then report back to the main thread later (though the main thread remained blocked while waiting for that response). The two approaches boil down to about the same thing though.

Either way the number we want is being modified on another thread. We either have to synchronize or risk reading bad data.
(Assignee)

Comment 12

4 years ago
Would wrapping the counter in JSRuntime in a mutex be better?  It's still synchronization, but more localized... (/me waves hands).
I think it would be a perf problem... Every time the worker thread wanted to update the number (like, every alloc and dealloc) it would have to grab the lock first.
(Assignee)

Comment 14

4 years ago
> Every time the worker thread wanted to
> update the number (like, every alloc and dealloc) it would have to grab the
> lock first.

It would only be needed for the mmap allocs and deallocs -- i.e. allocating/freeing GC chunks, code chunks, the stack, asm.js arrays (on 64-bit only), and the YARR bump allocator.  Such operations are relatively rare, much more so than malloc/free.
Hm. That doesn't sound as bad... I guess we'd need to see numbers to be sure though.
(Assignee)

Comment 16

4 years ago
> allocating/freeing GC chunks, code chunks, the stack, asm.js arrays (on
> 64-bit only), and the YARR bump allocator

Aw geez, on Unix the "explicit" measurement of the stack uses mincore() to measure only the pages that are resident!  There's no way we can keep a per-JSRuntime counter accurate in the face of that.

I'm really tempted to just ditch |nsMemoryReporterManager.explicit| at this point.
(Assignee)

Comment 17

4 years ago
Created attachment 736644 [details] [diff] [review]
(part 1) - Don't measure MEMORY_EXPLICIT in telemetry because it can cause pauses.

Heading down the "get rid of 'explicit'" path, I see four levels of change,
from smallest to largest.

(1) Remove the MEMORY_EXPLICIT telemetry measure.

(2) Also remove the standalone "explicit" reporter.  (Which just calls
    nsMemoryReporterManager::GetExplicit().)

(3) Also remove nsIMemoryReporter.explicitNonHeap, which means that
    |nsIMemoryReportermanager.explicit| will get slower.

(4) Also remove |nsIMemoryReporterManager.explicit|, which will break some
    things like the endurance tests.

In this patch I've just done (1).  It's a shame we won't have MEMORY_EXPLICIT
any more, but it's always been horribly noisy and I'm not aware of anything
useful that's ever come from it, so perhaps it's not a great loss.

(Relatedly, Benoit is working on adding the telemetry memory measurements to
SPS, but he said the "explicit" is no good because it does too much stuff with
the GC.)

I'm also sorely tempted to do (3) (which would also require (2)) to reduce some
complexity;  after all, it was added purely to make telemetry fast.  (4) seems
like overkill.
Attachment #736644 - Flags: review?(justin.lebar+bug)
(Assignee)

Updated

4 years ago
Assignee: nobody → n.nethercote
> (1) Remove the MEMORY_EXPLICIT telemetry measure.

I was going to suggest this when I first read the bug, so I guess we're in agreement.  This has been a lot more trouble than it's worth, particularly given how our telemetry dashboards frequently lose data and are perpetually broken.  Maybe one day when all of this is better, we can worry about this again.
(Assignee)

Comment 19

4 years ago
Created attachment 736655 [details] [diff] [review]
(part 2) - Remove the "explicit" single reporter.

This is step (2) from comment 17.
Attachment #736655 - Flags: review?(justin.lebar+bug)
(Assignee)

Comment 20

4 years ago
Created attachment 736656 [details] [diff] [review]
(part 3) - Remove nsIMemoryMultiReporter.explicitNonHeap because it's more trouble than it's worth.

This is step (3) from comment 17.

24 files changed, 19 insertions(+), 283 deletions(-)

BTW, this will also fix 728990.
Attachment #736656 - Flags: review?(justin.lebar+bug)
(Assignee)

Updated

4 years ago
Blocks: 733374
(Assignee)

Comment 21

4 years ago
> BTW, this will also fix 728990.

Sorry, it's actually bug 733374.
Attachment #736644 - Flags: review?(justin.lebar+bug) → review+
Attachment #736655 - Flags: review?(justin.lebar+bug) → review+
Comment on attachment 736656 [details] [diff] [review]
(part 3) - Remove nsIMemoryMultiReporter.explicitNonHeap because it's more trouble than it's worth.

>diff --git a/xpcom/base/nsMemoryReporterManager.cpp b/xpcom/base/nsMemoryReporterManager.cpp
>--- a/xpcom/base/nsMemoryReporterManager.cpp
>+++ b/xpcom/base/nsMemoryReporterManager.cpp

>-    // For each multi-reporter we could call CollectReports and filter out the
>-    // non-explicit, non-NONHEAP measurements.  But that's lots of wasted work,
>-    // so we instead use GetExplicitNonHeap() which exists purely for this
>-    // purpose.
>-    //
>-    // (Actually, in debug builds we also do it the slow way and compare the
>-    // result to the result obtained from GetExplicitNonHeap().  This
>-    // guarantees the two measurement paths are equivalent.  This is wise
>-    // because it's easy for memory reporters to have bugs.  But there's an
>-    // exception if DMD is enabled, because that makes DMD think that all the
>-    // blocks are double-counted.)
>+    // For each multi-reporter we call CollectReports and filter out the
>+    // non-explicit, non-NONHEAP measurements.  That's lots of wasted work,
>+    // and we used to have a GetExplicitNonHeap() method which did this more
>+    // efficiently, but it ended up being more trouble than it was worth.

Would you mind citing the bug?  This whole "you can use blame" business is
pretty silly IMO.
Attachment #736656 - Flags: review?(justin.lebar+bug) → review+
(Assignee)

Updated

4 years ago
Whiteboard: [MemShrink]
(Assignee)

Comment 23

4 years ago
I merged parts 2 and 3 before landing because I realized that part 2 by itself would have caused test failures, and I was too lazy to move the relevant test changes from part 3 to part 2.

https://hg.mozilla.org/integration/mozilla-inbound/rev/6ee0e740dc62
https://hg.mozilla.org/integration/mozilla-inbound/rev/02629c5c6d68
(Assignee)

Comment 24

4 years ago
Bustage fix:
https://hg.mozilla.org/integration/mozilla-inbound/rev/e30afdf18dd2
https://hg.mozilla.org/mozilla-central/rev/6ee0e740dc62
https://hg.mozilla.org/mozilla-central/rev/02629c5c6d68
https://hg.mozilla.org/mozilla-central/rev/e30afdf18dd2
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla23
(In reply to Nicholas Nethercote [:njn] (away until June 3) from comment #17)
> (4) Also remove |nsIMemoryReporterManager.explicit|, which will break some
>     things like the endurance tests.
> 
> complexity;  after all, it was added purely to make telemetry fast.  (4)
> seems like overkill.

So does this mean that the property will still exist but contains a null/undefined value now? If that's the case I would say we remove it from our endurance tests.
(Assignee)

Comment 27

4 years ago
> > (4) Also remove |nsIMemoryReporterManager.explicit|, which will break some
> >     things like the endurance tests.
> > 
> > complexity;  after all, it was added purely to make telemetry fast.  (4)
> > seems like overkill.
> 
> So does this mean that the property will still exist but contains a
> null/undefined value now? If that's the case I would say we remove it from
> our endurance tests.

Step (4) wasn't done.  So |nsIMemoryReporterManager.explicit| still works;  it's just slower than it used to be.  I'd keep it in the endurance tests unless you find it causes problems.
You need to log in before you can comment on or make changes to this bug.