Closed
Bug 859657
Opened 11 years ago
Closed 11 years ago
WorkerPrivate::BlockAndCollectRuntimeStats can block the main thread for several seconds
Categories
(Core :: DOM: Workers, defect)
Core
DOM: Workers
Tracking
()
RESOLVED
FIXED
mozilla23
People
(Reporter: vladan, Assigned: n.nethercote)
References
Details
(Whiteboard: [MemShrink])
Attachments
(3 files)
2.53 KB,
patch
|
justin.lebar+bug
:
review+
|
Details | Diff | Splinter Review |
2.44 KB,
patch
|
justin.lebar+bug
:
review+
|
Details | Diff | Splinter Review |
33.29 KB,
patch
|
justin.lebar+bug
:
review+
|
Details | Diff | Splinter Review |
Telemetry chrome-hang data shows that this method can tie up the main thread for several seconds on Windows. Could we add Telemetry to get timings for this operation? PR_WaitCondVar (in nspr4.pdb) -> mozilla::CondVar::Wait(unsigned int) (in xul.pdb) -> mozilla::dom::workers::WorkerPrivate::BlockAndCollectRuntimeStats(bool,void *) (in xul.pdb) -> mozilla::dom::workers::WorkerPrivate::MemoryReporter::GetExplicitNonHeap(__int64 *) (in xul.pdb) -> nsMemoryReporterManager::GetExplicit(__int64 *) (in xul.pdb) -> ExplicitReporter::GetAmount(__int64 *) (in xul.pdb) -> NS_InvokeByIndex_P (in xul.pdb)
Assignee | ||
Comment 1•11 years ago
|
||
All the memory reporter stuff triggered when viewing about:memory can cause hangs like this. But this is in response to a user-requested action, so it's not unexpected or even bad.
The telemetry memory check happens like once every few seconds without any user interaction, right?
Assignee | ||
Comment 3•11 years ago
|
||
Hmm. The telemetry ping does much less work than the full report triggered from about:memory. I thought this stack was for the full report, but now I see the GetExplicitNonHeap() in it, which means this *is* for the telemetry ping. Bugger. I think this is the relevant code: // If the worker is not already blocked (e.g. waiting for a worker event or // currently in a ctypes call) then we need to trigger the operation // callback to trap the worker. if (!mBlockedForMemoryReporter) { JS_TriggerOperationCallback(rt); // Wait until the worker actually blocks. while (!mBlockedForMemoryReporter) { mMemoryReportCondVar.Wait(); } }
With the current approach for collecting memory info from a worker (trigger op callback, wait for the worker to block and acknowledge, collect memory info, unblock worker) the only thing we can do here is maybe try to increase the number of places that the JS engine checks the operation callback. Other than that we need an async memory report interface.
We should fix it so that telemetry doesn't cause this to run, and then live with the blocking for when a user uses about:memory.
Assignee | ||
Comment 6•11 years ago
|
||
> We should fix it so that telemetry doesn't cause this to run, and then live
> with the blocking for when a user uses about:memory.
Then the "explicit" measurement taken for telemetry pings would exclude all non-heap memory allocated by workers, which includes the JS heap and generated code. Which would be quite a bummer.
Assignee | ||
Comment 7•11 years ago
|
||
And anything else relying on nsMemoryReporterManager.explicit would also get a bogus value. This includes the endurance tests (http://mozmill-ci.blargon7.com/#/endurance/charts?branch=22.0&platform=All&from=2013-03-14&to=2013-04-11), but AWSY wouldn't be affected. I've considered just getting rid of nsMemoryReporterManager.explicit, because keeping it working is an ongoing pain. The loss of explicit measurements in telemetry would be the main problem, though this bug is already putting them in jeopardy.
Assignee | ||
Comment 8•11 years ago
|
||
bent, khuey: let's imagine that on a telemetry ping, the worker just had to read a single number from JSRuntime. Would we need to wait on the condvar in that case?
If that value was being accessed/modified concurrently (and if JS is running in another thread then there is no way to avoid that) then we have to have some kind of synchronization or we'd sometimes be recording a bogus value. That seems like a deal-breaker, right?
Assignee | ||
Comment 10•11 years ago
|
||
> If that value was being accessed/modified concurrently (and if JS is running
> in another thread then there is no way to avoid that)
Isn't JSRuntime single-threaded?
(In reply to Nicholas Nethercote [:njn] from comment #10) > Isn't JSRuntime single-threaded? Yes, but not really here. The way memory reporting works currently is that a worker's JSRuntime is primarily only used on the worker thread, but when we report memory the worker blocks and then the main thread uses the worker's JSRuntime to collect memory stats. We used to post a message to the worker to have it collect memory data on its own thread and then report back to the main thread later (though the main thread remained blocked while waiting for that response). The two approaches boil down to about the same thing though. Either way the number we want is being modified on another thread. We either have to synchronize or risk reading bad data.
Assignee | ||
Comment 12•11 years ago
|
||
Would wrapping the counter in JSRuntime in a mutex be better? It's still synchronization, but more localized... (/me waves hands).
I think it would be a perf problem... Every time the worker thread wanted to update the number (like, every alloc and dealloc) it would have to grab the lock first.
Assignee | ||
Comment 14•11 years ago
|
||
> Every time the worker thread wanted to
> update the number (like, every alloc and dealloc) it would have to grab the
> lock first.
It would only be needed for the mmap allocs and deallocs -- i.e. allocating/freeing GC chunks, code chunks, the stack, asm.js arrays (on 64-bit only), and the YARR bump allocator. Such operations are relatively rare, much more so than malloc/free.
Hm. That doesn't sound as bad... I guess we'd need to see numbers to be sure though.
Assignee | ||
Comment 16•11 years ago
|
||
> allocating/freeing GC chunks, code chunks, the stack, asm.js arrays (on
> 64-bit only), and the YARR bump allocator
Aw geez, on Unix the "explicit" measurement of the stack uses mincore() to measure only the pages that are resident! There's no way we can keep a per-JSRuntime counter accurate in the face of that.
I'm really tempted to just ditch |nsMemoryReporterManager.explicit| at this point.
Assignee | ||
Comment 17•11 years ago
|
||
Heading down the "get rid of 'explicit'" path, I see four levels of change, from smallest to largest. (1) Remove the MEMORY_EXPLICIT telemetry measure. (2) Also remove the standalone "explicit" reporter. (Which just calls nsMemoryReporterManager::GetExplicit().) (3) Also remove nsIMemoryReporter.explicitNonHeap, which means that |nsIMemoryReportermanager.explicit| will get slower. (4) Also remove |nsIMemoryReporterManager.explicit|, which will break some things like the endurance tests. In this patch I've just done (1). It's a shame we won't have MEMORY_EXPLICIT any more, but it's always been horribly noisy and I'm not aware of anything useful that's ever come from it, so perhaps it's not a great loss. (Relatedly, Benoit is working on adding the telemetry memory measurements to SPS, but he said the "explicit" is no good because it does too much stuff with the GC.) I'm also sorely tempted to do (3) (which would also require (2)) to reduce some complexity; after all, it was added purely to make telemetry fast. (4) seems like overkill.
Attachment #736644 -
Flags: review?(justin.lebar+bug)
Assignee | ||
Updated•11 years ago
|
Assignee: nobody → n.nethercote
Comment 18•11 years ago
|
||
> (1) Remove the MEMORY_EXPLICIT telemetry measure.
I was going to suggest this when I first read the bug, so I guess we're in agreement. This has been a lot more trouble than it's worth, particularly given how our telemetry dashboards frequently lose data and are perpetually broken. Maybe one day when all of this is better, we can worry about this again.
Assignee | ||
Comment 19•11 years ago
|
||
This is step (2) from comment 17.
Attachment #736655 -
Flags: review?(justin.lebar+bug)
Assignee | ||
Comment 20•11 years ago
|
||
This is step (3) from comment 17. 24 files changed, 19 insertions(+), 283 deletions(-) BTW, this will also fix 728990.
Attachment #736656 -
Flags: review?(justin.lebar+bug)
Assignee | ||
Comment 21•11 years ago
|
||
> BTW, this will also fix 728990. Sorry, it's actually bug 733374.
Updated•11 years ago
|
Attachment #736644 -
Flags: review?(justin.lebar+bug) → review+
Updated•11 years ago
|
Attachment #736655 -
Flags: review?(justin.lebar+bug) → review+
Comment 22•11 years ago
|
||
Comment on attachment 736656 [details] [diff] [review] (part 3) - Remove nsIMemoryMultiReporter.explicitNonHeap because it's more trouble than it's worth. >diff --git a/xpcom/base/nsMemoryReporterManager.cpp b/xpcom/base/nsMemoryReporterManager.cpp >--- a/xpcom/base/nsMemoryReporterManager.cpp >+++ b/xpcom/base/nsMemoryReporterManager.cpp >- // For each multi-reporter we could call CollectReports and filter out the >- // non-explicit, non-NONHEAP measurements. But that's lots of wasted work, >- // so we instead use GetExplicitNonHeap() which exists purely for this >- // purpose. >- // >- // (Actually, in debug builds we also do it the slow way and compare the >- // result to the result obtained from GetExplicitNonHeap(). This >- // guarantees the two measurement paths are equivalent. This is wise >- // because it's easy for memory reporters to have bugs. But there's an >- // exception if DMD is enabled, because that makes DMD think that all the >- // blocks are double-counted.) >+ // For each multi-reporter we call CollectReports and filter out the >+ // non-explicit, non-NONHEAP measurements. That's lots of wasted work, >+ // and we used to have a GetExplicitNonHeap() method which did this more >+ // efficiently, but it ended up being more trouble than it was worth. Would you mind citing the bug? This whole "you can use blame" business is pretty silly IMO.
Attachment #736656 -
Flags: review?(justin.lebar+bug) → review+
Assignee | ||
Updated•11 years ago
|
Whiteboard: [MemShrink]
Assignee | ||
Comment 23•11 years ago
|
||
I merged parts 2 and 3 before landing because I realized that part 2 by itself would have caused test failures, and I was too lazy to move the relevant test changes from part 3 to part 2. https://hg.mozilla.org/integration/mozilla-inbound/rev/6ee0e740dc62 https://hg.mozilla.org/integration/mozilla-inbound/rev/02629c5c6d68
Assignee | ||
Comment 24•11 years ago
|
||
Bustage fix: https://hg.mozilla.org/integration/mozilla-inbound/rev/e30afdf18dd2
Comment 25•11 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/6ee0e740dc62 https://hg.mozilla.org/mozilla-central/rev/02629c5c6d68 https://hg.mozilla.org/mozilla-central/rev/e30afdf18dd2
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla23
Comment 26•11 years ago
|
||
(In reply to Nicholas Nethercote [:njn] (away until June 3) from comment #17) > (4) Also remove |nsIMemoryReporterManager.explicit|, which will break some > things like the endurance tests. > > complexity; after all, it was added purely to make telemetry fast. (4) > seems like overkill. So does this mean that the property will still exist but contains a null/undefined value now? If that's the case I would say we remove it from our endurance tests.
Assignee | ||
Comment 27•11 years ago
|
||
> > (4) Also remove |nsIMemoryReporterManager.explicit|, which will break some
> > things like the endurance tests.
> >
> > complexity; after all, it was added purely to make telemetry fast. (4)
> > seems like overkill.
>
> So does this mean that the property will still exist but contains a
> null/undefined value now? If that's the case I would say we remove it from
> our endurance tests.
Step (4) wasn't done. So |nsIMemoryReporterManager.explicit| still works; it's just slower than it used to be. I'd keep it in the endurance tests unless you find it causes problems.
You need to log in
before you can comment on or make changes to this bug.
Description
•