Closed Bug 859657 Opened 11 years ago Closed 11 years ago

WorkerPrivate::BlockAndCollectRuntimeStats can block the main thread for several seconds

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla23

People

(Reporter: vladan, Assigned: n.nethercote)

References

Details

(Whiteboard: [MemShrink])

Attachments

(3 files)

(part 1) - Don't measure MEMORY_EXPLICIT in telemetry because it can cause pauses. 11 years ago Nicholas Nethercote [inactive] 2.53 KB, patch	justin.lebar+bug : review+	Details \| Diff \| Splinter Review
(part 2) - Remove the "explicit" single reporter. 11 years ago Nicholas Nethercote [inactive] 2.44 KB, patch	justin.lebar+bug : review+	Details \| Diff \| Splinter Review
(part 3) - Remove nsIMemoryMultiReporter.explicitNonHeap because it's more trouble than it's worth. 11 years ago Nicholas Nethercote [inactive] 33.29 KB, patch	justin.lebar+bug : review+	Details \| Diff \| Splinter Review

Vladan Djeric (:vladan)

Reporter

Description

•

11 years ago

Telemetry chrome-hang data shows that this method can tie up the main thread for several seconds on Windows. Could we add Telemetry to get timings for this operation?

PR_WaitCondVar (in nspr4.pdb)
 -> mozilla::CondVar::Wait(unsigned int) (in xul.pdb)
 -> mozilla::dom::workers::WorkerPrivate::BlockAndCollectRuntimeStats(bool,void *) (in xul.pdb)
 -> mozilla::dom::workers::WorkerPrivate::MemoryReporter::GetExplicitNonHeap(__int64 *) (in xul.pdb)
 -> nsMemoryReporterManager::GetExplicit(__int64 *) (in xul.pdb)
 -> ExplicitReporter::GetAmount(__int64 *) (in xul.pdb)
 -> NS_InvokeByIndex_P (in xul.pdb)

Nicholas Nethercote [inactive]

Assignee

Comment 1

•

11 years ago

All the memory reporter stuff triggered when viewing about:memory can cause hangs like this.  But this is in response to a user-requested action, so it's not unexpected or even bad.

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 2

•

11 years ago

The telemetry memory check happens like once every few seconds without any user interaction, right?

Nicholas Nethercote [inactive]

Assignee

Comment 3

•

11 years ago

Hmm.  The telemetry ping does much less work than the full report triggered from about:memory.  I thought this stack was for the full report, but now I see the GetExplicitNonHeap() in it, which means this *is* for the telemetry ping.  Bugger.

I think this is the relevant code:

  // If the worker is not already blocked (e.g. waiting for a worker event or
  // currently in a ctypes call) then we need to trigger the operation
  // callback to trap the worker.
  if (!mBlockedForMemoryReporter) {
    JS_TriggerOperationCallback(rt);

    // Wait until the worker actually blocks.
    while (!mBlockedForMemoryReporter) {
      mMemoryReportCondVar.Wait();
    }
  }

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 4

•

11 years ago

With the current approach for collecting memory info from a worker (trigger op callback, wait for the worker to block and acknowledge, collect memory info, unblock worker) the only thing we can do here is maybe try to increase the number of places that the JS engine checks the operation callback. Other than that we need an async memory report interface.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 5

•

11 years ago

We should fix it so that telemetry doesn't cause this to run, and then live with the blocking for when a user uses about:memory.

Nicholas Nethercote [inactive]

Assignee

Comment 6

•

11 years ago

> We should fix it so that telemetry doesn't cause this to run, and then live
> with the blocking for when a user uses about:memory.

Then the "explicit" measurement taken for telemetry pings would exclude all non-heap memory allocated by workers, which includes the JS heap and generated code.  Which would be quite a bummer.

Nicholas Nethercote [inactive]

Assignee

Comment 7

•

11 years ago

And anything else relying on nsMemoryReporterManager.explicit would also get a bogus value.  This includes the endurance tests (http://mozmill-ci.blargon7.com/#/endurance/charts?branch=22.0&platform=All&from=2013-03-14&to=2013-04-11), but AWSY wouldn't be affected.

I've considered just getting rid of nsMemoryReporterManager.explicit, because keeping it working is an ongoing pain.  The loss of explicit measurements in telemetry would be the main problem, though this bug is already putting them in jeopardy.

Nicholas Nethercote [inactive]

Assignee

Comment 8

•

11 years ago

bent, khuey: let's imagine that on a telemetry ping, the worker just had to read a single number from JSRuntime.  Would we need to wait on the condvar in that case?

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 9

•

11 years ago

If that value was being accessed/modified concurrently (and if JS is running in another thread then there is no way to avoid that) then we have to have some kind of synchronization or we'd sometimes be recording a bogus value. That seems like a deal-breaker, right?

Nicholas Nethercote [inactive]

Assignee

Comment 10

•

11 years ago

> If that value was being accessed/modified concurrently (and if JS is running
> in another thread then there is no way to avoid that)

Isn't JSRuntime single-threaded?

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 11

•

11 years ago

(In reply to Nicholas Nethercote [:njn] from comment #10)
> Isn't JSRuntime single-threaded?

Yes, but not really here.

The way memory reporting works currently is that a worker's JSRuntime is primarily only used on the worker thread, but when we report memory the worker blocks and then the main thread uses the worker's JSRuntime to collect memory stats.

We used to post a message to the worker to have it collect memory data on its own thread and then report back to the main thread later (though the main thread remained blocked while waiting for that response). The two approaches boil down to about the same thing though.

Either way the number we want is being modified on another thread. We either have to synchronize or risk reading bad data.

Nicholas Nethercote [inactive]

Assignee

Comment 12

•

11 years ago

Would wrapping the counter in JSRuntime in a mutex be better?  It's still synchronization, but more localized... (/me waves hands).

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 13

•

11 years ago

I think it would be a perf problem... Every time the worker thread wanted to update the number (like, every alloc and dealloc) it would have to grab the lock first.

Nicholas Nethercote [inactive]

Assignee

Comment 14

•

11 years ago

> Every time the worker thread wanted to
> update the number (like, every alloc and dealloc) it would have to grab the
> lock first.

It would only be needed for the mmap allocs and deallocs -- i.e. allocating/freeing GC chunks, code chunks, the stack, asm.js arrays (on 64-bit only), and the YARR bump allocator.  Such operations are relatively rare, much more so than malloc/free.

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 15

•

11 years ago

Hm. That doesn't sound as bad... I guess we'd need to see numbers to be sure though.

Nicholas Nethercote [inactive]

Assignee

Comment 16

•

11 years ago

> allocating/freeing GC chunks, code chunks, the stack, asm.js arrays (on
> 64-bit only), and the YARR bump allocator

Aw geez, on Unix the "explicit" measurement of the stack uses mincore() to measure only the pages that are resident!  There's no way we can keep a per-JSRuntime counter accurate in the face of that.

I'm really tempted to just ditch |nsMemoryReporterManager.explicit| at this point.

Nicholas Nethercote [inactive]

Assignee

Comment 17

•

11 years ago

Attached patch (part 1) - Don't measure MEMORY_EXPLICIT in telemetry because it can cause pauses. — Details — Splinter Review

Heading down the "get rid of 'explicit'" path, I see four levels of change,
from smallest to largest.

(1) Remove the MEMORY_EXPLICIT telemetry measure.

(2) Also remove the standalone "explicit" reporter.  (Which just calls
    nsMemoryReporterManager::GetExplicit().)

(3) Also remove nsIMemoryReporter.explicitNonHeap, which means that
    |nsIMemoryReportermanager.explicit| will get slower.

(4) Also remove |nsIMemoryReporterManager.explicit|, which will break some
    things like the endurance tests.

In this patch I've just done (1).  It's a shame we won't have MEMORY_EXPLICIT
any more, but it's always been horribly noisy and I'm not aware of anything
useful that's ever come from it, so perhaps it's not a great loss.

(Relatedly, Benoit is working on adding the telemetry memory measurements to
SPS, but he said the "explicit" is no good because it does too much stuff with
the GC.)

I'm also sorely tempted to do (3) (which would also require (2)) to reduce some
complexity;  after all, it was added purely to make telemetry fast.  (4) seems
like overkill.

Attachment #736644 - Flags: review?(justin.lebar+bug)

Nicholas Nethercote [inactive]

Assignee

Updated

•

11 years ago

Assignee: nobody → n.nethercote

Justin Lebar (not reading bugmail)

Comment 18

•

11 years ago

> (1) Remove the MEMORY_EXPLICIT telemetry measure.

I was going to suggest this when I first read the bug, so I guess we're in agreement.  This has been a lot more trouble than it's worth, particularly given how our telemetry dashboards frequently lose data and are perpetually broken.  Maybe one day when all of this is better, we can worry about this again.

Nicholas Nethercote [inactive]

Assignee

Comment 19

•

11 years ago

Attached patch (part 2) - Remove the "explicit" single reporter. — Details — Splinter Review

This is step (2) from comment 17.

Attachment #736655 - Flags: review?(justin.lebar+bug)

Nicholas Nethercote [inactive]

Assignee

Comment 20

•

11 years ago

Attached patch (part 3) - Remove nsIMemoryMultiReporter.explicitNonHeap because it's more trouble than it's worth. — Details — Splinter Review

This is step (3) from comment 17.

24 files changed, 19 insertions(+), 283 deletions(-)

BTW, this will also fix 728990.

Attachment #736656 - Flags: review?(justin.lebar+bug)

Nicholas Nethercote [inactive]

Assignee

Updated

•

11 years ago

Blocks: 733374

Nicholas Nethercote [inactive]

Assignee

Comment 21

•

11 years ago

> BTW, this will also fix 728990.

Sorry, it's actually bug 733374.

Justin Lebar (not reading bugmail)

Updated

•

11 years ago

Attachment #736644 - Flags: review?(justin.lebar+bug) → review+

Justin Lebar (not reading bugmail)

Updated

•

11 years ago

Attachment #736655 - Flags: review?(justin.lebar+bug) → review+

Justin Lebar (not reading bugmail)

Comment 22

•

11 years ago

Comment on attachment 736656 [details] [diff] [review]
(part 3) - Remove nsIMemoryMultiReporter.explicitNonHeap because it's more trouble than it's worth.

>diff --git a/xpcom/base/nsMemoryReporterManager.cpp b/xpcom/base/nsMemoryReporterManager.cpp
>--- a/xpcom/base/nsMemoryReporterManager.cpp
>+++ b/xpcom/base/nsMemoryReporterManager.cpp

>-    // For each multi-reporter we could call CollectReports and filter out the
>-    // non-explicit, non-NONHEAP measurements.  But that's lots of wasted work,
>-    // so we instead use GetExplicitNonHeap() which exists purely for this
>-    // purpose.
>-    //
>-    // (Actually, in debug builds we also do it the slow way and compare the
>-    // result to the result obtained from GetExplicitNonHeap().  This
>-    // guarantees the two measurement paths are equivalent.  This is wise
>-    // because it's easy for memory reporters to have bugs.  But there's an
>-    // exception if DMD is enabled, because that makes DMD think that all the
>-    // blocks are double-counted.)
>+    // For each multi-reporter we call CollectReports and filter out the
>+    // non-explicit, non-NONHEAP measurements.  That's lots of wasted work,
>+    // and we used to have a GetExplicitNonHeap() method which did this more
>+    // efficiently, but it ended up being more trouble than it was worth.

Would you mind citing the bug?  This whole "you can use blame" business is
pretty silly IMO.

Attachment #736656 - Flags: review?(justin.lebar+bug) → review+

Nicholas Nethercote [inactive]

Assignee

Updated

•

11 years ago

Whiteboard: [MemShrink]

Nicholas Nethercote [inactive]

Assignee

Comment 23

•

11 years ago

I merged parts 2 and 3 before landing because I realized that part 2 by itself would have caused test failures, and I was too lazy to move the relevant test changes from part 3 to part 2.

https://hg.mozilla.org/integration/mozilla-inbound/rev/6ee0e740dc62
https://hg.mozilla.org/integration/mozilla-inbound/rev/02629c5c6d68

Nicholas Nethercote [inactive]

Assignee

Comment 24

•

11 years ago

Bustage fix:
https://hg.mozilla.org/integration/mozilla-inbound/rev/e30afdf18dd2

Ryan VanderMeulen [:RyanVM]

Comment 25

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/6ee0e740dc62
https://hg.mozilla.org/mozilla-central/rev/02629c5c6d68
https://hg.mozilla.org/mozilla-central/rev/e30afdf18dd2

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla23

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 26

•

11 years ago

(In reply to Nicholas Nethercote [:njn] (away until June 3) from comment #17)
> (4) Also remove |nsIMemoryReporterManager.explicit|, which will break some
>     things like the endurance tests.
> 
> complexity;  after all, it was added purely to make telemetry fast.  (4)
> seems like overkill.

So does this mean that the property will still exist but contains a null/undefined value now? If that's the case I would say we remove it from our endurance tests.

Nicholas Nethercote [inactive]

Assignee

Comment 27

•

11 years ago

> > (4) Also remove |nsIMemoryReporterManager.explicit|, which will break some
> >     things like the endurance tests.
> > 
> > complexity;  after all, it was added purely to make telemetry fast.  (4)
> > seems like overkill.
> 
> So does this mean that the property will still exist but contains a
> null/undefined value now? If that's the case I would say we remove it from
> our endurance tests.

Step (4) wasn't done.  So |nsIMemoryReporterManager.explicit| still works;  it's just slower than it used to be.  I'd keep it in the endurance tests unless you find it causes problems.

You need to log in before you can comment on or make changes to this bug.