Closed Bug 813867 Opened 12 years ago Closed 11 years ago

Report memory for web workers that use ctypes

Categories

(Core :: XPCOM, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla20
blocking-basecamp -
Tracking Status
firefox17 --- wontfix
firefox18 --- wontfix
firefox19 --- wontfix
firefox20 --- fixed
firefox21 --- fixed
b2g18 19+ fixed
b2g18-v1.0.0 --- affected

People

(Reporter: n.nethercote, Assigned: bent.mozilla)

References

(Blocks 1 open bug)

Details

(Whiteboard: [MemShrink:P2])

Attachments

(3 files, 2 obsolete files)

If a web workers uses ctypes its memory reporter gives up and does nothing.  This misses a non-trivial amount of memory for standard Firefox workers like esource:///modules/PageThumbsWorker.js and the async-file-io worker (whose name I can't remember now) -- just the JSRuntime for a worker is several 100s of KiBs.

The reason the memory reporter gives up is that some workers using ctypes do things like sit in blocking IO poll loops (we had this problem specifically with the LastPass extension, bug 679551). Our current implementation of memory reporters on workers blocks the main thread while we wait for the worker to respond to the memory request, and if the worker is blocked then the process will deadlock.

The way to fix this is with an asynchronous memory reporting API.  bent suggested one in bug 673323 comment 4.  Here's a tweaked version that would match the existing multi-reporter interface more closely:

  interface nsIMemoryAsyncMultiReporter : nsIMemoryMultiReporter {
    void cancelCollection();
  };

  interface nsIMemoryReporterManager  {
    ...

    void registerAsyncMultiReporter(in nsIMemoryAsyncMultiReporter reporter);
    void unregisterAsyncMultiReporter(in nsIMemoryAsyncMultiReporter reporter);

    void enumerateAsyncMultiReporters();
  };

You'd call CollectReports() as usual on the async reporter.  The difference is that if it hadn't completely within a certain time limit, it would be interrupted and cancelCollection() would be called.  Maybe collectReports() would be modified for async reporters by adding an extra arg that specifies the time limit.

I'm happy to implement this, but my understanding of timers and how this interrupting would work is very hazy.
I just tried experimenting with nsITimer::InitWithFuncCallback, but I don't think its powerful enough.  I guess it relies on control getting back to the event loop?  It certainly wasn't able to break out of an infinite loop.

So, what next?  Something signal-based, like setitimer()?  It's currently only used in Mozilla code within some 3rd-party code (rtc, skia, chromium) and some profiling code (jprof, tools/profiler/).
> I just tried experimenting with nsITimer::InitWithFuncCallback, but I don't think its 
> powerful enough.  I guess it relies on control getting back to the event loop?

Yes.

> Something signal-based, like setitimer()?

Would this require us to audit all invocations of blocking functions in workers to ensure that they handle ERRINTR properly?

I don't understand this problem space, but would it be possible for us to spin up a new thread which tries to acquire a lock protecting the worker's JS runtime?  The worker releases this lock when it's blocked on I/O, and when it reaches the event loop.
What's wrong with poking the OperationCallback like existing worker control messages do?
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #3)
> What's wrong with poking the OperationCallback like existing worker control
> messages do?

That happens automatically as soon as you post a control runnable to the worker (as the memory reporter stuff does now).
Right, so what's the problem here?

If you want a timer to cancel the runnable you receive the timer callback on the main thread and set a flag on the runnable telling it not to do anything if/when it finally gets to run.
Whiteboard: [MemShrink] → [MemShrink:P2]
jlebar found 4 unreported workers on B2G. It looks like net_worker.js, ril_worker.js and wifi_worker.js all use ctypes. wifi_worker seems to get spawned twice, so it seems like this bug is probably responsible for the unreported worker related memory he is seeing.
Blocks: 820906
Depends on: 813876
No longer depends on: 813876
Assignee: nobody → justin.lebar+bug
Attached patch partial, hacky patch (obsolete) — Splinter Review
Here's what I have so far.  (Sorry, I should have assigned this to myself.)

Basic idea is to create a new kind of reporter, nsIAsyncMemoryMultiReporter.  Its interface probably needs to change, I was kind of feeling my way with this stuff.  I modified aboutMemory.js to handle this kind of reporter, and I'm more confident about that part.  I also have two instances, OkAsyncMemoryMultiReporter and LoopyAsyncMemoryMultiReporter that I was using for testing;  the latter has an infinite loop, and so it should time out.

IIRC a problem with this code in its current form is that the timer used for the async reporters always runs to completion, even if all the async reporters have finished.  That clearly needs fixing.

I'm not particularly attached to this code, and it uses lots of pieces of Gecko that I'm not familiar with.  If you can see a better way to do it then go right ahead!
I'll post an ugly WIP in a minute, but this is from my phone:

> ├──11.59 MB (30.47%) -- workers/workers()
> │  ├───6.26 MB (16.45%) -- worker(resource://gre/modules/ril_worker.js, 43aec400)
> │  │   ├──2.37 MB (06.24%) -- runtime
> │  │   │  ├──0.79 MB (02.07%) ── script-sources
> │  │   │  ├──0.65 MB (01.71%) ── temporary
> │  │   │  ├──0.52 MB (01.36%) ── jaeger-code
> │  │   │  └──0.42 MB (01.09%) ++ (11 tiny)
> │  │   ├──2.22 MB (05.83%) -- compartment(web-worker)
> │  │   │  ├──1.31 MB (03.43%) -- type-inference
> │  │   │  │  ├──1.16 MB (03.04%) ── analysis-pool [2]
> │  │   │  │  └──0.15 MB (00.39%) ++ (2 tiny)
> │  │   │  ├──0.46 MB (01.20%) ++ (6 tiny)
> │  │   │  └──0.45 MB (01.19%) ++ gc-heap
> │  │   ├──1.41 MB (03.70%) -- gc-heap
> │  │   │  ├──1.38 MB (03.61%) ── unused-arenas
> │  │   │  └──0.03 MB (00.08%) ++ (3 tiny)
> │  │   └──0.26 MB (00.69%) ++ compartment(web-worker-atoms)
> │  ├───2.78 MB (07.30%) -- worker(resource://gre/modules/wifi_worker.js, 42bcb800)
> │  │   ├──1.75 MB (04.61%) -- gc-heap
> │  │   │  ├──1.72 MB (04.53%) ── unused-arenas
> │  │   │  └──0.03 MB (00.08%) ++ (3 tiny)
> │  │   ├──0.47 MB (01.25%) ++ compartment(web-worker)
> │  │   ├──0.45 MB (01.17%) ++ runtime
> │  │   └──0.10 MB (00.27%) ++ compartment(web-worker-atoms)
> │  ├───2.55 MB (06.71%) -- worker(resource://gre/modules/net_worker.js, 44780c00)
> │  │   ├──1.73 MB (04.55%) -- gc-heap
> │  │   │  ├──1.70 MB (04.47%) ── decommitted-arenas
> │  │   │  └──0.03 MB (00.08%) ++ (3 tiny)
> │  │   └──0.82 MB (02.17%) ++ (3 tiny)
Unfortunately with a 1s timeout it looks like we're giving up on multiple workers.  Obviously that's not ideal.
And I should say, it seems to take these two particular workers ~60s before they unblock and run my memory report runnable.
Yeah, these workers spend a lot of time blocking on FDs.
Would it be useful to indicate in about:memory that there were workers that failed to report back? Assuming you can do that.
(In reply to Andrew McCreight [:mccr8] from comment #15)
> Would it be useful to indicate in about:memory that there were workers that
> failed to report back? Assuming you can do that.

We currently report them with 0mb memory used, which causes them to be hidden.  We could change that, of course.

But really, I want to get memory info out of these blocked workers...
Attached patch Very draft patch, v1 (obsolete) — Splinter Review
This is very rough, but I'm posting it mostly in case anyone has feedback on
the general approach of:

* enqueue a control runnable to the worker, then
* spin a nested event loop waiting for the worker to execute this runnable.

If B2G workers are commonly blocked, getting memory reports only from the
workers that aren't currently blocked on an fd seems pretty lame to me.

I wonder if there's something better we can do here.
> But really, I want to get memory info out of these blocked workers...

You've probably already considered this, but what if

* We have a mutex M.
* When a worker starts running, it acquires a lock on M.
* When the worker calls a ctypes function, we release M.  When the ctypes call completes, it acquires M.
* When we want to get a memory report from the worker, we 

  * try to acquire a lock on M, and
  * enqueue a worker control task to get the memory report, as we do currently

Whichever one completes first gives us the memory report, and we cancel the other one.

I waved my hands a bit about the synchronization (we'd probably have to use a condvar), but does the general idea work?
(In reply to Justin Lebar [:jlebar] from comment #18)
> but does the general idea work?

It sounds basically like what we do already. Dispatching a control runnable short-circuits any running JS and immediately processes the runnable (using JS_TriggerOperationCallback and a condvar), so if the worker isn't blocked we'll report memory immediately.

If the worker is blocked then we can't do anything to unblock it, so the runnable will wait until the next point where the JS engine can bail out for the operation callback.
> > but does the general idea work?
>
> It sounds basically like what we do already.

I see.  I didn't realize that the control runnable runs immediately if the worker isn't in ctypes, so presumably we could wait N ms for the control runnable to complete and that would be equivalent to what I posted.

bent, jorendorff, and I agreed that comment 18 is basically our best bet here.  We just need to have a conversation with sicking about relative priorities.
Assignee: justin.lebar+bug → nobody
One of the motivations for this bug is that Firefox uses workers itself by default, e.g. osfile_async_worker.js and resource:///modules/PageThumbsWorker.js.

I wonder if it would be worth having a whitelist of workers that use ctypes that we know don't cause problems for the memory reporters...
> I wonder if it would be worth having a whitelist of workers that use ctypes that we know 
> don't cause problems for the memory reporters...

Given that none of us is reviewing that worker code, and given that we run worker memory reporters during telemetry, I'm not sure I'd want to take that risk, myself.
Assignee: nobody → bent.mozilla
Status: NEW → ASSIGNED
blocking-basecamp: --- → ?
Nominating for blocking since we suspect there might be very high percentages of our main process being consumed by workers. Comment 11 indicates up towards 30% of all parent process memory if I understand correctly?
(In reply to Jonas Sicking (:sicking) from comment #23)
> Comment 11 indicates up towards 30% of all parent process
> memory if I understand correctly?

Yes.  I'd go further than saying we suspect that a lot of our main-process memory is workers that use ctypes; I'm relatively sure that the report in comment 11 isn't bogus.  (It's also missing one worker, I think.)

But I don't think this is a blocker because

 * we don't have a clear path towards fixing this bug,
 * no user will notice whether we've fixed this bug or not, and
 * we can measure memory allocated in workers which use ctypes using DMD, although it's difficult and imprecise.  (And critically, DMD requires a special build, while about:memory does not.)

Having said that, I'd really like us to fix this bug, and if we can fix it before we release, that's even better.  This bug is important because if one of these workers starts using a lot of memory for whatever reason, we can't tell unless we're running a special DMD build.  High memory usage here is potentially catastrophic -- you may not be able to do anything to the phone until you reboot it, because the memory is tied up in the main process.  So if a tester or user hits a case where one of these workers is using a lot of memory, we have no way to understand what's going on without this bug fixed.

If that's scary enough to block, please go for it!
What justin said.  As the bug is written up, i don't think we want to block for this.  It is important to understand if we are leaking or using too much memory before we ship v.1, but as for reporting this we can live with out it.
blocking-basecamp: ? → -
Actually, I agree with that too. I actually intended that this be marked as a "softblocker", but it appears that it never got evaluated for that.

So I'll just move it to that state myself :) Feel free to renom, or minus, if you disagree.
tracking-b2g18: --- → +
Attached patch Patch, v1Splinter Review
Ok, this should work.

jorendorff, can you please look at the js/src changes? It's the basic ctypes callback API that we discussed on irc.

khuey, can you take a look at the dom/workers changes? We may need to sit down and go through it together, but the basic idea is to do all memory reporting on the main thread while the worker is blocked. There are now two signals that are flagged to make this all work: mMemoryReporterRunning is the way that the main thread tells the worker that a memory report is requested, and mBlockedForMemoryReporter is the way that the worker thread acknowledges that it is safe to proceed. That blocking can happen when the worker is waiting for events, when the worker is currently calling into ctypes, or if we trap it in the operation callback.

jlebar, I'd love it if you could test this on b2g!
Attachment #692020 - Attachment is obsolete: true
Attachment #692110 - Attachment is obsolete: true
Attachment #696539 - Flags: review?(khuey)
Attachment #696539 - Flags: review?(jorendorff)
Attachment #696539 - Flags: feedback?(justin.lebar+bug)
I had to change the patch as follows in order to get this to compile on my machine.  (I was getting "can't static_cast char[21] to void*".)  I think I found a version of gecko which doesn't BSoD on B2G; I'll post a memory report if it works this time.

diff -u b/dom/workers/WorkerPrivate.cpp b/dom/workers/WorkerPrivate.cpp
--- b/dom/workers/WorkerPrivate.cpp
+++ b/dom/workers/WorkerPrivate.cpp
@@ -1431,7 +1431,8 @@
     aCompartmentStats->extra1 = strdup(cJSPathPrefix.get());
 
     // This should never be used when reporting with workers (hence the "?!").
-    aCompartmentStats->extra2 = static_cast<void*>("explicit/workers/?!/");
+    static const char* bogusMemoryReporterPath = "explicit/workers/?!/";
+    aCompartmentStats->extra2 = const_cast<char*>(bogusMemoryReporterPath);
   }
 };
I took this memory dump right after rebooting the phone.  The first-run screen was showing.

We have 15% heap-unclassified in the main process, which is by far the lowest I've ever seen on B2G.
Comment on attachment 696539 [details] [diff] [review]
Patch, v1

My only regret is that I have but one + to give to this patch.
Attachment #696539 - Flags: feedback?(justin.lebar+bug) → feedback+
(In reply to Justin Lebar [:jlebar] (away 12/21-1/2) from comment #30)

Great! Thanks for checking :)
> My only regret is that I have but one + to give to this patch.

You'll get to aurora+ and b2g18+ it soon enough :)
Comment on attachment 696539 [details] [diff] [review]
Patch, v1

Review of attachment 696539 [details] [diff] [review]:
-----------------------------------------------------------------

It looks pretty good, but I still want to talk about a few things.

::: dom/workers/WorkerPrivate.cpp
@@ +2850,5 @@
> +      // Wait until the worker actually blocks.
> +      while (!mBlockedForMemoryReporter) {
> +        mMemoryReportCondVar.Wait();
> +      }
> +    }

I'm not thrilled about blocking the main thread waiting for the operation callback to run here.  We should talk through the implications of this.

@@ +2868,1 @@
>    }

Are the JS people committed to allowing us to do this from another thread?

@@ +2886,5 @@
> +{
> +  // This may happen on the worker thread or the main thread.
> +  MutexAutoLock lock(mMutex);
> +
> +  NS_ASSERTION(mMemoryReporterAlive, "Must be alive!");

This assertion will fire if NS_RegisterMemoryMultiReporter fails.

@@ +2912,5 @@
> +    NS_WARNING("Failed to register memory reporter!");
> +    // No need to lock here since a failed registration means our memory
> +    // reporter can't start running. Just clean up.
> +    mMemoryReporterAlive = false;
> +    mMemoryReporter = nullptr;

So you should just not set mMemoryReporterAlive here.

@@ +2948,5 @@
> +  if (NS_FAILED(NS_UnregisterMemoryMultiReporter(memoryReporter))) {
> +    // If this fails then the memory reporter will probably never die and we'll
> +    // hang below waiting for it. If this ever happens we need to fix
> +    // NS_UnregisterMemoryMultiReporter.
> +    NS_ERROR("Failed to unregister memory reporter! Worker is going to hang!");

I think this should be a hard abort.  If this ever starts happening we want to see it in crashstats.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #33)
> I'm not thrilled about blocking the main thread waiting for the operation
> callback to run here.  We should talk through the implications of this.

It's what we do today already, though (just not for workers that use ctypes). The memory report runnable that we dispatch currently blocks the main thread until the operation callback processes the control runnable.

> Are the JS people committed to allowing us to do this from another thread?

I sure hope so! It only makes sense to call it from a different thread.

> This assertion will fire if NS_RegisterMemoryMultiReporter fails.

Yeah, can fix.

> I think this should be a hard abort.  If this ever starts happening we want
> to see it in crashstats.

Ok.

Anything else?
(In reply to ben turner [:bent] from comment #34)
> (In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #33)
> > I'm not thrilled about blocking the main thread waiting for the operation
> > callback to run here.  We should talk through the implications of this.
> 
> It's what we do today already, though (just not for workers that use
> ctypes). The memory report runnable that we dispatch currently blocks the
> main thread until the operation callback processes the control runnable.

Bleh, ok.

> > Are the JS people committed to allowing us to do this from another thread?
> 
> I sure hope so! It only makes sense to call it from a different thread.

You should verify that with them :-P
Attachment #696539 - Flags: review?(jorendorff) → review+
Comment on attachment 696539 [details] [diff] [review]
Patch, v1

Review of attachment 696539 [details] [diff] [review]:
-----------------------------------------------------------------

Oops, I marked this r+ but suddenly I have a few questions about GC worker threads.
Attachment #696539 - Flags: review+ → review?(jorendorff)
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #35)
> > > Are the JS people committed to allowing us to do this from another thread?
> > 
> > I sure hope so! It only makes sense to call it from a different thread.
> 
> You should verify that with them :-P

This is the point I'm not sure about.

This code treats the worker's whole JSRuntime as a monolithic non-thread-safe data structure and adds locking around it. That's a fine plan, and totally plausible... except that the JSRuntime does have some multithreading going on internally: a GC helper thread. I'm worried that causing the worker (i.e. the JSRuntime's home thread) to block and stop paying attention to its GC helper thread could lead to deadlock. But maybe only because I don't know much about that code.

I don't yet see what mechanism or API guarantee makes it safe. Two locking schemes (one in your new code, one inside the JS engine) that have no detailed knowledge of each other are both protecting the same data.
Worker threads don't use the separate gc thread.
Comment on attachment 696539 [details] [diff] [review]
Patch, v1

Transferring review to jonco. The GC helper thread code isn't terribly well-owned.
Attachment #696539 - Flags: review?(jorendorff) → review?(jcoppeard)
Comment on attachment 696539 [details] [diff] [review]
Patch, v1

Switching back to r+ given comment 38. This is fine, go for it.
Attachment #696539 - Flags: review?(jcoppeard) → review+
Perhaps we should add some assertions that all the ancillary threads are turned off if we call this on a different thread though?
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #41)
> Perhaps we should add some assertions that all the ancillary threads are
> turned off if we call this on a different thread though?

That's probably a good idea. I could imagine that in the future we might want to turn on various JS helper threads for workers to get additional parallelism.
(In reply to Andrew McCreight [:mccr8] from comment #42)
> That's probably a good idea. I could imagine that in the future we might
> want to turn on various JS helper threads for workers to get additional
> parallelism.

Anyone disagree that this can be done in a followup? I'd like to land this asap.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #41)
> Perhaps we should add some assertions that all the ancillary threads are
> turned off if we call this on a different thread though?

Bug 826818.
Comment on attachment 696539 [details] [diff] [review]
Patch, v1

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Bug 679551 (kinda... We basically have never been able to gather memory reports from workers that use ctypes reliably)
User impact if declined: All workers that use ctypes will report 0 memory used.
Testing completed: tryserver was totally green
Risk to taking this patch (and alternatives if risky): It's possible that this could cause deadlocks if I goofed somewhere. No testing thus far has revealed anything so I guess we can't know for sure until we have some clean hang data.
String or UUID changes made by this patch: None
Attachment #696539 - Flags: approval-mozilla-b2g18?
Attachment #696539 - Flags: approval-mozilla-aurora?
Comment on attachment 696539 [details] [diff] [review]
Patch, v1

This is pretty critical stuff, IMO.
Attachment #696539 - Flags: approval-mozilla-b2g18?
Attachment #696539 - Flags: approval-mozilla-b2g18+
Attachment #696539 - Flags: approval-mozilla-aurora?
Attachment #696539 - Flags: approval-mozilla-aurora+
https://hg.mozilla.org/mozilla-central/rev/4b2611eed98a
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla20
Ben, is this (along with bug 827274 presumably) intended to land on b2g18 as well?
(In reply to Ryan VanderMeulen from comment #50)
> Ben, is this (along with bug 827274 presumably) intended to land on b2g18 as
> well?

They don't apply cleanly (may require some other patches to be merged first) so I'll handle the uplift if it all gets approved. Thanks.
Attachment #696539 - Flags: approval-mozilla-b2g18+
Attached patch Patch for b2g-18Splinter Review
[Approval Request Comment]
Bug caused by (feature/regressing bug #): Bug 679551 (kinda... We basically have never been able to gather memory reports from workers that use ctypes reliably)
User impact if declined: All workers that use ctypes will report 0 memory used. This looks to be about 25% of the main process under normal conditions. If we don't take this patch on b2g-18 then we will not be able to diagnose memory problems in the very critical system workers (phone, wifi, netd) on the devices that we actually ship.
Testing completed: m-c now for a while, also m-a & m-b.
Risk to taking this patch (and alternatives if risky): Now that this has baked on three other branches for a while we know it's pretty safe as long as we also land the one additional fix (bug 827274).
String or UUID changes made by this patch: None
Attachment #703563 - Flags: approval-mozilla-b2g18?
Based on bug 827274 comment 12 I'm going to back this out of mozilla-beta.
Given that this is non-blocking and makes a pretty significant amount of change, we'll hold off on approving for branch landing until after v1.0.0 -- see https://wiki.mozilla.org/Release_Management/B2G_Landing which explains further but the options are either to land to the date branch now so that it gets merged into 1.0.1 after 1/25 or to wait and get b2g18 approval after 1/25 for direct landing to that branch.
See bug 837187 as an example of the sort of bug report that is made difficult without this patch.
Attachment #703563 - Flags: approval-mozilla-b2g18? → approval-mozilla-b2g18+
I backed this out of b2g18 because it caused a perma-orange on linux pgo (only)... https://hg.mozilla.org/releases/mozilla-b2g18/rev/2035414073b7
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: