Closed Bug 966646 Opened 7 years ago Closed 7 years ago

Use JS helper threads for GC background sweeping / allocation

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla32

People

(Reporter: bhackett1024, Assigned: bhackett1024)

References

Details

Attachments

(2 files, 1 obsolete file)

Attached patch patch (obsolete) — Splinter Review
This eliminates the last thread creation done other than for the JS helper thread pool, which is now per process.  Doing this will allow DOM workers to share the use of these helper threads without causing any additional thread creation.
Attachment #8369092 - Flags: review?(wmccloskey)
Comment on attachment 8369092 [details] [diff] [review]
patch

Review of attachment 8369092 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks, this looks nice. I'd just like to see a few changes.

::: js/src/jsgc.cpp
@@ +1433,5 @@
> +        if (rt->gcChunkPool.wantBackgroundAllocation(rt)) {
> +            maybeLock.unlock();
> +            AutoLockWorkerThreadState lock;
> +            maybeLock.lock(rt);
> +            rt->gcHelperState.startBackgroundAllocationIfIdle();

I think we can improve this since we don't need to start the background allocation right away. Can you please make an auto class that gets pushed onto the stack when we enter this function (before the GC lock is taken)? When we decide we want background allocation, we would set a flag on the class. Then the destructor for the class would check the flag and, if set, take the locks in the correct order and call startBackgroundAllocationIfIdle().

@@ +2481,1 @@
>      for (;;) {

Can you please try simplifying this code? I don't think the loop is needed anymore. Also, we should be able to assert that the state is not IDLE at the beginning. And I think it would be a lot clearer to unconditionally set it to IDLE at the end, while also notifying on |done|.

As far as I can tell, the only way the state can change when it's not set to IDLE is if we set it to CANCEL_ALLOCATING. So perhaps we could also assert that the state is unchanged at the end of each case (except for allowing CANCEL_ALLOCATING in the ALLOCATING case).

::: js/src/vm/Runtime.h
@@ +1866,5 @@
>          runtime = rt;
>          rt->lockGC();
>      }
>  
> +    void unlock() {

Should be able to remove this.
Attachment #8369092 - Flags: review?(wmccloskey)
Attached patch updatedSplinter Review
Make the requested simplifications.  This also makes triggering a background GC thread infallible, which required some other modifications, mainly putting the |state| and |thread| members of GCHelperState under the worker thread state lock.
Assignee: nobody → bhackett1024
Attachment #8369092 - Attachment is obsolete: true
Attachment #8372492 - Flags: review?(wmccloskey)
Comment on attachment 8372492 [details] [diff] [review]
updated

Review of attachment 8372492 [details] [diff] [review]:
-----------------------------------------------------------------

Look great, thanks. Please remember to make sure this compiles with !JS_THREADSAFE.
Attachment #8372492 - Flags: review?(wmccloskey) → review+
Brian: is your JS helper thread patch ready to land? Bill r+'d it two months ago. :)
Flags: needinfo?(bhackett1024)
OS: Linux → All
Hardware: x86_64 → All
I've tried quite a bit actually to get this into landable shape but the only problem is these persistent timeouts that only happen on b2g ICS emulator opt (but not debug) and which I don't really have any way I know of to debug.  Which is pretty demoralizing and I don't have an ETA for landing this.
Flags: needinfo?(bhackett1024)
Gregor, can you recommend anyone on the B2G side that could investigate here?
Flags: needinfo?(anygregor)
(In reply to Brian Hackett (:bhackett) from comment #5)
> I've tried quite a bit actually to get this into landable shape but the only
> problem is these persistent timeouts that only happen on b2g ICS emulator
> opt (but not debug) and which I don't really have any way I know of to
> debug.  Which is pretty demoralizing and I don't have an ETA for landing
> this.

Do you have more info here? Is it always the same test? Is it across all mochitest suites?
Flags: needinfo?(anygregor)
(In reply to Gregor Wagner [:gwagner] from comment #7)
> (In reply to Brian Hackett (:bhackett) from comment #5)
> > I've tried quite a bit actually to get this into landable shape but the only
> > problem is these persistent timeouts that only happen on b2g ICS emulator
> > opt (but not debug) and which I don't really have any way I know of to
> > debug.  Which is pretty demoralizing and I don't have an ETA for landing
> > this.
> 
> Do you have more info here? Is it always the same test? Is it across all
> mochitest suites?

The timeouts/failures happen on all mochitest suites and marionette-webapi tests, everything else on all platforms works fine.  Here's the last try run I did showing the problem:

https://tbpl.mozilla.org/?tree=Try&rev=229eb8495595
Flags: needinfo?(anygregor)
Uh the parent process keeps crashing. That basically indicates that the parent is restarting:
13:05:44     INFO -  03-13 19:54:58.487    33    33 I ServiceManager: service 'media.resource_manager' died

Debug build is fine? Maybe some horrible race condition?
Flags: needinfo?(anygregor)
Do you have a new version of the patch? I see merge conflicts. I want to push it to the pine tree where we have improved crash stack reporting. Maybe we can find a stack for the crash.
Here is a try run for a rebased version of the patch, which still has the same issues:

https://tbpl.mozilla.org/?tree=Try&rev=ae189cf76bb7
Flags: needinfo?(anygregor)
(In reply to Brian Hackett (:bhackett) from comment #11)
> Here is a try run for a rebased version of the patch, which still has the
> same issues:
> 
> https://tbpl.mozilla.org/?tree=Try&rev=ae189cf76bb7

I pushed to pine and it seems like the opt builds have problems with workers.
https://tbpl.mozilla.org/?tree=Pine&showall=1&rev=21233bb8beaa
Flags: needinfo?(anygregor)
Meh seems like the worker bug is a regression from the merge :(
The device doesn't start with this patch as well but I couldn't get a stack yet. Should be the same problem as we see on the emulator.
Brian, I'd try this again after the patch in bug 1006695 lands.  Apparently Nuwa is unaware of JS helper threads, but was aware of the GCHelperThread, and I could imagine could cause some severe issues for this patch.
On the theory that bug 1006695 will fix the mysterious emulator problem, I've rebased the patch and pushed to try:
  https://tbpl.mozilla.org/?tree=Try&rev=229eb8495595

There have been a few GC refactorings I had to rebase over, so who knows how well my patch works, but it at least compiles and runs a little locally.
For good measure, here's an L64 debug run to make sure I didn't break everything:
  https://tbpl.mozilla.org/?tree=Try&rev=5544f49c29d7
Ack, I just did the exact same thing.  Try competition!

https://tbpl.mozilla.org/?tree=Try&rev=81ed8411b2dc
Blocks: 1014183
No longer blocks: 1014183
This introduced a build warning in both gcc and clang:

/js/src/jsapi-tests/../jsgc.h:1027:17: warning: inline function ‘void js::GCHelperState::startBackgroundAllocationIfIdle()’ used but never defined [enabled by default]

The function is used in GCRuntime.h, that includes jsgc.h (which doesn't contain the body). Not sure what the best fix is, so letting Brian decide: delete the inline keyword; move the definition to jsgc.h (but then, JSRuntime* is incompletely declared and used => error); or any other nice solution.
Flags: needinfo?(bhackett1024)
https://hg.mozilla.org/mozilla-central/rev/9c4c4356afce
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla32
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
This push should fix the interrupt lock assertion (which can be a deadlock in release mode), we just check whether the current thread owns the interrupt lock when starting background allocation (startBackgroundAllocationIfIdle is, not surprisingly, already a best-effort sort of thing).  This should also fix the warning in comment 20 by removing the inline keyword.

https://hg.mozilla.org/integration/mozilla-inbound/rev/c3780924e7b4
Flags: needinfo?(bhackett1024)
Note that we also suspect that this led to a large increase in B2G emulator test timeouts (like 2-3 per push). In case it matters, the emulator is a) slow and b) single core. CCing the other sheriffs to keep an eye on whether they spike again post-comment 23.
Did anything happen to performance numbers in non-emulator builds?
AFAIK, eideticker is the perf monitoring we do on B2G. wlach?
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #26)
> AFAIK, eideticker is the perf monitoring we do on B2G. wlach?

So eideticker is only part of the picture; we run many more tests and more frequently through b2gperf and 'make test-perf' which upload to http://datazilla.org/b2g.

I'm not seeing any difference in the startup numbers reported by either that are attributable to this change, but of course there's more to performance than just that. I'll CC the fxos-perf team as a head's up.
Depends on: 1014458
OK, I give up.

This brings this bug back to square one, aka comment 5.  There really isn't anything in this patch which should cause tests to time out more often.  So an increase in the number of timeouts could be a problem with the patch, it could be a problem with the testing infrastructure, it could be a problem with how JS interacts with b2g, like the Nuwa thing that held this bug up for three months.  But without any information about how or why the tests are failing (how is it this hard to get a stack?) and apparently without anyone on the b2g side able to invest the time to figure out what could be wrong, the prospects of anyone figuring this out seem poor.
Assignee: bhackett1024 → nobody
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → WONTFIX
Depends on: 1015178
We can fix the no stack problem pretty easily. I filed bug 1015178 for it.
Nicolas, Gregor, can we get some help with the b2g timeouts? This is a pretty important bug and it's completely blocked on b2g.
Flags: needinfo?(nicolas.b.pierron)
Flags: needinfo?(anygregor)
From https://bugzilla.mozilla.org/show_bug.cgi?id=1015200#c0 it looks like this was wrongfully backed out. Ed, can this be re-landed as is?
Status: RESOLVED → REOPENED
Flags: needinfo?(emorley)
Resolution: WONTFIX → ---
jsfunfuzz found a hard-to-reduce (but easy to reproduce) testcase with the patch in comment 21, so please do not reland this until Brian has taken a look. Thanks!
Flags: needinfo?(emorley) → needinfo?(bhackett1024)
Gary, that's the same assertion as in bug 1014458.  Can you reproduce that in the most recent landing of this, revision c3780924e7b4, which was intended to fix that assertion?
(In reply to Andrew McCreight [:mccr8] from comment #35)
> Gary, that's the same assertion as in bug 1014458.  Can you reproduce that
> in the most recent landing of this, revision c3780924e7b4, which was
> intended to fix that assertion?

That assertion was fixed by the backout: https://bugzilla.mozilla.org/show_bug.cgi?id=1014458#c28

And yes, it went away when the backout landed, so I figure that the patch is the cause of the issue.
It went away when for the initial backout (revision 2619a4def1b9), but didn't come back when bhackett relanded the patch (revision c3780924e7b4), so that may fix your test case, too.
(In reply to Bill McCloskey (:billm) from comment #37)
> See comment 23, Gary.

Wasn't comment 23 backed out in comment 28?

Anyway, I'm a little confused due to the back and forth landings/backouts here.

I guess we can recheck the testcase when all the required patches land again.
Flags: needinfo?(bhackett1024) → needinfo?(emorley)
(In reply to Gary Kwong [:gkw] [:nth10sd] from comment #39)
> Wasn't comment 23 backed out in comment 28?

Yes, but this assertion should be fixed in the patch landed in c3780924e7b4, which was indeed backed out in comment 28.  If the patch in c3780924e7b4 ever relands then the assertion should not come up.
Are the patches attached to this bug up-to-date (seems like the code from the last push isn't attached here). Are you still working on this or should :mccr8 take over?
In comment 29, bhackett said he gives up and unassigned himself, which I think implies he is not working on it. ;)
Yeah but since then he calmed down and some other stuff happened. Also, there's some confusion now over what the current state is / what happened. Hence the question. 

If you're willing to take over the bug, please assign yourself, thanks! IMHO, this should be fixed properly instead of worked around (as in Bug 1014183) …
jandem: I think comment 32 highlights that this patch might not be the reason of the emulator timeouts.
We should wait for edmorley to confirm if this patch can re-land or not. (thx ahal for noticing)
Flags: needinfo?(nicolas.b.pierron)
I didn't see the failure rate that was responsible for backing this out, so can't tell if the failures seen in bug 1015200 are exactly the same, or only a subset of what was occurring before.

How about two try pushes, one with inbound tip and one with inbound tip + the patches here, using something like trychooser "try: -b do -p emulator -u reftest,crashtest -t none" - and then I'll retrigger a load of jobs and we can directly compare?
Flags: needinfo?(emorley)
The number of timeouts went down! Can we land this?
Flags: needinfo?(emorley)
I retriggered every one of those twice, just now, so we'll see if the picture changes with a bigger sample.
With my retriggers, there is 1 out of 35 timeouts with Brian's patch, and 4 out of 35 without.
I've requested a few more retriggers, but looking promising so far. These latest retriggers should be complete within 60-90 mins.
Flags: needinfo?(emorley)
Looks good to land :-)
(In reply to Ed Morley [:edmorley UTC+0] from comment #51)
> Looks good to land :-)

https://hg.mozilla.org/integration/mozilla-inbound/rev/93dce4b831f3

(Bug 1013663 made this not apply cleanly; pushed before it bitrots more.)
Flags: needinfo?(anygregor)
https://hg.mozilla.org/mozilla-central/rev/93dce4b831f3
Assignee: nobody → bhackett1024
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Depends on: 1018533
Depends on: 1019224
You need to log in before you can comment on or make changes to this bug.