Closed Bug 605458 Opened 14 years ago Closed 14 years ago

Intermittent timeout in storage/test/unit/test_statement_executeAsync.js

Categories

(Toolkit :: Storage, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: mak, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [test which aborts the suite][fixed by bug 606052])

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287494556.1287496589.20600.gz

TEST-INFO | /Users/cltbld/talos-slave/mozilla-central_leopard_test-xpcshell/build/xpcshell/tests/storage/test/unit/test_statement_executeAsync.js | running test ...
command timed out: 1200 seconds without output, killing pid 369
process killed by signal 9
Summary: random timeout in storage/test/unit/test_statement_executeAsync.js → Intermittent timeout in storage/test/unit/test_statement_executeAsync.js
This might be because of bug 605314 which we will hopefully get approval to land the fix for soon.
We're thinking this was not bug 605314 but a regression related to the new fake idle service from bug 602871 that is OS X specific (all the failures are OS X specific thus far).  This concurs with a number of test failures we are experiencing in comm-central.
the fix in bug 602871 is not OS X specific.
also, the first failure here is 2010/10/19 06:22:36, while bug 602871 has been pushed at Tue Oct 19 11:18:40 2010. I can't see a relation.
(In reply to comment #9)
> also, the first failure here is 2010/10/19 06:22:36, while bug 602871 has been
> pushed at Tue Oct 19 11:18:40 2010.

I lied, the time of the push made me crazy :\ the times could even be coherent, but I'm unsure about the relation.
The rationale for the suspicion is that we have some tests in comm-central that are also of an asynchronous nature that reproducibly hang only on OS X but that succeed when the idle service is backed out.  I agree from looking at the patch and the rather slender OS X idle implementation that it is not clear why such a thing would be the case.  My little mac mini is building its heart out in an attempt to let me reproduce locally and debug (Standard8 had reproduced on his OS X box.)
If you could debug it would be awesome, I only have a 10.6.x box and all of these failures are on 10.5.8.
Actually the idle service implementation "should" not matter, we replace it on start of the test.  The fake implementation is pretty much no-op.
What could happen is that we throw when trying to restore the factory on shutdown, but since xpcshell is just telling us it timed out, without printing any output, it's hard to tell.
Rev3 MacOSX Leopard 10.5.8 mozilla-central opt test xpcshell on 2010/10/19 20:35:26
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287545726.1287547713.9497.gz
Rev3 MacOSX Leopard 10.5.8 mozilla-central opt test xpcshell on 2010/10/19 20:45:23
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287546323.1287548225.11855.gz
Rev3 MacOSX Leopard 10.5.8 mozilla-central opt test xpcshell on 2010/10/19 20:50:15
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287546615.1287548570.13154.gz
I am able to reproduce this on 10.6 (although targeting 10.5), which is good because the 10.5 box I procured is not having much success building trunk.

It looks like there is some sort of race related to the manual event loop spinning, likely due to the native platform event loop stuff.  As to my previous fake idle service accusation, it's conceivable the idle service was de-wedging things by periodically inserting events?
(In reply to comment #30)
> It looks like there is some sort of race related to the manual event loop
> spinning, likely due to the native platform event loop stuff.  As to my
> previous fake idle service accusation, it's conceivable the idle service was
> de-wedging things by periodically inserting events?
I don't think it ever used events for things.  Although, consumers may have...

Does this look like a legit storage bug then?
idle service is firing periodic checks of idle time and polling from os, but I don't seem to recall it firing events.
When I say event, I mean a runnable posted to the event loop.

This looks like a bug in the OS X event loop logic, at least when used for manually pumped event loops that terminate when a global is set (which is a common testing idiom).  The idle service causing a neverending stream of runnables to be posted would have shielded us from the bug.

I will look into this a bit more and recategorize into Core/Widgets once I can describe the race better or provide a patch.
Asuth, from what you say, we could temporarily make fakeIdleService shield us through runnables. If that's easy to do I'd vote to take that as a orange-stopper, and then look for a proper fix. Also if this ends up being a platform bug (looks like it is since it's 10.5.x only) it could be tricky and long to get a fix.
Per my theory, yes, if the fake idle service decided to schedule a recurring timer (targeted at the main thread) that did absolutely nothing, that should be a reasonable workaround and at the very least have no negative impact besides a minor drag on testing speed.  I am not going to pursue that so if you would like to do so, please do so.
Thanks to dtrace I see what is going on.  The nested event loop triggered by the weird OS X event mechanism is spinning and eating all the gecko events.  So when the control flow returns back up to nsBaseAppShell::OnProcessNextEvent it says "hey! no events and I'm allowed to wait, let's call back into the native code!".

I think I can provide a non-hacky semantic fix.  It seems like calling nsIThread:processNextEvent() should never allow a blocking wait to occur after having processed at least one gecko event.  Allowing multiple gecko events to be processed if the are already in the queue seems reasonable enough that I won't try and change that.

I am going to spin off a new bug in core/widgets - os x if there isn't one that covers this already.
I have a patch up on bug 606052 that solves the problem as I understand it and makes the problem go away for me locally.  I have requested review for the patch from the module owner.  I am not fully up on the facilitating orange bustage fixes thing for mozilla-central so if there is more that should be done, please feel free to facilitate that.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287779420.1287781360.11535.gz
Rev3 MacOSX Leopard 10.5.8 mozilla-central opt test xpcshell

TEST-INFO | /Users/cltbld/talos-slave/mozilla-central_leopard_test-xpcshell/build/xpcshell/tests/storage/test/unit/test_statement_executeAsync.js | running test ...

command timed out: 1200 seconds without output, killing pid 302
Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central opt test xpcshell on 2010/10/24 17:38:53
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287967133.1287968800.1015.gz&fulltext=1
Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central debug test xpcshell on 2010/10/24 17:31:03
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287966663.1287968939.1563.gz&fulltext=1
Rev3 MacOSX Leopard 10.5.8 mozilla-central opt test xpcshell on 2010/10/24 17:39:34
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287967174.1287968939.1558.gz&fulltext=1
Rev3 MacOSX Leopard 10.5.8 mozilla-central opt test xpcshell on 2010/10/25 12:43:19
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1288035799.1288037817.9905.gz&fulltext=1
Asuth fixed it in bug 606052 and X on os x has never been so green!

Thanks!
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Whiteboard: [orange][test which aborts the suite] → [orange][test which aborts the suite][fixed by bug 606052]
Whiteboard: [orange][test which aborts the suite][fixed by bug 606052] → [test which aborts the suite][fixed by bug 606052]
You need to log in before you can comment on or make changes to this bug.