Closed Bug 1019071 Opened 10 years ago Closed 10 years ago

Frequent OS X 10.6 mochitest-4 timeouts

Categories

(Core :: General, defect)

x86_64
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla33
Tracking Status
firefox31 --- unaffected
firefox32 --- fixed
firefox33 --- fixed
firefox-esr24 --- unaffected
b2g-v1.4 --- fixed
b2g-v2.0 --- fixed
b2g-v2.1 --- fixed

People

(Reporter: emorley, Assigned: dougt)

Details

Attachments

(3 files)

Not sure what the pattern is here, but over the last week or two, there has been a distinct increase in the number of OS X timeouts, particularly in mochitest-4.

https://tbpl.mozilla.org/php/getParsedLog.php?id=40647707&tree=Mozilla-Central
https://tbpl.mozilla.org/php/getParsedLog.php?id=40664803&tree=Mozilla-Central
https://tbpl.mozilla.org/php/getParsedLog.php?id=40692307&tree=Mozilla-Central
https://tbpl.mozilla.org/php/getParsedLog.php?id=40727732&tree=Mozilla-Central
https://tbpl.mozilla.org/php/getParsedLog.php?id=40727039&tree=Mozilla-Central
https://tbpl.mozilla.org/php/getParsedLog.php?id=40759316&tree=Mozilla-Central
https://tbpl.mozilla.org/php/getParsedLog.php?id=40821611&tree=Mozilla-Central
https://tbpl.mozilla.org/php/getParsedLog.php?id=40856317&tree=Mozilla-Central

https://tbpl.mozilla.org/php/getParsedLog.php?id=40768304&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=40779603&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=40807498&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=40822348&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=40835000&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=40837845&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=40839708&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=40862348&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=40865152&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=40882522&tree=B2g-Inbound

Steven, as our resident OSX guru, does anything stand out to you in these stacks? :)
Flags: needinfo?(smichaud)
These are all on OS X 10.6.8.  Has the 10.6-specific test machine setup changed recently?

Is the increase "distinct" enough to pin it down to a particular day?

Whereever there's a crash stack, you can tell that the app was killed -- it didn't crash on its own.
Flags: needinfo?(smichaud)
I highly doubt anything has changed with our OSX 10.6 test slaves in some time now :). But no, I don't see any recent OSX 10.6 changes in the RelEng notes from the past couple weeks.

OSX 10.6 has always been quite timeout-prone, so it's very difficult to pin down exactly when things took a turn for the worse. Going back through the history on m-c, the first M4 orange I see was on May 21.
I'm going to run retriggers from around May 21 timeframe to see if I can narrow this down a bit.
Looks like this is where things took a dive:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=8e1ec45ff14d&tochange=83849e385943

https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=Rev4%20MacOSX%20Snow%20Leopard%2010.6%20mozilla-inbound%20opt%20test%20mochitest-4&fromchange=7ba8c658dc58&tochange=83849e385943

I'm strongly leaning towards the WebCrypto patches being at fault given that a MIPS patch is NPOTB for OSX and there's nothing media-related on the stack to implicate bug 921622 (plus that's been uplifted to other branches without any hangs). I think the NSS guys are going to have to look into this next.
Flags: needinfo?(rlb)
Assigning this to myself.  

I suspect that the issue is with the threading model used by WebCrypto -- every crypto operation creates a new, short-lived thread, and the WebCrypto mochitests do this several times.  It seems plausible that that could be causing weirdness.  In fact, I ran a test that did 4,096 iterations of digest() calls, and things did get weird.  I'll try forcing things over to the main thread and see if that makes things better.
Assignee: nobody → rlb
Flags: needinfo?(rlb)
Well, I tried removing the threading issues, and it's still not happy.
https://tbpl.mozilla.org/?tree=Try&rev=aa814cb7a141

Any suggestions on how to try to debug this?  Assuming it's the WebCrypto patches, I can't think of anything besides the threading that would have this kind of global impact.  

I suppose I could try repeating the test with the individual patches added/removed.  Might try that in the morning.

And anyway, who's still using 10.6?  ;)
Flags: needinfo?(ryanvm)
Revert to rev 21d19c772468 (AudioStream rework):
https://tbpl.mozilla.org/?tree=Try&rev=1919cfc9f280

Revert to rev 7ea38414bb31 (Add WebCrypto WebIDL interfaces):
https://tbpl.mozilla.org/?tree=Try&rev=5c2f07af074f

Revert to rev 43114cb99480 (Ensure that NSS is initialzed for CryptoTasks):
https://tbpl.mozilla.org/?tree=Try&rev=bcc473802175

Revert to rev 739a582c405c (Add support for SHA-1 and SHA-2 digests to WebCrypto API):
https://tbpl.mozilla.org/?tree=Try&rev=11f9ee68e91a
Flags: needinfo?(ryanvm)
> And anyway, who's still using 10.6?  ;)

About 25% of our current Mac users.
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #24)
> Revert to rev 21d19c772468 (AudioStream rework):
> https://tbpl.mozilla.org/?tree=Try&rev=1919cfc9f280
> 
> Revert to rev 7ea38414bb31 (Add WebCrypto WebIDL interfaces):
> https://tbpl.mozilla.org/?tree=Try&rev=5c2f07af074f
> 
> Revert to rev 43114cb99480 (Ensure that NSS is initialzed for CryptoTasks):
> https://tbpl.mozilla.org/?tree=Try&rev=bcc473802175
> 
> Revert to rev 739a582c405c (Add support for SHA-1 and SHA-2 digests to
> WebCrypto API):
> https://tbpl.mozilla.org/?tree=Try&rev=11f9ee68e91a

These show a pretty convincing spike starting with rev 7ea38414bb31. Not sure what to make of even-bigger spike on rev 43114cb99480 that went away with rev 739a582c405c, though.
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #29)
> (In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #24)
> > Revert to rev 21d19c772468 (AudioStream rework):
> > https://tbpl.mozilla.org/?tree=Try&rev=1919cfc9f280
> > 
> > Revert to rev 7ea38414bb31 (Add WebCrypto WebIDL interfaces):
> > https://tbpl.mozilla.org/?tree=Try&rev=5c2f07af074f
> > 
> > Revert to rev 43114cb99480 (Ensure that NSS is initialzed for CryptoTasks):
> > https://tbpl.mozilla.org/?tree=Try&rev=bcc473802175
> > 
> > Revert to rev 739a582c405c (Add support for SHA-1 and SHA-2 digests to
> > WebCrypto API):
> > https://tbpl.mozilla.org/?tree=Try&rev=11f9ee68e91a
> 
> These show a pretty convincing spike starting with rev 7ea38414bb31. Not
> sure what to make of even-bigger spike on rev 43114cb99480 that went away
> with rev 739a582c405c, though.

Agreed.  Going through that commit now, trying to isolate the problem.  Stand by.
Assignee: rlb → nobody
OK, I think we're making progress on this. Richard ran a Try push with the dom/crypto tests disabled (but leaving the code itself in place) and M4 is solid green. What makes this interesting is that the dom/crypto tests don't run in M4! This immediately points to changes in test chunking. Given that we chunk-by-dir in mochitest-plain (meaning that only the addition/subtraction of test directories will affect what tests run in a given chunk) and dom/crypto runs in M3, it's a plausible theory.

And in fact, if I compare a before and after for M4, I can see that dom/geolocation was pushed from the end of M3 to the beginning of M4! What I don't know is whether this is an issue with the Geolocation tests themselves or a consequence of just running more stuff overall in M4. And even if it was an issue with the Geolocation tests not cleaning up or something, it stands to reason that we wouldn't have seen problems before with them running at the end of the chunk.

Doug, can you please help with getting someone to look into what might be going on here? This is a very frequent and difficult to star orange, so any attention is greatly appreciated :)
Flags: needinfo?(dougt)
I have seen failures when running mochitest --repeat n with the geo mochitests, where the mochitest http server would stop responding. I am having trouble reproducing this ATM. 

Is this a correct interpretation of the log: after the last geo test [test_worseAccuracyDoesNotBlockCallback.html], one of the geo tests appears to still be running and calling the mochitest http server (i.e. one of the geo mochitest web pages staying open).
comment 43: there is no issue here, the geo provider gets shut down normally after 5 seconds of being idle.
Some of the recent timeouts (was trying to see if any that came up repeatedly that we could file and so not have to manually star):

/tests/content/base/test/test_document_constructor.html | Test timed out.
/tests/dom/imptests/html/dom/nodes/test_getElementsByClassName-02.htm | application timed out after 330 seconds with no output
/tests/dom/tests/mochitest/localstorage/test_embededNulls.html | Test timed out.
/tests/dom/tests/mochitest/localstorage/test_localStorageKeyOrder.html | application timed out after 330 seconds with no output
/tests/dom/tests/mochitest/storageevent/test_storageSessionStorageEventCheckNoPropagation.html | Test timed out.
/tests/dom/tests/mochitest/whatwg/test_postMessage_origin.xhtml | Test timed out.
/tests/dom/workers/test/test_errorwarning.html | Test timed out.
/tests/dom/workers/test/test_eventDispatch.html | Test timed out.
/tests/dom/workers/test/test_instanceof.html | Test timed out.
/tests/dom/workers/test/test_url.html | Test timed out.
/tests/dom/workers/test/test_xhr2.html | Test timed out.
/tests/dom/xbl/test/test_bug591198.html | Test timed out.
/tests/dom/xslt/tests/mochitest/test_bug551654.html | Test timed out.
/tests/editor/libeditor/base/tests/test_bug586662.html | Test timed out.
/tests/editor/libeditor/html/tests/test_bug414526.html | Test timed out.
/tests/editor/libeditor/html/tests/test_bug550434.html | Test timed out.
/tests/editor/libeditor/html/tests/test_bug599322.html | Test timed out.
/tests/embedding/test/test_nsFind.html | application timed out after 330 seconds with no output
/tests/js/jsd/test/test_bug507448.html | application timed out after 330 seconds with no output
/tests/js/xpconnect/tests/mochitest/test_bug772288.html | Test timed out.
/tests/js/xpconnect/tests/mochitest/test_bug772288.html | Test timed out.
/tests/layout/base/tests/test_bug435293-scale.html | Test timed out.
/tests/layout/base/tests/test_bug441782-4e.html | Test timed out.
/tests/layout/base/tests/test_bug467672-1a.html | Test timed out.
/tests/layout/base/tests/test_bug467672-2g.html | Test timed out.
/tests/layout/base/tests/test_bug467672-2g.html | Test timed out.
/tests/layout/base/tests/test_bug467672-3g.html | Test timed out.
/tests/layout/base/tests/test_bug467672-4f.html | Test timed out.
/tests/layout/base/tests/test_bug570378-arabic-5d.html | Test timed out.
/tests/layout/base/tests/test_bug570378-persian-2b.html | Test timed out.
/tests/layout/base/tests/test_bug570378-persian-2e.html | Test timed out.
/tests/layout/base/tests/test_bug570378-persian-5f.html | Test timed out.
/tests/layout/base/tests/test_bug582771.html | Test timed out.
/tests/layout/base/tests/test_bug749186.html | Test timed out.
/tests/layout/generic/test/test_bug288789.html | application timed out after 330 seconds with no output
Have added some to the summary of bug 928309.
636 INFO TEST-UNEXPECTED-FAIL | /tests/dom/workers/test/test_dataURLWorker.html | Test timed out.
https://tbpl.mozilla.org/php/getParsedLog.php?id=41442167&tree=Mozilla-Inbound
TEST-UNEXPECTED-FAIL | /tests/dom/tests/mochitest/bugs/test_bug593174.html | application timed out after 330 seconds with no output
https://tbpl.mozilla.org/php/getParsedLog.php?id=41450592&tree=Mozilla-Central
1691 INFO TEST-UNEXPECTED-FAIL | /tests/gfx/tests/mochitest/test_acceleration.html | Test timed out.
https://tbpl.mozilla.org/php/getParsedLog.php?id=41494179&tree=Mozilla-Central
1836 INFO TEST-UNEXPECTED-FAIL | /tests/intl/uconv/tests/test_bug959058-1.html | Test timed out.
https://tbpl.mozilla.org/php/getParsedLog.php?id=41508411&tree=Mozilla-Inbound

2031 INFO TEST-UNEXPECTED-FAIL | /tests/js/xpconnect/tests/mochitest/test_bug655297-2.html | Test timed out.
https://tbpl.mozilla.org/php/getParsedLog.php?id=41491269&tree=Mozilla-Inbound

1878 INFO TEST-UNEXPECTED-FAIL | /tests/intl/uconv/tests/test_utf8_overconsumption.html | Test timed out.
https://tbpl.mozilla.org/php/getParsedLog.php?id=41514301&tree=Mozilla-Inbound
782 INFO TEST-UNEXPECTED-FAIL | /tests/dom/workers/test/test_sharedWorker.html | Test timed out.
https://tbpl.mozilla.org/php/getParsedLog.php?id=41532009&tree=Mozilla-Inbound
2036 INFO TEST-UNEXPECTED-FAIL | /tests/js/xpconnect/tests/mochitest/test_bug661980.html | Test timed out.
https://tbpl.mozilla.org/php/getParsedLog.php?id=41534086&tree=Mozilla-Inbound
As of today, m-c tip is currently timing out ~16% of the time on mochitest-4. Per the Try link below, skipping Geolocation resulted in 50/50 runs being green.
https://tbpl.mozilla.org/?tree=Try&rev=cb49372a7ec6

Per the job disabling policy, I'm to wait 2 business days before disabling tests after giving notification, so if I don't hear anything by Friday, I will go ahead push this patch.
https://wiki.mozilla.org/Sheriffing/Test_Disabling_Policy
Attachment #8438665 - Flags: review?(dougt)
Comment on attachment 8438665 [details] [diff] [review]
Skip Geolocation tests on OSX

i'll back this out when we figure out what the heck is going one.
Attachment #8438665 - Flags: review?(dougt) → review+
Flags: needinfo?(dougt)
Assignee: nobody → dougt
Thanks, Doug.

https://hg.mozilla.org/integration/mozilla-inbound/rev/9ccdc477061e
Keywords: leave-open
Whiteboard: [tests disabled on OSX]
something might be funny with ipc:

09:52:49     INFO -  113 INFO TEST-START | /tests/dom/tests/mochitest/localstorage/test_appIsolation.html
09:52:49     INFO -  ############################### browserElementPanning.js loaded
09:52:49     INFO -  ######################## BrowserElementChildPreload.js loaded
09:52:49     INFO -  JavaScript error: chrome://browser/content/tabbrowser.xml, line 3230: tab is null
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #89)
> As of today, m-c tip is currently timing out ~16% of the time on
> mochitest-4. Per the Try link below, skipping Geolocation resulted in 50/50
> runs being green.
> https://tbpl.mozilla.org/?tree=Try&rev=cb49372a7ec6

Great spot Ryan! :-)
Attached file mochitest_output.txt
Doug, you asked about mochitest repeat, 
I ran mochitest --repeat 2, on my OS X 10.9.3.
Here is the output, failures in  tests/content/canvas and timeouts in
tests/content/html/content.

It took 2 hours, I could run overnight with more repeats, I just have to remember to do so.

Changing the timing on the requests from 200ms to longer (as ggp mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=957580), may magically work around the problem. That particular bug indicates that the mochitest httpd has a bug, and stops responding. I realize that there is no output in the logs that indicate this is the problem (it would be easy to see in the log if the mochitest http stopped responding).
Attachment #8440135 - Flags: review+
we need to backout 9ccdc477061e, and land attachment 8440135 [details] [diff] [review].
https://hg.mozilla.org/mozilla-central/rev/79c6d543f29f
https://hg.mozilla.org/mozilla-central/rev/b6ab581b1547
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla33
This is test-only, so I'll get it uplifted to Aurora tomorrow if nobody beats me to it.
Whiteboard: [tests disabled on OSX]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: