Closed Bug 921635 Opened 11 years ago Closed 6 years ago

Frequent OSX mochitest timeouts in the dom-level* suites

Categories

(Core :: General, defect)

x86_64
macOS
defect
Not set
critical

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: RyanVM, Unassigned)

References

Details

(Whiteboard: [test disabled on OS X] [leave open])

Attachments

(1 file)

Attached image index.png —
This has been persistently occurring for the last week+. We are seeing frequent OSX "application timed out after 330 seconds with no output" failures, often in the dom-level* test suite like the log below:

https://tbpl.mozilla.org/php/getParsedLog.php?id=28486318&tree=Mozilla-Inbound

Rev4 MacOSX Lion 10.7 mozilla-inbound opt test mochitest-3 on 2013-09-27 14:11:22 PDT for push 6b9e77c98785
slave: talos-r4-lion-011

14:19:11     INFO -  28705 INFO TEST-START | /tests/dom/tests/mochitest/dom-level1-core/test_nodeappendchildnomodificationallowederrEE.html
14:24:41  WARNING -  TEST-UNEXPECTED-FAIL | /tests/dom/tests/mochitest/dom-level1-core/test_nodeappendchildnomodificationallowederrEE.html | application timed out after 330 seconds with no output
14:24:45     INFO -  SCREENSHOT: <see attached>
14:24:46     INFO -  Traceback (most recent call last):
14:24:46     INFO -    File "/builds/slave/talos-slave/test/build/tests/mochitest/runtests.py", line 1045, in runTests
14:24:46     INFO -      onLaunch=onLaunch
14:24:46     INFO -    File "/builds/slave/talos-slave/test/build/tests/mochitest/runtests.py", line 939, in runApp
14:24:46     INFO -      self.handleTimeout(timeout, proc, utilityPath, debuggerInfo, browserProcessId)
14:24:46     INFO -    File "/builds/slave/talos-slave/test/build/tests/mochitest/runtests.py", line 1073, in handleTimeout
14:24:46     INFO -      self.killAndGetStack(browserProcessId, utilityPath, debuggerInfo, dump_screen=not debuggerInfo)
14:24:46     INFO -    File "/builds/slave/talos-slave/test/build/tests/mochitest/runtests.py", line 724, in killAndGetStack
14:24:46     INFO -      os.kill(processPID, signal.SIGABRT)
14:24:46     INFO -  OSError: [Errno 3] No such process
14:24:46     INFO -  Automation Error: Received unexpected exception while running application
14:24:46     INFO -  WARNING | leakcheck | refcount logging is off, so leaks can't be detected!
14:24:46     INFO -  runtests.py | Running tests: end.
14:24:46    ERROR - Return code: 256

Multiple issues:
1.) Breakpad isn't launching and giving us a stack. Ted, is that a known issue?
2.) I see iCal running in all of the screenshots. Talking to philor, it doesn't sound like this is related, but it's worth pointing out.

These oranges are hitting frequently and we really need some movement on finding the root cause.
Flags: needinfo?(ted)
Component: Buildduty → General
Product: Release Engineering → Core
QA Contact: armenzg
Version: other → unspecified
Dunno what the root cause is, but the proximate cause is one of glandium's "move * to moz.build" patches, where we had a makefile (or multiple makefiles, not sure) with broken loops that caused us to not run most of these tests. Despite the tests having been in the tree forever, these are basically new tests, not a new regression in old tests.
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #0)
> 1.) Breakpad isn't launching and giving us a stack. Ted, is that a known
> issue?

We discussed this in some other bug, but judging from the error message the process no longer exists by the time we try to kill it. I suggested that we should probably just wrap that os.kill in a try/except block to fail more gracefully there. I note that the screenshot doesn't show any trace of a Firefox process either, which lends credence to that theory. If we fix that we might be able to get a stack out of the process, since the error we're hitting here is causing us to skip the check_for_crashes call.

As to why the process is dying in slow motion, I don't know. Note that bug 920640 fixed the underlying cause of us not trying to kill the process for a stack (and bug 717758 made us try at all on Mac).
Flags: needinfo?(ted)
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #5)
> We discussed this in some other bug, but judging from the error message the
> process no longer exists by the time we try to kill it.

(Bug 921509)
Talos has a similar problem which I have documented in bug 923770.  While I don't know the root cause we could probably be smarter in our process management.
Can we please get a bug on file (or mark any existing bugs on file as dependencies here) for whatever work needs to be done to get a stack from these hangs? I'm not going to be able to get any traction with the development team if we can't give them a stack.
Flags: needinfo?(ted)
Flags: needinfo?(jmaher)
at first I thought this was related to loading too many pages (a bug I see on 10.6 and 10.7 in talos tp5 only), but after looking at a few logs in here, I am not sure if it is the same.  This tp5 regression is something that is in the last couple weeks, likewise with this bug.  Both bugs fail on test suites that load a lot of test files which are small and quick to load (i.e. fast page cycling) and on 10.6 and 10.7.

either our code changed or the test machines changed (possible os update?), Callek, can you help me figure out if we have updated any software on our 10.6/10.7 machines in the last 3 weeks?
Flags: needinfo?(jmaher) → needinfo?(bugspam.Callek)
I don't see anything obvious in our puppet history for what could have changed on these test machines.

*) Do retriggers on old jobs exhibit the same talos regressions?
*) Do you have a regression range that is smaller than 3 weeks to look in?
Flags: needinfo?(bugspam.Callek)
What is the frequency of the dom-level* failures?  Maybe 1:20, we could post to try server and do a bunch of retriggers.  A theory I have is that we are loading pages too fast, so slowing down the pages in mochitest might fix this.  While that isn't a permanent solution, it might help us figure out the root cause.  I guess I could test this theory in talos :)

I am not sure if we can get a stack from the non responsive browser, if there is a way, I would love to get this in talos, possibly ted has some ideas there and I could test them out on try server while I shoot for the high score table.
Comment 1. Any attempt at retriggering to a single push regressor should find the single push which took us from only running a fraction of these tests to actually running them all.
Depends on: 926934
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #46)
> Can we please get a bug on file (or mark any existing bugs on file as
> dependencies here) for whatever work needs to be done to get a stack from
> these hangs? I'm not going to be able to get any traction with the
> development team if we can't give them a stack.

I thought I commented on this elsewhere, but my original theory seems to be off-base. I think something is going wrong with our process management, since the process is dead before we get to kill it to get a stack. I would tentatively blame the mozbase refactor, since it seems plausible that something could be going wrong inside of mozrunner or mozprocess.

jhammel: what do you think?
Flags: needinfo?(ted) → needinfo?(jhammel)
It would be worth checking the test archives. The cause might be missing files because of make fuckage.
> I think something is going wrong with our process management, 

:( Sadly seems likely.

> since the process is dead before we get to kill it to get a stack. I
> would tentatively blame the mozbase refactor, since it seems
> plausible that something could be going wrong inside of mozrunner
> or mozprocess.

I'm going to give a strong blanket "maybe".

I'll try to add in a few debug statements as soon as I can, but I can't put this at priority #1 right right now.
Clint, jst - these frequent timeouts are contributing to an already bad situation for OSX test slave backlog. If we can't find an owner for this, I'm afraid we may need to resort to disabling the dom-level* tests on OSX.

AFAICT, we need:
* Fixes to the test harness to give stacks on hangs/timeouts
* Developer action once usable stacks exist

I'll also point out that per comment 1, these tests were actually accidentally disabled for a long time and nobody seemed to notice or miss them.
Flags: needinfo?(jst)
Flags: needinfo?(ctalbert)
I propose:
* disabling the tests as they were
* filing a bug to enable them (treat them as new tests)
* file a bug to get valid crash stacks
Turns out that trusting my guesses about history rather than actually looking at blame in https://hg.mozilla.org/integration/mozilla-inbound/annotate/45097bc3a578/dom/tests/mochitest/dom-level1-core/Makefile.in isn't a great way to make decisions.

We were missing some but not all of them between July 3 2012 and September 5 2013. If we're going to decide we don't need them at all on the platform that I think most of the DOM owners and peers use, then we should do so by deciding that we don't need them at all because we made it over a year without regressing any of them so they just test things that we test with other tests, and stop running them on any platform.

But before that, we should actually check whether I'm even right that 54e4a1ad2b22 is the start (though not on a weekday, we really can't afford the load of an extra few hundred 10.7 mochitests in our 4.5 hour queue).
(In reply to Phil Ringnalda (:philor) from comment #1)
> Dunno what the root cause is, but the proximate cause is one of glandium's
> "move * to moz.build" patches, where we had a makefile (or multiple
> makefiles, not sure) with broken loops that caused us to not run most of
> these tests. Despite the tests having been in the tree forever, these are
> basically new tests, not a new regression in old tests.

It was only 11 of those tests that were disabled. test_HTMLAppletElement01.html to test_HTMLAppletElement11.html in dom/tests/mochitest/dom-level2-html/Makefile.in. The change that enabled them is bug 917622.
wtf, bmo, I didn't touch the needinfo checkbox!
Flags: needinfo?(jhammel)
(In reply to Mike Hommey [:glandium] from comment #65)
> (In reply to Phil Ringnalda (:philor) from comment #1)
> > Dunno what the root cause is, but the proximate cause is one of glandium's
> > "move * to moz.build" patches, where we had a makefile (or multiple
> > makefiles, not sure) with broken loops that caused us to not run most of
> > these tests. Despite the tests having been in the tree forever, these are
> > basically new tests, not a new regression in old tests.
> 
> It was only 11 of those tests that were disabled.
> test_HTMLAppletElement01.html to test_HTMLAppletElement11.html in
> dom/tests/mochitest/dom-level2-html/Makefile.in. The change that enabled
> them is bug 917622.

I've disabled these 11 tests on OS X, in:
https://hg.mozilla.org/mozilla-central/rev/f2b317484d08

Let's see if that helps... :-)
Whiteboard: [some tests disabled on OS X] [leave open]
(In reply to Phil Ringnalda (:philor) from comment #122)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=29814900&tree=Mozilla-
> Inbound

This was after the test disabling in comment 120 sadly.
I think the next step is just to disable all dom-level* tests on OS X until someone actually cares whether they are reliable or not :-(
See Also: → 672015
See Also: → 910229
(In reply to Carsten Book [:Tomcat] from comment #146)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=29775369&tree=B2g-Inbound

This wasn't a timeout - wrong bug?
Flags: needinfo?(cbook)
We've sadly come to the end of the road for now - dom-level1-core, dom-level2-core & dom-level2-html disabled on OS X for too many intermittent timeouts & until someone is able to investigate the failures:
remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/66d14f6329ca
Whiteboard: [some tests disabled on OS X] [leave open] → [test disabled on OS X] [leave open]
I should also add that these tests are also too OOMy for Android, so have been disabled there for some time (see bug 910229). Seems like these tests need some TLC all around.
(In reply to Ed Morley [:edmorley UTC+1] from comment #148)
> We've sadly come to the end of the road for now - dom-level1-core,
> dom-level2-core & dom-level2-html disabled on OS X for too many intermittent
> timeouts & until someone is able to investigate the failures:
> remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/66d14f6329ca

Bizarrely this exposed issues with layout/generic/test/test_bug448987.html which has now also been disabled. Bug 932296 filed to get that sorted.
(In reply to Joel Maher (:jmaher) from comment #62)
> I propose:
> * disabling the tests as they were
> * filing a bug to enable them (treat them as new tests)
> * file a bug to get valid crash stacks

So step 1 is done here. Will file the 2 follow-ups and we'll go from there.
Flags: needinfo?(jst)
Flags: needinfo?(jhammel)
Flags: needinfo?(jgriffin)
Flags: needinfo?(ctalbert)
Flags: needinfo?(cbook)
Flags: needinfo?(jgriffin)
Step 2 now needs to be "file a bug to go about fixing and re-enabling dom-level1-core, dom-level2-core & dom-level2-html" rather than just those 11.
Depends on: 932350
(In reply to Joel Maher (:jmaher) from comment #62)
> I propose:
> * disabling the tests as they were
> * filing a bug to enable them (treat them as new tests)

Bug 932350

> * file a bug to get valid crash stacks

Bug 932349
Summary: Frequent OSX mochitest timeouts, often in the dom-level* suites → Frequent OSX mochitest timeouts in the dom-level* suites
Blocks: 941443
Flags: needinfo?(ryanvm)
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(ryanvm)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: