Closed Bug 1191877 Opened 9 years ago Closed 6 years ago

Intermittent OSX build Automation Error: mozprocess timed out after 2400 seconds running ['/tools/buildbot/bin/python', 'mach', '--log-no-times', 'build', '-v']

Categories

(Release Engineering :: General, defect)

x86_64
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: RyanVM, Unassigned)

References

Details

(Keywords: intermittent-failure)

+++ This bug was initially created as a clone of Bug #1165763 +++

+++ This bug was initially created as a clone of Bug #1154377 +++

+++ This bug was initially created as a clone of Bug #1145507 +++
this looks like a buildsymbols problem
Flags: needinfo?(ted)
From comment #22:

11:09:46     INFO -  44623: Worker processing files: ('dist/universal/test-stage/cppunittest/TestJSONWriter.dSYM', 'dist/universal/test-stage/cppunittest/TestJSONWriter')
11:09:46     INFO -  44622: Worker processing files: ('dist/universal/test-stage/cppunittest/TestLineBreak.dSYM', 'dist/universal/test-stage/cppunittest/TestLineBreak')
11:09:47     INFO -  44623: Worker processing files: ('dist/universal/test-stage/cppunittest/TestLineBreak.dSYM', 'dist/universal/test-stage/cppunittest/TestLineBreak')
11:09:47     INFO -  44620: Worker processing files: ('dist/universal/test-stage/cppunittest/TestMacroArgs.dSYM', 'dist/universal/test-stage/cppunittest/TestMacroArgs')
11:09:47     INFO -  Exception in thread Thread-2:
11:09:47     INFO -  Traceback (most recent call last):
11:09:47     INFO -    File "/tools/python/lib/python2.7/threading.py", line 551, in __bootstrap_inner
11:09:47     INFO -      self.run()
11:09:47     INFO -    File "/tools/python/lib/python2.7/threading.py", line 504, in run
11:09:47     INFO -      self.__target(*self.__args, **self.__kwargs)
11:09:47     INFO -    File "/tools/python/lib/python2.7/multiprocessing/pool.py", line 319, in _handle_tasks
11:09:47     INFO -      put(task)
11:09:47     INFO -  RuntimeError: dictionary changed size during iteration
11:21:43     INFO -  44613: Submitting jobs for files: ('dist/universal/firefox/Nightly.app/Contents/MacOS/XUL.dSYM', 'dist/universal/firefox/Nightly.app/Contents/MacOS/XUL')
12:01:43     INFO - Automation Error: mozprocess timed out after 2400 seconds running ['/tools/buildbot/bin/python', 'mach', '--log-no-times', 'build', '-v']
12:01:43    ERROR - timed out after 2400 seconds of no output
12:01:43    ERROR - Return code: -9
12:01:43  WARNING - setting return code to 2
12:01:43    FATAL - 'mach build' did not run successfully. Please check log for errors.

This is a bug in symbolstore.py.
So, there is a bug where multiple threads or something else modifies a dictionary during iteraction. This causes an exception in a thread. And, chances are the code for waiting on threads is also buggy in that it doesn't detect aborted threads.
This smells like https://bugs.python.org/issue23051. The bug was fixed in Python 2.7.10.
Actually, I don't think this is Python issue 23051. But the stack does appear to come from inside multiprocesing, so this does look like a bug in multiprocessing.
The error is not the same from issue 23051. However, the fix likely solves this problem as well.

https://hg.python.org/cpython/rev/311d52878a65 is the fix. It added an additional try..except around the iteration of taskseq. This will almost certainly catch the exception we're seeing.

I have no clue why we're now suddenly seeing this though. Weird.
So bug 1164816 covers the "background task in symbolstore.py raises and the script hangs" situation. I think that patch just got hung up on a broken unit test, I suspect I could get that finished up and landed without much effort.

As to the error at hand here, I'm really at a loss as to why it'd start showing up now. The multiprocessing code hasn't changed in a long time, none of the recent changes to the script seem to be pretty innocuous:
https://hg.mozilla.org/mozilla-central/filelog/d6ea652c579992daa9041cc9718bb7c6abefbc91/toolkit/crashreporter/tools/symbolstore.py
Flags: needinfo?(ted)
Could we hack around this by throwing a multiprocessing.Lock around the code that inserts jobs?
Possibly, yeah, but the patch in that bug was basically done except for one dumb error that I hadn't fixed, so I just fixed that and pushed it to try to sanity check that it still works on all the other platforms. That try push is looking pretty green, so I should be able to get those patches landed today.
Pretty sure bug 1164816 fixed this.
Depends on: 1164816
Component: General Automation → General
I think this bug outlived its usefulness.
Closing this, feel free to reopen if I'm wrong.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.