Closed
Bug 757684
Opened 12 years ago
Closed 12 years ago
stop eating all exceptions in dmg_signpackage
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bhearsum, Assigned: bhearsum)
References
Details
Attachments
(1 file)
588 bytes,
patch
|
bear
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
I've hit this a few times with Mac signing, and it causes awful problems with l10n - it seems to corrupt the objdir in such a way that subsequent builds on the same slave do not work. 2012-05-22 15:50:55,695 - DEBUG - 28218: Exceeded timeout 2012-05-22 15:50:56,695 - DEBUG - 28218: Success! The first line comes from the main thread, here: https://github.com/mozilla/build-tools/blob/master/release/signing/signing-server.py#L130, and is printed right very shortly before kill() is called. The second line comes from the main thread too, here: https://github.com/mozilla/build-tools/blob/master/release/signing/signing-server.py#L155. The first time through the while loop we print out Exceeded timeout, and then proceed into the kill() code. That code runs, and doesn't raise any exception. We then hit this code: https://github.com/mozilla/build-tools/blob/master/release/signing/signing-server.py#L148 which polls the process, and I _assume_ receives a return code and breaks out of the loop. Then we get to: https://github.com/mozilla/build-tools/blob/master/release/signing/signing-server.py#L155 which only prints success if rc = 0. _SO_. It appears to me that the worker processes are somehow getting killed with kill(), as evidenced by the fact that the tarball is corrupt. Additionally, it appears that despite being kill()'ed are returning 0, as evidenced by the fact that we get "success!" in the log. Still digging into how this is possible, and what exactly we can do about it.
Assignee | ||
Comment 1•12 years ago
|
||
21:00 < bhearsum|afk> ok, so there's some sort of race or logic error in the signing server that's causing it to kill workers yet have them think they succeeded - full details of that are in https://bugzilla.mozilla.org/show_bug.cgi?id=757684 - the long and short is that we try to kill the workers with SIGINT, and then check the exit code to see what happened. the worker thread is executing some code that (unfortunately) eats all exceptions and returns False, but we don't check the return code of the function that does that 21:00 < bhearsum|afk> _so_, my theory is that the worker is dying, but the exception that the kill causes gets eaten, and the process exits normally Here's the test I did to prove that theory: ➜ tmp cat test.py #!/usr/bin/python import time try: while True: time.sleep(1) except: import traceback traceback.print_exc() ➜ tmp python test.py Traceback (most recent call last): File "test.py", line 7, in <module> time.sleep(1) KeyboardInterrupt ➜ tmp echo $? 0
Assignee: nobody → bhearsum
Attachment #626276 -
Flags: review?(bear)
Updated•12 years ago
|
Attachment #626276 -
Flags: review?(bear) → review+
Assignee | ||
Updated•12 years ago
|
Attachment #626276 -
Flags: checked-in+
Assignee | ||
Updated•12 years ago
|
Summary: possible race condition in signing server can cause workers to both succeed and time out → stop eating all exceptions in dmg_signpackage
Comment 2•12 years ago
|
||
we should probably update the signing server logic so that it's impossible for a worker that's timed out to be considered successful.
Assignee | ||
Comment 3•12 years ago
|
||
OK, I think this is fixed now. I'm getting a different error, but I'm pretty sure it's a build system issue. Looking at that in bug 723176.
Assignee | ||
Comment 4•12 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #2) > we should probably update the signing server logic so that it's impossible > for a worker that's timed out to be considered successful. Filed bug 757692 on this.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Updated•6 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•