Closed Bug 757684 Opened 13 years ago Closed 13 years ago

stop eating all exceptions in dmg_signpackage

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: bhearsum)

References

Details

Attachments

(1 file)

I've hit this a few times with Mac signing, and it causes awful problems with l10n - it seems to corrupt the objdir in such a way that subsequent builds on the same slave do not work. 2012-05-22 15:50:55,695 - DEBUG - 28218: Exceeded timeout 2012-05-22 15:50:56,695 - DEBUG - 28218: Success! The first line comes from the main thread, here: https://github.com/mozilla/build-tools/blob/master/release/signing/signing-server.py#L130, and is printed right very shortly before kill() is called. The second line comes from the main thread too, here: https://github.com/mozilla/build-tools/blob/master/release/signing/signing-server.py#L155. The first time through the while loop we print out Exceeded timeout, and then proceed into the kill() code. That code runs, and doesn't raise any exception. We then hit this code: https://github.com/mozilla/build-tools/blob/master/release/signing/signing-server.py#L148 which polls the process, and I _assume_ receives a return code and breaks out of the loop. Then we get to: https://github.com/mozilla/build-tools/blob/master/release/signing/signing-server.py#L155 which only prints success if rc = 0. _SO_. It appears to me that the worker processes are somehow getting killed with kill(), as evidenced by the fact that the tarball is corrupt. Additionally, it appears that despite being kill()'ed are returning 0, as evidenced by the fact that we get "success!" in the log. Still digging into how this is possible, and what exactly we can do about it.
21:00 < bhearsum|afk> ok, so there's some sort of race or logic error in the signing server that's causing it to kill workers yet have them think they succeeded - full details of that are in https://bugzilla.mozilla.org/show_bug.cgi?id=757684 - the long and short is that we try to kill the workers with SIGINT, and then check the exit code to see what happened. the worker thread is executing some code that (unfortunately) eats all exceptions and returns False, but we don't check the return code of the function that does that 21:00 < bhearsum|afk> _so_, my theory is that the worker is dying, but the exception that the kill causes gets eaten, and the process exits normally Here's the test I did to prove that theory: ➜ tmp cat test.py #!/usr/bin/python import time try: while True: time.sleep(1) except: import traceback traceback.print_exc() ➜ tmp python test.py Traceback (most recent call last): File "test.py", line 7, in <module> time.sleep(1) KeyboardInterrupt ➜ tmp echo $? 0
Assignee: nobody → bhearsum
Attachment #626276 - Flags: review?(bear)
Attachment #626276 - Flags: review?(bear) → review+
Attachment #626276 - Flags: checked-in+
Summary: possible race condition in signing server can cause workers to both succeed and time out → stop eating all exceptions in dmg_signpackage
we should probably update the signing server logic so that it's impossible for a worker that's timed out to be considered successful.
OK, I think this is fixed now. I'm getting a different error, but I'm pretty sure it's a build system issue. Looking at that in bug 723176.
(In reply to Chris AtLee [:catlee] from comment #2) > we should probably update the signing server logic so that it's impossible > for a worker that's timed out to be considered successful. Filed bug 757692 on this.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: