Closed Bug 757684 Opened 12 years ago Closed 12 years ago

stop eating all exceptions in dmg_signpackage

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: bhearsum)

References

Details

Attachments

(1 file)

I've hit this a few times with Mac signing, and it causes awful problems with l10n - it seems to corrupt the objdir in such a way that subsequent builds on the same slave do not work.

2012-05-22 15:50:55,695 - DEBUG - 28218: Exceeded timeout
2012-05-22 15:50:56,695 - DEBUG - 28218: Success!

The first line comes from the main thread, here: https://github.com/mozilla/build-tools/blob/master/release/signing/signing-server.py#L130, and is printed right very shortly before kill() is called.
The second line comes from the main thread too, here: https://github.com/mozilla/build-tools/blob/master/release/signing/signing-server.py#L155.

The first time through the while loop we print out Exceeded timeout, and then proceed into the kill() code. That code runs, and doesn't raise any exception. We then hit this code:
https://github.com/mozilla/build-tools/blob/master/release/signing/signing-server.py#L148

which polls the process, and I _assume_ receives a return code and breaks out of the loop. Then we get to:
https://github.com/mozilla/build-tools/blob/master/release/signing/signing-server.py#L155

which only prints success if rc = 0.

_SO_. It appears to me that the worker processes are somehow getting killed with kill(), as evidenced by the fact that the tarball is corrupt. Additionally, it appears that despite being kill()'ed are returning 0, as evidenced by the fact that we get "success!" in the log.

Still digging into how this is possible, and what exactly we can do about it.
21:00 < bhearsum|afk> ok, so there's some sort of race or logic error in the signing server that's causing it 
                      to kill workers yet have them think they succeeded - full details of that are in 
                      https://bugzilla.mozilla.org/show_bug.cgi?id=757684 - the long and short is that we try 
                      to kill the workers with SIGINT, and then check the exit code to see what happened. the 
                      worker thread is executing some code that (unfortunately) eats all exceptions and 
                      returns False, but we don't check the return code of the function that does that
21:00 < bhearsum|afk> _so_, my theory is that the worker is dying, but the exception that the kill causes gets 
                      eaten, and the process exits normally

Here's the test I did to prove that theory:
➜  tmp  cat test.py
#!/usr/bin/python

import time

try:
    while True:
        time.sleep(1)
except:
    import traceback
    traceback.print_exc()
➜  tmp  python test.py
Traceback (most recent call last):
  File "test.py", line 7, in <module>
    time.sleep(1)
KeyboardInterrupt
➜  tmp  echo $?
0
Assignee: nobody → bhearsum
Attachment #626276 - Flags: review?(bear)
Attachment #626276 - Flags: review?(bear) → review+
Attachment #626276 - Flags: checked-in+
Summary: possible race condition in signing server can cause workers to both succeed and time out → stop eating all exceptions in dmg_signpackage
we should probably update the signing server logic so that it's impossible for a worker that's timed out to be considered successful.
OK, I think this is fixed now. I'm getting a different error, but I'm pretty sure it's a build system issue. Looking at that in bug 723176.
(In reply to Chris AtLee [:catlee] from comment #2)
> we should probably update the signing server logic so that it's impossible
> for a worker that's timed out to be considered successful.

Filed bug 757692 on this.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: