Closed
Bug 1311706
Opened 8 years ago
Closed 8 years ago
Autophone - unknown dead lock blocks command and pulse message processing
Categories
(Testing Graveyard :: Autophone, defect)
Testing Graveyard
Autophone
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bc, Assigned: bc)
References
Details
Attachments
(2 files)
8.44 KB,
text/plain
|
Details | |
1.06 KB,
patch
|
jmaher
:
review+
|
Details | Diff | Splinter Review |
Twice in the last couple of weeks we have encountered a situation on autophone-1 where we have two devices nexus-6p-[12] both running Talos where both the command processing and pulse message processing cease to work. There must be a dead lock somewhere perhaps triggered by failures in the network or submitting results to perfherder.
I don't see this on the other Autophone hosts where there are more than two devices which may be a clue along with the fact that identical devices run identical tests on autophone-1.
Comment 1•8 years ago
|
||
interesting observation on the identical tests/devices. Is it most likely that perfherder is causing us to resend messages?
Assignee | ||
Comment 2•8 years ago
|
||
I don't know. I think I'll try to reproduce when I have additional nexus 6ps available here. Maybe over the weekend.
Assignee | ||
Comment 3•8 years ago
|
||
Basically the issue is when we receive a command that is processed via route_cmd we obtain a lock. In the event that the command is autophone-shutdown, we call pulse_monitor.stop which sets the _stopping event on the pulse monitor in order to gracefully exit the pulse monitor thread. We then join the pulse monitor thread to wait until it exits. However if the pulse monitor calls one of the callbacks, it will block on acquiring the lock and prevent the pulse monitor thread from exiting.
A simple solution is for the pulse monitor callbacks build_callback, and jobaction_callback to early return if the pulse monitor is stopping. I think this will be sufficient.
Assignee | ||
Updated•8 years ago
|
Assignee: nobody → bob
Status: NEW → ASSIGNED
Assignee | ||
Comment 4•8 years ago
|
||
This appears to work for me. I believe this is exacerbated on autophone-1 due to the identical devices running identical tests on identical branches.
Attachment #8804531 -
Flags: review?(jmaher)
Comment 5•8 years ago
|
||
Comment on attachment 8804531 [details] [diff] [review]
bug-1311706-v1.patch
Review of attachment 8804531 [details] [diff] [review]:
-----------------------------------------------------------------
looks good to me!
Attachment #8804531 -
Flags: review?(jmaher) → review+
Assignee | ||
Comment 6•8 years ago
|
||
https://github.com/mozilla/autophone/commit/38ab79b13fed459675b4d30715cdd54fb1639491
deployed 2016-10-26 06:00:00 PT
Updated•3 years ago
|
Product: Testing → Testing Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•