Closed Bug 1311706 Opened 4 years ago Closed 4 years ago

Autophone - unknown dead lock blocks command and pulse message processing


(Testing :: Autophone, defect)

Not set


(Not tracked)



(Reporter: bc, Assigned: bc)




(2 files)

Twice in the last couple of weeks we have encountered a situation on autophone-1 where we have two devices nexus-6p-[12] both running Talos where both the command processing and pulse message processing cease to work. There must be a dead lock somewhere perhaps triggered by failures in the network or submitting results to perfherder.

I don't see this on the other Autophone hosts where there are more than two devices which may be a clue along with the fact that identical devices run identical tests on autophone-1.
interesting observation on the identical tests/devices.  Is it most likely that perfherder is causing us to resend messages?
I don't know. I think I'll try to reproduce when I have additional nexus 6ps available here. Maybe over the weekend.
Attached file log
Basically the issue is when we receive a command that is processed via route_cmd we obtain a lock. In the event that the command is autophone-shutdown, we call pulse_monitor.stop which sets the _stopping event on the pulse monitor in order to gracefully exit the pulse monitor thread. We then join the pulse monitor thread to wait until it exits. However if the pulse monitor calls one of the callbacks, it will block on acquiring the lock and prevent the pulse monitor thread from exiting.

A simple solution is for the pulse monitor callbacks build_callback, and jobaction_callback to early return if the pulse monitor is stopping. I think this will be sufficient.
Assignee: nobody → bob
This appears to work for me. I believe this is exacerbated on autophone-1 due to the identical devices running identical tests on identical branches.
Attachment #8804531 - Flags: review?(jmaher)
Comment on attachment 8804531 [details] [diff] [review]

Review of attachment 8804531 [details] [diff] [review]:

looks good to me!
Attachment #8804531 - Flags: review?(jmaher) → review+

deployed 2016-10-26 06:00:00 PT
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.