Closed Bug 915106 Opened 11 years ago Closed 11 years ago

watch_devices.sh lock files not getting cleaned up on foopies

Categories

(Release Engineering :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pmoore, Assigned: pmoore)

Details

Attachments

(1 file)

In this comment, https://bugzilla.mozilla.org/show_bug.cgi?id=869897#c5 I highlighted the following problem with the current implementation of the lockfile mechanism in watch_devices.sh:

.... <excerpt> .... 

A problem with the current "lockfile" implementation
====================================================

The current "lockfile" implementation is not safe if the exit trap of watch_devices.sh does not get called for any reason (e.g. process crash, out-of-memory error, physical power loss, kernel panic, ....). In this situation, the lockfile remains in place indefinitely, blocking future runs watch_devices.sh. The blocking lockfile must first manually be discovered, and then manually removed to fix the problem. This can be fixed by adding -l 3600 as a switch to the lockfile command.

.... </excerpt> ....

The change for this was contained in the following patch:
https://bugzilla.mozilla.org/page.cgi?id=splinter.html&bug=869897&attachment=750723

This was rejected with the comment (https://bugzilla.mozilla.org/show_bug.cgi?id=869897#c25):

.... <excerpt> .... 

can we (for now) not do the -l 3600 I'd rather approach that in a followup at a later time. since if we get a lockfile stuck around for a long time, its almost always a reason for a human to peek anyway.

Very few lockfile-is-stuck is cause for no-human, at least for now.

.... </excerpt> .... 

I understand the motivation of this comment, but I think there is an assumption at play that a) a human would know there was a problem to investigate, and b) that such investigations would be routinely performed. I'm not sure either of these are true. I believe a fully automated solution would be our best strategy to overcome this issue.

For example, I've just looked and just found the following pandas with stale lockfiles, and there could be more that weren't captured by my test:

panda-0024
panda-0028
panda-0029
panda-0040
panda-0041
panda-0042
panda-0043
panda-0050
panda-0055
panda-0071
panda-0178
panda-0211
panda-0212
panda-0222
panda-0236
panda-0239
panda-0294
panda-0303
panda-0473
panda-0535
panda-0567
panda-0568
panda-0587
panda-0605
panda-0673
panda-0706
panda-0707
panda-0714
panda-0721
panda-0726
panda-0752
panda-0771
panda-0813
panda-0840

Since this is a very simple and non-invasive change to perform, and should not have any untoward side effects, I would like to go ahead and implement it. Then stale lock files would get cleaned up automatically and prevent devices being idle that could be used.
Component: Release Automation → General Automation
QA Contact: bhearsum → catlee
Comment on attachment 802943 [details] [diff] [review]
Patch to tools repo to update watch_devices.sh to handle stale lock files

Review of attachment 802943 [details] [diff] [review]:
-----------------------------------------------------------------

Not a fan of this approach, as we discussed before;

As one example, panda-0568:

-r--r--r--  1 cltbld cltbld       1 Jun 29 11:10 watcher.lock
[cltbld@foopy77.p6.releng.scl1.mozilla.com ~]$ uptime
 15:53:33 up 80 days,  2:14,  1 user,  load average: 0.06, 0.07, 0.02
[cltbld@foopy77.p6.releng.scl1.mozilla.com ~]$ date
Tue Sep 17 15:53:37 PDT 2013

So the foopy rebooted for some reason around that date...

I would support some sort of magic around checking the watcher.log over the course of say 3-4 hours for "failed to aquire lockfile" and removing the lockfile if that is the case... the idea on that approach is that there is no other action happening in the log during that period for some odd reason.
Attachment #802943 - Flags: review?(bugspam.Callek) → review-
Hi Callek,

I don't really understand why you gave an r-: the example you give above is exactly the problem that my fix would have resolved. Instead, panda-0568 has had a watcher.lock file, preventing the watcher from running against it, for 80 days (almost three months).

As proposed, this fix is precisely to catch issues with e.g. power failures, and this example you give above was indeed caused by the power outage described in bug 888656. With my fix, we would not have had this problem.

Your proposed solution, to monitor the watcher.log file, would give essentially the same result, but is considerably more logic to implement and maintain, than adding a simple -l option to the lockfile command.

Can you explain again, since I did not understand, how your example above is a reason to *not* implement this change?

Thanks,
Pete
Flags: needinfo?(bugspam.Callek)
Comment on attachment 802943 [details] [diff] [review]
Patch to tools repo to update watch_devices.sh to handle stale lock files

compromise change of mind.

Lets make this 12 *hours* of stale lockfile clear. My concern falls around the chance that a proc/contact-with-device is just taking longer than we hope. And thus we might try to do stuff with constant-step-on-toes and then starve pid's/file handles.

That is pretty unlikely as it is, so having 12 hours should be safe. And if the machine goes offline unexpectedly we can do a fabric run to clear lockfiles should it be a lot of machines, or if its just one or two this will solve it.

The reason I was not as worried about power faults on machines, is that power faults is almost always a "we want to look at why" anyway.
Attachment #802943 - Flags: review- → review+
Flags: needinfo?(bugspam.Callek)
And committed again (to change to 12 hour timeout, as per comment 4 above): https://hg.mozilla.org/build/tools/rev/0cfa0e5c2446
in production
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: