Closed
Bug 915106
Opened 12 years ago
Closed 12 years ago
watch_devices.sh lock files not getting cleaned up on foopies
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: pmoore, Assigned: pmoore)
Details
Attachments
(1 file)
|
663 bytes,
patch
|
Callek
:
review+
|
Details | Diff | Splinter Review |
In this comment, https://bugzilla.mozilla.org/show_bug.cgi?id=869897#c5 I highlighted the following problem with the current implementation of the lockfile mechanism in watch_devices.sh:
.... <excerpt> ....
A problem with the current "lockfile" implementation
====================================================
The current "lockfile" implementation is not safe if the exit trap of watch_devices.sh does not get called for any reason (e.g. process crash, out-of-memory error, physical power loss, kernel panic, ....). In this situation, the lockfile remains in place indefinitely, blocking future runs watch_devices.sh. The blocking lockfile must first manually be discovered, and then manually removed to fix the problem. This can be fixed by adding -l 3600 as a switch to the lockfile command.
.... </excerpt> ....
The change for this was contained in the following patch:
https://bugzilla.mozilla.org/page.cgi?id=splinter.html&bug=869897&attachment=750723
This was rejected with the comment (https://bugzilla.mozilla.org/show_bug.cgi?id=869897#c25):
.... <excerpt> ....
can we (for now) not do the -l 3600 I'd rather approach that in a followup at a later time. since if we get a lockfile stuck around for a long time, its almost always a reason for a human to peek anyway.
Very few lockfile-is-stuck is cause for no-human, at least for now.
.... </excerpt> ....
I understand the motivation of this comment, but I think there is an assumption at play that a) a human would know there was a problem to investigate, and b) that such investigations would be routinely performed. I'm not sure either of these are true. I believe a fully automated solution would be our best strategy to overcome this issue.
For example, I've just looked and just found the following pandas with stale lockfiles, and there could be more that weren't captured by my test:
panda-0024
panda-0028
panda-0029
panda-0040
panda-0041
panda-0042
panda-0043
panda-0050
panda-0055
panda-0071
panda-0178
panda-0211
panda-0212
panda-0222
panda-0236
panda-0239
panda-0294
panda-0303
panda-0473
panda-0535
panda-0567
panda-0568
panda-0587
panda-0605
panda-0673
panda-0706
panda-0707
panda-0714
panda-0721
panda-0726
panda-0752
panda-0771
panda-0813
panda-0840
Since this is a very simple and non-invasive change to perform, and should not have any untoward side effects, I would like to go ahead and implement it. Then stale lock files would get cleaned up automatically and prevent devices being idle that could be used.
| Assignee | ||
Comment 1•12 years ago
|
||
Attachment #802943 -
Flags: review?(bugspam.Callek)
Updated•12 years ago
|
Component: Release Automation → General Automation
QA Contact: bhearsum → catlee
Comment 2•12 years ago
|
||
Comment on attachment 802943 [details] [diff] [review]
Patch to tools repo to update watch_devices.sh to handle stale lock files
Review of attachment 802943 [details] [diff] [review]:
-----------------------------------------------------------------
Not a fan of this approach, as we discussed before;
As one example, panda-0568:
-r--r--r-- 1 cltbld cltbld 1 Jun 29 11:10 watcher.lock
[cltbld@foopy77.p6.releng.scl1.mozilla.com ~]$ uptime
15:53:33 up 80 days, 2:14, 1 user, load average: 0.06, 0.07, 0.02
[cltbld@foopy77.p6.releng.scl1.mozilla.com ~]$ date
Tue Sep 17 15:53:37 PDT 2013
So the foopy rebooted for some reason around that date...
I would support some sort of magic around checking the watcher.log over the course of say 3-4 hours for "failed to aquire lockfile" and removing the lockfile if that is the case... the idea on that approach is that there is no other action happening in the log during that period for some odd reason.
Attachment #802943 -
Flags: review?(bugspam.Callek) → review-
| Assignee | ||
Comment 3•12 years ago
|
||
Hi Callek,
I don't really understand why you gave an r-: the example you give above is exactly the problem that my fix would have resolved. Instead, panda-0568 has had a watcher.lock file, preventing the watcher from running against it, for 80 days (almost three months).
As proposed, this fix is precisely to catch issues with e.g. power failures, and this example you give above was indeed caused by the power outage described in bug 888656. With my fix, we would not have had this problem.
Your proposed solution, to monitor the watcher.log file, would give essentially the same result, but is considerably more logic to implement and maintain, than adding a simple -l option to the lockfile command.
Can you explain again, since I did not understand, how your example above is a reason to *not* implement this change?
Thanks,
Pete
Flags: needinfo?(bugspam.Callek)
Comment 4•12 years ago
|
||
Comment on attachment 802943 [details] [diff] [review]
Patch to tools repo to update watch_devices.sh to handle stale lock files
compromise change of mind.
Lets make this 12 *hours* of stale lockfile clear. My concern falls around the chance that a proc/contact-with-device is just taking longer than we hope. And thus we might try to do stuff with constant-step-on-toes and then starve pid's/file handles.
That is pretty unlikely as it is, so having 12 hours should be safe. And if the machine goes offline unexpectedly we can do a fabric run to clear lockfiles should it be a lot of machines, or if its just one or two this will solve it.
The reason I was not as worried about power faults on machines, is that power faults is almost always a "we want to look at why" anyway.
Attachment #802943 -
Flags: review- → review+
Updated•12 years ago
|
Flags: needinfo?(bugspam.Callek)
| Assignee | ||
Comment 5•12 years ago
|
||
| Assignee | ||
Comment 6•12 years ago
|
||
And committed again (to change to 12 hour timeout, as per comment 4 above): https://hg.mozilla.org/build/tools/rev/0cfa0e5c2446
Comment 7•12 years ago
|
||
in production
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•