Closed
Bug 915106
Opened 11 years ago
Closed 11 years ago
watch_devices.sh lock files not getting cleaned up on foopies
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: pmoore, Assigned: pmoore)
Details
Attachments
(1 file)
663 bytes,
patch
|
Callek
:
review+
|
Details | Diff | Splinter Review |
In this comment, https://bugzilla.mozilla.org/show_bug.cgi?id=869897#c5 I highlighted the following problem with the current implementation of the lockfile mechanism in watch_devices.sh: .... <excerpt> .... A problem with the current "lockfile" implementation ==================================================== The current "lockfile" implementation is not safe if the exit trap of watch_devices.sh does not get called for any reason (e.g. process crash, out-of-memory error, physical power loss, kernel panic, ....). In this situation, the lockfile remains in place indefinitely, blocking future runs watch_devices.sh. The blocking lockfile must first manually be discovered, and then manually removed to fix the problem. This can be fixed by adding -l 3600 as a switch to the lockfile command. .... </excerpt> .... The change for this was contained in the following patch: https://bugzilla.mozilla.org/page.cgi?id=splinter.html&bug=869897&attachment=750723 This was rejected with the comment (https://bugzilla.mozilla.org/show_bug.cgi?id=869897#c25): .... <excerpt> .... can we (for now) not do the -l 3600 I'd rather approach that in a followup at a later time. since if we get a lockfile stuck around for a long time, its almost always a reason for a human to peek anyway. Very few lockfile-is-stuck is cause for no-human, at least for now. .... </excerpt> .... I understand the motivation of this comment, but I think there is an assumption at play that a) a human would know there was a problem to investigate, and b) that such investigations would be routinely performed. I'm not sure either of these are true. I believe a fully automated solution would be our best strategy to overcome this issue. For example, I've just looked and just found the following pandas with stale lockfiles, and there could be more that weren't captured by my test: panda-0024 panda-0028 panda-0029 panda-0040 panda-0041 panda-0042 panda-0043 panda-0050 panda-0055 panda-0071 panda-0178 panda-0211 panda-0212 panda-0222 panda-0236 panda-0239 panda-0294 panda-0303 panda-0473 panda-0535 panda-0567 panda-0568 panda-0587 panda-0605 panda-0673 panda-0706 panda-0707 panda-0714 panda-0721 panda-0726 panda-0752 panda-0771 panda-0813 panda-0840 Since this is a very simple and non-invasive change to perform, and should not have any untoward side effects, I would like to go ahead and implement it. Then stale lock files would get cleaned up automatically and prevent devices being idle that could be used.
Assignee | ||
Comment 1•11 years ago
|
||
Attachment #802943 -
Flags: review?(bugspam.Callek)
Updated•11 years ago
|
Component: Release Automation → General Automation
QA Contact: bhearsum → catlee
Comment 2•11 years ago
|
||
Comment on attachment 802943 [details] [diff] [review] Patch to tools repo to update watch_devices.sh to handle stale lock files Review of attachment 802943 [details] [diff] [review]: ----------------------------------------------------------------- Not a fan of this approach, as we discussed before; As one example, panda-0568: -r--r--r-- 1 cltbld cltbld 1 Jun 29 11:10 watcher.lock [cltbld@foopy77.p6.releng.scl1.mozilla.com ~]$ uptime 15:53:33 up 80 days, 2:14, 1 user, load average: 0.06, 0.07, 0.02 [cltbld@foopy77.p6.releng.scl1.mozilla.com ~]$ date Tue Sep 17 15:53:37 PDT 2013 So the foopy rebooted for some reason around that date... I would support some sort of magic around checking the watcher.log over the course of say 3-4 hours for "failed to aquire lockfile" and removing the lockfile if that is the case... the idea on that approach is that there is no other action happening in the log during that period for some odd reason.
Attachment #802943 -
Flags: review?(bugspam.Callek) → review-
Assignee | ||
Comment 3•11 years ago
|
||
Hi Callek, I don't really understand why you gave an r-: the example you give above is exactly the problem that my fix would have resolved. Instead, panda-0568 has had a watcher.lock file, preventing the watcher from running against it, for 80 days (almost three months). As proposed, this fix is precisely to catch issues with e.g. power failures, and this example you give above was indeed caused by the power outage described in bug 888656. With my fix, we would not have had this problem. Your proposed solution, to monitor the watcher.log file, would give essentially the same result, but is considerably more logic to implement and maintain, than adding a simple -l option to the lockfile command. Can you explain again, since I did not understand, how your example above is a reason to *not* implement this change? Thanks, Pete
Flags: needinfo?(bugspam.Callek)
Comment 4•11 years ago
|
||
Comment on attachment 802943 [details] [diff] [review] Patch to tools repo to update watch_devices.sh to handle stale lock files compromise change of mind. Lets make this 12 *hours* of stale lockfile clear. My concern falls around the chance that a proc/contact-with-device is just taking longer than we hope. And thus we might try to do stuff with constant-step-on-toes and then starve pid's/file handles. That is pretty unlikely as it is, so having 12 hours should be safe. And if the machine goes offline unexpectedly we can do a fabric run to clear lockfiles should it be a lot of machines, or if its just one or two this will solve it. The reason I was not as worried about power faults on machines, is that power faults is almost always a "we want to look at why" anyway.
Attachment #802943 -
Flags: review- → review+
Updated•11 years ago
|
Flags: needinfo?(bugspam.Callek)
Assignee | ||
Comment 5•11 years ago
|
||
Committed: https://hg.mozilla.org/build/tools/rev/112929180442
Assignee | ||
Comment 6•11 years ago
|
||
And committed again (to change to 12 hour timeout, as per comment 4 above): https://hg.mozilla.org/build/tools/rev/0cfa0e5c2446
Comment 7•11 years ago
|
||
in production
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•