Closed
Bug 1024448
Opened 10 years ago
Closed 10 years ago
Stale slaverebooter lockfile
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: sbruno, Assigned: Callek)
Details
Attachments
(2 obsolete files)
Nagios is sending out critical warnings like: Thu 04:14:27 PDT [4760] buildbot-master74.srv.releng.usw2.mozilla.com:File Age - /builds/slaverebooter/slaverebooter.log is CRITICAL: FILE_AGE CRITICAL: /builds/slaverebooter/slaverebooter.log is 37895 seconds old and 30351977 bytes I'd like to raise two points about this: 1) If the crontabbed command which runs slavereboot crashes, the lock file will not be removed - this "point of fragility" is similar to the one described here (pete pointed me to this): https://bugzilla.mozilla.org/show_bug.cgi?id=915106 We should at least add a -l option here as well, or use a different mechanism to make sure one instance is running at a time (see e.g.: https://bugzilla.mozilla.org/show_bug.cgi?id=869897#c2) 2) I tried to manually remove the lock file to allow future scheduled runs, but there's a process which recreates it as soon as it's removed. Callek: is something you setup, maybe temporarily as part of your work on slaveapi update? If so, maybe it could be useful to update this bug with info which could be relevant for troubleshooting this kind of issues when you're not online. Thanks a lot!
Comment 1•10 years ago
|
||
Looks like a screen session is creating this lockfile process, rather than the lockfile mentioned in the crontab: [12/06/2014 14:15:56] Peter Moore: root 7231 1 0 Jun10 ? 00:00:00 sudo su - -> root 7233 7231 0 Jun10 ? 00:00:00 su - -> root 7234 7233 0 Jun10 ? 00:00:00 -bash -> root 7247 7234 0 Jun10 ? 00:00:00 su cltbld -> cltbld 7248 7247 0 Jun10 ? 00:00:00 bash -> cltbld 19539 7248 0 04:44 ? 00:00:00 lockfile /builds/slaverebooter/slaverebooter.lock [12/06/2014 14:18:54] Peter Moore: *but* [12/06/2014 14:18:56] Peter Moore: [root@buildbot-master74.srv.releng.usw2.mozilla.com ~]# cat /proc/7248/environ | xargs -0 -n1 | grep TERM TERM=screen [12/06/2014 14:22:00] Peter Moore: hmmmm - interesting [12/06/2014 14:22:01] Peter Moore: [root@buildbot-master74.srv.releng.usw2.mozilla.com ~]# for proc in 1 7231 7233 7234 7247 7248 19539; do cat /proc/${proc}/environ | xargs -0 -n1 | grep TERM; done TERM=linux TERM=screen TERMCAP=SC|screen|VT 100/ANSI X3.64 virtual terminal:\ TERM=screen TERM=screen TERM=screen TERM=screen TERM=screen [root@buildbot-master74.srv.releng.usw2.mozilla.com ~]# [12/06/2014 14:23:15] Peter Moore: [root@buildbot-master74.srv.releng.usw2.mozilla.com ~]# cat /proc/7233/environ | xargs -0 -n1 | grep SUDO_USER SUDO_USER=jwood [root@buildbot-master74.srv.releng.usw2.mozilla.com ~]#
Comment 2•10 years ago
|
||
As discussed, since bug 860736, AlignmentAtAsmJSPrologue and NativeFrameSize are equivalent and used for the same purpose, so we can get rid of one of them.
Attachment #8439223 -
Flags: review?(luke)
Updated•10 years ago
|
Assignee: nobody → benj
Status: NEW → ASSIGNED
Comment 3•10 years ago
|
||
More tests for the asm.js => ion path: - calls with all arguments on registers - (your test in the previous bug was already testing calls with arguments getting on the stack) - calls going through the throwLabel path - OOL convert for int32 - OOL convert for doubles Any other ideas of missing tests?
Attachment #8439226 -
Flags: review?(luke)
Reporter | ||
Comment 4•10 years ago
|
||
Benjamin, are you sure this is the right Bug for your patches?
Flags: needinfo?(benj)
Comment 5•10 years ago
|
||
Oops, sorry for the noise. Silly me, I've used bzexport without checking the bug number.
Flags: needinfo?(benj)
Comment 6•10 years ago
|
||
Comment on attachment 8439223 [details] [diff] [review] 1 - Kill NativeFrameSize Nothing to do here.
Attachment #8439223 -
Attachment is obsolete: true
Attachment #8439223 -
Flags: review?(luke)
Comment 7•10 years ago
|
||
Comment on attachment 8439226 [details] [diff] [review] More tests for asm.js => ion calls Nothing to do here.
Attachment #8439226 -
Attachment is obsolete: true
Attachment #8439226 -
Flags: review?(luke)
Updated•10 years ago
|
Assignee: benj → nobody
Status: ASSIGNED → NEW
Assignee | ||
Comment 8•10 years ago
|
||
Apparantly I never killed my while loop which was aquiring the lockfile properly... :/ I had started a manual run after the slaveapi server was brought up just fine, but said manual run was hanging off the already aquired lockfile, and did remove the lock, but since it was aquiring in a loop didn't allow the cron to run... This should be fixed (did some `kill` magic) but leaving open until the actual cron kicks off
Assignee: nobody → bugspam.Callek
Reporter | ||
Comment 9•10 years ago
|
||
Callek, what about the -l option for lockfile command (or an alternative mechanism to manage the synchronization)?
Assignee | ||
Comment 10•10 years ago
|
||
in the current slaverebooter iteration we don't want -l because it can take arbitrarily long to complete (when I was doing the restart, it took greater than 11 hours start->finish)... so I don't want to do -l because we absolutely don't want to overlap. I'm not sure what alternative methods are better than lockfile atm. this nagios alert usually helps us notice when things go sideways (like today)
Comment 11•10 years ago
|
||
I think a good safety check is whether the pid is still alive. i.e. we would write a .pid file once we had acquired the lock, and this would contain the process id number (optionally we could also store cmdline/environ or start time). Then when the next process tries to acquire the lockfile, and fails, it can sanity check if the process is still alive, by checking the process table. In the case a different process exists with the same process id (they get reused by the OS), we could also query the process to see if it matches the other meta data (environ / cmdline / start time etc). Then you have a very robust way of checking that the process really is still running. I also prefer mkdir to lockfile, since it is a standard system command available on all OS's, atomically checks for the existence of a directory and creates it - so is safe (no race condition), and also the created directory can then include e.g. pid files, log files, etc. If you don't like it, we can still implement the same using lockfile. BTW might make sense to reuse the same solution in our other cron / only-run-one-of-these-at-once tools (such as watch_devices.sh).
Assignee | ||
Comment 12•10 years ago
|
||
The issue as filed is resolved, for a while now
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•