Closed Bug 1024448 Opened 10 years ago Closed 10 years ago

Stale slaverebooter lockfile

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: sbruno, Assigned: Callek)

Details

Attachments

(2 obsolete files)

Nagios is sending out critical warnings like:

Thu 04:14:27 PDT [4760] buildbot-master74.srv.releng.usw2.mozilla.com:File Age - /builds/slaverebooter/slaverebooter.log is CRITICAL: FILE_AGE CRITICAL: /builds/slaverebooter/slaverebooter.log is 37895 seconds old and 30351977 bytes

I'd like to raise two points about this:

1) If the crontabbed command which runs slavereboot crashes, the lock file will not be removed - this "point of fragility" is similar  to the one described here (pete pointed me to this): https://bugzilla.mozilla.org/show_bug.cgi?id=915106
We should at least add a -l option here as well, or use a different mechanism to make sure one instance is running at a time (see e.g.: https://bugzilla.mozilla.org/show_bug.cgi?id=869897#c2)

2) I tried to manually remove the lock file to allow future scheduled runs, but there's a process which recreates it as soon as it's removed. Callek: is something you setup, maybe temporarily as part of your work on slaveapi update?
If so, maybe it could be useful to update this bug with info which could be relevant for troubleshooting this kind of issues when you're not online.
Thanks a lot!
Looks like a screen session is creating this lockfile process, rather than the lockfile mentioned in the crontab:


[12/06/2014 14:15:56] Peter Moore: root      7231     1  0 Jun10 ?        00:00:00 sudo su -
  -> root      7233  7231  0 Jun10 ?        00:00:00 su -
    -> root      7234  7233  0 Jun10 ?        00:00:00 -bash
      -> root      7247  7234  0 Jun10 ?        00:00:00 su cltbld
        -> cltbld    7248  7247  0 Jun10 ?        00:00:00 bash
          -> cltbld   19539  7248  0 04:44 ?        00:00:00 lockfile /builds/slaverebooter/slaverebooter.lock
[12/06/2014 14:18:54] Peter Moore: *but*
[12/06/2014 14:18:56] Peter Moore: [root@buildbot-master74.srv.releng.usw2.mozilla.com ~]# cat /proc/7248/environ | xargs -0 -n1 | grep TERM
TERM=screen
[12/06/2014 14:22:00] Peter Moore: hmmmm - interesting
[12/06/2014 14:22:01] Peter Moore: [root@buildbot-master74.srv.releng.usw2.mozilla.com ~]# for proc in 1 7231 7233 7234 7247 7248 19539; do cat /proc/${proc}/environ | xargs -0 -n1 | grep TERM; done
TERM=linux
TERM=screen
TERMCAP=SC|screen|VT 100/ANSI X3.64 virtual terminal:\
TERM=screen
TERM=screen
TERM=screen
TERM=screen
TERM=screen
[root@buildbot-master74.srv.releng.usw2.mozilla.com ~]#
[12/06/2014 14:23:15] Peter Moore: [root@buildbot-master74.srv.releng.usw2.mozilla.com ~]# cat /proc/7233/environ | xargs -0 -n1 | grep SUDO_USER
SUDO_USER=jwood
[root@buildbot-master74.srv.releng.usw2.mozilla.com ~]#
Attached patch 1 - Kill NativeFrameSize (obsolete) — Splinter Review
As discussed, since bug 860736, AlignmentAtAsmJSPrologue and NativeFrameSize are equivalent and used for the same
purpose, so we can get rid of one of them.
Attachment #8439223 - Flags: review?(luke)
Assignee: nobody → benj
Status: NEW → ASSIGNED
More tests for the asm.js => ion path:
- calls with all arguments on registers
- (your test in the previous bug was already testing calls with arguments getting on the stack)
- calls going through the throwLabel path
- OOL convert for int32
- OOL convert for doubles
Any other ideas of missing tests?
Attachment #8439226 - Flags: review?(luke)
Benjamin, are you sure this is the right Bug for your patches?
Flags: needinfo?(benj)
Oops, sorry for the noise. Silly me, I've used bzexport without checking the bug number.
Flags: needinfo?(benj)
Comment on attachment 8439223 [details] [diff] [review]
1 - Kill NativeFrameSize

Nothing to do here.
Attachment #8439223 - Attachment is obsolete: true
Attachment #8439223 - Flags: review?(luke)
Comment on attachment 8439226 [details] [diff] [review]
More tests for asm.js => ion calls

Nothing to do here.
Attachment #8439226 - Attachment is obsolete: true
Attachment #8439226 - Flags: review?(luke)
Assignee: benj → nobody
Status: ASSIGNED → NEW
Apparantly I never killed my while loop which was aquiring the lockfile properly... :/

I had started a manual run after the slaveapi server was brought up just fine, but said manual run was hanging off the already aquired lockfile, and did remove the lock, but since it was aquiring in a loop didn't allow the cron to run...

This should be fixed (did some `kill` magic) but leaving open until the actual cron kicks off
Assignee: nobody → bugspam.Callek
Callek, what about the -l option for lockfile command (or an alternative mechanism to manage the synchronization)?
in the current slaverebooter iteration we don't want -l because it can take arbitrarily long to complete (when I was doing the restart, it took greater than 11 hours start->finish)... so I don't want to do -l because we absolutely don't want to overlap.

I'm not sure what alternative methods are better than lockfile atm. this nagios alert usually helps us notice when things go sideways (like today)
I think a good safety check is whether the pid is still alive. i.e. we would write a .pid file once we had acquired the lock, and this would contain the process id number (optionally we could also store cmdline/environ or start time).

Then when the next process tries to acquire the lockfile, and fails, it can sanity check if the process is still alive, by checking the process table. In the case a different process exists with the same process id (they get reused by the OS), we could also query the process to see if it matches the other meta data (environ / cmdline / start time etc).

Then you have a very robust way of checking that the process really is still running.

I also prefer mkdir to lockfile, since it is a standard system command available on all OS's, atomically checks for the existence of a directory and creates it - so is safe (no race condition), and also the created directory can then include e.g. pid files, log files, etc. If you don't like it, we can still implement the same using lockfile.

BTW might make sense to reuse the same solution in our other cron / only-run-one-of-these-at-once tools (such as watch_devices.sh).
The issue as filed is resolved, for a while now
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: