Closed Bug 942448 Opened 11 years ago Closed 11 years ago

foopy watch_devices does not clear lockfile when it fails to get data from slavealloc

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: Callek, Assigned: sbruno)

Details

Attachments

(1 file)

So, after this weekends mtv downtime, we have very few tegras up, looking through logs on the foopies I see stuff like: retry: Calling <function run_with_timeout at 0x7f1b53e22488> with args: (['wget', '-q', '-O/builds/tegra-114/buildbot.tac.new', 'http://slavealloc.pvt.build.mozilla.org/gettac/tegra-114'], 300, None, 'ERROR 404: Not Found', True, True), kwargs: {}, attempt #4 Executing: ['wget', '-q', '-O/builds/tegra-114/buildbot.tac.new', 'http://slavealloc.pvt.build.mozilla.org/gettac/tegra-114'] retry: Failed, sleeping 240 seconds before retrying 2013-11-22 22:50:03 -- *** ERROR *** failed to aquire lockfile .... This means that retry failed out and we never released the lockfile, this is "a bad thing" for two reasons, we should fall back to the existing buildbot.tac _and_ always release the lockfile when stuff fails. I'm cleaning up with a find command of 'find /builds/*/ -name watcher.lock -cmin +120 -exec rm -vf {} \;' across all foopies just now. in order to restore sanity
Assignee: nobody → armenzg
Assignee: armenzg → nobody
Assignee: nobody → sbruno
List foopies with: curl -s -L 'https://hg.mozilla.org/build/tools/raw-file/tip/buildfarm/mobile/devices.json' | sed -n 's/.*foopy\([0-9][0-9]*\).*/\1/p' | sort -un | while read id; do host foopy${id} | sed -n 's/ has address .*//p'; done List log watcher log files on a foopy with: shopt -s nullglob; ls -1 /builds/{tegra,panda}-*/watcher.log* /builds/watcher.log /builds/watcher_rolled_logs/watcher.log* /builds/tools/buildfarm/mobile/watch_devices.sh is creating these log files (and also associated lock files) and is in our repos here: https://hg.mozilla.org/build/tools/file/tip/buildfarm/mobile/watch_devices.sh watch_devices.sh is triggered by cron in /etc/cron.d/foopy on the foopy which is defined in puppet here: https://hg.mozilla.org/build/puppet/file/tip/modules/foopy/templates/foopy.crontab.erb
Regarding this bug, I thought the issue was fixed in bug 915106?
Attached patch tools_01.patchSplinter Review
Attachment #8454370 - Flags: review?(pmoore)
Maybe you got to it Callek before the 12 hour timeout? See https://bugzilla.mozilla.org/show_bug.cgi?id=915106#c4 and https://bugzilla.mozilla.org/show_bug.cgi?id=915106#c6 For this reason, I would still prefer a one hour timeout.
Patch uploaded anyway to fall back to existing buildbot.tac in case of failed slavealloc call to retrieve a new one.
Before I review, I would like to double check the statements in the comment 0: When you wrote the code to get the tac file from slavealloc, I think you already took care of making sure we fall back to using the current version, if we can't contact slavealloc: https://hg.mozilla.org/build/tools/file/c385d52dd8fa/buildfarm/mobile/manage_buildslave.sh#l41 Here you see you download it to buildbot.tac.new and only if that is successful, replace buildbot.tac with buildbot.tac.new. If it fails, the old buildbot.tac will prevail. Admittedly the logging could be improved to say "falling back to using existing version" (and checking there is already an existing version there), which there only shouldn't be, if that device *never* got a buildbot.tac before, which should not be the case in this example - that should only be new devices that have never run a job before. The second point is I believe the lockfile is only *not* released when there is a crash (i.e. then the trap does not get called which removes the lock): https://hg.mozilla.org/build/tools/file/c385d52dd8fa/buildfarm/mobile/watch_devices.sh#l51 So I think both of these cases are already handled - we have a trap to remove lock, regardless of whether script errors or not, and we have a timeout to remove lock file if it is older than 12 hours. Lastly, we have something in place (written by you!) :) that makes sure we fall back to using existing buildbot.tac if we can't download a new one. So I'm not sure the problems after the mtv downtime are really caused by this. And if it was because there was sudden power loss on the foopies, I believe waiting 12 hours would have solved it, and like I said in bug 915106 I would prefer a 1 hour wait, rather than a 12 hour (comments 4 and 6 of this bug).
Flags: needinfo?(bugspam.Callek)
In the light of Pete's comment, the value of my patch is just a slightly better logging in the (unlikely) case that both slavealloc call fails and there is no existing buildbot.tac file to fall back to.
I'd like to remove the review request flag for me until the needinfo above has been clarified, since I think the review is not needed, and it is stuck in my queue! When the needinfo above has been clarified, if the review is still required, please feel free to set the review flag again for me. Thanks Simone!
Attachment #8454370 - Flags: review?(pmoore)
(In reply to Pete Moore [:pete][:pmoore] from comment #6) > Before I review, I would like to double check the statements in the comment > 0: Hrm..... everything you say looks accurate when i stare at code, but looks counter to what I see in c#0... (e.g. how do we 'wait 240 seconds' from that retry command, if all is well..... what happens on network failure, etc) I'm inclined to trust your evaluation here pete, and proceed as you decree without being a further block here. I'm sorry it took me so long to eval this n-i
Flags: needinfo?(bugspam.Callek)
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → INVALID
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: