942448 - foopy watch_devices does not clear lockfile when it fails to get data from slavealloc

Reporter

Description

•

11 years ago

So, after this weekends mtv downtime, we have very few tegras up, looking through logs on the foopies I see stuff like: retry: Calling <function run_with_timeout at 0x7f1b53e22488> with args: (['wget', '-q', '-O/builds/tegra-114/buildbot.tac.new', 'http://slavealloc.pvt.build.mozilla.org/gettac/tegra-114'], 300, None, 'ERROR 404: Not Found', True, True), kwargs: {}, attempt #4 Executing: ['wget', '-q', '-O/builds/tegra-114/buildbot.tac.new', 'http://slavealloc.pvt.build.mozilla.org/gettac/tegra-114'] retry: Failed, sleeping 240 seconds before retrying 2013-11-22 22:50:03 -- *** ERROR *** failed to aquire lockfile .... This means that retry failed out and we never released the lockfile, this is "a bad thing" for two reasons, we should fall back to the existing buildbot.tac _and_ always release the lockfile when stuff fails. I'm cleaning up with a find command of 'find /builds/*/ -name watcher.lock -cmin +120 -exec rm -vf {} \;' across all foopies just now. in order to restore sanity

Armen [:armenzg]

Updated

•

11 years ago

Assignee: nobody → armenzg

Armen [:armenzg]

Updated

•

11 years ago

Assignee: armenzg → nobody

Chris Cooper [:coop] (he/him)

Updated

•

11 years ago

Assignee: nobody → sbruno

Pete Moore [:pmoore][:pete]

Comment 1

•

11 years ago

List foopies with: curl -s -L 'https://hg.mozilla.org/build/tools/raw-file/tip/buildfarm/mobile/devices.json' | sed -n 's/.*foopy$[0-9][0-9]*$.*/\1/p' | sort -un | while read id; do host foopy${id} | sed -n 's/ has address .*//p'; done List log watcher log files on a foopy with: shopt -s nullglob; ls -1 /builds/{tegra,panda}-*/watcher.log* /builds/watcher.log /builds/watcher_rolled_logs/watcher.log* /builds/tools/buildfarm/mobile/watch_devices.sh is creating these log files (and also associated lock files) and is in our repos here: https://hg.mozilla.org/build/tools/file/tip/buildfarm/mobile/watch_devices.sh watch_devices.sh is triggered by cron in /etc/cron.d/foopy on the foopy which is defined in puppet here: https://hg.mozilla.org/build/puppet/file/tip/modules/foopy/templates/foopy.crontab.erb

Pete Moore [:pmoore][:pete]

Comment 2

•

11 years ago

Regarding this bug, I thought the issue was fixed in bug 915106?

Simone Bruno [:simone]

Assignee

Comment 3

•

11 years ago

Attached patch tools_01.patch — Details — Splinter Review

Attachment #8454370 - Flags: review?(pmoore)

Pete Moore [:pmoore][:pete]

Comment 4

•

11 years ago

Maybe you got to it Callek before the 12 hour timeout? See https://bugzilla.mozilla.org/show_bug.cgi?id=915106#c4 and https://bugzilla.mozilla.org/show_bug.cgi?id=915106#c6 For this reason, I would still prefer a one hour timeout.

Simone Bruno [:simone]

Assignee

Comment 5

•

11 years ago

Patch uploaded anyway to fall back to existing buildbot.tac in case of failed slavealloc call to retrieve a new one.

Pete Moore [:pmoore][:pete]

Comment 6

•

11 years ago

Before I review, I would like to double check the statements in the comment 0: When you wrote the code to get the tac file from slavealloc, I think you already took care of making sure we fall back to using the current version, if we can't contact slavealloc: https://hg.mozilla.org/build/tools/file/c385d52dd8fa/buildfarm/mobile/manage_buildslave.sh#l41 Here you see you download it to buildbot.tac.new and only if that is successful, replace buildbot.tac with buildbot.tac.new. If it fails, the old buildbot.tac will prevail. Admittedly the logging could be improved to say "falling back to using existing version" (and checking there is already an existing version there), which there only shouldn't be, if that device *never* got a buildbot.tac before, which should not be the case in this example - that should only be new devices that have never run a job before. The second point is I believe the lockfile is only *not* released when there is a crash (i.e. then the trap does not get called which removes the lock): https://hg.mozilla.org/build/tools/file/c385d52dd8fa/buildfarm/mobile/watch_devices.sh#l51 So I think both of these cases are already handled - we have a trap to remove lock, regardless of whether script errors or not, and we have a timeout to remove lock file if it is older than 12 hours. Lastly, we have something in place (written by you!) :) that makes sure we fall back to using existing buildbot.tac if we can't download a new one. So I'm not sure the problems after the mtv downtime are really caused by this. And if it was because there was sudden power loss on the foopies, I believe waiting 12 hours would have solved it, and like I said in bug 915106 I would prefer a 1 hour wait, rather than a 12 hour (comments 4 and 6 of this bug).

Flags: needinfo?(bugspam.Callek)

Simone Bruno [:simone]

Assignee

Comment 7

•

11 years ago

In the light of Pete's comment, the value of my patch is just a slightly better logging in the (unlikely) case that both slavealloc call fails and there is no existing buildbot.tac file to fall back to.

Pete Moore [:pmoore][:pete]

Comment 8

•

11 years ago

I'd like to remove the review request flag for me until the needinfo above has been clarified, since I think the review is not needed, and it is stuck in my queue! When the needinfo above has been clarified, if the review is still required, please feel free to set the review flag again for me. Thanks Simone!

Pete Moore [:pmoore][:pete]

Updated

•

11 years ago

Attachment #8454370 - Flags: review?(pmoore)

Justin Wood (:Callek)

Reporter

Comment 9

•

11 years ago

(In reply to Pete Moore [:pete][:pmoore] from comment #6) > Before I review, I would like to double check the statements in the comment > 0: Hrm..... everything you say looks accurate when i stare at code, but looks counter to what I see in c#0... (e.g. how do we 'wait 240 seconds' from that retry command, if all is well..... what happens on network failure, etc) I'm inclined to trust your evaluation here pete, and proceed as you decree without being a further block here. I'm sorry it took me so long to eval this n-i

Flags: needinfo?(bugspam.Callek)

Simone Bruno [:simone]

Assignee

Updated

•

11 years ago

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → INVALID

Nobody; OK to take it and work on it

Updated

•

7 years ago

Component: Platform Support → Buildduty

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

5 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

foopy watch_devices does not clear lockfile when it fails to get data from slavealloc

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

Tracking

(Not tracked)

People

(Reporter: Callek, Assigned: sbruno)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Updated

Updated

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Comment 9

Updated

Updated

Updated

Attachment

General

Description

File Name

Content Type