Closed Bug 1245823 Opened 9 years ago Closed 9 years ago

Speedup runtime of granular pending builds nagios check

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Callek, Assigned: aselagea)

References

Details

Attachments

(5 files, 5 obsolete files)

Over in Bug 1204970] we made check pending builds use allthethings.json and parse that for actual pool depth/age. However over the past 24 hours or so its been flapping as CRITICAL due to nagios timeouts in trying to run it. When we first created the new check, amy bumped the timeout to 60s, but bumping it more than that isn't as useful. There are two main options here: * Speed up the overall runtime of the check * Make the script run in a cron and emit to a static file the nagios check queries. I'm partial to trying for speeding up first.
Attached file new_check_pending_builds.py (obsolete) —
I uploaded a new version of the pending count check script. There are two major things that have been changed: - cached the list of builders by counting how many times they repeat: in this way, we will avoid comparing each builder name with its correspondent from the allthethings.json file, thus saving a great amount of time - use the local allthethings.json file (which is also computed on cruncher) instead of its secure.pub.m.o link These combined can greatly reduce the time required to run the script.
Attachment #8717015 - Flags: feedback?(bugspam.Callek)
Comment on attachment 8717015 [details] new_check_pending_builds.py giving feedback would be much easier for me in terms of a diff rather than a flat file. Also I'm interested in the timing info from before and after the change, e.g. stamp_start=$(date --utc +%s); for i in {1..20}; do sleep 1; done; stamp_end=$(date --utc +%s); echo "$(((stamp_end - stamp_start) / 20))" where you replace "sleep 1" with the call to execute this script, and it outputs the average time [in seconds] to execute. And you run it again with the change (you can s/20/10/g in there if you are concerned about waiting too long)
Attachment #8717015 - Flags: feedback?(bugspam.Callek)
Since at the moment we only have a small number of pending jobs (under 20), I opted doing some local tests using a previously saved "builds-pending.json" file, containing 4530 pending jobs and a size of 512KB. Here are the two cases: I. new version of the script: "builds-pending.json" --> local file, "allthethings.json" --> also a local file (the new script uses the latest version of "allthethings.json" file from cruncher, so it's pretty much a similar situation) II. old version of the script: "builds-pending.json" --> local file, "allthethings.json" --> specified by secure.pub.m.o link As it can be noticed, there's a significant difference between the two scripts in terms of running speed. Attached the patch that contains the changes with respect to the version that can be found on braindump and the "builds-json" file used for tests.
Attached file check_pending_builds.py (obsolete) —
Attachment #8717015 - Attachment is obsolete: true
Attached patch bug_1245823.patch (obsolete) — Splinter Review
Attachment #8717383 - Flags: feedback?(bugspam.Callek)
Attached file builds-pending.json
Attached file results_new.txt
Attached file results_old.txt
Assignee: nobody → alin.selagea
Comment on attachment 8717383 [details] [diff] [review] bug_1245823.patch Review of attachment 8717383 [details] [diff] [review]: ----------------------------------------------------------------- ::: nagios-related/check_pending_builds.py @@ +28,5 @@ > + if k in jobpool_cache: > + pending_builds[k] += 1 > + else: > + pending_builds[k] = 1 > + jobpool_cache.append(k) jobpool_cache is unused here, can save some processing time by removing it, and instead doing: if k in pending_builds: @@ +36,5 @@ > # get the builder names and their corresponding slavepools > def get_builders_and_slavepools(): > builders_and_slavepools = [] > + with open("/var/www/html/builds/allthethings.json") as response: > + result = json.loads(response.read()) using `response` here clobbers the global import, and is confusing. `as f:` is a common format. And since you're changing to a file object instead of response's library, you can switch to: `result = json.load(f)` since .load() uses a file-object and avoid the resident memory useage of .read()-->json-object @@ +78,4 @@ > count_by_slavepool = [] > for m in range(0, len(slavepools)): > count_by_slavepool.append([slavepools[m][0], 0]) > + for i,j in pending_builds.items(): use .iteritems() here (its better on mem usage) @@ +84,3 @@ > for m in range(0, len(count_by_slavepool)): > + if builders_and_slavepools[k][1] == slavepools[m][0]: > + count_by_slavepool[m][1] = count_by_slavepool[m][1] + j perhaps `count_by_slavepool[m][1]` += j would read better? (it's semantically the same, and has no perf benefit this new way)
Attachment #8717383 - Flags: feedback?(bugspam.Callek) → feedback+
Attached patch bug_1245823.patch (obsolete) — Splinter Review
Attachment #8717383 - Attachment is obsolete: true
Attachment #8718771 - Flags: review?(bugspam.Callek)
Attached file check_pending_builds.py (obsolete) —
Attachment #8717381 - Attachment is obsolete: true
Thanks for your feedback. I did the suggested changes, tested the script again (both locally and on cruncher) and worked without issues. Here's the new version of the patch and the new version of the script. If everything is OK, should we land the patch and then replace the script on cruncher?
(In reply to Alin Selagea [:aselagea][:buildduty] from comment #12) > Thanks for your feedback. > > I did the suggested changes, tested the script again (both locally and on > cruncher) and worked without issues. Here's the new version of the patch and > the new version of the script. Reviewing now ;-) > If everything is OK, should we land the patch and then replace the script on > cruncher? Yes.
Comment on attachment 8718771 [details] [diff] [review] bug_1245823.patch Review of attachment 8718771 [details] [diff] [review]: ----------------------------------------------------------------- r+ with the following changes. ::: nagios-related/check_pending_builds.py @@ +27,5 @@ > + k = request['buildername'] > + if k in pending_builds.keys(): > + pending_builds[k] += 1 > + else: > + pending_builds[k] = 1 trailing whitespace @@ +28,5 @@ > + if k in pending_builds.keys(): > + pending_builds[k] += 1 > + else: > + pending_builds[k] = 1 > + pending_builds.update() .update() does nothing, drop it ;-) @@ +35,5 @@ > > # get the builder names and their corresponding slavepools > def get_builders_and_slavepools(): > builders_and_slavepools = [] > + with open("/var/www/html/builds/allthethings.json") as f: Hrm, actually lets make this a global variable like the all_things_url was, that will make it easier to discover if this path ever changes.
Attachment #8718771 - Flags: review?(bugspam.Callek) → review+
Okay, updated the patch.
Attachment #8718771 - Attachment is obsolete: true
Attachment #8718772 - Attachment is obsolete: true
Attachment #8718857 - Flags: checked-in+
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: