Closed Bug 1036468 Opened 11 years ago Closed 8 years ago

reduce overhead in job setup in buildbot as well as test scripts + mozharness

Categories

(Release Engineering :: Applications: MozharnessCore, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: jmaher, Unassigned)

References

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2506] [capacity][good next bug])

in looking at a new talos job I was playing with on try: https://tbpl-dev.allizom.org/php/getParsedLog.php?id=43046379&tree=Try&full=1 tbpl showed 12 minutes, but the test itself took roughly 2 minutes. Ouch, that is not a good return on investment, something must be going on- List of times and buildbot steps: 00:03 set props basedir 00:09 rm -Rf properties 00:32 rm -Rf scripts 00:44 hg clone hg.mozilla.org/build/mozharness scripts 00:08 hg update -C -r production 00:41 hg id -i 00:45 <download to oath.txt> ?wtf? 07:51 python -u talos_script.py ... 00:10 rm -f oath.txt 00:11 slave lost reboot ------------- 03:23 time outside of the mozharness script 07:51 time inside the mozharness script lets look at what happens inside the mozharness script: 01:32 - download talos.json (timeout 30 seconds, sleep 60 seconds) 00:01 - download firefox build (win32.zip) 00:15 - rm c:/slave/talos-data\\talos 00:15 - clone talos repo 00:02 - download tp5n.zip 00:36 - unzip tp5n.zip (we could cache this - rarely changes) 00:01 - download + unzip flash.zip (we could cache this - rarely changes) 00:13 - virtualenv setup 00:01 - install pyyaml (never changes) 00:10 - install pywin32 (never changes) 00:40 - talos setup.py installation of dependencies 00:01 - unzip win32.zip firefox bundle 02:54 - talos initialization, test, upload results 00:01 - mozharness cleanup, reporting total 6:46 (missing 1:05 here, so this isn't 100% accurate) As you can see there are some things that we should reconsider: in buildbot: * figure out how to improve downloading of oauth.txt (possibly save 20-40 seconds) * don't remove script and reclone, just cleanup (possibly savings of 30-60 seconds) * potential savings here could be between 0:50 and 1:30 in mozharness: * tp5n.zip +flash.zip - keep them static or do some md5sum to ensure we have the latest bits (we spend 0:37 here, we could save 15-20 seconds) * install pyyaml and pywin32 on the system globally (save 10-11 seconds) * optimize our timeout and retry for talos.json (maybe save 30-60 seconds) * reuse talos repo (pull latest, update to <rev>, delete files, save 10-15 seconds) * potential savings here could be: 1:05-1:46 doing all of these changes could take a 11 minute job and save between 1:55 and 3:16. Multiply that out by hundreds of thousands of jobs and we have a big win
Component: General Automation → Mozharness
Slave pre-flight tasks (bug 712206) should make sure the tools/scripts repo is always checked out and as up-to-date as possible before the slave even returns to the pool.
Depends on: 712206
Whiteboard: [capacity]
Depends on: 1047207
Whiteboard: [capacity] → [capacity][good next bug]
Whiteboard: [capacity][good next bug] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2506] [capacity][good next bug]
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.