Closed Bug 677004 Opened 11 years ago Closed 9 years ago

usebuildbot=1 has too much lag between when a job finishes and when it is displayed


(Tree Management Graveyard :: TBPL, defect)

Not set


(Not tracked)



(Reporter: philor, Unassigned)



(Keywords: regression, sheriffing-untriaged)

By comparison with tinderbox-based when tinderbox is four hours lagged reading mail, usebuildbot=1 looks good, but head to head, when tinderbox isn't lagged, it looks terrible.

The job I was just watching actually finished at 19:56, tinderbox claims it finished at 19:57, I starred it at 20:00 so it was visible on tinderbox-based tbpl by my 19:59 refresh, but usebuildbot didn't show it until my 20:05 refresh (though it was quick to remove the running grey-letter, making it look like the job just disappeared for 8 minutes).

We have to dump tinderbox, so we have to do what we can do, but zomg, if you proposed doing something which would do what this essentially does, add 5-10 minutes onto the time it takes to run every single build and test job, it better come with Shetland unicorn rides for everyone to make up for it.
Depends on: 681834
There's a few steps in getting data from buildbot to users:

1, insert finished jobs into status db 
cron job on each buildbot master that looks for newly finished builds, and inserts them in the db. Runs every 10 minutes, mostly takes 30s to insert (sometimes 90)

2, recreate builds-4hr.js.gz
cron job on cruncher. Runs every minute, 10-30s to generate

3, copy builds-4hr.js.gz to build.m.o
cron job on cruncher. Runs every minute, very quick to transfer a few hundred KB

4, expiry header
The Apache config on build.m.o sets an Expiry header of 'access plus 1 minute' when the gz file is requested. Non-issue if the tbpl refresh interval is that or longer.

So that's roughly 15 minutes worst case.

Possible improvements
* running the cron job on the masters more frequently, bug 681834 to look into this
* refactor 1 so that builds are individually added to the db on completion. Bug 662885 will provide a backend to do this
* push the file over after 2, instead of waiting for the pull to happen
Keywords: regression
* move 2 and 3 off cruncher, which maybe isn't a good place for tier 1 jobs to run

I think because of a helping hand from bug 714406, my lag last night (while I had jobs finishing during the time when builds-2011-12-31.js.gz was being created and the load on cruncher was being alerted as 10 to 12) was around 50 minutes.
Whiteboard: [sheriff-want]
Here are the current steps when a job finishes:
* immediately on job finish we start uploading the log, insert the job into statusdb, and send the pulse message. This takes a few seconds
* we generate build-running.js, builds-pending.js, and builds-4hr.js every minute, taking a few seconds. There's no rsync to move the file any more

Are you seeing delays longer than a minute or two these days ? How often does tbpl import data ?
(In reply to Nick Thomas [:nthomas] from comment #3)
> Are you seeing delays longer than a minute or two these days ? How often
> does tbpl import data ?

I believe the tbpl cron job for running is set to 5 mins. TBPL's client side then refreshes every 2 mins.

The situation is a bit better than it was, but there still seems to be a bit of a lag at times (though it may just be when you hit the worst case of 60-90 secs for the steps in comment 3 + 5 min tbpl cron + 2 min tbpl client side; total 8-9 mins).

I'll try to keep an eye out to see what the delays are in reality.
Maybe we should be looking at shortening up the tbpl cron, or thinking about pushing data from buildbot directly into tbpl via API.
Whiteboard: [sheriff-want]
Or just get used to the way things are.
Closed: 9 years ago
Resolution: --- → WORKSFORME
Product: Webtools → Tree Management
Product: Tree Management → Tree Management Graveyard
You need to log in before you can comment on or make changes to this bug.