Closed Bug 669000 Opened 10 years ago Closed 10 years ago

Please deploy and install TBPL

Categories

(mozilla.org Graveyard :: Webdev, task, P3)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mstange, Assigned: laura)

References

Details

(Whiteboard: [tbpl][buildduty])

Attachments

(1 file)

The security review of the all-new, buildbot-based TBPL is finished (bug 661365), so we can now go ahead and update tbpl.mozilla.org. I've just pushed all changes to http://hg.mozilla.org/users/mstange_themasta.com/tinderboxpushlog/

From now on, we'll need a running MongoDB server and a cronjob that runs /var/www/tinderboxpushlog/dataimport/import-buildbot-data.py every 5 minutes. The python script also requires two packages that haven't been installed yet.

So in addition to the usual "hg pull -u", these things need to be done:
 - MongoDB might need to be updated. It's been 9 months since I installed it.
 - The MongoDB server ("mongod") needs to be started and run permanently in the
   background. The only argument it needs is "-bind-ip 127.0.0.1" (because
   otherwise everyone on the internet would have full access to it). It's
   possible to create a config file and put this option in there, but that's
   probably not worth it.
 - The Python packages pytz and simplejson need to be installed so that
   "python26" can use them.
 - In the tbpl directory (/var/www/tinderboxpushlog/), a folder called "cache"
   needs to be created and be made writable to all users.
 - The file /var/www/tinderboxpushlog/php/sheriff-password.php needs to be
   created with the contents "<?php define('SHERIFF_PASSWORD', 'thepassword');"
   with thepassword replaced by the real sheriff password (bug 322423).
 - The cronjob needs to be set up: "crontab -e" and add this line:
*/5 * * * * python26 /var/www/tinderboxpushlog/dataimport/import-buildbot-data.py
 - The last few days need to be imported by running
   "python26 /var/www/tinderboxpushlog/dataimport/import-buildbot-data.py -d 5"

These instructions are repeated in the tbpl readme file:
http://hg.mozilla.org/users/mstange_themasta.com/tinderboxpushlog/raw-file/tip/README

MongoDB, MongoPHP and Python 2.6 (as python26) with pymongo are already installed. I did that in October 2010 (bug 571551 comment 46), but unfortunately I've forgotten how I did it and where they ended up. So whoever does the installation will have to figure it out again. Sorry :(

Testing & verification:

The first thing that needs to be verified is whether bug 638515 was successful in letting tbpl.mozilla.org/php/starcomment.php access the ES database. This can be done right after the hg pull -u step by checking whether
http://tbpl.mozilla.org/php/starcomment.php?tree=mozilla-central&dates[]=2011-07-02
has the same result as
http://brasstacks.mozilla.com/starcomment.php?tree=mozilla-central&dates[]=2011-07-02 .

Whether MongoDB is running and accessible by PHP can be checked by going to
http://tbpl.mozilla.org/php/getRevisionBuilds.php?branch=mozilla-central&rev=test - if the result is "[]", it worked.

Whether the python package installation was successful can be tested by running "python26 /var/www/tinderboxpushlog/dataimport/import-buildbot-data.py" and looking for error messages.

If everything works, http://tbpl.mozilla.org/ should look the same as http://tbpl.swatinem.de/ and http://tbpl.mozilla.org/?usebuildbot=1 should look the same as http://tbpl.swatinem.de/?usebuildbot=1 .


Please let me know if there are any questions. This is not super urgent, but it should probably done during quieter times.
If anything goes wrong, people will still be able to use http://tbpl.swatinem.de/ or http://dev.philringnalda.com/tbpl/ .
*puts on risk mitigation hat*

We should wait until after the aurora and beta merges on Tuesday, because of the rush of landings that happen before then.

(In reply to comment #0)
> So in addition to the usual "hg pull -u", these things need to be done:

Can everything in this section can happen beforehand without any chance of breaking the existing tbpl ?

> MongoDB, MongoPHP and Python 2.6 (as python26) with pymongo are already
> installed. I did that in October 2010 (bug 571551 comment 46), but
> unfortunately I've forgotten how I did it and where they ended up. So
> whoever does the installation will have to figure it out again. Sorry :(

yum is configured to pull the 10gen packages of Mongo. The yum repo host changed, and so did the package naming - see http://www.mongodb.org/display/DOCS/CentOS+and+Fedora+Packages. 

Don't know where MongoPHP came from, or version. rpm doesn't know anything about it.

pymongo has an egg at
  /usr/lib/python2.6/site-packages/pymongo-1.9_-py2.6-linux-i686.egg
Upgrade to 1.11 ?

> Testing & verification:
> The first thing that needs to be verified is whether bug 638515 was
> successful in letting tbpl.mozilla.org/php/starcomment.php access the ES
> database. 

Looks like it was:
dm-tbpl01 ~]$ telnet elasticsearch1.metrics.sjc1.mozilla.com 9200
Trying 10.2.72.53...
Connected to elasticsearch1.metrics.sjc1.mozilla.com (10.2.72.53).
Escape character is '^]'.
QUIT
Connection closed by foreign host.

What else is changing in the path without use_buildbot=1 ?

Perhaps it's worth setting up a staging instance (eg using a vhost) before cutting over the main instance. Might help shake out any bugs from production use too. philor reports that starring on http://tbpl.swatinem.de/ fails for the non-buildbot version.
Yeah, I want the shiny, I want it bad, but if tbpl.swatinem.de is the staging server, then staging completely fails QA. If tbpl.swatinem.de is the backup, then it completely fails to be a backup. And if you're counting on my instance, hosted on Dreamhost, to be the dependable 24/7 tree-closes-if-it's-not-available thing, worse yet while I'm on vacation, then... don't.
(In reply to comment #1)
> *puts on risk mitigation hat*
> 
> We should wait until after the aurora and beta merges on Tuesday, because of
> the rush of landings that happen before then.

Good point.

> (In reply to comment #0)
> > So in addition to the usual "hg pull -u", these things need to be done:
> 
> Can everything in this section can happen beforehand without any chance of
> breaking the existing tbpl ?

Yes, unless it completely breaks PHP or even Apache.

> > MongoDB, MongoPHP and Python 2.6 (as python26) with pymongo are already
> > installed. I did that in October 2010 (bug 571551 comment 46), but
> > unfortunately I've forgotten how I did it and where they ended up. So
> > whoever does the installation will have to figure it out again. Sorry :(
> 
> yum is configured to pull the 10gen packages of Mongo. The yum repo host
> changed, and so did the package naming - see
> http://www.mongodb.org/display/DOCS/CentOS+and+Fedora+Packages.

What does this mean for us?

> Don't know where MongoPHP came from, or version. rpm doesn't know anything
> about it.

Oh, for this one I've even jotted down notes. I think I did "sudo pecl install mongo".

> pymongo has an egg at
>   /usr/lib/python2.6/site-packages/pymongo-1.9_-py2.6-linux-i686.egg
> Upgrade to 1.11 ?

Can't hurt.

> > Testing & verification:
> > The first thing that needs to be verified is whether bug 638515 was
> > successful in letting tbpl.mozilla.org/php/starcomment.php access the ES
> > database. 
> 
> Looks like it was:

Good!

> What else is changing in the path without use_buildbot=1 ?

Only starring. Stars come from starcomment.php instead of from Tinderbox, and are sent to both starcomment.php and submitBuildStar.php (which inserts it into the MongoDB).

> Perhaps it's worth setting up a staging instance (eg using a vhost) before
> cutting over the main instance. Might help shake out any bugs from
> production use too.

Sure, if that's an option, why not.

> philor reports that starring on http://tbpl.swatinem.de/
> fails for the non-buildbot version.

Uh oh. I'll look into it tomorrow.

(In reply to comment #2)
> Yeah, I want the shiny, I want it bad, but if tbpl.swatinem.de is the
> staging server, then staging completely fails QA.

Anything else aside from starring (which is bad enough, I admit)?

> And if you're counting on
> my instance, hosted on Dreamhost, to be the dependable 24/7
> tree-closes-if-it's-not-available thing, worse yet while I'm on vacation,
> then... don't.

Sorry :(
I can set up another fallback at tests.themasta.com again.
(In reply to comment #3)
> > yum is configured to pull the 10gen packages of Mongo. The yum repo host
> > changed, and so did the package naming - see
> > http://www.mongodb.org/display/DOCS/CentOS+and+Fedora+Packages.
> 
> What does this mean for us?

Ah, just that we need to uninstall Mongo, update the yum repo config and reinstall Mongo as mongo-10gen (or mongo-10gen-server?).
(In reply to comment #3)
> Anything else aside from starring (which is bad enough, I admit)?

Storing summaries fails for non-usebuildbot=1, like there isn't a summaries/ or it isn't writable. Retriggering jobs doesn't work, usebuildbot=1 or not. Probably the fault of the data, but still a regression: http://tbpl.swatinem.de/?usebuildbot=1&noignore=1&rev=0c02168c83a6 doesn't show the hidden Valgrind build that non-usebuildbot=1 does.
Looks like tbpl.swatinem.de's at 121f67e06845, so it would need to pull adding fx-team and putting tp5 in compare-talos to serve as a backup.
I pulled the latest code.

(In reply to comment #5)
> Storing summaries fails for non-usebuildbot=1, like there isn't a summaries/
> or it isn't writable.
Fixed.


> Retriggering jobs doesn't work, usebuildbot=1 or not.
Hm, need help investigating.

> Probably the fault of the data, but still a regression:
> http://tbpl.swatinem.de/?usebuildbot=1&noignore=1&rev=0c02168c83a6 doesn't
> show the hidden Valgrind build that non-usebuildbot=1 does.
Yes, that is probably the data.

Also tbpl.mozilla.org/php/starcomment.php is currently a 404 so it completely breaks the tinderbox version.
(In reply to comment #7)
> > Retriggering jobs doesn't work, usebuildbot=1 or not.
> Hm, need help investigating.

I added http://tbpl.swatinem.de to the list of hosts that can pass auth details to buildapi.
Depends on: 669062
(In reply to comment #3)
> > What else is changing in the path without use_buildbot=1 ?
> 
> Only starring. Stars come from starcomment.php instead of from Tinderbox,
> and are sent to both starcomment.php and submitBuildStar.php (which inserts
> it into the MongoDB).

Oh, there's another change in non-usebuildbot=1 mode: We access getHiddenBuilders.php in order to filter the pending and running builds. If the request to getHiddenBuilders.php fails, for example due to MongoDB not running, we show a persistent error message, but TBPL should still work normally.

> > philor reports that starring on http://tbpl.swatinem.de/
> > fails for the non-buildbot version.
> 
> Uh oh. I'll look into it tomorrow.

Patch is up in bug 669062. What would I do without philor's QA...

(In reply to comment #7)
> Also tbpl.mozilla.org/php/starcomment.php is currently a 404 so it
> completely breaks the tinderbox version.

Right. You'll need to revert bug 668992 on tbpl.swatinem.de until tbpl.mozilla.org has been deployed.
Depends on: 669072
(In reply to comment #9)
> (In reply to comment #7)
> > Also tbpl.mozilla.org/php/starcomment.php is currently a 404 so it
> > completely breaks the tinderbox version.
> 
> Right. You'll need to revert bug 668992 on tbpl.swatinem.de until
> tbpl.mozilla.org has been deployed.

Patched this in bug 669072:
Well now it works also when starcomment returns an error json as it does on my server due to not being able to access ES. We show the error message but still display the results.
http://tbpl.swatinem.de/?usebuildbot=1&rev=b7f03b37cf0c doesn't have a Linux64 opt build, http://tbpl.swatinem.de/?usebuildbot=1&tree=TraceMonkey&rev=d8e967b8afc8 doesn't have the OS X64 opt build or nightly, only the shark nightly.
And since trying to star on tbpl.swatinem.de now gives me a summary of "Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 20 bytes) in /var/www/tbpl/php/inc/GzipUtils.php on line 27", we can't use that as a backup any more than we can use mine. We either need to roll this out as tbpl2.m.o, or move tbpl.m.o to tbpl-classic.m.o. I'd be inclined toward starting by installing a tip tbpl at tbpl-staging.m.o, then once it's actually working, install a pull -r 057622a238f7 at tbpl-classic.m.o, verify that's working, then during a downtime either move tbpl-staging or install a fresh tip (not sure whether moving, so you can have hidden builds set correctly and have the tip of twenty trees already starred, or installing fresh and having to do all that live, is easier).
(In reply to comment #11)
> http://tbpl.swatinem.de/?usebuildbot=1&rev=b7f03b37cf0c doesn't have a
> Linux64 opt build,
> http://tbpl.swatinem.de/?usebuildbot=1&tree=TraceMonkey&rev=d8e967b8afc8
> doesn't have the OS X64 opt build or nightly, only the shark nightly.

Great, that's exactly the kind of bugs that we need to catch before we can use the buildbot backend by default.
(In reply to comment #12)
> And since trying to star on tbpl.swatinem.de now gives me a summary of
> "Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to
> allocate 20 bytes) in /var/www/tbpl/php/inc/GzipUtils.php on line 27"

On which build? Starring works for me (I tested on an old try push of mine).

I've set up http://tests.themasta.com/tbpl-classic/ as a fallback, just in case.

> (not sure whether moving, so you can
> have hidden builds set correctly and have the tip of twenty trees already
> starred, or installing fresh and having to do all that live, is easier).

If we keep all instances on the same server and only use vhosts to serve e.g. /var/www/tbpl-staging/ to tbpl-staging.m.o, we'll already have all the data because it's in the same MongoDB.
(In reply to comment #12)
> And since trying to star on tbpl.swatinem.de now gives me a summary of
> "Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to
> allocate 20 bytes) in /var/www/tbpl/php/inc/GzipUtils.php on line 27"

Hm, I didn’t want to mess with the memory limit, but log parsing really does use a lot so it may fail occasionally on my server.
(In reply to comment #11)
> http://tbpl.swatinem.de/?usebuildbot=1&rev=b7f03b37cf0c doesn't have a
> Linux64 opt build,
> http://tbpl.swatinem.de/?usebuildbot=1&tree=TraceMonkey&rev=d8e967b8afc8
> doesn't have the OS X64 opt build or nightly, only the shark nightly.

Congratulations, you've found bug 669137.
Are we still blocked on the starring issue (or something else) here? It's unclear to me from the above comments.
Priority: -- → P3
Whiteboard: [tbpl][buildduty]
The starring issue has been addressed. This work can go ahead.

We only need to decide whether we want to do any of the mitigation measures philor suggested in comment 12.
I've asked nthomas to coordinate work on this.

IT would like an architectural diagram describing the different components that need to be set up, and any external dependencies.
If we don't want to do any intermediate staging server mitigation, the list in comment 0 describes all that needs to be set up.

I'll draw the diagram later today.
We would like to put tbpl on a firmer production footing now that it has become an important developer tool. I propose we have two tbpl machines/VMs, using one as staging and the other as production, and switching DNS when we want to deploy major changes like this one.

RelOps have agreed to maintain the 'hardware' and software stack on these two machines (excluding tbpl of course), much like they or general IT do for other developer systems. Part of that is setting the machine configuration in puppet, which requires documentation of what the requirements are, how those should be set up, and any dependencies on other systems. I think that's a good thing to do in general. Amy can comment on if 
 hg.mozilla.org/users/mstange_themasta.com/tinderboxpushlog/file/default/README
is enough. I'm happy to add any information needed for files generated by buildapi that tbpl consumes.

Once we have a staging instance up we can verify all is working, set the hidden builders appropriately and so on, then cut over.
I've added a production named branch to the http://hg.mozilla.org/users/mstange_themasta.com/tinderboxpushlog repo so that small fixes can be deployed live while the bulk of this work continues. When it's time this large chunk can be merged in from default branch.
Once the diagram is ready, Dustin or myself will work up a list of todo items for RelOps and file bugs as needed.  We will need to coordinate with NetOps and IT to get the pieces in place for tbpl to be managed.

Please assign to me or Dustin when the diagram is ready.
The diagram is here (sorry about the arrows):
https://wiki.mozilla.org/Tinderboxpushlog/ArchitectureAndDependencies

Is that what you had in mind?
Assignee: nobody → bear
Blocks: 625979
The wiggly arrows add character.  :}


Bear is going to be doing the majority of the work to translate this to operational requirements, but I have some initial questions about the document (keep in mind that I'm coming from the systems side, and do not understand how TBPL functions internally). 

1) Where are the PHP scripts stored on tbpl? Is there a description of what each does, when it's called, and what the possible success/failure end states are?  Full paths to files and descriptions of scripts/cron jobs are great, since they will help people find and debug things quickly in case of an emergency.
2) What user does the cronjob run as, and how frequently?  Is there a description of what the cron job is designed to do?
3) What are the underlying protocols and ports necessary for each component (we need to know what sort of firewall restrictions to accommodate)?
4) Are the versions of software listed minimum requirements or absolute requirements?
5) Is there a discussion of expected load?  Does the system scale linearly?  Logarithmically? Other?
6) Which systems in this diagram are single points of failure?
7) What is the expected lifecycle of this system (this may be documented somewhere else in the project plan for this system)?
(In reply to comment #25)
> 5) Is there a discussion of expected load?  Does the system scale linearly? 
> Logarithmically? Other?

Related to this, could we do a rough calculation of how disk requirements scale over time ? Are we planning to expire data from the MongoDB ?
(In reply to comment #26)
> (In reply to comment #25)
> > 5) Is there a discussion of expected load?  Does the system scale linearly? 
> > Logarithmically? Other?
> 
> Related to this, could we do a rough calculation of how disk requirements
> scale over time ? Are we planning to expire data from the MongoDB ?

Expiring Data in MongoDB would be a good Idea, Markus already mentioned it to me in a personal discussion.
As for the disk space: MongoDB grows by doubling the chunksize each time a chunk fills up.
On my server the largest chunk is 512M, so the whole DB is in the range of 1G, that is with ~670K runs. It wasn’t cleared since I initially set it up at the end of May. So thats roughly 2 Month worth of data.
(In reply to comment #25)
> 1) Where are the PHP scripts stored on tbpl?

They're all in the /php/ directory [1], except for /leak-analysis/index.php, but that one will also move to /php/ soon (bug 658543).

> Is there a description of what
> each does, when it's called, and what the possible success/failure end
> states are?

Nope. Where would you prefer such a documentation to reside? Just as another wiki page?

> 2) What user does the cronjob run as, and how frequently?

The readme file [2] currently suggests running it every 5 minutes, but in the meantime bug 601740 has changed things so that the source data is now updated every minute. So it would make sense to use use that same frequency on our side, too.

The user that the cronjob runs as probably doesn't matter. The script [3] only pulls files from the internet and communicates with the local MongoDB - all users should be allowed to do that.
What users can I choose from?

> Is there a
> description of what the cron job is designed to do?

In the comment at the beginning of the file [3].

> 3) What are the underlying protocols and ports necessary for each component
> (we need to know what sort of firewall restrictions to accommodate)?

I'll check and update the wiki page.

> 4) Are the versions of software listed minimum requirements or absolute
> requirements?

Minimum. I haven't tested with newer versions but I don't foresee any incompatibilities.

> 5) Is there a discussion of expected load?  Does the system scale linearly? 
> Logarithmically? Other?

I haven't really thought about this yet, but I think it's mostly linear in the number of users plus linear in the number of builds. (Every colored letter corresponds to one build.)

Log parsing is definitely the most expensive operation. It happens every time a user selects a build or views a build log, except if the result is already cached.

> 6) Which systems in this diagram are single points of failure?

All of them, I think, though with different grades of severity. For example, if access to api-dev.bugzilla.mozilla.org doesn't work, most of TBPL is still usable but starring builds won't leave a comment on the starred bug.

> 7) What is the expected lifecycle of this system (this may be documented
> somewhere else in the project plan for this system)?

I don't know of a project plan :)
The plan is basically to have it stay online indefinitely and perpetually tweak it (e.g. fix bugs, add functionality). Changes shouldn't accumulate too much before they're deployed.
(Did that answer the question or did you mean something else?)

[1] http://hg.mozilla.org/users/mstange_themasta.com/tinderboxpushlog/file/f3dfc39d8103/php
[2] http://hg.mozilla.org/users/mstange_themasta.com/tinderboxpushlog/file/f3dfc39d8103/README
[3] http://hg.mozilla.org/users/mstange_themasta.com/tinderboxpushlog/file/f3dfc39d8103/dataimport/import-buildbot-data.py
[4] https://bugzilla.mozilla.org/buglist.cgi?quicksearch=OPEN%20component%3ATinderboxpushlog

(In reply to comment #26)
> Related to this, could we do a rough calculation of how disk requirements
> scale over time ? Are we planning to expire data from the MongoDB ?

Expiring cached logs (and their HTML-ified versions) is probably more important than expiring the MongoDB data. I haven't implemented any expiration measures yet because I wanted to see how fast things grow in the wild first.
(In reply to comment #28)
> Expiring cached logs (and their HTML-ified versions) is probably more
> important than expiring the MongoDB data. I haven't implemented any
> expiration measures yet because I wanted to see how fast things grow in the
> wild first.

Just for reference:
du -h cache/
856K    cache/annotatedsummary
14M     cache/parsedlog
222M    cache/rawlog
2,8M    cache/excerpt
239M    cache/

Do we need rawlog at all, it’s just a copy of the log on the ftp, right?
But these numbers do not reflect reality at all, since the logs are only fetched/parsed when they are requested. And since I don’t expect anyone to really use my instance, it only contains a really tiny subset of logs.
(In reply to comment #28)
> (In reply to comment #25)
> > 3) What are the underlying protocols and ports necessary for each component
> > (we need to know what sort of firewall restrictions to accommodate)?
> 
> I'll check and update the wiki page.

I've added the ports. Most of it is via HTTP, Bugzilla is via HTTPS, ftp.mozilla.org via FTP and elasticsearch1.metrics.sjc1.mozilla.com via port 9200. The MongoDB port is configurable but defaults to 27017.

(In reply to comment #29)
> Do we need rawlog at all, it’s just a copy of the log on the ftp, right?

Right. If we don't cache it, we'll download it at least twice simultaneously while generating the TinderboxPrint and annotated summary excerpts.
arr - is their anything you need cleared up or enhanced?
One of my main concerns here is making sure that tbpl does not become another spof.  To this end, I want to make sure we use load balanced web services and clustered databases instead of one standalone machine (or two, one for the database and one for the web front end).  This means building redundant infrastructure that operations will support in the long run.

In talking with ops, we do not have any production support infrastructure for nosql at the moment, but the direction things seem to be heading in is elasticsearch (in fact, it's one of the elements in your dependency diagram).  Is there design reason why mongodb was chosen over using elasticsearch for tbpl itself?  I ask because standardizing on technology is a goal for creating systems that can be supported operationally.
(In reply to comment #32)
> One of my main concerns here is making sure that tbpl does not become
> another spof.

Why "another"?

> In talking with ops, we do not have any production support infrastructure
> for nosql at the moment, but the direction things seem to be heading in is
> elasticsearch (in fact, it's one of the elements in your dependency
> diagram).  Is there design reason why mongodb was chosen over using
> elasticsearch for tbpl itself?

No, no design reason. I simply didn't know about elasticsearch back then.
From a quick look it looks similar enough to MongoDB that converting TBPL to elasticsearch shouldn't be too complicated.

Should we consider such a conversion?
"another" because typically, and certainly this morning, when releng and relops come to look at this bug, they are coming straight from having looked at nagios alerts about how Tinderbox's mail queue is at 4000 messages and 5 hours old, because it has not just a single point but a single thread of failure for reading mail.

The expedient thing to do would be to deploy tbpl-staging as a P1 blocker and see how it fares with the staging load, because the state of Tinderbox is seriously delaying Gecko development, but multi-hour tree closures because Tinderbox is totally broken don't instill the same sense of urgency in everyone.
Blocks: 659724
Either way, bug 659724 or bug 669000, somebody's a blocker.
Severity: normal → blocker
Depends on: 676772
Attached file Setup notes
I've gone ahead and set up a staging instance because the current situation is untenable to developers. Attached are the notes. You can reach that staging instance at 
   http://tbpl.allizom.org/?usebuildbot=1

We can't flip that live until the hidden builder state has been brought over from tinderbox to all the trees (which I'm going to start doing but will be leaning heavily on philor and others to complete), and we know it will hold up to some load.
On mozilla-central, mozilla-inbound, and try I hid anything that matched these filters: jetpack, bundle

m-c only: l10n, Win x64, WIN x86-64, xulrunner
try only: Rev3 WINNT 6.1 try opt test reftest-no-accel

Notable omissions: m-aurora, m-beta, m-1.9.2, rest of world
Awesome!

(In reply to Nick Thomas [:nthomas] from comment #8)
> I added http://tbpl.swatinem.de to the list of hosts that can pass auth
> details to buildapi.

Can you do that for tbpl.allizom.org, too, please?
(In reply to Markus Stange from comment #38)
> Can you do that for tbpl.allizom.org, too, please?
Done, could you verify ?

Is there a way we can avoid restarring jobs ? At the moment the staging instance isn't seeing stars that were created using the prod one (using tinderbox), and vice versa.
If we need to push to production, I have already created
 /var/www/tinderboxpushlog/cache
 /var/www/tinderboxpushlog/php/sheriff-password.php

We'd need to merge default -> production in the tbpl repo, and update /var/www/tinderboxpushlog/ to the new tip of production. Assuming all goes well I imagine we'd want to make usebuildbot=1 the default.
(In reply to Nick Thomas [:nthomas] from comment #39)
> (In reply to Markus Stange from comment #38)
> > Can you do that for tbpl.allizom.org, too, please?
> Done, could you verify ?

Verified. It works in Tinderbox mode. Bug 676806 will fix the usebuildbot=1 mode.

> Is there a way we can avoid restarring jobs ? At the moment the staging
> instance isn't seeing stars that were created using the prod one (using
> tinderbox), and vice versa.

No, not at the moment. The original plan was that people use the staging instance in Tinderbox mode for a while, which will show all the old stars, and will also insert any new stars into MongoDB. Those would then show up in usebuildbot=1 mode.

I could write a script that does the migration, but I don't have time for that today. Maybe on Sunday or Monday.
The problem is that if we don't replace old with new, we may end up with people starring in the old tbpl and people starring in the new one with double reports in the bugs and double work for people actively looking at trees.
I don't think having old data on old stars is really blocking, we may just re-star the first 2 or 3 changesets and would be fine.
To reiterate arr's concerns in comment 32, I am very concerned about building out new Single Points of Failure.

I'd like to know the redundancy story for the new tbpl. The best answer to this question is "here's the design doc", but some specific questions spring immediately to mind:

Is the webhead stateful, or is all the state in the DB?
Is the importer idempotent?
Are there any downstream dependencies on this service besides human developers? (Put another way if this falls over, what portion of the overall RelEng infra grinds to a halt immediately?)
(In reply to Zandr Milewski [:zandr] from comment #43)
> (Put another way if this falls over, what portion of the overall
> RelEng infra grinds to a halt immediately?)

None. RelEng infra doesn't consume any data from this. We publish data to some well defined end-points, and this new TBPL is one consumer of that data.
(In reply to Zandr Milewski [:zandr] from comment #43)
> I'd like to know the redundancy story for the new tbpl. The best answer to
> this question is "here's the design doc",

Somebody else will have to create a redundancy design and design doc, I'm out of my element here.

> Is the webhead stateful, or is all the state in the DB?

All the state is in the DB. We store files for caching but they don't store state.

> Is the importer idempotent?

Yes.

> Are there any downstream dependencies on this service besides human
> developers?

No.
(In reply to Markus Stange from comment #41)
> No, not at the moment. The original plan was that people use the staging
> instance in Tinderbox mode for a while, which will show all the old stars,
> and will also insert any new stars into MongoDB.

But none of the 2 modes (with or without usebuildbot) will post stars that the old tbpl can show, right?  I tried starring in the new tbpl in both modes, nothing shows in the old tbpl.  At least on the allizom.org instance.
https://bugzilla.mozilla.org/show_bug.cgi?id=617328#c44 is a comment from allizom in non-buildbot mode, with the URL "undefined", which might also explain the failure to tinderbox-comment in that mode.

The other sticky point for a nice long staging is that since buildbot mode properly comments on bugs with its own URL rather than a tinderbox URL, and properly uses tbpl.mozilla.org rather than the host where it's installed, every time you star on allizom you put a 404 log URL into the bug.
(In reply to Marco Bonardo [:mak] (Away 8-14 Aug) from comment #46)
> (In reply to Markus Stange from comment #41)
> > No, not at the moment. The original plan was that people use the staging
> > instance in Tinderbox mode for a while, which will show all the old stars,
> > and will also insert any new stars into MongoDB.
> 
> But none of the 2 modes (with or without usebuildbot) will post stars that
> the old tbpl can show, right?

Oh, right. That's easy to fix, though, I've attached a patch in bug 676835.
(In reply to Phil Ringnalda (:philor) from comment #47)
> https://bugzilla.mozilla.org/show_bug.cgi?id=617328#c44 is a comment from
> allizom in non-buildbot mode, with the URL "undefined"

Thanks, bug 676837

> The other sticky point for a nice long staging is that since buildbot mode
> properly comments on bugs with its own URL rather than a tinderbox URL, and
> properly uses tbpl.mozilla.org rather than the host where it's installed,
> every time you star on allizom you put a 404 log URL into the bug.

This can be changed in Config.js in absoluteBaseURL. Not sure it's worth it, though.
(In reply to Markus Stange from comment #45)
> (In reply to Zandr Milewski [:zandr] from comment #43)
> > I'd like to know the redundancy story for the new tbpl. The best answer to
> > this question is "here's the design doc",
> 
> Somebody else will have to create a redundancy design and design doc, I'm
> out of my element here.

I was actually hoping for a design doc for the tbpl application/service.

> All the state is in the DB. We store files for caching but they don't store
> state.

So we can load balance without worrying about session persistence? Excellent.

> > Is the importer idempotent?
> 
> Yes.

OK, so this isn't such a bad state. Assuming the current staging environment doesn't show any obvious functional or massive scaling issues, we should get it deployed with multiple webheads behind a load balancer, and running against and off-box ES cluster built from ops' existing puppet configs.
Got agreement in principle to stand up hardware for this in Phoenix. This means that the buildapi calls would be going over a WAN link, but since I understand that to be an HTTP API, it shouldn't be a big deal.
Depends on: 676879
Can we update tbpl.allizom for bug 676837 and bug 676835?
Updated from ae3714489a04+ to 424557cf9015+ (which also includes bug 676806). The local diff is:

diff -r ae3714489a04 js/Config.js
--- a/js/Config.js	Fri Aug 05 22:41:32 2011 +1200
+++ b/js/Config.js	Thu Aug 11 15:04:58 2011 -0700
@@ -19,7 +19,7 @@
   selfServeAPIBaseURL: "https://build.mozilla.org/buildapi/self-serve",
   alternateTinderboxPushlogURL: "http://build.mozillamessaging.com/tinderboxpushlog/?tree=",
   alternateTinderboxPushlogName: "Mozilla Messaging",
-  wooBugURL: "http://tbpl.mozilla.org/php/starcomment.php", // war-on-orange database
+  wooBugURL: "http://tbpl.allizom.org/php/starcomment.php", // war-on-orange database
   // treeInfo gives details about the trees and repositories. There are various
   // items that can be specified:
   //
Depends on: 678685
Updated from 424557cf9015+ --> 5f08b66fdc3c+ to pick up bug 678688, per philor's request on IRC.
On tbpl.allizom that is, same for comment #53.
per discussions with bear+laura, reassigning to WebDev
Assignee: bear → nobody
Component: Release Engineering → Webdev
QA Contact: release → webdev
So, does this unassigned P3 blocker have a plan?

Currently, the poor slobs who use tbpl.mozilla.org are getting their results 81 minutes delayed (a great day for Tinderbox, didn't go over 90 minutes behind!), while I use the buildbot flavor of tbpl.allizom.org and fill up Bugzilla with literally hundreds of broken links per day to logs which the code thought would be on tbpl.mozilla.org in a few days, rather than a few months.

If it's going to be a month or two to free someone up and assign to them, and then several months for them to learn the code and totally rewrite it in an IT-pleasing way, we'll need to make more sturdy plans for our workarounds.
There is a plan, and I'm working on it, just hadn't taken the bug.  Here's the plan:
1.  rewrite code to work on generic cluster.  This will be faster than getting infrastructure in place for Mongo, according to discussions with IT.
2. stage it, go through infra + qa
3. deploy, ideally in parallel with the existing code for now.

I expect this will take at least a couple of weeks.

In addition, we have assessed the possibility of multithreading tinderbox.  I've had a couple of estimates on that which indicate it's not a huge amount of work.  (Of course multithreading (and perl to boot) is always more complicated than you think it's going to be.)  My intention here is to kick this work off in parallel assuming I can get time from an appropriate volunteer.
Assignee: nobody → laura
(In reply to Laura Thomson :laura from comment #58)
> 1.  rewrite code to work on generic cluster.  This will be faster than
> getting infrastructure in place for Mongo, according to discussions with IT.

Rewrite what part of the code to use what instead? Mongo -> ES or is there anything else?
Mongo-> something that we already have in prod.  Still assessing what.  The list of prod datastores available on the generic cluster is MySQL, Redis, memcache, and ES in about a month's time. So non-ES would be faster if possible.

Also looking at getting data from the json feeds rather than direct from buildbot, as this saves having to stand up a buildbot DB slave in PHX.

glob filed https://bugzilla.mozilla.org/show_bug.cgi?id=681680 for the improvement to tinderbox, which he's going to work on.  (Thanks glob!)
(In reply to Laura Thomson :laura from comment #60)
> Mongo-> something that we already have in prod.  Still assessing what.  The
> list of prod datastores available on the generic cluster is MySQL, Redis,
> memcache, and ES in about a month's time. So non-ES would be faster if
> possible.

Who is deciding that?

> Also looking at getting data from the json feeds rather than direct from
> buildbot, as this saves having to stand up a buildbot DB slave in PHX.

We currently are getting the data from the json feeds. Direct access to buildbot (or at least something else that can be more realtime than json feeds) is something I have been wanting for a long time but that never happened.


I know this is hard for everybody involved, as in getting an outside-contributer project deployed. But as one of those contributers I must honestly say that I feel kind of “out of the loop” on the recent happenings.
Hey Arpad,

I did not know you were one of the contributors - thank you for the hard work you have already done on this.  This new version should make life a lot easier.  

(In reply to Arpad Borsos (Swatinem) from comment #61)
> (In reply to Laura Thomson :laura from comment #60)
> > Mongo-> something that we already have in prod.  Still assessing what.  The
> > list of prod datastores available on the generic cluster is MySQL, Redis,
> > memcache, and ES in about a month's time. So non-ES would be faster if
> > possible.
> 
> Who is deciding that?

If you have a preference or rational for one over the other, please let me know.
The data I have from IT is:
- ~2 months or more to order hardware for Mongo and get it into prod, monitored, tuned etc.  (This is a skillset IT doesn't have right now.)
- ~1 month until ES is available in prod
- MySQL, Redis, memcache are all available immediately.

> 
> > Also looking at getting data from the json feeds rather than direct from
> > buildbot, as this saves having to stand up a buildbot DB slave in PHX.
> 
> We currently are getting the data from the json feeds. 

Didn't realize that - literally just got into this in any detail today.  Good news!

>Direct access to
> buildbot (or at least something else that can be more realtime than json
> feeds) is something I have been wanting for a long time but that never
> happened.
> 

We can do this medium term for sure, once we get into production in the PHX datacenter.  We should do it anyway, so we have a hot spare for buildbot db.  

> 
> I know this is hard for everybody involved, as in getting an
> outside-contributer project deployed. But as one of those contributers I
> must honestly say that I feel kind of “out of the loop” on the recent
> happenings.

I'd love to get your help in getting this live - let's get together on IRC and work through it.  The main goal is to make the new tbpl live ASAP, because tinderbox is, as everyone knows, a mess right now.
(In reply to Laura Thomson :laura from comment #62)
> If you have a preference or rational for one over the other, please let me
> know.
> The data I have from IT is:
> - ~2 months or more to order hardware for Mongo and get it into prod,
> monitored, tuned etc.  (This is a skillset IT doesn't have right now.)
> - ~1 month until ES is available in prod
> - MySQL, Redis, memcache are all available immediately.

Thanks for the heads up.
We already cross-post comments to the ES database at http://elasticsearch1.metrics.sjc1.mozilla.com:9200 I thought that was officially supported.

I was discussing this duplication with Markus some time ago but we didn’t think it was a good short-term solution. We are currently cross-posting comments to three different locations, which already gives us some troubles (bug 678887).
I’m all for cutting down the duplication and minimizing the resulting mess.
Using our own database in the first place was just a way to work around not having direct access to the build-db, and because it was the easiest way to go, for us :-)

So my question here is what the long-term solution should be.

> > We currently are getting the data from the json feeds. 
> 
> Didn't realize that - literally just got into this in any detail today. 
> Good news!
> 
> >Direct access to
> > buildbot (or at least something else that can be more realtime than json
> > feeds) is something I have been wanting for a long time but that never
> > happened.
> > 
> 
> We can do this medium term for sure, once we get into production in the PHX
> datacenter.  We should do it anyway, so we have a hot spare for buildbot db.

The thing I want to achieve is getting the lowest latencies possible. Unfortunately, we are currently lagging behind there (bug 677004), although I think it is still not as bad as tinderbox sometimes.


So I’m also happy to see things finally moving along, but I wish there was more of a coordinated movement.
Periodically polling the periodically generated json and duplicate that data into our own database is not the way forward, as is the duplication of the comments into 3 different dbs right now.
So what is the best way to achieve our goals? Discussing all this in IRC is a good idea. What is the best time for this? (Considering Markus and me are on german time)
I'll email to set up a time where we can work this out, faster than bugzilla comments.
Meeting today between myself, Arpad, mstange, philor and rhelmer yielded the following plan:
- Port the app to use MySQL.  (We'll still need the link to the existing ES cluster in SJC.)  ETA: next Thursday, September 1.  rhelmer will work on the python parts, and Arpad on the PHP parts.
- Stage new version
- Get any needed infrasec/QA on staging - need to gather resources here, I'll report back with a timeline.
- Deploy, sometime in the week starting September 5 at the very earliest, likely the week after that I suspect.

Discussions are taking place in #tbpl if you want to help.

Once this is deployed, we will work on a long term road map for the product.
Depends on: 682059
Depends on: 682914
Depends on: 683241
per bug 683241 this is done:

https://tbpl.mozilla.org

cheers!
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
There are a few regressions here and there, but it's up.  Those have their own bugs.

Huge thanks to cshields and rhelmer for putting in the hard yards to get this deployed.
(In reply to Laura Thomson :laura from comment #67)
> Huge thanks to cshields and rhelmer for putting in the hard yards to get
> this deployed.

Yay! Good job, guys.
(In reply to Fred Wenzel [:wenzel] from comment #68)
> (In reply to Laura Thomson :laura from comment #67)
> > Huge thanks to cshields and rhelmer for putting in the hard yards to get
> > this deployed.
> 
> Yay! Good job, guys.

I know laura mentioned this in the newsgroup post, but arpad, mstange, peterbe all helped out a ton at the last minute (and of course arpad and mstange got tbpl where it is now) so I want to make sure they get shout outs too :)
Thanks for everyone involved. Lets get the regressions fixed and work on some long-term goal. As was mentioned in another bug, we still need tinderbox for tree status.
No longer depends on: 676879
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.