Closed Bug 610915 Opened 14 years ago Closed 13 years ago

Create a trigger to notify QA Mozmill boxes once builds and updates are available

Categories

(Release Engineering :: General, defect, P3)

defect

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: whimboo, Unassigned)

References

Details

(Whiteboard: [automation][mozmill][updates])

Last Friday I have talked with John how we can improve our Mozmill test-runs in a way that we would be ready for RelEng once bug 588398 can be fixed. Therefore a couple of things will have to be improved on our side. But at the same time we can already do our best to test builds and updates on the FTP server once they have been made available. Right now we are triggering our test-runs at 8am each day. The problem with that is, that sometimes builds or updates aren't available at this time and our tests fail. Further we detect possible update failures not as close as builds are available. The following idea came up: * Create a trigger notification which informs our qa horus box once a build or an update is ready. This should be limited to the en-US locale but can cover all 3 supported branches (4.0, 3.6, and 3.5). * Once we got this notification we can queue it up on our side and run tests against the build/update for the specific platform once the machine has been finished the current test-run. * Reports are sent up to Brasstacks. Those can be fetched via a RestAPI if Releng wants to have feedback. * (Later we can consider a nightly-test update channel to make sure updates are working before pushing those to the nightly channel) As John mentioned to me we can definitely go on with this solution until RelEng is able to work on bug 588398. Means we would have to find a reasonable trigger mechanism to accomplish this specific task.
What information is needed to kick off this automation? A couple of ways that I can think of off the top of my head include posting to a website information that triggers the automation and running a remote command over ssh. Do you have any thoughts on the triggering?
Given that qa-horus isn't "official" infrastructure, and there are actually 8 or 9 virtual machines that want to consume this info, think we probably want a polling solution on our end rather than something that pushes to us. Either a web service (or simple webpage) or some semaphore file dropped somewhere qa-horus and its VMs can access could work. Maybe a polling script that hits a URI formed around the date? Once it returns 200, we could grab the platform-appropriate build under a directory given in the returned content.
John, we have two major items for Mozmill automation at the moment. Those are nightly tests and release tests. For both we have to test the builds and the updates on all platforms. Lets see what specific data we need. Then we can find the easiest way to trigger our tests. Nightly - Updates (all supported branches) * Trigger: platform specific .mar file (watching latest-* on FTP for all supported branches?) * Actions: Test partial and complete updates with fallback on nightly channel Nightly - General Tests (all supported branches) * Trigger: platform specific installer (watching latest-* on FTP for all supported branches?) * Actions: execute our general test-run against that build Release - General Tests (branch to test) * Trigger: platform specific en-US candidate build which has been uploaded to the FTP (folders vary so no easy way to watch) * Actions: execute our general test-run against that build Release - Updates (branch to test) * Trigger: channel to test * Trigger: branch to test (needed separately to identify the list of predefined builds) * Actions: Test partial and complete updates with fallback for predefined list of builds As Geo mentioned we have a couple of VM's which currently run those tests. There is no master/slave installation in place. For nighlies we have a cron script and releases will be executed manually. We are thinking about using a CI system like Hudson. I think the more important part for this bug is the way how we can improve the handling of release candidate builds and updates. I would like to get some feedback from RelEng what's the best way for triggering builds.
For release updates, your best bet is to poll manually constructed updated URLs. For example, when 3.6.13 ships to beta, the following URL will change from returning an empty XML snippet to one that contains update information: https://aus2.mozilla.org/update/1/Firefox/3.6.13/$buildID/WINNT_x86-msvc/en-US/beta/update.xml ($buildID needs to be replaced with whatever it ends up being, of course). URLs for other update channels/platforms/locales would be very similar. At this time, it is possible to trigger notifications when updates on betatest become available. We don't have any triggers for releasetest, beta, or release channels, though. For release builds, we already notify other systems to run unit and perf tests. We could notify QA systems as well. We'll need specs on what the notification needs to look like. For nightlies builds *and* updates we can trigger like the above.
(In reply to comment #4) > For release builds, we already notify other systems to run unit and perf tests. > We could notify QA systems as well. We'll need specs on what the notification > needs to look like. How do those notifications look like? I think it would be the best when we could use the same data. Is there any code or bug I can look at?
Priority: -- → P3
Whiteboard: [automation][mozmill][updates]
Yeah, at this point I think we're ok aligning with you folks on notification format. We don't have anything in particular in place yet, so can deal with anything sane. As Henrik says, eventually we want to move towards a robust CI solution like Hudson. For the first round, though, it'll probably still be simple triggered scripts. Which is all to say, I don't want the trigger solution to be too coupled to QA's infrastructure because our infrastructure is in flux. Re: the XML snippet solution, pardon if this is a really basic question, but how would we anticipate the build ID to construct the URL? Also, is that method relevant to nightlies? Also, when you say "we can trigger like the above" can you be a little more specific? I see several methods mentioned above.
(In reply to comment #6) > Yeah, at this point I think we're ok aligning with you folks on notification > format. We don't have anything in particular in place yet, so can deal with > anything sane. We can notify by e-mail, some sort of HTTP service, dropping a file somewhere -- whatever. > Which is all to say, I don't want the trigger solution to be too coupled to > QA's infrastructure because our infrastructure is in flux. OK > Re: the XML snippet solution, pardon if this is a really basic question, but > how would we anticipate the build ID to construct the URL? You might be able to make use of the configuration files we keep to feed our tests. They contain all the necessary data to manually construct an update URL. These are here, divided into files by platform+branch: http://hg.mozilla.org/build/tools/file/tip/release/updates. The related script, verify.sh, shows how to construct the URL: http://hg.mozilla.org/build/tools/file/3379198fb5af/release/updates/verify.sh#l99 > Also, is that method > relevant to nightlies? Nightlies use a different format, and we don't keep a record of the buildids anywhere so it may be tougher there....let's focus on releases first. > Also, when you say "we can trigger like the above" can you be a little more > specific? I see several methods mentioned above. Nevermind that, actually -- I was trying to say that we can notify when nightly builds are available and also when nightly updates are available -- but those are tied together anyways.
Ben, I realize now that we're actually talking about two different things. We have two separate efforts. The urgent one at this time is to get triggers on nightly builds. I'm leading this project on QA side, so that's my primary concern. For now we expect to test them after release to channel, but we'd eventually like to look towards being able to qualify them before they go out to the community. I believe the primary purpose is to prevent situations like the recent one that stranded some number of users on a build that never updates properly to an unbroken build. We also want to get triggers on release builds, also for update testing, which is more Henrik's concern. However, our current release testing process is adequate (if a little too "semiautomatic" for lack of a trigger) so the urgency is lower there. I think Henrik and I have both been chiming in on this bug assuming they'd be variations on the same solution, but perhaps not. How would you like to proceed--two bugs, or what? The nightly testing is actually one of our main Q4 goals, so I'd like to proceed steadily if at all possible.
I have a crude script to poll ftp and update snippets and feed the event info into pulse. It polls ftp, sending messages for platform builds it sees, reading the .txt files and using the build id to construct a snippet URL. I don't think it makes much sense to do custom notifications for mozmill when many other systems will be interested in them as well. I know my scripts will want to be notified :-) Are the builds mv'd into the nightly directory or are they uploaded directly to their final location? Is the latest-* symlink changed at the very end of the process or very start? I am worried about sending a "build available" message while the build is still being uploaded. Also, for this bug, do you want notifications for each platform individually or once all platforms are available? I'm still working out a coherent routing key story...
Christian, agreed re: not really wanting a custom solution for Mozmill. That's why I wanted to feel out what was already available. My assumption is that eventually we'll be subscribing to Pulse for this sort of thing, so I'm trying to figure out the best interim solution that gets us what we want without a ton of infrastructure we'll throw away later. The basic thing I think we need is a simple notification of when yesterday's build can be updated to today's build, and what build ID we should look for after update for verification purposes. We probably also need enough info to find and download today's build directly from ftp. Then our process would essentially set it aside so it could be used as the update "before" for the next day. It's possible all we need is the info to download today's build, then we can unpack it and look for the build ID directly in the package. Still, my gut feeling is that I'd prefer releng to directly tell us the expected build ID so we have an independent verification source. The most convenient thing on our side would be a notification per-platform. Then I can have VMs individually poll for their own notification. I also assume not all the platforms land at the same time, which would also suggest per-platform. All of this is open to discussion, of course. We just want to do the most effective thing and I'm taking some early guesses as to what that would be. If it would make more sense to get on a conf call or in a room next week to speed this along, I'm all for it.
(In reply to comment #10) > We probably also need enough info to find and download today's build directly > from ftp. Then our process would essentially set it aside so it could be used > as the update "before" for the next day. Just an info beside because you are always talking about todays build. It can happen that multiple builds are generated within a day, so we should use the latest build term here.
That's a really good point. When there are multiple builds per day, does earlier build update to later build within the same day? If so, we should just generalize to previous/latest, rather than yesterday/today. If builds don't automatically update that way, though, we'd have to talk out the proper behavior further.
Compared to the depend builds that happen all day, 'nightly' means a few things * it's a clobber build * we publish the symbols to crash-stats * it's published as an update to people on the nightly channel The way nightly updates work is that _all_ old builds update to the latest 'nightly' build.
(In reply to comment #8) > The urgent one at this time is to get triggers on nightly builds. I'm leading > this project on QA side, so that's my primary concern. OK, let's focus solely on this, then. Currently, the only notifications we have are buildbot -> buildbot within the releng network, so we cannot re-use any of those for your purposes. In the medium to long term I think that we should be running any Mozmill tests on RelEng infrastructure. If there's things we can't run on RelEng infra, you should be subscribing to Pulse events for notification. But we're not there yet, so in the interim, I suggest we start notifying you by e-mail. Without knowing your systems, I imagine it would look something like this: You'd need a special account to receive the notifications and something that fires a script when it receives mail. The script would be responsible for parsing the e-mail and triggering the tests. The e-mail would include the information the tests need to run: URLs of builds, buildid, version, etc. This should be a fairly quick thing to get up and running.
Re: the criteria for updates above, how often do we release "clobber builds" w/ a channel update? Once daily or more often? Re: solution, OK, cool. I'm just fine with interim solutions, and I agree that anything long-term should be under the releng umbrella. I'd like to run things on our side initially, if only to feel out the reliability first. Once I've worked out any corner/edge cases re: the Mozmill part of the testing, we should discuss how to do it right within the RelEng infra. Christian has told me that he -might- have a notification system set up with Pulse that would work for us in the immediate term. I plan on exploring this more with him on Wednesday. Failing that, I think an email solution would work fine. There are some details to work out there (per-platform notifications or not being chief among) but it sounds straightforward enough. I'll get back to you by mid-aft on Wednesday to figure out how to go forward, thanks much!
(In reply to comment #15) > Re: the criteria for updates above, how often do we release "clobber builds" w/ > a channel update? Once daily or more often? Under normal circumstances, once per day, very early in the PST morning. Depending on the quality of that build or sometimes just big stuff landing during the day we sometimes get requests to spin additional ones. > Re: solution, OK, cool. I'm just fine with interim solutions, and I agree that > anything long-term should be under the releng umbrella. Great, glad we're on the same page here! > I'd like to run things on our side initially, if only to feel out the > reliability first. Once I've worked out any corner/edge cases re: the Mozmill > part of the testing, we should discuss how to do it right within the RelEng > infra. Makes sense to me. > Christian has told me that he -might- have a notification system set up with > Pulse that would work for us in the immediate term. I plan on exploring this > more with him on Wednesday. Awesome. > Failing that, I think an email solution would work fine. There are some details > to work out there (per-platform notifications or not being chief among) but it > sounds straightforward enough. > > I'll get back to you by mid-aft on Wednesday to figure out how to go forward, > thanks much! Ok, let me know!
Christian's out sick today. I spoke with him briefly this morning re: this issue, but want to have a fuller conversation with him when he's back in the office. So, haven't forgotten about this, just bumped a day or two.
Spoke with Christian today. I think he'll weigh on the bug too, but I think the upshot is we want to go with Pulse since it will probably be the long term solution, and because there's already prototype listener code that's very easily adapted to our purpose. However, the one glaring weakness of that approach is that due to how Pulse is currently polling FTP, it may not be able to tell us that a build is a clobber/channel update, which could be a pretty big issue from our side. Because of this, I think Christian is looking to see if there's a better way to get this info from RelEng rather than polling FTP. I think he'll probably weigh on in that.
Here's what I think should happen: 1. Rather than send email, the build system should send a Pulse message. One could develop it on the RelEng side by sending emails and it would be trivial to convert the email sending code to send Pulse messages instead. This means other tools can get the notification too. 2. The RelEng side should send a notification email to an email list if it can't connect to or send the message via Pulse. This makes it so if pulse.mozilla.org does go down, it won't affect anything RelEng related. We can later manually craft messages and send them if need be. 3. The QA tool listens to the messages via Pulse and does its thing Pulse will be taken out of prototype mode on Wednesday so it should be stable enough to write tools against it, and the automation team will be maintaining it over time...no worries about it going away. Needed for this plan is RelEng buy-in to instrument this part and send Pulse messages. Like I said I have a scraper script above but I'd much rather have this pushed from the source.
I'm more than happy to block this on Pulse. The RelEng work on getting status send to Pulse is being done in bug 614576. Once that lands, QA will be able to subscribe to nightly build completion from there. Catlee says that the routing key will be something like builds.mozilla-central-linux-nightly.#.finished, but that we can confirm that later. I'm blocking this on that bug, but this can be RESO FIXED when it has landed as far as I can tell. > 2. The RelEng side should send a notification email to an email list if it > can't connect to or send the message via Pulse. This makes it so if > pulse.mozilla.org does go down, it won't affect anything RelEng related. We can > later manually craft messages and send them if need be. This doesn't seem like a blocking issue, but I've echoed it into 614576.
Depends on: 614576
(In reply to Ben Hearsum [:bhearsum] from comment #20) > I'm more than happy to block this on Pulse. The RelEng work on getting > status send to Pulse is being done in bug 614576. Once that lands, QA will > be able to subscribe to nightly build completion from there. Catlee says > that the routing key will be something like > builds.mozilla-central-linux-nightly.#.finished, but that we can confirm > that later. > > I'm blocking this on that bug, but this can be RESO FIXED when it has landed > as far as I can tell. -> RESO FIXED ?
(In reply to Aki Sasaki [:aki] from comment #21) > (In reply to Ben Hearsum [:bhearsum] from comment #20) > > I'm more than happy to block this on Pulse. The RelEng work on getting > > status send to Pulse is being done in bug 614576. Once that lands, QA will > > be able to subscribe to nightly build completion from there. Catlee says > > that the routing key will be something like > > builds.mozilla-central-linux-nightly.#.finished, but that we can confirm > > that later. > > > > I'm blocking this on that bug, but this can be RESO FIXED when it has landed > > as far as I can tell. > > -> RESO FIXED ? Geo, Henrik?
In the year since we've discussed this, our needs for triggering have changed somewhat, and our experience with Pulse has been a bit mixed (uptime issues, some inconveniences with the library, etc.). It sounds like the trigger message from RelEng is present, which IMO means we can close this out. We'll subsequently decide within Automation Services to what degree we're going to use that vs other methodologies. Henrik, can we go ahead and close this?
Ben, can you please give a round-up on the notifications, which are available in Pulse now? Also as Geo pointed out we are not sure to use Pulse or simply watching the latest-builds folders for updates. Both ways would work and we simply have to find out which one we want to use.
(In reply to Henrik Skupin (:whimboo) from comment #24) > Ben, can you please give a round-up on the notifications, which are > available in Pulse now? Sorry, can you remind me what the scope is here? Are we talking about just nightlies? just releases? both? en-US only or some locales or all locales? > Also as Geo pointed out we are not sure to use Pulse or simply watching the > latest-builds folders for updates. Both ways would work and we simply have > to find out which one we want to use. You can also watch the buildapi JSON for changes - which is the same data source that TBPL feeds off of. It's located here: http://build.mozilla.org/builds/builds-4hr.js.gz
(In reply to Ben Hearsum [:bhearsum] from comment #25) > Sorry, can you remind me what the scope is here? Are we talking about just > nightlies? just releases? both? en-US only or some locales or all locales? It's for en-Us daily builds (nightly, aurora) for now. But I believe that it wouldn't be that different for other builds too. > You can also watch the buildapi JSON for changes - which is the same data > source that TBPL feeds off of. It's located here: > http://build.mozilla.org/builds/builds-4hr.js.gz Interesting. Haven't known about this. Do you have a script somewhere in a repo which makes use of it? Or is there any basic documentation for the data structure?
Here's a good starting place: from mozillapulse import consumers def got_message(data, message): routing_key = data['_meta']['routing_key'] _magic, builder, therest = routing_key.split(".", 2) if not builder.startswith("try"): return props = dict( (k,v) for (k,v,source) in data['payload']['build']['properties'] ) print routing_key, props print pulse = consumers.BuildConsumer(applabel='catlee@mozilla.com|build-watcher') pulse.configure(topic='build.*.*.finished', callback=got_message) pulse.listen() You should be able to modify this to get exactly the builds you're interested in.
Geo already has also been playing in @ http://hg.mozilla.org/qa/mozmill-automation/file/default/ondemand (which I think was this bug).
Thanks Chris for this snippet. Once Pulse is working again I will check how it works for us in regards of daily test-runs and l10n tests for localized builds. Beside that we haven't covered the second part of the summary yet. Is there also a Pulse message when updates have been made available? Is this part of the BuildConsumer? (In reply to Christian Legnitto [:LegNeato] from comment #29) > Geo already has also been playing in @ > http://hg.mozilla.org/qa/mozmill-automation/file/default/ondemand (which I > think was this bug). This is a different project and covers our release testing efforts. We do manually trigger those tests and only use pulse to connect our machines for distributing the test execution notification.
Blocks: 709052
No longer blocks: 617816
(In reply to Chris AtLee [:catlee] from comment #28) > pulse = consumers.BuildConsumer(applabel='catlee@mozilla.com|build-watcher') > pulse.configure(topic='build.*.*.finished', callback=got_message) Chris do you have an example how such a message would look like? As long as Pulse is not working I kinda would like to run a simulation locally to proof that our code is working. Thanks.
Pulse is working again so I got the structure of such a build notification.
After talking to catlee on IRC I was able to put together some more code in addition to the snippet from comment 28, which let us trigger a Jenkins job. Everything works like a charm and I think we can close out this bug now. Thanks everyone.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Are you sure that you are pushing messages into the system about a finished build processes? The only messages I get for the topic='*.*.*.*' are notifications from test machines. There is so far no single entry for a finished build process. Can you please check the BuildConsumer for those type of messages? Thanks.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Chris, I was running the script the whole night to get some notifications for the Firefox 9.0b6 builds. For this release I only got one single notification for the win32 zip archive on the FTP server. No notifications for the installers or packages on other platforms. Same applies to the localized builds. Doesn't RelEng yet support candidate builds and I really have to wait for todays nightly builds? Also I see some new l10n builds in the tinderbox-builds folder for aurora. I also never got notifications for those: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-aurora-l10n/ I can only see some notifications for try server builds. It would be kinda appreciated when we could sort out those issues soon, so we can get our CI system running by end of this year. Who from RelEng would be the contact person? Thanks.
After talking to Nick on IRC, I have changed the topic to '#' to get all messages passed through Pulse. With it I now successfully see finished jobs for builds, currently only for l10n-dep builds. I will let the script continue to run to check how the nightly builds of Nighly, and Aurora will look like. Chris, if possible I would kinda appreciate it if we could have a quick chat together regarding additional properties in those messages, e.g. the package path to the ftp location for l10n-dep builds.
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
Chris, thanks for all the help on this. I'm now getting notifications for all the daily builds across locales, which already triggering our Mozmill tests on my test machine. For further proposals or issues regarding missing properties I will file new bugs. This bug has been solved. Marking as verified.
Status: RESOLVED → VERIFIED
\o/
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.