Closed Bug 813629 Opened 9 years ago Closed 7 years ago

Run QA's update tests for releases on RelEng hardware

Categories

(Release Engineering :: Release Automation: Other, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1148546

People

(Reporter: bhearsum, Unassigned)

References

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1914] )

Filing this as a concrete bug, but it's still just an idea.

QA runs a set of what I believe are automated tests when snippets are pushed to releasetest. If this is true, I believe we would gain a lot by running these tests on the RelEng pool. Specifically because:
1) We have the capacity to parallelize more than QA. This means that the existing tests can run faster, and possibly that we can test more versions/locales.
2) We can trigger tests automatically when things go to releasetest. For betas this means that these tests could run overnight. For final releases it means that we get back whatever time it takes between getting on releasetests and QA kicking off the tests (so, communication time + manual setup time).

I'm interested to hear others' thoughts and opinions on this.
I'm all for anything which allows us to scale up and automate our processes even further. I'm ccing Clint Talbert as his team is working on scaling up our own infrastructure and can likely comment to this proposal.
(In reply to Anthony Hughes, Mozilla QA (:ashughes) from comment #1)
> I'm all for anything which allows us to scale up and automate our processes
> even further. I'm ccing Clint Talbert as his team is working on scaling up
> our own infrastructure and can likely comment to this proposal.

Awesome. In order to assess how easy/difficult it would be to integrate, some details would be helpful. No rush on these, as I don't think we can realistically act on this until late December or January because of B2G:
- What manual steps do you take to trigger your automation? Does it require additional manual steps after starting before it finishes?
- Is your automation run on sets of homogenous machines, or individual heterogeneous ones?
- Where do the script(s) that run it live?
- Does it have configuration files? If so, where do they live? How are they updated?
(In reply to Anthony Hughes, Mozilla QA (:ashughes) from comment #1)
> I'm all for anything which allows us to scale up and automate our processes
> even further. I'm ccing Clint Talbert as his team is working on scaling up
> our own infrastructure and can likely comment to this proposal.

Anthony, in the future please cc myself to any request which is related to Mozmill tests. That will help us to get faster to a solution. If I wouldn't have seen a comment from Ben on another bug, I wouldn't have known about this one. Thanks.

(In reply to Ben Hearsum [:bhearsum] from comment #2)
> - What manual steps do you take to trigger your automation? Does it require
> additional manual steps after starting before it finishes?

The test machines would need Mozmill to be installed. Also you will have to clone our mozmill-automation repository (http://hg.mozilla.org/qa/mozmill-automation/) to trigger the testrun_update.py job. More pre-work is not necessary. Not sure when Mozmill 2.0 will be ready but it will make it even easier to run Mozmill tests.

> - Is your automation run on sets of homogenous machines, or individual
> heterogeneous ones?

You can do whatever you want by integrating your own queue mechanism. Right now we have 3 machines per platform. 

> - Where do the script(s) that run it live?

We run those through our Mozmill CI system (https://github.com/mozilla/mozmill-ci) on dedicated ESX hardware. As for now those tests are triggered manually by Anthony or Juan across a couple of locales and previous versions.

> - Does it have configuration files? If so, where do they live? How are they
> updated?

For CI you will need configuration files. If you have your own queuing and handling mechanism you can trigger the tests directly. All options can be specified through the CLI tool. For multiple versions it would have to be executed multiple times.
Thanks for all of the info Henrik, I'm hoping we can get to this in Q1 or Q2.
(In reply to Henrik Skupin (:whimboo) [away 12/21 - 01/01] from comment #3)
> (In reply to Ben Hearsum [:bhearsum] from comment #2)
> > - What manual steps do you take to trigger your automation? Does it require
> > additional manual steps after starting before it finishes?
> 
> The test machines would need Mozmill to be installed. Also you will have to
> clone our mozmill-automation repository
> (http://hg.mozilla.org/qa/mozmill-automation/) to trigger the
> testrun_update.py job. More pre-work is not necessary. Not sure when Mozmill
> 2.0 will be ready but it will make it even easier to run Mozmill tests.

Can you give me more details on what exactly you run, and what it depends? Basically, I'm trying to figure out how to run it by hand, in order to think about how to automate it.
Dependencies have been mentioned so I can't add anything more. For the command to execute just run:

./testrun_update.py --report=http://mozmill-ci.blargon7.com/db/ firefox-installer.ext

The --report part is optional and only specified a location where to send the JSON report to. It could be replaced with e.g. file://report.json to save it off locally. Otherwise only specify the Firefox installer (exe, dmg, bz2) and that's it. Installation and uninstallation is done automatically.
(In reply to Henrik Skupin (:whimboo) [away 12/21 - 01/01] from comment #6)
> Dependencies have been mentioned so I can't add anything more. For the
> command to execute just run:
> 
> ./testrun_update.py --report=http://mozmill-ci.blargon7.com/db/
> firefox-installer.ext
> 
> The --report part is optional and only specified a location where to send
> the JSON report to. It could be replaced with e.g. file://report.json to
> save it off locally. Otherwise only specify the Firefox installer (exe, dmg,
> bz2) and that's it. Installation and uninstallation is done automatically.

OK, this is a great start. How do you manage the array of platforms/locales/versions that need to be tested? Is that done in Jenkins or ....?
Currently we only test a handful of locales and versions on all platforms. Therefor we have a config file we feed into mozmill-ci. It manages the creation of jobs to run. Example configs you can find here:

https://github.com/mozilla/mozmill-ci/tree/master/config/ondemand

For full details about the testing of updates please talk to Anthony and Juan from the QA team. They have the specific requirements.
(In reply to Henrik Skupin (:whimboo) [away 12/21 - 01/01] from comment #8)
> Example configs you can find here:
> https://github.com/mozilla/mozmill-ci/tree/master/config/ondemand
> 
> For full details about the testing of updates please talk to Anthony and
> Juan from the QA team. They have the specific requirements.

Right, this is a sample. The actual config files Juan and I use are broader though. We typically test 4 previous Firefox versions, en_US + 1 random locale, and all supported platforms. I can provide an example if desired.
This looks helpful for when we look at this further - thanks folks!
I'm sad about it, but I'm bumping down the priority on this. It's an enhancement, and we have many other things to do. QA's hardware has been much more stable/quick in the past few months, too, so this doesn't buy us as much as it would've when originally filed. I still think we should do it at some point though.
Priority: -- → P3
No problem at all. As of now we have 3 nodes per platform for testing updates on various channels. We made a lot of improvements to get those tests running faster. I think that we have a good situation now. But I would defer to Anthony here, given that he runs the tests. So he might be able to reply about the duration and if we would need another round of speed improvements.

Without having puppet available for our CI machines, I don't see a way to integrate more than 3 nodes per platform. The manually maintenance would be way to time consuming. In such a case only a more dedicated mesh of machines would help, where Releng comes into play.
I have no issues with this being lowered in priority. That said, I would consider this bug and the puppet bug to be hard-blockers for shipping more than two Betas per week, if we ever decide that's a thing we want to do.
(In reply to Anthony Hughes, Mozilla QA (:ashughes) from comment #13)
> I have no issues with this being lowered in priority. That said, I would
> consider this bug and the puppet bug to be hard-blockers for shipping more
> than two Betas per week, if we ever decide that's a thing we want to do.

+1. In fact, I would argue that spending our time getting QA tests running on RelEng hardware is a better use of everyone's time than QA learning about and deploying Puppet. When we get to the point of > 2 betas per week, this sounds like a great reason to push this project along.
Well, IT would do all the necessary Puppet work. It would be not us. :) The bug for that is already on file and it's bug 828557. Not sure when this will actually happen.
Product: mozilla.org → Release Engineering
QE update testing was one bottleneck during the 32.0.3 chemspill. Assuming that this change would remove the bottleneck, I do see value in this bug as it should allow us to turnaround releases more quickly. Can this bug be prioritized for Q4 or Q1?
Lawrence, I do not see any actionable information in your last comment. What was the bottleneck specifically? Can you please be more detailed about the issue someone was facing? I haven't heard anything about that yet personally, and that makes me feel bad.

Also I think setting this up on releng hardware doesn't make sense at this time given that we want to get rid of Mozmill in the foreseeable future and replace everything with Marionette.
Flags: needinfo?(lmandel)
During the 32.0.3 chemspill we released updates for 10 products. The bottleneck occurred because we were trying to push too many products through update verification concurrently. This was not a major issue but did add time to the release. My understanding from speaking with some people in releng is that this bottleneck can be mostly eliminated by moving update verification testing to releng hardware.

If the plan is to change the test framework, this may invalidate this request or it may make sense to work with releng on the change.
Flags: needinfo?(lmandel)
(In reply to Lawrence Mandel [:lmandel] (use needinfo) from comment #18)
> During the 32.0.3 chemspill we released updates for 10 products. The
> bottleneck occurred because we were trying to push too many products through
> update verification concurrently. This was not a major issue but did add
> time to the release. My understanding from speaking with some people in
> releng is that this bottleneck can be mostly eliminated by moving update
> verification testing to releng hardware.
> 
> If the plan is to change the test framework, this may invalidate this
> request or it may make sense to work with releng on the change.

It's all related. I don't think the plan invalidates this request.
* We need to change the test framework because Mozmill cannot handle E10S. Marionette is closer to being able to do that, so we are standardizing under the Marionette flag.
* The mozmill tests never got up on buildbot (we tried back in the old, old days, failed and developed a one-off system to run this automation on because there was no good way to run localized builds in buildbot infra back then.
* Marionette tests *already* run in buildbot. So this will just be a new target for a new set of tests using Marionette on desktop Firefox. This will enable us to run using releng hardware which will solve the bottleneck issue.

What we need:
* These tests (update tests) will need localized builds. I'm still not sure how we will acquire those builds or what that send-change would look like. Currently, we set up a configuration file for the updates we want to test and the mozmill harness handles acquiring all the builds for us. That activity will have to be in the mozharness script for this test target I imagine, but will likely need help from releng to create it.

* The marionette cut over is due to be completed in 2014 Q4. The update and localization tests are the highest priority tests to be re-written for the new framework, should be done in Q1.

* The set of mozmill machines that run these tests currently will continue being used for doing some small amount of live and services based testing which hits staging and/or live sites that are outside of the buildbot network because there is still no way to do that in the core buildbot automation (something we would like to fix later in 2015). Those are currently called the functional tests that occur on those machines. Offloading the update and l10n testing will free up the system to work in parallel.
I don't think l10n builds will be a problem, at least as this bug is filed about checking releasetest - I'm taking that as the test channel(s) for beta/release/esr. In this case we can just trigger the marionette job after all the locales are in. In future we may be able to optimise to test after individual locales, or chunks of locales, are done. We should also include the after-release testing on the end-user channel, for consistency of testing infra.

Might be more complicated for nightly/aurora. Bug 740142 is redoing how we do repacks there so lets revisit after that is done, in a separate bug. Might also be a chance to fix issues like bug 588397, aka test updates before enabling them to audience.
Summary: run QA's releasetest tests on releng hardware → Run QA's update tests for releases on RelEng hardware
(In reply to Clint Talbert ( :ctalbert ) from comment #19)
> * Marionette tests *already* run in buildbot. So this will just be a new
> target for a new set of tests using Marionette on desktop Firefox. This will
> enable us to run using releng hardware which will solve the bottleneck issue.

Just a note on that one... the Marionette tests we currently have are all not for Firefox desktop. So we are hitting a new product here, which means that a lot of things might have to be fixed in Marionette. Also we might not be able to backport most of those changes in Marionette, so they will ride the train. It means some of those will not end-up in ESR releases, and it will take a while until all branches are supported.

> What we need:
> * These tests (update tests) will need localized builds. I'm still not sure
> how we will acquire those builds or what that send-change would look like.

This is worked on in bug 735184. So marking it as a dependency.

> * The marionette cut over is due to be completed in 2014 Q4. The update and
> localization tests are the highest priority tests to be re-written for the
> new framework, should be done in Q1.

The localization tests actually don't have such a high priority, but the functional and remote tests executed for localized builds. All those 3 are the pillows for our release testing.

> * The set of mozmill machines that run these tests currently will continue
> being used for doing some small amount of live and services based testing
> which hits staging and/or live sites that are outside of the buildbot
> network because there is still no way to do that in the core buildbot
> automation (something we would like to fix later in 2015). Those are
> currently called the functional tests that occur on those machines.

The functional tests are all using local test data, so they could directly run on buildbot. The remote tests are the problematic piece here, which need remote websites for testing specific areas like SSL.

> Offloading the update and l10n testing will free up the system to work in
> parallel.

The biggest time safer would indeed be the update tests, followed by functional tests. The latter are run earlier the day so we wouldn't have to wait with signoff on those. Except when too many releases are coming out on a single day. 

(In reply to Nick Thomas [:nthomas] from comment #20)
> I don't think l10n builds will be a problem, at least as this bug is filed
> about checking releasetest - I'm taking that as the test channel(s) for
> beta/release/esr. In this case we can just trigger the marionette job after
> all the locales are in. In future we may be able to optimise to test after

I expect that the calling convention will not change, and we only replace the inner logic of our testrun scripts. So API wise nothing should change for you when we transition over to Marionette. Also the referenced bug from above wouldn't play a big role then.
Depends on: 735184
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1914]
Currently we are working on a transition of the update tests from Mozmill to Marionette. Those would become the new base tests for any kind of work which needs to happen here. Please give us about 1-2 months to get them into a stable state. Keep in mind that we would have to get them tested at least with a beta release, but the lowest version of Firefox a compatible Marionette runs in is 39.0. Maybe we can get necessary patches backported to 38 so it will hit the ESR release.
Depends on: 1129843
This is now covered by bug 1148546.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1148546
You need to log in before you can comment on or make changes to this bug.