Closed Bug 1194483 Opened 9 years ago Closed 8 years ago

Funsize broke update tests for Firefox nightly builds

Categories

(Mozilla QA Graveyard :: Infrastructure, defect)

42 Branch
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: whimboo, Assigned: whimboo)

References

Details

(Keywords: meta, regression, Whiteboard: [blocked])

Starting today our update tests are broken for Nightly builds of Firefox. This is because of bug 1173459, which might also land for Aurora soon. Due to the funsize changes there is no previous_buildid anymore in the pulse notification of nightly and l10n-nightly builds. That one was important for us determine the version of Firefox to download, install, and start the update from.

We started a discussion over on bug 1193508 comment 3:

Neil:
Actually we are almost there. Starting from tomorrow, the nightly builds and
L10N repacks won't be having any partial updates. See bug 1173459.

Funsize will be generating partials (4 partials going backwards). An example
task graph can be found here:
https://tools.taskcluster.net/task-graph-inspector/#Z1i-AjOzRAybatqPtW-fpA/

Complete tasks report to the following exchange:

Exchange: exchange/taskcluster-queue/v1/task-completed
Routing key: #.funsize-balrog.#

The second task in the graph (signing) publishes a manifest with some
information about the update, see
https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/wyLyDnMCSiSyMII5nY0FUw/0/public/env/manifest.json

The balrog task doesn't publish any manifests, but this can be fixed easily.
Also I can add any missing information that you need.

If you want to track other statues (failures, etc), see the exchanges listed at
http://docs.taskcluster.net/queue/exchanges/

Also, the tasks can be accessed via taskcluster index API
(http://docs.taskcluster.net/services/index/) or UI:

using hg revision:
https://tools.taskcluster.net/index/#funsize.v1.mozilla-central.revision.linux64.d4f3a8a75577.sk.3.balrog/funsize.v1.mozilla-central.revision.linux64.d4f3a8a75577.sk.3.balrog

or latest:
https://tools.taskcluster.net/index/#funsize.v1.mozilla-central.latest.linux64.sk.3.balrog/funsize.v1.mozilla-central.latest.linux64.sk.3.balrog

the same for artifacts:

https://tools.taskcluster.net/index/artifacts/#funsize.v1.mozilla-central.revision.linux64.d4f3a8a75577.sk.3.balrog/funsize.v1.mozilla-central.revision.linux64.d4f3a8a75577.sk.3.balrog

https://tools.taskcluster.net/index/artifacts/#funsize.v1.mozilla-central.latest.linux64.sk.3.balrog/funsize.v1.mozilla-central.latest.linux64.sk.3.balrog


IMHO the easiest way to schedule the tests would be by integrating them into
the funsize graph I mentioned above. It'd be much better to test the updates
before we publish them. ATM we can "easily" do this for Linux and probably
Windows. Mac is not supported by Taskcluster yet.

I hope this helps. Feel free to ping me if you want to chat about possible
solutions for this.


Henrik:
(In reply to Rail Aliiev [:rail] from comment #3)
> Actually we are almost there. Starting from tomorrow, the nightly builds and
> L10N repacks won't be having any partial updates. See bug 1173459.

It's sad to see that this happened without any warning or information to us.
Means from now on we do no longer test any firefox update for Nightly and Dev
Edition builds. I hope we can quickly work through that and get it fixed. I
assume that we should get this handled in a new bug.

> Funsize will be generating partials (4 partials going backwards). An example
> task graph can be found here:
> https://tools.taskcluster.net/task-graph-inspector/#Z1i-AjOzRAybatqPtW-fpA/
> 
> Complete tasks report to the following exchange:
[..]
> IMHO the easiest way to schedule the tests would be by integrating them into
> the funsize graph I mentioned above. It'd be much better to test the updates
> before we publish them. ATM we can "easily" do this for Linux and probably
> Windows. Mac is not supported by Taskcluster yet.
> 
> I hope this helps. Feel free to ping me if you want to chat about possible
> solutions for this.

If we cannot cover all platforms via taskcluster we will still have to do the
updates on our side. I will observe the exchange you pointed out, and file a
new bug for any additional work.
So before I'm trying to start working on code which let the tests run on our infrastructure, Jonathan and myself are wondering if we could get started with Windows and Linux at least for task cluster.

Rail, there is the following script for mozharness which will execute our Firefox ui tests:
http://hg.mozilla.org/mozilla-central/file/default/testing/mozharness/scripts/firefox_ui_updates.py

I think you should be perfectly able to call this if mozharness is around on that machine which has to trigger the update. Armen will be able to give you all the details in how to call it, given that he wrote this script. If the tests would be run via TaskCluster you might need the buildbot bridge. At least that's what Armen is saying.
Flags: needinfo?(rail)
Depends on: 1194495
If we add FX ui test jobs on buildbot, we can use the BBB to schedule them within a TC graph.

rail: is there a way to go back in your bug until we have a working solution?

whimboo: did you test from the previous build to today's? which locales? all 5 archs? (L32/64, W32/64 & Mac)
If we run the update tests on the releng infra, the only blocker is to deploy git to the Windows testers (bug 1192525).
If we ask markco to deploy that for us soon and we add the builders we should be able to have something working before end of next week.

Either that, our add what we're missing in the pulse exchanges.
(In reply to Armen Zambrano Gasparnian [:armenzg] from comment #2)
> whimboo: did you test from the previous build to today's? which locales? all
> 5 archs? (L32/64, W32/64 & Mac)

Yes, and additional 32bit build of Firefox on Windows 64. Here the tests for aurora a couple days ago:

https://treeherder.allizom.org/#/jobs?repo=mozilla-central&filter-job_group_symbol=Fu&exclusion_profile=false&fromchange=0e269a1f1beb

Currently I'm working on getting the TaskCluster consumer working with Mozilla Pulse. I already got some messages from the exchange Rail has mentioned.
(In reply to Henrik Skupin (:whimboo) from comment #0)
> Complete tasks report to the following exchange:
> 
> Exchange: exchange/taskcluster-queue/v1/task-completed
> Routing key: #.funsize-balrog.#

Shouldn't this be #.signing-worker-v1.# as of now? I don't think we should start running the update tests before the signing task has been finished. Otherwise the exchange/taskcluster-scheduler/v1/task-graph-finished exchange should be better, which indeed makes sure that the scheduler has been finished the full graph.

> The second task in the graph (signing) publishes a manifest with some
> information about the update, see
> https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/
> wyLyDnMCSiSyMII5nY0FUw/0/public/env/manifest.json

I can now at least see the path I have to go to reach this manifest. Now I only have to figure out how to use the taskcluster Python client to actually accomplish that.

> The balrog task doesn't publish any manifests, but this can be fixed easily.
> Also I can add any missing information that you need.

Do you think that this would really work without the signing process?

> If you want to track other statues (failures, etc), see the exchanges listed
> at
> http://docs.taskcluster.net/queue/exchanges/

What would those errors be in case of your funsize tasks? If something failed? I think that would not be interesting for me given that we don't test it.

> using hg revision:
> https://tools.taskcluster.net/index/#funsize.v1.mozilla-central.revision.
> linux64.d4f3a8a75577.sk.3.balrog/funsize.v1.mozilla-central.revision.linux64.
> d4f3a8a75577.sk.3.balrog
 
For that we would still have to know the rev of the former builds. We don't have that.

But that opens up another question. Given that we now produce a couple of partial updates, we might have to run tests on all of them and not only from the previous build. Means every finished task should trigger an update test.
Depends on: 1194762
We had a chat on Friday together with Armen and here is what we decided to do:

1. Get the update tests running on our infrastructure by listening to the Taskcluster notifications via Pulse
2. Let me test that the mozharness driven update tests work for nightly builds, and then use that in the funsize tasks to test the update
3. Run Windows and Linux via taskcluster, but OS X still needs to be run on our hardware given that taskcluster doesn't support it.

As of today the link to the necessary manifest is now present on the balrog task, which means that I can start getting this integrated.
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
Flags: needinfo?(rail)
For the necessary changes in mozmill-ci I filed https://github.com/mozilla/mozmill-ci/issues/618.
We landed the PR for handling funsize partial updates today and all seems to be working fine:

mozilla-central:
https://treeherder.allizom.org/#/jobs?repo=mozilla-central&revision=8a6045d14d6b&filter-job_group_symbol=Ff&filter-job_group_symbol=Fr&filter-job_group_symbol=Fu

mozilla-aurora:
https://treeherder.allizom.org/#/jobs?repo=mozilla-aurora&revision=20a79fcdc75f&filter-job_group_symbol=Ff&filter-job_group_symbol=Fr&filter-job_group_symbol=Fu

Something which I miss are the full updates in case of a daily build which is older than those 4 days. For that I no longer get a Pulse message. Rail, how can we get this done?
Flags: needinfo?(rail)
I don't know what exactly you need to get the tests scheduled. Is it just the build ID of the latest nightly build/repacks, URL of the complete MAR, or just just an event that would trigger the tests?

BTW, when bug 1195365 is implemented, funsize will be responsible for publishing completes as well.
Flags: needinfo?(rail)
What I need is the build id of the source build to test. So in this case the build which will get the complete mar served. So we are definitely blocked on bug 1195365.
Depends on: 1195365
Whiteboard: [blocked]
Depends on: 1198011
Depends on: 1198003
Depends on: 1189267
Everything has been implemented for Mozmill CI and I pushed the code to staging. That way we have a bake time over the weekend. If all works well, I will push all to production on Monday.

Rail, I will keep this bug open until the work for the full mar files has been done.
Currently our update tests are not getting run again due to larger security changes for funsize tasks as covered by bug 1220252. Means balrog is no longer submitting Pulse messages at the moment but the signing task is doing. It is not known how long it will take to get this blocker fixed. So maybe we have to temporarily switch to the other routing key.

A latest task can be found here:
https://tools.taskcluster.net/task-inspector/#EpKfgaNnTUGEb-vea-e3yA/

A routing key which should work is accordingly to Rail: #.signing-provisioner-v1.signing-worker-v1.#

I will try this out now and have a Mozmill CI PR open later today if it works.
The temporary fix for Mozmill CI is covered by https://github.com/mozilla/mozmill-ci/pull/683.
The funsize part has been done some months ago. Now I don't want to keep this bug open until the complete mar part has been fixed. So closing as fixed.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: Mozilla QA → Mozilla QA Graveyard
You need to log in before you can comment on or make changes to this bug.