Closed Bug 1285340 Opened 8 years ago Closed 6 years ago

TEST-UNEXPECTED-FAIL | test_direct_update.py TestDirectUpdate.test_update, test_fallback_update.py TestFallbackUpdate.test_update | AssertionError: u'20160706030233' != '20160707083343' (target buildids different)

Categories

(Testing :: Firefox UI Tests, defect, P5)

defect

Tracking

(firefox50 affected, firefox51 affected)

RESOLVED WONTFIX
Tracking Status
firefox50 --- affected
firefox51 --- affected

People

(Reporter: whimboo, Unassigned)

References

Details

(Keywords: intermittent-failure)

Attachments

(1 file)

Due to bug 1284516 we have a huge delay for partial updates. If a new nightly is triggered shortly after the last one we have mixed partials. As result our update tests are failing because they are testing for the wrong target buildid.
This bug started to happen a lot for Windows on mozilla-central. We seem to get the right patch file suggested but an update actually is not done and we keep at the old version.

https://treeherder.mozilla.org/logviewer.html#?job_id=4701023&repo=mozilla-central

> 'build_pre': {'buildid': u'20160818030226'
>     'update_snippet': '<?xml version="1.0"?>\n<updates>\n <update type="minor" displayVersion="51.0a1" appVersion="51.0a1" platformVersion="51.0a1" buildID="20160820030224">\n <patch type="complete" URL="https://mozilla-nightly-updates.s3.amazonaws.com/mozilla-central/20160820030224/Firefox-mozilla-central-51.0a1-win32-en-US.complete.mar?versionId=MdsWHOSXNi_85EFd.q0r5DavyHjg_QPa" hashFunction="sha512" hashValue="db3625c607f3ba6b9945d00b2ba6e603e62df7efd0c2578b45730ff0b1fff8d28b89342a20b61df7d136beecbc68cb4b11f8d64587eaab708ca10861817d461d" size="50128697"/>\n <patch type="partial" URL="https://mozilla-nightly-updates.s3.amazonaws.com/mozilla-central/20160820030224/Firefox-mozilla-central-51.0a1-win32-en-US-20160818030226-20160820030224.partial.mar?versionId=aM3o45hP41ch4_nzB.rAEegPjBqeYxDJ" hashFunction="sha512" hashValue="6ea477b15c7f182dbb17da0e509b531e4f6a1beca049e9db738f11b32ca661e4f88082fd8b8e444b8772e9d39f5b3f3114a3816fa1324106949ae649484be192" size="6477861"/>\n </update>\n</updates>', 
[..]
> [{'build_post': {'buildid': u'20160818030226',
These failures seem to only occur on Windows 8 and that for both 32bit and 64bit. I will have a look what could be wrong here.
Depends on: 1298771
Depends on: 1008135
Both branches mozilla-central and mozilla-aurora are actually affected by this test failure.
Matt, I feel this is the same issue as bug 1293404 but for Windows. We seem to restart the application, but our Marionette session is still alive with the non-updated version of Firefox. As result we report the wrong application infos. 

Do you think that it is another fallout from bug 1272614?

FYI: I formerly hijacked this older bug because I was thinking those failures are rarely intermittent. But actually we got a spike around the same time as bug 1293404 on central and aurora.
Flags: needinfo?(mhowell)
See Also: → 1293404
Are these tests expected to pass when run locally? I'm always getting the result that no update is available, but it looks like the production update server is being used, meaning that's to be expected (also meaning the test depends on Balrog being up and offering updates, which doesn't seem ideal). So I don't really know how to debug this, especially on Windows. Any advice?
Flags: needinfo?(mhowell) → needinfo?(hskupin)
The update tests only work for Nightly and Release builds, and when updates are announced via Balrog. So what you should do is the following:

1. Clone mozilla-central
2. Download an older Nightly which definitely gets an update served (at least a day old)
3. Run "mach firefox-ui-update --binary %path/to/firefox% --update-target-buildid=%buildid/of/updated/build"

I haven't tried yet to run updates for a debug build. In such a case you would definitely have to overwrite the update url like "--update-override-url http://people.mozilla.org/~hskupin/update.xml". This way you can serve a custom update. But I'm not sure if some other knobs have to be turned in Firefox to allow such an update.
Flags: needinfo?(hskupin)
Matt, it should be easier to get started with bug 1293404 first, which I can easily reproduce locally.
Okay, thanks. I don't have a good native Linux machine, and I know Windows a lot better, so that's usually where I prefer to start. But easy local reproduction trumps that, so I'll take your advice and get on bug 1293404 instead.
Just to note here... this failure seems to only happen for Windows 8.1 (32 and 64 bit), but not for Windows 7 nor XP.
OS: Unspecified → Windows 8.1
Hardware: Unspecified → All
Btw. today is the first day when everything passed on Windows for mozilla-central:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=91c2b9d5c1354ca79e5b174591dbb03b32b15bbf&filter-searchStr=fxup&filter-tier=1&filter-tier=2&filter-tier=3

Not sure if my landed fix for bug 1274943 has changed something. We will see tomorrow for Aurora, which hasn't gotten this patch when todays nightly builds have been created.
It's clearly not fixed yet.
Those failures are still coming up a lot on those days when we have multiple nightly builds for a channel. An example here is the second nightly build of mozilla-aurora from last Saturday:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&revision=070cf059bb70aa991d50b4d266987adf26685915&filter-searchStr=update&filter-tier=1&filter-tier=2&filter-tier=3&selectedJob=4078966

The assertion failure is: '20161105004017' != '20161106004004'

Pre-build is 20161104004016 which gets a complete patch offered to 20161105004017 while we expect 20161106004004.

Please keep in mind that we start our tests about 5min after we got the funsize pulse notifications! From our point it means that the updates are available, but looks like as seen here it is not the case.

https://github.com/mozilla/mozmill-ci/blob/master/jenkins-master/jobs/mozilla-central_update/config.xml#L79

Is there something wrong with Balrog and maybe caching again? Or is Balrog confused with multiple nightlies per day?

Here a full log of a failing test on Linux which shows the details in the update summary table at the end of the log:

https://firefox-ui-tests.s3.amazonaws.com/99c5e926-8745-440c-81ad-1e1a3f00d884/log_info.log
Flags: needinfo?(rail)
Flags: needinfo?(bhearsum)
OS: Windows 8.1 → All
There is an interesting pattern for Nov 6th update tests on mozilla-aurora:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&revision=8027a7b723c9db7254bdb0bea3bc8937220daa7e&filter-searchStr=update&filter-tier=1&filter-tier=2&filter-tier=3

This failure is only visible for update numbers 1 and 2 but not for 3 and 4.
See Also: → 1316567
I'm not aware of any caching issues that would be causing a problem here. We have an extremely short TTL on blob caching (60s), so I highly doubt that would be impacting you here.

It would be extremely helpful in the future to provide much more detailed information if you want to try and debug something after the fact. The full update URL, the full XML response, and the full browser console log are _extremely_ helpful in debugging. Trying to recreate these things later is next to impossible.

It looks to me like updates were locked to Firefox-mozilla-aurora-nightly-20161105004017 from about 07:18 GMT on November 6th, until about 09:45 GMT on November 7th, so that's probably what caused your failure. Rail is the one who locked them, so I'll leave it to him for further explanation.
Flags: needinfo?(bhearsum)
I temporarily locked the updates because of bug 1315530
Flags: needinfo?(rail)
I see. Thanks Rail. I wonder if there is a way to improve the communication here whenever update channels are getting closed/opened. I could ask you each time when we see those issues, but I doubt it would not scale.

Also sometimes we have only a couple of update tests which fail with this failure, while all others are fine. Sadly all those instances are currently on Windows where we do not have any app.update.log output (bug 1299601) available. So it's hard to give full logs, but we have update URLs and the full XML response.
We had kinda lot of these failures after the recent merge on mozilla-aurora on the auroratest channel. The assertions looked like: AssertionError: u'20170307070455' != '20170307004018'

https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&revision=577083e852674484f8064f45a9b99cf13e1f9b6f&filter-searchStr=firefox%20ui%20update&filter-tier=1&filter-tier=2&filter-tier=3
I can see this failure a lot recently again. Here one example:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&filter-searchStr=Firefox%20UI&filter-tier=1&filter-tier=2&filter-tier=3&selectedJob=83344632

It's an update request for build https://archive.mozilla.org/pub/firefox/nightly/2017/03/2017-03-12-08-47-40-mozilla-aurora/firefox-54.0a2.en-US.linux-i686.tar.bz2, with a given target build id of `20170313084713`.

Full log of the test:
https://firefox-ui-tests.s3.amazonaws.com/947e2ec3-5573-4339-8291-d13a46306420/log_info.log

Test result:

TEST-UNEXPECTED-FAIL | test_direct_update.py TestDirectUpdate.test_update | AssertionError: u'20170313084027' != '20170313084713'

Which means instead of build id `20170313084713` a build with id `20170313084027` was offered.

The update URL which we used was:

https://aus5.mozilla.org/update/6/Firefox/54.0a2/20170312084740/Linux_x86-gcc3/en-US/aurora/Linux%203.13.0-106-generic%20(GTK%203.10.8%2Clibpulse%204.0.0)/NA/default/default/update.xml?force=1

Here the data of the update snippet:

<?xml version="1.0"?>\n<updates>\n    <update type="minor" displayVersion="54.0a2" appVersion="54.0a2" platformVersion="54.0a2" buildID="20170313084027">\n        <patch type="complete" URL="https://mozilla-nightly-updates.s3.amazonaws.com/mozilla-aurora/20170313084027/Firefox-mozilla-aurora-54.0a2-linux-en-US.complete.mar?versionId=YEjBQ5tesjqu._zbDv83CaCTGNmz_.9u" hashFunction="sha512" hashValue="dfec66c64b2fc8abc2e3eecb26fd0081e9112084060f834acaba1995bc5b051984b4f9adaed214afde8c855c0cfb3c2cbb4d8ae2f6093d474a5362d226b2b196" size="68598785"/>\n        <patch type="partial" URL="https://mozilla-nightly-updates.s3.amazonaws.com/mozilla-aurora/20170313084027/Firefox-mozilla-aurora-54.0a2-linux-en-US-20170312084740-20170313084027.partial.mar?versionId=mhm_G98qSFCzKr9CAixuH2LaPEnFLUN1" hashFunction="sha512" hashValue="a00601323f3d9bc7aa3e7ba639a1cff58394cd9d2a397a583e23adcf8dc1c62aaaefcf288b61391966edd573479cd69ebc85150cd335b80e04313889e48566b6" size="11736142"/>\n    </update>\n</updates>

Here we go. So I assume the reported target build within the funsize messages are wrong. Rail, did something got changed lately?
Flags: needinfo?(rail)
It looks like this started all by Monday last week on Aurora, which was the merge day.
Could it be similar to bug 1347107?
No, those target build ids we directly get via the funsize Pulse messages. So we simply trust them.
Attached file funsize message
(In reply to Henrik Skupin (:whimboo) from comment #27)
> I can see this failure a lot recently again. Here one example:
> https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&filter-
> searchStr=Firefox%20UI&filter-tier=1&filter-tier=2&filter-
> tier=3&selectedJob=83344632
> 
> It's an update request for build
> https://archive.mozilla.org/pub/firefox/nightly/2017/03/2017-03-12-08-47-40-
> mozilla-aurora/firefox-54.0a2.en-US.linux-i686.tar.bz2, with a given target
> build id of `20170313084713`.
> 
> Full log of the test:
> https://firefox-ui-tests.s3.amazonaws.com/947e2ec3-5573-4339-8291-
> d13a46306420/log_info.log
> 
> TEST-UNEXPECTED-FAIL | test_direct_update.py TestDirectUpdate.test_update |
> AssertionError: u'20170313084027' != '20170313084713'
> 
> Which means instead of build id `20170313084713` a build with id
> `20170313084027` was offered.
 
As you can see in this attachment for the above job we have:

> "to_buildid": "20170313084713",

And that differs from what we actually run.
Ah, I see. Let's see if Simon has any thoughts on this.
Flags: needinfo?(rail) → needinfo?(sfraser)
Things I've found out so far:  There's no mention in the funsize scheduler logs, or treeherder, of a build for 20170313084713.  The only one on that day is 20170313084027. The first mention of 20170313084713 in the funsize scheduler logs is the day afterwards, when it starts generating partials from that build.

The entry in balrog for the 20170312084740-20170313084713 partial was added by funsize at 09:51:18 GMT on 13 March (Data version 8 for Firefox-mozilla-aurora-nightly-20170313084713 in balrog's ui). The error in the logs above is at 02:59:29, presumably Pacific time, so roughly 8 minutes after the data was added to balrog. This means it doesn't look like a race between balrog updates and tests being scheduled. Does balrog make a release available before all the locales and partials are available?
Flags: needinfo?(sfraser)
I've spoken to bhearsum about how this works in balrog, and it seems that it's down to a race after all, just one that happens when builds are retriggered. Assuming the 084027 was the original, and the 084713 the retrigger, then the rule for 'from 20170312084740, what should I get?' is all down to whichever task finishes last. 

My understanding:

1. 084713 finished generating an en-US partial for 20170312084740, updated the rule in balrog, and sent a pulse message
2. Afterwards 084027 finished that same partial from 20170312084740, and overwrote the rule.
3. Sometime after 1, the update test received the pulse message, but the test ran sometime after 2.

It doesn't appear that there's anything we can do about this in the short term, although if we swap to 'nightlies on push' as some of the plans indicate we might, this will need to be revisited so that it's more stable.
Does it mean all this comes back to bug 1284516 as underlying problem? That partial updates are made available too late, and when we are trying to run our tests other partials have already overwritten the Balrog rules?
More that they're available at staggered intervals, was my understanding. 'When they are ready' rather than 'when this release is ready.' I suppose it's related.
(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #34)
> 1. 084713 finished generating an en-US partial for 20170312084740, updated
> the rule in balrog, and sent a pulse message
> 2. Afterwards 084027 finished that same partial from 20170312084740, and
> overwrote the rule.

Why do we allow those jobs to overwrite the rule for an older build? Wouldn't it mean that a user which already has upgraded to 084713 gets an downgrade offer to 084027? AFAIK this should never happen.

Also why are we processing the generation of partials for newer builds first before finishing off queued items which are longer in the queue?
Flags: needinfo?(sfraser)
(In reply to Henrik Skupin (:whimboo) from comment #37)

> Why do we allow those jobs to overwrite the rule for an older build?

As I understand it, Balrog trusts that the requests are doing the right thing, and will make the update requested of it. It doesn't really have a concept of 'older', just 'earlier in the revision history'. Even adding that, we would still have some locales/platforms being updated to different builds in a race like this, depending on which tasks finished first.

> Wouldn't it mean that a user which already has upgraded to 084713 gets an
> downgrade offer to 084027? AFAIK this should never happen.

There's no update path from 084713 to 084027, so it won't be offered. It's only affecting partials, not completes.

> Also why are we processing the generation of partials for newer builds first
> before finishing off queued items which are longer in the queue?

I would have to guess, but my guess is that there are no dependencies between different task graphs, to specify the order in which they are processed. I don't know taskcluster or buildbot well enough to know how they prioritise and schedule tasks.
Flags: needinfo?(sfraser)
I think the problem we have here is that we somehow build Aurora builds twice after each other and that every day. Beside that there is also a build very early in the morning. Here the entries for the last days:

https://archive.mozilla.org/pub/firefox/nightly/2017/03/

Dir 	2017-03-17-00-40-22-mozilla-aurora-l10n/ 		
Dir 	2017-03-17-00-40-22-mozilla-aurora/
Dir 	2017-03-17-08-40-58-mozilla-aurora-l10n/ 		
Dir 	2017-03-17-08-40-58-mozilla-aurora/ 		
Dir 	2017-03-17-08-47-20-mozilla-aurora-l10n/ 		
Dir 	2017-03-17-08-47-20-mozilla-aurora/

Dir 	2017-03-18-00-40-03-mozilla-aurora-l10n/ 		
Dir 	2017-03-18-00-40-03-mozilla-aurora/
Dir 	2017-03-18-08-41-35-mozilla-aurora-l10n/ 		
Dir 	2017-03-18-08-41-35-mozilla-aurora/ 		
Dir 	2017-03-18-08-48-10-mozilla-aurora-l10n/ 		
Dir 	2017-03-18-08-48-10-mozilla-aurora/

Dir 	2017-03-19-00-40-06-mozilla-aurora-l10n/ 		
Dir 	2017-03-19-00-40-06-mozilla-aurora/
Dir 	2017-03-19-08-42-08-mozilla-aurora-l10n/ 		
Dir 	2017-03-19-08-42-08-mozilla-aurora/ 		
Dir 	2017-03-19-08-48-16-mozilla-aurora-l10n/ 		
Dir 	2017-03-19-08-48-16-mozilla-aurora/

Why is that happening? When I look at Treeherder I only see a single Nightly job per day.

https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&filter-searchStr=nightly&filter-tier=1&filter-tier=2&filter-tier=3&selectedJob=84883088
Depends on: 1348758
As it looks like this is a problem with the Taskcluster generated builds which happen twice. I filed bug 1348758 for that.
With bug 1348758 fixed now, I expect a large drop of those failures. Maybe some remain due to different reasons. I will keep an eye on those.
Sure thing. Happy to look into any more reasons, when they appear.
Summary: TEST-UNEXPECTED-FAIL | test_direct_update.py TestDirectUpdate.test_update, test_fallback_update.py TestFallbackUpdate.test_update | AssertionError: u'20160706030233' != '20160707083343' (target builids different) → TEST-UNEXPECTED-FAIL | test_direct_update.py TestDirectUpdate.test_update, test_fallback_update.py TestFallbackUpdate.test_update | AssertionError: u'20160706030233' != '20160707083343' (target buildids different)
(In reply to OrangeFactor Robot from comment #45)
> 98 failures in 166 pushes (0.59 failures/push) were associated with this bug
> yesterday.   
> 
> Repository breakdown:
> * mozilla-beta: 98
> 
> Platform breakdown:
> * windows7-64: 20
> * windows8-64: 19
> * linux32: 12
> * linux64: 11
> * windows7-32: 8
> * windows8-32: 7
> * osx-10-9: 7
> * osx-10-11: 7
> * osx-10-10: 7
> 
> For more details, see:
> https://brasstacks.mozilla.com/orangefactor/
> ?display=Bug&bugid=1285340&startday=2017-06-23&endday=2017-06-23&tree=all

These failures were due to incorrect setup of the beta channel (55b3 was available instead of 55b4). They should be ignored.
Note that I've also removed the associations to this bug and have reclassified all corresponding failing jobs as "expected fail" due to the incorrect setup of the channel.
Failures from comment 51 are not associated with this bug - those showed up on ESR 52.4.1, but are due to the fact that the build was offered as update only on Mac OS X - so Win & Linux failures are expected (OS X failures were due to some misconfiguration by releng).

I've removed classification in treeherder and re-classified as "expected fail".
We are not going to fix any kind of failure as listed here, also because it only happens for esr52.
Status: NEW → RESOLVED
Closed: 6 years ago
Priority: -- → P5
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: