Closed Bug 1373564 Opened 3 years ago Closed 3 years ago

Permafail test_direct_update.py TestDirectUpdate.test_update, test_fallback_update.py TestFallbackUpdate.test_update | TimeoutException: Timed out after 300.0 seconds with message: Update has been applied (app.update.staging.enabled=False)

Categories

(Testing :: Firefox UI Tests, defect)

Version 3
defect
Not set

Tracking

(firefox-esr52 unaffected, firefox54 unaffected, firefox55 fixed, firefox56 fixed)

RESOLVED FIXED
mozilla56
Tracking Status
firefox-esr52 --- unaffected
firefox54 --- unaffected
firefox55 --- fixed
firefox56 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: whimboo)

References

Details

(Keywords: intermittent-failure, regression)

Attachments

(1 file)

So far we run the update tests across all the channels with e10s disabled. This was due some incompatible changes for esr45 in our mozharness script. So I changed this lately by enabling e10s for all our of our tests run in Mozmill-CI (https://github.com/mozilla/mozmill-ci/issues/601).

Now we have a new kind of test failure, which is related to applying the update, and a timeout after 300s:

05:59:49     INFO -  1497531588724	Marionette	TRACE	2 -> [0,1208,"getElementAttribute",{"id":"dc129dbc-297a-4bd2-bdc9-c5dc2c979efa","name":"id"}]
05:59:49     INFO -  1497531588726	Marionette	TRACE	2 <- [1,1208,null,{"value":"apply"}]
05:59:49     INFO -  1497531588730	Marionette	TRACE	2 -> [0,1209,"executeScript",{"scriptTimeout":null,"newSandbox":true,"args":["state"],"filename":"software_update.py","script":"\n          let ums = Components.classes['@mozilla.org/updates/update-manager;1']\n                    .getService(Components.interfaces.nsIUpdateManager);\n          return ums.activeUpdate[arguments[0]];\n        ","sandbox":"default","line":23}]
05:59:49     INFO -  1497531588733	Marionette	TRACE	2 <- [1,1209,null,{"value":"pending-service"}]

Log lines from the updater log:

05:59:44     INFO -  *** AUS:SVC Checker: checkForUpdates, force: true
05:59:44     INFO -  *** AUS:SVC Checker:getUpdateURL - update URL: https://aus5.mozilla.org/update/6/Firefox/56.0a1/20170614030206/WINNT_x86-msvc-x86/en-US/nightly/Windows_NT%206.1.1.0%20(x86)(noBug1296630v1)(nowebsense)/SSE3,1023/default/default/update.xml?force=1
05:59:44     INFO -  *** AUS:SVC Checker:checkForUpdates - sending request to: https://aus5.mozilla.org/update/6/Firefox/56.0a1/20170614030206/WINNT_x86-msvc-x86/en-US/nightly/Windows_NT%206.1.1.0%20(x86)(noBug1296630v1)(nowebsense)/SSE3,1023/default/default/update.xml?force=1
[..]
05:59:44     INFO -  *** AUS:SVC Checker:onLoad - request completed downloading document
05:59:44     INFO -  *** AUS:SVC Checker:getUpdateURL - update URL: https://aus5.mozilla.org/update/6/Firefox/56.0a1/20170614030206/WINNT_x86-msvc-x86/en-US/nightly/Windows_NT%206.1.1.0%20(x86)(noBug1296630v1)(nowebsense)/SSE3,1023/default/default/update.xml?force=1
05:59:45     INFO -  *** AUS:SVC Checker:onLoad - number of updates available: 1
05:59:45     INFO -  *** AUS:SVC isServiceInstalled - returning true
05:59:45     INFO -  *** AUS:SVC shouldUseService - returning true
05:59:45     INFO -  *** AUS:SVC getCanApplyUpdates - bypass the write checks because the Windows Maintenance Service can be used
05:59:45     INFO -  *** AUS:SVC Creating Downloader
05:59:45     INFO -  *** AUS:SVC UpdateService:_downloadUpdate
05:59:45     INFO -  *** AUS:SVC readStringFromFile - file doesn't exist: C:\Users\mozauto\AppData\Local\Mozilla\updates\FCE21B573705EDF7\updates\0\update.status
05:59:45     INFO -  *** AUS:SVC readStatusFile - status: null, path: C:\Users\mozauto\AppData\Local\Mozilla\updates\FCE21B573705EDF7\updates\0\update.status
05:59:45     INFO -  *** AUS:SVC Downloader:downloadUpdate - url: https://mozilla-nightly-updates.s3.amazonaws.com/mozilla-central/20170615030208/Firefox-mozilla-central-56.0a1-win32-en-US-20170614030206-20170615030208.partial.mar?versionId=j3fQ4_CBHCFy2xWPHMq53K6Qs8PiieQq, path: C:\Users\mozauto\AppData\Local\Mozilla\updates\FCE21B573705EDF7\updates\0\update.mar, interval: 0
[..]
05:59:48     INFO -  *** AUS:SVC Downloader:onStartRequest - original URI spec: https://mozilla-nightly-updates.s3.amazonaws.com/mozilla-central/20170615030208/Firefox-mozilla-central-56.0a1-win32-en-US-20170614030206-20170615030208.partial.mar?versionId=j3fQ4_CBHCFy2xWPHMq53K6Qs8PiieQq, final URI spec: https://mozilla-nightly-updates.s3.amazonaws.com/mozilla-central/20170615030208/Firefox-mozilla-central-56.0a1-win32-en-US-20170614030206-20170615030208.partial.mar?versionId=j3fQ4_CBHCFy2xWPHMq53K6Qs8PiieQq
05:59:48     INFO -  *** AUS:SVC Downloader:onProgress - progress: 16981/7422943
[..]
05:59:49     INFO -  *** AUS:SVC Downloader:onProgress - progress: 7422943/7422943
05:59:49     INFO -  *** AUS:SVC Downloader:onStopRequest - original URI spec: https://mozilla-nightly-updates.s3.amazonaws.com/mozilla-central/20170615030208/Firefox-mozilla-central-56.0a1-win32-en-US-20170614030206-20170615030208.partial.mar?versionId=j3fQ4_CBHCFy2xWPHMq53K6Qs8PiieQq, final URI spec: https://mozilla-nightly-updates.s3.amazonaws.com/mozilla-central/20170615030208/Firefox-mozilla-central-56.0a1-win32-en-US-20170614030206-20170615030208.partial.mar?versionId=j3fQ4_CBHCFy2xWPHMq53K6Qs8PiieQq, status: 0
05:59:49     INFO -  *** AUS:SVC Downloader:onStopRequest - status: 0, current fail: 0, max fail: 10, retryTimeout: 2000
05:59:49     INFO -  *** AUS:SVC Downloader:_verifyDownload called
05:59:49     INFO -  *** AUS:SVC Downloader:_verifyDownload downloaded size == expected size.
05:59:49     INFO -  *** AUS:SVC isServiceInstalled - returning true
05:59:49     INFO -  *** AUS:SVC shouldUseService - returning true
05:59:49     INFO -  *** AUS:SVC Downloader:onStopRequest - setting state to: pending-service
05:59:49     INFO -  *** AUS:SVC getCanStageUpdates - staging updates is disabled by preference app.update.staging.enabled

As it looks like the updater is stuck waiting for the service to apply the update. But that doesn't seem to happen, or does not finish. I do not see this kind of failure on mozilla-beta yet, so maybe some recent changes to the application updater caused this problem?
OS: Unspecified → Windows
Hardware: Unspecified → All
Summary: Intermittent test_direct_update.py TestDirectUpdate.test_update, test_fallback_update.py TestFallbackUpdate.test_update | TimeoutException: Timed out after 300.0 seconds with message: Update has been applied. → Intermittent test_direct_update.py TestDirectUpdate.test_update, test_fallback_update.py TestFallbackUpdate.test_update | TimeoutException: Timed out after 300.0 seconds with message: Update has been applied (Windows Maintenance Service)
Here a pushlog: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=5293e5f89358&tochange=b266a8d8fd59

Maybe this is related to bug 1370576?
Flags: needinfo?(robert.strong.bugs)
Oh, I don't think this is related to turning on e10s because updates for numbers 2-4 are still working fine. Only the update number 1 is affected by yesterday.
(In reply to Henrik Skupin (:whimboo) from comment #3)
> Here a pushlog:
> https://hg.mozilla.org/mozilla-central/
> pushloghtml?fromchange=5293e5f89358&tochange=b266a8d8fd59
> 
> Maybe this is related to bug 1370576?

That disabled update staging

Also
(In reply to Henrik Skupin (:whimboo) from comment #4)
> Oh, I don't think this is related to turning on e10s because updates for
> numbers 2-4 are still working fine. Only the update number 1 is affected by
> yesterday.
So, clearing needinfo
Flags: needinfo?(robert.strong.bugs)
Summary: Intermittent test_direct_update.py TestDirectUpdate.test_update, test_fallback_update.py TestFallbackUpdate.test_update | TimeoutException: Timed out after 300.0 seconds with message: Update has been applied (Windows Maintenance Service) → Permafail test_direct_update.py TestDirectUpdate.test_update, test_fallback_update.py TestFallbackUpdate.test_update | TimeoutException: Timed out after 300.0 seconds with message: Update has been applied (Windows Maintenance Service)
(In reply to Henrik Skupin (:whimboo) from comment #1)
> So far we run the update tests across all the channels with e10s disabled.
> This was due some incompatible changes for esr45 in our mozharness script.
> So I changed this lately by enabling e10s for all our of our tests run in
> Mozmill-CI (https://github.com/mozilla/mozmill-ci/issues/601).
Would be a good thing for you to figure out why these tests are failing for you with e10s enabled before enabling them in CI.
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #5)
> > Maybe this is related to bug 1370576?
> 
> That disabled update staging

Correct, and since then the problem exists. All updates from day -1 were failing yesterday:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=035c25bef7b5e4175006e63eff10c61c2eef73f1&filter-searchStr=firefox%20update&filter-tier=1&filter-tier=2&filter-tier=3

With today's Nightly this most likely will also affect update number 2, and so on for the next days.

We do not modify this preference in our update tests.
Flags: needinfo?(robert.strong.bugs)
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #6)
> Would be a good thing for you to figure out why these tests are failing for
> you with e10s enabled before enabling them in CI.

Please see my comment 4. Everything was fine until a change for the application updater happened recently, and which is most likely the bug I referred above.
The patch on bug 1370576 got uplifted to beta, so I assume we also will see permafailures for the next update tests on beta. CC'ing Florin for information.
Keywords: regression
I ran the firefox-ui-functional CI tests today and several times over the last few days and they have been passing.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=077b32be5d3dad72be5023da9ba049193a279d57
(In reply to Henrik Skupin (:whimboo) from comment #9)
> The patch on bug 1370576 got uplifted to beta, so I assume we also will see
> permafailures for the next update tests on beta. CC'ing Florin for
> information.

Oh, the uplift happened 2 days ago, so it will first be visible for updates on beta starting early next week, and that only from beta2 onward.
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #10)
> I ran the firefox-ui-functional CI tests today and several times over the
> last few days and they have been passing.
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=077b32be5d3dad72be5023da9ba049193a279d57

Functional tests are not update tests. Those are totally different, and update tests we cannot run on a per commit basis.
(In reply to Henrik Skupin (:whimboo) from comment #12)
> (In reply to Robert Strong [:rstrong] (use needinfo to contact me) from
> comment #10)
> > I ran the firefox-ui-functional CI tests today and several times over the
> > last few days and they have been passing.
> > https://treeherder.mozilla.org/#/
> > jobs?repo=try&revision=077b32be5d3dad72be5023da9ba049193a279d57
> 
> Functional tests are not update tests. Those are totally different, and
> update tests we cannot run on a per commit basis.
So, they aren't CI tests though they are named Mozmill-CI. :(

As for this being due to bug 1370576 that bug disables staging updates by default. The other CI tests don't assume that staging is enabled and when they test staging they make sure the pref for staging is set to true.  If these tests don't do this then that might be the problem though I'd be surprised since these tests passed on try and elsewhere. The patch in bug 1370576 has the pref change.

The important thing is that the CI tests are passing and this is likely isolated to these tests and this likely doesn't affect clients as has been the case with other bugs with these tests.
Flags: needinfo?(robert.strong.bugs)
Try setting the app.update.staging.enabled pref to true and run your tests again.
Flags: needinfo?(hskupin)
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #13)
> So, they aren't CI tests though they are named Mozmill-CI. :(

Mozmill CI is our own managed CI system. Please do not mix it up with Taskcluster. And that system is in place for 9 years now! I would assume that it's clear meanwhile why we needed and still need it.

> As for this being due to bug 1370576 that bug disables staging updates by
> default. The other CI tests don't assume that staging is enabled and when
> they test staging they make sure the pref for staging is set to true.  If
> these tests don't do this then that might be the problem though I'd be
> surprised since these tests passed on try and elsewhere. The patch in bug
> 1370576 has the pref change.
> 
> The important thing is that the CI tests are passing and this is likely
> isolated to these tests and this likely doesn't affect clients as has been
> the case with other bugs with these tests.

If there is still this negative attitude against our update tests, and your CI tests really cover everything well enough, I don't see why I should spend tons of time each week to observe test failures, analyze regressions, and assist in helping to get problems fixed. This is personally disappointing. Anyway, further discussions might/should happen on the original email thread which got stalled again.

(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #14)
> Try setting the app.update.staging.enabled pref to true and run your tests
> again.

I will do this and will report back soon.
(In reply to Henrik Skupin (:whimboo) from comment #15)
> (In reply to Robert Strong [:rstrong] (use needinfo to contact me) from
> comment #13)
> > So, they aren't CI tests though they are named Mozmill-CI. :(
> 
> Mozmill CI is our own managed CI system. Please do not mix it up with
> Taskcluster. And that system is in place for 9 years now! I would assume
> that it's clear meanwhile why we needed and still need it.
> 
> > As for this being due to bug 1370576 that bug disables staging updates by
> > default. The other CI tests don't assume that staging is enabled and when
> > they test staging they make sure the pref for staging is set to true.  If
> > these tests don't do this then that might be the problem though I'd be
> > surprised since these tests passed on try and elsewhere. The patch in bug
> > 1370576 has the pref change.
> > 
> > The important thing is that the CI tests are passing and this is likely
> > isolated to these tests and this likely doesn't affect clients as has been
> > the case with other bugs with these tests.
> 
> If there is still this negative attitude against our update tests, and your
> CI tests really cover everything well enough, I don't see why I should spend
> tons of time each week to observe test failures, analyze regressions, and
> assist in helping to get problems fixed. This is personally disappointing.
> Anyway, further discussions might/should happen on the original email thread
> which got stalled again.
I'm overall fine with these tests. I'm not fine with developers repeatedly getting pulled into fixing them especially when the problem with the test is typically not a problem with the client. There has even been a time where you have told release drivers to ask me to work on these tests failing when they failed due to a patch that you landed (e.g. changing how the channel was set).
So I can see the same not only on Windows but any platform when the preference "app.update.staging.enabled" is set to False. Looking at the code of our update tests, we never prepared for that mode. And since the preference has been flipped on Windows, without checking or updating our tests, it's just broken.

The simplest solution for now would be to just enable the pref for our tests, and forcing staged updates to happen. With that we won't test the real path users would see right now anymore, but given the unclear situation of the tests I don't want to spend time now to get them updated to support the non-staged update path.
Blocks: 1370576
Flags: needinfo?(hskupin)
OS: Windows → All
Summary: Permafail test_direct_update.py TestDirectUpdate.test_update, test_fallback_update.py TestFallbackUpdate.test_update | TimeoutException: Timed out after 300.0 seconds with message: Update has been applied (Windows Maintenance Service) → Permafail test_direct_update.py TestDirectUpdate.test_update, test_fallback_update.py TestFallbackUpdate.test_update | TimeoutException: Timed out after 300.0 seconds with message: Update has been applied (app.update.staging.enabled=False)
Comment on attachment 8878405 [details]
Bug 1373564 - Force enable update staging mode to prevent perma failures for update tests.

https://reviewboard.mozilla.org/r/149744/#review154414
Attachment #8878405 - Flags: review?(dburns) → review+
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/21b600e847b9
Force enable update staging mode to prevent perma failures for update tests. r=automatedtester
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
Note: the CI tests in tree test both of those cases as well as many others since they both do occur to clients under various circumstances.
https://hg.mozilla.org/mozilla-central/rev/21b600e847b9
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
test-only patch needed on beta to fix a regression caused by bug 1370576.
Whiteboard: [checkin-needed-beta]
So we've run 6 rounds of tests (3 builds) on beta since the fix landed, and we haven't hit this. I think we can call it verified. Great job to jump on it and get it fixed in such short time!
You need to log in before you can comment on or make changes to this bug.