Closed Bug 768417 Opened 12 years ago Closed 12 years ago

Update tests are failing on Linux with "Downloaded update has been applied"

Categories

(Mozilla QA Graveyard :: Mozmill Tests, defect, P1)

All
Linux

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: whimboo, Assigned: AndreeaMatei)

References

()

Details

(Whiteboard: [mozmill-test-failure] s=q3 u=failure c=update p=1)

Attachments

(1 file)

Lately we have added the fallback update tests to Mozmill CI and starting from this time the fallback updates are failing. Partial updates work fine.

The message we get is: "Downloaded update has been applied."

The reason for this is a timeout in applying the update on Linux. Right now we have set 60s for the apply process. Why does it take so long on Linux64 to get the update applied? Ehsan, do you have an idea? Is it a known bug in Firefox on this platform?
Actually this also happened for Linux x86 on Aurora:
http://mozmill-ci.blargon7.com/#/update/report/a7655636e327552d4750d1013c06975b
Hardware: x86_64 → All
Dupe of bug 764587 which has been fixed on both central and aurora.  Please test on the trunk of both trees.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → DUPLICATE
Ehsan, this is not about the download, but applying the downloaded patch. Also this happens with the most recent Nighlty and Aurora builds. I don't have time to investigate today but can make sure to look at tomorrow.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
(In reply to Henrik Skupin (:whimboo) from comment #3)
> Ehsan, this is not about the download, but applying the downloaded patch.
> Also this happens with the most recent Nighlty and Aurora builds. I don't
> have time to investigate today but can make sure to look at tomorrow.

OK, in that case I think you need to increase the amount of timeout you use.  Also, that doesn't need to block bug 307181.
No longer blocks: bgupdates
But why does it take longer than 60 seconds to apply an update? This is quite long. If you would select a duration, what would you choose?
Well, I'm not sure.  This stuff is mostly I/O, so it really depends on the specs of the machines running the test, etc.  I think the best way to test this is to have a watchdog process for the updater app, and just look for it dying (either gracefully or crashing etc).  Without that, the best I can suggest is to measure how long the update takes on those machines in like 10 runs, and pick the maximum time and multiply it by 1.5...
The hardware our tests get run on are so different and we don't even know about. Reason is that our tests can be run by anyone kinda easily. So we can't figure out a right timeout which is small enough to say it's fine. That means I would say we should define a hard-limit in which time the patch definitely should have been applied. IMHO I would use 5 minutes for.
Summary: Fallback update tests are failing on Linux64 with "Downloaded update has been applied" → Update tests are failing on Linux with "Downloaded update has been applied"
Attached patch Patch v1Splinter Review
Assignee: nobody → hskupin
Status: REOPENED → ASSIGNED
Attachment #637792 - Flags: review?(anthony.s.hughes)
Hmm, if this is the case, I think using a timeout of any value is a bad idea.  You should ideally switch to watchdog.
Well, we have control of our own boxes which we need for release testing. Those are our primary focus. Results from the crowd will end-up on another dashboard. So those don't get in conflict. Would you mind giving me a hint to the watchdog thingy you mean? Thanks.
Comment on attachment 637792 [details] [diff] [review]
Patch v1

Anthony is out today. So asking Jeff for a sanity check. Here we raise the timeout for waiting of the downloaded update being applied in the background. The formerly 60s were too less but 5 minutes should be enough for nearly any system.
Attachment #637792 - Flags: review?(anthony.s.hughes) → review?(jhammel)
Attachment #637792 - Flags: review?(jhammel) → review+
Given the simplicity of the patch I directly pushed it to both affected branches:

http://hg.mozilla.org/qa/mozmill-tests/rev/433c9c4c36d9 (default)
http://hg.mozilla.org/qa/mozmill-tests/rev/44860ead0e23 (aurora)

For the watchdog proposal we can continue on another bug once I got the information and see a necessity to work on in the near future.

Lets call this one fixed.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
(In reply to Ehsan Akhgari [:ehsan] from comment #9)
> Hmm, if this is the case, I think using a timeout of any value is a bad
> idea.  You should ideally switch to watchdog.

Sure, I mean some piece of code which watches the updater process and reports back when it is finished either gracefully or through crashing for example, and then checks for the status of the staged update.
Even 5 minutes do not work on that machine! I will have to investigate that more closely now probably by running the tests on the affected VM.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
The strange thing is that this only happens on the 32bit machine but not the 64bit one. I will restart the 32bit one and check if that helps.
Status: REOPENED → ASSIGNED
I can't work on it right now. I hope someone can pick this up.
Severity: normal → major
Priority: -- → P2
(In reply to Henrik Skupin (:whimboo) from comment #16)
> I can't work on it right now. I hope someone can pick this up.

What needs to happen here? Is this a coding problem or an infrastructure problem? What is the impact on releases should this continue to go unfixed?
Andreea, can you please make this your priority for today? Once we get the final beta builds and we have to run the update tests it will affect a bunch of builds. Looks like it's mostly a timing issue in our test. Ask me in IRC whenever you have questions. Thanks!
Assignee: hskupin → andreea.matei
Priority: P2 → P1
Working on it.
I tried to reproduce the failure on Ubuntu 12.04, 32-bit, heavy loaded system, with both Nightly and Aurora versions from yesterday. The tests are not failing.
Reports:
* http://mozmill-crowd.blargon7.com/#/update/report/d87d47fd1034f072b9bece6ee66e4b6c
* http://mozmill-crowd.blargon7.com/#/update/report/d87d47fd1034f072b9bece6ee66e5536

Will continue to investigate on Ubuntu 11.10, eventually with older versions.
Not reproducible on Ubuntu 11.10 32 bit heavily loaded either.

Reports:
* http://mozmill-crowd.blargon7.com/#/update/report/d87d47fd1034f072b9bece6ee672a8d9
* http://mozmill-crowd.blargon7.com/#/update/report/d87d47fd1034f072b9bece6ee67299bd
* http://mozmill-crowd.blargon7.com/#/update/report/d87d47fd1034f072b9bece6ee6720670

Henrik, maybe something changed since your last check on comment 15, I will test tomorrow on a 64bit machine.
Ok, so the results from Ubuntu 11.10 and 12.04 both 63 bit, normal and heavy loaded, all passed:
Nightly:
* http://mozmill-crowd.blargon7.com/#/update/report/d87d47fd1034f072b9bece6ee69c06bc
* http://mozmill-crowd.blargon7.com/#/update/report/d87d47fd1034f072b9bece6ee69be1ca

Aurora: 
* http://mozmill-crowd.blargon7.com/#/update/report/d87d47fd1034f072b9bece6ee69bf85b
* http://mozmill-crowd.blargon7.com/#/update/report/d87d47fd1034f072b9bece6ee69be0fc

The only difference between these results and the report in the bug URL is Mozmill version, there it was Mozmill 1.5.13 so I believe 1.5.17 solved the problem.

Should I check if we are safe with a smaller timer for applying update or we are comfortable with 300000? It didn't take long in my testruns to apply it.
Whiteboard: [mozmill-test-failure] → s=2012-8-27 u=failure c=update [mozmill-test-failure]
(In reply to Andreea Matei [:AndreeaMatei] from comment #21)
> The only difference between these results and the report in the bug URL is
> Mozmill version, there it was Mozmill 1.5.13 so I believe 1.5.17 solved the
> problem.

Do our reports still show failures with Mozmill 1.5.17 and tests across the different versions of Firefox? If not you probably want to run ondemand tests on our Mozmill CI staging instance.

> Should I check if we are safe with a smaller timer for applying update or we
> are comfortable with 300000? It didn't take long in my testruns to apply it.

I cannot answer this question yet as long as we do not know the root cause of this problem on our CI system.
Whiteboard: s=2012-8-27 u=failure c=update [mozmill-test-failure] → s=2012-8-27 u=failure c=update p=1 [mozmill-test-failure]
 (In reply to Henrik Skupin (:whimboo) from comment #22)
> Do our reports still show failures with Mozmill 1.5.17 and tests across the
> different versions of Firefox? If not you probably want to run ondemand
> tests on our Mozmill CI staging instance.

No, last failures were on August 21st, since then all reports passed, either with mozmill 1.5.17 or 1.5.18.

On August 21st are several failures, on Ubuntu 11.10(x86), fr and En-US locale, with mozmill 1.5.17.

Looking back a few months, found that this error started on June 8th.
In pushlog something related was a week earlier, on June 1st: bug 759065 and bug 760290.

Researched today about ondemand tests from the wiki page, but I believe I need Vlad's help to trigger them, so will come back tomorrow with the outcome.
This failure is still present on our staging instance:
http://mozmill-staging.blargon7.com/#/update/report/671677a5d9d5ca25f3cf5ae1c4144ff3

Could be that we are affected by bug 760290. It should be possible for you to trigger such a test on your own (via build now) and watch the process while the test is running. Vlad has VPN access.
Whiteboard: s=2012-8-27 u=failure c=update p=1 [mozmill-test-failure] → [mozmill-test-failure] s=2012-8-27 u=failure c=update p=1
(In reply to Henrik Skupin (:whimboo) from comment #24)
> This failure is still present on our staging instance:
> http://mozmill-staging.blargon7.com/#/update/report/
> 671677a5d9d5ca25f3cf5ae1c4144ff3
> 
> Could be that we are affected by bug 760290. It should be possible for you
> to trigger such a test on your own (via build now) and watch the process
> while the test is running. Vlad has VPN access.

Yup we'll work together on this one
So we triggered some tests and looking also at the other ones that failed with our error, in the output console, I saw the following problem:
AUS: UI General : getPref - failed to get preference: app.update.billboard.test_url

Seen some files on MXR (all of which were also in bug 760290), this one is related:
http://mxr.mozilla.org/mozilla-central/source/toolkit/mozapps/update/content/updates.js
and the pref is used here after setting the state of the Update:
http://mxr.mozilla.org/mozilla-central/source/toolkit/mozapps/update/content/updates.js#464

Will dig some more tomorrow to see with fresh eyes the connections.
(In reply to Andreea Matei [:AndreeaMatei] from comment #26)
> So we triggered some tests and looking also at the other ones that failed
> with our error, in the output console, I saw the following problem:
> AUS: UI General : getPref - failed to get preference:
> app.update.billboard.test_url

That's unrelated. Important here is if you can see the 'applying update' message in the software update dialog, and how long it takes - means if we run into a timeout or not.
Seems this error is hard to catch. It didn't reproduced for us, all updates were made in a few seconds.
But from older reports when it failed I saw the test duration was 12 min. (total for the 2 tests - DirectUpdate and FallbackUpdate). The duration of a passing test is about 2 min. so we assume the timeout is 5 min for each test because both are failing to apply.
I can't see any more failing tests in the last couple of days. The last one which was failing with this stack was on Sep 5th. Lets close this bug for now but we will reopen if it happens again.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → WORKSFORME
Whiteboard: [mozmill-test-failure] s=2012-8-27 u=failure c=update p=1 → [mozmill-test-failure] s=q3 u=failure c=update p=1
Product: Mozilla QA → Mozilla QA Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: