768417 - Update tests are failing on Linux with "Downloaded update has been applied"

Reporter

Description

•

12 years ago

Lately we have added the fallback update tests to Mozmill CI and starting from this time the fallback updates are failing. Partial updates work fine.

The message we get is: "Downloaded update has been applied."

The reason for this is a timeout in applying the update on Linux. Right now we have set 60s for the apply process. Why does it take so long on Linux64 to get the update applied? Ehsan, do you have an idea? Is it a known bug in Firefox on this platform?

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 1

•

12 years ago

Actually this also happened for Linux x86 on Aurora:
http://mozmill-ci.blargon7.com/#/update/report/a7655636e327552d4750d1013c06975b

status-firefox15: --- → affected

status-firefox16: --- → affected

Hardware: x86_64 → All

(no longer active)

Comment 2

•

12 years ago

Dupe of bug 764587 which has been fixed on both central and aurora.  Please test on the trunk of both trees.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → DUPLICATE

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 3

•

12 years ago

Ehsan, this is not about the download, but applying the downloaded patch. Also this happens with the most recent Nighlty and Aurora builds. I don't have time to investigate today but can make sure to look at tomorrow.

Status: RESOLVED → REOPENED

Resolution: DUPLICATE → ---

(no longer active)

Comment 4

•

12 years ago

(In reply to Henrik Skupin (:whimboo) from comment #3)
> Ehsan, this is not about the download, but applying the downloaded patch.
> Also this happens with the most recent Nighlty and Aurora builds. I don't
> have time to investigate today but can make sure to look at tomorrow.

OK, in that case I think you need to increase the amount of timeout you use.  Also, that doesn't need to block bug 307181.

No longer blocks: bgupdates

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 5

•

12 years ago

But why does it take longer than 60 seconds to apply an update? This is quite long. If you would select a duration, what would you choose?

(no longer active)

Comment 6

•

12 years ago

Well, I'm not sure.  This stuff is mostly I/O, so it really depends on the specs of the machines running the test, etc.  I think the best way to test this is to have a watchdog process for the updater app, and just look for it dying (either gracefully or crashing etc).  Without that, the best I can suggest is to measure how long the update takes on those machines in like 10 runs, and pick the maximum time and multiply it by 1.5...

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 7

•

12 years ago

The hardware our tests get run on are so different and we don't even know about. Reason is that our tests can be run by anyone kinda easily. So we can't figure out a right timeout which is small enough to say it's fine. That means I would say we should define a hard-limit in which time the patch definitely should have been applied. IMHO I would use 5 minutes for.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Updated

•

12 years ago

Summary: Fallback update tests are failing on Linux64 with "Downloaded update has been applied" → Update tests are failing on Linux with "Downloaded update has been applied"

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 8

•

12 years ago

Attached patch Patch v1 — Details — Splinter Review

Assignee: nobody → hskupin

Status: REOPENED → ASSIGNED

Attachment #637792 - Flags: review?(anthony.s.hughes)

(no longer active)

Comment 9

•

12 years ago

Hmm, if this is the case, I think using a timeout of any value is a bad idea.  You should ideally switch to watchdog.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 10

•

12 years ago

Well, we have control of our own boxes which we need for release testing. Those are our primary focus. Results from the crowd will end-up on another dashboard. So those don't get in conflict. Would you mind giving me a hint to the watchdog thingy you mean? Thanks.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 11

•

12 years ago

Comment on attachment 637792 [details] [diff] [review]
Patch v1

Anthony is out today. So asking Jeff for a sanity check. Here we raise the timeout for waiting of the downloaded update being applied in the background. The formerly 60s were too less but 5 minutes should be enough for nearly any system.

Attachment #637792 - Flags: review?(anthony.s.hughes) → review?(jhammel)

Jeff Hammel

Updated

•

12 years ago

Attachment #637792 - Flags: review?(jhammel) → review+

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 12

•

12 years ago

Given the simplicity of the patch I directly pushed it to both affected branches:

http://hg.mozilla.org/qa/mozmill-tests/rev/433c9c4c36d9 (default)
http://hg.mozilla.org/qa/mozmill-tests/rev/44860ead0e23 (aurora)

For the watchdog proposal we can continue on another bug once I got the information and see a necessity to work on in the near future.

Lets call this one fixed.

Status: ASSIGNED → RESOLVED

Closed: 12 years ago → 12 years ago

status-firefox15: affected → fixed

status-firefox16: affected → fixed

Resolution: --- → FIXED

(no longer active)

Comment 13

•

12 years ago

(In reply to Ehsan Akhgari [:ehsan] from comment #9)
> Hmm, if this is the case, I think using a timeout of any value is a bad
> idea.  You should ideally switch to watchdog.

Sure, I mean some piece of code which watches the updater process and reports back when it is finished either gracefully or through crashing for example, and then checks for the status of the staged update.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 14

•

12 years ago

Even 5 minutes do not work on that machine! I will have to investigate that more closely now probably by running the tests on the affected VM.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 15

•

12 years ago

The strange thing is that this only happens on the 32bit machine but not the 64bit one. I will restart the 32bit one and check if that helps.

Status: REOPENED → ASSIGNED

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 16

•

12 years ago

I can't work on it right now. I hope someone can pick this up.

Severity: normal → major

Priority: -- → P2

u279076

Comment 17

•

12 years ago

(In reply to Henrik Skupin (:whimboo) from comment #16)
> I can't work on it right now. I hope someone can pick this up.

What needs to happen here? Is this a coding problem or an infrastructure problem? What is the impact on releases should this continue to go unfixed?

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 18

•

12 years ago

Andreea, can you please make this your priority for today? Once we get the final beta builds and we have to run the update tests it will affect a bunch of builds. Looks like it's mostly a timing issue in our test. Ask me in IRC whenever you have questions. Thanks!

Assignee: hskupin → andreea.matei

status-firefox15: fixed → affected

status-firefox16: fixed → affected

status-firefox17: --- → affected

Priority: P2 → P1

Andreea Matei [:AndreeaMatei]

Assignee

Comment 19

•

12 years ago

Working on it.
I tried to reproduce the failure on Ubuntu 12.04, 32-bit, heavy loaded system, with both Nightly and Aurora versions from yesterday. The tests are not failing.
Reports:
* http://mozmill-crowd.blargon7.com/#/update/report/d87d47fd1034f072b9bece6ee66e4b6c
* http://mozmill-crowd.blargon7.com/#/update/report/d87d47fd1034f072b9bece6ee66e5536

Will continue to investigate on Ubuntu 11.10, eventually with older versions.

Andreea Matei [:AndreeaMatei]

Assignee

Comment 20

•

12 years ago

Not reproducible on Ubuntu 11.10 32 bit heavily loaded either.

Reports:
* http://mozmill-crowd.blargon7.com/#/update/report/d87d47fd1034f072b9bece6ee672a8d9
* http://mozmill-crowd.blargon7.com/#/update/report/d87d47fd1034f072b9bece6ee67299bd
* http://mozmill-crowd.blargon7.com/#/update/report/d87d47fd1034f072b9bece6ee6720670

Henrik, maybe something changed since your last check on comment 15, I will test tomorrow on a 64bit machine.

Andreea Matei [:AndreeaMatei]

Assignee

Comment 21

•

12 years ago

Ok, so the results from Ubuntu 11.10 and 12.04 both 63 bit, normal and heavy loaded, all passed:
Nightly:
* http://mozmill-crowd.blargon7.com/#/update/report/d87d47fd1034f072b9bece6ee69c06bc
* http://mozmill-crowd.blargon7.com/#/update/report/d87d47fd1034f072b9bece6ee69be1ca

Aurora: 
* http://mozmill-crowd.blargon7.com/#/update/report/d87d47fd1034f072b9bece6ee69bf85b
* http://mozmill-crowd.blargon7.com/#/update/report/d87d47fd1034f072b9bece6ee69be0fc

The only difference between these results and the report in the bug URL is Mozmill version, there it was Mozmill 1.5.13 so I believe 1.5.17 solved the problem.

Should I check if we are safe with a smaller timer for applying update or we are comfortable with 300000? It didn't take long in my testruns to apply it.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Updated

•

12 years ago

Whiteboard: [mozmill-test-failure] → s=2012-8-27 u=failure c=update [mozmill-test-failure]

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 22

•

12 years ago

(In reply to Andreea Matei [:AndreeaMatei] from comment #21)
> The only difference between these results and the report in the bug URL is
> Mozmill version, there it was Mozmill 1.5.13 so I believe 1.5.17 solved the
> problem.

Do our reports still show failures with Mozmill 1.5.17 and tests across the different versions of Firefox? If not you probably want to run ondemand tests on our Mozmill CI staging instance.

> Should I check if we are safe with a smaller timer for applying update or we
> are comfortable with 300000? It didn't take long in my testruns to apply it.

I cannot answer this question yet as long as we do not know the root cause of this problem on our CI system.

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

12 years ago

Whiteboard: s=2012-8-27 u=failure c=update [mozmill-test-failure] → s=2012-8-27 u=failure c=update p=1 [mozmill-test-failure]

Andreea Matei [:AndreeaMatei]

Assignee

Comment 23

•

12 years ago

 (In reply to Henrik Skupin (:whimboo) from comment #22)
> Do our reports still show failures with Mozmill 1.5.17 and tests across the
> different versions of Firefox? If not you probably want to run ondemand
> tests on our Mozmill CI staging instance.

No, last failures were on August 21st, since then all reports passed, either with mozmill 1.5.17 or 1.5.18.

On August 21st are several failures, on Ubuntu 11.10(x86), fr and En-US locale, with mozmill 1.5.17.

Looking back a few months, found that this error started on June 8th.
In pushlog something related was a week earlier, on June 1st: bug 759065 and bug 760290.

Researched today about ondemand tests from the wiki page, but I believe I need Vlad's help to trigger them, so will come back tomorrow with the outcome.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 24

•

12 years ago

This failure is still present on our staging instance:
http://mozmill-staging.blargon7.com/#/update/report/671677a5d9d5ca25f3cf5ae1c4144ff3

Could be that we are affected by bug 760290. It should be possible for you to trigger such a test on your own (via build now) and watch the process while the test is running. Vlad has VPN access.

URL: http://mozmill-ci.blargon7.com/#/upda... → http://mozmill-staging.blargon7.com/#...

Whiteboard: s=2012-8-27 u=failure c=update p=1 [mozmill-test-failure] → [mozmill-test-failure] s=2012-8-27 u=failure c=update p=1

Maniac Vlad Florin (:vladmaniac)

Comment 25

•

12 years ago

(In reply to Henrik Skupin (:whimboo) from comment #24)
> This failure is still present on our staging instance:
> http://mozmill-staging.blargon7.com/#/update/report/
> 671677a5d9d5ca25f3cf5ae1c4144ff3
> 
> Could be that we are affected by bug 760290. It should be possible for you
> to trigger such a test on your own (via build now) and watch the process
> while the test is running. Vlad has VPN access.

Yup we'll work together on this one

Andreea Matei [:AndreeaMatei]

Assignee

Comment 26

•

12 years ago

So we triggered some tests and looking also at the other ones that failed with our error, in the output console, I saw the following problem:
AUS: UI General : getPref - failed to get preference: app.update.billboard.test_url

Seen some files on MXR (all of which were also in bug 760290), this one is related:
http://mxr.mozilla.org/mozilla-central/source/toolkit/mozapps/update/content/updates.js
and the pref is used here after setting the state of the Update:
http://mxr.mozilla.org/mozilla-central/source/toolkit/mozapps/update/content/updates.js#464

Will dig some more tomorrow to see with fresh eyes the connections.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 27

•

12 years ago

(In reply to Andreea Matei [:AndreeaMatei] from comment #26)
> So we triggered some tests and looking also at the other ones that failed
> with our error, in the output console, I saw the following problem:
> AUS: UI General : getPref - failed to get preference:
> app.update.billboard.test_url

That's unrelated. Important here is if you can see the 'applying update' message in the software update dialog, and how long it takes - means if we run into a timeout or not.

Andreea Matei [:AndreeaMatei]

Assignee

Comment 28

•

12 years ago

Seems this error is hard to catch. It didn't reproduced for us, all updates were made in a few seconds.
But from older reports when it failed I saw the test duration was 12 min. (total for the 2 tests - DirectUpdate and FallbackUpdate). The duration of a passing test is about 2 min. so we assume the timeout is 5 min for each test because both are failing to apply.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 29

•

12 years ago

I can't see any more failing tests in the last couple of days. The last one which was failing with this stack was on Sep 5th. Lets close this bug for now but we will reopen if it happens again.

Status: ASSIGNED → RESOLVED

Closed: 12 years ago → 12 years ago

Resolution: --- → WORKSFORME

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Updated

•

12 years ago

Whiteboard: [mozmill-test-failure] s=2012-8-27 u=failure c=update p=1 → [mozmill-test-failure] s=q3 u=failure c=update p=1

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Updated

•

12 years ago

status-firefox15: affected → ---

status-firefox16: affected → ---

status-firefox17: affected → ---

BMO Automation

Updated

•

5 years ago

Product: Mozilla QA → Mozilla QA Graveyard