Closed Bug 803489 Opened 12 years ago Closed 11 years ago

Software update tests on Windows 8 fail sometimes due to updater prompt on startup (jsbridge cannot connect)

Categories

(Mozilla QA Graveyard :: Mozmill Tests, defect, P1)

All
Windows 8
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: whimboo, Assigned: mario.garbi)

References

Details

(Whiteboard: s=130121 u=failure c=update p=1)

Attachments

(7 files)

Attached image screenshot
Seen this last night by running update tests against the 17.0b2 beta candidate builds of Firefox. As the screenshot shows there seem to be still running processes of Firefox around. This modal dialog on startup causes an application disconnect for Mozmill on first startup.

Not sure but it could be that we are not correctly shutting down Firefox between the tests. 

Affected testrun: http://10.250.73.243:8080/job/ondemand_update/2937/console

*** AUS:SVC Downloader:onProgress - progress: 5402/10386816
*** AUS:SVC Downloader:onProgress - progress: 2055304/10386816
*** AUS:SVC Downloader:onProgress - progress: 9106584/10386816
*** AUS:SVC Downloader:onProgress - progress: 10386816/10386816
*** AUS:SVC Downloader:onStopRequest - original URI spec: http://download.mozilla.org/?product=firefox-17.0b2-partial-16.0b6&os=win&lang=en-US&force=1, final URI spec: http://download.cdn.mozilla.net/pub/mozilla.org/firefox/releases/17.0b2/update/win32/en-US/firefox-16.0b6-17.0b2.partial.mar, status: 0
*** AUS:SVC Downloader:onStopRequest - setting state to: pending-service
*** AUS:UI gDownloadingPage:onStopRequest - patch verification succeeded
*** AUS:SVC gCanStageUpdates - able to stage updates because we'll use the service
*** AUS:SVC UpdateService:applyUpdateInBackground called with the following update: Firefox 17.0 Beta 2
*** AUS:SVC readStatusFile - status: failed: 7, path: C:\Users\mozauto\AppData\Local\Mozilla\Firefox\firefox\updates\0\update.status
*** UTM:SVC TimerManager:notify - notified @mozilla.org/browser/search-service;1
*** UTM:SVC TimerManager:notify - notified @mozilla.org/addons/integration;1
*** UTM:SVC TimerManager:notify - notified @mozilla.org/extensions/blocklist;1
Timeout: bridge.execFunction("5dc455a1-19d6-11e2-a001-7c6d6299cd7e", bridge.registry["{a4969087-3644-4262-9107-1dd5a911bc66}"]["runTestFile"], ["c:\\users\\mozauto\\appdata\\local\\temp\\tmpa8zyad.mozmill-tests\\tests\\update\\testDirectUpdate\\test2.js"])

TEST-UNEXPECTED-FAIL | Disconnect Error: Application unexpectedly closed
INFO Passed: 2
INFO Failed: 1
INFO Skipped: 0
Appeared in 11/23 on Windows NT 6.2.8400 (x86) on Release 17.0
http://mozmill-ondemand.blargon7.com/#/update/report/674977957b923f4905160d1b9ac05dcb
Reproducible in 11/26 on Windows 8 x64:
http://10.250.73.243:8080/job/ondemand_update/4844/
I wish someone could have a look into that. Not sure when I actually be able to work on it.
Assignee: hskupin → nobody
Priority: -- → P1
Status: ASSIGNED → NEW
This issue is enough of a nuisance that I'm going to stop running ondemand-update automation on release builds until this is resolved. The tests fail more often than they pass, creating an unnecessary backlog. We'll manually spotcheck Win8 updates in the meantime.
My suspicion on this is that this behavior exists because of all the trouble VMware Fusion is causing us. Those two Win8 VMs are running on qa-set which has the same memory issues as all the other machines we have run Fusion yet. So once we moved to the new ESX cluster this should be fixed.
Looks like even in the new CI the problem persists. It's something we have to figure out soon.
Whiteboard: s=130121 u=failure c=update p=1
Assignee: nobody → mario.garbi
Status: NEW → ASSIGNED
Mario, please try to reproduce this issue first. That would be important. If you can't locally please use our Win8 machines of the old CI system. Those are free and can be utilized. Keep in mind that something Mozmill related could be involved here. I'm happy to help you whenever you are blocked.
(In reply to Henrik Skupin (:whimboo) from comment #9)
> Mario, please try to reproduce this issue first. That would be important. If
> you can't locally please use our Win8 machines of the old CI system. Those
> are free and can be utilized. Keep in mind that something Mozmill related
> could be involved here. I'm happy to help you whenever you are blocked.

I'm on it as we speak. I will come back with info as soon as I get some results.
 I have tried to reproduce it locally and didn't managed yet except the cases when I manually open a second instance of Firefox. If testrun_update.py is run properly in the correct enviroment the tests pass 10/10: 
http://mozmill-crowd.blargon7.com/#/update/reports
 
 I will try on the Win8 machines of the old CI system and check where they act differently.
 So far I wasn't able to reproduce it on the old Win8 machine as well, I will continue investigating but I suspect it's related to another bug that leaves a FF instance opened.

Reports:
http://mozmill-crowd.blargon7.com/#/update/reports
I wouldn't be able to explain that but it might be bug 813170. Since we have disabled this test we no longer have those problems on Win8, right? Would you mind to check that? What was the frequency of failures in the last days?
I didn't managed to reproduce it in normal conditions (without manually opening a FF instance) neither locally(last 3 days) nor on the old Win8 machines (yesterday only). I will look over bug 813170 and continue the investigations.
Please check the results in the dashboard and come back with the failure rates from the last 7 or 10 days.
Mozmill CI update reports for:
07.01.2013 - 24.01.2013

2013-01-16
20.0a2 fr - Windows NT 5.1.2600 (x86)
http://mozmill-ci.blargon7.com/#/update/report/f25fe2f500e5e4086802832f52121ada

2013-01-08 
20.0a1 fr - Windows NT 6.2.8400 (x86)
http://mozmill-ci.blargon7.com/#/update/report/23d8fbdd0190d4b0496d6b129fcd6e8e

2013-01-08
20.0a1 en-US - Windows NT 6.2.8400 (x86_64)
http://mozmill-ci.blargon7.com/#/update/report/23d8fbdd0190d4b0496d6b129fc15b31

Only 3 fails in the period 07-24.01.2013 for update testruns with Disconnect Error. Last one was in 16.01.2013.
(In reply to mario garbi from comment #16)
> 2013-01-16
> 20.0a2 fr - Windows NT 5.1.2600 (x86)
> http://mozmill-ci.blargon7.com/#/update/report/
> f25fe2f500e5e4086802832f52121ada

That's not Windows 8 and not this bug. 

> 2013-01-08 
> 20.0a1 fr - Windows NT 6.2.8400 (x86)
> http://mozmill-ci.blargon7.com/#/update/report/
> 23d8fbdd0190d4b0496d6b129fcd6e8e
> 
> 2013-01-08
> 20.0a1 en-US - Windows NT 6.2.8400 (x86_64)
> http://mozmill-ci.blargon7.com/#/update/report/
> 23d8fbdd0190d4b0496d6b129fc15b31

January 8th was really the last time we have seen it? I thought I noticed it even with the new CI.
Yes, I double checked mozmill-ci reports and I cannot find a Disconnect error failure since 08.01.2013. I posted the win NT 5.1 report to cover all Disconnect errors.
This bug is for Win8 only. Other disconnect failures are based on different issues. Given that we cannot reproduce it right now and I was thinking that it might have been changed with the new CI, I will lower the priority to P3. Lets revisit on Monday when we can decide to close as WFM.
Priority: P1 → P3
Hardware: x86 → All
I still haven't been able to reproduce it, has it showed up in recent runs on CI?
(In reply to mario garbi from comment #20)
> I still haven't been able to reproduce it, has it showed up in recent runs
> on CI?

Not sure. You might want to check that yourself.
 As far as I've seen it hasn't showed up again in the last period.
Lets wait one more week and then we could close it as WFM if no more failure occurs.
This has happened a lot lately during the ondemand tests for recent Firefox releases. Mario, do we have an update here? Given that it's blocking the QA team from quickly running the update tests for release we have to raise the severity.
Priority: P3 → P2
 I am still working on trying to reproduce this. I will come back with updates as soon as possible.
So I was able to catch that issue right now. I made a screenshot given that zh-TW is not that readable for me. As what I have seen the dialog pops-up when Firefox downloads or applies the update. 

Tony, are you able to help out?
Flags: needinfo?(tchung)
Can you copy and paste the text in the dialog to Google Translate?
No, it was not copyable. I wouldn't have asked otherwise.
(In reply to Henrik Skupin (:whimboo) from comment #31)
> No, it was not copyable. I wouldn't have asked otherwise.

hi henrik, your screenshot in comment 29 is exactly the same translation for comment 0 screenshot.
Flags: needinfo?(tchung)
I am working on creating an ondemand file to trigger locally with the release builds that failed recently. So far I have also noticed some strange behavior with win8 and updating l10n versions of FF. This bug has top priority for me and I'll try to figure it out as soon as possible.
I has reproduced again yesterday 01.04 at 2:24 AM on Firefox 19.0.2 en-US:
http://mozmill-ondemand.blargon7.com/#/update/report/25ad365ca7bcf4905e9b700b4f970746

I am investigating this still and I'm trying to understand what ondemand testruns do differently from what I'm doing when I'm trying to simulate an ondemand run locally. I was hoping to observe the ondemand run but due to the late hour (2:24AM) I was unable.
*It has reproduced in an ondeman testrun.
It has also reproduced on regular testrun_update scripts for Firefox 22.0a1 fr on Win 6.2.9200 64bit:

http://mozmill-ci.blargon7.com/#/update/report/25ad365ca7bcf4905e9b700b4fceaa7e
Given all the massive failures lately I'm raising this issue to a P1. Mario, if you are not able to reproduce it yourself please take advantage of other members of your team. We cannot wait longer to get this problem fixed. It's sitting in the queue for almost 3 months(!) now. Thanks.
Priority: P2 → P1
Andreea and I have configured a local Jenkins and started running the ondemands testruns with the latest configuration file taken from this bug. 

In case this issue reproduces locally we can investigate it here. 

In case it doesn't, it might be an issue with the remote machine configuration. Mario, you might want to verify the reports and see if the same Windows 8 machine was used when the failures appeared.
I have noticed this while working on an win8 machine. When this pops up we are unable to interact with the applications running in background and we must first close this dialog. 
I'm not sure but I think this could impact our tests.
Thanks Mario. I have already disabled update checks which were still enabled on all Win8 machines. So this dialog should not appear anymore. But in any case Mozmill should be able to kill the Firefox process. So it shouldn't be directly related.
The first run of ondemand updates (using the file added to this bug), passed. We will be looking into mozmill-ci reports to see if all the failures were on a single machine or not, and also running again the ondemand through the local Jenkins.
Not the same machine was used when this error appeared. 

The ondemand test runs ran locally for the last couple of days, but the error does not appear. We have also created a new ondemand configuration file using the mozmill-ondemand reports for yesterday's testrun, but no error so far. 

We thought this had also something to do with the internet connection speed so we have checked the speed and our machine has a slower connection.

We think that we could start an ondemand testrun tomorrow morning our time on the http://mm-ci-master.qa.scl3.mozilla.com/ for Windows 8 machines only reporting to mozmill-crowd so it won't interfere with real reports. Please tell us what do you think.
Attached file screenshots
We have run ondemands on all Windows 8 remote machines twice. These are the results:
1) First time:
- we got disconnect error application closed at error patching (screenshot attached) - happened on the mm-win-8-32-1 machine
- we got disconnect error application closed at download (screenshot attached) - happened on the mm-win-8-32-3 machine

2) Second time:
- we still got disconnect error application closed at error patching on the first machine
- an error about update failed due to still running copies of Firefox. We have looked, but no FF process was running in the task manager except for the error dialog. Jenkins console for this issue is: http://mm-ci-master.qa.scl3.mozilla.com:8080/job/ondemand_update/6401/console

We are trying to see if we can reproduce locally now with the same configuration file.

Also we think we could mark as offline one of the nodes and run the update tests on the machine where the error most reproduced. This might help us to create a minimized test case.
(In reply to Daniela Petrovici from comment #44)
> - we got disconnect error application closed at error patching (screenshot
> attached) - happened on the mm-win-8-32-1 machine

When you attach screenshots please do that not via a zip archive but attach them separately. It's extra time for everyone who wants to look at those. Thanks.

> We are trying to see if we can reproduce locally now with the same
> configuration file.

Why do you want to try to reproduce on machines where you weren't able to see the issue in the past couple of days? Now that it got reproduced on win8_32-3 why don't you use this one immediately?
We have started investigation on the remote machine and we were able to reproduce the error by running an upgrade from 20.0b6 RO to 21.0b1. We are trying a create a minimized test case that will reproduce this issue constantly, although since no Firefox process is running in background, but we get the error message about that, we think that it might be a Firefox issue.
(In reply to Daniela Petrovici from comment #46)
> We have started investigation on the remote machine and we were able to
> reproduce the error by running an upgrade from 20.0b6 RO to 21.0b1. We are

It would be good to know how you have ran the tests. Also not sure what you mean with minimized testcase. I don't think that this is the right thing to do at this time.
Depends on: 858686
We ran it with normal testrun on the remote machine, giving the ro build 20.0b6, which did the upgrade until 21.0b1 and it reproduced 2 times out of 3 runs.

When running with the config file having more than 15 locales, after we get the jsbridge error saying of more copies existing, all following runs would pop that window and won't run until we close it.

We looked in task manager as well and haven't found any process of Firefox still running so we're not sure how to proceed.
We could try to run a script before each testrun begin, which would kill any existing firefox processes, to see if that helps or it still reproducing.
Please see the lately added dependency for my current work on that. It's bug 858686.
Depends on: 860677
Ok, so the patch on bug 860677 landed now. It should make that we no longer see this problem. Mario please schedule an ondemand testrun to proof that in the next days, or ask Anthony or Juan if one of them will run such a job in any way today. Thanks.
I've emailed Juan and Anthony and if they won't have an ondemand update testrun triggered today we will schedule one for Monday morning because today's testruns have already started.
Reproduced again today with the ondemand testrun on Firefox 19.0.2 es-AR with Windows NT 6.2.9200 (x86_64):

CI reports:
http://mozmill-crowd.blargon7.com/#/update/report/8ec48e7ab0431a61b624e36d3182b977

Jenkins: http://mm-ci-master.qa.scl3.mozilla.com:8080/job/ondemand_update/7317/console
(In reply to mario garbi from comment #52)
> Reproduced again today with the ondemand testrun on Firefox 19.0.2 es-AR
> with Windows NT 6.2.9200 (x86_64):
> 
> Jenkins:
> http://mm-ci-master.qa.scl3.mozilla.com:8080/job/ondemand_update/7317/console

So this is at least not the jsbridge error we were facing all the time in the last weeks! I call this a good sign. It means that my latest patches were successful. Mario, when testing please make sure you all watch the tests running. Once a failure like this happens it would be good to know where we are hanging. What I have seen last week are real hangs while applying the update. So it might be the case here.
I managed to reproduce this error for each ondemand update job 3/3. It would seem that the last tests from the Jenkins ondemand job are failing on both win-8-64 and win-8-32.

win-8-32-3
View the build in Jenkins:
http://mm-ci-master.qa.scl3.mozilla.com:8080/job/ondemand_update/7398/

View the results in the Mozmill Dashboard:
http://mozmill-crowd.blargon7.com/#/update/report/8ec48e7ab0431a61b624e36d319426ed
http://mozmill-crowd.blargon7.com/#/update/report/8ec48e7ab0431a61b624e36d319460a8

win-8-64-2
View the build in Jenkins:
http://mm-ci-master.qa.scl3.mozilla.com:8080/job/ondemand_update/7370/

View the results in the Mozmill Dashboard:
http://mozmill-crowd.blargon7.com/#/update/report/8ec48e7ab0431a61b624e36d3191a2da
http://mozmill-crowd.blargon7.com/#/update/report/8ec48e7ab0431a61b624e36d3191e909
By looking at the send time of the ondemand mail reports we can see that the time it takes from build version to build version increases:

FF 17 @ 02:15
FF 18 @ 02:19 (4 min)
FF 18.0.1 @ 02:22 (3min)
FF 19 @ 02:26 (4 min)
FF 19.0.2 @ 02:35 (9 min) --we had a failure report here

It's safe to assume that we had a ~5 minutes hang in the tests here.
The hang is evident in the console log as there's a timeout:

Timeout: bridge.execFunction("e002f730-a688-11e2-a13a-005056bb7a86", bridge.registry["{ab5e149e-93fa-45e1-a0fb-49df859c0a70}"]["runTestFile"], ["c:\\users\\mozauto\\appdata\\local\\temp\\tmpilq7u9.mozmill-tests\\tests\\update\\testDirectUpdate\\test2.js"])

Is this only reproducible for Firefox 19.0.2? As mentioned in comment 53, have you been able to watch the tests running. It would be useful to know what is happening on these boxes during those 5 minutes.
We have caught the failure and managed to collect a couple of screenshots that I will attach and the firefox process dump that should tell us why Firefox was not responding.
Screen Shot of the Task Manager
Firefox Dump download link - 170Mb 
https://dl.dropboxusercontent.com/u/37788888/Dump/firefox.DMP
Thanks for the analysis. Exactly that I have expected as mentioned before. I have seen exactly the same thing last week. 

Mario, please file a new bug for mozmill-test failures about the application disconnect. Further also file a bug for the application updater of Firefox and add all the information from the last three comments. Make both dependent.

With the information from the latest test we can say that the original issue the bug was filed against is finally fixed. The jsbridge error doesn't occur anymore. That means we can close this bug.

Thanks to everyone involved here.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Component: Mozmill Automation → Mozmill Tests
Resolution: --- → FIXED
Summary: Software update tests on Windows 8 fail sometimes due to still running copies of Firefox → Software update tests on Windows 8 fail sometimes due to updater prompt on startup (jsbridge cannot connect)
Product: Mozilla QA → Mozilla QA Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: