Intermittent toolkit\mozapps\update\test_svc\unit\test_0000_bootstrap_svc.js failure in head_update.js | failed: X == succeeded, where x is one of (7, 36, 46, 47, 48)

RESOLVED FIXED in mozilla23

Status

()

defect
RESOLVED FIXED
8 years ago
6 years ago

People

(Reporter: bbondy, Assigned: bbondy)

Tracking

({intermittent-failure})

unspecified
mozilla23
x86_64
Windows 7
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments, 10 obsolete attachments)

2.28 KB, patch
rstrong
: review+
Details | Diff | Splinter Review
5.42 KB, patch
bbondy
: review+
Details | Diff | Splinter Review
24.75 KB, patch
bbondy
: review+
Details | Diff | Splinter Review
About one out of 100 times I've seen an intermittent orange failure with error state failed: 7 == succeeded.  7 means write error.

I've actually only seen it 2-3 times on elm and they were always with the bootstrap 0000 test.  I previously had different errors for each return point of write errors and it was with the callback open in particular.

So in particular the error is from this part of the updater.cpp code:


> // CreateFileW will fail if the callback executable is already in use. Since
> // it isn't possible to update write the status file and return.
> if (callbackFile == INVALID_HANDLE_VALUE) {
> LOG(("NS_main: file in use - failed to exclusively open executable " \
>      "file: " LOG_S "\n", argv[callbackIndex]));
>  LogFinish();
>  WriteStatusFile(WRITE_ERROR);
>  NS_tremove(gCallbackBackupPath);
>  EXIT_WHEN_ELEVATED(elevatedLockFilePath, updateLockFileHandle, 1);
>  LaunchCallbackApp(argv[4], 

I have not seen this on m-c yet, but I think it will pop-up eventually.
A strange thing is we have no special handling of this code from the service, nor of the callback from the service code.
Duplicate of this bug: 715858
Summary: Intermittent service xpcshell test orange with failed: 7 == succeeded → Intermittent toolkit\mozapps\update\test_svc\unit\test_0000_bootstrap_svc.js failure in head_update.js | failed: 7 == succeeded
Making sure that people will find this if it happens on our Tinderboxes.
Blocks: 438871
Whiteboard: [orange]
I think Felipe Gomes got this same thing, he sent me his update.status and it shows failed: 7. CC'ing him.

His maintenanceservice.log file shows that updater.exe returned an error.

His update.log shows:
SOURCE DIRECTORY C:\Users\Felipe\AppData\Local\Mozilla\Firefox\Nightly\updates\0
DESTINATION DIRECTORY C:\Program Files (x86)\Nightly
NS_main: file in use - failed to exclusively open executable file: C:\Program Files (x86)\Nightly\firefox.exe

He mentioned he did not have any other profiles open that he knows of.
He is running Win7 x64.
Regarding comment 11, it could happen if he started the app, the update process started to happen, and he opened again right at the exact right time.  All within say 1second.

We do lock the file from being re-opened during update, but if you re-open at exactly the right time (with and without the service) then that could happen.  

That doesn't explain the intermittent test failure though, so still pondering.
I also tried manually adding a couple second firefox.exe sleep shutdown after launching updater.exe unelevated (which in turn runs the service to do the update). 
The update works correctly even with this sleep.

I don't see anything wrong with the existing wait for PID code inside the unelevated updater which seems to have worked based on my test:

>  __int64 pid = _wtoi64(argv[3]);
>   if (pid != 0) {
>     HANDLE parent = OpenProcess(SYNCHRONIZE, false, (DWORD) pid);
>     // May return NULL if the parent process has already gone away.
>     // Otherwise, wait for the parent process to exit before starting the
>     // update.
>     if (parent) {
>       DWORD result = WaitForSingleObject(parent, 5000);
>       CloseHandle(parent);
>       if (result != WAIT_OBJECT_0)
>         return 1;
>     }
>   }
So I think the idea of the timed wait in the first place is in case the synchronization wait returns but the process is still in use by Windows.  

For some reason maybe this takes a bit longer when the service is in use. 

I tried to increase this wait and also I added some extra logging that would have been great to have when Felipe got the problem.   I'll try this out on elm with a ton of xpcshell test restarts.

There is no callback locking specific code introduced in the service so I'm hoping a slightly longer wait is all that's needed to fix this.
Comment on attachment 587418 [details] [diff] [review]
Patch v1. Improve logging, wait longer.

This appears to be fixed here:
https://tbpl.mozilla.org/?tree=Elm&rev=6baa33f888c8

I don't really like the code around here in general, but it is just adjusting the parameters of an existing hack presumably made for the same reason.

Also the logging will help if someone ever gets this again.
Attachment #587418 - Flags: review?(robert.bugzilla)
Comment on attachment 587418 [details] [diff] [review]
Patch v1. Improve logging, wait longer.

># HG changeset patch
># Parent dd5bdbc4ff5c9f0bfbde47825a650ca8578acffc
># User Brian R. Bondy <netzen@gmail.com>
>Bug 715746 - Intermittent toolkit\mozapps\update\test_svc\unit\test_0000_bootstrap_svc.js failure in head_update.js | failed: 7 == succeeded
>
>diff --git a/toolkit/mozapps/update/updater/updater.cpp b/toolkit/mozapps/update/updater/updater.cpp
>--- a/toolkit/mozapps/update/updater/updater.cpp
>+++ b/toolkit/mozapps/update/updater/updater.cpp
>@@ -1967,37 +1967,39 @@ int NS_main(int argc, NS_tchar **argv)
>                  sizeof(gCallbackBackupPath)/sizeof(gCallbackBackupPath[0]),
>                  NS_T("%s" CALLBACK_BACKUP_EXT), argv[callbackIndex]);
>     NS_tremove(gCallbackBackupPath);
>     CopyFileW(argv[callbackIndex], gCallbackBackupPath, false);
> 
>     // Since the process may be signaled as exited by WaitForSingleObject before
>     // the release of the executable image try to lock the main executable file
>     // multiple times before giving up.
>-    int retries = 5;
>+    int retries = 10;
>     do {
>       // By opening a file handle wihout FILE_SHARE_READ to the callback
>       // executable, the OS will prevent launching the process while it is
>       // being updated.
>       callbackFile = CreateFileW(argv[callbackIndex],
>                                  DELETE | GENERIC_WRITE,
>                                  // allow delete, rename, and write
>                                  FILE_SHARE_DELETE | FILE_SHARE_WRITE,
>                                  NULL, OPEN_EXISTING, 0, NULL);
>       if (callbackFile != INVALID_HANDLE_VALUE)
>         break;
> 
>-      Sleep(50);
>+      DWORD lastError = GetLastError();
>+      LOG(("NS_main: callback app open attempt failed. " \
>+        "File: " LOG_S ". Last error: %d\n", argv[callbackIndex], lastError));
Since this could be logged repeatedly include an attempt counter like "attempt # failed"

>+
>+      Sleep(100);
>     } while (--retries);
> 
>     // CreateFileW will fail if the callback executable is already in use. Since
>     // it isn't possible to update write the status file and return.
>     if (callbackFile == INVALID_HANDLE_VALUE) {
>-      LOG(("NS_main: file in use - failed to exclusively open executable " \
>-           "file: " LOG_S "\n", argv[callbackIndex]));
Don't remove this. It is helpful when troubleshooting to know exactly why the update was not applied.

>       LogFinish();
>       WriteStatusFile(WRITE_ERROR);
>       NS_tremove(gCallbackBackupPath);
>       EXIT_WHEN_ELEVATED(elevatedLockFilePath, updateLockFileHandle, 1);
>       LaunchCallbackApp(argv[4], 
>                         argc - callbackIndex, 
>                         argv + callbackIndex,
>                         usingService);
Attachment #587418 - Flags: review?(robert.bugzilla) → review-
Attachment #587418 - Attachment is obsolete: true
Attachment #587789 - Flags: review?(robert.bugzilla)
Attachment #587789 - Flags: review?(robert.bugzilla) → review+
Unless I am mistaken (I haven't looked at the test flow), another thing that can be done to help troubleshoot is to compare the contents of the update.log when this fails so it appears in the test output.
Good idea I'll post a new task for that for once things slow down with service related bugs.
Pushed to mozilla-inbound:
http://hg.mozilla.org/integration/mozilla-inbound/rev/16d8c02193ab
Target Milestone: --- → mozilla12
https://hg.mozilla.org/mozilla-central/rev/16d8c02193ab
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Would you say this is reduced but not solved? Thanks for re-opening, it should not be happening on inbound.
Yeah, massively reduced, but the profiling one shouldn't have hit either, profiling auto-merges from m-c on every push so that had had your fix for a week.
ok thanks, I'll try to find a better fix for this which isn't fully based on waiting longer for the callback.
Depends on: 728731
Blocks: 757858
Depends on: 766567
Just an FYI we may see error code 7 for this bug change to 35, 36, or 37 once bug 766567 lands.
Summary: Intermittent toolkit\mozapps\update\test_svc\unit\test_0000_bootstrap_svc.js failure in head_update.js | failed: 7 == succeeded → Intermittent toolkit\mozapps\update\test_svc\unit\test_0000_bootstrap_svc.js failure in head_update.js | failed: 36 == succeeded (or if older code, failed: 7)
Duplicate of this bug: 767901
Rev3 WINNT 6.1 mozilla-inbound opt test xpcshell on 2012-06-22 11:54:02 PDT for push 372c93939e2b

slave: talos-r3-w7-039

https://tbpl.mozilla.org/php/getParsedLog.php?id=12909558&tree=Mozilla-Inbound

{
TEST-INFO | c:\talos-slave\test\build\xpcshell\tests\toolkit\mozapps\update\test_svc\unit\test_0000_bootstrap_svc.js | running test ...
TEST-UNEXPECTED-FAIL | c:\talos-slave\test\build\xpcshell\tests\toolkit\mozapps\update\test_svc\unit\test_0000_bootstrap_svc.js | test failed (with xpcshell return code: 0), see following log:
>>>>>>>

TEST-INFO | (xpcshell/head.js) | test 1 pending
TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [shouldRunServiceTest : 1437] Checking if the service exists on this machine.
TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [shouldRunServiceTest : 1443] Service exists, return value: 0

TEST-INFO | (xpcshell/head.js) | test 2 pending
TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [setupUpdaterTest : 1850] testing successful removal of the directory used to apply the mar file

TEST-PASS | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [setupUpdaterTest : 1852] false == false
TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [waitForApplicationStop : 1586] Waiting for maintenanceservice_installer.exe to stop if necessary...
TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [waitForApplicationStop : 1586] Waiting for maintenanceservice_tmp.exe to stop if necessary...
TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [waitForApplicationStop : 1586] Waiting for maintenanceservice.exe to stop if necessary...
TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [waitForServiceStop : 1556] Waiting for service to stop if necessary...
TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [waitForServiceStop : 1566] Stopping service...
TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [waitForServiceStop : 1580] Service stopped.
TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [waitForApplicationStop : 1586] Waiting for maintenanceservice_installer.exe to stop if necessary...
TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [waitForApplicationStop : 1586] Waiting for maintenanceservice_tmp.exe to stop if necessary...
TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [waitForApplicationStop : 1586] Waiting for maintenanceservice.exe to stop if necessary...

TEST-PASS | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [runUpdateUsingService : 1649] pending-service == pending-service
TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [runUpdateUsingService : 1664] launching C:\Windows\System32\cmd.exe /D /Q /C c:\talos-slave\test\build\firefox\0000_svc_aus_test_app.exe -no-remote -process-updates -dump-args c:\talos-slave\test\build\xpcshell\tests\toolkit\mozapps\update\test_svc\unit\app_args_log 1> c:\talos-slave\test\build\xpcshell\tests\toolkit\mozapps\update\test_svc\unit\0000_svc_app_console_log 2>&1
TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [setEnvironment : 2788] setting the XRE_NO_WINDOWS_CRASH_DIALOG environment variable to 1... previously it didn't exist
TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [setEnvironment : 2835] removing the XPCOM_MEM_LEAK_LOG environment variable... previous value c:\users\cltbld\appdata\local\temp\tmpmrhovp\runxpcshelltests_leaks.log
TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [setEnvironment : 2842] setting the XPCOM_DEBUG_BREAK environment variable to warn... previous value stack-and-abort
TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [setEnvironment : 2853] setting the MOZ_UPDATE_ROOT_OVERRIDE environment variable to c:\talos-slave\test\build\xpcshell\tests\toolkit\mozapps\update\test_svc\unit\0000_svc_mar

TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [setEnvironment : 2859] setting the MOZ_UPDATE_APPDIR_OVERRIDE environment variable to c:\talos-slave\test\build\xpcshell\tests\toolkit\mozapps\update\test_svc\unit\0000_svc_applyToDir

TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [setEnvironment : 2868] setting MOZ_NO_SERVICE_FALLBACK environment variable to 1

TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | copying c:\talos-slave\test\build\firefox\updater.exe to: c:\talos-slave\test\build\xpcshell\tests\toolkit\mozapps\update\test_svc\unit\0000_svc_applyToDir

TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | copying c:\talos-slave\test\build\firefox\maintenanceservice.exe to: c:\talos-slave\test\build\xpcshell\tests\toolkit\mozapps\update\test_svc\unit\0000_svc_applyToDir

TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | copying c:\talos-slave\test\build\firefox\maintenanceservice_installer.exe to: c:\talos-slave\test\build\xpcshell\tests\toolkit\mozapps\update\test_svc\unit\0000_svc_applyToDir
TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [resetEnvironment : 2888] setting the XPCOM_MEM_LEAK_LOG environment variable back to c:\users\cltbld\appdata\local\temp\tmpmrhovp\runxpcshelltests_leaks.log
TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [resetEnvironment : 2894] setting the XPCOM_DEBUG_BREAK environment variable back to stack-and-abort
TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [resetEnvironment : 2928] removing the XRE_NO_WINDOWS_CRASH_DIALOG environment variable
TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [resetEnvironment : 2934] removing the MOZ_UPDATE_ROOT_OVERRIDE environment variable

TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [resetEnvironment : 2940] removing the MOZ_UPDATE_APPDIR_OVERRIDE environment variable

TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [resetEnvironment : 2950] removing MOZ_NO_SERVICE_FALLBACK environment variable

TEST-INFO | (xpcshell/head.js) | test 2 finished

TEST-INFO | (xpcshell/head.js) | running event loop
TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [timerCallback : 1719] Still waiting to see the succeeded status, got applying for now...
TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [waitForApplicationStop : 1586] Waiting for maintenanceservice_installer.exe to stop if necessary...
TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [waitForApplicationStop : 1586] Waiting for maintenanceservice_tmp.exe to stop if necessary...
TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [waitForApplicationStop : 1586] Waiting for maintenanceservice.exe to stop if necessary...
TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [waitForServiceStop : 1556] Waiting for service to stop if necessary...
TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [waitForServiceStop : 1566] Stopping service...
TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [waitForServiceStop : 1580] Service stopped.
TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [waitForApplicationStop : 1586] Waiting for maintenanceservice_installer.exe to stop if necessary...
TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [waitForApplicationStop : 1586] Waiting for maintenanceservice_tmp.exe to stop if necessary...
TEST-INFO | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | [waitForApplicationStop : 1586] Waiting for maintenanceservice.exe to stop if necessary...

TEST-UNEXPECTED-FAIL | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | failed: 36 == succeeded - See following stack:
JS frame :: c:\talos-slave\test\build\xpcshell\head.js :: do_throw :: line 440
JS frame :: c:\talos-slave\test\build\xpcshell\head.js :: _do_check_eq :: line 534
JS frame :: c:\talos-slave\test\build\xpcshell\head.js :: do_check_eq :: line 555
JS frame :: c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js :: timerCallback :: line 1726
}


edmorley
https://tbpl.mozilla.org/php/getParsedLog.php?id=12964181&tree=Mozilla-Inbound
Rev3 WINNT 6.1 mozilla-inbound opt test xpcshell on 2012-06-25 01:48:49
slave: talos-r3-w7-063

TEST-UNEXPECTED-FAIL | c:\talos-slave\test\build\xpcshell\tests\toolkit\mozapps\update\test_svc\unit\test_0000_bootstrap_svc.js | test failed (with xpcshell return code: 0), see following log:
TEST-UNEXPECTED-FAIL | c:/talos-slave/test/build/xpcshell/tests/toolkit/mozapps/update/test_svc/unit/head_update.js | failed: 36 == succeeded - See following stack: