Closed Bug 1293404 Opened 3 years ago Closed 3 years ago

nsUpdateDriver.cpp should handle errors from waitpid

Categories

(Toolkit :: Application Update, defect)

50 Branch
All
Linux
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla51
Tracking Status
firefox48 --- unaffected
firefox49 --- wontfix
firefox-esr45 --- unaffected
firefox50 - fixed
firefox51 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: mhowell)

References

Details

(Keywords: intermittent-failure, regression)

Attachments

(2 files)

Summary: Intermittent Automation Error: mozprocess timed out after 300 seconds running ['/home/worker/workspace/build/venv/bin/python', '/home/worker/workspace/build/venv/lib/python2.7/site-packages/firefox_ui_harness/cli_update.py' → Automation Error: mozprocess timed out after 300 seconds running ['/home/mozauto/jenkins/workspace/mozilla-central_update/build/venv/bin/python', '/home/mozauto/jenkins/workspace/mozilla-central_update/build/venv/lib/python2.7/site-packages/firefox_ui_har
As the log shows we have problems reconnecting to Marionette after a restart with an update applied:

09:37:22     INFO -  1470821842141      Marionette      ERROR   Error on starting server: [Exception... "Component returned failure code: 0x804b0036 (NS_ERR
OR_SOCKET_ADDRESS_IN_USE) [nsIServerSocket.initSpecialConnection]"  nsresult: "0x804b0036 (NS_ERROR_SOCKET_ADDRESS_IN_USE)"  location: "JS frame :: chrome:/
/marionette/content/server.js :: MarionetteServer.prototype.start :: line 85"  data: no]
09:37:22     INFO -  [Exception... "Component returned failure code: 0x804b0036 (NS_ERROR_SOCKET_ADDRESS_IN_USE) [nsIServerSocket.initSpecialConnection]"  n
sresult: "0x804b0036 (NS_ERROR_SOCKET_ADDRESS_IN_USE)"  location: "JS frame :: chrome://marionette/content/server.js :: MarionetteServer.prototype.start :: 
line 85"  data: no]
09:37:22     INFO -  MarionetteServer.prototype.start@chrome://marionette/content/server.js:85:19
09:37:22     INFO -  MarionetteComponent.prototype.init@resource://gre/components/marionettecomponent.js:218:5
09:37:22     INFO -  MarionetteComponent.prototype.observe@resource://gre/components/marionettecomponent.js:142:7

Right now we do not shutdown the application as it looks like even with my other patches in place. Also we don't quit the client! That's interesting. Looks like we still miss a specific code path in Marionette.
Blocks: 1283906
Orange Factor says that this regression started on Aug 6th. Here the pushlog from mozilla-central:

https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=d320ef56876f52db9bc0eb79554c7332d4793769&tochange=6b65dd49d4f045c0a9753ce60bdb4b7b4aaedcf8

Actually I cannot find any bug which might have caused that.
Interestingly on mozilla-aurora the tests also started failing on Aug 18th:

Pushlog:
https://hg.mozilla.org/releases/mozilla-aurora/pushloghtml?fromchange=fdc2f1d67e39&tochange=d23f68f5cb77

The only overlap I can see here is bug 1258041 which were the safe-browsing related changes.

Here a list of broken update tests due to this disconnect:

mozilla-central:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&filter-searchStr=firefox%20ui%20update%20linux&filter-tier=1&filter-tier=2&filter-tier=3&bugfiler&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=running&filter-resultStatus=pending&filter-resultStatus=runnable

mozilla-aurora:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&filter-searchStr=firefox%20ui%20update%20linux&filter-tier=1&filter-tier=2&filter-tier=3&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=running&filter-resultStatus=pending&filter-resultStatus=runnable
Summary: Automation Error: mozprocess timed out after 300 seconds running ['/home/mozauto/jenkins/workspace/mozilla-central_update/build/venv/bin/python', '/home/mozauto/jenkins/workspace/mozilla-central_update/build/venv/lib/python2.7/site-packages/firefox_ui_har → Automation Error: mozprocess timed out after 300 seconds running ['/home/mozauto/jenkins/workspace/build/venv/bin/python', '/home/mozauto/jenkins/workspace/build/venv/lib/python2.7/site-packages/firefox_ui_harness/cli_update.py']
Francois, mind having a look at the last comment? I cannot see why such a hang in Firefox could happen, but could you imagine that it has something to do with the safebrowsing changes? It's really the only bug which landed at those two time frames on the appropriate branches.
Flags: needinfo?(francois)
Bug 1258041 essentially did these two things:

1. changes the default tracking protection list (different name, same exact content)
2. changes the logic of list selection in the UI

I can't see #2 having any impact at all on tests unless that preference is the thing you're testing.

As for #1, I don't understand how that would impact marionette tests since we disable tracking protection there:

http://searchfox.org/mozilla-central/rev/b38dbd1378cea4ae83bbc8a834cdccd02bbc5347/testing/marionette/client/marionette_driver/geckoinstance.py#323-324
Flags: needinfo?(francois)
Correct, and that is what I do not understand. Maybe I should poke around by using a one click loaner, given that I cannot reproduce locally at all but it fails nearly constantly on TC or our mozmill-ci nodes.
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
We definitely have to get this fixed before the next merge, so it doesn't cause bustage for our beta update tests.
I watched update tests running on TC and I can see this problem each time for fallback updates. When we downloaded the full mar package, and want Firefox to restart, Firefox comes back but can no longer start the server socket at port 2828. Not sure if that is related to the update process. But given that we also have bug 1257508 for the functional tests, I will check if normal in-app restarts can also cause such a behavior. Interestingly update tests nearly fail perma while functional tests only once in a while.
With the change of the socket timeout to 60 (bug 1284457) the failure summary is way clearer now. So it looks similar to bug 1202375 which only happened on OS X all the last releases.

Here the current stack trace:

14:59:29     INFO - TEST-UNEXPECTED-ERROR | test_fallback_update.py TestFallbackUpdate.test_update | IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Timed out waiting for port 2828!)
14:59:29     INFO - Traceback (most recent call last):
14:59:29     INFO -   File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette/marionette_test.py", line 351, in run
14:59:29     INFO -     testMethod()
14:59:29     INFO -   File "/home/worker/workspace/build/tests/firefox-ui/tests/update/fallback/test_fallback_update.py", line 21, in test_update
14:59:29     INFO -     self.download_and_apply_forced_update()
14:59:29     INFO -   File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/firefox_ui_harness/testcases.py", line 359, in download_and_apply_forced_update
14:59:29     INFO -     self.restart()
14:59:29     INFO -   File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/firefox_puppeteer/testcases/base.py", line 87, in restart
14:59:29     INFO -     self.marionette.restart(in_app=True)
14:59:29     INFO -   File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 42, in _
14:59:29     INFO -     return func(*args, **kwargs)
14:59:29     INFO -   File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 1101, in restart
14:59:29     INFO -     self.raise_for_port(self.wait_for_port())
14:59:29     INFO -   File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 55, in _
14:59:29     INFO -     m.force_shutdown()
14:59:29     INFO -   File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 42, in _
14:59:29     INFO -     return func(*args, **kwargs)
14:59:29     INFO -   File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 652, in raise_for_port
14:59:29     INFO -     raise socket.timeout("Timed out waiting for port {}!".format(self.port))
14:59:29     INFO - TEST-INFO took 205967ms

It means we run into a socket.timeout when waiting for the restart in self.marionette.restart(in_app=True). And all this because the reason already mentioned in comment 1 that Marionette server cannot create a socket because the wanted port is still in use.
Summary: Automation Error: mozprocess timed out after 300 seconds running ['/home/mozauto/jenkins/workspace/build/venv/bin/python', '/home/mozauto/jenkins/workspace/build/venv/lib/python2.7/site-packages/firefox_ui_harness/cli_update.py'] → TEST-UNEXPECTED-ERROR | test_fallback_update.py TestFallbackUpdate.test_update | IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Timed out waiting for port 2828!)
I will request a one-click loaner and let our tests run with Marionette trace logging enabled. Lets see if this gives us more information.
With the loaner I ran the following command:

$ firefox-ui-update --binary firefox/firefox tests/firefox-ui/tests/update/manifest.ini --update-f
allback-only --gecko-log - -vv

Sadly no further output is shown:

1472808595789   Marionette      TRACE   conn3 -> [0,196,"quitApplication",{"flags":["eForceQuit","eRestart"]}]
1472808595790   Marionette      TRACE   conn3 <- [1,196,null,{}]
1472808595897   Marionette      DEBUG   Closed connection conn3
*** UTM:SVC TimerManager:registerTimer - id: xpi-signature-verification
1472808598145   Marionette      DEBUG   Marionette enabled via build flag and pref
*** AUS:SVC Creating UpdateService
(firefox:1328): IBUS-WARNING **: The owner of /home/worker/.config/ibus/bus is not root!
1472808598572   Marionette      ERROR   Error on starting server: [Exception... "Component returned failure code: 0x804b0036 (NS_ERROR_SOCKET_ADDR
ESS_IN_USE) [nsIServerSocket.initSpecialConnection]"  nsresult: "0x804b0036 (NS_ERROR_SOCKET_ADDRESS_IN_USE)"  location: "JS frame :: chrome://mar
ionette/content/server.js :: MarionetteServer.prototype.start :: line 85"  data: no]
[Exception... "Component returned failure code: 0x804b0036 (NS_ERROR_SOCKET_ADDRESS_IN_USE) [nsIServerSocket.initSpecialConnection]"  nsresult: "0
x804b0036 (NS_ERROR_SOCKET_ADDRESS_IN_USE)"  location: "JS frame :: chrome://marionette/content/server.js :: MarionetteServer.prototype.start :: l
ine 85"  data: no]
So I was using lsof to check the status of localhost:2828:

After the first restart:

$ lsof -i TCP | grep 2828
firefox-u 4033 root    5u  IPv4 2179318      0t0  TCP localhost:47013->localhost:2828 (ESTABLISHED)
firefox   4039 root   43u  IPv4 2181625      0t0  TCP localhost:2828 (LISTEN)
firefox   4039 root   48u  IPv4 2179321      0t0  TCP localhost:2828->localhost:47013 (ESTABLISHED)

After next restart (fallback update is now downloaded and applied at the end)

$ lsof -i TCP | grep 2828
firefox-u 4033 root    5u  IPv4 2182301      0t0  TCP localhost:47044->localhost:2828 (ESTABLISHED)
firefox   4039 root   53u  IPv4 2179448      0t0  TCP localhost:2828 (LISTEN)
firefox   4039 root   61u  IPv4 2182304      0t0  TCP localhost:2828->localhost:47044 (ESTABLISHED)

After the next update (Firefox has been updated)

$ lsof -i TCP | grep 2828
firefox-u 4033 root    5u  IPv4 2183145      0t0  TCP localhost:47068->localhost:2828 (ESTABLISHED)
updater   4301 root   53u  IPv4 2179448      0t0  TCP localhost:2828 (LISTEN)
updater   4301 root   61u  IPv4 2182304      0t0  TCP localhost:2828->localhost:47044 (CLOSE_WAIT)

So the first two restarts were totally fine and Marionette was able to create a new server socket. But with the last restart when Firefox got the update applied the former connection to the client remains in a CLOSE_WAIT state and is not released. As result Marionette cannot create a new server socket at the same port.
I checked if this problem might be related to in_app restarts but it isn't. I can see the same failure when letting marionette client restart the browser via mozrunner.
So I tried older Nightly builds from mozilla-central to verify the regression range as pointed out in comment 4. Interestingly I do not see this problem at all. Reason here seems to be that we do not download partial updates but always complete ones. And as such the old software update dialog is not in use because all is routed through the about dialog.

I have the feeling that a change in Firefox somewhat broke the old software update dialog. I have no idea why we actually fail to stop the socket server of Marionette during shutdown.

CC'ing Robert in case he may know about changes which could have caused this on Linux only.
OS: Unspecified → Linux
Hardware: Unspecified → All
No idea why these tests had issues only on Linux.

As you noted, there are no changes in the range you posted that should have caused this or for that matter involve app update.

Note that the app update tests test downloading both a partial and a complete using the app update dialog.

If these tests use the balrog perhaps it wasn't serving partial updates when you ran these tests earlier.
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #20)
> Note that the app update tests test downloading both a partial and a
> complete using the app update dialog.

This is true when the first update attempt works. But when we invalidate the downloaded partial patch and restart Firefox, the complete patch will be downloaded via the software update dialog and not via the about window. Only for invalid complete patches, we seem to relaunch the about window to re-download the complete patch.

> If these tests use the balrog perhaps it wasn't serving partial updates when
> you ran these tests earlier.

I did, but using older builds to track down the regression range doesn't work because balrog always offers the complete patch now. So I will have to force the update url to point to a local update.xml, which includes a partial and complete update.

Here the situation in detail:
* partial patch  -> OK
* partial patch -> invalidate -> complete patch -> FAILURE
* complete patch -> OK
* complete patch -> invalidate -> complete patch -> OK
(In reply to Henrik Skupin (:whimboo) from comment #21)
> (In reply to Robert Strong [:rstrong] (use needinfo to contact me) from
> comment #20)
> > Note that the app update tests test downloading both a partial and a
> > complete using the app update dialog.
> 
> This is true when the first update attempt works.
What I am saying is the app update tests in the following dir do this
https://dxr.mozilla.org/mozilla-central/source/toolkit/mozapps/update/tests/chrome
Something I haven't actually seen before:

> firefox   4039 root   61u  IPv4 2182304      0t0  TCP localhost:2828->localhost:47044 (ESTABLISHED)
[..]
> updater   4301 root   53u  IPv4 2179448      0t0  TCP localhost:2828 (LISTEN)
> updater   4301 root   61u  IPv4 2182304      0t0  TCP localhost:2828->localhost:47044 (CLOSE_WAIT)
> firefox-u 4033 root    5u  IPv4 2183145      0t0  TCP localhost:47068->localhost:2828 (ESTABLISHED)

The process which still holds the open socket is bound to the updater. So it's not the Firefox process anymore. Are we doing anything fancy here with the process of Firefox before the update?
Matt, this might have to do with the changes you made to nsUpdateDriver.cpp that spohl reviewed. Could you help whimboo out with this?
Flags: needinfo?(mhowell)
This did start happening right after that patch (bug 1272614) landed. I'll poke around some.
Thanks Matt! Could it be that we do not wait long enough until the old process quit? It looks like we fork it into the update process, so that we maybe end-up with two instances at the same time. Right now Marionette always uses port 2828, and will fail. Btw. I proposed a solution for finding another available port on bug 1240830. So if that cannot be fixed in Firefox, we will have to fix the test harness.
Blocks: 1272614
I am sorry, but I have not been able to reproduce this failure to save my life. I'm watching the "partial patch -> invalidate -> complete patch" flow run, and it passes every time for me. I've tried multiple Linux machines, I even tried generating artificial load to slow everything down because I thought that might make this failure more likely, but it did not.

Truth be told, I'm pretty stumped. As pointed out in comment 23, we are playing some process games on Linux, in that we use execv() to invoke the updater directly from the browser process, but we've been doing that forever (or at least further back than the Mercurial history goes). The relevant change in bug 1272614 was to the other case, when we're staging an update. Then, we fork/exec a new process for the updater and wait for it to exit; the change was just to make that wait non-blocking.

I'm not giving up quite yet, but I don't have any ideas right now.
Flags: needinfo?(mhowell)
Maybe whimboo can help you reproduce it.
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #31)
> Maybe whimboo can help you reproduce it.
He did give me some advice over in bug 1285340. I have so little idea what I'm looking for, I need time to even figure out what else to ask.
Matt, maybe the best is to simply request a one-click loaner on TaskCluster. Given that it's a permanent failure you even won't have to setup anything to get it reproduced. 

If you haven't done that yet, here the steps:

1. Select one of the broken jobs in Treeherder like: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=91c2b9d5c1354ca79e5b174591dbb03b32b15bbf&selectedJob=4888235&filter-searchStr=fxup&filter-tier=3

2. Switch to the job details tab, scroll down, and click "One click loaner"

3. Sign-in via Octa, create the loaner, and wait until it's ready

4. Connect via the SSH web shell, and select option 1 to run the tests (it should definitely hang at the end) - Stop with ctrl+c.

5. Now run the following commands:

# cd workspace/build/
# . venv/bin/activate
# tar xfv firefox-51.0a1.en-US.linux-x86_64.tar.bz2 
# firefox-ui-update --binary firefox/firefox tests/firefox-ui/tests/update/manifest.ini --update-fallback-only --gecko-log - -vv

6. You can open up a second shell and exit the wizard (option 4) to run various commands like lsof

If you need more info you can always find me on IRC in the #automation channel.
As long as we don't know if this is a bug in the product, or if we should workaround that in the test harness, I will enhance Marionette to retry creating a server socket at the same port until we reach the timeout. This should help us here, to not let this hang reach beta by next week.
Depends on: 1301320
Quick update: using whimboo's instructions (thanks again), I was able to get reliable reproduction in a loaner machine. I'm still trying to actually debug the problem there; I may need to enlist the aid of someone more skilled in Linux debugging.

Side note: I found out that the TaskCluster task inspector [1] generates shell scripts that sort of clone the testing environment into a local Docker container, so I tried that thinking maybe I could debug more easily that way... but I couldn't get repro there. This failure has some sort of religious conviction against reproducing itself on any of my local machines in any form whatsoever.

[1] https://tools.taskcluster.net/task-inspector/#fVAMHmU9S6WIUnQJG1LpMw/
Thank you for the update Matt! You made a good point in terms of running the docker environment locally doesn't always work. We noticed that a couple of times in the past too.

I know that remote debugging is not that convenient and that your session could be killed at any time. Something I could propose is the following... we also see those hangs on our Linux machines in SCL3, which are VMs in an ESX cluster. We could take one offline so you could use that one instead. It offers SSH and VNC access, and won't be shutdown after a while.
Matt, do you have any update for us?
Flags: needinfo?(mhowell)
Sorry, haven't had a chance to work on this for the past couple days.

It finally occurred to me just now to actually double-check the code in my patch, and I found I did something kinda dumb; I didn't handle errors returned from waitpid at all, I just treat them as if it means the process is still running, which makes no sense. I wonder if some oddity of the test runner environment is causing waitpid to hit ECHILD. It might be the same oddity that causes strace to fail to attach, which is what inspired this theory but also prevents me from checking it. From what I can tell though, the old code path would also hang forever in that case, so I don't love this theory. I just don't have any other ones.

Do we have a way of doing a try run of this test? I don't see it listed in trychooser, but the only remaining way that I know of to verify any ideas is to just try some patches.
Flags: needinfo?(mhowell) → needinfo?(hskupin)
I don't know the update system that well. Maybe we could tweak such a try build to get updates. Is it enough to override the update URL and point it to patches for latest nightly? The partial patch will definitely fail, so the full one will be applied. Or is the update server not enabled at all for non nightly and release builds? Robert, can you help out here?
Flags: needinfo?(hskupin) → needinfo?(robert.strong.bugs)
Doing this on try would be difficult. I suggest using oak and the updates for oak.
Flags: needinfo?(robert.strong.bugs)
Hi Matt, Whimboo, can you help me understand from a scenario point of view, what this test failure means? That might help me decide whether to track it for 50 or not.
Flags: needinfo?(mhowell)
With the current situation QA will not be able to run the ondemand update tests for the upcoming beta releases on Linux 32/64. Maybe also Windows platforms are affected but those are less intermittent.

A proposal I have made on bug 1301320 to workaround this issue in Marionette has not been approved yet. Not sure if that will happen shortly. Last resort for me would be to change the update tests, and not do an in-app restart of Firefox to apply the update, but quit the application and let it restart again. I assume that this might help to make it work for Beta. Then we do not test the replacement feature while Firefox is still running anymore. But this might be better than nothing.
Attached patch PatchSplinter Review
This patch adds the explicit error checking that I mentioned was missing in comment 42. Following Robert's advice, I plan to push this to oak to see if it helps any.
Flags: needinfo?(mhowell)
https://hg.mozilla.org/projects/oak/rev/cae30a308fb978ec0c4abbcd1ea2c37ee02032b8
Bug 1293404 - Check for errors returned while waiting for the updater to finish
Thanks Matt! I will ask Sheriffs to trigger a nightly build, so we do not have to wait until end of today for it.
We got oak nightly builds:
https://archive.mozilla.org/pub/firefox/nightly/2016/09/2016-09-15-02-34-18-oak/

I would now try to run our tests with an older oak nightly, which should show the failure. And then use the same update.xml file for that new nightly build. Lets see if the theory works.
So I ran the fallback update tests against the latest oak nightly build with a tweaked update.xml file, and all is working fine now. I did it multiple times to ensure it was not intermittently passing, but no, not a single failure in 10 runs. Without the patch it failed each time.

Here the command which I was running:

> firefox-ui-update --binary firefox/firefox tests/firefox-ui/tests/update/manifest.ini --update-f
allback-only --update-override-url http://people.mozilla.org/~hskupin/update.xml --gecko-log -

So I think we should finalize it and get reviewed. If all works well we can get it landed on aurora over the weekend and should be safe for the merge on Monday. Thanks Matt for diving into the tests!

I think we should move this bug into Toolkit given that we actually found a bug in the updater.
Component: Firefox UI Tests → Application Update
Product: Testing → Toolkit
QA Contact: hskupin
Version: Version 3 → 50 Branch
Comment on attachment 8791420 [details] [diff] [review]
Patch

Requesting review based on comment 51.

Also removing the "experimental" tag. ;)
Attachment #8791420 - Attachment description: Experimental Patch → Patch
Attachment #8791420 - Flags: review?(robert.strong.bugs)
Comment on attachment 8791420 [details] [diff] [review]
Patch

># HG changeset patch
># User Matt Howell <mhowell@mozilla.com>
># Date 1473886898 25200
>#      Wed Sep 14 14:01:38 2016 -0700
># Node ID cd14a490a7f7b3b053b11753d30f69c2c55a2e00
># Parent  8a494adbc5cced90a4edf0c98cffde906bf7f3ae
>Bug 1293404 - Check for errors returned while waiting for the updater to finish
>
>MozReview-Commit-ID: 8m1NnlKKIID
>
>diff --git a/toolkit/xre/nsUpdateDriver.cpp b/toolkit/xre/nsUpdateDriver.cpp
>--- a/toolkit/xre/nsUpdateDriver.cpp
>+++ b/toolkit/xre/nsUpdateDriver.cpp
>@@ -993,23 +993,31 @@ ProcessHasTerminated(ProcessType pt)
>   }
>   CloseHandle(pt);
>   return true;
> #elif defined(XP_MACOSX)
>   // We're waiting for the process to terminate in LaunchChildMac.
>   return true;
> #elif defined(XP_UNIX)
>   int exitStatus;
>-  bool exited = waitpid(pt, &exitStatus, WNOHANG) > 0;
>-  if (!exited) {
>-    sleep(1);
>-  } else if (WIFEXITED(exitStatus) && (WEXITSTATUS(exitStatus) != 0)) {
>+  pid_t exited = waitpid(pt, &exitStatus, WNOHANG);
>+  if (exited == 0) {
>+    // Process is still running.
How about putting a LOG statement here as well?

>+    return false;
>+  } else if (exited == -1) {

Since the above returns change this to
}

if (exited == -1) {


>+    LOG(("Error while checking if the updater process is finished"));
>+    // This shouldn't happen, but if it does, the updater process is lost to us,
>+    // so the best we can do is pretend that it's exited.
>+    return true;
>+  }
>+  // If we get here, the process has exited; make sure it exited normally.
>+  if (WIFEXITED(exitStatus) && (WEXITSTATUS(exitStatus) != 0)) {
>     LOG(("Error while running the updater process, check update.log"));
>   }
>-  return exited;
>+  return true;
> #else
>   // No way to have a non-blocking implementation on these platforms,
>   // because we're using NSPR and it only provides a blocking wait.
>   int32_t exitCode;
>   PR_WaitProcess(pt, &exitCode);
>   if (exitCode != 0) {
>     LOG(("Error while running the updater process, check update.log"));
>   }

r=me with that
Attachment #8791420 - Flags: review?(robert.strong.bugs) → review+
[Tracking Requested - why for this release]:
This is a regression caused by bug 1272614 that was uplifted to Firefox 49 and 50 that prevents running these tests.
Assignee: hskupin → mhowell
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #53)
> >+  if (exited == 0) {
> >+    // Process is still running.
> How about putting a LOG statement here as well?
That could end up being quite a few log entries, since this runs once a second. Is that okay?

Also, I just realized I accidentally dropped the sleep() call that should have been in the exited == 0 case, so it wouldn't actually run once a second, it would run constantly. Probably not okay, I'll have to put that back along with the review changes.
(In reply to Matt Howell [:mhowell] from comment #55)
> (In reply to Robert Strong [:rstrong] (use needinfo to contact me) from
> comment #53)
> > >+  if (exited == 0) {
> > >+    // Process is still running.
> > How about putting a LOG statement here as well?
> That could end up being quite a few log entries, since this runs once a
> second. Is that okay?
Just skip it and if it becomes valuable it can be added later.
https://hg.mozilla.org/integration/mozilla-inbound/rev/2840a2d57a59c054bfa9694e7c8c9bfddc71dc28
Bug 1293404 - Check for errors returned while waiting for the updater to finish; r=rstrong
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #54)
> [Tracking Requested - why for this release]:
> This is a regression caused by bug 1272614 that was uplifted to Firefox 49
> and 50 that prevents running these tests.

The interesting fact is that our tests are not affected for Firefox 49 as you can see here:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&revision=77a60bbacb97c2f36815b2c395958a354aa581c9&filter-searchStr=fxup%20linux&filter-tier=1&filter-tier=2&filter-tier=3

So I assume another change as landed for Firefox 50 should have been caused it in combination with this patch.
https://hg.mozilla.org/mozilla-central/rev/2840a2d57a59
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla51
Please request Beta approval on this when you get a chance (assuming it won't be approved before Aurora50 gets uplifted to Beta).
Flags: needinfo?(mhowell)
I still see this failure for update tests over the weekend. But maybe that is because we haven't had a pre Nightly build with the fix included. I will keep an eye out today.
We are greening up!! Yay! 

https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=f0f15b7c6aa77a0c5750918aa0a1cb3dc82185bc&filter-searchStr=fxup%20linux&filter-tier=1&filter-tier=2&filter-tier=3&selectedJob=5032287

Please only see Linux 32 for now given that 64 has issues with running our tests at all. So en-US-1 is now green, and en-US-2 shows a different failure related to window focus which we also have for other functional tests. So by tomorrow we should also see en-US-3 passing, and en-US-4 by Wednesday.

Matt, please request approval for beta uplift. We may still have a change to get this fix included for the upcoming beta 1.
This is the patch that actually landed, with the changes from comment 53 - comment 55. r+ carried over. I'm posting this here to request beta approval, per comment 61.

Approval Request Comment
[Feature/regressing bug #]:
Bug 1272614

[User impact if declined]:
Potential shutdown hangs when applying updates.

[Describe test coverage new/current, TreeHerder]:
This patch is a fix for a test failure. See comment 51 and comment 64.

[Risks and why]:
Very low because of existing test coverage and time spent on nightly.

[String/UUID change made/needed]:
None
Flags: needinfo?(mhowell)
Attachment #8792551 - Flags: review+
Attachment #8792551 - Flags: approval-mozilla-beta?
Comment on attachment 8792551 [details] [diff] [review]
Patch as landed to central

Matt, something went wrong here with the patch. It means we still see this problem:

https://treeherder.allizom.org/#/jobs?repo=mozilla-central&exclusion_profile=false&filter-searchStr=firefox%20ui&filter-tier=1&filter-tier=2&filter-tier=3&selectedJob=4633066

Was there any other major change between the version which I have tested and what landed?
Attachment #8792551 - Flags: approval-mozilla-beta?
The installer as used for the broken build is:
https://archive.mozilla.org/pub/firefox/nightly/2016/09/2016-09-18-03-04-08-mozilla-central/firefox-51.0a1.en-US.linux-i686.tar.bz2

So it's certainly from past your landing on mozilla-central.

I'm going to reopen this bug. Not sure if we should backout the patch or work on a follow-up.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Since the patch that you originally reported had fixed the problem you are seeing with those tests fixes an actual problem and has already landed with this bug referenced please open a new bug.
Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED
Summary: TEST-UNEXPECTED-ERROR | test_fallback_update.py TestFallbackUpdate.test_update | IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Timed out waiting for port 2828!) → nsUpdateDriver.cpp should handle errors from waitpid
Comment on attachment 8792551 [details] [diff] [review]
Patch as landed to central

Even with the test failures being the same still, it might be that those have lower frequency. So I will file a new bug as asked from Rob.

Also setting the approval flag for beta again after I removed it in comment 66.
Attachment #8792551 - Flags: approval-mozilla-beta?
Blocks: 1303834
Comment on attachment 8792551 [details] [diff] [review]
Patch as landed to central

Fixes an intermittent, Beta50+
Attachment #8792551 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Matt, I'm setting the checkin beta flag just in case a sheriff can already land the patch before you are around today. It would be great to have this landed ASAP so we can make the next beta release. If it hasn't been landed when you are around, can you please do so? Thanks.
Whiteboard: [checkin-needed-beta]
(In reply to Matt Howell [:mhowell] from comment #42)
> I wonder if some oddity of the test
> runner environment is causing waitpid to hit ECHILD.

An oddity in NSPR, Bug 227246 can cause that if PR_CreateProcess() has been called at any point previously.
You need to log in before you can comment on or make changes to this bug.