Closed
Bug 1293404
Opened 8 years ago
Closed 8 years ago
nsUpdateDriver.cpp should handle errors from waitpid
Categories
(Toolkit :: Application Update, defect)
Tracking
()
RESOLVED
FIXED
mozilla51
Tracking | Status | |
---|---|---|
firefox48 | --- | unaffected |
firefox49 | --- | wontfix |
firefox-esr45 | --- | unaffected |
firefox50 | - | fixed |
firefox51 | --- | fixed |
People
(Reporter: intermittent-bug-filer, Assigned: molly)
References
Details
(Keywords: intermittent-failure, regression)
Attachments
(2 files)
1.81 KB,
patch
|
robert.strong.bugs
:
review+
|
Details | Diff | Splinter Review |
1.82 KB,
patch
|
molly
:
review+
ritu
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
Filed by: hskupin [at] gmail.com https://treeherder.mozilla.org/logviewer.html#?job_id=4607456&repo=mozilla-central https://queue.taskcluster.net/v1/task/S_D5OQcKSLermZDTR7xGwA/runs/0/artifacts/public%2Flogs%2Flive_backing.log
Updated•8 years ago
|
Summary: Intermittent Automation Error: mozprocess timed out after 300 seconds running ['/home/worker/workspace/build/venv/bin/python', '/home/worker/workspace/build/venv/lib/python2.7/site-packages/firefox_ui_harness/cli_update.py' → Automation Error: mozprocess timed out after 300 seconds running ['/home/mozauto/jenkins/workspace/mozilla-central_update/build/venv/bin/python', '/home/mozauto/jenkins/workspace/mozilla-central_update/build/venv/lib/python2.7/site-packages/firefox_ui_har
Comment 1•8 years ago
|
||
As the log shows we have problems reconnecting to Marionette after a restart with an update applied: 09:37:22 INFO - 1470821842141 Marionette ERROR Error on starting server: [Exception... "Component returned failure code: 0x804b0036 (NS_ERR OR_SOCKET_ADDRESS_IN_USE) [nsIServerSocket.initSpecialConnection]" nsresult: "0x804b0036 (NS_ERROR_SOCKET_ADDRESS_IN_USE)" location: "JS frame :: chrome:/ /marionette/content/server.js :: MarionetteServer.prototype.start :: line 85" data: no] 09:37:22 INFO - [Exception... "Component returned failure code: 0x804b0036 (NS_ERROR_SOCKET_ADDRESS_IN_USE) [nsIServerSocket.initSpecialConnection]" n sresult: "0x804b0036 (NS_ERROR_SOCKET_ADDRESS_IN_USE)" location: "JS frame :: chrome://marionette/content/server.js :: MarionetteServer.prototype.start :: line 85" data: no] 09:37:22 INFO - MarionetteServer.prototype.start@chrome://marionette/content/server.js:85:19 09:37:22 INFO - MarionetteComponent.prototype.init@resource://gre/components/marionettecomponent.js:218:5 09:37:22 INFO - MarionetteComponent.prototype.observe@resource://gre/components/marionettecomponent.js:142:7 Right now we do not shutdown the application as it looks like even with my other patches in place. Also we don't quit the client! That's interesting. Looks like we still miss a specific code path in Marionette.
Blocks: 1283906
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 4•8 years ago
|
||
Orange Factor says that this regression started on Aug 6th. Here the pushlog from mozilla-central: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=d320ef56876f52db9bc0eb79554c7332d4793769&tochange=6b65dd49d4f045c0a9753ce60bdb4b7b4aaedcf8 Actually I cannot find any bug which might have caused that.
Comment 5•8 years ago
|
||
Interestingly on mozilla-aurora the tests also started failing on Aug 18th: Pushlog: https://hg.mozilla.org/releases/mozilla-aurora/pushloghtml?fromchange=fdc2f1d67e39&tochange=d23f68f5cb77 The only overlap I can see here is bug 1258041 which were the safe-browsing related changes. Here a list of broken update tests due to this disconnect: mozilla-central: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&filter-searchStr=firefox%20ui%20update%20linux&filter-tier=1&filter-tier=2&filter-tier=3&bugfiler&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=running&filter-resultStatus=pending&filter-resultStatus=runnable mozilla-aurora: https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&filter-searchStr=firefox%20ui%20update%20linux&filter-tier=1&filter-tier=2&filter-tier=3&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=running&filter-resultStatus=pending&filter-resultStatus=runnable
status-firefox50:
--- → affected
status-firefox51:
--- → affected
Summary: Automation Error: mozprocess timed out after 300 seconds running ['/home/mozauto/jenkins/workspace/mozilla-central_update/build/venv/bin/python', '/home/mozauto/jenkins/workspace/mozilla-central_update/build/venv/lib/python2.7/site-packages/firefox_ui_har → Automation Error: mozprocess timed out after 300 seconds running ['/home/mozauto/jenkins/workspace/build/venv/bin/python', '/home/mozauto/jenkins/workspace/build/venv/lib/python2.7/site-packages/firefox_ui_harness/cli_update.py']
Comment 6•8 years ago
|
||
Francois, mind having a look at the last comment? I cannot see why such a hang in Firefox could happen, but could you imagine that it has something to do with the safebrowsing changes? It's really the only bug which landed at those two time frames on the appropriate branches.
Flags: needinfo?(francois)
Comment 7•8 years ago
|
||
Bug 1258041 essentially did these two things: 1. changes the default tracking protection list (different name, same exact content) 2. changes the logic of list selection in the UI I can't see #2 having any impact at all on tests unless that preference is the thing you're testing. As for #1, I don't understand how that would impact marionette tests since we disable tracking protection there: http://searchfox.org/mozilla-central/rev/b38dbd1378cea4ae83bbc8a834cdccd02bbc5347/testing/marionette/client/marionette_driver/geckoinstance.py#323-324
Flags: needinfo?(francois)
Comment 8•8 years ago
|
||
Correct, and that is what I do not understand. Maybe I should poke around by using a one click loaner, given that I cannot reproduce locally at all but it fails nearly constantly on TC or our mozmill-ci nodes.
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
Comment 9•8 years ago
|
||
We definitely have to get this fixed before the next merge, so it doesn't cause bustage for our beta update tests.
Comment 10•8 years ago
|
||
I watched update tests running on TC and I can see this problem each time for fallback updates. When we downloaded the full mar package, and want Firefox to restart, Firefox comes back but can no longer start the server socket at port 2828. Not sure if that is related to the update process. But given that we also have bug 1257508 for the functional tests, I will check if normal in-app restarts can also cause such a behavior. Interestingly update tests nearly fail perma while functional tests only once in a while.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 14•8 years ago
|
||
With the change of the socket timeout to 60 (bug 1284457) the failure summary is way clearer now. So it looks similar to bug 1202375 which only happened on OS X all the last releases. Here the current stack trace: 14:59:29 INFO - TEST-UNEXPECTED-ERROR | test_fallback_update.py TestFallbackUpdate.test_update | IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Timed out waiting for port 2828!) 14:59:29 INFO - Traceback (most recent call last): 14:59:29 INFO - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette/marionette_test.py", line 351, in run 14:59:29 INFO - testMethod() 14:59:29 INFO - File "/home/worker/workspace/build/tests/firefox-ui/tests/update/fallback/test_fallback_update.py", line 21, in test_update 14:59:29 INFO - self.download_and_apply_forced_update() 14:59:29 INFO - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/firefox_ui_harness/testcases.py", line 359, in download_and_apply_forced_update 14:59:29 INFO - self.restart() 14:59:29 INFO - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/firefox_puppeteer/testcases/base.py", line 87, in restart 14:59:29 INFO - self.marionette.restart(in_app=True) 14:59:29 INFO - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 42, in _ 14:59:29 INFO - return func(*args, **kwargs) 14:59:29 INFO - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 1101, in restart 14:59:29 INFO - self.raise_for_port(self.wait_for_port()) 14:59:29 INFO - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 55, in _ 14:59:29 INFO - m.force_shutdown() 14:59:29 INFO - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 42, in _ 14:59:29 INFO - return func(*args, **kwargs) 14:59:29 INFO - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 652, in raise_for_port 14:59:29 INFO - raise socket.timeout("Timed out waiting for port {}!".format(self.port)) 14:59:29 INFO - TEST-INFO took 205967ms It means we run into a socket.timeout when waiting for the restart in self.marionette.restart(in_app=True). And all this because the reason already mentioned in comment 1 that Marionette server cannot create a socket because the wanted port is still in use.
Summary: Automation Error: mozprocess timed out after 300 seconds running ['/home/mozauto/jenkins/workspace/build/venv/bin/python', '/home/mozauto/jenkins/workspace/build/venv/lib/python2.7/site-packages/firefox_ui_harness/cli_update.py'] → TEST-UNEXPECTED-ERROR | test_fallback_update.py TestFallbackUpdate.test_update | IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Timed out waiting for port 2828!)
Comment 15•8 years ago
|
||
I will request a one-click loaner and let our tests run with Marionette trace logging enabled. Lets see if this gives us more information.
Comment 16•8 years ago
|
||
With the loaner I ran the following command: $ firefox-ui-update --binary firefox/firefox tests/firefox-ui/tests/update/manifest.ini --update-f allback-only --gecko-log - -vv Sadly no further output is shown: 1472808595789 Marionette TRACE conn3 -> [0,196,"quitApplication",{"flags":["eForceQuit","eRestart"]}] 1472808595790 Marionette TRACE conn3 <- [1,196,null,{}] 1472808595897 Marionette DEBUG Closed connection conn3 *** UTM:SVC TimerManager:registerTimer - id: xpi-signature-verification 1472808598145 Marionette DEBUG Marionette enabled via build flag and pref *** AUS:SVC Creating UpdateService (firefox:1328): IBUS-WARNING **: The owner of /home/worker/.config/ibus/bus is not root! 1472808598572 Marionette ERROR Error on starting server: [Exception... "Component returned failure code: 0x804b0036 (NS_ERROR_SOCKET_ADDR ESS_IN_USE) [nsIServerSocket.initSpecialConnection]" nsresult: "0x804b0036 (NS_ERROR_SOCKET_ADDRESS_IN_USE)" location: "JS frame :: chrome://mar ionette/content/server.js :: MarionetteServer.prototype.start :: line 85" data: no] [Exception... "Component returned failure code: 0x804b0036 (NS_ERROR_SOCKET_ADDRESS_IN_USE) [nsIServerSocket.initSpecialConnection]" nsresult: "0 x804b0036 (NS_ERROR_SOCKET_ADDRESS_IN_USE)" location: "JS frame :: chrome://marionette/content/server.js :: MarionetteServer.prototype.start :: l ine 85" data: no]
Comment 17•8 years ago
|
||
So I was using lsof to check the status of localhost:2828: After the first restart: $ lsof -i TCP | grep 2828 firefox-u 4033 root 5u IPv4 2179318 0t0 TCP localhost:47013->localhost:2828 (ESTABLISHED) firefox 4039 root 43u IPv4 2181625 0t0 TCP localhost:2828 (LISTEN) firefox 4039 root 48u IPv4 2179321 0t0 TCP localhost:2828->localhost:47013 (ESTABLISHED) After next restart (fallback update is now downloaded and applied at the end) $ lsof -i TCP | grep 2828 firefox-u 4033 root 5u IPv4 2182301 0t0 TCP localhost:47044->localhost:2828 (ESTABLISHED) firefox 4039 root 53u IPv4 2179448 0t0 TCP localhost:2828 (LISTEN) firefox 4039 root 61u IPv4 2182304 0t0 TCP localhost:2828->localhost:47044 (ESTABLISHED) After the next update (Firefox has been updated) $ lsof -i TCP | grep 2828 firefox-u 4033 root 5u IPv4 2183145 0t0 TCP localhost:47068->localhost:2828 (ESTABLISHED) updater 4301 root 53u IPv4 2179448 0t0 TCP localhost:2828 (LISTEN) updater 4301 root 61u IPv4 2182304 0t0 TCP localhost:2828->localhost:47044 (CLOSE_WAIT) So the first two restarts were totally fine and Marionette was able to create a new server socket. But with the last restart when Firefox got the update applied the former connection to the client remains in a CLOSE_WAIT state and is not released. As result Marionette cannot create a new server socket at the same port.
Comment 18•8 years ago
|
||
I checked if this problem might be related to in_app restarts but it isn't. I can see the same failure when letting marionette client restart the browser via mozrunner.
Comment 19•8 years ago
|
||
So I tried older Nightly builds from mozilla-central to verify the regression range as pointed out in comment 4. Interestingly I do not see this problem at all. Reason here seems to be that we do not download partial updates but always complete ones. And as such the old software update dialog is not in use because all is routed through the about dialog. I have the feeling that a change in Firefox somewhat broke the old software update dialog. I have no idea why we actually fail to stop the socket server of Marionette during shutdown. CC'ing Robert in case he may know about changes which could have caused this on Linux only.
OS: Unspecified → Linux
Hardware: Unspecified → All
Comment 20•8 years ago
|
||
No idea why these tests had issues only on Linux. As you noted, there are no changes in the range you posted that should have caused this or for that matter involve app update. Note that the app update tests test downloading both a partial and a complete using the app update dialog. If these tests use the balrog perhaps it wasn't serving partial updates when you ran these tests earlier.
Comment 21•8 years ago
|
||
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #20) > Note that the app update tests test downloading both a partial and a > complete using the app update dialog. This is true when the first update attempt works. But when we invalidate the downloaded partial patch and restart Firefox, the complete patch will be downloaded via the software update dialog and not via the about window. Only for invalid complete patches, we seem to relaunch the about window to re-download the complete patch. > If these tests use the balrog perhaps it wasn't serving partial updates when > you ran these tests earlier. I did, but using older builds to track down the regression range doesn't work because balrog always offers the complete patch now. So I will have to force the update url to point to a local update.xml, which includes a partial and complete update. Here the situation in detail: * partial patch -> OK * partial patch -> invalidate -> complete patch -> FAILURE * complete patch -> OK * complete patch -> invalidate -> complete patch -> OK
Comment 22•8 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #21) > (In reply to Robert Strong [:rstrong] (use needinfo to contact me) from > comment #20) > > Note that the app update tests test downloading both a partial and a > > complete using the app update dialog. > > This is true when the first update attempt works. What I am saying is the app update tests in the following dir do this https://dxr.mozilla.org/mozilla-central/source/toolkit/mozapps/update/tests/chrome
Comment 23•8 years ago
|
||
Something I haven't actually seen before: > firefox 4039 root 61u IPv4 2182304 0t0 TCP localhost:2828->localhost:47044 (ESTABLISHED) [..] > updater 4301 root 53u IPv4 2179448 0t0 TCP localhost:2828 (LISTEN) > updater 4301 root 61u IPv4 2182304 0t0 TCP localhost:2828->localhost:47044 (CLOSE_WAIT) > firefox-u 4033 root 5u IPv4 2183145 0t0 TCP localhost:47068->localhost:2828 (ESTABLISHED) The process which still holds the open socket is bound to the updater. So it's not the Firefox process anymore. Are we doing anything fancy here with the process of Firefox before the update?
Comment 24•8 years ago
|
||
Matt, this might have to do with the changes you made to nsUpdateDriver.cpp that spohl reviewed. Could you help whimboo out with this?
Flags: needinfo?(mhowell)
Assignee | ||
Comment 25•8 years ago
|
||
This did start happening right after that patch (bug 1272614) landed. I'll poke around some.
Comment hidden (Intermittent Failures Robot) |
Comment 27•8 years ago
|
||
Thanks Matt! Could it be that we do not wait long enough until the old process quit? It looks like we fork it into the update process, so that we maybe end-up with two instances at the same time. Right now Marionette always uses port 2828, and will fail. Btw. I proposed a solution for finding another available port on bug 1240830. So if that cannot be fixed in Firefox, we will have to fix the test harness.
Blocks: 1272614
Updated•8 years ago
|
Keywords: regression
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 30•8 years ago
|
||
I am sorry, but I have not been able to reproduce this failure to save my life. I'm watching the "partial patch -> invalidate -> complete patch" flow run, and it passes every time for me. I've tried multiple Linux machines, I even tried generating artificial load to slow everything down because I thought that might make this failure more likely, but it did not. Truth be told, I'm pretty stumped. As pointed out in comment 23, we are playing some process games on Linux, in that we use execv() to invoke the updater directly from the browser process, but we've been doing that forever (or at least further back than the Mercurial history goes). The relevant change in bug 1272614 was to the other case, when we're staging an update. Then, we fork/exec a new process for the updater and wait for it to exit; the change was just to make that wait non-blocking. I'm not giving up quite yet, but I don't have any ideas right now.
Flags: needinfo?(mhowell)
Assignee | ||
Comment 32•8 years ago
|
||
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #31) > Maybe whimboo can help you reproduce it. He did give me some advice over in bug 1285340. I have so little idea what I'm looking for, I need time to even figure out what else to ask.
Comment hidden (Intermittent Failures Robot) |
Comment 34•8 years ago
|
||
Matt, maybe the best is to simply request a one-click loaner on TaskCluster. Given that it's a permanent failure you even won't have to setup anything to get it reproduced. If you haven't done that yet, here the steps: 1. Select one of the broken jobs in Treeherder like: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=91c2b9d5c1354ca79e5b174591dbb03b32b15bbf&selectedJob=4888235&filter-searchStr=fxup&filter-tier=3 2. Switch to the job details tab, scroll down, and click "One click loaner" 3. Sign-in via Octa, create the loaner, and wait until it's ready 4. Connect via the SSH web shell, and select option 1 to run the tests (it should definitely hang at the end) - Stop with ctrl+c. 5. Now run the following commands: # cd workspace/build/ # . venv/bin/activate # tar xfv firefox-51.0a1.en-US.linux-x86_64.tar.bz2 # firefox-ui-update --binary firefox/firefox tests/firefox-ui/tests/update/manifest.ini --update-fallback-only --gecko-log - -vv 6. You can open up a second shell and exit the wizard (option 4) to run various commands like lsof If you need more info you can always find me on IRC in the #automation channel.
Comment 35•8 years ago
|
||
As long as we don't know if this is a bug in the product, or if we should workaround that in the test harness, I will enhance Marionette to retry creating a server socket at the same port until we reach the timeout. This should help us here, to not let this hang reach beta by next week.
Depends on: 1301320
Assignee | ||
Comment 36•8 years ago
|
||
Quick update: using whimboo's instructions (thanks again), I was able to get reliable reproduction in a loaner machine. I'm still trying to actually debug the problem there; I may need to enlist the aid of someone more skilled in Linux debugging. Side note: I found out that the TaskCluster task inspector [1] generates shell scripts that sort of clone the testing environment into a local Docker container, so I tried that thinking maybe I could debug more easily that way... but I couldn't get repro there. This failure has some sort of religious conviction against reproducing itself on any of my local machines in any form whatsoever. [1] https://tools.taskcluster.net/task-inspector/#fVAMHmU9S6WIUnQJG1LpMw/
Comment hidden (Intermittent Failures Robot) |
Comment 38•8 years ago
|
||
Thank you for the update Matt! You made a good point in terms of running the docker environment locally doesn't always work. We noticed that a couple of times in the past too. I know that remote debugging is not that convenient and that your session could be killed at any time. Something I could propose is the following... we also see those hangs on our Linux machines in SCL3, which are VMs in an ESX cluster. We could take one offline so you could use that one instead. It offers SSH and VNC access, and won't be shutdown after a while.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 42•8 years ago
|
||
Sorry, haven't had a chance to work on this for the past couple days. It finally occurred to me just now to actually double-check the code in my patch, and I found I did something kinda dumb; I didn't handle errors returned from waitpid at all, I just treat them as if it means the process is still running, which makes no sense. I wonder if some oddity of the test runner environment is causing waitpid to hit ECHILD. It might be the same oddity that causes strace to fail to attach, which is what inspired this theory but also prevents me from checking it. From what I can tell though, the old code path would also hang forever in that case, so I don't love this theory. I just don't have any other ones. Do we have a way of doing a try run of this test? I don't see it listed in trychooser, but the only remaining way that I know of to verify any ideas is to just try some patches.
Flags: needinfo?(mhowell) → needinfo?(hskupin)
Comment 43•8 years ago
|
||
I don't know the update system that well. Maybe we could tweak such a try build to get updates. Is it enough to override the update URL and point it to patches for latest nightly? The partial patch will definitely fail, so the full one will be applied. Or is the update server not enabled at all for non nightly and release builds? Robert, can you help out here?
Flags: needinfo?(hskupin) → needinfo?(robert.strong.bugs)
Comment 44•8 years ago
|
||
Doing this on try would be difficult. I suggest using oak and the updates for oak.
Flags: needinfo?(robert.strong.bugs)
Hi Matt, Whimboo, can you help me understand from a scenario point of view, what this test failure means? That might help me decide whether to track it for 50 or not.
Flags: needinfo?(mhowell)
Comment 46•8 years ago
|
||
With the current situation QA will not be able to run the ondemand update tests for the upcoming beta releases on Linux 32/64. Maybe also Windows platforms are affected but those are less intermittent. A proposal I have made on bug 1301320 to workaround this issue in Marionette has not been approved yet. Not sure if that will happen shortly. Last resort for me would be to change the update tests, and not do an in-app restart of Firefox to apply the update, but quit the application and let it restart again. I assume that this might help to make it work for Beta. Then we do not test the replacement feature while Firefox is still running anymore. But this might be better than nothing.
Assignee | ||
Comment 47•8 years ago
|
||
This patch adds the explicit error checking that I mentioned was missing in comment 42. Following Robert's advice, I plan to push this to oak to see if it helps any.
Flags: needinfo?(mhowell)
Assignee | ||
Comment 48•8 years ago
|
||
https://hg.mozilla.org/projects/oak/rev/cae30a308fb978ec0c4abbcd1ea2c37ee02032b8 Bug 1293404 - Check for errors returned while waiting for the updater to finish
Comment 49•8 years ago
|
||
Thanks Matt! I will ask Sheriffs to trigger a nightly build, so we do not have to wait until end of today for it.
Comment 50•8 years ago
|
||
We got oak nightly builds: https://archive.mozilla.org/pub/firefox/nightly/2016/09/2016-09-15-02-34-18-oak/ I would now try to run our tests with an older oak nightly, which should show the failure. And then use the same update.xml file for that new nightly build. Lets see if the theory works.
Comment 51•8 years ago
|
||
So I ran the fallback update tests against the latest oak nightly build with a tweaked update.xml file, and all is working fine now. I did it multiple times to ensure it was not intermittently passing, but no, not a single failure in 10 runs. Without the patch it failed each time. Here the command which I was running: > firefox-ui-update --binary firefox/firefox tests/firefox-ui/tests/update/manifest.ini --update-f allback-only --update-override-url http://people.mozilla.org/~hskupin/update.xml --gecko-log - So I think we should finalize it and get reviewed. If all works well we can get it landed on aurora over the weekend and should be safe for the merge on Monday. Thanks Matt for diving into the tests! I think we should move this bug into Toolkit given that we actually found a bug in the updater.
Component: Firefox UI Tests → Application Update
Product: Testing → Toolkit
QA Contact: hskupin
Version: Version 3 → 50 Branch
Assignee | ||
Comment 52•8 years ago
|
||
Comment on attachment 8791420 [details] [diff] [review] Patch Requesting review based on comment 51. Also removing the "experimental" tag. ;)
Attachment #8791420 -
Attachment description: Experimental Patch → Patch
Attachment #8791420 -
Flags: review?(robert.strong.bugs)
Comment 53•8 years ago
|
||
Comment on attachment 8791420 [details] [diff] [review] Patch ># HG changeset patch ># User Matt Howell <mhowell@mozilla.com> ># Date 1473886898 25200 ># Wed Sep 14 14:01:38 2016 -0700 ># Node ID cd14a490a7f7b3b053b11753d30f69c2c55a2e00 ># Parent 8a494adbc5cced90a4edf0c98cffde906bf7f3ae >Bug 1293404 - Check for errors returned while waiting for the updater to finish > >MozReview-Commit-ID: 8m1NnlKKIID > >diff --git a/toolkit/xre/nsUpdateDriver.cpp b/toolkit/xre/nsUpdateDriver.cpp >--- a/toolkit/xre/nsUpdateDriver.cpp >+++ b/toolkit/xre/nsUpdateDriver.cpp >@@ -993,23 +993,31 @@ ProcessHasTerminated(ProcessType pt) > } > CloseHandle(pt); > return true; > #elif defined(XP_MACOSX) > // We're waiting for the process to terminate in LaunchChildMac. > return true; > #elif defined(XP_UNIX) > int exitStatus; >- bool exited = waitpid(pt, &exitStatus, WNOHANG) > 0; >- if (!exited) { >- sleep(1); >- } else if (WIFEXITED(exitStatus) && (WEXITSTATUS(exitStatus) != 0)) { >+ pid_t exited = waitpid(pt, &exitStatus, WNOHANG); >+ if (exited == 0) { >+ // Process is still running. How about putting a LOG statement here as well? >+ return false; >+ } else if (exited == -1) { Since the above returns change this to } if (exited == -1) { >+ LOG(("Error while checking if the updater process is finished")); >+ // This shouldn't happen, but if it does, the updater process is lost to us, >+ // so the best we can do is pretend that it's exited. >+ return true; >+ } >+ // If we get here, the process has exited; make sure it exited normally. >+ if (WIFEXITED(exitStatus) && (WEXITSTATUS(exitStatus) != 0)) { > LOG(("Error while running the updater process, check update.log")); > } >- return exited; >+ return true; > #else > // No way to have a non-blocking implementation on these platforms, > // because we're using NSPR and it only provides a blocking wait. > int32_t exitCode; > PR_WaitProcess(pt, &exitCode); > if (exitCode != 0) { > LOG(("Error while running the updater process, check update.log")); > } r=me with that
Attachment #8791420 -
Flags: review?(robert.strong.bugs) → review+
Comment 54•8 years ago
|
||
[Tracking Requested - why for this release]: This is a regression caused by bug 1272614 that was uplifted to Firefox 49 and 50 that prevents running these tests.
Assignee: hskupin → mhowell
status-firefox48:
--- → unaffected
status-firefox49:
--- → wontfix
status-firefox-esr45:
--- → unaffected
tracking-firefox50:
--- → ?
Assignee | ||
Comment 55•8 years ago
|
||
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #53) > >+ if (exited == 0) { > >+ // Process is still running. > How about putting a LOG statement here as well? That could end up being quite a few log entries, since this runs once a second. Is that okay? Also, I just realized I accidentally dropped the sleep() call that should have been in the exited == 0 case, so it wouldn't actually run once a second, it would run constantly. Probably not okay, I'll have to put that back along with the review changes.
Comment 56•8 years ago
|
||
(In reply to Matt Howell [:mhowell] from comment #55) > (In reply to Robert Strong [:rstrong] (use needinfo to contact me) from > comment #53) > > >+ if (exited == 0) { > > >+ // Process is still running. > > How about putting a LOG statement here as well? > That could end up being quite a few log entries, since this runs once a > second. Is that okay? Just skip it and if it becomes valuable it can be added later.
Assignee | ||
Comment 57•8 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/2840a2d57a59c054bfa9694e7c8c9bfddc71dc28 Bug 1293404 - Check for errors returned while waiting for the updater to finish; r=rstrong
Comment 58•8 years ago
|
||
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #54) > [Tracking Requested - why for this release]: > This is a regression caused by bug 1272614 that was uplifted to Firefox 49 > and 50 that prevents running these tests. The interesting fact is that our tests are not affected for Firefox 49 as you can see here: https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&revision=77a60bbacb97c2f36815b2c395958a354aa581c9&filter-searchStr=fxup%20linux&filter-tier=1&filter-tier=2&filter-tier=3 So I assume another change as landed for Firefox 50 should have been caused it in combination with this patch.
Comment 59•8 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/2840a2d57a59
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla51
Comment hidden (Intermittent Failures Robot) |
Comment 61•8 years ago
|
||
Please request Beta approval on this when you get a chance (assuming it won't be approved before Aurora50 gets uplifted to Beta).
Flags: needinfo?(mhowell)
Comment hidden (Intermittent Failures Robot) |
Comment 63•8 years ago
|
||
I still see this failure for update tests over the weekend. But maybe that is because we haven't had a pre Nightly build with the fix included. I will keep an eye out today.
Comment 64•8 years ago
|
||
We are greening up!! Yay! https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=f0f15b7c6aa77a0c5750918aa0a1cb3dc82185bc&filter-searchStr=fxup%20linux&filter-tier=1&filter-tier=2&filter-tier=3&selectedJob=5032287 Please only see Linux 32 for now given that 64 has issues with running our tests at all. So en-US-1 is now green, and en-US-2 shows a different failure related to window focus which we also have for other functional tests. So by tomorrow we should also see en-US-3 passing, and en-US-4 by Wednesday. Matt, please request approval for beta uplift. We may still have a change to get this fix included for the upcoming beta 1.
Assignee | ||
Comment 65•8 years ago
|
||
This is the patch that actually landed, with the changes from comment 53 - comment 55. r+ carried over. I'm posting this here to request beta approval, per comment 61. Approval Request Comment [Feature/regressing bug #]: Bug 1272614 [User impact if declined]: Potential shutdown hangs when applying updates. [Describe test coverage new/current, TreeHerder]: This patch is a fix for a test failure. See comment 51 and comment 64. [Risks and why]: Very low because of existing test coverage and time spent on nightly. [String/UUID change made/needed]: None
Flags: needinfo?(mhowell)
Attachment #8792551 -
Flags: review+
Attachment #8792551 -
Flags: approval-mozilla-beta?
Comment 66•8 years ago
|
||
Comment on attachment 8792551 [details] [diff] [review] Patch as landed to central Matt, something went wrong here with the patch. It means we still see this problem: https://treeherder.allizom.org/#/jobs?repo=mozilla-central&exclusion_profile=false&filter-searchStr=firefox%20ui&filter-tier=1&filter-tier=2&filter-tier=3&selectedJob=4633066 Was there any other major change between the version which I have tested and what landed?
Attachment #8792551 -
Flags: approval-mozilla-beta?
Comment 67•8 years ago
|
||
The installer as used for the broken build is: https://archive.mozilla.org/pub/firefox/nightly/2016/09/2016-09-18-03-04-08-mozilla-central/firefox-51.0a1.en-US.linux-i686.tar.bz2 So it's certainly from past your landing on mozilla-central. I'm going to reopen this bug. Not sure if we should backout the patch or work on a follow-up.
Comment 68•8 years ago
|
||
Since the patch that you originally reported had fixed the problem you are seeing with those tests fixes an actual problem and has already landed with this bug referenced please open a new bug.
Status: REOPENED → RESOLVED
Closed: 8 years ago → 8 years ago
Resolution: --- → FIXED
Updated•8 years ago
|
Summary: TEST-UNEXPECTED-ERROR | test_fallback_update.py TestFallbackUpdate.test_update | IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Timed out waiting for port 2828!) → nsUpdateDriver.cpp should handle errors from waitpid
Comment 69•8 years ago
|
||
Comment on attachment 8792551 [details] [diff] [review] Patch as landed to central Even with the test failures being the same still, it might be that those have lower frequency. So I will file a new bug as asked from Rob. Also setting the approval flag for beta again after I removed it in comment 66.
Attachment #8792551 -
Flags: approval-mozilla-beta?
Comment on attachment 8792551 [details] [diff] [review] Patch as landed to central Fixes an intermittent, Beta50+
Attachment #8792551 -
Flags: approval-mozilla-beta? → approval-mozilla-beta+
Comment 71•8 years ago
|
||
Matt, I'm setting the checkin beta flag just in case a sheriff can already land the patch before you are around today. It would be great to have this landed ASAP so we can make the next beta release. If it hasn't been landed when you are around, can you please do so? Thanks.
Whiteboard: [checkin-needed-beta]
Updated•8 years ago
|
Comment 72•8 years ago
|
||
bugherder uplift |
https://hg.mozilla.org/releases/mozilla-beta/rev/4abcabc313ae
Whiteboard: [checkin-needed-beta]
Comment 73•8 years ago
|
||
(In reply to Matt Howell [:mhowell] from comment #42) > I wonder if some oddity of the test > runner environment is causing waitpid to hit ECHILD. An oddity in NSPR, Bug 227246 can cause that if PR_CreateProcess() has been called at any point previously.
You need to log in
before you can comment on or make changes to this bug.
Description
•