Closed Bug 933680 Opened 10 years ago Closed 10 years ago
_plugin Crash Comment And URL .js hangs ubuntu ec2 slaves in conjunction with browser _bug410900 .js
23.15 KB, text/plain
8.01 KB, patch
|Details | Diff | Splinter Review|
631 bytes, patch
|Details | Diff | Splinter Review|
575 bytes, patch
|Details | Diff | Splinter Review|
The majority of our linux desktop unittests run on amazon ec2 slaves, all except for linux debug browser-chrome. this is because we fail on browser_bug410900.js: https://tbpl.mozilla.org/php/getParsedLog.php?id=29931108&tree=Cedar 11:39:57 INFO - INFO TEST-END | chrome://mochitests/content/browser/browser/components/preferences/in-content/tests/browser_advanced_update.js | finished in 1114ms 11:39:57 INFO - TEST-INFO | checking window state 11:39:57 INFO - TEST-START | chrome://mochitests/content/browser/browser/components/preferences/in-content/tests/browser_bug410900.js 11:39:57 INFO - ++DOCSHELL 0x2da6340 == 80 [id = 1209] 11:39:57 INFO - ++DOMWINDOW == 224 (0x8291e98) [serial = 3360] [outer = (nil)] 11:39:57 INFO - ++DOMWINDOW == 225 (0xb09cd68) [serial = 3361] [outer = 0x8291e98] 11:39:57 INFO - [Parent 2396] WARNING: NS_ENSURE_TRUE(mMutable) failed: file ../../../../netwerk/base/src/nsSimpleURI.cpp, line 264 11:39:57 INFO - [Parent 2396] WARNING: NS_ENSURE_TRUE(mMutable) failed: file ../../../../netwerk/base/src/nsSimpleURI.cpp, line 264 11:39:57 INFO - ++DOMWINDOW == 226 (0xafcb388) [serial = 3362] [outer = 0x8291e98] 11:39:57 INFO - [Parent 2396] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80520012: file ../../../layout/style/Loader.cpp, line 2007 11:39:57 INFO - [Parent 2396] WARNING: NS_ENSURE_TRUE(enabled) failed: file ../../../dom/base/Navigator.cpp, line 1707 11:39:58 INFO - [Parent 2396] WARNING: waitpid failed pid:2442 errno:10: file ../../../ipc/chromium/src/base/process_util_posix.cc, line 254 11:45:28 WARNING - TEST-UNEXPECTED-FAIL | chrome://mochitests/content/browser/browser/components/preferences/in-content/tests/browser_bug410900.js | application timed out after 330 seconds with no output
Is there an easy way for a developer to get access to these slaves (either directly or via Try)?
best method here is to file a bug with releng (see bug 840964 as an example) and then all the magic will be worked out so you have access via vpn, etc.
FYI, disabling this test just moves the problem to a different test (browser/components/preferences/in-content/tests/browser_connection.js).
current log on cedar: https://tbpl.mozilla.org/php/getParsedLog.php?id=30572745&tree=Cedar
did we disable this test? I don't see it in current logs. If we disabled it we should close this bug or update the title.
We disabled it on Cedar, and then discovered comment 4, and didn't revert disabling it.
thanks :philor. Can we keep hg revisions linked in bugs when we change stuff. Honestly if browser/components/preferences/in-content/tests/browser_connection.js passes only because browser_bug410900.js is running we should be disabling browser_connection.js and any others. Those are bad tests that depend on the state leftover by browser_bug410900.js
No, it's that browser_connection.js doesn't fail because it doesn't run because 330 seconds without output kills the run. Apparently none of the in-content prefs tests run on the ec2 slaves, and someone who knows something about them needs to file a bug for a loaner and see whether it's because in-content prefs themselves don't work, or something in common to all the tests doesn't work.
Component: File Handling → Preferences
Filed bug 946882 to request access to one of the Ubuntu ec2 machines.
browser_bug410900.js fails because browser_bug735471.js (http://dxr.mozilla.org/mozilla-central/source/browser/base/content/test/general/browser_bug735471.js) is run previously. If I remove that single file the tests complete as expected. :ttaubert- you had modified broswer_bug735471.js recently, any thoughts on why it would cause the in-content test case http://dxr.mozilla.org/mozilla-central/source/browser/components/preferences/in-content/tests/browser_bug410900.js to hang?
I'm working on this as part of bug 850101
Assignee: nobody → felipc
Status: NEW → ASSIGNED
Here's an update! Apologies for the delay, but this has been very painful to bisect.. Here are some more details. There's already some pre-existing problem in this test, but a bigger problem happens as a combination of *3* tests running together.. The whole test suite hangs due to a waitpid() call from IPC code triggered by an oop plugin test, browser_pluginCrashCommentAndURL.js Interestingly, this hang doesn't happen if either browser_bug735471.js or browser_pluginCrashCommentAndURL.js is not run in the test suite (and, in other news, just these two tests alone + browser/components/preferences is enough to trigger the problem). Now, with either of these removed, the b-c suite finishes, but browser_bug410900.js still doesn't work, it "cleanly" times out. However, even with the two pre-req tests out of the suite, the warnings from comment 12 still show up. BUT!, manually opening about:preferences after the suite finishes works ok, so I think the warning might be just a red herring. In any case, it looks like there are two problems that needs to be battled here, but it's way more clear now how to separate them, which should help with actually fixing them.
Felipe thanks for the update!
Hi Felipe, Any updates on this?
[qa-] for this iteration, but I assume Joel will verify, in the end.
Whiteboard: p=13 s=it-30c-29a-28b.2 → p=13 s=it-30c-29a-28b.2 [qa-]
Hey Armen, I'm still working on this. I'm hoping I'll be able to make another stride on it tomorrow.. I'll ping you or bhearsum to reactivate the VM then
Carry over to Iteration it-30c-29a-28b.3
Whiteboard: p=13 s=it-30c-29a-28b.2 [qa-] → p=13 s=it-30c-29a-28b.3 [qa-]
fgomes, any news?
Hello, a new latest update. I now understand all the conditions that lead to the hang and why those 3 tests are involved. Here's a detailed summary: Firefox hangs in this test right after printing this warning: "WARNING: waitpid failed pid:14683 errno:10:" http://hg.mozilla.org/mozilla-central/annotate/fc9947c00b51/ipc/chromium/src/base/process_util_posix.cc#l254 The pid is from the test plugin which was instantiated earlier to test crash reporting. The waitpid in question has WNOHANG, so it's not what hangs, but it probably tells that something is about to go wrong (my guess is http://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/chrome/common/child_process_host.cc#137) There's also a lot of IPC MessageChannel errors that might be relevant (log attached). To cause the problem, the plugin has to be loaded before browser_pluginCrashCommentAndURL.js runs, and then try to be loaded again afterwards. This combination happens due to the other two tests mentioned in comment 14, because they both exercise about:preferences which loads all plugins to populate the "Applications" tab. In sequence, what happens is: - browser_bug735471.js runs first, which loads about:preferences which iterates through plugins at http://mxr.mozilla.org/mozilla-central/source/browser/components/preferences/in-content/applications.js#1032, loading them - browser_pluginCrashCommentAndURL.js loads the test plugin to crash it and submit a crash report - browser_bug410900.js runs later and is the next first test to load about:preferences again, which then iterates through all plugins again. An interesting point is that gApplicationsPane._loadData completes successfully (no exceptions, and it runs to completion). The hang happens at some point soon afterwards. The main question is why this only happens on these ec2 slaves.. I don't know. Here's a list of all plugins that are loaded on them, in case that could interfere: LoadPlugin() /tmp/tmpqZhYDq/plugins/libnpsecondtest.so returned 4aa71a0 LoadPlugin() /tmp/tmpqZhYDq/plugins/libnptest.so returned 2e005d0 LoadPlugin() /usr/lib/mozilla/plugins/librhythmbox-itms-detection-plugin.so returned 2df30b0 LoadPlugin() /usr/lib/mozilla/plugins/libtotem-cone-plugin.so returned 2df8d60 LoadPlugin() /usr/lib/mozilla/plugins/libtotem-mully-plugin.so returned 50a9300 LoadPlugin() /usr/lib/mozilla/plugins/libtotem-gmp-plugin.so returned 50ae2b0 LoadPlugin() /usr/lib/mozilla/plugins/libtotem-narrowspace-plugin.so returned 50b6910 But they all seem to exist in the current tbpl Rev3 Fedora 12 boxes (plus libnptestjava.so) So, two things: - I think it's very likely that this bug is probably somewhere in ipc/chromium/plugins code, and not in the front-end preferences code, so I'm not the right person to move it forward from here. - Disabling browser_pluginCrashCommentAndURL.js for ec2 debug seems sufficient to unblock the work of migrating these test jobs to ec2 (instead of having to disable the entire suite of tests for in-content prefs) Needinfo'ing bsmedberg to see what can be done to move this forward
Flags: needinfo?(bmcbride) → needinfo?(benjamin)
Here's a short log from a browser-chrome run that only had these 3 tests included
Attachment #8388398 - Attachment is patch: false
Component: Preferences → Plug-ins
Product: Firefox → Core
Summary: browser_bug410900.js fails on ubuntu ec2 slaves → browser_pluginCrashCommentAndURL.js hangs ubuntu ec2 slaves in conjunction with browser_bug410900.js
The vital piece of information here is what the stack of the process is when it's hung. The log will usually tell us that, but the logs from 2013 in the top of this bug have already expired. Are there more recent logs which will have the stack dump of the hang? errno 10 is ECHILD, which means we're calling waitpid on a child process that has already been reaped (or we passed a completely bogus pid to waitpid, but that seems unlikely). This might have something to do with the way PIDs are allocated to processes on EC2 instances, if they create new processes with the same PID as old processes quickly. Otherwise nothing in that log looks suspicious: we'd expect to get output like ###!!! [Parent][MessageChannel::InterruptCall] Error: Channel error: cannot send/recv when a plugin process crashes, which is what this test is doing intentionally. What is the stack of the waitpid() call?
Flags: needinfo?(benjamin) → needinfo?(felipc)
I'm not entirely sure how to get the symbols loaded on these machines since the builds are not done there, so what I have is probably not very helpful, but here is the hung stack: (gdb) bt #0 0x00007f8b40d50d84 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x00007f8b3fb49e50 in PR_WaitCondVar () from /home/cltbld/m-c/firefox/libnspr4.so #2 0x00007f8b3fb542a5 in ?? () from /home/cltbld/m-c/firefox/libnspr4.so #3 0x00007f8b3a11da0a in ?? () from /home/cltbld/m-c/firefox/libxul.so #4 0x00007f8b3a11dc47 in ?? () from /home/cltbld/m-c/firefox/libxul.so #5 0x00007f8b3a11ded7 in ?? () from /home/cltbld/m-c/firefox/libxul.so #6 0x00007f8b3a5db230 in ?? () from /home/cltbld/m-c/firefox/libxul.so #7 0x00007f8b3a5db6c1 in ?? () from /home/cltbld/m-c/firefox/libxul.so #8 0x00007f8b3a5dbb7b in ?? () from /home/cltbld/m-c/firefox/libxul.so #9 0x00007f8b3a5dc632 in ?? () from /home/cltbld/m-c/firefox/libxul.so #10 0x00007f8b3a5d09d4 in ?? () from /home/cltbld/m-c/firefox/libxul.so #11 0x00007f8b3a124a52 in NS_InvokeByIndex () from /home/cltbld/m-c/firefox/libxul.so #12 0x00007f8b3ab62de7 in ?? () from /home/cltbld/m-c/firefox/libxul.so #13 0x00007f8b3ab63cb5 in ?? () from /home/cltbld/m-c/firefox/libxul.so #14 0x00007f8b3ab64857 in ?? () from /home/cltbld/m-c/firefox/libxul.so #15 0x00007f8b1491864c in ?? () #16 0x00007f8afc32f240 in ?? () #17 0x00007fff337d4a20 in ?? () #18 0x0000000000000000 in ?? () (gdb) I'll see if jmaher can get (or tell me how to) a full stacktrace for the waitpid call
I saw bsmedberg's comment so I wrote a patch to enable these jobs on Elm: https://bugzilla.mozilla.org/attachment.cgi?id=8388595&action=edit I hope to have the job running between today and tomorrow.
Assignee: felipc → nobody
Status: ASSIGNED → NEW
fgomes, could you please figure out if with the information below you can continue working on this? or find an owner before unassigning? We are pushing hard to move away from these ancient machines before the end of April. fgomes, our unit tests download symbols on-demand (only when we crash) You can look at the logs below to see how we retrieve them and use them. e.g 06:00:59 INFO - Running command: ['unzip', '-q', '/builds/slave/test/build/symbols/firefox-30.0a1.en-US.linux-i686.crashreporter-symbols.zip'] in /builds/slave/test/build/symbols I triggered Linux debug builds on Elm: https://tbpl.mozilla.org/?tree=Elm We got these two logs: https://tbpl.mozilla.org/php/getParsedLog.php?id=35939153&tree=Elm&full=1 https://tbpl.mozilla.org/php/getParsedLog.php?id=35939091&tree=Elm&full=1
Stack from the log indicates we're stuck synchronously running a nsIProcess from here: nsOSHelperAppService::GetHandlerAndDescriptionFromMailcapFile. I suspect that we actually killed that process because of bug 943174 and because these instances appear to re-use PIDs much more aggressively than most machines do. Given that, we probably need an assignee for bug 943174.
I've found that if we move to 3 chunks we can run debug mochitest-browser-chrome on EC2. We just landed a change on m-i to disable the tests. The jobs in Cedar will be green once m-i merges there: https://tbpl.mozilla.org/?tree=Cedar&jobname=Ubuntu%20VM%2012.04.*browser-chrome- This bug stands as an issue when we don't chunk. > jmaher: armenzg_buildduty: felipe's work is not needed if we chunk If we run 3 chunks across the board we should be able to move away from the minis. I believe this bug could be closed if we're moving on that direction.
I am seeing bug 410900.js failing on bc1 on inbound. Odd that it wasn't failing on cedar much.
this test has been disabled on Cedar and now that we are working on running these tests in chunks and on ec2 on inbound they are failing. Should we disable them on trunk based trees (inbound) or do we think this can be fixed in the next couple of days?
I am not assigned this bug nor do I have time to work on it. Why don't you ask :jld who is actually looking at it?
jld, see comment 30. I want to know how realistic it is to get this fixed or if we should temporarily disable it while you work on it?
The right test to disable here is browser_pluginCrashCommentAndURL.js, which causes the problem, not browser_bug410900.js. If we disable browser_bug410900.js the failure will just move to a different test. I think we should disable it until it's fixed (probably by bug 943174), even if the problem is not seen in a chunkfied run, as it will be just waiting for another test to be affected by the problem again which can be equally hard to debug and come to the same conclusion.
thanks, I have updated the patch in bug 982810 to reflect this.
I'm still trying to understand what's going on here — the cases I found in bug 943174 should apply only to content processes, not plugins. Also, the logs in this bug show many repeated calls to DidProcessCrash on the same dead process, which doesn't add up. I'm doing some debug-by-printf on try at the moment.
Okay, I know what's going on, at least for the case in my try run that produced the same kind of waitpid error spew seen here. It *is* a content process, and we're calling ContentParent::KillHard. That collects the zombie, and then the GeckoContentProcessHost dtor is called. As a hack to deal with Nuwa, ECHILD is treated as the process still existing, so it creates a ChildReaper. Because this is an NS_BUILD_REFCNT_LOGGING build, it's a ChildLaxReaper, which hangs around forever and reattempts a waitpid on that pid every time the parent gets SIGCHLD. So if there's a non-IPC child with the same pid at any point in the future of the test run, we'll steal its exit status and hang. I don't know how this squares with the previous observations that the complained-about pid was a plugin, though. In my case it seems to be a XUL <browser> from browser/base/content/test/social/browser_social_multiprovider.js, if I'm reading the logs right.
Can that be caused by either a content process or a plugin process? Social uses remote <browser>s, so it's definitely creating child processes. But it could be just the same problem manifesting itself in a different test. If you run the test suite with just the 3 tests mentioned in comment 21, it should happen.
(In reply to :Felipe Gomes from comment #37) > Can that be caused by either a content process or a plugin process? Social > uses remote <browser>s, so it's definitely creating child processes. But it > could be just the same problem manifesting itself in a different test. If > you run the test suite with just the 3 tests mentioned in comment 21, it > should happen. I can't reproduce that locally, and the only cases of 'waitpid failed' that appeared in my instrumented try run were content processes created for the social tests. Looking at the logs from comment #26: [https://tbpl.mozilla.org/php/getParsedLog.php?id=35939153&tree=Elm&full=1] 06:11:13 INFO - --DOMWINDOW == 0 (0x1fa59d0) [pid = 2581] [serial = 3] [outer = (nil)] [url = http://example.com/#newtab_background_captures] 06:11:13 INFO - [Parent 2444] WARNING: waitpid failed pid:2581 errno:10: file /builds/slave/m-cen-l64-d-000000000000000000/build/ipc/chromium/src/base/process_util_posix.cc, line 254 06:13:00 INFO - ++DOMWINDOW == 1 (0x10a2000) [pid = 2681] [serial = 1] [outer = (nil)] 06:13:01 INFO - [Parent 2444] WARNING: waitpid failed pid:2681 errno:10: file /builds/slave/m-cen-l64-d-000000000000000000/build/ipc/chromium/src/base/process_util_posix.cc, line 254 And I think those are content processes? [https://tbpl.mozilla.org/php/getParsedLog.php?id=35939153&tree=Elm&full=1] 06:12:08 INFO - Writing to log: /tmp/tmpPK4Gp8/runtests_leaks_plugin_pid2513.log 06:12:08 INFO - [Parent 2382] WARNING: waitpid failed pid:2513 errno:10: file /builds/slave/m-cen-lx-d-0000000000000000000/build/ipc/chromium/src/base/process_util_posix.cc, line 254 06:27:00 INFO - ==> process 2513 will purposefully crash 06:27:00 INFO - TEST-INFO | leakcheck | plugin process: deliberate crash and thus no leak log But that definitely looks like a plugin, and I still don't understand how this would happen to a plugin process (or anything else that doesn't have an instance of ContentParent).
thank for your work on this bug! It looks like we are getting closer to figuring it out.
The first mention of PID 2513 is for a plugin process: 06:09:47 INFO - For application/x-test found plugin libnptest.so 06:09:48 INFO -  WARNING: XPCOM objects created/destroyed from static ctor/dtor: file /builds/slave/m-cen-lx-d-0000000000000000000/build/xpcom/base/nsTraceRefcnt.cpp, line 142 And this plugin intentionally crashes. Have you broken on DidProcessCrash when running a single test where a plugin intentionally crashes such as dom/plugins/test/mochitest/test_crashing.html to see if it gets hit and what the stack is?
(In reply to Benjamin Smedberg [:bsmedberg] from comment #40) > And this plugin intentionally crashes. Have you broken on DidProcessCrash > when running a single test where a plugin intentionally crashes such as > dom/plugins/test/mochitest/test_crashing.html to see if it gets hit and what > the stack is? I see it being called once per process, in the ~GeckoChildProcessHost → EnsureProcessTerminated path; nor are there any other calls to waitpid/waitid/wait4 on the same pid. This also holds for browser_pluginCrashCommentAndURL.js, when I run it.
I'm doing a try run with a patch that should avoid the known issues from bug 943174, and an abort() in the "waitpid failed" path: https://tbpl.mozilla.org/?tree=Try&rev=3ac2bb012d54 And it's crashing there, sometimes with plugins and apparently also with remote browsers (test_aboutmemory5.xul). But this is good, because future try runs with more instrumentation might yield some information.
…also: If that patch can be made acceptable and landed (not the abort(), but the other part), it might avoid the failures that this bug was filed for, because a nonexistent process will be treated as already dead, so code like ChildLaxReaper won't keep repeatedly trying to wait on that pid forever. There would still be a race condition, but it would be a lot smaller and might not drag in unrelated tests.
I've found the waitpid that's been stealing IPC's zombies. It's in NSPR: https://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/md/unix/uxproces.c#648 The first time a child process is created through NSPR, it installs a SIGCHLD handler that uses a pipe-to-self to wake up a thread that loops on waitpid(-1, ...) to collect all outstanding dead children. After that point, whenever an IPC child exits, there will be a race for the child's exit status. At some point afterwards, IPC loses the race for one of its children and sets off a ChildLaxReaper for that pid, and so on. The right solution is probably to fix NSPR so that it ignores other children. Note that this is in addition to the ContentParent problems from bug 943174.
Hey Jed, if you get another try build without the abort() (and anything else that you think might be useful), I can test it on the ec2 machines to see if it helps with the original problem that is only reproducible there
This is the patch. If this unbreaks those tests, then the change to DidProcessCrash would probably work by itself, but at minimum it needs the SetPriorityAndCheckIsAlive change (or something along those lines) to not break b2g.
Attachment #8391004 - Flags: feedback?(felipc)
Jed, thanks for getting the patch together, and Felipe thanks for testing it and making sure it works for us! If we get this addressed, we should be ready to transition our linux debug tests. We only have some intermittent issues, this is the last persistent issue!
Comment on attachment 8391004 [details] [diff] [review] bug933680-wip-hg0.diff Review of attachment 8391004 [details] [diff] [review]: ----------------------------------------------------------------- A unchunked browser-chrome test with this patch finished successfully twice on ec2. There were some failures which I think are existing oranges as they look unrelated to this problem. I also isolated the 3 tests mentioned in comment 21 and they passed.
Attachment #8391004 - Flags: feedback?(felipc) → feedback+
This is great news. Looking forward to a final patch!
we don't have anybody working on bug 687369 which appears to be the root cause. This will disable the plugin test case.
Attachment #8392794 - Flags: review?(felipc)
Attachment #8392794 - Flags: review?(felipc) → review+
after this landed I continue to see browesr_bug410900.js failing on chunk bc1. Should I disable both tests, or enabled brower_pluginCrashCommentAndURL.js and only disable browser_bug410900.js?
FYI we recently can start pushing patches for this to try (bug 984480). I haven't try what the exact syntax might be though.
based on previous comments here, my vote is it disable both tests, but I have a strong inclination towards only disabling browser_bug410900.js
hmm I really don't know, I expected removing brower_pluginCrashCommentAndURL.js would fix the problem. Where can I see a log where that test was deactivated and browser_bug410900.js still failed?
Felipe, here is a log from inbound that shows browser_bug410900.js failing while not running browser_pluginCrashCommentAndURL.js: https://tbpl.mozilla.org/php/getParsedLog.php?id=36411871&tree=Mozilla-Inbound&full=1
:felipe, I would like to move on this bug, can you please provide feedback. Otherwise I will just have both tests disabled.
If disabling browser_bug410900.js makes it green, sure let's do that. But I suspect the next test will then hang. From that log it does seem the same root cause as discussed here. If disabling browser_pluginCrashCommentAndURL.js wasn't enough to get rid of the problem we will really need a real fix for this bug (some combination of bug 943174, bug 933680 comment 44, bug 678369).
odd, I went to look at this and somehow we turned green on bc1 this morning, this looks to be the magic push: https://tbpl.mozilla.org/?tree=Mozilla-Inbound&showall=1&jobname=Ubuntu%20VM%2012.04%20mozilla-inbound%20debug%20test%20mochitest-browser-chrome-&rev=907d1ff0c7ab both linux32 and linux64 debug builds seem to be doing much better. We can let them work over the weekend and verify on Monday that all is green.
Cool. I don't think that's a coincidence, bug 943174 is in that push :) After it rests for a few days we can even attempt to re-enable browser_pluginCrashCommentAndURL.js and see how that goes.
glad it is working as we think. On Monday if all is well I can push to try and see if we can reenable browser_pluginCrashCommentAndURL.js!
Attachment #8396993 - Flags: review?(felipc) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/82622cb23446 I vote for removing the 'leave-open' tag.
Assignee: nobody → jmaher
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla31
You need to log in before you can comment on or make changes to this bug.