933680 - browser_pluginCrashCommentAndURL.js hangs ubuntu ec2 slaves in conjunction with browser_bug410900.js

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Assignee

Description

•

12 years ago

The majority of our linux desktop unittests run on amazon ec2 slaves, all except for linux debug browser-chrome. this is because we fail on browser_bug410900.js: https://tbpl.mozilla.org/php/getParsedLog.php?id=29931108&tree=Cedar 11:39:57 INFO - INFO TEST-END | chrome://mochitests/content/browser/browser/components/preferences/in-content/tests/browser_advanced_update.js | finished in 1114ms 11:39:57 INFO - TEST-INFO | checking window state 11:39:57 INFO - TEST-START | chrome://mochitests/content/browser/browser/components/preferences/in-content/tests/browser_bug410900.js 11:39:57 INFO - ++DOCSHELL 0x2da6340 == 80 [id = 1209] 11:39:57 INFO - ++DOMWINDOW == 224 (0x8291e98) [serial = 3360] [outer = (nil)] 11:39:57 INFO - ++DOMWINDOW == 225 (0xb09cd68) [serial = 3361] [outer = 0x8291e98] 11:39:57 INFO - [Parent 2396] WARNING: NS_ENSURE_TRUE(mMutable) failed: file ../../../../netwerk/base/src/nsSimpleURI.cpp, line 264 11:39:57 INFO - [Parent 2396] WARNING: NS_ENSURE_TRUE(mMutable) failed: file ../../../../netwerk/base/src/nsSimpleURI.cpp, line 264 11:39:57 INFO - ++DOMWINDOW == 226 (0xafcb388) [serial = 3362] [outer = 0x8291e98] 11:39:57 INFO - [Parent 2396] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80520012: file ../../../layout/style/Loader.cpp, line 2007 11:39:57 INFO - [Parent 2396] WARNING: NS_ENSURE_TRUE(enabled) failed: file ../../../dom/base/Navigator.cpp, line 1707 11:39:58 INFO - [Parent 2396] WARNING: waitpid failed pid:2442 errno:10: file ../../../ipc/chromium/src/base/process_util_posix.cc, line 254 11:45:28 WARNING - TEST-UNEXPECTED-FAIL | chrome://mochitests/content/browser/browser/components/preferences/in-content/tests/browser_bug410900.js | application timed out after 330 seconds with no output

:Gavin Sharp [email: gavin@gavinsharp.com]

Comment 1

•

12 years ago

Is there an easy way for a developer to get access to these slaves (either directly or via Try)?

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Assignee

Comment 2

•

12 years ago

best method here is to file a bug with releng (see bug 840964 as an example) and then all the magic will be worked out so you have access via vpn, etc.

Phil Ringnalda (:philor)

Updated

•

11 years ago

Blocks: 819963

Jonathan Griffin (:jgriffin)

Comment 4

•

11 years ago

FYI, disabling this test just moves the problem to a different test (browser/components/preferences/in-content/tests/browser_connection.js).

Jonathan Griffin (:jgriffin)

Comment 5

•

11 years ago

current log on cedar: https://tbpl.mozilla.org/php/getParsedLog.php?id=30572745&tree=Cedar

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Assignee

Comment 6

•

11 years ago

did we disable this test? I don't see it in current logs. If we disabled it we should close this bug or update the title.

Phil Ringnalda (:philor)

Comment 7

•

11 years ago

We disabled it on Cedar, and then discovered comment 4, and didn't revert disabling it.

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Assignee

Comment 8

•

11 years ago

thanks :philor. Can we keep hg revisions linked in bugs when we change stuff. Honestly if browser/components/preferences/in-content/tests/browser_connection.js passes only because browser_bug410900.js is running we should be disabling browser_connection.js and any others. Those are bad tests that depend on the state leftover by browser_bug410900.js

Phil Ringnalda (:philor)

Comment 9

•

11 years ago

No, it's that browser_connection.js doesn't fail because it doesn't run because 330 seconds without output kills the run. Apparently none of the in-content prefs tests run on the ec2 slaves, and someone who knows something about them needs to file a bug for a loaner and see whether it's because in-content prefs themselves don't work, or something in common to all the tests doesn't work.

Jared Wein [:jaws] (please needinfo? me)

Updated

•

11 years ago

Flags: needinfo?(jaws)

Flags: needinfo?(bmcbride)

Jared Wein [:jaws] (please needinfo? me)

Updated

•

11 years ago

Component: File Handling → Preferences

Flags: needinfo?(jaws)

Jared Wein [:jaws] (please needinfo? me)

Updated

•

11 years ago

Depends on: 946882

Jared Wein [:jaws] (please needinfo? me)

Comment 10

•

11 years ago

Filed bug 946882 to request access to one of the Ubuntu ec2 machines.

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Assignee

Comment 11

•

11 years ago

browser_bug410900.js fails because browser_bug735471.js (http://dxr.mozilla.org/mozilla-central/source/browser/base/content/test/general/browser_bug735471.js) is run previously. If I remove that single file the tests complete as expected. :ttaubert- you had modified broswer_bug735471.js recently, any thoughts on why it would cause the in-content test case http://dxr.mozilla.org/mozilla-central/source/browser/components/preferences/in-content/tests/browser_bug410900.js to hang?

Flags: needinfo?(ttaubert)

Tim Taubert [:ttaubert] (inactive)

Comment 12

•

11 years ago

The log from comment #5 shows: browser_advanced_update.js | Console message: [JavaScript Error: "NS_ERROR_FAILURE: " {file: "chrome://browser/content/preferences/in-content/preferences.js" line: 37}] That is: > if (history.length > 1 && history.state) { So for some reason we fail accessing window.history when about:preferences is being loaded. That's also why comment #5 shows an empty in-content prefs page. NS_ERROR_FAILURE isn't very specific, I have no idea why that could fail.

Flags: needinfo?(ttaubert)

:Felipe Gomes (needinfo for replies!)

Comment 13

•

11 years ago

I'm working on this as part of bug 850101

Assignee: nobody → felipc

Status: NEW → ASSIGNED

:Felipe Gomes (needinfo for replies!)

Updated

•

11 years ago

Blocks: fxdesktopbacklog

:Felipe Gomes (needinfo for replies!)

Comment 14

•

11 years ago

Here's an update! Apologies for the delay, but this has been very painful to bisect.. Here are some more details. There's already some pre-existing problem in this test, but a bigger problem happens as a combination of *3* tests running together.. The whole test suite hangs due to a waitpid() call from IPC code triggered by an oop plugin test, browser_pluginCrashCommentAndURL.js Interestingly, this hang doesn't happen if either browser_bug735471.js or browser_pluginCrashCommentAndURL.js is not run in the test suite (and, in other news, just these two tests alone + browser/components/preferences is enough to trigger the problem). Now, with either of these removed, the b-c suite finishes, but browser_bug410900.js still doesn't work, it "cleanly" times out. However, even with the two pre-req tests out of the suite, the warnings from comment 12 still show up. BUT!, manually opening about:preferences after the suite finishes works ok, so I think the warning might be just a red herring. In any case, it looks like there are two problems that needs to be battled here, but it's way more clear now how to separate them, which should help with actually fixing them.

Marco Mucci [:MarcoM]

Updated

•

11 years ago

Whiteboard: p=0

Armen [:armenzg]

Comment 15

•

11 years ago

Felipe thanks for the update!

Armen [:armenzg]

Comment 16

•

11 years ago

Hi Felipe, Any updates on this?

Armen [:armenzg]

Updated

•

11 years ago

Flags: needinfo?(felipc)

Marco Mucci [:MarcoM]

Updated

•

11 years ago

Whiteboard: p=0 → p=13 s=it-30c-29a-28b.2

Tracy Walker [:tracy]

Comment 17

•

11 years ago

[qa-] for this iteration, but I assume Joel will verify, in the end.

Whiteboard: p=13 s=it-30c-29a-28b.2 → p=13 s=it-30c-29a-28b.2 [qa-]

:Felipe Gomes (needinfo for replies!)

Comment 18

•

11 years ago

Hey Armen, I'm still working on this. I'm hoping I'll be able to make another stride on it tomorrow.. I'll ping you or bhearsum to reactivate the VM then

Flags: needinfo?(felipc)

Marco Mucci [:MarcoM]

Comment 19

•

11 years ago

Carry over to Iteration it-30c-29a-28b.3

Whiteboard: p=13 s=it-30c-29a-28b.2 [qa-] → p=13 s=it-30c-29a-28b.3 [qa-]

Armen [:armenzg]

Comment 20

•

11 years ago

fgomes, any news?

:Felipe Gomes (needinfo for replies!)

Comment 21

•

11 years ago

Hello, a new latest update. I now understand all the conditions that lead to the hang and why those 3 tests are involved. Here's a detailed summary: Firefox hangs in this test right after printing this warning: "WARNING: waitpid failed pid:14683 errno:10:" http://hg.mozilla.org/mozilla-central/annotate/fc9947c00b51/ipc/chromium/src/base/process_util_posix.cc#l254 The pid is from the test plugin which was instantiated earlier to test crash reporting. The waitpid in question has WNOHANG, so it's not what hangs, but it probably tells that something is about to go wrong (my guess is http://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/chrome/common/child_process_host.cc#137) There's also a lot of IPC MessageChannel errors that might be relevant (log attached). To cause the problem, the plugin has to be loaded before browser_pluginCrashCommentAndURL.js runs, and then try to be loaded again afterwards. This combination happens due to the other two tests mentioned in comment 14, because they both exercise about:preferences which loads all plugins to populate the "Applications" tab. In sequence, what happens is: - browser_bug735471.js runs first, which loads about:preferences which iterates through plugins at http://mxr.mozilla.org/mozilla-central/source/browser/components/preferences/in-content/applications.js#1032, loading them - browser_pluginCrashCommentAndURL.js loads the test plugin to crash it and submit a crash report - browser_bug410900.js runs later and is the next first test to load about:preferences again, which then iterates through all plugins again. An interesting point is that gApplicationsPane._loadData completes successfully (no exceptions, and it runs to completion). The hang happens at some point soon afterwards. The main question is why this only happens on these ec2 slaves.. I don't know. Here's a list of all plugins that are loaded on them, in case that could interfere: LoadPlugin() /tmp/tmpqZhYDq/plugins/libnpsecondtest.so returned 4aa71a0 LoadPlugin() /tmp/tmpqZhYDq/plugins/libnptest.so returned 2e005d0 LoadPlugin() /usr/lib/mozilla/plugins/librhythmbox-itms-detection-plugin.so returned 2df30b0 LoadPlugin() /usr/lib/mozilla/plugins/libtotem-cone-plugin.so returned 2df8d60 LoadPlugin() /usr/lib/mozilla/plugins/libtotem-mully-plugin.so returned 50a9300 LoadPlugin() /usr/lib/mozilla/plugins/libtotem-gmp-plugin.so returned 50ae2b0 LoadPlugin() /usr/lib/mozilla/plugins/libtotem-narrowspace-plugin.so returned 50b6910 But they all seem to exist in the current tbpl Rev3 Fedora 12 boxes (plus libnptestjava.so) So, two things: - I think it's very likely that this bug is probably somewhere in ipc/chromium/plugins code, and not in the front-end preferences code, so I'm not the right person to move it forward from here. - Disabling browser_pluginCrashCommentAndURL.js for ec2 debug seems sufficient to unblock the work of migrating these test jobs to ec2 (instead of having to disable the entire suite of tests for in-content prefs) Needinfo'ing bsmedberg to see what can be done to move this forward

Flags: needinfo?(bmcbride) → needinfo?(benjamin)

:Felipe Gomes (needinfo for replies!)

Comment 22

•

11 years ago

Attached file output — Details

Here's a short log from a browser-chrome run that only had these 3 tests included

:Felipe Gomes (needinfo for replies!)

Updated

•

11 years ago

Attachment #8388398 - Attachment is patch: false

:Felipe Gomes (needinfo for replies!)

Updated

•

11 years ago

Component: Preferences → Plug-ins

Product: Firefox → Core

Summary: browser_bug410900.js fails on ubuntu ec2 slaves → browser_pluginCrashCommentAndURL.js hangs ubuntu ec2 slaves in conjunction with browser_bug410900.js

Benjamin Smedberg

Comment 23

•

11 years ago

The vital piece of information here is what the stack of the process is when it's hung. The log will usually tell us that, but the logs from 2013 in the top of this bug have already expired. Are there more recent logs which will have the stack dump of the hang? errno 10 is ECHILD, which means we're calling waitpid on a child process that has already been reaped (or we passed a completely bogus pid to waitpid, but that seems unlikely). This might have something to do with the way PIDs are allocated to processes on EC2 instances, if they create new processes with the same PID as old processes quickly. Otherwise nothing in that log looks suspicious: we'd expect to get output like ###!!! [Parent][MessageChannel::InterruptCall] Error: Channel error: cannot send/recv when a plugin process crashes, which is what this test is doing intentionally. What is the stack of the waitpid() call?

Flags: needinfo?(benjamin) → needinfo?(felipc)

:Felipe Gomes (needinfo for replies!)

Comment 24

•

11 years ago

I'm not entirely sure how to get the symbols loaded on these machines since the builds are not done there, so what I have is probably not very helpful, but here is the hung stack: (gdb) bt #0 0x00007f8b40d50d84 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x00007f8b3fb49e50 in PR_WaitCondVar () from /home/cltbld/m-c/firefox/libnspr4.so #2 0x00007f8b3fb542a5 in ?? () from /home/cltbld/m-c/firefox/libnspr4.so #3 0x00007f8b3a11da0a in ?? () from /home/cltbld/m-c/firefox/libxul.so #4 0x00007f8b3a11dc47 in ?? () from /home/cltbld/m-c/firefox/libxul.so #5 0x00007f8b3a11ded7 in ?? () from /home/cltbld/m-c/firefox/libxul.so #6 0x00007f8b3a5db230 in ?? () from /home/cltbld/m-c/firefox/libxul.so #7 0x00007f8b3a5db6c1 in ?? () from /home/cltbld/m-c/firefox/libxul.so #8 0x00007f8b3a5dbb7b in ?? () from /home/cltbld/m-c/firefox/libxul.so #9 0x00007f8b3a5dc632 in ?? () from /home/cltbld/m-c/firefox/libxul.so #10 0x00007f8b3a5d09d4 in ?? () from /home/cltbld/m-c/firefox/libxul.so #11 0x00007f8b3a124a52 in NS_InvokeByIndex () from /home/cltbld/m-c/firefox/libxul.so #12 0x00007f8b3ab62de7 in ?? () from /home/cltbld/m-c/firefox/libxul.so #13 0x00007f8b3ab63cb5 in ?? () from /home/cltbld/m-c/firefox/libxul.so #14 0x00007f8b3ab64857 in ?? () from /home/cltbld/m-c/firefox/libxul.so #15 0x00007f8b1491864c in ?? () #16 0x00007f8afc32f240 in ?? () #17 0x00007fff337d4a20 in ?? () #18 0x0000000000000000 in ?? () (gdb) I'll see if jmaher can get (or tell me how to) a full stacktrace for the waitpid call

Flags: needinfo?(felipc)

Armen [:armenzg]

Comment 25

•

11 years ago

I saw bsmedberg's comment so I wrote a patch to enable these jobs on Elm: https://bugzilla.mozilla.org/attachment.cgi?id=8388595&action=edit I hope to have the job running between today and tomorrow.

:Felipe Gomes (needinfo for replies!)

Updated

•

11 years ago

Assignee: felipc → nobody

Status: ASSIGNED → NEW

Armen [:armenzg]

Comment 26

•

11 years ago

fgomes, could you please figure out if with the information below you can continue working on this? or find an owner before unassigning? We are pushing hard to move away from these ancient machines before the end of April. fgomes, our unit tests download symbols on-demand (only when we crash) You can look at the logs below to see how we retrieve them and use them. e.g 06:00:59 INFO - Running command: ['unzip', '-q', '/builds/slave/test/build/symbols/firefox-30.0a1.en-US.linux-i686.crashreporter-symbols.zip'] in /builds/slave/test/build/symbols I triggered Linux debug builds on Elm: https://tbpl.mozilla.org/?tree=Elm We got these two logs: https://tbpl.mozilla.org/php/getParsedLog.php?id=35939153&tree=Elm&full=1 https://tbpl.mozilla.org/php/getParsedLog.php?id=35939091&tree=Elm&full=1

Marco Mucci [:MarcoM]

Updated

•

11 years ago

Whiteboard: p=13 s=it-30c-29a-28b.3 [qa-] → p=13 [qa-]

Benjamin Smedberg

Comment 27

•

11 years ago

Stack from the log indicates we're stuck synchronously running a nsIProcess from here: nsOSHelperAppService::GetHandlerAndDescriptionFromMailcapFile. I suspect that we actually killed that process because of bug 943174 and because these instances appear to re-use PIDs much more aggressively than most machines do. Given that, we probably need an assignee for bug 943174.

Armen [:armenzg]

Updated

•

11 years ago

Depends on: bad-waitpid

Armen [:armenzg]

Comment 28

•

11 years ago

I've found that if we move to 3 chunks we can run debug mochitest-browser-chrome on EC2. We just landed a change on m-i to disable the tests. The jobs in Cedar will be green once m-i merges there: https://tbpl.mozilla.org/?tree=Cedar&jobname=Ubuntu%20VM%2012.04.*browser-chrome- This bug stands as an issue when we don't chunk. > jmaher: armenzg_buildduty: felipe's work is not needed if we chunk If we run 3 chunks across the board we should be able to move away from the minis. I believe this bug could be closed if we're moving on that direction.

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Assignee

Comment 29

•

11 years ago

I am seeing bug 410900.js failing on bc1 on inbound. Odd that it wasn't failing on cedar much.

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Assignee

Comment 30

•

11 years ago

this test has been disabled on Cedar and now that we are working on running these tests in chunks and on ec2 on inbound they are failing. Should we disable them on trunk based trees (inbound) or do we think this can be fixed in the next couple of days?

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Assignee

Updated

•

11 years ago

Flags: needinfo?(benjamin)

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Assignee

Updated

•

11 years ago

Depends on: 982810

Benjamin Smedberg

Comment 31

•

11 years ago

I am not assigned this bug nor do I have time to work on it. Why don't you ask :jld who is actually looking at it?

Flags: needinfo?(benjamin)

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Assignee

Comment 32

•

11 years ago

jld, see comment 30. I want to know how realistic it is to get this fixed or if we should temporarily disable it while you work on it?

Flags: needinfo?(jld)

:Felipe Gomes (needinfo for replies!)

Comment 33

•

11 years ago

The right test to disable here is browser_pluginCrashCommentAndURL.js, which causes the problem, not browser_bug410900.js. If we disable browser_bug410900.js the failure will just move to a different test. I think we should disable it until it's fixed (probably by bug 943174), even if the problem is not seen in a chunkfied run, as it will be just waiting for another test to be affected by the problem again which can be equally hard to debug and come to the same conclusion.

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Assignee

Comment 34

•

11 years ago

thanks, I have updated the patch in bug 982810 to reflect this.

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Comment 35

•

11 years ago

I'm still trying to understand what's going on here — the cases I found in bug 943174 should apply only to content processes, not plugins. Also, the logs in this bug show many repeated calls to DidProcessCrash on the same dead process, which doesn't add up. I'm doing some debug-by-printf on try at the moment.

Flags: needinfo?(jld)

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Comment 36

•

11 years ago

Okay, I know what's going on, at least for the case in my try run that produced the same kind of waitpid error spew seen here. It *is* a content process, and we're calling ContentParent::KillHard. That collects the zombie, and then the GeckoContentProcessHost dtor is called. As a hack to deal with Nuwa, ECHILD is treated as the process still existing, so it creates a ChildReaper. Because this is an NS_BUILD_REFCNT_LOGGING build, it's a ChildLaxReaper, which hangs around forever and reattempts a waitpid on that pid every time the parent gets SIGCHLD. So if there's a non-IPC child with the same pid at any point in the future of the test run, we'll steal its exit status and hang. I don't know how this squares with the previous observations that the complained-about pid was a plugin, though. In my case it seems to be a XUL <browser> from browser/base/content/test/social/browser_social_multiprovider.js, if I'm reading the logs right.

:Felipe Gomes (needinfo for replies!)

Comment 37

•

11 years ago

Can that be caused by either a content process or a plugin process? Social uses remote <browser>s, so it's definitely creating child processes. But it could be just the same problem manifesting itself in a different test. If you run the test suite with just the 3 tests mentioned in comment 21, it should happen.

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Comment 38

•

11 years ago

(In reply to :Felipe Gomes from comment #37) > Can that be caused by either a content process or a plugin process? Social > uses remote <browser>s, so it's definitely creating child processes. But it > could be just the same problem manifesting itself in a different test. If > you run the test suite with just the 3 tests mentioned in comment 21, it > should happen. I can't reproduce that locally, and the only cases of 'waitpid failed' that appeared in my instrumented try run were content processes created for the social tests. Looking at the logs from comment #26: [https://tbpl.mozilla.org/php/getParsedLog.php?id=35939153&tree=Elm&full=1] 06:11:13 INFO - --DOMWINDOW == 0 (0x1fa59d0) [pid = 2581] [serial = 3] [outer = (nil)] [url = http://example.com/#newtab_background_captures] 06:11:13 INFO - [Parent 2444] WARNING: waitpid failed pid:2581 errno:10: file /builds/slave/m-cen-l64-d-000000000000000000/build/ipc/chromium/src/base/process_util_posix.cc, line 254 06:13:00 INFO - ++DOMWINDOW == 1 (0x10a2000) [pid = 2681] [serial = 1] [outer = (nil)] 06:13:01 INFO - [Parent 2444] WARNING: waitpid failed pid:2681 errno:10: file /builds/slave/m-cen-l64-d-000000000000000000/build/ipc/chromium/src/base/process_util_posix.cc, line 254 And I think those are content processes? [https://tbpl.mozilla.org/php/getParsedLog.php?id=35939153&tree=Elm&full=1] 06:12:08 INFO - Writing to log: /tmp/tmpPK4Gp8/runtests_leaks_plugin_pid2513.log 06:12:08 INFO - [Parent 2382] WARNING: waitpid failed pid:2513 errno:10: file /builds/slave/m-cen-lx-d-0000000000000000000/build/ipc/chromium/src/base/process_util_posix.cc, line 254 06:27:00 INFO - ==> process 2513 will purposefully crash 06:27:00 INFO - TEST-INFO | leakcheck | plugin process: deliberate crash and thus no leak log But that definitely looks like a plugin, and I still don't understand how this would happen to a plugin process (or anything else that doesn't have an instance of ContentParent).

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Assignee

Comment 39

•

11 years ago

thank for your work on this bug! It looks like we are getting closer to figuring it out.

Benjamin Smedberg

Comment 40

•

11 years ago

The first mention of PID 2513 is for a plugin process: 06:09:47 INFO - For application/x-test found plugin libnptest.so 06:09:48 INFO - [2513] WARNING: XPCOM objects created/destroyed from static ctor/dtor: file /builds/slave/m-cen-lx-d-0000000000000000000/build/xpcom/base/nsTraceRefcnt.cpp, line 142 And this plugin intentionally crashes. Have you broken on DidProcessCrash when running a single test where a plugin intentionally crashes such as dom/plugins/test/mochitest/test_crashing.html to see if it gets hit and what the stack is?

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Comment 41

•

11 years ago

(In reply to Benjamin Smedberg [:bsmedberg] from comment #40) > And this plugin intentionally crashes. Have you broken on DidProcessCrash > when running a single test where a plugin intentionally crashes such as > dom/plugins/test/mochitest/test_crashing.html to see if it gets hit and what > the stack is? I see it being called once per process, in the ~GeckoChildProcessHost → EnsureProcessTerminated path; nor are there any other calls to waitpid/waitid/wait4 on the same pid. This also holds for browser_pluginCrashCommentAndURL.js, when I run it.

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Comment 42

•

11 years ago

I'm doing a try run with a patch that should avoid the known issues from bug 943174, and an abort() in the "waitpid failed" path: https://tbpl.mozilla.org/?tree=Try&rev=3ac2bb012d54 And it's crashing there, sometimes with plugins and apparently also with remote browsers (test_aboutmemory5.xul). But this is good, because future try runs with more instrumentation might yield some information.

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Comment 43

•

11 years ago

…also: If that patch can be made acceptable and landed (not the abort(), but the other part), it might avoid the failures that this bug was filed for, because a nonexistent process will be treated as already dead, so code like ChildLaxReaper won't keep repeatedly trying to wait on that pid forever. There would still be a race condition, but it would be a lot smaller and might not drag in unrelated tests.

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Comment 44

•

11 years ago

I've found the waitpid that's been stealing IPC's zombies. It's in NSPR: https://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/md/unix/uxproces.c#648 The first time a child process is created through NSPR, it installs a SIGCHLD handler that uses a pipe-to-self to wake up a thread that loops on waitpid(-1, ...) to collect all outstanding dead children. After that point, whenever an IPC child exits, there will be a race for the child's exit status. At some point afterwards, IPC loses the race for one of its children and sets off a ChildLaxReaper for that pid, and so on. The right solution is probably to fix NSPR so that it ignores other children. Note that this is in addition to the ContentParent problems from bug 943174.

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Updated

•

11 years ago

Depends on: 745212

:Felipe Gomes (needinfo for replies!)

Comment 45

•

11 years ago

Hey Jed, if you get another try build without the abort() (and anything else that you think might be useful), I can test it on the ec2 machines to see if it helps with the original problem that is only reproducible there

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Comment 46

•

11 years ago

Attached patch bug933680-wip-hg0.diff — Details — Splinter Review

This is the patch. If this unbreaks those tests, then the change to DidProcessCrash would probably work by itself, but at minimum it needs the SetPriorityAndCheckIsAlive change (or something along those lines) to not break b2g.

Attachment #8391004 - Flags: feedback?(felipc)

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Assignee

Comment 47

•

11 years ago

Jed, thanks for getting the patch together, and Felipe thanks for testing it and making sure it works for us! If we get this addressed, we should be ready to transition our linux debug tests. We only have some intermittent issues, this is the last persistent issue!

Lawrence Mandel [:lmandel] (use needinfo)

Updated

•

11 years ago

No longer blocks: fxdesktopbacklog

Marco Mucci [:MarcoM]

Updated

•

11 years ago

Whiteboard: p=13 [qa-]

Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧

Updated

•

11 years ago

Depends on: 227246
No longer depends on: 745212

:Felipe Gomes (needinfo for replies!)

Comment 48

•

11 years ago

Comment on attachment 8391004 [details] [diff] [review] bug933680-wip-hg0.diff Review of attachment 8391004 [details] [diff] [review]: ----------------------------------------------------------------- A unchunked browser-chrome test with this patch finished successfully twice on ec2. There were some failures which I think are existing oranges as they look unrelated to this problem. I also isolated the 3 tests mentioned in comment 21 and they passed.

Attachment #8391004 - Flags: feedback?(felipc) → feedback+

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Assignee

Comment 49

•

11 years ago

This is great news. Looking forward to a final patch!

Karl Tomlinson (:karlt)

Updated

•

11 years ago

Depends on: 678369

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Assignee

Comment 50

•

11 years ago

Attached patch temporarily disable until we can fix the root cause (1.0) — Details — Splinter Review

we don't have anybody working on bug 687369 which appears to be the root cause. This will disable the plugin test case.

Attachment #8392794 - Flags: review?(felipc)

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Assignee

Updated

•

11 years ago

Keywords: leave-open

:Felipe Gomes (needinfo for replies!)

Updated

•

11 years ago

Attachment #8392794 - Flags: review?(felipc) → review+

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Assignee

Updated

•

11 years ago

Keywords: checkin-needed

Phil Ringnalda (:philor)

Comment 51

•

11 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/65448ed8c05c

Keywords: checkin-needed

Phil Ringnalda (:philor)

Updated

•

11 years ago

Attachment #8392794 - Flags: checkin+

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Assignee

Comment 52

•

11 years ago

after this landed I continue to see browesr_bug410900.js failing on chunk bc1. Should I disable both tests, or enabled brower_pluginCrashCommentAndURL.js and only disable browser_bug410900.js?

Flags: needinfo?(felipc)

Armen [:armenzg]

Comment 53

•

11 years ago

FYI we recently can start pushing patches for this to try (bug 984480). I haven't try what the exact syntax might be though.

Carsten Book [:Tomcat]

Comment 54

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/65448ed8c05c

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Assignee

Comment 55

•

11 years ago

based on previous comments here, my vote is it disable both tests, but I have a strong inclination towards only disabling browser_bug410900.js

:Felipe Gomes (needinfo for replies!)

Comment 56

•

11 years ago

hmm I really don't know, I expected removing brower_pluginCrashCommentAndURL.js would fix the problem. Where can I see a log where that test was deactivated and browser_bug410900.js still failed?

Flags: needinfo?(felipc)

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Assignee

Comment 57

•

11 years ago

Felipe, here is a log from inbound that shows browser_bug410900.js failing while not running browser_pluginCrashCommentAndURL.js: https://tbpl.mozilla.org/php/getParsedLog.php?id=36411871&tree=Mozilla-Inbound&full=1

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Assignee

Comment 58

•

11 years ago

:felipe, I would like to move on this bug, can you please provide feedback. Otherwise I will just have both tests disabled.

Flags: needinfo?(felipc)

:Felipe Gomes (needinfo for replies!)

Comment 59

•

11 years ago

If disabling browser_bug410900.js makes it green, sure let's do that. But I suspect the next test will then hang. From that log it does seem the same root cause as discussed here. If disabling browser_pluginCrashCommentAndURL.js wasn't enough to get rid of the problem we will really need a real fix for this bug (some combination of bug 943174, bug 933680 comment 44, bug 678369).

Flags: needinfo?(felipc)

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Assignee

Comment 60

•

11 years ago

odd, I went to look at this and somehow we turned green on bc1 this morning, this looks to be the magic push: https://tbpl.mozilla.org/?tree=Mozilla-Inbound&showall=1&jobname=Ubuntu%20VM%2012.04%20mozilla-inbound%20debug%20test%20mochitest-browser-chrome-&rev=907d1ff0c7ab both linux32 and linux64 debug builds seem to be doing much better. We can let them work over the weekend and verify on Monday that all is green.

:Felipe Gomes (needinfo for replies!)

Comment 61

•

11 years ago

Cool. I don't think that's a coincidence, bug 943174 is in that push :) After it rests for a few days we can even attempt to re-enable browser_pluginCrashCommentAndURL.js and see how that goes.

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Assignee

Comment 62

•

11 years ago

glad it is working as we think. On Monday if all is well I can push to try and see if we can reenable browser_pluginCrashCommentAndURL.js!

Karl Tomlinson (:karlt)

Updated

•

11 years ago

Blocks: 987892

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Assignee

Comment 63

•

11 years ago

Attached patch reenable browser_pluginCrashCommentAndURL.js on linux debug (1.0) — Details — Splinter Review

Attachment #8396993 - Flags: review?(felipc)

:Felipe Gomes (needinfo for replies!)

Updated

•

11 years ago

Attachment #8396993 - Flags: review?(felipc) → review+

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Assignee

Comment 64

•

11 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/82622cb23446 I vote for removing the 'leave-open' tag.

:Felipe Gomes (needinfo for replies!)

Comment 65

•

11 years ago

+1

Keywords: leave-open

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 66

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/82622cb23446

Assignee: nobody → jmaher

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla31

Cervantes Yu [:cyu] [:cervantes]

Updated

•

11 years ago

Blocks: 987164

u279076

Updated

•

11 years ago

status-firefox31: --- → fixed

Whiteboard: [qa-]

BMO Automation

Updated

•

3 years ago

Product: Core → Core Graveyard

output 11 years ago :Felipe Gomes (needinfo for replies!) 23.15 KB, text/plain		Details
bug933680-wip-hg0.diff 11 years ago Jed Davis [:jld] ⟨⏰\|UTC-8⟩ ⟦he/him⟧ 8.01 KB, patch	Felipe : feedback+	Details \| Diff \| Splinter Review
temporarily disable until we can fix the root cause (1.0) 11 years ago Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17) 631 bytes, patch	Felipe : review+ philor : checkin+	Details \| Diff \| Splinter Review
reenable browser_pluginCrashCommentAndURL.js on linux debug (1.0) 11 years ago Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17) 575 bytes, patch	Felipe : review+	Details \| Diff \| Splinter Review