749010 - Perma-orange Linux64 test_memoryReporters.xul crash above mozilla::dom::workers::WorkerPrivate::BlockAndCollectRuntimeStats on Release 13

Reporter

Description

•

13 years ago

+++ This bug was initially created as a clone of Bug #737244 +++ This crash originally happened intermittently, but it started happening 100% of the time on Linux64 opt builds in Aurora 13, continuing in Beta 13. The perma-orange started with https://hg.mozilla.org/releases/mozilla-aurora/rev/8d0f70b87fba (bug 736028). Requesting tracking-firefox13 because this is a new reproducible crash in Firefox 13. https://tbpl.mozilla.org/php/getParsedLog.php?id=11206374&tree=Mozilla-Beta Rev3 Fedora 12x64 mozilla-beta pgo test mochitest-other on 2012-04-25 15:17:39 PDT for push 7e8c20249a4f 13504 INFO TEST-START | chrome://mochitests/content/chrome/toolkit/components/aboutmemory/tests/test_memoryReporters.xul TEST-UNEXPECTED-FAIL | chrome://mochitests/content/chrome/toolkit/components/aboutmemory/tests/test_memoryReporters.xul | Exited with code 1 during test run INFO | automation.py | Application ran for: 0:02:12.925341 INFO | automation.py | Reading PID log: /tmp/tmpx_VsOppidlog ==> process 2357 launched child process 2391 ==> process 2357 launched child process 2399 ==> process 2357 launched child process 2407 ==> process 2357 launched child process 2409 ==> process 2357 launched child process 2411 INFO | automation.py | Checking for orphan process with PID: 2391 INFO | automation.py | Checking for orphan process with PID: 2399 INFO | automation.py | Checking for orphan process with PID: 2407 INFO | automation.py | Checking for orphan process with PID: 2409 INFO | automation.py | Checking for orphan process with PID: 2411 Downloading symbols from: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-beta-linux64/1335386323/firefox-13.0.en-US.linux-x86_64.crashreporter-symbols.zip PROCESS-CRASH | chrome://mochitests/content/chrome/toolkit/components/aboutmemory/tests/test_memoryReporters.xul | application crashed (minidump found) Crash dump filename: /tmp/tmpMB1Rf3/minidumps/36986f3b-61c9-5f43-4ee091f4-3f3d1a4d.dmp Operating system: Linux 0.0.0 Linux 2.6.31.5-127.fc12.x86_64 #1 SMP Sat Nov 7 21:11:14 EST 2009 x86_64 CPU: amd64 family 6 model 23 stepping 10 2 CPUs Crash reason: SIGSEGV Crash address: 0x10 Thread 0 (crashed) 0 libpthread-2.11.so + 0x8db0 rbx = 0x00000000 r12 = 0x4909cff0 r13 = 0x4909cf9f r14 = 0xa8b5c010 r15 = 0x4909cff0 rip = 0xd3608db0 rsp = 0x4909cf28 rbp = 0x00000001 Found by: given as instruction pointer in context 1 libnspr4.so!PR_Lock [ptsynch.c:7e8c20249a4f : 206 + 0x4] rip = 0xc37e6f99 rsp = 0x4909cf30 Found by: stack scanning 2 libxul.so!mozilla::dom::workers::WorkerPrivate::BlockAndCollectRuntimeStats [Mutex.h : 106 + 0x8] rbx = 0xa8b5c000 rip = 0xc0bd74da rsp = 0x4909cf40 Found by: call frame info 3 libxul.so!WorkerMemoryReporter::GetExplicitNonHeap [WorkerPrivate.cpp:7e8c20249a4f : 223 + 0xe] rbx = 0x9415f700 r12 = 0x0697d000 r13 = 0x4909d010 r14 = 0xc2038250 r15 = 0x4909cff0 rip = 0xc0bd77ab rsp = 0x4909cf90 rbp = 0x00000000 Found by: call frame info 4 libxul.so!nsMemoryReporterManager::GetExplicit [nsMemoryReporterManager.cpp:7e8c20249a4f : 805 + 0x9] rbx = 0x4909d3e0 r12 = 0x0697d000 r13 = 0x4909d010 r14 = 0xc2038250 r15 = 0x4909cff0 rip = 0xc119eea2 rsp = 0x4909cfb0 rbp = 0x00000000 Found by: call frame info 5 libxul.so!NS_InvokeByIndex_P [xptcinvoke_x86_64_unix.cpp : 195 + 0xb] rbx = 0x4909d3e0 r12 = 0x00000000 r13 = 0x00000002 r14 = 0xc17841f4 r15 = 0xc2333bb8 rip = 0xc11a0d8a rsp = 0x4909d0d0 rbp = 0x4909d1b0 Found by: call frame info 6 libxul.so!XPCWrappedNative::CallMethod [XPCWrappedNative.cpp:7e8c20249a4f : 3025 + 0x19] rbx = 0x4909d3e0 r12 = 0x4909d3e0 r13 = 0x4909d870 r14 = 0x4909d3a0 r15 = 0xc2333bb8 rip = 0xc0de61cb rsp = 0x4909d1c0 rbp = 0x00000001 Found by: call frame info 7 libxul.so!XPC_WN_GetterSetter [xpcprivate.h:7e8c20249a4f : 2660 + 0xc] rbx = 0x4909d870 r12 = 0x00000000 r13 = 0xb3aff0c0 r14 = 0xa853d700 r15 = 0x981a0fd0 rip = 0xc0ded719 rsp = 0x4909d840 rbp = 0xb3aff0b0 Found by: call frame info 8 libxul.so!js::InvokeKernel [jscntxtinlines.h:7e8c20249a4f : 314 + 0x8] rbx = 0x90d68400 r12 = 0xa853d700 r13 = 0x00000000 r14 = 0x00000000 r15 = 0x00000000 rip = 0xc14907d0 rsp = 0x4909d9a0 rbp = 0xb3aff0c0 Found by: call frame info 9 libxul.so!js::Invoke [jsinterp.h:7e8c20249a4f : 172 + 0xd] rbx = 0xa853d700 r12 = 0x4909db50 r13 = 0x4909db00 r14 = 0x4909da80 r15 = 0x00000000 rip = 0xc1490d65 rsp = 0x4909da70 rbp = 0xffffffff Found by: call frame info 10 libxul.so!js::InvokeGetterOrSetter [jsinterp.cpp:7e8c20249a4f : 635 + 0xb]

Matt Brubeck (:mbrubeck)

Reporter

Updated

•

13 years ago

Keywords: regression

Comment hidden (Legacy TBPL/Treeherder Robot)

Scoobidiver (away)

Updated

•

13 years ago

Crash Signature: [@ libpthread-2.11.so@0x0x8db0]

Jared Wein [:jaws] (please needinfo? me)

Comment 2

•

13 years ago

Is this crash only happening on Fx13?

Matt Brubeck (:mbrubeck)

Reporter

Comment 3

•

13 years ago

(In reply to Jared Wein [:jaws] from comment #2) > Is this crash only happening on Fx13? Yes.

Jared Wein [:jaws] (please needinfo? me)

Comment 4

•

13 years ago

I'm fine with backing this out of Fx13 and letting the patch ride the Fx14 train then. I'll talk with Asa later today (before I do the backout) to make sure that he knows about it since I know he really wanted it in for Fx13.

Assignee: nobody → jwein

Status: NEW → ASSIGNED

Matt Brubeck (:mbrubeck)

Reporter

Comment 5

•

13 years ago

It would also be nice to figure whether this crash is caused by 736028, or if it is an existing bug that was exposed in tests when bug 736028 landed. (Will backing out bug 736028 prevent users from seeing this crash, or just prevent it from showing up in our test suite?)

Comment hidden (Legacy TBPL/Treeherder Robot)

Alex Keybl [:akeybl]

Updated

•

13 years ago

tracking-firefox13: ? → +

Jared Wein [:jaws] (please needinfo? me)

Comment 10

•

13 years ago

Attached patch Backout of 736028 (obsolete) — Details — Splinter Review

[Approval Request Comment] Regression caused by (bug #): bug 736028 User impact if declined: Some crashes on Linux64 Testing completed (on m-c, etc.): The crashes went from intermittent to permanent when this patch went on to Fx13. These crashes have not been seen on Fx14 though, so the belief is that backing out this patch from Fx13 will reduce the crashes, but there has not been an investigation yet as to why this is causing crashes. Risk to taking this patch (and alternatives if risky): No risk expected, but thumbnail images will have a lower quality on the new tab page for Fx13. String changes made by this patch: none

Attachment #619725 - Flags: approval-mozilla-beta?

Comment hidden (Legacy TBPL/Treeherder Robot)

Alex Keybl [:akeybl]

Comment 12

•

13 years ago

(In reply to Jared Wein [:jaws] from comment #10) > Risk to taking this patch (and alternatives if risky): > No risk expected, but thumbnail images will have a lower quality on the new > tab page for Fx13. Have you spoken with Asa about backing out bug 736028? https://bugzilla.mozilla.org/show_bug.cgi?id=736028#c24 suggested he really wanted that change, and it's not yet clear that our beta users (which has a significantly higher Linux population than our release channel) are running into this problem.

Comment hidden (Legacy TBPL/Treeherder Robot)

Asa Dotzler [:asa]

Comment 14

•

13 years ago

I would prefer that we not back this out. The feature quality without it goes way down.

Comment hidden (Legacy TBPL/Treeherder Robot)

Lukas Blakk [:lsblakk] use ?needinfo

Comment 16

•

13 years ago

Comment on attachment 619725 [details] [diff] [review] Backout of 736028 a- for the backout due to Asa's request to keep the feature in, and also since there appears to be 0 user effect on crash-stats at present. We can revisit the option of backing out if evidence turns up to show users are being affected in larger numbers.

Attachment #619725 - Flags: approval-mozilla-beta? → approval-mozilla-beta-

Jared Wein [:jaws] (please needinfo? me)

Updated

•

13 years ago

Assignee: jwein → nobody

Status: ASSIGNED → NEW

Lukas Blakk [:lsblakk] use ?needinfo

Comment 17

•

13 years ago

changing status to wontfix for 13 but leaving this bug open in case we need to revisit the backout during 13's cycle.

status-firefox13: affected → wontfix

Comment hidden (Legacy TBPL/Treeherder Robot)

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 25

•

13 years ago

Why is this WONTFIX? If one of our tests starts failing permanently we need to do something about that.

status-firefox13: wontfix → affected

Phil Ringnalda (:philor)

Comment 26

•

13 years ago

And, because this is a crash (which stops the run, and kills reporting of what happened before it as well as not running what comes after it at all), keep in mind that this means we do not run mochitest-chrome on Linux-64 on 13, and we have not run it on 13 for three weeks.

Whiteboard: [orange][perma-orange] → [orange][perma-orange][test which aborts the suite]

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Updated

•

13 years ago

Depends on: 752585

Comment hidden (Legacy TBPL/Treeherder Robot)

Nicholas Nethercote [inactive]

Assignee

Comment 31

•

13 years ago

Jared, can you summarize the situation w.r.t. bug 736028 and why this crash occurs on 13 and not 14? I wrote this test but I think the crash doesn't really have anything to do with me, but I've found this whole bug horribly confusing. Thanks.

Jared Wein [:jaws] (please needinfo? me)

Comment 32

•

13 years ago

Nicholas, I'm not sure why this patch triggered the failures to occur and why they're not happening on 14. I can speculate that there are some other patches that landed, and the combination of them and the potentially larger memory that is consumed by the larger thumbnails introduced by bug 736028 results in this crash. Since this is not appearing on 14 I can only think that something landed during the 14 cycle that fixed this, but I'm not sure what that is.

Comment hidden (Legacy TBPL/Treeherder Robot)

Alex Keybl [:akeybl]

Comment 34

•

13 years ago

Jared - how would you like to attack this problem? Can we ask QA to bisect builds with the test case and get back to us with what build in FF14 fixed the issue?

Assignee: nobody → jaws

Comment hidden (Legacy TBPL/Treeherder Robot)

Jared Wein [:jaws] (please needinfo? me)

Comment 38

•

13 years ago

(In reply to Alex Keybl [:akeybl] from comment #34) > Jared - how would you like to attack this problem? Can we ask QA to bisect > builds with the test case and get back to us with what build in FF14 fixed > the issue? Yeah, that sounds like a good plan to me. Should this be reassigned to somebody in QA?

Comment hidden (Legacy TBPL/Treeherder Robot)

Alex Keybl [:akeybl]

Comment 40

•

13 years ago

(In reply to Jared Wein [:jaws] from comment #38) > Yeah, that sounds like a good plan to me. Should this be reassigned to > somebody in QA? I've added the qawanted and regressionwindow-wanted keywords (although we're really looking for a fix window on FF14). QA regularly triages through the tracked/qawanted bugs.

Keywords: qawanted, regressionwindow-wanted

Comment hidden (Legacy TBPL/Treeherder Robot)

Alex Keybl [:akeybl]

Comment 43

•

13 years ago

(In reply to Jared Wein [:jaws] from comment #38) > Yeah, that sounds like a good plan to me. Should this be reassigned to > somebody in QA? Would it be possible to provide QA with the associated testcase in a fashion that can be run as part of a bisection?

Comment hidden (Legacy TBPL/Treeherder Robot)

Jared Wein [:jaws] (please needinfo? me)

Comment 46

•

13 years ago

(In reply to Alex Keybl [:akeybl] from comment #43) > (In reply to Jared Wein [:jaws] from comment #38) > > Yeah, that sounds like a good plan to me. Should this be reassigned to > > somebody in QA? > > Would it be possible to provide QA with the associated testcase in a fashion > that can be run as part of a bisection? Yes, use Linux64 opt builds and run: TEST_PATH=toolkit/components/aboutmemory/tests/test_memoryReporters.xul make -C obj-dir mochitest-plain If test passes, then it is a GOOD revision. If the test crashes, then it is a BAD revision.

Comment hidden (Legacy TBPL/Treeherder Robot)

edmorley https://tbpl.mozilla.org/php/getParsedLog.php?id=11763242&tree=Mozilla-Beta Rev3 Fedora 12x64 mozilla-beta pgo test mochitest-other on 2012-05-15 06:46:39 slave: talos-r3-fed64-040 4482 ERROR TEST-UNEXPECTED-FAIL | chrome://mochitests/content/chrome/dom/tests/mochitest/chrome/test_dom_fullscreen_warning.xul | Must be in full-screen mode 4484 ERROR TEST-UNEXPECTED-FAIL | chrome://mochitests/content/chrome/dom/tests/mochitest/chrome/test_dom_fullscreen_warning.xul | Should remain in full-screen mode for script initiated key events for VK_CONVERT 4485 ERROR TEST-UNEXPECTED-FAIL | chrome://mochitests/content/chrome/dom/tests/mochitest/chrome/test_dom_fullscreen_warning.xul | Should remain in full-screen for VK_CONVERT press - got false, expected true 4486 ERROR TEST-UNEXPECTED-FAIL | chrome://mochitests/content/chrome/dom/tests/mochitest/chrome/test_dom_fullscreen_warning.xul | Should receive MozShowFullScreenWarning for VK_CONVERT press - got false, expected true 4515 ERROR TEST-UNEXPECTED-FAIL | chrome://mochitests/content/chrome/dom/tests/mochitest/chrome/test_dom_fullscreen_warning.xul | Should remain in full-screen for VK_PAGE_DOWN press - got false, expected true 4517 ERROR TEST-UNEXPECTED-FAIL | chrome://mochitests/content/chrome/dom/tests/mochitest/chrome/test_dom_fullscreen_warning.xul | Must be in full-screen mode 4519 ERROR TEST-UNEXPECTED-FAIL | chrome://mochitests/content/chrome/dom/tests/mochitest/chrome/test_dom_fullscreen_warning.xul | Should remain in full-screen mode for script initiated key events for VK_END 4520 ERROR TEST-UNEXPECTED-FAIL | chrome://mochitests/content/chrome/dom/tests/mochitest/chrome/test_dom_fullscreen_warning.xul | Should remain in full-screen for VK_END press - got false, expected true 4522 ERROR TEST-UNEXPECTED-FAIL | chrome://mochitests/content/chrome/dom/tests/mochitest/chrome/test_dom_fullscreen_warning.xul | Must be in full-screen mode 4524 ERROR TEST-UNEXPECTED-FAIL | chrome://mochitests/content/chrome/dom/tests/mochitest/chrome/test_dom_fullscreen_warning.xul | Should remain in full-screen mode for script initiated key events for VK_HOME 4525 ERROR TEST-UNEXPECTED-FAIL | chrome://mochitests/content/chrome/dom/tests/mochitest/chrome/test_dom_fullscreen_warning.xul | Should remain in full-screen for VK_HOME press - got false, expected true 4527 ERROR TEST-UNEXPECTED-FAIL | chrome://mochitests/content/chrome/dom/tests/mochitest/chrome/test_dom_fullscreen_warning.xul | Must be in full-screen mode 4529 ERROR TEST-UNEXPECTED-FAIL | chrome://mochitests/content/chrome/dom/tests/mochitest/chrome/test_dom_fullscreen_warning.xul | Should remain in full-screen mode for script initiated key events for VK_LEFT 4530 ERROR TEST-UNEXPECTED-FAIL | chrome://mochitests/content/chrome/dom/tests/mochitest/chrome/test_dom_fullscreen_warning.xul | Should remain in full-screen for VK_LEFT press - got false, expected true 4532 ERROR TEST-UNEXPECTED-FAIL | chrome://mochitests/content/chrome/dom/tests/mochitest/chrome/test_dom_fullscreen_warning.xul | Must be in full-screen mode 4534 ERROR TEST-UNEXPECTED-FAIL | chrome://mochitests/content/chrome/dom/tests/mochitest/chrome/test_dom_fullscreen_warning.xul | Should remain in full-screen mode for script initiated key events for VK_UP 4535 ERROR TEST-UNEXPECTED-FAIL | chrome://mochitests/content/chrome/dom/tests/mochitest/chrome/test_dom_fullscreen_warning.xul | Should remain in full-screen for VK_UP press - got false, expected true 4545 ERROR TEST-UNEXPECTED-FAIL | chrome://mochitests/content/chrome/dom/tests/mochitest/chrome/test_dom_fullscreen_warning.xul | Should remain in full-screen for VK_DOWN press - got false, expected true TEST-UNEXPECTED-FAIL | chrome://mochitests/content/chrome/toolkit/components/aboutmemory/tests/test_memoryReporters.xul | Exited with code 1 during test run PROCESS-CRASH | chrome://mochitests/content/chrome/toolkit/components/aboutmemory/tests/test_memoryReporters.xul | application crashed (minidump found) Thread 0 (crashed)

Comment hidden (Legacy TBPL/Treeherder Robot)

u279076

Comment 52

•

13 years ago

(In reply to Jared Wein [:jaws] from comment #46) > Yes, use Linux64 opt builds and run: > TEST_PATH=toolkit/components/aboutmemory/tests/test_memoryReporters.xul make > -C obj-dir mochitest-plain I'm not sure where I can get Linux64 opt builds from, in particular for the beta branch.

Comment hidden (Legacy TBPL/Treeherder Robot)

Jared Wein [:jaws] (please needinfo? me)

Comment 54

•

13 years ago

(In reply to Anthony Hughes, Mozilla QA (irc: ashughes) from comment #52) > (In reply to Jared Wein [:jaws] from comment #46) > > Yes, use Linux64 opt builds and run: > > TEST_PATH=toolkit/components/aboutmemory/tests/test_memoryReporters.xul make > > -C obj-dir mochitest-plain > > I'm not sure where I can get Linux64 opt builds from, in particular for the > beta branch. I think these are what you're looking for: https://ftp.mozilla.org/pub/mozilla.org/firefox/releases/13.0b1/linux-x86_64/en-US/ https://ftp.mozilla.org/pub/mozilla.org/firefox/releases/13.0b2/linux-x86_64/en-US/ https://ftp.mozilla.org/pub/mozilla.org/firefox/releases/13.0b3/linux-x86_64/en-US/

Phil Ringnalda (:philor)

Comment 55

•

13 years ago

Why? You want to know "what is the first build of 14 which has the fix for this problem?," but the first build of 14 which could have exhibited the problem, a build on inbound on 2012-04-12 when you landed the patch for bug 736028, does not exhibit the problem, so you know that the difference between non-permaorange 14 and permaorange 13 landed between when 13 merged from mozilla-central and then. I'd say that makes the steps * hg qimport https://bug736028.bugzilla.mozilla.org/attachment.cgi?id=614206 * hg up -r (whatever the tag is for when 13 merged away) * hg qpush * make -f client.mk build && TEST_PATH=toolkit/components/aboutmemory/tests/test_memoryReporters.xul make -C obj-dir mochitest-plain * hg qpop * hg up -r something else (Or more accurately, an hg bisect between the 13 merge point and something around when you pushed bug 736028, but I can't do the commands to bisect off the top of my head.)

Jared Wein [:jaws] (please needinfo? me)

Comment 56

•

13 years ago

I'll do a bisect tomorrow then if nobody else gets to it.

Comment hidden (Legacy TBPL/Treeherder Robot)

Alex Keybl [:akeybl]

Comment 58

•

13 years ago

(In reply to Jared Wein [:jaws] from comment #56) > I'll do a bisect tomorrow then if nobody else gets to it. Any updates? Let us know if you believe this can be deprioritized/untracked for whatever reason - our main concern is the user experience post-release, as well as the possibility of regressions being hidden behind this perma-orange.

Jared Wein [:jaws] (please needinfo? me)

Comment 59

•

13 years ago

Yeah I tried to do a bisect yesterday and today but I'm struggling with how to get the correct revisions to bisect between due to this being on a branch of mozilla-central.

Comment hidden (Legacy TBPL/Treeherder Robot)

Jared Wein [:jaws] (please needinfo? me)

Comment 64

•

13 years ago

Can I get help from somebody on doing the bisect? Patches landing on different branches has confused me as to how to get this to reliably work.

Comment hidden (Legacy TBPL/Treeherder Robot)

u279076

Comment 67

•

13 years ago

Removing QAWANTED as per discussion in Channel meeting today. Please re-add if there is something specific QA can do to help here.

Keywords: qawanted

Comment hidden (Legacy TBPL/Treeherder Robot)

Justin Dolske [:Dolske]

Comment 71

•

13 years ago

Hmmm... Looked at this with Jared, just to see if we could maybe figure out what's going wrong. The test seems completely unrelated to his patch (and was an existing orange), so it seems likely to me that this is a case of slightly changing some timing/memory and making an existing issue more frequent. [Which would imply there's no actual user impact, and that a backout wouldn't help with the underlying issue.] So here's what I was poking through... If njn or bent can shed some enlightenment here, it would be helpful. :) Stack for convenience (full version in comment 0): Thread 0 (crashed) 0 libpthread-2.11.so + 0x8db0 rbx = 0x00000000 r12 = 0x4909cff0 r13 = 0x4909cf9f r14 = 0xa8b5c010 r15 = 0x4909cff0 rip = 0xd3608db0 rsp = 0x4909cf28 rbp = 0x00000001 Found by: given as instruction pointer in context 1 libnspr4.so!PR_Lock [ptsynch.c:7e8c20249a4f : 206 + 0x4] rip = 0xc37e6f99 rsp = 0x4909cf30 Found by: stack scanning 2 WorkerPrivate::BlockAndCollectRuntimeStats [Mutex.h : 106 + 0x8] rbx = 0xa8b5c000 rip = 0xc0bd74da rsp = 0x4909cf40 Found by: call frame info 3 WorkerMemoryReporter::GetExplicitNonHeap [WorkerPrivate.cpp:7e8c20249a4f : 223 + 0xe] rbx = 0x9415f700 r12 = 0x0697d000 r13 = 0x4909d010 r14 = 0xc2038250 r15 = 0x4909cff0 rip = 0xc0bd77ab rsp = 0x4909cf90 rbp = 0x00000000 Found by: call frame info Things appear happy going into frame 3, we're gathering memory reporter data, calling into a WorkerMemoryReporter. (Pretty sure we're actually in ::CollectForRuntime, because ::GetExplicitNonHeap is just a two-liner calling it... Opt build and inlined). 217 CollectForRuntime(bool aIsQuick, void* aData) 218 { 219 AssertIsOnMainThread(); 220 221 if (mWorkerPrivate) { 222 bool disabled; 223 if (!mWorkerPrivate->BlockAndCollectRuntimeStats(...)) { so we end up in: 2828 WorkerPrivate::BlockAndCollectRuntimeStats(...) 2829 { 2830 AssertIsOnMainThread(); 2831 NS_ASSERTION(aData, "Null data!"); 2832 2833 { 2834 MutexAutoLock lock(mMutex); 2835 2836 if (mMemoryReporterDisabled) { And we're crashing when the MutexAutoLock is accessing |mMutex| to lock [it's also possible we're at the MutexAutoLock further down during the same thing, but I'm guessing we're not actually making it that far). The area where my suspicion falls is on the lifetime of the WorkerPrivate vs the WorkerMemoryReporter (which has a raw pointer -- |mWorkerPrivate| -- for the associated WorkerPrivate). When the worker and it's WorkerPrivate go away, where is it ensuring that that the WorkerMemoryReporter is appropriately nuked? WorkerPrivate::DoRunLoop is the place creating and nuking the reporter. The reporter is refcounted, but the WorkerPrivate is not. DisableMemoryReporter() is called, but BlockAndCollectRuntimeStats() uses the lock before checking that. So here's what I think happens. My uncertainty is around exactly what gets destroyed/cleared when... 1) main thread (the test) gets ahold of a WorkerMemoryReporter (refcount++). -- context switch -- 2) DoRunLoop decides to quit. It nulls mMemoryReporter (refcount--), but the reporter stays alive because of #1 3) worker thread goes away, destructs its WorkerPrivate (didn't find exactly where that happens) -- context switch -- 4) main thread calls WorkerMemoryReporter::GetExplicitNonHeap (ok) 5) ...in turn calls CollectForRuntime (ok) 6) ...in turn calls mWorkerPrivate->BlockAndCollectRuntimeStats (woah! mWorkerPrivate is now use-after-free? Guess enough bits are around / valid for the method to be invoked successfully) 7) Trues to do the locking, but the mutex has already been kablooied. Let me know where I'm wrong, for I am just a simple JS hacker. :)

Comment hidden (Legacy TBPL/Treeherder Robot)

Nicholas Nethercote [inactive]

Assignee

Updated

•

13 years ago

Whiteboard: [orange][perma-orange][test which aborts the suite] → [see comment 71][orange][perma-orange][test which aborts the suite]

Comment hidden (Legacy TBPL/Treeherder Robot)

Nicholas Nethercote [inactive]

Assignee

Comment 78

•

13 years ago

> DisableMemoryReporter() is called, but BlockAndCollectRuntimeStats() uses > the lock before checking that. I *think* this is the crucial part. The disabling is achieved by recording some state in WorkerPrivate, but I think it needs to be recorded in the WorkerMemoryReporter instead, so it can be checked before WorkerPrivate::BlockAndCollectRuntimeStats() is called.

Nicholas Nethercote [inactive]

Assignee

Comment 79

•

13 years ago

Attached patch patch (obsolete) — Details — Splinter Review

This patch changes how the disabling of WorkerMemoryReporter happens. Previously, WorkerPrivate would (a) set mMemoryReporterDisabled. Then when (b) WorkerMemoryReporter called WorkerPrivate::BlockAndCollectRuntimeStats, a |disabled| outparam would be set, and WorkerMemoryReporter would check for that and set mWorkerPrivate to nsnull. If dolske is right, the WorkerPrivate could be destroyed between (a) and (b). Now, WorkerPrivate just calls WorkerMemoryReporter::Disable() which directly sets mWorkerPrivate to nsnull. I've cut out the middle-man. Some caveats: - I've never seen this crash locally, despite the fact that I run this test on my Linux64 box frequently. (I ran the test myself and it didn't crash, so that's a start.) And despite this bug's title, it only fails sporadically, so a try server run won't tell me much. So if someone who can reliably reproduce the crash (dolske, jaws?) could try this patch, that'd be great. - Also, I didn't write this code and don't really understand it and it makes me nervous and I am just making this stuff up as I go along...

Attachment #628193 - Flags: feedback?(jaws)

Attachment #628193 - Flags: feedback?(dolske)

Jared Wein [:jaws] (please needinfo? me)

Comment 80

•

13 years ago

(In reply to Nicholas Nethercote [:njn] from comment #79) > And despite this bug's title, it only fails > sporadically, > so a try server run won't tell me much. So if someone who can reliably > reproduce the crash (dolske, jaws?) could try this patch, that'd be great. Did you try it locally with an optimized build?

Nicholas Nethercote [inactive]

Assignee

Comment 81

•

13 years ago

I just tried an Linux64 opt build of mozilla-beta, and couldn't get it to crash. (This is without my patch.) So I tried a PGO build, and it failed with: make[16060]: vfork: Cannot allocate memory This on a 64-bit machine with 16GB of RAM. Hmph. Currently trying a push to mozilla-beta, with some magic incantations from philor to get PGO happening.

Phil Ringnalda (:philor)

Comment 82

•

13 years ago

https://tbpl.mozilla.org/?tree=Try&rev=9801753cb194 is looking both green, and PGO'd, but then I realized we didn't have a control for "can we repro on try without the patch?" so I pushed https://tbpl.mozilla.org/?tree=Try&rev=c06b6b58bae6 which I hope will wind up orange.

Nicholas Nethercote [inactive]

Assignee

Comment 83

•

13 years ago

Comment on attachment 628193 [details] [diff] [review] patch (In reply to Phil Ringnalda (:philor) from comment #82) > https://tbpl.mozilla.org/?tree=Try&rev=9801753cb194 is looking both green, > and PGO'd, but then I realized we didn't have a control for "can we repro on > try without the patch?" so I pushed > https://tbpl.mozilla.org/?tree=Try&rev=c06b6b58bae6 which I hope will wind > up orange. It did wind up orange. Good news! Time for a review... bent, I'd appreciate a very careful review because (a) I'm not that familiar with this code and (b) RACE CONDITIONS.

Attachment #628193 - Flags: review?(bent.mozilla)

Attachment #628193 - Flags: feedback?(jaws)

Attachment #628193 - Flags: feedback?(dolske)

Jared Wein [:jaws] (please needinfo? me)

Updated

•

13 years ago

Assignee: jaws → n.nethercote

Comment hidden (Legacy TBPL/Treeherder Robot)

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 86

•

13 years ago

Comment on attachment 628193 [details] [diff] [review] patch Review of attachment 628193 [details] [diff] [review]: ----------------------------------------------------------------- Thanks for digging in here! I think this fix is good, two nits and a DEBUG fix below should do it. ::: dom/workers/WorkerPrivate.cpp @@ +122,5 @@ > static_cast<typename ISupportsBaseInfo<T>::ISupportsBase*>(raw); > dest->swap(rawSupports); > } > > +} /* anonymous namespace */ Nit: Please just move WorkerMemoryReporter below the namespace block rather than splitting the block in two. @@ +142,5 @@ > cstats->extra = const_cast<char *>(name); > } > }; > > +class mozilla::dom::workers::WorkerMemoryReporter : public nsIMemoryMultiReporter Nit: I'd prefer a BEGIN_WORKERS_NAMESPACE/END_WORKERS_NAMESPACE block around this rather than having to spell it all out like this. @@ +242,5 @@ > } > + > + void Disable() > + { > + mWorkerPrivate = nsnull; This is only safe because we've locked the worker's mutex, and I'd like to assert that here. To do this I think we should make WorkerMemoryReporter a friend class of WorkerPrivate (#ifdef DEBUG only) and then call AssertCurrentThreadOwns. Sound good?

Nicholas Nethercote [inactive]

Assignee

Comment 87

•

13 years ago

Attached patch patch, v2 (obsolete) — Details — Splinter Review

> This is only safe because we've locked the worker's mutex, and I'd like to > assert that here. To do this I think we should make WorkerMemoryReporter a > friend class of WorkerPrivate (#ifdef DEBUG only) and then call > AssertCurrentThreadOwns. Sound good? I've done this (see attached patch) but I'm getting assertion failures like the following (https://tbpl.mozilla.org/php/getParsedLog.php?id=12376609&tree=Try). I can't reproduce them on my own Linux64 debug builds, unfortunately. TEST-UNEXPECTED-FAIL | chrome://mochitests/content/chrome/dom/workers/test/test_chromeWorkerJSM.xul | Exited with code 1 during test run INFO | automation.py | Application ran for: 0:03:36.383366 INFO | automation.py | Reading PID log: /tmp/tmpLZ48pIpidlog ==> process 2161 launched child process 2189 ==> process 2161 launched child process 2215 ==> process 2161 launched child process 2222 ==> process 2161 launched child process 2224 ==> process 2161 launched child process 2226 INFO | automation.py | Checking for orphan process with PID: 2189 INFO | automation.py | Checking for orphan process with PID: 2215 INFO | automation.py | Checking for orphan process with PID: 2222 INFO | automation.py | Checking for orphan process with PID: 2224 INFO | automation.py | Checking for orphan process with PID: 2226 Downloading symbols from: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/nnethercote@mozilla.com-4c6a523c293f/try-linux-debug/firefox-15.0a1.en-US.linux-i686.crashreporter-symbols.zip PROCESS-CRASH | chrome://mochitests/content/chrome/dom/workers/test/test_chromeWorkerJSM.xul | application crashed (minidump found) Crash dump filename: /tmp/tmpo7OmGS/minidumps/7b9f7c77-dbcf-528d-08ca8fc3-3db40cdd.dmp Operating system: Linux 0.0.0 Linux 2.6.31.5-127.fc12.i686.PAE #1 SMP Sat Nov 7 21:25:57 EST 2009 i686 CPU: x86 GenuineIntel family 6 model 23 stepping 10 2 CPUs Crash reason: SIGSEGV Crash address: 0x28 Thread 28 (crashed) 0 libxul.so!mozilla::dom::workers::WorkerPrivate::DisableMemoryReporter [Mutex.h : 108 + 0x0] eip = 0x0197bc25 esp = 0xa7bff0f4 ebp = 0xa7bff148 ebx = 0x02d8b528 esi = 0xb4c54530 edi = 0x00000001 eax = 0x00000000 ecx = 0x0a0db1e8 edx = 0xa7bff128 efl = 0x00010286 Found by: given as instruction pointer in context 1 libxul.so!mozilla::dom::workers::WorkerPrivate::DoRunLoop [WorkerPrivate.cpp : 2795 + 0x8] eip = 0x0197c395 esp = 0xa7bff150 ebp = 0xa7bff1f8 ebx = 0x02d8b528 esi = 0x0a0db1c8 edi = 0xa7bff1ac Found by: call frame info 2 libxul.so!WorkerThreadRunnable::Run [RuntimeService.cpp : 329 + 0x6] eip = 0x0196ab65 esp = 0xa7bff200 ebp = 0xa7bff248 ebx = 0x02d8b528 esi = 0xb4c53d18 edi = 0x0a0db1c8 Found by: call frame info

Attachment #619725 - Attachment is obsolete: true

Attachment #628193 - Attachment is obsolete: true

Attachment #628193 - Flags: review?(bent.mozilla)

Comment hidden (Legacy TBPL/Treeherder Robot)

Phil Ringnalda (:philor)

Updated

•

13 years ago

Summary: Perma-orange Linux64 test_memoryReporters.xul crash above mozilla::dom::workers::WorkerPrivate::BlockAndCollectRuntimeStats on Beta 13 → Perma-orange Linux64 test_memoryReporters.xul crash above mozilla::dom::workers::WorkerPrivate::BlockAndCollectRuntimeStats on Release 13

Comment hidden (Legacy TBPL/Treeherder Robot)

(no longer active)

Comment 96

•

13 years ago

What's the status of this bug? Nick, have you got a chance to work on this again since the last time you looked at it?

Nicholas Nethercote [inactive]

Assignee

Comment 97

•

13 years ago

The assertion on try that I couldn't reproduce locally was discouraging enough that I haven't, esp. given that I only half understand this code. I'll try to get back to it next week.

Jason Orendorff [:jorendorff]

Comment 98

•

13 years ago

I have a patch in bug 725907 that also causes this to become perma-orange. Nick's v1 patch fixes it. https://tbpl.mozilla.org/?tree=Try&pusher=jorendorff@mozilla.com You might have to click the green arrow several times; the relevant pushes here are: 80f70b78ac67 - Fri Jul 6 16:56:13 2012 PDT - green MOth 23371b36f13c - Fri Jul 6 16:27:10 2012 PDT - orange MOth Can we land v1 + nits, and add the problematic assert in a follow-up bug?

Jason Orendorff [:jorendorff]

Updated

•

13 years ago

Blocks: 725907

Nicholas Nethercote [inactive]

Assignee

Comment 99

•

13 years ago

I just looked at this again, quite closely. I don't understand how the assertion could cause a crash. We lock the WorkerPrivate's |mMutex| in WorkerPrivate::DisableMemoryReporter() immediately before calling WorkerMemoryReporter::Disable(), and then this line causes the crash: mWorkerPrivate->mMutex.AssertCurrentThreadOwns(); It's not clear to me exactly what causes the crash because there's lots of inlining going on even in debug builds, and I can't reproduce it locally. I see two possible ways to proceed. - We land my patch with the assertion disabled, cross our fingers, and file a follow-up bug. - We disable the worker memory reporter entirely, and file a follow-up bug to re-enable it. This is safer, but obviously a shame. In both cases, I don't know how progress would be made on the follow-up bug, since I wasn't able to make any progress in this one. Maybe bent could try. Preferences?

(no longer active)

Comment 100

•

13 years ago

(In reply to comment #99) > It's not clear to me exactly what causes the crash because there's lots of > inlining going on even in debug builds, and I can't reproduce it locally. Not sure if you're using our tinderbox debug builds, but those also have --enable-optimized which can explain this. Have you tried doing a build with --enable-debug --disable-optimized (sorry if I'm preaching to the choir :-)

Nicholas Nethercote [inactive]

Assignee

Comment 101

•

13 years ago

Attached patch (part 1) - Make WorkerMemoryReporter less crashy. (obsolete) — Details — Splinter Review

bent, here's the patch with the assertion disabled. I'll let you decide if that's reasonable, since you wrote this code originally. If you r- it I will write a patch that disables the worker memory reporter. Thanks.

Attachment #641353 - Flags: review?(bent.mozilla)

Nicholas Nethercote [inactive]

Assignee

Comment 102

•

13 years ago

Attached patch (part 2) - Move code around to merge two anonymous namespaces; no functional changes. — Details — Splinter Review

Trivial follow-up patch.

Attachment #641354 - Flags: review?(bent.mozilla)

Nicholas Nethercote [inactive]

Assignee

Updated

•

13 years ago

Attachment #630084 - Attachment is obsolete: true

Nicholas Nethercote [inactive]

Assignee

Comment 103

•

13 years ago

> Not sure if you're using our tinderbox debug builds, but those also have > --enable-optimized which can explain this. Have you tried doing a build > with --enable-debug --disable-optimized Good suggestion. I just tried debug Linux32 and Linux64 builds with this. The stack traces were only marginally better, and not in a way that gave me any more salient information. Crash reason: SIGSEGV Crash address: 0x28 Thread 27 (crashed) 0 libxul.so!mozilla::Mutex::AssertCurrentThreadOwns [Mutex.h : 108 + 0x3] eip = 0x01546c9b esp = 0xae086090 ebp = 0xae0860a8 ebx = 0x03fa57b4 esi = 0x0a2c41d0 edi = 0x02cd8990 eax = 0x00000020 ecx = 0xae086c04 edx = 0x087b8a20 efl = 0x00010206 Found by: given as instruction pointer in context 1 libxul.so!mozilla::dom::workers::WorkerMemoryReporter::Disable [WorkerPrivate.cpp : 260 + 0x10] eip = 0x02037087 esp = 0xae0860b0 ebp = 0xae0860c8 ebx = 0x03fa57b4 esi = 0x0a2c41d0 edi = 0x02cd8990 Found by: call frame info 2 libxul.so!mozilla::dom::workers::WorkerPrivate::DisableMemoryReporter [WorkerPrivate.cpp : 2944 + 0x17] eip = 0x0203d43d esp = 0xae0860d0 ebp = 0xae086108 ebx = 0x03fa57b4 esi = 0x0a2c41d0 edi = 0x02cd8990 Found by: call frame info I also tried a local build that used the exact mozconfig used on the test machines, but I still wasn't able to reproduce the crash :(

Nicholas Nethercote [inactive]

Assignee

Comment 104

•

13 years ago

Attached patch (part 1) - Make WorkerMemoryReporter less crashy. — Details — Splinter Review

I finally managed to catch this in a debugger. Just a missing NULL check on mWorkerPrivate!

Attachment #641676 - Flags: review?(bent.mozilla)

Nicholas Nethercote [inactive]

Assignee

Updated

•

13 years ago

Attachment #641353 - Attachment is obsolete: true

Attachment #641353 - Flags: review?(bent.mozilla)

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 105

•

13 years ago

Comment on attachment 641676 [details] [diff] [review] (part 1) - Make WorkerMemoryReporter less crashy. I'm nervous that we don't know why this null check should be necessary... But if this helps unblock others then we should do it.

Attachment #641676 - Flags: review?(bent.mozilla) → review+

Ben Turner (not reading bugmail, use the needinfo flag!)

Updated

•

13 years ago

Attachment #641354 - Flags: review?(bent.mozilla) → review+

Nicholas Nethercote [inactive]

Assignee

Comment 106

•

13 years ago

> I'm nervous that we don't know why this null check should be necessary... > But if this helps unblock others then we should do it. It turns out the memory reporter can be disabled (for reasons I don't understand but aren't important) when CTypes are used by a worker. Then when the worker shuts down, Disable() is called again, whereupon mWorkerPrivate will already be NULL. So the NULL check is the right thing to do. Phew. I'll land this once my try server run has finished.

Nicholas Nethercote [inactive]

Assignee

Comment 107

•

13 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/7e78cbe9ea94 https://hg.mozilla.org/integration/mozilla-inbound/rev/bdc169cbbbac \o/

Jared Wein [:jaws] (please needinfo? me)

Comment 108

•

13 years ago

\o/

Ryan VanderMeulen [:RyanVM]

Comment 109

•

13 years ago

https://hg.mozilla.org/mozilla-central/rev/7e78cbe9ea94 https://hg.mozilla.org/mozilla-central/rev/bdc169cbbbac

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla16

Nicholas Nethercote [inactive]

Assignee

Updated

•

13 years ago

Depends on: 812935

Nobody; OK to take it and work on it

Updated

•

13 years ago

Keywords: intermittent-failure

Nobody; OK to take it and work on it

Updated

•

13 years ago

Whiteboard: [see comment 71][orange][perma-orange][test which aborts the suite] → [see comment 71][perma-orange][test which aborts the suite]

Ryan VanderMeulen [:RyanVM]

Updated

•

10 years ago

Keywords: regressionwindow-wanted

Backout of 736028 13 years ago Jared Wein [:jaws] (please needinfo? me) 4.39 KB, patch	lsblakk : approval-mozilla-beta-	Details \| Diff \| Splinter Review
patch 13 years ago Nicholas Nethercote [inactive] 7.81 KB, patch		Details \| Diff \| Splinter Review
patch, v2 13 years ago Nicholas Nethercote [inactive] 8.38 KB, patch		Details \| Diff \| Splinter Review
(part 1) - Make WorkerMemoryReporter less crashy. 13 years ago Nicholas Nethercote [inactive] 8.67 KB, patch		Details \| Diff \| Splinter Review
(part 2) - Move code around to merge two anonymous namespaces; no functional changes. 13 years ago Nicholas Nethercote [inactive] 77.48 KB, patch	bent.mozilla : review+	Details \| Diff \| Splinter Review
(part 1) - Make WorkerMemoryReporter less crashy. 13 years ago Nicholas Nethercote [inactive] 8.65 KB, patch	bent.mozilla : review+	Details \| Diff \| Splinter Review