Closed Bug 749010 Opened 13 years ago Closed 12 years ago

Perma-orange Linux64 test_memoryReporters.xul crash above mozilla::dom::workers::WorkerPrivate::BlockAndCollectRuntimeStats on Release 13

Categories

(Core :: DOM: Workers, defect)

13 Branch
x86_64
Linux
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla16
Tracking Status
firefox12 --- unaffected
firefox13 + affected
firefox14 --- unaffected

People

(Reporter: mbrubeck, Assigned: n.nethercote)

References

Details

(4 keywords, Whiteboard: [see comment 71][perma-orange][test which aborts the suite])

Crash Data

Attachments

(2 files, 4 obsolete files)

+++ This bug was initially created as a clone of Bug #737244 +++ This crash originally happened intermittently, but it started happening 100% of the time on Linux64 opt builds in Aurora 13, continuing in Beta 13. The perma-orange started with https://hg.mozilla.org/releases/mozilla-aurora/rev/8d0f70b87fba (bug 736028). Requesting tracking-firefox13 because this is a new reproducible crash in Firefox 13. https://tbpl.mozilla.org/php/getParsedLog.php?id=11206374&tree=Mozilla-Beta Rev3 Fedora 12x64 mozilla-beta pgo test mochitest-other on 2012-04-25 15:17:39 PDT for push 7e8c20249a4f 13504 INFO TEST-START | chrome://mochitests/content/chrome/toolkit/components/aboutmemory/tests/test_memoryReporters.xul TEST-UNEXPECTED-FAIL | chrome://mochitests/content/chrome/toolkit/components/aboutmemory/tests/test_memoryReporters.xul | Exited with code 1 during test run INFO | automation.py | Application ran for: 0:02:12.925341 INFO | automation.py | Reading PID log: /tmp/tmpx_VsOppidlog ==> process 2357 launched child process 2391 ==> process 2357 launched child process 2399 ==> process 2357 launched child process 2407 ==> process 2357 launched child process 2409 ==> process 2357 launched child process 2411 INFO | automation.py | Checking for orphan process with PID: 2391 INFO | automation.py | Checking for orphan process with PID: 2399 INFO | automation.py | Checking for orphan process with PID: 2407 INFO | automation.py | Checking for orphan process with PID: 2409 INFO | automation.py | Checking for orphan process with PID: 2411 Downloading symbols from: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-beta-linux64/1335386323/firefox-13.0.en-US.linux-x86_64.crashreporter-symbols.zip PROCESS-CRASH | chrome://mochitests/content/chrome/toolkit/components/aboutmemory/tests/test_memoryReporters.xul | application crashed (minidump found) Crash dump filename: /tmp/tmpMB1Rf3/minidumps/36986f3b-61c9-5f43-4ee091f4-3f3d1a4d.dmp Operating system: Linux 0.0.0 Linux 2.6.31.5-127.fc12.x86_64 #1 SMP Sat Nov 7 21:11:14 EST 2009 x86_64 CPU: amd64 family 6 model 23 stepping 10 2 CPUs Crash reason: SIGSEGV Crash address: 0x10 Thread 0 (crashed) 0 libpthread-2.11.so + 0x8db0 rbx = 0x00000000 r12 = 0x4909cff0 r13 = 0x4909cf9f r14 = 0xa8b5c010 r15 = 0x4909cff0 rip = 0xd3608db0 rsp = 0x4909cf28 rbp = 0x00000001 Found by: given as instruction pointer in context 1 libnspr4.so!PR_Lock [ptsynch.c:7e8c20249a4f : 206 + 0x4] rip = 0xc37e6f99 rsp = 0x4909cf30 Found by: stack scanning 2 libxul.so!mozilla::dom::workers::WorkerPrivate::BlockAndCollectRuntimeStats [Mutex.h : 106 + 0x8] rbx = 0xa8b5c000 rip = 0xc0bd74da rsp = 0x4909cf40 Found by: call frame info 3 libxul.so!WorkerMemoryReporter::GetExplicitNonHeap [WorkerPrivate.cpp:7e8c20249a4f : 223 + 0xe] rbx = 0x9415f700 r12 = 0x0697d000 r13 = 0x4909d010 r14 = 0xc2038250 r15 = 0x4909cff0 rip = 0xc0bd77ab rsp = 0x4909cf90 rbp = 0x00000000 Found by: call frame info 4 libxul.so!nsMemoryReporterManager::GetExplicit [nsMemoryReporterManager.cpp:7e8c20249a4f : 805 + 0x9] rbx = 0x4909d3e0 r12 = 0x0697d000 r13 = 0x4909d010 r14 = 0xc2038250 r15 = 0x4909cff0 rip = 0xc119eea2 rsp = 0x4909cfb0 rbp = 0x00000000 Found by: call frame info 5 libxul.so!NS_InvokeByIndex_P [xptcinvoke_x86_64_unix.cpp : 195 + 0xb] rbx = 0x4909d3e0 r12 = 0x00000000 r13 = 0x00000002 r14 = 0xc17841f4 r15 = 0xc2333bb8 rip = 0xc11a0d8a rsp = 0x4909d0d0 rbp = 0x4909d1b0 Found by: call frame info 6 libxul.so!XPCWrappedNative::CallMethod [XPCWrappedNative.cpp:7e8c20249a4f : 3025 + 0x19] rbx = 0x4909d3e0 r12 = 0x4909d3e0 r13 = 0x4909d870 r14 = 0x4909d3a0 r15 = 0xc2333bb8 rip = 0xc0de61cb rsp = 0x4909d1c0 rbp = 0x00000001 Found by: call frame info 7 libxul.so!XPC_WN_GetterSetter [xpcprivate.h:7e8c20249a4f : 2660 + 0xc] rbx = 0x4909d870 r12 = 0x00000000 r13 = 0xb3aff0c0 r14 = 0xa853d700 r15 = 0x981a0fd0 rip = 0xc0ded719 rsp = 0x4909d840 rbp = 0xb3aff0b0 Found by: call frame info 8 libxul.so!js::InvokeKernel [jscntxtinlines.h:7e8c20249a4f : 314 + 0x8] rbx = 0x90d68400 r12 = 0xa853d700 r13 = 0x00000000 r14 = 0x00000000 r15 = 0x00000000 rip = 0xc14907d0 rsp = 0x4909d9a0 rbp = 0xb3aff0c0 Found by: call frame info 9 libxul.so!js::Invoke [jsinterp.h:7e8c20249a4f : 172 + 0xd] rbx = 0xa853d700 r12 = 0x4909db50 r13 = 0x4909db00 r14 = 0x4909da80 r15 = 0x00000000 rip = 0xc1490d65 rsp = 0x4909da70 rbp = 0xffffffff Found by: call frame info 10 libxul.so!js::InvokeGetterOrSetter [jsinterp.cpp:7e8c20249a4f : 635 + 0xb]
Keywords: regression
Crash Signature: [@ libpthread-2.11.so@0x0x8db0]
Is this crash only happening on Fx13?
(In reply to Jared Wein [:jaws] from comment #2) > Is this crash only happening on Fx13? Yes.
I'm fine with backing this out of Fx13 and letting the patch ride the Fx14 train then. I'll talk with Asa later today (before I do the backout) to make sure that he knows about it since I know he really wanted it in for Fx13.
Assignee: nobody → jwein
Status: NEW → ASSIGNED
It would also be nice to figure whether this crash is caused by 736028, or if it is an existing bug that was exposed in tests when bug 736028 landed. (Will backing out bug 736028 prevent users from seeing this crash, or just prevent it from showing up in our test suite?)
Attached patch Backout of 736028 (obsolete) — Splinter Review
[Approval Request Comment] Regression caused by (bug #): bug 736028 User impact if declined: Some crashes on Linux64 Testing completed (on m-c, etc.): The crashes went from intermittent to permanent when this patch went on to Fx13. These crashes have not been seen on Fx14 though, so the belief is that backing out this patch from Fx13 will reduce the crashes, but there has not been an investigation yet as to why this is causing crashes. Risk to taking this patch (and alternatives if risky): No risk expected, but thumbnail images will have a lower quality on the new tab page for Fx13. String changes made by this patch: none
Attachment #619725 - Flags: approval-mozilla-beta?
(In reply to Jared Wein [:jaws] from comment #10) > Risk to taking this patch (and alternatives if risky): > No risk expected, but thumbnail images will have a lower quality on the new > tab page for Fx13. Have you spoken with Asa about backing out bug 736028? https://bugzilla.mozilla.org/show_bug.cgi?id=736028#c24 suggested he really wanted that change, and it's not yet clear that our beta users (which has a significantly higher Linux population than our release channel) are running into this problem.
I would prefer that we not back this out. The feature quality without it goes way down.
Comment on attachment 619725 [details] [diff] [review] Backout of 736028 a- for the backout due to Asa's request to keep the feature in, and also since there appears to be 0 user effect on crash-stats at present. We can revisit the option of backing out if evidence turns up to show users are being affected in larger numbers.
Attachment #619725 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
Assignee: jwein → nobody
Status: ASSIGNED → NEW
changing status to wontfix for 13 but leaving this bug open in case we need to revisit the backout during 13's cycle.
Why is this WONTFIX? If one of our tests starts failing permanently we need to do something about that.
And, because this is a crash (which stops the run, and kills reporting of what happened before it as well as not running what comes after it at all), keep in mind that this means we do not run mochitest-chrome on Linux-64 on 13, and we have not run it on 13 for three weeks.
Whiteboard: [orange][perma-orange] → [orange][perma-orange][test which aborts the suite]
Jared, can you summarize the situation w.r.t. bug 736028 and why this crash occurs on 13 and not 14? I wrote this test but I think the crash doesn't really have anything to do with me, but I've found this whole bug horribly confusing. Thanks.
Nicholas, I'm not sure why this patch triggered the failures to occur and why they're not happening on 14. I can speculate that there are some other patches that landed, and the combination of them and the potentially larger memory that is consumed by the larger thumbnails introduced by bug 736028 results in this crash. Since this is not appearing on 14 I can only think that something landed during the 14 cycle that fixed this, but I'm not sure what that is.
Jared - how would you like to attack this problem? Can we ask QA to bisect builds with the test case and get back to us with what build in FF14 fixed the issue?
Assignee: nobody → jaws
(In reply to Alex Keybl [:akeybl] from comment #34) > Jared - how would you like to attack this problem? Can we ask QA to bisect > builds with the test case and get back to us with what build in FF14 fixed > the issue? Yeah, that sounds like a good plan to me. Should this be reassigned to somebody in QA?
(In reply to Jared Wein [:jaws] from comment #38) > Yeah, that sounds like a good plan to me. Should this be reassigned to > somebody in QA? I've added the qawanted and regressionwindow-wanted keywords (although we're really looking for a fix window on FF14). QA regularly triages through the tracked/qawanted bugs.
(In reply to Jared Wein [:jaws] from comment #38) > Yeah, that sounds like a good plan to me. Should this be reassigned to > somebody in QA? Would it be possible to provide QA with the associated testcase in a fashion that can be run as part of a bisection?
(In reply to Alex Keybl [:akeybl] from comment #43) > (In reply to Jared Wein [:jaws] from comment #38) > > Yeah, that sounds like a good plan to me. Should this be reassigned to > > somebody in QA? > > Would it be possible to provide QA with the associated testcase in a fashion > that can be run as part of a bisection? Yes, use Linux64 opt builds and run: TEST_PATH=toolkit/components/aboutmemory/tests/test_memoryReporters.xul make -C obj-dir mochitest-plain If test passes, then it is a GOOD revision. If the test crashes, then it is a BAD revision.
(In reply to Jared Wein [:jaws] from comment #46) > Yes, use Linux64 opt builds and run: > TEST_PATH=toolkit/components/aboutmemory/tests/test_memoryReporters.xul make > -C obj-dir mochitest-plain I'm not sure where I can get Linux64 opt builds from, in particular for the beta branch.
(In reply to Anthony Hughes, Mozilla QA (irc: ashughes) from comment #52) > (In reply to Jared Wein [:jaws] from comment #46) > > Yes, use Linux64 opt builds and run: > > TEST_PATH=toolkit/components/aboutmemory/tests/test_memoryReporters.xul make > > -C obj-dir mochitest-plain > > I'm not sure where I can get Linux64 opt builds from, in particular for the > beta branch. I think these are what you're looking for: https://ftp.mozilla.org/pub/mozilla.org/firefox/releases/13.0b1/linux-x86_64/en-US/ https://ftp.mozilla.org/pub/mozilla.org/firefox/releases/13.0b2/linux-x86_64/en-US/ https://ftp.mozilla.org/pub/mozilla.org/firefox/releases/13.0b3/linux-x86_64/en-US/
Why? You want to know "what is the first build of 14 which has the fix for this problem?," but the first build of 14 which could have exhibited the problem, a build on inbound on 2012-04-12 when you landed the patch for bug 736028, does not exhibit the problem, so you know that the difference between non-permaorange 14 and permaorange 13 landed between when 13 merged from mozilla-central and then. I'd say that makes the steps * hg qimport https://bug736028.bugzilla.mozilla.org/attachment.cgi?id=614206 * hg up -r (whatever the tag is for when 13 merged away) * hg qpush * make -f client.mk build && TEST_PATH=toolkit/components/aboutmemory/tests/test_memoryReporters.xul make -C obj-dir mochitest-plain * hg qpop * hg up -r something else (Or more accurately, an hg bisect between the 13 merge point and something around when you pushed bug 736028, but I can't do the commands to bisect off the top of my head.)
I'll do a bisect tomorrow then if nobody else gets to it.
(In reply to Jared Wein [:jaws] from comment #56) > I'll do a bisect tomorrow then if nobody else gets to it. Any updates? Let us know if you believe this can be deprioritized/untracked for whatever reason - our main concern is the user experience post-release, as well as the possibility of regressions being hidden behind this perma-orange.
Yeah I tried to do a bisect yesterday and today but I'm struggling with how to get the correct revisions to bisect between due to this being on a branch of mozilla-central.
Can I get help from somebody on doing the bisect? Patches landing on different branches has confused me as to how to get this to reliably work.
Removing QAWANTED as per discussion in Channel meeting today. Please re-add if there is something specific QA can do to help here.
Keywords: qawanted
Hmmm... Looked at this with Jared, just to see if we could maybe figure out what's going wrong. The test seems completely unrelated to his patch (and was an existing orange), so it seems likely to me that this is a case of slightly changing some timing/memory and making an existing issue more frequent. [Which would imply there's no actual user impact, and that a backout wouldn't help with the underlying issue.] So here's what I was poking through... If njn or bent can shed some enlightenment here, it would be helpful. :) Stack for convenience (full version in comment 0): Thread 0 (crashed) 0 libpthread-2.11.so + 0x8db0 rbx = 0x00000000 r12 = 0x4909cff0 r13 = 0x4909cf9f r14 = 0xa8b5c010 r15 = 0x4909cff0 rip = 0xd3608db0 rsp = 0x4909cf28 rbp = 0x00000001 Found by: given as instruction pointer in context 1 libnspr4.so!PR_Lock [ptsynch.c:7e8c20249a4f : 206 + 0x4] rip = 0xc37e6f99 rsp = 0x4909cf30 Found by: stack scanning 2 WorkerPrivate::BlockAndCollectRuntimeStats [Mutex.h : 106 + 0x8] rbx = 0xa8b5c000 rip = 0xc0bd74da rsp = 0x4909cf40 Found by: call frame info 3 WorkerMemoryReporter::GetExplicitNonHeap [WorkerPrivate.cpp:7e8c20249a4f : 223 + 0xe] rbx = 0x9415f700 r12 = 0x0697d000 r13 = 0x4909d010 r14 = 0xc2038250 r15 = 0x4909cff0 rip = 0xc0bd77ab rsp = 0x4909cf90 rbp = 0x00000000 Found by: call frame info Things appear happy going into frame 3, we're gathering memory reporter data, calling into a WorkerMemoryReporter. (Pretty sure we're actually in ::CollectForRuntime, because ::GetExplicitNonHeap is just a two-liner calling it... Opt build and inlined). 217 CollectForRuntime(bool aIsQuick, void* aData) 218 { 219 AssertIsOnMainThread(); 220 221 if (mWorkerPrivate) { 222 bool disabled; 223 if (!mWorkerPrivate->BlockAndCollectRuntimeStats(...)) { so we end up in: 2828 WorkerPrivate::BlockAndCollectRuntimeStats(...) 2829 { 2830 AssertIsOnMainThread(); 2831 NS_ASSERTION(aData, "Null data!"); 2832 2833 { 2834 MutexAutoLock lock(mMutex); 2835 2836 if (mMemoryReporterDisabled) { And we're crashing when the MutexAutoLock is accessing |mMutex| to lock [it's also possible we're at the MutexAutoLock further down during the same thing, but I'm guessing we're not actually making it that far). The area where my suspicion falls is on the lifetime of the WorkerPrivate vs the WorkerMemoryReporter (which has a raw pointer -- |mWorkerPrivate| -- for the associated WorkerPrivate). When the worker and it's WorkerPrivate go away, where is it ensuring that that the WorkerMemoryReporter is appropriately nuked? WorkerPrivate::DoRunLoop is the place creating and nuking the reporter. The reporter is refcounted, but the WorkerPrivate is not. DisableMemoryReporter() is called, but BlockAndCollectRuntimeStats() uses the lock before checking that. So here's what I think happens. My uncertainty is around exactly what gets destroyed/cleared when... 1) main thread (the test) gets ahold of a WorkerMemoryReporter (refcount++). -- context switch -- 2) DoRunLoop decides to quit. It nulls mMemoryReporter (refcount--), but the reporter stays alive because of #1 3) worker thread goes away, destructs its WorkerPrivate (didn't find exactly where that happens) -- context switch -- 4) main thread calls WorkerMemoryReporter::GetExplicitNonHeap (ok) 5) ...in turn calls CollectForRuntime (ok) 6) ...in turn calls mWorkerPrivate->BlockAndCollectRuntimeStats (woah! mWorkerPrivate is now use-after-free? Guess enough bits are around / valid for the method to be invoked successfully) 7) Trues to do the locking, but the mutex has already been kablooied. Let me know where I'm wrong, for I am just a simple JS hacker. :)
Whiteboard: [orange][perma-orange][test which aborts the suite] → [see comment 71][orange][perma-orange][test which aborts the suite]
> DisableMemoryReporter() is called, but BlockAndCollectRuntimeStats() uses > the lock before checking that. I *think* this is the crucial part. The disabling is achieved by recording some state in WorkerPrivate, but I think it needs to be recorded in the WorkerMemoryReporter instead, so it can be checked before WorkerPrivate::BlockAndCollectRuntimeStats() is called.
Attached patch patch (obsolete) — Splinter Review
This patch changes how the disabling of WorkerMemoryReporter happens. Previously, WorkerPrivate would (a) set mMemoryReporterDisabled. Then when (b) WorkerMemoryReporter called WorkerPrivate::BlockAndCollectRuntimeStats, a |disabled| outparam would be set, and WorkerMemoryReporter would check for that and set mWorkerPrivate to nsnull. If dolske is right, the WorkerPrivate could be destroyed between (a) and (b). Now, WorkerPrivate just calls WorkerMemoryReporter::Disable() which directly sets mWorkerPrivate to nsnull. I've cut out the middle-man. Some caveats: - I've never seen this crash locally, despite the fact that I run this test on my Linux64 box frequently. (I ran the test myself and it didn't crash, so that's a start.) And despite this bug's title, it only fails sporadically, so a try server run won't tell me much. So if someone who can reliably reproduce the crash (dolske, jaws?) could try this patch, that'd be great. - Also, I didn't write this code and don't really understand it and it makes me nervous and I am just making this stuff up as I go along...
Attachment #628193 - Flags: feedback?(jaws)
Attachment #628193 - Flags: feedback?(dolske)
(In reply to Nicholas Nethercote [:njn] from comment #79) > And despite this bug's title, it only fails > sporadically, > so a try server run won't tell me much. So if someone who can reliably > reproduce the crash (dolske, jaws?) could try this patch, that'd be great. Did you try it locally with an optimized build?
I just tried an Linux64 opt build of mozilla-beta, and couldn't get it to crash. (This is without my patch.) So I tried a PGO build, and it failed with: make[16060]: vfork: Cannot allocate memory This on a 64-bit machine with 16GB of RAM. Hmph. Currently trying a push to mozilla-beta, with some magic incantations from philor to get PGO happening.
https://tbpl.mozilla.org/?tree=Try&rev=9801753cb194 is looking both green, and PGO'd, but then I realized we didn't have a control for "can we repro on try without the patch?" so I pushed https://tbpl.mozilla.org/?tree=Try&rev=c06b6b58bae6 which I hope will wind up orange.
Comment on attachment 628193 [details] [diff] [review] patch (In reply to Phil Ringnalda (:philor) from comment #82) > https://tbpl.mozilla.org/?tree=Try&rev=9801753cb194 is looking both green, > and PGO'd, but then I realized we didn't have a control for "can we repro on > try without the patch?" so I pushed > https://tbpl.mozilla.org/?tree=Try&rev=c06b6b58bae6 which I hope will wind > up orange. It did wind up orange. Good news! Time for a review... bent, I'd appreciate a very careful review because (a) I'm not that familiar with this code and (b) RACE CONDITIONS.
Attachment #628193 - Flags: review?(bent.mozilla)
Attachment #628193 - Flags: feedback?(jaws)
Attachment #628193 - Flags: feedback?(dolske)
Assignee: jaws → n.nethercote
Comment on attachment 628193 [details] [diff] [review] patch Review of attachment 628193 [details] [diff] [review]: ----------------------------------------------------------------- Thanks for digging in here! I think this fix is good, two nits and a DEBUG fix below should do it. ::: dom/workers/WorkerPrivate.cpp @@ +122,5 @@ > static_cast<typename ISupportsBaseInfo<T>::ISupportsBase*>(raw); > dest->swap(rawSupports); > } > > +} /* anonymous namespace */ Nit: Please just move WorkerMemoryReporter below the namespace block rather than splitting the block in two. @@ +142,5 @@ > cstats->extra = const_cast<char *>(name); > } > }; > > +class mozilla::dom::workers::WorkerMemoryReporter : public nsIMemoryMultiReporter Nit: I'd prefer a BEGIN_WORKERS_NAMESPACE/END_WORKERS_NAMESPACE block around this rather than having to spell it all out like this. @@ +242,5 @@ > } > + > + void Disable() > + { > + mWorkerPrivate = nsnull; This is only safe because we've locked the worker's mutex, and I'd like to assert that here. To do this I think we should make WorkerMemoryReporter a friend class of WorkerPrivate (#ifdef DEBUG only) and then call AssertCurrentThreadOwns. Sound good?
Attached patch patch, v2 (obsolete) — Splinter Review
> This is only safe because we've locked the worker's mutex, and I'd like to > assert that here. To do this I think we should make WorkerMemoryReporter a > friend class of WorkerPrivate (#ifdef DEBUG only) and then call > AssertCurrentThreadOwns. Sound good? I've done this (see attached patch) but I'm getting assertion failures like the following (https://tbpl.mozilla.org/php/getParsedLog.php?id=12376609&tree=Try). I can't reproduce them on my own Linux64 debug builds, unfortunately. TEST-UNEXPECTED-FAIL | chrome://mochitests/content/chrome/dom/workers/test/test_chromeWorkerJSM.xul | Exited with code 1 during test run INFO | automation.py | Application ran for: 0:03:36.383366 INFO | automation.py | Reading PID log: /tmp/tmpLZ48pIpidlog ==> process 2161 launched child process 2189 ==> process 2161 launched child process 2215 ==> process 2161 launched child process 2222 ==> process 2161 launched child process 2224 ==> process 2161 launched child process 2226 INFO | automation.py | Checking for orphan process with PID: 2189 INFO | automation.py | Checking for orphan process with PID: 2215 INFO | automation.py | Checking for orphan process with PID: 2222 INFO | automation.py | Checking for orphan process with PID: 2224 INFO | automation.py | Checking for orphan process with PID: 2226 Downloading symbols from: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/nnethercote@mozilla.com-4c6a523c293f/try-linux-debug/firefox-15.0a1.en-US.linux-i686.crashreporter-symbols.zip PROCESS-CRASH | chrome://mochitests/content/chrome/dom/workers/test/test_chromeWorkerJSM.xul | application crashed (minidump found) Crash dump filename: /tmp/tmpo7OmGS/minidumps/7b9f7c77-dbcf-528d-08ca8fc3-3db40cdd.dmp Operating system: Linux 0.0.0 Linux 2.6.31.5-127.fc12.i686.PAE #1 SMP Sat Nov 7 21:25:57 EST 2009 i686 CPU: x86 GenuineIntel family 6 model 23 stepping 10 2 CPUs Crash reason: SIGSEGV Crash address: 0x28 Thread 28 (crashed) 0 libxul.so!mozilla::dom::workers::WorkerPrivate::DisableMemoryReporter [Mutex.h : 108 + 0x0] eip = 0x0197bc25 esp = 0xa7bff0f4 ebp = 0xa7bff148 ebx = 0x02d8b528 esi = 0xb4c54530 edi = 0x00000001 eax = 0x00000000 ecx = 0x0a0db1e8 edx = 0xa7bff128 efl = 0x00010286 Found by: given as instruction pointer in context 1 libxul.so!mozilla::dom::workers::WorkerPrivate::DoRunLoop [WorkerPrivate.cpp : 2795 + 0x8] eip = 0x0197c395 esp = 0xa7bff150 ebp = 0xa7bff1f8 ebx = 0x02d8b528 esi = 0x0a0db1c8 edi = 0xa7bff1ac Found by: call frame info 2 libxul.so!WorkerThreadRunnable::Run [RuntimeService.cpp : 329 + 0x6] eip = 0x0196ab65 esp = 0xa7bff200 ebp = 0xa7bff248 ebx = 0x02d8b528 esi = 0xb4c53d18 edi = 0x0a0db1c8 Found by: call frame info
Attachment #619725 - Attachment is obsolete: true
Attachment #628193 - Attachment is obsolete: true
Attachment #628193 - Flags: review?(bent.mozilla)
Summary: Perma-orange Linux64 test_memoryReporters.xul crash above mozilla::dom::workers::WorkerPrivate::BlockAndCollectRuntimeStats on Beta 13 → Perma-orange Linux64 test_memoryReporters.xul crash above mozilla::dom::workers::WorkerPrivate::BlockAndCollectRuntimeStats on Release 13
What's the status of this bug? Nick, have you got a chance to work on this again since the last time you looked at it?
The assertion on try that I couldn't reproduce locally was discouraging enough that I haven't, esp. given that I only half understand this code. I'll try to get back to it next week.
I have a patch in bug 725907 that also causes this to become perma-orange. Nick's v1 patch fixes it. https://tbpl.mozilla.org/?tree=Try&pusher=jorendorff@mozilla.com You might have to click the green arrow several times; the relevant pushes here are: 80f70b78ac67 - Fri Jul 6 16:56:13 2012 PDT - green MOth 23371b36f13c - Fri Jul 6 16:27:10 2012 PDT - orange MOth Can we land v1 + nits, and add the problematic assert in a follow-up bug?
I just looked at this again, quite closely. I don't understand how the assertion could cause a crash. We lock the WorkerPrivate's |mMutex| in WorkerPrivate::DisableMemoryReporter() immediately before calling WorkerMemoryReporter::Disable(), and then this line causes the crash: mWorkerPrivate->mMutex.AssertCurrentThreadOwns(); It's not clear to me exactly what causes the crash because there's lots of inlining going on even in debug builds, and I can't reproduce it locally. I see two possible ways to proceed. - We land my patch with the assertion disabled, cross our fingers, and file a follow-up bug. - We disable the worker memory reporter entirely, and file a follow-up bug to re-enable it. This is safer, but obviously a shame. In both cases, I don't know how progress would be made on the follow-up bug, since I wasn't able to make any progress in this one. Maybe bent could try. Preferences?
(In reply to comment #99) > It's not clear to me exactly what causes the crash because there's lots of > inlining going on even in debug builds, and I can't reproduce it locally. Not sure if you're using our tinderbox debug builds, but those also have --enable-optimized which can explain this. Have you tried doing a build with --enable-debug --disable-optimized (sorry if I'm preaching to the choir :-)
bent, here's the patch with the assertion disabled. I'll let you decide if that's reasonable, since you wrote this code originally. If you r- it I will write a patch that disables the worker memory reporter. Thanks.
Attachment #641353 - Flags: review?(bent.mozilla)
Attachment #630084 - Attachment is obsolete: true
> Not sure if you're using our tinderbox debug builds, but those also have > --enable-optimized which can explain this. Have you tried doing a build > with --enable-debug --disable-optimized Good suggestion. I just tried debug Linux32 and Linux64 builds with this. The stack traces were only marginally better, and not in a way that gave me any more salient information. Crash reason: SIGSEGV Crash address: 0x28 Thread 27 (crashed) 0 libxul.so!mozilla::Mutex::AssertCurrentThreadOwns [Mutex.h : 108 + 0x3] eip = 0x01546c9b esp = 0xae086090 ebp = 0xae0860a8 ebx = 0x03fa57b4 esi = 0x0a2c41d0 edi = 0x02cd8990 eax = 0x00000020 ecx = 0xae086c04 edx = 0x087b8a20 efl = 0x00010206 Found by: given as instruction pointer in context 1 libxul.so!mozilla::dom::workers::WorkerMemoryReporter::Disable [WorkerPrivate.cpp : 260 + 0x10] eip = 0x02037087 esp = 0xae0860b0 ebp = 0xae0860c8 ebx = 0x03fa57b4 esi = 0x0a2c41d0 edi = 0x02cd8990 Found by: call frame info 2 libxul.so!mozilla::dom::workers::WorkerPrivate::DisableMemoryReporter [WorkerPrivate.cpp : 2944 + 0x17] eip = 0x0203d43d esp = 0xae0860d0 ebp = 0xae086108 ebx = 0x03fa57b4 esi = 0x0a2c41d0 edi = 0x02cd8990 Found by: call frame info I also tried a local build that used the exact mozconfig used on the test machines, but I still wasn't able to reproduce the crash :(
I finally managed to catch this in a debugger. Just a missing NULL check on mWorkerPrivate!
Attachment #641676 - Flags: review?(bent.mozilla)
Attachment #641353 - Attachment is obsolete: true
Attachment #641353 - Flags: review?(bent.mozilla)
Comment on attachment 641676 [details] [diff] [review] (part 1) - Make WorkerMemoryReporter less crashy. I'm nervous that we don't know why this null check should be necessary... But if this helps unblock others then we should do it.
Attachment #641676 - Flags: review?(bent.mozilla) → review+
Attachment #641354 - Flags: review?(bent.mozilla) → review+
> I'm nervous that we don't know why this null check should be necessary... > But if this helps unblock others then we should do it. It turns out the memory reporter can be disabled (for reasons I don't understand but aren't important) when CTypes are used by a worker. Then when the worker shuts down, Disable() is called again, whereupon mWorkerPrivate will already be NULL. So the NULL check is the right thing to do. Phew. I'll land this once my try server run has finished.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla16
Depends on: 812935
Whiteboard: [see comment 71][orange][perma-orange][test which aborts the suite] → [see comment 71][perma-orange][test which aborts the suite]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: