Closed
Bug 749010
Opened 13 years ago
Closed 12 years ago
Perma-orange Linux64 test_memoryReporters.xul crash above mozilla::dom::workers::WorkerPrivate::BlockAndCollectRuntimeStats on Release 13
Categories
(Core :: DOM: Workers, defect)
Tracking
()
RESOLVED
FIXED
mozilla16
Tracking | Status | |
---|---|---|
firefox12 | --- | unaffected |
firefox13 | + | affected |
firefox14 | --- | unaffected |
People
(Reporter: mbrubeck, Assigned: n.nethercote)
References
Details
(4 keywords, Whiteboard: [see comment 71][perma-orange][test which aborts the suite])
Crash Data
Attachments
(2 files, 4 obsolete files)
77.48 KB,
patch
|
bent.mozilla
:
review+
|
Details | Diff | Splinter Review |
8.65 KB,
patch
|
bent.mozilla
:
review+
|
Details | Diff | Splinter Review |
+++ This bug was initially created as a clone of Bug #737244 +++
This crash originally happened intermittently, but it started happening 100% of the time on Linux64 opt builds in Aurora 13, continuing in Beta 13. The perma-orange started with https://hg.mozilla.org/releases/mozilla-aurora/rev/8d0f70b87fba (bug 736028).
Requesting tracking-firefox13 because this is a new reproducible crash in Firefox 13.
https://tbpl.mozilla.org/php/getParsedLog.php?id=11206374&tree=Mozilla-Beta
Rev3 Fedora 12x64 mozilla-beta pgo test mochitest-other on 2012-04-25 15:17:39 PDT for push 7e8c20249a4f
13504 INFO TEST-START | chrome://mochitests/content/chrome/toolkit/components/aboutmemory/tests/test_memoryReporters.xul
TEST-UNEXPECTED-FAIL | chrome://mochitests/content/chrome/toolkit/components/aboutmemory/tests/test_memoryReporters.xul | Exited with code 1 during test run
INFO | automation.py | Application ran for: 0:02:12.925341
INFO | automation.py | Reading PID log: /tmp/tmpx_VsOppidlog
==> process 2357 launched child process 2391
==> process 2357 launched child process 2399
==> process 2357 launched child process 2407
==> process 2357 launched child process 2409
==> process 2357 launched child process 2411
INFO | automation.py | Checking for orphan process with PID: 2391
INFO | automation.py | Checking for orphan process with PID: 2399
INFO | automation.py | Checking for orphan process with PID: 2407
INFO | automation.py | Checking for orphan process with PID: 2409
INFO | automation.py | Checking for orphan process with PID: 2411
Downloading symbols from: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-beta-linux64/1335386323/firefox-13.0.en-US.linux-x86_64.crashreporter-symbols.zip
PROCESS-CRASH | chrome://mochitests/content/chrome/toolkit/components/aboutmemory/tests/test_memoryReporters.xul | application crashed (minidump found)
Crash dump filename: /tmp/tmpMB1Rf3/minidumps/36986f3b-61c9-5f43-4ee091f4-3f3d1a4d.dmp
Operating system: Linux
0.0.0 Linux 2.6.31.5-127.fc12.x86_64 #1 SMP Sat Nov 7 21:11:14 EST 2009 x86_64
CPU: amd64
family 6 model 23 stepping 10
2 CPUs
Crash reason: SIGSEGV
Crash address: 0x10
Thread 0 (crashed)
0 libpthread-2.11.so + 0x8db0
rbx = 0x00000000 r12 = 0x4909cff0 r13 = 0x4909cf9f r14 = 0xa8b5c010
r15 = 0x4909cff0 rip = 0xd3608db0 rsp = 0x4909cf28 rbp = 0x00000001
Found by: given as instruction pointer in context
1 libnspr4.so!PR_Lock [ptsynch.c:7e8c20249a4f : 206 + 0x4]
rip = 0xc37e6f99 rsp = 0x4909cf30
Found by: stack scanning
2 libxul.so!mozilla::dom::workers::WorkerPrivate::BlockAndCollectRuntimeStats [Mutex.h : 106 + 0x8]
rbx = 0xa8b5c000 rip = 0xc0bd74da rsp = 0x4909cf40
Found by: call frame info
3 libxul.so!WorkerMemoryReporter::GetExplicitNonHeap [WorkerPrivate.cpp:7e8c20249a4f : 223 + 0xe]
rbx = 0x9415f700 r12 = 0x0697d000 r13 = 0x4909d010 r14 = 0xc2038250
r15 = 0x4909cff0 rip = 0xc0bd77ab rsp = 0x4909cf90 rbp = 0x00000000
Found by: call frame info
4 libxul.so!nsMemoryReporterManager::GetExplicit [nsMemoryReporterManager.cpp:7e8c20249a4f : 805 + 0x9]
rbx = 0x4909d3e0 r12 = 0x0697d000 r13 = 0x4909d010 r14 = 0xc2038250
r15 = 0x4909cff0 rip = 0xc119eea2 rsp = 0x4909cfb0 rbp = 0x00000000
Found by: call frame info
5 libxul.so!NS_InvokeByIndex_P [xptcinvoke_x86_64_unix.cpp : 195 + 0xb]
rbx = 0x4909d3e0 r12 = 0x00000000 r13 = 0x00000002 r14 = 0xc17841f4
r15 = 0xc2333bb8 rip = 0xc11a0d8a rsp = 0x4909d0d0 rbp = 0x4909d1b0
Found by: call frame info
6 libxul.so!XPCWrappedNative::CallMethod [XPCWrappedNative.cpp:7e8c20249a4f : 3025 + 0x19]
rbx = 0x4909d3e0 r12 = 0x4909d3e0 r13 = 0x4909d870 r14 = 0x4909d3a0
r15 = 0xc2333bb8 rip = 0xc0de61cb rsp = 0x4909d1c0 rbp = 0x00000001
Found by: call frame info
7 libxul.so!XPC_WN_GetterSetter [xpcprivate.h:7e8c20249a4f : 2660 + 0xc]
rbx = 0x4909d870 r12 = 0x00000000 r13 = 0xb3aff0c0 r14 = 0xa853d700
r15 = 0x981a0fd0 rip = 0xc0ded719 rsp = 0x4909d840 rbp = 0xb3aff0b0
Found by: call frame info
8 libxul.so!js::InvokeKernel [jscntxtinlines.h:7e8c20249a4f : 314 + 0x8]
rbx = 0x90d68400 r12 = 0xa853d700 r13 = 0x00000000 r14 = 0x00000000
r15 = 0x00000000 rip = 0xc14907d0 rsp = 0x4909d9a0 rbp = 0xb3aff0c0
Found by: call frame info
9 libxul.so!js::Invoke [jsinterp.h:7e8c20249a4f : 172 + 0xd]
rbx = 0xa853d700 r12 = 0x4909db50 r13 = 0x4909db00 r14 = 0x4909da80
r15 = 0x00000000 rip = 0xc1490d65 rsp = 0x4909da70 rbp = 0xffffffff
Found by: call frame info
10 libxul.so!js::InvokeGetterOrSetter [jsinterp.cpp:7e8c20249a4f : 635 + 0xb]
Reporter | ||
Updated•13 years ago
|
Keywords: regression
Comment hidden (Legacy TBPL/Treeherder Robot) |
Updated•13 years ago
|
Crash Signature: [@ libpthread-2.11.so@0x0x8db0]
Comment 2•13 years ago
|
||
Is this crash only happening on Fx13?
Reporter | ||
Comment 3•13 years ago
|
||
(In reply to Jared Wein [:jaws] from comment #2)
> Is this crash only happening on Fx13?
Yes.
Comment 4•13 years ago
|
||
I'm fine with backing this out of Fx13 and letting the patch ride the Fx14 train then. I'll talk with Asa later today (before I do the backout) to make sure that he knows about it since I know he really wanted it in for Fx13.
Assignee: nobody → jwein
Status: NEW → ASSIGNED
Reporter | ||
Comment 5•13 years ago
|
||
It would also be nice to figure whether this crash is caused by 736028, or if it is an existing bug that was exposed in tests when bug 736028 landed. (Will backing out bug 736028 prevent users from seeing this crash, or just prevent it from showing up in our test suite?)
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Updated•13 years ago
|
Comment 10•13 years ago
|
||
[Approval Request Comment]
Regression caused by (bug #): bug 736028
User impact if declined: Some crashes on Linux64
Testing completed (on m-c, etc.):
The crashes went from intermittent to permanent when this patch went on to Fx13. These crashes have not been seen on Fx14 though, so the belief is that backing out this patch from Fx13 will reduce the crashes, but there has not been an investigation yet as to why this is causing crashes.
Risk to taking this patch (and alternatives if risky):
No risk expected, but thumbnail images will have a lower quality on the new tab page for Fx13.
String changes made by this patch: none
Attachment #619725 -
Flags: approval-mozilla-beta?
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 12•13 years ago
|
||
(In reply to Jared Wein [:jaws] from comment #10)
> Risk to taking this patch (and alternatives if risky):
> No risk expected, but thumbnail images will have a lower quality on the new
> tab page for Fx13.
Have you spoken with Asa about backing out bug 736028? https://bugzilla.mozilla.org/show_bug.cgi?id=736028#c24 suggested he really wanted that change, and it's not yet clear that our beta users (which has a significantly higher Linux population than our release channel) are running into this problem.
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 14•13 years ago
|
||
I would prefer that we not back this out. The feature quality without it goes way down.
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 16•13 years ago
|
||
Comment on attachment 619725 [details] [diff] [review]
Backout of 736028
a- for the backout due to Asa's request to keep the feature in, and also since there appears to be 0 user effect on crash-stats at present. We can revisit the option of backing out if evidence turns up to show users are being affected in larger numbers.
Attachment #619725 -
Flags: approval-mozilla-beta? → approval-mozilla-beta-
Updated•13 years ago
|
Assignee: jwein → nobody
Status: ASSIGNED → NEW
Comment 17•13 years ago
|
||
changing status to wontfix for 13 but leaving this bug open in case we need to revisit the backout during 13's cycle.
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Why is this WONTFIX? If one of our tests starts failing permanently we need to do something about that.
Comment 26•13 years ago
|
||
And, because this is a crash (which stops the run, and kills reporting of what happened before it as well as not running what comes after it at all), keep in mind that this means we do not run mochitest-chrome on Linux-64 on 13, and we have not run it on 13 for three weeks.
Whiteboard: [orange][perma-orange] → [orange][perma-orange][test which aborts the suite]
Depends on: 752585
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Assignee | ||
Comment 31•13 years ago
|
||
Jared, can you summarize the situation w.r.t. bug 736028 and why this crash occurs on 13 and not 14? I wrote this test but I think the crash doesn't really have anything to do with me, but I've found this whole bug horribly confusing. Thanks.
Comment 32•13 years ago
|
||
Nicholas, I'm not sure why this patch triggered the failures to occur and why they're not happening on 14.
I can speculate that there are some other patches that landed, and the combination of them and the potentially larger memory that is consumed by the larger thumbnails introduced by bug 736028 results in this crash.
Since this is not appearing on 14 I can only think that something landed during the 14 cycle that fixed this, but I'm not sure what that is.
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 34•13 years ago
|
||
Jared - how would you like to attack this problem? Can we ask QA to bisect builds with the test case and get back to us with what build in FF14 fixed the issue?
Assignee: nobody → jaws
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 38•13 years ago
|
||
(In reply to Alex Keybl [:akeybl] from comment #34)
> Jared - how would you like to attack this problem? Can we ask QA to bisect
> builds with the test case and get back to us with what build in FF14 fixed
> the issue?
Yeah, that sounds like a good plan to me. Should this be reassigned to somebody in QA?
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 40•13 years ago
|
||
(In reply to Jared Wein [:jaws] from comment #38)
> Yeah, that sounds like a good plan to me. Should this be reassigned to
> somebody in QA?
I've added the qawanted and regressionwindow-wanted keywords (although we're really looking for a fix window on FF14). QA regularly triages through the tracked/qawanted bugs.
Keywords: qawanted,
regressionwindow-wanted
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 43•13 years ago
|
||
(In reply to Jared Wein [:jaws] from comment #38)
> Yeah, that sounds like a good plan to me. Should this be reassigned to
> somebody in QA?
Would it be possible to provide QA with the associated testcase in a fashion that can be run as part of a bisection?
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 46•13 years ago
|
||
(In reply to Alex Keybl [:akeybl] from comment #43)
> (In reply to Jared Wein [:jaws] from comment #38)
> > Yeah, that sounds like a good plan to me. Should this be reassigned to
> > somebody in QA?
>
> Would it be possible to provide QA with the associated testcase in a fashion
> that can be run as part of a bisection?
Yes, use Linux64 opt builds and run:
TEST_PATH=toolkit/components/aboutmemory/tests/test_memoryReporters.xul make -C obj-dir mochitest-plain
If test passes, then it is a GOOD revision. If the test crashes, then it is a BAD revision.
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 52•13 years ago
|
||
(In reply to Jared Wein [:jaws] from comment #46)
> Yes, use Linux64 opt builds and run:
> TEST_PATH=toolkit/components/aboutmemory/tests/test_memoryReporters.xul make
> -C obj-dir mochitest-plain
I'm not sure where I can get Linux64 opt builds from, in particular for the beta branch.
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 54•13 years ago
|
||
(In reply to Anthony Hughes, Mozilla QA (irc: ashughes) from comment #52)
> (In reply to Jared Wein [:jaws] from comment #46)
> > Yes, use Linux64 opt builds and run:
> > TEST_PATH=toolkit/components/aboutmemory/tests/test_memoryReporters.xul make
> > -C obj-dir mochitest-plain
>
> I'm not sure where I can get Linux64 opt builds from, in particular for the
> beta branch.
I think these are what you're looking for:
https://ftp.mozilla.org/pub/mozilla.org/firefox/releases/13.0b1/linux-x86_64/en-US/
https://ftp.mozilla.org/pub/mozilla.org/firefox/releases/13.0b2/linux-x86_64/en-US/
https://ftp.mozilla.org/pub/mozilla.org/firefox/releases/13.0b3/linux-x86_64/en-US/
Comment 55•13 years ago
|
||
Why?
You want to know "what is the first build of 14 which has the fix for this problem?," but the first build of 14 which could have exhibited the problem, a build on inbound on 2012-04-12 when you landed the patch for bug 736028, does not exhibit the problem, so you know that the difference between non-permaorange 14 and permaorange 13 landed between when 13 merged from mozilla-central and then.
I'd say that makes the steps
* hg qimport https://bug736028.bugzilla.mozilla.org/attachment.cgi?id=614206
* hg up -r (whatever the tag is for when 13 merged away)
* hg qpush
* make -f client.mk build && TEST_PATH=toolkit/components/aboutmemory/tests/test_memoryReporters.xul make -C obj-dir mochitest-plain
* hg qpop
* hg up -r something else
(Or more accurately, an hg bisect between the 13 merge point and something around when you pushed bug 736028, but I can't do the commands to bisect off the top of my head.)
Comment 56•13 years ago
|
||
I'll do a bisect tomorrow then if nobody else gets to it.
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 58•13 years ago
|
||
(In reply to Jared Wein [:jaws] from comment #56)
> I'll do a bisect tomorrow then if nobody else gets to it.
Any updates? Let us know if you believe this can be deprioritized/untracked for whatever reason - our main concern is the user experience post-release, as well as the possibility of regressions being hidden behind this perma-orange.
Comment 59•13 years ago
|
||
Yeah I tried to do a bisect yesterday and today but I'm struggling with how to get the correct revisions to bisect between due to this being on a branch of mozilla-central.
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 64•13 years ago
|
||
Can I get help from somebody on doing the bisect? Patches landing on different branches has confused me as to how to get this to reliably work.
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 67•13 years ago
|
||
Removing QAWANTED as per discussion in Channel meeting today. Please re-add if there is something specific QA can do to help here.
Keywords: qawanted
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 71•13 years ago
|
||
Hmmm... Looked at this with Jared, just to see if we could maybe figure out what's going wrong. The test seems completely unrelated to his patch (and was an existing orange), so it seems likely to me that this is a case of slightly changing some timing/memory and making an existing issue more frequent. [Which would imply there's no actual user impact, and that a backout wouldn't help with the underlying issue.]
So here's what I was poking through... If njn or bent can shed some enlightenment here, it would be helpful. :)
Stack for convenience (full version in comment 0):
Thread 0 (crashed)
0 libpthread-2.11.so + 0x8db0
rbx = 0x00000000 r12 = 0x4909cff0 r13 = 0x4909cf9f r14 = 0xa8b5c010
r15 = 0x4909cff0 rip = 0xd3608db0 rsp = 0x4909cf28 rbp = 0x00000001
Found by: given as instruction pointer in context
1 libnspr4.so!PR_Lock [ptsynch.c:7e8c20249a4f : 206 + 0x4]
rip = 0xc37e6f99 rsp = 0x4909cf30
Found by: stack scanning
2 WorkerPrivate::BlockAndCollectRuntimeStats [Mutex.h : 106 + 0x8]
rbx = 0xa8b5c000 rip = 0xc0bd74da rsp = 0x4909cf40
Found by: call frame info
3 WorkerMemoryReporter::GetExplicitNonHeap [WorkerPrivate.cpp:7e8c20249a4f : 223 + 0xe]
rbx = 0x9415f700 r12 = 0x0697d000 r13 = 0x4909d010 r14 = 0xc2038250
r15 = 0x4909cff0 rip = 0xc0bd77ab rsp = 0x4909cf90 rbp = 0x00000000
Found by: call frame info
Things appear happy going into frame 3, we're gathering memory reporter data, calling into a WorkerMemoryReporter. (Pretty sure we're actually in ::CollectForRuntime, because ::GetExplicitNonHeap is just a two-liner calling it... Opt build and inlined).
217 CollectForRuntime(bool aIsQuick, void* aData)
218 {
219 AssertIsOnMainThread();
220
221 if (mWorkerPrivate) {
222 bool disabled;
223 if (!mWorkerPrivate->BlockAndCollectRuntimeStats(...)) {
so we end up in:
2828 WorkerPrivate::BlockAndCollectRuntimeStats(...)
2829 {
2830 AssertIsOnMainThread();
2831 NS_ASSERTION(aData, "Null data!");
2832
2833 {
2834 MutexAutoLock lock(mMutex);
2835
2836 if (mMemoryReporterDisabled) {
And we're crashing when the MutexAutoLock is accessing |mMutex| to lock [it's also possible we're at the MutexAutoLock further down during the same thing, but I'm guessing we're not actually making it that far).
The area where my suspicion falls is on the lifetime of the WorkerPrivate vs the WorkerMemoryReporter (which has a raw pointer -- |mWorkerPrivate| -- for the associated WorkerPrivate). When the worker and it's WorkerPrivate go away, where is it ensuring that that the WorkerMemoryReporter is appropriately nuked?
WorkerPrivate::DoRunLoop is the place creating and nuking the reporter. The reporter is refcounted, but the WorkerPrivate is not. DisableMemoryReporter() is called, but BlockAndCollectRuntimeStats() uses the lock before checking that.
So here's what I think happens. My uncertainty is around exactly what gets destroyed/cleared when...
1) main thread (the test) gets ahold of a WorkerMemoryReporter (refcount++).
-- context switch --
2) DoRunLoop decides to quit. It nulls mMemoryReporter (refcount--), but the reporter stays alive because of #1
3) worker thread goes away, destructs its WorkerPrivate (didn't find exactly where that happens)
-- context switch --
4) main thread calls WorkerMemoryReporter::GetExplicitNonHeap (ok)
5) ...in turn calls CollectForRuntime (ok)
6) ...in turn calls mWorkerPrivate->BlockAndCollectRuntimeStats (woah! mWorkerPrivate is now use-after-free? Guess enough bits are around / valid for the method to be invoked successfully)
7) Trues to do the locking, but the mutex has already been kablooied.
Let me know where I'm wrong, for I am just a simple JS hacker. :)
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Assignee | ||
Updated•13 years ago
|
Whiteboard: [orange][perma-orange][test which aborts the suite] → [see comment 71][orange][perma-orange][test which aborts the suite]
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Assignee | ||
Comment 78•12 years ago
|
||
> DisableMemoryReporter() is called, but BlockAndCollectRuntimeStats() uses
> the lock before checking that.
I *think* this is the crucial part. The disabling is achieved by recording some state in WorkerPrivate, but I think it needs to be recorded in the WorkerMemoryReporter instead, so it can be checked before WorkerPrivate::BlockAndCollectRuntimeStats() is called.
Assignee | ||
Comment 79•12 years ago
|
||
This patch changes how the disabling of WorkerMemoryReporter happens.
Previously, WorkerPrivate would (a) set mMemoryReporterDisabled. Then when
(b) WorkerMemoryReporter called WorkerPrivate::BlockAndCollectRuntimeStats, a
|disabled| outparam would be set, and WorkerMemoryReporter would check for
that and set mWorkerPrivate to nsnull. If dolske is right, the
WorkerPrivate could be destroyed between (a) and (b).
Now, WorkerPrivate just calls WorkerMemoryReporter::Disable() which directly
sets mWorkerPrivate to nsnull. I've cut out the middle-man.
Some caveats:
- I've never seen this crash locally, despite the fact that I run this test
on my Linux64 box frequently. (I ran the test myself and it didn't crash,
so that's a start.) And despite this bug's title, it only fails sporadically,
so a try server run won't tell me much. So if someone who can reliably
reproduce the crash (dolske, jaws?) could try this patch, that'd be great.
- Also, I didn't write this code and don't really understand it and it makes
me nervous and I am just making this stuff up as I go along...
Attachment #628193 -
Flags: feedback?(jaws)
Attachment #628193 -
Flags: feedback?(dolske)
Comment 80•12 years ago
|
||
(In reply to Nicholas Nethercote [:njn] from comment #79)
> And despite this bug's title, it only fails
> sporadically,
> so a try server run won't tell me much. So if someone who can reliably
> reproduce the crash (dolske, jaws?) could try this patch, that'd be great.
Did you try it locally with an optimized build?
Assignee | ||
Comment 81•12 years ago
|
||
I just tried an Linux64 opt build of mozilla-beta, and couldn't get it to crash. (This is without my patch.)
So I tried a PGO build, and it failed with:
make[16060]: vfork: Cannot allocate memory
This on a 64-bit machine with 16GB of RAM. Hmph. Currently trying a push to mozilla-beta, with some magic incantations from philor to get PGO happening.
Comment 82•12 years ago
|
||
https://tbpl.mozilla.org/?tree=Try&rev=9801753cb194 is looking both green, and PGO'd, but then I realized we didn't have a control for "can we repro on try without the patch?" so I pushed https://tbpl.mozilla.org/?tree=Try&rev=c06b6b58bae6 which I hope will wind up orange.
Assignee | ||
Comment 83•12 years ago
|
||
Comment on attachment 628193 [details] [diff] [review]
patch
(In reply to Phil Ringnalda (:philor) from comment #82)
> https://tbpl.mozilla.org/?tree=Try&rev=9801753cb194 is looking both green,
> and PGO'd, but then I realized we didn't have a control for "can we repro on
> try without the patch?" so I pushed
> https://tbpl.mozilla.org/?tree=Try&rev=c06b6b58bae6 which I hope will wind
> up orange.
It did wind up orange. Good news!
Time for a review... bent, I'd appreciate a very careful review because (a) I'm not that familiar with this code and (b) RACE CONDITIONS.
Attachment #628193 -
Flags: review?(bent.mozilla)
Attachment #628193 -
Flags: feedback?(jaws)
Attachment #628193 -
Flags: feedback?(dolske)
Updated•12 years ago
|
Assignee: jaws → n.nethercote
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment on attachment 628193 [details] [diff] [review]
patch
Review of attachment 628193 [details] [diff] [review]:
-----------------------------------------------------------------
Thanks for digging in here! I think this fix is good, two nits and a DEBUG fix below should do it.
::: dom/workers/WorkerPrivate.cpp
@@ +122,5 @@
> static_cast<typename ISupportsBaseInfo<T>::ISupportsBase*>(raw);
> dest->swap(rawSupports);
> }
>
> +} /* anonymous namespace */
Nit: Please just move WorkerMemoryReporter below the namespace block rather than splitting the block in two.
@@ +142,5 @@
> cstats->extra = const_cast<char *>(name);
> }
> };
>
> +class mozilla::dom::workers::WorkerMemoryReporter : public nsIMemoryMultiReporter
Nit: I'd prefer a BEGIN_WORKERS_NAMESPACE/END_WORKERS_NAMESPACE block around this rather than having to spell it all out like this.
@@ +242,5 @@
> }
> +
> + void Disable()
> + {
> + mWorkerPrivate = nsnull;
This is only safe because we've locked the worker's mutex, and I'd like to assert that here. To do this I think we should make WorkerMemoryReporter a friend class of WorkerPrivate (#ifdef DEBUG only) and then call AssertCurrentThreadOwns. Sound good?
Assignee | ||
Comment 87•12 years ago
|
||
> This is only safe because we've locked the worker's mutex, and I'd like to
> assert that here. To do this I think we should make WorkerMemoryReporter a
> friend class of WorkerPrivate (#ifdef DEBUG only) and then call
> AssertCurrentThreadOwns. Sound good?
I've done this (see attached patch) but I'm getting assertion failures like the following (https://tbpl.mozilla.org/php/getParsedLog.php?id=12376609&tree=Try).
I can't reproduce them on my own Linux64 debug builds, unfortunately.
TEST-UNEXPECTED-FAIL | chrome://mochitests/content/chrome/dom/workers/test/test_chromeWorkerJSM.xul | Exited with code 1 during test run
INFO | automation.py | Application ran for: 0:03:36.383366
INFO | automation.py | Reading PID log: /tmp/tmpLZ48pIpidlog
==> process 2161 launched child process 2189
==> process 2161 launched child process 2215
==> process 2161 launched child process 2222
==> process 2161 launched child process 2224
==> process 2161 launched child process 2226
INFO | automation.py | Checking for orphan process with PID: 2189
INFO | automation.py | Checking for orphan process with PID: 2215
INFO | automation.py | Checking for orphan process with PID: 2222
INFO | automation.py | Checking for orphan process with PID: 2224
INFO | automation.py | Checking for orphan process with PID: 2226
Downloading symbols from: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/nnethercote@mozilla.com-4c6a523c293f/try-linux-debug/firefox-15.0a1.en-US.linux-i686.crashreporter-symbols.zip
PROCESS-CRASH | chrome://mochitests/content/chrome/dom/workers/test/test_chromeWorkerJSM.xul | application crashed (minidump found)
Crash dump filename: /tmp/tmpo7OmGS/minidumps/7b9f7c77-dbcf-528d-08ca8fc3-3db40cdd.dmp
Operating system: Linux
0.0.0 Linux 2.6.31.5-127.fc12.i686.PAE #1 SMP Sat Nov 7 21:25:57 EST 2009 i686
CPU: x86
GenuineIntel family 6 model 23 stepping 10
2 CPUs
Crash reason: SIGSEGV
Crash address: 0x28
Thread 28 (crashed)
0 libxul.so!mozilla::dom::workers::WorkerPrivate::DisableMemoryReporter [Mutex.h : 108 + 0x0]
eip = 0x0197bc25 esp = 0xa7bff0f4 ebp = 0xa7bff148 ebx = 0x02d8b528
esi = 0xb4c54530 edi = 0x00000001 eax = 0x00000000 ecx = 0x0a0db1e8
edx = 0xa7bff128 efl = 0x00010286
Found by: given as instruction pointer in context
1 libxul.so!mozilla::dom::workers::WorkerPrivate::DoRunLoop [WorkerPrivate.cpp : 2795 + 0x8]
eip = 0x0197c395 esp = 0xa7bff150 ebp = 0xa7bff1f8 ebx = 0x02d8b528
esi = 0x0a0db1c8 edi = 0xa7bff1ac
Found by: call frame info
2 libxul.so!WorkerThreadRunnable::Run [RuntimeService.cpp : 329 + 0x6]
eip = 0x0196ab65 esp = 0xa7bff200 ebp = 0xa7bff248 ebx = 0x02d8b528
esi = 0xb4c53d18 edi = 0x0a0db1c8
Found by: call frame info
Attachment #619725 -
Attachment is obsolete: true
Attachment #628193 -
Attachment is obsolete: true
Attachment #628193 -
Flags: review?(bent.mozilla)
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Updated•12 years ago
|
Summary: Perma-orange Linux64 test_memoryReporters.xul crash above mozilla::dom::workers::WorkerPrivate::BlockAndCollectRuntimeStats on Beta 13 → Perma-orange Linux64 test_memoryReporters.xul crash above mozilla::dom::workers::WorkerPrivate::BlockAndCollectRuntimeStats on Release 13
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 96•12 years ago
|
||
What's the status of this bug? Nick, have you got a chance to work on this again since the last time you looked at it?
Assignee | ||
Comment 97•12 years ago
|
||
The assertion on try that I couldn't reproduce locally was discouraging enough that I haven't, esp. given that I only half understand this code. I'll try to get back to it next week.
Comment 98•12 years ago
|
||
I have a patch in bug 725907 that also causes this to become perma-orange. Nick's v1 patch fixes it.
https://tbpl.mozilla.org/?tree=Try&pusher=jorendorff@mozilla.com
You might have to click the green arrow several times; the relevant pushes here are:
80f70b78ac67 - Fri Jul 6 16:56:13 2012 PDT - green MOth
23371b36f13c - Fri Jul 6 16:27:10 2012 PDT - orange MOth
Can we land v1 + nits, and add the problematic assert in a follow-up bug?
Assignee | ||
Comment 99•12 years ago
|
||
I just looked at this again, quite closely. I don't understand how the assertion could cause a crash. We lock the WorkerPrivate's |mMutex| in WorkerPrivate::DisableMemoryReporter() immediately before calling WorkerMemoryReporter::Disable(), and then this line causes the crash:
mWorkerPrivate->mMutex.AssertCurrentThreadOwns();
It's not clear to me exactly what causes the crash because there's lots of inlining going on even in debug builds, and I can't reproduce it locally.
I see two possible ways to proceed.
- We land my patch with the assertion disabled, cross our fingers, and file a follow-up bug.
- We disable the worker memory reporter entirely, and file a follow-up bug to re-enable it. This is safer, but obviously a shame.
In both cases, I don't know how progress would be made on the follow-up bug, since I wasn't able to make any progress in this one. Maybe bent could try.
Preferences?
Comment 100•12 years ago
|
||
(In reply to comment #99)
> It's not clear to me exactly what causes the crash because there's lots of
> inlining going on even in debug builds, and I can't reproduce it locally.
Not sure if you're using our tinderbox debug builds, but those also have --enable-optimized which can explain this. Have you tried doing a build with --enable-debug --disable-optimized (sorry if I'm preaching to the choir :-)
Assignee | ||
Comment 101•12 years ago
|
||
bent, here's the patch with the assertion disabled. I'll let you decide if
that's reasonable, since you wrote this code originally. If you r- it I
will write a patch that disables the worker memory reporter. Thanks.
Attachment #641353 -
Flags: review?(bent.mozilla)
Assignee | ||
Comment 102•12 years ago
|
||
Trivial follow-up patch.
Attachment #641354 -
Flags: review?(bent.mozilla)
Assignee | ||
Updated•12 years ago
|
Attachment #630084 -
Attachment is obsolete: true
Assignee | ||
Comment 103•12 years ago
|
||
> Not sure if you're using our tinderbox debug builds, but those also have
> --enable-optimized which can explain this. Have you tried doing a build
> with --enable-debug --disable-optimized
Good suggestion. I just tried debug Linux32 and Linux64 builds with this. The stack traces were only marginally better, and not in a way that gave me any more salient information.
Crash reason: SIGSEGV
Crash address: 0x28
Thread 27 (crashed)
0 libxul.so!mozilla::Mutex::AssertCurrentThreadOwns [Mutex.h : 108 + 0x3]
eip = 0x01546c9b esp = 0xae086090 ebp = 0xae0860a8 ebx = 0x03fa57b4
esi = 0x0a2c41d0 edi = 0x02cd8990 eax = 0x00000020 ecx = 0xae086c04
edx = 0x087b8a20 efl = 0x00010206
Found by: given as instruction pointer in context
1 libxul.so!mozilla::dom::workers::WorkerMemoryReporter::Disable [WorkerPrivate.cpp : 260 + 0x10]
eip = 0x02037087 esp = 0xae0860b0 ebp = 0xae0860c8 ebx = 0x03fa57b4
esi = 0x0a2c41d0 edi = 0x02cd8990
Found by: call frame info
2 libxul.so!mozilla::dom::workers::WorkerPrivate::DisableMemoryReporter [WorkerPrivate.cpp : 2944 + 0x17]
eip = 0x0203d43d esp = 0xae0860d0 ebp = 0xae086108 ebx = 0x03fa57b4
esi = 0x0a2c41d0 edi = 0x02cd8990
Found by: call frame info
I also tried a local build that used the exact mozconfig used on the test machines, but I still wasn't able to reproduce the crash :(
Assignee | ||
Comment 104•12 years ago
|
||
I finally managed to catch this in a debugger. Just a missing NULL check on
mWorkerPrivate!
Attachment #641676 -
Flags: review?(bent.mozilla)
Assignee | ||
Updated•12 years ago
|
Attachment #641353 -
Attachment is obsolete: true
Attachment #641353 -
Flags: review?(bent.mozilla)
Comment on attachment 641676 [details] [diff] [review]
(part 1) - Make WorkerMemoryReporter less crashy.
I'm nervous that we don't know why this null check should be necessary... But if this helps unblock others then we should do it.
Attachment #641676 -
Flags: review?(bent.mozilla) → review+
Updated•12 years ago
|
Attachment #641354 -
Flags: review?(bent.mozilla) → review+
Assignee | ||
Comment 106•12 years ago
|
||
> I'm nervous that we don't know why this null check should be necessary...
> But if this helps unblock others then we should do it.
It turns out the memory reporter can be disabled (for reasons I don't understand but aren't important) when CTypes are used by a worker. Then when the worker shuts down, Disable() is called again, whereupon mWorkerPrivate will already be NULL. So the NULL check is the right thing to do. Phew. I'll land this once my try server run has finished.
Assignee | ||
Comment 107•12 years ago
|
||
Comment 108•12 years ago
|
||
\o/
Comment 109•12 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/7e78cbe9ea94
https://hg.mozilla.org/mozilla-central/rev/bdc169cbbbac
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla16
Updated•12 years ago
|
Keywords: intermittent-failure
Updated•12 years ago
|
Whiteboard: [see comment 71][orange][perma-orange][test which aborts the suite] → [see comment 71][perma-orange][test which aborts the suite]
Updated•9 years ago
|
Keywords: regressionwindow-wanted
You need to log in
before you can comment on or make changes to this bug.
Description
•