Perma-orange Linux64 test_memoryReporters.xul crash above mozilla::dom::workers::WorkerPrivate::BlockAndCollectRuntimeStats on Release 13

RESOLVED FIXED in mozilla16

Status

()

Core
DOM: Workers
--
critical
RESOLVED FIXED
5 years ago
2 years ago

People

(Reporter: mbrubeck, Assigned: njn)

Tracking

(4 keywords)

13 Branch
mozilla16
x86_64
Linux
crash, intermittent-failure, regression, reproducible
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox12 unaffected, firefox13+ affected, firefox14 unaffected)

Details

(Whiteboard: [see comment 71][perma-orange][test which aborts the suite], crash signature)

Attachments

(2 attachments, 4 obsolete attachments)

(Reporter)

Description

5 years ago
+++ This bug was initially created as a clone of Bug #737244 +++

This crash originally happened intermittently, but it started happening 100% of the time on Linux64 opt builds in Aurora 13, continuing in Beta 13.  The perma-orange started with https://hg.mozilla.org/releases/mozilla-aurora/rev/8d0f70b87fba (bug 736028).

Requesting tracking-firefox13 because this is a new reproducible crash in Firefox 13.

https://tbpl.mozilla.org/php/getParsedLog.php?id=11206374&tree=Mozilla-Beta
Rev3 Fedora 12x64 mozilla-beta pgo test mochitest-other on 2012-04-25 15:17:39 PDT for push 7e8c20249a4f

13504 INFO TEST-START | chrome://mochitests/content/chrome/toolkit/components/aboutmemory/tests/test_memoryReporters.xul
TEST-UNEXPECTED-FAIL | chrome://mochitests/content/chrome/toolkit/components/aboutmemory/tests/test_memoryReporters.xul | Exited with code 1 during test run
INFO | automation.py | Application ran for: 0:02:12.925341
INFO | automation.py | Reading PID log: /tmp/tmpx_VsOppidlog
==> process 2357 launched child process 2391
==> process 2357 launched child process 2399
==> process 2357 launched child process 2407
==> process 2357 launched child process 2409
==> process 2357 launched child process 2411
INFO | automation.py | Checking for orphan process with PID: 2391
INFO | automation.py | Checking for orphan process with PID: 2399
INFO | automation.py | Checking for orphan process with PID: 2407
INFO | automation.py | Checking for orphan process with PID: 2409
INFO | automation.py | Checking for orphan process with PID: 2411
Downloading symbols from: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-beta-linux64/1335386323/firefox-13.0.en-US.linux-x86_64.crashreporter-symbols.zip
PROCESS-CRASH | chrome://mochitests/content/chrome/toolkit/components/aboutmemory/tests/test_memoryReporters.xul | application crashed (minidump found)
Crash dump filename: /tmp/tmpMB1Rf3/minidumps/36986f3b-61c9-5f43-4ee091f4-3f3d1a4d.dmp
Operating system: Linux
                  0.0.0 Linux 2.6.31.5-127.fc12.x86_64 #1 SMP Sat Nov 7 21:11:14 EST 2009 x86_64
CPU: amd64
     family 6 model 23 stepping 10
     2 CPUs

Crash reason:  SIGSEGV
Crash address: 0x10

Thread 0 (crashed)
 0  libpthread-2.11.so + 0x8db0
    rbx = 0x00000000   r12 = 0x4909cff0   r13 = 0x4909cf9f   r14 = 0xa8b5c010
    r15 = 0x4909cff0   rip = 0xd3608db0   rsp = 0x4909cf28   rbp = 0x00000001
    Found by: given as instruction pointer in context
 1  libnspr4.so!PR_Lock [ptsynch.c:7e8c20249a4f : 206 + 0x4]
    rip = 0xc37e6f99   rsp = 0x4909cf30
    Found by: stack scanning
 2  libxul.so!mozilla::dom::workers::WorkerPrivate::BlockAndCollectRuntimeStats [Mutex.h : 106 + 0x8]
    rbx = 0xa8b5c000   rip = 0xc0bd74da   rsp = 0x4909cf40
    Found by: call frame info
 3  libxul.so!WorkerMemoryReporter::GetExplicitNonHeap [WorkerPrivate.cpp:7e8c20249a4f : 223 + 0xe]
    rbx = 0x9415f700   r12 = 0x0697d000   r13 = 0x4909d010   r14 = 0xc2038250
    r15 = 0x4909cff0   rip = 0xc0bd77ab   rsp = 0x4909cf90   rbp = 0x00000000
    Found by: call frame info
 4  libxul.so!nsMemoryReporterManager::GetExplicit [nsMemoryReporterManager.cpp:7e8c20249a4f : 805 + 0x9]
    rbx = 0x4909d3e0   r12 = 0x0697d000   r13 = 0x4909d010   r14 = 0xc2038250
    r15 = 0x4909cff0   rip = 0xc119eea2   rsp = 0x4909cfb0   rbp = 0x00000000
    Found by: call frame info
 5  libxul.so!NS_InvokeByIndex_P [xptcinvoke_x86_64_unix.cpp : 195 + 0xb]
    rbx = 0x4909d3e0   r12 = 0x00000000   r13 = 0x00000002   r14 = 0xc17841f4
    r15 = 0xc2333bb8   rip = 0xc11a0d8a   rsp = 0x4909d0d0   rbp = 0x4909d1b0
    Found by: call frame info
 6  libxul.so!XPCWrappedNative::CallMethod [XPCWrappedNative.cpp:7e8c20249a4f : 3025 + 0x19]
    rbx = 0x4909d3e0   r12 = 0x4909d3e0   r13 = 0x4909d870   r14 = 0x4909d3a0
    r15 = 0xc2333bb8   rip = 0xc0de61cb   rsp = 0x4909d1c0   rbp = 0x00000001
    Found by: call frame info
 7  libxul.so!XPC_WN_GetterSetter [xpcprivate.h:7e8c20249a4f : 2660 + 0xc]
    rbx = 0x4909d870   r12 = 0x00000000   r13 = 0xb3aff0c0   r14 = 0xa853d700
    r15 = 0x981a0fd0   rip = 0xc0ded719   rsp = 0x4909d840   rbp = 0xb3aff0b0
    Found by: call frame info
 8  libxul.so!js::InvokeKernel [jscntxtinlines.h:7e8c20249a4f : 314 + 0x8]
    rbx = 0x90d68400   r12 = 0xa853d700   r13 = 0x00000000   r14 = 0x00000000
    r15 = 0x00000000   rip = 0xc14907d0   rsp = 0x4909d9a0   rbp = 0xb3aff0c0
    Found by: call frame info
 9  libxul.so!js::Invoke [jsinterp.h:7e8c20249a4f : 172 + 0xd]
    rbx = 0xa853d700   r12 = 0x4909db50   r13 = 0x4909db00   r14 = 0x4909da80
    r15 = 0x00000000   rip = 0xc1490d65   rsp = 0x4909da70   rbp = 0xffffffff
    Found by: call frame info
10  libxul.so!js::InvokeGetterOrSetter [jsinterp.cpp:7e8c20249a4f : 635 + 0xb]
(Reporter)

Updated

5 years ago
Keywords: regression
Comment hidden (Treeherder Robot)

Updated

5 years ago
Crash Signature: [@ libpthread-2.11.so@0x0x8db0]
Is this crash only happening on Fx13?
(Reporter)

Comment 3

5 years ago
(In reply to Jared Wein [:jaws] from comment #2)
> Is this crash only happening on Fx13?

Yes.
I'm fine with backing this out of Fx13 and letting the patch ride the Fx14 train then. I'll talk with Asa later today (before I do the backout) to make sure that he knows about it since I know he really wanted it in for Fx13.
Assignee: nobody → jwein
Status: NEW → ASSIGNED
(Reporter)

Comment 5

5 years ago
It would also be nice to figure whether this crash is caused by 736028, or if it is an existing bug that was exposed in tests when bug 736028 landed.  (Will backing out bug 736028 prevent users from seeing this crash, or just prevent it from showing up in our test suite?)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)

Updated

5 years ago
tracking-firefox13: ? → +
Created attachment 619725 [details] [diff] [review]
Backout of 736028

[Approval Request Comment]
Regression caused by (bug #): bug 736028

User impact if declined: Some crashes on Linux64

Testing completed (on m-c, etc.):
The crashes went from intermittent to permanent when this patch went on to Fx13. These crashes have not been seen on Fx14 though, so the belief is that backing out this patch from Fx13 will reduce the crashes, but there has not been an investigation yet as to why this is causing crashes. 

Risk to taking this patch (and alternatives if risky):
No risk expected, but thumbnail images will have a lower quality on the new tab page for Fx13.

String changes made by this patch: none
Attachment #619725 - Flags: approval-mozilla-beta?
Comment hidden (Treeherder Robot)
(In reply to Jared Wein [:jaws] from comment #10)
> Risk to taking this patch (and alternatives if risky):
> No risk expected, but thumbnail images will have a lower quality on the new
> tab page for Fx13.

Have you spoken with Asa about backing out bug 736028? https://bugzilla.mozilla.org/show_bug.cgi?id=736028#c24 suggested he really wanted that change, and it's not yet clear that our beta users (which has a significantly higher Linux population than our release channel) are running into this problem.
Comment hidden (Treeherder Robot)

Comment 14

5 years ago
I would prefer that we not back this out. The feature quality without it goes way down.
Comment hidden (Treeherder Robot)
Comment on attachment 619725 [details] [diff] [review]
Backout of 736028

a- for the backout due to Asa's request to keep the feature in, and also since there appears to be 0 user effect on crash-stats at present. We can revisit the option of backing out if evidence turns up to show users are being affected in larger numbers.
Attachment #619725 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
Assignee: jwein → nobody
Status: ASSIGNED → NEW
changing status to wontfix for 13 but leaving this bug open in case we need to revisit the backout during 13's cycle.
status-firefox13: affected → wontfix
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Why is this WONTFIX?  If one of our tests starts failing permanently we need to do something about that.
status-firefox13: wontfix → affected
And, because this is a crash (which stops the run, and kills reporting of what happened before it as well as not running what comes after it at all), keep in mind that this means we do not run mochitest-chrome on Linux-64 on 13, and we have not run it on 13 for three weeks.
Whiteboard: [orange][perma-orange] → [orange][perma-orange][test which aborts the suite]
Depends on: 752585
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
(Assignee)

Comment 31

5 years ago
Jared, can you summarize the situation w.r.t. bug 736028 and why this crash occurs on 13 and not 14?  I wrote this test but I think the crash doesn't really have anything to do with me, but I've found this whole bug horribly confusing.  Thanks.
Nicholas, I'm not sure why this patch triggered the failures to occur and why they're not happening on 14.

I can speculate that there are some other patches that landed, and the combination of them and the potentially larger memory that is consumed by the larger thumbnails introduced by bug 736028 results in this crash.

Since this is not appearing on 14 I can only think that something landed during the 14 cycle that fixed this, but I'm not sure what that is.
Comment hidden (Treeherder Robot)
Jared - how would you like to attack this problem? Can we ask QA to bisect builds with the test case and get back to us with what build in FF14 fixed the issue?
Assignee: nobody → jaws
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
(In reply to Alex Keybl [:akeybl] from comment #34)
> Jared - how would you like to attack this problem? Can we ask QA to bisect
> builds with the test case and get back to us with what build in FF14 fixed
> the issue?

Yeah, that sounds like a good plan to me. Should this be reassigned to somebody in QA?
Comment hidden (Treeherder Robot)
(In reply to Jared Wein [:jaws] from comment #38)
> Yeah, that sounds like a good plan to me. Should this be reassigned to
> somebody in QA?

I've added the qawanted and regressionwindow-wanted keywords (although we're really looking for a fix window on FF14). QA regularly triages through the tracked/qawanted bugs.
Keywords: qawanted, regressionwindow-wanted
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
(In reply to Jared Wein [:jaws] from comment #38)
> Yeah, that sounds like a good plan to me. Should this be reassigned to
> somebody in QA?

Would it be possible to provide QA with the associated testcase in a fashion that can be run as part of a bisection?
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
(In reply to Alex Keybl [:akeybl] from comment #43)
> (In reply to Jared Wein [:jaws] from comment #38)
> > Yeah, that sounds like a good plan to me. Should this be reassigned to
> > somebody in QA?
> 
> Would it be possible to provide QA with the associated testcase in a fashion
> that can be run as part of a bisection?

Yes, use Linux64 opt builds and run:
TEST_PATH=toolkit/components/aboutmemory/tests/test_memoryReporters.xul make -C obj-dir mochitest-plain

If test passes, then it is a GOOD revision. If the test crashes, then it is a BAD revision.
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
(In reply to Jared Wein [:jaws] from comment #46)
> Yes, use Linux64 opt builds and run:
> TEST_PATH=toolkit/components/aboutmemory/tests/test_memoryReporters.xul make
> -C obj-dir mochitest-plain

I'm not sure where I can get Linux64 opt builds from, in particular for the beta branch.
Comment hidden (Treeherder Robot)
(In reply to Anthony Hughes, Mozilla QA (irc: ashughes) from comment #52)
> (In reply to Jared Wein [:jaws] from comment #46)
> > Yes, use Linux64 opt builds and run:
> > TEST_PATH=toolkit/components/aboutmemory/tests/test_memoryReporters.xul make
> > -C obj-dir mochitest-plain
> 
> I'm not sure where I can get Linux64 opt builds from, in particular for the
> beta branch.

I think these are what you're looking for:
https://ftp.mozilla.org/pub/mozilla.org/firefox/releases/13.0b1/linux-x86_64/en-US/
https://ftp.mozilla.org/pub/mozilla.org/firefox/releases/13.0b2/linux-x86_64/en-US/
https://ftp.mozilla.org/pub/mozilla.org/firefox/releases/13.0b3/linux-x86_64/en-US/
Why?

You want to know "what is the first build of 14 which has the fix for this problem?," but the first build of 14 which could have exhibited the problem, a build on inbound on 2012-04-12 when you landed the patch for bug 736028, does not exhibit the problem, so you know that the difference between non-permaorange 14 and permaorange 13 landed between when 13 merged from mozilla-central and then.

I'd say that makes the steps

* hg qimport https://bug736028.bugzilla.mozilla.org/attachment.cgi?id=614206
* hg up -r (whatever the tag is for when 13 merged away)
* hg qpush
* make -f client.mk build && TEST_PATH=toolkit/components/aboutmemory/tests/test_memoryReporters.xul make -C obj-dir mochitest-plain
* hg qpop
* hg up -r something else

(Or more accurately, an hg bisect between the 13 merge point and something around when you pushed bug 736028, but I can't do the commands to bisect off the top of my head.)
I'll do a bisect tomorrow then if nobody else gets to it.
Comment hidden (Treeherder Robot)
(In reply to Jared Wein [:jaws] from comment #56)
> I'll do a bisect tomorrow then if nobody else gets to it.

Any updates? Let us know if you believe this can be deprioritized/untracked for whatever reason - our main concern is the user experience post-release, as well as the possibility of regressions being hidden behind this perma-orange.
Yeah I tried to do a bisect yesterday and today but I'm struggling with how to get the correct revisions to bisect between due to this being on a branch of mozilla-central.
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Can I get help from somebody on doing the bisect? Patches landing on different branches has confused me as to how to get this to reliably work.
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Removing QAWANTED as per discussion in Channel meeting today. Please re-add if there is something specific QA can do to help here.
Keywords: qawanted
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Hmmm... Looked at this with Jared, just to see if we could maybe figure out what's going wrong. The test seems completely unrelated to his patch (and was an existing orange), so it seems likely to me that this is a case of slightly changing some timing/memory and making an existing issue more frequent. [Which would imply there's no actual user impact, and that a backout wouldn't help with the underlying issue.]

So here's what I was poking through... If njn or bent can shed some enlightenment here, it would be helpful. :)

Stack for convenience (full version in comment 0):

Thread 0 (crashed)
 0  libpthread-2.11.so + 0x8db0
    rbx = 0x00000000   r12 = 0x4909cff0   r13 = 0x4909cf9f   r14 = 0xa8b5c010
    r15 = 0x4909cff0   rip = 0xd3608db0   rsp = 0x4909cf28   rbp = 0x00000001
    Found by: given as instruction pointer in context
 1  libnspr4.so!PR_Lock [ptsynch.c:7e8c20249a4f : 206 + 0x4]
    rip = 0xc37e6f99   rsp = 0x4909cf30
    Found by: stack scanning
 2  WorkerPrivate::BlockAndCollectRuntimeStats [Mutex.h : 106 + 0x8]
    rbx = 0xa8b5c000   rip = 0xc0bd74da   rsp = 0x4909cf40
    Found by: call frame info
 3  WorkerMemoryReporter::GetExplicitNonHeap [WorkerPrivate.cpp:7e8c20249a4f : 223 + 0xe]
    rbx = 0x9415f700   r12 = 0x0697d000   r13 = 0x4909d010   r14 = 0xc2038250
    r15 = 0x4909cff0   rip = 0xc0bd77ab   rsp = 0x4909cf90   rbp = 0x00000000
    Found by: call frame info

Things appear happy going into frame 3, we're gathering memory reporter data, calling into a WorkerMemoryReporter. (Pretty sure we're actually in ::CollectForRuntime, because ::GetExplicitNonHeap is just a two-liner calling it... Opt build and inlined).


217   CollectForRuntime(bool aIsQuick, void* aData)
218   {
219     AssertIsOnMainThread();
220 
221     if (mWorkerPrivate) {
222       bool disabled;
223       if (!mWorkerPrivate->BlockAndCollectRuntimeStats(...)) {


so we end up in:

2828 WorkerPrivate::BlockAndCollectRuntimeStats(...)
2829 {
2830   AssertIsOnMainThread();
2831   NS_ASSERTION(aData, "Null data!");
2832 
2833   {
2834     MutexAutoLock lock(mMutex);
2835 
2836     if (mMemoryReporterDisabled) {

And we're crashing when the MutexAutoLock is accessing |mMutex| to lock [it's also possible we're at the MutexAutoLock further down during the same thing, but I'm guessing we're not actually making it that far).

The area where my suspicion falls is on the lifetime of the WorkerPrivate vs the WorkerMemoryReporter (which has a raw pointer -- |mWorkerPrivate| -- for the associated WorkerPrivate). When the worker and it's WorkerPrivate go away, where is it ensuring that that the WorkerMemoryReporter is appropriately nuked?

WorkerPrivate::DoRunLoop is the place creating and nuking the reporter. The reporter is refcounted, but the WorkerPrivate is not. DisableMemoryReporter() is called, but BlockAndCollectRuntimeStats() uses the lock before checking that.

So here's what I think happens. My uncertainty is around exactly what gets destroyed/cleared when...

1) main thread (the test) gets ahold of a WorkerMemoryReporter (refcount++).
-- context switch --
2) DoRunLoop decides to quit. It nulls mMemoryReporter (refcount--), but the reporter stays alive because of #1
3) worker thread goes away, destructs its WorkerPrivate (didn't find exactly where that happens)
-- context switch --
4) main thread calls WorkerMemoryReporter::GetExplicitNonHeap (ok)
5) ...in turn calls CollectForRuntime (ok)
6) ...in turn calls mWorkerPrivate->BlockAndCollectRuntimeStats (woah! mWorkerPrivate is now use-after-free? Guess enough bits are around / valid for the method to be invoked successfully)
7) Trues to do the locking, but the mutex has already been kablooied.


Let me know where I'm wrong, for I am just a simple JS hacker. :)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
(Assignee)

Updated

5 years ago
Whiteboard: [orange][perma-orange][test which aborts the suite] → [see comment 71][orange][perma-orange][test which aborts the suite]
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
(Assignee)

Comment 78

5 years ago
> DisableMemoryReporter() is called, but BlockAndCollectRuntimeStats() uses
> the lock before checking that.

I *think* this is the crucial part.  The disabling is achieved by recording some state in WorkerPrivate, but I think it needs to be recorded in the WorkerMemoryReporter instead, so it can be checked before WorkerPrivate::BlockAndCollectRuntimeStats() is called.
(Assignee)

Comment 79

5 years ago
Created attachment 628193 [details] [diff] [review]
patch

This patch changes how the disabling of WorkerMemoryReporter happens.

Previously, WorkerPrivate would (a) set mMemoryReporterDisabled.  Then when
(b) WorkerMemoryReporter called WorkerPrivate::BlockAndCollectRuntimeStats, a
|disabled| outparam would be set, and WorkerMemoryReporter would check for
that and set mWorkerPrivate to nsnull.  If dolske is right, the
WorkerPrivate could be destroyed between (a) and (b).

Now, WorkerPrivate just calls WorkerMemoryReporter::Disable() which directly
sets mWorkerPrivate to nsnull.  I've cut out the middle-man.

Some caveats:  

- I've never seen this crash locally, despite the fact that I run this test
  on my Linux64 box frequently.  (I ran the test myself and it didn't crash,
  so that's a start.)  And despite this bug's title, it only fails sporadically,
  so a try server run won't tell me much.  So if someone who can reliably 
  reproduce the crash (dolske, jaws?) could try this patch, that'd be great.  
  
- Also, I didn't write this code and don't really understand it and it makes
  me nervous and I am just making this stuff up as I go along...
Attachment #628193 - Flags: feedback?(jaws)
Attachment #628193 - Flags: feedback?(dolske)
(In reply to Nicholas Nethercote [:njn] from comment #79)
> And despite this bug's title, it only fails
> sporadically,
>   so a try server run won't tell me much.  So if someone who can reliably 
>   reproduce the crash (dolske, jaws?) could try this patch, that'd be great.

Did you try it locally with an optimized build?
(Assignee)

Comment 81

5 years ago
I just tried an Linux64 opt build of mozilla-beta, and couldn't get it to crash.  (This is without my patch.)

So I tried a PGO build, and it failed with:

  make[16060]: vfork: Cannot allocate memory

This on a 64-bit machine with 16GB of RAM.  Hmph.  Currently trying a push to mozilla-beta, with some magic incantations from philor to get PGO happening.
https://tbpl.mozilla.org/?tree=Try&rev=9801753cb194 is looking both green, and PGO'd, but then I realized we didn't have a control for "can we repro on try without the patch?" so I pushed https://tbpl.mozilla.org/?tree=Try&rev=c06b6b58bae6 which I hope will wind up orange.
(Assignee)

Comment 83

5 years ago
Comment on attachment 628193 [details] [diff] [review]
patch

(In reply to Phil Ringnalda (:philor) from comment #82)
> https://tbpl.mozilla.org/?tree=Try&rev=9801753cb194 is looking both green,
> and PGO'd, but then I realized we didn't have a control for "can we repro on
> try without the patch?" so I pushed
> https://tbpl.mozilla.org/?tree=Try&rev=c06b6b58bae6 which I hope will wind
> up orange.

It did wind up orange.  Good news!

Time for a review... bent, I'd appreciate a very careful review because (a) I'm not that familiar with this code and (b) RACE CONDITIONS.
Attachment #628193 - Flags: review?(bent.mozilla)
Attachment #628193 - Flags: feedback?(jaws)
Attachment #628193 - Flags: feedback?(dolske)
Assignee: jaws → n.nethercote
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment on attachment 628193 [details] [diff] [review]
patch

Review of attachment 628193 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for digging in here! I think this fix is good, two nits and a DEBUG fix below should do it.

::: dom/workers/WorkerPrivate.cpp
@@ +122,5 @@
>      static_cast<typename ISupportsBaseInfo<T>::ISupportsBase*>(raw);
>    dest->swap(rawSupports);
>  }
>  
> +} /* anonymous namespace */

Nit: Please just move WorkerMemoryReporter below the namespace block rather than splitting the block in two.

@@ +142,5 @@
>      cstats->extra = const_cast<char *>(name);
>    }
>  };
>    
> +class mozilla::dom::workers::WorkerMemoryReporter : public nsIMemoryMultiReporter

Nit: I'd prefer a BEGIN_WORKERS_NAMESPACE/END_WORKERS_NAMESPACE block around this rather than having to spell it all out like this.

@@ +242,5 @@
>    }
> +
> +  void Disable()
> +  {
> +    mWorkerPrivate = nsnull;

This is only safe because we've locked the worker's mutex, and I'd like to assert that here. To do this I think we should make WorkerMemoryReporter a friend class of WorkerPrivate (#ifdef DEBUG only) and then call AssertCurrentThreadOwns. Sound good?
(Assignee)

Comment 87

5 years ago
Created attachment 630084 [details] [diff] [review]
patch, v2


> This is only safe because we've locked the worker's mutex, and I'd like to
> assert that here. To do this I think we should make WorkerMemoryReporter a
> friend class of WorkerPrivate (#ifdef DEBUG only) and then call
> AssertCurrentThreadOwns. Sound good?

I've done this (see attached patch) but I'm getting assertion failures like the following (https://tbpl.mozilla.org/php/getParsedLog.php?id=12376609&tree=Try).
I can't reproduce them on my own Linux64 debug builds, unfortunately.


TEST-UNEXPECTED-FAIL | chrome://mochitests/content/chrome/dom/workers/test/test_chromeWorkerJSM.xul | Exited with code 1 during test run
INFO | automation.py | Application ran for: 0:03:36.383366
INFO | automation.py | Reading PID log: /tmp/tmpLZ48pIpidlog
==> process 2161 launched child process 2189
==> process 2161 launched child process 2215
==> process 2161 launched child process 2222
==> process 2161 launched child process 2224
==> process 2161 launched child process 2226
INFO | automation.py | Checking for orphan process with PID: 2189
INFO | automation.py | Checking for orphan process with PID: 2215
INFO | automation.py | Checking for orphan process with PID: 2222
INFO | automation.py | Checking for orphan process with PID: 2224
INFO | automation.py | Checking for orphan process with PID: 2226
Downloading symbols from: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/nnethercote@mozilla.com-4c6a523c293f/try-linux-debug/firefox-15.0a1.en-US.linux-i686.crashreporter-symbols.zip
PROCESS-CRASH | chrome://mochitests/content/chrome/dom/workers/test/test_chromeWorkerJSM.xul | application crashed (minidump found)
Crash dump filename: /tmp/tmpo7OmGS/minidumps/7b9f7c77-dbcf-528d-08ca8fc3-3db40cdd.dmp
Operating system: Linux
                  0.0.0 Linux 2.6.31.5-127.fc12.i686.PAE #1 SMP Sat Nov 7 21:25:57 EST 2009 i686
CPU: x86
     GenuineIntel family 6 model 23 stepping 10
     2 CPUs

Crash reason:  SIGSEGV
Crash address: 0x28

Thread 28 (crashed)
 0  libxul.so!mozilla::dom::workers::WorkerPrivate::DisableMemoryReporter [Mutex.h : 108 + 0x0]
    eip = 0x0197bc25   esp = 0xa7bff0f4   ebp = 0xa7bff148   ebx = 0x02d8b528
    esi = 0xb4c54530   edi = 0x00000001   eax = 0x00000000   ecx = 0x0a0db1e8
    edx = 0xa7bff128   efl = 0x00010286
    Found by: given as instruction pointer in context
 1  libxul.so!mozilla::dom::workers::WorkerPrivate::DoRunLoop [WorkerPrivate.cpp : 2795 + 0x8]
    eip = 0x0197c395   esp = 0xa7bff150   ebp = 0xa7bff1f8   ebx = 0x02d8b528
    esi = 0x0a0db1c8   edi = 0xa7bff1ac
    Found by: call frame info
 2  libxul.so!WorkerThreadRunnable::Run [RuntimeService.cpp : 329 + 0x6]
    eip = 0x0196ab65   esp = 0xa7bff200   ebp = 0xa7bff248   ebx = 0x02d8b528
    esi = 0xb4c53d18   edi = 0x0a0db1c8
    Found by: call frame info
Attachment #619725 - Attachment is obsolete: true
Attachment #628193 - Attachment is obsolete: true
Attachment #628193 - Flags: review?(bent.mozilla)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Summary: Perma-orange Linux64 test_memoryReporters.xul crash above mozilla::dom::workers::WorkerPrivate::BlockAndCollectRuntimeStats on Beta 13 → Perma-orange Linux64 test_memoryReporters.xul crash above mozilla::dom::workers::WorkerPrivate::BlockAndCollectRuntimeStats on Release 13
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
What's the status of this bug?  Nick, have you got a chance to work on this again since the last time you looked at it?
(Assignee)

Comment 97

5 years ago
The assertion on try that I couldn't reproduce locally was discouraging enough that I haven't, esp. given that I only half understand this code.  I'll try to get back to it next week.
I have a patch in bug 725907 that also causes this to become perma-orange. Nick's v1 patch fixes it.

  https://tbpl.mozilla.org/?tree=Try&pusher=jorendorff@mozilla.com

You might have to click the green arrow several times; the relevant pushes here are:

  80f70b78ac67 - Fri Jul 6 16:56:13 2012 PDT - green MOth
  23371b36f13c - Fri Jul 6 16:27:10 2012 PDT - orange MOth

Can we land v1 + nits, and add the problematic assert in a follow-up bug?
Blocks: 725907
(Assignee)

Comment 99

5 years ago
I just looked at this again, quite closely.  I don't understand how the assertion could cause a crash.  We lock the WorkerPrivate's |mMutex| in WorkerPrivate::DisableMemoryReporter() immediately before calling WorkerMemoryReporter::Disable(), and then this line causes the crash:

    mWorkerPrivate->mMutex.AssertCurrentThreadOwns();

It's not clear to me exactly what causes the crash because there's lots of inlining going on even in debug builds, and I can't reproduce it locally.

I see two possible ways to proceed.

- We land my patch with the assertion disabled, cross our fingers, and file a follow-up bug.

- We disable the worker memory reporter entirely, and file a follow-up bug to re-enable it.  This is safer, but obviously a shame.

In both cases, I don't know how progress would be made on the follow-up bug, since I wasn't able to make any progress in this one.  Maybe bent could try.

Preferences?
(In reply to comment #99)
> It's not clear to me exactly what causes the crash because there's lots of
> inlining going on even in debug builds, and I can't reproduce it locally.

Not sure if you're using our tinderbox debug builds, but those also have --enable-optimized which can explain this.  Have you tried doing a build with --enable-debug --disable-optimized (sorry if I'm preaching to the choir :-)
(Assignee)

Comment 101

5 years ago
Created attachment 641353 [details] [diff] [review]
(part 1) - Make WorkerMemoryReporter less crashy.

bent, here's the patch with the assertion disabled.  I'll let you decide if
that's reasonable, since you wrote this code originally.  If you r- it I
will write a patch that disables the worker memory reporter.  Thanks.
Attachment #641353 - Flags: review?(bent.mozilla)
(Assignee)

Comment 102

5 years ago
Created attachment 641354 [details] [diff] [review]
(part 2) - Move code around to merge two anonymous namespaces; no functional changes.

Trivial follow-up patch.
Attachment #641354 - Flags: review?(bent.mozilla)
(Assignee)

Updated

5 years ago
Attachment #630084 - Attachment is obsolete: true
(Assignee)

Comment 103

5 years ago
> Not sure if you're using our tinderbox debug builds, but those also have
> --enable-optimized which can explain this.  Have you tried doing a build
> with --enable-debug --disable-optimized

Good suggestion.  I just tried debug Linux32 and Linux64 builds with this.  The stack traces were only marginally better, and not in a way that gave me any more salient information.

Crash reason:  SIGSEGV
Crash address: 0x28

Thread 27 (crashed)
 0  libxul.so!mozilla::Mutex::AssertCurrentThreadOwns [Mutex.h : 108 + 0x3]
    eip = 0x01546c9b   esp = 0xae086090   ebp = 0xae0860a8   ebx = 0x03fa57b4
    esi = 0x0a2c41d0   edi = 0x02cd8990   eax = 0x00000020   ecx = 0xae086c04
    edx = 0x087b8a20   efl = 0x00010206
    Found by: given as instruction pointer in context
 1  libxul.so!mozilla::dom::workers::WorkerMemoryReporter::Disable [WorkerPrivate.cpp : 260 + 0x10]
    eip = 0x02037087   esp = 0xae0860b0   ebp = 0xae0860c8   ebx = 0x03fa57b4
    esi = 0x0a2c41d0   edi = 0x02cd8990
    Found by: call frame info
 2  libxul.so!mozilla::dom::workers::WorkerPrivate::DisableMemoryReporter [WorkerPrivate.cpp : 2944 + 0x17]
    eip = 0x0203d43d   esp = 0xae0860d0   ebp = 0xae086108   ebx = 0x03fa57b4
    esi = 0x0a2c41d0   edi = 0x02cd8990
    Found by: call frame info


I also tried a local build that used the exact mozconfig used on the test machines, but I still wasn't able to reproduce the crash :(
(Assignee)

Comment 104

5 years ago
Created attachment 641676 [details] [diff] [review]
(part 1) - Make WorkerMemoryReporter less crashy.

I finally managed to catch this in a debugger.  Just a missing NULL check on
mWorkerPrivate!
Attachment #641676 - Flags: review?(bent.mozilla)
(Assignee)

Updated

5 years ago
Attachment #641353 - Attachment is obsolete: true
Attachment #641353 - Flags: review?(bent.mozilla)
Comment on attachment 641676 [details] [diff] [review]
(part 1) - Make WorkerMemoryReporter less crashy.

I'm nervous that we don't know why this null check should be necessary... But if this helps unblock others then we should do it.
Attachment #641676 - Flags: review?(bent.mozilla) → review+
Attachment #641354 - Flags: review?(bent.mozilla) → review+
(Assignee)

Comment 106

5 years ago
> I'm nervous that we don't know why this null check should be necessary...
> But if this helps unblock others then we should do it.

It turns out the memory reporter can be disabled (for reasons I don't understand but aren't important) when CTypes are used by a worker.  Then when the worker shuts down, Disable() is called again, whereupon mWorkerPrivate will already be NULL.  So the NULL check is the right thing to do.  Phew.  I'll land this once my try server run has finished.
(Assignee)

Comment 107

5 years ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/7e78cbe9ea94
https://hg.mozilla.org/integration/mozilla-inbound/rev/bdc169cbbbac

\o/
\o/
https://hg.mozilla.org/mozilla-central/rev/7e78cbe9ea94
https://hg.mozilla.org/mozilla-central/rev/bdc169cbbbac
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla16
(Assignee)

Updated

5 years ago
Depends on: 812935
Keywords: intermittent-failure
Whiteboard: [see comment 71][orange][perma-orange][test which aborts the suite] → [see comment 71][perma-orange][test which aborts the suite]
Keywords: regressionwindow-wanted
You need to log in before you can comment on or make changes to this bug.