Closed Bug 890840 Opened 6 years ago Closed 6 years ago

Intermittent TEST-UNEXPECTED-FAIL | /tests/dom/browser-element/mochitest/test_browserElement_oop_XFrameOptionsAllowFrom.html | Test timed out. - ###!!! ABORT: ActorDestroy by IPC channel failure at LayerTransactionChild:

Categories

(Core :: Graphics: Layers, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla26
Tracking Status
firefox24 --- unaffected
firefox25 --- wontfix
firefox26 --- fixed

People

(Reporter: cbook, Assigned: markh)

References

()

Details

(Keywords: intermittent-failure)

Attachments

(2 files, 1 obsolete file)

https://tbpl.mozilla.org/php/getParsedLog.php?id=25007029&tree=Mozilla-Inbound

WINNT 6.2 mozilla-inbound pgo test mochitest-2 on 2013-07-08 00:42:05 PDT for push e5d26a0da6e9

slave: t-w864-ix-036

ERROR TEST-UNEXPECTED-FAIL | /tests/dom/browser-element/mochitest/test_browserElement_oop_XFrameOptionsAllowFrom.html | Test timed out.
Component: GFX: Color Management → Graphics: Layers
The patches in bug 891218 cause this same basic crash to happen on 9/10 runs of the browser/base/content/test/social/ directory (see bug 891218 comment 30).  The crash *usually* happens in one of the "chatwindow" tests - if I disable those tests the crash drops to probably 2/10 times.  This is on Windows - I haven't checked on any other platforms.

Hopefully this reliability will help in tracking this down.  I'm marking this bug as blocking bug 891218 and also the crash-e10s meta bug.
Blocks: 891218, 899758
OS: Mac OS X → All
Hardware: x86 → All
gavin and nrc had a (very) brief chat about this on IRC earlier this week - we'd really like to get to the bottom of it so we can better leverage remote processes.

I can *not* reproduce this with a debug build.  However, a stack-trace from a release build with symbols is attached.  Sadly it doesn't look that interesting to me - that crashing thread doesn't seem to have anything useful (and also doesn't have symbols from the Windows DLLs.)  I'm attaching it anyway incase it makes sense to someone.

Nick - anything useful there, or anything else you can suggest I do to get something useful?
Flags: needinfo?(ncameron)
I can't think of anything to do without more info really. And I'm not sure what more info, other than catching this in a debugger would be. Have you tried with a recent build? I have fixed a few bugs recently that could cause this error or ones like it.

Maybe bjacob or dzbarsky know something more, they both know more about IPDL/IPC than I do.
Flags: needinfo?(ncameron)
(In reply to Nick Cameron [:nrc] from comment #3)
> Have you tried with a recent build?

Yes - this was from a current m-c build.
Comment on attachment 787256 [details]
Stack trace from a crashing mochi test run

The problem is that this is a crash stack from the parent process, not the child that crashed.
> The problem is that this is a crash stack from the parent process, not the
> child that crashed.

I think it actually is the parent crashing.  I poked around a little, and in the parent process, debug builds would throw an "InvalidHandle" exception (which seems to be ignored), but release builds crash with the following stack trace.

>	ntdll.dll!_TppWaiterpThread@4()  + 0x49948 bytes	
 	kernel32.dll!@BaseThreadInitThunk@12()  + 0x12 bytes	
 	ntdll.dll!___RtlUserThreadStart@8()  + 0x27 bytes	
 	ntdll.dll!__RtlUserThreadStart@8()  + 0x1b bytes	

Googling this stack trace, I found a chromium bug (http://code.google.com/p/chromium/issues/detail?id=86334) that implied the problem might be a handle being closed twice - which matches up with the InvalidHandle exception I see.  Instrumenting ContentParent.cpp, I could indeed see the same process handle closed twice, generally because KillHard() was being called twice.

I haven't deduced exactly why KillHard is called twice, nor why debug builds do not crash, but the attached patch prevent this and prevents the crash - I used to see it on ~19 out of 20 runs, and I haven't seen it once with the patch applied.  The log doesn't show anything too bad happening - a few:

0:26.60 [Child 7108] WARNING: pipe error: 109: file o:/src/mozilla-git/central/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 312
 
(error 109 is ERROR_BROKEN_PIPE) which is probably benign, and:

0:32.58 ###!!! [Parent][SyncChannel] Error: Value error: message was deserialized, but contained an illegal value

which I've seen lots of times playing with e10s related stuff.

Justin seems the most prolific recent contributor to this file: any insights into this, or feedback on the patch?
Flags: needinfo?(justin.lebar+bug)
If this fixes something, that's great!  But since it's not true that [(KillHard() has been called) iff (!!mForceKillTask)] (because mForceKillTask eventually becomes null), should we add a bool variable to ContentParent to actually prevent us from killing the process twice?
Flags: needinfo?(justin.lebar+bug)
> But since it's not true that [(KillHard() has been called)
> iff (!!mForceKillTask)]

Yeah, I started to convince myself that wasn't the case, but just thought I'd get something up incase you had a better idea of how to fix the issue.

This patch just adds 'bool mCalledKillHard' and (obviously) still solves the problem for me.
Assignee: nobody → mhammond
Attachment #788821 - Attachment is obsolete: true
Attachment #791082 - Flags: review?(justin.lebar+bug)
Comment on attachment 791082 [details] [diff] [review]
0001-Bug-890840-calling-KillHard-multiple-times-may-crash.patch

r=me if you add a comment explaining at a high level why we can't KillHard twice (and maybe reference this bug number).
Attachment #791082 - Flags: review?(justin.lebar+bug) → review+
https://hg.mozilla.org/mozilla-central/rev/12a6a12d899d
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla26
This was filed against Fx25, however, it appears this depends on the patch from bug 895204, which landed for Fx26. Is this worth uplifting? What about beta?
Keeping in mind that Fx24 is slated to be the next ESR release as well.
We've seen the original orange exactly once, and this only started to become reproducible with lots of content processes for ongoing social work that *might* land in 26 and surely isn't going to be uplifted.  The issue here is definitely related to content processes.

Fx24 has no content processes IIUC.  25 will have background thumbnails, but we've yet to see this crash caused by them.

Given the above and that this depends on the far less trivial patch in bug 895204, I think the risk of uplifting is greater than the risk of this actually being a problem in 25/24, so I think we should wontfix those branches.

    It's a bit of a margin call though, so someone who is less risk-averse could convince me otherwise :)
You need to log in before you can comment on or make changes to this bug.