Closed Bug 1051567 Opened 5 years ago Closed 3 years ago

[e10s] Crash in [@ mozilla::ipc::MessageChannel::OnChannelErrorFromLink]

Categories

(Core :: IPC, defect, P4, critical)

50 Branch
x86_64
All
defect

Tracking

()

RESOLVED FIXED
mozilla51
Tracking Status
e10s + ---
firefox47 --- wontfix
firefox48 --- wontfix
firefox49 - fixed
firefox50 + fixed
firefox51 + fixed

People

(Reporter: szx, Assigned: kanru)

References

(Depends on 1 open bug, Blocks 1 open bug, )

Details

(5 keywords)

Crash Data

Attachments

(2 files)

User Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0 (Beta/Release)
Build ID: 20140807212602



Actual results:

This is the stack trace:

03e0f298 580a1751 mozalloc!mozalloc_abort(char * msg = 0x03e0f2e0 "[4864] ###!!! ABORT: Aborting on channel error.: file c:/builds/moz2_slave/rel-m-beta-w32_bld-00000000000/build/ipc/glue/MessageChannel.cpp, line 1532")+0x2a [c:\builds\moz2_slave\rel-m-beta-w32_bld-00000000000\build\memory\mozalloc\mozalloc_abort.cpp @ 30]
03e0f6d0 58112d78 xul!NS_DebugBreak(unsigned int aSeverity = 3, char * aStr = 0x58ecf8f0 "Aborting on channel error.", char * aExpr = 0x00000000 "", char * aFile = 0x58ecf178 "c:/builds/moz2_slave/rel-m-beta-w32_bld-00000000000/build/ipc/glue/MessageChannel.cpp", int aLine = 0n1532)+0x1ff [c:\builds\moz2_slave\rel-m-beta-w32_bld-00000000000\build\xpcom\base\nsdebugimpl.cpp @ 451]
03e0f6f0 58113e02 xul!mozilla::ipc::MessageChannel::OnChannelErrorFromLink(void)+0x4e [c:\builds\moz2_slave\rel-m-beta-w32_bld-00000000000\build\ipc\glue\messagechannel.cpp @ 1532]
03e0f6fc 581016bc xul!mozilla::ipc::ProcessLink::OnChannelError(void)+0x1b [c:\builds\moz2_slave\rel-m-beta-w32_bld-00000000000\build\ipc\glue\messagelink.cpp @ 356]
03e0f70c 580ff0c1 xul!IPC::Channel::ChannelImpl::OnIOCompleted(struct base::MessagePumpForIO::IOContext * context = 0x0276d004, unsigned long bytes_transfered = 0, unsigned long error = 0x6d)+0x7c [c:\builds\moz2_slave\rel-m-beta-w32_bld-00000000000\build\ipc\chromium\src\chrome\common\ipc_channel_win.cc @ 452]
03e0f734 580ff17f xul!base::MessagePumpForIO::WaitForIOCompletion(unsigned long timeout = 0xffffffff, class base::MessagePumpForIO::IOHandler * filter = 0x00000000)+0x74 [c:\builds\moz2_slave\rel-m-beta-w32_bld-00000000000\build\ipc\chromium\src\base\message_pump_win.cc @ 524]
03e0f744 580ff2ed xul!base::MessagePumpForIO::WaitForWork(void)+0x19 [c:\builds\moz2_slave\rel-m-beta-w32_bld-00000000000\build\ipc\chromium\src\base\message_pump_win.cc @ 501]
03e0f750 580fe981 xul!base::MessagePumpForIO::DoRunLoop(void)+0x50 [c:\builds\moz2_slave\rel-m-beta-w32_bld-00000000000\build\ipc\chromium\src\base\message_pump_win.cc @ 463]
03e0f770 580fedbb xul!base::MessagePumpWin::RunWithDispatcher(class base::MessagePump::Delegate * delegate = 0x03e0f7e8, class base::MessagePumpWin::Dispatcher * dispatcher = 0x00000000)+0x3c [c:\builds\moz2_slave\rel-m-beta-w32_bld-00000000000\build\ipc\chromium\src\base\message_pump_win.cc @ 55]
03e0f77c 581072ae xul!base::MessagePumpWin::Run(class base::MessagePump::Delegate * delegate = 0x03e0f7e8)+0xb [c:\builds\moz2_slave\rel-m-beta-w32_bld-00000000000\build\ipc\chromium\src\base\message_pump_win.h @ 78]
03e0f7b4 5810737d xul!MessageLoop::RunHandler(void)+0x51 [c:\builds\moz2_slave\rel-m-beta-w32_bld-00000000000\build\ipc\chromium\src\base\message_loop.cc @ 223]
03e0f7d4 58109d40 xul!MessageLoop::Run(void)+0x19 [c:\builds\moz2_slave\rel-m-beta-w32_bld-00000000000\build\ipc\chromium\src\base\message_loop.cc @ 197]
03e0f8c0 585f35a6 xul!base::Thread::ThreadMain(void)+0xa4 [c:\builds\moz2_slave\rel-m-beta-w32_bld-00000000000\build\ipc\chromium\src\base\thread.cc @ 171]
03e0f8c4 754a919f xul!ThreadEntry(void * arg = 0x027251c8)+0x9 [c:\builds\moz2_slave\rel-m-beta-w32_bld-00000000000\build\tools\profiler\platform-win32.cc @ 248]
03e0f8d0 771ea22b KERNEL32!BaseThreadInitThunk+0xe
03e0f914 771ea201 ntdll!__RtlUserThreadStart+0x20
03e0f924 00000000 ntdll!_RtlUserThreadStart+0x1b
Status: UNCONFIRMED → RESOLVED
Closed: 5 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1047160
Reopening per bug 1047160 comment 6.
Status: RESOLVED → REOPENED
Ever confirmed: true
Resolution: DUPLICATE → ---
See Also: → 1047160
By the way, I still have the crash dump on my disk in case you would like me to extract something else from it.
Status: REOPENED → UNCONFIRMED
Component: Untriaged → IPC
Ever confirmed: false
Product: Firefox → Core
I've had a similar same crash today, it seems to happen when I close the browser window.
As best I can tell this is actually the child process crashing.
I am getting this crash pretty much daily on an e10s build.
Status: UNCONFIRMED → NEW
tracking-e10s: --- → ?
Ever confirmed: true
Attached file OSX crash reporter log
Assignee: nobody → mrbkap
I wonder if this could be related to bug 1035454.
(In reply to Blake Kaplan (:mrbkap) from comment #8)
> I wonder if this could be related to bug 1035454.

Hmm, I have no idea how that could be relevant here.  Can you please clarify?  (I'd be happy to run some diagnostics if you've got things for me to check.)
My nightly is pretty unusable with this crash, and it's getting worse every day...  I'm probably going to stop dogfooding e10s until this bug is fixed.
Keywords: dogfood
Brad, I think we should not enable e10s on Nightly before we fix this bug.  It makes the browser very crashy.  I unfortunately don't have STRs but I get these crashes every hour or so these days.  Youtube hits this a *lot*.
Flags: needinfo?(blassey.bugs)
(In reply to :Ehsan Akhgari (not reading bugmail, needinfo? me!) from comment #11)
> Brad, I think we should not enable e10s on Nightly before we fix this bug. 
> It makes the browser very crashy.  I unfortunately don't have STRs but I get
> these crashes every hour or so these days.  Youtube hits this a *lot*.

Part of the reason to enable on nightly is to get a volume of crash data that will allow us to prioritize crash bugs. 

Right now I see 56 crashes over the last 30 days and they're all on Windows. Jim, have you seen this signature?
Flags: needinfo?(blassey.bugs) → needinfo?(jmathies)
(In reply to Brad Lassey [:blassey] (use needinfo?) from comment #12)
> (In reply to :Ehsan Akhgari (not reading bugmail, needinfo? me!) from
> comment #11)
> > Brad, I think we should not enable e10s on Nightly before we fix this bug. 
> > It makes the browser very crashy.  I unfortunately don't have STRs but I get
> > these crashes every hour or so these days.  Youtube hits this a *lot*.
> 
> Part of the reason to enable on nightly is to get a volume of crash data
> that will allow us to prioritize crash bugs. 

According to crash reporter, not a single one of these crashes have been submitted.  They all trigger the built-in OSX crash reporter, which is why I'm trying to bring this to your attention.
(In reply to :Ehsan Akhgari (not reading bugmail, needinfo? me!) from comment #16)
> This link reproduces this crash for me very reliably:
> <http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/machine-
> translation.html>

(The reason is that the said link triggers bug 1079422 which causes an unrelated crash in the content process, and then I get a popup reporting the crash this bug is filed for.)
Just for the record, I am hitting this crash 1+ times a day.  And not a single one of them have so far triggered breakpad.
I noticed  this crash happens frequently on treeherder these days.
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=7235805
See Also: → 965705
(In reply to Hsin-Yi Tsai [:hsinyi] from comment #19)
> I noticed  this crash happens frequently on treeherder these days.
> https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-
> inbound&job_id=7235805

Hi Blake,
Are you able to help make some progress here? Thank you.
Flags: needinfo?(mrbkap)
retriaging:

<RyanVM|sheriffduty> jimm: oh god, it's *that* ?
<RyanVM|sheriffduty> lord knows we have plenty of OnChannelErrorFromLink issues
<jimm> um, actually OnChannelErrorFromLink aborts are pretty rare afaik
<jimm> we have a few meta signatures
<jimm> this isn't one of the bad ones
<RyanVM|sheriffduty> jimm: bug 1142693?
<RyanVM|sheriffduty> only our top failure on OSX by a country mile
Blocks: 1142693
Flags: needinfo?(mrbkap)
note - bug 1152372 reproduces on Windows.
OS: Windows 8.1 → All
A portion of the mac related crashes are covered by bug 1142693.
> note - bug 1152372 reproduces on Windows.

This bug looks like a poorly filed bug that blames some ipc code for a python automation problem. I don't think this crash happens on Windows.

AFAICT this is a test only crash too - 

https://crash-stats.mozilla.com/search/?product=Firefox&version=40.0a1&signature=~OnChannelErrorFromLink&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

and I think we might have a fix for it in sub bug 1142693.
(In reply to Jim Mathies [:jimm] from comment #25)
> > note - bug 1152372 reproduces on Windows.
> 
> This bug looks like a poorly filed bug that blames some ipc code for a
> python automation problem. I don't think this crash happens on Windows.
> 
> AFAICT this is a test only crash too

I hit this a few days ago on Windows using the latest nightly during normal browsing, though I don't remember what I was doing at the time.

bp-c49b2000-9c7c-4dc6-a79a-9ac502150406
(In reply to Trevor Rowbotham from comment #26)
> (In reply to Jim Mathies [:jimm] from comment #25)
> > > note - bug 1152372 reproduces on Windows.
> > 
> > This bug looks like a poorly filed bug that blames some ipc code for a
> > python automation problem. I don't think this crash happens on Windows.
> > 
> > AFAICT this is a test only crash too
> 
> I hit this a few days ago on Windows using the latest nightly during normal
> browsing, though I don't remember what I was doing at the time.
> 
> bp-c49b2000-9c7c-4dc6-a79a-9ac502150406

Yep, that's this crash on Windows. Still very rare in the wild, or possibly we are having problems getting crash reports submitted - 

https://crash-stats.mozilla.com/report/list?product=Firefox&signature=mozalloc_abort%28char+const*+const%29+|+NS_DebugBreak+|+mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3AOnChannelErrorFromLink%28%29
I found a way to reproduce it on FF40/41 with e10s enabled (Win 7).
https://crash-stats.mozilla.com/report/index/8e8c2c6d-5045-4e6b-95a6-f6da92150604

I'm bisecting right now.
I filed bug 1171307 because I'm not sure if it's the same underlying issue.
Priority: -- → P4
See Also: → 1171307
Crash Signature: [@ mozalloc_abort | Abort | NS_DebugBreak | mozilla::ipc::MessageChannel::OnChannelErrorFromLink ]
This signature is still happening sometimes on Nightly.
This happens when Nightly is opened on Mac OS X 10.12 Sierra. Usually 2-4 crashes are recorded as soon as one just starts Nightly with clean profile (not 100% reproducible).

Here are some crash reports:
bp-4282b3bd-682f-4a19-8f7d-6d0d82160714
bp-81c3cdb7-6c98-46df-b637-05a542160714
bp-867ca5a6-4fdf-4641-b224-6aeb12160714
bp-d674a63b-37d5-4e3e-8941-101f42160714
bp-bf1bce20-8023-475b-a780-33b682160714
bp-9b2ffe73-2674-41b2-8545-470c82160714
Version: 32 Branch → 50 Branch
Crash volume for signature 'mozalloc_abort | Abort | NS_DebugBreak | mozilla::ipc::MessageChannel::OnChannelErrorFromLink':
 - nightly (version 50): 336 crashes from 2016-06-06.
 - aurora  (version 49): 742 crashes from 2016-06-07.
 - beta    (version 48): 5 crashes from 2016-06-06.
 - release (version 47): 5 crashes from 2016-05-31.
 - esr     (version 45): 0 crash from 2016-04-07.

Crash volume on the last weeks:
             Week N-1   Week N-2   Week N-3   Week N-4   Week N-5   Week N-6   Week N-7
 - nightly        139         36         27         19         35         37         10
 - aurora         203         83         71         76        127        106         11
 - beta             1          1          2          1          0          0          0
 - release          0          0          0          0          3          1          0
 - esr              0          0          0          0          0          0          0

Affected platform: Mac OS X
This signature started rising on Nightly using the 20160801074053 build. It is now the top crash on Nightly.
FYI - Seems to be related: I see a crash with Firefox48 (release) and Selenium-beta2 (Java client), platform: Windows 8.1, when invoking .quit()

1470659291121	Marionette	INFO	Listening on port 55104
[Child 8720] WARNING: pipe error: 109: file c:/builds/moz2_slave/m-rel-w32-00000000000000000000/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 343
[Child 8720] WARNING: pipe error: 109: file c:/builds/moz2_slave/m-rel-w32-00000000000000000000/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 343
1470659292014	Marionette	INFO	startBrowser 91f6547b-743d-4e5e-8b57-e1c5bb4d4cbd
1470659292021	Marionette	INFO	sendAsync 91f6547b-743d-4e5e-8b57-e1c5bb4d4cbd
1470659292157	Marionette	INFO	sendAsync 91f6547b-743d-4e5e-8b57-e1c5bb4d4cbd
1470659292379	Marionette	INFO	sendAsync 91f6547b-743d-4e5e-8b57-e1c5bb4d4cbd
[Child 1340] WARNING: pipe error: 232: file c:/builds/moz2_slave/m-rel-w32-00000000000000000000/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 497
[Child 1340] ###!!! ABORT: Aborting on channel error.: file c:/builds/moz2_slave/m-rel-w32-00000000000000000000/build/src/ipc/glue/MessageChannel.cpp, line 2046
Exception in thread "main" org.openqa.selenium.remote.UnreachableBrowserException: Error communicating with the remote browser. It may have died.
Build info: version: 'unknown', revision: '2aa21c1', time: '2016-08-02 14:59:43 -0700'
System info: host: 'lnz-geralde3', ip: '192.168.56.1', os.name: 'Windows 8.1', os.arch: 'amd64', os.version: '6.3', java.version: '1.8.0_45'
Driver info: driver.version: RemoteWebDriver
Capabilities [{rotatable=false, raisesAccessibilityExceptions=false, marionette=true, appBuildId=20160726073904, version=, platform=XP, proxy={}, command_id=1, specificationLevel=0, acceptSslCerts=false, browserVersion=48.0, platformVersion=6.3, XULappId={ec8030f7-c20a-464f-9b0e-13a3a9e97384}, browserName=Firefox, takesScreenshot=true, takesElementScreenshot=true, platformName=Windows_NT, device=desktop}]
Session ID: 91f6547b-743d-4e5e-8b57-e1c5bb4d4cbd
	at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:670)
	at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:706)
	at org.openqa.selenium.remote.RemoteWebDriver.quit(RemoteWebDriver.java:531)
	at FirefoxSmokeTest.main(FirefoxSmokeTest.java:20)
Caused by: java.lang.IllegalStateException: UnixUtils may not be used on Windows
	at org.openqa.selenium.os.ProcessUtils.getProcessId(ProcessUtils.java:188)
	at org.openqa.selenium.os.UnixProcess$SeleniumWatchDog.getPID(UnixProcess.java:222)
	at org.openqa.selenium.os.UnixProcess$SeleniumWatchDog.access$300(UnixProcess.java:201)
	at org.openqa.selenium.os.UnixProcess.destroy(UnixProcess.java:132)
	at org.openqa.selenium.os.CommandLine.destroy(CommandLine.java:155)
	at org.openqa.selenium.remote.service.DriverService.stop(DriverService.java:196)
	at org.openqa.selenium.remote.service.DriverCommandExecutor.execute(DriverCommandExecutor.java:94)
	at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:649)
	... 3 more
[Tracking Requested - why for this release]:
As mentioned in comment 33 this is a topcrash in Nightly now. So asking for tracking the 51 release.
Keywords: topcrash
Duplicate of this bug: 1171307
Assuming that bug 1293090 is correct that we're hitting this same crash in automation, this may not just be "Nightly" the trunk channel, but nightly the builds we do at 3am as distinct from the builds we do on every push, which means a major party foul on someone's part, because the binaries and the behavior of tests is not supposed to be different between nightlies and on-push.

Is it possible to tell the difference between crash reports from nightly-Nightly users and the rare default-update-channel users of on-push builds, to see whether this is happening on non-nightly builds?
[Tracking Requested - why for this release]: Nominating this for 50 tracking as it currently sits as the #2 top crash on 50, and it is being identified as a startup crash. It sits at #4 on 51.

I asked a few folks to try to get an answer to Phil's question in Comment 37, but so far I haven't found someone who has the answer.
Hello mccr8, I've seen you work on IPC related OOM crashes in e10s. Just wondering, is this something you can help investigate/fix? Please let me know.

Hello overholt, this was mentioned as a top crash in the channel meeting today and as the engineering owner for Fx50, could you please help me find an owner who can investigate this? Thanks!
Flags: needinfo?(overholt)
Flags: needinfo?(continuation)
This seems quite messy. Does bug 1152372 still reproduce this crash?

Jed, can you please take a quick look and see if anything jumps out at you?
Flags: needinfo?(overholt) → needinfo?(jld)
(In reply to Bogdan Maris, QA [:bogdan_maris][PTO 08-22 Aug] from comment #31)
> bp-4282b3bd-682f-4a19-8f7d-6d0d82160714
> bp-81c3cdb7-6c98-46df-b637-05a542160714
> bp-867ca5a6-4fdf-4641-b224-6aeb12160714
> bp-d674a63b-37d5-4e3e-8941-101f42160714
> bp-bf1bce20-8023-475b-a780-33b682160714
> bp-9b2ffe73-2674-41b2-8545-470c82160714

These all show a content process that's starting up, has sent a PCrashReporter constructor (which is sync) to the parent, and gets an IPC channel error while waiting for a reply.  

These are all on OS X, which has a history of OS bugs affecting the features we use for IPC (e.g., bug 1142693), so that might be part of it.  It would help to have a little more detail on the I/O error that caused the crash, but I'm not seeing anything useful in crash-stats.
Flags: needinfo?(jld)
This doesn't look OOM related as far as I can tell.
Flags: needinfo?(continuation)
Assignee: mrbkap → nobody
Severity: normal → critical
Keywords: crash
Summary: Crash in mozilla::ipc::MessageChannel::OnChannelErrorFromLink → [e10s] Crash in [@ mozilla::ipc::MessageChannel::OnChannelErrorFromLink]
Blocks: 1276526
No longer depends on: 1276526
This is getting a high volume crash for our Marionette tests for OS X debug builds. See all the bugs as marked as being blocked. It looks like that Firefox crashes randomly during the test job. Anything we can do here soon to help the sheriffs from not having to star that many test failures? That would be great! Thanks.
I looked at some of the blocked bugs, they all crash when the ContentChild is creating the PCrashReporter actor.

gecko.log has this line:
[Child 1935] WARNING: Message needs unreceived descriptors channel:1129c3000 message-type:4849673 header()->num_fds:1 num_fds:0 fds_i:0: file /builds/slave/m-in-m64-000000000000000000000/build/src/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 482

Bill, does this ring any alarm bells?
Flags: needinfo?(wmccloskey)
Indeed it looks very similar to bug 1142693.
Flags: needinfo?(wmccloskey) → needinfo?(jld)
https://mozilla-releng-blobs.s3.amazonaws.com/blobs/mozilla-inbound/sha512/4867c171ae0c3bb1face4f6b2e9025270ee8db27020d976805503f10fff636ed30aaf63144de66bd8c59a7273c0b12330c9485f8a42664b7af18f7cd3bd00b65

The gecko.log also has this line:
[Parent 1931] WARNING: FileDescriptorSet destroyed with unconsumed descriptors: file /builds/slave/m-in-m64-000000000000000000000/build/src/ipc/chromium/src/chrome/common/file_descriptor_set_posix.cc, line 22
One possibility is that we are leaking fds or other processes consumed too many fds so the child process failed to create a new one. However all the crashes are in PCrashReporterConstructor is suspicious.
Kan-Ru, out of interest, do those automation crashes map with those reported to crashstats? If not we may have another underlying issue?
(In reply to Henrik Skupin (:whimboo) from comment #48)
> Kan-Ru, out of interest, do those automation crashes map with those reported
> to crashstats? If not we may have another underlying issue?

They have similar crash stacks so I assume they are same crashes. Which means if we fix this we are not only fixing automation but also real crashes.
(In reply to Kan-Ru Chen [:kanru] (UTC+8) from comment #47)
> One possibility is that we are leaking fds or other processes consumed too
> many fds so the child process failed to create a new one. However all the
> crashes are in PCrashReporterConstructor is suspicious.

The reason we crash at PCrashReporterConstructor is because it's the first sync message send from ContentChild to parent which forces us to consume the incoming messages.
The msgtype 4849673 in the log is PContent::Msg_InitCompositor__ID so looks there is something wrong either in the new Endpoints code or the out-of-process compositor code.
Flags: needinfo?(dvander)
bug 1293580 is similar, I think. Mac runs out of fds a lot and we don't seem to know why. We can fail on either the sending side (by failing to create a channel), or on the receiving side, when SCM_RIGHTS or something fails to find a new descriptor.

Meanwhile, the old IPDL bridging model looked like this:
 1. Allocate fds for a channel.
 2. On failure, return.
 3. On success, send bridge messages.

Most consumers assumed that bridges never failed, because even if they did, nothing would "appear" wrong. The bridge would simply never happen on the other side, and functionality would be silently broken/missing (and in some cases would probably crash later).

Endpoints work differently. The caller must check for failure, and if you send an invalid endpoint, IPDL will crash. When we switched the compositor to use Endpoints, suddenly Mac's file descriptor problems exacerbated the fact that we're not handling errors properly.

We can and should fix our error-checking of Endpoints, which I'll do in bug 1293580. But this won't solve the fact that Mac is running out of descriptors way too often, and that will lead to broken behavior.

Can we get anyone to try bug 1296756? It'd be great if we could see how many of each descriptor type is open.
Flags: needinfo?(dvander)
Depends on: 1293580, 1296756
I'll take a look at bug 1296756
I don't think I have anything useful to add at this point.
Flags: needinfo?(jld)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f0cfacd0d39a&selectedJob=26416090

OpenedFileDescriptors: 11 2 0 2 0 0 0 7 0

total opened fds is 11, not particular high, which makes sense because the process is starting up. Not sure why SCM_RIGHTS failed to create fd though.
(In reply to Kan-Ru Chen [:kanru] (UTC+8) from comment #54)
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=f0cfacd0d39a&selectedJob=26416090
> 
> OpenedFileDescriptors: 11 2 0 2 0 0 0 7 0
> 
> total opened fds is 11, not particular high, which makes sense because the
> process is starting up. Not sure why SCM_RIGHTS failed to create fd though.

Where were you able to tell that SCM_RIGHTS failed?
(In reply to David Anderson [:dvander] from comment #55)
> (In reply to Kan-Ru Chen [:kanru] (UTC+8) from comment #54)
> > https://treeherder.mozilla.org/#/
> > jobs?repo=try&revision=f0cfacd0d39a&selectedJob=26416090
> > 
> > OpenedFileDescriptors: 11 2 0 2 0 0 0 7 0
> > 
> > total opened fds is 11, not particular high, which makes sense because the
> > process is starting up. Not sure why SCM_RIGHTS failed to create fd though.
> 
> Where were you able to tell that SCM_RIGHTS failed?

I can't 100% tell that SCM_RIGHTS failed but the log is added after the "Message needs unreceived descriptors" error. It means we have completely received the message but not the fds. It doesn't look like the sendmsg failed.

Fortunately or unfortunately, this is easily reproducible on try so I can try to log more information.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=eb05a72031b2

Not sure what has happened, but with my logging patch I can't reproduce this on try anymore.
Currently this bug is #3 top browser crash on Nightly. I know various people have chimed in on this bug - is someone actually willing to take ownership of it?
Assignee: nobody → kchen
https://treeherder.mozilla.org/#/jobs?repo=try&revision=6ee086ef5822

From the log I found that the failure steps are always like this:

1. parent: 1 file descriptor to send
2. parent: successfully writes 64 bytes
3. child: read 4096 bytes
4. parent: 1 file descriptor to send
5. parent: failed to write 64 bytes, EAGAIN
6. parent: successfully writes 64 bytes
7. child: read 128 bytes
8. child: only receives 1 file descriptor

I think after step 5. we forgot to pack the file descriptors into the message to send.
I'll have a patch to review pretty soon if this try is green:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b0be1080bcea
Blocks: 1262671
Keywords: regression
tracking-firefox49? because we think is a regression from bug 1262671 which landed in 49
I noticed this is opened 2 years ago so the patch in comment 60 can be fixing the original issue. Let's use this bug to fix the recent spike of this signature and open a new one if there is still more crashes of the signature later.
I meant the patch can't be fixing the original issue.
Comment on attachment 8786335 [details]
Bug 1051567 - Make sure we resend file descriptors for the first chunk of a message.

https://reviewboard.mozilla.org/r/75320/#review73268

Thanks for tracking this down Kan-Ru!

::: ipc/chromium/src/chrome/common/ipc_channel_posix.cc:583
(Diff revision 1)
>      static const int tmp = CMSG_SPACE(sizeof(
>          int[FileDescriptorSet::MAX_DESCRIPTORS_PER_MESSAGE]));
>      char buf[tmp];
>  
> -    if (partial_write_iter_.isNothing() &&
> +    if ((partial_write_iter_.isNothing() ||
> +         partial_write_iter_.value().Data() == msg->Buffers().Iter().Data()) &&

Instead of Iter().Data() you can use Start().

It also might make sense to move this code:
http://searchfox.org/mozilla-central/rev/064025c802c22cd5ad122746733cbd34ea47393c/ipc/chromium/src/chrome/common/ipc_channel_posix.cc#614-617
up above this check here. Then we can remove the isNothing() check and just check if Data() == Start().
Attachment #8786335 - Flags: review+
(In reply to Bill McCloskey (:billm) from comment #64)
> Comment on attachment 8786335 [details]
> Bug 1051567 - Make sure we resend file descriptors for the first chunk of a
> message.
> 
> https://reviewboard.mozilla.org/r/75320/#review73268
> 
> Thanks for tracking this down Kan-Ru!
> 
> ::: ipc/chromium/src/chrome/common/ipc_channel_posix.cc:583
> (Diff revision 1)
> >      static const int tmp = CMSG_SPACE(sizeof(
> >          int[FileDescriptorSet::MAX_DESCRIPTORS_PER_MESSAGE]));
> >      char buf[tmp];
> >  
> > -    if (partial_write_iter_.isNothing() &&
> > +    if ((partial_write_iter_.isNothing() ||
> > +         partial_write_iter_.value().Data() == msg->Buffers().Iter().Data()) &&
> 
> Instead of Iter().Data() you can use Start().

I can't use Start() because msg->Buffers() is marked as const. I added a const overload to BufferList::Start(). I assume you would rs+ this change ;)

> It also might make sense to move this code:
> http://searchfox.org/mozilla-central/rev/
> 064025c802c22cd5ad122746733cbd34ea47393c/ipc/chromium/src/chrome/common/
> ipc_channel_posix.cc#614-617
> up above this check here. Then we can remove the isNothing() check and just
> check if Data() == Start().

Sounds good.
Pushed by kchen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/acb978a84753
Make sure we resend file descriptors for the first chunk of a message. r=billm
https://hg.mozilla.org/mozilla-central/rev/acb978a84753
Status: NEW → RESOLVED
Closed: 5 years ago3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla51
No longer blocks: 1290815
No longer blocks: 1296792
No longer blocks: 1297917
No longer blocks: 1297916
Please nominate this for Aurora/Beta approval when you get a chance. Also, glory hallelujah at Mn-e10s now!
Flags: needinfo?(kchen)
Comment on attachment 8786335 [details]
Bug 1051567 - Make sure we resend file descriptors for the first chunk of a message.

Approval Request Comment
[Feature/regressing bug #]: bug 1262671
[User impact if declined]: Users on Linux and/or Mac platform might see immediate content process crash after startup
[Describe test coverage new/current, TreeHerder]: Landed on m-c and fixed many Mn-e10s intermittent failure on Mac
[Risks and why]: Low. This just restore the behavior before bug 1262671.
[String/UUID change made/needed]: n/a
Flags: needinfo?(kchen)
Attachment #8786335 - Flags: approval-mozilla-beta?
Attachment #8786335 - Flags: approval-mozilla-aurora?
Comment on attachment 8786335 [details]
Bug 1051567 - Make sure we resend file descriptors for the first chunk of a message.

Let's take this as it reverts some of the behavior which made Mn-e10s tests fail. If we land it right away and it sticks this can make it to the beta 10 build today.
Attachment #8786335 - Flags: approval-mozilla-beta?
Attachment #8786335 - Flags: approval-mozilla-beta+
Attachment #8786335 - Flags: approval-mozilla-aurora?
Attachment #8786335 - Flags: approval-mozilla-aurora+
Un-track for 49 for now as the volume of crash is low for beta.
You need to log in before you can comment on or make changes to this bug.