Closed Bug 1258317 Opened 5 years ago Closed 5 years ago

crash in mozilla::ipc::TransferHandleToProcess (PBackground related)

Categories

(Core :: IPC, defect, P1)

x86
Windows NT
defect

Tracking

()

RESOLVED FIXED
mozilla48
Tracking Status
e10s + ---
firefox46 --- wontfix
firefox47 --- wontfix
firefox48 --- fixed
firefox49 --- affected

People

(Reporter: kanru, Assigned: cyu)

References

Details

(Keywords: crash, Whiteboard: btpp-active)

Crash Data

Attachments

(4 files, 1 obsolete file)

This bug was filed from the Socorro interface and is 
report bp-fbd78b18-ba2b-4b29-a44e-c71be2160321.
=============================================================

This is #13, with 155 crashes, on the e10s-beta46-noapz a/b experiment.
looks like Windows specific.

Some different call stack:

Frame 	Module 	Signature 	Source
0 	xul.dll 	mozilla::ipc::TransferHandleToProcess(void*, unsigned long) 	ipc/glue/Transport_win.cpp
1 	xul.dll 	IPC::ParamTraits<mozilla::ipc::TransportDescriptor>::Write(IPC::Message*, mozilla::ipc::TransportDescriptor const&) 	ipc/glue/Transport_win.h
2 	xul.dll 	mozilla::ipc::ChannelOpened::ChannelOpened(mozilla::ipc::TransportDescriptor, unsigned long, IPCMessageStart, IPC::Message::PriorityValue) 	ipc/glue/ProtocolUtils.cpp
3 	xul.dll 	mozilla::ipc::Bridge(mozilla::ipc::PrivateIPDLInterface const&, mozilla::ipc::MessageChannel*, unsigned long, mozilla::ipc::MessageChannel*, unsigned long, IPCMessageStart, IPCMessageStart) 	ipc/glue/ProtocolUtils.cpp
4 	xul.dll 	mozilla::gmp::PGMPContent::Bridge(mozilla::gmp::PGMPServiceParent*, mozilla::gmp::PGMPParent*) 	obj-firefox/ipc/ipdl/PGMPContent.cpp
5 	xul.dll 	mozilla::gmp::GMPParent::Bridge(mozilla::gmp::GMPServiceParent*) 	dom/media/gmp/GMPParent.cpp
6 	xul.dll 	mozilla::gmp::GMPServiceParent::RecvLoadGMP(nsCString const&, nsCString const&, nsTArray<nsCString>&&, nsTArray<unsigned long>&&, unsigned long*, nsCString*, unsigned int*) 	dom/media/gmp/GMPServiceParent.cpp
7 	xul.dll 	mozilla::gmp::PGMPServiceParent::OnMessageReceived(IPC::Message const&, IPC::Message*&) 	obj-firefox/ipc/ipdl/PGMPServiceParent.cpp

Frame 	Module 	Signature 	Source
0 	xul.dll 	mozilla::ipc::TransferHandleToProcess(void*, unsigned long) 	ipc/glue/Transport_win.cpp
1 	xul.dll 	IPC::ParamTraits<mozilla::ipc::TransportDescriptor>::Write(IPC::Message*, mozilla::ipc::TransportDescriptor const&) 	ipc/glue/Transport_win.h
2 	xul.dll 	mozilla::ipc::ChannelOpened::ChannelOpened(mozilla::ipc::TransportDescriptor, unsigned long, IPCMessageStart, IPC::Message::PriorityValue) 	ipc/glue/ProtocolUtils.cpp
3 	xul.dll 	mozilla::ipc::Open(mozilla::ipc::PrivateIPDLInterface const&, mozilla::ipc::MessageChannel*, unsigned long, IPC::Channel::Mode, IPCMessageStart, IPCMessageStart) 	ipc/glue/ProtocolUtils.cpp
4 	xul.dll 	mozilla::ipc::PBackground::Open(mozilla::dom::PContentChild*) 	obj-firefox/ipc/ipdl/PBackground.cpp
5 	xul.dll 	`anonymous namespace'::ChildImpl::OpenProtocolOnMainThread(nsIEventTarget*) 	ipc/glue/BackgroundImpl.cpp
6 	xul.dll 	`anonymous namespace'::ChildImpl::GetOrCreateForCurrentThread(nsIIPCBackgroundChildCreateCallback*) 	ipc/glue/BackgroundImpl.cpp
7 	xul.dll 	mozilla::ipc::BackgroundChild::GetOrCreateForCurrentThread(nsIIPCBackgroundChildCreateCallback*) 	ipc/glue/BackgroundImpl.cpp
8 	xul.dll 	mozilla::dom::ContentChild::InitXPCOM() 	dom/ipc/ContentChild.cpp
9 	xul.dll 	mozilla::dom::ContentProcess::Init() 	dom/ipc/ContentProcess.cpp
DuplicateHandle on Windows can't just be asserted. Some antivirus software can make it fail. We need to annotate the crash to get more information about it.
Assignee: nobody → cyu
Depends on: 1258663
Crashes rose by 77% between 2016-03-24 and 2016-03-28 in beta.
Maire, did something change recently?
Flags: needinfo?(mreavy)
Priority: -- → P1
cpearce -- Is it likely that any of the recent GMP code changes uplifted to Beta (this started happening around the 24th) would cause this?  Can you take a look?

Billm -- Could this be a bug in the IPC code?  (see Comment 1)
Flags: needinfo?(wmccloskey)
Flags: needinfo?(mreavy)
Flags: needinfo?(cpearce)
Here's a list of the counts of the different paths crashing in mozilla::ipc::TransferHandleToProcess over the past 90 days.

34% are from GMP code, the rest elsewhere.

I don't think anything significant in the GMP code changed around the 24th, so I suspect this isn't caused by a change in GMP.

We did enable unencrypted MP4 decoding using the Adobe GMP in 46 before it hit beta, which is why we're seeing this on Windows XP.
Flags: needinfo?(cpearce)
I checked recent crash reports to see if there are some 3rd party software making DuplicateHandle() fail as in https://bugzilla.mozilla.org/show_bug.cgi?id=1253575#c20. Unlike bug 1253575, most crash reports in this bug don't have suspicious 3rd party dll injected.

In the attachment, we see crashes with protocol PGMPContent, PPluginModule and PBackground. PBackground takes the largest portion. So this is unlikely caused by GMP.

At least, we need to know the value of GetLastError() to know why we failed to transfer the handle as in bug 1258663.
Discussed this in triage: if this is caused by handle exhaustion, it should be seen primarily on session with high uptime.

If this is caused by 3rd-party injection, it will likely show up primarily in low-uptime sessions.
Flags: needinfo?(wmccloskey)
So far 2 reports with the new GetLastError annotation have shown up in nightly reports - 

https://crash-stats.mozilla.com/search/?product=Firefox&signature=~TransferHandleToProcess&version=48.0a1&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#crash-reports

One doesn't have the patch, the other has error code 87, which is "invalid parameter" and appears to happen during shutdown. This is odd in that GMP is initializing at the time.

The invalid param might be mDestinationProcessId implying the child process crashed or shutdown unexpectedly?

http://mxr.mozilla.org/mozilla-central/source/ipc/glue/Transport_win.h#42
http://mxr.mozilla.org/mozilla-central/source/ipc/glue/ProtocolUtils.cpp#266
Thread 0 is in shutdown and is waiting on child processes to terminate, so this looks like a shutdown race between two threads. One is trying to initialize the GMP, the other is terminating it.

mozilla::gmp::GeckoMediaPluginServiceParent::Observe(nsISupports*, char const*, char16_t const*)
nsObserverService::NotifyObservers(nsISupports*, char const*, char16_t const*)
nsXREDirProvider::DoShutdown()
ScopedXPCOMStartup::~ScopedXPCOMStartup()

https://hg.mozilla.org/mozilla-central/annotate/9bd900888753/dom/media/gmp/GMPServiceParent.cpp#l429

cpearce posted a useful list of which modules get hit by this assuming they are all similar crash types.

https://bug1258317.bmoattachments.org/attachment.cgi?id=8736079
Not all of these are shutdown races. In a current search, 337 total reports, 55 are in shutdown.
So to summarize, it appears we have two groups here.

1) Parent side crashes during GMP init on a background thread, made up of two sub groups: 

a) 55 (18%) Crashes that happen during a shutdown race where the main thread is tearing down the GMP child process at the same time the child is initializing.

b) 38 (12%) A generic TransferHandleToProcess, mostly associated with GMP.

2) 244 (~66%) Failures to transfer a handle in the child, mostly during the init of PBackground::Open. (We don't have GetLastError data for these yet.)
Whiteboard: btpp-active
In addition to error code 87, we also see error code 5 like in this: https://crash-stats.mozilla.com/report/index/01205346-b0a1-4805-a8d9-29fcf2160412 but there I don't see suspicious dlls injected.
Blocks: 1263951
I split GMP related issues off to bug 1263951.
Summary: crash in mozilla::ipc::TransferHandleToProcess → crash in mozilla::ipc::TransferHandleToProcess (PBackground related)
One thing we should do file a bug on exposing 'SystemError' to supersearch so we can facet on it.
(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #12)
> In addition to error code 87, we also see error code 5 like in this:
> https://crash-stats.mozilla.com/report/index/01205346-b0a1-4805-a8d9-
> 29fcf2160412 but there I don't see suspicious dlls injected.

Some tests against arbitrary pids in DuplicateHandle() produced error 5 and 87. It's unclear when it's access denied and when it's invalid parameter.
See Also: → 1264244
There are several crashes related to PBackground on aurora, one with error 5: https://crash-stats.mozilla.com/report/index/aa44c15b-4495-41b2-ac73-fa34b2160413

There is LavasoftTcpService.dll loaded in the crashed process. It's likely the cause of the crash.
Given that we've seen several ERROR_ACCESS_DENIED, I plan to add more annotation about the target process with GetSecurityInfo().
This annotates the crash report with string representation of the security descriptor of the process handle when we failed to transfer the handle to the target process.
Attachment #8742319 - Flags: feedback?(gkrizsanits)
Depends on: 1265361
(In reply to Jim Mathies [:jimm] from comment #19)
> I see an even split between 87 (parameter is incorrect) and 5 (access
> denied) - 
> 
> https://crash-stats.mozilla.com/search/
> ?product=Firefox&signature=~TransferHandleToProcess&build_id=%3E2016040900000
> 0&_facets=signature&_facets=system_error&_columns=date&_columns=signature&_co
> lumns=product&_columns=version&_columns=build_id&_columns=platform#facet-
> system_error

Interesting, the 87 value correlates to GMP mostly. The 5 value is a mixture of signatures. Of course, I only have about 15 signatures to base this on.

mozilla::dom::ContentParent::RecvLoadPlugin
mozilla::gmp::GMPServiceParent::RecvLoadGMP
mozilla::ipc::PBackground::Open
Comment on attachment 8742319 [details] [diff] [review]
Annotating the crash report with GetSecurityInfo() of the process handle

Has a bug: string not terminated.
Attachment #8742319 - Attachment is obsolete: true
Attachment #8742319 - Flags: feedback?(gkrizsanits)
What data is contained in the "string representation of the security descriptor"? We need to make sure that cannot contain private/personal data without additional protections.
Flags: needinfo?(cyu)
https://reviewboard.mozilla.org/r/47447/#review44145

I noticed that this DuplicateHandle is _not_ the windows API but some helper introduced in bug 1119878 and implemented here: http://hg.mozilla.org/mozilla-central/annotate/ae7413abfa4d/ipc/glue/ProtocolUtils.cpp#l266 Could you also check which operation fails there exactly? I have the feeling that after the changes in bug 1119878 sometimes we get here when the process belonging to the process id died already, in which case trying to open it would give invalid value in most cases, in some cases if another process already took that id we could get access denied...

::: ipc/glue/ProtocolUtils.cpp:379
(Diff revision 1)
> +                      rv));
> +  }
> +
> +  LPTSTR secDescStr = nullptr;
> +  ULONG secDescStrLen = 0;
> +  if (::ConvertSecurityDescriptorToStringSecurityDescriptor(secDesc,

Do we have to call this function if rv!=ERROR_SUCCESS?

::: ipc/glue/ProtocolUtils.cpp:386
(Diff revision 1)
> +      char* output = (char*) malloc(utf8Len + 1);
> +      wcstombs(output, secDescStr, utf8Len);
> +      output[utf8Len] = '\0';
> +      CrashReporter::AnnotateCrashReport(
> +        NS_LITERAL_CSTRING("ExtraSystemError"),
> +        nsPrintfCString("Security Descriptor: %s", output));
> +      free(output);

Could you please use nsCString here instead of malloc/free?
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #23)
> What data is contained in the "string representation of the security
> descriptor"? We need to make sure that cannot contain private/personal data
> without additional protections.

I don't think it contains any personal data but, I'm not too familiar with it and it is always worth to double check on these things: https://msdn.microsoft.com/en-us/library/windows/desktop/aa379570(v=vs.85).aspx
https://crash-stats.mozilla.com/search/?product=Firefox&signature=~TransferHandleToProcess&build_id=%3E20160409000000&system_error=5&_facets=signature&_facets=system_error&_facets=proto_signature&_facets=version&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-proto_signature

FWIW, I did a quick survey of the available crashes looking 3rd party modules. They do show up in a few of the reports but not all, so I don't think this correlates to 3rd party virus software.

The parameter incorrect errors are entirely GMP related (bug 1263951). The access denied errors are primarily PBackground, although we get these failures in a mixture of child processes. I think this is due to PBackground always getting created with content so the failure tends to show up with PBackground calls in the stack. The failure though probably has nothing to do with PBackground.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #23)
> What data is contained in the "string representation of the security
> descriptor"? We need to make sure that cannot contain private/personal data
> without additional protections.

An example of the security descriptor string representation, it's from the crash dump on my workstation by removing DuplicateHandle privilege from the content process:

ExtraSystemError=Owner: S-1-5-21-2078660737-3082294025-695398301-1002, Security Descriptor: D:(A;;0x1fffff;;;RC)(A;;0xe1fbf;;;S-1-5-21-2078660737-3082294025-695398301-1002)(A;;0x1fffff;;;SY)(A;;0x121411;;;S-1-5-5-0-484634)

(I'll add owner security ID to the annotation in the next revision in case the ACL has many entries and it can't be easily seen which entry takes effect).

Where each parenthesis represents an access control entry. "RC" and "SY" are allowed 0x1fffff privileges to the process and the active user is allowed 0xe1fbf privileges (0x40, which corresponds to PROCESS_DUPLICATE_HANDLE) is off. The last entry has a very low privilege.

The potential personal data are principals names, but the real names are hidden behind security IDs and are not revealed in the annotation.
Flags: needinfo?(cyu)
Comment on attachment 8742750 [details]
MozReview Request: Bug 1258317 - Part 1: Annotate the crash report with process information on failure to transfer an IPC transport to another process. r?gabor

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/47447/diff/1-2/
https://reviewboard.mozilla.org/r/47447/#review44145

I'll check which operation failed. It's also surprising (from the PoV of \*nix users) that Windwos PIDs are recycled quickly. This sounds that it's a bad idea that we use PID to identify a process istead of using process handle, and we may need to revisit the decision of switching to PID instead of process handle.
Comment on attachment 8742750 [details]
MozReview Request: Bug 1258317 - Part 1: Annotate the crash report with process information on failure to transfer an IPC transport to another process. r?gabor

https://reviewboard.mozilla.org/r/47447/#review44445

Thanks this looks great. Though I would still like to see which system call fails under http://hg.mozilla.org/mozilla-central/annotate/ae7413abfa4d/ipc/glue/ProtocolUtils.cpp#l266 ... But we can do it as followup, let's land this with the typo below fixed. But before you do that let's give Benjamin a chance to respond about the privacy concerns.

::: ipc/glue/ProtocolUtils.cpp:379
(Diff revisions 1 - 2)
>                        rv));
> +    return;
> +  }
> +
> +  LPTSTR ownerSidStr = nullptr;
> +  nsString annotation{};

This does not parse to me... what is {} here?
Attachment #8742750 - Flags: review+
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #30)
> But before you do that let's give Benjamin a
> chance to respond about the privacy concerns.
Flags: needinfo?(benjamin)
(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #29)
> https://reviewboard.mozilla.org/r/47447/#review44145
> 
> I'll check which operation failed. It's also surprising (from the PoV of
> \*nix users) that Windwos PIDs are recycled quickly. This sounds that it's a
> bad idea that we use PID to identify a process istead of using process
> handle, and we may need to revisit the decision of switching to PID instead
> of process handle.

It's just a theory really... it would be surprising to me as well, but Windows APIs
have their quirks :)
https://reviewboard.mozilla.org/r/47447/#review44449

::: ipc/glue/ProtocolUtils.cpp:319
(Diff revision 2)
>  #elif defined(OS_POSIX)
>    error = errno;
>  #endif
>    if (error) {
> -    CrashReporter::AnnotateCrashReport(NS_LITERAL_CSTRING("SystemError"),
> +    CrashReporter::AnnotateCrashReport(
> +      NS_LITERAL_CSTRING("SystemError"),

Can we update these annotation names so that they match the other ipc annotations in this file?

IPCSystemError
IPCExtraSystemError
IPCFatalErrorProtocol
IPCFatalErrorMsg

::: ipc/glue/ProtocolUtils.cpp:367
(Diff revision 2)
> +                               &ownerSid,
> +                               nullptr,
> +                               nullptr,
> +                               nullptr,
> +                               &secDesc);
> +  if (rv) {

nit, check against ERROR_SUCCESS here.
https://reviewboard.mozilla.org/r/47447/#review44445

> This does not parse to me... what is {} here?

This is C++11 style constructor, equivalent to nsString annotation().
https://reviewboard.mozilla.org/r/47447/#review44449

> Can we update these annotation names so that they match the other ipc annotations in this file?
> 
> IPCSystemError
> IPCExtraSystemError
> IPCFatalErrorProtocol
> IPCFatalErrorMsg

Sure. But we also need to update Soccorro, and there will be a time frame where both SystemError and IPCSystemError annotations come from end users.
(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #35)
> https://reviewboard.mozilla.org/r/47447/#review44449
> 
> > Can we update these annotation names so that they match the other ipc annotations in this file?
> > 
> > IPCSystemError
> > IPCExtraSystemError
> > IPCFatalErrorProtocol
> > IPCFatalErrorMsg
> 
> Sure. But we also need to update Soccorro, and there will be a time frame
> where both SystemError and IPCSystemError annotations come from end users.

socorro updates are easy, lets try to get this uplifted to beta asap so we don't have to worry about the differences there.
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #31)
> (In reply to Gabor Krizsanits [:krizsa :gabor] from comment #30)
> > But before you do that let's give Benjamin a
> > chance to respond about the privacy concerns.

I think this should be ok. For reference - 

https://msdn.microsoft.com/en-us/library/windows/desktop/aa379570(v=vs.85).aspx

My only concern is the inclusion of the owner and group sids but from what I'm finding on the net, these aren't considered security sensitive. We do not want this information exposed through the socorro web interface though, and should probably file a bug on removing this debug code once we resolve the issue.
Comment on attachment 8742750 [details]
MozReview Request: Bug 1258317 - Part 1: Annotate the crash report with process information on failure to transfer an IPC transport to another process. r?gabor

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/47447/diff/2-3/
Comment on attachment 8742750 [details]
MozReview Request: Bug 1258317 - Part 1: Annotate the crash report with process information on failure to transfer an IPC transport to another process. r?gabor

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/47447/diff/3-4/
Comment on attachment 8742750 [details]
MozReview Request: Bug 1258317 - Part 1: Annotate the crash report with process information on failure to transfer an IPC transport to another process. r?gabor

Approval Request Comment
[Feature/regressing bug #]: No bug. e10s internals asserting that system calls always succeed. 
[User impact if declined]: Difficulties in proceeding with some crashes that are seen by the users but hard to reproduce/diagnose/debug.
[Describe test coverage new/current, TreeHerder]: Just landed/tested on m-c.
[Risks and why]: Pretty low since the patch contains only diagnostic code during crashes.
[String/UUID change made/needed]: none
Attachment #8742750 - Flags: approval-mozilla-aurora?
Comment on attachment 8742750 [details]
MozReview Request: Bug 1258317 - Part 1: Annotate the crash report with process information on failure to transfer an IPC transport to another process. r?gabor


Approval Request Comment
[Feature/regressing bug #]: No bug. e10s internals asserting that system calls always succeed. 
[User impact if declined]: Difficulties in proceeding with some crashes that are seen by the users but hard to reproduce/diagnose/debug.
[Describe test coverage new/current, TreeHerder]: Just landed/tested on m-c.
[Risks and why]: Pretty low since the patch contains only diagnostic code during crashes.
[String/UUID change made/needed]: none
Attachment #8742750 - Flags: approval-mozilla-beta?
> ExtraSystemError=Owner: S-1-5-21-2078660737-3082294025-695398301-1002,
> Security Descriptor:
> D:(A;;0x1fffff;;;RC)(A;;0xe1fbf;;;S-1-5-21-2078660737-3082294025-695398301-
> 1002)(A;;0x1fffff;;;SY)(A;;0x121411;;;S-1-5-5-0-484634)

Are these SIDs persistent across runs of the browser? If so, we should:

1) treat this as PII and should not display it publicly in crash-stats. It should only be searchable/displayed to employees who have the special permissions and have logged in
2) we should treat this as temporary data collection and remove it when we no longer need it

data-review=me with that understood

Please coordinate with peterbe for the appropriate crash-stats handling of this data (it's private by default, so we just need to make sure that we don't open it up publicly by accident)
Flags: needinfo?(benjamin)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #43)
> > ExtraSystemError=Owner: S-1-5-21-2078660737-3082294025-695398301-1002,
> > Security Descriptor:
> > D:(A;;0x1fffff;;;RC)(A;;0xe1fbf;;;S-1-5-21-2078660737-3082294025-695398301-
> > 1002)(A;;0x1fffff;;;SY)(A;;0x121411;;;S-1-5-5-0-484634)
> 
> Are these SIDs persistent across runs of the browser? If so, we should:

yes.

> 1) treat this as PII and should not display it publicly in crash-stats. It
> should only be searchable/displayed to employees who have the special
> permissions and have logged in
> 2) we should treat this as temporary data collection and remove it when we
> no longer need it

Yep - completely agree here. I'll file a follow up to remove it.
Blocks: 1266440
Comment on attachment 8742750 [details]
MozReview Request: Bug 1258317 - Part 1: Annotate the crash report with process information on failure to transfer an IPC transport to another process. r?gabor

Too late for beta 46 and we aren't shipping e10s in 46.
Attachment #8742750 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
https://hg.mozilla.org/mozilla-central/rev/4091419d6b1c
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Keywords: leave-open
Comment on attachment 8742750 [details]
MozReview Request: Bug 1258317 - Part 1: Annotate the crash report with process information on failure to transfer an IPC transport to another process. r?gabor

New annotations in Crash reports, Aurora47+
Attachment #8742750 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
No longer blocks: 1263951
Depends on: 1267222
From the annotations of several new crashes, we see that the user has *all* access rights (0x1fffff) to the target process. The process handle has DACL like "D:(A;;0x1fffff;;;(owner sid))(A;;0x1fffff;;;SY)...". So the system doesn't block duplicating the handle to the parent process, and we need to direct our investigation to the sandbox broker.
Bowen, any insight here?
Flags: needinfo?(bobowen.code)
(In reply to Jim Mathies [:jimm] from comment #52)
> Bowen, any insight here?

Maybe this just shouldn't be a release assert.
I think we agreed this when it was originally reviewed, but then there were other issues and I ended up re-writing things a bit.
I put my patch up for reference and that's the patch that was landed, but obviously missed this point.

We'd need to check that it handles the failure gracefully and doesn't just blow up somewhere else.
Flags: needinfo?(bobowen.code)
(In reply to Bob Owen (:bobowen) from comment #53)
> (In reply to Jim Mathies [:jimm] from comment #52)
> > Bowen, any insight here?
> 
> Maybe this just shouldn't be a release assert.
> I think we agreed this when it was originally reviewed, but then there were
> other issues and I ended up re-writing things a bit.
> I put my patch up for reference and that's the patch that was landed, but
> obviously missed this point.
> 
> We'd need to check that it handles the failure gracefully and doesn't just
> blow up somewhere else.

Well, I'm more curious about the cause of the access denied errors, regardless of how we handle them. It appears the user has access to this handle so something else must be getting in the way. I was wondering if the sandbox might be doing something here. (Of course, this happen on beta so, maybe that is unlikely.)
Looks like can avoid asserting on 87 errors, those are invalid parameter errors implying the process id is invalid (closed).
(In reply to Jim Mathies [:jimm] from comment #55)
> Looks like can avoid asserting on 87 errors, those are invalid parameter
> errors implying the process id is invalid (closed).

Here's a good example - 

https://crash-stats.mozilla.com/report/index/db43c1b0-9c27-4db6-935c-796dc2160501

Note ShutdownProgress indicates we are shutting down when this happens.
See Also: → 1263951
I think we should change [1] to:
  nsAutoHandle targetProcess(::OpenProcess(PROCESS_DUP_HANDLE, FALSE, aTargetProcessId));

It would also be useful to know for sure whether it is the OpenProcessHandle that is failing or the "real" ::DuplicateHandle.

Given that we don't get access denied when we open the process for crash annotation and it appears we have full access, maybe it is the pipe handle that has somehow become invalid.
Although I would have thought this would give us ERROR_INVALID_HANDLE.

[1] https://dxr.mozilla.org/mozilla-central/rev/77cead2cd20300623eea2416bc9bce4d5021df09/ipc/glue/ProtocolUtils.cpp#296
Because of sandboxing, DuplicateHandle() actually happens in:
1. https://dxr.mozilla.org/mozilla-central/rev/77cead2cd20300623eea2416bc9bce4d5021df09/security/sandbox/chromium/sandbox/win/src/handle_dispatcher.cc#53 and 
2. https://dxr.mozilla.org/mozilla-central/rev/77cead2cd20300623eea2416bc9bce4d5021df09/security/sandbox/chromium/sandbox/win/src/handle_policy.cc#82

where ERROR_ACCESS_DENIED can be the result of
- the 1st DuplicateHandle() call fails in sandbox::HandleDispatcher::DuplicateHandleProxy()
- policy evaluation result isn't ASK_BROKER in andbox::HandlePolicy::DuplicateHandleProxyAction()
- the 2nd DuplicateHandle() call fails.

The above failures are in the parent process, but the crash is in the child process. We need to find a way to pass this information to child to know it fails.
(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #58)
> Because of sandboxing, DuplicateHandle() actually happens in:

The content process is only sandboxed on Nightly at the moment.
Why use base::OpenProcessHandle here?

https://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/process_util_win.cc#71

We're asking for:

PROCESS_DUP_HANDLE |
PROCESS_TERMINATE |
PROCESS_QUERY_INFORMATION |
PROCESS_VM_READ |
SYNCHRONIZE

When all we need is PROCESS_DUP_HANDLE.
(In reply to Bob Owen (:bobowen) from comment #59)
> (In reply to Cervantes Yu [:cyu] [:cervantes] from comment #58)
> > Because of sandboxing, DuplicateHandle() actually happens in:
> 
> The content process is only sandboxed on Nightly at the moment.

True, and this is showing up in beta right now, so we can rule the sandbox code path out.

https://crash-stats.mozilla.com/search/?product=Firefox&signature=%3Dmozilla%3A%3Aipc%3A%3ATransferHandleToProcess&version=47.0b1&dom_ipc_enabled=!__null__&process_type=content&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=ipc_system_error#crash-reports
(In reply to Jim Mathies [:jimm] from comment #60)
> Why use base::OpenProcessHandle here?
> 
> https://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/
> process_util_win.cc#71
> 
> We're asking for:
> 
> PROCESS_DUP_HANDLE |
> PROCESS_TERMINATE |
> PROCESS_QUERY_INFORMATION |
> PROCESS_VM_READ |
> SYNCHRONIZE
> 
> When all we need is PROCESS_DUP_HANDLE.

base::OpenProcessHandle doesn't grant PROCESS_VM_READ permission. Even we use base::OpenPrivilegedProcessHandle() that includes PROCESS_VM_READ, we still expect to see the call to succeed given that the owner has 0x1fffff privileges.
(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #62)

> base::OpenProcessHandle doesn't grant PROCESS_VM_READ permission. Even we
> use base::OpenPrivilegedProcessHandle() that includes PROCESS_VM_READ, we
> still expect to see the call to succeed given that the owner has 0x1fffff
> privileges.

While this is true, it's possible that something else is preventing the open when these permissions are requested.
So, it seems sensible to eliminate this by just using the standard Windows call and PROCESS_DUP_HANDLE.
(In reply to Bob Owen (:bobowen) from comment #63)
> (In reply to Cervantes Yu [:cyu] [:cervantes] from comment #62)
> 
> > base::OpenProcessHandle doesn't grant PROCESS_VM_READ permission. Even we
> > use base::OpenPrivilegedProcessHandle() that includes PROCESS_VM_READ, we
> > still expect to see the call to succeed given that the owner has 0x1fffff
> > privileges.
> 
> While this is true, it's possible that something else is preventing the open
> when these permissions are requested.
More specifically, some 3rd-party security magic?

> So, it seems sensible to eliminate this by just using the standard Windows
> call and PROCESS_DUP_HANDLE.
If it's something else is failing the OpenProcess() call, I guess PROCESS_DUP_HANDLE will be the target and we still fail when only requesting this privilege because PROCESS_DUP_HANDLE can be used to gain other privileges to another process by duplicating the pseudo handle.
Here's my proposal: let's try OpenProcess() with only PROCESS_DUP_HANDLE and add annotations to see if it really crashes in OpenProcess(). If this doesn't work, then we open the named pipe from the other side and don't do cross-process DuplicateHandle().
Lets also make invalid parameter errors non-fatal.
Attachment #8749613 - Flags: review?(gkrizsanits)
Comment on attachment 8749613 [details]
MozReview Request: Bug 1258317 - Part 2: Remove unused privilege in mozilla::ipc::TransferHandleTorProcess(). r?gabor

https://reviewboard.mozilla.org/r/51061/#review48015

::: ipc/glue/ProtocolUtils.cpp:40
(Diff revision 1)
>  using base::ProcessId;
>  
>  namespace mozilla {
> +
> +#if defined(MOZ_CRASHREPORTER) && defined(XP_WIN)
> +// Generate RAII classes for LPTSTR and PSECURITY_DESCRIPTOR.

Thanks for this!

::: ipc/glue/ProtocolUtils.cpp:309
(Diff revision 1)
> -  ScopedProcessHandle targetProcess;
> -  if (!base::OpenProcessHandle(aTargetProcessId, &targetProcess.rwget())) {
> -    return false;
> +  ScopedProcessHandle targetProcess(OpenProcess(PROCESS_DUP_HANDLE,
> +                                                  FALSE,
> +                                                  aTargetProcessId));

Indentation seems to be off here.

::: ipc/glue/ProtocolUtils.cpp:312
(Diff revision 1)
> +  if (!targetProcess) {
> +    AnnotateSystemError();
> +    AnnotateProcessInformation(aTargetProcessId);
> +    MOZ_CRASH("No access to the process.");
>    }

I agree with Jim and Bob that we should handle the invalid parameter case more gracefully. And do something about that release assert in cases like https://crash-stats.mozilla.com/report/index/db43c1b0-9c27-4db6-935c-796dc2160501#allthreads , and MOZ_CRASH-ing here will kind of make that impossible...
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #68)
> I agree with Jim and Bob that we should handle the invalid parameter case
> more gracefully. And do something about that release assert in cases like
> https://crash-stats.mozilla.com/report/index/db43c1b0-9c27-4db6-935c-
> 796dc2160501#allthreads , and MOZ_CRASH-ing here will kind of make that
> impossible...

While handling nonfatal errors is a must-have for us, it's scope is larger than this bug. If you'd agree, I prefer that we focus this bug on the DuplicateHandle() failure and open another bug for failing the IPC operations more gracefully.
(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #69)
> While handling nonfatal errors is a must-have for us, it's scope is larger
> than this bug. If you'd agree, I prefer that we focus this bug on the
> DuplicateHandle() failure and open another bug for failing the IPC
> operations more gracefully.

Alright, if you think it's a lot of work, let's do that in a follow-up bug.

> ::: ipc/glue/ProtocolUtils.cpp:312
> (Diff revision 1)
> > +  if (!targetProcess) {
> > +    AnnotateSystemError();
> > +    AnnotateProcessInformation(aTargetProcessId);
> > +    MOZ_CRASH("No access to the process.");
> >    }
> 

Only reason then for me to hesitate about the r+ is that I'm worried that MOZ_CRASH here will introduce new crashes because it is used by CreateTransport [1] and there it has been not fatal so far. What do you think?

http://mxr.mozilla.org/mozilla-central/source/ipc/glue/Transport_win.cpp#43
Depends on: 1271601
Blocks: 1271601
No longer depends on: 1271601
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #70)
> Only reason then for me to hesitate about the r+ is that I'm worried that
> MOZ_CRASH here will introduce new crashes because it is used by
> CreateTransport [1] and there it has been not fatal so far. What do you
> think?
> 
> http://mxr.mozilla.org/mozilla-central/source/ipc/glue/Transport_win.cpp#43

CreateTransport() only duplicate the handle in the same process and doesn't involve cross-process handle duplication. It won't fail because it uses the pseudo process handle from GetCurrentProcess() [1], which has all privileges

I am now working on the fallback mechanism that if we fail in cross-process DuplicateHandle(), we try harder by letting the other side know this and opening the pipe from the other side, which involves lower privilege and is less likely to fail.

[1] https://dxr.mozilla.org/mozilla-central/rev/674a552743785c28c75866969aad513bd8eaf6ae/ipc/glue/ProtocolUtils.cpp#278
(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #71)

> CreateTransport() only duplicate the handle in the same process and doesn't
> involve cross-process handle duplication. It won't fail because it uses the
> pseudo process handle from GetCurrentProcess() [1], which has all privileges

It's also used for shared memory.
(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #71)
> (In reply to Gabor Krizsanits [:krizsa :gabor] from comment #70)
> > Only reason then for me to hesitate about the r+ is that I'm worried that
> > MOZ_CRASH here will introduce new crashes because it is used by
> > CreateTransport [1] and there it has been not fatal so far. What do you
> > think?
> > 
> > http://mxr.mozilla.org/mozilla-central/source/ipc/glue/Transport_win.cpp#43
> 
> CreateTransport() only duplicate the handle in the same process and doesn't
> involve cross-process handle duplication. It won't fail because it uses the
> pseudo process handle from GetCurrentProcess() [1], which has all privileges

What about the INVALID_HANDLE_VALUE case? CreateTransport does an if (!serverPipe) check on the handle. Could you make sure we check for INVALID_HANDLE_VALUE there before trying to call DuplicateHandle? r=me with that fixed.

> 
> I am now working on the fallback mechanism that if we fail in cross-process
> DuplicateHandle(), we try harder by letting the other side know this and
> opening the pipe from the other side, which involves lower privilege and is
> less likely to fail.
> 
> [1]
> https://dxr.mozilla.org/mozilla-central/rev/
> 674a552743785c28c75866969aad513bd8eaf6ae/ipc/glue/ProtocolUtils.cpp#278

Sounds great, thanks.
(In reply to Bob Owen (:bobowen) from comment #72)
> (In reply to Cervantes Yu [:cyu] [:cervantes] from comment #71)
> 
> > CreateTransport() only duplicate the handle in the same process and doesn't
> > involve cross-process handle duplication. It won't fail because it uses the
> > pseudo process handle from GetCurrentProcess() [1], which has all privileges
> 
> It's also used for shared memory.

Anything we should concerned about there?
Flags: needinfo?(bobowen.code)
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #74)

> > It's also used for shared memory.
> 
> Anything we should concerned about there?

Sorry, shouldn't have been so brief.
Just a concern that we will start to crash if this is ever failing for the shared memory set up.
Flags: needinfo?(bobowen.code)
Actually I looked into this today, and since the last 10 or so crashes were all access denied errors and they all had sandboxbroker.dll loaded I looked into it and found that sandboxing does set that error manually

here: http://mxr.mozilla.org/mozilla-central/source/security/sandbox/chromium/sandbox/win/src/handle_policy.cc#69
and then we use that value to call :SetLastError with: http://mxr.mozilla.org/mozilla-central/ident?i=win32_result

This is a bit of a foot gun...
Bob: what does (!BrokerServicesBase::GetInstance()->IsActiveTarget(target_process_id), and how can we fail more gracefully here?

Jim: is sandboxing something e10s release is relying on? can/should we filter out users with sandboxing on from the crashes?
Flags: needinfo?(jmathies)
Flags: needinfo?(bobowen.code)
Comment on attachment 8749613 [details]
MozReview Request: Bug 1258317 - Part 2: Remove unused privilege in mozilla::ipc::TransferHandleTorProcess(). r?gabor

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/51061/diff/1-2/
Attachment #8749613 - Flags: review?(gkrizsanits)
Comment on attachment 8749613 [details]
MozReview Request: Bug 1258317 - Part 2: Remove unused privilege in mozilla::ipc::TransferHandleTorProcess(). r?gabor

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/51061/diff/2-3/
https://reviewboard.mozilla.org/r/51061/#review48015

> I agree with Jim and Bob that we should handle the invalid parameter case more gracefully. And do something about that release assert in cases like https://crash-stats.mozilla.com/report/index/db43c1b0-9c27-4db6-935c-796dc2160501#allthreads , and MOZ_CRASH-ing here will kind of make that impossible...

Revision 3 doesn't MOZ_CRASH() here and let the caller handle the return value. I'll handle the case of invalid handle separately in another bug. The permission denied case will be handled in part 3 of this bug.
(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #80)
> (In reply to Gabor Krizsanits [:krizsa :gabor] from comment #76)
> > Jim: is sandboxing something e10s release is relying on? can/should we
> > filter out users with sandboxing on from the crashes?
> 
> From comment #59, only nightly has content sandboxing on. And the latest
> several crashes are on 47 so the aceess denied errors are from sandboxing.

Are not from sandboxing.
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #76)
> Jim: is sandboxing something e10s release is relying on? can/should we
> filter out users with sandboxing on from the crashes?

No it does not block e10s. The Windows sb will roll out a couple releases after e10s goes live.

I'm fine with splitting sandbox related crashes out into a separate bug. If we do that please add |sb?| to the whiteboard.
Flags: needinfo?(jmathies)
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #76)
> Actually I looked into this today, and since the last 10 or so crashes were
> all access denied errors and they all had sandboxbroker.dll loaded I looked
> into it and found that sandboxing does set that error manually
> 
> here:
> http://mxr.mozilla.org/mozilla-central/source/security/sandbox/chromium/
> sandbox/win/src/handle_policy.cc#69
> and then we use that value to call :SetLastError with:
> http://mxr.mozilla.org/mozilla-central/ident?i=win32_result

We shouldn't be hitting this code for the content process apart from on Nightly, as it isn't sandboxed.
sandboxbroker.dll will still be loaded, because we sandbox GMP and 64-bit NPAPI.

Also, if I understand correctly we're duplicating to the broker/main process here, so even if we were hitting this function then |target_process_id != ::GetCurrentProcessId()| just above would be false.

> This is a bit of a foot gun...
> Bob: what does
> (!BrokerServicesBase::GetInstance()->IsActiveTarget(target_process_id), and
> how can we fail more gracefully here?

This is checking that this is a valid child target for a sandboxed child to duplicate a handle into.
As I say, we wouldn't hit this for the crashes in this bug, but this is used for duplicating shared memory from NPAPI to content process. This is chromium code, so we wouldn't want to change it, but ERROR_ACCESS_DENIED seems like the correct thing anyway.
Flags: needinfo?(bobowen.code)
Comment on attachment 8749613 [details]
MozReview Request: Bug 1258317 - Part 2: Remove unused privilege in mozilla::ipc::TransferHandleTorProcess(). r?gabor

https://reviewboard.mozilla.org/r/51061/#review49998
Attachment #8749613 - Flags: review?(gkrizsanits) → review+
(In reply to Bob Owen (:bobowen) from comment #84)
> We shouldn't be hitting this code for the content process apart from on
> Nightly, as it isn't sandboxed.
> sandboxbroker.dll will still be loaded, because we sandbox GMP and 64-bit
> NPAPI.

Sorry, I'm not familiar with how this is enabled/disabled. So you're saying
that it is NOT possible to turn it on with some config flags on beta and
that MOZ_SANDBOX is not even defined so we cannot hit this code. I kind of
got confused that because to me it seemed like MOZ_SANDBOX is defined since
it was merged with the GMP flag... but never mind then.

> 
> Also, if I understand correctly we're duplicating to the broker/main process
> here, so even if we were hitting this function then |target_process_id !=
> ::GetCurrentProcessId()| just above would be false.

True but there is another if branch a few lines above that returns the same error.
Either way, that error should be overwritten by the next operation so I was wrong.
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #86)
> (In reply to Bob Owen (:bobowen) from comment #84)
> > We shouldn't be hitting this code for the content process apart from on
> > Nightly, as it isn't sandboxed.
> > sandboxbroker.dll will still be loaded, because we sandbox GMP and 64-bit
> > NPAPI.
> 
> Sorry, I'm not familiar with how this is enabled/disabled. So you're saying
> that it is NOT possible to turn it on with some config flags on beta and
> that MOZ_SANDBOX is not even defined so we cannot hit this code. I kind of
> got confused that because to me it seemed like MOZ_SANDBOX is defined since
> it was merged with the GMP flag... but never mind then.

MOZ_SANDBOX is defined if either MOZ_CONTENT_SANDBOX or MOZ_GMP_SANDBOX is defined; so, yes, it's defined on beta/release.  But, if I understand correctly, what's at issue here is the sandbox dynamically intercepting calls to system interfaces like DuplicateHandle, which depends on whether sandboxing is enabled for that specific process, not (directly) on preprocessor defines.
This is the unpolished part 3 patch that implements fallback when we failed in cross-process DuplicateHandle(). This patch is far uglier than I originally thought:

* We need a busy loop with WaitNamedPipe() until the other side has created the named pipe and calls ConnectNamedPipe().
* const_cast used in ParamTraits<mozilla::ipc::TransportDescriptor>::Write() because the caller needs to know whether we failed in DuplicateHandle().
Now that Bob has a patch for bug 1271601 that pulls the handle from the parent process, we may land that 1271601 to see if the crash is fixed. If not, there is an alternative:

Instead of creating the server pipe in the child process and send the pipe to parent like PBackground, we reverse the direction of the pipe that the opener always uses the server pipe and have the other side always open the pipe.
(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #91)
> Now that Bob has a patch for bug 1271601 that pulls the handle from the
> parent process, we may land that 1271601 to see if the crash is fixed.

I'm leaning toward that. Let's uplift that patch first and see if the issue is fixed and then we can get back to this patch if that is OK with you.

Also, could you please uplift the previous patch as far as we can?
Flags: needinfo?(cyu)
Comment on attachment 8749613 [details]
MozReview Request: Bug 1258317 - Part 2: Remove unused privilege in mozilla::ipc::TransferHandleTorProcess(). r?gabor

Approval Request Comment
[Feature/regressing bug #]: No bug. e10s internals asserting that system calls always succeed. 
[User impact if declined]: Difficulties in proceeding with some crashes that are seen by the users but hard to reproduce/diagnose/debug.
[Describe test coverage new/current, TreeHerder]: Just landed/tested on m-c.
[Risks and why]: Pretty low since the patch contains only diagnostic code during crashes.
[String/UUID change made/needed]: none
Flags: needinfo?(cyu)
Attachment #8749613 - Flags: approval-mozilla-beta?
Attachment #8749613 - Flags: approval-mozilla-aurora?
Comment on attachment 8749613 [details]
MozReview Request: Bug 1258317 - Part 2: Remove unused privilege in mozilla::ipc::TransferHandleTorProcess(). r?gabor

Anything to help you diagnostic this crash, taking it in aurora. Ritu will make the call for beta.
Attachment #8749613 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 8749613 [details]
MozReview Request: Bug 1258317 - Part 2: Remove unused privilege in mozilla::ipc::TransferHandleTorProcess(). r?gabor

At this point, I would like to only accept uplifts for critical recent regressions, severe stability, security, perf, mlk issues only. This is to minimize code churn and to ensure we ship a high-quality Fx47 in 2 weeks.

Since this is e10s related, this does not meet the Beta47 uplift bar.
Attachment #8749613 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
We've had two recent landings for this, one here on 5-24 and bug 1271601 landed 5-27.

Nightly: last occurrence was in a 5-17 build
Aurora: last occurrence was in a 5-21 build

Lets keep this open until 48 merges to beta to see. So far things are looking good.

https://crash-stats.mozilla.com/search/?product=Firefox&version=48.0a2&signature=~TransferHandleToProcess&date=%3E2016-05-01&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#crash-reports
Flags: needinfo?(cyu)
No crash after on 48b. Given the frequency of crashes on beta and the time 48 merges to beta, we have confidence that bug 1271601 this crash is fixed in bug 1271601. Close this bug as fixed.
Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Flags: needinfo?(cyu)
Resolution: --- → FIXED
Just noticed this by accident. This patch uses MOZ_CRASH_REPORTER which doesn't exist. It's MOZ_CRASHREPORTER.
Flags: needinfo?(cyu)
Thanks for catching this. Bug 1295544 is filed for fixing that.
Flags: needinfo?(cyu)
See Also: → 1295544
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.