Closed Bug 1259480 Opened 8 years ago Closed 8 years ago

[e10s] FatalError crash in PContentChild::Read from PContentChild::SendReadPrefsArray

Categories

(Core :: DOM: Content Processes, defect, P1)

defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
e10s + ---
firefox47 --- affected
firefox48 --- affected
firefox49 --- affected
firefox50 --- affected

People

(Reporter: benjamin, Unassigned)

References

Details

(Whiteboard: btpp-active, e10st?)

Crash Data

Priority: -- → P1
Depends on: 1260270
#19 on the e10s topcrash list.
Blocks: e10s-crashes
This didn't show up in the top 50 in the apz 46 experiment.
(In reply to Jim Mathies [:jimm] from comment #2)
> This didn't show up in the top 50 in the apz 46 experiment.

My mistake, this was 21 in the content process list. It looks like it's startup related, but not all of the signatures initiate from SendReadPrefsArray. The startup crashes all do though.

This one might be pretty easy to track down, Kanru do you have anyone who's free to take a look at this?

https://docs.google.com/spreadsheets/d/17P0SBGcgrB-fFycCh4sA8KeK719-srjWEPiuvfcr6y8/edit#gid=777801084
Flags: needinfo?(kchen)
I can't view that spreadsheet, FWIW, even with my MoCo account.
This is an odd one. We're attempting to deserialize what could be a very large array of data that contains the parent's preferences. 

child send:
http://mxr.mozilla.org/mozilla-central/source/modules/libpref/Preferences.cpp#515

parent side:
http://mxr.mozilla.org/mozilla-central/source/dom/ipc/ContentParent.cpp#2678
http://mxr.mozilla.org/mozilla-central/source/modules/libpref/Preferences.cpp#394

We're dealing with user prefs here, I wonder if we should have better error checking on the parent side.
In a fairly clean profile, I see this preferences stuff generate a pretty big IPC message, doubling its way up to 262KB. Maybe for somebody with a ton of preferences it could be even larger, and thus be a source of OOMs for people low on memory? We probably need to wait until Kan-ru's patch in bug 1258312 filters out to a beta experiment. This is one of many deserialization crashes.
See Also: → 1258312
Cervantes, can you take a look?
Flags: needinfo?(kchen) → needinfo?(cyu)
I'll take a look.
Assignee: nobody → cyu
Flags: needinfo?(cyu)
Whiteboard: btpp-active
Signature is likely to be changed to OOM by bug 1258312 but I think we still need to fix the large preferences over IPC issue. We will need to update the signature once we figure how to collect that for bug 1258312.
Blocks: e10s-oom
No longer blocks: e10s-crashes
Blocks: 1262918
No longer blocks: e10s-oom
From this report https://crash-stats.mozilla.com/report/index/9f96e936-2ecd-4588-a575-e35af2160418 the build already has the fix for bug 1258312, but we still see it crash in deserializing the prefs array. This doesn't look like an OOM. The error message in FatalError() is "unknown union type", that is, MaybePrefValue is neither of TPrefValue or Tnull_t. It's still unclear how such unexpected union value get passed from parent's check.
I sampled several dumps and saw the union have values like 206424 or 0, and the prefs array has length values > 2000, which is unlikely because the from the crash reports there aren't many addons installed. It looks like the IPC message was corrupted.
This isn't quite the same issue, but I found this crash on beta 47. It is an OOM while trying to send a prefs array, and the OOM allocation size is around 220MB: https://crash-stats.mozilla.com/report/index/01246d16-3c71-4bb1-a148-35b992160428
Crash Signature: [@ mozalloc_abort | NS_DebugBreak | mozilla::ipc::FatalError | mozilla::dom::PContentChild::Read ] → [@ mozalloc_abort | NS_DebugBreak | mozilla::ipc::FatalError | mozilla::dom::PContentChild::Read ] [@ mozalloc_abort | NS_DebugBreak | mozilla::ipc::FatalError | mozilla::dom::PContentChild::FatalError | mozilla::dom::PContentChild::Read ]
See Also: → 1268662
(In reply to Andrew McCreight [:mccr8] from comment #12)
> This isn't quite the same issue, but I found this crash on beta 47. It is an
> OOM while trying to send a prefs array, and the OOM allocation size is
> around 220MB:
> https://crash-stats.mozilla.com/report/index/01246d16-3c71-4bb1-a148-
> 35b992160428

Maybe the fix in bug 1258312 moved the crash from the child process to the parent process.
I have a preliminary patch for checksumming IPC messages (header and payload separately) using crc32c and ran several talos e10s suites locally (damp, svgr, others). Checksum over IPC messages has ~1 % in some tests and indifferent in most tests.

Having checksums doesn't tell us where the problem is, but allows for early detection of errors (if there are really message corruptions). In addition, we may also need to log message type of the previous message received in the channel.

Jim, do you think it help to add checksums to detect message corruption?
Flags: needinfo?(jmathies)
(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #14)
> I have a preliminary patch for checksumming IPC messages (header and payload
> separately) using crc32c and ran several talos e10s suites locally (damp,
> svgr, others). Checksum over IPC messages has ~1 % in some tests and
> indifferent in most tests.
> 
> Having checksums doesn't tell us where the problem is, but allows for early
> detection of errors (if there are really message corruptions). In addition,
> we may also need to log message type of the previous message received in the
> channel.
> 
> Jim, do you think it help to add checksums to detect message corruption?

I wouldn't anticipate messages corrupting in transit, I can't really think of anything that could cause that. We're just sending data over pipes. Two more likely scenarios would be memory corruption of the message data on the receiving side, or potentially some sort of a flaw in our serializer on the sending side.

I don't think we want to land something like this but you could run it through try to see if it hits on something.

ni'ing Gabor who's been doing a lot of work in IPC for an opinion. We might want to ask billm too.
Flags: needinfo?(jmathies) → needinfo?(gkrizsanits)
Given that these are mostly for this one particular message, it is probably worth looking at the serialization code for prefs and see if there are more cases where it can fail somewhere, but still end up sending the message.

One idea I had was that we could add a opt-in mechanism to immediately deserialize a message after we create it. This might give us more information. I filed bug 1268900 for that. It might be worth looking into if staring at the serialization code doesn't turn up anything.
My first bet would be that the pref file was corrupted somehow. The parsing algorithm code looks pretty scary, so I would not be surprised if it generated something crazy for a corrupted file (OR wrote something crazy while trying to save an already broken state). Can we do some sanity check on the parsed prefs? Comment 11 sounds like something we could filter out with some sanity check on the parsed prefs before sending it over.
Flags: needinfo?(gkrizsanits)
Depends on: 1269365
I filed bug 1269365 about turning OOM failures during deserialization of nsTArray into OOM crashes. This isn't consistent with what Cervantes found in comment 10, but it would be nice to at least eliminate that as a possibility.
mccr8, do we still need this with bug 1269365?
Flags: needinfo?(continuation)
Assignee: cyu → nobody
(In reply to Jim Mathies [:jimm] from comment #19)
> mccr8, do we still need this with bug 1269365?

I haven't seen any crashes from bug 1269365, so it still isn't clear what these crashes are caused by.
Flags: needinfo?(continuation)
I tried to implement the specialized version of bug 1268900, only for this bug. It looks like that we made PContentParent::Write() overloads private, and this makes it difficult to dry run message deserialization in ContentParent::RecvReadPrefsArray(). We still need to make this happen in the code generator.
Depends on: 1268900
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #17)
> My first bet would be that the pref file was corrupted somehow. The parsing
> algorithm code looks pretty scary, so I would not be surprised if it
> generated something crazy for a corrupted file (OR wrote something crazy
> while trying to save an already broken state). Can we do some sanity check
> on the parsed prefs? Comment 11 sounds like something we could filter out
> with some sanity check on the parsed prefs before sending it over.

I've looked into this more thoroughly and I don't think this is the case. I've found a few parts that I can imagine that can go wrong but nothing that could cause what is described in Comment 10. In the serialization we write length/data pairs. To read an invalid pref value data type, the memory should be either totally corrupted, or we should write a length that is shorter or longer than the data that follows it. But even if the pref array is full of invalid data that can never happen based on how IPCMessageUtils is implemented.
Whiteboard: btpp-active → btpp-active, e10st?
A number of mitigations for this issue have landed, so I'm going to mark this WFM. (There are some crashes with this signature, but I think the signature would look like something else now anyways.)
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
Crash volume for signature 'mozalloc_abort | NS_DebugBreak | mozilla::ipc::FatalError | mozilla::dom::PContentChild::FatalError | mozilla::dom::PContentChild::Read':
 - nightly (version 50): 24 crashes from 2016-06-06.
 - aurora  (version 49): 7 crashes from 2016-06-07.
 - beta    (version 48): 6 crashes from 2016-06-06.
 - release (version 47): 1 crash from 2016-05-31.
 - esr     (version 45): 0 crash from 2016-04-07.

Crash volume on the last weeks:
             Week N-1   Week N-2   Week N-3   Week N-4   Week N-5   Week N-6   Week N-7
 - nightly          0          0          0          0          0          1          0
 - aurora           0          0          5          0          2          0          0
 - beta             1          0          0          5          0          0          0
 - release          1          0          0          0          0          0          0
 - esr              0          0          0          0          0          0          0

Affected platform: Windows
You need to log in before you can comment on or make changes to this bug.