<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Comment 4

•

9 years ago

I can't view that spreadsheet, FWIW, even with my MoCo account.

Comment 5

•

9 years ago

This is an odd one. We're attempting to deserialize what could be a very large array of data that contains the parent's preferences. child send: http://mxr.mozilla.org/mozilla-central/source/modules/libpref/Preferences.cpp#515 parent side: http://mxr.mozilla.org/mozilla-central/source/dom/ipc/ContentParent.cpp#2678 http://mxr.mozilla.org/mozilla-central/source/modules/libpref/Preferences.cpp#394 We're dealing with user prefs here, I wonder if we should have better error checking on the parent side.

Comment 6

•

9 years ago

In a fairly clean profile, I see this preferences stuff generate a pretty big IPC message, doubling its way up to 262KB. Maybe for somebody with a ton of preferences it could be even larger, and thus be a source of OOMs for people low on memory? We probably need to wait until Kan-ru's patch in bug 1258312 filters out to a beta experiment. This is one of many deserialization crashes.

Comment 7

•

9 years ago

Cervantes, can you take a look?

Flags: needinfo?(kchen) → needinfo?(cyu)

Andrew Overholt [:overholt]

Comment 8

•

9 years ago

I'll take a look.

Assignee: nobody → cyu

Flags: needinfo?(cyu)

Updated

•

9 years ago

Whiteboard: btpp-active

Kan-Ru Chen [:kanru] (UTC+9)

Comment 9

•

9 years ago

Signature is likely to be changed to OOM by bug 1258312 but I think we still need to fix the large preferences over IPC issue. We will need to update the signature once we figure how to collect that for bug 1258312.

Updated

•

9 years ago

Blocks: e10s-oom
No longer blocks: e10s-crashes

Updated

•

9 years ago

Blocks: 1262918
No longer blocks: e10s-oom

Comment 10

•

9 years ago

From this report https://crash-stats.mozilla.com/report/index/9f96e936-2ecd-4588-a575-e35af2160418 the build already has the fix for bug 1258312, but we still see it crash in deserializing the prefs array. This doesn't look like an OOM. The error message in FatalError() is "unknown union type", that is, MaybePrefValue is neither of TPrefValue or Tnull_t. It's still unclear how such unexpected union value get passed from parent's check.

Comment 11

•

9 years ago

I sampled several dumps and saw the union have values like 206424 or 0, and the prefs array has length values > 2000, which is unlikely because the from the crash reports there aren't many addons installed. It looks like the IPC message was corrupted.

Comment 12

•

9 years ago

This isn't quite the same issue, but I found this crash on beta 47. It is an OOM while trying to send a prefs array, and the OOM allocation size is around 220MB: https://crash-stats.mozilla.com/report/index/01246d16-3c71-4bb1-a148-35b992160428

Updated

•

9 years ago

Updated

•

9 years ago

Comment 13

•

9 years ago

(In reply to Andrew McCreight [:mccr8] from comment #12) > This isn't quite the same issue, but I found this crash on beta 47. It is an > OOM while trying to send a prefs array, and the OOM allocation size is > around 220MB: > https://crash-stats.mozilla.com/report/index/01246d16-3c71-4bb1-a148- > 35b992160428 Maybe the fix in bug 1258312 moved the crash from the child process to the parent process.

Comment 14

•

9 years ago

I have a preliminary patch for checksumming IPC messages (header and payload separately) using crc32c and ran several talos e10s suites locally (damp, svgr, others). Checksum over IPC messages has ~1 % in some tests and indifferent in most tests. Having checksums doesn't tell us where the problem is, but allows for early detection of errors (if there are really message corruptions). In addition, we may also need to log message type of the previous message received in the channel. Jim, do you think it help to add checksums to detect message corruption?

Flags: needinfo?(jmathies)

Comment 15

•

9 years ago

(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #14) > I have a preliminary patch for checksumming IPC messages (header and payload > separately) using crc32c and ran several talos e10s suites locally (damp, > svgr, others). Checksum over IPC messages has ~1 % in some tests and > indifferent in most tests. > > Having checksums doesn't tell us where the problem is, but allows for early > detection of errors (if there are really message corruptions). In addition, > we may also need to log message type of the previous message received in the > channel. > > Jim, do you think it help to add checksums to detect message corruption? I wouldn't anticipate messages corrupting in transit, I can't really think of anything that could cause that. We're just sending data over pipes. Two more likely scenarios would be memory corruption of the message data on the receiving side, or potentially some sort of a flaw in our serializer on the sending side. I don't think we want to land something like this but you could run it through try to see if it hits on something. ni'ing Gabor who's been doing a lot of work in IPC for an opinion. We might want to ask billm too.

Flags: needinfo?(jmathies) → needinfo?(gkrizsanits)

Gabor Krizsanits (INACTIVE)

Comment 16

•

9 years ago

Given that these are mostly for this one particular message, it is probably worth looking at the serialization code for prefs and see if there are more cases where it can fail somewhere, but still end up sending the message. One idea I had was that we could add a opt-in mechanism to immediately deserialize a message after we create it. This might give us more information. I filed bug 1268900 for that. It might be worth looking into if staring at the serialization code doesn't turn up anything.

Comment 17

•

9 years ago

My first bet would be that the pref file was corrupted somehow. The parsing algorithm code looks pretty scary, so I would not be surprised if it generated something crazy for a corrupted file (OR wrote something crazy while trying to save an already broken state). Can we do some sanity check on the parsed prefs? Comment 11 sounds like something we could filter out with some sanity check on the parsed prefs before sending it over.

Flags: needinfo?(gkrizsanits)

Updated

•

9 years ago

Depends on: 1269365

Comment 18

•

9 years ago

I filed bug 1269365 about turning OOM failures during deserialization of nsTArray into OOM crashes. This isn't consistent with what Cervantes found in comment 10, but it would be nice to at least eliminate that as a possibility.

Comment 19

•

9 years ago

mccr8, do we still need this with bug 1269365?

Flags: needinfo?(continuation)

Updated

•

9 years ago

Assignee: cyu → nobody

Comment 20

•

9 years ago

(In reply to Jim Mathies [:jimm] from comment #19) > mccr8, do we still need this with bug 1269365? I haven't seen any crashes from bug 1269365, so it still isn't clear what these crashes are caused by.

Flags: needinfo?(continuation)

Comment 21

•

9 years ago

I tried to implement the specialized version of bug 1268900, only for this bug. It looks like that we made PContentParent::Write() overloads private, and this makes it difficult to dry run message deserialization in ContentParent::RecvReadPrefsArray(). We still need to make this happen in the code generator.

Updated

•

9 years ago

tracking-e10s: + → ?

Updated

•

9 years ago

tracking-e10s: ? → +

Gabor Krizsanits (INACTIVE)

Updated

•

9 years ago

Depends on: 1268900

Comment 22

•

9 years ago

(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #17) > My first bet would be that the pref file was corrupted somehow. The parsing > algorithm code looks pretty scary, so I would not be surprised if it > generated something crazy for a corrupted file (OR wrote something crazy > while trying to save an already broken state). Can we do some sanity check > on the parsed prefs? Comment 11 sounds like something we could filter out > with some sanity check on the parsed prefs before sending it over. I've looked into this more thoroughly and I don't think this is the case. I've found a few parts that I can imagine that can go wrong but nothing that could cause what is described in Comment 10. In the serialization we write length/data pairs. To read an invalid pref value data type, the memory should be either totally corrupted, or we should write a length that is shorter or longer than the data that follows it. But even if the pref array is full of invalid data that can never happen based on how IPCMessageUtils is implemented.

Updated

•

9 years ago

Whiteboard: btpp-active → btpp-active, e10st?

Comment 23

•

9 years ago

A number of mitigations for this issue have landed, so I'm going to mark this WFM. (There are some crashes with this signature, but I think the signature would look like something else now anyways.)