Closed Bug 1266517 Opened 9 years ago Closed 5 years ago

OOM | large crash in MessageChannel::OnMessageReceivedFromLink()

Categories

(Core :: IPC, defect, P3)

Product:

Component:

Type:

defect

Priority:

P3

Severity:

normal

Tracking

()

Status:

RESOLVED WORKSFORME

Tracking Flags:

Tracking

Status

e10s

+

---

firefox48

---

affected

People

(Reporter: mccr8, Unassigned)

References

(Depends on 1 open bug)

Details

(Whiteboard: btpp-backlog, e10st?)

Crash Data

Attachments

(1 file)

memorylist.txt 9 years ago Ting-Yu Chou [:ting] (away) 2.76 KB, text/plain		Details

Andrew McCreight [:mccr8]

Reporter

Description

•

9 years ago

I've come across 3 crashes inside deque::push_back inside MessageChannel::OnMessageReceivedFromLink() on Nightly. They are OOM | large crashes of about 1MB. I'd guess that this is due to the deque itself getting large. Maybe the message queue gets backed up under some circumstances. https://crash-stats.mozilla.com/report/index/895f658b-8a41-4371-bc49-5d3a02160419 https://crash-stats.mozilla.com/report/index/993337f7-3b2b-4aeb-be52-90d972160419 https://crash-stats.mozilla.com/report/index/b980b271-7c80-4bfc-a0f4-f06d62160421 Three crashes isn't a big deal, but this is worth keeping an eye on it in case it is bad on Beta, where we have seen many more address space issues. Note that this crash signature is fairly generic and includes other crashes inside STL data structures. I guess we should add std::_Allocate to the skip list.

Andrew Overholt [:overholt]

Updated

•

9 years ago

Whiteboard: btpp-backlog

Brindusa Tot, Desktop QA

Updated

•

9 years ago

tracking-e10s: --- → ?

Jim Mathies [:jimm]

Updated

•

9 years ago

tracking-e10s: ? → +

Priority: -- → P1

Jim Mathies [:jimm]

Updated

•

9 years ago

Depends on: 1267306

Eric Rahm [:erahm]

Comment 1

•

9 years ago

At about 1MB w/ 24 bytes per message (on 64-bit) we're talking about ~43,600 pending messages. Assuming a worst-case growth strategy we'll guess they double, so maybe we had half that, so ~21,800 messages. That's a lot of messages.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 2

•

9 years ago

In all three crashes in comment 0, the main thread is doing a CC triggered by a memory pressure notification. That's sort of an explanation for why we might not respond to messages for a long time. But it's still unclear what all these messages are that are being sent.

Jim Mathies [:jimm]

Updated

•

9 years ago

Crash Signature: [@ OOM | large | mozalloc_abort | mozalloc_handle_oom | moz_xmalloc | std::_Allocate ] → [@ OOM | large | mozalloc_abort | mozalloc_handle_oom | moz_xmalloc | std::_Allocate ] [@ OOM | large | mozalloc_abort | mozalloc_handle_oom | moz_xmalloc | std::deque<T>::_Growmap ]

Jim Mathies [:jimm]

Comment 3

•

9 years ago

https://crash-stats.mozilla.com/search/?product=Firefox&signature=~_Growmap&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#crash-reports

Jim Mathies [:jimm]

Updated

•

9 years ago

Whiteboard: btpp-backlog → btpp-backlog, e10st?

Ting-Yu Chou [:ting] (away)

Comment 4

•

9 years ago

The crash is still happening, dozens per day.

Ting-Yu Chou [:ting] (away)

Comment 5

•

9 years ago

The flood of messages could be a side effect from bugs that we break one large message to many smaller ones, for example bug 1263028.

Ting-Yu Chou [:ting] (away)

Comment 6

•

9 years ago

By using the tool from bug 1001760, there're a lot of <1M free blocks. For example bp-182c7af4-5d0a-49f0-b51f-a7b0b2160623: Free=1093M Tiny=1050M Misaligned=0M Usable=0M Other=42M This is also what I observed in bug 1259512 comment 29, will try to sort out how are these created.

Andrew McCreight [:mccr8]

Reporter

Comment 7

•

9 years ago

(In reply to Ting-Yu Chou [:ting] from comment #5) > The flood of messages could be a side effect from bugs that we break one > large message to many smaller ones, for example bug 1263028. That's possible, but each of the messages there is about 128kb, so 20,000 messages would be quite a bit of data. One ugly way to address this issue would be to implement a segmented queue, perhaps out of a pair of stacks made from SegmentedVector. Maybe trying to drain the message queue on memory pressure would help? Of course, if you can figure out why memory is so fragmented that would also be very useful.

Ting-Yu Chou [:ting] (away)

Comment 8

•

9 years ago

:dmajor did notice the WRITECOMBINE memory in bug 1062065...

Ting-Yu Chou [:ting] (away)

Comment 9

•

9 years ago

(In reply to Ting-Yu Chou [:ting] from comment #8) > :dmajor did notice the WRITECOMBINE memory in bug 1062065... Check also bug 1101179.

Ting-Yu Chou [:ting] (away)

Comment 10

•

9 years ago

Searched on Socorro for the product has terms "Firefox" and date >= 2016-06-01 695399 100% signature contains OOM 263061 38% signature contains OOM && write combine size >= 128M 83955 12% signature contains OOM && tiny block size >= 128M 49475 7% signature contains OOM && tiny block size >= 128M && write combine size >= 128M

Ting-Yu Chou [:ting] (away)

Comment 11

•

9 years ago

(In reply to Ting-Yu Chou [:ting] from comment #9) > (In reply to Ting-Yu Chou [:ting] from comment #8) > > :dmajor did notice the WRITECOMBINE memory in bug 1062065... > > Check also bug 1101179. And bug 1123465, bug 127925.

Ting-Yu Chou [:ting] (away)

Comment 12

•

9 years ago

(In reply to Ting-Yu Chou [:ting] from comment #10) > Searched on Socorro for the product has terms "Firefox" and date >= > 2016-06-01 > > 695399 100% signature contains OOM > 263061 38% signature contains OOM && write combine size >= 128M > 83955 12% signature contains OOM && tiny block size >= 128M > 49475 7% signature contains OOM && tiny block size >= 128M && write > combine size >= 128M 488134 70% signature contains OOM && tiny block size >= 64M 371045 53% signature contains OOM && write combine size >= 64M 329172 47% signature contains OOM && tiny block size >= 64M && write combine size >= 64M

Ting-Yu Chou [:ting] (away)

Comment 13

•

9 years ago

Attached file memorylist.txt — Details

Most free space are from tiny (1M) blocks, if we can eliminate them... probably bug 1127925 is still worth to work on. But before that, I will see if I can add annotations to understand what the queued messages are.

Ting-Yu Chou [:ting] (away)

Comment 14

•

9 years ago

(In reply to Ting-Yu Chou [:ting] from comment #13) > Most free space are from tiny (1M) blocks, if we can eliminate them... > probably bug 1127925 is still worth to work on. Correction: bug 1123465

Ting-Yu Chou [:ting] (away)

Comment 15

•

9 years ago

(In reply to Andrew McCreight [:mccr8] from comment #7) > One ugly way to address this issue would be to implement a segmented queue, > perhaps out of a pair of stacks made from SegmentedVector. std::list may also be an option, it's a double ended linked list.

Ting-Yu Chou [:ting] (away)

Updated

•

9 years ago

Assignee: nobody → janus926

Ting-Yu Chou [:ting] (away)

Updated

•

9 years ago

Depends on: 1287044

Ting-Yu Chou [:ting] (away)

Comment 16

•

9 years ago

Finally there came some OOM from AnnotateAllocator: bp-b509ad9a-f492-4b79-94e4-281512160905, bp-77e09693-ee45-4a54-ae26-2dee22160904, bp-0e52bfc7-220f-43cd-a9b8-17b922160903, bp-5ddab854-6bde-45e2-b142-897c42160902 however none of them have NumberOfPeidngIPC, TopPendingIPCCout, and TopPendingIPCName in the metadata...

Ting-Yu Chou [:ting] (away)

Comment 17

•

9 years ago

It seems CrashReporter::gExceptionHandler is null, which AnnotateCrashReport() returns early.

Ting-Yu Chou [:ting] (away)

Comment 18

•

9 years ago

(In reply to Ting-Yu Chou [:ting] from comment #17) > It seems CrashReporter::gExceptionHandler is null, which > AnnotateCrashReport() returns early. Another possibility is CrashReporterHelperRunnable hasn't had a chance to run on main thread. All the 4 reports in comment 16 did not OOM in main thread.

Ting-Yu Chou [:ting] (away)

Updated

•

9 years ago

Depends on: 1301022

Ting-Yu Chou [:ting] (away)

Comment 19

•

9 years ago

Even with bug 1301022, bp-9b712aed-dd03-4b99-b181-a82a32160923 still doesn't have the added annotations...

Ting-Yu Chou [:ting] (away)

Updated

•

9 years ago

Depends on: 1312326

Ting-Yu Chou [:ting] (away)

Comment 20

•

9 years ago

bpb4853361-f3cf-4609-b155-d20b82161021 has: NumberOfPendingIPC 65535 TopPendingIPCCount 65351 TopPendingIPCName ??? TopPendingIPCType 0x440003 which seems is PBluetoothRequestEnd but doesn't make sense. Bug 1312326 could get us the message name.

Ting-Yu Chou [:ting] (away)

Comment 21

•

9 years ago

(In reply to Ting-Yu Chou [:ting] from comment #20) > which seems is PBluetoothRequestEnd but doesn't make sense. My bad math, it is PCompositorBridge::Msg_DidComposite__ID.

Ting-Yu Chou [:ting] (away)

Comment 22

•

9 years ago

(In reply to Ting-Yu Chou [:ting] from comment #21) > (In reply to Ting-Yu Chou [:ting] from comment #20) > > which seems is PBluetoothRequestEnd but doesn't make sense. > > My bad math, it is PCompositorBridge::Msg_DidComposite__ID. Matt, do you have any ideas how could this message flood? There's an OOM when push a message to the pending queue, there were 65535 pending messages and 65351 of them were this message.

Flags: needinfo?(matt.woodrow)

Matt Woodrow (:mattwoodrow)

Comment 23

•

9 years ago

(In reply to Ting-Yu Chou [:ting] from comment #22) > (In reply to Ting-Yu Chou [:ting] from comment #21) > > (In reply to Ting-Yu Chou [:ting] from comment #20) > > > which seems is PBluetoothRequestEnd but doesn't make sense. > > > > My bad math, it is PCompositorBridge::Msg_DidComposite__ID. > > Matt, do you have any ideas how could this message flood? > > There's an OOM when push a message to the pending queue, there were 65535 > pending messages and 65351 of them were this message. We send one of these back to the content process and main thread of the UI process each time we composite. At 60fps (possible if a page with an animation is left open), that would require the receiving thread to be blocked for almost 20 minutes. I guess it's also possible if we're just processing messages slower than they are being received, seems like it would take a long time to accumulate that many though.

Flags: needinfo?(matt.woodrow)

Ting-Yu Chou [:ting] (away)

Comment 24

•

9 years ago

I am not sure whether PCompositorBridge::Msg_DidComposite__ID is the only message that could fill up the queue, will file a bug for it.

Ting-Yu Chou [:ting] (away)

Updated

•

9 years ago

Depends on: 1312945

Mats Palmgren (inactive)

Updated

•

8 years ago

Crash Signature: [@ OOM | large | mozalloc_abort | mozalloc_handle_oom | moz_xmalloc | std::_Allocate ] [@ OOM | large | mozalloc_abort | mozalloc_handle_oom | moz_xmalloc | std::deque<T>::_Growmap ] → [@ OOM | large | mozalloc_abort | mozalloc_handle_oom | moz_xmalloc | std::_Allocate ] [@ OOM | large | mozalloc_abort | mozalloc_handle_oom | moz_xmalloc | std::deque<T>::_Growmap ] [@ OOM | unknown | mozalloc_abort | mozalloc_handle_oom | moz_xmalloc …

Ting-Yu Chou [:ting] (away)

Updated

•

8 years ago

Assignee: janus926 → nobody

Jim Mathies [:jimm]

Updated

•

7 years ago

Priority: P1 → P3

BugBot [:suhaib / :marco/ :calixte]

Comment 25

•

5 years ago

Closing because no crashes reported for 12 weeks.

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → WORKSFORME

You need to log in before you can comment on or make changes to this bug.