Closed Bug 1266517 Opened 5 years ago Closed 4 months ago

OOM | large crash in MessageChannel::OnMessageReceivedFromLink()

Categories

(Core :: IPC, defect, P3)

defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
e10s + ---
firefox48 --- affected

People

(Reporter: mccr8, Unassigned)

References

(Depends on 1 open bug)

Details

(Whiteboard: btpp-backlog, e10st?)

Crash Data

Attachments

(1 file)

I've come across 3 crashes inside deque::push_back inside MessageChannel::OnMessageReceivedFromLink() on Nightly. They are OOM | large crashes of about 1MB. I'd guess that this is due to the deque itself getting large. Maybe the message queue gets backed up under some circumstances.

https://crash-stats.mozilla.com/report/index/895f658b-8a41-4371-bc49-5d3a02160419
https://crash-stats.mozilla.com/report/index/993337f7-3b2b-4aeb-be52-90d972160419
https://crash-stats.mozilla.com/report/index/b980b271-7c80-4bfc-a0f4-f06d62160421

Three crashes isn't a big deal, but this is worth keeping an eye on it in case it is bad on Beta, where we have seen many more address space issues.

Note that this crash signature is fairly generic and includes other crashes inside STL data structures. I guess we should add std::_Allocate to the skip list.
Whiteboard: btpp-backlog
Priority: -- → P1
Depends on: 1267306
At about 1MB w/ 24 bytes per message (on 64-bit) we're talking about ~43,600 pending messages. Assuming a worst-case growth strategy we'll guess they double, so maybe we had half that, so ~21,800 messages. That's a lot of messages.
In all three crashes in comment 0, the main thread is doing a CC triggered by a memory pressure notification. That's sort of an explanation for why we might not respond to messages for a long time. But it's still unclear what all these messages are that are being sent.
Crash Signature: [@ OOM | large | mozalloc_abort | mozalloc_handle_oom | moz_xmalloc | std::_Allocate ] → [@ OOM | large | mozalloc_abort | mozalloc_handle_oom | moz_xmalloc | std::_Allocate ] [@ OOM | large | mozalloc_abort | mozalloc_handle_oom | moz_xmalloc | std::deque<T>::_Growmap ]
Whiteboard: btpp-backlog → btpp-backlog, e10st?
The crash is still happening, dozens per day.
The flood of messages could be a side effect from bugs that we break one large message to many smaller ones, for example bug 1263028.
By using the tool from bug 1001760, there're a lot of <1M free blocks. For example bp-182c7af4-5d0a-49f0-b51f-a7b0b2160623:

  Free=1093M Tiny=1050M Misaligned=0M Usable=0M Other=42M 

This is also what I observed in bug 1259512 comment 29, will try to sort out how are these created.
(In reply to Ting-Yu Chou [:ting] from comment #5)
> The flood of messages could be a side effect from bugs that we break one
> large message to many smaller ones, for example bug 1263028.

That's possible, but each of the messages there is about 128kb, so 20,000 messages would be quite a bit of data.

One ugly way to address this issue would be to implement a segmented queue, perhaps out of a pair of stacks made from SegmentedVector.

Maybe trying to drain the message queue on memory pressure would help?

Of course, if you can figure out why memory is so fragmented that would also be very useful.
:dmajor did notice the WRITECOMBINE memory in bug 1062065...
(In reply to Ting-Yu Chou [:ting] from comment #8)
> :dmajor did notice the WRITECOMBINE memory in bug 1062065...

Check also bug 1101179.
Searched on Socorro for the product has terms "Firefox" and date >= 2016-06-01

  695399 100% signature contains OOM
  263061 38%  signature contains OOM && write combine size >= 128M
  83955  12%  signature contains OOM && tiny block size >= 128M
  49475   7%  signature contains OOM && tiny block size >= 128M && write combine size >= 128M
(In reply to Ting-Yu Chou [:ting] from comment #9)
> (In reply to Ting-Yu Chou [:ting] from comment #8)
> > :dmajor did notice the WRITECOMBINE memory in bug 1062065...
> 
> Check also bug 1101179.

And bug 1123465, bug 127925.
(In reply to Ting-Yu Chou [:ting] from comment #10)
> Searched on Socorro for the product has terms "Firefox" and date >=
> 2016-06-01
> 
>   695399 100% signature contains OOM
>   263061 38%  signature contains OOM && write combine size >= 128M
>   83955  12%  signature contains OOM && tiny block size >= 128M
>   49475   7%  signature contains OOM && tiny block size >= 128M && write
> combine size >= 128M

    488134 70%  signature contains OOM && tiny block size >= 64M
    371045 53%  signature contains OOM && write combine size >= 64M
    329172 47%  signature contains OOM && tiny block size >= 64M && write combine size >= 64M
Attached file memorylist.txt
Most free space are from tiny (1M) blocks, if we can eliminate them... probably bug 1127925 is still worth to work on.

But before that, I will see if I can add annotations to understand what the queued messages are.
(In reply to Ting-Yu Chou [:ting] from comment #13)
> Most free space are from tiny (1M) blocks, if we can eliminate them...
> probably bug 1127925 is still worth to work on.

Correction: bug 1123465
(In reply to Andrew McCreight [:mccr8] from comment #7)
> One ugly way to address this issue would be to implement a segmented queue,
> perhaps out of a pair of stacks made from SegmentedVector.

std::list may also be an option, it's a double ended linked list.
Assignee: nobody → janus926
Depends on: 1287044
Finally there came some OOM from AnnotateAllocator:

  bp-b509ad9a-f492-4b79-94e4-281512160905,
  bp-77e09693-ee45-4a54-ae26-2dee22160904,
  bp-0e52bfc7-220f-43cd-a9b8-17b922160903,
  bp-5ddab854-6bde-45e2-b142-897c42160902

however none of them have NumberOfPeidngIPC, TopPendingIPCCout, and TopPendingIPCName in the metadata...
It seems CrashReporter::gExceptionHandler is null, which AnnotateCrashReport() returns early.
(In reply to Ting-Yu Chou [:ting] from comment #17)
> It seems CrashReporter::gExceptionHandler is null, which
> AnnotateCrashReport() returns early.

Another possibility is CrashReporterHelperRunnable hasn't had a chance to run on main thread. All the 4 reports in comment 16 did not OOM in main thread.
Depends on: 1301022
Even with bug 1301022, bp-9b712aed-dd03-4b99-b181-a82a32160923 still doesn't have the added annotations...
Depends on: 1312326
bpb4853361-f3cf-4609-b155-d20b82161021 has:

  NumberOfPendingIPC 65535
  TopPendingIPCCount 65351
  TopPendingIPCName  ???
  TopPendingIPCType  0x440003

which seems is PBluetoothRequestEnd but doesn't make sense. Bug 1312326 could get us the message name.
(In reply to Ting-Yu Chou [:ting] from comment #20)
> which seems is PBluetoothRequestEnd but doesn't make sense.

My bad math, it is PCompositorBridge::Msg_DidComposite__ID.
(In reply to Ting-Yu Chou [:ting] from comment #21)
> (In reply to Ting-Yu Chou [:ting] from comment #20)
> > which seems is PBluetoothRequestEnd but doesn't make sense.
> 
> My bad math, it is PCompositorBridge::Msg_DidComposite__ID.

Matt, do you have any ideas how could this message flood?

There's an OOM when push a message to the pending queue, there were 65535 pending messages and 65351 of them were this message.
Flags: needinfo?(matt.woodrow)
(In reply to Ting-Yu Chou [:ting] from comment #22)
> (In reply to Ting-Yu Chou [:ting] from comment #21)
> > (In reply to Ting-Yu Chou [:ting] from comment #20)
> > > which seems is PBluetoothRequestEnd but doesn't make sense.
> > 
> > My bad math, it is PCompositorBridge::Msg_DidComposite__ID.
> 
> Matt, do you have any ideas how could this message flood?
> 
> There's an OOM when push a message to the pending queue, there were 65535
> pending messages and 65351 of them were this message.

We send one of these back to the content process and main thread of the UI process each time we composite.

At 60fps (possible if a page with an animation is left open), that would require the receiving thread to be blocked for almost 20 minutes.

I guess it's also possible if we're just processing messages slower than they are being received, seems like it would take a long time to accumulate that many though.
Flags: needinfo?(matt.woodrow)
I am not sure whether PCompositorBridge::Msg_DidComposite__ID is the only message that could fill up the queue, will file a bug for it.
Depends on: 1312945
Crash Signature: [@ OOM | large | mozalloc_abort | mozalloc_handle_oom | moz_xmalloc | std::_Allocate ] [@ OOM | large | mozalloc_abort | mozalloc_handle_oom | moz_xmalloc | std::deque<T>::_Growmap ] → [@ OOM | large | mozalloc_abort | mozalloc_handle_oom | moz_xmalloc | std::_Allocate ] [@ OOM | large | mozalloc_abort | mozalloc_handle_oom | moz_xmalloc | std::deque<T>::_Growmap ] [@ OOM | unknown | mozalloc_abort | mozalloc_handle_oom | moz_xmalloc …
Assignee: janus926 → nobody
Priority: P1 → P3

Closing because no crashes reported for 12 weeks.

Status: NEW → RESOLVED
Closed: 4 months ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.