Closed
Bug 1266517
Opened 9 years ago
Closed 4 years ago
OOM | large crash in MessageChannel::OnMessageReceivedFromLink()
Categories
(Core :: IPC, defect, P3)
Core
IPC
Tracking
()
RESOLVED
WORKSFORME
People
(Reporter: mccr8, Unassigned)
References
(Depends on 1 open bug)
Details
(Whiteboard: btpp-backlog, e10st?)
Crash Data
Attachments
(1 file)
2.76 KB,
text/plain
|
Details |
I've come across 3 crashes inside deque::push_back inside MessageChannel::OnMessageReceivedFromLink() on Nightly. They are OOM | large crashes of about 1MB. I'd guess that this is due to the deque itself getting large. Maybe the message queue gets backed up under some circumstances.
https://crash-stats.mozilla.com/report/index/895f658b-8a41-4371-bc49-5d3a02160419
https://crash-stats.mozilla.com/report/index/993337f7-3b2b-4aeb-be52-90d972160419
https://crash-stats.mozilla.com/report/index/b980b271-7c80-4bfc-a0f4-f06d62160421
Three crashes isn't a big deal, but this is worth keeping an eye on it in case it is bad on Beta, where we have seen many more address space issues.
Note that this crash signature is fairly generic and includes other crashes inside STL data structures. I guess we should add std::_Allocate to the skip list.
Updated•9 years ago
|
Whiteboard: btpp-backlog
Updated•9 years ago
|
tracking-e10s:
--- → ?
Updated•9 years ago
|
Priority: -- → P1
Comment 1•9 years ago
|
||
At about 1MB w/ 24 bytes per message (on 64-bit) we're talking about ~43,600 pending messages. Assuming a worst-case growth strategy we'll guess they double, so maybe we had half that, so ~21,800 messages. That's a lot of messages.
In all three crashes in comment 0, the main thread is doing a CC triggered by a memory pressure notification. That's sort of an explanation for why we might not respond to messages for a long time. But it's still unclear what all these messages are that are being sent.
Updated•9 years ago
|
Crash Signature: [@ OOM | large | mozalloc_abort | mozalloc_handle_oom | moz_xmalloc | std::_Allocate ] → [@ OOM | large | mozalloc_abort | mozalloc_handle_oom | moz_xmalloc | std::_Allocate ]
[@ OOM | large | mozalloc_abort | mozalloc_handle_oom | moz_xmalloc | std::deque<T>::_Growmap ]
Comment 3•9 years ago
|
||
Updated•8 years ago
|
Whiteboard: btpp-backlog → btpp-backlog, e10st?
Comment 4•8 years ago
|
||
The crash is still happening, dozens per day.
Comment 5•8 years ago
|
||
The flood of messages could be a side effect from bugs that we break one large message to many smaller ones, for example bug 1263028.
Comment 6•8 years ago
|
||
By using the tool from bug 1001760, there're a lot of <1M free blocks. For example bp-182c7af4-5d0a-49f0-b51f-a7b0b2160623:
Free=1093M Tiny=1050M Misaligned=0M Usable=0M Other=42M
This is also what I observed in bug 1259512 comment 29, will try to sort out how are these created.
Reporter | ||
Comment 7•8 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #5)
> The flood of messages could be a side effect from bugs that we break one
> large message to many smaller ones, for example bug 1263028.
That's possible, but each of the messages there is about 128kb, so 20,000 messages would be quite a bit of data.
One ugly way to address this issue would be to implement a segmented queue, perhaps out of a pair of stacks made from SegmentedVector.
Maybe trying to drain the message queue on memory pressure would help?
Of course, if you can figure out why memory is so fragmented that would also be very useful.
Comment 8•8 years ago
|
||
:dmajor did notice the WRITECOMBINE memory in bug 1062065...
Comment 9•8 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #8)
> :dmajor did notice the WRITECOMBINE memory in bug 1062065...
Check also bug 1101179.
Comment 10•8 years ago
|
||
Searched on Socorro for the product has terms "Firefox" and date >= 2016-06-01
695399 100% signature contains OOM
263061 38% signature contains OOM && write combine size >= 128M
83955 12% signature contains OOM && tiny block size >= 128M
49475 7% signature contains OOM && tiny block size >= 128M && write combine size >= 128M
Comment 11•8 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #9)
> (In reply to Ting-Yu Chou [:ting] from comment #8)
> > :dmajor did notice the WRITECOMBINE memory in bug 1062065...
>
> Check also bug 1101179.
And bug 1123465, bug 127925.
Comment 12•8 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #10)
> Searched on Socorro for the product has terms "Firefox" and date >=
> 2016-06-01
>
> 695399 100% signature contains OOM
> 263061 38% signature contains OOM && write combine size >= 128M
> 83955 12% signature contains OOM && tiny block size >= 128M
> 49475 7% signature contains OOM && tiny block size >= 128M && write
> combine size >= 128M
488134 70% signature contains OOM && tiny block size >= 64M
371045 53% signature contains OOM && write combine size >= 64M
329172 47% signature contains OOM && tiny block size >= 64M && write combine size >= 64M
Comment 13•8 years ago
|
||
Most free space are from tiny (1M) blocks, if we can eliminate them... probably bug 1127925 is still worth to work on.
But before that, I will see if I can add annotations to understand what the queued messages are.
Comment 14•8 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #13)
> Most free space are from tiny (1M) blocks, if we can eliminate them...
> probably bug 1127925 is still worth to work on.
Correction: bug 1123465
Comment 15•8 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #7)
> One ugly way to address this issue would be to implement a segmented queue,
> perhaps out of a pair of stacks made from SegmentedVector.
std::list may also be an option, it's a double ended linked list.
Updated•8 years ago
|
Assignee: nobody → janus926
Comment 16•8 years ago
|
||
Finally there came some OOM from AnnotateAllocator:
bp-b509ad9a-f492-4b79-94e4-281512160905,
bp-77e09693-ee45-4a54-ae26-2dee22160904,
bp-0e52bfc7-220f-43cd-a9b8-17b922160903,
bp-5ddab854-6bde-45e2-b142-897c42160902
however none of them have NumberOfPeidngIPC, TopPendingIPCCout, and TopPendingIPCName in the metadata...
Comment 17•8 years ago
|
||
It seems CrashReporter::gExceptionHandler is null, which AnnotateCrashReport() returns early.
Comment 18•8 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #17)
> It seems CrashReporter::gExceptionHandler is null, which
> AnnotateCrashReport() returns early.
Another possibility is CrashReporterHelperRunnable hasn't had a chance to run on main thread. All the 4 reports in comment 16 did not OOM in main thread.
Comment 19•8 years ago
|
||
Even with bug 1301022, bp-9b712aed-dd03-4b99-b181-a82a32160923 still doesn't have the added annotations...
Comment 20•8 years ago
|
||
bpb4853361-f3cf-4609-b155-d20b82161021 has:
NumberOfPendingIPC 65535
TopPendingIPCCount 65351
TopPendingIPCName ???
TopPendingIPCType 0x440003
which seems is PBluetoothRequestEnd but doesn't make sense. Bug 1312326 could get us the message name.
Comment 21•8 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #20)
> which seems is PBluetoothRequestEnd but doesn't make sense.
My bad math, it is PCompositorBridge::Msg_DidComposite__ID.
Comment 22•8 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #21)
> (In reply to Ting-Yu Chou [:ting] from comment #20)
> > which seems is PBluetoothRequestEnd but doesn't make sense.
>
> My bad math, it is PCompositorBridge::Msg_DidComposite__ID.
Matt, do you have any ideas how could this message flood?
There's an OOM when push a message to the pending queue, there were 65535 pending messages and 65351 of them were this message.
Flags: needinfo?(matt.woodrow)
Comment 23•8 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #22)
> (In reply to Ting-Yu Chou [:ting] from comment #21)
> > (In reply to Ting-Yu Chou [:ting] from comment #20)
> > > which seems is PBluetoothRequestEnd but doesn't make sense.
> >
> > My bad math, it is PCompositorBridge::Msg_DidComposite__ID.
>
> Matt, do you have any ideas how could this message flood?
>
> There's an OOM when push a message to the pending queue, there were 65535
> pending messages and 65351 of them were this message.
We send one of these back to the content process and main thread of the UI process each time we composite.
At 60fps (possible if a page with an animation is left open), that would require the receiving thread to be blocked for almost 20 minutes.
I guess it's also possible if we're just processing messages slower than they are being received, seems like it would take a long time to accumulate that many though.
Flags: needinfo?(matt.woodrow)
Comment 24•8 years ago
|
||
I am not sure whether PCompositorBridge::Msg_DidComposite__ID is the only message that could fill up the queue, will file a bug for it.
Updated•8 years ago
|
Crash Signature: [@ OOM | large | mozalloc_abort | mozalloc_handle_oom | moz_xmalloc | std::_Allocate ]
[@ OOM | large | mozalloc_abort | mozalloc_handle_oom | moz_xmalloc | std::deque<T>::_Growmap ] → [@ OOM | large | mozalloc_abort | mozalloc_handle_oom | moz_xmalloc | std::_Allocate ]
[@ OOM | large | mozalloc_abort | mozalloc_handle_oom | moz_xmalloc | std::deque<T>::_Growmap ]
[@ OOM | unknown | mozalloc_abort | mozalloc_handle_oom | moz_xmalloc …
Updated•7 years ago
|
Assignee: janus926 → nobody
Updated•7 years ago
|
Priority: P1 → P3
Comment 25•4 years ago
|
||
Closing because no crashes reported for 12 weeks.
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WORKSFORME
You need to log in
before you can comment on or make changes to this bug.
Description
•