Annotate crashes for pending messages in MessageChannel::OnMessageReceivedFromLink

RESOLVED FIXED in Firefox 50

Status

()

defect
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: ting, Assigned: ting)

Tracking

(Blocks 1 bug)

Trunk
mozilla50
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox50 fixed)

Details

Attachments

(1 attachment, 2 obsolete attachments)

The OOM from trying to push a message to mPending deque in MessageChannel::OnMessageReceivedFromLink() shows there're more than 10000 (could be far more) pending IPC messages. We should annotate the crashes so we can understand what are they.
Posted patch patch v1 (obsolete) — Splinter Review
The patch is for Windows because I can't get the code compiled with GCC (I'll figure out why, but I think on Windows should be enough to get us some data). And there's one thing I'm concerned that what if it's OOM again while annotating, is it possible to avoid using jemalloc at runtime?
Attachment #8771343 - Flags: review?(continuation)
Comment on attachment 8771343 [details] [diff] [review]
patch v1

Review of attachment 8771343 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for working on this. This patch should be reviewed by an IPC peer.
Attachment #8771343 - Flags: review?(continuation) → review?(wmccloskey)
The reason why this can't be compiled with GCC 4.8.3:

  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61426

Though it's fixed in 2014, but the target milestone is 5.0.
Comment on attachment 8771343 [details] [diff] [review]
patch v1

Review of attachment 8771343 [details] [diff] [review]:
-----------------------------------------------------------------

::: ipc/glue/MessageChannel.h
@@ +474,5 @@
> +      T* allocate(size_t n) {
> +        void* p = ::operator new(n * sizeof(T), std::nothrow);
> +        if (!p && n) {
> +          // Count the number of each different pending message.
> +          std::map<std::string, uint32_t> m;

I suspect this won't work, for a few reasons:

1. The message name is only assigned once it has been processed on the main thread (in PContentParent::OnMessageReceived, for example--you can look for set_name calls). If a message is in the queue, then that hasn't happened yet. So all the names will be "???". You can fix this by using the message type instead. This is in msg.type(). It will be number that we'll have to look up ourselves, but that's fine.

2. If we fail to allocate the mPending vector, then I suspect there's a good chance we won't have space for |m| either. So we should try to do this without allocating. One idea is to pick a random message in the queue and store its type in the crash report. If a given message type consumes most of the queue, then it will be much more likely to be chosen.
Attachment #8771343 - Flags: review?(wmccloskey)
(In reply to Bill McCloskey (:billm) from comment #4)
> yet. So all the names will be "???". You can fix this by using the message
> type instead. This is in msg.type(). It will be number that we'll have to
> look up ourselves, but that's fine.

I was considering whether to use the name or the type, and chose the name because I found it's not easy to translate a type to a message name on dxr for a specific revision which I can't do:

  https://dxr.mozilla.org/mozilla-central/rev/xxxxxx/source/obj-x86_64-pc-linux-gnu/ipc/ipdl/_ipdlheaders/IPCMessageStart.h

Do you know an easier way to look up the message type?

Note the message through ThreadLink will have a valid name.

> 
> 2. If we fail to allocate the mPending vector, then I suspect there's a good
> chance we won't have space for |m| either. So we should try to do this
> without allocating. One idea is to pick a random message in the queue and
> store its type in the crash report. If a given message type consumes most of
> the queue, then it will be much more likely to be chosen.

If the assumption is "a given message type consumes most of the queue", then the chance of no space for |m|, |v|, or |buf| might be not that high because there won't be many different types of message. Pick a random message could avoid allocation, but we won't be able to know whether it is the one consumes most of the queue, the only thing we can tell is a message seems to be flood when there's a bunch of crashes report the same message type.

:mccr8, what do you think about this?
Flags: needinfo?(continuation)
(In reply to Ting-Yu Chou [:ting] from comment #5)
> (In reply to Bill McCloskey (:billm) from comment #4)
> > yet. So all the names will be "???". You can fix this by using the message
> > type instead. This is in msg.type(). It will be number that we'll have to
> > look up ourselves, but that's fine.
> 
> I was considering whether to use the name or the type, and chose the name
> because I found it's not easy to translate a type to a message name on dxr
> for a specific revision which I can't do:
> 
>  
> https://dxr.mozilla.org/mozilla-central/rev/xxxxxx/source/obj-x86_64-pc-
> linux-gnu/ipc/ipdl/_ipdlheaders/IPCMessageStart.h
> 
> Do you know an easier way to look up the message type?

No, I don't think there's an easy way.

> Note the message through ThreadLink will have a valid name.

OK, but since the crashes in bug 1266517 are happening in the content process, that's not really relevant.
(In reply to Ting-Yu Chou [:ting] from comment #5)
> :mccr8, what do you think about this?

I don't know, sorry.
Flags: needinfo?(continuation)
I got another idea, because we won't handle any pending message, we can sort the deque |mPending| and then walk through it to find the message with the largest quantity.
Traced std::sort, Microsoft implements it as a hybrid algorithm which uses quicksort, heapsort and insertion sort, also it looks like an in-place sorting.
Posted patch patch v2 (obsolete) — Splinter Review
Attachment #8771343 - Attachment is obsolete: true
Attachment #8772719 - Flags: review?(wmccloskey)
Comment on attachment 8772719 [details] [diff] [review]
patch v2

Review of attachment 8772719 [details] [diff] [review]:
-----------------------------------------------------------------

::: ipc/glue/MessageChannel.h
@@ +491,5 @@
> +            Message &msg = *it;
> +            if (msg.type() == curType) {
> +              ++curCount;
> +            } else {
> +              if (curCount > topCount) {

You need to do this block after the loop has exited as well (in case the last type is the top one).
Attachment #8772719 - Flags: review?(wmccloskey) → review+
Posted patch patch v3Splinter Review
Thanks for catching that.
Attachment #8772719 - Attachment is obsolete: true
Try looks good.
Keywords: checkin-needed
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/50ff01c81e09
Annotate pending IPC messages when OOM due to too many of them. r=billm
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/50ff01c81e09
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
Can we uplift this to aurora and/or beta?
Flags: needinfo?(janus926)
Mind if I request uplift this after the first report comes in with the annotation?
Flags: needinfo?(janus926)
I am a bit worried as I haven't seen any report with the annotation. Though there're also only three _GrowMap() OOM from MessageLoop::PostTask_Helper() since the patch got landed.
You need to log in before you can comment on or make changes to this bug.