Closed Bug 1439236 Opened 7 years ago Closed 7 years ago

Crash in m_copym

Categories

(Core :: WebRTC: Networking, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla62
Tracking Status
firefox-esr52 --- wontfix
firefox-esr60 --- fixed
firefox58 --- wontfix
firefox59 --- wontfix
firefox60 + wontfix
firefox61 + fixed
firefox62 --- fixed

People

(Reporter: philipp, Assigned: drno)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is report bp-4d2fe8c8-b660-4bfb-8894-64ede0180211. ============================================================= Top 10 frames of crashing thread: 0 xul.dll m_copym netwerk/sctp/src/user_mbuf.c:996 1 xul.dll sctp_copy_mbufchain netwerk/sctp/src/netinet/sctp_output.c:6898 2 xul.dll sctp_med_chunk_output netwerk/sctp/src/netinet/sctp_output.c:9072 3 xul.dll sctp_chunk_output netwerk/sctp/src/netinet/sctp_output.c:10490 4 xul.dll sctp_common_input_processing netwerk/sctp/src/netinet/sctp_input.c:6130 5 xul.dll usrsctp_conninput netwerk/sctp/src/user_socket.c:3237 6 xul.dll mozilla::DataChannelConnection::SctpDtlsInput netwerk/sctp/datachannel/DataChannel.cpp:815 7 xul.dll sigslot::signal3<mozilla::TransportLayer*, unsigned char const*, unsigned int, sigslot::single_threaded>::operator media/mtransport/sigslot.h:2486 8 xul.dll mozilla::TransportFlow::PacketReceived media/mtransport/transportflow.cpp:252 9 xul.dll sigslot::signal3<mozilla::TransportLayer*, unsigned char const*, unsigned int, sigslot::single_threaded>::operator media/mtransport/sigslot.h:2486 ============================================================= this is a cross-platform crash signature, happening in the content process most of the time.
Byron, is this something you could take a look at? It's weird. Starting on Dec 27th across seemingly all versions. Could the start of this correlate to the release of a Firefox or chrome version? I'm thinking something at the remote side could be triggering it.
Rank: 12
Flags: needinfo?(docfaraday)
Priority: -- → P2
the date in december is a red herring in this case, because in bug 1427111 we've purged all crash reports recorded before that date. this crash signature would have been around for longer than that.
Looks like a nullptr crash to me at first glance. I can try to give it a closer look, but I need to spend some time digging us out from underneath all the oranges...
Feel free to give it to someone else that knows the area and who is not on PTO :-)
the crash signature here is spiking up after 59 has gone to the release population - it would be great if there was a fix in time for version 60 at least.
So this is caused when we have a sctp_association.send_queue that contains a sctp_tmit_chunk where sctp_tmit_chunk.data is null. I see at least one place where this can happen (and this place also checks whether |data| is null to begin with, so it may be expected that sometimes |data| will be null while the chunk is in the send_queue). Iterating over |send_queue|: https://searchfox.org/mozilla-central/source/netwerk/sctp/src/netinet/sctp_output.c#6712 Checking if |data| is null: https://searchfox.org/mozilla-central/source/netwerk/sctp/src/netinet/sctp_output.c#6716 Releasing/nulling |data| without removing |chk| from |send_queue|: https://searchfox.org/mozilla-central/source/netwerk/sctp/src/netinet/sctp_output.c#6723 This hints to me that we just need to add a check on |data|, maybe here: https://searchfox.org/mozilla-central/source/netwerk/sctp/src/netinet/sctp_output.c#7040 What do you think?
Flags: needinfo?(docfaraday) → needinfo?(lennart.grahl)
I need to pass this along to Michael. Btw. this is what has changed since our last usrsctp update: https://github.com/sctplab/usrsctp/compare/0e076261b832121cf120ddc04aaff87ac3a34d30...HEAD
Flags: needinfo?(lennart.grahl) → needinfo?(tuexen)
I'll take a look and try to figure out if I can reproduce and fix the issue.
Flags: needinfo?(tuexen)
This issue has also been reported by Chrome. However, my analysis given in https://github.com/sctplab/usrsctp/issues/160 is still valid... Does FF provide some API how we can get some additional information when this problem occurs?
(In reply to Michael Tüxen from comment #9) > This issue has also been reported by Chrome. However, my analysis given in > https://github.com/sctplab/usrsctp/issues/160 is still valid... I guess I can try looking through the code and seeing if there's anywhere else that has this bug. > Does FF provide some API how we can get some additional information when > this problem occurs? I am not aware of any Firefox API that would give you more info when a third-party library crashes than what you get already with crash-stats.
(In reply to Byron Campen [:bwc] from comment #10) > (In reply to Michael Tüxen from comment #9) > > This issue has also been reported by Chrome. However, my analysis given in > > https://github.com/sctplab/usrsctp/issues/160 is still valid... > > I guess I can try looking through the code and seeing if there's anywhere > else that has this bug. I don't understand the above sentence. But it seems to be a problem showing up in different environments... So I would say it is a bug in usrsctp. However, the data pointer SHOULD NOT be NULL when the chunk is in the send queue. It MAY be null, if the chunk is in the sent queue. Please note that there is a know bug (we are working on it) when you use the buffer priority policy, but this is NOT used within WebRTC. So if you find a place where the pointer is set to NULL for a chunk being in the send queue and not using the buffer priority policy, I would like to know, since might be the root cause. I have not seen this bug in my testing... > > > Does FF provide some API how we can get some additional information when > > this problem occurs? > > I am not aware of any Firefox API that would give you more info when a > third-party library crashes than what you get already with crash-stats. OK. Thanks. I contacted someone for the github issue, who was able to provide access to stack variable. Unfortunately, he can't provide any information about heap variables...
Is anyone still looking at this? This crash is still around across all channels and platforms.
Flags: needinfo?(drno)
I can provide a workaround (avoiding the crash) but have no idea how the problem can happen, so I can't really fix it. Should I provide a work around?
(In reply to Michael Tüxen from comment #13) > I can provide a workaround (avoiding the crash) but have no idea how the > problem can happen, > so I can't really fix it. Should I provide a work around? Yes if there is a workaround it might be worth trying it. Thanks
Flags: needinfo?(drno)
Flags: needinfo?(tuexen)
Please try: https://github.com/sctplab/usrsctp/commit/d89882d04900c80860874b5eb389b3ed3a0bca3d which avoids dereferencing a pointer, which is NULL. However, it does not fix the real cause of the problem, since I have not been able to reproduce this. This problem was also reported in https://github.com/sctplab/usrsctp/issues/160.
Flags: needinfo?(tuexen)
I have a couple of crash reports like this in Nightly from my own usage. Most (if not all) of my crashes involve me dragging a file from macOS' finder to Nightly's content area. And it's not always reproducible, but it often reproduces after a couple of hours of usage. I don't know if it helps, but here are my reports: https://crash-stats.mozilla.com/report/index/d1133362-3abd-4565-9b27-d39df0180507 https://crash-stats.mozilla.com/report/index/8cba118b-b7d2-4f8f-9306-8ede00180427 https://crash-stats.mozilla.com/report/index/37ae14ed-9b28-406b-af81-421a10180507 I hope this helps.
Nils, can we get the upstream patch landed and see if it helps?
Flags: needinfo?(drno)
Assignee: nobody → drno
Flags: needinfo?(drno)
Attachment #8976268 - Flags: review?(dminor)
Comment on attachment 8976268 [details] Bug 1439236: exit early if m_copym gets called with null ptr https://reviewboard.mozilla.org/r/244434/#review250672 ::: commit-message-3c9d6:2 (Diff revision 1) > +Bug 1439236: exit early if m_copym gets called with null ptr > + Please add a reference to the upstream commit in your commit message: https://github.com/sctplab/usrsctp/commit/d89882d04900c80860874b5eb389b3ed3a0bca3d
Attachment #8976268 - Flags: review?(dminor) → review+
Pushed by drno@ohlmeier.org: https://hg.mozilla.org/integration/autoland/rev/2755492fe3d1 exit early if m_copym gets called with null ptr r=dminor
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla62
No crashes on Nightly since this landed. Worth an uplift request for Beta?
Flags: needinfo?(drno)
Comment on attachment 8976268 [details] Bug 1439236: exit early if m_copym gets called with null ptr Approval Request Comment [Feature/Bug causing the regression]: Probably the sctp in bug 1297418 [User impact if declined]: Crashes in webrtc data channels [Is this code covered by automated tests?]: The code in general yes, but not the crash. [Has the fix been verified in Nightly?]: Since we don't know how repro not really. But crashstats in Nightly look good since the patch landed. [Needs manual test from QE? If yes, steps to reproduce]: No, no known steps to repro [List of other uplifts needed for the feature/fix]: N/A [Is the change risky?]: No [Why is the change risky/not risky?]: It only adds extra safety checks to catch/avoid null pointer crashes. [String changes made/needed]: N/A
Flags: needinfo?(drno)
Attachment #8976268 - Flags: approval-mozilla-beta?
I just discovered this one crash which didn't happen on the timer thread, but after receiving an incoming packet while trying to generate an outgoing packet https://crash-stats.mozilla.com/report/index/22c9bcf8-296d-4da2-9f7d-d182f0180518 Michael does this provide any more insight into how this nullptr can happen?
Flags: needinfo?(tuexen)
Comment on attachment 8976268 [details] Bug 1439236: exit early if m_copym gets called with null ptr Adds some extra safety checks to avoid webrtc crashes. Approved for 61.0b9 and ESR 60.1.
Attachment #8976268 - Flags: approval-mozilla-esr60+
Attachment #8976268 - Flags: approval-mozilla-beta?
Attachment #8976268 - Flags: approval-mozilla-beta+
(In reply to Nils Ohlmeier [:drno] from comment #26) > I just discovered this one crash which didn't happen on the timer thread, > but after receiving an incoming packet while trying to generate an outgoing > packet > https://crash-stats.mozilla.com/report/index/22c9bcf8-296d-4da2-9f7d- > d182f0180518 > > Michael does this provide any more insight into how this nullptr can happen? Unfortunately not... The cause of the problem is when the pointer becomes NULL. The problem gets visible, when the code hits the assumption that the pointer is not NULL. Best regards Michael
Flags: needinfo?(tuexen)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: