Status

()

defect
P2
critical
Rank:
12
RESOLVED FIXED
a year ago
11 months ago

People

(Reporter: philipp, Assigned: drno)

Tracking

({crash, regression})

Trunk
mozilla62
Points:
---

Firefox Tracking Flags

(firefox-esr52 wontfix, firefox-esr60 fixed, firefox58 wontfix, firefox59 wontfix, firefox60+ wontfix, firefox61+ fixed, firefox62 fixed)

Details

(crash signature)

Attachments

(1 attachment)

(Reporter)

Description

a year ago
This bug was filed from the Socorro interface and is
report bp-4d2fe8c8-b660-4bfb-8894-64ede0180211.
=============================================================

Top 10 frames of crashing thread:

0 xul.dll m_copym netwerk/sctp/src/user_mbuf.c:996
1 xul.dll sctp_copy_mbufchain netwerk/sctp/src/netinet/sctp_output.c:6898
2 xul.dll sctp_med_chunk_output netwerk/sctp/src/netinet/sctp_output.c:9072
3 xul.dll sctp_chunk_output netwerk/sctp/src/netinet/sctp_output.c:10490
4 xul.dll sctp_common_input_processing netwerk/sctp/src/netinet/sctp_input.c:6130
5 xul.dll usrsctp_conninput netwerk/sctp/src/user_socket.c:3237
6 xul.dll mozilla::DataChannelConnection::SctpDtlsInput netwerk/sctp/datachannel/DataChannel.cpp:815
7 xul.dll sigslot::signal3<mozilla::TransportLayer*, unsigned char const*, unsigned int, sigslot::single_threaded>::operator media/mtransport/sigslot.h:2486
8 xul.dll mozilla::TransportFlow::PacketReceived media/mtransport/transportflow.cpp:252
9 xul.dll sigslot::signal3<mozilla::TransportLayer*, unsigned char const*, unsigned int, sigslot::single_threaded>::operator media/mtransport/sigslot.h:2486

=============================================================

this is a cross-platform crash signature, happening in the content process most of the time.
Byron, is this something you could take a look at?

It's weird. Starting on Dec 27th across seemingly all versions. Could the start of this correlate to the release of a Firefox or chrome version? I'm thinking something at the remote side could be triggering it.
Rank: 12
Flags: needinfo?(docfaraday)
Priority: -- → P2
(Reporter)

Comment 2

a year ago
the date in december is a red herring in this case, because in bug 1427111 we've purged all crash reports recorded before that date.
this crash signature would have been around for longer than that.
Looks like a nullptr crash to me at first glance. I can try to give it a closer look, but I need to spend some time digging us out from underneath all the oranges...
Feel free to give it to someone else that knows the area and who is not on PTO :-)
(Reporter)

Comment 5

a year ago
the crash signature here is spiking up after 59 has gone to the release population - it would be great if there was a fix in time for version 60 at least.
So this is caused when we have a sctp_association.send_queue that contains a sctp_tmit_chunk where sctp_tmit_chunk.data is null. I see at least one place where this can happen (and this place also checks whether |data| is null to begin with, so it may be expected that sometimes |data| will be null while the chunk is in the send_queue).

Iterating over |send_queue|: https://searchfox.org/mozilla-central/source/netwerk/sctp/src/netinet/sctp_output.c#6712
Checking if |data| is null: https://searchfox.org/mozilla-central/source/netwerk/sctp/src/netinet/sctp_output.c#6716
Releasing/nulling |data| without removing |chk| from |send_queue|: https://searchfox.org/mozilla-central/source/netwerk/sctp/src/netinet/sctp_output.c#6723

This hints to me that we just need to add a check on |data|, maybe here: https://searchfox.org/mozilla-central/source/netwerk/sctp/src/netinet/sctp_output.c#7040

What do you think?
Flags: needinfo?(docfaraday) → needinfo?(lennart.grahl)

Comment 7

a year ago
I need to pass this along to Michael.

Btw. this is what has changed since our last usrsctp update: https://github.com/sctplab/usrsctp/compare/0e076261b832121cf120ddc04aaff87ac3a34d30...HEAD
Flags: needinfo?(lennart.grahl) → needinfo?(tuexen)

Comment 8

a year ago
I'll take a look and try to figure out if I can reproduce and fix the issue.
Flags: needinfo?(tuexen)

Comment 9

a year ago
This issue has also been reported by Chrome. However, my analysis given in https://github.com/sctplab/usrsctp/issues/160 is still valid...

Does FF provide some API how we can get some additional information when this problem occurs?
(In reply to Michael Tüxen from comment #9)
> This issue has also been reported by Chrome. However, my analysis given in
> https://github.com/sctplab/usrsctp/issues/160 is still valid...

   I guess I can try looking through the code and seeing if there's anywhere else that has this bug.

> Does FF provide some API how we can get some additional information when
> this problem occurs?

   I am not aware of any Firefox API that would give you more info when a third-party library crashes than what you get already with crash-stats.

Comment 11

a year ago
(In reply to Byron Campen [:bwc] from comment #10)
> (In reply to Michael Tüxen from comment #9)
> > This issue has also been reported by Chrome. However, my analysis given in
> > https://github.com/sctplab/usrsctp/issues/160 is still valid...
> 
>    I guess I can try looking through the code and seeing if there's anywhere
> else that has this bug.
I don't understand the above sentence. But it seems to be a problem showing up
in different environments... So I would say it is a bug in usrsctp. However,
the data pointer SHOULD NOT be NULL when the chunk is in the send queue. It
MAY be null, if the chunk is in the sent queue.
Please note that there is a know bug (we are working on it) when you use the
buffer priority policy, but this is NOT used within WebRTC.
So if you find a place where the pointer is set to NULL for a chunk being
in the send queue and not using the buffer priority policy, I would like to
know, since might be the root cause.

I have not seen this bug in my testing...
> 
> > Does FF provide some API how we can get some additional information when
> > this problem occurs?
> 
>    I am not aware of any Firefox API that would give you more info when a
> third-party library crashes than what you get already with crash-stats.
OK. Thanks.

I contacted someone for the github issue, who was able to provide access to
stack variable. Unfortunately, he can't provide any information about heap
variables...
Is anyone still looking at this?  This crash is still around across all channels and platforms.
Flags: needinfo?(drno)

Comment 13

a year ago
I can provide a workaround (avoiding the crash) but have no idea how the problem can happen,
so I can't really fix it. Should I provide a work around?
(Assignee)

Comment 14

a year ago
(In reply to Michael Tüxen from comment #13)
> I can provide a workaround (avoiding the crash) but have no idea how the
> problem can happen,
> so I can't really fix it. Should I provide a work around?

Yes if there is a workaround it might be worth trying it. Thanks
Flags: needinfo?(drno)
Flags: needinfo?(tuexen)

Comment 15

a year ago
Please try:
https://github.com/sctplab/usrsctp/commit/d89882d04900c80860874b5eb389b3ed3a0bca3d
which avoids dereferencing a pointer, which is NULL. However, it does not fix the
real cause of the problem, since I have not been able to reproduce this.

This problem was also reported in https://github.com/sctplab/usrsctp/issues/160.
Flags: needinfo?(tuexen)
I have a couple of crash reports like this in Nightly from my own usage.

Most (if not all) of my crashes involve me dragging a file from macOS' finder to Nightly's content area. And it's not always reproducible, but it often reproduces after a couple of hours of usage.

I don't know if it helps, but here are my reports:
https://crash-stats.mozilla.com/report/index/d1133362-3abd-4565-9b27-d39df0180507
https://crash-stats.mozilla.com/report/index/8cba118b-b7d2-4f8f-9306-8ede00180427
https://crash-stats.mozilla.com/report/index/37ae14ed-9b28-406b-af81-421a10180507

I hope this helps.
Nils, can we get the upstream patch landed and see if it helps?
Flags: needinfo?(drno)
(Assignee)

Updated

11 months ago
Assignee: nobody → drno
Flags: needinfo?(drno)
Comment hidden (mozreview-request)
(Assignee)

Updated

11 months ago
Attachment #8976268 - Flags: review?(dminor)

Comment 20

11 months ago
mozreview-review
Comment on attachment 8976268 [details]
Bug 1439236: exit early if m_copym gets called with null ptr

https://reviewboard.mozilla.org/r/244434/#review250672

::: commit-message-3c9d6:2
(Diff revision 1)
> +Bug 1439236: exit early if m_copym gets called with null ptr
> +

Please add a reference to the upstream commit in your commit message:

https://github.com/sctplab/usrsctp/commit/d89882d04900c80860874b5eb389b3ed3a0bca3d
Attachment #8976268 - Flags: review?(dminor) → review+
Comment hidden (mozreview-request)

Comment 22

11 months ago
Pushed by drno@ohlmeier.org:
https://hg.mozilla.org/integration/autoland/rev/2755492fe3d1
exit early if m_copym gets called with null ptr r=dminor

Comment 23

11 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/2755492fe3d1
Status: NEW → RESOLVED
Last Resolved: 11 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla62
No crashes on Nightly since this landed. Worth an uplift request for Beta?
Flags: needinfo?(drno)
(Assignee)

Comment 25

11 months ago
Comment on attachment 8976268 [details]
Bug 1439236: exit early if m_copym gets called with null ptr

Approval Request Comment
[Feature/Bug causing the regression]: Probably the sctp in bug 1297418
[User impact if declined]: Crashes in webrtc data channels
[Is this code covered by automated tests?]: The code in general yes, but not the crash.
[Has the fix been verified in Nightly?]: Since we don't know how repro not really. But crashstats in Nightly look good since the patch landed.
[Needs manual test from QE? If yes, steps to reproduce]: No, no known steps to repro
[List of other uplifts needed for the feature/fix]: N/A
[Is the change risky?]: No
[Why is the change risky/not risky?]: It only adds extra safety checks to catch/avoid null pointer crashes.
[String changes made/needed]: N/A
Flags: needinfo?(drno)
Attachment #8976268 - Flags: approval-mozilla-beta?
(Assignee)

Comment 26

11 months ago
I just discovered this one crash which didn't happen on the timer thread, but after receiving an incoming packet while trying to generate an outgoing packet https://crash-stats.mozilla.com/report/index/22c9bcf8-296d-4da2-9f7d-d182f0180518

Michael does this provide any more insight into how this nullptr can happen?
Flags: needinfo?(tuexen)
Comment on attachment 8976268 [details]
Bug 1439236: exit early if m_copym gets called with null ptr

Adds some extra safety checks to avoid webrtc crashes. Approved for 61.0b9 and ESR 60.1.
Attachment #8976268 - Flags: approval-mozilla-esr60+
Attachment #8976268 - Flags: approval-mozilla-beta?
Attachment #8976268 - Flags: approval-mozilla-beta+

Comment 30

11 months ago
(In reply to Nils Ohlmeier [:drno] from comment #26)
> I just discovered this one crash which didn't happen on the timer thread,
> but after receiving an incoming packet while trying to generate an outgoing
> packet
> https://crash-stats.mozilla.com/report/index/22c9bcf8-296d-4da2-9f7d-
> d182f0180518
> 
> Michael does this provide any more insight into how this nullptr can happen?

Unfortunately not... The cause of the problem is when the pointer becomes NULL. The problem gets visible, when the code hits the assumption that the pointer is not NULL.

Best regards
Michael
Flags: needinfo?(tuexen)
Keywords: regression
You need to log in before you can comment on or make changes to this bug.