Open Bug 1513458 Opened Last year Updated 5 months ago

Crash in mozilla::net::HttpChannelChild::OnStartRequest

Categories

(Core :: Networking: HTTP, defect, P3, critical)

Unspecified
Linux
defect

Tracking

()

Tracking Status
firefox-esr60 --- unaffected
firefox64 --- unaffected
firefox65 --- unaffected
firefox66 --- wontfix
firefox67 --- wontfix
firefox68 --- wontfix
firefox69 --- fix-optional
firefox70 --- fix-optional

People

(Reporter: calixte, Assigned: keeler)

References

(Blocks 1 open bug)

Details

(Keywords: crash, leave-open, regression, Whiteboard: [necko-triaged])

Crash Data

Attachments

(3 files)

This bug was filed from the Socorro interface and is
report bp-14c80ad8-46f8-4639-b1c0-286920181211.
=============================================================

Top 10 frames of crashing thread:

0 libxul.so mozilla::net::HttpChannelChild::OnStartRequest netwerk/protocol/http/HttpChannelChild.cpp:551
1 libxul.so mozilla::net::StartRequestEvent::Run netwerk/protocol/http/HttpChannelChild.cpp:432
2 libxul.so mozilla::net::ChannelEventQueue::RunOrEnqueue netwerk/ipc/ChannelEventQueue.h:200
3 libxul.so mozilla::net::HttpChannelChild::RecvOnStartRequest netwerk/protocol/http/HttpChannelChild.cpp:487
4 libxul.so mozilla::net::PHttpChannelChild::OnMessageReceived ipc/ipdl/PHttpChannelChild.cpp:743
5 libxul.so mozilla::dom::PContentChild::OnMessageReceived ipc/ipdl/PContentChild.cpp:5417
6 libxul.so mozilla::ipc::MessageChannel::DispatchMessage ipc/glue/MessageChannel.cpp:2159
7 libxul.so mozilla::ipc::MessageChannel::MessageTask::Run ipc/glue/MessageChannel.cpp:1935
8 libxul.so mozilla::SchedulerGroup::Runnable::Run xpcom/threads/SchedulerGroup.cpp:299
9 libxul.so nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1157

=============================================================

There is 1 crash in nightly 66 with buildid 20181211093801. In analyzing the backtrace, the regression may have been introduced by patch [1] to fix bug 1380148.

[1] https://hg.mozilla.org/mozilla-central/rev?node=bf2f21326169
Flags: needinfo?(valentin.gosu)
Priority: -- → P1
Whiteboard: [necko-triaged]
Assignee: nobody → valentin.gosu
Flags: needinfo?(valentin.gosu)
Crash Signature: [@ mozilla::net::HttpChannelChild::OnStartRequest] → [@ mozilla::net::HttpChannelChild::OnStartRequest] [@ mozilla::net::StartRequestEvent::Run]
This is still showing up in nightly but in very low volume.
I'm not sure if this issue is caused by switching profiles to an older representation of the security info, or if there's a bug with serializing the info into a repr that can't be deserialized (that would mean a bug in TransportSecurityInfo::Read/ReadSSLStatus).
The representation changed in bug 1468222.

If it is the case that this happens when switching profiles, then I'm really confused about where the serialization is coming from. If it were from the cache, the parent process should have produced a null securityInfo as the call to deserialize would have failed in the parent.

Dana, do you have any suggestion about what we should do here?
Remove the assertion? Land some code to annotate the crash with the securityInfo that's failing so we get a better idea of what's going on? Or maybe something else?
Flags: needinfo?(dkeeler)

I'm not seeing a way in which the current serialization could create something the deserialization will fail to decode. It's possible that the current deserialization code doesn't properly handle all possible previous representations, but as you say presumably the parent would have already asserted at this point. Annotating the crash might help, although I imagine there's a limit on the size of the annotation? (a serialized TransportSecurityInfo can be a few kB)
My only concern with adding this annotation would be that doing so could reveal domain names users are visiting, which I think normally we don't make available to everyone by default.

Flags: needinfo?(dkeeler)

Wild theory: could this be caused by memory error/bit flips?

We could try to narrow the error down by adding some diag-assertions to all the error checks in secinfo::read() (the deserialization code)

Selena pointed out that an increase in reports here correlates pretty well with the shield study for increasing the number of content processes to 8 (see 1511183 comment 11). Correlation != causation, of course, but I wonder if this is one of the crashes wbeard noticed increasing.

Flags: needinfo?(wbeard)
Flags: needinfo?(erahm)

(In reply to Andrew Overholt [:overholt] from comment #6)

Selena pointed out that an increase in reports here correlates pretty well with the shield study for increasing the number of content processes to 8 (see 1511183 comment 11). Correlation != causation, of course, but I wonder if this is one of the crashes wbeard noticed increasing.

The shield study was in release (64), AFAICT this spike was on Nightly. 8 CP has also been in nightly since 64.

Flags: needinfo?(erahm)
Flags: needinfo?(wbeard)
Pushed by valentin.gosu@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/243733368484
Add more specific MOZ_DIAGNOSTIC_ASSERTS to TransportSecurityInfo::Read in order to pinpoint crashes r=mayhemer

There are 6 crashes with: MOZ_RELEASE_ASSERT(((bool)(__builtin_expect(!!(!NS_FAILED_impl(rv)), 1)))) (Deserialization should not fail)
and 4 with MOZ_RELEASE_ASSERT(false) (Deserialization should not fail).
Some reports: https://bit.ly/2FTvgNY

Crash Signature: [@ mozilla::net::HttpChannelChild::OnStartRequest] [@ mozilla::net::StartRequestEvent::Run] → [@ mozilla::net::HttpChannelChild::OnStartRequest] [@ mozilla::net::StartRequestEvent::Run] [@ mozilla::psm::TransportSecurityInfo::Read]

So it seems the faulty securityInfo is indeed coming from the cache. I don't know if there's a way of knowing if the cache entry is for a regular web page or for something generated by a service worker.
I see us failing at multiple points. This might mean that either the cache entry was corrupted on disk, or that we serialized it wrongly.
However, looking through the code it seems that whenever we failed to parse the securityInfo in OpenCacheInputStream we would doom the entry and return the error, so the bad serialization couldn't be propagated to the child unless there's a bug in TransportSecurityInfo::Write.

My other theory was that we wrongly call OverrideSecurityInfo with the wrong object, but given the places where we assert, that doesn't seem likely.
Dana, any ideas here?

Flags: needinfo?(dkeeler)

(In reply to Calixte Denizet (:calixte) from comment #11)

There are 6 crashes with: MOZ_RELEASE_ASSERT(((bool)(__builtin_expect(!!(!NS_FAILED_impl(rv)), 1)))) (Deserialization should not fail)
and 4 with MOZ_RELEASE_ASSERT(false) (Deserialization should not fail).
Some reports: https://bit.ly/2FTvgNY

I can't see those reports for some reason. Is there some special permission my account would need to do so?
(I'll hopefully be able to provide a useful answer to comment 12 if I can have a look at some of the diagnostic assertions)

Flags: needinfo?(cdenizet)

(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #13)

I can't see those reports for some reason. Is there some special permission my account would need to do so?

I can see them in private browsing, so it should work. Here is a non-bitly link

Can you link to some reports directly? I'm consistently getting "no results were found" with this profile, a clean profile, a different computer, and on the vpn.

Thanks. The ones of those that aren't content crashes I think we can ignore - if the cached version is too old, deserialization will fail in the parent, but if I understand correctly that's not what we're concerned about here. Would it be possible to a) make this crash in content processes only and b) use MOZ_CRASH_ANNOTATE with the first 1024 bytes of the base-64 encoding of the data we're trying to deserialize? (To avoid including sensitive data, we could limit it to ~76 bytes of (binary) data, which in a valid serialization would be as much as we could get before we hit e.g. the serial number of the server's certificate.)

Flags: needinfo?(dkeeler)
Flags: needinfo?(cdenizet)

I downloaded a few of the raw data dumps a while ago, and some of them do actually include some of the serialization. I'm away from my windows machine now, so I can't point you to the exact crash reports, but hopefully it's not so rare and hard to find.

Duplicate of this bug: 1522023
Duplicate of this bug: 1522459

Adding a similar signature, which has MOZ_RELEASE_ASSERT(((bool)(__builtin_expect(!!(!NS_FAILED_impl(rv)), 1)))) (Deserialization should not fail) as the Moz Crash Reason.

Crash Signature: [@ mozilla::net::HttpChannelChild::OnStartRequest] [@ mozilla::net::StartRequestEvent::Run] [@ mozilla::psm::TransportSecurityInfo::Read] → [@ mozilla::net::HttpChannelChild::OnStartRequest] [@ mozilla::net::StartRequestEvent::Run] [@ mozilla::psm::TransportSecurityInfo::Read] [@ mozilla::psm::TransportSecurityInfo::ReadSSLStatus]

Dana, were you able to get some data from the crashes?
I noticed some of the crashes happen here and that code changed recently in bug 1496340. Do you think that could be the problem (part of the problem?)

Flags: needinfo?(dkeeler)

Turns out I don't have the account permissions to download raw dumps. Would you be able to send me a serialization that's failing?
In terms of the changes from bug 1496340, I'm still not seeing a way we could serialize something that we would then fail to deserialize.

Flags: needinfo?(dkeeler)

Tracking for 66 since this is a high crash rate for early beta (dev edition)

I found a dump that contained some of the serialization of the securityInfo and sent it to Dana via email.

This doesn't need to be tracked for release - as the crashes are caused by MOZ_DIAGNOSTIC_ASSERT.
The asserts are needed to investigate why the deserialization of transportSecurityInfo is failing sometimes.

Removing tracking flags as per comment #27

This reproduces permanently in one of my test profiles, e.g. bp-951e8661-c2be-4892-a3b0-fa0350190215 - loading https://support.mozilla.org is sufficient. Let me know if you want a copy of the profile.

Flags: needinfo?(valentin.gosu)

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #29)

This reproduces permanently in one of my test profiles, e.g. bp-951e8661-c2be-4892-a3b0-fa0350190215 - loading https://support.mozilla.org is sufficient. Let me know if you want a copy of the profile.

Yes, please! It seems to be a different assert than most others we get, but it would be really useful to see how we got into this state.

Flags: needinfo?(valentin.gosu)

Dana, can you take this bug? I am rather busy with fission work and don't have time to investigate.
I forwarded you the profile via email.

Assignee: valentin.gosu → dkeeler

(In reply to Valentin Gosu [:valentin] from comment #30)

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #29)

This reproduces permanently in one of my test profiles, e.g. bp-951e8661-c2be-4892-a3b0-fa0350190215 - loading https://support.mozilla.org is sufficient. Let me know if you want a copy of the profile.

Yes, please! It seems to be a different assert than most others we get, but it would be really useful to see how we got into this state.

Crashes in the parent process are unrelated to this issue. In this case, we're trying to deserialize a security info serialization from around 2013, so I'm not surprised it's failing. attachment 9045072 [details] should filter out cases like this.

Pushed by dkeeler@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/2ca136370e18
update diagnostic assertions to only assert in content processes r=valentin

Crashes resulting from the diagnostic assertions added in 2ca136370e18 suggest
that certificate decoding is faiiling in the content process (which seems
impossible given that presumably we successfully decoded the very same
certificate in the parent). This should tell us what error code NSS is
returning when this happens, which may illustrate the issue.

Pushed by dkeeler@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b59840ac7358
attempt to determine why certificate decoding would fail in content processes r=mayhemer

https://crash-stats.mozilla.com/report/index/de7c9aca-1f0a-46f7-b0ea-a56430190426 says the error is SEC_ERROR_REUSED_ISSUER_AND_SERIAL, which is maybe plausible. One way I could see this happening is:

  • user connects to a site, caches some resources (including the server certificate)
  • user closes the browser
  • site changes its certificate but re-uses the issuer/serial (which is a violation of the baseline requirements and no public CA is likely to do this, but this appears to happen occasionally in non-public PKIs or for e.g. routers)
  • user connects to the site again. If the parent process never deserializes the original cached server certificate (I don't know if this would be the case), it will connect just fine, because it doesn't otherwise know of the existence of the original certificate. Presumably the child process will at some point see the new certificate. Then, if the child process tries to load the security info of a cached resource, it will also see the original certificate and fail to decode it because NSS treats this situation as absolutely forbidden (indeed, it basically has to or it will break).

Due to the architecture of NSS, this is not something we're going to fix at that level, so we have to try and work around it.

I guess one question I have is why does HttpChannelChild need to deserialize the securityinfo anyway? Does anything in the content process use it?

Flags: needinfo?(kershaw)

I guess one question I have is why does HttpChannelChild need to deserialize the securityinfo anyway? Does anything in the content process use it?

I think the main reason is that nsIChannel interface [1] requires to return securityInfo. According to searchfox result [2], there could be a lot of places in content process using it.

[1] https://searchfox.org/mozilla-central/rev/66086345467c69685434dd1c5177b30a7511b1a5/netwerk/base/nsIChannel.idl#96
[2] https://searchfox.org/mozilla-central/search?q=symbol:%23securityInfo%2C_ZN10nsIChannel15GetSecurityInfoEPP11nsISupports&redirect=false

Flags: needinfo?(kershaw)
Duplicate of this bug: 1541318
Crash Signature: [@ mozilla::net::HttpChannelChild::OnStartRequest] [@ mozilla::net::StartRequestEvent::Run] [@ mozilla::psm::TransportSecurityInfo::Read] [@ mozilla::psm::TransportSecurityInfo::ReadSSLStatus] → [@ mozilla::net::HttpChannelChild::OnStartRequest] [@ mozilla::net::StartRequestEvent::Run] [@ mozilla::psm::TransportSecurityInfo::Read] [@ mozilla::psm::TransportSecurityInfo::ReadSSLStatus] [@ mozilla::psm::TransportSecurityInfo::Read(nsIObjectInput…

I see a set of crashes that have the assert in this bug: https://bit.ly/2LnINl1

Crash Signature: [@ mozilla::net::HttpChannelChild::OnStartRequest] [@ mozilla::net::StartRequestEvent::Run] [@ mozilla::psm::TransportSecurityInfo::Read] [@ mozilla::psm::TransportSecurityInfo::ReadSSLStatus] [@ mozilla::psm::TransportSecurityInfo::Read(nsIObjectInput… → [@ mozilla::net::HttpChannelChild::OnStartRequest] [@ mozilla::net::StartRequestEvent::Run] [@ mozilla::psm::TransportSecurityInfo::Read] [@ mozilla::psm::TransportSecurityInfo::ReadSSLStatus] [@ mozilla::psm::TransportSecurityInfo::Read(nsIObjectInpu…
Priority: P1 → P3

Crashes are only a handful per day now, wontfix 68.

This continues into 70, but most of the crashes look like single installations.

Low volume crash, marking fix-optional to remove this from weekly triage.

You need to log in before you can comment on or make changes to this bug.