Closed Bug 1582647 Opened 5 months ago Closed 3 months ago

Crash in [@ mozilla::ipc::PrincipalInfoToPrincipal]

Categories

(Core :: Networking: DNS, defect, critical)

Unspecified
Windows 10
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla72
Tracking Status
firefox-esr68 --- unaffected
firefox70 + wontfix
firefox71 + fixed
firefox72 --- fixed

People

(Reporter: pascalc, Assigned: valentin)

References

(Blocks 1 open bug, Regression)

Details

(4 keywords)

Crash Data

Attachments

(1 file)

This bug is for crash report bp-c08ebc62-368b-41f4-8ff4-fce2e0190919.

Top 10 frames of crashing thread:

0 xul.dll mozilla::ipc::PrincipalInfoToPrincipal ipc/glue/BackgroundUtils.cpp:119
1 xul.dll static bool mozilla::ipc::IPDLParamTraits<nsIPrincipal*>::Read dom/ipc/PermissionMessageUtils.cpp:39
2 xul.dll mozilla::ipc::IPDLParamTraits<mozilla::dom::WindowGlobalInit>::Read ipc/ipdl/DOMTypes.cpp:1818
3 xul.dll mozilla::dom::PBrowserParent::OnMessageReceived ipc/ipdl/PBrowserParent.cpp:6217
4 xul.dll mozilla::dom::PContentParent::OnMessageReceived ipc/ipdl/PContentParent.cpp:5852
5 xul.dll mozilla::ipc::MessageChannel::DispatchMessage ipc/glue/MessageChannel.cpp:2109
6 xul.dll mozilla::ipc::MessageChannel::MessageTask::Run ipc/glue/MessageChannel.cpp:1985
7 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1225
8 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:486
9 xul.dll mozilla::ipc::MessagePump::Run ipc/glue/MessagePump.cpp:88

There is a big spike of this signature in both nightly 71 and 70 beta 7. See also bug 1564107 which is a similar crash affecting Thunderbird.
20 crashes in 70b6 but 618 in 70b7 so we may have uplifted a top crasher.

List of uplifts in beta 7 https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=FIREFOX_70_0b6_RELEASE&tochange=FIREFOX_70_0b7_RELEASE

this just started spiking up yesterday on 21:00 UTC - b7 was released two days earlier, which makes me think that something external might have caused this spike. users with italian builds are predominantly affected (>60% of all crashes), but the most common reported crashing urls are about:newtab & about:blank and don't provide many clues.

the reports come with MOZ_CRASH(Base domain must be available when deserialized) from bug 1517089.

Flags: needinfo?(jvarga)

the big crashing spike has stopped again.

Removing the blocking flags as the spike has gone but let's keep tracking them for a while in case it shows up again.

I have attempted reproducing a crash with Nightly v71.0a1 from 2019-09-20 (Italian build) with no success. As suggested from crashes' comments, it appears to have some relevance with the Windows update received around 20th of September, so I have had a few updates pending for my Windows 10, which I have been installed one by one and attempted to reproduce the crash after each update. Unfortunately, I could not reproduce any kind of crash.

Could somebody read the crash reports in an attempt to deduce any kind of steps to reproduce this crash?

Flags: needinfo?(jld)

The crash relates to the base domain not matching the principal base domain. I can't think of any reson why this might be out of sync, maybe it's something weird like the Public Suffix List updated between passing it to the IPC?

https://searchfox.org/mozilla-central/source/caps/ContentPrincipal.cpp#426-432

It also looks like this would happen when the ThirdPartyUtil service isn't initialised or available too.

This code lives in ipc/ for historical reasons, but it's not really IPC, and I'm told the component most likely to have people who understand it is DOM: Security.

Component: IPC → DOM: Security
Flags: needinfo?(jld)

Yes, it's weird. If we can't reproduce it in a debug build, we will have to add a crash annotation or something to see what's wrong.
We can also change the method to avoid the base domain lookup by default and do it for quota manager and quota clients only.

Flags: needinfo?(jvarga)

The updated correlations for beta are posted below. Lots of the comments mention crashing when opening a new tab

  • Every time a new tab is opened by a mouse LMB click or wheel click the browser crash Totally unavoidable It crash every time
  • Opening e new tab by link with "ctrl" pressed

(99.71% in signature vs 09.65% overall) moz_crash_reason = MOZ_CRASH(Base domain must be available when deserialized)
(37.30% in signature vs 21.10% overall) Module "comctl32.dll" = true [92.28% vs 23.89% if platform_version = 6.1.7601 Service Pack 1]
(90.49% in signature vs 36.35% overall) build_id = 20190916074538
(05.13% in signature vs 59.08% overall) "D2D1.1?" in app_notes = true [26.33% vs 82.27% if platform_pretty_version = Windows 7]
(13.80% in signature vs 64.62% overall) cpu_arch = x86 [17.18% vs 70.32% if platform = Windows NT]
(67.54% in signature vs 15.37% overall) Addon "amazon@search.mozilla.org" = true
(00.17% in signature vs 50.14% overall) build_id = 20190919164641
(61.94% in signature vs 06.32% overall) useragent_locale = it [61.94% vs 15.36% if startup_crash = 0]
(30.52% in signature vs 05.99% overall) Module "cdp.dll" = true [87.02% vs 40.54% if platform_version = 10.0.18362]
(30.78% in signature vs 76.25% overall) Addon "twitter@search.mozilla.org" = true
(56.06% in signature vs 15.27% overall) Module "OneCoreUAPCommonProxyStub.dll" = true [100.0% vs 56.24% if platform_version = 10.0.18362]
(77.59% in signature vs 33.96% overall) ipc_fatal_error_msg = null
(22.41% in signature vs 65.93% overall) is_garbage_collecting = null
(27.59% in signature vs 71.10% overall) Addon "amazondotcom@search.mozilla.org" = true

QA Whiteboard: [qa-regression-triage]

Hi, sounds like there's a plan in comment 8, but not assigned. Is someone planning to pick this up?

Flags: needinfo?(jvarga)
Flags: needinfo?(ckerschb)

Yeah

Assignee: nobody → jvarga
Component: DOM: Security → DOM: Quota Manager
Flags: needinfo?(jvarga)
Priority: P1 → P2

(In reply to Jan Varga [:janv] from comment #11)

Yeah

Thanks Yan for taking this one.

Flags: needinfo?(ckerschb)

This looks to have improved a lot after "EARLY_BETA_OR_EARLIER" turned off. Does that help identify the cause of the spike?

Since the volume is down so far I'll go ahead and wontfix for 70.

Flags: needinfo?(jvarga)

Removing the tracking flag for 71 and marking as fix-optional since we had a handful of crashes only on nightly since the Sept 20 giant spike and our dev edition builds only had a couple crashes last week, it doesn't seem like this is going to be a problem for the release at this point.

this crash signature is spiking up again on release currently - the spike seems to mostly come from various *.wpcomstaging.com domains.

It looks like wpcomstaging.com was removed from the public suffix list on November 8th:
https://github.com/publicsuffix/list/commit/41cb01612341c7ce3bcdd0cc4e696ae9f6416600

Hi! Barry from Automattic (WordPress.com) here. As the owner of wpcomstaging.com, how can we help? We have quite a few internal Firefox users affected by this bug.

It looks like none of the crashes are related to QuotaManager. Based on the fact that we're seeing this across all branches, I think this is a transient spike caused by bug 1563246 resulting in an asymmetry where the parent process and the content process have different views of base domain lookups. (They each have their own caches). https://bugzilla.mozilla.org/show_bug.cgi?id=1563246#c0 mentioned the issue of multiprocessing sharing and races and https://phabricator.services.mozilla.com/D42470#1297238 mentioned a follow-up to deal with the issue, but I see no follow-up on the bug. I should note that even if all the processes coordinated a simultaneous switchover, it still seems unsound to change Firefox's interpretation of URL's while Firefox is running.

Moving to that bug's component and clearing priority for re-triage.

Assignee: jvarga → nobody
Component: Storage: Quota Manager → Networking: DNS
Priority: P2 → --
Regressed by: 1563246

I should note that this presumably is a self-healing problem in that each profile should experience a parent process crash at most once because there should be a consistent view of base domains on restart.

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #19)

I should note that this presumably is a self-healing problem in that each profile should experience a parent process crash at most once because there should be a consistent view of base domains on restart.

In my experience of this issue there were multiple crashes and I can still crash my browser by visiting a .wpcomstaging.com subdomain - https://westiatomictwo.wpcomstaging.com/

I just submitted a crash report that references this issue in the comments.

MozCrashReason: MOZ_CRASH(Base domain must be available when deserialized)
Notes: FP(D00-L1000-W00000000-T000) WR? WR- OMTP? OMTP+3 GL Layers? GL Context? GL Context+ GL Layers+ 

i've come across reports from affected users who crashed hitting this signature numerous times as well

I am retracking for 71 given the big spike of crashes in our last 2 betas.

Hm, that's strange that it crashes multiple times.

@pa(In reply to Pascal Chevrel:pascalc from comment #22)

I am retracking for 71 given the big spike of crashes in our last 2 betas.

Thanks

(In reply to Jan Varga [:janv] from comment #23)

Hm, that's strange that it crashes multiple times.

I had it happen multiple times a day, which was pretty frustrating (I also work at Automattic).

I'm happy to test any fixes to see if the resolve the issue.

One other note here, after reading through the implementation patches for this it sounds like update process implemented should end up with a file somewhere within my Firefox profile?

From what I can tell I can't find this file - I may be looking in the wrong place because it wasn't clear from the implementation code exactly where it was being saved.

But I guess if the file isn't actually getting written to disk then this would explain why it happens repeatedly.

Adding needinfo's to the author and reviewers of bug 1563246, plus bug 1083971 which seems to be the bug on auto-updating in the first place.

Flags: needinfo?(mathieu)
Flags: needinfo?(kmaglione+bmo)
Flags: needinfo?(erahm)
Flags: needinfo?(arpitbharti73)

Mathieu is probably the right person to triage this. I can help if we find that this is specific to the low-level Dafsa implementation.

Flags: needinfo?(erahm)

The Dafsa file should be downloaded in ./settings/main/public-suffix-list/dafsa.bin (btw mine was in ~/.cache/mozilla/firefox/profile/ instead of ~/.mozilla/firefox/profile/)

When the file is downloaded and written on disk, a signal is sent ("public-suffix-list-updated") and the Dafsa is reload in memory.

The crash may happen when a new PSL is published. The last updates were at those datetimes, see if you observe some spikes around those dates:

    {
      "datetime": "2019-11-21T01:04:50.454851",
      "by": "ldap:mleplatre@mozilla.com",
      "changes": 1
    },
    {
      "datetime": "2019-11-18T08:20:37.781460",
      "by": "ldap:mleplatre@mozilla.com",
      "changes": 2
    },
    {
      "datetime": "2019-11-13T11:29:09.194072",
      "by": "ldap:mleplatre@mozilla.com",
      "changes": 1
    },
    {
      "datetime": "2019-11-11T12:09:57.886129",
      "by": "ldap:mleplatre@mozilla.com",
      "changes": 1
    },
    {
      "datetime": "2019-11-05T17:24:15.024196",
      "by": "ldap:mleplatre@mozilla.com",
      "changes": 3
    },
    {
      "datetime": "2019-10-16T13:02:50.756325",
      "by": "ldap:mleplatre@mozilla.com",
      "changes": 1
    }

Let me know if you want me to dig more into Remote Settings uptake for this particular collection. I'm also working on Bug 1597337 to distinguish download error from real update error.

Flags: needinfo?(mathieu)

So all processes get the notification ?

Flags: needinfo?(mathieu)

I think Andrew is right that updating the public suffix list during runtime is a footgun. Even if we update the list in all processes, we don't (and really can't) update the base domain of existing principals, so if we try to IPDL serialize a principal which will have a different base domain before and after the PSL update, we'll crash on that assertion.

I don't know if that's actually what's happening here, but it's my best guess.

Flags: needinfo?(kmaglione+bmo)

Yeah, I also think it would be hard to address all edge cases of the current (on the fly) approach. I don't know all the implementation details of the PSL update mechanism, but can we just prepare a new Dafsa file during runtime and activate it at next startup ?

Flags: needinfo?(jvarga)

(In reply to Mathieu Leplatre [:leplatrem] from comment #28)

The Dafsa file should be downloaded in ./settings/main/public-suffix-list/dafsa.bin (btw mine was in ~/.cache/mozilla/firefox/profile/ instead of ~/.mozilla/firefox/profile/)

Ah, I found mine now - ./Library/Caches/Firefox/Profiles/lkoxassr.default/settings/main/public-suffix-list/dafsa.bin (on OS X)

(In reply to Jan Varga [:janv] from comment #31)

Yeah, I also think it would be hard to address all edge cases of the current (on the fly) approach. I don't know all the implementation details of the PSL update mechanism, but can we just prepare a new Dafsa file during runtime and activate it at next startup ?

Hey, I wrote the DAFSA updation system, the DAFSA is built with Firefox and is part of the compiled binary. The updated DAFSA is loaded from disk at every startup. I don't think we can prepare a new DAFSA at runtime, The best way to get an updated DAFSA should be through the current Remote Settings mechanism.

Currently we reload after receiving the signal and until then we use the built in DAFSA, activating on next startup would mean pointing to a new file on disk, which may or may not exist yet.

Flags: needinfo?(arpitbharti73)

Can we disable updates of the public suffix list from remote settings, and fall back to the builtin one, until this is resolved?

Nhi, given the crash volume on 70 and the spikes in 71, can we have this bug prioritized and assigned and possibly a solution found for 71/72 or a 71 dot release? Thanks

Flags: needinfo?(nhnguyen)

Jan, could you answer Julien's question in comment #34? Thanks

Flags: needinfo?(jvarga)

(In reply to Julien Cristau [:jcristau] from comment #34)

Can we disable updates of the public suffix list from remote settings, and fall back to the builtin one, until this is resolved?

Yes this can be done if we delete the record on the Remote Settings server, in case of deletion it will be removed from the users system and we should fallback on the built in dafsa.
Someone with access rights to records can trigger it, Mathieu would know better about it.

(In reply to Pascal Chevrel:pascalc from comment #36)

Jan, could you answer Julien's question in comment #34? Thanks

This belongs to Networking: DNS now (not QuotaManager).

Flags: needinfo?(jvarga)

Valentin, could you confirm whether this is an DNS bug and take it on if it is? Thanks!

Flags: needinfo?(nhnguyen) → needinfo?(valentin.gosu)

(In reply to Nhi Nguyen (:nhi) from comment #39)

Valentin, could you confirm whether this is an DNS bug and take it on if it is? Thanks!

It is a DNS bug in the sense that the code belongs in netwerk/ but from what I can tell necko people haven't worked on the code that much.
Based on comments 28..31 I think the solution here is to avoid responding to the onUpdate event.
I'll submit a patch shortly that disables that bit via a pref.

Assignee: nobody → valentin.gosu
Flags: needinfo?(valentin.gosu)

Updating the PSL while Firefox is running could cause principals to have
a different base domain before/after the update. It is therefore
preferable that we only update the PSL at startup.
See bug 1582647 comment 30

Attachment #9112067 - Attachment description: Bug 1582647 - Don't send the "public-suffix-list-updated" immediately after the PSL update r=Gijs,leplatrem → Bug 1582647 - Pref off the "public-suffix-list-updated" event
Pushed by valentin.gosu@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/45a717eb127f
Pref off the "public-suffix-list-updated" event r=Gijs
Blocks: 1600043

Valentin, could you request the uplift to release? We have an RC2 coming and I think it's worth taking this patch in

Flags: needinfo?(valentin.gosu)

Gijs, could you request an uplift to RC2 as the reviewer? Thanks

Flags: needinfo?(gijskruitbosch+bugs)

Comment on attachment 9112067 [details]
Bug 1582647 - Pref off the "public-suffix-list-updated" event

Beta/Release Uplift Approval Request

  • User impact if declined: Potential crashes when Firefox loads an updated etld list and has principals in memory that are affected by the changes.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This patch simply disables the mechanism that loads the file into memory.
    This also means that etld changes would reach users more slowly, but that was also the case before we implemented the update mechanism.
  • String changes made/needed:
Flags: needinfo?(valentin.gosu)
Attachment #9112067 - Flags: approval-mozilla-release?
Flags: needinfo?(gijskruitbosch+bugs)

Comment on attachment 9112067 [details]
Bug 1582647 - Pref off the "public-suffix-list-updated" event

Fix for a current top crash in release, approved for our RC2, thanks!

Attachment #9112067 - Flags: approval-mozilla-release? → approval-mozilla-release+
Status: NEW → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla72
Flags: needinfo?(mathieu)

(In reply to Arpit Bharti [:arpit73] from comment #37)

(In reply to Julien Cristau [:jcristau] from comment #34)

Can we disable updates of the public suffix list from remote settings, and fall back to the builtin one, until this is resolved?

Yes this can be done if we delete the record on the Remote Settings server, in case of deletion it will be removed from the users system and we should fallback on the built in dafsa.
Someone with access rights to records can trigger it, Mathieu would know better about it.

Mathieu?

Flags: needinfo?(mathieu)

The feature was introduced in 70, and it's marked as wontfix.

A patch has landed in 71 and 72, that disables reloading (preventing crash)

I would let the server part the way it is, don't you think?

Flags: needinfo?(mathieu)

If we can do something server-side so the users on 70 stop crashing that'd be useful.

Flags: needinfo?(mathieu)

I tested the deletion on the STAGE server, and the client behaviour looks sound.

  • A warning is shown, as expected: Unsupported sync event for Public Suffix List (source)
  • The record is deleted in local DB (browser toolbox IDB)
  • The file is deleted on disk ~/.cache/mozilla/firefox/profile/settings/main/public-suffix-list/dafsa.bin
  • No issue on startup (no-op if no record in code)

Ethan and I will proceed with the deletion of the record in PROD, in order to prevent 70 users to reload the file from disk after startup.

Flags: needinfo?(mathieu)
You need to log in before you can comment on or make changes to this bug.