Closed Bug 1667613 Opened 1 year ago Closed 1 year ago

Startup Crash in [@ mozilla::intl::LocaleService::GetAppLocalesAsLangTags]

Categories

(Core :: Internationalization, defect, P1)

defect

Tracking

()

RESOLVED FIXED
84 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox81 --- unaffected
firefox82 --- unaffected
firefox83 blocking verified
firefox84 + verified

People

(Reporter: aryx, Assigned: m_kato)

References

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file)

During the past week, there were 4 crashes with this signature on 4 different devices. They hit in the first 3 minutes after the application got launched.

Crash report: https://crash-stats.mozilla.org/report/index/629e45c6-0fad-4f1e-9936-ba27b0200924

Top 10 frames of crashing thread:

0 xul.dll mozilla::intl::LocaleService::GetAppLocalesAsLangTags intl/locale/LocaleService.cpp:400
1 xul.dll mozilla::intl::LocaleService::GetAppLocaleAsLangTag intl/locale/LocaleService.cpp:427
2 xul.dll nsChromeRegistryChrome::GetBaseURIFromPackage chrome/nsChromeRegistryChrome.cpp:330
3 xul.dll nsChromeRegistry::ConvertChromeURL chrome/nsChromeRegistry.cpp:253
4 xul.dll nsChromeProtocolHandler::NewChannel chrome/nsChromeProtocolHandler.cpp:127
5 xul.dll mozilla::net::nsIOService::NewChannelFromURIWithProxyFlagsInternal netwerk/base/nsIOService.cpp:1134
6 xul.dll mozilla::net::nsIOService::NewChannelFromURIWithClientAndController netwerk/base/nsIOService.cpp:1076
7 xul.dll NS_NewChannelInternal netwerk/base/nsNetUtil.cpp:404
8 xul.dll NS_NewChannel netwerk/base/nsNetUtil.cpp:345
9 xul.dll nsStringBundleBase::ParseProperties intl/strres/nsStringBundle.cpp:446

Zibi, any idea what started this?

Flags: needinfo?(gandalf)

Zibi, any idea what started this?

I don't, but I'm pretty sure that historically nsTArray<nsCString> mAppLocales should never be nullptr, and that there has been some work on nsTArray recently, so maybe :sg or :froydnj will know since they're on the log as last people touching nsTArray?

It seems that something changed about nsTArray which makes it possible to be nullptr and it was not the case before? That's my first guess.

Flags: needinfo?(sgiesecke)
Flags: needinfo?(gandalf)
Flags: needinfo?(froydnj+bz)

I tried to look at the dumps of three of the crash reports (https://crash-stats.mozilla.org/report/index/27ad659c-321a-4d7d-898f-d14530200928, https://crash-stats.mozilla.org/report/index/cb234549-9183-4890-b592-8f3e50200928, https://crash-stats.mozilla.org/report/index/967137e2-bfca-4e8a-9ad8-fe59b0200927) with Visual Studio, but unfortunately, it doesn't load xul.dll symbols for any of those properly. Not sure what that means.

I am not sure what exactly you suspect to be nullptr here. The nsTArray mLocales, which is a member of LocaleService is nothing what would be affected by changes to nsTArray. It's not exactly clear, since there might be inline frames involved, but to me it looks as if LocaleService::GetInstance() returns nullptr at https://hg.mozilla.org/mozilla-central/annotate/b7717ee20ba9d676a6f559efbcb45027db0003c6/chrome/nsChromeRegistryChrome.cpp#l330

This can happen when the call to ClearOnShutdown gets into https://searchfox.org/mozilla-central/source/xpcom/base/ClearOnShutdown.cpp#21 from https://searchfox.org/mozilla-central/source/intl/locale/LocaleService.cpp#152, which doesn't check again if sLocaleService was cleared immediately.

Flags: needinfo?(sgiesecke)
Flags: needinfo?(froydnj+bz)
Flags: needinfo?(gandalf)

Thank you for your help and investigation!

I am not sure what exactly you suspect to be nullptr here. The nsTArray mLocales, which is a member of LocaleService is nothing what would be affected by changes to nsTArray.

Thanks, that's what it looked like, but I wasn't sure.

It's not exactly clear, since there might be inline frames involved, but to me it looks as if LocaleService::GetInstance() returns nullptr at https://hg.mozilla.org/mozilla-central/annotate/b7717ee20ba9d676a6f559efbcb45027db0003c6/chrome/nsChromeRegistryChrome.cpp#l330

How did you conclude that from the stack?
The top frame is https://searchfox.org/mozilla-central/source/intl/locale/LocaleService.cpp#400

Flags: needinfo?(gandalf) → needinfo?(sgiesecke)

Sure, here are some more details on my analysis when looking at https://crash-stats.mozilla.org/report/index/cb234549-9183-4890-b592-8f3e50200928:

  1. The top frame shows https://hg.mozilla.org/mozilla-central/annotate/4668c7b5e842a7afef9e4c183c8a85af950eb1ad/intl/locale/LocaleService.cpp#l400, which calls mAppLocales.IsEmpty(). Assuming this is not affected by other optimizations, since nsTArray_base::IsEmpty is not a virtual function, this can't happen when tying to access the vtable, but we are probably inside that function, which is likely to have been inlined here, since it is very small. So https://searchfox.org/mozilla-central/rev/dfd9c0f72f9765bd4a187444e0c1e19e8834a506/xpcom/ds/nsTArray.h#416 calls Length which accesses mHdr at https://searchfox.org/mozilla-central/rev/dfd9c0f72f9765bd4a187444e0c1e19e8834a506/xpcom/ds/nsTArray.h#413. It then dereferences mHdr, so this could cause an access violation like the one mentioned if mHdr is nullptr, but I think this is rather unlikely (if nsTArray had some bug that causes this to be nullptr we would probably have seen this elsewhere). The other possibility is that the base address of the nsTArray object itself is already nullptr + x, so trying to access mHdr will try to access nullptr + x + offsetof(nsTArray<...>, mHdr). I then went on to see if the pointer of the object containing the nsTArray might be a nullptr.
  2. The second stack frame just delegates to the first.
  3. The third stack frame is at https://hg.mozilla.org/mozilla-central/annotate/4668c7b5e842a7afef9e4c183c8a85af950eb1ad/chrome/nsChromeRegistryChrome.cpp#l330 and is actually calling a method on the object containing the nsTArray through a pointer obtained by calling LocaleService::GetInstance(). If that returned nullptr, this would explain the access violation.
  4. So I looked into https://searchfox.org/mozilla-central/rev/dfd9c0f72f9765bd4a187444e0c1e19e8834a506/intl/locale/LocaleService.cpp#135. At first sight, this looks as if it does if (!sInstance) { sInstance= new LocaleService{...}; } return sInstance; which would never return nullptr. However, there's the call to ClearOnShutdown at https://searchfox.org/mozilla-central/rev/dfd9c0f72f9765bd4a187444e0c1e19e8834a506/intl/locale/LocaleService.cpp#152.
  5. ClearOnShutdown calls InsertIntoShutdownList at https://searchfox.org/mozilla-central/rev/dfd9c0f72f9765bd4a187444e0c1e19e8834a506/xpcom/base/ClearOnShutdown.h#117
  6. InsertIntoShutdownList checks PastShutdownPhase(aPhase) and in that case calls aObserver->Shutdown(); at https://searchfox.org/mozilla-central/rev/dfd9c0f72f9765bd4a187444e0c1e19e8834a506/xpcom/base/ClearOnShutdown.cpp#21
  7. aObserver is a PointerClearer for sInstance, so calling Shutdown will set sInstance to nullptr, which makes LocaleService::GetInstance return nullptr, which explains the access violation.

What I now also saw in the crash metadata is that it has a GraphicsCriticalError saying |[0][GFX1-]: Killing GPU process due to IPC reply timeout (t=502.452) |[1][GFX1-]: Failed as lost WebRenderBridgeChild. (t=502.452) |[2][GFX1-]: Failed to create remote compositor (t=504.172) |[3][GFX1-]: Receive IPC close with reason=AbnormalShutdown (t=545.582), so this sounds as if we are actually shutting down.

So it seems to me as if nsChromeRegistryChrome::GetBaseURIFromPackage should check the return value of LocaleService::GetInstance() before dereferencing.

Flags: needinfo?(sgiesecke)

Tracking for 83 as startup crashes means loosing users.

Thanks so much for the analysis and guidance through the stack trace debugging! Super helpful :)

I take your conclusion as actionable!

Priority: -- → P2

Setting P2 because it's a crash and a startup one at that, the fix is relatively simple but the volume is low, so below P1s.

Severity: -- → S2

Zibi, do you have a fix coming for this crash? Thanks

Flags: needinfo?(gandalf)

Not at the moment, I don't have cycles.

Flags: needinfo?(gandalf)
Priority: P2 → P3

Insofar the root cause for this is the GPU process problems mentioned in comment 4, maybe a graphics person should give this a look?

Flags: needinfo?(jyavenard)

Changing the priority to p2 as the bug is tracked by a release manager for the current nightly.
See What Do You Triage for more information

Priority: P3 → P2

Forwarding to JEff. But I think gfx is just a red herring, and isn't related to the problem at hand

(In reply to Simon Giesecke [:sg] [he/him] from comment #4)

  1. ClearOnShutdown calls InsertIntoShutdownList at https://searchfox.org/mozilla-central/rev/dfd9c0f72f9765bd4a187444e0c1e19e8834a506/xpcom/base/ClearOnShutdown.h#117
  2. InsertIntoShutdownList checks PastShutdownPhase(aPhase) and in that case calls aObserver->Shutdown(); at https://searchfox.org/mozilla-central/rev/dfd9c0f72f9765bd4a187444e0c1e19e8834a506/xpcom/base/ClearOnShutdown.cpp#21
  3. aObserver is a PointerClearer for sInstance, so calling Shutdown will set sInstance to nullptr, which makes LocaleService::GetInstance return nullptr, which explains the access violation.

I think you're on the money, we've started shutdown and something is calling GetInstance()

So it seems to me as if nsChromeRegistryChrome::GetBaseURIFromPackage should check the return value of LocaleService::GetInstance() before dereferencing.

The easiest will be to dispatch a task on the main thread to call ClearOnShutdown(&sInstance, ShutdownPhase::Shutdown) then.
worse case it leaks a bit because we failed to dispatch the task, and as we're shutting down it doesn't matter.

Flags: needinfo?(jyavenard) → needinfo?(jgilbert)

I agree with :jya, thanks!

Flags: needinfo?(jgilbert)

The startup crashes are going up since 83 moved to beta, makoto could we get somebody assigned on this one to have time to find a fix and uplift it to beta? Thanks

Flags: needinfo?(m_kato)

Zibi, could you look this issue? Since shutting down process won't get XPCOM instance, LocaleService and nsChromeRegistryChrome should consider that these are called during shunting down.

Flags: needinfo?(m_kato) → needinfo?(gandalf)

Hundreds of startup crashes on Beta, I consider it as a blocker for the 83 release. Makoto, Zibi, this needs an owner and a fix we can uplift in beta. Thanks.

Severity: S2 → S1
Flags: needinfo?(m_kato)
Priority: P2 → P1

Related to bug 1663747. I think that nsStringBundle::LoadProperties should check shutdown phase like bug 1663747.

Flags: needinfo?(m_kato)
Flags: needinfo?(gandalf)
Assignee: nobody → m_kato

ClearOnShutdown is called before shut down observer is notified. So when this
is notified, some services are already shut down.

So string bundle won't get the value since Locale service is shut down, so we
should return error when shutting down status.

Pushed by m_kato@ga2.so-net.ne.jp:
https://hg.mozilla.org/integration/autoland/rev/83bf4fd3b1fb
String bundle should return error after ClearOnShutdown is called. r=zbraniecki
Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 84 Branch

Makoto, could you request an uplift to beta? Thanks

Flags: needinfo?(m_kato)

Comment on attachment 9184143 [details]
Bug 1667613 - String bundle should return error after ClearOnShutdown is called.

Beta/Release Uplift Approval Request

  • User impact if declined: When user quits Firefox, Firefox may crash. Although we don't know how to reproduce this, but this is potential fix.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): String bundle returns error instead when shutting down is started. After user quits Firefox, remained HTTP request (most is captive portals service) will have no Accept-Language header. But it is unnecessary during shutdown.
  • String changes made/needed: No
Flags: needinfo?(m_kato)
Attachment #9184143 - Flags: approval-mozilla-beta?
Regressions: 1674077

Comment on attachment 9184143 [details]
Bug 1667613 - String bundle should return error after ClearOnShutdown is called.

Approved for 83 beta 6, thanks.

Attachment #9184143 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
You need to log in before you can comment on or make changes to this bug.