Open Bug 1724336 Opened 3 years ago Updated 5 months ago

Startup Crash in [@ mozilla::URLPreloader::ReadCache] after "MOZ_DIAGNOSTIC_ASSERT(key.mPath.Length() > 0) (Path should be non-empty)"

Categories

(Core :: XPConnect, defect)

Unspecified
Windows
defect

Tracking

()

REOPENED
Tracking Status
firefox-esr78 --- wontfix
firefox-esr91 --- wontfix
firefox-esr102 --- wontfix
firefox90 --- unaffected
firefox91 --- wontfix
firefox92 --- wontfix
firefox93 --- wontfix
firefox94 --- wontfix
firefox95 --- wontfix
firefox96 --- wontfix
firefox97 --- wontfix
firefox98 --- wontfix
firefox99 --- wontfix
firefox112 --- wontfix
firefox113 --- wontfix
firefox114 --- wontfix

People

(Reporter: RyanVM, Unassigned)

References

(Regression)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file)

+++ This bug was initially created as a clone of Bug #1719319 +++

We're starting to see some new reports now from builds with the diagnostic assert added in bug 1719319. Let's use this bug to continue the investigation since bug 1719319 appears to have successfully mitigated the crashes for release users.

Crash report: https://crash-stats.mozilla.org/report/index/99c4953b-57da-4d41-bddd-ec4470210805

MOZ_CRASH Reason: MOZ_DIAGNOSTIC_ASSERT(false) (Entry should be new and not in any list)

Top 10 frames of crashing thread:

0 xul.dll mozilla::URLPreloader::ReadCache js/xpconnect/loader/URLPreloader.cpp:308
1 xul.dll mozilla::URLPreloader::BackgroundReadFiles js/xpconnect/loader/URLPreloader.cpp:340
2 xul.dll mozilla::detail::RunnableMethodImpl<D3DVsyncSource::D3DVsyncDisplay*, void  xpcom/threads/nsThreadUtils.h:1201
3 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1142
4 xul.dll mozilla::ipc::MessagePumpForNonMainThreads::Run ipc/glue/MessagePump.cpp:300
5 xul.dll MessageLoop::RunHandler ipc/chromium/src/base/message_loop.cc:324
6 xul.dll MessageLoop::Run ipc/chromium/src/base/message_loop.cc:306
7 xul.dll static nsThread::ThreadFunc xpcom/threads/nsThread.cpp:390
8 nss3.dll _PR_NativeRunThread nsprpub/pr/src/threads/combined/pruthr.c:399
9 nss3.dll pr_root nsprpub/pr/src/md/windows/w95thred.c:139
Severity: -- → S2
Keywords: sec-other
Depends on: 1724509

We are ending the 92 cycle so wontfix 92. This is a startup crash and such crashes are not good for user retention, I think it would be good to at least have an assignee and some investigation done, thanks.

:kmag is back from vacation. He has patches in bug 1724509. He is looking into landing them.

We mitigated crashes in release builds in bug 1719319, so firefox93 should be unaffected. But we definitely should look into this issue deeper.

(In reply to Hsin-Yi Tsai [:hsinyi] from comment #3)

We mitigated crashes in release builds in bug 1719319, so firefox93 should be unaffected. But we definitely should look into this issue deeper.

Hey Kris, can you please confirm if my understanding about the whether firefox-93 is affected or not right? If my understanding is right, we ppbly would want to hold off landing patches in bug 1724509 until soft-freeze ends.

Flags: needinfo?(kmaglione+bmo)
Has Regression Range: --- → yes

Bug 1717778 is on all branches now (and is unhidden) so this doesn't need to be hidden any more. Maybe we could disable this fatal assert on early beta, too, to avoid bothering release managers every cycle.

Group: dom-core-security
Keywords: sec-other

https://searchfox.org/mozilla-central/rev/02bd3e02b72d431f2c600a7328697a521f87d9b6/js/xpconnect/loader/URLPreloader.cpp#336,338
we have pendingURLs outside the lock. Isn't it possible that we end up waiting for the lock and while that is happening some other
call ends up passing cached urls to another list?

Flags: needinfo?(continuation)

Writing a patch (once a build is ready)

Oh wait, BackgroundReadFiles seems to have other issues too. Thinking.

Depends on: 1755539
Flags: needinfo?(continuation)

I put up a patch in bug 1755539 that changes a few things around. I don't know how it could lead to this issue, but it might eliminate some weirdness at least.

Flags: needinfo?(kmaglione+bmo)

My patch didn't help. I guess I can try looking at how the preloader list is written out.

Depends on: 1757227

It looks like my changes in bug 1757227 did finally turn up a useful tidbit. The two crashes since my patch landed are hitting MOZ_DIAGNOSTIC_ASSERT(key.mPath.Length() > 0) (Path should be non-empty) so we're ending up with multiple empty paths.

I clicked a bit in searchfox. So we read this CacheKey from our file. IIUC that means we already wrote it before with an empty path. Now there seem to be quite a few constructors for CacheKey that seem not to check if we actually get a path, should we have some assert there?

Yeah, there may be some way we end up with an entry that has a blank path. My question is, how can we end up with two of them? The file is written out by iterating over a hash table where the keyed off of a CacheKey, so how can the same CacheKey appear twice? I spent some time trying to figure out if there's something weird with the hashtable that could allow that, but I couldn't come up with anything.

The type of the hashtable is nsClassHashtable<nsGenericHashKey<CacheKey>, URLEntry>, where the CacheKeys are also URLEntries, so we end up with two copies of the data (one for the key and one for the value) but aside from the small excess usage of space I can't think of anything wrong with that. I couldn't see any place that mutates the values in the hash table.

Whiteboard: [domcore-s2-revisit]
See Also: → 1761667

:hsinyi given the low volume crashes here, is this still considered an S2?

Flags: needinfo?(htsai)

(In reply to Dianna Smith [:diannaS] from comment #14)

:hsinyi given the low volume crashes here, is this still considered an S2?

Just talked with mccr8, yeah, it probably makes more sense to downgrade this to S3, given the crash in beta/release has been mitigated.
Note that we sill want to figure the issue out, but it's unfortunately stalled after we gave this several try.

Severity: S2 → S3
Flags: needinfo?(htsai)
Keywords: stalled

The severity field for this bug is set to S3. However, the bug has the topcrash keyword.
:peterv, could you consider increasing the severity of this top-crash bug? If the crash isn't "top" anymore, could you drop the topcrash keyword?

For more information, please visit auto_nag documentation.

Flags: needinfo?(peterv)
Flags: needinfo?(peterv)

We've bypassed the assertion that made it a top crash. It is still present on Nightly.

Whiteboard: [domcore-s2-revisit]

Kris going to take another look.

Flags: needinfo?(kmaglione+bmo)

There's a flurry of new crashes under this signature on nightly, if it can be useful I can extract the contents of the strings causing the crash from the reports. Just NI? me if it could help.

(In reply to Gabriele Svelto [:gsvelto] from comment #20)

There's a flurry of new crashes under this signature on nightly, if it can be useful I can extract the contents of the strings causing the crash from the reports. Just NI? me if it could help.

Do you intend the string of MOZ_CRASH_REASON? In all the nightly instances that seems to be MOZ_DIAGNOSTIC_ASSERT(key.mPath.Length() > 0) (Path should be non-empty), so we have a key without path, apparently. Maybe we should assert if key.mType is any of the expected types or just garbage, too?

Edit: It seems we already knew this from comment 12 and comment 13...

I mean the actual contents of the key itself.

(In reply to Gabriele Svelto [:gsvelto] from comment #22)

I mean the actual contents of the key itself.

That sounds helpful, thanks.

Edit: Assuming that we can see the type this way, too. Can we see the content of buf ?

Unfortunately the contents of buf are unavailable as it's been optimized away and VS is unable to reconstruct it. key->mType is TypeAppJar and pendingURLs seems to contain only one entry aside from the sentinel (I don't know what it contains though). If I go up one frame this.mGREPrefix is jar:file:///C:/Program%20Files/Firefox%20Nightly/omni.ja!/.

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 10 desktop browser crashes on nightly (startup)

:hsinyi, could you consider increasing the severity of this top-crash bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(htsai)

This doesn't look a true topcrash. There were 14 reports from 1 install and it seems that contributes to topcrash or topcrash-startup calculation.

Hey :marco, we could perhaps teach our bot to do something differently for the number of reports coming from a single install.

Flags: needinfo?(htsai) → needinfo?(mcastelluccio)

(In reply to Hsin-Yi Tsai (she/her) [:hsinyi] from comment #26)

This doesn't look a true topcrash. There were 14 reports from 1 install and it seems that contributes to topcrash or topcrash-startup calculation.

Hey :marco, we could perhaps teach our bot to do something differently for the number of reports coming from a single install.

Ah, looks we've already taken the above into consideration in our criteria to some extent - "Top 10 desktop browser crashes on Nightly, if they happen for enough different installations."

From Socorro, it looks like in the last week the crash happened 18 times to 8 different installations on 110.0a1 and 4 times to 1 installation on 111.0a1.

Flags: needinfo?(mcastelluccio)

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 10 desktop browser crashes on nightly (startup)

:peterv, could you consider increasing the severity of this top-crash bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(peterv)
Duplicate of this bug: 1824666
Flags: needinfo?(peterv)

The assertion is different now.

Summary: Startup Crash in [@ mozilla::URLPreloader::ReadCache] after "MOZ_DIAGNOSTIC_ASSERT(false) (Entry should be new and not in any list)" → Startup Crash in [@ mozilla::URLPreloader::ReadCache] after "MOZ_DIAGNOSTIC_ASSERT(key.mPath.Length() > 0) (Path should be non-empty)"

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit auto_nag documentation.

Most of the startup assertions we're seeing are from when we decode multiple
paths that are empty strings. That doesn't seem likely to happen very often
from on disk corruption, so presumably it must be happening before/when we
encode the cache file?

I don't have much hope that this will turn up the cause, but it seems worth a
try, and it definitely shouldn't hurt.

Assignee: nobody → kmaglione+bmo
Status: NEW → ASSIGNED
Pushed by maglione.k@gmail.com: https://hg.mozilla.org/integration/autoland/rev/f2fbfecc33d1 Add some more URLPreloader path sanity checks. r=mccr8
Pushed by maglione.k@gmail.com: https://hg.mozilla.org/integration/autoland/rev/b3b312008642 Add some more URLPreloader path sanity checks. r=mccr8
Status: ASSIGNED → RESOLVED
Closed: 1 years ago
Resolution: --- → FIXED
Target Milestone: --- → 114 Branch

Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit auto_nag documentation.

Keywords: stalled
Status: RESOLVED → REOPENED
Flags: needinfo?(kmaglione+bmo)
Resolution: FIXED → ---
Target Milestone: 114 Branch → ---

The bug assignee is inactive on Bugzilla, so the assignee is being reset.

Assignee: kmaglione+bmo → nobody
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: