Crash in [@ mozilla::dom::cache::db::(anonymous namespace)::DeleteSecurityInfo]
Categories
(Core :: Storage: Cache API, defect, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr115 | --- | unaffected |
firefox124 | --- | wontfix |
firefox125 | --- | wontfix |
firefox126 | --- | fixed |
People
(Reporter: aryx, Assigned: asuth)
References
(Blocks 1 open bug)
Details
(Keywords: crash, topcrash)
Crash Data
Attachments
(1 file, 1 obsolete file)
This affects only early betas (b1-b6). <20 crashes for <=6 installs per beta affected. 85% of crashes with 32-bit builds.
Crash report: https://crash-stats.mozilla.org/report/index/c364aa86-3e6e-4370-8a9a-4f9900240326
MOZ_CRASH Reason: MOZ_DIAGNOSTIC_ASSERT(refcount >= aCount)
Top 10 frames of crashing thread:
0 xul.dll mozilla::dom::cache::db:: dom/cache/DBSchema.cpp:1649
0 xul.dll mozilla::dom::cache::db:: dom/cache/DBSchema.cpp:1687
1 xul.dll mozilla::dom::cache::db::DeleteCacheId dom/cache/DBSchema.cpp:696
2 xul.dll mozilla::dom::cache:: const dom/cache/Manager.cpp:129
2 xul.dll mozilla::ReduceEach<`lambda at /builds/worker/workspace/obj-build/dist/include/mozilla/dom/quota/QuotaCommon.h:1218:7', mozilla::CheckedInt<long long>, `lambda at /builds/worker/checkouts/gecko/dom/cache/Manager.cpp:129:7'>::<lambda_1>::operator const dom/quota/QuotaCommon.h:1197
2 xul.dll mozilla::CollectEach dom/quota/QuotaCommon.h:1180
2 xul.dll mozilla::ReduceEach dom/quota/QuotaCommon.h:1197
2 xul.dll mozilla::Reduce dom/quota/QuotaCommon.h:1217
2 xul.dll mozilla::dom::cache:: dom/cache/Manager.cpp:129
3 xul.dll mozilla::dom::cache::SyncDBAction::RunWithDBOnTarget dom/cache/DBAction.cpp:140
Comment 1•9 months ago
|
||
FWIW this seems to happen only with Windows 32 Bit builds.
Updated•9 months ago
|
Comment 2•9 months ago
|
||
The bug is linked to a topcrash signature, which matches the following criterion:
- Top 20 desktop browser crashes on beta
:janv, could you consider increasing the severity of this top-crash bug?
For more information, please visit BugBot documentation.
Comment 3•9 months ago
|
||
Andrew, can you help with deciding if this should be S2 ?
Comment 4•9 months ago
|
||
My understanding is that this is just a diagnostic assertion and this crash will go away, as comment 0 says. But in release we would write negative values to our DB, IIUC. I do not know what consequences this would have, but it sounds like something will never be deleted from the table.
FWIW, I do not think that our recent change to DBAction::RunOnTarget
affects things here, as the crash arrives inside RunWithDBOnTarget
, which is beyond that check, which once true will be true for all callers.
Assignee | ||
Comment 5•9 months ago
•
|
||
(In reply to Jens Stutte [:jstutte] from comment #1)
FWIW this seems to happen only with Windows 32 Bit builds.
Unfortunately expanding the time range to 6 months shows Linux and Mac as well as amd64 and arm64, although the crashes are definitely biased towards these older machines.
A notable thing about the canonical crash https://crash-stats.mozilla.org/report/index/c364aa86-3e6e-4370-8a9a-4f9900240326 is that it is taking place during the cleanup of an unclean shutdown.
I don't know if this link encodes what I want it to, but the core idea is that faceting based on proto signature can let us tell what the high level action was. And we get:
- Marker cleanup: 132 (66%)
- SetupAction calling DeleteCacheId: 132 = (99 + 23 + 4 + 3 + 3)
- Active usage: 68 (34%)
- CachePutAllAction 45 = (25 + 14 + 6)
- CacheDeleteAction 14 = (9 + 5)
- DeleteOrphanedCacheAction 9 = (6 + 1 + 1 + 1)
So this is definitely catching an inconsistency and the possible options from more likely to less likely (where there can be multiple things going on amongst those crashes):
- Database corruption due to cosmic rays or similar. Everything we did involving the database was perfect, but the numbers don't add up now.
- Logic bug related to the context marker cleanup logic; this could manifest both directly during cleanup processing or subsequent actions if the math gets off.
- We're seeing inconsistent database state due to insufficient / inconsistent use of database transactions.
- Serious logic bug that happens even without shutdown.
I think we can rule out option 4 (big logic bug) based on the rate of incidence. I think option 3 is also unlikely because we explicitly use transactions at the manager level, so even though we currently do manual refcounting of the security data, that integrity should be maintained. Option 2 seems like something we need to take seriously since 2/3 of these crashes are startup recovery crashes and the other 1/3 might be fallout from bungled recovery. Option 1 is always a thing that can happen.
In terms of actions we can/should take, I think we should:
- On this bug:
- Drop the severity of the assertion to
MOZ_ASSERT_DEBUG_OR_FUZZING
and request uplift since this is clearly a thing that happens (although early beta goes away April 1st).
- Drop the severity of the assertion to
- In follow-up(s), edit: bug 1888508 now filed for this:
- Investigate the shutdown edge cases more.
- Try and return a corruption error code in this case which can be handled by a call to mozilla::dom::cache::WipeDatabase.
The currently control-flow isn't well-suited to calling WipeDatabase and that would be a bit scary to uplift, plus there may be a correctness problem we'd want to address, hence why I'd suggest we just lower the assertion in this bug, so we can uplift it to beta where we're seeing the crashes.
(In reply to Jens Stutte [:jstutte] from comment #4)
My understanding is that this is just a diagnostic assertion and this crash will go away, as comment 0 says. But in release we would write negative values to our DB, IIUC. I do not know what consequences this would have, but it sounds like something will never be deleted from the table.
At a surface level we're looking at leaking security rows, yes. But depending on what's going on with the bad math, it also seems possible that mozilla::dom::cache::db::ReadResponse could fail in some cases.
FWIW, I do not think that our recent change to
DBAction::RunOnTarget
affects things here, as the crash arrives insideRunWithDBOnTarget
, which is beyond that check, which once true will be true for all callers.
Agreed. That change might slightly increase the incidence of the bug if this is case 2 from above (given that some of the cleanup steps relate to deleted caches which might have otherwise been processed on shutdown), but it wouldn't be the cause of the bug and the shutdown hang timer already means we would be facing that scenario anyways.
Assignee | ||
Updated•9 months ago
|
Assignee | ||
Comment 6•9 months ago
|
||
We are downgrading this assertion since it's clearly happening and we
want a patch that is upliftable to beta; we will investigate and
address the potential logic problem separately. (It could also just
be flipped memory bits; we need to investigate.)
Assignee | ||
Comment 8•9 months ago
|
||
I filed bug 1888508 for next steps, consistent with comment 5.
Assignee | ||
Comment 9•9 months ago
|
||
We are downgrading this assertion since it's clearly happening and we
want a patch that is upliftable to beta; we will investigate and
address the potential logic problem separately. (It could also just
be flipped memory bits; we need to investigate.)
Original Revision: https://phabricator.services.mozilla.com/D206027
Updated•9 months ago
|
Comment 10•9 months ago
|
||
Uplift Approval Request
- Code covered by automated testing: yes
- Explanation of risk level: The assertion is being downgraded.
- Needs manual QE test: no
- Is Android affected?: yes
- User impact if declined: Some users with unhappy databases are experience MOZ_DIAGNOSTIC_ASSERT crashes on beta until early beta ends.
- String changes made/needed: none
- Steps to reproduce for manual QE testing: n/a
- Fix verified in Nightly: no
- Risk associated with taking this patch: Extremely low
Comment 11•9 months ago
|
||
bugherder |
Updated•9 months ago
|
Updated•8 months ago
|
Description
•