Closed Bug 1538541 Opened 5 years ago Closed 3 years ago

Crash in [@ mdb_cursor_put]

Categories

(Toolkit :: Storage, defect, P1)

defect

Tracking

()

RESOLVED FIXED
85 Branch
Tracking Status
firefox-esr60 --- unaffected
firefox-esr78 --- unaffected
firefox66 --- unaffected
firefox67 --- unaffected
firefox67.0.1 --- unaffected
firefox68 --- disabled
firefox69 --- disabled
firefox70 + wontfix
firefox71 + disabled
firefox72 --- disabled
firefox83 --- unaffected
firefox84 + fixed
firefox85 --- fixed

People

(Reporter: calixte, Unassigned)

References

(Blocks 2 open bugs)

Details

(Keywords: crash, regression, Whiteboard: [psm-assigned])

Crash Data

Attachments

(1 obsolete file)

This bug is for crash report bp-5d77bd19-41ce-459f-9c1c-7f9fb0190324.

Top 10 frames of crashing thread:

0 xul.dll mdb_cursor_put third_party/rust/lmdb-rkv-sys/lmdb/libraries/liblmdb/mdb.c:6938
1 xul.dll mdb_put third_party/rust/lmdb-rkv-sys/lmdb/libraries/liblmdb/mdb.c:9022
2 xul.dll static union core::result::Result< third_party/rust/rkv/src/store/single.rs:51
3 xul.dll struct nserror::nsresult cert_storage::cert_storage_constructor security/manager/ssl/cert_storage/src/lib.rs:449
4 xul.dll static void construct_cert_storage::<unnamed-tag>::operator security/manager/ssl/cert_storage/src/cert_storage.cpp:20
5 xul.dll nsresult mozilla::detail::RunnableFunction<`lambda at z:/task_1553378114/build/src/security/manager/ssl/cert_storage/src/cert_storage.cpp:19:66'>::Run xpcom/threads/nsThreadUtils.h:562
6 xul.dll mozilla::SyncRunnable::Run xpcom/threads/SyncRunnable.h:100
7 xul.dll mozilla::SyncRunnable::DispatchToThread xpcom/threads/SyncRunnable.h:55
8 xul.dll mozilla::SyncRunnable::DispatchToThread xpcom/threads/SyncRunnable.h:89
9 xul.dll construct_cert_storage security/manager/ssl/cert_storage/src/cert_storage.cpp:17

There is 1 crash in nightly 68 with buildid 20190323214211. In analyzing the backtrace, the regression may have been introduced by patch [1] to fix bug 1429796.

[1] https://hg.mozilla.org/mozilla-central/rev?node=99079ab7e52e

Flags: needinfo?(mgoodwin)

I'm guessing this is an lmdb thing?

Flags: needinfo?(mgoodwin) → needinfo?(myk)

(In reply to Mark Goodwin [:mgoodwin] from comment #1)

I'm guessing this is an lmdb thing?

Indeed. I'm unsure what's going on here, but it's likely to be an LMDB issue—and unlikely to be a bug in cert_storage. I'll investigate…

Assignee: nobody → myk
Flags: needinfo?(myk)
Priority: -- → P1
Whiteboard: [psm-assigned]

[Tracking Requested - why for this release]:
The crash volume is pretty low but after patch https://hg.mozilla.org/mozilla-central/rev?node=76a363f06ebd has landed, there are 5 startup crashes.

startup crashes are bad. myk did your investigation turn up anything so far?

I've been investigating this, but so far I haven't been able to reproduce it. It isn't specific to cert_storage, as some of the crashes occurred during writes to other datastores (like xulstore).

I've tried generating a massive number of writes of random key/value pairs, but that didn't trigger a crash on my Windows 10 VM. My next step will be to try writing many key/value pairs with duplicate keys, on the off chance that this is related to updating of key/value pairs for keys that already exist in a datastore.

Status: NEW → ASSIGNED
Component: Security: PSM → Storage
Product: Core → Toolkit

In the interest of reducing risk for Beta/Release users, I've disabled rkv usage on those channels in bug 1547877. (It remains enabled on the Nightly channel, where I will continue to investigate it.) Thus this no longer affects Firefox 68.

in early stability data for 68.0b3, this signature is accounting for 4% of browser crashes - the current mitigation doesn't seem to be fully effective yet.

(In reply to [:philipp] from comment #7)

in early stability data for 68.0b3, this signature is accounting for 4% of browser crashes - the current mitigation doesn't seem to be fully effective yet.

The mitigation was effective when I landed it, however MOZ_NEW_CERT_STORAGE has since been enabled for EARLY_BETA_OR_EARLIER by bug 1548365, and that is presumably why this issue is showing up in the early beta builds.

changing the status for this, as 68.0b7 disabled the intermediate cert preloading (bug 1555110).

Here's another report of a crash at the same line in mdb_cursor_put:

https://www.openldap.org/its/index.cgi/Incoming?id=9037;selectid=9037

(The crash occurs at line 6688 in LMDB 0.9.18, which is the same as line 6938 in LMDB 0.9.23 and line 6937 in our slightly modified version of LMDB 0.9.23.)

Attachment #9073075 - Attachment is obsolete: true

This is very low volume in nightly at this point - Marco do you still think it is P1?
Myk is this something you are still going to work on or - maybe not? Thanks!

p.s. It was kind of fun just now to n-i to :myk and :mak.

Flags: needinfo?(myk)
Flags: needinfo?(mak77)

I don't know what's the priority of the feature this drives (intermediate preloading / new cert storage), most of the bugs seem stuck from a couple months, and it's only enabled in Nightly. it's a blocker to that feature, but not to the product or Storage in general.

Flags: needinfo?(mak77)
Priority: P1 → P3

(In reply to Marco Bonardo [::mak] (Away 7-25 Aug) from comment #13)

I don't know what's the priority of the feature this drives (intermediate preloading / new cert storage), most of the bugs seem stuck from a couple months, and it's only enabled in Nightly. it's a blocker to that feature, but not to the product or Storage in general.

Those features are P1 - we want to ship them as soon as we can. This bug, bug 1550174, and bug 1538539 are blocking them from shipping.

Priority: P3 → P1
Assignee: myk → vporof

Apologies for the delay responding. I'm unlikely to work on this anytime soon. So reassigning to Victor (as he's already done) is the right thing to do.

Flags: needinfo?(myk)

Victor: a few weeks ago I had a thought that perhaps this could be due to LMDB keys that are too long. LMDB keys are required to be between 1 and 511 bytes long. And I don't think rkv currently protects against attempting to insert a key/value pair with a key that is too long.

LMDB should return an error in that case. And I don't know of any XULStore/Cert Storage keys that are too long (although traditional extensions could have inserted anything into XULStore in the past, and Firefox would have retained that data after the extension was uninstalled).

Also, I was unable to reproduce this crash with an automated test in rkv that inserts overlong keys. But perhaps there are circumstances under which it can occur nonetheless. Food for thought.

See Also: → 1584446

hi, do you know why this crash signature is already showing up in firefox 70 when the underlying feature apparently isn't shipping yet?

Flags: needinfo?(vporof)

RKV is now enabled in nightly and early beta. For those signatures to show up in release, I suspect that means RKV is manually turned on; however, I haven't confirmed that yet. I'll investigate.

Flags: needinfo?(vporof)

This is a fairly high crash volume on release 70, tracking to keep an eye on it though it may not be something we can fix in 70.

This is indeed a bit bizarre, we use build-time configuration variables to determine whether to ship the old (non-rkv) or new (rkv-based) versions of the three consumers which can currently use RKV. The variables are MOZ_NEW_XULSTORE, MOZ_NEW_NOTIFICATION_STORE, and MOZ_NEW_CERT_STORAGE for the XULStore, Notification Store, and Cert Storage consumers, respectively.

I wonder if there's a new consumer we don't know about.

A quick search reveals bug 1573832 and this line[0] which uses kvstore.

Alex, are we using kvstore in release (70) for about:support?

[0] https://searchfox.org/mozilla-central/source/toolkit/content/aboutSupport.js#872

Flags: needinfo?(achronop)

I am afraid we don't. The enumeration of kvstore database in about:support added in 71.

Flags: needinfo?(achronop)

(In reply to Alex Chronopoulos [:achronop] from comment #23)

I am afraid we don't. The enumeration of kvstore database in about:support added in 71.

Are you planning to ship this to release with the 71 train? It's slightly problematic since kvstore is an RKV wrapper, and RKV has some troubles (see bug 1586179) which prevent us from enabling it in release.

However, the good news is that we have a custom backend that you can use without having to change anything, but we should get to it before 71 ships. It's going to be similar to bug 1594995.

Filed bug 1596012.

Here's a list of all the crashes that happen in release: https://crash-stats.mozilla.org/signature/?product=Firefox&release_channel=release&signature=mdb_cursor_put&date=%3E%3D2018-11-14T00%3A00%3A00.000Z&date=%3C2019-11-13T23%3A59%3A00.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_columns=startup_crash&_sort=-date&page=1#reports

I'm looking into which exact consumers use RKV in release.

For now, it seems that it's kvstore only, which should only used by

  • Notification Store (but guarded by MOZ_NEW_NOTIFICATION_STORE, in nightly and early beta, so this shouldn't affect 70 release)
  • about:support (unguarded, but shipping in 71, not 70, and we'll fix this in bug 1596012).

Are there any other consumers of kvstore that you know about Alex?

Flags: needinfo?(achronop)

A search for KeyValueService pointed me to https://searchfox.org/mozilla-release/source/dom/media/mediacapabilities/KeyValueStorage.cpp#67 and bug 1530996. This seems to wrap kvstore in C++.

I was planning to ship it in Release with 71 since I was not aware of the crashes. In mediacapabilities we have been using the RKV wrapper directly from C++ land. This is in 70. I am putting a patch to pref it off.

Flags: needinfo?(achronop)

Here's the current status.

There are two new features which use RKV through kvstore.

  1. Bug 1530996, landed in 70, for performance bencharks. We plan to disable via a hotfix to media.mediacapabilities.from-database.
  2. Bug 1573832, landed in 71, for about:support. We plan to disable it for now while still in beta.

For both those features, using RKV with safe mode (similar to bug 1594995), will allow us to re-enable them. We want RKV with safe mode to bake in nightly for a cycle first, so we can probably re-enable everything in 72. If there's a sense of urgency, we can look into enabling for 71 if our implementation is robust enough.

Alex is working on 1. and 2. above for disabling those features. Alex, please link the bug numbers here for reference, and make them block the ship-rkv meta-bug.

I am working on bug 1596012 and bug 1596051, which will re-enable those features after proving that crashes don't happen anymore when using safe-mode. Switching to non-safe-mode (for performance) will happen in bug 1596063.

Flags: needinfo?(achronop)
No longer blocks: crlite
No longer blocks: 1429796

I have filed Bug 1596064 for 1, and Bug 1596065 for 2. I am working on Bug 1596064 for 1.

Flags: needinfo?(achronop)

[Tracking Requested - why for this release]: Pascal, FYI probably you want to track this or the related bugs

Flags: needinfo?(pascalc)
Flags: needinfo?(pascalc)
OS: Windows 7 → Unspecified
Version: Trunk → unspecified

For 70, this is fixed by bug 1596065, where tracking flag for firefox70 is not set yet. Should we uplift, or wontfix both for 70?

Flags: needinfo?(pascalc)

(In reply to Victor Porof [:vporof][:vp] from comment #32)

For 70, this is fixed by bug 1596065, where tracking flag for firefox70 is not set yet. Should we uplift, or wontfix both for 70?

We don't have plans for a dot release for 70 because 71 ships in less than 2 weeks, all the RKV crash fixes should be uplifted to 71.

Flags: needinfo?(pascalc)

(In reply to Alex Chronopoulos [:achronop] from comment #30)

I have filed Bug 1596064 for 1, and Bug 1596065 for 2. I am working on Bug 1596064 for 1.

We uplifted both bug 1596064 and bug 1596065 in our last betas and had no crash after beta 11 so marking 71 as fixed.

Hi, a couple questions - has anyone encountering this crash saved a copy of their DB files, that we can get a look at?

What is the impact of this crash - is it persistent, and so once a crash happens, it continues to happen on
subsequent restarts? (Which implies that some corrupted state has persisted in the LMDB files.) Or does a
restart tend to work without further problems?

Blocks: rkv-perf-mode
No longer blocks: ship-rkv
Severity: critical → blocker
Priority: P1 → P2

Not affecting release anymore.

Status: ASSIGNED → NEW
Priority: P2 → P3
Assignee: vporof → nobody

[Tracking Requested - why for this release]:
the signature is regressing again in 84.0b

Looks like the new crashes in 84 are in Glean code. Alessio, can you please take a look?

Flags: needinfo?(alessio.placitelli)

Changing the priority to p1 as the bug is tracked by a release manager for the current beta.
See What Do You Triage for more information

Priority: P3 → P1

Hey Jan-Erik,

this should have been solved by the latest Glean release, right?

Flags: needinfo?(alessio.placitelli) → needinfo?(jrediger)

As stated in the RKV documentation, it is absolutely required to use the "safe-mode" backend for any product that is intended to ride the trains to beta/release: https://github.com/mozilla/rkv#%EF%B8%8F-warning-%EF%B8%8F

To my knowledge, Glean has been using the LMDB backend for RKV successfully in the past, because the platforms it was shipping on were coincidentally not affected by the crash. In order to ship Glean on desktop however, an update to a recent version of RKV was needed.

It was done in https://github.com/mozilla/glean/pull/1322. We should either uplift that to beta, or skip the current train.

Yes, it landed in nightly ... checks ... 3 days ago. I'm going to ask to uplift that. Sorry for the delay there.

Flags: needinfo?(jrediger)
See Also: → 1677701

Bug 1677701 has been uplifted to Beta for 84.0b4. If this crash goes away, let's resolve this bug and file a new one for any future instances of this signature.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 85 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: