Closed Bug 1892230 Opened 3 months ago Closed 1 month ago

Glean serializes its entire database thousands of times during startup

Categories

(Data Platform and Tools :: Glean: SDK, defect, P1)

defect

Tracking

(firefox127 fixed, firefox128 fixed)

RESOLVED FIXED
Tracking Status
firefox127 --- fixed
firefox128 --- fixed

People

(Reporter: jrmuizel, Assigned: janerik)

References

(Blocks 4 open bugs, Regression)

Details

(Keywords: regression)

Attachments

(4 files, 2 obsolete files)

See https://share.firefox.dev/4d4f5tY

This is happening in rkv::backend::impl_safe::environment::EnvironmentImpl::write_to_disk because the bincode serialization of the database seems to do a lot of allocation.

It spends about 400ms doing this which causes resources contention (malloc lock and cpu time) for other parts of startup.

Component: Glean Platform → Glean: SDK
Summary: glean searilizes it's entire database multiple times during startup. → glean seriailizes it's entire database multiple times during startup.
No longer blocks: fenix-startup
Summary: glean seriailizes it's entire database multiple times during startup. → glean seriailizes its entire database multiple times during startup.

Jan-Erik, is it possible for us to do the stuff that happens at startup in a single transaction so we don't have to re-serialize multiple times?

Flags: needinfo?(jrediger)

Not with how the current things are set up. rkv is very suboptimal in that regard and we don't really use "transactions".
I might have an idea to lessen it a bit though (right now we commit very fine granularly, we might be able to make that a bit more broad).

Flags: needinfo?(jrediger)

(In reply to Jan-Erik Rediger [:janerik] from comment #2)

Not with how the current things are set up. rkv is very suboptimal in that regard and we don't really use "transactions"

Can you elaborate why rkv is suboptimal in that regard for the sake of my curiosity?

Flags: needinfo?(jrediger)

It's current "safe-mode" database mode is keeping a hashmap of everything in memory and serializing that on disk for persistence on every write.

Flags: needinfo?(jrediger)
Assignee: nobody → jrediger
Priority: -- → P3

serializing that on disk for persistence on every write.

Should this be the case or can we somehow let it happen less frequently?

Glean essentially relies on this persistence right now. We would need to significantly overhaul the backend, or bring in an already existing storage that does this better.
For Glean I've also been exploring using sqlite directly instead.

Summary: glean seriailizes its entire database multiple times during startup. → Glean serializes its entire database multiple times during startup

Here's a profile that shows this even worse:
https://share.firefox.dev/3WUbWrp

Glean is keeping a CPU at 100% for 2.1 seconds during startup.

See Also: → 1896193

It looks like that majority of these large times are caused by the number of serializations to disk. In this profile https://share.firefox.dev/4bBUJqJ Glean serializes the whole database 6123 times during startup.

Summary: Glean serializes its entire database multiple times during startup → Glean serializes its entire database thousands of times during startup
Blocks: 1894804

It seems like this number can grow without bound? I now get 23,618 serializations during startup. Do you know of a way to reset the count to 0?

Flags: needinfo?(jrediger)

I'm looking into this right now. I need to understand which metrics are actually recorded there.
Where in the profile do you see the number of total serializations it does?

Flags: needinfo?(jrediger) → needinfo?(jmuizelaar)

I patched rkv to get a count with a patch like:

diff --git a/src/backend/impl_safe/environment.rs b/src/backend/impl_safe/environment.rs
index f9c44cf..d36b892 100644
--- a/src/backend/impl_safe/environment.rs
+++ b/src/backend/impl_safe/environment.rs
@@ -14,7 +14,7 @@ use std::{
     fs,
     ops::DerefMut,
     path::{Path, PathBuf},
-    sync::{Arc, RwLock, RwLockReadGuard, RwLockWriteGuard},
+    sync::{atomic::AtomicI32, Arc, RwLock, RwLockReadGuard, RwLockWriteGuard},
 };

 use id_arena::Arena;
@@ -236,9 +236,14 @@ impl EnvironmentImpl {
             path.to_mut().push(DEFAULT_DB_FILENAME);
         };

+        let count = WRITE_COUNT.fetch_add(1, std::sync::atomic::Ordering::SeqCst);
+        let txt_path = path.with_extension(format!("txt{}", count));
+
         // Write to a temp file first.
         let tmp_path = path.with_extension("tmp");
-        fs::write(&tmp_path, self.serialize()?)?;
+        let data = self.serialize()?;
+        fs::write(&txt_path, format!("{}", data.len()));
+        fs::write(&tmp_path, data)?;

         // Atomically move that file to the database file.
         fs::rename(tmp_path, path)?;
@@ -254,6 +259,8 @@ impl EnvironmentImpl {
     }
 }

+static WRITE_COUNT: AtomicI32 = AtomicI32::new(0);
+
 impl<'e> BackendEnvironment<'e> for EnvironmentImpl {
     type Database = DatabaseImpl;
     type Error = ErrorImpl;
Flags: needinfo?(jmuizelaar)

So it turns out, that on Android, glean writes the DB every time a metric is accumulated. So the number of serializations that we're seeing makes sense and seems like it is a very real world problem.

For example, during YouTube playback I see the performance.clone associated metrics causing DB writes about 30-40 times/second.

Travis and I used Server Knobs to disable the three performance.clone.deserialize metrics on Firefox for Android release: https://experimenter.services.mozilla.com/nimbus/disable-performanceclone-metrics/summary

It wouldn't be unreasonable to also submit a code change to #ifdef these off, as there's presently a conflicting metric configuration on 10% of release that won't enroll in this.

Blocks: perf-android
Depends on: 1898515
Regressed by: 1877843
Keywords: regression
Depends on: 1898649
Blocks: 1898810
Priority: P3 → P1
Attachment #9404994 - Attachment description: Bug 1892230 - Fenix: Test delayPingLifetimeIo with a local Glean r?TravisLong! → Bug 1892230 - Fenix: Use delayPingLifetimeIo=true to batch db writes in memory r?TravisLong!

Backported for beta.

This buffers metric recordings in memory and flushes them to disk periodically.

Attachment #9405039 - Flags: approval-mozilla-beta?
Attachment #9405038 - Flags: approval-mozilla-beta?
Pushed by jrediger@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/3dd749801bc9
Update to Glean v60.1.1 r=TravisLong,supply-chain-reviewers,mach-reviewers,android-reviewers,daabel
https://hg.mozilla.org/integration/autoland/rev/7bb8c93621ba
Fenix: Use delayPingLifetimeIo=true to batch db writes in memory r=TravisLong,android-reviewers,Roger,daabel
Status: NEW → RESOLVED
Closed: 1 month ago
Resolution: --- → FIXED

Comment on attachment 9405038 [details]
Bug 1892230 - Update to Glean v60.0.1 r?TravisLong!

Merged to the beta branch before the beta to release merge

Attachment #9405038 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9405039 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
See Also: → 1900346

Backported for beta.

Original Revision: https://phabricator.services.mozilla.com/D212286

Attachment #9405258 - Flags: approval-mozilla-beta?

This buffers metric recordings in memory and flushes them to disk periodically.

Original Revision: https://phabricator.services.mozilla.com/D212288

Attachment #9405259 - Attachment is obsolete: true
Attachment #9405258 - Attachment is obsolete: true
Attachment #9405258 - Flags: approval-mozilla-beta?

^ that was an accidental addition, which I abandoned now. The correct 2 patches were approved and merged above.

Regressions: 1904702
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: