Glean serializes its entire database thousands of times during startup
Categories
(Data Platform and Tools :: Glean: SDK, defect, P1)
Tracking
(firefox127 fixed, firefox128 fixed)
People
(Reporter: jrmuizel, Assigned: janerik)
References
(Blocks 4 open bugs, Regression)
Details
(Keywords: regression)
Attachments
(4 files, 2 obsolete files)
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
pascalc
:
approval-mozilla-beta+
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
pascalc
:
approval-mozilla-beta+
|
Details | Review |
See https://share.firefox.dev/4d4f5tY
This is happening in rkv::backend::impl_safe::environment::EnvironmentImpl::write_to_disk
because the bincode serialization of the database seems to do a lot of allocation.
It spends about 400ms doing this which causes resources contention (malloc lock and cpu time) for other parts of startup.
Reporter | ||
Updated•5 months ago
|
Reporter | ||
Updated•5 months ago
|
Reporter | ||
Updated•5 months ago
|
Updated•5 months ago
|
Updated•5 months ago
|
Updated•5 months ago
|
Reporter | ||
Comment 1•5 months ago
|
||
Jan-Erik, is it possible for us to do the stuff that happens at startup in a single transaction so we don't have to re-serialize multiple times?
Assignee | ||
Comment 2•5 months ago
|
||
Not with how the current things are set up. rkv
is very suboptimal in that regard and we don't really use "transactions".
I might have an idea to lessen it a bit though (right now we commit
very fine granularly, we might be able to make that a bit more broad).
Reporter | ||
Comment 3•5 months ago
|
||
(In reply to Jan-Erik Rediger [:janerik] from comment #2)
Not with how the current things are set up.
rkv
is very suboptimal in that regard and we don't really use "transactions"
Can you elaborate why rkv
is suboptimal in that regard for the sake of my curiosity?
Assignee | ||
Comment 4•5 months ago
|
||
It's current "safe-mode" database mode is keeping a hashmap of everything in memory and serializing that on disk for persistence on every write.
Updated•5 months ago
|
serializing that on disk for persistence on every write.
Should this be the case or can we somehow let it happen less frequently?
Assignee | ||
Comment 6•5 months ago
|
||
Glean essentially relies on this persistence right now. We would need to significantly overhaul the backend, or bring in an already existing storage that does this better.
For Glean I've also been exploring using sqlite directly instead.
Assignee | ||
Updated•5 months ago
|
Reporter | ||
Comment 7•5 months ago
|
||
Here's a profile that shows this even worse:
https://share.firefox.dev/3WUbWrp
Glean is keeping a CPU at 100% for 2.1 seconds during startup.
Reporter | ||
Comment 8•5 months ago
|
||
It looks like that majority of these large times are caused by the number of serializations to disk. In this profile https://share.firefox.dev/4bBUJqJ Glean serializes the whole database 6123 times during startup.
Reporter | ||
Updated•5 months ago
|
Reporter | ||
Comment 9•4 months ago
|
||
It seems like this number can grow without bound? I now get 23,618 serializations during startup. Do you know of a way to reset the count to 0?
Assignee | ||
Comment 10•4 months ago
|
||
I'm looking into this right now. I need to understand which metrics are actually recorded there.
Where in the profile do you see the number of total serializations it does?
Reporter | ||
Comment 11•4 months ago
|
||
I patched rkv to get a count with a patch like:
diff --git a/src/backend/impl_safe/environment.rs b/src/backend/impl_safe/environment.rs
index f9c44cf..d36b892 100644
--- a/src/backend/impl_safe/environment.rs
+++ b/src/backend/impl_safe/environment.rs
@@ -14,7 +14,7 @@ use std::{
fs,
ops::DerefMut,
path::{Path, PathBuf},
- sync::{Arc, RwLock, RwLockReadGuard, RwLockWriteGuard},
+ sync::{atomic::AtomicI32, Arc, RwLock, RwLockReadGuard, RwLockWriteGuard},
};
use id_arena::Arena;
@@ -236,9 +236,14 @@ impl EnvironmentImpl {
path.to_mut().push(DEFAULT_DB_FILENAME);
};
+ let count = WRITE_COUNT.fetch_add(1, std::sync::atomic::Ordering::SeqCst);
+ let txt_path = path.with_extension(format!("txt{}", count));
+
// Write to a temp file first.
let tmp_path = path.with_extension("tmp");
- fs::write(&tmp_path, self.serialize()?)?;
+ let data = self.serialize()?;
+ fs::write(&txt_path, format!("{}", data.len()));
+ fs::write(&tmp_path, data)?;
// Atomically move that file to the database file.
fs::rename(tmp_path, path)?;
@@ -254,6 +259,8 @@ impl EnvironmentImpl {
}
}
+static WRITE_COUNT: AtomicI32 = AtomicI32::new(0);
+
impl<'e> BackendEnvironment<'e> for EnvironmentImpl {
type Database = DatabaseImpl;
type Error = ErrorImpl;
Reporter | ||
Comment 12•4 months ago
|
||
So it turns out, that on Android, glean writes the DB every time a metric is accumulated. So the number of serializations that we're seeing makes sense and seems like it is a very real world problem.
For example, during YouTube playback I see the performance.clone associated metrics causing DB writes about 30-40 times/second.
Comment 13•4 months ago
|
||
Travis and I used Server Knobs to disable the three performance.clone.deserialize
metrics on Firefox for Android release: https://experimenter.services.mozilla.com/nimbus/disable-performanceclone-metrics/summary
It wouldn't be unreasonable to also submit a code change to #ifdef
these off, as there's presently a conflicting metric configuration on 10% of release that won't enroll in this.
Updated•4 months ago
|
Updated•4 months ago
|
Assignee | ||
Comment 14•4 months ago
|
||
Assignee | ||
Comment 15•4 months ago
|
||
Assignee | ||
Updated•4 months ago
|
Updated•4 months ago
|
Assignee | ||
Comment 16•4 months ago
|
||
Backported for beta.
Assignee | ||
Comment 17•4 months ago
|
||
This buffers metric recordings in memory and flushes them to disk periodically.
Updated•4 months ago
|
Updated•4 months ago
|
Comment 18•4 months ago
|
||
Comment 19•4 months ago
|
||
bugherder |
Updated•4 months ago
|
Comment 20•4 months ago
|
||
uplift |
Comment 21•4 months ago
|
||
Comment on attachment 9405038 [details]
Bug 1892230 - Update to Glean v60.0.1 r?TravisLong!
Merged to the beta branch before the beta to release merge
Updated•4 months ago
|
Assignee | ||
Comment 22•4 months ago
|
||
Backported for beta.
Original Revision: https://phabricator.services.mozilla.com/D212286
Updated•4 months ago
|
Assignee | ||
Comment 23•4 months ago
|
||
This buffers metric recordings in memory and flushes them to disk periodically.
Original Revision: https://phabricator.services.mozilla.com/D212288
Updated•4 months ago
|
Updated•4 months ago
|
Assignee | ||
Comment 24•4 months ago
|
||
^ that was an accidental addition, which I abandoned now. The correct 2 patches were approved and merged above.
Description
•