Glean serializes its entire database thousands of times during startup
Categories
(Data Platform and Tools :: Glean: SDK, defect, P1)
Tracking
(firefox127 fixed, firefox128 fixed)
People
(Reporter: jrmuizel, Assigned: janerik)
References
(Blocks 4 open bugs, Regression)
Details
(Keywords: regression)
Attachments
(4 files, 2 obsolete files)
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
pascalc
:
approval-mozilla-beta+
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
pascalc
:
approval-mozilla-beta+
|
Details | Review |
See https://share.firefox.dev/4d4f5tY
This is happening in rkv::backend::impl_safe::environment::EnvironmentImpl::write_to_disk
because the bincode serialization of the database seems to do a lot of allocation.
It spends about 400ms doing this which causes resources contention (malloc lock and cpu time) for other parts of startup.
Reporter | ||
Updated•1 year ago
|
Reporter | ||
Updated•1 year ago
|
Reporter | ||
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Reporter | ||
Comment 1•1 year ago
|
||
Jan-Erik, is it possible for us to do the stuff that happens at startup in a single transaction so we don't have to re-serialize multiple times?
Assignee | ||
Comment 2•1 year ago
|
||
Not with how the current things are set up. rkv
is very suboptimal in that regard and we don't really use "transactions".
I might have an idea to lessen it a bit though (right now we commit
very fine granularly, we might be able to make that a bit more broad).
Reporter | ||
Comment 3•1 year ago
|
||
(In reply to Jan-Erik Rediger [:janerik] from comment #2)
Not with how the current things are set up.
rkv
is very suboptimal in that regard and we don't really use "transactions"
Can you elaborate why rkv
is suboptimal in that regard for the sake of my curiosity?
Assignee | ||
Comment 4•1 year ago
|
||
It's current "safe-mode" database mode is keeping a hashmap of everything in memory and serializing that on disk for persistence on every write.
Updated•1 year ago
|
Comment 5•1 year ago
|
||
serializing that on disk for persistence on every write.
Should this be the case or can we somehow let it happen less frequently?
Assignee | ||
Comment 6•1 year ago
|
||
Glean essentially relies on this persistence right now. We would need to significantly overhaul the backend, or bring in an already existing storage that does this better.
For Glean I've also been exploring using sqlite directly instead.
Assignee | ||
Updated•1 year ago
|
Reporter | ||
Comment 7•1 year ago
|
||
Here's a profile that shows this even worse:
https://share.firefox.dev/3WUbWrp
Glean is keeping a CPU at 100% for 2.1 seconds during startup.
Reporter | ||
Comment 8•1 year ago
|
||
It looks like that majority of these large times are caused by the number of serializations to disk. In this profile https://share.firefox.dev/4bBUJqJ Glean serializes the whole database 6123 times during startup.
Reporter | ||
Updated•1 year ago
|
Reporter | ||
Comment 9•1 year ago
|
||
It seems like this number can grow without bound? I now get 23,618 serializations during startup. Do you know of a way to reset the count to 0?
Assignee | ||
Comment 10•1 year ago
|
||
I'm looking into this right now. I need to understand which metrics are actually recorded there.
Where in the profile do you see the number of total serializations it does?
Reporter | ||
Comment 11•1 year ago
|
||
I patched rkv to get a count with a patch like:
diff --git a/src/backend/impl_safe/environment.rs b/src/backend/impl_safe/environment.rs
index f9c44cf..d36b892 100644
--- a/src/backend/impl_safe/environment.rs
+++ b/src/backend/impl_safe/environment.rs
@@ -14,7 +14,7 @@ use std::{
fs,
ops::DerefMut,
path::{Path, PathBuf},
- sync::{Arc, RwLock, RwLockReadGuard, RwLockWriteGuard},
+ sync::{atomic::AtomicI32, Arc, RwLock, RwLockReadGuard, RwLockWriteGuard},
};
use id_arena::Arena;
@@ -236,9 +236,14 @@ impl EnvironmentImpl {
path.to_mut().push(DEFAULT_DB_FILENAME);
};
+ let count = WRITE_COUNT.fetch_add(1, std::sync::atomic::Ordering::SeqCst);
+ let txt_path = path.with_extension(format!("txt{}", count));
+
// Write to a temp file first.
let tmp_path = path.with_extension("tmp");
- fs::write(&tmp_path, self.serialize()?)?;
+ let data = self.serialize()?;
+ fs::write(&txt_path, format!("{}", data.len()));
+ fs::write(&tmp_path, data)?;
// Atomically move that file to the database file.
fs::rename(tmp_path, path)?;
@@ -254,6 +259,8 @@ impl EnvironmentImpl {
}
}
+static WRITE_COUNT: AtomicI32 = AtomicI32::new(0);
+
impl<'e> BackendEnvironment<'e> for EnvironmentImpl {
type Database = DatabaseImpl;
type Error = ErrorImpl;
Reporter | ||
Comment 12•1 year ago
|
||
So it turns out, that on Android, glean writes the DB every time a metric is accumulated. So the number of serializations that we're seeing makes sense and seems like it is a very real world problem.
For example, during YouTube playback I see the performance.clone associated metrics causing DB writes about 30-40 times/second.
Comment 13•1 year ago
|
||
Travis and I used Server Knobs to disable the three performance.clone.deserialize
metrics on Firefox for Android release: https://experimenter.services.mozilla.com/nimbus/disable-performanceclone-metrics/summary
It wouldn't be unreasonable to also submit a code change to #ifdef
these off, as there's presently a conflicting metric configuration on 10% of release that won't enroll in this.
Updated•1 year ago
|
Updated•1 year ago
|
Assignee | ||
Comment 14•1 year ago
|
||
Assignee | ||
Comment 15•1 year ago
|
||
Assignee | ||
Updated•1 year ago
|
Updated•1 year ago
|
Assignee | ||
Comment 16•1 year ago
|
||
Backported for beta.
Assignee | ||
Comment 17•1 year ago
|
||
This buffers metric recordings in memory and flushes them to disk periodically.
Updated•1 year ago
|
Updated•1 year ago
|
Comment 18•1 year ago
|
||
Comment 19•1 year ago
|
||
bugherder |
Updated•1 year ago
|
Comment 20•1 year ago
|
||
uplift |
Comment 21•1 year ago
|
||
Comment on attachment 9405038 [details]
Bug 1892230 - Update to Glean v60.0.1 r?TravisLong!
Merged to the beta branch before the beta to release merge
Updated•1 year ago
|
Assignee | ||
Comment 22•1 year ago
|
||
Backported for beta.
Original Revision: https://phabricator.services.mozilla.com/D212286
Updated•1 year ago
|
Assignee | ||
Comment 23•1 year ago
|
||
This buffers metric recordings in memory and flushes them to disk periodically.
Original Revision: https://phabricator.services.mozilla.com/D212288
Updated•1 year ago
|
Updated•1 year ago
|
Assignee | ||
Comment 24•1 year ago
|
||
^ that was an accidental addition, which I abandoned now. The correct 2 patches were approved and merged above.
Description
•