URL classifier startup takes too long (90ms in nsUrlClassifierDBServiceWorker::OpenDb)
Categories
(Toolkit :: Safe Browsing, defect, P2)
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox140 | --- | fixed |
People
(Reporter: mstange, Assigned: timhuang)
References
(Blocks 1 open bug)
Details
Attachments
(1 file)
Profile: https://share.firefox.dev/3Y6f1UG
Opening the URL classifier database takes too long. The work in mozilla::safebrowsing::Classifier::Open takes 90ms on a Pixel 6. This has a large impact on our applink startup times (bug 1945906) because it delays the initial network request for the app link URL.
The profile shows that the time is spent in two places:
- Loading the prefix list into memory hits malloc lock contention because it creates lots of small allocations. Can this be one big allocation instead?
- The other half of the time is spent copying the file contents into memory a second time, and computing the CRC32 on those contents.
Is the CRC32 verification necessary? Does it need to happen on every startup? On Android, users switch apps frequently so it's common that Firefox gets killed in the background and has to be restarted when the user switches back to it.
| Reporter | ||
Comment 1•11 months ago
|
||
I've filed bug 1956923 with a suggestion that would absorb the slow initialization time into other delays, but in the long run we should just make initialization faster.
Comment 2•10 months ago
|
||
The severity field is not set for this bug.
:dimi, could you have a look please?
For more information, please visit BugBot documentation.
Comment 3•10 months ago
|
||
(In reply to Markus Stange [:mstange] from comment #0)
- The other half of the time is spent copying the file contents into memory a second time, and computing the CRC32 on those contents.
Is the CRC32 verification necessary?
If my understanding is correct, we originally implemented CRC32 verification to avoid cases where, if Firefox crashes while updating or saving SafeBrowsing data, we can ensure the file will be re-downloaded at the next startup.
Currently, we don't have telemetry to know how common this problem is. I think we should add telemetry for it, but until we have that data, keeping the CRC32 verification might be the safer choice.
Does it need to happen on every startup? On Android, users switch apps frequently so it's common that Firefox gets killed in the background and has to be restarted when the user switches back to it.
@Tim and i are thinking using a faster algorithm, xxhash for example. Do you think changing the algorithm would improve the performance?
Comment 4•10 months ago
|
||
How big is the file?
Comment 5•10 months ago
|
||
An easy win would be to just the ARM64 CRC instructions: https://github.com/google/crc32c/blob/main/src/crc32c_arm64.cc
Comment 6•10 months ago
|
||
(In reply to Jeff Muizelaar [:jrmuizel] from comment #4)
How big is the file?
There are multiple files. The largest one is around ~10 MB, and the others are around a few hundred KB each. There are about 20 files in total.
Comment 7•10 months ago
|
||
(In reply to Jeff Muizelaar [:jrmuizel] from comment #5)
An easy win would be to just the ARM64 CRC instructions: https://github.com/google/crc32c/blob/main/src/crc32c_arm64.cc
This is a great suggestion, we will move forward with this option.
| Reporter | ||
Comment 8•10 months ago
|
||
Sure, faster CRC32 is something that may help a bit but it won't fix this bug. I think it should be handled in a new dependent bug.
I haven't seen a response to the first question yet. Can we change the in-memory representation so that it's one big allocation rather than many small allocations?
(In reply to Dimi Lee [:dimi] from comment #3)
Is the CRC32 verification necessary?
If my understanding is correct, we originally implemented CRC32 verification to avoid cases where, if Firefox crashes while updating or saving SafeBrowsing data, we can ensure the file will be re-downloaded at the next startup.
Isn't this problem usually avoided by writing out to a different filename and then moving the file to the final spot once all the data has been written? Or if we really want the CRC32 check, could we do it once after we change the file, rather than on every startup when we should be able to know that the file hasn't changed since the last check?
Comment 9•10 months ago
|
||
(In reply to Markus Stange [:mstange] from comment #8)
Sure, faster CRC32 is something that may help a bit but it won't fix this bug. I think it should be handled in a new dependent bug.
I haven't seen a response to the first question yet. Can we change the in-memory representation so that it's one big allocation rather than many small allocations?
sorry i forgot to answer this one!
So we intentionally change the in-memory representation from one big-array to array of array in Bug 1046038.
And i did update the algorithm so we only use array of array for tables with a lot of prefixes in this patch.
From the comment in Bug 1046037, we had OOM issue with 1-2M allocations (11 years ago). The entries we have is even more than 11 years ago. Taking phishing table for example, the table has around 1M entries, so we would probably need 4MB allocation with a flat array. Will that be a problem on mobile?
We might also test whether one flat array does improve the performance by setting this pref to 2 * 1024 * 1024 on mobile.
Isn't this problem usually avoided by writing out to a different filename and then moving the file to the final spot once all the data has been written? Or if we really want the CRC32 check, could we do it once after we change the file, rather than on every startup when we should be able to know that the file hasn't changed since the last check?
yes, you are right. I misread those bugs we added/updated the checksum implementation :(
We indeed move the update files to the final spot when update is done to address the crash during update problem. The checksum was implemented when we move the storage from sqlitedb to file 13 years ago and it was to prevent file corruption.
So to answer this question again - "Is the CRC32 verification necessary? Does it need to happen on every startup?"
I'd prefer not getting rid of the integrity check entirely, to avoid the case where a really unlucky user gets a bad file unnoticed and we’re never able to recover it for them. But I do think we can avoid doing the integrity check during every startup. Here's what I'd suggest:
Since SafeBrowsing updates every 30 or 45 minutes, I think we can do the integrity check only the first time we update in a session - meaning that even if the file is corrupted, it only stays corrupted for 45 minutes. I think the chance that users are not warned about a phishing site, or that a safe site is marked as phishing due to file corruption, is fairly low.
Updated•9 months ago
|
| Assignee | ||
Comment 10•9 months ago
|
||
To improve the startup performance, we delay the CRC32 verfication to
the time before we send the list update to the SafeBrowsing server.
Comment 11•9 months ago
|
||
Comment 12•9 months ago
|
||
Comment 13•9 months ago
|
||
Revert "Bug 1956920 - Delay the CRC32 verification on .vlp files to the list update. r=dimi" for causing multiple failures.
Comment 14•9 months ago
|
||
| Assignee | ||
Updated•9 months ago
|
Comment 15•9 months ago
|
||
| bugherder | ||
| Assignee | ||
Comment 16•9 months ago
|
||
Hi Markus,
In this bug, we introduced a pref urlclassifier.delay_prefixes_crc32_check to delay the CRC32 check. The delay is currently disabled by the pref. Would you be able to test if enabling this pref will improve the performance? Thanks.
Updated•9 months ago
|
Comment 17•8 months ago
|
||
ni myself to measure the impact of delaying the crc32_check, urlclassifier.delay_prefixes_crc32_check
Comment 18•8 months ago
|
||
The results from combining early initialization of the URL classifier along with the urlclassifier.delay_prefixes_crc32_check pref are looking very promising: https://bugzilla.mozilla.org/show_bug.cgi?id=1956356#c19
I'll separate those out next.
| Reporter | ||
Comment 19•8 months ago
|
||
(In reply to Tim Huang[:timhuang] from comment #16)
Hi Markus,
In this bug, we introduced a pref
urlclassifier.delay_prefixes_crc32_checkto delay the CRC32 check. The delay is currently disabled by the pref. Would you be able to test if enabling this pref will improve the performance? Thanks.
Thanks Tim! And sorry for the delay here. Andrew will be taking a look at validating the performance impact.
Could you file follow-up bugs about the other source of slowness identified in the profiles, and about flipping this pref to true by default? This bug got closed when the off-by-default-pref patch landed. It would have been better to land that patch in a separate bug.
Comment 20•8 months ago
|
||
Quick update -- we've discovered some bugs in our CI test scores, bug 1970614 and bug 1972093, which are delaying the evaluation of this change.
Comment 21•6 months ago
|
||
Still having problems evaluating the impact of enabling this pref, but have not forgotten it.
Comment 22•6 months ago
|
||
Apologies for the delays.
We're seeing reproducible improvements to Fenix startup with urlclassifier.delay_prefixes_crc32_check enabled.
~8 to 24ms on newssite-applink-startup applink_startup
and ~6 to 60ms on tab-restore-shopify, (Please ignore the Android p6 results as they are a test issue, bug 1985798)
Startup CPU times looks to also have improved.
I think we should enable urlclassifier.delay_prefixes_crc32_check, bug 1971949.
(Leaving my ni as Tim is on PTO and not accepting ni's).
Updated•4 months ago
|
Comment 23•3 months ago
|
||
Tim,
We believe that enabling urlclassifier.delay_prefixes_crc32_check would be beneficial to Fenix cold start applink performance (based on profiling and CI tests, Comment 22).
We have telemetry in place to observe any field impact.
Would you be able to enable the urlclassifier.delay_prefixes_crc32_check pref ?
| Assignee | ||
Comment 24•3 months ago
|
||
Sur, I will work on Bug 1971949 to enable this for mobile.
Comment 25•3 months ago
|
||
Thank you, I'll keep my eyes open for perf alerts and changes to telemetry.
Comment 26•3 months ago
|
||
I ran a perf compare of this change, bug 1971949 and it looks good.
a55, newssite-applink-startup
Difference of means: -1.75% (-37.9 ms)
Difference of medians: -0.5% (-10.76 ms)
pixel 6, newssite-applink-startup
Difference of means: -2.29% (-37.46 ms)
Difference of medians: -2.3% (-36.9 ms)
And in a higher variance test
android-hw-p6-13-0-aarch64-shippable
Difference of means: -4.27% (-75.88 ms)
Difference of medians: -6.13% (-111.34 ms)
Comment 27•3 months ago
|
||
And while we cannot directly tie the impact of this change to our telemetry, since landing we are seeing median time to request start for cold applinks drop from 778ms to 605ms, which is by far the lowest value recorded.
Comment 28•3 months ago
|
||
(In reply to Andrew Creskey [:acreskey] from comment #27)
And while we cannot directly tie the impact of this change to our telemetry, since landing we are seeing median
time to request startfor cold applinks drop from778msto605ms, which is by far the lowest value recorded.
I've taken a closer look at the telemetry and its implementation and I believe the measured improvements in time to request start are valid.
More views here
However we've been landing a series of improvements over the past few weeks and it's not possible to pinpoint which improvements came from which changes without reverting patches. (Which we did initially intend to do, but we are now instead focusing on continued improvements).
Description
•