Closed Bug 1334616 Opened 9 years ago Closed 8 years ago

nsChannelClassifier::IsTrackerWhitelisted is super inefficient

Categories

(Toolkit :: Safe Browsing, defect, P1)

defect

Tracking

()

RESOLVED DUPLICATE of bug 1341506

People

(Reporter: ehsan.akhgari, Assigned: hchang)

References

(Blocks 1 open bug)

Details

Attachments

(2 files, 1 obsolete file)

See this profile: https://clptr.io/2jFVXaI What's happening is that we get to nsChannelClassifier::OnClassifyComplete on the parent process' main thread, where if we see NS_ERROR_TRACKING_URI we call IsTrackerWhitelisted() which calls ClassifyLocalWithTables which can block on a background thread. In this profile, this blocking wait takes 5 *seconds*. This is really bad, especially after bug 1334241. But it is an existing problem in private browsing windows. Francois, Gian-Carlo, do you have any idea what's going on here? Why do we need to call ClassifyLocalWithTables after classification is finished?
Flags: needinfo?(gpascutto)
Flags: needinfo?(francois)
(In reply to :Ehsan Akhgari from comment #0) > See this profile: https://clptr.io/2jFVXaI > Francois, Gian-Carlo, do you have any idea what's going on here? Why do we > need to call ClassifyLocalWithTables after classification is finished? If by "after classification is finished" you're referring to the naming of ClassifyComplete, then take note that "Complete" here stands for "Complete SHA256 hash" as opposed to "SHA 256 hash prefix", not necessarily the state of the channel. As far as I remember, the blocking was required because we don't even want to start opening a connection to a tracking server, and this was understood as having a clearly measurable pageload regression, which we ignored in private browsing because skipping the trackers meant we load pages much faster anyhow. I believe the background discussion is in https://bugzilla.mozilla.org/show_bug.cgi?id=1100024 or related bugs.
Flags: needinfo?(gpascutto)
(In reply to Gian-Carlo Pascutto [:gcp] from comment #1) > (In reply to :Ehsan Akhgari from comment #0) > > See this profile: https://clptr.io/2jFVXaI > > Francois, Gian-Carlo, do you have any idea what's going on here? Why do we > > need to call ClassifyLocalWithTables after classification is finished? > > If by "after classification is finished" you're referring to the naming of > ClassifyComplete, then take note that "Complete" here stands for "Complete > SHA256 hash" as opposed to "SHA 256 hash prefix", not necessarily the state > of the channel. Oh, I was totally misunderstanding the meaning of the name! Thanks for correcting me. > As far as I remember, the blocking was required because we don't even want > to start opening a connection to a tracking server, and this was understood > as having a clearly measurable pageload regression, which we ignored in > private browsing because skipping the trackers meant we load pages much > faster anyhow. > > I believe the background discussion is in > https://bugzilla.mozilla.org/show_bug.cgi?id=1100024 or related bugs. Hmm, so this is essentially pausing the UI thread on a network request, if I understand bug 1100024 comment 2 correctly. Is that true?
(In reply to :Ehsan Akhgari from comment #0) > Francois, Gian-Carlo, do you have any idea what's going on here? Why do we > need to call ClassifyLocalWithTables after classification is finished? Once we find a tracker, we need to apply our entity whitelist to determine whether or not it's actually a third-party tracker (e.g. google-analytics.com on google.com is not considered third-party, thanks to that entity list). That call however is entirely local and doesn't need to go out to the network. I wonder whether it might be blocking because we're in the middle of updating the list and we have locked the whitelist.
Flags: needinfo?(francois)
(In reply to François Marier [:francois] from comment #3) > (In reply to :Ehsan Akhgari from comment #0) > > Francois, Gian-Carlo, do you have any idea what's going on here? Why do we > > need to call ClassifyLocalWithTables after classification is finished? > > Once we find a tracker, we need to apply our entity whitelist to determine > whether or not it's actually a third-party tracker (e.g. > google-analytics.com on google.com is not considered third-party, thanks to > that entity list). > > That call however is entirely local and doesn't need to go out to the > network. Good news! > I wonder whether it might be blocking because we're in the middle of > updating the list and we have locked the whitelist. That's possible... The profile unfortunately doesn't capture the worker thread so it's not possible to tell what was happening on that thread, but your hypothesis is the only one I can think of. We need to find an alternative solution though, especially if we want to ship bug 1334241 and anything depeneding on it. Is it possible to do the whitelist checking in the background as part of the initial classification? Or can we change things on the Necko side to start the classification before the connection is established, so that we can delay doing that until the TP classification is finished?
(In reply to :Ehsan Akhgari from comment #4) > We need to find an alternative solution though, especially if we want to > ship bug 1334241 and anything depeneding on it. Is it possible to do the > whitelist checking in the background as part of the initial classification? > Or can we change things on the Necko side to start the classification before > the connection is established, so that we can delay doing that until the TP > classification is finished? Maybe we could fire off IsTrackerWhitelisted() and set a flag on the channel based on that result just before returning from nsChannelClassifier::StartInternal() (http://searchfox.org/mozilla-central/rev/d20e4431d0cf40311afa797868bc5c58c54790a2/netwerk/base/nsChannelClassifier.cpp#421). We'd have to be careful about a potential race here if nsChannelClassifier::OnClassifyComplete() returns before we're done checking the whitelist. However, I'm not sure this would help. Whatever is causing ClassifyLocalWithTables() to block for 5 seconds(!) is going to stall everything no matter where it's called. Can you think of anything else we could instrument to detect where this might be blocked?
(In reply to François Marier [:francois] from comment #6) > Created attachment 8833274 [details] > Bug 1334616 - Add telemetry probe to monitor ClassifyLocalWithTables() perf. Let's add a probe to detect any future performance regressions (or improvements).
(In reply to François Marier [:francois] from comment #5) > (In reply to :Ehsan Akhgari from comment #4) > > We need to find an alternative solution though, especially if we want to > > ship bug 1334241 and anything depeneding on it. Is it possible to do the > > whitelist checking in the background as part of the initial classification? > > Or can we change things on the Necko side to start the classification before > > the connection is established, so that we can delay doing that until the TP > > classification is finished? > > Maybe we could fire off IsTrackerWhitelisted() and set a flag on the channel > based on that result just before returning from > nsChannelClassifier::StartInternal() > (http://searchfox.org/mozilla-central/rev/ > d20e4431d0cf40311afa797868bc5c58c54790a2/netwerk/base/nsChannelClassifier. > cpp#421). We'd have to be careful about a potential race here if > nsChannelClassifier::OnClassifyComplete() returns before we're done checking > the whitelist. > > However, I'm not sure this would help. Whatever is causing > ClassifyLocalWithTables() to block for 5 seconds(!) is going to stall > everything no matter where it's called. Stalling a channel load for 5 seconds is bad, sure, but it's much better than locking the parent process's main thread, which means the entire application will become unresponsive. In general doing synchronous operations like this on the parent process's main thread isn't OK because it's not possible to predict how long the operation takes. > Can you think of anything else we could instrument to detect where this > might be blocked? Unfortunately not. The url-classifier code is completely foreign to me and I don't know much about how it works... :/
Comment on attachment 8833274 [details] Bug 1334616 - Add telemetry probe to monitor ClassifyLocalWithTables() perf. https://reviewboard.mozilla.org/r/109530/#review111168 ::: toolkit/components/telemetry/Histograms.json:3985 (Diff revision 1) > "high": 120000, > "n_buckets": 30, > "bug_numbers": [1315893], > "description": "Time spent per classifier update (ms), keyed by the name of the provider." > }, > + "URLCLASSIFIER_CLWT_TIME": { A less cryptic name please. _CLASSIFYLOCAL_ is OK as that is a trivial wrapper for CLWT. ::: toolkit/components/telemetry/Histograms.json:3989 (Diff revision 1) > }, > + "URLCLASSIFIER_CLWT_TIME": { > + "alert_emails": ["safebrowsing-telemetry@mozilla.org"], > + "expires_in_version": "never", > + "kind": "exponential", > + "high": 5000, 5 seconds is what was reported, it's not necessarily the worst out there. Don't we want to know what is the worst? Given that this is exponential with plenty of buckets, you can make the maximum 30s or a minute.
Attachment #8833274 - Flags: review?(gpascutto) → review+
Assignee: nobody → hchang
Pushed by fmarier@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/2e9eab35d1ea Add telemetry probe to monitor ClassifyLocalWithTables() perf. r=gcp
See Also: → 1124307
Pushed by fmarier@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/d2fafcf20ceb Add telemetry probe to monitor ClassifyLocalWithTables() perf. r=gcp
Status: NEW → ASSIGNED
Priority: -- → P1
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
Reopening because that patch is just the first step, Henry will be working on an actual fix.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla54 → ---
Was there any data collection review [1] on the new histogram? That is required for all new probes. 1: https://wiki.mozilla.org/Firefox/Data_Collection
(In reply to Georg Fritzsche [:gfritzsche] from comment #17) > Was there any data collection review [1] on the new histogram? Benjamin, I meant to ask for your datareview here but forgot to flag you before landing the patch. Does this look good to you? We want to measure this for two reasons: 1. so we can confirm that we have fixed the bug 2. so that we don't have a perf regression in the future
Flags: needinfo?(benjamin)
Depends on: 1338017
Yeah this is fine, the automated alerting should work for this as a scalar time metric.
Flags: needinfo?(benjamin)
Blocks: UIJank
I have a prototype at [1], which I've initially verified can do lookup and update "almost" concurrently. How I verify it is just to add a infinite loop at [2]. If update has to be entirely mutual exclusive to lookup, no channel can be opened once an update request is issued. :Ehsan, could you suggest how to verify if the original issue of this bug is fixed by [1]? We believe Bug 1338017 is "one of" the root causes but are not sure if it's the only one. It looks like you ran a profiler with "privacy.trackingprotection.annotate_channels=true". Is that the only configuration? How often do you see this kind of long delay? Do I have to profile a long while to reproduce or it happens all the time? Thanks for your help in advance :) [1] https://github.com/elefant/gecko-dev/commits/sb-background-db-update [2] http://searchfox.org/mozilla-central/rev/ea31c4b64f34a29415a69fb768f8051495547315/toolkit/components/url-classifier/Classifier.cpp#573
Flags: needinfo?(ehsan)
(In reply to Henry Chang [:henry][:hchang] from comment #20) > I have a prototype at [1], which I've initially verified can do lookup > and update "almost" concurrently. How I verify it is just to add a infinite > loop at [2]. If update has to be entirely mutual exclusive to lookup, > no channel can be opened once an update request is issued. Fantastic! > :Ehsan, could you suggest how to verify if the original issue > of this bug is fixed by [1]? We believe Bug 1338017 is "one of" > the root causes but are not sure if it's the only one. > > It looks like you ran a profiler with > > "privacy.trackingprotection.annotate_channels=true". Yes, and this pref landed on Nightly as the default in bug 1334241, so you shouldn't need to set any prefs now. > Is that the only configuration? How often do you see this kind of > long delay? Do I have to profile a long while to reproduce or it > happens all the time? Unfortunately I don't know of a way to reproduce the jank in the profile in comment 0. AFAIK we can't even necessarily verify this on Talos since we disable safebrowsing updates on Talos. I think the best that we can do is to land the fix and watch out the telemetry probe as the fix gets to the beta channel and then decide whether this is fast enough or whether more work needs to be done. Sorry I don't have a better answer.
Flags: needinfo?(ehsan)
Attached patch AsyncWhitelisting.patch (obsolete) — Splinter Review
Attachment #8839422 - Attachment is obsolete: true
(In reply to François Marier [:francois] from comment #5) > (In reply to :Ehsan Akhgari from comment #4) > > We need to find an alternative solution though, especially if we want to > > ship bug 1334241 and anything depeneding on it. Is it possible to do the > > whitelist checking in the background as part of the initial classification? > > Or can we change things on the Necko side to start the classification before > > the connection is established, so that we can delay doing that until the TP > > classification is finished? > > Maybe we could fire off IsTrackerWhitelisted() and set a flag on the channel > based on that result just before returning from > nsChannelClassifier::StartInternal() > (http://searchfox.org/mozilla-central/rev/ > d20e4431d0cf40311afa797868bc5c58c54790a2/netwerk/base/nsChannelClassifier. > cpp#421). We'd have to be careful about a potential race here if > nsChannelClassifier::OnClassifyComplete() returns before we're done checking > the whitelist. > I am still worried about the synchronous operation on the main thread so I experimentally changed the whitelist checking to async in attachment 8839424 [details] [diff] [review]. (Probably I didn't "flip" the result correctly but it should not be far from the answer) Francois, gcp, do you have any concern or comment regarding this kind of approach? I haven't tested the patch yet but wanna know if I am missing something which forces us to do it synchronously.
Flags: needinfo?(gpascutto)
Flags: needinfo?(francois)
I don't know enough about nsChannelClassifier to give useful feedback here.
Flags: needinfo?(gpascutto)
Bug 1334616 has steps to reproduce, and also a profile with the URL Classifier thread included: https://perfht.ml/2miKkrd The URL Classifier thread is busy in HashStore::WriteFile().
(In reply to Henry Chang [:henry][:hchang] from comment #24) > (In reply to François Marier [:francois] from comment #5) > > (In reply to :Ehsan Akhgari from comment #4) > > > We need to find an alternative solution though, especially if we want to > > > ship bug 1334241 and anything depeneding on it. Is it possible to do the > > > whitelist checking in the background as part of the initial classification? > > > Or can we change things on the Necko side to start the classification before > > > the connection is established, so that we can delay doing that until the TP > > > classification is finished? > > > > Maybe we could fire off IsTrackerWhitelisted() and set a flag on the channel > > based on that result just before returning from > > nsChannelClassifier::StartInternal() > > (http://searchfox.org/mozilla-central/rev/ > > d20e4431d0cf40311afa797868bc5c58c54790a2/netwerk/base/nsChannelClassifier. > > cpp#421). We'd have to be careful about a potential race here if > > nsChannelClassifier::OnClassifyComplete() returns before we're done checking > > the whitelist. > > > > I am still worried about the synchronous operation on the main thread so > I experimentally changed the whitelist checking to async in attachment > 8839424 [details] [diff] [review]. > (Probably I didn't "flip" the result correctly but it should not be far from > the answer) > > Francois, gcp, do you have any concern or comment regarding this kind of > approach? > I haven't tested the patch yet but wanna know if I am missing something which > forces us to do it synchronously. The patch implementation is totally wrong. It will block itself. I'll just file a new bug for this topic and provide revised patch.
Flags: needinfo?(francois)
Depends on: 1341506
(In reply to :Ehsan Akhgari from comment #27) > Bug 1334616 has steps to reproduce Oops, this should have been bug 1325054.
Hi Ehsan, As Bug 1341506 has landed IsTrackerWhitelisted() becomes a async call. Should we close this bug and continue to work on Bug 1338017 for the background DB update? Thanks :)
Flags: needinfo?(ehsan)
(In reply to Henry Chang [:henry][:hchang] from comment #30) > Hi Ehsan, > > As Bug 1341506 has landed IsTrackerWhitelisted() becomes a async call. > Should we close this bug and continue to work on Bug 1338017 for the > background DB update? Sure but note that bug 1325054 still remains and I think it's more important since it does affect release channel population already.
Status: REOPENED → RESOLVED
Closed: 9 years ago8 years ago
Flags: needinfo?(ehsan)
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: