Closed
Bug 1334616
Opened 9 years ago
Closed 8 years ago
nsChannelClassifier::IsTrackerWhitelisted is super inefficient
Categories
(Toolkit :: Safe Browsing, defect, P1)
Toolkit
Safe Browsing
Tracking
()
RESOLVED
DUPLICATE
of bug 1341506
People
(Reporter: ehsan.akhgari, Assigned: hchang)
References
(Blocks 1 open bug)
Details
Attachments
(2 files, 1 obsolete file)
59 bytes,
text/x-review-board-request
|
gcp
:
review+
|
Details |
5.87 KB,
patch
|
Details | Diff | Splinter Review |
See this profile: https://clptr.io/2jFVXaI
What's happening is that we get to nsChannelClassifier::OnClassifyComplete on the parent process' main thread, where if we see NS_ERROR_TRACKING_URI we call IsTrackerWhitelisted() which calls ClassifyLocalWithTables which can block on a background thread.
In this profile, this blocking wait takes 5 *seconds*. This is really bad, especially after bug 1334241. But it is an existing problem in private browsing windows.
Francois, Gian-Carlo, do you have any idea what's going on here? Why do we need to call ClassifyLocalWithTables after classification is finished?
Flags: needinfo?(gpascutto)
Flags: needinfo?(francois)
Comment 1•9 years ago
|
||
(In reply to :Ehsan Akhgari from comment #0)
> See this profile: https://clptr.io/2jFVXaI
> Francois, Gian-Carlo, do you have any idea what's going on here? Why do we
> need to call ClassifyLocalWithTables after classification is finished?
If by "after classification is finished" you're referring to the naming of ClassifyComplete, then take note that "Complete" here stands for "Complete SHA256 hash" as opposed to "SHA 256 hash prefix", not necessarily the state of the channel.
As far as I remember, the blocking was required because we don't even want to start opening a connection to a tracking server, and this was understood as having a clearly measurable pageload regression, which we ignored in private browsing because skipping the trackers meant we load pages much faster anyhow.
I believe the background discussion is in https://bugzilla.mozilla.org/show_bug.cgi?id=1100024 or related bugs.
Flags: needinfo?(gpascutto)
Reporter | ||
Comment 2•9 years ago
|
||
(In reply to Gian-Carlo Pascutto [:gcp] from comment #1)
> (In reply to :Ehsan Akhgari from comment #0)
> > See this profile: https://clptr.io/2jFVXaI
> > Francois, Gian-Carlo, do you have any idea what's going on here? Why do we
> > need to call ClassifyLocalWithTables after classification is finished?
>
> If by "after classification is finished" you're referring to the naming of
> ClassifyComplete, then take note that "Complete" here stands for "Complete
> SHA256 hash" as opposed to "SHA 256 hash prefix", not necessarily the state
> of the channel.
Oh, I was totally misunderstanding the meaning of the name! Thanks for correcting me.
> As far as I remember, the blocking was required because we don't even want
> to start opening a connection to a tracking server, and this was understood
> as having a clearly measurable pageload regression, which we ignored in
> private browsing because skipping the trackers meant we load pages much
> faster anyhow.
>
> I believe the background discussion is in
> https://bugzilla.mozilla.org/show_bug.cgi?id=1100024 or related bugs.
Hmm, so this is essentially pausing the UI thread on a network request, if I understand bug 1100024 comment 2 correctly. Is that true?
Comment 3•9 years ago
|
||
(In reply to :Ehsan Akhgari from comment #0)
> Francois, Gian-Carlo, do you have any idea what's going on here? Why do we
> need to call ClassifyLocalWithTables after classification is finished?
Once we find a tracker, we need to apply our entity whitelist to determine whether or not it's actually a third-party tracker (e.g. google-analytics.com on google.com is not considered third-party, thanks to that entity list).
That call however is entirely local and doesn't need to go out to the network.
I wonder whether it might be blocking because we're in the middle of updating the list and we have locked the whitelist.
Flags: needinfo?(francois)
Reporter | ||
Comment 4•9 years ago
|
||
(In reply to François Marier [:francois] from comment #3)
> (In reply to :Ehsan Akhgari from comment #0)
> > Francois, Gian-Carlo, do you have any idea what's going on here? Why do we
> > need to call ClassifyLocalWithTables after classification is finished?
>
> Once we find a tracker, we need to apply our entity whitelist to determine
> whether or not it's actually a third-party tracker (e.g.
> google-analytics.com on google.com is not considered third-party, thanks to
> that entity list).
>
> That call however is entirely local and doesn't need to go out to the
> network.
Good news!
> I wonder whether it might be blocking because we're in the middle of
> updating the list and we have locked the whitelist.
That's possible... The profile unfortunately doesn't capture the worker thread so it's not possible to tell what was happening on that thread, but your hypothesis is the only one I can think of.
We need to find an alternative solution though, especially if we want to ship bug 1334241 and anything depeneding on it. Is it possible to do the whitelist checking in the background as part of the initial classification? Or can we change things on the Necko side to start the classification before the connection is established, so that we can delay doing that until the TP classification is finished?
Comment 5•9 years ago
|
||
(In reply to :Ehsan Akhgari from comment #4)
> We need to find an alternative solution though, especially if we want to
> ship bug 1334241 and anything depeneding on it. Is it possible to do the
> whitelist checking in the background as part of the initial classification?
> Or can we change things on the Necko side to start the classification before
> the connection is established, so that we can delay doing that until the TP
> classification is finished?
Maybe we could fire off IsTrackerWhitelisted() and set a flag on the channel based on that result just before returning from nsChannelClassifier::StartInternal() (http://searchfox.org/mozilla-central/rev/d20e4431d0cf40311afa797868bc5c58c54790a2/netwerk/base/nsChannelClassifier.cpp#421). We'd have to be careful about a potential race here if nsChannelClassifier::OnClassifyComplete() returns before we're done checking the whitelist.
However, I'm not sure this would help. Whatever is causing ClassifyLocalWithTables() to block for 5 seconds(!) is going to stall everything no matter where it's called.
Can you think of anything else we could instrument to detect where this might be blocked?
Comment hidden (mozreview-request) |
Comment 7•9 years ago
|
||
(In reply to François Marier [:francois] from comment #6)
> Created attachment 8833274 [details]
> Bug 1334616 - Add telemetry probe to monitor ClassifyLocalWithTables() perf.
Let's add a probe to detect any future performance regressions (or improvements).
Reporter | ||
Comment 8•9 years ago
|
||
(In reply to François Marier [:francois] from comment #5)
> (In reply to :Ehsan Akhgari from comment #4)
> > We need to find an alternative solution though, especially if we want to
> > ship bug 1334241 and anything depeneding on it. Is it possible to do the
> > whitelist checking in the background as part of the initial classification?
> > Or can we change things on the Necko side to start the classification before
> > the connection is established, so that we can delay doing that until the TP
> > classification is finished?
>
> Maybe we could fire off IsTrackerWhitelisted() and set a flag on the channel
> based on that result just before returning from
> nsChannelClassifier::StartInternal()
> (http://searchfox.org/mozilla-central/rev/
> d20e4431d0cf40311afa797868bc5c58c54790a2/netwerk/base/nsChannelClassifier.
> cpp#421). We'd have to be careful about a potential race here if
> nsChannelClassifier::OnClassifyComplete() returns before we're done checking
> the whitelist.
>
> However, I'm not sure this would help. Whatever is causing
> ClassifyLocalWithTables() to block for 5 seconds(!) is going to stall
> everything no matter where it's called.
Stalling a channel load for 5 seconds is bad, sure, but it's much better than locking the parent process's main thread, which means the entire application will become unresponsive. In general doing synchronous operations like this on the parent process's main thread isn't OK because it's not possible to predict how long the operation takes.
> Can you think of anything else we could instrument to detect where this
> might be blocked?
Unfortunately not. The url-classifier code is completely foreign to me and I don't know much about how it works... :/
Comment 9•9 years ago
|
||
mozreview-review |
Comment on attachment 8833274 [details]
Bug 1334616 - Add telemetry probe to monitor ClassifyLocalWithTables() perf.
https://reviewboard.mozilla.org/r/109530/#review111168
::: toolkit/components/telemetry/Histograms.json:3985
(Diff revision 1)
> "high": 120000,
> "n_buckets": 30,
> "bug_numbers": [1315893],
> "description": "Time spent per classifier update (ms), keyed by the name of the provider."
> },
> + "URLCLASSIFIER_CLWT_TIME": {
A less cryptic name please. _CLASSIFYLOCAL_ is OK as that is a trivial wrapper for CLWT.
::: toolkit/components/telemetry/Histograms.json:3989
(Diff revision 1)
> },
> + "URLCLASSIFIER_CLWT_TIME": {
> + "alert_emails": ["safebrowsing-telemetry@mozilla.org"],
> + "expires_in_version": "never",
> + "kind": "exponential",
> + "high": 5000,
5 seconds is what was reported, it's not necessarily the worst out there. Don't we want to know what is the worst? Given that this is exponential with plenty of buckets, you can make the maximum 30s or a minute.
Attachment #8833274 -
Flags: review?(gpascutto) → review+
Assignee | ||
Updated•9 years ago
|
Assignee: nobody → hchang
Comment hidden (mozreview-request) |
Comment 11•9 years ago
|
||
Pushed by fmarier@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/2e9eab35d1ea
Add telemetry probe to monitor ClassifyLocalWithTables() perf. r=gcp
Comment 12•9 years ago
|
||
Comment hidden (mozreview-request) |
Comment 14•9 years ago
|
||
Pushed by fmarier@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/d2fafcf20ceb
Add telemetry probe to monitor ClassifyLocalWithTables() perf. r=gcp
Updated•9 years ago
|
Status: NEW → ASSIGNED
Priority: -- → P1
Comment 15•9 years ago
|
||
bugherder |
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
status-firefox54:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
Comment 16•9 years ago
|
||
Reopening because that patch is just the first step, Henry will be working on an actual fix.
Status: RESOLVED → REOPENED
status-firefox54:
fixed → ---
Resolution: FIXED → ---
Target Milestone: mozilla54 → ---
Comment 17•9 years ago
|
||
Was there any data collection review [1] on the new histogram?
That is required for all new probes.
1: https://wiki.mozilla.org/Firefox/Data_Collection
Comment 18•9 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] from comment #17)
> Was there any data collection review [1] on the new histogram?
Benjamin, I meant to ask for your datareview here but forgot to flag you before landing the patch. Does this look good to you?
We want to measure this for two reasons:
1. so we can confirm that we have fixed the bug
2. so that we don't have a perf regression in the future
Flags: needinfo?(benjamin)
Comment 19•9 years ago
|
||
Yeah this is fine, the automated alerting should work for this as a scalar time metric.
Flags: needinfo?(benjamin)
Assignee | ||
Comment 20•9 years ago
|
||
I have a prototype at [1], which I've initially verified can do lookup
and update "almost" concurrently. How I verify it is just to add a infinite
loop at [2]. If update has to be entirely mutual exclusive to lookup,
no channel can be opened once an update request is issued.
:Ehsan, could you suggest how to verify if the original issue
of this bug is fixed by [1]? We believe Bug 1338017 is "one of"
the root causes but are not sure if it's the only one.
It looks like you ran a profiler with
"privacy.trackingprotection.annotate_channels=true".
Is that the only configuration? How often do you see this kind of
long delay? Do I have to profile a long while to reproduce or it
happens all the time?
Thanks for your help in advance :)
[1] https://github.com/elefant/gecko-dev/commits/sb-background-db-update
[2] http://searchfox.org/mozilla-central/rev/ea31c4b64f34a29415a69fb768f8051495547315/toolkit/components/url-classifier/Classifier.cpp#573
Flags: needinfo?(ehsan)
Reporter | ||
Comment 21•9 years ago
|
||
(In reply to Henry Chang [:henry][:hchang] from comment #20)
> I have a prototype at [1], which I've initially verified can do lookup
> and update "almost" concurrently. How I verify it is just to add a infinite
> loop at [2]. If update has to be entirely mutual exclusive to lookup,
> no channel can be opened once an update request is issued.
Fantastic!
> :Ehsan, could you suggest how to verify if the original issue
> of this bug is fixed by [1]? We believe Bug 1338017 is "one of"
> the root causes but are not sure if it's the only one.
>
> It looks like you ran a profiler with
>
> "privacy.trackingprotection.annotate_channels=true".
Yes, and this pref landed on Nightly as the default in bug 1334241, so you shouldn't need to set any prefs now.
> Is that the only configuration? How often do you see this kind of
> long delay? Do I have to profile a long while to reproduce or it
> happens all the time?
Unfortunately I don't know of a way to reproduce the jank in the profile in comment 0. AFAIK we can't even necessarily verify this on Talos since we disable safebrowsing updates on Talos. I think the best that we can do is to land the fix and watch out the telemetry probe as the fix gets to the beta channel and then decide whether this is fast enough or whether more work needs to be done. Sorry I don't have a better answer.
Flags: needinfo?(ehsan)
Assignee | ||
Comment 22•8 years ago
|
||
Assignee | ||
Comment 23•8 years ago
|
||
Attachment #8839422 -
Attachment is obsolete: true
Assignee | ||
Comment 24•8 years ago
|
||
(In reply to François Marier [:francois] from comment #5)
> (In reply to :Ehsan Akhgari from comment #4)
> > We need to find an alternative solution though, especially if we want to
> > ship bug 1334241 and anything depeneding on it. Is it possible to do the
> > whitelist checking in the background as part of the initial classification?
> > Or can we change things on the Necko side to start the classification before
> > the connection is established, so that we can delay doing that until the TP
> > classification is finished?
>
> Maybe we could fire off IsTrackerWhitelisted() and set a flag on the channel
> based on that result just before returning from
> nsChannelClassifier::StartInternal()
> (http://searchfox.org/mozilla-central/rev/
> d20e4431d0cf40311afa797868bc5c58c54790a2/netwerk/base/nsChannelClassifier.
> cpp#421). We'd have to be careful about a potential race here if
> nsChannelClassifier::OnClassifyComplete() returns before we're done checking
> the whitelist.
>
I am still worried about the synchronous operation on the main thread so
I experimentally changed the whitelist checking to async in attachment 8839424 [details] [diff] [review].
(Probably I didn't "flip" the result correctly but it should not be far from
the answer)
Francois, gcp, do you have any concern or comment regarding this kind of approach?
I haven't tested the patch yet but wanna know if I am missing something which
forces us to do it synchronously.
Flags: needinfo?(gpascutto)
Flags: needinfo?(francois)
Comment 25•8 years ago
|
||
I don't know enough about nsChannelClassifier to give useful feedback here.
Flags: needinfo?(gpascutto)
Reporter | ||
Comment 27•8 years ago
|
||
Bug 1334616 has steps to reproduce, and also a profile with the URL Classifier thread included: https://perfht.ml/2miKkrd
The URL Classifier thread is busy in HashStore::WriteFile().
Assignee | ||
Comment 28•8 years ago
|
||
(In reply to Henry Chang [:henry][:hchang] from comment #24)
> (In reply to François Marier [:francois] from comment #5)
> > (In reply to :Ehsan Akhgari from comment #4)
> > > We need to find an alternative solution though, especially if we want to
> > > ship bug 1334241 and anything depeneding on it. Is it possible to do the
> > > whitelist checking in the background as part of the initial classification?
> > > Or can we change things on the Necko side to start the classification before
> > > the connection is established, so that we can delay doing that until the TP
> > > classification is finished?
> >
> > Maybe we could fire off IsTrackerWhitelisted() and set a flag on the channel
> > based on that result just before returning from
> > nsChannelClassifier::StartInternal()
> > (http://searchfox.org/mozilla-central/rev/
> > d20e4431d0cf40311afa797868bc5c58c54790a2/netwerk/base/nsChannelClassifier.
> > cpp#421). We'd have to be careful about a potential race here if
> > nsChannelClassifier::OnClassifyComplete() returns before we're done checking
> > the whitelist.
> >
>
> I am still worried about the synchronous operation on the main thread so
> I experimentally changed the whitelist checking to async in attachment
> 8839424 [details] [diff] [review].
> (Probably I didn't "flip" the result correctly but it should not be far from
> the answer)
>
> Francois, gcp, do you have any concern or comment regarding this kind of
> approach?
> I haven't tested the patch yet but wanna know if I am missing something which
> forces us to do it synchronously.
The patch implementation is totally wrong. It will block itself.
I'll just file a new bug for this topic and provide revised patch.
Flags: needinfo?(francois)
Reporter | ||
Comment 29•8 years ago
|
||
(In reply to :Ehsan Akhgari from comment #27)
> Bug 1334616 has steps to reproduce
Oops, this should have been bug 1325054.
Assignee | ||
Comment 30•8 years ago
|
||
Hi Ehsan,
As Bug 1341506 has landed IsTrackerWhitelisted() becomes a async call.
Should we close this bug and continue to work on Bug 1338017 for the
background DB update?
Thanks :)
Flags: needinfo?(ehsan)
Reporter | ||
Comment 31•8 years ago
|
||
(In reply to Henry Chang [:henry][:hchang] from comment #30)
> Hi Ehsan,
>
> As Bug 1341506 has landed IsTrackerWhitelisted() becomes a async call.
> Should we close this bug and continue to work on Bug 1338017 for the
> background DB update?
Sure but note that bug 1325054 still remains and I think it's more important since it does affect release channel population already.
Status: REOPENED → RESOLVED
Closed: 9 years ago → 8 years ago
Flags: needinfo?(ehsan)
Resolution: --- → DUPLICATE
You need to log in
before you can comment on or make changes to this bug.
Description
•