Closed Bug 1122691 Opened 9 years ago Closed 9 years ago

3.21% linux* tp5 regression on inbound (v.38) Jan 14th from push 25d9ba8ceb7a

Categories

(Core :: DOM: Security, defect)

All
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla38

People

(Reporter: jmaher, Assigned: mmc)

References

(Blocks 1 open bug)

Details

(Keywords: perf, regression, Whiteboard: [talos_regression])

Attachments

(1 file, 5 obsolete files)

:mmc, can you look into this and help us determine if this is expected or unexpected.  I would like to get a decision made on this by the 21st where we either:
* backout
* have an idea for a fix and a general timeline to work on it
* understand why this caused a regression and accept it
Flags: needinfo?(mmc)
Joel, could you tell me if safebrowsing (browser.safebrowsing.enabled, browser.safebrowsing.malware.enabled) is intended to be used on talos? The changes that you point to should only affect talos if safebrowsing is on, and even then only minimally.

Also if you could tell me if network predictive actions (network.predictor.enable) are enabled on talos?
Flags: needinfo?(mmc) → needinfo?(jmaher)
https://hg.mozilla.org/integration/mozilla-inbound/rev/25d9ba8ceb7a#l6.121

Isn't this doing the extra thread pingpong even if it's disabled? 3% sounds surprisingly high for that, but I don't see much in that patch that'd be affected by SB being on or off anyway?
Here are the preferences we explicitly set on talos:
http://hg.mozilla.org/build/talos/file/d6bcfd393641/talos/PerfConfigurator.py#l252


related to safebrowsing:
'browser.safebrowsing.gethashURL': 'http://127.0.0.1/safebrowsing-dummy/gethash',
'browser.safebrowsing.keyURL': 'http://127.0.0.1/safebrowsing-dummy/newkey',
'browser.safebrowsing.updateURL': 'http://127.0.0.1/safebrowsing-dummy/update',
'browser.safebrowsing.enabled': False,
'browser.safebrowsing.malware.enabled': False,


related to network.*:
'network.proxy.http': 'localhost',
'network.proxy.http_port': 80,
'network.proxy.type': 1,
'network.http.speculative-parallel-limit': 0
Flags: needinfo?(jmaher)
(In reply to Gian-Carlo Pascutto [:gcp] from comment #3)
> https://hg.mozilla.org/integration/mozilla-inbound/rev/25d9ba8ceb7a#l6.121
> 
> Isn't this doing the extra thread pingpong even if it's disabled? 3% sounds
> surprisingly high for that, but I don't see much in that patch that'd be
> affected by SB being on or off anyway?

It is calling into DoLocalLookup, but it should be with empty tables. I'm surprised that a thread switch would take so long as well. We can shortcircuit if the tables are empty to avoid the thread switch and see if that helps.
Assignee: nobody → mmc
Status: NEW → ASSIGNED
Attachment #8550571 - Flags: review?(gpascutto)
this try push does raise red flags on the tp5 data, I did a few retriggers and we will see if it remains good!
and with the retriggers it looks pretty good.
Comment on attachment 8550571 [details] [diff] [review]
Short-circuit ClassifyLocal if there aren't any tables to look up (

Review of attachment 8550571 [details] [diff] [review]:
-----------------------------------------------------------------

There's nothing wrong with this code per se, but the only thing it's really accomplishing is making sure our tests are no longer able to detect the very real (because real builds have it enabled by default!) performance regression we just introduced :-/
(In reply to Gian-Carlo Pascutto [:gcp] from comment #10)
> There's nothing wrong with this code per se, but the only thing it's really
> accomplishing is making sure our tests are no longer able to detect the very
> real (because real builds have it enabled by default!) performance
> regression we just introduced :-/

That doesn't sound fun, and we probably don't want performance issues hidden from us.

Preferably you'll know why either bug 1100024 or bug 1120499 cause this slight regression. If you know why it regressed and you think it's a fair price to pay for the new code, then that's fine - no need to hide the regression from the test - as we understand it's an acceptable the price to pay.

If, however, you don't understand why it ended up with a regression or think it isn't related to either of those bugs, then just say so and we'll think again what to do with this regression.

Either way, the goal is not to hide regressions from tests, but rather to understand where they came from and then take an educated decision on what to do about them (and doing nothing about them is a fair decision as well).
Bug 1100024 must stay in the tree when tracking protection is enabled. It requires either bug 1120499 or an alternative solution that increases the amount of locking on the worker thread in order to work.

I don't think that there's a way around this overhead with the current architecture:
- Necko runs on the main thread
- Lookups and updates to SB database currently happen on a worker thread and either need to lock to run on multiple threads, or stay on the worker thread (current solution)

The overhead of thread switching happens only in ClassifyLocal in necko during BeginConnect. The existing lookup from nsChannelClassifier should be exactly the same, because our usage of SyncRunnable does not force a dispatch if execution is already on the worker thread.

In order to reduce the scope of this problem, we could limit the ClassifyLocal call (https://mxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/nsHttpChannel.cpp#4889) to when tracking protection should be enabled, and let speculative network traffic pass otherwise. The net effect of this would be that phishing and malware pages *would* be pre-fetched.

In the tracking protection case, I think we are more than making up the time by not rendering blocked elements. It's still a 20% page load speed up on average for people with tracking protection enabled.

Patrick, what do you think?
Flags: needinfo?(mcmanus)
monica, can we confirm some things?

1] my understanding is that for tp5, classifylocal() should be returning NS_OK for aResponse.. right?

2] and it runs on the same thread (the main thread) it is called from in nsHttpChannel - so there isn't any thread ping ponging going on.

3] and it really is just using cpu before dispatch the networking activity that it previously overlapped with the network activity (i.e. the networking was happening on the socket thread while this same cpu work was on the main thread). This serialziation is actually the point of the fix.

given that, its not surprising that page load is regressed - more stuff is checked before we send of the http request.

if all that is true - then we should accept the regression.

otoh, can we look at a profile of classifyLocal() - as it suddenly finds itself on a very critical path and is apprently CPU bound, a little micro optimization might really pay off. Separate bug.
Flags: needinfo?(mcmanus)
(In reply to Patrick McManus [:mcmanus] from comment #13)
> monica, can we confirm some things?
> 
> 1] my understanding is that for tp5, classifylocal() should be returning
> NS_OK for aResponse.. right?

Yes.

> 2] and it runs on the same thread (the main thread) it is called from in
> nsHttpChannel - so there isn't any thread ping ponging going on.

Unfortunately no. Originally I landed it that way, then I had to revert it due to https://bugzilla.mozilla.org/show_bug.cgi?id=1120145#c2. The fixes were to add locking to the DB lookup or dispatch to the worker thread. I chose the dispatch -- but don't anticipate that locking would add any less overhead.

> 3] and it really is just using cpu before dispatch the networking activity
> that it previously overlapped with the network activity (i.e. the networking
> was happening on the socket thread while this same cpu work was on the main
> thread). This serialziation is actually the point of the fix.

Yes.

> given that, its not surprising that page load is regressed - more stuff is
> checked before we send of the http request.
> 
> if all that is true - then we should accept the regression.

Should we accept it if tracking protection is not enabled?
 
> otoh, can we look at a profile of classifyLocal() - as it suddenly finds
> itself on a very critical path and is apprently CPU bound, a little micro
> optimization might really pay off. Separate bug.
I guess the question is did the code that did the lockless lookup on the main thread have the same tp5 regresssion? (I understand it has correctness problems :))

We ought to be able to figure that out from try.

If they answer is yes, then I guess we can live with it and pursue a profile and improvement in another bug.

If the answer is no, then it seems to me we ought to be able to optimize this with a reader/writer lock.. Are updates the only write case? That sounds like its an outlier and a good use case.
if you have a try push, I would be happy to analyze it to see if it is good or regressed.
(In reply to Patrick McManus [:mcmanus] from comment #15)
> I guess the question is did the code that did the lockless lookup on the
> main thread have the same tp5 regresssion? (I understand it has correctness
> problems :))
> We ought to be able to figure that out from try.
> 
> If they answer is yes, then I guess we can live with it and pursue a profile
> and improvement in another bug.
>
> If the answer is no, then it seems to me we ought to be able to optimize this with a reader/writer lock.. > Are updates the only write case? That sounds like its an outlier and a good use case.

I think the answer is no, because it was never flagged for talos regressions for the 3 days or so it was live in Nightly, and this bug was filed the day after the fix landed on m-c.

Unfortunately updates are not the only write case. Lookups can also cause writes if we need to store hash completions. (gcp, please correct me if I'm wrong.)

(In reply to Joel Maher (:jmaher) from comment #16)
> if you have a try push, I would be happy to analyze it to see if it is good
> or regressed.

The crashy checkin was

https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=671ad56e6e12

Unfortunately there were a lot of orange tests that day.
>Unfortunately updates are not the only write case. Lookups can also cause writes if we need to store hash completions. (gcp, please correct me if I'm wrong.)

Yes, but that's only on a hit (treated the same as an update). For regular SafeBrowsing, that's a 1/5000 case. I don't know for Tracking Protection? A lock that is almost always uncontested does look like it would be faster than a synchronous dispatch.

What's the expected effect if we're dispatching updates while lookups are also being dispatched? Do the lookups have to wait till the update (with IO!) finishes? That seems like it could lead to bad yank every 30 minutes (depending on the exact effect of those synchronous dispatches, if they let the event loop spin it might be ok).

If the above happens a more fine grained lock seems preferable, though also more error prone. The race we hit is in mPrefixSet.Contains(), but there's potential others in mTableFreshness, ActiveTables() and all the ones I missed :-)

Thoughts:

1) If we enable tracking protection by default the release this lands, we can live with the perf regression and file a follow-up to fix afterwards. If we never expect to enable tracking protection by default, I am far less willing to take this hit. 3% is not nothing. Is it possible to avoid the ClassifyLocal call if *only* tracking protection is disabled? The patch I saw in this bug only avoids it if *all* SafeBrowsing is disabled. If we only have the perf hit when TP is enabled, it's a no brainer to go forward with it (3% perf hit for big privacy win and 20% less stuff to fetch).

2) It's not entirely clear why we get the 3% now. Is it because dispatching is high overhead or because of the extra serialization? The fact that the first (broken) patch didn't regress is evidence of the former, correct?

3) If there's evidence dispatching is the issue, we can do a fine grained lock (I don't think reader/writer really helps as there aren't concurrent SB lookups). This is possible, but tricky. The original implementation of UrlClassifierPrefixSet supported queries from either thread, but getting rid of it sure was a simplification.

My preference would be to do (1), and if we are certain that dispatching is the issue, tackle (3) in a follow-up but not let it block TP.
(In reply to Gian-Carlo Pascutto [:gcp] from comment #18)
> 
> What's the expected effect if we're dispatching updates while lookups are
> also being dispatched? Do the lookups have to wait till the update (with
> IO!) finishes? That seems like it could lead to bad yank every 30 minutes
> (depending on the exact effect of those synchronous dispatches, if they let
> the event loop spin it might be ok).

so the current synchronous dispatch does not let the main thread event loop spin - so it has this problem (with I/O) too I assume? We probably can't let that go.

Is it possible that if the main thread tries to get the lock and fails it can just be dispatched in the normal async way a hit is dispatched?

> 
> 1) If we enable tracking protection by default the release this lands, we
> can live with the perf regression and file a follow-up to fix afterwards

I'm not exactly sure what is being proposed.. I'll just say that in the default config we definitely can't have the jank problem (synchronoized threads with IO) and we should try hard to avoid tp5 regression (though that may be impossible).

> 2) It's not entirely clear why we get the 3% now. Is it because dispatching
> is high overhead or because of the extra serialization? The fact that the
> first (broken) patch didn't regress is evidence of the former, correct?

sounds that way, though its probably good to have joel actually evaluate a build with the broken patch to be certain.
I did look at the linked revision in comment 17, that is on inbound from Jan 9th.  I would prefer to see that applied to the current state today- is there something specific with that push which it should fix something or make things specifically worse?
I'm not positive how the current Dispatch and (heavy IO) updates interact, and I can't tell if you (mcmanus) are confirming my fears or not :-)

>Is it possible that if the main thread tries to get the lock and fails it can just be dispatched in the normal async way a hit is dispatched?

I think that just leads to bug 1100024 again.

>I'm not exactly sure what is being proposed.

If the yank-on-update isn't real, then 3% pref regression IFF tracking protection is enabled is OK.
(In reply to Gian-Carlo Pascutto [:gcp] from comment #21)
> I'm not positive how the current Dispatch and (heavy IO) updates interact,
> and I can't tell if you (mcmanus) are confirming my fears or not :-)
> 
> >Is it possible that if the main thread tries to get the lock and fails it can just be dispatched in the normal async way a hit is dispatched?
> 
> I think that just leads to bug 1100024 again.

I don't mean using the safe browsing data in parallel unsafely, I mean if ClassifyLocal finds the lock contended, that it returns something other than NS_OK for the response (i.e. local block list hit), which should start the classifer::start logic and a slow ping pong callback to the main thread.
I'm not sure what the advantage of that is? The dispatch isn't async, it's sync-as-can-be: https://dxr.mozilla.org/mozilla-central/source/toolkit/components/url-classifier/nsUrlClassifierProxies.cpp#134

That also answers my own "does the event loop spin" question :P
(In reply to Gian-Carlo Pascutto [:gcp] from comment #18)
> If the above happens a more fine grained lock seems preferable, though also
> more error prone. The race we hit is in mPrefixSet.Contains(), but there's
> potential others in mTableFreshness, ActiveTables() and all the ones I
> missed :-)

Yes, I am not optimistic about this.

> Thoughts:
> 
> 1) If we enable tracking protection by default the release this lands, we
> can live with the perf regression and file a follow-up to fix afterwards.

Based on how getting the settings UI out is going, I don't think we'll ever enable tracking protection by default.

> If
> we never expect to enable tracking protection by default, I am far less
> willing to take this hit. 3% is not nothing. Is it possible to avoid the
> ClassifyLocal call if *only* tracking protection is disabled?

Yes, and I think that's the easiest solution. Dispatching is the issue based on comment 9.
(In reply to Gian-Carlo Pascutto [:gcp] from comment #23)
> I'm not sure what the advantage of that is? The dispatch isn't async, it's
> sync-as-can-be:
> https://dxr.mozilla.org/mozilla-central/source/toolkit/components/url-
> classifier/nsUrlClassifierProxies.cpp#134
> 

I'm saying change DoLookupLookup to not use SyncRunnable. Instead it should do some variation of TryLock(). Should tryLock() fail, ClassifyLocal() should return something other than NS_OK immediately.. and then you're in the unlikely slow path.

> That also answers my own "does the event loop spin" question :P

the main thread does not advance - comment 19. So that's janky.
OK -- we can try reverting the worker dispatch and doing a TryLock(). Realistically I don't have time to work on this until next week or (more likely) mid-Feb, so if someone else has cycles before then, that would be great.

gcp, can you think of anything else in comment 18 that needs to be locked?
Now that I understand mcmanus' point (ClassifyLocal can return a false positive, it'll get fixed up later), we can be extremely crude and just lock everything at the start of an update. We'll just lose some prefetching while the update runs, no biggy.

I would prefer just fixing up the patch here so ClassifyLocal only gets called if TP is enabled, which means that you can ship it, and the performance regression will be limited to the new feature which isn't enabled by default. That way we can have TP if we want it now. We can improve the performance via the above trick later.
Comment on attachment 8552573 [details] [diff] [review]
Skip ClassifyLocal unless tracking protection is enabled (

Review of attachment 8552573 [details] [diff] [review]:
-----------------------------------------------------------------

Hey Pat,

Chatted with gcp over irc. He suggested scoping the regression to when tracking protection is enabled in case comment 26 takes longer than expected. The old codepath in nsChannelClassifier works the same as before ClassifyLocal was introduced. Does that work for you?

Thanks,
Monica
Attachment #8552573 - Flags: review?(mcmanus)
Attachment #8552573 - Flags: review?(mcmanus) → review+
I'm sorry to hear the path to turning tracking on by default is so challenging. I want this feature :)

If you have a bug for turning the pref on, then making it dependent on a bug for optimizing this would be good.
(In reply to Patrick McManus [:mcmanus] from comment #31)
> I'm sorry to hear the path to turning tracking on by default is so
> challenging. I want this feature :)
> 
> If you have a bug for turning the pref on, then making it dependent on a bug
> for optimizing this would be good.

The bugs for exposing the preference in the settings UI at all are https://bugzilla.mozilla.org/show_bug.cgi?id=1103241 and https://bugzilla.mozilla.org/show_bug.cgi?id=1122132, but I don't think they should block on this :)
Ugh, sorry -- I see the fix.
Flags: needinfo?(mmc)
Attachment #8550571 - Attachment is obsolete: true
Attachment #8550571 - Flags: review?(gpascutto)
Attachment #8552573 - Attachment is obsolete: true
Comment on attachment 8553332 [details] [diff] [review]
Skip ClassifyLocal unless tracking protection is enabled (

Review of attachment 8553332 [details] [diff] [review]:
-----------------------------------------------------------------

I missed the case where we need to call the classifier for phishing and malware sites and ended up having to refactor nsChannelClassifier a bit.

green try: https://tbpl.mozilla.org/?tree=Try&rev=1e3b263edaf6
Attachment #8553332 - Flags: review?(mcmanus)
Comment on attachment 8553332 [details] [diff] [review]
Skip ClassifyLocal unless tracking protection is enabled (

Review of attachment 8553332 [details] [diff] [review]:
-----------------------------------------------------------------

oops, one of the xpcshells came back red
Attachment #8553332 - Flags: review?(mcmanus)
Attached patch skip-classify-local (obsolete) — Splinter Review
OK, now try really is green: https://tbpl.mozilla.org/?tree=Try&rev=888cc2c8084f

There were 2 bugs: one was I wasn't handling the phishing/malware ContinueBeginConnect case correctly, and the other is a longstanding bug in nsUrlClassifierDBService that merrily continues with nullptr if it can't get a profile dir that the previous patch triggered more often.
Attachment #8553332 - Attachment is obsolete: true
Attachment #8553587 - Flags: review?(mcmanus)
Attachment #8553587 - Flags: review?(gpascutto)
Attachment #8553587 - Flags: review?(gpascutto) → review+
Comment on attachment 8553587 [details] [diff] [review]
skip-classify-local

Review of attachment 8553587 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/base/nsChannelClassifier.h
@@ +26,2 @@
>      // ContinueBeginConnect once Start has successfully returned.
> +    void Start(nsIChannel *aChannel, nsIHttpChannelInternal *aChannelInternal);

I guess you're using it as a flag on whether or not to call continuebeginconnect() but wouldn't it be saner to just pass a bool explicitly defined to do that? (and qi aChannel)

::: netwerk/protocol/http/nsHttpChannel.cpp
@@ +4898,5 @@
>      nsRefPtr<nsChannelClassifier> channelClassifier = new nsChannelClassifier();
>      if (mLoadFlags & LOAD_CLASSIFY_URI) {
> +        nsresult rv;
> +        nsCOMPtr<nsIURIClassifier> classifier = do_GetService(NS_URICLASSIFIERSERVICE_CONTRACTID, &rv);
> +        if (NS_SUCCEEDED(rv)) {

you don't use that rv in any way the !classifier check didn't already do (e.g. return it), so it can stay !classifier.

@@ +4988,2 @@
>      if (mCanceled || !mLocalBlocklist) {
> +        ContinueBeginConnect();

rv = ContinueBeginConnect()
if (ns_failed(rv)) return rv;

::: toolkit/components/url-classifier/nsUrlClassifierDBService.cpp
@@ +1132,5 @@
>    if (NS_FAILED(rv)) {
>      rv = NS_GetSpecialDirectory(NS_APP_USER_PROFILE_50_DIR,
>                                  getter_AddRefs(cacheDir));
>    }
> +  if (NS_FAILED(rv)) {

I think this reads a lot easier inside the brace on the previous line
Attachment #8553587 - Flags: review?(mcmanus)
Attachment #8553587 - Attachment is obsolete: true
Comment on attachment 8555544 [details] [diff] [review]
Skip ClassifyLocal unless tracking protection is enabled (

Review of attachment 8555544 [details] [diff] [review]:
-----------------------------------------------------------------

Address Pat's comments in comment 43, please take another look. Try looks green except for an flaky test which doesn't reproduce locally, so it's probably unrelated.
Attachment #8555544 - Flags: review?(mcmanus)
Comment on attachment 8555544 [details] [diff] [review]
Skip ClassifyLocal unless tracking protection is enabled (

Review of attachment 8555544 [details] [diff] [review]:
-----------------------------------------------------------------

one thing to figure out - then r+

::: netwerk/protocol/http/nsHttpChannel.cpp
@@ +4898,5 @@
>      nsRefPtr<nsChannelClassifier> channelClassifier = new nsChannelClassifier();
>      if (mLoadFlags & LOAD_CLASSIFY_URI) {
> +        nsresult rv;
> +        nsCOMPtr<nsIURIClassifier> classifier = do_GetService(NS_URICLASSIFIERSERVICE_CONTRACTID, &rv);
> +        if (!classifier) {

so I screwed this up in my last review when I said "!classifier" instead of rv.

it of course should be "if (classifier)" - right? unchanged in the patch. You also don't need the declaration of rv or the second parameter t do_GetService() as you aren't using rv.. that's what I was trying to say and screwed up.
Attachment #8555544 - Flags: review?(mcmanus) → review+
Yes, that's right. Let me fix it and run another try.
Attachment #8555544 - Attachment is obsolete: true
Comment on attachment 8555646 [details] [diff] [review]
Skip ClassifyLocal unless tracking protection is enabled (

Review of attachment 8555646 [details] [diff] [review]:
-----------------------------------------------------------------

Green try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=11f47835dfe8
Attachment #8555646 - Flags: review+
https://hg.mozilla.org/mozilla-central/rev/b6d3a8960d38
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla38
Depends on: 1130893
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: