Closed Bug 1130893 Opened 5 years ago Closed 5 years ago

Wrong principal is used for nsChannelClassifier checks (tracking Protection blocks videos on facebook, and gifs-reencoded-as-h.264 on Twitter)

Categories

(Firefox :: Security, defect)

x86_64
Linux
defect
Not set

Tracking

()

RESOLVED FIXED
Firefox 38
Tracking Status
firefox35 --- unaffected
firefox36 --- unaffected
firefox37 --- unaffected
firefox38 --- fixed

People

(Reporter: dholbert, Assigned: mmc)

References

(Blocks 1 open bug, )

Details

(Keywords: regression)

Attachments

(2 files, 4 obsolete files)

I haven't been able to play facebook videos on my facebook timeline for at least the past day or so. (not sure exactly how long, as I don't often try to play fb videos)

I just fired up error console to see what was going on, and I noticed this appear when I hit "play" on a video:
{
The resource at "https://fbcdn-video-f-a.akamaihd.net/hvideo-ak-xfp1/v/t42.1790-2/109772[...]_n.mp4?rl=[...]" was blocked because tracking protection is enabled.
}

After I disabled tracking protection in my URL bar, the video played.

STR:
 * Enable tracking protection.
 * Visit https://www.facebook.com/
 * Scroll down your timeline to find a video that a friend posted directly on facebook (not like an embedded youtube video or something)
 * Click "play" on the video (or click the post's timestamp to go to its own facebook page, and then try to click "play" on the video there)

EXPECTED RESULTS: Video should play.

ACTUAL RESULTS: The video doesn't play, and a message like the one I posted above appears in your error console.
I just confirmed that I can reproduce this in a fresh profile. I don't reliably see the message in the error console (not sure why not), but the video definitely doesn't play unless I disable tracking protection.

Here's a sample video from the "Firefox" page that can be used for testing this (you have to be logged in to view it, for some reason):  https://www.facebook.com/video.php?v=10154812622600022

I'm using 64-bit Nightly 38.0a1 (2015-02-07) on Ubuntu 14.10 (linux), on my Lenovo W530 thinkpad.
Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0
Summary: Tracking protection breaks Facebook videos → Tracking protection blocks you from playing Facebook videos (on Facebook itself)
needinfo=mmc to take a look or redirect to someone who can take a look. (This seems like a pretty bad thing for tracking-protection to break.)

The user-exposed workaround -- manually disabling tracking protection on Facebook -- is OK, but suboptimal, because facebook is a site where I'd really *like* to be protected from tracking by ads / 3rd-party scripts that show up on the site.
Flags: needinfo?(mmc)
Blocks: tp-breakage
Flags: needinfo?(mmc)
Hmm, I don't see akamai in the web console (which is good, because we shouldn't be blocking akamai), but I do see fbcdn:

The resource at "https://scontent-a.xx.fbcdn.net/hvideo-xpa1/v/t43.1792-2/10801438_10154812625010022_2136059214_n.mp4?oh=a868958a851c42a9cc6eadceef548cde&oe=54D915F6" was blocked because tracking protection is enabled.

That's actually troubling, because fbcdn is in the Content category on Disconnect's list, and since they don't block that category by default, neither should we. I have a feeling this is an issue with broken updates. Could I trouble you to attach the contents of your profile's safebrowsing directory? On Linux, that's in ~/.cache/mozilla/firefox/<profile_name>/safebrowsing.
Flags: needinfo?(dholbert)
Well, seeing as how I can reproduce in my (old) profile, I'm pretty sure I know what the problem is -- but it would be great to see your profile, especially if you are able to reproduce in a new one, for confirmation.
(In reply to [:mmc] Monica Chew (please use needinfo) from comment #3)
> Hmm, I don't see akamai in the web console (which is good, because we
> shouldn't be blocking akamai), but I do see fbcdn:

I do too, when trying this from the MV office, in a fresh profile with Nightly & tracking protection enabled & e10s disabled.[1]

[1] (It seems to behave oddly when I have e10s on, so I turned it off).

> Could I trouble you to attach the contents of your profile's safebrowsing
> directory? On Linux, that's in
> ~/.cache/mozilla/firefox/<profile_name>/safebrowsing.

Sure -- I've just emailed those to you, for my normal profile and also for the fresh profile that I just created & reproduced this with, in the MV office (as noted above).
Flags: needinfo?(dholbert)
Thanks dholbert. Well, shoot -- it looks like the logging may be incorrect, because I don't see hashes for fbcdn or akamaihd in these profiles.
(In reply to Daniel Holbert [:dholbert] from comment #5)
> [1] (It seems to behave oddly when I have e10s on, so I turned it off).

(I filed bug 1131182 for this e10s + tracking protection weirdness, btw. Feel free to dupe / mark dependencies over there as-appropriate.)
Summary: Tracking protection blocks you from playing Facebook videos (on Facebook itself) → Wrong principal is used for ClassifyLocal checks
Assignee: nobody → mmc
Status: NEW → ASSIGNED
Comment on attachment 8561763 [details] [diff] [review]
Wrong principal is used for ClassifyLocal checks

Review of attachment 8561763 [details] [diff] [review]:
-----------------------------------------------------------------

This patch is not for check-in, but to illustrate a problem with the way that we call classifyLocal in nsHttpChannel. This is a relatively recent change that we need to fix in 38.

1927148304[1003292d0]: nsChannelClassifier[13772e8e0]: Enabling TP checks on channel[13f42d050] with uri https://scontent-a.xx.fbcdn.net/hvideo-xpa1/v/t43.1792-2/10801438_10154812625010022_2136059214_n.mp4?oh=59bc99c618767850f677e9c284205a4f&oe=54D973B6 for toplevel https://www.facebook.com
1927148304[1003292d0]: nsChannelClassifier: Classifying principal https://www.facebook.com/video.php?v=10154812622600022 on channel with uri https://scontent-a.xx.fbcdn.net/hvideo-xpa1/v/t43.1792-2/10801438_10154812625010022_2136059214_n.mp4?oh=59bc99c618767850f677e9c284205a4f&oe=54D973B6 [this=13772e8e0]

^This line above means that we are classifying https://www.facebook.com/<blah> when trying to determine whether to cancel channel fbcdn.net. That is the wrong thing to do and explains why fbcdn.net is being cancelled though it is not on the blocklist (facebook.com *is*, but should only be cancelled in third party contexts).

I thought that nsIChannel->GetPrincipal was the right thing to do here, but it looks like maybe we should just do nsIChannel->GetURI and then wrap that in a principal.
[Tracking Requested - why for this release]: This is a regression for tracking protection users only caused by fixing https://bugzilla.mozilla.org/show_bug.cgi?id=1100024.
This wrong principal is not set for ClassifyLocal checks, but inside nsChannelClassifier::StartInternal.
Summary: Wrong principal is used for ClassifyLocal checks → Wrong principal is used for nsChannelClassifier checks
Attachment #8561763 - Attachment is obsolete: true
Comment 11 is wrong: the bug that caused the regression is not https://bugzilla.mozilla.org/show_bug.cgi?id=1100024. I suspect that it has more to do with changes to loadinfo and how the principal is calculated that cause a change to GetChannelResultPrincipal.

I verified that this bug only appears in Nightly. STR:
1) On a clean profile, enable privacy.trackingprotection.enabled
2) Wait 30 seconds and visit http://people.mozilla.org/~mchew/test_tp.html. You should see a bunch of empty boxes. If not, wait a bit more and reload.
3) Visit https://www.facebook.com/video.php?v=10154812622600022.

Expected results: The video autoplays

With https://bugzilla.mozilla.org/attachment.cgi?id=8562212, this bug is fixed on Nightly. The call to GetChannelResultPrincipal has been in nsChannelClassifier for approximately forever.

However, in Nightly GetChannelResultPrincipal seems to return the toplevel URI for subresource https://scontent-a.xx.fbcdn.net/hvideo-xpa1/v/t43.1792-2/10801438_10154812625010022_2136059214_n.mp4?oh=59bc99c618767850f677e9c284205a4f&oe=54D973B6, and the subresource URI itself in FF 37 and older.

Tanvi, Chris: is this working as intended?
Flags: needinfo?(tanvi)
Flags: needinfo?(mozilla)
(In reply to [:mmc] Monica Chew (please use needinfo) from comment #15)
> However, in Nightly GetChannelResultPrincipal seems to return the toplevel
> URI for subresource
> https://scontent-a.xx.fbcdn.net/hvideo-xpa1/v/t43.1792-2/
> 10801438_10154812625010022_2136059214_n.
> mp4?oh=59bc99c618767850f677e9c284205a4f&oe=54D973B6, and the subresource URI
> itself in FF 37 and older.
> 
> Tanvi, Chris: is this working as intended?

If you are looking for the subresource's channel uri, you probably need GetChanneURIPrincipal() instead of GetChannelResultPrincipal().  GetChannelResultPrincipal() returns the owner if available - http://mxr.mozilla.org/mozilla-central/source/caps/nsIScriptSecurityManager.idl#190 .  My guess as to why this came up now and not earlier is that we are adding loadInfo to more and more channels.  In GetChannelResultPrincipal() we use the loadInfo if we have it, and fallback to GetChannelURIPrincipal() if we don't.  So I assume that the channel didn't have a loadInfo before and does now.
Comment on attachment 8562212 [details] [diff] [review]
Use GetChannelURIPrincipal instead of GetChannelResultPrincipal in nsChannelClassifier

Review of attachment 8562212 [details] [diff] [review]:
-----------------------------------------------------------------

See comment 16. I think this is a long-standing bug that just happened to work because loadinfo was not previously available, and now that it is, we are classifying the wrong thing.

Pat, this loadinfo change might have repercussions in https://mxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/HttpBaseChannel.cpp#1896.
Attachment #8562212 - Flags: review?(mcmanus)
Attachment #8562212 - Flags: review?(gpascutto)
Flags: needinfo?(tanvi)
Flags: needinfo?(mozilla)
(In reply to [:mmc] Monica Chew (please use needinfo) from comment #17)
> Comment on attachment 8562212 [details] [diff] [review]
> Use GetChannelURIPrincipal instead of GetChannelResultPrincipal in
> nsChannelClassifier
> 
> Review of attachment 8562212 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> See comment 16. I think this is a long-standing bug that just happened to
> work because loadinfo was not previously available, and now that it is, we
> are classifying the wrong thing.
> 
> Pat, this loadinfo change might have repercussions in
> https://mxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/
> HttpBaseChannel.cpp#1896.

Hopefully this bug doesn't show up in too many other places.  IIRC, Jonas and Christoph went through all instances of GetChannelPrincipal and updated them to either GetChannelURIPrincipal or GetChannelResultPrincipal after inspecting the callsite.  However, it is not always clear from the code which one is needed.
filing this one under "good to know"
Attachment #8562212 - Flags: review?(mcmanus) → review+
tanvi - mmc suggests that HttpBaseChannel::GetPrinicipal might need to use GetChannelPrincipal() instead of ChannelResultPrinicipal() as well.. what do you think? (makes sense to me based on this thread, but I'm not the expert here.)
Flags: needinfo?(tanvi)
Using GetChannelURIPrincipal sounds surprising to me that it would fix things. I'd like to better understand why before we make this change.

IIRC GetChannelURIPrincipal and GetChannelResultPrincipal only return different things in a small number of exceptional cases:

* Sandboxed loads, such as when loading things into an <iframe sandbox>.
* "Force inherit" cross-site loads, such as when loading a document using XMLHttpRequest.
* When loading from protocols that set an owner, which nowadays is mainly chrome: and about:

I suspect that the force-inherit flag (SEC_FORCE_INHERIT_PRINCIPAL) is set though since MediaResource.cpp seems to set that flag.

But I also think that MediaResource.cpp used to set an owner. And was recently switched to using SEC_FORCE_INHERIT_PRINCIPAL instead of using an owner.

But that means that I think that GetChannelResultPrincipal (formerly GetChannelPrincipal) always has returned the principal that it currently returns. So I don't *think* the recent changes to nsILoadInfo has actually changed anything here (that was certainly the goal).

I.e. I think that the GetChannel*Principal code here has for a long long time returned the principal of the calling page. Rather than the principal of the URI that is being loaded.


In summary, I'd like to understand two things:

* What has changed? Why did facebook recently stop working?
* What is this code actually trying to do? I.e. what's the security check that you're trying to do?


As an aside, it feels like we're over-using SEC_FORCE_INHERIT_PRINCIPAL. I don't understand why the media code sets that flag. But that's a separate discussion probably.
(In reply to Jonas Sicking (:sicking) from comment #21)
> Using GetChannelURIPrincipal sounds surprising to me that it would fix
> things. I'd like to better understand why before we make this change.
> 
> IIRC GetChannelURIPrincipal and GetChannelResultPrincipal only return
> different things in a small number of exceptional cases:
> 
> * Sandboxed loads, such as when loading things into an <iframe sandbox>.
> * "Force inherit" cross-site loads, such as when loading a document using
> XMLHttpRequest.
> * When loading from protocols that set an owner, which nowadays is mainly
> chrome: and about:
> 
> I suspect that the force-inherit flag (SEC_FORCE_INHERIT_PRINCIPAL) is set
> though since MediaResource.cpp seems to set that flag.
> 
> But I also think that MediaResource.cpp used to set an owner. And was
> recently switched to using SEC_FORCE_INHERIT_PRINCIPAL instead of using an
> owner.

Perhaps this is responsible for the change?

> But that means that I think that GetChannelResultPrincipal (formerly
> GetChannelPrincipal) always has returned the principal that it currently
> returns. So I don't *think* the recent changes to nsILoadInfo has actually
> changed anything here (that was certainly the goal).
> 
> I.e. I think that the GetChannel*Principal code here has for a long long
> time returned the principal of the calling page. Rather than the principal
> of the URI that is being loaded.
> 
> 
> In summary, I'd like to understand two things:
> 
> * What has changed? Why did facebook recently stop working?

This is a client-side regression, I don't think anything has changed on Facebook. To verify this I applied the logging part of the patch only to mozilla-aurora and observed that GetChannelResultPrincipal returned the URI of the resource being requested, while Nightly does not.

Unfortunately I don't have time to do a bisection, but perhaps Tanvi or Chris can point out the relevant changes in https://bugzilla.mozilla.org/show_bug.cgi?id=1087720.

> * What is this code actually trying to do? I.e. what's the security check
> that you're trying to do?

Check the URI of the resource being loaded against the blocklist. These checks should not care at all about the parent resource.
(In reply to [:mmc] Monica Chew (please use needinfo) from comment #22)
> (In reply to Jonas Sicking (:sicking) from comment #21)
> > Using GetChannelURIPrincipal sounds surprising to me that it would fix
> > things. I'd like to better understand why before we make this change.
> > 
> > IIRC GetChannelURIPrincipal and GetChannelResultPrincipal only return
> > different things in a small number of exceptional cases:
> > 
> > * Sandboxed loads, such as when loading things into an <iframe sandbox>.
> > * "Force inherit" cross-site loads, such as when loading a document using
> > XMLHttpRequest.
> > * When loading from protocols that set an owner, which nowadays is mainly
> > chrome: and about:
> > 
> > I suspect that the force-inherit flag (SEC_FORCE_INHERIT_PRINCIPAL) is set
> > though since MediaResource.cpp seems to set that flag.
> > 
> > But I also think that MediaResource.cpp used to set an owner. And was
> > recently switched to using SEC_FORCE_INHERIT_PRINCIPAL instead of using an
> > owner.
> 
> Perhaps this is responsible for the change?

Indeed. This changed recently in bug 604496. So bug 604496 is the cause of regression here.

> > * What has changed? Why did facebook recently stop working?
> 
> This is a client-side regression, I don't think anything has changed on
> Facebook. To verify this I applied the logging part of the patch only to
> mozilla-aurora and observed that GetChannelResultPrincipal returned the URI
> of the resource being requested, while Nightly does not.

What URI did you get on mozilla-aurora, and what URI are you getting now? Is either of them a data: URI?

> Unfortunately I don't have time to do a bisection, but perhaps Tanvi or
> Chris can point out the relevant changes in
> https://bugzilla.mozilla.org/show_bug.cgi?id=1087720.

Fortunately we won't need a bisection. Bug 604496 is definitely the cause here.

However note that none of the work that Tanvi or Christoph has been doing is the cause here. While the regression uses nsILoadInfo, in the past we would have just set a channel owner which would have had the exact same effect.

> > * What is this code actually trying to do? I.e. what's the security check
> > that you're trying to do?
> 
> Check the URI of the resource being loaded against the blocklist. These
> checks should not care at all about the parent resource.

If you're just interested in getting the URI of the resource being loaded, then why call Get*Principal anything? Why not just call nsIChannel.GetURI?
Blocks: 604496
(In reply to Jonas Sicking (:sicking) from comment #23)
> (In reply to [:mmc] Monica Chew (please use needinfo) from comment #22)
> > (In reply to Jonas Sicking (:sicking) from comment #21)
> > > Using GetChannelURIPrincipal sounds surprising to me that it would fix
> > > things. I'd like to better understand why before we make this change.
> > > 
> > > IIRC GetChannelURIPrincipal and GetChannelResultPrincipal only return
> > > different things in a small number of exceptional cases:
> > > 
> > > * Sandboxed loads, such as when loading things into an <iframe sandbox>.
> > > * "Force inherit" cross-site loads, such as when loading a document using
> > > XMLHttpRequest.
> > > * When loading from protocols that set an owner, which nowadays is mainly
> > > chrome: and about:
> > > 
> > > I suspect that the force-inherit flag (SEC_FORCE_INHERIT_PRINCIPAL) is set
> > > though since MediaResource.cpp seems to set that flag.
> > > 
> > > But I also think that MediaResource.cpp used to set an owner. And was
> > > recently switched to using SEC_FORCE_INHERIT_PRINCIPAL instead of using an
> > > owner.
> > 
> > Perhaps this is responsible for the change?
> 
> Indeed. This changed recently in bug 604496. So bug 604496 is the cause of
> regression here.
> 
> > > * What has changed? Why did facebook recently stop working?
> > 
> > This is a client-side regression, I don't think anything has changed on
> > Facebook. To verify this I applied the logging part of the patch only to
> > mozilla-aurora and observed that GetChannelResultPrincipal returned the URI
> > of the resource being requested, while Nightly does not.
> 
> What URI did you get on mozilla-aurora, and what URI are you getting now? Is
> either of them a data: URI?

No data URIs. On aurora, GetChannelResultPrincipal returns https://scontent-a.xx.fbcdn.net/hvideo-xpa1/... (the resource that's being loaded), and on Nightly it returns https://www.facebook.com/video.php?v=10154812622600022, the URI of the toplevel page.

> > Unfortunately I don't have time to do a bisection, but perhaps Tanvi or
> > Chris can point out the relevant changes in
> > https://bugzilla.mozilla.org/show_bug.cgi?id=1087720.
> 
> Fortunately we won't need a bisection. Bug 604496 is definitely the cause
> here.
> 
> However note that none of the work that Tanvi or Christoph has been doing is
> the cause here. While the regression uses nsILoadInfo, in the past we would
> have just set a channel owner which would have had the exact same effect.
> 
> > > * What is this code actually trying to do? I.e. what's the security check
> > > that you're trying to do?
> > 
> > Check the URI of the resource being loaded against the blocklist. These
> > checks should not care at all about the parent resource.
> 
> If you're just interested in getting the URI of the resource being loaded,
> then why call Get*Principal anything? Why not just call nsIChannel.GetURI?

The only reason is that safebrowsing checks require a principal, not a URI. However they don't do anything interesting with the principal except for extracting the URI from it.

So, is our usecase of GetChannelURIPrincipal sane here?
I'd really like to understand what's going on here, but the URL in the URL field of this bug just says "This content is currently unavailable".

Is there a testcase showing the problem?  The patch for bug 604496 should not have changed the return value of GetChannelResultPrincipal for anything like what's being described here, and we need to get to the bottom of why it seems to.
Oh, I guess per comment 1 I have to have a Facebook account to reproduce.  That's no good.

Daniel, can you try setting breakpoints on the "securityFlags = nsILoadInfo::SEC_FORCE_INHERIT_PRINCIPAL;" that got added in bug 604496 and seeing whether you hit them, and if so which codepath it's on and what the relevant URI is?
(In reply to Boris Zbarsky [:bz] from comment #26)
> Daniel, can you try setting breakpoints on the "securityFlags =
> nsILoadInfo::SEC_FORCE_INHERIT_PRINCIPAL;" that got added in bug 604496 and
> seeing whether you hit them, and if so which codepath it's on and what the
> relevant URI is?

Discussed in IRC a bit -- since I'm on Linux, I'm not actually hitting those securityFlags assignments from bug 604496 (because they're in media code, and since I don't have a h264 decoder on linux, I get flash fallback, which means I don't hit any media code when loading this facebook video).

So IIUC, bug 604496's changes aren't relevant here (i.e. aren't required to trigger the bug), at least for people like me who get the flash fallback.  bz says there's some equivalent code for plugins, but it hasn't changed in a while.

I ran mozregression (probably should've done so sooner, after it became clear this was a regression), and I got this range:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=5a3d567fd8b5&tochange=b6d3a8960d38
...which just includes one changeset:
> Bug 1122691: Skip ClassifyLocal unless tracking protection is enabled (r=mcmanus,gcp)

Based on the commit message, it sounds like that wasn't intended to have any effect on behavior when tracking-protection is enabled -- but given that it *did* have an effect, it seems likely that there's a bug in that cset.
Blocks: 1122691
Flags: needinfo?(mmc)
Keywords: regression
<tangent>
There was one time (only once) during my mozregression run where a single session went from "block the video" to "don't block the video", mid-session (across a few reloads). Specifically:
 - I loaded the page in a "broken" (post-regression) build, and the video was blocked [and an error console message appeared, to indicate the block was from tracking protection].
 - I reloaded the page, *and the video showed up and was playable*, to my surprise.
 - I reloaded again, and it continued to work correctly.
 - But then I did Ctrl+shift+del to clear recent data, & re-logged into facebook, and once again the video was blocked & error console messages showed up again.

Not sure what's was going on here, but it seems there's an element of randomness (or mutability-over-time) here, somehow.

I'd be less surprised to see things going from not-blocked to blocked (due to blocking signatures being loaded). But it seems odd for something to go in the reverse direction, spontaneously...
</tangent>
No longer blocks: 604496
Comment on attachment 8562212 [details] [diff] [review]
Use GetChannelURIPrincipal instead of GetChannelResultPrincipal in nsChannelClassifier

Review of attachment 8562212 [details] [diff] [review]:
-----------------------------------------------------------------

From the further comments it looks like this needs more investigation.
Attachment #8562212 - Flags: review?(gpascutto)
Comment on attachment 8562212 [details] [diff] [review]
Use GetChannelURIPrincipal instead of GetChannelResultPrincipal in nsChannelClassifier

Yep. Thanks for your help, everyone. I will look more today.
Flags: needinfo?(mmc)
Attachment #8562212 - Flags: review+
FWIW, I am putting a debug note in nsChannelClassifier [1] right before we call GetChannelResultPrincipal and print the loadInfo for the channel in question:

{
  channel-uri:         https://scontent-a.xx.fbcdn.net/hvideo-xpa1/v/t43.1792-2/...
  loadingPrincipal:    https://www.facebook.com/video.php?v=10154812622600022
  triggeringPrincipal: https://www.facebook.com/video.php?v=10154812622600022
  content-type:        12 - TYPE_OBJECT_SUBREQUEST
  sec-flags:           1 - SEC_FORCE_INHERIT_PRINCIPAL
}

we create channels using TYPE_OBJECT_SUBREQUEST only in one place really.
  http://mxr.mozilla.org/mozilla-central/source/dom/plugins/base/nsPluginHost.cpp#3112
  http://mxr.mozilla.org/mozilla-central/source/dom/plugins/base/nsPluginHost.cpp#3126

I am not completely sure, but maybe we should call GetChannelURIPrincipal after all.

[1] http://mxr.mozilla.org/mozilla-central/source/netwerk/base/nsChannelClassifier.cpp#290
Summary: Wrong principal is used for nsChannelClassifier checks → Wrong principal is used for nsChannelClassifier checks (tracking Protection blocks videos on facebook)
Yeah, I suspect the right fix here is to use GetChannelURIPrincipal. But I'd like to better understand the code. Usually principals represent an actor and they are used to check if that actor is allowed to take a particular action.

But if all you are doing is a same-origin check between "something" and the principal discussed here, then using GetChannelURIPrincipal sounds like it would be the right fix.
To mmc: even if  GetChannelURIPrincipal is the right fix I'm not sure what to make of comment 27.
Just chatted with Monica on IRC, but potentially there is yet a different problem:

I backed out Monica's changes from
> Bug 1122691: Skip ClassifyLocal unless tracking protection is enabled (r=mcmanus,gcp)
using
>  hg backout -r b6d3a8960d38
and run the test from Comment 15 leaving my debug statement @:
> http://mxr.mozilla.org/mozilla-central/source/netwerk/base/nsChannelClassifier.cpp#290

and get calls to GetChannelResultPrincipal only for the following channels:

> [1] aChannel-URI: http://pagead2.googlesyndication.com/pagead/show_ads.js?js
> [2] aChannel-URI: https://pagead2.googlesyndication.com/pagead/show_ads.js?img
> [3] aChannel-URI: https://pagead2.googlesyndication.com/pagead/show_ads.js?img
> [4] aChannel-URI: http://pagead2.googlesyndication.com/pagead/show_ads.js?iframe
> [5] aChannel-URI: http://pagead2.googlesyndication.com/pagead/show_ads.js?iframe
> [6] aChannel-URI: http://doubleclick.net/

In contrast, if I use the latest code from mozilla-central we get calls to GetChannelResultPrincipal for 300+ channels.

Looking at the patch from Bug 1122691:
> https://bugzilla.mozilla.org/attachment.cgi?id=8555646&action=diff
I don't see anything obviously off. I'll keep debugging, but I thought it's worth sharing; if someone has suggestions, please let us know.
OK, so the main change in that changeset that would cause more stuff to be classified is that in nsHttpChannel::BeginConnect if mCanceled || !mLocalBlockList we used to just return immediately.  But now instead we go ahead and call into the classifier if !mCanceled.

So it used to be that we only did channelClassifier->Start() if mLocalBlockList, but not we always do it.  That's why a lot more stuff goes through there than used to, presumably.

I don't understand the comment that was added starting with "mLocalBlocklist is true only if...".
(In reply to Boris Zbarsky [:bz] from comment #35)
> OK, so the main change in that changeset that would cause more stuff to be
> classified is that in nsHttpChannel::BeginConnect if mCanceled ||
> !mLocalBlockList we used to just return immediately.  But now instead we go
> ahead and call into the classifier if !mCanceled.
> 
> So it used to be that we only did channelClassifier->Start() if
> mLocalBlockList, but not we always do it.  That's why a lot more stuff goes
> through there than used to, presumably.
> 
> I don't understand the comment that was added starting with "mLocalBlocklist
> is true only if...".

That's a typo, it should have been "mLocalBlocklist is true only if tracking protection is enabled and the URI is a tracking domain"
OK, so...

I thought the idea of the LOAD_CLASSIFY_URI flag was that we should not check things for the phishing/malware bits if that flag is not set.

Furthermore, I thought the idea was that we want to do the ClassifyLocal bit first because it's fast and only do the full classify if the local check says stuff should be blocked, which is why the mLocalBlockList bit was as it was.
(In reply to Boris Zbarsky [:bz] from comment #37)
> OK, so...
> 
> I thought the idea of the LOAD_CLASSIFY_URI flag was that we should not
> check things for the phishing/malware bits if that flag is not set.
> 
> Furthermore, I thought the idea was that we want to do the ClassifyLocal bit
> first because it's fast and only do the full classify if the local check
> says stuff should be blocked, which is why the mLocalBlockList bit was as it
> was.

That was the original idea, until I caused a performance regression: https://bugzilla.mozilla.org/show_bug.cgi?id=1122691

The outcome of that bug was to limit ClassifyLocal checks to when tracking protection was enabled, hence https://bugzilla.mozilla.org/show_bug.cgi?id=1122691 which somehow caused this regression.

There is a logic error somewhere. In this particular case, fbcdn.net is not on the blocklist, so we really shouldn't even be getting to nsChannelClassifier::Start if that's the case, because mLocalBlocklist should not be true.
(In reply to [:mmc] Monica Chew (please use needinfo) from comment #38)

OK -- I think I may understand what's going on here. The following comment is wrong and illustrates why we are hitting this case.

> There is a logic error somewhere. In this particular case, fbcdn.net is not
> on the blocklist, so we really shouldn't even be getting to
> nsChannelClassifier::Start if that's the case, because mLocalBlocklist
> should not be true.

Prior to bug 1122691, we only call nsChannelClassifier::Start for URIs that appear on the local blocklist, for any type of list (phishing, malware, tracking). The local blocklist check happens using nsHttpChannel->GetPrincipal. This is important, because it doesn't fallback to GetChannelResultPrincipal unless the appId is required, which in our case, it isn't. So ClassifyLocal has always used the correct principal for blocklist lookups.

After bug 1122691, we call nsChannelClassifier::Start for all URIs unless they are cancelled, in order to catch the phishing and malware cases. So, the resource being loaded from fbcdn.net is going through nsChannelClassifier::Start in case it is phishing or malware.

Sometime between https://hg.mozilla.org/integration/mozilla-inbound/rev/25d9ba8ceb7a (landed 1/14) and now, GetChannelResultPrincipal changed behavior in a way that we are inheriting the parent resource (facebook.com) for fbcdn.net, resulting in an incorrect classification.

I believe that the original patch is still the correct fix for this.
Comment on attachment 8562212 [details] [diff] [review]
Use GetChannelURIPrincipal instead of GetChannelResultPrincipal in nsChannelClassifier

Review of attachment 8562212 [details] [diff] [review]:
-----------------------------------------------------------------

Please see comment 39. If this is confusing, let me know and I will try to explain it better.
Attachment #8562212 - Flags: review?(mcmanus)
Attachment #8562212 - Flags: review?(jonas)
> That was the original idea, until I caused a performance regression: 

No, I mean original idea before tracking protection existed at all.  The original design of the phishing/malware stuff was to:

1)  Only do it for channels that explicitly opt in via LOAD_CLASSIFY_URI .
2)  For those cannels, check the local data before doing a full remote classify and only
    do the remote thing if the local check indicates the URI _might_ be on the full list.

These were critical performance optimizations for the phishing/malware bits.

> The outcome of that bug was to limit ClassifyLocal checks to when tracking protection was
> enabled

That's clearly wrong, since the idea for tracking/malware is to do that check before doing a full (and slow) remote classify, no?

> so we really shouldn't even be getting to nsChannelClassifier::Start if that's the case,
> because mLocalBlocklist should not be true.

The patch in bug 1122691 calls nsChannelClassifier::Start() unconditionally for all HTTP channels, no?  As in, ignoring the value of mLocalBlocklist.  It even has a comment about why it's ignoring the value of mLocalBlockList....

Stepping back from the code for a second, can you describe the overall control flow we're aiming for here?  The pre-tracking-protection control flow is described earlier in this comment.  How, if at all, do we want tracking protection to modify that?
> I believe that the original patch is still the correct fix for this.

I believe it's not the right fix for the "we now spend time classifying all this stuff we shouldn't be classifying at all" regression.
And, importantly, plugin channels are NOT LOAD_CLASSIFY_URI, which explains why they were the canary in the coal mine here.

I'm not sure why various other stuff (like images) is, exactly.  Again, LOAD_CLASSIFY_URI was, iirc, a performance optimization to avoid doing all the malware/phishing work except in the cases where it was actually useful.
(In reply to Boris Zbarsky [:bz] from comment #41)
> > That was the original idea, until I caused a performance regression: 
> 
> No, I mean original idea before tracking protection existed at all.  The
> original design of the phishing/malware stuff was to:
> 
> 1)  Only do it for channels that explicitly opt in via LOAD_CLASSIFY_URI .

Yes, that was the intent. I see now that I broke this behavior in https://hg.mozilla.org/integration/mozilla-inbound/rev/25d9ba8ceb7a when mLocalBlocklist changed.

> 2)  For those cannels, check the local data before doing a full remote
> classify and only
>     do the remote thing if the local check indicates the URI _might_ be on
> the full list.

There is no remote check, ever, unless a URI is on a local list. I think we are talking about 2 different things here. Please see https://bugzilla.mozilla.org/show_bug.cgi?id=1100024#c2 for background information.

> These were critical performance optimizations for the phishing/malware bits.
> 
> > The outcome of that bug was to limit ClassifyLocal checks to when tracking protection was
> > enabled
> 
> That's clearly wrong, since the idea for tracking/malware is to do that
> check before doing a full (and slow) remote classify, no?
> 
> > so we really shouldn't even be getting to nsChannelClassifier::Start if that's the case,
> > because mLocalBlocklist should not be true.
> 
> The patch in bug 1122691 calls nsChannelClassifier::Start() unconditionally
> for all HTTP channels, no?  As in, ignoring the value of mLocalBlocklist. 
> It even has a comment about why it's ignoring the value of
> mLocalBlockList....
> 
> Stepping back from the code for a second, can you describe the overall
> control flow we're aiming for here?  The pre-tracking-protection control
> flow is described earlier in this comment.  How, if at all, do we want
> tracking protection to modify that?

We want the presence of a URI on the tracking protection list to skip speculative connections, which is what was causing bug 1100024.
Attachment #8562212 - Flags: review?(mcmanus)
Attachment #8562212 - Flags: review?(jonas)
> There is no remote check, ever, unless a URI is on a local list. 

Ah, so Start() does _another_ local check?  And the only point of the early ClassifyLocal call is to prevent the initial connect which would otherwise happen while the async Start() bits happen?  Thank you for the pointer; I'd clearly not understood the idea of ClassifyLocal.

So just to make sure I understand:

1)  We don't care about speculative connections to malware/phishing stuff, but do care for tracking protection.  I buy that; the danger with malware/phishing is what we do with the response, but the danger with tracking is what the server does with the request.  We should clearly document this reasoning in the code that conditions the ClassifyLocal on tracking protection being enabled.

2)  We added the CLASSIFY_URI flag to various places unconditionally, but in some pf those places (e.g. the image loader) we only care about it if tracking protection is enabled, whereas in others we need it for malware/phishing protection.  Should we make those places that only care about tracking protection not set the flag unless tracking protection is enabled?  I'm assuming we can check that more cheaply than we can do whatever nsChannelClassifier::Start does.

3)  We should only do Start() when CLASSIFY_URI is set.

4)  We should in fact use the channel URI principal, not the channel result principal, when classifying.
Attachment #8562212 - Attachment is obsolete: true
(In reply to Boris Zbarsky [:bz] from comment #45)
> > There is no remote check, ever, unless a URI is on a local list. 
> 
> Ah, so Start() does _another_ local check?  

Yes, sorry for the confusing naming of these. It turned out to be the sanest thing to do.

> And the only point of the early
> ClassifyLocal call is to prevent the initial connect which would otherwise
> happen while the async Start() bits happen?  Thank you for the pointer; I'd
> clearly not understood the idea of ClassifyLocal.
> 
> So just to make sure I understand:
> 
> 1)  We don't care about speculative connections to malware/phishing stuff,
> but do care for tracking protection.  I buy that; the danger with
> malware/phishing is what we do with the response, but the danger with
> tracking is what the server does with the request.  We should clearly
> document this reasoning in the code that conditions the ClassifyLocal on
> tracking protection being enabled.

We definitely care for tracking protection. I think we should care for phishing and malware -- we may only be protecting against phishing and malware if we happen to call nsIChannel->Cancel() before the page renders.

> 
> 2)  We added the CLASSIFY_URI flag to various places unconditionally, but in
> some pf those places (e.g. the image loader) we only care about it if
> tracking protection is enabled, whereas in others we need it for
> malware/phishing protection.  Should we make those places that only care
> about tracking protection not set the flag unless tracking protection is
> enabled?  I'm assuming we can check that more cheaply than we can do
> whatever nsChannelClassifier::Start does.

I think we should be classifying images for phishing, so I don't think so.
> 
> 3)  We should only do Start() when CLASSIFY_URI is set.

Yes, thank you for catching that. mLocalBlocklist was guarded by this but obviously I broke that. I fixed it in my latest patch.

> 
> 4)  We should in fact use the channel URI principal, not the channel result
> principal, when classifying.

Yes.
Comment on attachment 8563600 [details] [diff] [review]
Use GetChannelURIPrincipal instead of GetChannelResultPrincipal in nsChannelClassifier, only call nsChannelClassifier if LOAD_CLASSIFY_URI is set

Review of attachment 8563600 [details] [diff] [review]:
-----------------------------------------------------------------

Address bz's comments about LOAD_CLASSIFY_URI.
Attachment #8563600 - Flags: review?(mcmanus)
Attachment #8563600 - Flags: review?(jonas)
> we may only be protecting against phishing and malware if we happen to call
> nsIChannel->Cancel() before the page renders

Hmm.  I would have thought we waited in some way on the classifier result before delivering things like OnStartRequest.  Do we not do that?

In any case, the upshot of the current code is that if tracking protection is not enabled we do NOT prevent preconnects based on malware/phishing bits, right?  So clearly either we're OK with that (for reasons that need documenting, please) or we're not and we should fix that.

Looking at this last patch, are we guaranteed that mCanceled can only be true if (mLoadFlags & LOAD_CLASSIFY_URI)?  Because if not, why are we moving that check inside there?
(In reply to Boris Zbarsky [:bz] from comment #50)
> > we may only be protecting against phishing and malware if we happen to call
> > nsIChannel->Cancel() before the page renders
> 
> Hmm.  I would have thought we waited in some way on the classifier result
> before delivering things like OnStartRequest.  Do we not do that?

We call nsIChannel->Suspend when the classifier starts but that seems like a race condition to me.

> In any case, the upshot of the current code is that if tracking protection
> is not enabled we do NOT prevent preconnects based on malware/phishing bits,
> right?  So clearly either we're OK with that (for reasons that need
> documenting, please) or we're not and we should fix that.

I will expand the comments before ClassifyLocal.

> Looking at this last patch, are we guaranteed that mCanceled can only be
> true if (mLoadFlags & LOAD_CLASSIFY_URI)?  Because if not, why are we moving
> that check inside there?

ContinueBeginConnect handles the case where mCanceled is true. We need to be able to handle that there in any case, because the classifier can cancel the channel.
Speaking of race conditions, I can't read and get a comment into this bug :)


(In reply to Boris Zbarsky [:bz] from comment #45)
> 
> 1)  We don't care about speculative connections 

So this has never been (just) about speculative operations - these can be normal channel loads.

The socket thread stuff that actually does the HTTP proceeds in parallel with the main thread stuff doing the lookup. That's how phishing has always worked - it juts suspend's the channel until it figures out if it needs to cancel it or not - preventing the consumer from ever seeing malware. That suspend doesn't really stop the socket thread from doing its thing (at least until some buffering back pressure builds up). And I would assert that's a good thing.

Tracking protection piggy backed on that, but unfortunately as you point out - its not enough to keep the consumer away from the tracking site's response - we need to prevent the request. And that's what monica's change sought to do by deferring the BeginConnect() part of the channel in the case where a url got a hit on the synchronous local list.. deferring it until it could be looked up in the full list asynchronously.

That is going to slow stuff down - hopefully just a little bit - because now the local check is serialized with starting the networking rather than being done in parallel with it. (And there is a lot idle cpu time involved in the networking - so best to get it started asap). otoh that's kind of the point of the feature.
(In reply to [:mmc] Monica Chew (please use needinfo) from comment #51)
> (In reply to Boris Zbarsky [:bz] from comment #50)
> > > we may only be protecting against phishing and malware if we happen to call
> > > nsIChannel->Cancel() before the page renders
> > 
> > Hmm.  I would have thought we waited in some way on the classifier result
> > before delivering things like OnStartRequest.  Do we not do that?
> 
> We call nsIChannel->Suspend when the classifier starts but that seems like a
> race condition to me.
> 

as long as suspend() is called synchronously on the main thread before beginOpen() is called that should not be racy.
(In reply to Patrick McManus [:mcmanus] from comment #20)
> tanvi - mmc suggests that HttpBaseChannel::GetPrinicipal might need to use
> GetChannelPrincipal() instead of ChannelResultPrinicipal() as well.. what do
> you think? (makes sense to me based on this thread, but I'm not the expert
> here.)

Hi Patrick,

GetChannelPrincipal() was renamed to get GetChannelResultPrincipal() in bug 1062529.  There was a need for a GetChannelURIPrincipal() that returned the codebase principal, instead of the channel's owner in nsCORSProxyListener (https://bugzilla.mozilla.org/show_bug.cgi?id=1038756#c24).

HttpBaseChannel::GetPrincipal was added in Bug 974018 to get redirect history.  It has always called GetChannelPrincipal()/GetChannelResultPrincipal().  To determine if you want GetChannelURIPrincipal() instead, you have to evaluate the callers of GetPrincipal to identify if they expect the channel owner's principal or the principal of the uri that is being loaded.  Either way, we should clearly document which principal is being returned in both HttpBaseChannel::GetPrincipal and in GetChannel*Principal().
(In reply to Patrick McManus [:mcmanus] from comment #53)

> as long as suspend() is called synchronously on the main thread before
> beginOpen() is called that should not be racy.

that wasn't very precise, was it?

s/beginOpen()/continueBeginConnect() is called/

that way mSuspended will be set before any of the routines that deliver events to the listeners run (and they check suspended).

the way it is coded seems fine.
Comment on attachment 8563600 [details] [diff] [review]
Use GetChannelURIPrincipal instead of GetChannelResultPrincipal in nsChannelClassifier, only call nsChannelClassifier if LOAD_CLASSIFY_URI is set

Review of attachment 8563600 [details] [diff] [review]:
-----------------------------------------------------------------

r=mcmanus on the nsHttpChannel change. as for the appropriate Principal call I've learned as much as anyone on this thread, so I shouldn't be holding the r+ bit if someone more experienced is avail.

::: netwerk/base/nsChannelClassifier.cpp
@@ +298,5 @@
>          do_GetService(NS_SCRIPTSECURITYMANAGER_CONTRACTID, &rv);
>      NS_ENSURE_SUCCESS(rv, rv);
>  
>      nsCOMPtr<nsIPrincipal> principal;
> +    rv = securityManager->GetChannelURIPrincipal(mChannel, getter_AddRefs(principal));

I'm going to defer to jonas or cristoph on this because of lack of self confidence. You've convinced me though!

::: netwerk/protocol/http/nsHttpChannel.cpp
@@ +4988,5 @@
>          }
>          mCaps &= ~NS_HTTP_ALLOW_PIPELINING;
>      }
> +    // mLocalBlocklist is true only if tracking protection is enabled and the
> +    // URI is a tracking domain, it makes not guarantees about phishing or

s/not/no

@@ +4994,3 @@
>      // nsChannelClassifier to catch phishing and malware URIs.
>      bool callContinueBeginConnect = true;
> +    if (mLoadFlags & LOAD_CLASSIFY_URI) {

I think this might be more clearly written as

if (!(mLoadFlags & LOAD_CLASSIFY_URI)) {
  return ContinueBeginConnect();
}

if (mCanceled ..

it feels simpler. but its up to your discretion.
Attachment #8563600 - Flags: review?(mcmanus) → review+
Comment on attachment 8563600 [details] [diff] [review]
Use GetChannelURIPrincipal instead of GetChannelResultPrincipal in nsChannelClassifier, only call nsChannelClassifier if LOAD_CLASSIFY_URI is set

Ha, it looks like Chris is the one who picked GetChannelResultPrincipal instead of GetChannelURIPrincipal in nsChannelClassifier in the first place in https://hg.mozilla.org/integration/mozilla-inbound/rev/8a1273ab0767, redirecting to him for review :)
Attachment #8563600 - Flags: review?(jonas) → review?(mozilla)
(In reply to Patrick McManus [:mcmanus] from comment #20)
> tanvi - mmc suggests that HttpBaseChannel::GetPrinicipal might need to use
> GetChannelPrincipal() instead of ChannelResultPrinicipal() as well.. what do
> you think? (makes sense to me based on this thread, but I'm not the expert
> here.)

Yes, this function should likely use GetChannelURIPrincipal as well (I assume that's what you meant to ask).

But it depends on what the consumers of this function actually use the principal for. I.e. which principal they are expecting to get.

GetChannelResultPrincipal() returns the principal that the resource will have once it has been downloaded and parsed. This is affected by who loads the resource, and how it's loaded. For example a resource inside a <iframe sandbox> will always return a nullprincipal for GetChannelResultPrincipal().

GetChannelURIPrincipal() returns the principal for the URI which is loaded. In a sense it's the principal for the server. It is not affected by who or how a resource is loaded.

Then there's the triggeringPrincipal and loadingPrincipal, which both live on the nsILoadInfo on the channel. They are documented on nsIIOService.newChannel2.

So it would probably be good to rename HttpBaseChannel::GetPrinicipal to something more specific.
Comment on attachment 8563600 [details] [diff] [review]
Use GetChannelURIPrincipal instead of GetChannelResultPrincipal in nsChannelClassifier, only call nsChannelClassifier if LOAD_CLASSIFY_URI is set

Review of attachment 8563600 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/base/nsChannelClassifier.cpp
@@ +298,5 @@
>          do_GetService(NS_SCRIPTSECURITYMANAGER_CONTRACTID, &rv);
>      NS_ENSURE_SUCCESS(rv, rv);
>  
>      nsCOMPtr<nsIPrincipal> principal;
> +    rv = securityManager->GetChannelURIPrincipal(mChannel, getter_AddRefs(principal));

When we did the split from GetChannelPrincipal into GetChannelResultPrincipal and GetChannelURIPricnipal we evaluated each callsite, apparently we got this one wrong. Sorry for breaking your stuff Monica; it is indeed the right change.
Attachment #8563600 - Flags: review?(mozilla) → review+
For the record, I just hit a similar issue on Twitter, when viewing a tweet with an animated GIF (which twitter reencodes as mp4/h.264 and plays via flash fallback).

URL:  https://twitter.com/snookca/status/566042697430228992
Error message:
{
20:59:41.990 The resource at "https://pbs.twimg.com/tweet_video/B9r81WUCcAAqUMZ.mp4" was blocked because tracking protection is enabled.1 twitter.com
}

Noting it here because I'm almost sure it's the same issue.
Summary: Wrong principal is used for nsChannelClassifier checks (tracking Protection blocks videos on facebook) → Wrong principal is used for nsChannelClassifier checks (tracking Protection blocks videos on facebook, and gifs-reencoded-as-h.264 on Twitter)
Attachment #8563600 - Attachment is obsolete: true
Attachment #8563911 - Attachment is obsolete: true
Comment on attachment 8563913 [details] [diff] [review]
Use GetChannelURIPrincipal instead of GetChannelResultPrincipal in nsChannelClassifier, only call nsChannelClassifier if LOAD_CLASSIFY_URI is set (

Review of attachment 8563913 [details] [diff] [review]:
-----------------------------------------------------------------

Address Pat's review comments and carry over r+. I also verified that the attached patch fixes https://twitter.com/snookca/status/566042697430228992. Thanks Daniel!
Attachment #8563913 - Flags: review+
https://hg.mozilla.org/mozilla-central/rev/b009014abdb3
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Whiteboard: [fixed-in-fx-team]
Target Milestone: --- → Firefox 38
Forgot to clear info along with comment 54.
Flags: needinfo?(tanvi)
You need to log in before you can comment on or make changes to this bug.