Closed Bug 454792 Opened 16 years ago Closed 16 years ago

Should safebrowsing code react to private browsing mode?

Categories

(Toolkit :: Safe Browsing, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: johnath, Unassigned)

References

Details

Attachments

(1 file)

Discussion arose in bug 248970 around if and how our anti-phishing/anti-malware code should respond to the user entering private browsing mode.

There are three obvious options, but I'd welcome other suggestions:

1) Do nothing
2) Turn off all anti-phishing/anti-malware protections
3) Continue to check sites against the local DB, and to pull updates, but do not perform double-checks of hash-matches.

I don't think option 2 is any good - it would be an unexpected consequence of turning on PB mode - a mode where people may well be *more* likely to visit "high-risk" parts of the internet.  I also think it offers no material advantage over option 3.

The "Do Nothing" option shouldn't be dispensed with too quickly.  The double-check pingbacks are done by requesting a full-length hash, given a match of the partial hash we store locally.  An unscrupulous list provider *could* easily find out which site corresponded to a particular hash, and conclude that the user either visited that site or a hash-colliding one, but information like cookies or user-specific query parameters from the reported bad site are not included in that request.  We also have a fair bit of confidence in the strength of Google's privacy policy with regard to this service.

Nevertheless, as I argued in the private browsing bug, users who activate this mode are likely to be people sensitive to any kind of information leakage, however mitigated, so we should consider whether more than "Do nothing" is needed here.

If private browsing put us into a mode where, as described above, we kept updating the DB, and checking page loads against it, but stopped the double checks, that would close the source of leakage.  The downside, though, is that we would risk blocking sites that had, since our last list update, been cleared.  Given that we update the DB about every half hour, and allow click-through to the blocked site, this is arguably not a terrible trade-off to make but I understand the principled argument too, that we don't want to be blocking sites any more than absolutely necessary.

I'm not settled on a decision here yet - I wanted to open up the discussion, but my personal leaning is towards option 3.  This gets our PB user all the protection they would normally have, risks some false positives (until the next regular DB update clears them) and I think adheres to the principle of least surprise.
Is there a mechanism where we could force a db update when going private to ensure we have the latest data available? That would I think make option 3 even less problematic.
Remember that completely random false positives (32-bit hash collisions) are also possible.  The pingback was originally added because of those collisions, the fact that it protected us against stale blocks was just a happy side-effect.
Another option which Dave mentioned in bug 248970 comment 194 is to show a specialized error page stating that the URL *may* be a phishing/malware site, but we can't verify this exactly because you're in the private browsing mode, and asking the user to confirm the match by explicit action in order to send the pingback.

I think this may be the best option.  It has all the merits of option 3 mentioned in comment 0, with the added plus that the false positives can be verified and eliminated by requiring an explicit action on user's part, and since we already require user interaction on safebrowsing error pages, I think it's a little trade-off in terms of usability.

Beltzner could possibly add some insight here as well.
OS: Mac OS X → All
Hardware: PC → All
IIRC, the phishing/malware database is large enough that there's a 1/1000 chance of any site having a 32-bit hash collision :(
(In reply to comment #0)
> (...) An unscrupulous list provider *could*
> easily find out which site corresponded to a particular hash, and conclude that
> the user either visited that site or a hash-colliding one, but information like
> cookies or user-specific query parameters from the reported bad site are not
> included in that request.

Cookie from the reported bad site are not included in gethash request, yes, but cookies from the "safebrowsing" provider (.google.com) ARE included in the gethash request. (Related bug: bug 368255)

> We also have a fair bit of confidence in the
> strength of Google's privacy policy with regard to this service.
> (...)

Cool, however many FF users (probably most of them) are totally unaware that their browser is remotely controlled by Google. (Related bug: bug 430741)
(In reply to comment #4)
> IIRC, the phishing/malware database is large enough that there's a 1/1000
> chance of any site having a 32-bit hash collision :(

How did you come up with this number? I think that probability of collision is much, much smaller than 1/1000.
(In reply to comment #6)
> (In reply to comment #4)
> > IIRC, the phishing/malware database is large enough that there's a 1/1000
> > chance of any site having a 32-bit hash collision :(
> 
> How did you come up with this number? I think that probability of collision is
> much, much smaller than 1/1000.

Oh, and remember that it will be possible to reduce uncertainty almost completely, when bug 441359 gets fixed... (Think about few gethash request related with just one visited page... Sure, possibility of just one collision is the same, but few given collisions in short time span are very unlikely...)
I wrote:
> Oh, and remember that it will be possible to reduce uncertainty almost
> completely, when bug 441359 gets fixed...

Well, I was definitely too conservative, too cautious when assessing this whole issue with gethash requests -- it is _already_ possible for Google to monitor visits to predefined set of pages (and number of pages can be in hundreds of thousands; sqlite is quite efficient after all...).

Jesse: there is checked not only one 32-bit hash prefix (field "domain" in moz_classifier table in urlclassifier3.sqlite file), but - if there is a match in domain - also (if non-empty) "partial_data" (which is also 32-bit hash prefix, but calculated from full URL/part of URL). Hence, possibility of collision is _extremely_ unlikely. And gethash request appears only when there are matches in both prefixes, so Google can very reliable get info about visited page.
Ping?  Who will have the final call on this?  We need to reach a decision sooner rather than later.
Ehsan - this doesn't need to be part of the initial private browsing landing.  Given the discussions in bug 248970 about the scope of private browsing mode, about the primary emphasis being "Eliminate local traces", followed by "Eliminate certain attacks that allow sites to correlate private and non-private browsing activities" I think this falls into the latter category.

That doesn't mean we shouldn't do it, but since it doesn't require string changes, it isn't bound by the Thursday string freeze, and I also don't want that PB patch to get any more complicated than it already is!  :)
As I said in comment #8 above it is possible for data provider for "safebrowsing" service to monitor visits to some chosen set of pages (at least in some cases).
I've created a project that demonstrates this issue: http://bb.homelinux.org/firefox/sb2/

Note that it also works with "private browsing" mode enabled (tested with Firefox 3.1b2). I know that you don't care about leaving non-local traces, but IMO "private browsing mode" should be renamed to "not so private browsing mode" ;) or perhaps "don't leave LOCAL traces" mode. If not, then users may get false impression that browser cares about their privacy when, in fact, it doesn't IMO.
Actually "safebrowsing" may leave some LOCAL traces, too.

Follow these steps (I assume you use some Unix-like system...):

0. download and put in your PATH somewhere the script check-for-domain-in-urlcl3-classifier.sh (get your copy from http://bb.homelinux.org/soft/misc/sb2/check-for-domain-in-urlcl3-classifier.sh) (you may need to modify value of SQLITE3 variable in the script, if your binary of sqlite ver. 3 command line tool is named other than "sqlite3"; the script also requires perl and sha256sum)

1. make sure you have "safebrowsing" enabled

2. make sure that "safebrowsing" database is complete -- it should contain hash for "ianfette.org" (I guess this is "real" testing site, Ian Fette is Google's employee that works on "safebrowsing", see eg. bug 441359...):
   $ check-for-domain-in-urlcl3-classifier.sh ianfette.org
   Looking for "ianfette.org" (292E6556)...
   FROM moz_classifier:
   [ id  | domain |partial_data|                  complete_data                     |chunk_id|table_id ]
   693310|292E6556|||7895|3
   
   Note that "complete_data" field is empty. (If record doesn't show up at all, it means either that you haven't downloaded full database yet OR Google decided to remove this hash from the database.)

3. now, run Firerox and visit http://ianfette.org (warning shows up, but you may ignore it, the site is completely harmless, at least at the moment ;->)

4. see record for this domain in "safebrowsing" database again: 
   $ check-for-domain-in-urlcl3-classifier.sh ianfette.org
   Looking for "ianfette.org" (292E6556)...
   FROM moz_classifier:
   [ id  | domain |partial_data|                  complete_data                     |chunk_id|table_id ]
   693310|292E6556|292E6556|292E655691C847BDA21327EE3893028095996460C537639EABBC2D2E2FDF1C1C|7895|3

   Note that "complete_data" is full, which means that user (probably...) visited/wanted to visit ianfette.org.

In practice however this is not very big issue (but only in context of leaving LOCAL traces), because browser requests (and usually receives and caches) MORE than one hash (ie. not only hash that matches ianfette.org, but also 4 more adjacent hashes from the database.)
I don't think that tells you what you think it does. All you really know is that the user visited (or had some site unknowingly load content from) a site whose hash had the same first four bytes as ianfette.org. Maybe it was ianfette.org, but it could just as easily be treehuggersforjesus.org and when the complete hash didn't match we let them on through. We save the complete so we don't have to ping Google for it next time there's a partial match.

The vast majority of safe sites aren't going to be in the database at all so you won't get any hints that the user might have visited them.
(In reply to comment #13)
> I don't think that tells you what you think it does. All you really know is
> that the user visited (or had some site unknowingly load content from)

Let's be clear here: AFAIK (I should play with FF and my server-side implementation of SB to check all my doubts, I guess...) in 3.0.x line the "safebrowsing" database is consulted only for the URLs coming from:
- visited site, ie. address field (including redirs?)
- frames (?)
- URLs embedded in <object> (bug 394485).

Bug 441359 is going to add also checking of these embedded resources in a visited site:
- scripts
- stylesheet files.

Most of these usually come from the same domain (at least considering only 2 and 3 top domain components as "safebrowsing" does) as embedding site (but not always, of course).

> a site
> whose hash had the same first four bytes as ianfette.org. Maybe it was
> ianfette.org, but it could just as easily be treehuggersforjesus.org

I agree that there is a possibility of hash collisions, especially in the case "block whole domain" (ie. when partial_data field is empty, as in example above). I've chosen this example, because I don't know any other "real" test site that is "stable" enough and is intended to block full URL (not all pages from given domain, as in example above). (Example test sites in mozilla.com domain are special cases -- visiting them never results in /gethash requests and never changes state of the database AFAIK.)

However, in the case of "block full URL (or part of URL)" there are effectively compared 8 bytes, not 4, ie. not only hash from "domain" field, but also from "partial_data" field when "domain" field matches. In this case probability of collisions seems to be much smaller...

> and when
> the complete hash didn't match we let them on through. We save the complete so
> we don't have to ping Google for it next time there's a partial match.

Not always (for example: server may send bogus hashes (ie. not starting with requested hash prefixes) or return HTTP 204 and give nothing (this case is mentioned in the spec.), or deliberately miscalculate HMAC (*)). There are also "freshness" requirements in play on client side...

> The vast majority of safe sites aren't going to be in the database at all so
> you won't get any hints that the user might have visited them.

For this reason (among others) I don't consider this issue as very important from the angle of leaving LOCAL traces. From server point of view, however, client leaves some interesting data... (It also depends on what data given instance of Firefox is fed with by the server during regular updates...)


(*) Especially the case with miscalculated HMAC could be signaled to the user, because it usually indicates some kind of "badness" anyway, ie. either server playing some games or attempt in man-in-the-middle attack.
Johnath: have you reached a decision here?  I'm thinking that option 1 in comment 0 (do nothing) may be the way to go here, given the discussion happened here...
(In reply to comment #12)
> Actually "safebrowsing" may leave some LOCAL traces, too.

Since this small subset of the bigger issue would be relatively easy to fix, I filed bug 472421 to decide on that one.
Attaching script mentioned in my comment #12. (Save/rename/make a symlink to it as "check-for-domain-in-urlcl3-classifier.sh".)
(In reply to comment #15)
> Johnath: have you reached a decision here?  I'm thinking that option 1 in
> comment 0 (do nothing) may be the way to go here, given the discussion happened
> here...

Particularly given the work happening in bug 472421 I am inclined to agree. Jesse's right that the false positive rate if we can't get full hashes is profoundly high and would hurt the usability of PB mode, and more to the point, PB mode is, as I said above, principally concerned with local traces.  People who want genuine anonymity when browsing should look to addons like TorButton which can actually obfuscate your source IP.

I'm resolving this as WONTFIX, beyond the changes being made in bug 472421.  If someone wants to re-open it, they should have concrete suggestions for the style of change, addressing the risk of false positive cache hits and the desire to continue to offer up to date protection in PB mode.  Otherwise, we're done here.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → WONTFIX
Comment on attachment 355854 [details]
script used in comment #12 (GPL3 license)

Noted that it's GPL3 and later on the license to help people know what they'll be reading.
Attachment #355854 - Attachment description: script used in comment #12 → script used in comment #12 (GPL3 license)
Product: Firefox → Toolkit
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: