Auto-update the public suffix list out-of-Firefox-release-band

NEW
Unassigned

Status

()

defect
P5
normal
5 years ago
9 days ago

People

(Reporter: Gijs, Unassigned)

Tracking

(Blocks 2 bugs)

Trunk
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [necko-would-take])

Attachments

(3 attachments)

Reporter

Description

5 years ago
The PSL being out of date has bitten us a few times now, also wrt ESR releases and so on, because it means we can't accurately determine what public domain suffixes are in use. We're also planning to use it for deciding whether URL bar input is likely to be a real domain rather than a search, for which keeping it up to date becomes still more important. We should fetch updates to it every now and again (roughly every week sounds sensible to me) instead of having it be fixed per-Firefox-release.
Weekly sounds fine. One issue is that the PSL is "compiled" before being shipped so that it's more compact and quicker to access. This is done by a build script: netwerk/dns/prepare_tlds.py. You'd need to decide whether to build that compilation into Firefox, or ship the compiled version (which I believe is C++, so that would be complex), or invent a new intermediate format.

Gerv
Reporter

Comment 2

5 years ago
Gavin, this isn't exactly a frontend bug... do you think we have time to pick this up in one of the 36 iterations so we can pick up bug 1080682 in one of the 37 ones? (or, alternatively, can we convince one of the core folks in this area to pick this up?)
Flags: needinfo?(gavin.sharp)
This sounds like something that needs to be broken down - is that what you're suggesting?
Flags: needinfo?(gavin.sharp)
Reporter

Comment 4

5 years ago
(In reply to :Gavin Sharp [email: gavin@gavinsharp.com] from comment #3)
> This sounds like something that needs to be broken down - is that what
> you're suggesting?

I meant that the other bug being blocked by this (bug 1080682) probably should be fixed after this one.

As for breaking it up, I'm not sure if it makes sense for front-end to pick up any of the pieces, but I suppose breaking it up makes sense, at least to split the list data structure off in such a way that it can be updated and isn't compiled into the binary, and then more steps to actually update it - or to support reading updates on top of the compiled-in list or something. Still not sure if front-end is the best to decide the architecture here and/or do the breakdown, though...

Gavin/Patrick, thoughts on this?
Flags: needinfo?(mcmanus)
Flags: needinfo?(gavin.sharp)
All I'm saying is that it seems like a large enough undertaking that we need to break down the multiple steps required. I agree that someone closer to the core bits of this would be helpful in that process.
Flags: needinfo?(gavin.sharp)
I think jason was the last person to touch this code - so I'll pass the buck for his opinion.

this is starting to sound like the phishing list, and the tracker list.. maybe we can reuse that infrastructure? (cc monica)
Flags: needinfo?(jduell.mcbugs)
Flags: needinfo?(mcmanus) → needinfo?(mmc)
Hey folks,

The phishing list and tracker list use the safebrowsing infrastructure, which fetches updates every 45 minutes or so from a pre-determined server. That sounds like it may be overkill for PSL, unless it thrashes a lot. For other lists related to security stuff, we use buildbot which runs weekly and updates Nightly and Aurora, but not Beta. That would basically mean that the PSL list needs to be live for 14 weeks when it hits Beta. The buildbot work for HPKP and HSTS was done in

https://bugzilla.mozilla.org/show_bug.cgi?id=836097
https://bugzilla.mozilla.org/show_bug.cgi?id=1004279
Flags: needinfo?(mmc)
Reporter

Comment 8

5 years ago
(In reply to [:mmc] Monica Chew (please use needinfo) from comment #7)
> Hey folks,
> 
> The phishing list and tracker list use the safebrowsing infrastructure,
> which fetches updates every 45 minutes or so from a pre-determined server.
> That sounds like it may be overkill for PSL, unless it thrashes a lot. For
> other lists related to security stuff, we use buildbot which runs weekly and
> updates Nightly and Aurora, but not Beta. That would basically mean that the
> PSL list needs to be live for 14 weeks when it hits Beta. The buildbot work
> for HPKP and HSTS was done in
> 
> https://bugzilla.mozilla.org/show_bug.cgi?id=836097
> https://bugzilla.mozilla.org/show_bug.cgi?id=1004279

every 45 minutes /would/ be overkill, but we're looking for something that's updatable out-of-release-bands, so the buildbot work isn't a great fit either, unfortunately.
(In reply to Patrick McManus [:mcmanus] from comment #6)
> this is starting to sound like the phishing list, and the tracker list..
> maybe we can reuse that infrastructure? (cc monica)

I keep saying we need a generic local-info-updating infrastructure!

As Gijs says, this needs to happen at non-release times, and needs to keep happening even when we stop making updates to a particular release. Every 45 minutes would be OK, if that's all we have, as long as the server can return Not Modified, and can cope with that many hits.

The PSL tends to change approximately monthly, but it can be more frequent. However, a month's delay from checkin to full deployment would still be much, much better than what we have now.

Gerv
wget recently got PSL support built-in (so they now have the Firefox problem with possibly stale PSL information for a long time) and we're looking at doing the same for curl/libcurl soonish.

If this turns out to be a suitable decent generic binary PSL download solution, there might be more out there that can benefit... Then I don't mean using the hosting infrastructure for downloads perhaps, but more a decent binary format and associated code for doing PSL operations. I'm sure Tim over at libpsl will be interested: https://github.com/rockdaboot/libpsl

Updated

4 years ago
Flags: needinfo?(jduell.mcbugs)
Whiteboard: [necko-would-take]
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: -- → P5
Reporter

Updated

2 years ago
Blocks: 1416247

Comment 12

Last year
Hi,

Honestly, right now, my number one frustration with Firefox is bug 1080682 which is blocked by this.  I'm very interested in getting involved with open source development.  What can I do to help get this started?  It looks like (at a high level) the following steps would need to happen:

1. Download PSL from https://publicsuffix.org/list/public_suffix_list.dat, leveraging caching
2. Store it somewhere semi-permanent (not sure where, maybe same place as cache?)
3. Use https://github.com/rockdaboot/libpsl or similar library to decode the list
4. Redownload weekly

These changes should either be made in a way that clients of the existing built-in PSL can use it, or if more appropriate, clients of the existing PSL should be modified to use this new implementation.

Any thoughts, suggestions?  I'm willing to put in the work on this because bug 1080682 makes me want to punch Firefox in the face on a regular basis.
Reporter

Comment 13

Last year
(In reply to Gervase Markham [:gerv] (not reading bugmail) from comment #9)
> (In reply to Patrick McManus [:mcmanus] from comment #6)
> > this is starting to sound like the phishing list, and the tracker list..
> > maybe we can reuse that infrastructure? (cc monica)
> 
> I keep saying we need a generic local-info-updating infrastructure!

We've got one of these now, it's called "remote settings" (formerly/also known as kinto). Mathieu, what would it take to use that for the PSL updates? Especially considering the PSL code right now is largely written in C++, not JS... I guess we could do a similar thing to what the gfx blocklist does right now and have JS code to handle the updates and have that broadcast to C++ with any changes or something.

(In reply to Connor from comment #12)
> Hi,
> 
> Honestly, right now, my number one frustration with Firefox is bug 1080682
> which is blocked by this.  I'm very interested in getting involved with open
> source development.  What can I do to help get this started?  It looks like
> (at a high level) the following steps would need to happen:
> 
> 1. Download PSL from https://publicsuffix.org/list/public_suffix_list.dat,
> leveraging caching

I think we'd probably want to use a more structured format than a text file, ideally with diffing and/or push notification support so we don't have to redownload the entire list, as well as strong security/integrity guarantees (stronger than "just" downloading it over https). I believe kinto offers all of this. We already use it for the onecrl list of distrusted intermediate certificates (ie certs that sit somewhere between root certificates and site/endpoint certificates).

> 2. Store it somewhere semi-permanent (not sure where, maybe same place as
> cache?)

The data would be stored in the user's Firefox profile, but this is something the existing remotesettings JS client takes care of already - you wouldn't need to be concerned with the exact specifics of this.

> Any thoughts, suggestions?  I'm willing to put in the work on this because
> bug 1080682 makes me want to punch Firefox in the face on a regular basis.

This is really encouraging, thanks for offering to help with this bug.
Flags: needinfo?(mathieu)
This would indeed be a perfect use-case for RemoteSettings.

Depending how often it gets updated and how big it is, there are several strategies.
For example, the two obvious ones would be:
- have one record per suffix (optimal sync but tedious to edit)
- have one record with the list as attachment (redownloaded completely on each server side update)

Then, regarding its load by the Firefox code, we also have several approaches available.
For example, in the "sync" event callback we could either:
- write a dump on disk and send a signal for the C++ to reload it [0]
- serialize the list and send it to C++ as string in the message payload [1]
- define a C++ interface and call it from JS [2]

[0] https://searchfox.org/mozilla-central/rev/ef51c56995c72e21683b1db390f920fedd93a91c/services/common/blocklist-clients.js#309-328
[1] https://searchfox.org/mozilla-central/rev/d2966246905102b36ef5221b0e3cbccf7ea15a86/toolkit/mozapps/extensions/Blocklist.jsm#1206-1232 
[2] https://searchfox.org/mozilla-central/rev/d2966246905102b36ef5221b0e3cbccf7ea15a86/services/common/blocklist-clients.js#69-104
Flags: needinfo?(mathieu)

Comment 15

Last year
I don't know what the work administration-wise would be for updating this, so I'm not sure whether one record per suffix vs one record for the whole list is better.  For context, the list right now has 12661 lines and 7849 suffixes (the other lines are blank or comments). The copy in Firefox's source [0] hasn't really ever been systematically updated, but the "official source" on github [1] is updated ranging from multiples times a week to every other week.  I propose that the optimal solution would use one record per suffix, and that a script could be used to add and remove records based on changes to the file.  Should I move forward with implementing it that way (one record per suffix)?

[0] https://hg.mozilla.org/mozilla-central/filelog/28ad9a9e95d518e1163e550ae19c972aabb44df5/netwerk/dns/effective_tld_names.dat
[1] https://github.com/publicsuffix/list/commits/master/public_suffix_list.dat
Flags: needinfo?(mathieu)
(In reply to Connor from comment #15)
> I don't know what the work administration-wise would be for updating this,
> so I'm not sure whether one record per suffix vs one record for the whole
> list is better.  For context, the list right now has 12661 lines and 7849
> suffixes (the other lines are blank or comments). The copy in Firefox's
> source [0] hasn't really ever been systematically updated, but the "official
> source" on github [1] is updated ranging from multiples times a week to
> every other week.  I propose that the optimal solution would use one record
> per suffix, and that a script could be used to add and remove records based
> on changes to the file.  Should I move forward with implementing it that way
> (one record per suffix)?

Is it possible for the list to be preprocessed on the server at all, and then Firefox could just download the new blob of data structure, rather than doing a bunch of munging client side?  IIUC, this would require one record for the whole list, so we might be downloading many largish updates (sorry, not terribly familiar with the proposed update mechanism, apologies for sounding ignorant), but I think that's OK, if we can get the list in some sort of thing that's easily indexable.

Comment 17

Last year
On closer examination, the current implementation requires that it be a DAFSA [0], so unless there's a compelling reason to change away from that, I will implement it with that expectation.  The DAFSA should probably be generated server side, which means it will be one record for the entire serialized data structure.  As it stands now, the raw serialized DAFSA is around 35k.  I think this is a perfectly reasonable size for updates, especially knowing that it will only download when changed.

[0] https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton
That sounds ideal, thank you!

Comment 19

Last year
From its description.

> A "public suffix" is one under which Internet users can (or historically could) directly register names.

It looks like it fully degenerated into a list of subdomain providers.
https://github.com/publicsuffix/list/commit/65ddeb3eca4cfd9f436f7b2fed49df57624d40f7#diff-7a8a497c39dadd4b04d30f5e8e679bf8

Good luck with that.
FYI: at least Fedora is already shipping a package called "publicsuffix-list" which is exactly the PSL as a DAFSA file. So it seems to be a general consensus to be "the way" to do it. curl will use that file to load PSL dynamically and allow it to be independently updated.
So, a possible approach would be:

1. have a script that builds the DAFSA file from the latest data 
2. publish the file on Remote Settings server using the REST API
3. have someone in charge of signing off the change
4. let the client download and ingest the file using the Remote Settings client API


For 1, I can't help much ;)

For 2, we can use the DEV server to build the prototype. Publishing the record with the attached file would just consist in running something like this Gist [0]

Once the prototype is done, we'll have to setup STAGE/PROD with a new collection, signoff, give VPN access etc. (full procedure is on Mana [1])

Step 3. only makes sense once using STAGE/PROD.

For step 4. the official API documentation is here: https://searchfox.org/mozilla-central/source/services/common/docs/RemoteSettings.rst
In order to use the DEV server, a tutorial is being published [2].

Ping me or send me an email if you need more info ;)

[0] https://gist.github.com/leplatrem/b67c3465321d61aa05e3f07f8f3ca05a
[1] https://mana.mozilla.org/wiki/pages/viewpage.action?pageId=66655528
[2] https://github.com/mozilla/remote-settings/pull/66
Flags: needinfo?(mathieu)
The tutorials for Remote Settings were published:
https://remote-settings.readthedocs.io

And I added one about attachments:
https://remote-settings.readthedocs.io/en/latest/tutorial-attachments.html 

Have fun :)
Connor,

Did you make some progress? Can I help you in some way?

I updated the docs with new tutorials, screencasts etc. https://remote-settings.readthedocs.io
Flags: needinfo?(connor.hewitt)

Comment 24

11 months ago
Mathieu,

Yep, I've been making progress.  Since this is my first contribution, I've been taking some time to familiarize myself with the Firefox source.  I will check out the new tutorials, and I'll let you know if I have any questions!

Thanks!
Flags: needinfo?(connor.hewitt)
Hi Connor,

Are you still interested to work on this? Can I help in some way?

Let us know ;)

Comment 26

8 months ago
Hi Mathieu,

Unfortunately school has started again, and while I keep hoping I'll find time to work on this, I don't think I will for the foreseeable future.  I haven't made any significant progress, so if someone else wants to work on it, they should!

Thanks!

(In reply to Mathieu Leplatre [:leplatrem] from comment #22)

The tutorials for Remote Settings were published:
https://remote-settings.readthedocs.io

And I added one about attachments:
https://remote-settings.readthedocs.io/en/latest/tutorial-attachments.html

Have fun :)

Hello Mathieu, there hasn't been any development on this issue in a while and I see it is mentioned in the GSOC 2019 brainstorming document.
Will you be mentoring that project? I'll be going through the resources you've linked to and get more familiar with this code. Do you have any further information about, how this project will proceed in GSOC.

Hi, Mathieu, I would also like to contribute here.
(In reply to Arpit Bharti from comment #27)

(In reply to Mathieu Leplatre [:leplatrem] from comment #22)

The tutorials for Remote Settings were published:
https://remote-settings.readthedocs.io

And I added one about attachments:
https://remote-settings.readthedocs.io/en/latest/tutorial-attachments.html

Have fun :)

Hello Mathieu, there hasn't been any development on this issue in a while and I see it is mentioned in the GSOC 2019 brainstorming document.
Will you be mentoring that project? I'll be going through the resources you've linked to and get more familiar with this code. Do you have any further information about, how this project will proceed in GSOC.

I addition to what Arpit asked, I would also like to know any mini tasks or starter issues to work on relevant to this one, in order to get started here.

Flags: needinfo?(mathieu)

I can't really spend time on this right now, but once the project gets approved we'll start by writing a detailed roadmap.

In the mean time, if you've gone through all the tutorials mentioned above and want to dig more stuff, there's also some easy-pick issues in the ecosystem that powers Remote Settings: https://github.com/issues?q=is%3Aopen+is%3Aissue+archived%3Afalse+user%3AKinto+label%3Aeasy-pick

Thanks for your interest and motivation

Flags: needinfo?(mathieu)

Hello everyone, this bug will be worked on as a project under Google's Summer of Code program. I am Arpit from Delhi, India and I will be working under the mentorship of Mathieu[:leplatrem] for the next three months to submit patches for this bug.
We have come up with a strategy detailed in this blueprint document:
https://docs.google.com/document/d/1kxlAhu87MQtATxYfBdfRO-WjMHVNo1jA9Gr5mdVBnN8/edit?usp=sharing

You need to log in before you can comment on or make changes to this bug.