Closed Bug 1083971 Opened 8 years ago Closed 3 years ago
Auto-update the public suffix list out-of-Firefox-release-band
47 bytes, text/x-phabricator-request
|Details | Review|
The PSL being out of date has bitten us a few times now, also wrt ESR releases and so on, because it means we can't accurately determine what public domain suffixes are in use. We're also planning to use it for deciding whether URL bar input is likely to be a real domain rather than a search, for which keeping it up to date becomes still more important. We should fetch updates to it every now and again (roughly every week sounds sensible to me) instead of having it be fixed per-Firefox-release.
Weekly sounds fine. One issue is that the PSL is "compiled" before being shipped so that it's more compact and quicker to access. This is done by a build script: netwerk/dns/prepare_tlds.py. You'd need to decide whether to build that compilation into Firefox, or ship the compiled version (which I believe is C++, so that would be complex), or invent a new intermediate format. Gerv
Gavin, this isn't exactly a frontend bug... do you think we have time to pick this up in one of the 36 iterations so we can pick up bug 1080682 in one of the 37 ones? (or, alternatively, can we convince one of the core folks in this area to pick this up?)
This sounds like something that needs to be broken down - is that what you're suggesting?
(In reply to :Gavin Sharp [email: firstname.lastname@example.org] from comment #3) > This sounds like something that needs to be broken down - is that what > you're suggesting? I meant that the other bug being blocked by this (bug 1080682) probably should be fixed after this one. As for breaking it up, I'm not sure if it makes sense for front-end to pick up any of the pieces, but I suppose breaking it up makes sense, at least to split the list data structure off in such a way that it can be updated and isn't compiled into the binary, and then more steps to actually update it - or to support reading updates on top of the compiled-in list or something. Still not sure if front-end is the best to decide the architecture here and/or do the breakdown, though... Gavin/Patrick, thoughts on this?
All I'm saying is that it seems like a large enough undertaking that we need to break down the multiple steps required. I agree that someone closer to the core bits of this would be helpful in that process.
I think jason was the last person to touch this code - so I'll pass the buck for his opinion. this is starting to sound like the phishing list, and the tracker list.. maybe we can reuse that infrastructure? (cc monica)
Hey folks, The phishing list and tracker list use the safebrowsing infrastructure, which fetches updates every 45 minutes or so from a pre-determined server. That sounds like it may be overkill for PSL, unless it thrashes a lot. For other lists related to security stuff, we use buildbot which runs weekly and updates Nightly and Aurora, but not Beta. That would basically mean that the PSL list needs to be live for 14 weeks when it hits Beta. The buildbot work for HPKP and HSTS was done in https://bugzilla.mozilla.org/show_bug.cgi?id=836097 https://bugzilla.mozilla.org/show_bug.cgi?id=1004279
(In reply to [:mmc] Monica Chew (please use needinfo) from comment #7) > Hey folks, > > The phishing list and tracker list use the safebrowsing infrastructure, > which fetches updates every 45 minutes or so from a pre-determined server. > That sounds like it may be overkill for PSL, unless it thrashes a lot. For > other lists related to security stuff, we use buildbot which runs weekly and > updates Nightly and Aurora, but not Beta. That would basically mean that the > PSL list needs to be live for 14 weeks when it hits Beta. The buildbot work > for HPKP and HSTS was done in > > https://bugzilla.mozilla.org/show_bug.cgi?id=836097 > https://bugzilla.mozilla.org/show_bug.cgi?id=1004279 every 45 minutes /would/ be overkill, but we're looking for something that's updatable out-of-release-bands, so the buildbot work isn't a great fit either, unfortunately.
(In reply to Patrick McManus [:mcmanus] from comment #6) > this is starting to sound like the phishing list, and the tracker list.. > maybe we can reuse that infrastructure? (cc monica) I keep saying we need a generic local-info-updating infrastructure! As Gijs says, this needs to happen at non-release times, and needs to keep happening even when we stop making updates to a particular release. Every 45 minutes would be OK, if that's all we have, as long as the server can return Not Modified, and can cope with that many hits. The PSL tends to change approximately monthly, but it can be more frequent. However, a month's delay from checkin to full deployment would still be much, much better than what we have now. Gerv
wget recently got PSL support built-in (so they now have the Firefox problem with possibly stale PSL information for a long time) and we're looking at doing the same for curl/libcurl soonish. If this turns out to be a suitable decent generic binary PSL download solution, there might be more out there that can benefit... Then I don't mean using the hosting infrastructure for downloads perhaps, but more a decent binary format and associated code for doing PSL operations. I'm sure Tim over at libpsl will be interested: https://github.com/rockdaboot/libpsl
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: -- → P5
Hi, Honestly, right now, my number one frustration with Firefox is bug 1080682 which is blocked by this. I'm very interested in getting involved with open source development. What can I do to help get this started? It looks like (at a high level) the following steps would need to happen: 1. Download PSL from https://publicsuffix.org/list/public_suffix_list.dat, leveraging caching 2. Store it somewhere semi-permanent (not sure where, maybe same place as cache?) 3. Use https://github.com/rockdaboot/libpsl or similar library to decode the list 4. Redownload weekly These changes should either be made in a way that clients of the existing built-in PSL can use it, or if more appropriate, clients of the existing PSL should be modified to use this new implementation. Any thoughts, suggestions? I'm willing to put in the work on this because bug 1080682 makes me want to punch Firefox in the face on a regular basis.
(In reply to Gervase Markham [:gerv] (not reading bugmail) from comment #9) > (In reply to Patrick McManus [:mcmanus] from comment #6) > > this is starting to sound like the phishing list, and the tracker list.. > > maybe we can reuse that infrastructure? (cc monica) > > I keep saying we need a generic local-info-updating infrastructure! We've got one of these now, it's called "remote settings" (formerly/also known as kinto). Mathieu, what would it take to use that for the PSL updates? Especially considering the PSL code right now is largely written in C++, not JS... I guess we could do a similar thing to what the gfx blocklist does right now and have JS code to handle the updates and have that broadcast to C++ with any changes or something. (In reply to Connor from comment #12) > Hi, > > Honestly, right now, my number one frustration with Firefox is bug 1080682 > which is blocked by this. I'm very interested in getting involved with open > source development. What can I do to help get this started? It looks like > (at a high level) the following steps would need to happen: > > 1. Download PSL from https://publicsuffix.org/list/public_suffix_list.dat, > leveraging caching I think we'd probably want to use a more structured format than a text file, ideally with diffing and/or push notification support so we don't have to redownload the entire list, as well as strong security/integrity guarantees (stronger than "just" downloading it over https). I believe kinto offers all of this. We already use it for the onecrl list of distrusted intermediate certificates (ie certs that sit somewhere between root certificates and site/endpoint certificates). > 2. Store it somewhere semi-permanent (not sure where, maybe same place as > cache?) The data would be stored in the user's Firefox profile, but this is something the existing remotesettings JS client takes care of already - you wouldn't need to be concerned with the exact specifics of this. > Any thoughts, suggestions? I'm willing to put in the work on this because > bug 1080682 makes me want to punch Firefox in the face on a regular basis. This is really encouraging, thanks for offering to help with this bug.
This would indeed be a perfect use-case for RemoteSettings. Depending how often it gets updated and how big it is, there are several strategies. For example, the two obvious ones would be: - have one record per suffix (optimal sync but tedious to edit) - have one record with the list as attachment (redownloaded completely on each server side update) Then, regarding its load by the Firefox code, we also have several approaches available. For example, in the "sync" event callback we could either: - write a dump on disk and send a signal for the C++ to reload it  - serialize the list and send it to C++ as string in the message payload  - define a C++ interface and call it from JS   https://searchfox.org/mozilla-central/rev/ef51c56995c72e21683b1db390f920fedd93a91c/services/common/blocklist-clients.js#309-328  https://searchfox.org/mozilla-central/rev/d2966246905102b36ef5221b0e3cbccf7ea15a86/toolkit/mozapps/extensions/Blocklist.jsm#1206-1232  https://searchfox.org/mozilla-central/rev/d2966246905102b36ef5221b0e3cbccf7ea15a86/services/common/blocklist-clients.js#69-104
I don't know what the work administration-wise would be for updating this, so I'm not sure whether one record per suffix vs one record for the whole list is better. For context, the list right now has 12661 lines and 7849 suffixes (the other lines are blank or comments). The copy in Firefox's source  hasn't really ever been systematically updated, but the "official source" on github  is updated ranging from multiples times a week to every other week. I propose that the optimal solution would use one record per suffix, and that a script could be used to add and remove records based on changes to the file. Should I move forward with implementing it that way (one record per suffix)?  https://hg.mozilla.org/mozilla-central/filelog/28ad9a9e95d518e1163e550ae19c972aabb44df5/netwerk/dns/effective_tld_names.dat  https://github.com/publicsuffix/list/commits/master/public_suffix_list.dat
(In reply to Connor from comment #15) > I don't know what the work administration-wise would be for updating this, > so I'm not sure whether one record per suffix vs one record for the whole > list is better. For context, the list right now has 12661 lines and 7849 > suffixes (the other lines are blank or comments). The copy in Firefox's > source  hasn't really ever been systematically updated, but the "official > source" on github  is updated ranging from multiples times a week to > every other week. I propose that the optimal solution would use one record > per suffix, and that a script could be used to add and remove records based > on changes to the file. Should I move forward with implementing it that way > (one record per suffix)? Is it possible for the list to be preprocessed on the server at all, and then Firefox could just download the new blob of data structure, rather than doing a bunch of munging client side? IIUC, this would require one record for the whole list, so we might be downloading many largish updates (sorry, not terribly familiar with the proposed update mechanism, apologies for sounding ignorant), but I think that's OK, if we can get the list in some sort of thing that's easily indexable.
On closer examination, the current implementation requires that it be a DAFSA , so unless there's a compelling reason to change away from that, I will implement it with that expectation. The DAFSA should probably be generated server side, which means it will be one record for the entire serialized data structure. As it stands now, the raw serialized DAFSA is around 35k. I think this is a perfectly reasonable size for updates, especially knowing that it will only download when changed.  https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton
That sounds ideal, thank you!
From its description. > A "public suffix" is one under which Internet users can (or historically could) directly register names. It looks like it fully degenerated into a list of subdomain providers. https://github.com/publicsuffix/list/commit/65ddeb3eca4cfd9f436f7b2fed49df57624d40f7#diff-7a8a497c39dadd4b04d30f5e8e679bf8 Good luck with that.
FYI: at least Fedora is already shipping a package called "publicsuffix-list" which is exactly the PSL as a DAFSA file. So it seems to be a general consensus to be "the way" to do it. curl will use that file to load PSL dynamically and allow it to be independently updated.
So, a possible approach would be: 1. have a script that builds the DAFSA file from the latest data 2. publish the file on Remote Settings server using the REST API 3. have someone in charge of signing off the change 4. let the client download and ingest the file using the Remote Settings client API For 1, I can't help much ;) For 2, we can use the DEV server to build the prototype. Publishing the record with the attached file would just consist in running something like this Gist  Once the prototype is done, we'll have to setup STAGE/PROD with a new collection, signoff, give VPN access etc. (full procedure is on Mana ) Step 3. only makes sense once using STAGE/PROD. For step 4. the official API documentation is here: https://searchfox.org/mozilla-central/source/services/common/docs/RemoteSettings.rst In order to use the DEV server, a tutorial is being published . Ping me or send me an email if you need more info ;)  https://gist.github.com/leplatrem/b67c3465321d61aa05e3f07f8f3ca05a  https://mana.mozilla.org/wiki/pages/viewpage.action?pageId=66655528  https://github.com/mozilla/remote-settings/pull/66
The tutorials for Remote Settings were published: https://remote-settings.readthedocs.io And I added one about attachments: https://remote-settings.readthedocs.io/en/latest/tutorial-attachments.html Have fun :)
Connor, Did you make some progress? Can I help you in some way? I updated the docs with new tutorials, screencasts etc. https://remote-settings.readthedocs.io
Mathieu, Yep, I've been making progress. Since this is my first contribution, I've been taking some time to familiarize myself with the Firefox source. I will check out the new tutorials, and I'll let you know if I have any questions! Thanks!
Hi Connor, Are you still interested to work on this? Can I help in some way? Let us know ;)
Hi Mathieu, Unfortunately school has started again, and while I keep hoping I'll find time to work on this, I don't think I will for the foreseeable future. I haven't made any significant progress, so if someone else wants to work on it, they should! Thanks!
Attachment #9070986 - Attachment description: Bug 1083971 - Created to_bin(), words_to_bin() functions in make_dafsa and modified prepare_tlds.py to deal with different number of arguments → Bug 1083971 - Add an option to output a binary file for the PSL data
Pushed by email@example.com: https://hg.mozilla.org/integration/autoland/rev/822cb68b6ab7 Add an option to output a binary file for the PSL data r=leplatrem,erahm
Pushed by firstname.lastname@example.org: https://hg.mozilla.org/integration/autoland/rev/27de3a352a39 Added a new line in xpcom/ds/tools/make_dafsa.py to fix lint failure
Assignee: nobody → arpitbharti73
Resolution: INCOMPLETE → FIXED
You need to log in before you can comment on or make changes to this bug.