Closed Bug 414122 Opened 17 years ago Closed 16 years ago

Preprocess effective TLD data into C++ code (eliminate file I/O to read, move parsing out of C++, remove an arena, etc.)

Categories

(Core :: Networking, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla1.9beta4

People

(Reporter: Waldo, Assigned: Waldo)

Details

(Keywords: perf)

Attachments

(4 files, 4 obsolete files)

Attached patch Patch (obsolete) — Splinter Review
Effective TLD data is currently kept in two files and read at startup (within app data and possibly within the profile), which while convenient for editing has a few problems:

- A stat that usually fails for the profile location
- Trips into kernel space to get the data
- Data has to be copied from kernel->user space
- Getting that data into Mozilla strings might often
  involve yet another copy
- Lowercasing/punycoding for ASCII/Unicode entries
  means that even if the previous copy is optimized
  away, a copy is still necessary to get entries to
  their final string form
- A *final* copy happens when the string data is
  dumped into an arena for final storage
- We're parsing the file in C++, not some safer
  language

We can avoid all this if we just embed all the data in the program directly.  This makes editing harder, but that's a rare operation.  It removes the profile-level extensibility, but this is security-conscious code -- I don't see a good reason for extensibility.  It might make updating the list harder -- but I doubt we'd ever update the list any way other than with .mars and AUS (except as a workaround for a publicly disclosed security hole, and then only for a week or so until we fixed it), and presumably bsdiff performs such insertions reasonably efficiently.

This first pass moves all the data into a huge static array of struct { const char domain[]; PRPackedBool ...; }.  There are better schemes that can avoid hashing the entire array at startup.  We can do those after this first step (and probably in another bug), which I predict will get us much of the possible win while avoiding overcomplexity.

Bonus points: this is a net reduction in lines of code, even with a new license header!
Attachment #299414 - Flags: review?(dwitte)
Keywords: perf
Making sure the etld data file isn't included in installers and such will be a followup patch.
nice work!

i have some reservations here, but hopefully we can work through them:

1) codesize. by doing this you're effectively increasing codesize for all gecko embeddors/clients by about 30k, without any kind of flag to chop it down. (for instance, currently minimo could ship without the file if it wanted.)

2) client editability. people who want to add or remove entries to the file can presently do so - i imagine the primary use of this would be to debug things. (if there's a legitimate mistake in the file, it should get reported to us so we can correct it.) removing entries i'm therefore not too worried about, but adding entries might be useful. i don't know enough about this use case to effectively argue for it, so i'll do what all good reviewers do and put the onus on you - can you skim through the original etld bug and see if any arguments were mentioned for having this in-profile extensibility?

3) i'd like to see some quick-n-dirty numbers on how much the entire file read actually costs. can you sprinkle some PR_Now()'s around and give numbers - both for the original file read, and your just-hash-the-static-data approach?
(In reply to comment #2)
> 1) codesize. by doing this you're effectively increasing codesize for all
> gecko embeddors/clients by about 30k, without any kind of flag to chop it
> down. (for instance, currently minimo could ship without the file if it
> wanted.)

I don't really buy this, as this is a security mechanism.  Second, if they care that much, they can replace the file at build time.

> 2) client editability. people who want to add or remove entries to the file can
> presently do so - i imagine the primary use of this would be to debug things.
> (if there's a legitimate mistake in the file, it should get reported to us so
> we can correct it.) removing entries i'm therefore not too worried about, but
> adding entries might be useful. i don't know enough about this use case to
> effectively argue for it, so i'll do what all good reviewers do and put the
> onus on you - can you skim through the original etld bug and see if any
> arguments were mentioned for having this in-profile extensibility?

Looking...

> 3) i'd like to see some quick-n-dirty numbers on how much the entire file read
> actually costs. can you sprinkle some PR_Now()'s around and give numbers - both
> for the original file read, and your just-hash-the-static-data approach?

Time to call LoadEffectiveTLDFiles() looks to be around 6-7ms according to PR_IntervalNow(), and time to run the loop in the posted patch is consistently 1ms (with 1ms resolution in both cases).
As far as I can tell from bug 331510 and from the m.d.platform thread, no real use cases were provided -- it was just provided as a way for local extensibility if people wanted it.  I don't see anyone saying they'd actually *use* it -- it was a hypothetical.  Also, part of the separate-file thing seems to have been a holdover from the time before there was agreement that updates would be served using the update system -- and as roc noted in bug 331510 comment 21, there isn't really a compelling need to update the list any more often than security updates happen.

Also, for cold starts the time pre-patch is 27ms, while the time post-patch is 6ms.  Possibly excepting after-crash starts, the file isn't likely to be cached by the OS, so we're really looking at 20ms for real-world use.

...oh, blah.  These numbers are all debug numbers, so they're larger than they'd be in release builds.  Regardless, the proportionate speedup should still hold.  If absolutely necessary I could check in non-debug builds, but I don't think it's a great use of my time to do so, to be honest.
Flags: blocking1.9?
Comment on attachment 299414 [details] [diff] [review]
Patch

>Index: netwerk/dns/src/nsEffectiveTLDService.cpp
>===================================================================
> nsresult
> nsEffectiveTLDService::Init()
> {
>   if (!mHash.Init())
>     return NS_ERROR_OUT_OF_MEMORY;

you should use |mHash.Init(NS_ARRAY_LENGTH(gEntries) - 1)| here to presize the hash.

>+  // Initialize eTLD hash from static array
>+  for (PRUint32 i = 0; i < NS_ARRAY_LENGTH(gEntries) - 1; i++) {      
>+    nsDomainEntry *entry = mHash.PutEntry(gEntries[i].domain);
>+    NS_ENSURE_TRUE(entry, NS_ERROR_FAILURE);

NS_ERROR_OUT_OF_MEMORY here instead.

>Index: netwerk/dns/src/nsEffectiveTLDService.h
>===================================================================
>   typedef const char* KeyType;
>   typedef const char* KeyTypePointer;
> 
>-  nsDomainEntry(const char* aDomain);
>+  nsDomainEntry(const char* entry)

if you |typedef const ETLDEntry& KeyType| and |typedef const ETLDEntry* KeyTypePointer|, you can do away with SetData() and things should look a bit cleaner! (also, not your fault, but that |const char*| type in the ctor should really be |KeyTypePointer|). and, s/entry/aEntry/, per style.

>+  PRBool IsNormal() { return mData.normal; }
>+  PRBool IsException() { return mData.exception; }
>+  PRBool IsWild() { return mData.wild; }

leave the return type as PRPackedBool here, unless you're sure the compiler won't generate upconversion code that it can't optimize away?

> private:
>-  const char   *mDomain;
>-  PRPackedBool  mIsNormal;
>-  PRPackedBool  mIsException;
>-  PRPackedBool  mIsWild;
>+  ETLDEntry mData;

could make this |const ETLDEntry& mData|, and save about 8k of memory by not duplicating data (at the cost of an extra deref - though there's a good chance that second one will be cache-local, since it's in the static section). thoughts?

>-  virtual ~nsEffectiveTLDService();
>+  virtual ~nsEffectiveTLDService() { }

i'm not a fan of inline-looking virtual dtors - leave this part the way it was?

>Index: netwerk/dns/src/prepare_tlds.py
>===================================================================
>+def _normalizeHostname(domain):
>+  if _isASCII(domain):
>+    return domain.lower()
>+  return ".".join(map(encodings.idna.ToASCII, domain.split(".")))

for a domain like "herøy.møre-og-romsdal.NO" this will convert the UTF8 sections, but it won't lowercase the ASCII sections. need to handle those too.

>+  for etld in etldSet:
>+    normal = boolStr(etld.normal())
>+    exception = boolStr(etld.exception())
>+    wild = boolStr(etld.wild())

multiple entries for the same domain are legal (e.g. "fr.fr" and "*.fr.fr"), and the C++ handles these by ||'ing up the flags:
http://mxr.mozilla.org/mozilla/source/netwerk/dns/src/nsEffectiveTLDService.cpp#334

you should do that here too - just a pre-loop to collapse entries?

>+  print "  { NULL, 0 }"

nsnull?

great stuff - ftr, my own measurements (opt build) showed 3200us for file read-in (without patch - warm start), 420us for hash construction with patch, and 170us with the hash presizing fix. definitely want this :)

as you point out, could also write a unit test to compare this generated file with what nsIIDNService says is the normalised, ACE-encoded form for the etld file entries - ugly stuff though, won't hold you to that for getting this patch in.
Attachment #299414 - Flags: review?(dwitte) → review-
Just making sure I'm following this...the end result of this patch/bug is that effective_tld_names.dat will no longer be a packagable file (and won't end up in dist/foo anymore), correct?
(In reply to comment #5)
> if you |typedef const ETLDEntry& KeyType| and |typedef const ETLDEntry*
> KeyTypePointer|, you can do away with SetData() and things should look a bit
> cleaner! (also, not your fault, but that |const char*| type in the ctor
> should really be |KeyTypePointer|). and, s/entry/aEntry/, per style.

I don't think this works, because it's GetEntry(KeyType) and PutEntry(KeyType), and when we want to call GetEntry we only have the domain.

> > private:
> >+  ETLDEntry mData;
> 
> could make this |const ETLDEntry& mData|, and save about 8k of memory by not
> duplicating data (at the cost of an extra deref - though there's a good
> chance that second one will be cache-local, since it's in the static
> section). thoughts?

Yeah, probably a good idea.  Next step after this will be to gperf (probably offline, sadly, since adding to the tree or making it a build requirement's kinda dubious) and pre-hash completely to eliminate a lot of these problems, but one step at a time.

> >Index: netwerk/dns/src/prepare_tlds.py
> >===================================================================
> >+def _normalizeHostname(domain):
> >+  if _isASCII(domain):
> >+    return domain.lower()
> >+  return ".".join(map(encodings.idna.ToASCII, domain.split(".")))
> 
> for a domain like "herøy.møre-og-romsdal.NO" this will convert the UTF8
> sections, but it won't lowercase the ASCII sections. need to handle those too.

I was just copying the implementation -- but the implementation uses convertUTF8ToACE, which doesn't lowercase in this case either -- or at least it didn't prior to this version of the patch.

> as you point out, could also write a unit test to compare this generated
> file with what nsIIDNService says is the normalised, ACE-encoded form for
> the etld file entries - ugly stuff though, won't hold you to that for
> getting this patch in.

I actually think this is happening; more in the patch...
(In reply to comment #6)
> Just making sure I'm following this...the end result of this patch/bug is that
> effective_tld_names.dat will no longer be a packagable file (and won't end up
> in dist/foo anymore), correct?

Yes.
Attached patch Take two (obsolete) — Splinter Review
As mentioned, I don't think making the key an ETLDEntry can work given GetEntry; also, mData is a pointer because a reference would have to be initialized at construction time.
Attachment #299414 - Attachment is obsolete: true
Attachment #299718 - Flags: review?(dwitte)
Attached patch Take three (obsolete) — Splinter Review
We managed to agree that having more than one of a.b.c, *.a.b.c, and !a.b.c makes no sense, so we can simplify this a bit and return to in-order output.  This also picks up a change to make a dtor non-virtual and to remove the now-clearly-redundant normal flag.  I'm still not sure normalness need even exist as a concept, but one step of reasoning at a time.  :-)

This step also demonstrated that our data file had a duplicate entry in it, which I've since removed.  :-)
Attachment #299718 - Attachment is obsolete: true
Attachment #299726 - Flags: review?(dwitte)
Attachment #299718 - Flags: review?(dwitte)
Comment on attachment 299726 [details] [diff] [review]
Take three

>Index: netwerk/dns/src/nsEffectiveTLDService.cpp
>===================================================================
> nsresult
> nsEffectiveTLDService::NormalizeHostname(nsCString &aHostname)
> {

could combine paths a bit here:

if (!IsASCII)
  ConvertUTF8toACE();
ToLowerCase();

r=me... the gperf stuff sounds pretty neat (i wonder how many other large, static tables we have in the tree? - i know we have a medium-size one in windows gfx font code - it's a linear list even - ew!), and it'd be nice to figure out if |static const char*| alignment is 4 bytes on all our platforms. ship it!
Attachment #299726 - Flags: review?(dwitte) → review+
Jeff, here's a patch that will keep Camino from burning when you kill effective_tld_names.dat as a build product.  If the patch doesn't apply when time comes to land your patch, you can poke me (or smorgan or mento) in #camino and I can spin a new one.  Thanks!
Attached patch With non-duplicated ToLowerCase (obsolete) — Splinter Review
Attachment #299726 - Attachment is obsolete: true
Attachment #299801 - Flags: superreview?(benjamin)
Attachment #299801 - Flags: review+
Flags: blocking1.9? → blocking1.9+
Priority: -- → P2
Target Milestone: mozilla1.9beta3 → mozilla1.9beta4
Comment on attachment 299801 [details] [diff] [review]
With non-duplicated ToLowerCase

>Index: netwerk/dns/src/nsEffectiveTLDService.cpp

> nsresult
> nsEffectiveTLDService::Init()

>+  // Initialize eTLD hash from static array
>+  for (PRUint32 i = 0; i < NS_ARRAY_LENGTH(gEntries) - 1; i++) {      
>+    nsDomainEntry *entry = mHash.PutEntry(gEntries[i].domain);
>+    NS_ENSURE_TRUE(entry, NS_ERROR_OUT_OF_MEMORY);
>+    entry->SetData(&gEntries[i]);

As mentioned on IRC, I'd like a debug assert that the entry data is already properly normalized, just so the python normalizer and the C++ normalizer don't accidentally get out of synch.

>Index: netwerk/dns/src/prepare_tlds.py

>+class EffectiveTLDSet:

This class makes me unhappy: is there a reason you couldn't write a single generator function instead of an object? Something like:

def getEffectiveTLDs(path):
  file = codecs.open(path, "r", "UTF-8")
  domains = Sets.set()
  for line in self.file.readline():
    line = line.rstrip()
    if line.startswith('//') or "." not in line:
      continue

    ... etc

    entry = EffectiveTLDEntry(line)

    domain = entry.domain
    if domain in self.domains:
      raise Exception(...)

    yield entry

sr=me with these two changes

Does this patch need moa= also? It's not clear which module it's actually in.
Attachment #299801 - Flags: superreview?(benjamin) → superreview+
Attachment #299801 - Attachment is obsolete: true
Attachment #302852 - Flags: superreview+
Attachment #302852 - Flags: review+
moa=biesi. Given that this only touches netwerk/ what module would it be if not necko?
Comment on attachment 302852 [details] [diff] [review]
With assert, generator function

Cuts down on file I/O on every startup for what's basically constant data...
Attachment #302852 - Flags: approval1.9?
Comment on attachment 302852 [details] [diff] [review]
With assert, generator function

This bug is blocking - no approval need.  Land away
Attachment #302852 - Flags: approval1.9?
Comment on attachment 299794 [details] [diff] [review]
Patch to remove effective_tld_names.dat from Camino's packaging

I checked in an updated version of this patch to stop the Camino burning.
So I went to create the don't-package patch here and found Camino had already picked it up for a bustage fix.  I don't have any idea why they would have broken; this change should have just been some temporary extra bloat.  Anyway, this picks up the rest of the manifests that include etld data file.
I went ahead and committed the packaging fix; it's not worth someone's time to review this basically-mechanical removal of lines.  And with that, fixed!  I'll probably file a followup bug to do some sort of gperf thing to move the hashing into a compile-time step.
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Don't you have to add this file to removed-files.in?
"have to", probably not.  Certainly wouldn't hurt to add it, tho...I'll cook up a patch tomorrow probably.
+1 for adding to removed-files.in, it keeps Build's update-verification for releases nice and squeaky clean.
Attachment #303491 - Flags: review?(jwalden+bmo)
Comment on attachment 303491 [details] [diff] [review]
Add to removed-files.in

>Index: mail/installer/removed-files.in

So, um, why this change?  I can't find evidence that mail/ (like calendar/, which you similarly omitted) ever shipped with this file.  Other than that looks exactly like what I'd done locally.
Comment on attachment 303491 [details] [diff] [review]
Add to removed-files.in

Not sure if this is even relevant any more, but in any case I said my part in the last comment...
Attachment #303491 - Flags: review?(jwalden+bmo)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: