Closed Bug 607115 Opened 9 years ago Closed 9 years ago

use a much smaller guid format than we currently use for bookmarks

Categories

(Toolkit :: Places, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla2.0b9
Tracking Status
blocking2.0 --- betaN+

People

(Reporter: mconnor, Assigned: sdwilsh)

References

Details

(Whiteboard: [fixed-in-places])

Attachments

(3 files, 2 obsolete files)

Sync generates smaller GUIDs for history than the UUID + sequence number that bookmarks currently uses.  It's 10 characters from a base70 charset that is URL and command-line safe.  Collision probability is 1% after 100 million records.   See bug 518146 for details.

I'm proposing that we take Utils.makeGUID from Sync and make it an API (nsIProbablyUniqueID anyone?), which we can then for bookmarks and history.

Key benefits:
* Smaller places DBs (especially in conjunction with bug 607112)
* Less overhead to generate.
* Sync can stick more of these in a URL for partial sync on mobile.
(In reply to comment #0)
> I'm proposing that we take Utils.makeGUID from Sync and make it an API
> (nsIProbablyUniqueID anyone?), which we can then for bookmarks and history.
Prefer to see this as a SQL function in SQLFunctions.cpp, but yes.
I guess if we're assuming no one outside of sqlite-land will use these, that's fine.  I was thinking that we'd make Utils.makeGUID use the built in impl, and also make passwords/forms/other stuff use this method so we don't have to insert our own guids anywhere.
(In reply to comment #2)
> I guess if we're assuming no one outside of sqlite-land will use these, that's
> fine.  I was thinking that we'd make Utils.makeGUID use the built in impl, and
> also make passwords/forms/other stuff use this method so we don't have to
> insert our own guids anywhere.
Well, at least for places, it probably makes sense to generate the guid when we insert it.  It's possible we could also just add it as a generic function for all storage consumers too.  However, it's also possible I'm over-engineering this without thinking about it much yet ;)
blocking2.0: --- → betaN+
Depends on: 518146
I am proposing that we go with this algorithm (which is different from what Mardak did in bug 518146, but has the benefit of being driven by PRNG:

Use nsIRandomGenerator::generateRandomBytes, and base64 encode that.  Base64 encoding gives us 60 characters (a-z, A-Z, 0-9, /, +), with the last two being not so good for urls.  We can replace them with ! and * respectively (or one of the other characters we do not use from the current implementation).  The current encoding allows for 70^10 possibilities (2.82 x 10^18).  If we keep the length of only ten characters, we allow for 64^10 possibilities (1.15 x 10^18), which is a bit less than half, which I believe doubles the likelihood of collision.  Adding one additional character nets us 64^11 possibilities (73.8 x 10^18) which is 25 times the number of possibilities with the current implementation in Sync.  I'm a bit leery at implementing something that increases the chance of collisions, so my question is this: is the sync team OK with an 11 character guid?
(In reply to comment #4)
> Use nsIRandomGenerator::generateRandomBytes, and base64 encode that.  Base64
> encoding gives us 60 characters (a-z, A-Z, 0-9, /, +),

(That's 64 characters, not 60.)

> with the last two being
> not so good for urls.  We can replace them with ! and * respectively (or one of
> the other characters we do not use from the current implementation).

I suggest using the "base64url" encoding as described in http://tools.ietf.org/html/rfc4648#section-5, basically using - and _ instead of / and +.

> The current encoding allows for 70^10 possibilities (2.82 x 10^18). If we keep the
> length of only ten characters, we allow for 64^10 possibilities (1.15 x 10^18),
> which is a bit less than half, which I believe doubles the likelihood of
> collision.  Adding one additional character nets us 64^11 possibilities (73.8 x
> 10^18) which is 25 times the number of possibilities with the current
> implementation in Sync.  I'm a bit leery at implementing something that
> increases the chance of collisions, so my question is this: is the sync team OK
> with an 11 character guid?

Let's make it 12 characters base64 (because base64 pads to multiples of 4 characters anyway). That's 9 random bytes as input to base64-encode:

  guid = btoa(prng.generateRandomBytes(9))
First, we'll need to make nsIRandomGenerator threadsafe so we can use it both on the main thread and the background thread.
Assignee: nobody → sdwilsh
Status: NEW → ASSIGNED
Attachment #492380 - Flags: review?(kaie)
Nod. If we go with base64, better to go with multiples of 4 characters, so 12 is better than 8. Although we could always just truncate to 11 characters, etc.
Status: ASSIGNED → NEW
(In reply to comment #5)
> (That's 64 characters, not 60.)
Uh, yeah.  Luckily everything after that had the right number.

> I suggest using the "base64url" encoding as described in
> http://tools.ietf.org/html/rfc4648#section-5, basically using - and _ instead
> of / and +.
Yeah, that's simple enough to tweak.

> Let's make it 12 characters base64 (because base64 pads to multiples of 4
> characters anyway). That's 9 random bytes as input to base64-encode:
> 
>   guid = btoa(prng.generateRandomBytes(9))
OK, but this is being implemented in c++ land so it's a bit more complicated ;)
Status: NEW → ASSIGNED
Comment on attachment 492380 [details] [diff] [review]
Part 1 - mark and make nsIRandomGenerator threadsafe

In my understanding PK11_GenerateRandom should be fully threadsafe, it should be fine to call it from any thread.

r=kaie
Attachment #492380 - Flags: review?(kaie) → review+
Attached patch Part 2 - helper method (obsolete) — Splinter Review
Attachment #492442 - Flags: review?(mak77)
Attached patch Part 3 - SQL method with tests (obsolete) — Splinter Review
This includes tests for part 3 and part 2.  Brute force it by making guids until we've seen every single expected character, and the whole time make sure we never see an illegal one.

Also tests that nothing blows up on the background thread.
Attachment #492444 - Flags: review?(mak77)
Parts 1 - 3 are all we need in this bug.  The work to actually use this stuff will be in bug 607117.
Blocks: 607117
No longer blocks: 607107
Whiteboard: [needs review mak]
Comment on attachment 492442 [details] [diff] [review]
Part 2 - helper method

>diff --git a/toolkit/components/places/src/Helpers.cpp b/toolkit/components/places/src/Helpers.cpp

>+nsresult
>+GenerateGUID(nsCString& _guid)
>+{
>+  _guid.Truncate();
>+
>+  // Request raw random bytes and base64 encode it.  For each set of three
>+  // bytes, we get one character.

nit: "encode them"

>+  const PRUint32 kGUIDLength = 12;
>+  const PRUint32 kRequiredBytesLength =
>+    static_cast<PRUint32>((kGUIDLength + 1) / 4 * 3);

nit: add parentheses to make operator precedence explicit.
What is the +1 saving? comment plz

>+  PRUint8* buffer;
>+  nsresult rv = rg->GenerateRandomBytes(kRequiredBytesLength, &buffer);
>+  NS_ENSURE_SUCCESS(rv, rv);

I don't see where you NS_Free(buffer), looks like being leaked
As an alternative, you could make it being adopted by a nsCAutoString, and const_cast<char*>(string.get()) later.

>+  // SetLength doesn't set aside space for NULL termination, but PL_Base64Encode
>+  // will NULL terminate it's string.

nit: s/it's/its, but better "PL_Base64Encode generates a NULL terminated string."

>+  NS_ENSURE_TRUE(_guid.SetCapacity(kGUIDLength + 1), NS_ERROR_OUT_OF_MEMORY);
>+  _guid.SetLength(kGUIDLength);
>+  (void)PL_Base64Encode(reinterpret_cast<const char*>(buffer),
>+                        kRequiredBytesLength, _guid.BeginWriting());

Actually, the documentation for base64encode says that the buffer is not null terminated if the third agument is not null.
While it's null terminated if you use the direct return value and pass null as the third argument.
http://mxr.mozilla.org/mozilla-central/source/nsprpub/lib/libc/src/base64.c#137
137  * If the destination argument is not NULL, it is assumed to
138  * be of sufficient size, and the contents will not be null-
139  * terminated by this routine.
Attachment #492442 - Flags: review?(mak77) → review-
Comment on attachment 492444 [details] [diff] [review]
Part 3 - SQL method with tests

>diff --git a/toolkit/components/places/src/SQLFunctions.cpp b/toolkit/components/places/src/SQLFunctions.cpp
>   MatchAutoCompleteFunction::create(mozIStorageConnection *aDBConn)
>   {
>     nsRefPtr<MatchAutoCompleteFunction> function(new MatchAutoCompleteFunction);
>-    NS_ENSURE_TRUE(function, NS_ERROR_OUT_OF_MEMORY);

nit: Could you make "function = new" per our code style?

>+////////////////////////////////////////////////////////////////////////////////
>+//// GUID Creation Function
>+
>+  //////////////////////////////////////////////////////////////////////////////
>+  //// GenerateGUIDFunction

nit: I find having both a bit redundant, but it's ok if you think it's worth it.

>+  /* static */

nit: I think we usually use // static

>diff --git a/toolkit/components/places/tests/unit/test_sql_guid_function.js b/toolkit/components/places/tests/unit/test_sql_guid_function.js

>+++ b/toolkit/components/places/tests/unit/test_sql_guid_function.js
>@@ -0,0 +1,125 @@
>+/* Any copyright is dedicated to the Public Domain.
>+   http://creativecommons.org/publicdomain/zero/1.0/ */

nit and OT: We should ask gerv to figure out the common style for these pd declarations, some file puts them between BEGIN/END LICENSE BLOCK, some file uses a javadoc init, some prefixes with *...

>+/**
>+ * Checks all our invariants about our guids for a given result.
>+ *
>+ * @param aGuid
>+ *        The guid to check.
>+ */
>+function check_invariants(aGuid)
>+{
>+  print("TEST-INFO | " + tests[index - 1].name + " | Checking guid '" +
>+        aGuid + "'");
>+
>+  const kGUIDLength = 12;
>+  do_check_eq(aGuid.length, kGUIDLength);
>+  do_check_true(/[a-zA-Z0-9\-_]{12}/.test(aGuid));

you could avoid the length check and do /^[a-zA-Z0-9\-_]{12}$/ so the regex will also do length check
the output should not be worse since you print the guid before

>+function test_guid_on_background()
>+{
>+  // We should not assert if we execute this asynchronously.
>+  let stmt = DBConn().createAsyncStatement("SELECT GENERATE_GUID()");
>+  let checked = false;
>+  stmt.executeAsync({

no handleError?

>+let tests = [
>+  test_guid_invariants,
>+  test_guid_on_background,
>+];
>+let index = 0;

nit: using tests.shift() is usually more compact than using a test index

mostly nits, nothing major.
Attachment #492444 - Flags: review?(mak77) → review+
(In reply to comment #14)
> nit and OT: We should ask gerv to figure out the common style for these pd
> declarations, some file puts them between BEGIN/END LICENSE BLOCK, some file
> uses a javadoc init, some prefixes with *...
I did this a while ago, and this is why we have boilerplate for it now:
http://www.mozilla.org/MPL/boilerplate-1.1/
ah cool, I don't like it with that trailing */ but my personal taste is not part of a discussion :) thanks for notifying me of the boilerplate.
Whiteboard: [needs review mak]
(In reply to comment #13)
> >+  const PRUint32 kGUIDLength = 12;
> >+  const PRUint32 kRequiredBytesLength =
> >+    static_cast<PRUint32>((kGUIDLength + 1) / 4 * 3);
> 
> nit: add parentheses to make operator precedence explicit.
operator precedence doesn't matter for / and *.  Those operations are commutative, so their order doesn't matter (however, the addition is wrapped because it does matter).

> What is the +1 saving? comment plz
It's due to integer math, but we don't need to do this as long as we are a multiple of 4.  I have an assert at the end that makes sure the guid we return is of the right length, so we should be OK.

> I don't see where you NS_Free(buffer), looks like being leaked
> As an alternative, you could make it being adopted by a nsCAutoString, and
> const_cast<char*>(string.get()) later.
Naw, I'll just free it right after I'm done using it.

> Actually, the documentation for base64encode says that the buffer is not null
> terminated if the third agument is not null.
> While it's null terminated if you use the direct return value and pass null as
> the third argument.
nsCStrings need to be NULL terminated though.  I've fixed up the wording to better reflect reality.
I refactored this a bit because of how I'm doing things in bug 607112.  These changes will make that bug's patch smaller, which I think is a good thing.

This should address the review comments.
Attachment #492442 - Attachment is obsolete: true
Attachment #492673 - Flags: review?(mak77)
(In reply to comment #14)
> nit: I find having both a bit redundant, but it's ok if you think it's worth
> it.
Agreed, but it's how this file is.  I can change it all over if you'd like, but that means re-indenting the world.

> nit: I think we usually use // static
not in this file though

> you could avoid the length check and do /^[a-zA-Z0-9\-_]{12}$/ so the regex
> will also do length check
> the output should not be worse since you print the guid before
Our regex isn't really any different ;)
I kept the length check separate just because I wanted the failure to be very clear.  I don't think having the separate check hurts us, so I'm going to leave it in.

> no handleError?
handleCompletion checks the reason, so we don't need it

> nit: using tests.shift() is usually more compact than using a test index
While this is true, it makes it less clear when I use the functions name for logging output.  Using the index makes all of those call sites clearer, IMO.
For checkin.
Attachment #492444 - Attachment is obsolete: true
Attachment #492673 - Flags: review?(mak77) → review+
http://hg.mozilla.org/mozilla-central/rev/952cedf2302e
http://hg.mozilla.org/mozilla-central/rev/510ee4557bb9
http://hg.mozilla.org/mozilla-central/rev/4ff3fc8162d8
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla2.0b9
Duplicate of this bug: 419739
You need to log in before you can comment on or make changes to this bug.