Closed Bug 1362449 Opened 7 years ago Closed 7 years ago

make Base64{Encode,Decode} on nsString not require an ASCII-fication

Categories

(Core :: XPCOM, enhancement, P3)

enhancement

Tracking

()

RESOLVED FIXED
mozilla57
Tracking Status
firefox57 --- fixed

People

(Reporter: froydnj, Assigned: froydnj)

References

Details

Attachments

(14 files, 1 obsolete file)

7.98 KB, patch
erahm
: review+
Details | Diff | Splinter Review
2.13 KB, patch
erahm
: review+
Details | Diff | Splinter Review
1.28 KB, patch
erahm
: review+
Details | Diff | Splinter Review
2.77 KB, patch
erahm
: review+
Details | Diff | Splinter Review
2.20 KB, patch
erahm
: review+
Details | Diff | Splinter Review
1.65 KB, patch
erahm
: review+
Details | Diff | Splinter Review
3.41 KB, patch
erahm
: review+
Details | Diff | Splinter Review
2.21 KB, patch
erahm
: review+
Details | Diff | Splinter Review
2.40 KB, patch
erahm
: review+
Details | Diff | Splinter Review
2.47 KB, patch
erahm
: review+
Details | Diff | Splinter Review
2.12 KB, patch
erahm
: review+
Details | Diff | Splinter Review
1.78 KB, patch
erahm
: review+
Details | Diff | Splinter Review
1.49 KB, patch
erahm
: review+
Details | Diff | Splinter Review
3.02 KB, patch
erahm
: review+
Details | Diff | Splinter Review
Base64{Encode,Decode}, when given an nsString, first convert the string to ASCII, then do the requested operation, then convert back to nsString.  We should skip the conversion steps, and just do the encoding directly on char16_t strings.  Crash reports indicate that people do pass large strings through these functions sometimes, and trying to reduce the memory footprint here would be helpful.
I have patches for encoding.  Patches for decoding are harder, but very doable.
Assignee: nobody → nfroyd
Priority: -- → P3
We already had testcases for streaming encoding, but we didn't have any
for one-shot encoding and decoding.  Let's fix that.
Attachment #8903713 - Flags: review?(erahm)
We'll need this to implement direct encoding from nsString, and the
changes are trivial.
Attachment #8903714 - Flags: review?(erahm)
We need to explicitly reduce the incoming character to 8 bits, rather
than assuming that a conversion to uint32_t will do the right thing.
Attachment #8903715 - Flags: review?(erahm)
One less use of NSPR is a good thing.  The failure cases that
PL_Base64Encode would have caught for us are:

1. "Truncation".
2. Integer overflow when computing destination string length.
3. Malloc failures.

The first one only gets checked if we pass in zero for the source
length, which we never do.  The latter two only get checked if we pass
in a null pointer for the destination, which we never do.  So removing
the error handling PL_Base64Encode implies here is a good thing.
Attachment #8903716 - Flags: review?(erahm)
The nsACString -> nsACString encode routine has several checks in it for
correct operation, and the nsAString -> nsAString encode routine relies
on those checks happening via the nsACString -> nsACString routine.
Once we start encoding nsAStrings directly, we'll still need those
checks, and the easiest way to ensure they happen is to move the core
base64 encode logic for strings into a templated helper.
Attachment #8903717 - Flags: review?(erahm)
After all the previous work, we can base64 encoding nsString values
directly into nsString values, without having to go through intermediate
nsCString values.  Since this routine backs base64 routines exposed to
the web, this change should help with OOMs that we see associated with
base64 encoding.
Attachment #8903718 - Flags: review?(erahm)
These tables are nearly identical to the ones for base64url decoding,
but ideally will be slightly more readable, since things are broken up
into sets of eight entries at a time.
Attachment #8903719 - Flags: review?(erahm)
The decoding logic is the same for Base64 and Base64URL; we might as
well reuse the routines that we already have for Base64URL decoding so
we don't make mistakes in the logic.
Attachment #8903720 - Flags: review?(erahm)
The existing Base64URL code converts from `const char` to `uint8_t`.
We're going to want versions that convert from character types to
character types, so make the decode routines accept generic input and
output types.
Attachment #8903721 - Flags: review?(erahm)
Our base64 decoding routines are a little nicer, and using them is a
necessary step to templating all the necessary routines.
Attachment #8903722 - Flags: review?(erahm)
Similar to the work we did for encoding, we're going to want this to be
type-generic.
Attachment #8903723 - Flags: review?(erahm)
The current nsString decoding routine indirectly relies on the various
checks this routine performs; making it generic over string types
ensures that we can eventually call it directly from the nsString
decoding routine.
Attachment #8903724 - Flags: review?(erahm)
The result of decoding needs to be an explicit 8-bit type before being
widened to the output type to avoid undesirable effects like sign
extension.
Attachment #8903725 - Flags: review?(erahm)
After all the previous work, we can now base64 decode nsString types without
intermediate conversion steps to nsCString, which is faster and more
memory-efficient.
Attachment #8903726 - Flags: review?(erahm)
Comment on attachment 8903713 [details] [diff] [review]
part 0 - add test cases for base64 string encode/decode

Review of attachment 8903713 [details] [diff] [review]:
-----------------------------------------------------------------

Nice, thanks for adding this.
Attachment #8903713 - Flags: review?(erahm) → review+
Comment on attachment 8903714 [details] [diff] [review]
part 1 - make Base64 encode generic over source character types

Review of attachment 8903714 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/io/Base64.cpp
@@ +45,2 @@
>  static void
> +Encode2to4(const SrcT* aSrc, DestT* aDest)

Do we still want to constrain SrcT to be an unsigned type? I'm thinking mainly in regards to the shifts but maybe it doesn't matter.
Attachment #8903714 - Flags: review?(erahm) → review+
Comment on attachment 8903715 [details] [diff] [review]
part 1a - fix Encode3to4 for non-8-bit character types

Review of attachment 8903715 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/io/Base64.cpp
@@ +23,5 @@
>                    "0123456789+/";
>  
> +// The Base64 encoder assumes all characters are less than 256; for 16-bit
> +// strings, that means assuming that all characters are within range, and
> +// masking off high bits if necessary.

After discussing with you this explanation makes sense, but I think we need to add a comment to the public interface saying we ignore the high order byte in nsStrings (if it's not already there). This is pretty unexpected.
Attachment #8903715 - Flags: review?(erahm) → review+
Comment on attachment 8903716 [details] [diff] [review]
part 2 - use our Base64 encode implementation rather than NSPR's

Review of attachment 8903716 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/io/Base64.cpp
@@ +310,5 @@
>    *aBase64 = nullptr;
>    uint32_t base64Len = ((aBinaryLen + 2) / 3) * 4;
>  
>    // Add one byte for null termination.
>    UniqueFreePtr<char[]> base64((char*)malloc(base64Len + 1));

Huh, I think that type is wrong. Should just be `char` right?

Looks like we can just get rid of the UniqueFreePtr now and malloc directly into `aBase64`. I don't feel too strongly about that.

@@ +338,5 @@
>  
>    uint32_t base64Len = ((aBinary.Length() + 2) / 3) * 4;
>  
>    // Add one byte for null termination.
>    if (!aBase64.SetCapacity(base64Len + 1, fallible)) {

While we're in here we could get rid of the +1, our string classes do that stuff for us.
Attachment #8903716 - Flags: review?(erahm) → review+
Comment on attachment 8903717 [details] [diff] [review]
part 3 - templatify core nsCString base64 encode routine

Review of attachment 8903717 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/io/Base64.cpp
@@ +343,5 @@
>    if (!aBase64.SetCapacity(base64Len + 1, fallible)) {
>      return NS_ERROR_OUT_OF_MEMORY;
>    }
>  
> +  typename T::char_type* base64 = aBase64.BeginWriting();

`auto` would be a little less awful here
Attachment #8903717 - Flags: review?(erahm) → review+
Comment on attachment 8903718 [details] [diff] [review]
part 4 - avoid conversions in nsString Base64 encoding

Review of attachment 8903718 [details] [diff] [review]:
-----------------------------------------------------------------

\o/
Attachment #8903718 - Flags: review?(erahm) → review+
Comment on attachment 8903719 [details] [diff] [review]
part 5 - add tables for normal base64 decoding

Review of attachment 8903719 [details] [diff] [review]:
-----------------------------------------------------------------

r- for now, but that's just minor cleanup (that I care enough for you to fix).

::: xpcom/io/Base64.cpp
@@ +238,5 @@
>    return NS_OK;
>  }
>  
> +// Maps an encoded character to a value in the Base64 URL alphabet, per
> +// RFC 4648, Table 2. Invalid input characters map to UINT8_MAX.

So this *isn't* the URL encoding table and it's *not* using RFC 4648 Table 2. A short blurb saying "these are the same but '+' and '/' are used instead of '-' and '_'" or something to that effect.

@@ +257,5 @@
> +
> +  /* 64 */ 255, 0, 1, 2, 3, 4, 5, 6,
> +  /* 72 */ 7, 8, 9, 10, 11, 12, 13, 14,
> +  /* 80 */ 15, 16, 17, 18, 19, 20, 21, 22,
> +  /* 88 */ 23, 24, 25, /* A - Z */ 255, 255, 255, 255, 255,

FWIW I don't find the placement of the comment helpful, it would be me more useful at the start of the sequence but meh.

@@ +266,5 @@
> +  255, 255, 255, 255, 255,
> +};
> +
> +template<typename T>
> +bool

MUST_USE?

@@ +270,5 @@
> +bool
> +Base64CharToValue(T aChar, uint8_t* aValue)
> +{
> +  uint8_t index = static_cast<uint8_t>(aChar);
> +  *aValue = kBase64DecodeTable[index & 0x7f];

Can we add some asserts on the table size and make 0x7f a const with a decent name?

@@ +271,5 @@
> +Base64CharToValue(T aChar, uint8_t* aValue)
> +{
> +  uint8_t index = static_cast<uint8_t>(aChar);
> +  *aValue = kBase64DecodeTable[index & 0x7f];
> +  return (*aValue != 255) && !(index & ~0x7f);

a) Maybe just use UINT8_MAX or std::numeric_limits<uint8_t>::max()?
b) This is pretty weird, why do we do a bounds check after? What's the use case of getting the masked value and also returning false?*

* I get this is copied from the URL version, but I still think it's weird.
Attachment #8903719 - Flags: review?(erahm) → review-
Comment on attachment 8903720 [details] [diff] [review]
part 6 - factor out base64url decoding routines

Review of attachment 8903720 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/io/Base64.cpp
@@ +400,5 @@
>  {
>    return Base64EncodeHelper(aBinary, aBase64);
>  }
>  
> +template<typename F>

Do we need to template here, can we just take a function pointer?  If you want/need to keep it a template can you call 'F' something more obvious (CharacterMapping or something, just not F).
Attachment #8903720 - Flags: review?(erahm) → review+
Attachment #8903721 - Flags: review?(erahm) → review+
Comment on attachment 8903722 [details] [diff] [review]
part 8 - use our base64 decoding routines instead of NSPR's

Review of attachment 8903722 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/io/Base64.cpp
@@ +456,5 @@
> +  uint32_t inputLength = aBase64Len;
> +  char* binary = aBinary;
> +  uint32_t binaryLength = 0;
> +
> +  // Handle trailing '=' characters.

As followup it might be nice to share this logic w/ the URL version.
Attachment #8903722 - Flags: review?(erahm) → review+
Attachment #8903723 - Flags: review?(erahm) → review+
Comment on attachment 8903724 [details] [diff] [review]
part 10 - templatify CString base64 decode routine

Review of attachment 8903724 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/io/Base64.cpp
@@ +564,5 @@
>    if (!aBinary.SetCapacity(binaryLen + 1, fallible)) {
>      return NS_ERROR_OUT_OF_MEMORY;
>    }
>  
> +  typename T::char_type* binary = aBinary.BeginWriting();

Again, this could be `auto`, but meh.
Attachment #8903724 - Flags: review?(erahm) → review+
Attachment #8903725 - Flags: review?(erahm) → review+
Comment on attachment 8903726 [details] [diff] [review]
part 12 - avoid conversions in nsString decoding

Review of attachment 8903726 [details] [diff] [review]:
-----------------------------------------------------------------

\o/
Attachment #8903726 - Flags: review?(erahm) → review+
(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #17)
> ::: xpcom/io/Base64.cpp
> @@ +45,2 @@
> >  static void
> > +Encode2to4(const SrcT* aSrc, DestT* aDest)
> 
> Do we still want to constrain SrcT to be an unsigned type? I'm thinking
> mainly in regards to the shifts but maybe it doesn't matter.

I changed part 1a to explicitly truncate to 8 bits in all Encode*to* routines instead.

(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #18)
> ::: xpcom/io/Base64.cpp
> @@ +23,5 @@
> >                    "0123456789+/";
> >  
> > +// The Base64 encoder assumes all characters are less than 256; for 16-bit
> > +// strings, that means assuming that all characters are within range, and
> > +// masking off high bits if necessary.
> 
> After discussing with you this explanation makes sense, but I think we need
> to add a comment to the public interface saying we ignore the high order
> byte in nsStrings (if it's not already there). This is pretty unexpected.

Comments added.

(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #19)
> ::: xpcom/io/Base64.cpp
> @@ +310,5 @@
> >    *aBase64 = nullptr;
> >    uint32_t base64Len = ((aBinaryLen + 2) / 3) * 4;
> >  
> >    // Add one byte for null termination.
> >    UniqueFreePtr<char[]> base64((char*)malloc(base64Len + 1));
> 
> Huh, I think that type is wrong. Should just be `char` right?

It should indeed be just `char`.

(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #22)
> @@ +271,5 @@
> > +Base64CharToValue(T aChar, uint8_t* aValue)
> > +{
> > +  uint8_t index = static_cast<uint8_t>(aChar);
> > +  *aValue = kBase64DecodeTable[index & 0x7f];
> > +  return (*aValue != 255) && !(index & ~0x7f);
> 
> a) Maybe just use UINT8_MAX or std::numeric_limits<uint8_t>::max()?

I don't know if that clarifies or obscures, since we should really be using the same value in the table above.

> b) This is pretty weird, why do we do a bounds check after? What's the use
> case of getting the masked value and also returning false?*
> 
> * I get this is copied from the URL version, but I still think it's weird.

I do not know why it was written that way in the first place.  I don't know if there's a lot of value in changing it to be different just because, but I can change it...

(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #23)
> ::: xpcom/io/Base64.cpp
> @@ +400,5 @@
> >  {
> >    return Base64EncodeHelper(aBinary, aBase64);
> >  }
> >  
> > +template<typename F>
> 
> Do we need to template here, can we just take a function pointer?  If you
> want/need to keep it a template can you call 'F' something more obvious
> (CharacterMapping or something, just not F).

I thought with taking the template, we might encourage inlining, and writing templates is nicer than writing function pointer types.

It seems odd to complain about `typename F` here, but not complain about `typename T` and similar.  But I've changed it to `typename Decoder` locally.
(In reply to Nathan Froyd [:froydnj] from comment #27)
> (In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment
> #19)
> > ::: xpcom/io/Base64.cpp
> > @@ +310,5 @@
> > >    *aBase64 = nullptr;
> > >    uint32_t base64Len = ((aBinaryLen + 2) / 3) * 4;
> > >  
> > >    // Add one byte for null termination.
> > >    UniqueFreePtr<char[]> base64((char*)malloc(base64Len + 1));
> > 
> > Huh, I think that type is wrong. Should just be `char` right?
> 
> It should indeed be just `char`.

I take that back, it should be `char[]`.  We're specifying the deletion policy explicitly with `UniqueFreePtr`, so we don't have to worry about allocator/deallocator mismatches.  `char[]` also permits us to use `operator[]` on the `UniqueFreePtr`, rather than having to do something ugly like `base64.get()[...]`.  We're storing what is conceptually an array in the smart pointer, rather than just a pointer to a single dynamically-allocated `char`.
With review comments addressed, I think.
Attachment #8905067 - Flags: review?(erahm)
Attachment #8903719 - Attachment is obsolete: true
Comment on attachment 8905067 [details] [diff] [review]
part 5 - add tables for normal base64 decoding

Review of attachment 8905067 [details] [diff] [review]:
-----------------------------------------------------------------

r=me

::: xpcom/io/Base64.cpp
@@ +280,5 @@
> +  if (index & ~mask) {
> +    return false;
> +  }
> +  *aValue = kBase64DecodeTable[index & mask];
> +  

nit: spaces
Attachment #8905067 - Flags: review?(erahm) → review+
(In reply to Nathan Froyd [:froydnj] from comment #28)
> (In reply to Nathan Froyd [:froydnj] from comment #27)
> > (In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment
> > #19)
> > > ::: xpcom/io/Base64.cpp
> > > @@ +310,5 @@
> > > >    *aBase64 = nullptr;
> > > >    uint32_t base64Len = ((aBinaryLen + 2) / 3) * 4;
> > > >  
> > > >    // Add one byte for null termination.
> > > >    UniqueFreePtr<char[]> base64((char*)malloc(base64Len + 1));
> > > 
> > > Huh, I think that type is wrong. Should just be `char` right?
> > 
> > It should indeed be just `char`.
> 
> I take that back, it should be `char[]`.  We're specifying the deletion
> policy explicitly with `UniqueFreePtr`, so we don't have to worry about
> allocator/deallocator mismatches.  `char[]` also permits us to use
> `operator[]` on the `UniqueFreePtr`, rather than having to do something ugly
> like `base64.get()[...]`.  We're storing what is conceptually an array in
> the smart pointer, rather than just a pointer to a single
> dynamically-allocated `char`.

Oh right, that's pretty bizarre behavior on our part. You either do UniqueFreePtr<char> or UniqueFreePtr<char[]> but *not* UniqueFreePtr<char*>.
(In reply to Nathan Froyd [:froydnj] from comment #27) 
> > > +template<typename F>
> > 
> > Do we need to template here, can we just take a function pointer?  If you
> > want/need to keep it a template can you call 'F' something more obvious
> > (CharacterMapping or something, just not F).
> 
> I thought with taking the template, we might encourage inlining, and writing
> templates is nicer than writing function pointer types.
> 
> It seems odd to complain about `typename F` here, but not complain about
> `typename T` and similar.  But I've changed it to `typename Decoder` locally.

It's just that templating a function param is a little out of the norm and it wasn't obvious it was a function (to me at least). A comment would have sufficed I guess. With `typename T` it's generally clear that's a character type in it's usage (T* etc).
Pushed by nfroyd@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/c0d320e5bad1
part 0 - add test cases for base64 string encode/decode; r=erahm
https://hg.mozilla.org/integration/mozilla-inbound/rev/959dd58ea6df
part 1 - make Base64 encode generic over source character types; r=erahm
https://hg.mozilla.org/integration/mozilla-inbound/rev/bde5d6755cf0
part 1a - fix Encode*to* for non-8-bit character types; r=erahm
https://hg.mozilla.org/integration/mozilla-inbound/rev/014a075a5a0d
part 2 - use our Base64 encode implementation rather than NSPR's; r=erahm
https://hg.mozilla.org/integration/mozilla-inbound/rev/53447e7c6c29
part 3 - templatify core nsCString base64 encode routine; r=erahm
https://hg.mozilla.org/integration/mozilla-inbound/rev/837aaf616d85
part 4 - avoid conversions in nsString Base64 encoding; r=erahm
https://hg.mozilla.org/integration/mozilla-inbound/rev/9080c5fb0769
part 5 - add tables for normal base64 decoding; r=erahm
https://hg.mozilla.org/integration/mozilla-inbound/rev/966a99fcb8ce
part 6 - factor out base64url decoding routines; r=erahm
https://hg.mozilla.org/integration/mozilla-inbound/rev/e364a330eb3e
part 7 - make decode routines type-generic; r=erahm
https://hg.mozilla.org/integration/mozilla-inbound/rev/6fed9af7f5e2
part 8 - use our base64 decoding routines instead of NSPR's; r=erahm
https://hg.mozilla.org/integration/mozilla-inbound/rev/8454297462a3
part 9 - templatify core base64 decode routine; r=erahm
https://hg.mozilla.org/integration/mozilla-inbound/rev/c92f1975d936
part 10 - templatify CString base64 decode routine; r=erahm
https://hg.mozilla.org/integration/mozilla-inbound/rev/f2669ac73b90
part 11 - fix decoding subroutines to cope with wide output types; r=erahm
https://hg.mozilla.org/integration/mozilla-inbound/rev/981709ff412e
part 12 - avoid conversions in nsString Base64 decoding; r=erahm
You need to log in before you can comment on or make changes to this bug.