Closed Bug 995321 Opened 10 years ago Closed 10 years ago

nsIDocumentEncoder.encodeToString should offer a way to limit the size of the output

Categories

(Core :: DOM: Serializers, defect)

defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla31

People

(Reporter: florian, Assigned: florian)

References

Details

(Whiteboard: [translation] p=3 s=it-31c-30a-29b.3 [qa-])

Attachments

(1 file, 3 obsolete files)

Attached patch Patch v1 (obsolete) — Splinter Review
In bug 971048 I need a way to feed the language detection library added in bug 971047 with a significant amount of plain text data (chromimum seems to pass CLD 64k of data), but I don't need to serialize the whole page (and it's awfully slow for large pages) so I'm looking for a way to make the serializer stop after some amount of data has been generated.

The attached patch adds an optional parameter to the encodeToString method of nsDocumentEncoder. It does what I need, but I'm a bit hesitant on some details:
- should I have added an encodeToStringWithMaxLength method instead of adding an optional parameter? Having to add a "0, " to callers in random places was kinda annoying.
- is it acceptable to use the end offset as a way to pass the max remaining length to the serializers? I was surprised to discover some serializers don't check the endoffset value and crash if the caller requested an out of bound value. I think I fixed all these cases, but double checking could be useful.

The reason why I'm requesting feedback rather than review is the 2 questions above + I think this needs a test but didn't want to spend time writing it until I'm sure I'm not going to start over :).
Attachment #8405497 - Flags: feedback?(hsivonen)
Attachment #8405497 - Flags: feedback?(bugs)
(In reply to Florian Quèze [:florian] [:flo] from comment #0)
> - should I have added an encodeToStringWithMaxLength method instead of
> adding an optional parameter? Having to add a "0, " to callers in random
> places was kinda annoying.
I think I'd prefer encodeToStringWithMaxLength

> - is it acceptable to use the end offset as a way to pass the max remaining
> length to the serializers? I was surprised to discover some serializers
> don't check the endoffset value and crash if the caller requested an out of
> bound value.
Old code.

But looks ok.

However, what kind of data do you need?
Could we not just use document.textContent and perhaps have a form of that which limits the
max length.  Probably not since we want some extra space characters.
(.textContent and .innerHTML sure are optimized better than any old serializers.)
Though, if we limit serialization anyway to xxK, it should be pretty fast.
(In reply to Olli Pettay [:smaug] from comment #1)

> However, what kind of data do you need?

We need a string containing a sample of the text displayed on the page.

> Could we not just use document.textContent

If I remember correctly, felipe tried this and it wasn't suitable for our use case because it included the content of <script> and <style> tags.

> Probably not since we want some extra space characters.

Whitespace isn't really relevant for our use case.

> (.textContent and .innerHTML sure are optimized better than any old
> serializers.)

Yes, I remember timing this compared to the plain text serializer and they were much faster.

> Though, if we limit serialization anyway to xxK, it should be pretty fast.

I'm limiting the size of the serializer's output mostly to limit the time it takes to return a result.
Comment on attachment 8405497 [details] [diff] [review]
Patch v1

(In reply to Florian Quèze [:florian] [:flo] from comment #0)
> The attached patch adds an optional parameter to the encodeToString method
> of nsDocumentEncoder.

Do you actually need the pretty printing, etc., features from nsDocumentEncoder? If not, would it make more sense from the performance and maintainability perspective to write a custom (independent of nsDocumentEncoder) .textContent-like content grabber that skips over display: none; nodes?

> - should I have added an encodeToStringWithMaxLength method instead of
> adding an optional parameter? Having to add a "0, " to callers in random
> places was kinda annoying.

I would prefer adding a new method.

> - is it acceptable to use the end offset as a way to pass the max remaining
> length to the serializers? I was surprised to discover some serializers
> don't check the endoffset value and crash if the caller requested an out of
> bound value. I think I fixed all these cases, but double checking could be
> useful.

I defer to smaug on this one.  I'm not familiar with the code in this level.
Attachment #8405497 - Flags: feedback?(hsivonen) → feedback+
Attached patch Patch v2 (obsolete) — Splinter Review
(In reply to Henri Sivonen (:hsivonen) from comment #3)

> (In reply to Florian Quèze [:florian] [:flo] from comment #0)

> Do you actually need the pretty printing, etc., features from
> nsDocumentEncoder?

Not really no.

> If not, would it make more sense from the performance and
> maintainability perspective to write a custom (independent of
> nsDocumentEncoder) .textContent-like content grabber that skips over
> display: none; nodes?

I think it makes sense to re-use the existing code unless we have clearly identified issues with it. If it turns out running this code has a significant performance cost, I'll be happy to look into a more optimized solution similar to .textContent.
Attachment #8405497 - Attachment is obsolete: true
Attachment #8405497 - Flags: feedback?(bugs)
Attachment #8406161 - Flags: review?(bugs)
Comment on attachment 8406161 [details] [diff] [review]
Patch v2

>diff --git a/content/base/public/nsIDocumentEncoder.idl b/content/base/public/nsIDocumentEncoder.idl
>--- a/content/base/public/nsIDocumentEncoder.idl
>+++ b/content/base/public/nsIDocumentEncoder.idl
>@@ -325,13 +325,28 @@ interface nsIDocumentEncoder : nsISuppor
Update uuid of nsIDocumentEncoder

>    *              be stored.
>    * @return The document encoded as a string.
>    * 
>    */
>   AString encodeToStringWithContext( out AString aContextString,
>                                      out AString aInfoString);
> 
>   /**
>+   * Encode the document into a string of limited size.
>+   * @param aMaxLength After aMaxLength characters, the encoder will stop
>+   *                   encoding new data.
>+   *                   Only values > 0 will be considered.
>+   *                   The returned string may be slightly larger than
>+   *                   aMaxLength because some serializers (eg. HTML)
>+   *                   may need to close some tags after they stop
>+   *                   encoding new data, or finish a line (72 columns
>+   *                   by default for the plain text serializer).
>+   *
>+   * @return The document encoded into a string.
>+   */
>+  AString encodeToStringWithMaxLength(in long aMaxLength);
Make the param unsigned

>   if (!aDontSerializeRoot) {
>-    rv = SerializeNodeStart(maybeFixedNode, 0, -1, aStr, aNode);
>+    int32_t endOffset = -1;
>+    if (aMaxLength > 0)
>+      endOffset = aMaxLength - aStr.Length();
always {} with if in new code.
Please MOZ_ASSERT that aMaxLength >= aStr.Length()
>-  if (aStartOffset || (aEndOffset != -1)) {
>-    int32_t length = (aEndOffset == -1) ? data.Length() : aEndOffset;
>+  int32_t dataLength = data.Length();
>+  if (aStartOffset || (aEndOffset != -1 && aEndOffset < dataLength)) {
>+    int32_t length =
>+      (aEndOffset == -1) ? dataLength : std::min(aEndOffset, dataLength);
>     length -= aStartOffset;
Isn't it now rather easy to get length to be some huge value, when
aEndOffset is for example 1 and aStartOffset 2 and dataLength 10000.
And if aEndOffset now depends on the maxlength, the caller doesn't really
verify the value.
So perhaps check that length is > 0 before passing it to Mid, which seems to take uint32

>+    // Create a plaintext encoder without flags.
>+    var encoder = Cc["@mozilla.org/layout/documentEncoder;1?type=text/plain"]
>+                  .createInstance(de);
>+    encoder.init(document, "text/plain", 0);
>+
>+    var maxLength = 300;
>+    ok(encoder.encodeToStringWithMaxLength(maxLength).length < maxLength + 72,
>+       "test limiting the length to " + maxLength);

Could you please add some more tests.
At least length 0, 1 and then some huge values.
Attachment #8406161 - Flags: review?(bugs) → review-
Attached patch Patch v3 (obsolete) — Splinter Review
Attachment #8406161 - Attachment is obsolete: true
Attachment #8406493 - Flags: review?(bugs)
Status: NEW → ASSIGNED
Whiteboard: p=3 s=it-31c-30a-29b.3 [qa?]
Please needinfo me to request QA verification of this bug.
Whiteboard: p=3 s=it-31c-30a-29b.3 [qa?] → p=3 s=it-31c-30a-29b.3 [qa-]
Comment on attachment 8406493 [details] [diff] [review]
Patch v3

>+  if (aMaxLength > 0 && aStr.Length() >= aMaxLength)
>+    return NS_OK;
I'd prefer {} with if in all the new code, and this file isn't consistent, so
better to move to the right coding style here.

>+    // Create a plaintext encoder without flags.
>+    var encoder = Cc["@mozilla.org/layout/documentEncoder;1?type=text/plain"]
>+                  .createInstance(de);
>+    encoder.init(document, "text/plain", 0);
>+
>+    ok(encoder.encodeToStringWithMaxLength(1).length < 1 + 72,
>+       "test limiting the length to 1");
>+
>+    ok(encoder.encodeToStringWithMaxLength(300).length < 300 + 72,
>+       "test limiting the length to 300");
>+
>+    is(encoder.encodeToStringWithMaxLength(0), encoder.encodeToString(),
>+       "limiting the length to 0 should be ignored");
>+
>+    is(encoder.encodeToStringWithMaxLength(10000), encoder.encodeToString(),
>+       "limiting the length to a huge value should return the whole page");

Could you also check that the encoded string is actually the expected one, not
just the length
Attachment #8406493 - Flags: review?(bugs) → review+
(In reply to Olli Pettay [:smaug] from comment #8)

> >+    // Create a plaintext encoder without flags.
> >+    var encoder = Cc["@mozilla.org/layout/documentEncoder;1?type=text/plain"]
> >+                  .createInstance(de);
> >+    encoder.init(document, "text/plain", 0);
> >+
> >+    ok(encoder.encodeToStringWithMaxLength(1).length < 1 + 72,
> >+       "test limiting the length to 1");
> >+
> >+    ok(encoder.encodeToStringWithMaxLength(300).length < 300 + 72,
> >+       "test limiting the length to 300");
> >+
> >+    is(encoder.encodeToStringWithMaxLength(0), encoder.encodeToString(),
> >+       "limiting the length to 0 should be ignored");
> >+
> >+    is(encoder.encodeToStringWithMaxLength(10000), encoder.encodeToString(),
> >+       "limiting the length to a huge value should return the whole page");
> 
> Could you also check that the encoded string is actually the expected one,
> not
> just the length

I'm already checking the encoded string for the last 2 cases.
For the first 2 cases, it's difficult because:
- How many characters actually need to be returned isn't strongly specified.
- Bug 996302 is in my way.

I would like to check that the returned string is a prefix of the encoded string without max length, but even that is problematic given bug 996302.
Apparently the first time the encoder is called, there's a leading space in the result. One hack around the issue could be to run the encoder once and discard the result before doing the actual test, but that felt too hackish.
Hmm, that is unfortunate.
Perhaps create a new encoder each time and compare the results? That would be less hackish.
Attached patch Patch v4Splinter Review
(In reply to Olli Pettay [:smaug] from comment #10)

> Perhaps create a new encoder each time and compare the results? That would
> be less hackish.

Do you like the test this way? The trimRight calls I added are because the strings have a trailing line break.
Attachment #8406493 - Attachment is obsolete: true
Flags: needinfo?(bugs)
Flags: firefox-backlog+
I guess you didn't get my comment on irc. The test looks ok.
Flags: needinfo?(bugs)
Comment on attachment 8407576 [details] [diff] [review]
Patch v4

Review of attachment 8407576 [details] [diff] [review]:
-----------------------------------------------------------------

::: content/base/test/test_encodeToStringWithMaxLength.html
@@ +19,5 @@
> +    encoder.init(document, "text/plain", 0);
> +    return encoder;
> +  }
> +
> +  function testPaintextSerializerWithMaxLength() {

nit: Paintext :)
https://hg.mozilla.org/mozilla-central/rev/8c982eaeef86
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla31
Status: RESOLVED → VERIFIED
Whiteboard: p=3 s=it-31c-30a-29b.3 [qa-] p=3 s=it-31c-30a-29b.3 [qa-] → [translation] p=3 s=it-31c-30a-29b.3 [qa-] [translation] p=3 s=it-31c-30a-29b.3 [qa-]
You need to log in before you can comment on or make changes to this bug.