Closed
Bug 995321
Opened 10 years ago
Closed 10 years ago
nsIDocumentEncoder.encodeToString should offer a way to limit the size of the output
Categories
(Core :: DOM: Serializers, defect)
Core
DOM: Serializers
Tracking
()
VERIFIED
FIXED
mozilla31
People
(Reporter: florian, Assigned: florian)
References
Details
(Whiteboard: [translation] p=3 s=it-31c-30a-29b.3 [qa-])
Attachments
(1 file, 3 obsolete files)
13.11 KB,
patch
|
Details | Diff | Splinter Review |
In bug 971048 I need a way to feed the language detection library added in bug 971047 with a significant amount of plain text data (chromimum seems to pass CLD 64k of data), but I don't need to serialize the whole page (and it's awfully slow for large pages) so I'm looking for a way to make the serializer stop after some amount of data has been generated. The attached patch adds an optional parameter to the encodeToString method of nsDocumentEncoder. It does what I need, but I'm a bit hesitant on some details: - should I have added an encodeToStringWithMaxLength method instead of adding an optional parameter? Having to add a "0, " to callers in random places was kinda annoying. - is it acceptable to use the end offset as a way to pass the max remaining length to the serializers? I was surprised to discover some serializers don't check the endoffset value and crash if the caller requested an out of bound value. I think I fixed all these cases, but double checking could be useful. The reason why I'm requesting feedback rather than review is the 2 questions above + I think this needs a test but didn't want to spend time writing it until I'm sure I'm not going to start over :).
Attachment #8405497 -
Flags: feedback?(hsivonen)
Attachment #8405497 -
Flags: feedback?(bugs)
Comment 1•10 years ago
|
||
(In reply to Florian Quèze [:florian] [:flo] from comment #0) > - should I have added an encodeToStringWithMaxLength method instead of > adding an optional parameter? Having to add a "0, " to callers in random > places was kinda annoying. I think I'd prefer encodeToStringWithMaxLength > - is it acceptable to use the end offset as a way to pass the max remaining > length to the serializers? I was surprised to discover some serializers > don't check the endoffset value and crash if the caller requested an out of > bound value. Old code. But looks ok. However, what kind of data do you need? Could we not just use document.textContent and perhaps have a form of that which limits the max length. Probably not since we want some extra space characters. (.textContent and .innerHTML sure are optimized better than any old serializers.) Though, if we limit serialization anyway to xxK, it should be pretty fast.
Assignee | ||
Comment 2•10 years ago
|
||
(In reply to Olli Pettay [:smaug] from comment #1) > However, what kind of data do you need? We need a string containing a sample of the text displayed on the page. > Could we not just use document.textContent If I remember correctly, felipe tried this and it wasn't suitable for our use case because it included the content of <script> and <style> tags. > Probably not since we want some extra space characters. Whitespace isn't really relevant for our use case. > (.textContent and .innerHTML sure are optimized better than any old > serializers.) Yes, I remember timing this compared to the plain text serializer and they were much faster. > Though, if we limit serialization anyway to xxK, it should be pretty fast. I'm limiting the size of the serializer's output mostly to limit the time it takes to return a result.
Comment 3•10 years ago
|
||
Comment on attachment 8405497 [details] [diff] [review] Patch v1 (In reply to Florian Quèze [:florian] [:flo] from comment #0) > The attached patch adds an optional parameter to the encodeToString method > of nsDocumentEncoder. Do you actually need the pretty printing, etc., features from nsDocumentEncoder? If not, would it make more sense from the performance and maintainability perspective to write a custom (independent of nsDocumentEncoder) .textContent-like content grabber that skips over display: none; nodes? > - should I have added an encodeToStringWithMaxLength method instead of > adding an optional parameter? Having to add a "0, " to callers in random > places was kinda annoying. I would prefer adding a new method. > - is it acceptable to use the end offset as a way to pass the max remaining > length to the serializers? I was surprised to discover some serializers > don't check the endoffset value and crash if the caller requested an out of > bound value. I think I fixed all these cases, but double checking could be > useful. I defer to smaug on this one. I'm not familiar with the code in this level.
Attachment #8405497 -
Flags: feedback?(hsivonen) → feedback+
Assignee | ||
Comment 4•10 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #3) > (In reply to Florian Quèze [:florian] [:flo] from comment #0) > Do you actually need the pretty printing, etc., features from > nsDocumentEncoder? Not really no. > If not, would it make more sense from the performance and > maintainability perspective to write a custom (independent of > nsDocumentEncoder) .textContent-like content grabber that skips over > display: none; nodes? I think it makes sense to re-use the existing code unless we have clearly identified issues with it. If it turns out running this code has a significant performance cost, I'll be happy to look into a more optimized solution similar to .textContent.
Attachment #8405497 -
Attachment is obsolete: true
Attachment #8405497 -
Flags: feedback?(bugs)
Attachment #8406161 -
Flags: review?(bugs)
Comment 5•10 years ago
|
||
Comment on attachment 8406161 [details] [diff] [review] Patch v2 >diff --git a/content/base/public/nsIDocumentEncoder.idl b/content/base/public/nsIDocumentEncoder.idl >--- a/content/base/public/nsIDocumentEncoder.idl >+++ b/content/base/public/nsIDocumentEncoder.idl >@@ -325,13 +325,28 @@ interface nsIDocumentEncoder : nsISuppor Update uuid of nsIDocumentEncoder > * be stored. > * @return The document encoded as a string. > * > */ > AString encodeToStringWithContext( out AString aContextString, > out AString aInfoString); > > /** >+ * Encode the document into a string of limited size. >+ * @param aMaxLength After aMaxLength characters, the encoder will stop >+ * encoding new data. >+ * Only values > 0 will be considered. >+ * The returned string may be slightly larger than >+ * aMaxLength because some serializers (eg. HTML) >+ * may need to close some tags after they stop >+ * encoding new data, or finish a line (72 columns >+ * by default for the plain text serializer). >+ * >+ * @return The document encoded into a string. >+ */ >+ AString encodeToStringWithMaxLength(in long aMaxLength); Make the param unsigned > if (!aDontSerializeRoot) { >- rv = SerializeNodeStart(maybeFixedNode, 0, -1, aStr, aNode); >+ int32_t endOffset = -1; >+ if (aMaxLength > 0) >+ endOffset = aMaxLength - aStr.Length(); always {} with if in new code. Please MOZ_ASSERT that aMaxLength >= aStr.Length() >- if (aStartOffset || (aEndOffset != -1)) { >- int32_t length = (aEndOffset == -1) ? data.Length() : aEndOffset; >+ int32_t dataLength = data.Length(); >+ if (aStartOffset || (aEndOffset != -1 && aEndOffset < dataLength)) { >+ int32_t length = >+ (aEndOffset == -1) ? dataLength : std::min(aEndOffset, dataLength); > length -= aStartOffset; Isn't it now rather easy to get length to be some huge value, when aEndOffset is for example 1 and aStartOffset 2 and dataLength 10000. And if aEndOffset now depends on the maxlength, the caller doesn't really verify the value. So perhaps check that length is > 0 before passing it to Mid, which seems to take uint32 >+ // Create a plaintext encoder without flags. >+ var encoder = Cc["@mozilla.org/layout/documentEncoder;1?type=text/plain"] >+ .createInstance(de); >+ encoder.init(document, "text/plain", 0); >+ >+ var maxLength = 300; >+ ok(encoder.encodeToStringWithMaxLength(maxLength).length < maxLength + 72, >+ "test limiting the length to " + maxLength); Could you please add some more tests. At least length 0, 1 and then some huge values.
Attachment #8406161 -
Flags: review?(bugs) → review-
Assignee | ||
Comment 6•10 years ago
|
||
Attachment #8406161 -
Attachment is obsolete: true
Attachment #8406493 -
Flags: review?(bugs)
Updated•10 years ago
|
Status: NEW → ASSIGNED
Whiteboard: p=3 s=it-31c-30a-29b.3 [qa?]
Please needinfo me to request QA verification of this bug.
Whiteboard: p=3 s=it-31c-30a-29b.3 [qa?] → p=3 s=it-31c-30a-29b.3 [qa-]
Comment 8•10 years ago
|
||
Comment on attachment 8406493 [details] [diff] [review] Patch v3 >+ if (aMaxLength > 0 && aStr.Length() >= aMaxLength) >+ return NS_OK; I'd prefer {} with if in all the new code, and this file isn't consistent, so better to move to the right coding style here. >+ // Create a plaintext encoder without flags. >+ var encoder = Cc["@mozilla.org/layout/documentEncoder;1?type=text/plain"] >+ .createInstance(de); >+ encoder.init(document, "text/plain", 0); >+ >+ ok(encoder.encodeToStringWithMaxLength(1).length < 1 + 72, >+ "test limiting the length to 1"); >+ >+ ok(encoder.encodeToStringWithMaxLength(300).length < 300 + 72, >+ "test limiting the length to 300"); >+ >+ is(encoder.encodeToStringWithMaxLength(0), encoder.encodeToString(), >+ "limiting the length to 0 should be ignored"); >+ >+ is(encoder.encodeToStringWithMaxLength(10000), encoder.encodeToString(), >+ "limiting the length to a huge value should return the whole page"); Could you also check that the encoded string is actually the expected one, not just the length
Attachment #8406493 -
Flags: review?(bugs) → review+
Assignee | ||
Comment 9•10 years ago
|
||
(In reply to Olli Pettay [:smaug] from comment #8) > >+ // Create a plaintext encoder without flags. > >+ var encoder = Cc["@mozilla.org/layout/documentEncoder;1?type=text/plain"] > >+ .createInstance(de); > >+ encoder.init(document, "text/plain", 0); > >+ > >+ ok(encoder.encodeToStringWithMaxLength(1).length < 1 + 72, > >+ "test limiting the length to 1"); > >+ > >+ ok(encoder.encodeToStringWithMaxLength(300).length < 300 + 72, > >+ "test limiting the length to 300"); > >+ > >+ is(encoder.encodeToStringWithMaxLength(0), encoder.encodeToString(), > >+ "limiting the length to 0 should be ignored"); > >+ > >+ is(encoder.encodeToStringWithMaxLength(10000), encoder.encodeToString(), > >+ "limiting the length to a huge value should return the whole page"); > > Could you also check that the encoded string is actually the expected one, > not > just the length I'm already checking the encoded string for the last 2 cases. For the first 2 cases, it's difficult because: - How many characters actually need to be returned isn't strongly specified. - Bug 996302 is in my way. I would like to check that the returned string is a prefix of the encoded string without max length, but even that is problematic given bug 996302. Apparently the first time the encoder is called, there's a leading space in the result. One hack around the issue could be to run the encoder once and discard the result before doing the actual test, but that felt too hackish.
Comment 10•10 years ago
|
||
Hmm, that is unfortunate. Perhaps create a new encoder each time and compare the results? That would be less hackish.
Assignee | ||
Comment 11•10 years ago
|
||
(In reply to Olli Pettay [:smaug] from comment #10) > Perhaps create a new encoder each time and compare the results? That would > be less hackish. Do you like the test this way? The trimRight calls I added are because the strings have a trailing line break.
Attachment #8406493 -
Attachment is obsolete: true
Flags: needinfo?(bugs)
Updated•10 years ago
|
Flags: firefox-backlog+
Comment 12•10 years ago
|
||
I guess you didn't get my comment on irc. The test looks ok.
Flags: needinfo?(bugs)
Comment 13•10 years ago
|
||
Comment on attachment 8407576 [details] [diff] [review] Patch v4 Review of attachment 8407576 [details] [diff] [review]: ----------------------------------------------------------------- ::: content/base/test/test_encodeToStringWithMaxLength.html @@ +19,5 @@ > + encoder.init(document, "text/plain", 0); > + return encoder; > + } > + > + function testPaintextSerializerWithMaxLength() { nit: Paintext :)
Assignee | ||
Comment 14•10 years ago
|
||
https://hg.mozilla.org/integration/fx-team/rev/8c982eaeef86
Comment 15•10 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/8c982eaeef86
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla31
Updated•10 years ago
|
Status: RESOLVED → VERIFIED
Updated•9 years ago
|
Whiteboard: p=3 s=it-31c-30a-29b.3 [qa-] p=3 s=it-31c-30a-29b.3 [qa-] → [translation] p=3 s=it-31c-30a-29b.3 [qa-] [translation] p=3 s=it-31c-30a-29b.3 [qa-]
You need to log in
before you can comment on or make changes to this bug.
Description
•