Closed
Bug 320883
Opened 19 years ago
Closed 19 years ago
Extra newline when textfile has CRLF at offset 1024+1025
Categories
(Core :: DOM: HTML Parser, defect, P4)
Tracking
()
VERIFIED
FIXED
mozilla1.9alpha1
People
(Reporter: bugzilla, Assigned: mrbkap)
References
()
Details
(Keywords: fixed1.8.1, Whiteboard: [patch])
Attachments
(1 file, 1 obsolete file)
3.16 KB,
patch
|
bzbarsky
:
review+
brendan
:
superreview+
jst
:
approval-branch-1.8.1+
|
Details | Diff | Splinter Review |
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5
When a file of type text/plain has a "Windows" newline (CRLF) at byte offset 1024+1025 two newlines are shown.
I assume Firefox reads chunks of 1024 bytes at a time. In that case the CR is not seen in connection with the LF and both are rendered as a newline.
Reproducible: Always
Steps to Reproduce:
1. Create a text/plain file with 1023 bytes of text
2. Add CR+LF
3. Add some more text
Actual Results:
The CRLF will show as two newlines.
Expected Results:
Only one newline should be shown.
This occurs when the resource is requested over HTTP. It does not occur when the file is accessed through the file system.
E.g. visit http://stock.ter.dk/chunks.txt , save the resource, open the resource via the local file system.
Maybe related to Core bug 197075?
Updated•19 years ago
|
Assignee: nobody → mrbkap
Component: General → HTML: Parser
Product: Firefox → Core
QA Contact: general → parser
Version: unspecified → Trunk
Comment 2•19 years ago
|
||
*** Bug 323613 has been marked as a duplicate of this bug. ***
Comment 3•19 years ago
|
||
Also being reported heavily at Project Gutenberg Distributed Proofreaders. http://www.pgdp.net/phpBB2/viewtopic.php?t=18705
Another testcase at http://www.xs4all.nl/~iarnell/test1024.txt
Assignee | ||
Comment 4•19 years ago
|
||
This is easy to fix, but it brings into question the newline handling code in CTextToken::Consume. It currently leaves CR-LF pairs in instead of stripping the CR out (which seems wrong to me). I'll attach a patch that leaves their behavior alone, but fixes the testcase here.
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Priority: -- → P4
Target Milestone: --- → mozilla1.9alpha
Assignee | ||
Comment 5•19 years ago
|
||
If we find a carriage return, then we need to check the next character. If we're at the end of a packet boundary, then we'll need to wait longer. Otherwise, we don't need to worry about more data coming, and we can treat it as a standalone CR.
Attachment #209368 -
Flags: review?(bzbarsky)
Assignee | ||
Comment 6•19 years ago
|
||
Comment on attachment 209368 [details] [diff] [review]
Treat CR-LF pairs as pairs
>Index: parser/htmlparser/src/nsHTMLTokens.cpp
> result = aScanner.Peek(aChar);
> ...
>+ aScanner.Peek(aChar);
This second Peek is superfluous, please ignore it.
Comment 7•19 years ago
|
||
> This is easy to fix, but it brings into question the newline handling code in
> CTextToken::Consume.
This is probably handled by us stripping \r in the content sink.... Perhaps we should consider fixing the code here and removing the stripping there? Or would that lead to less-efficient string-copying than we have now?
Comment 8•19 years ago
|
||
Comment on attachment 209368 [details] [diff] [review]
Treat CR-LF pairs as pairs
>Index: parser/htmlparser/src/nsHTMLTokens.cpp
>+ aScanner.Peek(aChar);
You're removing this, right?
>+ if (aScanner.IsIncremental()) {
>+ return result;
So... don't we need to reset the scanner position? Or does something else handle it for us?
I have to admit; I was looking at nsHTMLTokenizer::ConsumeTex and I'm not sure it deals with a failure return at all well. In particular, it will IF_FREE aToken, set it to null, then set it to theToken, and AddToken theToken.... But theToken has nothing useful in it (we just bailed out of this code). So what's up with that?
Assignee | ||
Comment 9•19 years ago
|
||
(In reply to comment #7)
> should consider fixing the code here and removing the stripping there? Or
> would that lead to less-efficient string-copying than we have now?
That's my fear. I think I'd rather leave the current code as-is, since it's the content sink that does the stripping.
(In reply to comment #8)
> >+ aScanner.Peek(aChar);
> You're removing this, right?
Yes.
> >+ if (aScanner.IsIncremental()) {
> >+ return result;
>
> So... don't we need to reset the scanner position? Or does something else
> handle it for us?
Yes, see the code in nsParser::Tokenize that calls nsITokenizer::ConsumeToken. If the parser returns an error, then it rewinds the scanner to the last mark. Marks are set at the end of each successfully consumed token.
> But theToken has nothing useful in it (we just bailed out of this code).
> So what's up with that?
D'oh, my |return result| should be a |break;|, so we'll bail out of the while loop and bind our substring to something useful. Want a new patch?
Comment 10•19 years ago
|
||
> If the parser returns an error, then it rewinds the scanner to the last mark.
Ok.
> so we'll bail out of the while loop and bind our substring to something useful.
But if the scanner is going to be rewound... should we really be binding here? That will convert the error code to NS_OK on ConsumeText and we'll get doubled text, no?
Assignee | ||
Comment 11•19 years ago
|
||
(In reply to comment #10)
> > so we'll bail out of the while loop and bind our substring to something useful.
> But if the scanner is going to be rewound... should we really be binding here?
> That will convert the error code to NS_OK on ConsumeText and we'll get doubled
> text, no?
No, because CTextToken::Consume only returns to nsHTMLTokenizer::ConsumeText, which will convert the return value immediately, so the parser will never see it, and it will mark the scanner right before the carriage return. Therefore, when the next chunk of data comes in, we'll start reading at the carriage return and do the Right Thing (TM).
Comment 12•19 years ago
|
||
Comment on attachment 209368 [details] [diff] [review]
Treat CR-LF pairs as pairs
Ah, I see. Can you post a patch with the various things fixed so I can do some more mental modeling of parser? ;)
Attachment #209368 -
Flags: review?(bzbarsky) → review-
Assignee | ||
Comment 13•19 years ago
|
||
I made one large change to this patch: I no longer set result to NS_OK in CTextToken::Consume. This is a micro optimization so that we don't go around the loop again if we found EOF.
Attachment #209368 -
Attachment is obsolete: true
Attachment #209480 -
Flags: review?(bzbarsky)
Updated•19 years ago
|
Attachment #209480 -
Flags: review?(bzbarsky) → review+
Assignee | ||
Updated•19 years ago
|
Attachment #209480 -
Flags: superreview?(brendan)
Assignee | ||
Updated•19 years ago
|
Whiteboard: [patch]
Comment 14•19 years ago
|
||
Comment on attachment 209480 [details] [diff] [review]
Updated patch
Sure, rs=me.
/be
Attachment #209480 -
Flags: superreview?(brendan) → superreview+
Assignee | ||
Comment 15•19 years ago
|
||
Fix checked into trunk.
Status: ASSIGNED → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
Comment 16•19 years ago
|
||
Thanks. Fix works for me on trunk. Any chance of getting this into 1.8 as well? Reports from distributed proofreaders show this is occurring on firefox 1.5 (and 1.0.7) too.
Reporter | ||
Comment 17•19 years ago
|
||
This issue is still present in Firefox 2.0a1 (PortableFirefox)
Assignee | ||
Updated•19 years ago
|
Attachment #209480 -
Flags: approval-branch-1.8.1?(jst)
Comment 18•19 years ago
|
||
Yes; it hasn't been checked on branch. See the flags on the patch attachment.
Updated•19 years ago
|
Attachment #209480 -
Flags: approval-branch-1.8.1?(jst) → approval-branch-1.8.1+
Assignee | ||
Comment 19•19 years ago
|
||
Fix checked into the MOZILLA_1_8_BRANCH. Marking this verified on trunk per comment 16.
Status: RESOLVED → VERIFIED
Keywords: fixed1.8.1
You need to log in
before you can comment on or make changes to this bug.
Description
•