Closed Bug 320883 Opened 19 years ago Closed 19 years ago

Extra newline when textfile has CRLF at offset 1024+1025

Categories

(Core :: DOM: HTML Parser, defect, P4)

x86
Windows XP
defect

Tracking

()

VERIFIED FIXED
mozilla1.9alpha1

People

(Reporter: bugzilla, Assigned: mrbkap)

References

()

Details

(Keywords: fixed1.8.1, Whiteboard: [patch])

Attachments

(1 file, 1 obsolete file)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5

When a file of type text/plain has a "Windows" newline (CRLF) at byte offset 1024+1025 two newlines are shown.

I assume Firefox reads chunks of 1024 bytes at a time. In that case the CR is not seen in connection with the LF and both are rendered as a newline.


Reproducible: Always

Steps to Reproduce:
1. Create a text/plain file with 1023 bytes of text
2. Add CR+LF
3. Add some more text

Actual Results:  
The CRLF will show as two newlines.

Expected Results:  
Only one newline should be shown.

This occurs when the resource is requested over HTTP. It does not occur when the file is accessed through the file system.

E.g. visit http://stock.ter.dk/chunks.txt , save the resource, open the resource via the local file system.
Maybe related to Core bug 197075?
Assignee: nobody → mrbkap
Component: General → HTML: Parser
Product: Firefox → Core
QA Contact: general → parser
Version: unspecified → Trunk
*** Bug 323613 has been marked as a duplicate of this bug. ***
Also being reported heavily at Project Gutenberg Distributed Proofreaders.  http://www.pgdp.net/phpBB2/viewtopic.php?t=18705
Another testcase at http://www.xs4all.nl/~iarnell/test1024.txt
This is easy to fix, but it brings into question the newline handling code in CTextToken::Consume. It currently leaves CR-LF pairs in instead of stripping the CR out (which seems wrong to me). I'll attach a patch that leaves their behavior alone, but fixes the testcase here.
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Priority: -- → P4
Target Milestone: --- → mozilla1.9alpha
Attached patch Treat CR-LF pairs as pairs (obsolete) — Splinter Review
If we find a carriage return, then we need to check the next character. If we're at the end of a packet boundary, then we'll need to wait longer. Otherwise, we don't need to worry about more data coming, and we can treat it as a standalone CR.
Attachment #209368 - Flags: review?(bzbarsky)
Comment on attachment 209368 [details] [diff] [review]
Treat CR-LF pairs as pairs

>Index: parser/htmlparser/src/nsHTMLTokens.cpp
>       result = aScanner.Peek(aChar);
> ...
>+        aScanner.Peek(aChar);

This second Peek is superfluous, please ignore it.
> This is easy to fix, but it brings into question the newline handling code in
> CTextToken::Consume.

This is probably handled by us stripping \r in the content sink....  Perhaps we should consider fixing the code here and removing the stripping there?  Or would that lead to less-efficient string-copying than we have now?
Comment on attachment 209368 [details] [diff] [review]
Treat CR-LF pairs as pairs

>Index: parser/htmlparser/src/nsHTMLTokens.cpp
>+        aScanner.Peek(aChar);

You're removing this, right?

>+              if (aScanner.IsIncremental()) {
>+                return result;

So... don't we need to reset the scanner position?  Or does something else handle it for us?

I have to admit; I was looking at nsHTMLTokenizer::ConsumeTex and I'm not sure it deals with a failure return at all well.  In particular, it will IF_FREE aToken, set it to null, then set it to theToken, and AddToken theToken.... But theToken has nothing useful in it (we just bailed out of this code).  So what's up with that?
(In reply to comment #7)
> should consider fixing the code here and removing the stripping there?  Or
> would that lead to less-efficient string-copying than we have now?

That's my fear. I think I'd rather leave the current code as-is, since it's the content sink that does the stripping.

(In reply to comment #8)
> >+        aScanner.Peek(aChar);
> You're removing this, right?

Yes.

> >+              if (aScanner.IsIncremental()) {
> >+                return result;
> 
> So... don't we need to reset the scanner position?  Or does something else
> handle it for us?

Yes, see the code in nsParser::Tokenize that calls nsITokenizer::ConsumeToken. If the parser returns an error, then it rewinds the scanner to the last mark. Marks are set at the end of each successfully consumed token.

> But theToken has nothing useful in it (we just bailed out of this code).
> So what's up with that?

D'oh, my |return result| should be a |break;|, so we'll bail out of the while loop and bind our substring to something useful. Want a new patch?
> If the parser returns an error, then it rewinds the scanner to the last mark.

Ok.

> so we'll bail out of the while loop and bind our substring to something useful.

But if the scanner is going to be rewound... should we really be binding here?  That will convert the error code to NS_OK on ConsumeText and we'll get doubled text, no?
(In reply to comment #10)
> > so we'll bail out of the while loop and bind our substring to something useful.
> But if the scanner is going to be rewound... should we really be binding here? 
> That will convert the error code to NS_OK on ConsumeText and we'll get doubled
> text, no?

No, because CTextToken::Consume only returns to nsHTMLTokenizer::ConsumeText, which will convert the return value immediately, so the parser will never see it, and it will mark the scanner right before the carriage return. Therefore, when the next chunk of data comes in, we'll start reading at the carriage return and do the Right Thing (TM).
Comment on attachment 209368 [details] [diff] [review]
Treat CR-LF pairs as pairs

Ah, I see.  Can you post a patch with the various things fixed so I can do some more mental modeling of parser?  ;)
Attachment #209368 - Flags: review?(bzbarsky) → review-
Attached patch Updated patchSplinter Review
I made one large change to this patch: I no longer set result to NS_OK in CTextToken::Consume. This is a micro optimization so that we don't go around the loop again if we found EOF.
Attachment #209368 - Attachment is obsolete: true
Attachment #209480 - Flags: review?(bzbarsky)
Attachment #209480 - Flags: review?(bzbarsky) → review+
Attachment #209480 - Flags: superreview?(brendan)
Whiteboard: [patch]
Comment on attachment 209480 [details] [diff] [review]
Updated patch

Sure, rs=me.

/be
Attachment #209480 - Flags: superreview?(brendan) → superreview+
Fix checked into trunk.
Status: ASSIGNED → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
Thanks. Fix works for me on trunk. Any chance of getting this into 1.8 as well? Reports from distributed proofreaders show this is occurring on firefox 1.5 (and 1.0.7) too.
This issue is still present in Firefox 2.0a1 (PortableFirefox)
Attachment #209480 - Flags: approval-branch-1.8.1?(jst)
Yes; it hasn't been checked on branch.   See the flags on the patch attachment.
Attachment #209480 - Flags: approval-branch-1.8.1?(jst) → approval-branch-1.8.1+
Fix checked into the MOZILLA_1_8_BRANCH. Marking this verified on trunk per comment 16.
Status: RESOLVED → VERIFIED
Keywords: fixed1.8.1
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: