Extra newline when textfile has CRLF at offset 1024+1025

VERIFIED FIXED in mozilla1.9alpha1

Status

()

Core
HTML: Parser
P4
minor
VERIFIED FIXED
13 years ago
12 years ago

People

(Reporter: Peter Brodersen, Assigned: mrbkap)

Tracking

({fixed1.8.1})

Trunk
mozilla1.9alpha1
x86
Windows XP
fixed1.8.1
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [patch], URL)

Attachments

(1 attachment, 1 obsolete attachment)

(Reporter)

Description

13 years ago
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5

When a file of type text/plain has a "Windows" newline (CRLF) at byte offset 1024+1025 two newlines are shown.

I assume Firefox reads chunks of 1024 bytes at a time. In that case the CR is not seen in connection with the LF and both are rendered as a newline.


Reproducible: Always

Steps to Reproduce:
1. Create a text/plain file with 1023 bytes of text
2. Add CR+LF
3. Add some more text

Actual Results:  
The CRLF will show as two newlines.

Expected Results:  
Only one newline should be shown.

This occurs when the resource is requested over HTTP. It does not occur when the file is accessed through the file system.

E.g. visit http://stock.ter.dk/chunks.txt , save the resource, open the resource via the local file system.

Comment 1

13 years ago
Maybe related to Core bug 197075?
Assignee: nobody → mrbkap
Component: General → HTML: Parser
Product: Firefox → Core
QA Contact: general → parser
Version: unspecified → Trunk

Comment 2

13 years ago
*** Bug 323613 has been marked as a duplicate of this bug. ***

Comment 3

13 years ago
Also being reported heavily at Project Gutenberg Distributed Proofreaders.  http://www.pgdp.net/phpBB2/viewtopic.php?t=18705
Another testcase at http://www.xs4all.nl/~iarnell/test1024.txt
(Assignee)

Comment 4

13 years ago
This is easy to fix, but it brings into question the newline handling code in CTextToken::Consume. It currently leaves CR-LF pairs in instead of stripping the CR out (which seems wrong to me). I'll attach a patch that leaves their behavior alone, but fixes the testcase here.
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Priority: -- → P4
Target Milestone: --- → mozilla1.9alpha
(Assignee)

Comment 5

13 years ago
Created attachment 209368 [details] [diff] [review]
Treat CR-LF pairs as pairs

If we find a carriage return, then we need to check the next character. If we're at the end of a packet boundary, then we'll need to wait longer. Otherwise, we don't need to worry about more data coming, and we can treat it as a standalone CR.
Attachment #209368 - Flags: review?(bzbarsky)
(Assignee)

Comment 6

13 years ago
Comment on attachment 209368 [details] [diff] [review]
Treat CR-LF pairs as pairs

>Index: parser/htmlparser/src/nsHTMLTokens.cpp
>       result = aScanner.Peek(aChar);
> ...
>+        aScanner.Peek(aChar);

This second Peek is superfluous, please ignore it.
> This is easy to fix, but it brings into question the newline handling code in
> CTextToken::Consume.

This is probably handled by us stripping \r in the content sink....  Perhaps we should consider fixing the code here and removing the stripping there?  Or would that lead to less-efficient string-copying than we have now?
Comment on attachment 209368 [details] [diff] [review]
Treat CR-LF pairs as pairs

>Index: parser/htmlparser/src/nsHTMLTokens.cpp
>+        aScanner.Peek(aChar);

You're removing this, right?

>+              if (aScanner.IsIncremental()) {
>+                return result;

So... don't we need to reset the scanner position?  Or does something else handle it for us?

I have to admit; I was looking at nsHTMLTokenizer::ConsumeTex and I'm not sure it deals with a failure return at all well.  In particular, it will IF_FREE aToken, set it to null, then set it to theToken, and AddToken theToken.... But theToken has nothing useful in it (we just bailed out of this code).  So what's up with that?
(Assignee)

Comment 9

13 years ago
(In reply to comment #7)
> should consider fixing the code here and removing the stripping there?  Or
> would that lead to less-efficient string-copying than we have now?

That's my fear. I think I'd rather leave the current code as-is, since it's the content sink that does the stripping.

(In reply to comment #8)
> >+        aScanner.Peek(aChar);
> You're removing this, right?

Yes.

> >+              if (aScanner.IsIncremental()) {
> >+                return result;
> 
> So... don't we need to reset the scanner position?  Or does something else
> handle it for us?

Yes, see the code in nsParser::Tokenize that calls nsITokenizer::ConsumeToken. If the parser returns an error, then it rewinds the scanner to the last mark. Marks are set at the end of each successfully consumed token.

> But theToken has nothing useful in it (we just bailed out of this code).
> So what's up with that?

D'oh, my |return result| should be a |break;|, so we'll bail out of the while loop and bind our substring to something useful. Want a new patch?
> If the parser returns an error, then it rewinds the scanner to the last mark.

Ok.

> so we'll bail out of the while loop and bind our substring to something useful.

But if the scanner is going to be rewound... should we really be binding here?  That will convert the error code to NS_OK on ConsumeText and we'll get doubled text, no?
(Assignee)

Comment 11

13 years ago
(In reply to comment #10)
> > so we'll bail out of the while loop and bind our substring to something useful.
> But if the scanner is going to be rewound... should we really be binding here? 
> That will convert the error code to NS_OK on ConsumeText and we'll get doubled
> text, no?

No, because CTextToken::Consume only returns to nsHTMLTokenizer::ConsumeText, which will convert the return value immediately, so the parser will never see it, and it will mark the scanner right before the carriage return. Therefore, when the next chunk of data comes in, we'll start reading at the carriage return and do the Right Thing (TM).
Comment on attachment 209368 [details] [diff] [review]
Treat CR-LF pairs as pairs

Ah, I see.  Can you post a patch with the various things fixed so I can do some more mental modeling of parser?  ;)
Attachment #209368 - Flags: review?(bzbarsky) → review-
(Assignee)

Comment 13

13 years ago
Created attachment 209480 [details] [diff] [review]
Updated patch

I made one large change to this patch: I no longer set result to NS_OK in CTextToken::Consume. This is a micro optimization so that we don't go around the loop again if we found EOF.
Attachment #209368 - Attachment is obsolete: true
Attachment #209480 - Flags: review?(bzbarsky)
(Assignee)

Updated

13 years ago
Attachment #209480 - Flags: superreview?(brendan)
(Assignee)

Updated

13 years ago
Whiteboard: [patch]
Comment on attachment 209480 [details] [diff] [review]
Updated patch

Sure, rs=me.

/be
Attachment #209480 - Flags: superreview?(brendan) → superreview+
(Assignee)

Comment 15

13 years ago
Fix checked into trunk.
Status: ASSIGNED → RESOLVED
Last Resolved: 13 years ago
Resolution: --- → FIXED

Comment 16

13 years ago
Thanks. Fix works for me on trunk. Any chance of getting this into 1.8 as well? Reports from distributed proofreaders show this is occurring on firefox 1.5 (and 1.0.7) too.
(Reporter)

Comment 17

12 years ago
This issue is still present in Firefox 2.0a1 (PortableFirefox)
(Assignee)

Updated

12 years ago
Attachment #209480 - Flags: approval-branch-1.8.1?(jst)
Yes; it hasn't been checked on branch.   See the flags on the patch attachment.

Updated

12 years ago
Attachment #209480 - Flags: approval-branch-1.8.1?(jst) → approval-branch-1.8.1+
(Assignee)

Comment 19

12 years ago
Fix checked into the MOZILLA_1_8_BRANCH. Marking this verified on trunk per comment 16.
Status: RESOLVED → VERIFIED
Keywords: fixed1.8.1
You need to log in before you can comment on or make changes to this bug.