UnknownDecoder uses a 1024 byte buffer. This leads to recognizing some text files as HTML (incorrectly). We should consider using a smaller buffer. 128 or 256 are probably the best possibilities. Do many web pages have over 128 chars of whitespace at the beginning?
You have to read past the first newline or 128 bytes to handle Unix #! interpreter lines (bug 110767). I have seen web pages with >70 blank lines at the beginning in a simple-minded attempt to hide code. With CRLF newlines, that's > 128 bytes. That would indicate at least 256 bytes. Maybe the decoder should be smarter.
Ok... Does 256 sound as a reasonable compromise? Any hints on making the decoder smarter are much appreciated, btw. :)
Priority: -- → P4
Target Milestone: --- → mozilla1.0
A thought. Perhaps we should look for "<tagname" as the first non-whitespace text instead of just anywhere in the 1024/256/whatever bytes?
I would think 256 is probably fine unless some unix reads 256 bytes for the shbang hack. In that case, I would go for 512. Testing for <tag may be good but beware of perversities. #!/bin/sh <foo cat >bar ... is obviously a shell script but without the first line it becomes <foo cat >bar ... which could be hard to detect. I think the only proper solution is to look for a text-like distribution in the first n bytes but that may be yet harder still. That is what the unknown decoder is trying to do now. It's just applying a simplistic statistical model.
No time to work on this.
Target Milestone: mozilla1.0 → mozilla1.2
Resolving as a duplicate. While this bug calls for changes that are different than those in bug 126782, the changes affect the exact same thing. The technical discussion should occur in just one bug. If my thinking is off-base in resolving this as a duplicate, please reopen. *** This bug has been marked as a duplicate of 126782 ***
Status: NEW → RESOLVED
Last Resolved: 16 years ago
Resolution: --- → DUPLICATE
-> file handling
Component: Networking → File Handling
QA Contact: benc → sairuh
You need to log in before you can comment on or make changes to this bug.