Closed Bug 465572 Opened 14 years ago Closed 14 years ago
[silme] Improve BOM handling in io/file
As discussed with Axel, we need to handle BOM even if the file would be properly loaded as 'utf_8'. This patch also adds removing of other BOMs which were previously left untouched.
Component: is / Icelandic → Infrastructure
QA Contact: icelandic.is → infrastructure
I'm sorry for the bugspam...
Assignee: nobody → akalla
IMHO we really shouldn't removing any content of source files and fail silently with BOMs - this is bad idea, example: bom sesitive file format/app/buildsystem. BOMs should be handled on a per app/file format level.
Stef: we are only removing BOMs if they exist and the BOMs will be recreated when the object will be written to file (see the "writeToFile()" method). So looking at the encoding parameter you always see if the object had a BOM, and because of that no BOM gets lost (if in a 'utf_8' text a BOM will be detected, the encoding parameter will be changed to 'utf_8_sig').
In some situations, BOM in UTF-8 encoded file will be a problem and should raise error (or add to logger) with optional skip bomsign rule (to load the rest of a file) - this is the silent aspect I'm writing about, not the strips when (re)writing file. This is also wrong place, especially to hardcode, such default format independent rules.
changes compared with v1: in getSourceWithEncoding(): -delete a BOM only if it really exists -if the encoding was 'utf_8_sig', but no BOM was found, change the encoding to 'utf_8' in getSource(): -if after finding a BOM reading fails, raise an UnicodeError Regarding Stef's comment: I'm not sure what's the best solution for that problem... One option would be to add a new attribute "strict" to parsers and have something like that: (...) def getSource(cls, path, encoding=None, fallback=None, strict=False): if encoding is not None: output = cls.getSourceWithEncoding(path, encoding) if strict and encoding != output: raise Exception('The output is not encoded in ' + encoding + \ ' like expected, but in ' + output) return output (...)
don't we already cover the strict issue with encoding=''? If encoding is different than the given one we throw an Exception. If someone wants to TRY encoding but not fail if it's not the one he should use fallback. Am I missing sonething?
patch changed as discussed on IRC with Gandalf. The problem with the existing implementation is, that even if you do 'encoding='utf_8', reading will NOT fail if the encoding is in fact 'utf_8_sig'. This patch tries to solve that problem.
Attachment #348915 - Attachment is obsolete: true
parallel working on multiple repos is hard... Here the patch with a few additional crash fixes.
Attachment #349265 - Attachment is obsolete: true
changed default fallback value from "None" to "", because fallback has to be a list
Attachment #349266 - Attachment is obsolete: true
Attachment #349386 - Flags: review?(gandalf)
Attachment #349386 - Flags: review?(gandalf) → review-
Comment on attachment 349386 [details] [diff] [review] v2 patch adding 'fallback' attribute to parsers you should not put param= like this if it's not static because it will be shared between all instances. You need __init__ for this.
fixed a few short problems
fixed with todays checkin http://hg.mozilla.org/users/zbraniecki_mozilla.com/silme/rev/50b2918343a7
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.