Closed Bug 465572 Opened 16 years ago Closed 15 years ago

[silme] Improve BOM handling in io/


(Mozilla Localizations :: Infrastructure, defect)

Not set


(Not tracked)



(Reporter: adriank, Assigned: adriank)




(2 files, 5 obsolete files)

Attached patch patch v1 (obsolete) — Splinter Review
As discussed with Axel, we need to handle BOM even if the file would be properly loaded as 'utf_8'.

This patch also adds removing of other BOMs which were previously left untouched.
Attachment #348802 - Flags: review?(gandalf)
Component: is / Icelandic → Infrastructure
QA Contact: → infrastructure
I'm sorry for the bugspam...
Assignee: nobody → akalla
IMHO we really shouldn't removing any content of source files and fail silently with BOMs - this is bad idea, example: bom sesitive file format/app/buildsystem.

BOMs should be handled on a per app/file format level.
Stef: we are only removing BOMs if they exist and the BOMs will be recreated when the object will be written to file (see the "writeToFile()" method).

So looking at the encoding parameter you always see if the object had a BOM, and because of that no BOM gets lost (if in a 'utf_8' text a BOM will be detected, the encoding parameter will be changed to 'utf_8_sig').
In some situations, BOM in UTF-8 encoded file will be a problem and should raise error (or add to logger) with optional skip bomsign rule (to load the rest of a file) - this is the silent aspect I'm writing about, not the strips when (re)writing file.

This is also wrong place, especially to hardcode, such default format independent rules.
Attached patch (main) patch v2 (obsolete) — Splinter Review
changes compared with v1:

in getSourceWithEncoding():
-delete a BOM only if it really exists
-if the encoding was 'utf_8_sig', but no BOM was found, change the encoding to 'utf_8'

in getSource():
-if after finding a BOM reading fails, raise an UnicodeError

Regarding Stef's comment:
I'm not sure what's the best solution for that problem... One option would be to add a new attribute "strict" to parsers and have something like that:

  def getSource(cls, path, encoding=None, fallback=None, strict=False):
    if encoding is not None:
      output = cls.getSourceWithEncoding(path, encoding)
      if strict and encoding != output[1]:
        raise Exception('The output is not encoded in ' + encoding + \
                        ' like expected, but in ' + output[1])
      return output
Attachment #348802 - Attachment is obsolete: true
Attachment #348802 - Flags: review?(gandalf)
don't we already cover the strict issue with encoding=''?

If encoding is different than the given one we throw an Exception. If someone wants to TRY encoding but not fail if it's not the one he should use fallback.

Am I missing sonething?
Attached patch (main) patch v3 (obsolete) — Splinter Review
patch changed as discussed on IRC with Gandalf.

The problem with the existing implementation is, that even if you do 'encoding='utf_8', reading will NOT fail if the encoding is in fact 'utf_8_sig'.
This patch tries to solve that problem.
Attachment #348915 - Attachment is obsolete: true
Attached patch (main) patch v4 (obsolete) — Splinter Review
parallel working on multiple repos is hard...
Here the patch with a few additional crash fixes.
Attachment #349265 - Attachment is obsolete: true
changed default fallback value from "None" to "[]", because fallback has to be a list
Attachment #349266 - Attachment is obsolete: true
Attachment #349385 - Flags: review?(gandalf)
Attachment #349386 - Flags: review?(gandalf)
Attachment #349386 - Flags: review?(gandalf) → review-
Comment on attachment 349386 [details] [diff] [review]
v2 patch adding 'fallback' attribute to parsers

you should not put param=[] like this if it's not static because it will be shared between all instances.

You need __init__ for this.
Attached patch (main) patch v5Splinter Review
fixed a few short problems
Attachment #349385 - Attachment is obsolete: true
Attachment #349385 - Flags: review?(gandalf)
fixed with todays checkin
Closed: 15 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.