bugzilla.mozilla.org has resumed normal operation. Attachments prior to 2014 will be unavailable for a few days. This is tracked in Bug 1475801.
Please report any other irregularities here.

Guess encoding of non-UTF8 app manifest files



Developer Pages
6 years ago
6 years ago


(Reporter: kumar, Assigned: robhudson)


Mac OS X



(1 attachment)

When a non-UTF8 manifest file is submitted (see bug 780823) and the server did not specify a charset in the header (bug 754487) then we need to guess the encoding. We can do this with the chardet module which is already included in zamboni.
Created attachment 658296 [details]
Example of non-UTF8 app manifest

example of fixing with chardet:

In [2]: mf = open('/Users/kumar/tmp/w2mo.webapp').read()
In [3]: mf.decode('utf8')
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (2, 0))

UnicodeDecodeError                        Traceback (most recent call last)

UnicodeDecodeError: 'utf8' codec can't decode byte 0xae in position 41: unexpected code byte

In [4]: import chardet
In [5]: chardet.detect(mf)
Out[5]: {'confidence': 0.55246051025679954, 'encoding': 'ISO-8859-2'}
In [6]: mf.decode('ISO-8859-2')
Out[6]: u'{  \r\n  "version": "3.8",\r\n  "name": "W2MO\u017d",\r\n...'

Comment 2

6 years ago
We're not already doing this?
oh, that's good! We just need to fix up all the places that aren't already doing that. escape_all() in bug 780823 might not be the only place.
Rob: can you take a quick grep through the dev tools and see if there is anywhere else to do this?  Otherwise, please close.  Thanks.
Assignee: nobody → robhudson.mozbugs
Priority: -- → P4
Target Milestone: --- → 2012-11-01
Comment on attachment 658296 [details]
Example of non-UTF8 app manifest

This file have mixed-encodings.

>  "name": "W2MO®",

Here is windows-1252.

>  "locales": {  
>    "de": {  
>      "description": "W2MO: Logistikoptimierung, 3D-Simulation, Personalplanung"    },
>    "en": {  
>      "description": "W2MO: Logistics 3D-Simulation, Optimization, Workforce Planning"
>    },  
>    "fr": {  
>      "description": "W2MO: Conception logistique, 3D-simulation, optimisation, planification du personnel"    },  
>    "nl": {  
>      "description": "W2MO: Logistiek planning. 3D Simulatie, optimalisatie, arbeidskrachten planning"    },  
>    "es": {  
>      "description": "W2MO: diseño logístico, simulación 3D, optimización, planificación de la plantilla"    },  
>    "pt-br": {  
>      "description": "W2MO: logística, simulação 3D e animação, otimização, planejamento de pessoal"    },  
>    "pt-pt": {  
>      "description": "W2MO: logística, simulação 3D e animação, otimização, planejamento de pessoal"    },  
>    "ru": {  
>      "description": "W2MO: Логистический дизайн, 3D-симуляция, оптимизация, кадровое планирование, расчет стоимости"    },  
>    "cn": {  
>      "description": "W2MO: Логістичний дизайн, 3D-симуляція, оптимізація, планування персоналу, калькуляція витрат"    }  
>  },  

But here is UTF-8.

Comment 6

6 years ago
Per the manifest specification:

> The document must be UTF-8 in order for the app to be submitted to Firefox Marketplace. It is recommended to omit the byte order mark (BOM). Other encodings can be specified with a charset parameter on the Content-Type header (i.e. Content-Type: application/x-web-app-manifest+json; charset=ISO-8859-4), though this will not be respected by the Marketplace.


The current submission flow will raise a validation error that says "Your manifest file was not encoded as valid UTF-8." if a non-UTF-8 manifest is provided.

I don't see any good reason to support non-UTF-8 manifests other than to open a whole realm of bugs and implementation difficulties. We should do everything we can to ease the technical burdens placed on app consumers and other web app marketplaces. UTF-8 is not by any means an esoteric encoding and guessing the encoding with chardet does not mean our guess will be accurate (especially with mixed encodings*), so I don't think we should venture down that path unless it becomes a pain point.

* Danny tried to copy from ISO-8859-4 into GB2312. Now his documeniIÈsQ}#Ր…­µÝ. Mixed Encodings: Not even once.
Last Resolved: 6 years ago
Resolution: --- → WONTFIX
If we show a validation warning that's probably good enough. I think we do need to continue stripping BOMs because many editors on Windows add those. Not all devs are advanced enough to convert encodings of their source code so my main concern here is that we're making life hard by not supporting their valid non-UTF8 manifests. I guess we can wait to see if we get bugs and / or complaints about that.
I meant as long as we show a validation error when not utf8.

Comment 9

6 years ago
We do show a validation error, and we do continue to strip the BOM.
You need to log in before you can comment on or make changes to this bug.