Closed Bug 567660 Opened 10 years ago Closed 9 years ago

cfx xpi can't produce install.rdf with non-ASCII characters

Categories

(Add-on SDK Graveyard :: General, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: asqueella, Assigned: avarma)

References

Details

Attachments

(3 files)

If you have non-ASCII data to be written to install.rdf, "cfx xpi" fails with the following trace:

Exporting extension to test.xpi.
Traceback (most recent call last):
  File "/Users/nickolay/dev/jetpack-sdk/bin/cfx", line 6, in <module>
    cuddlefish.run()
  File "/Users/nickolay/dev/jetpack-sdk/python-lib/cuddlefish/__init__.py", line 541, in run
    xpts=xpts)
  File "/Users/nickolay/dev/jetpack-sdk/python-lib/cuddlefish/xpi.py", line 10, in build_xpi
    open('.install.rdf', 'w').write(str(manifest))
  File "/Users/nickolay/dev/jetpack-sdk/python-lib/cuddlefish/rdf.py", line 11, in __str__
    self.dom.writexml(buf, encoding="utf-8")
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/xml/dom/minidom.py", line 1749, in writexml
    node.writexml(writer, indent, addindent, newl)
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/xml/dom/minidom.py", line 817, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/xml/dom/minidom.py", line 817, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/xml/dom/minidom.py", line 817, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/xml/dom/minidom.py", line 1036, in writexml
    _write_data(writer, "%s%s%s"%(indent, self.data, newl))
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/xml/dom/minidom.py", line 303, in _write_data
    writer.write(data)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u017b' in position 0: ordinal not in range(128)

I originally got this when I had:
    <em:translator>&#x017B;</em:translator>
in the template install.rdf, but you can get the same effect by using \u escapes in package.json, e.g.:
{
    "author": "\u017b",
    ...
}
Duplicate of this bug: 573540
Attached patch patchSplinter Review
Oops, my bad. This patch fixes the bug... we also use StringIO instead of cStringIO because the latter doesn't support unicode string buffers.
Attachment #452881 - Flags: review?(warner-bugzilla)
Comment on attachment 452881 [details] [diff] [review]
patch

I'm not too familiar with xml.dom.minidom . Does the encoding= argument to writexml() merely set an attribute, and not actually affect the way writexml() encodes its output? Equivalently, does writexml() write unicode into the buf object?

If so, this patch looks good. If not, I'd be worried about double-encoding.
Attachment #452881 - Flags: review?(warner-bugzilla) → review+
oh, of course, it'd be nice to have a unit test for this.
Atul: can you land this?
Assignee: nobody → avarma
Keywords: checkin-needed
Comment on attachment 470087 [details] [diff] [review]
patch_with_unit_test

Here's a version of the patch w/ a unit test. Indeed, the 'encoding' kwarg to dom.writexml() just sets the value of the eponymous attribute on the xml header, and the actual unicode object returned isn't encoded to that encoding, so we have to do that ourselves.
Attachment #470087 - Flags: review?(warner-bugzilla)
Removed 'checkin-needed' since I've busted a patch w/ unit test...
Status: NEW → ASSIGNED
Keywords: checkin-needed
Comment on attachment 470087 [details] [diff] [review]
patch_with_unit_test

I like it. The one suggestion I'd make is to add a comment into RDF.__str__
to explain the apparent double-encoding there, something like:

 # real files have an .encoding attribute and use it when you write() unicode
 # into them: they read()/write() unicode and put encoded bytes in the
 # backend file. StringIO objects read()/write() unicode and put unicode in
 # the backing store, so we must encode the output of getvalue() to get a
 # bytestring. (cStringIO objects are weirder: they effectively have
 # .encoding hardwired to "ascii" and put only bytes in the backing store, so
 # we can't use them here).
 #
 # The encoding= argument to dom.writexml() merely sets the XML header's
 # encoding= attribute. It still writes unencoded unicode to the output file,
 # so we have to encode it for real afterwards.
Attachment #470087 - Flags: review?(warner-bugzilla) → review+
Good idea, I've added the comment and pushed:

Bug 567660 - cfx xpi can't produce install.rdf with non-ASCII characters
Atul Varma [:atul]
http://hg.mozilla.org/labs/jetpack-sdk/rev/479b348157dc
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Duplicate of this bug: 592270
I was wondering why you use StringIO at all and not just dom.toxml(encoding="utf-8")? And is the differing behaviour between StringIO and cStringIO really intended and can be expected to stay that way?
The Add-on SDK is no longer a Mozilla Labs experiment and has become a big enough project to warrant its own Bugzilla product, so the "Add-on SDK" product has been created for it, and I am moving its bugs to that product.

To filter bugmail related to this change, filter on the word "looptid".
Component: Jetpack SDK → General
Product: Mozilla Labs → Add-on SDK
QA Contact: jetpack-sdk → general
Version: Trunk → unspecified
> I was wondering why you use StringIO at all and not just
> dom.toxml(encoding="utf-8")? And is the differing behaviour between
> StringIO and cStringIO really intended and can be expected to stay that
> way?

panzi: good points. From what I figured out, dom.toxml(encoding=) doesn't
actually affect the encoding at all: it merely sets an attribute in the XML
header, so we need some mechanism to get from our unicode data to the UTF-8
bytes in the output.

The behavioral differences between StringIO and cStringIO are surprising, and
probably a result of the optimized cStringIO version not being as
general-purpose as the original. In most cases, I'd expect the pure-Python
original (StringIO) to be canonical, for for alternative implementations to
try and converge on its behavior. So for our purposes, since StringIO's
performance is plenty, I think we're best off sticking to StringIO.
You need to log in before you can comment on or make changes to this bug.