Closed Bug 533890 Opened 11 years ago Closed 11 years ago

Need to figure out encoding/decoding of patches in qimportbz

Categories

(Developer Services :: Mercurial: qimportbz, defect)

defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: robarnold, Assigned: sgautherie)

References

Details

(Keywords: crash)

Attachments

(4 files, 2 obsolete files)

I thought we had it down but crowder gave me this trace:
$ hg qimport bz:502933
Fetching...done
Parsing...** unknown exception encountered, details follow
** report bug details to http://www.selenic.com/mercurial/bts
** or mercurial@selenic.com
** Mercurial Distributed SCM (version 1.2.1)
** Extensions loaded: extdiff, graphlog, mq, win32text, qimportbz
Traceback (most recent call last):
  File "hg", line 27, in <module>
  File "mercurial\dispatch.pyc", line 16, in run
  File "mercurial\dispatch.pyc", line 25, in dispatch
  File "mercurial\dispatch.pyc", line 41, in _runcatch
  File "mercurial\dispatch.pyc", line 372, in _dispatch
  File "mercurial\dispatch.pyc", line 247, in runcommand
  File "mercurial\dispatch.pyc", line 417, in _runcommand
  File "mercurial\dispatch.pyc", line 377, in checkargs
  File "mercurial\dispatch.pyc", line 371, in <lambda>
  File "mercurial\util.pyc", line 718, in check
  File "mercurial\extensions.pyc", line 100, in wrap
  File "mercurial\util.pyc", line 718, in check
  File "c:\Documents and Settings\crowder\qimportbz\__init__.py", line 120, in qimporthook
    orig(ui, repo, *files, **opts)
  File "mercurial\util.pyc", line 718, in check
  File "hgext\mq.pyc", line 1686, in qimport
  File "hgext\mq.pyc", line 1600, in qimport
  File "mercurial\url.pyc", line 314, in open
  File "urllib2.pyc", line 387, in open
  File "c:\Documents and Settings\crowder\qimportbz\bzhandler.py", line 63, in bz_response
    bug = bz.Bug(self.ui, data)
  File "c:\Documents and Settings\crowder\qimportbz\bz.py", line 176, in __init__
    self.attachments = [Attachment.parse(self, a) for a in xml.findall("bug/attachment")]
  File "c:\Documents and Settings\crowder\qimportbz\bz.py", line 34, in parse
    return ctor(bug, node)
  File "c:\Documents and Settings\crowder\qimportbz\bz.py", line 72, in __init__
    self.data = fp.read().decode('utf-8')
  File "encodings\utf_8.pyc", line 16, in decode
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1063149-1063150: invalid data
Another example, with a smaller bug, (on Windows 2000):
{
$ hg qimport bz:531576

"UnicodeDecodeError: 'utf8' codec can't decode bytes in position 2391-2393: invalid data"
}
(In reply to comment #1)
> "UnicodeDecodeError: 'utf8' codec can't decode bytes in position 2391-2393:

Ftr, testing with 'ascii' reports:
"UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 2391: ordinal not in range(128)"

To be explicit,
*that's the 'ã' character in "Dão".
*and that patch is an "ansi" version of the previously attached "utf-8" patch (which decodes correctly).
Can't do much more, as there is no way to know what the actual file character set would be :-<
Assignee: nobody → sgautherie.bz
Status: NEW → ASSIGNED
Attachment #420704 - Flags: review?(tellrob)
Blocks: 502933, 531576
Severity: normal → major
Comment on attachment 420704 [details] [diff] [review]
(Av1) Discard patch file on UnicodeDecodeError, Improve console outputs.

>-      self.data = fp.read().decode('utf-8')
>+      try:
>+        self.data = fp.read().decode('utf-8')
>+      except UnicodeDecodeError:
>+        bug.settings.ui.warn("Patch id=%s desc=\"%s\" file was discarded:\n" % (self.id, self.desc))
>+        # Print the exception without its traceback.
>+        import sys
>+        sys.excepthook(sys.exc_info()[0], sys.exc_info()[1], None)
>+        # Can't do better than discard data:
>+        # trying |.decode('utf-8', 'replace')| as a fallback would be too risky
>+        #   if user imports the result then forgets to fix it.
>+        self.data = ''

I don't like the import sys in the middle here. Also, can we try decoding in other formats? What is the testcase here and what is its encoding? A latin-1/ascii decoding seems like a common mis-encoding for a patch (we could convert it to utf-8 for them with a warning).

The rest of the changes in the patch are unrelated to this bug (and I'm not sure I like them).
Attachment #420704 - Flags: review?(tellrob) → review-
Av1, with comment 4 suggestion(s).

(In reply to comment #4)

> I don't like the import sys in the middle here. Also, can we try decoding in

Moved to the start of the method.

> other formats? What is the testcase here and what is its encoding? A
> latin-1/ascii decoding seems like a common mis-encoding for a patch (we could
> convert it to utf-8 for them with a warning).

Testcase is comment 1 and 2.
BugZilla doesn't know what the exact encoding is: that's the issue!
(ascii is a subpart of utf-8 and causes (nor solves) no problem.)
I don't know of a sure way to select the correct ISO-8859-* variant...

Anyway, this patch is meant to stop aborting and let proceed with the other patches.
As a follow-up patch, you could add a charset override feature, for example.

> The rest of the changes in the patch are unrelated to this bug (and I'm not
> sure I like them).

I filed bug 540164...
Attachment #420704 - Attachment is obsolete: true
Attachment #422005 - Flags: review?(tellrob)
Comment on attachment 422005 [details] [diff] [review]
(Av2) Discard patch file on UnicodeDecodeError

>From: Serge Gautherie <sgautherie.bz@free.fr>
>
>Bug 533890 - Need to figure out encoding/decoding of patches in qimportbz;
>(Av2) Discard patch file on UnicodeDecodeError.
>
>diff --git a/bz.py b/bz.py
>--- a/bz.py
>+++ b/bz.py
>@@ -77,25 +77,35 @@ class Flag(object):
> 
>   # Compare by flag name
>   def __cmp__(self, other):
>     flagorder = ['r', 'sr', 'ui-r', 'a', 'c']
>     return cmp(flagorder.index(self.abbrev()), flagorder.index(other.abbrev()))
> 
> class Patch(Attachment):
>   def __init__(self, bug, node):
>+    import sys

I actually meant that this should be at the top level of the file.

>     Attachment.__init__(self, bug, node)
>     self.flags = list(sorted(Flag(bug, n) for n in node.findall('flag')))
>     rawtext = base64.b64decode(node.find('data').text)
>     data = StringIO.StringIO(rawtext)
>     filename, message, user, date, branch, nodeid, p1, p2 = patch.extract(bug.settings.ui, data)
>     # for some reason, patch.extract writes a temporary file with the diff hunks
>     if filename:
>       fp = file(filename)
>-      self.data = fp.read().decode('utf-8')
>+      try:
>+        self.data = fp.read().decode('utf-8')
>+      except UnicodeDecodeError:
>+        bug.settings.ui.warn("Patch id=%s desc=\"%s\" file was discarded:\n" % (self.id, self.desc))
>+        # Print the exception without its traceback.
>+        sys.excepthook(sys.exc_info()[0], sys.exc_info()[1], None)
>+        # Can't do better than discard data:
>+        # trying |.decode('utf-8', 'replace')| as a fallback would be too risky
>+        #   if user imports the result then forgets to fix it.
>+        self.data = ''

How do you feel about prompting the user for the text? We could provide the url for them and throw up an editor via HG's API.
Av2, with comment 6 suggestion(s).

(In reply to comment #6)

> I actually meant that this should be at the top level of the file.

Done, with some reordering too.

> How do you feel about prompting the user for the text? We could provide the url

The qbz user could only guess at what the charset the patch author used is :-|

> for them and throw up an editor via HG's API.

From my experience (but ymmv), the case (does exist but) is rare and of low value:
if one is really interested in the problematic patch, one can either download+fix it "manually" or just ask the author to attach a fixed version (= fix it at the source).
Attachment #422005 - Attachment is obsolete: true
Attachment #427248 - Flags: review?(tellrob)
Attachment #422005 - Flags: review?(tellrob)
(In reply to comment #7)
> Created an attachment (id=427248) [details]
> (Av2a) Discard patch file on UnicodeDecodeError
> 
> Av2, with comment 6 suggestion(s).
> 
> (In reply to comment #6)
> 
> > I actually meant that this should be at the top level of the file.
> 
> Done, with some reordering too.
> 
> > How do you feel about prompting the user for the text? We could provide the url
> 
> The qbz user could only guess at what the charset the patch author used is :-|

Well I figure that Firefox will decode it correctly and they can just copy & paste (probably some unicode).

> > for them and throw up an editor via HG's API.
> 
> From my experience (but ymmv), the case (does exist but) is rare and of low
> value:
> if one is really interested in the problematic patch, one can either
> download+fix it "manually" or just ask the author to attach a fixed version (=
> fix it at the source).

I think it at least makes the download&fix case easier.
(In reply to comment #8)

> Well I figure that Firefox will decode it correctly and they can just copy &
> paste (probably some unicode).

I assume the browser downloads the file "as is": I don't think it can guess/fix the (missing) charset either.
But anyway.

> I think it at least makes the download&fix case easier.

I understood you: you asked for my feeling, I gave you my opinion.
Then, I already replied in comment 5 that my patch cares about the exception fix only/first and leaves enabling an enhanced behavior to someone who cares about it (as you do and as I don't atm).
Attachment #427248 - Flags: review?(tellrob) → review+
Comment on attachment 427248 [details] [diff] [review]
(Av2a) Discard patch file on UnicodeDecodeError
[Checkin: Comment 10]


http://hg.mozilla.org/users/robarnold_cmu.edu/qimportbz/rev/0fa47807ab5e
Attachment #427248 - Attachment description: (Av2a) Discard patch file on UnicodeDecodeError → (Av2a) Discard patch file on UnicodeDecodeError [Checkin: Comment 10]
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Case is: hg qimport -p bz:213932

NB: I assume 'message' should need the same fix too ... but I prefer to wait for an actual (test)case :-|
Attachment #436790 - Flags: review?(tellrob)
Blocks: 213932
Attachment #436790 - Flags: review?(tellrob) → review+
Comment on attachment 436790 [details] [diff] [review]
(Bv1) Discard user data on UnicodeDecodeError, Some rewrite and documentation too
[Checkin: Comment 12]


http://hg.mozilla.org/users/robarnold_cmu.edu/qimportbz/rev/518ac050d9fa
Attachment #436790 - Attachment description: (Bv1) Discard user data on UnicodeDecodeError, Some rewrite and documentation too → (Bv1) Discard user data on UnicodeDecodeError, Some rewrite and documentation too [Checkin: Comment 12]
Testcase: hg qimport -p bz:533890
Attachment #440976 - Flags: review?(tellrob)
Keywords: crash
Severity: major → critical
Blocks: 535674
Attachment #440976 - Flags: review?(tellrob) → review+
Comment on attachment 440976 [details] [diff] [review]
(Cv1) Discard message data on UnicodeDecodeError
[Checkin: Comment 15+16]


http://hg.mozilla.org/users/robarnold_cmu.edu/qimportbz/rev/3a0e20e87a1c
Attachment #440976 - Attachment description: (Cv1) Discard message data on UnicodeDecodeError → (Cv1) Discard message data on UnicodeDecodeError [Checkin: Comment 15]
Comment on attachment 440976 [details] [diff] [review]
(Cv1) Discard message data on UnicodeDecodeError
[Checkin: Comment 15+16]


+
http://hg.mozilla.org/users/robarnold_cmu.edu/qimportbz/rev/6bb40df6000e
(Dv1) Fix patch Cv1 (empty message) regression.
Attachment #440976 - Attachment description: (Cv1) Discard message data on UnicodeDecodeError [Checkin: Comment 15] → (Cv1) Discard message data on UnicodeDecodeError [Checkin: Comment 15+16]
Product: Other Applications → Developer Services
You need to log in before you can comment on or make changes to this bug.