Open Bug 1453945 Opened 3 years ago Updated 2 years ago

[mozprofile] Add Unicode support to "mozprofile.diff()"

Categories

(Testing :: Mozbase, defect, P3)

defect

Tracking

(Not tracked)

People

(Reporter: whimboo, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

https://treeherder.mozilla.org/logviewer.html#?job_id=173459232&repo=try&lineNumber=998

[task 2018-04-13T09:21:19.397Z]  2:04.32 =================================== FAILURES ===================================
[task 2018-04-13T09:21:19.398Z]  2:04.32 ______________________ TestProfilePrint.test_profile_diff ______________________
[task 2018-04-13T09:21:19.398Z]  2:04.32 
[task 2018-04-13T09:21:19.398Z]  2:04.32 self = <test_profile_view.TestProfilePrint testMethod=test_profile_diff>
[task 2018-04-13T09:21:19.398Z]  2:04.32 
[task 2018-04-13T09:21:19.400Z]  2:04.32     def test_profile_diff(self):
[task 2018-04-13T09:21:19.400Z]  2:04.32         profile1 = mozprofile.Profile()
[task 2018-04-13T09:21:19.401Z]  2:04.32         profile2 = mozprofile.Profile(preferences=dict(foo='bar'))
[task 2018-04-13T09:21:19.401Z]  2:04.32 
[task 2018-04-13T09:21:19.402Z]  2:04.32         # diff a profile against itself; no difference
[task 2018-04-13T09:21:19.402Z]  2:04.32         self.assertEqual([], mozprofile.diff(profile1, profile1))
[task 2018-04-13T09:21:19.405Z]  2:04.32 
[task 2018-04-13T09:21:19.405Z]  2:04.32         # diff two profiles
[task 2018-04-13T09:21:19.405Z]  2:04.32 >       diff = dict(mozprofile.diff(profile1, profile2))
[task 2018-04-13T09:21:19.406Z]  2:04.32 
[task 2018-04-13T09:21:19.406Z]  2:04.33 testing/mozbase/mozprofile/tests/test_profile_view.py:65:
[task 2018-04-13T09:21:19.406Z]  2:04.33 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[task 2018-04-13T09:21:19.407Z]  2:04.33 testing/mozbase/mozprofile/mozprofile/diff.py:51: in diff
[task 2018-04-13T09:21:19.407Z]  2:04.33     section_diff = list(diff_function(value, other, profile1.profile, profile2.profile))
[task 2018-04-13T09:21:19.407Z]  2:04.33 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[task 2018-04-13T09:21:19.408Z]  2:04.33 
[task 2018-04-13T09:21:19.408Z]  2:04.33 a = [], b = ['foo: bar'], fromfile = '/tmp/tmpr8U2zJ.mozrunner\U0001f36a'
[task 2018-04-13T09:21:19.408Z]  2:04.33 tofile = '/tmp/tmpbDjJvy.mozrunner\U0001f36a', fromfiledate = '', tofiledate = '', n = 3
[task 2018-04-13T09:21:19.408Z]  2:04.33 lineterm = '\n'
[task 2018-04-13T09:21:19.408Z]  2:04.33 
[task 2018-04-13T09:21:19.409Z]  2:04.33     def unified_diff(a, b, fromfile='', tofile='', fromfiledate='',
[task 2018-04-13T09:21:19.409Z]  2:04.33                      tofiledate='', n=3, lineterm='\n'):
[task 2018-04-13T09:21:19.409Z]  2:04.33         r"""
[task 2018-04-13T09:21:19.409Z]  2:04.33         Compare two sequences of lines; generate the delta as a unified diff.
[task 2018-04-13T09:21:19.409Z]  2:04.33 
[task 2018-04-13T09:21:19.410Z]  2:04.33         Unified diffs are a compact way of showing line changes and a few
[task 2018-04-13T09:21:19.410Z]  2:04.33         lines of context.  The number of context lines is set by 'n' which
[task 2018-04-13T09:21:19.410Z]  2:04.33         defaults to three.
[task 2018-04-13T09:21:19.410Z]  2:04.33 
[task 2018-04-13T09:21:19.411Z]  2:04.33         By default, the diff control lines (those with ---, +++, or @@) are
[task 2018-04-13T09:21:19.411Z]  2:04.33         created with a trailing newline.  This is helpful so that inputs
[task 2018-04-13T09:21:19.412Z]  2:04.33         created from file.readlines() result in diffs that are suitable for
[task 2018-04-13T09:21:19.412Z]  2:04.33         file.writelines() since both the inputs and outputs have trailing
[task 2018-04-13T09:21:19.412Z]  2:04.33         newlines.
[task 2018-04-13T09:21:19.413Z]  2:04.33 
[task 2018-04-13T09:21:19.413Z]  2:04.33         For inputs that do not have trailing newlines, set the lineterm
[task 2018-04-13T09:21:19.413Z]  2:04.33         argument to "" so that the output will be uniformly newline free.
[task 2018-04-13T09:21:19.413Z]  2:04.33 
[task 2018-04-13T09:21:19.413Z]  2:04.33         The unidiff format normally has a header for filenames and modification
[task 2018-04-13T09:21:19.413Z]  2:04.33         times.  Any or all of these may be specified using strings for
[task 2018-04-13T09:21:19.414Z]  2:04.33         'fromfile', 'tofile', 'fromfiledate', and 'tofiledate'.
[task 2018-04-13T09:21:19.414Z]  2:04.33         The modification times are normally expressed in the ISO 8601 format.
[task 2018-04-13T09:21:19.414Z]  2:04.33 
[task 2018-04-13T09:21:19.414Z]  2:04.33         Example:
[task 2018-04-13T09:21:19.414Z]  2:04.33 
[task 2018-04-13T09:21:19.414Z]  2:04.33         >>> for line in unified_diff('one two three four'.split(),
[task 2018-04-13T09:21:19.414Z]  2:04.33         ...             'zero one tree four'.split(), 'Original', 'Current',
[task 2018-04-13T09:21:19.414Z]  2:04.33         ...             '2005-01-26 23:30:50', '2010-04-02 10:20:52',
[task 2018-04-13T09:21:19.414Z]  2:04.33         ...             lineterm=''):
[task 2018-04-13T09:21:19.414Z]  2:04.33         ...     print line                  # doctest: +NORMALIZE_WHITESPACE
[task 2018-04-13T09:21:19.414Z]  2:04.33         --- Original        2005-01-26 23:30:50
[task 2018-04-13T09:21:19.414Z]  2:04.33         +++ Current         2010-04-02 10:20:52
[task 2018-04-13T09:21:19.415Z]  2:04.33         @@ -1,4 +1,4 @@
[task 2018-04-13T09:21:19.415Z]  2:04.33         +zero
[task 2018-04-13T09:21:19.415Z]  2:04.33          one
[task 2018-04-13T09:21:19.415Z]  2:04.33         -two
[task 2018-04-13T09:21:19.415Z]  2:04.33         -three
[task 2018-04-13T09:21:19.415Z]  2:04.33         +tree
[task 2018-04-13T09:21:19.415Z]  2:04.33          four
[task 2018-04-13T09:21:19.415Z]  2:04.33         """
[task 2018-04-13T09:21:19.415Z]  2:04.33 
[task 2018-04-13T09:21:19.415Z]  2:04.33         started = False
[task 2018-04-13T09:21:19.415Z]  2:04.33         for group in SequenceMatcher(None,a,b).get_grouped_opcodes(n):
[task 2018-04-13T09:21:19.415Z]  2:04.33             if not started:
[task 2018-04-13T09:21:19.415Z]  2:04.33                 started = True
[task 2018-04-13T09:21:19.415Z]  2:04.33                 fromdate = '\t{}'.format(fromfiledate) if fromfiledate else ''
[task 2018-04-13T09:21:19.415Z]  2:04.33                 todate = '\t{}'.format(tofiledate) if tofiledate else ''
[task 2018-04-13T09:21:19.415Z]  2:04.33 >               yield '--- {}{}{}'.format(fromfile, fromdate, lineterm)
[task 2018-04-13T09:21:19.416Z]  2:04.33 E               UnicodeEncodeError: 'ascii' codec can't encode character u'\U0001f36a' in position 24: ordinal not in range(128)
So what we make use of here by default is `difflib.unified_diff()`, and that method doesn't format the parameters into a Unicode string:

>            yield '--- {}{}{}'.format(fromfile, fromdate, lineterm)
>            yield '+++ {}{}{}'.format(tofile, todate, lineterm)

Andrew, given that this external method is small enough, do you think we can just fork it into mozprofile's diff.py? Or do you know of another helper which we could make use of, and which is Unicode aware?
Flags: needinfo?(ahalberstadt)
What if we feed it unicode to begin with? It looks like it'll just operate on whatever types we give it.

Would you mind uploading your test case?
Flags: needinfo?(ahalberstadt)
Attached file testcase
(In reply to Andrew Halberstadt [:ahal] from comment #2)
> What if we feed it unicode to begin with? It looks like it'll just operate
> on whatever types we give it.

Well, feeding a Unicode string as we do in this case will fail because `unified_diff` uses `format()`, and this method expects the target to also be a Unicode literal, and this is not the case here:

> yield '+++ {}{}{}'.format(tofile, todate, lineterm)

As such using a Unicode string causes:

> UnicodeEncodeError: 'ascii' codec can't encode characters in position 58-59: ordinal not in range(128) 

We could encode the profile path and other arguments in `diff.py` as UTF-8, which will allow this method to work, but then we fail in mozprofile's `diff.py`:

> retval.append((key, '\n'.join(section_diff)))

The thrown exception:

> UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 29: ordinal not in range(128)

> Would you mind uploading your test case?

Please see the attached minimized testcase. I really don't see a way without forking. Please let me know what you think.
Flags: needinfo?(ahalberstadt)
Hm, maybe we could use `from __future__ import unicode_literals` in all of our mozbase packages? It might be a good first step, but not sure if that helps us here completely.
(In reply to Henrik Skupin (:whimboo) from comment #4)
> We could encode the profile path and other arguments in `diff.py` as UTF-8,
> which will allow this method to work, but then we fail in mozprofile's
> `diff.py`:
> 
> > retval.append((key, '\n'.join(section_diff)))
> 
> The thrown exception:
> 
> > UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 29: ordinal not in range(128)

Ok, so I think the issue isn't with difflib. I did a bit of Googling and other people were also raising similar issues, but they all got WONTFIXED due to it being the consumer's responsibility to pass in the appropriate types being diffed.

So I think we need to A) be consistent with the parameters we pass into difflib (make sure they are all unicode strings), and B) utf-8 encode the result before manipulating/returning it. This way it will leave as a str, though other parts of mozbase might also break when using it, but at least the diff part will be fixed.
Flags: needinfo?(ahalberstadt)
Priority: -- → P3
You need to log in before you can comment on or make changes to this bug.