Closed Bug 1384570 Opened 7 years ago Closed 7 years ago

[compare-locales] Merge/unify content from different branches in cross-channel localization

Categories

(Localization Infrastructure and Tools :: compare-locales, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Pike, Assigned: stas)

References

(Depends on 1 open bug)

Details

User Story

The content generation part of x-channel takes an ordered list of content, as byte strings.

It returns a byte string that has all localizable strings from each of the contents, and that is as close to the left-most content.

It's a reduce algorithm of N contents, where the first content is merged with the 2nd, and that content is merged with the 3rd, etc.

The resulting content will match the first content exactly if there's nothing to add.

Attachments

(1 file, 6 obsolete files)

This graph needs to be paired with the heads of the other repositories in the same DAG (m-c vs c-c), and then we need to replay that history, generating unified content like this:

Start with the most up-to-date version of the file in question, and then, usefully, add missing entities from the other revisions.

To create stable results, the revisions should be ordered, and we need to deal with the fact that any individiual file can be on any revision, or not.

The result is then commited with time, user, and description of the original commit.
Comment on attachment 8904318 [details]
Bug 1384570 - Part 1 - Merge localization content from multiple channels.

https://reviewboard.mozilla.org/r/176086/#review181214

::: cross-channel-l10n/mozxchannel/merge.py:44
(Diff revision 1)
> +
> +    contents = [(key, get_entity(key)) for _, key in diff]
> +    return OrderedDict(contents)
> +
> +
> +def serialize(entity):

I wonder if we'd be better off if the algorithm inspected the previous entity for trailing '\n' instead of post-fixing.

That way, we can also move a file without trailing newline through the algorithm unchanged.

::: cross-channel-l10n/tests/test-merge-whitespace.py:8
(Diff revision 1)
> +
> +import unittest
> +from mozxchannel.merge import merge_channels
> +
> +
> +class TestMergeWhitespace(unittest.TestCase):

I'd love to see a test case for 

"""

foo = Foo 1

baz = Baz 1

"""
and

"""

foo = Foo 1

bar = Bar 2

baz = Baz 1

"""

Result should be the second content.
Attachment #8904319 - Attachment is obsolete: true
Attachment #8904318 - Flags: review?(l10n)
Comment on attachment 8904539 [details]
Bug 1384570 - Part 2 - De-duplicate comments.

https://reviewboard.mozilla.org/r/176368/#review181718

I didn't look into this patch deeply, as I found a caveat about the core data structure in this algorithm.

::: cross-channel-l10n/mozxchannel/merge.py:15
(Diff revision 1)
>  
>  def merge_channels(name, *resources):
>      p = parser.getParser(name)
>  
> +    # A map of comments to the keys of entities they belong to.
> +    comments = {}

This maps comment strings to IDs, that won't work in practice, as we have files with duplicate comments.

http://searchfox.org/mozilla-central/source/toolkit/locales/en-US/chrome/global/aboutServiceWorkers.dtd is an example (I did search in the pontoon db to find that, ugh)
Comment on attachment 8904542 [details]
Bug 1384570 - Part 3 - Normalize whitespace in a non-destructive manner.

https://reviewboard.mozilla.org/r/176374/#review181726

I like this, just a pythonic nit.

::: cross-channel-l10n/mozxchannel/merge.py:95
(Diff revision 1)
>  
>  
> -def serialize_entity(entity):
> -    snippet = entity.all
> +def serialize_legacy_entity(acc, entity):
> +    # Ensure there's a newline at the end of the resource content so far before
> +    # we append the current entity's content.
> +    if acc and acc[-1] != "\n":

Nit, .endswith('\n')
Comment on attachment 8904318 [details]
Bug 1384570 - Part 1 - Merge localization content from multiple channels.

https://reviewboard.mozilla.org/r/176088/#review181212

I'm a bit sceptical on the fluent serializer, it seems to me that getting the whitespace EntityBase objects emitted in compare-locales would be the right thing to do in c-l to begin with, and we could get away with just one algorithm here?

::: cross-channel-l10n/mozxchannel/merge.py:48
(Diff revision 3)
> +
> +def serialize_legacy_resource(entities):
> +    return "".join(map(serialize_entity, entities.values()))
> +
> +
> +def serialize_fluent_resource(entities):

I'm wondering if the compare-locales FluentParser should just emit Whitespace objects, and then we can just have one algorithm?

Sounds like a sane thing to do for c-l anyway?
Blocks: 1397720
Depends on: 1397802
Assigning to stas, as he's working on this.

Also clarifying the user story, as most of that story was actually talking about graphs and such, while this bug is really just about the content.

Also clarifying that we'll use binary data in and out of the code here. That's what we get from hg and what we should write back to disk. Encodings should be handled on the content side that knows file formats etc.
Assignee: nobody → stas
User Story: (updated)
Comment on attachment 8904318 [details]
Bug 1384570 - Part 1 - Merge localization content from multiple channels.

https://reviewboard.mozilla.org/r/176088/#review183724

::: cross-channel-l10n/mozxchannel/merge.py:48
(Diff revision 3)
> +
> +def serialize_legacy_resource(entities):
> +    return "".join(map(serialize_entity, entities.values()))
> +
> +
> +def serialize_fluent_resource(entities):

I'm going to remove the support for Fluent for now since we don't need it for the initial x-channel implementation and I'd like to get it right. I filed bug 1399055 to track adding Fluent support to x-channel with dependecies set to bugs about the required changes to compare-locales Parsers.
Attachment #8904542 - Attachment is obsolete: true
Axel suggested that we move this code to compare-locales. Morphing the bug accordingly.
Status: NEW → ASSIGNED
Component: General → compare-locales
Product: Developer Services → Localization Infrastructure and Tools
Summary: Merge/unify content from different branches in cross-channel localization → [compare-locales] Merge/unify content from different branches in cross-channel localization
Attachment #8904318 - Attachment is obsolete: true
Attachment #8904318 - Flags: review?(l10n)
Attachment #8904539 - Attachment is obsolete: true
Attachment #8904539 - Flags: review?(l10n)
Attachment #8911257 - Attachment is obsolete: true
Attachment #8911257 - Flags: review?(l10n)
Attachment #8911258 - Attachment is obsolete: true
Attachment #8911258 - Flags: review?(l10n)
Comment on attachment 8912129 [details]
Bug 1384570 - Merge resources across channels.

https://reviewboard.mozilla.org/r/183502/#review191812

This is mostly good, though I only skimmed the patch on the surface so far. We talked so much about the actual code that I didn't spend much time looking for detail snags.

A couple of things, though.

Supporting the unsupported, notably, ftl, and things compare-locales doesn't understand. In the inline comments, I talk about raising a dedicated exception, which I could catch and ignore as a caller.
I also wonder if we shouldn't just return the first content. Though, maybe, in the actual l10n-merge use-case, we shouldn't? 'cause I think in that case we want to fall back to English for the whole file, and not to the l10n file. (which I expect to be "newer" or "left")
I'm rambling, it might be good to be explicit here, but to use an Error class that doesn't hide other things.

I noticed that there are no license headers on the tests, can you add those, too?

And I got a bit confused on when you encode in tests, and when you don't. My gut instinct is that we should always encode, that should reduce the number of places that we need to fix up for py3.

And as we're talking about encodings, please don't hard-code utf-8, but use parser.encoding instead. Yes, in practice this works, but we do have an existing abstraction, and we should use it. Ignore me here, as I'm saying refactor: Should we put encode and decode as methods on Parser?

::: compare_locales/merge.py:16
(Diff revision 2)
> +from compare_locales import parser as cl
> +from compare_locales.compare import AddRemove
> +
> +
> +def merge_channels(name, *resources):
> +    parser = cl.getParser(name)

Can you catch the UserWarning (bad me) this raises for unknown file formats, and raise a better one?

Like, MergeNotSupportedError, as a subclass of ValueError?

Also ...

::: compare_locales/merge.py:19
(Diff revision 2)
> +
> +def merge_channels(name, *resources):
> +    parser = cl.getParser(name)
> +
> +    if isinstance(parser, cl.FluentParser):
> +        raise Exception("Fluent files (.ftl) are not supported (bug 1399055).")

... use that explicit class here?

::: compare_locales/merge.py:57
(Diff revision 2)
> +
> +    entities = reduce(
> +        lambda x, y: merge_two(comments, x, y),
> +        map(parse_resource, resources))
> +
> +    return encode(serialize_legacy_resource(entities), 'utf8')

Let's use parser.encoding here instead of hard-coding 'utf-8'. Or parser.encode, if you're up for it.

::: compare_locales/tests/test_merge_dtd.py:8
(Diff revision 2)
> +
> +import unittest
> +from compare_locales.merge import merge_channels
> +
> +
> +class TestMergeDTD(unittest.TestCase):

The tests here generally make flake8 unhappy for line lengths.

I haven't found a way to disable that check for a file, so I think we should shorten the lines in this test. Despite the fact that they're real-life examples.
Attachment #8912129 - Flags: review?(l10n) → review-
Thanks for the feedback, Axel. I'll update the patch tomorrow!

(In reply to Axel Hecht [:Pike] from comment #20)
> Supporting the unsupported, notably, ftl, and things compare-locales doesn't
> understand. In the inline comments, I talk about raising a dedicated
> exception, which I could catch and ignore as a caller.
> I also wonder if we shouldn't just return the first content. Though, maybe,
> in the actual l10n-merge use-case, we shouldn't? 'cause I think in that case
> we want to fall back to English for the whole file, and not to the l10n
> file. (which I expect to be "newer" or "left")
> I'm rambling, it might be good to be explicit here, but to use an Error
> class that doesn't hide other things.

I'm close to having a reviewable patch for bug 1399055. Let's just raise a custom Exception for now for FTL to be explicit about it and hopefully we'll enable proper FTL support next week.

> And I got a bit confused on when you encode in tests, and when you don't. My
> gut instinct is that we should always encode, that should reduce the number
> of places that we need to fix up for py3.

IIUC this will increase the noise in the tests. Given that there's a dedicated test for encoding, I feel like not encoding in other tests is closer to the "right way" of unit testing. I can still do it if you want me to.
I first went for Parser.encode but then I decided against it. Let's not grow the Parser API just for this one use-case. codecs.encode coupled with Parser.encoding are good enough and more versatile.
Blocks: 1399055
Comment on attachment 8912129 [details]
Bug 1384570 - Merge resources across channels.

https://reviewboard.mozilla.org/r/183502/#review192212

::: compare_locales/tests/test_merge_dtd.py:8
(Diff revision 2)
> +
> +import unittest
> +from compare_locales.merge import merge_channels
> +
> +
> +class TestMergeDTD(unittest.TestCase):

flake8 still complains about the trailing whitespace in the test about trailing whitespace. I haven't found a way to silnce it.
(In reply to Staś Małolepszy :stas from comment #21)
> > And I got a bit confused on when you encode in tests, and when you don't. My
> > gut instinct is that we should always encode, that should reduce the number
> > of places that we need to fix up for py3.
> 
> IIUC this will increase the noise in the tests. Given that there's a
> dedicated test for encoding, I feel like not encoding in other tests is
> closer to the "right way" of unit testing. I can still do it if you want me
> to.

For python3, we'll need to do something:

Fuchsia:compare-locales axelhecht$ python3
>>> import codecs
>>> codecs.getdecoder('utf-8')('foo')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
TypeError: a bytes-like object is required, not 'str'

But we can use b'' byte literals, those work in both py2.6+ and py3. We could do that now for new code, or do that for all tests in a follow-up patch.

Re flake8, I found two ways to get Guido happy with us, used both in my patchlet to the test:

diff --git a/compare_locales/tests/test_merge_dtd.py b/compare_locales/tests/test_merge_dtd.py
--- a/compare_locales/tests/test_merge_dtd.py
+++ b/compare_locales/tests/test_merge_dtd.py
@@ -29,12 +29,11 @@
         channels = ("""
 <!ENTITY foo "Foo 1">
 """, """
-<!ENTITY foo "Foo 2"> 
+<!ENTITY foo "Foo 2">\x20
 """)
         self.assertEqual(
             merge_channels(self.name, *channels), """
-<!ENTITY foo "Foo 1"> 
-""")
+<!ENTITY foo "Foo 1"> \n""")
 
     def test_browser_dtd(self):
         channels = ("""\


Not sure if it'd be a bug or a feature to test one variant against the other, given that it's just source code
Comment on attachment 8912129 [details]
Bug 1384570 - Merge resources across channels.

https://reviewboard.mozilla.org/r/183502/#review192234

r=me with the flake8 issues fixed. Also, I'd prefer to have this patch lead by example re byte literals for tests going in to readContents. Then I don't have to touch the tests in 10 minutes and request review on that from you ;-)

::: compare_locales/tests/test_merge_dtd.py:32
(Diff revision 4)
> +
> +    def test_trailing_whitespace(self):
> +        channels = ("""
> +<!ENTITY foo "Foo 1">
> +""", """
> +<!ENTITY foo "Foo 2"> 

Please make this a \x20 or a """ \n""" to make Guido happy with us.

::: compare_locales/tests/test_merge_dtd.py:36
(Diff revision 4)
> +""", """
> +<!ENTITY foo "Foo 2"> 
> +""")
> +        self.assertEqual(
> +            merge_channels(self.name, *channels), """
> +<!ENTITY foo "Foo 1"> 

Same here.
Attachment #8912129 - Flags: review?(l10n) → review+
Comment on attachment 8912129 [details]
Bug 1384570 - Merge resources across channels.

https://reviewboard.mozilla.org/r/183502/#review192252
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: