Closed Bug 1310980 Opened 8 years ago Closed 8 years ago

compare-locales parser should reflect full structure of files

Categories

(Localization Infrastructure and Tools :: compare-locales, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Pike, Assigned: Pike)

References

Details

Attachments

(10 files)

We need better support for file structures in compare-locales, in particular, comments need to be handled independently of entities. We need this to support more constructive algorithms for merging l10n files.
Blocks: 1310981
Attachment #8803918 - Flags: review?(l10n)
Comment on attachment 8803918 [details] bug 1310980, remove bookmarks.html parser, we haven't used that in ages. https://reviewboard.mozilla.org/r/87642/#review87060 r=me for trivial code removal.
Attachment #8803918 - Flags: review?(l10n) → review+
I'd love to get reviews on this, but then I know that this is 9 patches deep, and some 200 net+ lines of code. Looking at flod as a consumer of this for transvision, stas in general, and matjaz as a well as a possible pontoon consumer. As a consumer of this dashboard-wise, I only see a few changes in the output, and they're good. I.e. bugs in the old code, like the defines parser never worked really. Which is demonstrated by its complete lack of tests so far. High-level: The Parser.parse() interface didn't change, it now filters the more complex parsed structure to just return Entity and Junk objects, which is all that existed prior. There's new tests for the defines parser, for a bunch of edge cases like files with just comments and whitespaces, and the tests to verify entity positions in the file are new. Customers interested in the nitty gritty details would look at Parser.walk() now.
Attachment #8803927 - Flags: review?(stas)
Comment on attachment 8803918 [details] bug 1310980, remove bookmarks.html parser, we haven't used that in ages. https://reviewboard.mozilla.org/r/87640/#review90324 This change looks good to me. I did't review all the small changes to regexps and others, seeing that you have a good test coverage for those. I have a few high-level questions which mostly relate to the fact that I don't know how this code came to be. ::: compare_locales/parser.py:30 (Diff revision 1) > > - <-------[3]---------><------[6]------> > + <-------[2]---------> > ''' > - def __init__(self, ctx, pp, > - span, pre_ws_span, pre_comment_span, def_span, > + def __init__(self, ctx, pp, pre_comment, > + span, pre_ws_span, def_span, > key_span, val_span, post_span): (This might be beyond the scope of this bug but I'm curious.) What is the reason for storing almost all information about the Entity in form of `span`s? Intuitively I would expect something like: def __init__(self, ctx, pp, pre_comment, pre_ws, definition, key, val, post): ... and then in the `Parser`: def creatEntity(self, ctx, m): return Entity(ctx, self.processValue, **m.groupdict()) This way you could get rid of almost all `Entity`'s getters which currently have to do all the slicing according to the span indexes. It seems that it would make more sense now that comments and white-space are separate classes. Is this some sort of legacy decision, or is there a use-case that I'm missing right now? ::: compare_locales/parser.py:91 (Diff revision 1) > def get_post(self): > return self.ctx.contents[self.post_span[0]:self.post_span[1]] > > # getters > > all = property(get_all) Nit: you use the `@property` decorator in new code so maybe use it here as well? ::: compare_locales/parser.py:235 (Diff revision 1) > return val > > def __iter__(self): > + return self.walk(onlyEntities=True) > + > + def walk(self, onlyEntities=False): If the parser understands comments now, I'd expect `__iter__` to iterate over all contents, and something like `entities()` to return a generator yielding only `Entity` instances. This is likely a matter of personal preference, so I'll leave it up to you to decide.
Comment on attachment 8803927 [details] bug 1310980, part 8: defined and tested behaviour for all-whitespace files https://reviewboard.mozilla.org/r/88120/#review90382
Attachment #8803927 - Flags: review?(stas) → review+
(In reply to Staś Małolepszy :stas from comment #13) > Comment on attachment 8803918 [details] > bug 1310980, remove bookmarks.html parser, we haven't used that in ages. > > https://reviewboard.mozilla.org/r/87640/#review90324 > > This change looks good to me. I did't review all the small changes to > regexps and others, seeing that you have a good test coverage for those. I > have a few high-level questions which mostly relate to the fact that I don't > know how this code came to be. > > ::: compare_locales/parser.py:30 > (Diff revision 1) > > > > - <-------[3]---------><------[6]------> > > + <-------[2]---------> > > ''' > > - def __init__(self, ctx, pp, > > - span, pre_ws_span, pre_comment_span, def_span, > > + def __init__(self, ctx, pp, pre_comment, > > + span, pre_ws_span, def_span, > > key_span, val_span, post_span): > > (This might be beyond the scope of this bug but I'm curious.) > > What is the reason for storing almost all information about the Entity in > form of `span`s? It's an optimization such that compare-locales doesn't store a ton of small strings. I never really tested if it is an optimization, though. I'd not change that as a ride along, as it could have all kinds of perf impact. I also need the spans to return entity and value positions. <...> > This way you could get rid of almost all `Entity`'s getters which currently > have to do all the slicing according to the span indexes. It seems that it > would make more sense now that comments and white-space are separate > classes. That's one thing I'm not totally happy with yet, whitespace is somewhat still part of the entity, unless it's at the end of the file in some file formats :-/ > Is this some sort of legacy decision, or is there a use-case that I'm > missing right now? > > ::: compare_locales/parser.py:91 > (Diff revision 1) > > def get_post(self): > > return self.ctx.contents[self.post_span[0]:self.post_span[1]] > > > > # getters > > > > all = property(get_all) > > Nit: you use the `@property` decorator in new code so maybe use it here as > well? I guess there's a general reason to overhaul the py2.6-isms. > ::: compare_locales/parser.py:235 > (Diff revision 1) > > return val > > > > def __iter__(self): > > + return self.walk(onlyEntities=True) > > + > > + def walk(self, onlyEntities=False): > > If the parser understands comments now, I'd expect `__iter__` to iterate > over all contents, and something like `entities()` to return a generator > yielding only `Entity` instances. > > This is likely a matter of personal preference, so I'll leave it up to you > to decide. The iterator is also a somewhat public API, so I kept its functionality untouched, as I did for .parse().
Pushed this in three patches, remove bookmarks.html, fix tests, new parser: https://hg.mozilla.org/l10n/compare-locales/pushloghtml?changeset=90fed0b23a4d. Marking FIXED.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: