compare-locales parser should reflect full structure of files

RESOLVED FIXED

Status

P1
normal
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: Pike, Assigned: Pike)

Tracking

(Blocks: 1 bug)

Details

Attachments

(10 attachments)

(Assignee)

Description

2 years ago
We need better support for file structures in compare-locales, in particular, comments need to be handled independently of entities.

We need this to support more constructive algorithms for merging l10n files.
(Assignee)

Updated

2 years ago
Blocks: 1310981
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
(Assignee)

Updated

2 years ago
Attachment #8803918 - Flags: review?(l10n)
(Assignee)

Comment 11

2 years ago
mozreview-review
Comment on attachment 8803918 [details]
bug 1310980, remove bookmarks.html parser, we haven't used that in ages.

https://reviewboard.mozilla.org/r/87642/#review87060

r=me for trivial code removal.
Attachment #8803918 - Flags: review?(l10n) → review+
(Assignee)

Comment 12

2 years ago
I'd love to get reviews on this, but then I know that this is 9 patches deep, and some 200 net+ lines of code.

Looking at flod as a consumer of this for transvision, stas in general, and matjaz as a well as a possible pontoon consumer.

As a consumer of this dashboard-wise, I only see a few changes in the output, and they're good. I.e. bugs in the old code, like the defines parser never worked really. Which is demonstrated by its complete lack of tests so far.

High-level:

The Parser.parse() interface didn't change, it now filters the more complex parsed structure to just return Entity and Junk objects, which is all that existed prior.

There's new tests for the defines parser, for a bunch of edge cases like files with just comments and whitespaces, and the tests to verify entity positions in the file are new.

Customers interested in the nitty gritty details would look at Parser.walk() now.
(Assignee)

Updated

2 years ago
Attachment #8803927 - Flags: review?(stas)

Comment 13

2 years ago
mozreview-review
Comment on attachment 8803918 [details]
bug 1310980, remove bookmarks.html parser, we haven't used that in ages.

https://reviewboard.mozilla.org/r/87640/#review90324

This change looks good to me.  I did't review all the small changes to regexps and others, seeing that you have a good test coverage for those.  I have a few high-level questions which mostly relate to the fact that I don't know how this code came to be.

::: compare_locales/parser.py:30
(Diff revision 1)
>  
> -    <-------[3]---------><------[6]------>
> +    <-------[2]--------->
>      '''
> -    def __init__(self, ctx, pp,
> -                 span, pre_ws_span, pre_comment_span, def_span,
> +    def __init__(self, ctx, pp, pre_comment,
> +                 span, pre_ws_span, def_span,
>                   key_span, val_span, post_span):

(This might be beyond the scope of this bug but I'm curious.)

What is the reason for storing almost all information about the Entity in form of `span`s?

Intuitively I would expect something like:

    def __init__(self, ctx, pp, pre_comment, pre_ws,
                 definition, key, val, post):
        ...

and then in the `Parser`:

    def creatEntity(self, ctx, m):
        return Entity(ctx, self.processValue,
                      **m.groupdict())
                      
This way you could get rid of almost all `Entity`'s getters which currently have to do all the slicing according to the span indexes.  It seems that it would make more sense now that comments and white-space are separate classes. 

Is this some sort of legacy decision, or is there a use-case that I'm missing right now?

::: compare_locales/parser.py:91
(Diff revision 1)
>      def get_post(self):
>          return self.ctx.contents[self.post_span[0]:self.post_span[1]]
>  
>      # getters
>  
>      all = property(get_all)

Nit: you use the `@property` decorator in new code so maybe use it here as well?

::: compare_locales/parser.py:235
(Diff revision 1)
>          return val
>  
>      def __iter__(self):
> +        return self.walk(onlyEntities=True)
> +
> +    def walk(self, onlyEntities=False):

If the parser understands comments now, I'd expect `__iter__` to iterate over all contents, and something like `entities()` to return a generator yielding only `Entity` instances.

This is likely a matter of personal preference, so I'll leave it up to you to decide.

Comment 14

2 years ago
mozreview-review
Comment on attachment 8803927 [details]
bug 1310980, part 8: defined and tested behaviour for all-whitespace files

https://reviewboard.mozilla.org/r/88120/#review90382
Attachment #8803927 - Flags: review?(stas) → review+
(Assignee)

Comment 15

2 years ago
(In reply to Staś Małolepszy :stas from comment #13)
> Comment on attachment 8803918 [details]
> bug 1310980, remove bookmarks.html parser, we haven't used that in ages.
> 
> https://reviewboard.mozilla.org/r/87640/#review90324
> 
> This change looks good to me.  I did't review all the small changes to
> regexps and others, seeing that you have a good test coverage for those.  I
> have a few high-level questions which mostly relate to the fact that I don't
> know how this code came to be.
> 
> ::: compare_locales/parser.py:30
> (Diff revision 1)
> >  
> > -    <-------[3]---------><------[6]------>
> > +    <-------[2]--------->
> >      '''
> > -    def __init__(self, ctx, pp,
> > -                 span, pre_ws_span, pre_comment_span, def_span,
> > +    def __init__(self, ctx, pp, pre_comment,
> > +                 span, pre_ws_span, def_span,
> >                   key_span, val_span, post_span):
> 
> (This might be beyond the scope of this bug but I'm curious.)
> 
> What is the reason for storing almost all information about the Entity in
> form of `span`s?

It's an optimization such that compare-locales doesn't store a ton of small strings.

I never really tested if it is an optimization, though. I'd not change that as a ride along, as it could have all kinds of perf impact. I also need the spans to return entity and value positions.

<...>

> This way you could get rid of almost all `Entity`'s getters which currently
> have to do all the slicing according to the span indexes.  It seems that it
> would make more sense now that comments and white-space are separate
> classes. 

That's one thing I'm not totally happy with yet, whitespace is somewhat still part of the entity, unless it's at the end of the file in some file formats :-/

> Is this some sort of legacy decision, or is there a use-case that I'm
> missing right now?
> 
> ::: compare_locales/parser.py:91
> (Diff revision 1)
> >      def get_post(self):
> >          return self.ctx.contents[self.post_span[0]:self.post_span[1]]
> >  
> >      # getters
> >  
> >      all = property(get_all)
> 
> Nit: you use the `@property` decorator in new code so maybe use it here as
> well?

I guess there's a general reason to overhaul the py2.6-isms.
 
> ::: compare_locales/parser.py:235
> (Diff revision 1)
> >          return val
> >  
> >      def __iter__(self):
> > +        return self.walk(onlyEntities=True)
> > +
> > +    def walk(self, onlyEntities=False):
> 
> If the parser understands comments now, I'd expect `__iter__` to iterate
> over all contents, and something like `entities()` to return a generator
> yielding only `Entity` instances.
> 
> This is likely a matter of personal preference, so I'll leave it up to you
> to decide.

The iterator is also a somewhat public API, so I kept its functionality untouched, as I did for .parse().
(Assignee)

Comment 16

2 years ago
Pushed this in three patches, remove bookmarks.html, fix tests, new parser: https://hg.mozilla.org/l10n/compare-locales/pushloghtml?changeset=90fed0b23a4d.

Marking FIXED.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.