Junk entities in reference files should not count towards word counts

RESOLVED FIXED

Status

RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: stas, Assigned: Pike)

Tracking

Details

Attachments

(1 attachment)

(Reporter)

Description

2 years ago
In ContentComparer.compare and ContentComparer.add Junk entities in the reference file are treated the same as good entities.  For instance, their word totals count towards the total "missing_w" for the file.
(Assignee)

Comment 1

2 years ago
The support for Junk entries in the reference is poor in general.

I think it'd be good to report those as warnings in compare-locales. I'm somewhat hopeful that doing so might get us to the point where we can land a test in mozilla-central to make sure that errors in en-US are caught on push.

flod, what'd be your take?
Flags: needinfo?(francesco.lodolo)
(In reply to Axel Hecht [:Pike] from comment #1)
> I think it'd be good to report those as warnings in compare-locales. I'm
> somewhat hopeful that doing so might get us to the point where we can land a
> test in mozilla-central to make sure that errors in en-US are caught on push.

I think showing a warning would be a good idea, we had an example just a few days ago where they decided to escape a forward slash in devtools.

As for stats, truth is that we're still going to expose the string and count it as missing if not translated, so the current behavior still makes sense, as confusing as it might be.
Flags: needinfo?(francesco.lodolo)
(Assignee)

Updated

2 years ago
Assignee: nobody → l10n
Comment hidden (mozreview-request)
(Reporter)

Comment 4

2 years ago
mozreview-review
Comment on attachment 8876080 [details]
bug 1371342, don't count parse errors in en-US as missing strings, warn on them,

https://reviewboard.mozilla.org/r/147512/#review151778

This looks good although I have a few questions and comments.  Feel free to treat them as suggestions and land at will.

::: compare_locales/compare.py:466
(Diff revision 1)
>          skips = []
>          checker = getChecker(l10n, reference=ref[0], extra_tests=extra_tests)
>          for action, entity in ar:
>              if action == 'delete':
>                  # missing entity
> +                if isinstance(ref[0][ref[1][entity]], parser.Junk):

This is a bit unwieldy in how you have to jump extra hoops to get to the Entity object.

I wonder if it would make sense to implement this filtering behavior directly in the Parser.  Perhaps add a named argument to Parser.parse, with_junk (True by default), so that you can do something like:

    ref = p.parser(with_junk=False)

::: compare_locales/compare.py:467
(Diff revision 1)
>          checker = getChecker(l10n, reference=ref[0], extra_tests=extra_tests)
>          for action, entity in ar:
>              if action == 'delete':
>                  # missing entity
> +                if isinstance(ref[0][ref[1][entity]], parser.Junk):
> +                    self.notify('warning', l10n, 'Parser error in en-US')

Can we reporting this warning in `ref` rather than `l10n`? Also, are you okay with calling `en-US` out explicitly here.  Perhaps `reference` would be more fitting?

::: compare_locales/tests/test_merge.py:279
(Diff revision 1)
>          self.assertEqual(map(lambda e: e.key,  m), ["foo", "eff", "bar"])
>  
> +    def test_reference_junk(self):
> +        self.assertTrue(os.path.isdir(self.tmp))
> +        self.reference("""<!ENTITY foo 'fooVal'>
> +<!ENT bar 'bad val'>

If `bar` was present in the l10n (and parsed okay), would it be reported as obsolete? Perhaps add a test for this scenario as well?

::: compare_locales/tests/test_merge.py:327
(Diff revision 1)
> +                    'changed': 1,
> +                    'changed_w': 2
> +                }},
> +             'details': {
> +                 'l10n.dtd': [
> +                     {'warning': u"can't parse en-US value at line 1, column 0 for bar"}]

Is there any way to get the column value closer to the real location of the error? I don't know what level of granularity SAXParseException can give us.
Attachment #8876080 - Flags: review?(stas) → review+
(Reporter)

Comment 5

2 years ago
(In reply to Staś Małolepszy :stas from comment #4)

> I wonder if it would make sense to implement this filtering behavior
> directly in the Parser.  Perhaps add a named argument to Parser.parse,
> with_junk (True by default), so that you can do something like:
> 
>     ref = p.parser(with_junk=False)

Ah, this probably wouldn't work in the form that I suggested, because you still need to report warning about those Junk entities.
(Assignee)

Comment 6

2 years ago
mozreview-review
Comment on attachment 8876080 [details]
bug 1371342, don't count parse errors in en-US as missing strings, warn on them,

https://reviewboard.mozilla.org/r/147512/#review151790

::: compare_locales/compare.py:466
(Diff revision 1)
>          skips = []
>          checker = getChecker(l10n, reference=ref[0], extra_tests=extra_tests)
>          for action, entity in ar:
>              if action == 'delete':
>                  # missing entity
> +                if isinstance(ref[0][ref[1][entity]], parser.Junk):

Yeah, it's unwieldy but mostly just because of how I store the two parts of the c-l parse result as a tuple. That used to be more useful before bug 1361037 dropped caching en-US parse results. It'd be easier to read if I used the same names as for l10n. I'll fix that in a different bug.

As for not showing Junk at all, you already found that out yourself ;-)

::: compare_locales/compare.py:467
(Diff revision 1)
>          checker = getChecker(l10n, reference=ref[0], extra_tests=extra_tests)
>          for action, entity in ar:
>              if action == 'delete':
>                  # missing entity
> +                if isinstance(ref[0][ref[1][entity]], parser.Junk):
> +                    self.notify('warning', l10n, 'Parser error in en-US')

I need to use the l10n file here, as I only report to the localization. I.e. no observer would actually take notice of something for mozilla-central/browser/locales/en-US/foo.dtd.

The underlying problem here is that we don't have a test running on mozilla-central, making sure it's not broken :-/ The warning is really for localizers to know that something's fishy with en-US.

As for en-US, I think that's easier to digest than "reference". That's jargon that is probably even harder to understand for a localizer, and that they can't do anything about this.

::: compare_locales/tests/test_merge.py:279
(Diff revision 1)
>          self.assertEqual(map(lambda e: e.key,  m), ["foo", "eff", "bar"])
>  
> +    def test_reference_junk(self):
> +        self.assertTrue(os.path.isdir(self.tmp))
> +        self.reference("""<!ENTITY foo 'fooVal'>
> +<!ENT bar 'bad val'>

Yes, it's going to be shown as obsolete entry. We have existing tests that they're reported, I don't think there's value in an extra test here.

::: compare_locales/tests/test_merge.py:327
(Diff revision 1)
> +                    'changed': 1,
> +                    'changed_w': 2
> +                }},
> +             'details': {
> +                 'l10n.dtd': [
> +                     {'warning': u"can't parse en-US value at line 1, column 0 for bar"}]

I looked into that, and I had to replicate or refactor the 10+ lines at https://hg.mozilla.org/l10n/compare-locales/file/fca288f0e156/compare_locales/checks.py#l287. A lot of work for something that we should prevent in the first place.

Note, if we added a test to mozilla-central to fail on try for things like this, it'd get a good error location from using the English file as localization.
Comment hidden (mozreview-request)
(Assignee)

Comment 8

2 years ago
https://hg.mozilla.org/l10n/compare-locales/rev/30240c3976dc110636d180f1d890e6e6c1d69fa3, FIXED.

I'll send the ref[][][] stuff into bug 1371601.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
(Assignee)

Comment 9

2 years ago
Forgot to flake8 the result, landed some follow-ups.
(Assignee)

Updated

2 years ago
Blocks: 1372254
You need to log in before you can comment on or make changes to this bug.