Closed Bug 797571 Opened 7 years ago Closed 6 years ago

Repair orphaned translations automatically where possible

Categories

(developer.mozilla.org :: Localization, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: lorchard, Unassigned)

References

Details

(Whiteboard: [localization])

Some notes continued over from bug 792417, describing some data spelunking on a data set that's a few weeks old:

select count(d1.id)
from wiki_document as d1 
where d1.parent_id is null and 
    d1.locale <> 'en-US' and
    d1.slug in (select d2.slug 
                    from wiki_document as d2 
                    where d2.locale='en-US')

That yields 6758 documents in non-en-US locales, whose slugs correspond to documents in the en-US locale, yet do not claim to be translations. But, since the slugs match, they probably *are* translations

Another interesting stat:

select count(d1.id)
from wiki_document as d1 
where d1.parent_id is null and 
    d1.locale <> 'en-US' and
    left(d1.html,8) <> 'REDIRECT' and
    d1.slug in (select d2.slug 
                    from wiki_document as d2 
                    where d2.locale='en-US')

This yields 4523 documents - meaning that 4523 of the 6758 apparent orphaned translations whose slugs correspond to en-US pages are actually redirects.

A quick glance at some of those 4523 tell me that the seem to point at pages that have been moved from en-US-inspired slugs to translated slugs. (eg. /de/docs/Code_snippets/Scrollbar -> /de/docs/Codeschnipsel/Scrollbar) In an arbitrary sampling of these, it looked like the redirect target was itself marked as a translation of an en-US page. So, no real orphan problem here.

That leaves 2235 remaining non-en-US documents whose slugs correspond to en-US documents. Those are probably real orphaned translations, and so we can probably fix those. (Thus, this bug)
More research:

select count(id) from wiki_document 
where parent_id is null and 
    html like '%languages(%' 

This yields 6174 documents which contain 'languages(' - a likely indicator that there's a {{ languages() }} or {{ wiki.languages() }} macro in the page that migration failed to parse and use for setting a translation parent. 

These are probably translation orphans we can address by trying harder to parse that macro to locate an en-US parent.
Actually, let's refine that last query:

select count(id) from wiki_document 
where parent_id is null and 
    locale <> 'en-US' and 
    html like '%languages(%' 

That yields 2901 documents - en-US documents are never translations of other en-US documents :)
Whiteboard: s=2012-10-30
Whiteboard: s=2012-10-30 → s=2012-10-30 p=
Whiteboard: s=2012-10-30 p= → s=2012-10-30 u=user
Priority: -- → P2
Whiteboard: s=2012-10-30 u=user → s=2012-10-30 u=user c=Localization
Priority: P2 → P1
Whiteboard: s=2012-10-30 u=user c=Localization → [localization]
Jean-Yves and I spent a few hundreds hours to fix pages with a lost parents. 
I have implemented a maintenance page to list without-parent pages some time ago for that:
https://developer.mozilla.org/de/docs/without-parent (0 in most locales now)

Les, do we still have a problem here? Do you want to run these queries again and describe if there is still something that would need reparation?
Flags: needinfo?(lorchard)
mysql> select count(d1.id)
    -> from wiki_document as d1
    -> where d1.parent_id is null and
    ->     d1.locale <> 'en-US' and
    ->     d1.slug in (select d2.slug
    ->                     from wiki_document as d2
    ->                     where d2.locale='en-US');
+--------------+
| count(d1.id) |
+--------------+
|         8663 |
+--------------+
1 row in set (0.28 sec)

So there are even more non-English pages with identical slugs to English pages that don't have an English "parent" page.



mysql> select count(d1.id)
    -> from wiki_document as d1
    -> where d1.parent_id is null and
    ->     d1.locale <> 'en-US' and
    ->     left(d1.html,8) <> 'REDIRECT' and
    ->     d1.slug in (select d2.slug
    ->                     from wiki_document as d2
    ->                     where d2.locale='en-US');
+--------------+
| count(d1.id) |
+--------------+
|          740 |
+--------------+
1 row in set (0.16 sec)

I'm not sure what accounts for such a big drop in this number and this ratio, unless:

* We've cleaned out old redirects?
* redirects aren't in the first 8 characters anymore?


mysql> select count(id) from wiki_document
    -> where parent_id is null and
    ->     locale <> 'en-US' and
    ->     html like '%languages(%' ;
+-----------+
| count(id) |
+-----------+
|       585 |
+-----------+
1 row in set (0.34 sec)

So, much fewer non-English pages without a parent that use the "languages" macro. Not sure what that means ...
Flags: needinfo?(lorchard)
I think the first query is useless because redirects have no parent. So that increased because we have moved pages in the locales.

I guess a query that excludes redirects and deleted pages could use d1.is_redirect = 0 and deleted = 0.
Furthermore we are not worrying about Talk and User pages.

select count(d1.id)
 from wiki_document as d1
where d1.parent_id is null and
    d1.locale <> 'en-US' and d1.is_redirect = 0 and d1.deleted = 0
    and d1.slug not LIKE "Talk:%" and d1.slug not LIKE "User:%" and
      d1.slug in (select d2.slug
                      from wiki_document as d2
                     where d2.locale='en-US');

That will give us 475 pages, a spot check told me, that these pages are listed on our maintenance pages "without-parent" already.


For the language macro usage: It's going down because we've fixed a bunch of these pages and are removing the macro little by little. Again, no need for redirects, deleted, Talk and User pages.

select locale, slug from wiki_document 
where parent_id is null and is_redirect = 0 and deleted = 0
    and slug not LIKE "Talk:%" and slug not LIKE "User:%" and
    locale <> 'en-US' and 
    html like '%languages(%'

Same here: Spot check says that these pages are already listed on the "without-parent" maintenance page.


So, if I don't miss something here, this should be covered by the listing on https://developer.mozilla.org/<locale>/docs/without-parent

We already fixed a lot there and are intending to get those 0 in every locale.

This bug proposes to have an automatic reparation. That never happened and probably will not.
Anything else left to do here?
Great comment and information Florian. We're done here!
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.