Closed Bug 1288146 Opened 8 years ago Closed 6 years ago

[decision][l10n-conversion] which data do we need for the l10n migration process

Categories

(L20n :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: Pike, Assigned: Pike)

References

(Blocks 1 open bug)

Details

Taking this out of bug 1280685 to figure out what data we need to feed in to the migration code.

For one, we can just use hg annotate. But I'm also wondering if we should go for an intermediate format that we could get out of pootle.

Depends a bit on timing of pootle fs, and the alternative thought I have that we could possibly rewrite all of the pootle-generated files to get attribution right in hg, and then have a consistent input.

I wonder if there's any traces of narro left at this point ;-) .
So here's what I think we should have as data going in.

Granted, it's based on what I can get out of hg for locales that have attribution there, which is why I'm looking at you, Dwayne, if you can get something similar out of the mlo db?

I'd basically do a json, global or per file, that looks something like:
 "quitApplicationCmd.label": [
  "Dwayne Bailey <dwayne@translate.org.za>", 
  1300482324
 ], 
 "openCmd.commandkey": [
  "Dwayne Bailey <dwayne@translate.org.za>", 
  1284512337
 ], 
 "panicButton.view.2hr": [
  "af team [Pootle] <https://wiki.mozilla.org/L10n:Teams:af>", 
  1418627869
 ], 
 "identity.description.passiveLoaded": [
  "af team [Pootle] <https://wiki.mozilla.org/L10n:Teams:af>", 
  1439385459
 ], 
 "pointerLock.notification.message": [
  "Friedel Wolff [pootle] <friedel@translate.org.za>", 
  1368194401
 ], 
 "goOfflineCmd.accesskey": [
  "Friedel Wolff [pootle] <friedel@translate.org.za>", 
  1355451421
 ], 
 "editPopupSettings.accesskey": [
  "Dwayne Bailey <dwayne@translate.org.za>", 
  1283936009
 ], 
 "showAllTabsCmd.label": [
  "Dwayne Bailey <dwayne@translate.org.za>", 
  1283936009
 ], 
 "pageStyleNoStyle.accesskey": [
  "Friedel Wolff <friedel@translate.org.za>", 
  1242741390
 ], 

On the hg side, we have the following attribution for Afrikaans browser.dtd:

Afrikaans team <https://wiki.mozilla.org/L10n:Teams:af>
Axel Hecht <l10n@mozilla.com>
Dwayne Bailey <dwayne@translate.org.za>
Friedel Wolff <friedel@translate.org.za>
Friedel Wolff [pootle] <friedel@translate.org.za>
Walter Leibbrandt <walter@translate.org.za>
af team [Pootle] <https://wiki.mozilla.org/L10n:Teams:af>

The general idea being that we'll have a hash, mapping the string ID to a ["username", posix_timestamp] tuple. The posix_timestamp is a good tool to order attribution data from different sources.

"username" should be something per individual, and satisfy the requirements on hg/git usernames, i.e., name <email@dress.tld>


needinfo on Dwayne, can you check in particular the data you have on the Afrikaans browser.dtd if you can fill some of the gaps here?
Assignee: nobody → l10n
Flags: needinfo?(dwayne)
I've done a small modification to the schema here in the meantime, creating a dict like this:

{
  'authors': ['Axel Hecht <l10n@mozilla.com>', ...],
  'blame': {
    'my/filename.dtd': {
      'entity_id': [0, timestamp]
    }
  }
}

The tuples for each ID are the index in the given author list, and the epoch timestamp.

Major difference is a bit of space winning by storing the authors only once.

The script that does that is up on https://gist.github.com/Pike/9d52f8b912dfdf3ba9d6bd2bc67084fa.

I've talked to Ryan about this data set so that we can pull attribution data we have historically in hg together with the data that's only in pootle right now. In particular, the strings currently attributed to ... team ...<https://wiki...> hopefully have better data in pootle.

Ryan, did you have a chance to look if that data's reasonably straightforward to get out of pootle, and what data we'd get?
Flags: needinfo?(dwayne) → needinfo?(ryan)
Hi @Pike

Looking at the data available Pootle has different types of "unit events" for eg units (translations) being created, or the translation changing. I'm wondering which events that we need attribution for - eg only where the translation has changed?

Im guessing to also exclude events where the user is the system user, which i think will remove almost all "unit creation" events anyway.

The other thing i need to check is whether translations that have been accepted from suggestions are properly attributed. I think this should be correct, but worth checking.

Regarding the format, i would like to clarify a couple of things.

Does the dictionary key "blame" denote the source - so in the case of Pootle should it be "pootle" ?

In this example should `'entity_id'` be an actual entity id?

    'my/filename.dtd': {
      'entity_id': [0, timestamp]
    }

should `[0, timestamp]` be a list of tuples or do we only want the last event ?
Flags: needinfo?(ryan) → needinfo?(l10n)
The last contributor to an entity is what we're looking for, and yes, we'd like to exclude bots as much as possible. So yes, last non-system event sounds like the corresponding pootle thing.

The dictionary key 'blame' is in contrast to 'authors', and should stay the same. 'entity_id' should be "foo.label" for
<!ENTITY foo.label "this is foo">
Flags: needinfo?(l10n)
Resolving this.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.