Closed Bug 1673043 Opened 5 years ago Closed 5 years ago

Data request: word count and TM hits for mozilla.org strings in 2020

Categories

(Webtools Graveyard :: Pontoon, task, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gueroJeff, Assigned: mathjazz)

Details

I'd like to get a word count analysis for mozilla.org for the calendar year between November 2019 and the 1 of November 2020. This should include the total new words and TM hits (best, if possible, to break it up into 100% match and % based fuzzy match bands down to 70%).

We enabled the new (FTL) Mozilla.org project on 2020-05-22.

I assume I should include words added to both projects, Mozilla.lang (words added after 1 of November 2019) and Mozilla.ftl (all words added so far)? Regarding the latter, do we need to exclude strings we migrated from .lang to .ftl (might be tricky)?

Assignee: nobody → m
Status: NEW → ASSIGNED
Flags: needinfo?(jbeatty)
Priority: -- → P2

Yes, we should do added strings to both projects. Treat the migrated strings as TM matches rather than trying to exclude them. You can use either German, French, or Italian as the base target language for the matches.

Flags: needinfo?(jbeatty)

I need to tweak the scope of this request.

In addition to the total word count stats for 2020 (Jan 2020-Jan 2021), I need the current (as of today) missing word count for the mozilla.org project for the following locales:

  • ar
  • zh-CN
  • zh-TW
  • nl
  • hi-IN
  • id
  • it
  • ja
  • ko
  • ms
  • pt-BR
  • pt-PT
  • pl
  • ru
  • es-ES
  • es-MX
  • tr
  • vi
Flags: needinfo?(m)

Sorry to be a jerk, but I have one more dimension to add here: I also need a month-to-month word count for new words (added words) between Jan 2020 - Jan 2021.

Missing string/word count for the given locales:

locale		m. strings	missing words
ar		1086		23908
zh-CN		0		0
zh-TW		124		8169
nl		0		0
hi-IN		1004		25378
id		573		10245
it		0		0
ja		1277		26020
ko		411		9690
ms		1736		29955
pt-BR		0		0
pt-PT		446		8365
pl		534		17206
ru		0		0
es-ES		0		0
es-MX		501		8592
tr		285		15850
vi		1		93

Keeping DB query around for future re-use:

for l in LOCALES:
    translated = Translation.objects.filter(
        entity__resource__project__slug="mozillaorg",
        locale__code=l,
        approved=True,
    ).values_list("entity__pk", flat=True)
    entities = Entity.objects.filter(
        resource__project__slug="mozillaorg",
        resource__translatedresources__locale__code=l,
        obsolete=False,
    ).exclude(pk__in=translated)
    print("{}\t\t{}\t\t{}".format(
        l,
        entities.count(),
        sum([e.word_count for e in entities])
    ))

I'll add word counts for newly added strings in a separate comment.

Flags: needinfo?(m)

Word counts for old (mozilla.lang) and new (mozilla.flt) Mozilla.org project:

mozilla.lang

month		str	words	no_tm	tm70%+	tm80%+	tm90%+	tm100%+
January		49	844	511	333	330	222	74
February	192	6323	6068	255	224	141	124
March		63	696	289	407	398	379	325
April		50	395	277	118	102	57	54
May		12	116	77	39	39	34	34

mozilla.ftl

month		str	words	no_tm	tm70%+	tm80%+	tm90%+	tm100%+
May		466	3014	1297	1717	1551	1131	781
June		540	12530	6017	6513	6334	5177	1727
July		400	4529	1425	3104	2889	2073	908
August		35	332	230	102	91	46	41
September	122	2406	882	1524	1471	1121	411
October		310	3672	1610	2062	1866	1367	1090
November	238	1940	362	1578	1364	923	508
December	500	13679	6517	7162	7025	6570    2933

Code:

from dateutil.relativedelta import relativedelta
from pontoon.base.models import *
from pontoon.base.templatetags.helpers import as_simple_translation
from pontoon.base.utils import aware_datetime, get_last_months

months = sorted(
    [aware_datetime(year, month, 1) for year, month in get_last_months(14)]
)

PROJECT = Project.objects.get(slug='mozillaorg')
print("month\t\tstr\t\twords\t\tno_tm\t\ttm70%+\t\ttm80%+\t\ttm90%+\t\ttm100%+")

for month in months:
    output = []
    entities = Entity.objects.filter(
        resource__project=PROJECT,
        date_created__gte=month,
        date_created__lt=month + relativedelta(months=1),
    )
    for e in entities:
        try:
            tm = (
                TranslationMemoryEntry.objects
                .filter(locale__code="de")
                .minimum_levenshtein_ratio(as_simple_translation(e.string))
                .exclude(entity=e)
                .exclude(translation__approved=False, translation__fuzzy=False)
                .filter(entity__date_created__lt=month)
                .order_by('-quality')
            )[0]
        except:
            tm = None
        output.append((e,tm))
    print("{}\t\t{}\t\t{}\t\t{}\t\t{}\t\t{}\t\t{}\t\t{}".format(
        month.strftime("%B"),
        entities.count(),
        sum([e.word_count for e, _ in output]),
        sum([e.word_count for e, tm in output if tm is None]),
        sum([e.word_count for e, tm in output if tm is not None and tm.quality >= 70]),
        sum([e.word_count for e, tm in output if tm is not None and tm.quality >= 80]),
        sum([e.word_count for e, tm in output if tm is not None and tm.quality >= 90]),
        sum([e.word_count for e, tm in output if tm is not None and tm.quality == 100]),
    ))
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Product: Webtools → Webtools Graveyard
You need to log in before you can comment on or make changes to this bug.