Data request: word count and TM hits for mozilla.org strings in 2020
Categories
(Webtools Graveyard :: Pontoon, task, P2)
Tracking
(Not tracked)
People
(Reporter: gueroJeff, Assigned: mathjazz)
Details
I'd like to get a word count analysis for mozilla.org for the calendar year between November 2019 and the 1 of November 2020. This should include the total new words and TM hits (best, if possible, to break it up into 100% match and % based fuzzy match bands down to 70%).
Assignee | ||
Comment 1•5 years ago
|
||
We enabled the new (FTL) Mozilla.org project on 2020-05-22.
I assume I should include words added to both projects, Mozilla.lang (words added after 1 of November 2019) and Mozilla.ftl (all words added so far)? Regarding the latter, do we need to exclude strings we migrated from .lang to .ftl (might be tricky)?
Reporter | ||
Comment 2•5 years ago
|
||
Yes, we should do added strings to both projects. Treat the migrated strings as TM matches rather than trying to exclude them. You can use either German, French, or Italian as the base target language for the matches.
Reporter | ||
Comment 3•5 years ago
|
||
I need to tweak the scope of this request.
In addition to the total word count stats for 2020 (Jan 2020-Jan 2021), I need the current (as of today) missing word count for the mozilla.org project for the following locales:
- ar
- zh-CN
- zh-TW
- nl
- hi-IN
- id
- it
- ja
- ko
- ms
- pt-BR
- pt-PT
- pl
- ru
- es-ES
- es-MX
- tr
- vi
Reporter | ||
Comment 4•5 years ago
|
||
Sorry to be a jerk, but I have one more dimension to add here: I also need a month-to-month word count for new words (added words) between Jan 2020 - Jan 2021.
Assignee | ||
Comment 5•5 years ago
|
||
Missing string/word count for the given locales:
locale m. strings missing words
ar 1086 23908
zh-CN 0 0
zh-TW 124 8169
nl 0 0
hi-IN 1004 25378
id 573 10245
it 0 0
ja 1277 26020
ko 411 9690
ms 1736 29955
pt-BR 0 0
pt-PT 446 8365
pl 534 17206
ru 0 0
es-ES 0 0
es-MX 501 8592
tr 285 15850
vi 1 93
Keeping DB query around for future re-use:
for l in LOCALES:
translated = Translation.objects.filter(
entity__resource__project__slug="mozillaorg",
locale__code=l,
approved=True,
).values_list("entity__pk", flat=True)
entities = Entity.objects.filter(
resource__project__slug="mozillaorg",
resource__translatedresources__locale__code=l,
obsolete=False,
).exclude(pk__in=translated)
print("{}\t\t{}\t\t{}".format(
l,
entities.count(),
sum([e.word_count for e in entities])
))
I'll add word counts for newly added strings in a separate comment.
Assignee | ||
Comment 6•5 years ago
|
||
Word counts for old (mozilla.lang) and new (mozilla.flt) Mozilla.org project:
mozilla.lang
month str words no_tm tm70%+ tm80%+ tm90%+ tm100%+
January 49 844 511 333 330 222 74
February 192 6323 6068 255 224 141 124
March 63 696 289 407 398 379 325
April 50 395 277 118 102 57 54
May 12 116 77 39 39 34 34
mozilla.ftl
month str words no_tm tm70%+ tm80%+ tm90%+ tm100%+
May 466 3014 1297 1717 1551 1131 781
June 540 12530 6017 6513 6334 5177 1727
July 400 4529 1425 3104 2889 2073 908
August 35 332 230 102 91 46 41
September 122 2406 882 1524 1471 1121 411
October 310 3672 1610 2062 1866 1367 1090
November 238 1940 362 1578 1364 923 508
December 500 13679 6517 7162 7025 6570 2933
Code:
from dateutil.relativedelta import relativedelta
from pontoon.base.models import *
from pontoon.base.templatetags.helpers import as_simple_translation
from pontoon.base.utils import aware_datetime, get_last_months
months = sorted(
[aware_datetime(year, month, 1) for year, month in get_last_months(14)]
)
PROJECT = Project.objects.get(slug='mozillaorg')
print("month\t\tstr\t\twords\t\tno_tm\t\ttm70%+\t\ttm80%+\t\ttm90%+\t\ttm100%+")
for month in months:
output = []
entities = Entity.objects.filter(
resource__project=PROJECT,
date_created__gte=month,
date_created__lt=month + relativedelta(months=1),
)
for e in entities:
try:
tm = (
TranslationMemoryEntry.objects
.filter(locale__code="de")
.minimum_levenshtein_ratio(as_simple_translation(e.string))
.exclude(entity=e)
.exclude(translation__approved=False, translation__fuzzy=False)
.filter(entity__date_created__lt=month)
.order_by('-quality')
)[0]
except:
tm = None
output.append((e,tm))
print("{}\t\t{}\t\t{}\t\t{}\t\t{}\t\t{}\t\t{}\t\t{}".format(
month.strftime("%B"),
entities.count(),
sum([e.word_count for e, _ in output]),
sum([e.word_count for e, tm in output if tm is None]),
sum([e.word_count for e, tm in output if tm is not None and tm.quality >= 70]),
sum([e.word_count for e, tm in output if tm is not None and tm.quality >= 80]),
sum([e.word_count for e, tm in output if tm is not None and tm.quality >= 90]),
sum([e.word_count for e, tm in output if tm is not None and tm.quality == 100]),
))
Assignee | ||
Updated•5 years ago
|
Updated•4 years ago
|
Description
•