[decision] Where should we use "Number of literate speakers" info?

NEW
Unassigned

Status

P3
normal
2 years ago
20 days ago

People

(Reporter: mathjazz, Unassigned)

Tracking

Trunk
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

User Story

Some facts:

1. "Number of literate speakers" data is taken from CLDR:
http://www.unicode.org/cldr/charts/latest/supplemental/territory_language_information.html

2. There have been reports that the numbers are not always accurate:
http://unicode.org/cldr/trac/ticket/10099

3. Even if the absolute numbers are not necessarily accurate, they still provide an estimate of big a particular locale is compared to others.

4. We aren't aware of any other data source we could use for this purpose.

5. Data is shown on the Teams dashboard, on each project dashboard and in the heading of each team dashboard:
https://pontoon.mozilla.org/teams/
https://pontoon.mozilla.org/projects/pontoon-intro/
https://pontoon.mozilla.org/gd/

6. Data is useful at least for project managers, for example to identify big locales that aren't complete before the deadline.

Attachments

(2 attachments)

(Reporter)

Description

2 years ago
In dev-l10n a suggestion has been made to adjust the way we use the "Number of literate speakers" information:
https://groups.google.com/forum/#!topic/mozilla.dev.l10n/lNN0N_xt-Xo

Let's figure out where and when to present this information. See User story for more details.

Here are some of the options we have:

1. Get rid of the data completely, including from the database.

2. Keep the data, but only show it in the Locale Admin list.

3. Keep the data, but only show it in the Locale Admin list and in the heading of each team dashboard along with other locale data.

4. Keep everything unchanged, but only show the "Number of literate speakers" in locale listings to Admins.

5. Keep everything unchanged.

Comment 1

2 years ago
Re 3 (in first post): This won't even give you that data. Let's take Austria. It makes the (ludicrous) claim that 95.0% speak Bavarian and that 98% of them are literate. The population of Austria is 8.4 million. That would suggest that there are 7.8 million people who are literate in Bavarian. That's so wrong it gives me a nosebleed (even though I'd *like* this to be the case). If there are 1,000 people literate in Bavarian I'd be surprised. So if any dev worked on the basis of this figure, they're be assuming 7.8 million users/speakers who simply don't exist.

I would get rid of the data completely (option 1) or simply replace it with a tier system i.e. if we split our locales into 
Tier 3 < 1 million speakers
Tier 2 1-5 million speakers
Tier 1 > 5 million speakers

That would give devs a very rough guide about whether a big locale is being impacted by translations being behind but without getting into details that we could argue about till the cows come home.

There are not so many locales on Mozilla. I would suggest pulling the data of Wikipedia rather than CLDR which in this instance is not reliable at all. Wikipedia isn't entirely reliable for speaker numbers either but if we did a very broad tier system such as I suggested, it wouldn't matter because it's a rough guideline and for rough speaker figures, Wikipedia is better than CLDR or Ethnologue.
I would suggest not to overblow an CLDR error. Is there an easy way for us to compare how much CLDR differs from data we could pull from Wikipedia?

I'd be curious to see if it's a meaningful chunk of data or a couple exceptions that we can upstream as bugs to CLDR.

Comment 3

2 years ago
It is not a single CLDR error. It's virtually all off.

I already filed a bug with CLDR but they work to their own timescales, in my experience, it will be at least half a year before a fix filters through - once it has been agreed what the fix actually entails and I suspect that will be a long process with many cooks.

Comment 4

2 years ago
Michael, your assertions are way off. Take a look at https://de.wikipedia.org/wiki/Deutsche_Sprache#/media/File:Deutsche_Dialekte.PNG.

That said, I wish that CLDR had more documentation on why they change which number to which value. They'd effectively document the value of the data. But asserting that their data isn't data at all is just a fallacy.
All this being said, I feel confident with the idea of replacing number of speakers with "countries where spoken." This would be useful information for us to be able to advocate for localization when leadership identifies marketing focus territories. Moreso than number of speakers in that region.
> It is not a single CLDR error. It's virtually all off.

I'm sorry, but this is an opinion, not data. I asked for data.

Comment 7

2 years ago
I know what the German dialect map looks like. The map represents maximum geographical spread, NOT speaker density. According to that map, all of Munich speaks Bavarian or Hamburg Platt. Which they don't (cf this report https://www.welt.de/wissenschaft/article113938439/Muetter-Medien-Mobilitaet-Warum-Dialekte-sterben.html which states that in Hamburg the number of people who speak Platt has dropped between 1984 and 2007 from 29% to 10%).

Zibi, you know code, I know linguistics, I have a degree in the stuff. Which, if you want to label it "opinion", makes it an "expert opinion". I don't have time to research the entire dataset for your convenience. Ready the bug on CLDR, there are some specifics there.
I agree with :guerojeff.
(Reporter)

Comment 9

a year ago
Created attachment 8868781 [details]
a-fullpage.png

I agree there's a value in having the "countries where spoken" information available for each locale.

We should spin that off as a separate bug though, because it opens a few additional questions. In particular:

- How do we present the countries? With names? Codes? Flags? Something else?
- How does the UI look for languages like French that are spoken in dozens of countries?
- How do we stay out of politics?
- Shall we talk territories instead of countries?

Going back to the original problem: attached is an example of implementing a tier system that Michael proposed. I used the same ranges as Google Play uses for the number of downloads, which splits languages in ~10 groups. That number I find to be a good compromise between A) solving the problem of virtually all speaker numbers being wrong and B) giving us a granular enough grouping of teams.
I find that data hard to digest, and tedious to read. Like, 50-100 thousand and 50-100 million are almost the same.

Maybe there's a way to color-code this? On a logarithmic scale or so?

Also, curious, what happens if you sort by the population column in the draft patch?
(Reporter)

Comment 11

a year ago
(In reply to Axel Hecht [:Pike] from comment #10)
> I find that data hard to digest, and tedious to read. Like, 50-100 thousand
> and 50-100 million are almost the same.

I agree, I tried not to make the column too wide.

There are at least two other variants:
http://stackoverflow.com/a/11537826
http://stackoverflow.com/a/34025940

I prefer the second, because it's a closed interval.

> Also, curious, what happens if you sort by the population column in the
> draft patch?

There's no patch yet, but we can make numbers sort properly regardless of the presentation (similarly as we do in the latest activity column for example).
(Reporter)

Comment 12

a year ago
Created attachment 8868785 [details]
Numbers only

Slightly updated proposal, using numbers instead of words.

1-5 million vs. 1 - 5.000.000

Comment 13

a year ago
That's a lot of zeroes to take in ;)

How about we just use m (million) and k (thousand) so you'd get
1-5m
5-10m
0.5-1k

It would keep the column narrow and certainly in the English speaking word m and k are very very common abbreviations, even in spoken English "five kay" (instead of five thousand) is very common these days.

Comment 14

10 months ago
Why not just use the ethnologue data for number of speakers? That's what UNESCO, Wikipedia etc use. Focusing on literate speakers is highly discriminatory against regional and minority languages as well.
(Reporter)

Updated

3 months ago
Priority: P2 → P3

Comment 15

20 days ago
> 1. Get rid of the data completely, including from the database.

That would be my option. But since you said devs use it, I would only show it to devs. If that's not possible, let's hide from as many people as possible.

Facts: the data is way off, it servers no real purpose but to help devs see if a "big" language is not complete. Let's keep it to the devs and spare the rest of us the pain/anger/disbelief of seeing numbers that do not reflect reality.

If you insist on keeping those numbers, for whatever reason, then lets ask each community to provide them, with references, of course.

Comment 16

20 days ago
That information is not only useless for speakers/linguists (which surely have better/preferred sources for the languages they are interested in) but it could be an active impediment for the development of some languages languages, already neglected at the national/regional level. Take Triqui for example as an example of an active locale in Pontoon:

Pontoon: 4,500 literarate speakers
Mexico Census: 25,000 speakers (in 2010, on an seemingly upward trend) [1]

A Triqui speaker might go from, "why do it?", to "hey let's do it!", base on those numbers. We surely don't want to influence communities and collaboration by publishing data that is not really that accurate.

As a side note, they use the latin script and, lately, up to three writing systems (depending on the intended public), so "literate speakers" doesn't mean what it would mean for other languages, with one established writing system.

[1] http://site.inali.gob.mx/pdf/libro_lenguas_indigenas_nacionales_en_riesgo_de_desaparicion.pdf
You need to log in before you can comment on or make changes to this bug.