Should we collect share of IMEs with Telemetry?

RESOLVED FIXED in Firefox 63

Status

()

enhancement
RESOLVED FIXED
4 years ago
Last year

People

(Reporter: masayuki, Assigned: masayuki)

Tracking

({inputmethod})

44 Branch
mozilla63
Points:
---

Firefox Tracking Flags

(firefox63 fixed)

Details

(Whiteboard: tpi:-)

Attachments

(4 attachments)

I'm not familiar with Telemetry, but if we can collect the share of IMEs (or kayboard layouts) of Windows, Mac OS X, Linux and Android if it's possible.

On Windows, perhaps, we can collect if user uses TSF mode or IMM mode. And the TIP description, CLSID, GUID and lang ID in TSF mode.

On Mac, perhaps, we can collect the language and selected TIS name.

I'm not sure about Linux and Android.
Seems like a useful metric to have.
Whiteboard: tpi:-
Currently, Telemetry::ScalarSet() allows us to collect string value.

On Windows, we need TIP description or IMM name, GUID if TIP (perhaps, single string value is better since we need to know relation between TIP name and GUID) if we need to do some hack without testers who provide log of TSFTextStore. Note that we can/should get only first CJK locale's TIP name. Typically, CJK users launch Firefox with a CJK IME. If CJK is secondary language of other area's users, Firefox may be launched with their native language. However, after that, one of CJK IME may be activated, then, we should collect this.

On Linux, we can get context id as string now. This won't be changed during session.

On macOS, I'm still not sure. But could be first IME opened keyboard layout name?
I'll request review for the patches.

I think that we should collect IME name when an IME is selected by user (or our process is launched when an IME is active). Nobody knows how many IMEs are usually used in the world especially, there are too many 3rd party IMEs in China. Therefore, we cannot collect the data with predefined enum.  Additionally, some users may use 2 or more IMEs.  So, *I* think that we need to use keyed-boolean Scalar to collect all IMEs which are selected on our process.  The value is always set to true, but the key should have enough information which allows us to identify IME and human-readable because only collecting GUID or something, we may meet unknown IME.

So, I think that the key should be:

- on Windows, we should take primary language of IME and IME name.  IME name may be collected both localized name and English name since retrieved name depends on system locale. But I think that we can merge the data by our hand if we need stricter data.

- on macOS, we should take Input Source ID of IME for non-Japanese IMEs and Bundle ID of IME for Japanese IMEs. Input Source ID includes input mode information.  This is important for non-Japanese IMEs because input mode changes how to input characters.  On the other hand, input mode of Japanese IME changes type of characters between, Hiragana, Katakana, Alpha-Numeric, etc. They may be switched even when user inputs a small paragraph.  So, the information what we want is the IME itself only when IME is a Japanese IME. So, we should collect Bundle ID instead only when IME is a Japanese IME.

- on Linux, we should collect IM (e.g., fcitx, ibus etc. cannot collect conversion engine like Mozc). IM is available even if user does not use "IME". For example, only when user installs en-US keyboard layout, fcitx or ibus is installed in most environment. The information what we need is, IM of users using composition string to input characters.  So, we should collect the information only when a compositionstart is being dispatched.

Ideally, in any platforms, IME name should be recorded only when first compositionstart with an IME.  However, I don't do that on Windows nor macOS because doing it makes the code more complicated.  If user installed IMEs which are not used usually but selected them temporarily during switching layout, we'd receive noise.  However, such user (like me) must be rare.  I believe that we can ignore such noise because collecting data must be enough large to ignore them.
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
jimm:

I and Makoto-san are not so familiar with Telemetry. So, could you check whether I'm taking correct type (keyed-bool Scalar) to collect every IME which is selected by user on our process and count only once during a session (between start of main process and end of main process). I.e., even if user selects same IME multiple times in a session, we should count the IME usage is 1.
> What questions will you answer with this data?

The data will tell us how percentage of users (based on all IME users) use which IME.

> Why does Mozilla need to answer these questions? Are there benefits for users?

The share is important for us to decide how each IME related bug report is important.

> What alternative methods did you consider to answer these questions? Why were they not sufficient?

We've fixed each bug randomly since we don't know how many users use each IME since we don't know especially Chinese market. (Chinese market has too much 3rd party IMEs.)

> Can current instrumentation answer these questions?

I believe so.

> List all proposed measurements and indicate the category of data collection for each measurement, using the Firefox data collection categories on the Mozilla wiki.

Language code and IME name of IME on Windows, category 1, tracking bug is this.
IME internal name of macOS, category 1, tracking bug is this.
IM name of Linux, category 1, tracking bug is this.

> How long will this data be collected? Choose one of the following:

I want to collect the data permanently since IME share may be changed. New IME may be released, some existing IME may be discontinued.

> What populations will you measure?

I want to collect data of majority users, so, even in release channel. However, these data are not collected from non-IME users. Only on Linux, we may collect data from some Western users who use dead keys since we cannot distinguish whether active keyboard layout is IME or keyboard layout which just has dead key sequence on Linux.

> If this data collection is default on, what is the opt-out mechanism for users?

I use Telemetry API, so, must be able to opt-out from the privacy settings.

> Please provide a general description of how you will analyze this data.

So, I want to know that an IME is used by how much our users. If the number of users is enough small and bug report for the IME is minor, we can skip the bug fix until we have much time.

If we have serious bug report and the IME user is too many, we should fix it as soon as possible. I.e., useful for deciding whether each fix should be uplifted.

> Where do you intend to share the results of your analysis?

In bugzilla for deciding something mentioned above.
Assignee: nobody → masayuki
Flags: needinfo?(francois)

Comment 10

Last year
mozreview-review
Comment on attachment 8986393 [details]
Bug 1215818 - part 1: Add telemetry probe to collect TIP names of TSF which are actually used by the users

https://reviewboard.mozilla.org/r/251766/#review258258
Attachment #8986393 - Flags: review?(jmathies) → review+
BTW, I would suggest copying your answers into a .txt file and attaching it to this bug. It will be easier to keep track of which version of the data review was eventually r+.

(In reply to Masayuki Nakano [:masayuki] (JST, +0900) from comment #9)
> > What questions will you answer with this data?
> 
> The data will tell us how percentage of users (based on all IME users) use
> which IME.

What does IME stand for? Input Method E...?

It would be good to expand the acronym at least once in here.

> > Can current instrumentation answer these questions?
> 
> I believe so.

I think here you meant to say "No.". The current instrumentation does not capture the information, therefore you need to add new probes.

> > List all proposed measurements and indicate the category of data collection for each measurement, using the Firefox data collection categories on the Mozilla wiki.
> 
> Language code and IME name of IME on Windows, category 1, tracking bug is
> this.
> IME internal name of macOS, category 1, tracking bug is this.

Are you collecting the language code in ISO 639-1 (e.g. en, fr, de, ja) -- https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes -- or are you collecting it in a different format?

Are you also including the name of the country? (e.g. en_US, en_GB, fr_CA)

Can you give me examples of what the IME name could be? Is it the vendor name of a piece of software that's installed on the user's machine? Does it include the version number?

> IM name of Linux, category 1, tracking bug is this.

Is that a typo (IM -> IME) or is it called "IM" on Linux?

> > What populations will you measure?
> want to collect data of majority users, so, even in release channel. However,
> these data are not collected from non-IME users. Only on Linux, we may
> collect data from some Western users who use dead keys since we cannot
> distinguish whether active keyboard layout is IME or keyboard layout which
> just has dead key sequence on Linux.

According to https://reviewboard.mozilla.org/r/251770/diff/1#index_header, it sounds like you're collecting the contents of an environment variable?

Are there really a lot of different third-party IMEs on Linux or is the list relatively fixed? Would whitelisting the known ones (and otherwise submitting "other") be an option on this platform to avoid collecting data on users with unusual configurations?
Flags: needinfo?(francois)
Posted file bug1215181-data.txt
(In reply to François Marier [:francois] from comment #11)
> BTW, I would suggest copying your answers into a .txt file and attaching it
> to this bug. It will be easier to keep track of which version of the data
> review was eventually r+.

Sure.

> (In reply to Masayuki Nakano [:masayuki] (JST, +0900) from comment #9)
> > > What questions will you answer with this data?
> > 
> > The data will tell us how percentage of users (based on all IME users) use
> > which IME.
> 
> What does IME stand for? Input Method E...?
> 
> It would be good to expand the acronym at least once in here.

Sure, I added a lot of explanation into the answer for first question.

> > > Can current instrumentation answer these questions?
> > 
> > I believe so.
> 
> I think here you meant to say "No.". The current instrumentation does not
> capture the information, therefore you need to add new probes.

Oh, sorry, I misunderstood the question.  We don't have any data as far as I know (internally nor public data).

> > > List all proposed measurements and indicate the category of data collection for each measurement, using the Firefox data collection categories on the Mozilla wiki.
> > 
> > Language code and IME name of IME on Windows, category 1, tracking bug is
> > this.
> > IME internal name of macOS, category 1, tracking bug is this.
> 
> Are you collecting the language code in ISO 639-1 (e.g. en, fr, de, ja) --
> https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes -- or are you
> collecting it in a different format?
> 
> Are you also including the name of the country? (e.g. en_US, en_GB, fr_CA)

It's a number between 0x0000 and 0xFFFF, called Locale Identifier on Windows. I added URL for the document.

> Can you give me examples of what the IME name could be? Is it the vendor
> name of a piece of software that's installed on the user's machine? Does it
> include the version number?

Okay, I added example on each platform.

> > IM name of Linux, category 1, tracking bug is this.
> 
> Is that a typo (IM -> IME) or is it called "IM" on Linux?

Sorry for the confusion. Linux's IME is separated as 2 layers, one is IM and the other is Conversion Engine. The former is important for applications and cannot retrieve the latter information.  See new text for the detail.

> > > What populations will you measure?
> > want to collect data of majority users, so, even in release channel. However,
> > these data are not collected from non-IME users. Only on Linux, we may
> > collect data from some Western users who use dead keys since we cannot
> > distinguish whether active keyboard layout is IME or keyboard layout which
> > just has dead key sequence on Linux.
> 
> According to https://reviewboard.mozilla.org/r/251770/diff/1#index_header,
> it sounds like you're collecting the contents of an environment variable?

Almost yes.

> Are there really a lot of different third-party IMEs on Linux or is the list
> relatively fixed? Would whitelisting the known ones (and otherwise
> submitting "other") be an option on this platform to avoid collecting data
> on users with unusual configurations?

I think that only collecting major IM is enough on Linux since IM needs to be supported by distribution in most cases. I know there are some minor IM which is not listed as known IM by us, but as far as I've tested, such minor IM is too unstable.  So, if we'd see too large number as "unknown", we should correct IM name stricter.  If you (or reviewer, Makoto-san) think that we should collect non-whitelisted IM name from first, I'll change the patch, though.
Attachment #8986666 - Flags: review?(francois)
Hmm, I check the code for Linux again. Perhaps, we can make it collect any IM with simple change.

Comment 14

Last year
mozreview-review
Comment on attachment 8986393 [details]
Bug 1215818 - part 1: Add telemetry probe to collect TIP names of TSF which are actually used by the users

https://reviewboard.mozilla.org/r/251766/#review258376
Attachment #8986393 - Flags: review?(m_kato) → review+
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)

Comment 18

Last year
mozreview-review
Comment on attachment 8986394 [details]
Bug 1215818 - part 2: Add telemetry probe to collect IME usage on macOS

https://reviewboard.mozilla.org/r/251768/#review258382
Attachment #8986394 - Flags: review?(m_kato) → review+
Okay, I made IMContextWrapper sets IM name to Telemetry even if we don't know the name.

Comment 21

Last year
mozreview-review
Comment on attachment 8986395 [details]
Bug 1215818 - part 3: Add telemetry probe to collect IM share on Linux

https://reviewboard.mozilla.org/r/251770/#review258396

::: widget/gtk/IMContextWrapper.cpp:326
(Diff revision 2)
> +    const char* contextIDChar =
> +        gtk_im_multicontext_get_context_id(GTK_IM_MULTICONTEXT(mContext));
> +    if (!contextIDChar) {
> +        return nsDependentCSubstring();
> +    }
> +

We need to investigate wayland case when we support wayland as default build.
Attachment #8986395 - Flags: review?(m_kato) → review+
Comment on attachment 8986395 [details]
Bug 1215818 - part 3: Add telemetry probe to collect IM share on Linux

https://reviewboard.mozilla.org/r/251770/#review258396

> We need to investigate wayland case when we support wayland as default build.

Do you have environment which can run wayland?
(In reply to Masayuki Nakano [:masayuki] (JST, +0900) from comment #22)
> Comment on attachment 8986395 [details]
> Bug 1215818 - part 3: Add telemetry probe to collect IM share on Linux
> 
> https://reviewboard.mozilla.org/r/251770/#review258396
> 
> > We need to investigate wayland case when we support wayland as default build.
> 
> Do you have environment which can run wayland?

Yes, I have an environment (Debian/sid) on GPD Pocket.  But default build of Firefox doesn't support wayland yet.  To build wayland version, we need --enable-default-toolkit=cairo-gtk3-wayland build option.
francois: ping
Flags: needinfo?(francois)
Thanks for the updated document. I'll take a look on Monday.
Flags: needinfo?(francois)
Comment on attachment 8986666 [details]
bug1215181-data.txt

1) Is there or will there be **documentation** that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes, in Scalars.yaml.

2) Is there a control mechanism that allows the user to turn the data collection on and off?

Yes, telemetry setting.

3) If the request is for permanent data collection, is there someone who will monitor the data over time?**

Yes, Masayuki Nakano.

4) Using the **[category system of data types](https://wiki.mozilla.org/Firefox/Data_Collection)** on the Mozilla wiki, what collection type of data do the requested measurements fall under?  **

Category 1.

Based on comments in this bug, we have code that ensures that we are not collecting the arbitrary contents of the XMODIFIERS environment variable on Linux.

5) Is the data collection request for default-on or default-off?

Default on, all channels.

6) Does the instrumentation include the addition of **any *new* identifiers** (whether anonymous or otherwise; e.g., username, random IDs, etc.  See the appendix for more details)?

No.

7) Is the data collection covered by the existing Firefox privacy notice?

Yes.

8) Does there need to be a check-in in the future to determine whether to renew the data?

No, permanent.
Attachment #8986666 - Flags: review?(francois) → review+
This was one of the most detailed data review request I've seen. Thanks for all of the explanations and examples, it was very helpful!
(In reply to François Marier [:francois] from comment #27)
> This was one of the most detailed data review request I've seen. Thanks for
> all of the explanations and examples, it was very helpful!

Thank you!

Comment 29

Last year
Pushed by masayuki@d-toybox.com:
https://hg.mozilla.org/integration/autoland/rev/cbbdabbed7e0
part 1: Add telemetry probe to collect TIP names of TSF which are actually used by the users r=jimm,m_kato
https://hg.mozilla.org/integration/autoland/rev/1d5d0381e51d
part 2: Add telemetry probe to collect IME usage on macOS r=m_kato
https://hg.mozilla.org/integration/autoland/rev/fa829ff77595
part 3: Add telemetry probe to collect IM share on Linux r=m_kato
You need to log in before you can comment on or make changes to this bug.