Closed Bug 1272354 Opened 5 years ago Closed 4 years ago

Add telemetry for DLC service

Categories

(Firefox for Android Graveyard :: General, defect)

All
Android
defect
Not set
normal

Tracking

(firefox55 fixed)

RESOLVED FIXED
Firefox 55
Tracking Status
firefox55 --- fixed

People

(Reporter: sebastian, Assigned: sebastian)

References

Details

Attachments

(5 files)

Now that DLC is shipping and we are enabling catalog sync (bug 1271352), it would be interesting to see how this performs in the wild. Especially to implement a better scheduling (bug 1257492).
Before we move l10n into DLC (bug 945123), we should know how DLC performs in practice.

I'll mark bug 945123 to block switching on ICU for Android, so this would be really good to have soon, and even think about potentially uplifting it to get more reliable data.
Blocks: 945123
Sebastian, is this something we might have soon? (I realize there are always competing priorities)
Flags: needinfo?(s.kaspari)
Yeah, I can squeeze this in. Let me know if you have specific questions / probe requirements. From IRC I remember that :Pike is interested in "install time until DLC ready". In addition to that I wanted to have some simple probes here for download completed/failed, synchronization of catalog completed/failed and maybe the type of network.
Assignee: nobody → s.kaspari
Status: NEW → ASSIGNED
Flags: needinfo?(s.kaspari)
Thanks. Yeah, what I said on irc is my main point of interest, the other things sound like good additional information to me, too.
I looked into this a bit today and one of the things I'm still not sure about is whether we want to track this (success, failure, time etc.) for every item in the catalog or generate more something like a summarized report. Currently we have 48 files in the catalog (fonts + hyphenation dictionaries) and my only concern here is that this could result in a lot of telemetry data. Although we do not perform downloads often (as there are not many updates to the current set of files). Synchronizing the catalog (or more like checking that there's no new data available) is something we do more often though.

Ignoring probe sizes I'd track for success: content id, time, network type. And for errors additionally the type of error (We already handle specific error type classes internally). For the catalog sync I'd track the same type of data but instead of content id I'd track the number of updated records.
(In reply to Sebastian Kaspari (:sebastian) from comment #5)
> I looked into this a bit today and one of the things I'm still not sure
> about is whether we want to track this (success, failure, time etc.) for
> every item in the catalog or generate more something like a summarized
> report. Currently we have 48 files in the catalog (fonts + hyphenation
> dictionaries) and my only concern here is that this could result in a lot of
> telemetry data. Although we do not perform downloads often (as there are not
> many updates to the current set of files). Synchronizing the catalog (or
> more like checking that there's no new data available) is something we do
> more often though.

One efficient way to track this would be using a categorical histogram [1] for each success & failure scenario.
For times i'd assume monitoring this in one histogram is a good start, if this data shows real issues then this could be followed up by QA, more detailed measurements, ...

Are you looking into getting this from opt-in Telemetry or opt-out (i.e. beta vs. release population)?

1: https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Adding_a_new_Telemetry_probe#Choosing_a_Histogram_Type
FWIW, i'm just limiting categorical histograms to 50 labels today.
This sounds like a potential reason to increase that limit or allow overriding it.
Thanks for the hints!

(In reply to Georg Fritzsche [:gfritzsche] from comment #6)
> Are you looking into getting this from opt-in Telemetry or opt-out (i.e.
> beta vs. release population)?

Oh, yeah, this is an interesting problem. Most of the initial and interesting events happen way before the user had a chance to opt-in - so we will not be able to measure that. That's a problem.
On pre-release (Beta, Aurora, Nightly) that is not a problem (we default to "opt-in" data being enabled there).
So the question is mainly: do you need the data from most release users or is the beta population enough to track the status?

The "core" ping is what we get for all release users, but is designed for very minimal b/w impact.
The richer Telemetry data we only get for pre-release users (and the few release users that opted in).
:sebastian, any update on this? It's a blocked on my DLC l10n in Fennec list.
Flags: needinfo?(s.kaspari)
Hi Sebastian
I just walked by and have no idea what is downloadable content and I found this page : https://wiki.mozilla.org/Mobile/Fennec/Android/Downloadable_Content.

Is this page still valid? Thank you!
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #10)
> :sebastian, any update on this? It's a blocked on my DLC l10n in Fennec list.

Sorry that this is still taking time. This week I'm still busy getting Focus for Android out of the door.

(In reply to Nevin Chen [:nechen] from comment #11)
> I just walked by and have no idea what is downloadable content and I found
> this page :
> https://wiki.mozilla.org/Mobile/Fennec/Android/Downloadable_Content.
> 
> Is this page still valid? Thank you!

Yes, this page is still valid. Although not as detailed as I would like it to be. If you are going to work on DLC bugs then let's jump on Vidyo and I can onboard you. :)
Flags: needinfo?(s.kaspari)
Those patches add some basic telemetry to monitor successful and failed downloads as well as catalog synchronizations. This could be even more detailed. However this will give us a good overview and based on the data we can decide what details might be interesting to see.
Comment on attachment 8854025 [details]
Bug 1272354 - (DLC) Download Action: If there's no download scheduled then exit early.

https://reviewboard.mozilla.org/r/126018/#review128762
Attachment #8854025 - Flags: review?(gkruglov) → review+
Comment on attachment 8854026 [details]
Bug 1272354 - (DLC) DownloadAction: Move network checks inside download loop.

https://reviewboard.mozilla.org/r/126020/#review128764

::: commit-message-79c77:4
(Diff revision 2)
> +Bug 1272354 - (DLC) DownloadAction: Move network checks inside download loop. r?grisha
> +
> +This has two advantages:
> +* If the network changes while we are downloading content then we do not continue on an unmetered network.

I think you meant "... on a metered network."?
Attachment #8854026 - Flags: review?(gkruglov) → review+
Comment on attachment 8854027 [details]
Bug 1272354 - (DLC) Download Action: Add telemetry to track download success/failure.

https://reviewboard.mozilla.org/r/126022/#review128766

It seems like a histogram event to record % success might be interesting. E.g. scheduled 13 things to download, succeeded with 10, record 0.76. Although, rescheduling files might muddy that ratio a bit.

::: mobile/android/base/java/org/mozilla/gecko/dlc/DownloadAction.java:82
(Diff revision 2)
>  
>                  File destinationFile = getDestinationFile(context, content);
>                  if (destinationFile.exists() && verify(destinationFile, content.getChecksum())) {
>                      Log.d(LOGTAG, "Content already exists and is up-to-date.");
>                      catalog.markAsDownloaded(content);
> +                    DownloadContentTelemetry.eventDownloadSuccess(content);

I'm not familiar with the scheduler - under what conditions would we have a scheduled file already present and matching checksums locally? If there isn't a good reason why this might happen, perhaps that's not a "success" (even though "nothing to do" is a great outcome for this iteration of the loop!)
Attachment #8854027 - Flags: review?(gkruglov) → review+
Comment on attachment 8854028 [details]
Bug 1272354 - (DLC) Sync Action: Add telemetry to track sync success/failure.

https://reviewboard.mozilla.org/r/126024/#review128768
Attachment #8854028 - Flags: review?(gkruglov) → review+
Comment on attachment 8854029 [details]
Bug 1272354 - Add in-tree documentation of DLC telemetry.

https://reviewboard.mozilla.org/r/126026/#review128770

Looks great!

::: mobile/android/docs/downloadcontenttelemetry.rst:17
(Diff revision 2)
> +All UI events use action ``action.1`` and method ``service``. Additional information is structured as JSON blobs (``extras`` field).
> +
> +Downloading content
> +===================
> +
> +Telemetry about downloaded content has the extra field ``action`` set to ``dlc_download``. The ``result`` field will be set to "success" or ``failure`` depending on whether the content could be downloaded successfully. The ``content`` field will contain the UUID of the content that was downloaded or failed to be downloaded. This is helpful to identify potentially corrupt content.

s/"success"/``success``

::: mobile/android/docs/downloadcontenttelemetry.rst:56
(Diff revision 2)
> +    }
> +
> +Synchronizing catalog
> +=====================
> +
> +The app has a local catalog of content that can be downloaded. This catalog can be updated and is synchronized from a server. Telemetry about synchronizing the catalog has the extra field ``action`` set to ``dlc_sync``. Like for content downloads the ``result`` field will be set to "success" or ``failure`` depending on whether the synchronization was successful or not.

s/"success"/``success``

::: mobile/android/docs/downloadcontenttelemetry.rst:70
(Diff revision 2)
> +.. code-block:: js
> +
> +    extras: {
> +        "action": "dlc_sync",
> +        "result": "success",
> +        "updated": false,

This is just "fetched catalog size > 0", right? Would it be more useful to just record the number itself?
Attachment #8854029 - Flags: review?(gkruglov) → review+
Comment on attachment 8854029 [details]
Bug 1272354 - Add in-tree documentation of DLC telemetry.

https://reviewboard.mozilla.org/r/126026/#review129060

::: mobile/android/docs/downloadcontenttelemetry.rst:17
(Diff revision 2)
> +All UI events use action ``action.1`` and method ``service``. Additional information is structured as JSON blobs (``extras`` field).
> +
> +Downloading content
> +===================
> +
> +Telemetry about downloaded content has the extra field ``action`` set to ``dlc_download``. The ``result`` field will be set to "success" or ``failure`` depending on whether the content could be downloaded successfully. The ``content`` field will contain the UUID of the content that was downloaded or failed to be downloaded. This is helpful to identify potentially corrupt content.

If you could clarify that the "content" is described in the "Synchronizing catalog" section, that'd be helpful.

::: mobile/android/docs/downloadcontenttelemetry.rst:19
(Diff revision 2)
> +Downloading content
> +===================
> +
> +Telemetry about downloaded content has the extra field ``action`` set to ``dlc_download``. The ``result`` field will be set to "success" or ``failure`` depending on whether the content could be downloaded successfully. The ``content`` field will contain the UUID of the content that was downloaded or failed to be downloaded. This is helpful to identify potentially corrupt content.
> +
> +Telemetry extras send for a successful content download might look like this:

Nit: sent
Attachment #8854029 - Flags: review?(liuche) → review+
Comment on attachment 8854027 [details]
Bug 1272354 - (DLC) Download Action: Add telemetry to track download success/failure.

https://reviewboard.mozilla.org/r/126022/#review128766

> I'm not familiar with the scheduler - under what conditions would we have a scheduled file already present and matching checksums locally? If there isn't a good reason why this might happen, perhaps that's not a "success" (even though "nothing to do" is a great outcome for this iteration of the loop!)

For example:
* An entry in the catalog could change. In this case we might trigger a download but the actual file has not changed.
* Our process is killed after the download but before we could update the catalog on disk
* We ship a very basic catalog in the app. After updating the catalog from the server we might end up with a somehow different catalog but we already have the file locally.
* And: If there's already a file in the location we want to put it: Let's just test if its already the right one.. :)

The reason why I want to report this as success is that if we haven't been able to mark this file as correctly downloaded in the catalog then it probably wasn't reported to telemetry either.
Comment on attachment 8854029 [details]
Bug 1272354 - Add in-tree documentation of DLC telemetry.

https://reviewboard.mozilla.org/r/126026/#review128770

> This is just "fetched catalog size > 0", right? Would it be more useful to just record the number itself?

Good question. Right now I don't think the actual number if meaningful - the value heavily depends on when the sync happened and at what time we updated the catalog. But let's see - this is the first iteration.
Pushed by s.kaspari@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/2512c717a261
(DLC) Download Action: If there's no download scheduled then exit early. r=Grisha
https://hg.mozilla.org/integration/autoland/rev/1b8c117b4c96
(DLC) DownloadAction: Move network checks inside download loop. r=Grisha
https://hg.mozilla.org/integration/autoland/rev/ff4b04a63533
(DLC) Download Action: Add telemetry to track download success/failure. r=Grisha
https://hg.mozilla.org/integration/autoland/rev/1f4417a2d084
(DLC) Sync Action: Add telemetry to track sync success/failure. r=Grisha
https://hg.mozilla.org/integration/autoland/rev/65632edcba8b
Add in-tree documentation of DLC telemetry. r=Grisha,liuche
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.