1528278 - Provide baseline for update download time and tools to keep measuring

Reporter

Description

•

7 years ago

Brief description of the request:
The first part of the update agent (bug 1343669) that consists in downloading updates as Firefox is closed (start the download process as Firefox is opened but keep going to download completion as Firefox is closed)) is landing on Firefox 67. Our goal is to reduce user orphaning through faster update downloads and we want to validate that when it lands we'll see improvements when compared to previous updates.
This bug is about setting a baseline for the time that it takes to download a mar file (firefox N to Fireofx N+1 updates) so that we can measure improvements as the BITS download fetaure (download as Firefox is closed) lands on Nightly and then rides the trains. A distribution of the download time per time buckets created for major updates feels more appropriate than looking at average or median download time to help highlight scenarios where we make the most improvements.

Link to any assets:

Is there a specific data scientist you would like or someone who has helped to triage this request:
No

Jim Mathies [:jimm]

Comment 1

•

7 years ago

Data science will reach out to Rob Strong to identify the telemetry / analysis methods we can use to measure this.

Jess Mokrzecki [:jmok]

Comment 2

•

7 years ago

(In reply to Jim Mathies [:jimm] from comment #1)

Data science will reach out to Rob Strong to identify the telemetry / analysis methods we can use to measure this.
Was there someone you had already spoken to about his and agreed to reach out to Rob?

Romain Testard [:RT]

Reporter

Comment 3

•

7 years ago

(In reply to Jess Mokrzecki [:jmok] from comment #2)

(In reply to Jim Mathies [:jimm] from comment #1)

Data science will reach out to Rob Strong to identify the telemetry / analysis methods we can use to measure this.
Was there someone you had already spoken to about his and agreed to reach out to Rob?
Jim and I spoke about this. This note is really only about informing the data scientist that Robert is the expert who can help understand how to access the data and interpret it.

Jess Mokrzecki [:jmok]

Updated

•

7 years ago

Assignee: nobody → sguha

Status: NEW → ASSIGNED

Points: --- → 3

"Saptarshi Guha[:joy]"

Assignee

Comment 4

•

7 years ago

Hi Rob,
This is a very clear ask, i.e.

X = time taken to download a mar file (firefox N to Fireofx N+1 updates)

Do we measure successful and incomplete downloads of mar files and bytes downloaded and from what value of N to what value of M (in the above M = N+1, is M always equal to N+1? IIRC it's yes)

Is this measurement in main_summary?

Flags: needinfo?(robert.strong.bugs)

Robert Strong (they/them - no direct email)

Comment 5

•

7 years ago

Success / failure for downloads are in telemetry but mar size and time to download are not in telemetry. Also, downloads can occur across multiple sessions which will complicate the measurements.

Flags: needinfo?(robert.strong.bugs)

"Saptarshi Guha[:joy]"

Assignee

Comment 6

•

7 years ago

Is there a way we can get into telemetry something that measures

bytes downloaded, time taken, MAR file id, from version, completed

where given you mentioned that it can span multiple sessions

bytes is the bytes downloaded in that session
time: time taken download bytes
MAR file id: corresponds to the MAR file being downloaded so would remain the same across sessions (if multiple sessions are required)
from version: the n in comment 4
complete: if this session completed the download of the MAR file

I'm less interested in success meaning if the MAR file could be applied correctly but in success defined as MAR file completely and correctly downloaded. Hence complete would correspond to success here.

Do you think we could get these measurements(or something akin to this) into client telemetry?

Flags: needinfo?(robert.strong.bugs)

Robert Strong (they/them - no direct email)

Comment 7

•

7 years ago

I think so for the current download mechanism. I'll find out how close we can get to similar measurements for the new method.

We have success for download already as noted in comment #5 but if necessary we could also add another one.

For MAR file ID, something could be created to denote that the download is the same download across sessions. Since there are cases where the same mar could be downloaded again I think this is closer to what you want vs an ID for the mar file?

Flags: needinfo?(robert.strong.bugs) → needinfo?(sguha)

"Saptarshi Guha[:joy]"

Assignee

Comment 8

•

7 years ago

For MAR file ID, something could be created to denote that the download is the same download across sessions.

yes this exactly. And yes, we would like similar (or as close as possible) readings from the BITS interface.

thanks much

Flags: needinfo?(sguha)

Robert Strong (they/them - no direct email)

Comment 9

•

7 years ago

•

Edited

Hi Saptarshi,

We're limited in what data points we can get from BITS with just the BITS download implementation. We should be able to get many more BITS data points with the complete Update Agent implementation but with just the BITS download implementation the data points are much more limited. Below I've outlined the data points we can get for the BITS update download and the Firefox update download that I hope will suffice for the analysis.

Telemetry for the overall measurement of the difference that downloading with BITS compared to downloading in Firefox can be achieved using the average time it takes to finish an update when using BITS update download and the Firefox update download. All intervals will be recorded in seconds.

Interval from the start of the "Update Check" to the start of the "Update Download".
Interval from start of the "Update Download" to "Update Download Ready".
Note: The phrase "Update Download Ready" represents a couple of different possible next steps:

Start of "Update Staging" when staging is available. This will be followed by "Update Pending".
"Update Pending" when the update will be applied without staging.

Interval from "Update Download Ready" to "Update Pending". This will be the time in seconds to stage the update or 0 when it isn't possible to stage the update.
Interval from "Update Pending" to "Update Applied".

To provide a consistent comparison for the difference in time that it takes for a client to update the intervals from start of "Update Check" or "Update Download" through any of the phases that come after "Update Download" can be added. I suggest using "Update Applied" to get the overall affect that using BITS to download an update has on getting clients updated though measuring the next phase of "Update Download Ready" would exclude failures that occur in later phases.
Note: the initial implementation of BITS download will likely have little affect on the majority of clients which already update within a short period of time but it should show an improvement for clients that only run Firefox seldom and / or for short periods of time. The full Update Agent implementation which is planned to happen after BITS download should show an improvement for the majority of clients.

It is possible to just record a single interval from "Update Download" through "Update Download Ready" for this but the additional intervals will provide telemetry that we can use to better evaluate issues with the different phases and their affect on updating clients.

Additional data that will be submitted to telemetry:
The app version that was updated from.
Bytes per second for the download. We won't always be able to record this for the full download in the BITS case but we can get bytes per second for BITS while Firefox is running.
Total bytes downloaded.

Since there are cases where different telemetry client ID's would submit individual intervals they will be stored in the active-update.xml until the update has finished so they are all submitted at the same time and by the same telemetry client ID.

Flags: needinfo?(sguha)

"Saptarshi Guha[:joy]"

Assignee

Comment 10

•

7 years ago

So if i understand correctly, these are time durations(intervals)?
If a profile successfully updates, then time between Update Check and Update Applied is the time taken to successfuly update.
If "Updated Applied" never happened, one would have to backtrack to the last successful stage to see time taken to last successful stage. What would be the values of missing stages? i guess their entries would be missing?

Looking at Romain's comment, i had some thoughts

BITS might not lead to faster downloads because IIRC BITS can throttle the downloads, right?
but the promise of BITS is that downloads will continue even if Firefox is closed and hence the following hypotheses
(a)we have more profiles with completed downloads (of MAR files)
(b) we have more profiles that successfully update to newer version of Firefox

Thus while Romain did ask for baseline of download rates, ultimately i think were interested hypotheses (a) and (b), correct?
Hence once this lands we hope to see

for seldom/less users of Firefox, higher successful MAR download (do we measure currently if success of MARS download?) aftre they get this feature.
for same set of users, higher update rate (do we also measure if the MAR file successfully applied?)

Lastly, why do you say "The full Update Agent implementation which is planned to happen after BITS download should show an improvement for the majority of clients."

Flags: needinfo?(sguha)

Flags: needinfo?(rtestard)

Flags: needinfo?(robert.strong.bugs)

Robert Strong (they/them - no direct email)

Comment 11

•

7 years ago

•

Edited

(In reply to "Saptarshi Guha[:joy]" from comment #10)

So if i understand correctly, these are time durations(intervals)?

Correct and they are in seconds.

If a profile successfully updates, then time between Update Check and Update Applied is the time taken to successfuly update.

Correct

If "Updated Applied" never happened, one would have to backtrack to the last successful stage to see time taken to last successful stage. What would be the values of missing stages? i guess their entries would be missing?

That is what I am leaning towards. It is also possible to check for a success code in existing telemetry if we want to measure the time it took to fail.

Looking at Romain's comment, i had some thoughts

BITS might not lead to faster downloads because IIRC BITS can throttle the downloads, right?

That is one reason. If BITS fails for whatever reason then we'll fallback to using Firefox to download the update which will also take longer due to the time it takes to fail and start the download in Firefox. There are probably other reasons as well.

but the promise of BITS is that downloads will continue even if Firefox is closed and hence the following hypotheses

(a)we have more profiles with completed downloads (of MAR files)

(b) we have more profiles that successfully update to newer version of Firefox

More installations (profiles if you prefer since that more closely matches telemetry) that would download the MAR file sooner and hence update sooner.
I don't anticipate that it should have a significant affect on success of downloading or updating.

The main reason for older Firefox versions taking a very long time to update was identified using telemetry and fixed in Firefox 49 and further improvements were made in Firefox 52. This can be seen with the "Update Watersheds" by comparing Firefox 43.0.1 and 47.0.2 with Firefox 56.0 in the "Out of date, of concern client distribution across Firefox versions" section of the "Firefox Application Update Out Of Date Dashboard".
https://telemetry.mozilla.org/update-orphaning/#version-dist-chart

The main reason I recommended using BITS to download while Firefox isn't running is to update the clients that seldom run Firefox or when they do run Firefox they only have it open for a very short period of time. This can be seen in the "Out of date, potentially of concern reason distribution" section of the "Firefox Application Update Out Of Date Dashboard". Specifically, the clients that "Ran for less than 2 hours" during the previous 12 weeks (84 days) and have "Less than 4 update pings" during the previous 12 weeks (84 days).
https://telemetry.mozilla.org/update-orphaning/#not-min-reqs-chart

Thus while Romain did ask for baseline of download rates, ultimately i think were interested hypotheses (a) and (b), correct?

For this phase of the project I am interested in getting Windows clients that seldom run Firefox or only run Firefox for a short period of time updated faster without adversely affecting the clients that are already considered to be updating at a reasonable pace. If all Windows clients update faster with the BITS download implementation then that is all the better. For the full implementation I am interested in all Windows clients updating faster, having less interaction with updating, and possibly some other improvements we've been considering.

Hence once this lands we hope to see

for seldom/less users of Firefox, higher successful MAR download (do we measure currently if success of MARS download?) aftre they get this feature.

Faster successful download rate for these users. We don't expect there to be a change in success vs. failure of downloads.

We already measure download success and failure (including error code). Also, a value for the interval between "Update Download" and "Update Download Ready" represents a successful download will be the likely path we take.

for same set of users, higher update rate (do we also measure if the MAR file successfully applied?)

Faster successful update rate for these users. We don't expect there to be a change in success vs. failure of applying updates.

We already measure applied success and failure (including error code). Also, a value for the interval between "Update Pending" to "Update Applied" represents an update that was successfully applied will be the likely path we take.

Lastly, why do you say "The full Update Agent implementation which is planned to happen after BITS download should show an improvement for the majority of clients."

This is just the BITS download portion of the Update Agent. The full project would also perform the update check, the download, and apply the update - all without Firefox running.

Flags: needinfo?(robert.strong.bugs)

Robert Strong (they/them - no direct email)

Updated

•

7 years ago

Flags: needinfo?(sguha)

Robert Strong (they/them - no direct email)

Comment 12

•

7 years ago

I didn't go into detail explaining "Update Watersheds", why the data shows an improvement, or the other data points. If you'd like more detail feel free to ask and I can either respond in a comment or we can have a vidyo meeting to discuss it if you prefer.

Robert Strong (they/them - no direct email)

Updated

•

7 years ago

Blocks: 1539154

Robert Strong (they/them - no direct email)

Comment 13

•

7 years ago

•

Edited

Saptarshi, I'm just about done with the implementation and changed a couple of things.

Since it isn't possible to guarantee the bytes per second for the full download and it is possible that the time recording the download is extremely small I decided to go with recording the number of seconds spent recording the download and the number of bytes downloaded during that period. This makes it possible to exclude records based off of the time spent recording the download, the number of bytes downloaded, or both if desired. This will of course require dividing the bytes downloaded by the seconds downloading when querying telemetry.

Since a BITS update download can fallback to an in Firefox update download I am going to add a new data point so it is possible to tell if the update download was done using BITS, Firefox, or BITS then falling back to Firefox.

I went with only recording successful updates since a successful update only occurs after restarting and scalars can only be recorded once per session. This way there won't be any chance of records being overwritten.

The MAR file size for the entire download will still be recorded as well.

The other data points mentioned will still be recorded at the same time after a successful update.

Is this acceptable for you?

"Saptarshi Guha[:joy]"

Assignee

Comment 14

•

7 years ago

I guess it its, but to be very honest, this makes much more sense when i get the data in hand and i can explore it, state hypothesis and either confirm or deny (and challenge assumptions).

I think like we discussed before, getting this out early in pre-release, we can explore the data and then tweak as desired but everything you said makes sense.

Though one question seems natural: if we only record on successful update how can we infer an update failed? My answer is: the last stage seen in [Comment 9] (all of which is recorded) and if its not the "Update Applied"(eventually) then the updated did not happen. Not saying I need this data point (failed/not failed updates), but just to clarify my understanding

Flags: needinfo?(sguha)

Robert Strong (they/them - no direct email)

Comment 15

•

7 years ago

•

Edited

The failure can be determined via the detailed failure information that is already recorded in telemetry histograms though intervals aren't included in this. Regretfully, telemetry scalars which are needed for the intervals don't support multiple values in one session so if an update failed before restarting the next attempt would overwrite the failure due to how scalars work.

Since there are multiple fall backs for the download phase I am adding a bitfield value so it is possible to determine which download methods were tried.

Robert Strong (they/them - no direct email)

Updated

•

7 years ago

No longer blocks: 1539154

Depends on: 1520321, 1539154

Robert Strong (they/them - no direct email)

Updated

•

7 years ago

Depends on: 1542100

Jim Mathies [:jimm]

Updated

•

7 years ago

Blocks: update-agent

Robert Strong (they/them - no direct email)

Comment 16

•

6 years ago

The new telemetry probes landed last Friday.

The telemetry I've added for this are scalars and are automatically added to MainSummary.
Note: the pre-existing telemetry probes mentioned below predate the addition of Scalars, they are histograms, and they aren't automatically added to MainSummary.

There are two branches of the same telemetry scalar probes.
app.startup
app.session

app.startup contains both success and failures.
app.session contains only failures.

The telemetry is only submitted when an update has finished.
An update is finished when all attempts to update have been performed and this includes both success and failures.
The update.session telemetry only contains updates that have failed all attempts to update with the last attempt failing before the client restarts to apply the update.
The update.startup telemetry only contains updates that have finished with either success or failure during application startup.

To check if app.startup is for a success or failure you can check the following histograms
UPDATE_STATE_CODE_COMPLETE_STARTUP
UPDATE_STATE_CODE_PARTIAL_STARTUP
UPDATE_STATE_CODE_UNKNOWN_STARTUP

The existence of a value means the histogram is the one associated to the app.startup scalars.
A value of 10 means the update was successful
The most predominant failed case per the dashboard has been due to cancelling elevation requests in which case the value will be 12 and one of the following histograms will have a value of 9
UPDATE_STATUS_ERROR_CODE_COMPLETE_STARTUP
UPDATE_STATUS_ERROR_CODE_PARTIAL_STARTUP
UPDATE_STATUS_ERROR_CODE_UNKNOWN_STARTUP

When the client cancels the elevation request the client will continue to report the same values on startup until they approve the elevation request to complete the update.

The new telemetry along with descriptions of each probe is available starting at:
https://searchfox.org/mozilla-central/source/toolkit/components/telemetry/Scalars.yaml#2695

Several of the intervals have multiple probes. A couple of examples that show why there are so many probes:
Client has a failed partial MAR BITS download, failed partial MAR internal download, failed complete MAR BITS download, and a failed complete MAR internal download.
Client has a successful partial MAR BITS download, failed partial MAR stage, and to fallback to a complete MAR BITS download. This would typically be followed by either a successful or a failed complete MAR stage. Depending on the error the client will either try to apply the update on the next startup or inform the client that they should download and install the latest version.
Client performs the update phases with a partial MAR successfully except for the apply phase on restart. They will then try to update using the complete MAR and could finish the update with either a success or failure when it tries to apply the complete MAR on restart.

In all of the above cases all of the telemetry will be submitted at the same time. This should make it easier to work with the data and it also makes it easier to prevent mixing data from two different updates since scalars can only be submitted once per session.

Robert Strong (they/them - no direct email)

Comment 17

•

6 years ago

•

Edited

Some more notes

The values for update.startup.intervals.check will typically be a small number. When it isn't the client will likely not have automatic updates enabled which would have immediately started the update download. Also, the value is not affected by BITS.
The values for update.startup.intervals.download_<> are the time it took to complete the download either using the internal Firefox code or BITS.
The values for update.startup.intervals.stage_<> represents the time it took to stage the update and the value isn't affected by BITS. No value represents that staging was not attempted (e.g. disabled by pref, requirements not met, etc.).
The values for update.startup.intervals.apply_<> represents the time it took for the client to restart so the value can vary by quite a lot. This value isn't affected by BITS.

When the client exits while downloading the number for update.startup.downloads.<>_bytes will be different than the value of update.startup.mar_<>_size_bytes values

A couple of examples interpreting the data

If there are values for:
update.startup.downloads.bits_partial_seconds
update.startup.downloads.internal_partial_seconds
Without values for:
update.startup.downloads.bits_complete_seconds
update.startup.downloads.internal_complete_seconds
This means:
the BITS partial download failed
the internal Firefox code partial download succeeded
there should also be a value for update.startup.mar_complete_size_bytes since there should always be a complete update MAR file

If there are values for:
update.startup.downloads.bits_complete_seconds
update.startup.downloads.internal_complete_seconds
Without values for:
update.startup.downloads.bits_partial_seconds
update.startup.downloads.internal_partial_seconds
This means:
the BITS complete download failed
the internal Firefox code complete download succeeded
there was no partial update MAR file advertised which can also be determined by there not being a value for update.startup.mar_partial_size_bytes

If there are values for:
update.startup.downloads.bits_partial_seconds
update.startup.downloads.bits_complete_seconds
update.startup.intervals.stage_partial
update.startup.intervals.stage_complete
update.startup.intervals.apply_complete
Without values for:
update.startup.downloads.internal_partial_seconds
update.startup.downloads.internal_complete_seconds
update.startup.intervals.apply_partial
This means:
download of the partial using BITS succeeded
download of the complete using BITS succeeded
staging of the partial update failed
staging of the complete update succeeded
the complete update made it through to the end of the apply phase and whether it was successful or not can be determined by the value of the UPDATE_STATE_CODE_COMPLETE_STARTUP histogram.

Below are two examples (one with BITS and one using the internal Firefox download code) of the telemetry submitted that I simulated on my system where a partial that had been successfully staged failed during the apply phase.

| Scalar Telemetry probe name                         |  Value   |
+-----------------------------------------------------+----------+
| update.startup.from_app_version                     |   68.0a1 |
| update.startup.mar_partial_size_bytes               |  5597747 |
| update.startup.mar_complete_size_bytes              | 52520731 |
| update.startup.intervals.check                      |      167 |
| update.startup.intervals.download_internal_partial  |        4 |
| update.startup.intervals.download_internal_complete |       32 |
| update.startup.intervals.stage_partial              |        5 |
| update.startup.intervals.stage_complete             |        6 |
| update.startup.intervals.apply_partial              |       91 |
| update.startup.intervals.apply_complete             |        8 |
| update.startup.downloads.internal_partial_bytes     |  5597747 |
| update.startup.downloads.internal_partial_seconds   |        4 |
| update.startup.downloads.internal_complete_bytes    | 52520731 |
| update.startup.downloads.internal_complete_seconds  |       31 |
+-----------------------------------------------------+----------+

| Scalar Telemetry probe name                         |  Value   |
+-----------------------------------------------------+----------+
| update.startup.from_app_version                     |   68.0a1 |
| update.startup.mar_partial_size_bytes               |  5079059 |
| update.startup.mar_complete_size_bytes              | 52537067 |
| update.startup.intervals.check                      |       36 |
| update.startup.intervals.download_bits_partial      |        3 |
| update.startup.intervals.download_bits_complete     |       24 |
| update.startup.intervals.stage_partial              |        5 |
| update.startup.intervals.stage_complete             |        7 |
| update.startup.intervals.apply_partial              |       81 |
| update.startup.intervals.apply_complete             |       19 |
| update.startup.downloads.bits_partial_bytes         |  5079059 |
| update.startup.downloads.bits_partial_seconds       |        1 |
| update.startup.downloads.bits_complete_bytes        | 52537067 |
| update.startup.downloads.bits_complete_seconds      |       24 |
+-----------------------------------------------------+----------+

Below is a typical example using BITS without any failures.

| Scalar Telemetry probe name                         |  Value   |
+-----------------------------------------------------+----------+
| update.startup.from_app_version                     |   68.0a1 |
| update.startup.mar_partial_size_bytes               |  6383239 |
| update.startup.mar_complete_size_bytes              | 52539591 |
| update.startup.intervals.check                      |        1 |
| update.startup.intervals.download_bits_partial      |        3 |
| update.startup.intervals.stage_partial              |        6 |
| update.startup.intervals.apply_partial              |    12486 |
| update.startup.downloads.bits_partial_bytes         |  6383239 |
| update.startup.downloads.bits_partial_seconds       |        2 |
+-----------------------------------------------------+----------+

"Saptarshi Guha[:joy]"

Assignee

Comment 18

•

6 years ago

A couple of examples interpreting the data

If there are values for:
update.startup.downloads.bits_partial_seconds
update.startup.downloads.internal_partial_seconds
Without values for:
update.startup.downloads.bits_complete_seconds
update.startup.downloads.internal_complete_seconds
This means:
the BITS partial download failed
the internal Firefox code partial download succeeded
there should also be a value for update.startup.mar_complete_size_bytes since there should always be a complete update MAR file

my understanding:

If we see update.startup.downloads.bits_partial_seconds and
update.startup.downloads.internal_partial_seconds this means the bits download
failed and the internal downloader kicked in and hence the latter value is present.

Therefore we should see update.startup.mar_partial_size_bytes and not
neccassarily update.startup.mar_complete_size_bytes since a partial MAR was
downloaded not a complete MAR (as mentioned above
update.startup.downloads.internal_complete_seconds is missing)

If there are values for:
update.startup.downloads.bits_complete_seconds
update.startup.downloads.internal_complete_seconds
Without values for:
update.startup.downloads.bits_partial_seconds
update.startup.downloads.internal_partial_seconds
This means:
the BITS complete download failed
the internal Firefox code complete download succeeded
there was no partial update MAR file advertised which can also be determined by there not being a value for update.startup.mar_partial_size_bytes

Makes sense. If i understood correctly, if we only saw update.startup.downloads.bits_{partial|complete}_seconds then the BITS succeeded?

If there are values for:
update.startup.downloads.bits_partial_seconds
update.startup.downloads.bits_complete_seconds
update.startup.intervals.stage_partial
update.startup.intervals.stage_complete
update.startup.intervals.apply_complete
Without values for:
update.startup.downloads.internal_partial_seconds
update.startup.downloads.internal_complete_seconds
update.startup.intervals.apply_partial
This means:
download of the partial using BITS succeeded
download of the complete using BITS succeeded
staging of the partial update failed
staging of the complete update succeeded
the complete update made it through to the end of the apply phase and whether it was successful or not can be determined by the value of the UPDATE_STATE_CODE_COMPLETE_STARTUP histogram.

Makes sense. I think i'm confused in that i thought the update will download either a partial MAR or a complete MAR but looking at your examples it seems both the partial and complete are downloaded. And from your second 'scenario' we can have complte downloads without having partials, but we cannot have a partial without a complete?

Thanks much

Can the histograms UPDATE_CAN_USE_BITS_NOTIFY and UPDATE_STATE_CODE_COMPLETE_STARTUP be pushed into main_summary?

Flags: needinfo?(robert.strong.bugs)

Robert Strong (they/them - no direct email)

Comment 19

•

6 years ago

(In reply to "Saptarshi Guha[:joy]" from comment #18)

A couple of examples interpreting the data

If there are values for:
update.startup.downloads.bits_partial_seconds
update.startup.downloads.internal_partial_seconds
Without values for:
update.startup.downloads.bits_complete_seconds
update.startup.downloads.internal_complete_seconds
This means:
the BITS partial download failed
the internal Firefox code partial download succeeded
there should also be a value for update.startup.mar_complete_size_bytes since there should always be a complete update MAR file

my understanding:

If we see update.startup.downloads.bits_partial_seconds and
update.startup.downloads.internal_partial_seconds this means the bits download
failed and the internal downloader kicked in and hence the latter value is present.

Correct

Therefore we should see update.startup.mar_partial_size_bytes and not
neccassarily update.startup.mar_complete_size_bytes since a partial MAR was
downloaded not a complete MAR (as mentioned above
update.startup.downloads.internal_complete_seconds is missing)

No, the values for update.startup.mar_partial_size_bytes and update.startup.mar_complete_size_bytes represent the MAR files advertised by the update server, are always recorded, and this is not dependent on whether the MAR was downloaded by the client.

If there are values for:
update.startup.downloads.bits_complete_seconds
update.startup.downloads.internal_complete_seconds
Without values for:
update.startup.downloads.bits_partial_seconds
update.startup.downloads.internal_partial_seconds
This means:
the BITS complete download failed
the internal Firefox code complete download succeeded
there was no partial update MAR file advertised which can also be determined by there not being a value for update.startup.mar_partial_size_bytes

Makes sense. If i understood correctly, if we only saw update.startup.downloads.bits_{partial|complete}_seconds then the BITS succeeded?

Correct

If there are values for:
update.startup.downloads.bits_partial_seconds
update.startup.downloads.bits_complete_seconds
update.startup.intervals.stage_partial
update.startup.intervals.stage_complete
update.startup.intervals.apply_complete
Without values for:
update.startup.downloads.internal_partial_seconds
update.startup.downloads.internal_complete_seconds
update.startup.intervals.apply_partial
This means:
download of the partial using BITS succeeded
download of the complete using BITS succeeded
staging of the partial update failed
staging of the complete update succeeded
the complete update made it through to the end of the apply phase and whether it was successful or not can be determined by the value of the UPDATE_STATE_CODE_COMPLETE_STARTUP histogram.

Makes sense. I think i'm confused in that i thought the update will download either a partial MAR or a complete MAR but looking at your examples it seems both the partial and complete are downloaded.

The typical case when a partial is advertised only the partial is downloaded. In the above example, staging of the partial failed so the complete was downloaded. For both the partial and the complete only BITS was used because it was a staging failure and not a download failure. If there was a BITS download failure it would fallback to using the internal Firefox download code.

And from your second 'scenario' we can have complte downloads without having partials, but we cannot have a partial without a complete?

Technically we can have a partial without a complete but in practice it is extremely unlikely we will.

Thanks much

No problem

Can the histograms UPDATE_CAN_USE_BITS_NOTIFY and UPDATE_STATE_CODE_COMPLETE_STARTUP be pushed into main_summary?

I'll file a telemetry bug to see if they can. I suspect you would also like at a minimum UPDATE_STATE_CODE_PARTIAL_STARTUP. Do you think it should be added permanently? If so, there are other histograms that should probably be added as well.

Flags: needinfo?(robert.strong.bugs) → needinfo?(sguha)

"Saptarshi Guha[:joy]"

Assignee

Comment 20

•

6 years ago

I guess if we continue this analysis into beta and release then they ought be there for the near future.

Flags: needinfo?(sguha)

Robert Strong (they/them - no direct email)

Comment 21

•

6 years ago

Filed bug 1552213 for the additions to main summary

Robert Strong (they/them - no direct email)

Comment 22

•

6 years ago

An example with just a complete MAR file advertised

| Scalar Telemetry probe name                         |  Value   |
+-----------------------------------------------------+----------+
| update.startup.from_app_version                     |   68.0a1 |
| update.startup.mar_complete_size_bytes              | 52728655 |
| update.startup.intervals.check                      |        7 |
| update.startup.intervals.download_bits_complete     |       20 |
| update.startup.intervals.stage_complete             |        7 |
| update.startup.intervals.apply_complete             |       12 |
| update.startup.downloads.bits_complete_bytes        | 52728655 |
| update.startup.downloads.bits_complete_seconds      |       19 |
+-----------------------------------------------------+----------+

"Saptarshi Guha[:joy]"

Assignee

Comment 23

•

6 years ago

Commenting here on my progress since we've been communicating a lot in this thread.

I took the data from main_summary for 60% sample of Nightly profiles on app_build_id >= 20190515 for dates after 2019-05-15 (till now) and on WindowsNT. The code for the data extract can be found here (unfortunately behind LDAP).

Overview Stats

This resulted in a total of 7,508,494 profiles and 38,070,761 sub-sessions.
90% of the profiles have been active only one day, and a break up of the higher end is

90%	91%	92%	93%	94%	95%	96%	97%	98%	99%	100%
1	1	1	1	2	2	3	3	4	4	7

in terms of subsessions, 90% have fewer than 11 sub-sessions and the distribution on the higher end is

90%	91%	92%	93%	94%	95%	96%	97%	98%	99%	99.9%
11	12	12	13	14	16	18	20	24	33	79

However the bulk of these profiles have missing entries for update_startup_mar_(complete|partial)_size_bytes across all their sub-sessions and thus would never update in the week. Of the profiles listed above, 159,919 (~2%) had this advertised. Those who have a non missing value have much higher usage

days active is higher

90%	91%	92%	93%	94%	95%	96%	97%	98%	99%	100%
5	5	5	5	6	6	6	6	7	7	7

number of subsessions is higher

90%	91%	92%	93%	94%	95%	96%	97%	98%	99%	99.9%
32	34	36	39	42	46	51	59	70	94	273

Based on email conversations, I'm going to assume the 'test' group is the group that used BITS (with internal as a fallback) and those that used internal exclusively. For a profile if any of download_bits_partial_seconds or download_bits_complete_seconds is non missing they were in the test group. If it's missing for all their subsessions, they are in the control group.

This appears to be not a 50% split with 28% of the 159,919 in the Test group and remainder in the Control group.

Next comment will have timing results.

Robert Strong (they/them - no direct email)

Comment 24

•

6 years ago

I suspect that this is due to the telemetry only being submitted for a session when an update has occurred.

I checked using one day’s worth of data on Nightly, only Windows, app name equals Firefox, one build ID that has the BITS patches, unique client IDs, and pings with the UPDATE_CAN_USE_BITS_NOTIFY histogram to identify the client’s BITS enabled status there are approximately 49.1% enabled, 47.6% disabled by pref, and 3.2% disabled due to proxy. The values were
Totals: CanUseBits: 5685, NoBits_NotWindows: 5246, NoBits_Pref: 5503, NoBits_Proxy: 374, NoBits_OtherUser: 1

Robert Strong (they/them - no direct email)

Comment 25

•

6 years ago

Saptarshi, if you could please comment in bug 1552213 to let the telemetry devs know when you will need that bug fixed.

Jim Mathies [:jimm]

Updated

•

6 years ago

Blocks: 1553977

"Saptarshi Guha[:joy]"

Assignee

Comment 26

•

6 years ago

The histograms were added to bigquery. We'll see data over the weekend.

"Saptarshi Guha[:joy]"

Assignee

Comment 27

•

6 years ago

Okay, new comment starting, can kindly ignore Comment 23. Moving the analysis to a report for easier edits.

See https://metrics.mozilla.com/~sguha/mz/download_baselines/report.1.html

I can rerun this as we get new data. There isn't much in yet.

"Saptarshi Guha[:joy]"

Assignee

Comment 28

•

6 years ago

Report has been updated (it has been a full week now). There isnt any difference (download speed has not regressed) and my guess why there isn't a difference (i.e in terms of profiles on older versions/builds) is that nightly users are generally more active.

https://metrics.mozilla.com/~sguha/mz/download_baselines/report.1.html

I'll update as we get more data and confirm if download speeds indeed increased.

"Saptarshi Guha[:joy]"

Assignee

Comment 29

•

6 years ago

Also i can make this report public (since the bug is).

Romain Testard [:RT]

Reporter

Comment 30

•

6 years ago

Hi Saptarshi, it sounds like we got what we needed from this bug and BITS shipped to 68. OK to close the bug?

Flags: needinfo?(rtestard) → needinfo?(sguha)

"Saptarshi Guha[:joy]"

Assignee

Comment 31

•

6 years ago

I believe things looked okay. Though beta did not show as large differences as we saw in nightly there were not any regressions.
There was one section where data was funky but that appears mostly because of targeting the wrong version.

+r

Flags: needinfo?(sguha)

Romain Testard [:RT]

Reporter

Comment 32

•

6 years ago

Thanks!
Closing now.

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED