Closed Bug 1284915 Opened 5 years ago Closed 5 years ago

Investigate large number (~37% and ~3.76% of all) of orphaned users stuck on Firefox 43.0.1

Categories

(Toolkit :: Application Update, defect, P4)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: spohl, Assigned: robert.strong.bugs)

References

Details

Attachments

(2 files)

Preliminary work on bug 1284899 revealed that a large number (~48%) of our orphaned users are stuck on Firefox 43.0.1. We should investigate whether or not this is legitimate and if there are any mitigations that we could apply.
Attached file JSON data
This data is the basis of the graph in attachment 8768444 [details].
The percentage of users stuck on 43.0.1 remains virtually unchanged over the last few weeks. I ran an analysis on OS version distribution on 43.0.1 vs. 44.*. Users on XP are represented at about twice the rate on 43.0.1 vs. 44.*, but I believe this is expected since 43.0.1 is a watershed release. It would have been more worrisome if 80% of orphaned users were XP users.

Posting analysis data here as fyi:

Firefox 43.0.1 using longitudinal_v20160731: (total count: 54490)
Windows 95:         0.02%
Windows 2000:       0.04%
Windows XP:        19.85%
Windows XP Pro 64:  0.44%
Windows Vista:      3.79%
Windows 7:         48.01%
Windows 8:          3.56%
Windows 8.1:       10.96%
Windows 10:        13.32%

Firefox 44.* using longitudinal_v20160731: (total count: 23260)
Windows 95:        <0.01%
Windows 2000:       0.03%
Windows XP:        10.34%
Windows XP Pro 64:  0.31%
Windows Vista:      2.13%
Windows 7:         53.20%
Windows 8:          3.07%
Windows 8.1:       12.21%
Windows 10:        18.70%
Priority: -- → P4
Assignee: mhowell → robert.strong.bugs
Another data point.

Firefox is 8.9% of Wikipedia visitors for the first week of September. The versions can be split up in https://analytics.wikimedia.org/dashboards/browsers/#all-sites-by-browser/browser-family-and-major-hierarchical-view

47: 1.1%
43: 0.95%
40: 0.91%
45: 0.87%
44: 0.82%
42: 0.81%
38: 0.72%
…
Ignore the (In reply to Harald Kirschner :digitarald from comment #4)
> Another data point.

Ignore that comment, filtering actually to the last month (I had 2015 for the start date), the numbers look much better with 47 and 48 in the lead: https://www.dropbox.com/s/2j5f3i6e79q7l35/Screenshot%202016-09-06%2011.46.43.PNG?dl=0
Not saying this is the reason for the bump at this time but it definitely has affected successful updates.

UPDATE_STATUS_ERROR_CODE_COMPLETE_STAGE with an error 19 (CERT_VERIFY_ERROR) across versions

Version |  41  |  42  |  43  |  44
--------+------+------+------+-----
Win All |  49  |  11  | 7630 | 4420
Win 7   |  27  |   8  | 3180 | 1870

https://dxr.mozilla.org/mozilla-central/source/toolkit/mozapps/update/updater/archivereader.cpp#63
Should have mentioned that this data was gathered from the telemetry histogram dashboard
Looking at beta it appears that these numbers are more consistent going back to earlier releases. The releaseChannelCollection for app update was added for Firefox 40. Was there a period when the release channel wasn't reporting consistently?
Note: Bug 1157581 added releaseChannelCollection to app update.
No updates found using 8/01/16 - 8/31/16 for the date range

Version |   41   |   42   |   43   |   44   |   45
--------+--------+--------+--------+--------+--------
Win All |  1340  |  35070 | 348340 |  64230 |  89400
Win 7   |  3710  |  16360 | 157540 |  36110 |  49920

I am not confident in this but I suspect that this difference in Firefox 43 is mainly due to the large number of orphaned Firefox 43 installs.
Same numbers including the app update telemetry ping count.

Version |      41       |       42       |       43        |       44        |      45
--------+---------------+----------------+-----------------+-----------------+----------------
Win All | 1340 / 135780 | 35070 / 168000 | 348340 / 943820 |  64230 / 249440 |  89400 / 341850
Win 7   | 3710 /  76020 | 16360 /  92030 | 157540 / 439740 |  36110 / 136140 |  49920 / 200280
Using the notify - no updates found divided by the notify - ping count I get the following percentage of the time that a notify event to check for updates occurs with the end result being no updates found.

Version |  42 |  43 |  44 |   45
--------+-----+-----+-----+------
Win All | 20% | 37% | 26% |  26%
Win 7   | 18% | 35% | 26% |  25%
I didn't include 41 since the telemetry numbers seem skewed for that release.
I did a comparison of no updates found on release and beta for the date range of 2016/08/01 - 2016/08/31 and though the no updates found didn't stand out it does appear that for that date range there is a large number of Release 43 and Beta 44.

The first number is notify - no updates found
The second number is notify - count
Numbers are in thousands

Release
Version |    42    |     43    |    44    |   45
--------+----------+-----------+----------+---------
Win All | 35 / 168 | 348 / 943 | 64 / 249 | 89 / 342
Win 7   | 16 /  92 | 157 / 439 | 36 / 136 | 50 / 200


Beta
Version |     42    |    43     |    44     |      45
--------+-----------+-----------+-----------+----------------
Win All | 35 / 1520 | 33 / 1670 | 46 / 6690 |  30 / 1540
Win 7   | 28 /  930 | 14 /  975 | 32 / 3980 |  16 /  915

Focusing in on just notify - count

Release
Version |  42 |  43 |  44 |  45
--------+-----+-----+-----+-----
Win All | 168 | 943 | 249 | 342
Win 7   |  92 | 439 | 136 | 200


Beta
Version |  42  |  43  |  44  |  45
--------+------+------+------+-----
Win All | 1520 | 1670 | 6690 | 1540
Win 7   |  930 |  975 | 3980 |  915

So, Firefox Release 43 and Firefox Beta 44 both have a significantly larger update notify ping count than other releases for both all versions of Windows and for Windows 7.

This makes me suspect something on the server side might have caused this since if it were client side the Firefox version should be the same for the large number of update notify ping count. In other words, if Firefox 43 Release had a client side change that caused the larger number of update notify ping count then Firefox Beta 43 should show a similar increase in the update notify ping count.

This is not to say that this isn't a client side bug. For example, I think the aus SSL certificate was changed during this time period and it might be due to a bug in the code exposed after this change.
It could also be due to some inconsistency with how we collect telemetry though I have no reason to think that it is.
Stephen, do you think comment #15 might be on the right track?
Flags: needinfo?(spohl.mozilla.bugs)
Note that even though there is discussion of 43 in bug 1277925 it is actually about beta 44.
Another thing to try to verify is that it isn't caused by a late landing change to Firefox 43.
Flags: needinfo?(spohl.mozilla.bugs)
accidentally cleared the need info for comment #17
Flags: needinfo?(spohl.mozilla.bugs)
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #10)
> No updates found using 8/01/16 - 8/31/16 for the date range
> 
> Version |   41   |   42   |   43   |   44   |   45
> --------+--------+--------+--------+--------+--------
> Win All |  1340  |  35070 | 348340 |  64230 |  89400
> Win 7   |  3710  |  16360 | 157540 |  36110 |  49920

I haven't been following this terribly closely, but is this saying that in the month of August, users on 43.0.1 were not even offered an update at all?
Flags: needinfo?(robert.strong.bugs)
No, it is saying that the some users reported no updates found during this date range. Other date ranges also show similar ratio of users reporting no update found. The following is from comment #15 and if no users were offered an update the first number would be approximately the the second number.

The first number is notify - no updates found
The second number is notify - count
Numbers are in thousands
Release
Version |    42    |     43    |    44    |   45
--------+----------+-----------+----------+---------
Win All | 35 / 168 | 348 / 943 | 64 / 249 | 89 / 342
Win 7   | 16 /  92 | 157 / 439 | 36 / 136 | 50 / 200
Flags: needinfo?(robert.strong.bugs)
Update ping count across Aurora, Beta, and Release shows that Release 43, Beta 44, and Aurora 45 all have by far the largest count.

Release
Version |      |      |   42  |   43  |   44  |   45 |
--------+------+------+-------+-------+-------+------+
Win All |      |      |   168 |   943 |   249 |  342 |
Win 7   |      |      |    92 |   439 |   136 |  200 |


Beta
Version |      |  42  |   43  |   44  |   45  |
--------+------+------+-------+-------+-------+
Win All |      | 1520 |  1670 |  6690 |  1540 |
Win 7   |      |  930 |   975 |  3980 |   915 |

Aurora
Version |  42  |  43  |   44  |   45  |   46  |
--------+------+------+-------+-------+-------+
Win All | 2.34 | 3.05 | 19.02 | 53.87 | 0.632 |
Win 7   | 1.12 | 1.63 | 10.03 | 26.52 | 0.283 |
I believe the histogram dashboard only shows a total count of pings, not broken down by individual users, correct? So if people get orphaned on 43 the counts would be expected to be higher for this version. So it might be a consequence of, rather than a cause for, people being orphaned. Then again, I'm not very familiar with the histogram dashboard.
Flags: needinfo?(spohl.mozilla.bugs)
Discussed this over irc with Stephen and he also thinks this is likely a server side change that triggered this per comment #15.

For anyone following along: there is a large count for release 43. There is also a large count for Beta 44 and a large count for Aurora 45. I would expect that if a client side change caused the orphaning it should be on Release 43, Beta 43, and Aurora 43.
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #25)
> Discussed this over irc with Stephen and he also thinks this is likely a
> server side change that triggered this per comment #15.
> 
> For anyone following along: there is a large count for release 43. There is
> also a large count for Beta 44 and a large count for Aurora 45. I would
> expect that if a client side change caused the orphaning it should be on
> Release 43, Beta 43, and Aurora 43.

When you say server, does that mean AUS? We can have a look and see if there was any notable changes in the window (August, 2016 if I'm reading correctly?).
I'm using a recent date range to simplify the analysis. When I say server change I really mean infrastructure. This could be anything from balrog, SSL certificates rolled out around the same time across all channels, signing changes rolled out around the same time across all channels, CDN changes rolled out around the same time across all channels, etc. since it appears to affect 43 Release, 44 Beta, 45 Aurora, etc. I'm in no way saying that there is a bug in any of those but it does appear that something changed in infrastructure that triggered this. It appears to have started around the SHA1 to SHA2 change.

It has been stated that this primarily affects Windows 7. Next I'm going to check percentages across versions and channels to see if that is the case.
Ben, it looks like Aurora is getting quite a few more no updates found for builds prior to Aurora 46 than other channels. Could you look into this?

No updates found sum divided by ping sum both for the notify event
For the date range of 2016/08/01 - 2016/08/31

Aurora
Version |   42   |   43   |   44   |   45   |   46   |   47   |   48   |
--------+--------+--------+--------+--------+--------+--------+--------+
Win 10  | 22.06% | 21.25% | 51.66% | 73.99% |  3.70% |  2.39% |  3.39% |
Win 8.1 | 16.38% | 14.58% | 39.61% | 57.90% |  1.90% |  1.15% |  1.89% |
Win 8   | 16.07% | 19.75% | 36.13% | 44.13% |  9.09% |  4.38% |  0.94% |
Win 7   | 15.16% | 20.05% | 33.43% | 61.88% |  2.31% |  2.99% |  2.04% |
Win XP  | 23.14% | 25.34% | 42.90% | 71.55% | 12.50% |  0.74% |  2.33% |


Beta
Version |   42   |   43   |   44   |   45   |   46   |   47   |
--------+--------+--------+--------+--------+--------+--------+
Win 10  |  0.91% |  0.33% |  0.47% |  0.84% |  0.28% |  0.43% |
Win 8.1 |  0.29% |  0.42% |  0.04% |  0.24% |  0.42% |  0.27% |
Win 8   |  0.71% |  1.97% |  0.03% |  0.33% |  0.15% |  0.34% |
Win 7   |  2.87% |  1.55% |  0.88% |  1.76% |  2.15% |  1.62% |
Win XP  |  2.89% |  4.89% |  1.10% |  5.82% |  7.22% |  4.77% |
Flags: needinfo?(bhearsum)
Do the percentages shift for Aurora in September ? Bug 1280142 was fixed right at the end of August.
Don't know but I'll look in a bit.

Did bug 1280142 affect multiple versions of aurora?
Any OS_VERSION containing Windows_NT and buildID <20151223004004. Similarly nightly with buildID <20151209095500.
Doesn't appear to have made a significant difference

No updates found sum divided by ping sum both for the notify event
For the date range of 2016/09/01 - 2016/09/18
Aurora
Version |   45   |
--------+--------+
Win 10  | 71.50% |
Win 8.1 | 54.58% |
Win 8   | 44.13% |
Win 7   | 60.53% |
Win XP  | 68.32% |

Perhaps there are other cases like the one in bug 1280142
Comparison of aurora, beta, and release. The release numbers are disturbing and I will sanity check them.

Nick / Ben, do either of you have any idea as to why the Release channel users on old builds wouldn't see updates so often during the month of August?

No updates found sum divided by ping sum both for the notify event
For the date range of 2016/08/01 - 2016/08/31

Aurora
Version |   42   |   43   |   44   |   45   |  46    |   47   |   48   |
--------+--------+--------+--------+--------+--------+--------+--------+
Win 10  | 22.06% | 21.25% | 51.66% | 73.99% |  3.70% |  2.39% |  3.39% |
Win 8.1 | 16.38% | 14.58% | 39.61% | 57.90% |  1.90% |  1.15% |  1.89% |
Win 8   | 16.07% | 19.75% | 36.13% | 44.13% |  9.09% |  4.38% |  0.94% |
Win 7   | 15.16% | 20.05% | 33.43% | 61.88% |  2.31% |  2.99% |  2.04% |
Win XP  | 23.14% | 25.34% | 42.90% | 71.55% | 12.50% |  0.74% |  2.33% |


Beta
Version |   42   |   43   |   44   |   45   |  46    |   47   |
--------+--------+--------+--------+--------+--------+--------+
Win 10  |  0.91% |  0.33% |  0.47% |  0.84% |  0.28% |  0.43% |
Win 8.1 |  0.29% |  0.42% |  0.04% |  0.24% |  0.42% |  0.27% |
Win 8   |  0.71% |  1.97% |  0.03% |  0.33% |  0.15% |  0.34% |
Win 7   |  2.87% |  1.55% |  0.88% |  1.76% |  2.15% |  1.62% |
Win XP  |  2.89% |  4.89% |  1.10% |  5.82% |  7.22% |  4.77% |

Release
Version |   42   |   43   |   44   |   45   |  46    |
--------+--------+--------+--------+--------+--------+
Win 10  | 40.56% | 28.31% | 16.02% | 21.38% | 29.60% |
Win 8.1 | 25.28% | 33.17% | 26.96% | 24.78% | 31.83% |
Win 8   |  7.24% | 33.77% | 26.89% | 25.57% | 31.67% |
Win 7   | 17.50% | 33.83% | 25.13% | 22.06% | 31.16% |
Win XP  | 10.29% | 37.28% | 32.95% | 32.47% | 36.65% |
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #33)
> Comparison of aurora, beta, and release. The release numbers are disturbing
> and I will sanity check them.
> 
> Nick / Ben, do either of you have any idea as to why the Release channel
> users on old builds wouldn't see updates so often during the month of August?
> 
> No updates found sum divided by ping sum both for the notify event
> For the date range of 2016/08/01 - 2016/08/31
> 
> Aurora
> Version |   42   |   43   |   44   |   45   |  46    |   47   |   48   |
> --------+--------+--------+--------+--------+--------+--------+--------+
> Win 10  | 22.06% | 21.25% | 51.66% | 73.99% |  3.70% |  2.39% |  3.39% |
> Win 8.1 | 16.38% | 14.58% | 39.61% | 57.90% |  1.90% |  1.15% |  1.89% |
> Win 8   | 16.07% | 19.75% | 36.13% | 44.13% |  9.09% |  4.38% |  0.94% |
> Win 7   | 15.16% | 20.05% | 33.43% | 61.88% |  2.31% |  2.99% |  2.04% |
> Win XP  | 23.14% | 25.34% | 42.90% | 71.55% | 12.50% |  0.74% |  2.33% |
> 
> 
> Beta
> Version |   42   |   43   |   44   |   45   |  46    |   47   |
> --------+--------+--------+--------+--------+--------+--------+
> Win 10  |  0.91% |  0.33% |  0.47% |  0.84% |  0.28% |  0.43% |
> Win 8.1 |  0.29% |  0.42% |  0.04% |  0.24% |  0.42% |  0.27% |
> Win 8   |  0.71% |  1.97% |  0.03% |  0.33% |  0.15% |  0.34% |
> Win 7   |  2.87% |  1.55% |  0.88% |  1.76% |  2.15% |  1.62% |
> Win XP  |  2.89% |  4.89% |  1.10% |  5.82% |  7.22% |  4.77% |
> 
> Release
> Version |   42   |   43   |   44   |   45   |  46    |
> --------+--------+--------+--------+--------+--------+
> Win 10  | 40.56% | 28.31% | 16.02% | 21.38% | 29.60% |
> Win 8.1 | 25.28% | 33.17% | 26.96% | 24.78% | 31.83% |
> Win 8   |  7.24% | 33.77% | 26.89% | 25.57% | 31.67% |
> Win 7   | 17.50% | 33.83% | 25.13% | 22.06% | 31.16% |
> Win XP  | 10.29% | 37.28% | 32.95% | 32.47% | 36.65% |

We had background updates off for the Firefox release channel for versions <48.0 for the vast majority of August. It looks like they were disabled on August 3rd at 10:17 UTC, reenabled for about 20 minutes on the 18th (most likely in error), and then re-enabled again on August 22nd at 11:04 UTC - this time permanently.
Flags: needinfo?(bhearsum)
Thanks Ben! I did the same for the date range of 2016/07/01 - 2016/07/31. It looks like a few things need to be looked into but I'd like to start with the Firefox 42 systems that aren't being offered updates especially on Windows 10. Any ideas as to why?

 No updates found sum divided by ping sum both for the notify event
For the date range of 2016/07/01 - 2016/07/31
Release
Version |   42   |   43   |   44   |   45   |  46    |
--------+--------+--------+--------+--------+--------+
Win 10  | 38.12% |  1.49% |  4.10% |  4.25% |  8.64% |
Win 8.1 | 27.40% |  1.14% |  3.27% |  2.60% |  5.88% |
Win 8   |  8.14% |  0.66% |  1.15% |  2.24% |  4.17% |
Win 7   | 18.24% |  1.60% |  3.86% |  3.24% |  6.95% |
Win XP  | 10.10% |  1.56% |  6.53% |  7.54% |  9.08% |


scale = thousands
Release
Version |       42       |      43       |       44      |      45       |      46
--------+----------------+---------------+---------------+---------------+---------------
Win 10  | 12.78 /  33.53 | 1.36 /  91.05 | 2.67 /  65.07 | 3.52 /  82.75 | 16.99 / 196.69
Win 8.1	|  3.59 /  13.10 | 0.61 /  53.72 | 0.74 /  22.66 | 0.76 /  29.18 |  3.38 /  57.49
Win 8   |  0.36 /   4.42 | 0.17 /  25.83 | 0.10 /   8.72 | 0.26 /  11.61 |  0.80 /  19.19
Win 7   | 21.04 / 115.37 | 7.12 / 443.72 | 6.93 / 179.48 | 9.78 / 302.22 | 34.05 / 489.73
Win XP  |  3.61 /  35.75 | 3.86 / 246.85 | 3.11 /  47.60 | 5.00 /  66.28 | 11.91 / 131.21
Flags: needinfo?(bhearsum)
I spoke with Ben this morning. Turns out that background updates have been turned off for Firefox <48.0 because of a problem with the WebSense plugin (see bug 1294650).

It has since been reenabled for users who don't use WebSense (see bug 1298404), so it might make sense to filter out users who use WebSense from the update orphaning dashboard. I suggest filtering out because we deliberately don't offer updates to these users and the update orphaning dashboard is intended to track unknown reasons why users are out of date.
Flags: needinfo?(bhearsum)
I'll just use a different date range from before the websense issue as I did in comment #35
The reason I would rather use a different date range instead of filtering is that I can quickly gather this data via the telemetry dashboard. Once problem areas such as comment #28, comment #32, and comment #35 are found if necessary I'll dive deeper into the data which takes much more time than just using the telemetry dashboard.
For comment #35
Flags: needinfo?(bhearsum)
It may be unrelated, but just in case.

Until today we served 43.0.1 full installers (no stubs) to all XP and Vista users by sniffing their UA. 43.0.1 was the last release signed with the SHA1 cert by release automation.

Starting with 49.0, we are serving up-to-date full installer (no stubs) signed with the new SHA1 cert, again, detecting them by sniffing their UA.

$ curl -A 'Mozilla/5.0 (Windows; U; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)' -sLI 'https://download.mozilla.org/?product=firefox-latest&os=win&lang=en-US' | grep ^Location:

Location: https://download-installer.cdn.mozilla.net/pub/firefox/releases/49.0/win32-sha1/en-US/Firefox%20Setup%2049.0.exe
Definitely unrelated to Firefox 42 on Windows 10 not seeing updates. I'll keep the installers in mind but I don't think it has anything to do with this.
Any chance those Windows 10 pings are using the "MetroFirefox" product? I know we had that at one point, but I'm pretty sure it's dead. We definitely don't serve updates to anyone requesting under that product name. Other than that, I'm not aware of anything that would affect Windows 10 more than any other version. I don't see any special casing for Windows 10 in the Balrog rules.

And just for the record, it looks like updates to 47.0.1 were enabled at a 100% background rate for the entire month of July.
Flags: needinfo?(bhearsum)
I don't see how since MetroFirefox never made it to the release channel. Other Windows versions for Firefox 42 also have high percentages per comment #35... just not as dramatic.
As quick test I installed 42.0 win32 en-US on Windows 10, which updated to 43.0.1 (for the SHA1 watershed), and then 47.0.1 (websense). It didn't find 49.0 yet, because the hotfix for websense hasn't been picked up.

Do 'No updates found' telemetry only fire if the request works and comes back as <updates></updates> ? Or does it also include SSL certs issues, MITM by security products etc ?
It should only report that code when <updates></updates> and there are separate codes for cert issues, general failures, etc.
Here is the same data including Firefox 41
No updates found sum divided by ping sum both for the notify event
For the date range of 2016/07/01 - 2016/07/31
Release
Version |   41   |   42   |   43   |   44   |   45   |  46    |
--------+--------+--------+--------+--------+--------+--------+
Win 10  |  1.99% | 38.12% |  1.49% |  4.10% |  4.25% |  8.64% |
Win 8.1 |  1.82% | 27.40% |  1.14% |  3.27% |  2.60% |  5.88% |
Win 8   |  0.72% |  8.14% |  0.66% |  1.15% |  2.24% |  4.17% |
Win 7   |  1.87% | 18.24% |  1.60% |  3.86% |  3.24% |  6.95% |
Win XP  |  4.72% | 10.10% |  1.56% |  6.53% |  7.54% |  9.08% |

scale = thousands
Release
Version |      41
--------+-------------
Win 10  | 0.33 / 16.61
Win 8.1 | 0.17 /  9.32
Win 8   | 0.03 /  4.19
Win 7   | 1.74 / 92.98
Win XP  | 1.79 / 37.95
I believe that arch was added to build target in bug 1183077 for Firefox 42. Perhaps something due to that change triggered this?

Nick, what do you think about my filing new aus bugs for this and for the aurora issue pointed out in comment #28 and comment #32 since this bug is about release 43.0.1?
Flags: needinfo?(nthomas)
Just talked with nthomas over irc and he is going to try to get server log(s) so the url being sent by the Firefox 42 clients can be examined. I know that this isn't about this bug but it definitely appears to be an issue that we have a solid path to investigating and I'll continue investigating Firefox 43.0.1 in the hope of finding a solid path for it as well as soon as time permits.
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #48)
> Just talked with nthomas over irc and he is going to try to get server
> log(s) so the url being sent by the Firefox 42 clients can be examined. I
> know that this isn't about this bug but it definitely appears to be an issue
> that we have a solid path to investigating and I'll continue investigating
> Firefox 43.0.1 in the hope of finding a solid path for it as well as soon as
> time permits.

I happened to have a ~12h slice of logs from September 16th on my laptop for unrelated reasons. Here's an example entry:
balrog_web.access.log.custom.1:1474038063.638 "w.x.y.z" GET /update/3/Firefox/42.0/20151029151421/WINNT_x86-msvc-x64/en-US/release/Windows_NT%2010.0.0.0%20(x64)/default/default/update.xml HTTP/1.1" 200 861 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:42.0) Gecko/20100101 Firefox/42.0" 0.000 - "-" HIT

And here's a count of of buildids sent by Firefox 42.0 (GMP and SystemAddon pings excluded):
➜  log grep '/42.0/' * | grep '/Firefox/' | grep WIN | cut -d/ -f7 | sort | uniq -c
    552 WINNT_x86_64-msvc
   4226 WINNT_x86_64-msvc-x64
  29512 WINNT_x86-msvc
 208414 WINNT_x86-msvc-x64
 110575 WINNT_x86-msvc-x86
What conditions did you use to grep the log to get the example entry? I'd like to see the unique entries along with count of the entries that contain 'Windows NT 10' and 'Firefox/42'
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #50)
> What conditions did you use to grep the log to get the example entry? I'd
> like to see the unique entries along with count of the entries that contain
> 'Windows NT 10' and 'Firefox/42'

I put the grep in the comment above :). It was filtered on '/42.0/', '/Firefox/', and 'WIN'.

Here's a query using your conditions:
➜  log grep 'Firefox/42' * | grep 'Windows NT 10' | cut -d/ -f7 | sort | uniq -c
grep: new: Is a directory
    227 Linux_x86_64-gcc3 (wtf!?)
    428 WINNT_x86_64-msvc
   1217 WINNT_x86_64-msvc-x64
  10856 WINNT_x86-msvc
  56788 WINNT_x86-msvc-x64
   5052 WINNT_x86-msvc-x86
For that data can I get the unique values for /update/3/Firefox/42.0/20151029151421/WINNT_x86-msvc-x64/en-US/release/Windows_NT%2010.0.0.0%20(x64)/default/default/update.xml along with the counts?
I don't have the same set of logs on hand as Ben, but I tried using the Content-Length of the response:
* 317771 Firefox update requests in dataset [1]
*   2368 matching a regexp of '/Firefox/42\.0', to exclude GMP, SystemAddons. Broadly consistent with ADI.

Of those
*   1983 non-empty responses (84%) [2]
*    385 empty responses (16%)
*      6 empty responses (0.25%) if limited to 'WINNT.*/release' [3]
*      0 empty responses if deprecated platforms are ignored

So in this sample there's no problem with returning some content, at least. Perhaps we should provide some of these logs for your own analysis rather than acting as a middleman ?


[1] logs for one loadbalancer IP for the hour 20160921 2300 UTC
[2] Content-Length more than 42 bytes, aka <?xml version="1.0"?><updates></updates>
[3] Most empty repsonses are from Darwin, which is pointed at the latest release if new enough OSX and therefore vulnerable to throttling. This doesn't apply to Windows because of the 43.0.1 watershed for SHA1.
Flags: needinfo?(nthomas)
43.0.1 data - 2016-09-01 to current from the longitudinal datasource

Client ID's that have reported as Firefox 43.0.1 for the last 40 pings

Note: I checked the last 5 entries for all of the data below and the percentages didn't change significantly so the numbers below represent the most recent entry.

Total unique client ID's: 28941

                                      |       |  Count  |  Total  |
Update Check Code                     | Count | Percent | Percent |
--------------------------------------+-------+---------+---------+
CHK_IS_DOWNLOADING                    |  4867 |  40.15% |  16.82% |
CHK_PREF_DISABLED                     |  2260 |  18.65% |   7.81% |
CHK_ADDON_NO_INCOMPAT                 |  1504 |  12.41% |   5.20% |
CHK_GENERAL_ERROR_SILENT              |  1004 |   8.28% |   3.47% |
CHK_HAS_ACTIVEUPDATE                  |   702 |   5.79% |   2.43% |
CHK_IS_STAGED                         |   592 |   4.88% |   2.05% |
CHK_SHOWPROMPT_PREF                   |   532 |   4.39% |   1.84% |
CHK_UNABLE_TO_APPLY                   |   343 |   2.83% |   1.19% |
CHK_NO_MUTEX                          |    86 |   0.71% |   0.30% |
CHK_NO_UPDATE_FOUND                   |    65 |   0.54% |   0.22% |
CHK_GENERAL_ERROR_PROMPT              |    56 |   0.46% |   0.19% |
CHK_IS_DOWNLOADED                     |    55 |   0.45% |   0.19% |
CHK_ADDON_UPDATES_FOR_INCOMPAT        |    31 |   0.26% |   0.11% |
CHK_ADDON_HAVE_INCOMPAT               |    18 |   0.15% |   0.06% |
CHK_ADDON_PREF_DISABLED               |     5 |   0.04% |   0.02% |
CHK_OFFLINE                           |     1 |   0.01% |   0.00% |
--------------------------------------+-------+---------+---------+
All                                   | 12121 | 100.00% |  41.88% |

CHK_IS_DOWNLOADING - a background download is already in progress (no notification)
CHK_PREF_DISABLED - background update checks disabled by preference (no notification)
CHK_ADDON_NO_INCOMPAT - no incompatible add-ons found during incompatible check (background download)
CHK_GENERAL_ERROR_SILENT - a general update check failure and threshold of 10 consecutive failures not reached (update notification). Will retry after 12 hours. Also see Update Check Extended Error Code below for specific errors.
CHK_HAS_ACTIVEUPDATE - already has an active update in progress (no notification)
CHK_IS_STAGED - an update is already staged and ready to apply (no notification)
CHK_SHOWPROMPT_PREF - showing prompt to update due to preference (update notification)
CHK_UNABLE_TO_APPLY - unable to apply updates (manual install to upgrade notification)
CHK_NO_MUTEX is unable to check for updates per hasUpdateMutex() (no notification and will retry in 12 hours)
CHK_NO_UPDATE_FOUND - no update found (no notification)
CHK_GENERAL_ERROR_PROMPT - a general update check failure and threshold of 10 consecutive failures reached (check failure notification). Also see Update Check Extended Error Code below for specific errors.
CHK_IS_DOWNLOADED - an update is already downloaded and ready to apply without staging (no notification)
CHK_ADDON_UPDATES_FOR_INCOMPAT - incompatible add-ons found and all of them have updates (background download)
CHK_ADDON_HAVE_INCOMPAT - incompatible add-ons found (update notification)
CHK_ADDON_PREF_DISABLED - incompatible add-on check disabled by preference (background download)
CHK_OFFLINE is unable to perform a background check while offline (no notification)


                                      |       |  Count  |  Total  |
Network Error Code                    | Count | Percent | Percent |
--------------------------------------+-------+---------+---------+
NS_ERROR_UNKNOWN_HOST                 |   743 |  76.84% |   2.57% |
NS_ERROR_ABORT                        |    65 |   6.72% |   0.22% |
NS_ERROR_NET_INTERRUPT                |    36 |   3.72% |   0.12% |
NS_ERROR_CONNECTION_REFUSED           |    35 |   3.62% |   0.12% |
NS_ERROR_NET_RESET                    |    32 |   3.31% |   0.11% |
NS_ERROR_PROXY_CONNECTION_REFUSED     |    28 |   2.90% |   0.10% |
NS_ERROR_NET_TIMEOUT                  |    16 |   1.65% |   0.06% |
NS_ERROR_FAILURE                      |     7 |   0.72% |   0.02% |
NS_ERROR_NOT_AVAILABLE                |     5 |   0.52% |   0.02% |
--------------------------------------+-------+---------+---------+
All                                   |   967 | 100.00% |   3.34% |


                                      |       |  Count  |  Total  |
Download Code                         | Count | Percent | Percent |
--------------------------------------+-------+---------+---------+
DWNLD_ERR_DOCUMENT_NOT_CACHED         |  5097 |  49.83% |  17.61% |
DWNLD_RETRY_NET_RESET |               |  3799 |  37.14% |  13.13% |
DWNLD_RETRY_CONNECTION_REFUSED        |   468 |   4.58% |   1.62% |
DWNLD_RETRY_NET_TIMEOUT               |   444 |   4.34% |   1.53% |
DWNLD_SUCCESS                         |   255 |   2.49% |   0.88% |
DWNLD_ERR_PATCH_SIZE_NOT_EQUAL        |   128 |   1.25% |   0.44% |
DWNLD_ERR_VERIFY_PATCH_SIZE_NOT_EQUAL |    33 |   0.32% |   0.11% |
DWNLD_RETRY_OFFLINE                   |     4 |   0.04% |   0.01% |
--------------------------------------+-------+---------+---------+
All                                   | 10228 | 100.00% |  35.34% |

DWNLD_ERR_DOCUMENT_NOT_CACHED - most likely from tearing down the connection during shutdown. See bug 1272585 for the details I know of.
DWNLD_RETRY_NET_RESET - when the connection is reset. Will retry again after 2000 ms and after 10 consecutive failures will start the update process over.
DWNLD_RETRY_CONNECTION_REFUSED - when the connection was refused. Will retry again after 2000 ms and after 10 consecutive failures will start the update process over.
DWNLD_RETRY_NET_TIMEOUT - when the connection timed out. Will retry again after 2000 ms and after 10 consecutive failures will start the update process over.
DWNLD_SUCCESS - successfully downloaded.
DWNLD_ERR_PATCH_SIZE_NOT_EQUAL - when the size of the file per the onProgress maxProgress param is different than the expected size per the update.xml. Will start the update process over.
DWNLD_ERR_VERIFY_PATCH_SIZE_NOT_EQUAL - when the download finishes and the size of the file is different than the expected size per the update.xml. Will start the update process over.
DWNLD_RETRY_OFFLINE - when Firefox goes offline and the download will be retried after going online
The data points to clients having issues downloading the mar file. I believe this is due to a combination of the following:

* bug 1272585 (fixed in Firefox 49) and bug 1309124 causing clients to restart the update from the beginning.

* bug 1309125 making it so the download takes more time so the client is more likely to experience an error such as the cases in bug 1272585 and bug 1309124.

* the watershed requiring all previous versions to update to 43.0.1 before updating to the latest version and thereby not allowing any previous versions to update directly to the latest version.

* there only being a complete mar file offered to update from 43.0.1 to any newer version. This makes it so the download takes more time so the client is more likely to experience an error such as the cases in bug 1272585 and bug 1309124. Filed bug 1309130 to provide a partial mar file to 43.0.1 clients.


Current version distribution per the MainSummary parquet
43.0.1 has 7528112 unique client ID's
43.0.1 is 3.76% of the 42.x through 49.x Firefox population
43.0.1 is 33.24% of the 42.x through 46.x Firefox population

42.x through 49.x has 200273527 unique client ID's
42.x through 46.x has 22644372 unique client ID's

                    |   All   | 42 - 46 |
Version |   Count   | Percent | Percent |
 42.0.0 |   1785285 |   0.89% |   7.88% |
 43.0.0 |    128906 |   0.06% |   0.57% |
 43.0.1 |   7528112 |   3.76% |  33.24% |
 43.0.2 |    145835 |   0.07% |   0.64% |
 43.0.3 |    215578 |   0.11% |   0.95% |
 43.0.4 |   1261773 |   0.63% |   5.57% |
 44.0.0 |    394595 |   0.20% |   1.74% |
 44.0.1 |    105961 |   0.05% |   0.47% |
 44.0.2 |   2059561 |   1.03% |   9.10% |
 45.0.0 |    684240 |   0.34% |   3.02% |
 45.0.1 |   2081016 |   1.04% |   9.19% |
 45.0.2 |   1524286 |   0.76% |   6.73% |
 46.0.0 |    557760 |   0.28% |   2.46% |
 46.0.1 |   4171464 |   2.08% |  18.42% |
 47.0.0 |   8048831 |   4.02% |
 47.0.1 |  32263986 |  16.11% |
 48.0.0 |   5425794 |   2.71% |
 48.0.1 |   6281067 |   3.14% |
 48.0.2 | 117656129 |  58.75% |
 49.0.0 |   7953348 |   3.97% |
Note: it is possible to set throttling to 0 via a preference so we might also consider using a system or hotfix add-on to set this preference.
(In reply to Nick Thomas [:nthomas] from comment #53)
> I don't have the same set of logs on hand as Ben, but I tried using the
> Content-Length of the response:
> * 317771 Firefox update requests in dataset [1]
> *   2368 matching a regexp of '/Firefox/42\.0', to exclude GMP,
> SystemAddons. Broadly consistent with ADI.
> 
> Of those
> *   1983 non-empty responses (84%) [2]
> *    385 empty responses (16%)
> *      6 empty responses (0.25%) if limited to 'WINNT.*/release' [3]
> *      0 empty responses if deprecated platforms are ignored
> 
> So in this sample there's no problem with returning some content, at least.
> Perhaps we should provide some of these logs for your own analysis rather
> than acting as a middleman ?
> 
> 
> [1] logs for one loadbalancer IP for the hour 20160921 2300 UTC
> [2] Content-Length more than 42 bytes, aka <?xml
> version="1.0"?><updates></updates>
> [3] Most empty repsonses are from Darwin, which is pointed at the latest
> release if new enough OSX and therefore vulnerable to throttling. This
> doesn't apply to Windows because of the 43.0.1 watershed for SHA1.
Thanks Nick! I'll dive into the data as time permits and file a new bug if needed.
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #56)
> Note: it is possible to set throttling to 0 via a preference so we might
> also consider using a system or hotfix add-on to set this preference.
I forgot to mention that we don't throttle foreground downloads on release and beta and we don't throttle foreground or background downloads on aurora or nightly so I don't expect there to be a problem if we disable throttling for the remaining 43.0.1 clients.
Summary: Investigate/fix large number (~48%) of orphaned users stuck on Firefox 43.0.1 → Investigate large number (~48%) of orphaned users stuck on Firefox 43.0.1
Updating summary to reflect the current state per the dashboard
https://telemetry.mozilla.org/update-orphaning/
Summary: Investigate large number (~48%) of orphaned users stuck on Firefox 43.0.1 → Investigate large number (~37%) of orphaned users stuck on Firefox 43.0.1
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #59)
> Updating summary to reflect the current state per the dashboard
> https://telemetry.mozilla.org/update-orphaning/

Please note that the decrease in percentage is almost exclusively due to the fact that for every release cycle, there is one more version of Firefox that we consider orphaned. Last time I checked, the actual total number of users stuck on 43.0.1 appeared to be fairly constant.
(In reply to Stephen A Pohl [:spohl] from comment #60)
> (In reply to Robert Strong [:rstrong] (use needinfo to contact me) from
> comment #59)
> > Updating summary to reflect the current state per the dashboard
> > https://telemetry.mozilla.org/update-orphaning/
> 
> Please note that the decrease in percentage is almost exclusively due to the
> fact that for every release cycle, there is one more version of Firefox that
> we consider orphaned. Last time I checked, the actual total number of users
> stuck on 43.0.1 appeared to be fairly constant.
The major change is mostly due to this.

 also looked at updates to 43.0.1 along with updates from 43.0.1 and there is definitely movement. The  appearance of the subset being fairly constant is likely due to the amount of time it takes to successfully update due to the issues I mentioned in comment #55
Summary: Investigate large number (~37%) of orphaned users stuck on Firefox 43.0.1 → Investigate large number (~37% and ~3.76% of all) of orphaned users stuck on Firefox 43.0.1
There are bugs filed for the identified issues so resolving since the investigation is done.
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Data for clients updating to Firefox 43.0.1 from a previous version and updating from 43.0.1 to a newer version

longitudinal datasource (1% of profiles) pings from 2016-09-01 to today:

8983 clients with a Firefox version greater than 43.0.1 for the latest ping and a Firefox version of 43.0.1 for the previous ping.
3456 clients with a Firefox version of 43.0.1 for the latest ping and a Firefox version less than 43.0.1 for the previous ping.

69169 clients with a Firefox version greater than 43.0.1 for the latest ping and a Firefox version of 43.0.1 for one of the previous 10 pings.
22977 clients with a Firefox version of 43.0.1 for the latest ping and a Firefox version less than 43.0.1 for one of the previous 10 pings.

173037 clients with a Firefox version greater than 43.0.1 for the latest ping and a Firefox version of 43.0.1 for one of the previous 40 pings.
35288 clients with a Firefox version of 43.0.1 for the latest ping and a Firefox version less than 43.0.1 for one of the previous 40 pings.
Some numbers for the 47.0.2 watershed that I think are relevant to the findings in this bug.

Clients that reported 47.0.2

Date Range: 20161102 - 20161104

47.0.2 Total: 21897

Matches in MainSummary: 21772 (99.43% of all pings)

Version | Count | Percent |
--------+-------+---------+
 39.0   |     2 |   0.01% |
 40.0.3 |     1 |   0.00% |
 41.0.2 |     1 |   0.00% |
 42.0   |     1 |   0.00% |
 43.0.1 |    11 |   0.05% |
 43.0.4 |     1 |   0.00% |
 44.0   |     1 |   0.00% |
 44.0.2 |     3 |   0.01% |
 45.0   |     2 |   0.01% |
 45.0.1 |     3 |   0.01% |
 45.0.2 |     2 |   0.01% |
 46.0.1 |    10 |   0.05% |
 47.0   |     7 |   0.03% |
 47.0.1 |    66 |   0.30% |
 47.0.2 | 14327 |  65.80% |
 48.0   |     1 |   0.00% |
 48.0.2 |    61 |   0.28% |
 49.0   |     2 |   0.01% |
 49.0.1 |    11 |   0.05% |
 49.0.2 |  7259 |  33.34% |
--------+-------+---------+
        | 21772 | 100.00% |


Clients that reported 47.0.2

From all pings
Date Range: 20161102 - 20161104

From MainSummary
Date Range: 20161102 - 20161114 (updated every night)

47.0.2 Total: 21897

Matches in MainSummary: 21851 (99.79% of all pings)

Version | Count | Percent |
--------+-------+---------+
 39.0   |     2 |   0.01% |
 39.0.3 |     1 |   0.00% |
 40.0.3 |     1 |   0.00% |
 41.0   |     2 |   0.01% |
 41.0.2 |     1 |   0.00% |
 42.0   |     2 |   0.01% |
 43.0.1 |    55 |   0.25% |
 43.0.2 |     1 |   0.00% |
 43.0.3 |     1 |   0.00% |
 43.0.4 |     7 |   0.03% |
 44.0   |     2 |   0.01% |
 44.0.2 |     8 |   0.04% |
 45.0   |     9 |   0.04% |
 45.0.1 |     7 |   0.03% |
 45.0.2 |     7 |   0.03% |
 46.0   |     1 |   0.00% |
 46.0.1 |    23 |   0.11% |
 47.0   |    19 |   0.09% |
 47.0.1 |   107 |   0.49% |
 47.0.2 |  7177 |  32.85% |
 48.0   |     3 |   0.01% |
 48.0.1 |     1 |   0.00% |
 48.0.2 |   103 |   0.47% |
 49.0   |     1 |   0.00% |
 49.0.1 |     9 |   0.04% |
 49.0.2 | 14300 |  65.44% |
 50.0   |     1 |   0.00% |
--------+-------+---------+
        | 21851 | 100.00% |

This analysis uses the same clients that updated to 47.0.2 to show the movement of clients over time. So, from 20161104 to 20161114 the percentage of clients that have updated to the latest available increased from 33.34% to 65.44% and 32.85% of the clients that updated to 47.0.2 are still on 47.0.2. I believe it is reasonable to assume that the vast majority of clients that have 47.0.2 updated to it using application update since the latest Firefox version is the primary version we offer to download, etc. and from that application update is working for these clients. Since these are still a large percentage of these users on 47.0.2 I think the causes are at least in part the causes identified in this bug along with that a large portion of users on older versions likely using Firefox much less often than the clients that have updated to the latest version. I also highly suspect that 47.0.2 is going to have similar orphaning percentages as we have seen with 43.0.1.

As for possible future solutions for this issue ( besides the ones listed in comment #55 ) there is also the possibility of using something along the lines of the Windows' Background Intelligent Transfer Service (BITS) to perform the download when Firefox isn't running and then applying the update when Firefox is next launched.
You need to log in before you can comment on or make changes to this bug.