Closed Bug 818370 Opened 13 years ago Closed 13 years ago

Slow uptake for Fx-17

Categories

(Release Engineering :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(firefox17+ fixed)

RESOLVED FIXED
Tracking Status
firefox17 + fixed

People

(Reporter: aphadke, Unassigned)

Details

(pasting email thread out here so we can track it) Annie was looking at the Fx-17 uptake graph and found out that Fx-17 uptake is far lower than our previous uptakes, http://screencast.com/t/ilxm5oV7 OR https://metrics.mozilla.com/pentaho/content/pentaho-cdf-dd/Render?solution=metrics2&path=%2Ftwitter%2Ffirefox&file=fx17_release.wcdf I verified the raw data from two different sources and doubt we are missing any data. Are we throttling updates or have we changed the way Fx-17 pings AMO servers? Replies: Anthony Hughes: I just did a quick spotcheck and I am getting a background update which would imply I am unthrottled (checked 17.0, 16.0.2, and 15.0). I can do more detailed testing if required but not until tomorrow morning. Alex Keybl: Thanks for testing, Anthony. In the morning, it would be good to double check that a blocklist ping is happening as expected. I think it's very unlikely that it's broken, however, since we've lost a similar percentage from 16.0.* that we've gained in 17.0.*. This is concerning - not sure if there's anything else we can be checking here. Including rstrong to see if he has any ideas. Nick Thomas: I agree that throttling is disabled, based on making lots of manual requests against aus3.m.o to check that all the nodes behind that are responding properly. It might be fallout from bug 818038 instead.
I did a quick additional check to the failure case and removing the hotfix I'm able to update. 1. Install Firefox 15 2. Install the hotfix 3. Wait for background update 4. Quit Firefox and pave-over install Firefox 16 5. Wait for background update Firefox fails to background update the second time (step 5) with the following errors in Error Console 6. Remove the hotfix 7 Wait for background update Firefox successfully updates to Firefox 17.0.1 Anthony Hughes QA Engineer, Desktop Firefox Mozilla Corporation ----- Original Message ----- From: "Anthony Hughes" <ahughes@mozilla.com> To: "Alex Keybl" <akeybl@mozilla.com> Cc: "Nick Thomas" <nthomas@mozilla.com>, "Gilbert FitzGerald" <gfitzgerald@mozilla.com>, "Anurag Phadke" <aphadke@mozilla.com>, "release" <release@mozilla.com>, "Lukas Blakk" <lsblakk@mozilla.com>, "jakem@mozilla.com Maul" <jakem@mozilla.com>, "Corey Shields" <cshields@mozilla.com>, "Dave Townsend" <dtownsend@mozilla.com>, "Annie Elliott" <aelliott@mozilla.com>, "Robert Strong" <rstrong@mozilla.com> Sent: Wednesday, December 5, 2012 10:28:35 AM Subject: Re: firefox 17 uptake I tested the following scenarios successfully. 1. Install Firefox 15 2. Install the hotfix 3. Wait for backround update 4. Restart Firefox Firefox background updates to Firefox 17.0.1 successfully. 1. Install Firefox 15 2. Install the hotfix 3. Quit Firefox and pave-over install Firefox 16 4. Start Firefox and wait for background update Firefox background updates to Firefox 17.0.1 successfully. However, the following scenario fails. 1. Install Firefox 15 2. Install the hotfix 3. Wait for background update 4. Quit Firefox and pave-over install Firefox 16 5. Wait for background update 6. Restart Firefox Firefox fails to background update the second time (step 5) with the following errors in Error Console Error: Search service falling back to synchronous initialization at SRCH_SVC__ensureInitialized@resource:///components/nsSearchService.js:2498 @resource:///components/nsSearchService.js:3462 AHU_loadDefaultSearchEngine@resource:///components/nsBrowserContentHandler.js:825 @resource:///components/nsBrowserContentHandler.js:554 dch_handle@resource:///components/nsBrowserContentHandler.js:804 Source File: resource:///components/nsSearchService.js Line: 2499 Error: ERROR addons.xpi: Failed to remove file C:\Users\ashughes\AppData\Roaming\Mozilla\Firefox\Profiles\v25sutgw.default\extensions\trash\firefox-hotfix@mozilla.org.xpi: [Exception... "Component returned failure code: 0x80520015 (NS_ERROR_FILE_ACCESS_DENIED) [nsIFile.remove]" nsresult: "0x80520015 (NS_ERROR_FILE_ACCESS_DENIED)" location: "JS frame :: resource:///modules/XPIProvider.jsm :: recursiveRemove :: line 1256" data: no] Source File: resource:///modules/XPIProvider.jsm Line: 1256 Error: Expected certificate attribute 'sha1Fingerprint' value incorrect, expected: 'CA:C4:7D:BF:63:4D:24:E9:DC:93:07:2F:E3:C8:EA:6D:C3:94:6E:89', got: 'F1:DB:F9:6A:7B:B8:04:FA:48:3C:16:95:C7:2F:17:C6:5B:C2:9F:45'. Source File: resource:///modules/CertUtils.jsm Line: 103 Error: Certificate checks failed. See previous errors for details. Source File: resource:///modules/CertUtils.jsm Line: 106 Anthony Hughes QA Engineer, Desktop Firefox Mozilla Corporation ----- Original Message ----- From: "Alex Keybl" <akeybl@mozilla.com> To: "Anthony Hughes" <ashughes@mozilla.com> Cc: "Nick Thomas" <nthomas@mozilla.com>, "Anthony Hughes" <ahughes@mozilla.com>, "Gilbert FitzGerald" <gfitzgerald@mozilla.com>, "Anurag Phadke" <aphadke@mozilla.com>, "release" <release@mozilla.com>, "Lukas Blakk" <lsblakk@mozilla.com>, "jakem@mozilla.com Maul" <jakem@mozilla.com>, "Corey Shields" <cshields@mozilla.com>, "Dave Townsend" <dtownsend@mozilla.com>, "Annie Elliott" <aelliott@mozilla.com>, "Robert Strong" <rstrong@mozilla.com> Sent: Wednesday, December 5, 2012 9:47:58 AM Subject: Re: firefox 17 uptake I'm also wondering if this could at all be related to https://bugzilla.mozilla.org/show_bug.cgi?id=803596 (either the cert or the fix for https://bugzilla.mozilla.org/show_bug.cgi?id=790096 ). For this point Anthony, can you make sure to install the hotfix on Firefox 15 and then try to bg update to FF17? Another path that may be good to test would be Firefox 15 with the hotfix, paved over by FF16, followed by a bg update to FF17. -Alex On Dec 5, 2012, at 9:41 AM, Alex Keybl < akeybl@mozilla.com > wrote: Is there any analysis that we can do around whether update pings are being received, and whether update downloads are being initiated? We need to understand where the failure is occurring as soon as possible. For instance, whether 1) The updates is not being requested 2) The download is not starting 3) The download is not completing 4) The update is not being applied I'm also wondering if this could at all be related to https://bugzilla.mozilla.org/show_bug.cgi?id=803596 (either the cert or the fix for https://bugzilla.mozilla.org/show_bug.cgi?id=790096 ). -Alex On Dec 5, 2012, at 6:30 AM, Annie Elliott < aelliott@mozilla.com > wrote: Good Morning All - I've attached 2 graphs. There is what I would to be consider an expected post-release drop in Beta 17. Beta 16 is, much like Release 16 seems to be, unaffected. I've also attached the updated ADI for yesterday, which shows no rebound, just the natural weekday lift (i.e., it follows the same Su->Sa shape as all others. Both are inclusive of yesterday. -Annie ------------------------------------------------ Annie Elliott , Mozilla Metrics Team e: aelliott@mozilla.com / i: aelliott ------------------------------------------------ <Screen Shot 2012-12-05 at 6.19.58 AM.png> <Screen Shot 2012-12-05 at 6.25.12 AM.png> On Dec 4, 2012, at 11:36 PM, Nick Thomas < nthomas@mozilla.com > wrote: Bug 818038 should not be version specific at all. You could try looking at the uptake of the 18.0 betas (vs 17.0) since that also coincides with the issue on the load balancer. Throttling isn't used on the beta channel at all, and the first 18.0 beta was available for update on Mon Nov 26. From a quick look at the ADU of 17.0 on the beta channel we can rule out the blocklist ping being broken. -Nick On 5/12/12 6:54 PM, Annie Elliott wrote: Hi All - I will throw in my 2 cents that that bug sounds fairly plausible, and according to comment 4, if it is the culprit, it is now fixed. If that is the case, we should see some rebound tomorrow to my thinking. However, I am worried though that somehow it seems limited to *just* version 17, and so to me it would make more sense then if Comment 2 held, and it were a CDN problem (my understanding of MoCo release machinations may be off a bit though). Some observations: - If you look at the attached, there is a 7th peak for version 16 that shouldn't be where it is; it should instead be in the end-of-cycle downslope like the one for v15. Additionally, note that the downslope in week 7, where there would normally be the crossover (overtaking) of one version from another, is missing. - v16 *appears* to be just chugging along just fine, and v17 seems to be mostly MIA. If it were the case that the ping was simply broken in 17, then I would posit that we should instead be seeing the 7th week downward slope where folks are upgrading but not showing up as upgraded. I am leaning towards the download being broken for a lot of folks. If it were simply the Zeus caching problem, would it discriminate between one release and not another like that? -Annie ------------------------------------------------ Annie Elliott , Mozilla Metrics Team *e:* aelliott@mozilla.com < mailto:aelliott@mozilla.com > / *i:* aelliott ------------------------------------------------ On Dec 4, 2012, at 7:25 PM, Nick Thomas < nthomas@mozilla.com < mailto:nthomas@mozilla.com >> wrote: I agree that throttling is disabled, based on making lots of manual requests against aus3.m.o to check that all the nodes behind that are responding properly. It might be fallout from bug 818038 instead. -Nick On 5/12/12 3:47 PM, Alex Keybl wrote: Thanks for testing, Anthony. In the morning, it would be good to double check that a blocklist ping is happening as expected. I think it's very unlikely that it's broken, however, since we've lost a similar percentage from 16.0.* that we've gained in 17.0.*. This is concerning - not sure if there's anything else we can be checking here. Including rstrong to see if he has any ideas. -Alex On Dec 4, 2012, at 6:29 PM, Anthony Hughes < ahughes@mozilla.com < mailto:ahughes@mozilla.com >> wrote: I just did a quick spotcheck and I am getting a background update which would imply I am unthrottled (checked 17.0, 16.0.2, and 15.0). I can do more detailed testing if required but not until tomorrow morning. Anthony Hughes QA Engineer, Desktop Firefox Mozilla Corporation ----- Original Message ----- From: "Alex Keybl" < akeybl@mozilla.com < mailto:akeybl@mozilla.com >> To: "Anurag Phadke" < aphadke@mozilla.com < mailto:aphadke@mozilla.com >>, "release" < release@mozilla.com < mailto:release@mozilla.com >>, "Anthony Hughes" < ashughes@mozilla.com < mailto:ashughes@mozilla.com >> Cc: "Gilbert FitzGerald" < gfitzgerald@mozilla.com < mailto:gfitzgerald@mozilla.com >>, "Annie Elliott" < aelliott@mozilla.com < mailto:aelliott@mozilla.com >> Sent: Tuesday, December 4, 2012 6:24:04 PM Subject: Re: firefox 17 uptake Hi Anurag, It makes sense that the curve started slow (we were initially unthrottled for a couple of days over Thanksgiving), but I would have expected uptake to increase significantly over the weekend once we released 17.0.1 unthrottled on Friday. RelEng & QA - can we double verify that we're unthrottled properly? -Alex On Dec 4, 2012, at 6:06 PM, Anurag Phadke < aphadke@mozilla.com < mailto:aphadke@mozilla.com >> wrote: Hey Alex, Annie was looking at the Fx-17 uptake graph and found out that Fx-17 uptake is far lower than our previous uptakes, http://screencast.com/t/ilxm5oV7 OR https://metrics.mozilla.com/pentaho/content/pentaho-cdf-dd/Render?solution=metrics2&path=%2Ftwitter%2Ffirefox&file=fx17_release.wcdf I verified the raw data from two different sources and doubt we are missing any data. Are we throttling updates or have we changed the way Fx-17 pings AMO servers? -anurag
On Dec 5, 2012, at 10:01 AM, Dave Townsend <dtownsend@mozilla.com> wrote: On 5 December 2012 09:41, Alex Keybl <akeybl@mozilla.com> wrote: I'm also wondering if this could at all be related to https://bugzilla.mozilla.org/show_bug.cgi?id=803596 (either the cert or the fix for https://bugzilla.mozilla.org/show_bug.cgi?id=790096). The cert fix part of this is very unlikely I think, it should have no effect on the app update path. The other fix is maybe possible, but there would be a way to check that with metrics. Does the uptake look the same on all platforms or are only some affected? The fix for bug 790096 should have only impacted OSX and Linux, but based on the vast reduction in uptake I would assume Windows is affected too so it probably isn't related. That and all that fix does is flip a pref, a pref that has no effect in Firefox 16 so it would be very surprising if it were breaking Firefox 16's ability to update in some way.
That and all that fix does is flip a pref, a pref that has no effect in Firefox 16 so it would be very surprising if it were breaking Firefox 16's ability to update in some way. Agreed - having re-read the bugs, it's very unlikely that the hotfix would have changed update behavior (especially when we renamed the pref in FF16). I'm just trying to cover our bases, looking for issues that may only be impacting external users. The hotfix was a candidate, since Anthony's testing may not have installed it prior to updating. It's looking more and more like this isn't an in-prouct issue, but we'll need to analyze those update pings and subsequent downloads to make sure. -Alex
I did a quick additional check to the failure case and removing the hotfix I'm able to update. 1. Install Firefox 15 2. Install the hotfix 3. Wait for background update 4. Quit Firefox and pave-over install Firefox 16 5. Wait for background update Firefox fails to background update the second time (step 5) with the following errors in Error Console 6. Remove the hotfix 7 Wait for background update Firefox successfully updates to Firefox 17.0.1 Anthony Hughes QA Engineer, Desktop Firefox Mozilla Corporation ----- Original Message ----- From: "Anthony Hughes" <ahughes@mozilla.com> To: "Alex Keybl" <akeybl@mozilla.com> Cc: "Nick Thomas" <nthomas@mozilla.com>, "Gilbert FitzGerald" <gfitzgerald@mozilla.com>, "Anurag Phadke" <aphadke@mozilla.com>, "release" <release@mozilla.com>, "Lukas Blakk" <lsblakk@mozilla.com>, "jakem@mozilla.com Maul" <jakem@mozilla.com>, "Corey Shields" <cshields@mozilla.com>, "Dave Townsend" <dtownsend@mozilla.com>, "Annie Elliott" <aelliott@mozilla.com>, "Robert Strong" <rstrong@mozilla.com> Sent: Wednesday, December 5, 2012 10:28:35 AM Subject: Re: firefox 17 uptake I tested the following scenarios successfully. 1. Install Firefox 15 2. Install the hotfix 3. Wait for backround update 4. Restart Firefox Firefox background updates to Firefox 17.0.1 successfully. 1. Install Firefox 15 2. Install the hotfix 3. Quit Firefox and pave-over install Firefox 16 4. Start Firefox and wait for background update Firefox background updates to Firefox 17.0.1 successfully. However, the following scenario fails. 1. Install Firefox 15 2. Install the hotfix 3. Wait for background update 4. Quit Firefox and pave-over install Firefox 16 5. Wait for background update 6. Restart Firefox Firefox fails to background update the second time (step 5) with the following errors in Error Console Error: Search service falling back to synchronous initialization at SRCH_SVC__ensureInitialized@resource:///components/nsSearchService.js:2498 @resource:///components/nsSearchService.js:3462 AHU_loadDefaultSearchEngine@resource:///components/nsBrowserContentHandler.js:825 @resource:///components/nsBrowserContentHandler.js:554 dch_handle@resource:///components/nsBrowserContentHandler.js:804 Source File: resource:///components/nsSearchService.js Line: 2499 Error: ERROR addons.xpi: Failed to remove file C:\Users\ashughes\AppData\Roaming\Mozilla\Firefox\Profiles\v25sutgw.default\extensions\trash\firefox-hotfix@mozilla.org.xpi: [Exception... "Component returned failure code: 0x80520015 (NS_ERROR_FILE_ACCESS_DENIED) [nsIFile.remove]" nsresult: "0x80520015 (NS_ERROR_FILE_ACCESS_DENIED)" location: "JS frame :: resource:///modules/XPIProvider.jsm :: recursiveRemove :: line 1256" data: no] Source File: resource:///modules/XPIProvider.jsm Line: 1256 Error: Expected certificate attribute 'sha1Fingerprint' value incorrect, expected: 'CA:C4:7D:BF:63:4D:24:E9:DC:93:07:2F:E3:C8:EA:6D:C3:94:6E:89', got: 'F1:DB:F9:6A:7B:B8:04:FA:48:3C:16:95:C7:2F:17:C6:5B:C2:9F:45'. Source File: resource:///modules/CertUtils.jsm Line: 103 Error: Certificate checks failed. See previous errors for details. Source File: resource:///modules/CertUtils.jsm Line: 106 Anthony Hughes QA Engineer, Desktop Firefox Mozilla Corporation
Are bugs 817688 or 816472 related?
(In reply to Chris AtLee [:catlee] from comment #5) > Are bugs 817688 or 816472 related? Possibly though we don't know for sure what the cause of the slow uptake is as of yet.
"1) The updates is not being requested 2) The download is not starting 3) The download is not completing 4) The update is not being applied" Annie - is this something that the Metrics team can pull together? We'd like to understand the number of update pings that are being received on a daily basis from Firefox 16 (note that Firefox 17 now pings every 12 hours instead of every 24 hours), and the number of update downloads that have been initiated daily? We should be able to discern the number of unsuccessful updates from the number of initiated updates versus the ADI. While Beta 18 is definitely not impacted as badly as Release 17, it has had a depressed uptake it's got about half as many users as Beta 17 did this many days since release.
One quick run on my XP machine with a fresh profile and app.update.log enabled shows some odd messages: AUS:SVC Checker:getUpdateURL - update URL: https://aus3.mozilla.org/update/3/Firefox/16.0.2/20121024073032/WINNT_x86-msvc/en-US/release/Windows_NT%205.1.3.0%20(x86)/default/default/update.xml AUS:SVC gCanCheckForUpdates - able to check for updates AUS:SVC Checker:checkForUpdates - sending request to: https://aus3.mozilla.org/update/3/Firefox/16.0.2/20121024073032/WINNT_x86-msvc/en-US/release/Windows_NT%205.1.3.0%20(x86)/default/default/update.xml UTM:SVC TimerManager:notify - notified @mozilla.org/updates/update-service;1 AUS:SVC Checker:onError - request.status: 2153390067 AUS:SVC getStatusTextFromCode - transfer error: Update XML file malformed (200), default code: 200 AUS:SVC UpdateService:notify:listener - error during background update: Update XML file malformed (200) I'll keep the machine running to see if there's more.
Juan - can you grab updates.xml, active-updatexml, and any other xml files after hitting that?
I just got this from AUS: <updates><update type="minor" displayVersion="17.0.1" appVersion="17.0.1" platformVersion="17.0.1" buildID="20121128204232" detailsURL="https://www.mozilla.com/en-US/firefox/17.0.1/releasenotes/" actions="silent"><patch type="complete" URL="http://download.mozilla.org/?product=firefox-17.0.1-complete&os=win&lang=en-US" hashFunction="SHA512" hashValue="00296049fbe1d59ec1579f77bb931f06e2ea29ded42a0fa80524cb8a42a322e76363f53eb5bb38c8fdab7398e9246e5e20b83ee544c03e31724d24762eb2067d" size="24284002"/><patch type="partial" URL="http://download.mozilla.org/?product=firefox-17.0.1-partial-16.0.2&os=win&lang=en-US" hashFunction="SHA512" hashValue="3ff770ed7cf1c2e75722d59c9d2017066b5c83ee4ad536bb94c38fa5d30bc1eeecd7cb19f55d4087337ac5e9dfefda635208cf5b8d1ffd09a13a1f0e40230f4a" size="10352377"/></update></updates> and validator.w3.org throws a bunch of errors about it. Not sure if the Gecko parser is more liberal, though.
(In reply to Ben Hearsum [:bhearsum] from comment #10) > I just got this from AUS: > <updates><update type="minor" displayVersion="17.0.1" appVersion="17.0.1" > platformVersion="17.0.1" buildID="20121128204232" > detailsURL="https://www.mozilla.com/en-US/firefox/17.0.1/releasenotes/" > actions="silent"><patch type="complete" > URL="http://download.mozilla.org/?product=firefox-17.0.1- > complete&os=win&lang=en-US" hashFunction="SHA512" > hashValue="00296049fbe1d59ec1579f77bb931f06e2ea29ded42a0fa80524cb8a42a322e763 > 63f53eb5bb38c8fdab7398e9246e5e20b83ee544c03e31724d24762eb2067d" > size="24284002"/><patch type="partial" > URL="http://download.mozilla.org/?product=firefox-17.0.1-partial-16.0. > 2&os=win&lang=en-US" hashFunction="SHA512" > hashValue="3ff770ed7cf1c2e75722d59c9d2017066b5c83ee4ad536bb94c38fa5d30bc1eeec > d7cb19f55d4087337ac5e9dfefda635208cf5b8d1ffd09a13a1f0e40230f4a" > size="10352377"/></update></updates> > > and validator.w3.org throws a bunch of errors about it. Not sure if the > Gecko parser is more liberal, though. Any more errors than with a 16.0 update XML?
(In reply to Alex Keybl [:akeybl] from comment #11) > (In reply to Ben Hearsum [:bhearsum] from comment #10) > > I just got this from AUS: > > <snip> > > and validator.w3.org throws a bunch of errors about it. Not sure if the > > Gecko parser is more liberal, though. > > Any more errors than with a 16.0 update XML? Nope -- https://aus3.mozilla.org/update/3/Firefox/16.0/20121005155445/WINNT_x86-msvc/en-US/release/Windows_NT%205.1.3.0%20%28x86%29/default/default/update.xml seems to give the same ones. Maybe a couple less because there's no partial for it.
I believe this may be resolved now, just based on Akamai bandwidth data. Bandwidth for download.cdn.mozilla.net picks up *sharply* around the time that bug 818038 was resolved. That bug also provides a small explanation for why other (lesser-used) versions would be less affected. The Zeus cache is very short (15 seconds)... less frequently accessed things will naturally have a lower hit rate. Only hits were affected... misses did not hit that bug. Therefore, since the hit rate is lower with infrequently-accessed things, they were more likely to work properly, and thus should have a higher uptake. It won't be a huge difference, but it should be non-zero.
(In reply to juan becerra [:juanb] from comment #8) > One quick run on my XP machine with a fresh profile and app.update.log > enabled shows some odd messages: > > AUS:SVC Checker:getUpdateURL - update URL: > https://aus3.mozilla.org/update/3/Firefox/16.0.2/20121024073032/WINNT_x86- > msvc/en-US/release/Windows_NT%205.1.3.0%20(x86)/default/default/update.xml > > AUS:SVC gCanCheckForUpdates - able to check for updates > > AUS:SVC Checker:checkForUpdates - sending request to: > https://aus3.mozilla.org/update/3/Firefox/16.0.2/20121024073032/WINNT_x86- > msvc/en-US/release/Windows_NT%205.1.3.0%20(x86)/default/default/update.xml > > UTM:SVC TimerManager:notify - notified @mozilla.org/updates/update-service;1 > > AUS:SVC Checker:onError - request.status: 2153390067 > > AUS:SVC getStatusTextFromCode - transfer error: Update XML file malformed > (200), default code: 200 > > AUS:SVC UpdateService:notify:listener - error during background update: > Update XML file malformed (200) > > I'll keep the machine running to see if there's more. Hey Juan, just to confirm - the update did not complete successfully?
(In reply to Alex Keybl [:akeybl] from comment #14) > > Hey Juan, just to confirm - the update did not complete successfully? It did not complete successfully.
(In reply to Ben Hearsum [:bhearsum] from comment #9) > Juan - can you grab updates.xml, active-updatexml, and any other xml files > after hitting that? I wasn't able to locate these files.
juanb's going to reach out to rstrong to take a look at his unsuccessful update, given the fact that it's reproducible.
It turned out Fiddler was interfering and that caused the problem. So we can invalidate comment #8 as evidence in this investigation.
This has been verified to be fixed in bug 818038 comment 4 and watching the data over the last couple of days has shown that our downloads are getting the expected spikes (while 16.0.2 users are dropping accordingly).
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
I've been asked to provide details about why QA could not have caught this; there are multiple reasons. 1. Our release automation runs within a small 1-2 hour window as soon as snippets are live (this problem did not exist until well after that window). 2. Our release automation only tests manual check for updates and fallback updates (it is incapable of testing background updates). 3. When unthrottling happens, QA manually spotchecks a couple of builds and locales to make sure background updates are working. We did not encounter any of these issues during said spotchecks. The only way QA *might* have seen this is if we were running update tests on release 24-7. Our automation infrastructure is incapable of handling this kind of testing and it's not a reasonable proposition that we could get to this point any time soon. I personally think the realistic mitigating strategy here is to have RelMan/Metrics lowering the bar for our "red flag" scenario and not giving so much weight to the holidays/weekends assumption. Once a red flag is raised, QA and RelEng can investigate.
FF17 Download Post Mortem - 12 Dec 12 Attendees: Alex K Annie E Anthony H Anurag P Bhavana B Lukas B A. What went wrong IT was under extreme load w/ ff17, and enabled better caching on the CDN side of things, which did not accept the correct headers in the download itself. Ultimately, this was not noticed until approximately 12/4. B. When did it go wrong? This happened for 17.0.0, which was released on 11/20, and then unthrottled on 11/30 with the dot release. C. Why did it go wrong? There were communication issues, stemming from a fast reaction to an overload problem on the CDN; there may be nagios alerts on an apacheworker. There is a release-drivers channel that can be used for these notifications. These configuration changes happen as a natural part of operational IT, and there were no red flags to escalate. D. Who could/should have been watching closer? The release management could/should have been watching more closely. It is however, unclear how soon we should have noticed; there are assumption that was made surrounding the proximity to holidays that this was not anything more than a post-holiday issue. E. What is there to watch? Anthony's work might have been an early indicator (he will provide a short writeup of his process, and that will be added to the 818370 bug). Metrics data warehouse has a day delay in processing time; and it should be noted that there is an ~5 day delay between download and ADI peaks per release (as expected - this is a user behavior issue that cannot be mitigated). Alex notes that it was a obvious problem only AFTER we unthrottled on the 30th with the release 17.0.1. Given the varying post-release processes (DoW, whether or not there is a .1 release, as well as unthrottle timing), there is no fixed "on day x" timeframe we can use. F. How do we identify this type of issue earlier next time? To Lukas' concern, we have put up the download/release dash which displays clear, cyclic behaviors and anomalies should be easily noted from here on out. In hindsight, the lack of download spike early on is completely obvious. To Annie's concern, to drop our detection and response timeframe to intra-day, release managers need two things: Access to CDN vendor near-real-time monitoring (TODO: Get accounts/login to CDN-vendor supplied dashes) Timely notifications from IT when changes are made that could affect downloads to provide context for when we should be watching more closely (TODO: Set up light notification protocol with IT)
Having an automated process that periodically (once an hour?) checks the availability of update snippets / performs a download of the mar file in the same way as app update and reports any issues could have caught this.
:rstrong is infrastructure in place to set something like this up?
I wouldn't know but doubtful
(In reply to Annie Elliott from comment #23) > :rstrong is infrastructure in place to set something like this up? Not in QA. It would likely take considerable effort from QA and RelEng to get our existing automation to remove any blind spots here. At least initially we would have to 1. Write tests which could handle background updates 2. Make considerable investment in new infrastructure to be able to run 24/7 without impacting other tests (Nightly, Aurora, Endurance, l10n) -- would require some IT and A-Team work 3. Make the automation fully automated (remove the requirement for someone in QA to initiate the tests) -- would require some RelEng and A-Team work Even looking at this high-level, it's not a trivial amount of work. It would be optimistic to think we could get to this point by the end of next year. I'd like to aim for the lowest hanging fruit, which to me seems like being more cautious about what existing reporting mechanisms are telling us.
PS If it would be seen as valuable, I can try to get the wheels in motion on the three items I highlighted above. One of the key players in this would be the A-Team and they are fairly heavily committed to B2G work in the short-term. However, I might be able to get buy-in to get these checked off by the end of next year. I would not hinge a short-term solution on this.
(In reply to Annie Elliott from comment #21) > A. What went wrong > IT was under extreme load w/ ff17, and enabled better caching on the CDN > side of things, which did not accept the correct headers in the download > itself. Ultimately, this was not noticed until approximately 12/4. Yes to load, no to CDN. The load issue was on download.mozilla.org, aka Bouncer. This is earlier in the chain than the CDN. > C. Why did it go wrong? > There were communication issues, stemming from a fast reaction to an > overload problem on the CDN; there may be nagios alerts on an apacheworker. > There is a release-drivers channel that can be used for these notifications. > These configuration changes happen as a natural part of operational IT, and > there were no red flags to escalate. No to CDN, yes to Apache workers for the Bouncer cluster. Nagios is indeed what prompted us to take action... we have a check that looks for this specific problem (MaxClients). This check alerted and caused us to begin diagnostics. The problem that affected uptake was caused by the solution to the load issue (namely: load balancer caching). It was not caught in testing after implementation because the testing did not encompass the Range header, which was necessary to trigger the load balancer bug which ultimately caused the uptake problem. > D. Who could/should have been watching closer? > The release management could/should have been watching more closely. It is > however, unclear how soon we should have noticed; there are assumption that > was made surrounding the proximity to holidays that this was not anything > more than a post-holiday issue. I believe it should be fairly easy to set up a simple WebQA "Jenkins" automation test to check for this specific failure (bad status code for certain requests). I recommend getting in touch with Stephen Donner about setting this up... I don't think it would be particularly complicated, although that definitely depends on how thorough we want to be. Similarly, we could have a Nagios check that looks for a proper return code to a specific query. We have this already (we'll notice if download.mozilla.org suddenly starts throwing "500 Internal Server Error" for everything), but I doesn't catch this particular scenario that we hit... it was returning a 206, which is a perfectly legitimate return code under normal circumstances... but not the expected one, given the Range request header and the Location response header. The primary difference there is who's watching it, and who maintains it. A Jenkins project is more easily extended to encompass other testing scenarios... a Nagios check is generally geared towards a single specific case. Nagios, however, can typically run much more frequently, and is actively monitored by IT... it can page us if it breaks. I recommend we go for both- a simple Nagios check to validate this particular scenario, and a Jenkins project to test this scenario as well as others. > F. How do we identify this type of issue earlier next time? > To Lukas' concern, we have put up the download/release dash which displays > clear, cyclic behaviors and anomalies should be easily noted from here on > out. In hindsight, the lack of download spike early on is completely obvious. > > To Annie's concern, to drop our detection and response timeframe to > intra-day, release managers need two things: > 1. Access to CDN vendor near-real-time monitoring (TODO: Get accounts/login to > CDN-vendor supplied dashes) > 2. Timely notifications from IT when changes are made that could affect > downloads to provide context for when we should be watching more closely > (TODO: Set up light notification protocol with IT) #1 might be doable- I recommend opening a bug in mozilla.org / Server Operations::Web Operations, asking for access to the Akamai CDN portal. Please include a list of everyone that should have access. We'll want to run this by some folks before doing it, but it seems feasible. PLEASE NOTE: the problem was *not* the CDN. All this will show you is the same metrics you already have... namely a lack of bandwidth spike (mentioned in F, above). We could have noticed the bad status codes based on logs from download.mozilla.org that Metrics already collects... we just weren't looking for them. Note also that CDN stats tend to lag, too... with Akamai we get partial data within about 15 minutes, and full data after a 3-5 hours. There are other methods that would be much faster for this particular scenario. #2 ... fair enough. I propose a mention of the impending change in an IRC channel of your choice. I'm not sure an email to release-drivers@ is appropriate, as that seems likely to generate more concern and delays than will generally be necessary (it's easy for something to sound scary but actually be very minor), but I could certainly be swayed on this. One problem will be how to enforce this, as it's outside the natural workflow... not insurmountable, but needs some thinking. I propose a #3- Nagios and Jenkins tests to check for valid behavior from download.mozilla.org and download.cdn.mozilla.net. This is minimal now, but could be expanded. It could easily have caught this within 5 minutes of there being an issue.
I like #3! Could alerts go out to concerned parties in release management?
(In reply to Anthony Hughes, Mozilla QA (:ashughes) from comment #25) > (In reply to Annie Elliott from comment #23) > > :rstrong is infrastructure in place to set something like this up? > > Not in QA. It would likely take considerable effort from QA and RelEng to > get our existing automation to remove any blind spots here. You may want to rope in WebQA here. I suspect a good bit of what needs to happen here essentially boils down to "get this thing, see what it says, get this other thing", which roughly describes our usage of Jenkins automated testing. Of course that won't be a complete end-to-end test of product delivery and updating as it doesn't involve Firefox directly, but it would have been sufficient to notice this issue. > I'd like to aim for the lowest hanging fruit, which to me seems like being > more cautious about what existing reporting mechanisms are telling us. Agreed... an end-to-end test is far more complicated, but the low-hanging fruit is easier. We can detect this particular problem (and many like it) with minimal additional infrastructure. 1) Nagios checks (IT) 2) Jenkins automated tests (WebQA) 3) download.mozilla.org log analysis - too many 206's, not enough 302's (Metrics) 4) CDN bandwidth didn't spike when expected 5) uptake didn't spike when expected
(In reply to Annie Elliott from comment #28) > I like #3! Could alerts go out to concerned parties in release management? Don't know for sure if this is feasible in an *automated* fashion. I do know for Nagios monitoring we're getting quite organized, and it's very feasible to have an alert troubleshooting step of "email <someone> about the problem". I'm fairly sure Jenkins makes it possible to send an email or other notification on test failure, but I wouldn't want to speak for WebQA. CC'ing Stephen Donner for confirmation.
As per comments in email, future post mortems need to have representation my a wider diversity if groups.
(In reply to Jake Maul [:jakem] from comment #30) > (In reply to Annie Elliott from comment #28) > > I like #3! Could alerts go out to concerned parties in release management? > > Don't know for sure if this is feasible in an *automated* fashion. > > I do know for Nagios monitoring we're getting quite organized, and it's very > feasible to have an alert troubleshooting step of "email <someone> about the > problem". > > I'm fairly sure Jenkins makes it possible to send an email or other > notification on test failure, but I wouldn't want to speak for WebQA. CC'ing > Stephen Donner for confirmation. +1 Regarding end to end testing, I only think that is necessary for releases which mozmill is taking care of atm and we have 1000's of checks in the tests that run in the tree for the client code. I think what would be best to focus on here is network / server verification.
(In reply to Jake Maul [:jakem] from comment #29) > > 3) download.mozilla.org log analysis - too many 206's, not enough 302's > (Metrics) This is an operational (RT / NRT) monitoring need, not metrics. We are precisely staffed for a business day SLA, for business purposes, not for operational metrics. This would be no faster than waiting for download counts.
REVISED POST- MORTEM WRITE-UP A. What went wrong The Bouncer cluster (managed by IT) was under extreme load w/ ff17 on download.mozilla.org. This is earlier in the chain than the CDN. IT enabled better load balancer caching to alleviate this, which did not accept the correct headers in the download itself. Ultimately, this was not noticed until approximately 12/4. B. When did it go wrong? This happened for 17.0.0, which was released on 11/20, and then unthrottled on 11/30 with the dot release. C. Why did it go wrong? There were communication issues, stemming from a fast reaction to an overload problem on the Bouncer cluster; there are/were nagios alerts on an apacheworker. The specific problem was with MaxClients. These configuration changes happen as a natural part of operational IT, and there were no red flags to escalate further; it was not caught in testing after implementation because the testing did not encompass the Range header, which was necessary to trigger the load balancer bug which ultimately caused the uptake problem. D. Who could/should have been watching closer? The release management could/should have been watching more closely. It is however, unclear how soon we should have noticed; there are assumption that was made surrounding the proximity to holidays that this was not anything more than a post-holiday issue. E. What is there to watch? Anthony's work could not have been an early indicator, as spot-checks were fine, and the automated test window is very small, occurring well before the deluge that prompted the config change. Metrics data warehouse has a day delay in processing time; and it should be noted that there is an ~5 day delay between download and ADI peaks per release (as expected - this is a user behavior issue that cannot be mitigated). Alex notes that it was a obvious problem only AFTER we unthrottled on the 30th with the release 17.0.1. Given the varying post-release processes (DoW, whether or not there is a .1 release, as well as unthrottle timing), there is no fixed "on day x" timeframe we can use. F. How could we identify this type of issue earlier next time? To Lukas' concern, we have put up the download/release dash which displays clear, cyclic behaviors and anomalies should be easily noted from here on out. In hindsight, the lack of download spike early on is completely obvious. To Annie's concern, to drop our detection and response timeframe to intra-day, release managers need two things: 1. Access to CDN vendor near-real-time monitoring (TODO: Get accounts/login to CDN-vendor supplied dashes) Jake Maul notes the following for this: I recommend opening a bug in mozilla.org / Server Operations::Web Operations, asking for access to the Akamai CDN portal. Please include a list of everyone that should have access. We'll want to run this by some folks before doing it, but it seems feasible. PLEASE NOTE: the problem was *not* the CDN. All this will show you is the same metrics you already have... namely a lack of bandwidth spike (mentioned in F, above). We could have noticed the bad status codes based on logs from download.mozilla.org that Metrics already collects... we just weren't looking for them. Note also that CDN stats tend to lag, too... with Akamai we get partial data within about 15 minutes, and full data after a 3-5 hours. There are other methods that would be much faster for this particular scenario. 2. Timely notifications from IT when changes are made that could affect downloads to provide context for when we should be watching more closely (TODO: Set up light notification protocol with IT) Jake Maul notes the following for this: I propose a mention of the impending change in an IRC channel of your choice. I'm not sure an email to release-drivers@ is appropriate, as that seems likely to generate more concern and delays than will generally be necessary (it's easy for something to sound scary but actually be very minor), but I could certainly be swayed on this. One problem will be how to enforce this, as it's outside the natural workflow... not insurmountable, but needs some thinking. To which Annie Elliott responds: IRC is hit or miss, and is highly dependent upon the right people looking at a scrolling stream of information that may or may not persist, depending upon how folks have their IRC clients set up, and what behavioral patterns around looking for stale information in feeds are. An alternative proposal would be to put up a simple, internal web page logging date/time and simple description of changes. This passes the persistence test, and cleans up the flood of push notifications. Folks who need to monitor this then need to habitualise monitoring it. (TODO: Pick a communication process) To Rob Strong's concern, having an automated process that periodically (once an hour?) checks the availability of update snippets / performs a download of the mar file in the same way as app update and reports any issues could have caught this. Best existing infrastructure for this may lie within WebQA's Jenkins framework. (TODO: Talk to Stephen Donner) Jake Maul believes that we are at a point with nagios where it is feasible to have an alert troubleshooting step of "email <someone> about the problem." Further, he believes that either this or a Jenkins alert could have caught this condition in 5 minutes. Rob strong further poses that the nagios check could look for a proper return code to a specific query. We have this already (we'll notice if download.mozilla.org suddenly starts throwing "500 Internal Server Error" for everything), but I doesn't catch this particular scenario that we hit... it was returning a 206, which is a perfectly legitimate return code under normal circumstances... but not the expected one, given the Range request header and the Location response header. (TODO: Setup Nagios check) It should be noted that several people are concerned about the group ("the who") that is tasked with doing the watching, as well as the extent of the net cast to do forensics on this release. Comments have suggested including the following actors: -RelEng -IT -Incident Management teams (Not sure if these exist here yet. They are generally rotating, multidisciplinary on call response teams.)
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.