Closed
Bug 904259
Opened 11 years ago
Closed 11 years ago
Get http uptime statistics for application update server
Categories
(Infrastructure & Operations Graveyard :: WebOps: Product Delivery, task)
Infrastructure & Operations Graveyard
WebOps: Product Delivery
x86_64
Windows 8
Tracking
(Not tracked)
RESOLVED
DUPLICATE
of bug 904710
People
(Reporter: taras.mozilla, Unassigned)
Details
Application update service is critical to making sure our users have an up to date version of Firefox. According to a discussion we had today the service was down for some non-trivial amount of time: https://www.yammer.com/mozilla.com/threads/316211652?m=1735801253&message_id=316604157&nid=1531 I would like to see stats on past availability (last 6 months) of our application update servers that aus3.mozilla.org resolves to. In the future this should be monitored on status.mozilla.com.
Comment 1•11 years ago
|
||
It was down today for a scheduled maintenance: bug 889688. This was approved by the IT Change Advisory Board, which includes (among other people) Release Engineering. IT's on-call was aware as well (living in #it). There was not a maintenance notification email sent on this, because it only affects people manually triggering an update... automatic updates simply retry later, so there's no user-visible impact. We're still working on where the line is on what does and does not deserve a maintenance notification, so perhaps this should have been on the other side of the line. We don't want to send so many (especially to the whole company) that people become desensitized to them... there's a legitimate fear that we've already passed that point. To that end we're working on a better notification system. I'll reply separately about the monitoring aspect. That's worthy of a dedicated comment, I feel. :)
Comment 2•11 years ago
|
||
Worth mentioning explicitly, there *was* an unforeseen problem with this maintenance that resulted in an unexpected outage. So yes, it was down for far longer than it should have been. :(
Comment 3•11 years ago
|
||
As for monitoring, unfortunately we can't actually provide past availability data. We don't have it. The best we could provide would be bits and pieces of the whole picture. This particular service returns a "200 OK" even when things are not okay. Metrics may be able to dig up access logs going back that far, but they won't help for availability monitoring. Because of this design flaw we can't easily do much with normal HTTP monitoring tools... unless the network is down, the service is going to look good. We do have basic server-level ping/load/disk monitoring for that situation. Due to the nature of the app, it's not easy for us to know *what* to monitor. Unlike other sites we can't simply check the homepage... for a sensible response we have to check a specific URL. Unfortunately the URL changes every 6 weeks. We could monitor *something*, but it wouldn't usually be the latest thing. And again, if it broke, we wouldn't necessarily know. :( The right solution here is to build a "status" page into the app that returns 4xx or 5xx error code when something is broken, and a 200 otherwise. This page would trigger various internal health checks so it could return a sane result. The bad news is, as mentioned in that Yammer thread, there is no active development work on the current version of AUS. So there's nobody allocated and up-to-speed who could implement this. It could potentially be built into the new AUS4 (aka: balrog), which will soon be in use for Nightly, but obviously that doesn't help right now. In the interim, we *can* set up some monitoring... but don't be fooled, it will not be very good. I would not rely on it to be an accurate picture of availability. As a broader idea, we should probably spend some time making sure the whole product delivery stack is both monitor-able and actually monitored. Many pieces are, but some (like this one) are not so good. This is fruitless without commitments from responsible teams (primarily IT, RelEng, and WebDev) to help out, and I don't know how quickly we can feasibly audit the whole thing.
Comment 4•11 years ago
|
||
Just to be clear, this bug is specifically about aus3.mozilla.org. When I say "Product Delivery", I'm referring to the whole pipeline... www.mozilla.org, download.mozilla.org, AUS (2, 3, and soon 4), ftp.mozilla.org, and the CDNs. The flow for an update is: Firefox hits aus3.mozilla.org with a particular URL AUS responds... if there's an update, it redirects to download.mozilla.org (bouncer) bouncer chooses a mirror - a CDN - and redirects to it the chosen CDN caches and serves the actual files, using ftp.mozilla.org as the origin For an installer, the flow is: user visits www.mozilla.org, clicks download button, which goes to bouncer bouncer chooses a mirror - a CDN - and redirects to it the chosen CDN caches and serves the actual files, using ftp.mozilla.org as the origin
Reporter | ||
Comment 5•11 years ago
|
||
16:12 < jakem> taras: howdy 16:12 < jakem> I just commented on your bug about aus uptime stats 16:12 < jakem> and about to comment again :) 16:37 < taras> jakem: hi 16:37 < taras> jakem: to be clear 16:37 < taras> i'm not concerned about this specific outage 16:38 < taras> i'd like to know how often the update servers are unavailable 16:39 < jakem> sadly, there's not much we can do to answer that :( 16:39 < jakem> the AUS app returns a 200 OK even when things are not ok 16:39 < taras> so there are no past logs to analyze? 16:39 < jakem> there might be some access logs, but they won't be of much value 16:39 < taras> that's shitty 16:39 < jakem> yep 16:39 < jakem> code has no active maintenance 16:39 < taras> so is that what it was returning today? 16:39 < jakem> they're rewriting it from the ground up 16:39 < jakem> yeah 16:40 < jakem> it was returning a 200 OK with incomplete XML, apparently 16:40 < taras> fantastic 16:40 < jakem> the new version is *much* better, operationally 16:40 < taras> can you remind me of the bug # for the new one? 16:40 < jakem> let me find it..... 16:40 < jakem> bhearsum is the guy working on it 16:40 < taras> jakem: so to be absolutely clear 16:40 < taras> we have http-level monitoring 16:40 < taras> but it's useless cos app sucks? 16:41 < taras> so we know for sure atleast http is highly available? 16:44 < jakem> we have access logs, that get shipped to metrics hourly (dunno how long they're preserved for) ... which are good for hit-rate analysis but not availability 16:44 < jakem> we have server-level ping logs from nagios 16:44 < jakem> and we have basic "does this return a valid HTTP response" monitoring for the purpose of yanking dead servers out of the load balancer config 16:45 < jakem> so we can tell, for example, when Apache stops responding, or a node goes entirely offline 16:45 < jakem> but if they start going haywire and returning bad data... that's what we can't tell 16:47 < jakem> so... yeah, TL;DR: it sucks 16:47 < taras> ok 16:47 < taras> i'll talk to my people 16:47 < jakem> okie dokie... lmk if there's anything we can do 16:47 < taras> see how we want to get out of this 16:48 < jakem> I'd like to someday set up an audit of the whole product delivery stack from a monitoring perspective 16:48 < taras> this is probably the most critical ff service 16:48 < jakem> when you say "your people" and "get out of this"... what's the situation? 16:48 < taras> i'm sad that we dont strongly own it 16:48 < jakem> nah, I wouldn't say that 16:48 < jakem> product delivery as a whole is for sure 16:49 < jakem> but... aus is only used for updates... surely delivering the original installer is more important 16:49 < taras> nope 16:49 < taras> it's important 16:49 < taras> but it's as important to keep those users up to date 16:49 < taras> any risk we introduce in not updating users 16:49 < taras> may contribute to our not-optimal upgrading rates
Comment 6•11 years ago
|
||
(In reply to Jake Maul [:jakem] from comment #3) > As for monitoring, unfortunately we can't actually provide past availability > data. We don't have it. The best we could provide would be bits and pieces > of the whole picture. > > This particular service returns a "200 OK" even when things are not okay. This isn't quite accurate. It returns 200 OK when it believes that the request doesn't need to be updated. This is by design and will not be changing. If the datastore becomes unavailable, the app can't find any updates, and will return an empty 200 OK for all responses. This specific thing could be considered to be a limitation of the current AUS software.. For this, something like checking if 10.8.74.14:/aus2 is mounted or not would be a decent proxy for "has data". We won't have this problem when Balrog is in production - if it's datastore goes offline, the server will bubble up 500 ISEs. If the app has internal issues (eg, an error or exception bubbles up) it will return a 500 ISE. We've seen this when we've had syntax errors in the config file, for example. > Due to the nature of the app, it's not easy for us to know *what* to > monitor. Unlike other sites we can't simply check the homepage... for a > sensible response we have to check a specific URL. Unfortunately the URL > changes every 6 weeks. We could monitor *something*, but it wouldn't usually > be the latest thing. And again, if it broke, we wouldn't necessarily know. :( I want to emphasize that this is also by design - updates are very fluid. We ship at least every 6 weeks, adjust throttling multiple times per cycle, and sometimes make other tweaks. However, through all of this you can still build an update URL that should always return a non-empty update. For example: https://aus3.mozilla.org/update/3/Firefox/16.0.2/20121024073032/WINNT_x86-msvc/de/release/default/default/default/update.xml?force=1 The above URL simulates a Windows 16.0.2 "de" build asking for an update. It doesn't include specific OS information, to avoid getting rejected due to an unsupported OS. The ?force=1 on the end bypasses the throttling. Until we stop serving 16.0.2 updates altogether, this URL will always return a non-empty 200 response unless the datastore is offline. Does that help?
Comment 7•11 years ago
|
||
(In reply to Ben Hearsum [:bhearsum] from comment #6) > (In reply to Jake Maul [:jakem] from comment #3) > > As for monitoring, unfortunately we can't actually provide past availability > > data. We don't have it. The best we could provide would be bits and pieces > > of the whole picture. > > > > This particular service returns a "200 OK" even when things are not okay. > > This isn't quite accurate. It returns 200 OK when it believes that the > request doesn't need to be updated. Er, "the application making the request".
Reporter | ||
Updated•11 years ago
|
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → DUPLICATE
Updated•8 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•