Closed Bug 904259 Opened 11 years ago Closed 11 years ago

Get http uptime statistics for application update server

Categories

(Infrastructure & Operations Graveyard :: WebOps: Product Delivery, task)

x86_64
Windows 8
task
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 904710

People

(Reporter: taras.mozilla, Unassigned)

Details

Application update service is critical to making sure our users have an up to date version of Firefox. 

According to a discussion we had today the service was down for some non-trivial amount of time: https://www.yammer.com/mozilla.com/threads/316211652?m=1735801253&message_id=316604157&nid=1531

I would like to see stats on past availability (last 6 months) of our application update servers that aus3.mozilla.org resolves to. In the future this should be monitored on status.mozilla.com.
It was down today for a scheduled maintenance: bug 889688. This was approved by the IT Change Advisory Board, which includes (among other people) Release Engineering. IT's on-call was aware as well (living in #it).

There was not a maintenance notification email sent on this, because it only affects people manually triggering an update... automatic updates simply retry later, so there's no user-visible impact. We're still working on where the line is on what does and does not deserve a maintenance notification, so perhaps this should have been on the other side of the line. We don't want to send so many (especially to the whole company) that people become desensitized to them... there's a legitimate fear that we've already passed that point. To that end we're working on a better notification system.

I'll reply separately about the monitoring aspect. That's worthy of a dedicated comment, I feel. :)
Worth mentioning explicitly, there *was* an unforeseen problem with this maintenance that resulted in an unexpected outage. So yes, it was down for far longer than it should have been. :(
As for monitoring, unfortunately we can't actually provide past availability data. We don't have it. The best we could provide would be bits and pieces of the whole picture.

This particular service returns a "200 OK" even when things are not okay. Metrics may be able to dig up access logs going back that far, but they won't help for availability monitoring. Because of this design flaw we can't easily do much with normal HTTP monitoring tools... unless the network is down, the service is going to look good. We do have basic server-level ping/load/disk monitoring for that situation.

Due to the nature of the app, it's not easy for us to know *what* to monitor. Unlike other sites we can't simply check the homepage... for a sensible response we have to check a specific URL. Unfortunately the URL changes every 6 weeks. We could monitor *something*, but it wouldn't usually be the latest thing. And again, if it broke, we wouldn't necessarily know. :(


The right solution here is to build a "status" page into the app that returns 4xx or 5xx error code when something is broken, and a 200 otherwise. This page would trigger various internal health checks so it could return a sane result.

The bad news is, as mentioned in that Yammer thread, there is no active development work on the current version of AUS. So there's nobody allocated and up-to-speed who could implement this. It could potentially be built into the new AUS4 (aka: balrog), which will soon be in use for Nightly, but obviously that doesn't help right now.

In the interim, we *can* set up some monitoring... but don't be fooled, it will not be very good. I would not rely on it to be an accurate picture of availability.


As a broader idea, we should probably spend some time making sure the whole product delivery stack is both monitor-able and actually monitored. Many pieces are, but some (like this one) are not so good. This is fruitless without commitments from responsible teams (primarily IT, RelEng, and WebDev) to help out, and I don't know how quickly we can feasibly audit the whole thing.
Just to be clear, this bug is specifically about aus3.mozilla.org. When I say "Product Delivery", I'm referring to the whole pipeline... www.mozilla.org, download.mozilla.org, AUS (2, 3, and soon 4), ftp.mozilla.org, and the CDNs.

The flow for an update is:

Firefox hits aus3.mozilla.org with a particular URL
AUS responds... if there's an update, it redirects to download.mozilla.org (bouncer)
bouncer chooses a mirror - a CDN - and redirects to it
the chosen CDN caches and serves the actual files, using ftp.mozilla.org as the origin

For an installer, the flow is:

user visits www.mozilla.org, clicks download button, which goes to bouncer
bouncer chooses a mirror - a CDN - and redirects to it
the chosen CDN caches and serves the actual files, using ftp.mozilla.org as the origin
16:12 < jakem> taras: howdy
16:12 < jakem> I just commented on your bug about aus uptime stats
16:12 < jakem> and about to comment again :)
16:37 < taras> jakem: hi
16:37 < taras> jakem: to be clear
16:37 < taras> i'm not concerned about this specific outage
16:38 < taras> i'd like to know how often the update servers are unavailable
16:39 < jakem> sadly, there's not much we can do to answer that :(
16:39 < jakem> the AUS app returns a 200 OK even when things are not ok
16:39 < taras> so there are no past logs to analyze?
16:39 < jakem> there might be some access logs, but they won't be of much value
16:39 < taras> that's shitty
16:39 < jakem> yep
16:39 < jakem> code has no active maintenance
16:39 < taras> so is that what it was returning today?
16:39 < jakem> they're rewriting it from the ground up
16:39 < jakem> yeah
16:40 < jakem> it was returning a 200 OK with incomplete XML, apparently
16:40 < taras> fantastic
16:40 < jakem> the new version is *much* better, operationally
16:40 < taras> can you remind me of the bug # for the new one?
16:40 < jakem> let me find it.....
16:40 < jakem> bhearsum is the guy working on it
16:40 < taras> jakem: so to be absolutely clear
16:40 < taras> we have http-level monitoring
16:40 < taras> but it's useless cos app sucks?
16:41 < taras> so we know for sure atleast http is highly available?
16:44 < jakem> we have access logs, that get shipped to metrics hourly (dunno how long they're preserved for) ... which are good for hit-rate analysis but not availability
16:44 < jakem> we have server-level ping logs from nagios
16:44 < jakem> and we have basic "does this return a valid HTTP response" monitoring for the purpose of yanking dead servers out of the load balancer config
16:45 < jakem> so we can tell, for example, when Apache stops responding, or a node goes entirely offline
16:45 < jakem> but if they start going haywire and returning bad data... that's what we can't tell
16:47 < jakem> so... yeah, TL;DR: it sucks
16:47 < taras> ok
16:47 < taras> i'll talk to my people
16:47 < jakem> okie dokie... lmk if there's anything we can do
16:47 < taras> see how we want to get out of this
16:48 < jakem> I'd like to someday set up an audit of the whole product delivery stack from a monitoring perspective
16:48 < taras> this is probably the most critical ff service
16:48 < jakem> when you say "your people" and "get out of this"... what's the situation?
16:48 < taras> i'm sad that we dont strongly own it
16:48 < jakem> nah, I wouldn't say that
16:48 < jakem> product delivery as a whole is for sure
16:49 < jakem> but... aus is only used for updates... surely delivering the original installer is more important
16:49 < taras> nope
16:49 < taras> it's important
16:49 < taras> but it's as important to keep those users up to date
16:49 < taras> any risk we introduce in not updating users
16:49 < taras> may contribute to our not-optimal upgrading rates
(In reply to Jake Maul [:jakem] from comment #3)
> As for monitoring, unfortunately we can't actually provide past availability
> data. We don't have it. The best we could provide would be bits and pieces
> of the whole picture.
> 
> This particular service returns a "200 OK" even when things are not okay.

This isn't quite accurate. It returns 200 OK when it believes that the request doesn't need to be updated. This is by design and will not be changing.

If the datastore becomes unavailable, the app can't find any updates, and will return an empty 200 OK for all responses. This specific thing could be considered to be a limitation of the current AUS software.. For this, something like checking if 10.8.74.14:/aus2 is mounted or not would be a decent proxy for "has data". We won't have this problem when Balrog is in production - if it's datastore goes offline, the server will bubble up 500 ISEs.

If the app has internal issues (eg, an error or exception bubbles up) it will return a 500 ISE. We've seen this when we've had syntax errors in the config file, for example.


> Due to the nature of the app, it's not easy for us to know *what* to
> monitor. Unlike other sites we can't simply check the homepage... for a
> sensible response we have to check a specific URL. Unfortunately the URL
> changes every 6 weeks. We could monitor *something*, but it wouldn't usually
> be the latest thing. And again, if it broke, we wouldn't necessarily know. :(

I want to emphasize that this is also by design - updates are very fluid. We ship at least every 6 weeks, adjust throttling multiple times per cycle, and sometimes make other tweaks. However, through all of this you can still build an update URL that should always return a non-empty update. For example:
https://aus3.mozilla.org/update/3/Firefox/16.0.2/20121024073032/WINNT_x86-msvc/de/release/default/default/default/update.xml?force=1

The above URL simulates a Windows 16.0.2 "de" build asking for an update. It doesn't include specific OS information, to avoid getting rejected due to an unsupported OS. The ?force=1 on the end bypasses the throttling. Until we stop serving 16.0.2 updates altogether, this URL will always return a non-empty 200 response unless the datastore is offline. Does that help?
(In reply to Ben Hearsum [:bhearsum] from comment #6)
> (In reply to Jake Maul [:jakem] from comment #3)
> > As for monitoring, unfortunately we can't actually provide past availability
> > data. We don't have it. The best we could provide would be bits and pieces
> > of the whole picture.
> > 
> > This particular service returns a "200 OK" even when things are not okay.
> 
> This isn't quite accurate. It returns 200 OK when it believes that the
> request doesn't need to be updated.

Er, "the application making the request".
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → DUPLICATE
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.