Closed Bug 804347 Opened 7 years ago Closed 7 years ago

Sentry/Bouncer location availability improvements and reporting

Categories

(Webtools :: Bouncer, defect)

defect
Not set

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: nmaul, Assigned: brandon)

References

()

Details

(Whiteboard: [bouncer-improvement])

This is the bug for the work discussed earlier today.

To summarize the situation:

We have a handful of mirrors in Bouncer, some of which are CDN accounts and are "nearly" always available. Sentry however, is not, and sometimes will be unable to reach a mirror or file on a mirror, and decide that mirror (or that file) must be unavailable, and deactivate it.

In the situation where Sentry has some sort of problem reaching external hosts, it can wreck havoc by deciding that all of the CDN mirrors are unavailable, leaving only our internal FTP cluster mirrors... which are incapable of handling the full traffic load.

The proposed solution is 2-pronged:

1) Disable the "FTP Cluster - non-SSL" mirror in bouncer... we *know* this can't handle all the traffic, and is believed to exist only for fallback purposes anyway. A fallback that doesn't actually work (and in fact causes additional problems if used) is not useful.

2) Re-work the logic for failed location checks. Currently, a location/mirror combo being "active" implies that the most recent health check succeeded. A location/mirror combo being "not active" implies that the most recent health check failed.

The additional logic should be that if *all* mirrors for a given location are "not active", then we should attempt to proceed with the standard weighting decision we would have made if all mirrors were "active".

In other words, if Sentry thinks every mirror for a particular location is down, Bouncer should assume Sentry itself is broken and that in fact all the mirrors for that location are actually *up*, and proceed accordingly.

Note that this should not override the "active" column in the mirrors table... if we mark a mirror (table.column: mirror_mirrors.active) as inactive, it should not be eligible to receive traffic, even if Sentry believes none of the active mirrors have the file.

This should also not override the SSL-Only logic. If a product is marked as being SSL only, but Sentry believes no such mirrors have the file, it should still 404 as it currently does.


Finally, if we do hit such a situation where a location has no "active" mirrors, someone should be notified of it. Ideally, this would come in via Nagios checks. We can handle this as a separate item. I believe some of the code is already present and may be re-usable. It's already possible to get a list of all eligible mirrors for a given product:

https://bounceradmin.mozilla.com/stats/locations/?p=2046

Perhaps we can build on that and have a much simpler page that shows all "locations" that have no usable mirrors. Then we could simply have Nagios check that status page and alert if it's wrong.



In pseudo-code + pseudo-SQL, the main feature would look roughly like this:

user requests a product/os/locale
bouncer determines what "location" this corresponds to
if "select * from mirror_location_mirror_map where location_id = <location> and active = 1" is EMPTY_SET
then
    round up all the enabled mirrors (mirror_mirrors.active = 1)
    assume they all have this location 
    proceed as normal (process according to normal weighted delivery algorithm)
fi

I believe this logic would best be integrated into download.mozilla.org rather than Sentry or bounceradmin.mozilla.com. However, we could put this check into Sentry if we wanted to... we would need to be careful, though, because checks and updates are done serially. It would be bad to serially mark each location/mirror row as inactive, and then go back a the end to mark them all active. I think a wrapper around the disabling function would do the job. I'm not sure if this is better than reworking download.mozilla.org directly, though.


This could also be easily modified to also send an email when this triggers, but we're afraid that would result in *far* too many emails if/when a problem occurs. Hence the "have a webpage that Nagios can check" idea above. That ties in nicely with existing alerting mechanisms, and would provide a convenient status page.
Whiteboard: [bouncer-improvements] → [bouncer-improvement]
A number of clarifications I want to request before we go forward:

* So to clarify #2, if WE uncheck the active box, we don't want to have Bouncer use it or consider it possibly available in the event of a CDN meltdown. But if Sentry unchecks it, we DO want it to be available. If this is correct, how do you intend for us to know the difference between a manual unset and a Sentry unset?

* If the SSL-only mirrors are disabled by Sentry, do we not wish to also assume that they might be available to our users and continue to attempt to serve files?

I'm going to break this bug into component pieces and block this bug on those parts. It appears there are a number of things that need to happen here:

* The disabling of the FTP mirror in production (IT or Releng)
* The development of code that properly handles the unavailability of mirrors
* The modification of the sentry code to write to whatever columns we need or might create (depending on your feedback)
* The creation of Nagios alerts by IT.
Flags: needinfo?(nmaul)
Good questions!

In reply to Brandon Savage [:brandon] from comment #1)
> A number of clarifications I want to request before we go forward:
> 
> * So to clarify #2, if WE uncheck the active box, we don't want to have
> Bouncer use it or consider it possibly available in the event of a CDN
> meltdown. But if Sentry unchecks it, we DO want it to be available. If this
> is correct, how do you intend for us to know the difference between a manual
> unset and a Sentry unset?

Sentry never un-sets that particular option (mirror_mirrors.active). It only ever plays with mirror_location_mirror_map, and in some rare circumstances will alter the mirror's weight. If it decides it needs to pull an entire mirror, it does so by setting every single location on that mirror to inactive.

mirror_mirrors.active is only ever set manually.

> * If the SSL-only mirrors are disabled by Sentry, do we not wish to also
> assume that they might be available to our users and continue to attempt to
> serve files?

You are right, I misspoke. This new logic should apply to SSL mirrors and SSL-only products as well.

What I was intending to convey was just that we should not attempt to "cross over" from SSL to non-SSL in the event that all SSL location/mirrors combos are active=0 but the same locations *are* available on a non-SSL mirror. The SSL-only flag should "win", and we should attempt to serve from an SSL mirror, always.

Hopefully that's more clear. :)

> I'm going to break this bug into component pieces and block this bug on
> those parts. It appears there are a number of things that need to happen
> here:
> 
> * The disabling of the FTP mirror in production (IT or Releng)

I'll do this... today, hopefully. :)

> * The development of code that properly handles the unavailability of mirrors

To my eyes this could be done in either download.mozilla.org's PHP code (per-request) or Sentry's Perl code (per-availability-check). Either way is fine with me. :)

> * The modification of the sentry code to write to whatever columns we need
> or might create (depending on your feedback)

I think we can get away without needing to add/remove any columns, but simply change how we use the mirror_location_mirror_map.active column.

> * The creation of Nagios alerts by IT.

Definitely.


Let's add an additional bug:

* The creation of a status page that Nagios can check, and SRE's can use to figure out specifically what's wrong. Possibly based on code from the existing https://bounceradmin.mozilla.com/stats/locations/ page, but summarized.
Flags: needinfo?(nmaul)
(In reply to Jake Maul [:jakem] from comment #0)
> We have a handful of mirrors in Bouncer, some of which are CDN accounts and
> are "nearly" always available. Sentry however, is not, and sometimes will be
> unable to reach a mirror or file on a mirror, and decide that mirror (or
> that file) must be unavailable, and deactivate it.

This is s hangover from when we had ftp.m.o with everything and mirrors with an active set of files. Now that we have CDN using ftp as the origin it's fair to rely on that, so long as ftp is up and we don't have any left over excludes on the origin vhost.
1. I'm thinking about an edge case, and I just want to talk through the logic here.

If a file is requested that is not on any of the CDNs, but Sentry thinks at least one CDN is up, what is the desired behavior?

Is this a different desired behavior from the situation where Sentry thinks there are no valid CDNs?

2. What's the logic for the required check for nagios?
Are we looking, for example, for zero valid locations for firefox-latest, or something like that?
 
3. In Socorro we have some nagios checks that directly run SQL.  Should we do that instead of adding a webpage, so as to avoid adding a level of complexity/possible failure?
To answer #1, if the file is unavailable we should 404 regardless of how many CDNs are up. That's our present behavior (if you request something that doesn't exist, or you request something that is unavailable).

This is different than the desired behavior from a situation where Sentry thinks there are no valid CDNs. In the case when Sentry believes there are no valid CDNs, the files are still there and the CDNs may be (are probably) still up, but Sentry doesn't know this. 

There is a possible problem here, though, using the column mirror_location_mirror_map.active. When we create the file, we mark it as active; when Sentry believes it's unavailable, it marks it as inactive. What happens if we ever remove the file from that mirror (I don't even know if we do this)? Sentry would mark it inactive for that mirror, but in the event that Sentry marks ALL mirrors as inactive, we would attempt then to serve a file from a mirror that legitimately doesn't have the file.

Example: Mirror A and B are given file xyz.exe. We remove xyz.exe from Mirror A, and Sentry marks it inactive per design. Sentry then fails and marks xyz.exe inactive on Mirror B; per our present recommendation, we fall back to serving from the inactive mirrors while IT sorts out the problem. Mirror A and Mirror B would both then be used, resulting in 50% of download requests getting a 404.

One possible solution to the above stated problem might be the following: Sentry has a column that it marks as "unhealthy" when it can't detect a file. It also would then log the unhealthy state in a separate table, and if after 24 hours, it's unable to access the file, it marks it as inactive (for that mirror). I believe it is highly unlikely that Sentry would be unable to access the CDN for 24 hours.
(In reply to Laura Thomson :laura from comment #4)
> 1. I'm thinking about an edge case, and I just want to talk through the
> logic here.
> 
> If a file is requested that is not on any of the CDNs, but Sentry thinks at
> least one CDN is up, what is the desired behavior?
> 
> Is this a different desired behavior from the situation where Sentry thinks
> there are no valid CDNs?

The crux of the issue is that it can be difficult to discriminate between "mirror doesn't have this file" and "mirror has this file but Sentry checks are failing for some reason". The symptoms can be similar.


A reasonable test might be *how* the check failed:


A 404 reasonably indicates that the mirror being checked legitimately doesn't have the file. If a mirror is 404'ing for a file, we can probably safely assume it really doesn't have it. Even in the case of CDNs this seems reasonable... if one CDN node is 404'ing, the others most likely are too, since they share a common config.

A 403 is the same. This happens for example on the SSL CDN mirror- it 403's for any update.mar files, because the origin forbids the SSL CDN from having those files. So we can probably treat these just like 404s... real cases of not having that file.

A 500 indicates that the mirror is unavailable... however in the case of CDN mirrors, this breaks down because a "mirror" is now many many nodes. This breaks even a little for the FTP cluster mirrors, because that's multiple LBs fronting multiple web nodes. We should probably assuming in this case that the test was a fluke and move on. In fact, Sentry might already be retrying these, with the logic I added some weeks ago.

A timeout can be anything- I suspect this is what we can really dig into. Right now a timeout counts as "mirror is down" or "doesn't have that file", depending on which check is happening. It's a bad assumption within Sentry that any time a HEAD fails for any reason, the mirror must be down.


> 2. What's the logic for the required check for nagios?
> Are we looking, for example, for zero valid locations for firefox-latest, or
> something like that?

I want nagios to be able to determine if Sentry believes:

1) any mirrors are hard-down. That is: the mirror is turned on (mirror_mirrors.active=1), but have zero active products in mirror_location_mirror_map.

2) any product/location has no available mirrors. There are likely to be some right now that are legitimately broken in this way, which we'll need to fix.


> 3. In Socorro we have some nagios checks that directly run SQL.  Should we
> do that instead of adding a webpage, so as to avoid adding a level of
> complexity/possible failure?

That's okay, but I want it to be obvious to the SRE's what action needs to be taken. They need to be able to determine what's wrong. We can possibly solve that via documentation, but ideally the alert itself would be obvious enough that they wouldn't need much more to go on. That's what led me to a report webpage. However, I'm definitely open to suggestions on this. A SQL check does seem likely to be simpler to make work, but doesn't as easily illuminate for them what to do about it.
In order to test this for QA purposes, we'll want to do a few different tests. We'll want to test that unhealthy mirrors are removed from circulation, and that when all mirrors are reported as unhealthy, that we do in fact serve from them. Finally, we'll want to test that mirrors that are marked as healthy or unhealthy are not served at all, if they are marked as inactive.

TEST CASE 1:

Mark all mirrors but one as "unhealthy" for a product. Verify that a single "healthy" mirror is used to serve all traffic.

TEST CASE 2:

Mark all mirrors as "unhealthy" for a product. Verify that all unhealthy mirrors are used to serve all traffic.

TEST CASE 3:

Deactivate all mirrors for a product except one, but mark them as healthy. Verify that the one mirror is used to serve the product. Repeat this test, but mark all mirrors as unhealthy. Correct results will be that only the active and healthy/unhealthy mirror is used.

TEST CASE 4:

Deactivate all mirrors, and mark them as healthy. Ensure the product is not served at all. Repeat the test with all mirrors marked unhealthy.
Depends on: 807731
Please ping me when it's a good time for you -- this is still on my radar and I'd love to spend a good amount of time verifying it so we can push this into production asap.
I saw that :stephend had blocked this bug on the unit testing bug which I've been actively working on. Does deploying this code NOT depend on that unit testing bug? If not, let's knock this down first thing next week.
:brandonsavage -- I tend to agree with Stephen but we should discuss it if you disagree. 

This is a high visibility and priority application. Having repeatable automated tests to verify our assumptions on how the code works should (imho) block this feature. Your tests will be able to expose and more easily test certain paths through the code than what we can do manually or with our automation tools. 

QA has limited amounts of time and any projects that we can increase the automation coverage on is a huge boon for us particularly when it helps dramatically lower the risk on a project.
I tentatively blocked on bug 807731, based on IRC discussion a couple of weeks ago -- I thought we left it that we'd get unit tests to at least cover the basics of the new functionality, and then manually test around it.  I'm still not sure the absolute timeline this has to ship, but I'd really hope/prefer that we get automated testing beyond our requests automation (https://github.com/mozilla/bouncer-tests), so that whenever we need make any code changes to Bouncer, we can turn it around quicker.

Laura, do you have a sense of timing, and what's your recommendation?
Assignee: nobody → bsavage
After talking with :mbrandt, we've decided we don't need to block on unit tests for this bug, but that we want to release this while unit tests are being developed. The risk for the Bouncer code change is low.
Status: NEW → RESOLVED
Closed: 7 years ago
No longer depends on: 807731
Resolution: --- → FIXED
See Also: → 807731
Nicely done :brandonsavage ... very nice! QA verifed on dev running through the steps noted in comment 7. The Bouncer-test suite as well as the our curl tests (https://gist.github.com/3842439) ran (pass & fail) as expected.

Setup on dev:
mysql> SELECT * FROM mirror_mirrors WHERE id IN (493, 492, 494, 486);
+-----+----------------------------+----------------------------------------------------------------------------------+----------+--------+-------+
| id | name | baseurl | rating | active | count |
+-----+----------------------------+----------------------------------------------------------------------------------+----------+--------+-------+
| 486 | Mozilla CDN - non-SSL | http://download.cdn.mozilla.net/pub/mozilla.org | 50000000 | 1 | 590 |
| 492 | Mozilla CDN - SSL | https://download-installer.cdn.mozilla.net/pub/mozilla.org | 10000 | 1 | 0 |
| 493 | Mozilla FTP Cluster - SSL | https://ftp.mozilla.org/pub/mozilla.org | 1 | 1 | 0 |
| 494 | Mozilla Edgecast - non-SSL | http://wpc.1237.edgecastcdn.net/801237/download.cdn.mozilla.net/pub/mozilla.org/ | 500 | 1 | 0 |
+-----+----------------------------+----------------------------------------------------------------------------------+----------+--------+-------+


TEST CASE 1:

Mark all mirrors but one as "unhealthy" for a product. Verify that a single "healthy" mirror is used to serve all traffic.

mysql> SELECT mlmm.*, mp.name, ml.id as ml_id FROM mirror_location_mirror_map mlmm LEFT JOIN mirror_locations ml ON ml.id = mlmm.location_id LEFT JOIN mirror_products mp on mp.id = ml.product_id WHERE ml.os_id = 2 AND mp.name LIKE 'Firefox-16.0.2';
+------------+-------------+-----------+--------+---------+----------------+-------+
| id | location_id | mirror_id | active | healthy | name | ml_id |
+------------+-------------+-----------+--------+---------+----------------+-------+
| 2498505505 | 9875 | 493 | 1 | 0 | Firefox-16.0.2 | 9875 |
| 2498505587 | 9875 | 492 | 1 | 0 | Firefox-16.0.2 | 9875 |
| 2498505546 | 9875 | 259 | 0 | 0 | Firefox-16.0.2 | 9875 |
| 2498505669 | 9875 | 494 | 1 | 0 | Firefox-16.0.2 | 9875 |
| 2498505628 | 9875 | 486 | 1 | 0 | Firefox-16.0.2 | 9875 |
+------------+-------------+-----------+--------+---------+----------------+-------+
curl -IL https://download-dev.allizom.org/?product=firefox-16.0.2&os=win&lang=en-US
Location: http://download.cdn.mozilla.net/pub/mozilla.org/firefox/releases/16.0.2/win32/en-US/Firefox%20Setup%2016.0.2.exe

curl -IL https://download-dev.allizom.org/?product=firefox-16.0.2&os=win&lang=en-US
Location: http://wpc.1237.edgecastcdn.net/801237/download.cdn.mozilla.net/pub/mozilla.org//firefox/releases/16.0.2/win32/en-US/Firefox%20Setup%2016.0.2.exe

TEST CASE 2:

Mark all mirrors as "unhealthy" for a product. Verify that all unhealthy mirrors are used to serve all traffic.

mysql> SELECT mlmm.*, mp.name, ml.id as ml_id FROM mirror_location_mirror_map mlmm LEFT JOIN mirror_locations ml ON ml.id = mlmm.location_id LEFT JOIN mirror_products mp on mp.id = ml.product_id WHERE ml.os_id = 2 AND mp.name LIKE 'Firefox-16.0.2';
+------------+-------------+-----------+--------+---------+----------------+-------+
| id | location_id | mirror_id | active | healthy | name | ml_id |
+------------+-------------+-----------+--------+---------+----------------+-------+
| 2498505505 | 9875 | 493 | 1 | 0 | Firefox-16.0.2 | 9875 |
| 2498505587 | 9875 | 492 | 1 | 0 | Firefox-16.0.2 | 9875 |
| 2498505546 | 9875 | 259 | 0 | 0 | Firefox-16.0.2 | 9875 |
| 2498505669 | 9875 | 494 | 1 | 0 | Firefox-16.0.2 | 9875 |
| 2498505628 | 9875 | 486 | 1 | 0 | Firefox-16.0.2 | 9875 |
+------------+-------------+-----------+--------+---------+----------------+-------+
curl -IL https://download-dev.allizom.org/?product=firefox-16.0.2&os=win&lang=en-US
Location: http://download.cdn.mozilla.net/pub/mozilla.org/firefox/releases/16.0.2/win32/en-US/Firefox%20Setup%2016.0.2.exe

TEST CASE 3:

Deactivate all mirrors for a product except one, but mark them as healthy. Verify that the one mirror is used to serve the product. Repeat this test, but mark all mirrors as unhealthy. Correct results will be that only the active and healthy/unhealthy mirror is used.

mysql> SELECT mlmm.*, mp.name, ml.id as ml_id FROM mirror_location_mirror_map mlmm LEFT JOIN mirror_locations ml ON ml.id = mlmm.location_id LEFT JOIN mirror_products mp on mp.id = ml.product_id WHERE ml.os_id = 2 AND mp.name LIKE 'Firefox-16.0.2';
+------------+-------------+-----------+--------+---------+----------------+-------+
| id | location_id | mirror_id | active | healthy | name | ml_id |
+------------+-------------+-----------+--------+---------+----------------+-------+
| 2498505505 | 9875 | 493 | 0 | 1 | Firefox-16.0.2 | 9875 |
| 2498505587 | 9875 | 492 | 0 | 1 | Firefox-16.0.2 | 9875 |
| 2498505546 | 9875 | 259 | 0 | 1 | Firefox-16.0.2 | 9875 |
| 2498505669 | 9875 | 494 | 1 | 1 | Firefox-16.0.2 | 9875 |
| 2498505628 | 9875 | 486 | 0 | 1 | Firefox-16.0.2 | 9875 |
+------------+-------------+-----------+--------+---------+----------------+-------+
curl -IL https://download-dev.allizom.org/?product=firefox-16.0.2&os=win&lang=en-US
Location: http://wpc.1237.edgecastcdn.net/801237/download.cdn.mozilla.net/pub/mozilla.org//firefox/releases/16.0.2/win32/en-US/Firefox%20Setup%2016.0.2.exe

TEST CASE 4:

Deactivate all mirrors, and mark them as healthy. Ensure the product is not served at all. Repeat the test with all mirrors marked unhealthy.

mysql> SELECT mlmm.*, mp.name, ml.id as ml_id FROM mirror_location_mirror_map mlmm LEFT JOIN mirror_locations ml ON ml.id = mlmm.location_id LEFT JOIN mirror_products mp on mp.id = ml.product_id WHERE ml.os_id = 2 AND mp.name LIKE 'Firefox-16.0.2';
+------------+-------------+-----------+--------+---------+----------------+-------+
| id | location_id | mirror_id | active | healthy | name | ml_id |
+------------+-------------+-----------+--------+---------+----------------+-------+
| 2498505505 | 9875 | 493 | 0 | 1 | Firefox-16.0.2 | 9875 |
| 2498505587 | 9875 | 492 | 0 | 1 | Firefox-16.0.2 | 9875 |
| 2498505546 | 9875 | 259 | 0 | 1 | Firefox-16.0.2 | 9875 |
| 2498505669 | 9875 | 494 | 0 | 1 | Firefox-16.0.2 | 9875 |
| 2498505628 | 9875 | 486 | 0 | 1 | Firefox-16.0.2 | 9875 |
+------------+-------------+-----------+--------+---------+----------------+-------+
curl -IL https://download-dev.allizom.org/?product=firefox-16.0.2&os=win&lang=en-US
HTTP/1.1 404 Not Found
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.