plugins.mozilla.org is funky

RESOLVED FIXED

Status

RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: Usul, Assigned: rwatson)

Tracking

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/1493] )

(Reporter)

Description

3 years ago
curl -I https://plugins.mozilla.org/
HTTP/1.1 401 Authorization Required

Since when do we need auth for plugins ?

also watson saw :
tail -f curl.txt | grep HTTP
HTTP/1.1 401 Authorization Required
HTTP/1.1 401 Authorization Required
HTTP/1.1 401 Authorization Required
HTTP/1.1 401 Authorization Required
HTTP/1.1 500 Internal Server Error
HTTP/1.1 401 Authorization Required
HTTP/1.1 500 Internal Server Error
HTTP/1.1 500 Internal Server Error
HTTP/1.1 401 Authorization Required
HTTP/1.1 401 Authorization Required
HTTP/1.1 401 Authorization Required
HTTP/1.1 401 Authorization Required
HTTP/1.1 401 Authorization Required
HTTP/1.1 500 Internal Server Error
HTTP/1.1 401 Authorization Required
HTTP/1.1 401 Authorization Required
HTTP/1.1 401 Authorization Required
HTTP/1.1 401 Authorization Required
HTTP/1.1 401 Authorization Required
HTTP/1.1 401 Authorization Required
HTTP/1.1 401 Authorization Required
HTTP/1.1 401 Authorization Required
HTTP/1.1 401 Authorization Required
HTTP/1.1 401 Authorization Required
HTTP/1.1 401 Authorization Required

Updated

3 years ago
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/1493]
(Assignee)

Updated

3 years ago
Assignee: server-ops-webops → rwatson
pretty sure that's always required auth.
(Assignee)

Comment 4

3 years ago
I think we are focusing on the wrong thing here. 

The problem is that plugins is failing every 5 or so requests. Usul mentioning the auth thing was a side question.

We are seeing more failure from new relic:
Alert open: Error rate > 5.0% on 'plugins.mozilla.org'
That is strange, it should really not be hit that often, seeing that the JSON is being cached. With that said, there is a new backend for this service, just not sure at exactly what stage it is.

From what I remember, it will run on Heroku and the JSON will be served from the mozcdn and updated/cache busted when there are updates made to the content.

Looping in Lonnen, as he will have more details on the above.
Flags: needinfo?(chris.lonnen)

Comment 6

3 years ago
Error rate on NR is fixed... late last week.

401's for the homepage are expected... health checks should look at the json file that :espressive linked.

The 500 ISE errors are interesting. I'm wondering if they're related to the NR errors we were seeing. Can anyone confirm if they're still getting them?
(Assignee)

Comment 7

3 years ago
Yeah 500's, this occurred before the NR stuff and is still occurring. (I just tried it)
1 in say 5 - 7 get:

curl -I https://plugins.mozilla.org/
HTTP/1.1 500 Internal Server Error

So I believe this to be completely unrelated to the NR stuff.

Comment 8

3 years ago
That error is being generated by Zeus (not apache). Haven't found anything interesting in the ZLB logs, though.

Comment 9

3 years ago
The 500's seem to be not dependent on User-Agent or file being requested. Most 500's are for /en-us/plugins_list.json, but some are for /pfs/v2<something> as well.

In order to diagnose this further, we'll need to add some log fields to the plugins.mozilla.org logfile in Zeus. CC'ing :sheeri as she manages the Data Team, where these logs are currently being sent.

@sheeri: Can you confirm if any processing is using plugins.mozilla.org log data? We're currently sending this data to metrics-logger1.private.scl3, but I'm fairly sure it's not being used and we can change it as we please (and stop sending it). https://bugzilla.mozilla.org/show_bug.cgi?id=1002626#c9 doesn't mention it.
Flags: needinfo?(scabral)
500 errors for /en-us/plugins_list.json seems strange, unless there was a botched update and the JSON does not parse. The 500's for /pfs/v2<something> are the real question though as there should be no requests for this anymore.

Is it possible to know where the requests are coming from the requests /pfs/v2<something>?
These 500 errors are being generated automatically by Zeus when the backend server is unavailable to serve the request (either by returning a non-HTTP response, or by failing due to connection reset by peer or connection refused).

Typically this indicates that an httpd worker thread crashed for some reason - and, if it crashes at certain points, the request won't be written to the apache logs at all.

Comment 13

3 years ago
I enabled extra log fields on the ZLB logs, and atoll's comment 12 is correct- it is contacting backend nodes. Next step is to try to figure out what's happening that causes ZLB to decide to return a 500 to the user. Likely troubleshooting steps will be to bump up the LogLevel in Apache, and/or PHP's own log_level.

Comment 14

3 years ago
This seems to be fixed, and I'm thinking that this is a ZLB bug.

Specifically:

If max_connection_attempts on a pool is set to 0 (unlimited, and the default) or 2, no errors. Also no retries reported, indicating it's not ever actually failing.

If max_connection_attempts on a pool is set to 1, occasional 500 ISE errors thrown by the ZLB. These report that a node was used, but do not report any node response time... I believe they never actually attempt a connection at all.


I've found nothing on the web nodes indicating they're having problems.

The 500's seem not to be specific to User-Agent, SSL cipher used, or backend node ZLB reports "using".


I'm clearing the NEEDINFO's on this because they're no longer relevant. I'm going to watch this for a bit longer, but I think all is well here.


Note: I haven't found any other vservers in PHX1 that are throwing regular Zeus-generated 500 ISE's... only plugins. This seems reasonable, because we had tweaked the plugins pool connection settings during a load spike some time ago, when caching didn't work properly... we probably unwittingly introduced this problem at that time.
Flags: needinfo?(scabral)
Flags: needinfo?(chris.lonnen)
Ah. With 1, it's going to fail from time to time when the backend server closes the connection without Zeus's knowledge as part of shutting down due to MaxRequestsPerChild or equivalent. So we should probably audit sitewide and make sure it's undefined, 0, or 2 for all vservers.

Comment 16

3 years ago
Okay, all is still well here. Calling this one resolved.

MOC, if you disabled any monitoring of plugins due to this issue, you can safely re-enable it now. :)
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Flags: needinfo?(ludovic)
Resolution: --- → FIXED
(Reporter)

Updated

3 years ago
Flags: needinfo?(ludovic)
(Assignee)

Comment 17

3 years ago
I've just re-added the pingdom check "HTTP status code should be 401"
You need to log in before you can comment on or make changes to this bug.