Closed
Bug 1196418
Opened 10 years ago
Closed 8 years ago
[prod] Intermittent downtime on the webapp
Categories
(Socorro :: General, task)
Socorro
General
Tracking
(Not tracked)
RESOLVED
INVALID
People
(Reporter: mbrandt, Unassigned)
References
()
Details
Per today's discussion at the socorro meeting, we've noticed a potential repeating and yet intermittent downtime that occurs roughly once per day.
Both pingdom (:jd) and our Selenium backed front-end tests have noticed this.
Manifestation in Web QA's environment is:
- ReadTimeout: HTTPSConnectionPool(host='crash-stats.mozilla.com', port=443): Read timed out. (read timeout=10)
The current idea is to add additional instrumentation - New Relic - to collect more information so we can trace why this occurs.
Updated•10 years ago
|
Comment 1•10 years ago
|
||
If it takes longer than 10 seconds, even if it's not "down" it's almost the same as down. As in, it's not working. It might be interesting to see how things progressed if you changed that 10s to 30s or something.
Reporter | ||
Comment 2•10 years ago
|
||
(In reply to Peter Bengtsson [:peterbe] from comment #1)
> If it takes longer than 10 seconds, even if it's not "down" it's almost the
> same as down. As in, it's not working. It might be interesting to see how
> things progressed if you changed that 10s to 30s or something.
Agreed, although the point of the front-end UI tests are to act more in the role of acceptance tests. Greater than 10s load times aren't permissible.
I'm not against running an experiment and bumping this number -- though I'd like to do it in parallel with the added insight that New Relic would give us into the layers of the app. Does that sound like sane reasoning?
Comment 3•10 years ago
|
||
I'd love to see (hint hint JP) a graph of the ping times that Pingdom is getting.
I worry that we might have *something* in the webapp that is horribly horribly slow and when it gets triggered stuff doesn't stop working, it just spikes in slowness. We used to have this with the stuff that talked to ElasticSearch. Certain queries would cause downloads of tens of megabytes from the ES cluster to the webapp. Arguable nothing stopped working but everything became slow because of resources being spent on those rogue queries. So it's interesting to get a feel for these spikes and possibly what URLs might cause them.
Comment 4•10 years ago
|
||
I've set https://webqa-ci.mozilla.com/view/Socorro/job/socorro.prod/347 to be kept "forever," just in case, but it's an example with a timestamp of such an occurrence: Aug 20, 2015 4:29:00 AM
Comment 5•10 years ago
|
||
Let's need-info :jp so it gets added to his queue :-)
Flags: needinfo?(jschneider)
Comment 6•10 years ago
|
||
/me is summoned
https://www.dropbox.com/s/qiyq14rw14s27qt/Screenshot%202015-08-24%2022.13.44.JPG
There's ping times in the last 7 days. Our downtime issue seems to now be fixed, ever since we upsized the analysis server.
Hooray.
Flags: needinfo?(jschneider)
Comment 7•10 years ago
|
||
(In reply to JP Schneider [:jp] from comment #6)
> /me is summoned
> https://www.dropbox.com/s/qiyq14rw14s27qt/Screenshot%202015-08-24%2022.13.44.
> JPG
>
> There's ping times in the last 7 days. Our downtime issue seems to now be
> fixed, ever since we upsized the analysis server.
>
> Hooray.
According to our requests-module based tests, that doesn't jive -- https://webqa-ci.mozilla.com/view/Socorro/job/socorro.prod/355/console - we had another downtime @ Aug 24, 2015 4:29:00 AM
Comment 8•10 years ago
|
||
What's the latest here? Are we still seeing it being a yo-yo?
Comment 9•10 years ago
|
||
It looks like we haven't seen this since August 24th.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 10•10 years ago
|
||
Was just waiting for this to be knocked to resolved/fixed :-) Bumping to verified.
Status: RESOLVED → VERIFIED
Looks like we saw this again - reopening; how (are we?) monitoring this?
https://webqa-ci.mozilla.com/view/Socorro/job/socorro.prod/464/console
Status: VERIFIED → REOPENED
Resolution: FIXED → ---
Comment 12•10 years ago
|
||
JP,
I haven't received anything from Pingdom about prod being down today. Have you?
Is our monitoring not working?
There was no release today so I don't that as an obvious reason why the site would be down.
Flags: needinfo?(jschneider)
(In reply to Peter Bengtsson [:peterbe] from comment #12)
> JP,
> I haven't received anything from Pingdom about prod being down today. Have
> you?
> Is our monitoring not working?
> There was no release today so I don't that as an obvious reason why the site
> would be down.
This isn't from today - the timestamp from the build (https://webqa-ci.mozilla.com/view/Socorro/job/socorro.prod/464/) is Oct 8, 2015 4:29:00 AM (and that's PDT). So, this was yesterday, and to my knowledge, is just-now being reported, here.
Comment 14•10 years ago
|
||
We need those newrelic licenses. Then, I can do something about all of this.
Flags: needinfo?(jschneider)
(In reply to JP Schneider [:jp] from comment #14)
> We need those newrelic licenses. Then, I can do something about all of this.
For cross-reference, we have them, per bug 1223949.
Comment 16•8 years ago
|
||
This is just too old to track any more. Prod is up :)
Status: REOPENED → RESOLVED
Closed: 10 years ago → 8 years ago
Resolution: --- → INVALID
You need to log in
before you can comment on or make changes to this bug.
Description
•