Closed Bug 1196418 Opened 10 years ago Closed 8 years ago

[prod] Intermittent downtime on the webapp

Categories

(Socorro :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: mbrandt, Unassigned)

References

()

Details

Per today's discussion at the socorro meeting, we've noticed a potential repeating and yet intermittent downtime that occurs roughly once per day. Both pingdom (:jd) and our Selenium backed front-end tests have noticed this. Manifestation in Web QA's environment is: - ReadTimeout: HTTPSConnectionPool(host='crash-stats.mozilla.com', port=443): Read timed out. (read timeout=10) The current idea is to add additional instrumentation - New Relic - to collect more information so we can trace why this occurs.
If it takes longer than 10 seconds, even if it's not "down" it's almost the same as down. As in, it's not working. It might be interesting to see how things progressed if you changed that 10s to 30s or something.
(In reply to Peter Bengtsson [:peterbe] from comment #1) > If it takes longer than 10 seconds, even if it's not "down" it's almost the > same as down. As in, it's not working. It might be interesting to see how > things progressed if you changed that 10s to 30s or something. Agreed, although the point of the front-end UI tests are to act more in the role of acceptance tests. Greater than 10s load times aren't permissible. I'm not against running an experiment and bumping this number -- though I'd like to do it in parallel with the added insight that New Relic would give us into the layers of the app. Does that sound like sane reasoning?
I'd love to see (hint hint JP) a graph of the ping times that Pingdom is getting. I worry that we might have *something* in the webapp that is horribly horribly slow and when it gets triggered stuff doesn't stop working, it just spikes in slowness. We used to have this with the stuff that talked to ElasticSearch. Certain queries would cause downloads of tens of megabytes from the ES cluster to the webapp. Arguable nothing stopped working but everything became slow because of resources being spent on those rogue queries. So it's interesting to get a feel for these spikes and possibly what URLs might cause them.
I've set https://webqa-ci.mozilla.com/view/Socorro/job/socorro.prod/347 to be kept "forever," just in case, but it's an example with a timestamp of such an occurrence: Aug 20, 2015 4:29:00 AM
Let's need-info :jp so it gets added to his queue :-)
Flags: needinfo?(jschneider)
/me is summoned https://www.dropbox.com/s/qiyq14rw14s27qt/Screenshot%202015-08-24%2022.13.44.JPG There's ping times in the last 7 days. Our downtime issue seems to now be fixed, ever since we upsized the analysis server. Hooray.
Flags: needinfo?(jschneider)
(In reply to JP Schneider [:jp] from comment #6) > /me is summoned > https://www.dropbox.com/s/qiyq14rw14s27qt/Screenshot%202015-08-24%2022.13.44. > JPG > > There's ping times in the last 7 days. Our downtime issue seems to now be > fixed, ever since we upsized the analysis server. > > Hooray. According to our requests-module based tests, that doesn't jive -- https://webqa-ci.mozilla.com/view/Socorro/job/socorro.prod/355/console - we had another downtime @ Aug 24, 2015 4:29:00 AM
What's the latest here? Are we still seeing it being a yo-yo?
It looks like we haven't seen this since August 24th.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Was just waiting for this to be knocked to resolved/fixed :-) Bumping to verified.
Status: RESOLVED → VERIFIED
Looks like we saw this again - reopening; how (are we?) monitoring this? https://webqa-ci.mozilla.com/view/Socorro/job/socorro.prod/464/console
Status: VERIFIED → REOPENED
Resolution: FIXED → ---
JP, I haven't received anything from Pingdom about prod being down today. Have you? Is our monitoring not working? There was no release today so I don't that as an obvious reason why the site would be down.
Flags: needinfo?(jschneider)
(In reply to Peter Bengtsson [:peterbe] from comment #12) > JP, > I haven't received anything from Pingdom about prod being down today. Have > you? > Is our monitoring not working? > There was no release today so I don't that as an obvious reason why the site > would be down. This isn't from today - the timestamp from the build (https://webqa-ci.mozilla.com/view/Socorro/job/socorro.prod/464/) is Oct 8, 2015 4:29:00 AM (and that's PDT). So, this was yesterday, and to my knowledge, is just-now being reported, here.
We need those newrelic licenses. Then, I can do something about all of this.
Flags: needinfo?(jschneider)
(In reply to JP Schneider [:jp] from comment #14) > We need those newrelic licenses. Then, I can do something about all of this. For cross-reference, we have them, per bug 1223949.
This is just too old to track any more. Prod is up :)
Status: REOPENED → RESOLVED
Closed: 10 years ago8 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.