1196418 - [prod] Intermittent downtime on the webapp

Reporter

Description

•

10 years ago

Per today's discussion at the socorro meeting, we've noticed a potential repeating and yet intermittent downtime that occurs roughly once per day. Both pingdom (:jd) and our Selenium backed front-end tests have noticed this. Manifestation in Web QA's environment is: - ReadTimeout: HTTPSConnectionPool(host='crash-stats.mozilla.com', port=443): Read timed out. (read timeout=10) The current idea is to add additional instrumentation - New Relic - to collect more information so we can trace why this occurs.

Peter Bengtsson [:peterbe]

Updated

•

10 years ago

URL: https://crash-stats.mozilla.org → https://crash-stats.mozilla.com

Peter Bengtsson [:peterbe]

Comment 1

•

10 years ago

If it takes longer than 10 seconds, even if it's not "down" it's almost the same as down. As in, it's not working. It might be interesting to see how things progressed if you changed that 10s to 30s or something.

Matt Brandt [:mbrandt]

Reporter

Comment 2

•

10 years ago

(In reply to Peter Bengtsson [:peterbe] from comment #1) > If it takes longer than 10 seconds, even if it's not "down" it's almost the > same as down. As in, it's not working. It might be interesting to see how > things progressed if you changed that 10s to 30s or something. Agreed, although the point of the front-end UI tests are to act more in the role of acceptance tests. Greater than 10s load times aren't permissible. I'm not against running an experiment and bumping this number -- though I'd like to do it in parallel with the added insight that New Relic would give us into the layers of the app. Does that sound like sane reasoning?

Peter Bengtsson [:peterbe]

Comment 3

•

10 years ago

I'd love to see (hint hint JP) a graph of the ping times that Pingdom is getting. I worry that we might have *something* in the webapp that is horribly horribly slow and when it gets triggered stuff doesn't stop working, it just spikes in slowness. We used to have this with the stuff that talked to ElasticSearch. Certain queries would cause downloads of tens of megabytes from the ES cluster to the webapp. Arguable nothing stopped working but everything became slow because of resources being spent on those rogue queries. So it's interesting to get a feel for these spikes and possibly what URLs might cause them.

Stephen Donner [:stephend] Not actively reading bugmail

Comment 4

•

10 years ago

I've set https://webqa-ci.mozilla.com/view/Socorro/job/socorro.prod/347 to be kept "forever," just in case, but it's an example with a timestamp of such an occurrence: Aug 20, 2015 4:29:00 AM

Stephen Donner [:stephend] Not actively reading bugmail

Comment 5

•

10 years ago

Let's need-info :jp so it gets added to his queue :-)

Flags: needinfo?(jschneider)

JP Schneider [:jp]

Comment 6

•

10 years ago

/me is summoned https://www.dropbox.com/s/qiyq14rw14s27qt/Screenshot%202015-08-24%2022.13.44.JPG There's ping times in the last 7 days. Our downtime issue seems to now be fixed, ever since we upsized the analysis server. Hooray.

Flags: needinfo?(jschneider)

Stephen Donner [:stephend] Not actively reading bugmail

Comment 7

•

10 years ago

(In reply to JP Schneider [:jp] from comment #6) > /me is summoned > https://www.dropbox.com/s/qiyq14rw14s27qt/Screenshot%202015-08-24%2022.13.44. > JPG > > There's ping times in the last 7 days. Our downtime issue seems to now be > fixed, ever since we upsized the analysis server. > > Hooray. According to our requests-module based tests, that doesn't jive -- https://webqa-ci.mozilla.com/view/Socorro/job/socorro.prod/355/console - we had another downtime @ Aug 24, 2015 4:29:00 AM

Peter Bengtsson [:peterbe]

Comment 8

•

10 years ago

What's the latest here? Are we still seeing it being a yo-yo?

Dave Hunt [:davehunt] [he/him] ⌚BST

Comment 9

•

10 years ago

It looks like we haven't seen this since August 24th.

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Matt Brandt [:mbrandt]

Reporter

Comment 10

•

10 years ago

Was just waiting for this to be knocked to resolved/fixed :-) Bumping to verified.

Status: RESOLVED → VERIFIED

Stephen Donner [:stephend] Not actively reading bugmail

Comment 11

•

10 years ago

Looks like we saw this again - reopening; how (are we?) monitoring this? https://webqa-ci.mozilla.com/view/Socorro/job/socorro.prod/464/console

Status: VERIFIED → REOPENED

Resolution: FIXED → ---

Peter Bengtsson [:peterbe]

Comment 12

•

10 years ago

JP, I haven't received anything from Pingdom about prod being down today. Have you? Is our monitoring not working? There was no release today so I don't that as an obvious reason why the site would be down.

Flags: needinfo?(jschneider)

Stephen Donner [:stephend] Not actively reading bugmail

Comment 13

•

10 years ago

(In reply to Peter Bengtsson [:peterbe] from comment #12) > JP, > I haven't received anything from Pingdom about prod being down today. Have > you? > Is our monitoring not working? > There was no release today so I don't that as an obvious reason why the site > would be down. This isn't from today - the timestamp from the build (https://webqa-ci.mozilla.com/view/Socorro/job/socorro.prod/464/) is Oct 8, 2015 4:29:00 AM (and that's PDT). So, this was yesterday, and to my knowledge, is just-now being reported, here.

JP Schneider [:jp]

Comment 14

•

10 years ago

We need those newrelic licenses. Then, I can do something about all of this.

Flags: needinfo?(jschneider)

Stephen Donner [:stephend] Not actively reading bugmail

Comment 15

•

10 years ago

(In reply to JP Schneider [:jp] from comment #14) > We need those newrelic licenses. Then, I can do something about all of this. For cross-reference, we have them, per bug 1223949.

Peter Bengtsson [:peterbe]

Comment 16

•

8 years ago

This is just too old to track any more. Prod is up :)

Status: REOPENED → RESOLVED

Closed: 10 years ago → 8 years ago

Resolution: --- → INVALID

Bugzilla

[prod] Intermittent downtime on the webapp

Categories

(Socorro :: General, task)

Tracking

(Not tracked)

People

(Reporter: mbrandt, Unassigned)

References

(
URL
)

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16