Closed Bug 1160856 Opened 9 years ago Closed 9 years ago

Stage is suffering from high web transaction response time & log parser backlogs

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

Details

We're seeing a few issues on stage that need sorting out before we can push to prod. We also need to do a prod push asap, since we need esr38 repo support. I'd prefer to not have to cherry pick.

We're seeing:
* Higher web api response times (not surprising given the log parser changes, but I think we need to re-consider the approach)
* More worryingly, spikes in web api response times and resultant apdex warnings (https://docs.newrelic.com/docs/apm/new-relic-apm/apdex/apdex-measuring-user-satisfaction)
* Even more worryingly, the log parsers are disconnecting repeatedly and this morning there was a ~80000 log backlog.

Several things have landed in quick succession:
1) Changing how we parse logs (moving bug generation from the log processors to the web API nodes)
2) Switching treeherder-client to use requests rather than httplib (issues with different timeout?).
3) Moving treeherder-client into the service repo (whilst not directly risky, it's making backouts more of a pain)

I think we should backout our way to victory, and soon - on previous occasions we've been more lax about leaving breakage in & it's held up deploys or meant that we pushed anyway and residual bugs were still unassigned some time after.
From my email on Friday:

"""
New Relic alerts that started today ("Alert open: Apdex score < 0.7 on 'treeherder-stage'")

These alerts started as of today's stage deploy. As far as I can tell it only contains the treeherder-client update:
https://github.com/mozilla/treeherder/compare/9e29351...b0a0ea7
http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.stage/logs/master.1430384791
http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.stage/logs/master.1430492897

However the web transaction load has been higher since the deploy yesterday (unsurprising given the log parser/bug suggestion changes):
https://rpm.newrelic.com/accounts/677903/applications/5585473?tw[end]=1430502214&tw[start]=1430243014

I wonder if we need to make the bug suggestions async after all? (And add appropriate handling to ensure the API/UI handles the bug suggestion not being complete).
"""
To rule out processes getting stuck and not picking up changes, I pushed 9e29351 to stage (ie what was on stage up til now minus https://github.com/mozilla/treeherder/commit/b0a0ea7a8c71e5e887c2e0b450aff270bb83c875) to see if the apdex warnings go away. 

I suspect they will, since when we switched to requests, we stopped specifying a timeout:
https://github.com/mozilla/treeherder/commit/437a3c379a6a1ec545e348763dd141e67af64217#diff-8f20c700b32fc2a7be610d1089eb9eb4L744
Log parsers aren't hanging any more, no apdex alerts since, and log backlog down from 80,000 to 43,000 jobs. Though interestingly we're still not maxing out the CPU on the log processor nodes, but oh well.
(In reply to Ed Morley [:emorley] from comment #3)
> Log parsers aren't hanging any more, no apdex alerts since, and log backlog
> down from 80,000 to 43,000 jobs.

That's good to know. So my vote would probably be to backout the requests (437a3c379) and the log suggestions (01d54ee72a) commits and see if that works ok. If it does (and I don't see why it wouldn't), I'd say we're probably good to go.

I would dispute that moving treeherder-client into the repo makes backouts more of a pain -- I'd argue that it's easier, because we can try removing problematic commits from treeherder directly, rather than being forced to upgrade/downgrade coarse-grained versions.

> Though interestingly we're still not maxing
> out the CPU on the log processor nodes, but oh well.

We never were. I think we might actually be somewhat over provisioned on log parsing under normal circumstances.
(In reply to William Lachance (:wlach) from comment #4)
> That's good to know. So my vote would probably be to backout the requests
> (437a3c379) and the log suggestions (01d54ee72a) commits and see if that
> works ok. If it does (and I don't see why it wouldn't), I'd say we're
> probably good to go.

01d54ee72a is not the cause - that's never been deployed to stage. The newest commit that's been pushed to stage at any point is b0a0ea7 (and I've since rolled back to 9e29351, which is fine).

The culprit is more like 3fa859e (and we'd also have to back out bfef91b for the tests, plus the followup fix 77fae37).

> I would dispute that moving treeherder-client into the repo makes backouts
> more of a pain -- I'd argue that it's easier, because we can try removing
> problematic commits from treeherder directly, rather than being forced to
> upgrade/downgrade coarse-grained versions.

Not in general, but in this specific case. There are now conflicts with the backouts, given how many files had to be touched; just unfortunate timing. 

> > Though interestingly we're still not maxing
> > out the CPU on the log processor nodes, but oh well.
> 
> We never were. I think we might actually be somewhat over provisioned on log
> parsing under normal circumstances.

Yeah I realise we never were - my point is more that during the meeting the other week we theorised that the recent log parsing improvements would only be seen to have a major impact when we were backlogged. We're backlogged at the moment and are still using <50% CPU.
> (In reply to William Lachance (:wlach) from comment #4)
> > That's good to know. So my vote would probably be to backout the requests
> > (437a3c379) and the log suggestions (01d54ee72a) commits and see if that
> > works ok. If it does (and I don't see why it wouldn't), I'd say we're
> > probably good to go.
> 
> 01d54ee72a is not the cause - that's never been deployed to stage. The
> newest commit that's been pushed to stage at any point is b0a0ea7 (and I've
> since rolled back to 9e29351, which is fine).
> 
> The culprit is more like 3fa859e (and we'd also have to back out bfef91b for
> the tests, plus the followup fix 77fae37).

I still think we'd want to pull out 01d54ee72a if you want to promote to master soon, just because it's unproven. We can always reapply it right after.

> > > Though interestingly we're still not maxing
> > > out the CPU on the log processor nodes, but oh well.
> > 
> > We never were. I think we might actually be somewhat over provisioned on log
> > parsing under normal circumstances.
> 
> Yeah I realise we never were - my point is more that during the meeting the
> other week we theorised that the recent log parsing improvements would only
> be seen to have a major impact when we were backlogged. We're backlogged at
> the moment and are still using <50% CPU.

Ah I see. Well, feel free to experiment with the concurrency setting in the log parser if you're so inclined. :) I'd wager we could easily handle a setting of 10 or 20.

https://github.com/mozilla/treeherder/blob/master/bin/run_celery_worker_log_parser#L27
Blocks: 1080760
You need to log in before you can comment on or make changes to this bug.