1160856 - Stage is suffering from high web transaction response time & log parser backlogs

Assignee

Description

•

10 years ago

We're seeing a few issues on stage that need sorting out before we can push to prod. We also need to do a prod push asap, since we need esr38 repo support. I'd prefer to not have to cherry pick. We're seeing: * Higher web api response times (not surprising given the log parser changes, but I think we need to re-consider the approach) * More worryingly, spikes in web api response times and resultant apdex warnings (https://docs.newrelic.com/docs/apm/new-relic-apm/apdex/apdex-measuring-user-satisfaction) * Even more worryingly, the log parsers are disconnecting repeatedly and this morning there was a ~80000 log backlog. Several things have landed in quick succession: 1) Changing how we parse logs (moving bug generation from the log processors to the web API nodes) 2) Switching treeherder-client to use requests rather than httplib (issues with different timeout?). 3) Moving treeherder-client into the service repo (whilst not directly risky, it's making backouts more of a pain) I think we should backout our way to victory, and soon - on previous occasions we've been more lax about leaving breakage in & it's held up deploys or meant that we pushed anyway and residual bugs were still unassigned some time after.

Ed Morley [:emorley]

Assignee

Comment 1

•

10 years ago

From my email on Friday: """ New Relic alerts that started today ("Alert open: Apdex score < 0.7 on 'treeherder-stage'") These alerts started as of today's stage deploy. As far as I can tell it only contains the treeherder-client update: https://github.com/mozilla/treeherder/compare/9e29351...b0a0ea7 http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.stage/logs/master.1430384791 http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.stage/logs/master.1430492897 However the web transaction load has been higher since the deploy yesterday (unsurprising given the log parser/bug suggestion changes): https://rpm.newrelic.com/accounts/677903/applications/5585473?tw[end]=1430502214&tw[start]=1430243014 I wonder if we need to make the bug suggestions async after all? (And add appropriate handling to ensure the API/UI handles the bug suggestion not being complete). """

Ed Morley [:emorley]

Assignee

Comment 2

•

10 years ago

To rule out processes getting stuck and not picking up changes, I pushed 9e29351 to stage (ie what was on stage up til now minus https://github.com/mozilla/treeherder/commit/b0a0ea7a8c71e5e887c2e0b450aff270bb83c875) to see if the apdex warnings go away. I suspect they will, since when we switched to requests, we stopped specifying a timeout: https://github.com/mozilla/treeherder/commit/437a3c379a6a1ec545e348763dd141e67af64217#diff-8f20c700b32fc2a7be610d1089eb9eb4L744

Ed Morley [:emorley]

Assignee

Comment 3

•

10 years ago

Log parsers aren't hanging any more, no apdex alerts since, and log backlog down from 80,000 to 43,000 jobs. Though interestingly we're still not maxing out the CPU on the log processor nodes, but oh well.

William Lachance (:wlach)

Comment 4

•

10 years ago

(In reply to Ed Morley [:emorley] from comment #3) > Log parsers aren't hanging any more, no apdex alerts since, and log backlog > down from 80,000 to 43,000 jobs. That's good to know. So my vote would probably be to backout the requests (437a3c379) and the log suggestions (01d54ee72a) commits and see if that works ok. If it does (and I don't see why it wouldn't), I'd say we're probably good to go. I would dispute that moving treeherder-client into the repo makes backouts more of a pain -- I'd argue that it's easier, because we can try removing problematic commits from treeherder directly, rather than being forced to upgrade/downgrade coarse-grained versions. > Though interestingly we're still not maxing > out the CPU on the log processor nodes, but oh well. We never were. I think we might actually be somewhat over provisioned on log parsing under normal circumstances.

Ed Morley [:emorley]

Assignee

Comment 5

•

10 years ago

(In reply to William Lachance (:wlach) from comment #4) > That's good to know. So my vote would probably be to backout the requests > (437a3c379) and the log suggestions (01d54ee72a) commits and see if that > works ok. If it does (and I don't see why it wouldn't), I'd say we're > probably good to go. 01d54ee72a is not the cause - that's never been deployed to stage. The newest commit that's been pushed to stage at any point is b0a0ea7 (and I've since rolled back to 9e29351, which is fine). The culprit is more like 3fa859e (and we'd also have to back out bfef91b for the tests, plus the followup fix 77fae37). > I would dispute that moving treeherder-client into the repo makes backouts > more of a pain -- I'd argue that it's easier, because we can try removing > problematic commits from treeherder directly, rather than being forced to > upgrade/downgrade coarse-grained versions. Not in general, but in this specific case. There are now conflicts with the backouts, given how many files had to be touched; just unfortunate timing. > > Though interestingly we're still not maxing > > out the CPU on the log processor nodes, but oh well. > > We never were. I think we might actually be somewhat over provisioned on log > parsing under normal circumstances. Yeah I realise we never were - my point is more that during the meeting the other week we theorised that the recent log parsing improvements would only be seen to have a major impact when we were backlogged. We're backlogged at the moment and are still using <50% CPU.

William Lachance (:wlach)

Comment 6

•

10 years ago

> (In reply to William Lachance (:wlach) from comment #4) > > That's good to know. So my vote would probably be to backout the requests > > (437a3c379) and the log suggestions (01d54ee72a) commits and see if that > > works ok. If it does (and I don't see why it wouldn't), I'd say we're > > probably good to go. > > 01d54ee72a is not the cause - that's never been deployed to stage. The > newest commit that's been pushed to stage at any point is b0a0ea7 (and I've > since rolled back to 9e29351, which is fine). > > The culprit is more like 3fa859e (and we'd also have to back out bfef91b for > the tests, plus the followup fix 77fae37). I still think we'd want to pull out 01d54ee72a if you want to promote to master soon, just because it's unproven. We can always reapply it right after. > > > Though interestingly we're still not maxing > > > out the CPU on the log processor nodes, but oh well. > > > > We never were. I think we might actually be somewhat over provisioned on log > > parsing under normal circumstances. > > Yeah I realise we never were - my point is more that during the meeting the > other week we theorised that the recent log parsing improvements would only > be seen to have a major impact when we were backlogged. We're backlogged at > the moment and are still using <50% CPU. Ah I see. Well, feel free to experiment with the concurrency setting in the log parser if you're so inclined. :) I'd wager we could easily handle a setting of 10 or 20. https://github.com/mozilla/treeherder/blob/master/bin/run_celery_worker_log_parser#L27

Ed Morley [:emorley]

Assignee

Updated

•

10 years ago

Blocks: 1080760

Ed Morley [:emorley]

Assignee

Comment 7

•

10 years ago

Backouts landed and deployed: https://github.com/mozilla/treeherder/commit/efa98c7528826edbc8bf19b131bbfedde6d315be https://github.com/mozilla/treeherder/commit/c9f837f747b792714e02ca3cdd7d6e078ece97d1 https://github.com/mozilla/treeherder/commit/765ecbe84a95cc10fb4e2da449125f37f00344be https://github.com/mozilla/treeherder/commit/0530a8b0f712337efd93c321e588f1f08042268a https://github.com/mozilla/treeherder/commit/6c7d48d50463c9a5768d48644ef621b2ba3de4de https://github.com/mozilla/treeherder/commit/4824072e39c78fd96ea4d668f536f2927d7473e0 I've left in 01d54ee72a, since it's basically a no-op, and is the counterpart to 6b1279c9d0 (bug 1158878) rather than the things we've backed out.

Assignee: nobody → emorley

Blocks: 1144417

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Bugzilla

Stage is suffering from high web transaction response time & log parser backlogs

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

Tracking

(Not tracked)

People

(Reporter: emorley, Assigned: emorley)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Updated

Comment 7