Remove as much use of datasource as possible from the log parsing tasks and autoclassification code

RESOLVED FIXED

Status

RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: jgraham, Assigned: jgraham)

Tracking

Details

Attachments

(1 attachment)

(Assignee)

Description

2 years ago
Now that we have information about jobs in the main database we can start using this directly for orchestrating the log parsing tasks, rather than using the repository / project specific id / job guid information.
Created attachment 8802060 [details] [review]
[treeherder] mozilla:log_tasks_job_id > mozilla:master
(Assignee)

Updated

2 years ago
Attachment #8802060 - Flags: review?(wlachance)
Comment on attachment 8802060 [details] [review]
[treeherder] mozilla:log_tasks_job_id > mozilla:master

Sorry, meant to r+ this earlier.
Attachment #8802060 - Flags: review?(wlachance) → review+

Comment 3

2 years ago
Commits pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/d38acb67ab7e1685ef3404859a59683cbca075f0
Bug 1310972 - Remove parse_job_logs function.

We now parse logs one at a time, so this isn't used.

https://github.com/mozilla/treeherder/commit/3a7b5dc651c643accdae7319890bef6b24b679c9
Bug 1310972 - Use job id as input to log parsing tasks

Now that log information is in the main treeherder database use the id
information there as the input to the log parsing and autoclassification
tasks.

As part of the same change there is a mild refactor of the
autoclassification related tasks to not go through the call_command
infrastructure as this is just unnecessary overhead.
Before this is deployed to prod we need to fix it so already-queued tasks aren't burnt:
	
crossreference-error-lines
exceptions:TypeError
	crossreference_error_lines() takes exactly 1 argument (2 given)

	
store-failure-lines
exceptions:TypeError
	inner() takes exactly 2 arguments (4 given)

log-parser
exceptions:TypeError
	inner() takes exactly 2 arguments (4 given)

	
autoclassify
exceptions:TypeError
	autoclassify() takes exactly 1 argument (2 given)

eg:
https://rpm.newrelic.com/accounts/677903/applications/14179733/filterable_errors#/show/aba08b5b-9616-11e6-9c7c-b82a72d22a14_15754_19345/stack_trace?top_facet=transactionUiName&primary_facet=error.class&barchart=barchart&_k=p9c0zd

As a one-off workaround, I guess we could temporarily disable the celerybeat+pulse-listener dynos to let the queues empty out whilst this deploys.

We need to be extra careful in the future when making changes like this (and same for schema changes) to make sure that the changes are not only correct, but are backwards-forwards compatible for the deploy itself.
Assignee: nobody → james
(Assignee)

Comment 5

2 years ago
> As a one-off workaround, I guess we could temporarily disable the celerybeat+pulse-listener dynos to let the queues empty out whilst this deploys.

If we think that will be acceptable it's the easiest way to get this landed

> We need to be extra careful in the future when making changes like this (and same for schema changes) to make sure that the changes are not only correct, but are backwards-forwards compatible for the deploy itself.

Yeah, I think the right thing for breaking changes is to add version numbers to the queues so that we add new tasks with an incremented version number in one patch and remove the old queues in a second patch once the tasks are drained. I can rewrite this patch to be more like that if you think it's a better idea.
For today's prod deploy I:
* reduced the worker_beat and worker_read_pulse_jobs dynos to zero
* waited until the rabbitmq admin panel queue counts reached zero
* started the deploy
* waited for the release phase tasks to run and new code to be pushed out
* returned the worker_beat and worker_read_pulse_jobs dyno counts to 1
* checked NR for errors
* checked queues caught up using rabbitmq admin panel

All looks fine, no exceptions shown in NR.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.