We don't clear expiration dates

NEW
Unassigned

Status

Tree Management
Treeherder: SETA
P3
normal
7 months ago
a month ago

People

(Reporter: armenzg, Unassigned)

Tracking

Details

We have 343 jobs that have an expiration date on the past:
https://sql.telemetry.mozilla.org/queries/34879/source

jmaher: could you please help me determine which jobs I should *not* clear?
I assume any talos jobs should be part of preseed.json and use the 2100 date.

We will also have to prevent this from happening in the future.

>>> JobPriority.objects.filter(expiration_date__isnull=False).exclude(expiration_date=datetime.datetime(2100, 12, 31, 0, 0))[0].expiration_date
datetime.datetime(2017, 8, 13, 0, 2, 46, 419828)
The data from redash could be out of date. There's few more jobs:
>>> JobPriority.objects.filter(expiration_date__isnull=False).exclude(expiration_date=datetime.datetime(2100, 12, 31, 0, 0)).filter(priority=1).count()
373
I looked over those jobs, we should mark them as p5, at the very least all the devedition failures.  I am sure there is a small bug related to expiration, hopefully that is easy to find and fix.
I fixed them all. [1]

Here's the logic to clear the expiration date for old jobs:
https://github.com/mozilla/treeherder/blob/c1dfddc5f604b9948f712498f04569c1426f5782/treeherder/seta/models.py#L14-L20

It seems we have the logic to run this once in a while (when analyze failures runs):
https://github.com/mozilla/treeherder/blob/4fcd020fa08987ed5c9a6d5e45dd0475a46f159f/treeherder/seta/analyze_failures.py#L22-L36

We even have tests for it:
https://github.com/mozilla/treeherder/blob/6d6f904317c86c776ff3fdfa4c2d085c003f38c9/tests/seta/test_models.py#L14-L21

It seems analyze failures is having trouble:
https://rpm.newrelic.com/accounts/677903/applications/14179757/filterable_errors#/show/20b24c8b-974f-11e7-86ce-0242ac110011_8853_13493/stack_trace?top_facet=transactionUiName&primary_facet=error.class&barchart=barchart&_k=vmji30
> django.db.utils:OperationalError: (3024, 'Query execution was interrupted, maximum statement execution time exceeded')
> ...
> File "/app/treeherder/workers/task.py", line 43, in inner
> File "/app/treeherder/seta/tasks.py", line 8, in seta_analyze_failures
> File "/app/treeherder/seta/analyze_failures.py", line 27, in run
> File "/app/treeherder/seta/analyze_failures.py", line 82, in get_failures_fixed_by_commit
> File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/query.py", line 53, in __iter__
> File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 894, in execute_sql

Maybe we should separate the logic to clear the expiration dates to here:
https://github.com/mozilla/treeherder/blob/master/treeherder/seta/update_job_priority.py#L158

emorley: what process can we have that it would have allowed me to know of this issue sooner? (e.g log alerting or something else)

[1]
>>> JobPriority.objects.filter(expiration_date__isnull=False).exclude(expiration_date=datetime.datetime(2100, 12, 31, 0, 0)).filter(priority=1).count()
373
>>> JobPriority.objects.filter(expiration_date__isnull=False).exclude(expiration_date=datetime.datetime(2100, 12, 31, 0, 0)).filter(priority=1).update(priority=5)
373L
>>> JobPriority.objects.filter(expiration_date__isnull=False).exclude(expiration_date=datetime.datetime(2100, 12, 31, 0, 0)).filter(priority=5).update(expiration_date=None)
373L
>>> JobPriority.objects.filter(expiration_date__isnull=False).exclude(expiration_date=datetime.datetime(2100, 12, 31, 0, 0)).filter(priority=1)
<QuerySet []>
Flags: needinfo?(emorley)

Comment 4

7 months ago
(In reply to Armen [:armenzg] from comment #3)
> emorley: what process can we have that it would have allowed me to know of
> this issue sooner? (e.g log alerting or something else)

We have alerts set up for higher exception rates on prod, that get sent to treeherder-internal. However these include all web and non-web transactions, so include the some very frequent transactions like ingesting a pulse job, parsing a log, or serving one API request. As such, a once or twice a day exception on a task like this doesn't trigger an alert.

I added the seta-analyse-tasks task as a "key transaction" just now, by finding the transaction info here:
https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions?type=other&show_browser=false#id=5b224f746865725472616e73616374696f6e2f43656c6572792f736574612d616e616c797a652d6661696c75726573222c22225d

...then using the "add as key transaction" link at the top.

This created an entry here:
https://rpm.newrelic.com/accounts/677903/key_transactions#

Currently that entry is under the default key transaction alert policy, here:
https://rpm.newrelic.com/accounts/677903/key_transaction_alert_policies

It's not clear if the alert policy percentage error setting would have caught this or not (eg do they need more than one failure occurrence per timeframe for it to trigger?).

An alternative would be to find whatever error output was output to Papertrial (adding such output if needed), and then setting up a Papertrial alert, here:
https://papertrailapp.com/alerts

Other than that, I'm not sure if there's a better approach given the contrast between the high and low frequency tasks we have within Treeherder (the seta tasks isn't the only infrequent task we have - there's also cycle-data, calculate-durations, fetch-bugs etc which could suffer from the same).
Flags: needinfo?(emorley)

Comment 5

7 months ago
It's worth noting that the seta task has been timing out on and off for several months now (bug 1368982).

I'll add another comment over there with more info.
Priority: -- → P3
You need to log in before you can comment on or make changes to this bug.