Closed
Bug 1399106
Opened 7 years ago
Closed 4 years ago
We don't clear expiration dates
Categories
(Tree Management Graveyard :: Treeherder: SETA, enhancement, P3)
Tree Management Graveyard
Treeherder: SETA
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: armenzg, Unassigned)
Details
We have 343 jobs that have an expiration date on the past: https://sql.telemetry.mozilla.org/queries/34879/source jmaher: could you please help me determine which jobs I should *not* clear? I assume any talos jobs should be part of preseed.json and use the 2100 date. We will also have to prevent this from happening in the future. >>> JobPriority.objects.filter(expiration_date__isnull=False).exclude(expiration_date=datetime.datetime(2100, 12, 31, 0, 0))[0].expiration_date datetime.datetime(2017, 8, 13, 0, 2, 46, 419828)
Reporter | ||
Comment 1•7 years ago
|
||
The data from redash could be out of date. There's few more jobs:
>>> JobPriority.objects.filter(expiration_date__isnull=False).exclude(expiration_date=datetime.datetime(2100, 12, 31, 0, 0)).filter(priority=1).count()
373
Comment 2•7 years ago
|
||
I looked over those jobs, we should mark them as p5, at the very least all the devedition failures. I am sure there is a small bug related to expiration, hopefully that is easy to find and fix.
Reporter | ||
Comment 3•7 years ago
|
||
I fixed them all. [1] Here's the logic to clear the expiration date for old jobs: https://github.com/mozilla/treeherder/blob/c1dfddc5f604b9948f712498f04569c1426f5782/treeherder/seta/models.py#L14-L20 It seems we have the logic to run this once in a while (when analyze failures runs): https://github.com/mozilla/treeherder/blob/4fcd020fa08987ed5c9a6d5e45dd0475a46f159f/treeherder/seta/analyze_failures.py#L22-L36 We even have tests for it: https://github.com/mozilla/treeherder/blob/6d6f904317c86c776ff3fdfa4c2d085c003f38c9/tests/seta/test_models.py#L14-L21 It seems analyze failures is having trouble: https://rpm.newrelic.com/accounts/677903/applications/14179757/filterable_errors#/show/20b24c8b-974f-11e7-86ce-0242ac110011_8853_13493/stack_trace?top_facet=transactionUiName&primary_facet=error.class&barchart=barchart&_k=vmji30 > django.db.utils:OperationalError: (3024, 'Query execution was interrupted, maximum statement execution time exceeded') > ... > File "/app/treeherder/workers/task.py", line 43, in inner > File "/app/treeherder/seta/tasks.py", line 8, in seta_analyze_failures > File "/app/treeherder/seta/analyze_failures.py", line 27, in run > File "/app/treeherder/seta/analyze_failures.py", line 82, in get_failures_fixed_by_commit > File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/query.py", line 53, in __iter__ > File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 894, in execute_sql Maybe we should separate the logic to clear the expiration dates to here: https://github.com/mozilla/treeherder/blob/master/treeherder/seta/update_job_priority.py#L158 emorley: what process can we have that it would have allowed me to know of this issue sooner? (e.g log alerting or something else) [1] >>> JobPriority.objects.filter(expiration_date__isnull=False).exclude(expiration_date=datetime.datetime(2100, 12, 31, 0, 0)).filter(priority=1).count() 373 >>> JobPriority.objects.filter(expiration_date__isnull=False).exclude(expiration_date=datetime.datetime(2100, 12, 31, 0, 0)).filter(priority=1).update(priority=5) 373L >>> JobPriority.objects.filter(expiration_date__isnull=False).exclude(expiration_date=datetime.datetime(2100, 12, 31, 0, 0)).filter(priority=5).update(expiration_date=None) 373L >>> JobPriority.objects.filter(expiration_date__isnull=False).exclude(expiration_date=datetime.datetime(2100, 12, 31, 0, 0)).filter(priority=1) <QuerySet []>
Flags: needinfo?(emorley)
Comment 4•7 years ago
|
||
(In reply to Armen [:armenzg] from comment #3) > emorley: what process can we have that it would have allowed me to know of > this issue sooner? (e.g log alerting or something else) We have alerts set up for higher exception rates on prod, that get sent to treeherder-internal. However these include all web and non-web transactions, so include the some very frequent transactions like ingesting a pulse job, parsing a log, or serving one API request. As such, a once or twice a day exception on a task like this doesn't trigger an alert. I added the seta-analyse-tasks task as a "key transaction" just now, by finding the transaction info here: https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions?type=other&show_browser=false#id=5b224f746865725472616e73616374696f6e2f43656c6572792f736574612d616e616c797a652d6661696c75726573222c22225d ...then using the "add as key transaction" link at the top. This created an entry here: https://rpm.newrelic.com/accounts/677903/key_transactions# Currently that entry is under the default key transaction alert policy, here: https://rpm.newrelic.com/accounts/677903/key_transaction_alert_policies It's not clear if the alert policy percentage error setting would have caught this or not (eg do they need more than one failure occurrence per timeframe for it to trigger?). An alternative would be to find whatever error output was output to Papertrial (adding such output if needed), and then setting up a Papertrial alert, here: https://papertrailapp.com/alerts Other than that, I'm not sure if there's a better approach given the contrast between the high and low frequency tasks we have within Treeherder (the seta tasks isn't the only infrequent task we have - there's also cycle-data, calculate-durations, fetch-bugs etc which could suffer from the same).
Flags: needinfo?(emorley)
Comment 5•7 years ago
|
||
It's worth noting that the seta task has been timing out on and off for several months now (bug 1368982). I'll add another comment over there with more info.
Reporter | ||
Updated•6 years ago
|
Priority: -- → P3
Reporter | ||
Comment 6•4 years ago
|
||
We're moving away from SETA wontfixing.
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WONTFIX
Updated•4 years ago
|
Product: Tree Management → Tree Management Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•