Open Bug 1180732 Opened 9 years ago Updated 2 years ago

[tracker] Automatic backfilling

Categories

(Testing :: General, defect)

defect

Tracking

(Not tracked)

People

(Reporter: armenzg, Unassigned)

References

(Depends on 2 open bugs, Blocks 1 open bug)

Details

Attachments

(3 files)

Listen to trees for finished *test* jobs, we should be able to re-trigger a job a couple of times (similar to what trigger bot does on try) and backfill back to the last known good run.

This will be added to pulse_actions.
https://github.com/adusca/pulse_actions
adusca, what is the latest on pulse_actions having multiple pulse consumers?
adusca is going to talk to Kyle on how to determine which jobs are always failing through active data.

Adding chmanchester to keep in the loop since he had the same issue for triggerbot.
(In reply to Armen Zambrano G. (:armenzg - Toronto) from comment #2)
> adusca is going to talk to Kyle on how to determine which jobs are always
> failing through active data.
> 
> Adding chmanchester to keep in the loop since he had the same issue for
> triggerbot.

The proposed solution for this is to use treeherder to get visibility, as in https://github.com/chmanchester/trigger-bot/pull/4
TODO:
* Determine plan on how to deal with changesets that are full of failures
This csv contains a list of everything that would be triggered on Jul 21st between 0:00 and 19:00. The file has ~2400 lines representing ~1300 requests (some requests take 3 lines of logging, some take just 1).
Some of the request on the other csv (the ones with the form "We would make a POST ...") were not very easy to read, so I generated another csv with the buildernames and revisions for those request_ids.
That sounds like 700 jobs on inbound for almost a day. That is not bad at all.

What is left to remove dry_run mode?
We need to write up clearly what this system does.

Could you please document it in here?
https://wiki.mozilla.org/Auto-tools/Projects/Automatic_backfilling

We will also need to publizcize this and coordinate with sheriffs.
We could enable for few hours on a day, disable it again and ask sheriffs to give us their feedback.
The csv table overestimates a little what we would trigger because mozci guarantees that we have at most one job of each and there are a lot of duplicates on the table. With dry-run the job doesn't get triggered on the first time and because of that we said that we would trigger it again.

Steps missing to turn this on:

1. Decide if we need to blacklist some jobs. This may involve implementing checking if a job is hidden on TH.

2. Decide what to do if there is no build job available to trigger a test job? While the queuing functionality for pulse_actions is not ready, I think the best is to just skip those jobs.
#1 - I think that is an optimization at this moment. Let's document it on the wiki and follow up on it.
#2 - Let's skip it for now and document it.

I would like to see this running somewhere live even if we're not completely optimal (minimum viable product).

Works for you?
Email sent to sheriffs:
#######################
We are planning to turn on a service that automatically backfills failed test jobs on m-i.
If there are no concerns, we would like to turn this on experimentally for a couple of hours on Monday. We hope this will make it easier to identify which revision broke a test. Suggestions are welcome.
The backfilling works like this:
- It triggers the job that failed one extra time
- Then it looks for a successful run of the job on the previous 5 revisions. If a good run is found, it triggers the job only on revisions up to this good run. If not, it triggers the job on every one of the previous 5 revisions. Previous jobs will be triggered one time.
This made for hours of hilarity today when Windows mochitest-2 went permafail and I was stuck starring auto-retriggers on old pushes for hours on end.
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=win%20mochitest-2&fromchange=e2e7ff226eca&tochange=b92398507d91
Attached image [screenshot]
1) a perma failure is introduced
2) we re-triggered in some cases up to 6 times (6 orange jobs)

In order to mitigate this, I suggest we do this:
* When an orange job finishes, we don't retrigger it any more times
** I think we now re-trigger 1 or 2 times
** This way we *only* backfill revisions that have that missing job

This will prevent having more than 1 job on any revision due to automatic backfilling.

In the longer term, we can re-evaluate and probably start looking into using active data to do smarter analysis.
See Also: → 1194213
Depends on: 1194213
Depends on: 1195809
Depends on: 1195821
Depends on: 1195824
Depends on: 1195851
Status summary
##############
We have automated backfilling working only for mozilla-inbound.
Before we go ahead and enable it in more places we will need to address some of these to prevent automated back filing from cause harm (extra jobs) when we hit edge cases.

* Bug 1195809 - Automatic backfilling should consider more recent pushes that say "Backed out" or "backout" and *not* backfilling
* Bug 1195821 - Automatic backfilling should use follow some rules to decide when not to backfill
* bug 1195824 - Automatic backfilling should deal better with perma failures
* bug 1195851 - Automatic backfilling should not trigger jobs for a pool when it is heavely back logged
Depends on: 1197223
Depends on: 1197238
Summary: Automatic backfilling → [tracker] Automatic backfilling
Depends on: 1198273
Bug 1195809 - Automatic backfilling should consider more recent pushes that say "Backed out" or "backout"
Bug 1195821 - Automatic backfilling should use follow some rules to decide when not to backfill
Bug 1195824 - Automatic backfilling should deal better with perma failures
Bug 1195851 - Automatic backfilling should not trigger jobs for a pool when it is heavely back logged
Bug 1197223 - Automatic backfilling should respect hidden jobs
Bug 1197238 - Create report per builder on how many jobs are scheduled on a day
Bug 1198273 - Have metrics for scheduling
For anyone contributing, here are the steps to get started:
* Checkout https://github.com/adusca/pulse_actions.git
* Setup a virtualenv
* Install the project inside of it with python setup.py develop
* Run python pulse_actions/worker.py --topic-base backfilling --dry-run

A starting bug would be this since it would give the ability to use automated backfilling without special credentials:
Bug 1203141 - Teach automatic backfilling to use Treeherder as a source of jobs (instead of BuildApi)
Related pulse_actions bugs but not blocker to resume Automated backfilling:
Bug 1203146 - Pulse Actions throws an unreadable exception when the proper env variables are not set up
Bug 1203141 - Teach automatic backfilling to use Treeherder as a source of jobs (instead of BuildApi)
Depends on: 1212002
Depends on: 1213308
Depends on: 1210390
In order to re-enable automatic back filling we will have to fix the following:

Bug 1195824 - Automatic backfilling should deal better with perma failures
Bug 1195851 - Automatic backfilling should not trigger jobs for a pool when it is heavely back logged
Bug 1197223 - Automatic backfilling should respect hidden jobs
To be filed - Automatic backfilling should clear the cache when the list of builders changes due to a reconfig
Bug 1210390 - allthethings.json does not have up-to-date information

Optimization:
Bug 1195821 - Automatic backfilling should use follow some rules to decide when not to backfill
Bug 1195809 - Automatic backfilling should consider more recent pushes that say "Backed out" or "backout"

Nice to have:
Bug 1197238 - Create report per builder on how many jobs are scheduled on a day
Bug 1198273 - Have metrics for scheduling
Depends on: 1216677
Assignee: alicescarpa → armenzg
We're putting this to the side until we have an automatic starring service to prevent unncessary load.
Assignee: armenzg → nobody
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: