Open Bug 1180732 Opened 10 years ago Updated 3 years ago

[tracker] Automatic backfilling

Tracking

(Not tracked)

Status:

NEW

People

(Reporter: armenzg, Unassigned)

References

(Depends on 2 open bugs, Blocks 1 open bug)

Details

Attachments

(3 files)

things_we_would_trigger.csv 10 years ago Alice Scarpa [:adusca] 525.09 KB, text/csv		Details
requests_with_buildernames.csv 10 years ago Alice Scarpa [:adusca] 84.41 KB, text/csv		Details
[screenshot] 10 years ago Armen [:armenzg] 222.78 KB, image/png		Details

Armen [:armenzg]

Reporter

Description

•

10 years ago

Listen to trees for finished *test* jobs, we should be able to re-trigger a job a couple of times (similar to what trigger bot does on try) and backfill back to the last known good run. This will be added to pulse_actions. https://github.com/adusca/pulse_actions

Armen [:armenzg]

Reporter

Comment 1

•

10 years ago

adusca, what is the latest on pulse_actions having multiple pulse consumers?

Armen [:armenzg]

Reporter

Comment 2

•

10 years ago

adusca is going to talk to Kyle on how to determine which jobs are always failing through active data. Adding chmanchester to keep in the loop since he had the same issue for triggerbot.

Chris Manchester (limited bugmail, email directly)

Comment 3

•

10 years ago

(In reply to Armen Zambrano G. (:armenzg - Toronto) from comment #2) > adusca is going to talk to Kyle on how to determine which jobs are always > failing through active data. > > Adding chmanchester to keep in the loop since he had the same issue for > triggerbot. The proposed solution for this is to use treeherder to get visibility, as in https://github.com/chmanchester/trigger-bot/pull/4

Armen [:armenzg]

Reporter

Comment 4

•

10 years ago

TODO: * Determine plan on how to deal with changesets that are full of failures

Alice Scarpa [:adusca]

Comment 5

•

10 years ago

Attached file things_we_would_trigger.csv — Details

This csv contains a list of everything that would be triggered on Jul 21st between 0:00 and 19:00. The file has ~2400 lines representing ~1300 requests (some requests take 3 lines of logging, some take just 1).

Alice Scarpa [:adusca]

Comment 6

•

10 years ago

Attached file requests_with_buildernames.csv — Details

Some of the request on the other csv (the ones with the form "We would make a POST ...") were not very easy to read, so I generated another csv with the buildernames and revisions for those request_ids.

Armen [:armenzg]

Reporter

Comment 7

•

10 years ago

That sounds like 700 jobs on inbound for almost a day. That is not bad at all. What is left to remove dry_run mode? We need to write up clearly what this system does. Could you please document it in here? https://wiki.mozilla.org/Auto-tools/Projects/Automatic_backfilling We will also need to publizcize this and coordinate with sheriffs. We could enable for few hours on a day, disable it again and ask sheriffs to give us their feedback.

Alice Scarpa [:adusca]

Comment 8

•

10 years ago

The csv table overestimates a little what we would trigger because mozci guarantees that we have at most one job of each and there are a lot of duplicates on the table. With dry-run the job doesn't get triggered on the first time and because of that we said that we would trigger it again. Steps missing to turn this on: 1. Decide if we need to blacklist some jobs. This may involve implementing checking if a job is hidden on TH. 2. Decide what to do if there is no build job available to trigger a test job? While the queuing functionality for pulse_actions is not ready, I think the best is to just skip those jobs.

Armen [:armenzg]

Reporter

Comment 9

•

10 years ago

#1 - I think that is an optimization at this moment. Let's document it on the wiki and follow up on it. #2 - Let's skip it for now and document it. I would like to see this running somewhere live even if we're not completely optimal (minimum viable product). Works for you?

Armen [:armenzg]

Reporter

Comment 10

•

10 years ago

Email sent to sheriffs: ####################### We are planning to turn on a service that automatically backfills failed test jobs on m-i. If there are no concerns, we would like to turn this on experimentally for a couple of hours on Monday. We hope this will make it easier to identify which revision broke a test. Suggestions are welcome. The backfilling works like this: - It triggers the job that failed one extra time - Then it looks for a successful run of the job on the previous 5 revisions. If a good run is found, it triggers the job only on revisions up to this good run. If not, it triggers the job on every one of the previous 5 revisions. Previous jobs will be triggered one time.

Ryan VanderMeulen [:RyanVM]

Comment 11

•

10 years ago

This made for hours of hilarity today when Windows mochitest-2 went permafail and I was stuck starring auto-retriggers on old pushes for hours on end. https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=win%20mochitest-2&fromchange=e2e7ff226eca&tochange=b92398507d91

Armen [:armenzg]

Reporter

Comment 12

•

10 years ago

Attached image [screenshot] — Details

1) a perma failure is introduced 2) we re-triggered in some cases up to 6 times (6 orange jobs) In order to mitigate this, I suggest we do this: * When an orange job finishes, we don't retrigger it any more times ** I think we now re-trigger 1 or 2 times ** This way we *only* backfill revisions that have that missing job This will prevent having more than 1 job on any revision due to automatic backfilling. In the longer term, we can re-evaluate and probably start looking into using active data to do smarter analysis.

Armen [:armenzg]

Reporter

Updated

•

10 years ago

Updated

•

10 years ago

Depends on: 1194213

Armen [:armenzg]

Reporter

Updated

•

10 years ago

Depends on: 1195809

Armen [:armenzg]

Reporter

Updated

•

10 years ago

Depends on: 1195821

Armen [:armenzg]

Reporter

Updated

•

10 years ago

Depends on: 1195824

Armen [:armenzg]

Reporter

Updated

•

10 years ago

Depends on: 1195851

Armen [:armenzg]

Reporter

Comment 13

•

10 years ago

Status summary ############## We have automated backfilling working only for mozilla-inbound. Before we go ahead and enable it in more places we will need to address some of these to prevent automated back filing from cause harm (extra jobs) when we hit edge cases. * Bug 1195809 - Automatic backfilling should consider more recent pushes that say "Backed out" or "backout" and *not* backfilling * Bug 1195821 - Automatic backfilling should use follow some rules to decide when not to backfill * bug 1195824 - Automatic backfilling should deal better with perma failures * bug 1195851 - Automatic backfilling should not trigger jobs for a pool when it is heavely back logged

Armen [:armenzg]

Reporter

Updated

•

10 years ago

Depends on: 1197223

Armen [:armenzg]

Reporter

Updated

•

10 years ago

Depends on: 1197238

Armen [:armenzg]

Reporter

Updated

•

10 years ago

Summary: Automatic backfilling → [tracker] Automatic backfilling

Armen [:armenzg]

Reporter

Updated

•

10 years ago

Depends on: 1198273

Armen [:armenzg]

Reporter

Comment 14

•

10 years ago

Bug 1195809 - Automatic backfilling should consider more recent pushes that say "Backed out" or "backout" Bug 1195821 - Automatic backfilling should use follow some rules to decide when not to backfill Bug 1195824 - Automatic backfilling should deal better with perma failures Bug 1195851 - Automatic backfilling should not trigger jobs for a pool when it is heavely back logged Bug 1197223 - Automatic backfilling should respect hidden jobs Bug 1197238 - Create report per builder on how many jobs are scheduled on a day Bug 1198273 - Have metrics for scheduling

Armen [:armenzg]

Reporter

Comment 15

•

10 years ago

For anyone contributing, here are the steps to get started: * Checkout https://github.com/adusca/pulse_actions.git * Setup a virtualenv * Install the project inside of it with python setup.py develop * Run python pulse_actions/worker.py --topic-base backfilling --dry-run A starting bug would be this since it would give the ability to use automated backfilling without special credentials: Bug 1203141 - Teach automatic backfilling to use Treeherder as a source of jobs (instead of BuildApi)

Armen [:armenzg]

Reporter

Comment 16

•

10 years ago

Related pulse_actions bugs but not blocker to resume Automated backfilling: Bug 1203146 - Pulse Actions throws an unreadable exception when the proper env variables are not set up Bug 1203141 - Teach automatic backfilling to use Treeherder as a source of jobs (instead of BuildApi)

Armen [:armenzg]

Reporter

Updated

•

10 years ago

Depends on: 1212002

Armen [:armenzg]

Reporter

Updated

•

10 years ago

Depends on: 1213308

Armen [:armenzg]

Reporter

Updated

•

10 years ago

Depends on: 1210390

Armen [:armenzg]

Reporter

Comment 17

•

10 years ago

In order to re-enable automatic back filling we will have to fix the following: Bug 1195824 - Automatic backfilling should deal better with perma failures Bug 1195851 - Automatic backfilling should not trigger jobs for a pool when it is heavely back logged Bug 1197223 - Automatic backfilling should respect hidden jobs To be filed - Automatic backfilling should clear the cache when the list of builders changes due to a reconfig Bug 1210390 - allthethings.json does not have up-to-date information Optimization: Bug 1195821 - Automatic backfilling should use follow some rules to decide when not to backfill Bug 1195809 - Automatic backfilling should consider more recent pushes that say "Backed out" or "backout" Nice to have: Bug 1197238 - Create report per builder on how many jobs are scheduled on a day Bug 1198273 - Have metrics for scheduling

Armen [:armenzg]

Reporter

Updated

•

10 years ago

Depends on: 1216677

Armen [:armenzg]

Reporter

Updated

•

10 years ago

Assignee: alicescarpa → armenzg

Armen [:armenzg]

Reporter

Comment 18

•

10 years ago

We're putting this to the side until we have an automatic starring service to prevent unncessary load.

Assignee: armenzg → nobody

BMO Automation

Updated

•

3 years ago

Severity: normal → S3

You need to log in before you can comment on or make changes to this bug.