Last Comment Bug 1180732 - [tracker] Automatic backfilling
: [tracker] Automatic backfilling
Status: NEW
:
Product: Testing
Classification: Components
Component: General (show other bugs)
: unspecified
: Unspecified Unspecified
-- normal (vote)
: ---
Assigned To: Nobody; OK to take it and work on it
:
:
Mentors:
Depends on: 1195851 1212002 1194213 1195809 1195821 1195824 1197223 1197238 1198273 1210390 1213308 1216677
Blocks: 1178522
  Show dependency treegraph
 
Reported: 2015-07-06 08:29 PDT by Armen Zambrano - Back on March 27th [:armenzg] (EDT/UTC-4)
Modified: 2015-11-06 08:40 PST (History)
3 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
things_we_would_trigger.csv (525.09 KB, text/csv)
2015-07-22 09:09 PDT, Alice Scarpa [:adusca]
no flags Details
requests_with_buildernames.csv (84.41 KB, text/csv)
2015-07-22 10:07 PDT, Alice Scarpa [:adusca]
no flags Details
[screenshot] (222.78 KB, image/png)
2015-07-28 12:51 PDT, Armen Zambrano - Back on March 27th [:armenzg] (EDT/UTC-4)
no flags Details

Description User image Armen Zambrano - Back on March 27th [:armenzg] (EDT/UTC-4) 2015-07-06 08:29:55 PDT
Listen to trees for finished *test* jobs, we should be able to re-trigger a job a couple of times (similar to what trigger bot does on try) and backfill back to the last known good run.

This will be added to pulse_actions.
https://github.com/adusca/pulse_actions
Comment 1 User image Armen Zambrano - Back on March 27th [:armenzg] (EDT/UTC-4) 2015-07-06 08:34:10 PDT
adusca, what is the latest on pulse_actions having multiple pulse consumers?
Comment 2 User image Armen Zambrano - Back on March 27th [:armenzg] (EDT/UTC-4) 2015-07-06 12:36:59 PDT
adusca is going to talk to Kyle on how to determine which jobs are always failing through active data.

Adding chmanchester to keep in the loop since he had the same issue for triggerbot.
Comment 3 User image Chris Manchester (:chmanchester) 2015-07-06 12:39:54 PDT
(In reply to Armen Zambrano G. (:armenzg - Toronto) from comment #2)
> adusca is going to talk to Kyle on how to determine which jobs are always
> failing through active data.
> 
> Adding chmanchester to keep in the loop since he had the same issue for
> triggerbot.

The proposed solution for this is to use treeherder to get visibility, as in https://github.com/chmanchester/trigger-bot/pull/4
Comment 4 User image Armen Zambrano - Back on March 27th [:armenzg] (EDT/UTC-4) 2015-07-06 12:44:01 PDT
TODO:
* Determine plan on how to deal with changesets that are full of failures
Comment 5 User image Alice Scarpa [:adusca] 2015-07-22 09:09:43 PDT
Created attachment 8637272 [details]
things_we_would_trigger.csv

This csv contains a list of everything that would be triggered on Jul 21st between 0:00 and 19:00. The file has ~2400 lines representing ~1300 requests (some requests take 3 lines of logging, some take just 1).
Comment 6 User image Alice Scarpa [:adusca] 2015-07-22 10:07:08 PDT
Created attachment 8637315 [details]
requests_with_buildernames.csv

Some of the request on the other csv (the ones with the form "We would make a POST ...") were not very easy to read, so I generated another csv with the buildernames and revisions for those request_ids.
Comment 7 User image Armen Zambrano - Back on March 27th [:armenzg] (EDT/UTC-4) 2015-07-24 09:52:29 PDT
That sounds like 700 jobs on inbound for almost a day. That is not bad at all.

What is left to remove dry_run mode?
We need to write up clearly what this system does.

Could you please document it in here?
https://wiki.mozilla.org/Auto-tools/Projects/Automatic_backfilling

We will also need to publizcize this and coordinate with sheriffs.
We could enable for few hours on a day, disable it again and ask sheriffs to give us their feedback.
Comment 8 User image Alice Scarpa [:adusca] 2015-07-24 10:07:11 PDT
The csv table overestimates a little what we would trigger because mozci guarantees that we have at most one job of each and there are a lot of duplicates on the table. With dry-run the job doesn't get triggered on the first time and because of that we said that we would trigger it again.

Steps missing to turn this on:

1. Decide if we need to blacklist some jobs. This may involve implementing checking if a job is hidden on TH.

2. Decide what to do if there is no build job available to trigger a test job? While the queuing functionality for pulse_actions is not ready, I think the best is to just skip those jobs.
Comment 9 User image Armen Zambrano - Back on March 27th [:armenzg] (EDT/UTC-4) 2015-07-24 10:11:46 PDT
#1 - I think that is an optimization at this moment. Let's document it on the wiki and follow up on it.
#2 - Let's skip it for now and document it.

I would like to see this running somewhere live even if we're not completely optimal (minimum viable product).

Works for you?
Comment 10 User image Armen Zambrano - Back on March 27th [:armenzg] (EDT/UTC-4) 2015-07-24 12:15:04 PDT
Email sent to sheriffs:
#######################
We are planning to turn on a service that automatically backfills failed test jobs on m-i.
If there are no concerns, we would like to turn this on experimentally for a couple of hours on Monday. We hope this will make it easier to identify which revision broke a test. Suggestions are welcome.
The backfilling works like this:
- It triggers the job that failed one extra time
- Then it looks for a successful run of the job on the previous 5 revisions. If a good run is found, it triggers the job only on revisions up to this good run. If not, it triggers the job on every one of the previous 5 revisions. Previous jobs will be triggered one time.
Comment 11 User image Ryan VanderMeulen [:RyanVM] 2015-07-28 11:31:17 PDT
This made for hours of hilarity today when Windows mochitest-2 went permafail and I was stuck starring auto-retriggers on old pushes for hours on end.
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=win%20mochitest-2&fromchange=e2e7ff226eca&tochange=b92398507d91
Comment 12 User image Armen Zambrano - Back on March 27th [:armenzg] (EDT/UTC-4) 2015-07-28 12:51:21 PDT
Created attachment 8640054 [details]
[screenshot]

1) a perma failure is introduced
2) we re-triggered in some cases up to 6 times (6 orange jobs)

In order to mitigate this, I suggest we do this:
* When an orange job finishes, we don't retrigger it any more times
** I think we now re-trigger 1 or 2 times
** This way we *only* backfill revisions that have that missing job

This will prevent having more than 1 job on any revision due to automatic backfilling.

In the longer term, we can re-evaluate and probably start looking into using active data to do smarter analysis.
Comment 13 User image Armen Zambrano - Back on March 27th [:armenzg] (EDT/UTC-4) 2015-08-19 10:33:00 PDT
Status summary
##############
We have automated backfilling working only for mozilla-inbound.
Before we go ahead and enable it in more places we will need to address some of these to prevent automated back filing from cause harm (extra jobs) when we hit edge cases.

* Bug 1195809 - Automatic backfilling should consider more recent pushes that say "Backed out" or "backout" and *not* backfilling
* Bug 1195821 - Automatic backfilling should use follow some rules to decide when not to backfill
* bug 1195824 - Automatic backfilling should deal better with perma failures
* bug 1195851 - Automatic backfilling should not trigger jobs for a pool when it is heavely back logged
Comment 14 User image Armen Zambrano - Back on March 27th [:armenzg] (EDT/UTC-4) 2015-08-26 13:42:38 PDT
Bug 1195809 - Automatic backfilling should consider more recent pushes that say "Backed out" or "backout"
Bug 1195821 - Automatic backfilling should use follow some rules to decide when not to backfill
Bug 1195824 - Automatic backfilling should deal better with perma failures
Bug 1195851 - Automatic backfilling should not trigger jobs for a pool when it is heavely back logged
Bug 1197223 - Automatic backfilling should respect hidden jobs
Bug 1197238 - Create report per builder on how many jobs are scheduled on a day
Bug 1198273 - Have metrics for scheduling
Comment 15 User image Armen Zambrano - Back on March 27th [:armenzg] (EDT/UTC-4) 2015-09-09 07:41:52 PDT
For anyone contributing, here are the steps to get started:
* Checkout https://github.com/adusca/pulse_actions.git
* Setup a virtualenv
* Install the project inside of it with python setup.py develop
* Run python pulse_actions/worker.py --topic-base backfilling --dry-run

A starting bug would be this since it would give the ability to use automated backfilling without special credentials:
Bug 1203141 - Teach automatic backfilling to use Treeherder as a source of jobs (instead of BuildApi)
Comment 16 User image Armen Zambrano - Back on March 27th [:armenzg] (EDT/UTC-4) 2015-09-09 11:52:16 PDT
Related pulse_actions bugs but not blocker to resume Automated backfilling:
Bug 1203146 - Pulse Actions throws an unreadable exception when the proper env variables are not set up
Bug 1203141 - Teach automatic backfilling to use Treeherder as a source of jobs (instead of BuildApi)
Comment 17 User image Armen Zambrano - Back on March 27th [:armenzg] (EDT/UTC-4) 2015-10-19 12:54:55 PDT
In order to re-enable automatic back filling we will have to fix the following:

Bug 1195824 - Automatic backfilling should deal better with perma failures
Bug 1195851 - Automatic backfilling should not trigger jobs for a pool when it is heavely back logged
Bug 1197223 - Automatic backfilling should respect hidden jobs
To be filed - Automatic backfilling should clear the cache when the list of builders changes due to a reconfig
Bug 1210390 - allthethings.json does not have up-to-date information

Optimization:
Bug 1195821 - Automatic backfilling should use follow some rules to decide when not to backfill
Bug 1195809 - Automatic backfilling should consider more recent pushes that say "Backed out" or "backout"

Nice to have:
Bug 1197238 - Create report per builder on how many jobs are scheduled on a day
Bug 1198273 - Have metrics for scheduling
Comment 18 User image Armen Zambrano - Back on March 27th [:armenzg] (EDT/UTC-4) 2015-11-06 08:40:07 PST
We're putting this to the side until we have an automatic starring service to prevent unncessary load.

Note You need to log in before you can comment on or make changes to this bug.