Open Bug 1745813 Opened 3 years ago Updated 2 years ago

Implement "mutually exclusive" test settings to ease migrations

Categories

(Firefox Build System :: Task Configuration, enhancement, P3)

enhancement

Tracking

(Not tracked)

People

(Reporter: ahal, Unassigned)

Details

Whenever we migrate tests we:

  1. Stand up a new set of tasks that use the new setting (e.g new Ubuntu image)
  2. Do lots of try pushes to triage failures
  3. Land the new setting in parallel to the old one.
  4. Fix disabled tests
  5. Turn off the old setting

Or sometimes we do this instead:

  1. Fix all failing tests on try
  2. Update tasks to new setting in-place

Both cases are time consuming, prone to bit-rot (as new tests get added that often start failing), and are gated on people familiar with the CI configuration as it typically requires fiddling with the test configs.

What if we could instead define a "migration". That is two test settings that are mutually exclusive. That is every test manifest can either be in one or the other but not both.

We would define which manifests belonged to which setting either in the test manifests themselves, or in moz.build files. But crucially not in taskgraph. This way, the new process for migrations become:

  1. Define the migration (incl new test setting)
  2. Ensure all manifests are pointing at the old migration
  3. Migrate tests manifest by manifest (or directory by directory if we want less granularity).
  4. Delete the old setting

The main benefits of this system are:

  1. Step 3 can be handled in parallel by developers themselves (the people most likely able to fix test issues)
  2. No more bitrot as individual manifests are much faster and more manageable to transition
  3. Apart from setup/teardown, no knowledge of CI configuration required.

I believe this proposal can help avoid the single team bottleneck when managing migrations.

in general I like this idea. In practice I don't see this as helping as much.

  1. a hardware migration introduces other criteria like limited resources. This will be the case for OSX* and Android-hw*
  2. often in a migration there are OS changes which are needed- <=3 people hacking in coordination can ensure that OS changes are communicated properly- and then revisit previous work. This can be confusing- lets say the clipboard is broken and it would cross multiple teams.
  3. adding a variant (like fission) - possibly this would work, but typically it is rapid greening up and fixing later
  4. perf tests are often part of a migration and those have a need for overlap (it takes 1+ week of data to start posting regressions, so an abrupt change means there is no coverage for that week)
  5. there are often timelines to completing a migration - relying on the masses will work for 80-90% of the cases, the rest will need someone from releng to nag or do the work.

I see this helping out in scenarios where we have an in-tree ubuntu image upgrade, or something like bringing windows11 online. Given the frequency of updating just an OS, I see this not saving as much time as it could appear to.

Severity: -- → N/A
You need to log in before you can comment on or make changes to this bug.