Closed Bug 1676487 Opened 2 years ago Closed 2 months ago

Reduce max-run-time for Marionette jobs in CI

Categories

(Testing :: Marionette, task, P1)

Default
task
Points:
1

Tracking

(firefox109 fixed)

RESOLVED FIXED
109 Branch
Tracking Status
firefox109 --- fixed

People

(Reporter: whimboo, Assigned: whimboo)

References

Details

(Whiteboard: [webdriver:m5])

Attachments

(1 file)

As discovered https://phabricator.services.mozilla.com/D96466#inline-544594 we use a max-run-time of 5400s at the moment. That is quite a lot for the Marionette jobs.

We should reduce it for all Mn jobs to a lower value. I will check what our usual runtime is for the job across platforms, and if we simply can reduce the timeout, or maybe have to divide the jobs into chunks.

Joel, does one of you have a tool that can scrape jobs of specific types on Treeherder/Taskcluster and fetch their duration? Doing that manually is a bit bothersome, and I could imagine that also other job types could benefit from that.

Flags: needinfo?(jmaher)

there is no simple tool that I know of, ./mach test-info .. can provide some insight into specific tests, but not really test jobs. You could query the treeherder database on redash, or use activedata as well.

I did a quick redash query using the treeherder datasource:

select
  jt.symbol,
  jt.name,
  j.end_time-j.start_time
from
  job j,
  job_type jt
where 
  j.job_type_id=jt.id
  and jt.name like '%marionette%'
  and j.result='success'
limit 100

it should get you a headstart if you want to use redash.

Flags: needinfo?(jmaher)

As discussed on Matrix the above query doesn't work in different ways. So here an updated one:

select
  jt.name,
  j.start_time,
  j.end_time,
  timestampdiff(second, j.start_time, j.end_time) as seconds,
  jl.url
from
  job j,
  job_type jt,
  job_log jl
where
  j.job_type_id=jt.id
  and jl.job_id=j.id
  and jt.name like '%opt-marionette-e10s%'
  and jt.name not like '%ccov%'
  and jt.name not like '%devedition%'
  and j.result='success'
  and timestampdiff(second, j.start_time, j.end_time) > 2000
limit 100

Basically here what I got suggested...

  • No job should run longer than 3600s
  • for jobs of opt builds 1800s is fine
  • for jobs of debug/asan/ccov builds 2700s - 3600s are fine

Jobs that take longer should be split into multiple chunks.

Lets test the proposed timeouts from the last comment if those work and if not if we could chunk the Mn jobs.

Assignee: nobody → hskupin
Status: NEW → ASSIGNED
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b034b4e4cb99
[marionette] Reduce max-run-time for jobs in Taskcluster. r=jmaher
Blocks: webdriver:m5
Points: --- → 1
Priority: P3 → P1
Whiteboard: [webdriver:m5]
Status: ASSIGNED → RESOLVED
Closed: 2 months ago
Resolution: --- → FIXED
Target Milestone: --- → 109 Branch
You need to log in before you can comment on or make changes to this bug.