Bug 1846041 Comment 16 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

Just to put numbers on what we saw in terms of indexing job runtimes (which was expected and explicitly fine), runtimes before and after landing (anecdotal, no math done):
- linux x64 debug: [85 minutes](https://treeherder.mozilla.org/jobs?repo=mozilla-central&selectedTaskRun=WpbJr2cUSOmod5Hjs-or3w.0&searchStr=searchfox&revision=ed6d212df8707d8634eedfe6b0c5dbfbeedde584) to [109 minutes](https://treeherder.mozilla.org/jobs?repo=mozilla-central&selectedTaskRun=Q1lfxSpzQfCuGB8H3302vw.2&searchStr=searchfox&revision=5a6d4f139a8b138ece7c4f7eeb8d800d669af462)
- win 2012 x64 debug: [88 minutes](https://treeherder.mozilla.org/jobs?repo=mozilla-central&selectedTaskRun=V1qeFusTTmu0-1HXlohh1w.0&searchStr=searchfox&revision=ed6d212df8707d8634eedfe6b0c5dbfbeedde584) to [113 minutes](https://treeherder.mozilla.org/jobs?repo=mozilla-central&selectedTaskRun=GLlvWI0yQTGE8pbEtHQvnQ.0&searchStr=searchfox&revision=5a6d4f139a8b138ece7c4f7eeb8d800d669af462)
- osx cross debug [86 minutes](https://treeherder.mozilla.org/jobs?repo=mozilla-central&selectedTaskRun=HCJSehLOQjGuKMFOgPFDow.0&searchStr=searchfox&revision=5a6d4f139a8b138ece7c4f7eeb8d800d669af462) to [113 minutes](https://treeherder.mozilla.org/jobs?repo=mozilla-central&selectedTaskRun=M4dmygCBRsuVb1hdJC-gjg.1&searchStr=searchfox&revision=5a6d4f139a8b138ece7c4f7eeb8d800d669af462)
- osx aarch64 cross debug [84 minutes](https://treeherder.mozilla.org/jobs?repo=mozilla-central&selectedTaskRun=by2p6S4FQCi9kBI7RmGh_Q.0&searchStr=searchfox&revision=ed6d212df8707d8634eedfe6b0c5dbfbeedde584) to [107 minutes](https://treeherder.mozilla.org/jobs?repo=mozilla-central&selectedTaskRun=C5_qa1PrTK-rG8B9Kc1ydw.0&searchStr=searchfox&revision=5a6d4f139a8b138ece7c4f7eeb8d800d669af462)
- android 5.0 aarch64 debug: [107 minutes](https://treeherder.mozilla.org/jobs?repo=mozilla-central&selectedTaskRun=Ff_Dg1aYQPCTs3TQFBRHzg.0&searchStr=searchfox&revision=ed6d212df8707d8634eedfe6b0c5dbfbeedde584) to [159 minutes](https://treeherder.mozilla.org/jobs?repo=mozilla-central&selectedTaskRun=HCJSehLOQjGuKMFOgPFDow.0&searchStr=searchfox&revision=5a6d4f139a8b138ece7c4f7eeb8d800d669af462)
- ios debug [71 minutes](https://treeherder.mozilla.org/jobs?repo=mozilla-central&selectedTaskRun=GKeT02RCR_qlhYyyQu5lZA.0&searchStr=searchfox&revision=ed6d212df8707d8634eedfe6b0c5dbfbeedde584) to [96 minutes](https://treeherder.mozilla.org/jobs?repo=mozilla-central&selectedTaskRun=JTAjprQLQrSJwc0Ccls5Pg.0&searchStr=searchfox&revision=5a6d4f139a8b138ece7c4f7eeb8d800d669af462)

This does raise the question of whether we need to split the rust part of the job out so it can be parallelized or increase the capacity of the builder to avoid causing indexing latency problems for searchfox.  Since we already wait 5 hours (300 minutes) for coverage data, we still have quite a large amount of head room.

I think the only question is then whether the longer runtime increases the probability of "infrastructure failures" like when taskcluster was on EC2 and using spot instances and the instances could potentially be killed.  Right now I do some "retry" and "exception" failures with the exceptions explicitly marked as bug 1922641 where it seems like the ~AZ has run out of resources which could also explain the retries being requested after 12 minutes and only being started after 42 minutes.  The retry deadlines still leave us with enough headroom, and I think the meta issues on worker retries is probably something best left to taskcluster ops who can file a bug against searchfox if there's a need to slice and dice the the jobs.
Just to put numbers on what we saw in terms of indexing job runtimes (which was expected and explicitly fine), runtimes before and after landing (anecdotal, no math done):
- linux x64 debug: [85 minutes](https://treeherder.mozilla.org/jobs?repo=mozilla-central&selectedTaskRun=WpbJr2cUSOmod5Hjs-or3w.0&searchStr=searchfox&revision=ed6d212df8707d8634eedfe6b0c5dbfbeedde584) to [109 minutes](https://treeherder.mozilla.org/jobs?repo=mozilla-central&selectedTaskRun=Q1lfxSpzQfCuGB8H3302vw.2&searchStr=searchfox&revision=5a6d4f139a8b138ece7c4f7eeb8d800d669af462)
- win 2012 x64 debug: [88 minutes](https://treeherder.mozilla.org/jobs?repo=mozilla-central&selectedTaskRun=V1qeFusTTmu0-1HXlohh1w.0&searchStr=searchfox&revision=ed6d212df8707d8634eedfe6b0c5dbfbeedde584) to [113 minutes](https://treeherder.mozilla.org/jobs?repo=mozilla-central&selectedTaskRun=GLlvWI0yQTGE8pbEtHQvnQ.0&searchStr=searchfox&revision=5a6d4f139a8b138ece7c4f7eeb8d800d669af462)
- osx cross debug [86 minutes](https://treeherder.mozilla.org/jobs?repo=mozilla-central&selectedTaskRun=HCJSehLOQjGuKMFOgPFDow.0&searchStr=searchfox&revision=5a6d4f139a8b138ece7c4f7eeb8d800d669af462) to [113 minutes](https://treeherder.mozilla.org/jobs?repo=mozilla-central&selectedTaskRun=M4dmygCBRsuVb1hdJC-gjg.1&searchStr=searchfox&revision=5a6d4f139a8b138ece7c4f7eeb8d800d669af462)
- osx aarch64 cross debug [84 minutes](https://treeherder.mozilla.org/jobs?repo=mozilla-central&selectedTaskRun=by2p6S4FQCi9kBI7RmGh_Q.0&searchStr=searchfox&revision=ed6d212df8707d8634eedfe6b0c5dbfbeedde584) to [107 minutes](https://treeherder.mozilla.org/jobs?repo=mozilla-central&selectedTaskRun=C5_qa1PrTK-rG8B9Kc1ydw.0&searchStr=searchfox&revision=5a6d4f139a8b138ece7c4f7eeb8d800d669af462)
- android 5.0 aarch64 debug: [107 minutes](https://treeherder.mozilla.org/jobs?repo=mozilla-central&selectedTaskRun=Ff_Dg1aYQPCTs3TQFBRHzg.0&searchStr=searchfox&revision=ed6d212df8707d8634eedfe6b0c5dbfbeedde584) to [159 minutes](https://treeherder.mozilla.org/jobs?repo=mozilla-central&selectedTaskRun=HCJSehLOQjGuKMFOgPFDow.0&searchStr=searchfox&revision=5a6d4f139a8b138ece7c4f7eeb8d800d669af462)
- ios debug [71 minutes](https://treeherder.mozilla.org/jobs?repo=mozilla-central&selectedTaskRun=GKeT02RCR_qlhYyyQu5lZA.0&searchStr=searchfox&revision=ed6d212df8707d8634eedfe6b0c5dbfbeedde584) to [96 minutes](https://treeherder.mozilla.org/jobs?repo=mozilla-central&selectedTaskRun=JTAjprQLQrSJwc0Ccls5Pg.0&searchStr=searchfox&revision=5a6d4f139a8b138ece7c4f7eeb8d800d669af462)

This does raise the question of whether we need to split the rust part of the job out so it can be parallelized or increase the capacity of the builder to avoid causing indexing latency problems for searchfox.  Since we already wait 5 hours (300 minutes) for coverage data, we still have quite a large amount of head room, which is to say we don't need to split the job.

I think the only question is then whether the longer runtime increases the probability of "infrastructure failures" like when taskcluster was on EC2 and using spot instances and the instances could potentially be killed.  Right now I do some "retry" and "exception" failures with the exceptions explicitly marked as bug 1922641 where it seems like the ~AZ has run out of resources which could also explain the retries being requested after 12 minutes and only being started after 42 minutes.  The retry deadlines still leave us with enough headroom, and I think the meta issues on worker retries is probably something best left to taskcluster ops who can file a bug against searchfox if there's a need to slice and dice the the jobs.

Back to Bug 1846041 Comment 16