Closed Bug 1677903 Opened 2 years ago Closed 2 years ago

Ensure coverage data is always exactly for the indexed build and ensure indexers have a high probability of having coverage data available.

Categories

(Webtools :: Searchfox, enhancement)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: asuth, Assigned: asuth)

References

(Blocks 2 open bugs)

Details

Attachments

(3 files)

:sg had noticed stale code coverage in https://bugzilla.mozilla.org/show_bug.cgi?id=1603383#c5 and per :sg (via other discussion channels) when :marco looked into it, the coverage data had been broken for ~10 days. When initially developing the code coverage direct integration in bug 1566874 I'd added a fallback to the "latest" route for the data because the code coverage automation was being delayed by tier 3 taskcluster jobs which meant we'd never have coverage data otherwise, see https://bugzilla.mozilla.org/show_bug.cgi?id=1566874#c16 for more details on the conclusions there.

Per https://bugzilla.mozilla.org/show_bug.cgi?id=1566874#c15 I hadn't actually meant to leave this in place, although I think I may have come around to hand-waving that the occasional 1-day lag would be okay. Since the lag can clearly end up higher than this and it's likely searchfox's continuing to show stale data could have contributed to this, I think it makes sense to go back to my original intent of only showing coverage data that corresponds exactly to the code you're looking at.

At the moment, the problem appears to be that the Taskcluster hook to parse the coverage artifacts and upload a summary artifact is never triggered because Taskcluster is not sending task-group-resolved pulse messages for m-c groups (https://github.com/taskcluster/taskcluster/issues/3926).

I'm going to add a daily cron task that will automatically trigger the hook for m-c pushes that were not yet ingested. This should at least make it so that "latest" is never too old.

(In reply to Marco Castelluccio [:marco] from comment #1)

I'm going to add a daily cron task that will automatically trigger the hook for m-c pushes that were not yet ingested. This should at least make it so that "latest" is never too old.

What time of day were you thinking of setting the cron for? Right now the mozilla-central indexing job "start-release1-indexer"/"config1.json" starts doing things at 8am EST / 12pm GMT which has been moved earlier over time thanks to the Windows jobs getting faster via the move to cross-compilation.

I think it's worth moving this later if we can get guaranteed matching coverage results. If it's a LOT later, it may be worth considering moving Searchfox to index after both of the nightlies (10 UTC and 22 UTC) instead of just 10 UTC. I don't know if coverage gets run for both nightlies though?

Attached file GitHub Pull Request

(This did already land and took effect in today's mozilla-central indexing.)

The missing code coverage data was due to Taskcluster pulse messages misbehaving (https://github.com/taskcluster/taskcluster/issues/3926).
I now have a cron hook that goes through all mozilla-pushes and checks if they have been ingested. If they haven't been ingested, it will trigger ingestion jobs for them.

As agreed on Matrix, we can do the following:

  • The searchfox cron will take the latest ccov revision. If it is older than X hours (2?) it will ingest the latest tip and not include coverage (or include older coverage, as long as it is not older than X+Y hours (12?)). If it is newer than X hours, it will use that revision and include that revision's coverage. The hours will be counted between the time of the last coverage revision and the revision of the latest searchfox linux job.
  • We will sync the crons to make it so coverage is often caught. The first proposal was to take advantage of the nightly by setting ccov cron at 13:30 UTC, Searchfox cron at 15:00 UTC. We might need to change this as the coverage job can take a while (I need to check how long, and I need to check if we can optimize it).
Flags: needinfo?(bugmail)

Recent runs using https://firefox-ci-tc.services.mozilla.com/hooks/project-relman/code-coverage-repo-production to find the aggregation jobs (which maybe is wrong? But it does seem like some of these are cron triggered and some are pulse triggered judging by the differences in payloads?):

  • Today, Nov 30th: treeherder aggregation job
    • Started: 2020-11-30T14:06:57.217Z. This is 11/30/2020, 9:06:57 AM in local ET. I think all of the "Linux 18.04 x64 CCov opt" jobs were done by ~6:30am and the slower "Windows 10 x64 CCov WebRender opt" jobs (do they matter) were more like ~7:45am.
    • Finished: 2020-11-30T16:38:38.904Z which misses the 15:00 UTC target. This is 11/30/2020, 11:38:38 AM in local ET.
    • Duration: 2h 32m
  • Sunday, Nov 29th: treeherder aggregation job
    • Started: 2020-11-30T00:04:19.169Z Note that this is massively after the push. This should have started before that?
    • Finished: 2020-11-30T01:31:29.398Z
    • Duration: 1h 27m
  • Saturday, Nov 28th: treeherder aggregation job
    • Started: 2020-11-29T00:03:36.461Z This is also massively after the push.
    • Finished: 2020-11-29T01:35:55.270Z
    • Duration: 1h 32m
  • Friday, Nov 27th. The treeherder push here was from Thursday evening on Nov 26th. Here's the aggregation job though.

Because the aggregation jobs don't seem to show up directly in the treeherder UI, it's possible I'm not looking at the right thing. (Or maybe they do and I'm really not looking at the right thing?)

I noticed we had coverage yesterday in searchfox. It appears this happened because the nightly was built off of a Mozilla-central commit that was significantly older than usual (Tuesday at 10:49pm ET for Wednesday rather than the normal 4:30am+ ET).

Because the manual process of comment 5 was laborious and maybe wrong and I need to figure out how to implement the logic in comment 4, I created a terrifying bash script at https://gist.github.com/asutherland/0f820de77856e459d7e1d4196be6f7ae that provides us with the helpful output at the bottom of the comment with the headings.

There's 3 passes in the script as displayed below:

  1. "Searchfox rev artifact creator times":
    • Find the recent mozilla-central searchfox treeherder linux64 jobs and use that to figure out the revisions that searchfox indexed.
    • Lookup the task that corresponds to the coverage artifact index root for the given tree and revision.
    • If the was a task, get its info, then output it.
  2. "Hook Recent Runs for code-coverage-cron-production"
    • This just shows the completion and start times from jobs for this hook. These tasks don't have any revision information in them. They also seem to only produce precursor artifacts and not the actual aggregation we would assume if I run taskcluster task def en9AhwDYQGSSWZ32PF30Tw for example on the 12-04 cron task's id.
  3. "Hook Results for code-coverage-repo-production":
    • Find the recent hook runs
    • Filter the runs down to those that correspond to mozilla-central and a revision for which searchfox indexing run (as determined in the first pass)
    • Output the info.

What the output and my investigation seems to be showing is that:

  • Maybe the cron job is currently required to make things happen.
  • The cron job is actually running at midnight UTC time instead of when we want it to?

:marco, thoughts?

Output

Searchfox rev artifact creator times

  • d4b21131a094a0819b8856fbf4165cfb32d671c6 artifact project.relman.code-coverage.production.repo.mozilla-central.d4b21131a094a0819b8856fbf4165cfb32d671c6
    • 2020-12-03T10:44:38.906Z - indexing completed in job IZjIJ121TMO3s4OwZFMl2Q
    • 2020-12-04T01:40:41.803Z - artifact task completed in job VrUpFqafShWpEv5-DCqHQA started at 2020-12-04T00:03:31.464Z
  • b836812230d1274efa92eb0a0d38601ef4715cda artifact project.relman.code-coverage.production.repo.mozilla-central.b836812230d1274efa92eb0a0d38601ef4715cda
    • 2020-12-04T10:45:32.316Z - indexing completed in job WQ8-4NBtTWqaXTA74wjWRg
    • 2020-12-05T01:50:36.159Z - artifact task completed in job bsAQop-BTfGXOa-Mp-C9wQ started at 2020-12-05T00:05:40.519Z
  • 7ce95b6cde2643be60ec53d8c743db28aec30884 artifact project.relman.code-coverage.production.repo.mozilla-central.7ce95b6cde2643be60ec53d8c743db28aec30884
    • 2020-12-05T10:42:01.895Z - indexing completed in job TrEpg3-2RneHzhzUVccmxw
    • 2020-12-06T01:40:05.383Z - artifact task completed in job OLpXEJx7RDeOuTSkrSuKOA started at 2020-12-06T00:06:45.971Z
  • 5a17a9893fe09fd912f1972b204d0f9ca1184b28 artifact project.relman.code-coverage.production.repo.mozilla-central.5a17a9893fe09fd912f1972b204d0f9ca1184b28
    • 2020-12-06T10:45:04.382Z - indexing completed in job BTicAleYQ2uOBMbZXS_gPw
    • 2020-12-07T01:33:25.315Z - artifact task completed in job RBIL-Rv-RNG34DHcz-LM0g started at 2020-12-07T00:05:12.411Z
  • 7167ab3f6e8d7b77439ac1cc7d9356bbaff02510 artifact project.relman.code-coverage.production.repo.mozilla-central.7167ab3f6e8d7b77439ac1cc7d9356bbaff02510
    • 2020-12-07T10:37:57.384Z - indexing completed in job Yy6RvYLzQ9y7QF3TFD7Clw
    • 2020-12-08T01:46:59.147Z - artifact task completed in job d3yZDYm2R9St4Eiwn4_P2g started at 2020-12-08T00:05:12.404Z
  • 1a164819adefcdb42bdba8a24976b62ee8d03c80 artifact project.relman.code-coverage.production.repo.mozilla-central.1a164819adefcdb42bdba8a24976b62ee8d03c80
    • 2020-12-08T10:43:20.898Z - indexing completed in job TC6zbrHwR8S_dJ7DyBO-eg
    • 2020-12-09T01:41:03.400Z - artifact task completed in job QxoUycv-TAaXLVAn54gn1Q started at 2020-12-09T00:09:07.369Z
  • b4cd29abb74ae321be6176130fcd2130c7179d6d artifact project.relman.code-coverage.production.repo.mozilla-central.b4cd29abb74ae321be6176130fcd2130c7179d6d
    • 2020-12-09T10:56:35.407Z - indexing completed in job V7vA_AsxQiGPH1_2aGgB9g
    • 2020-12-09T11:10:07.410Z - artifact task completed in job L6wue_-7TPug_39vNRaEAA started at 2020-12-09T08:59:40.346Z
  • 73f050da7d204fcfb59aa2528f3d22665bc043d3 artifact project.relman.code-coverage.production.repo.mozilla-central.73f050da7d204fcfb59aa2528f3d22665bc043d3
    • 2020-12-10T10:46:45.880Z - indexing completed in job erxttPsYRDSqf8YWwDr5Cg
    • No such artifact!

Hook Recent Runs for code-coverage-cron-production

  • 2020-12-04T00:28:22.567Z completed after starting at 2020-12-04T00:00:53.062Z for task en9AhwDYQGSSWZ32PF30Tw
  • 2020-12-05T00:42:28.150Z completed after starting at 2020-12-05T00:02:45.325Z for task ZccIOU_jT6ylGydRTjQQIA
  • 2020-12-06T01:09:56.020Z completed after starting at 2020-12-06T00:01:58.761Z for task cIjBJx1bQ42IdsuBiuq8WQ
  • 2020-12-07T00:37:33.837Z completed after starting at 2020-12-07T00:03:03.062Z for task KRPevd9WRs2iXICruGPFLQ
  • 2020-12-08T00:22:34.855Z completed after starting at 2020-12-08T00:02:14.910Z for task KJ1Iwt9lTu2sJkY4gwP_DQ
  • 2020-12-09T00:21:27.892Z completed after starting at 2020-12-09T00:03:04.293Z for task dbWUaFy0RmKMbJId8fHRhA
  • 2020-12-10T00:27:41.105Z completed after starting at 2020-12-10T00:01:59.470Z for task WxOg4C3mSRa4YXhwhIKanw

Hook Results for code-coverage-repo-production

  • rev d4b21131a094a0819b8856fbf4165cfb32d671c6
    • 2020-12-10T11:48:40.246Z - indexing completed in job IZjIJ121TMO3s4OwZFMl2Q
    • 2020-12-04T01:40:41.803Z - aggregation completed in job VrUpFqafShWpEv5-DCqHQA
  • rev b836812230d1274efa92eb0a0d38601ef4715cda
    • 2020-12-10T11:48:40.246Z - indexing completed in job WQ8-4NBtTWqaXTA74wjWRg
    • 2020-12-05T01:50:36.159Z - aggregation completed in job bsAQop-BTfGXOa-Mp-C9wQ
  • rev 7ce95b6cde2643be60ec53d8c743db28aec30884
    • 2020-12-10T11:48:40.246Z - indexing completed in job TrEpg3-2RneHzhzUVccmxw
    • 2020-12-06T01:40:05.383Z - aggregation completed in job OLpXEJx7RDeOuTSkrSuKOA
  • rev 5a17a9893fe09fd912f1972b204d0f9ca1184b28
    • 2020-12-10T11:48:40.246Z - indexing completed in job BTicAleYQ2uOBMbZXS_gPw
    • 2020-12-07T01:33:25.315Z - aggregation completed in job RBIL-Rv-RNG34DHcz-LM0g
  • rev 7167ab3f6e8d7b77439ac1cc7d9356bbaff02510
    • 2020-12-10T11:48:40.246Z - indexing completed in job Yy6RvYLzQ9y7QF3TFD7Clw
    • 2020-12-08T01:46:59.147Z - aggregation completed in job d3yZDYm2R9St4Eiwn4_P2g
  • rev 1a164819adefcdb42bdba8a24976b62ee8d03c80
    • 2020-12-10T11:48:40.246Z - indexing completed in job TC6zbrHwR8S_dJ7DyBO-eg
    • 2020-12-09T01:41:03.400Z - aggregation completed in job QxoUycv-TAaXLVAn54gn1Q
  • rev b4cd29abb74ae321be6176130fcd2130c7179d6d
    • 2020-12-10T11:48:40.246Z - indexing completed in job V7vA_AsxQiGPH1_2aGgB9g
    • 2020-12-09T11:10:07.410Z - aggregation completed in job L6wue_-7TPug_39vNRaEAA
  • rev b4cd29abb74ae321be6176130fcd2130c7179d6d
    • 2020-12-10T11:48:40.246Z - indexing completed in job V7vA_AsxQiGPH1_2aGgB9g
    • 2020-12-10T06:33:31.604Z - aggregation completed in job cCIzobjoTuu6BI3iNIBYdw
Flags: needinfo?(bugmail) → needinfo?(mcastelluccio)

Given that pulse messages are not working reliably, I made the tasks triggered by the code-coverage-cron-production hook in turn trigger aggregation tasks (the code-coverage-repo-production hook).
So, code-coverage-cron-production is just the cron task that triggers the actual code-coverage-repo-production aggregation tasks.

Currently the code-coverage-cron-production hook is set up to run at midnight UTC (https://hg.mozilla.org/ci/ci-configuration/file/4ad9577b4316ed8669e8e697e90c83dfec8fb5b0/hooks.yml#l47), I haven't changed that yet.
Through your script, can you measure the runtime (finished - scheduled) of code-coverage-repo-production tasks for m-c? Then, I can set up the cron hook to run at 15:00 UTC - ~95th percentile runtime.

It'd also be interesting to know, for any given Searchfox indexing task, what was the latest code-coverage-repo-production job that run and how far the pushes linked to them are (if they're often max 1 day apart, then we're often safe to use "latest" when the exact revision is not available).

Note: there seems to be some inconsistencies between 1 and 3, e.g. for the indexing task for 7167ab3f6e8d7b77439ac1cc7d9356bbaff02510.

Flags: needinfo?(mcastelluccio)

(In reply to Marco Castelluccio [:marco] from comment #7)

Note: there seems to be some inconsistencies between 1 and 3, e.g. for the indexing task for 7167ab3f6e8d7b77439ac1cc7d9356bbaff02510.

Ah, yeah, all the indexing times in the hook results (3) are the same time because I'd defined SF_COMPLETED locally but was referencing COMPLETED_SF from the first phase. An optimization gone wrong in the face of inconsistent naming. I'll clean things up and make it help provide the data for the runtimes.

Attached file gist run data

I updated the gist.

Google sheets says the 95th percentile over the last 2 weeks is 192.51 minutes. That's from pasting the comma-delimited values from the first section into a cell, choosing "Data...Split Text to Columns", then doing =PERCENTILE(B1:N1, 0.95) (I pasted the data into B1 and it split to N1 by inspection.) Note that the 3rd section had a lot of fast runs because in many cases the cron job triggered the artifact rather than the pulse hook. The 1st section I believe properly figures out which of the cron or pulse hook jobs ended up getting there first.

Flags: needinfo?(mcastelluccio)

OK, so I can setup the cron hook to run at noon UTC.

I could also:

  1. currently the cron hook is doing a few expensive things other than triggering repo hook tasks. If I split it in two, I could run the cron to trigger repo more frequently;
  2. in the repo hook, upload the artifact used by Searchfox and index the task as soon as possible instead of waiting for the task to finish (the artifact is available relatively early in the runtime of the task, so the 190 minutes would definitely be way lower).

Both are easy to achieve. 1 requires some additional configuration, so 2 should be quicker to implement.

Depends on: 1684105
Flags: needinfo?(mcastelluccio)

All of this sounds great, thank you! As I work on the searchfox automation logic we discussed in comment 4, I'll make sure to handle cases where the task that generated the artifact is still actively running!

Bug 1684105 landed moving the coverage cron to noon UTC but the results so far are inconclusive. The first day's run (Jan 12th) artifact-producing job started at 2021-01-12T14:26:10.169Z and ended rather late at 2021-01-12T17:02:13.256Z despite the cron job starting at 2021-01-12T12:04:20.607Z and finishing at 2021-01-12T12:30:02.368Z. The second job/today's run (Jan 13th) actually beat the searchfox job because the mozilla-central merge ended up happening a few minutes after the nightly cron job was triggered, so the coverage had an extra 5 hours of time to run.

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #12)

Created attachment 9196873 [details]
post-cron-job changes output

Bug 1684105 landed moving the coverage cron to noon UTC but the results so far are inconclusive. The first day's run (Jan 12th) artifact-producing job started at 2021-01-12T14:26:10.169Z and ended rather late at 2021-01-12T17:02:13.256Z despite the cron job starting at 2021-01-12T12:04:20.607Z and finishing at 2021-01-12T12:30:02.368Z. The second job/today's run (Jan 13th) actually beat the searchfox job because the mozilla-central merge ended up happening a few minutes after the nightly cron job was triggered, so the coverage had an extra 5 hours of time to run.

The cron which started at 2021-01-12T12:04:20.607Z only triggered one artifact-producing job for b97f7d02c91242976b9f88dd8994875923cbeadf at 12:10, which ended at 14:28.
I guess Searchfox tried to use a91c74c24d25c15a4c3a55fd09217bf20ad2d4b5, which is the push right after b97f7d02c91242976b9f88dd8994875923cbeadf.
And I guess at the time of the cron job execution, the coverage test tasks for a91c74c24d25c15a4c3a55fd09217bf20ad2d4b5 weren't finished yet.

In a case like this, if we use the rules from comment 4, Searchfox could use "latest" in place of a91c74c24d25c15a4c3a55fd09217bf20ad2d4b5.
(Note: we could also, instead of a time-based logic, use a number of pushes logic: if there are less than X pushes between the latest revision at the time Searchfox runs and the latest coverage revision, use coverage, otherwise if there are too many pushes between them, don't use coverage)

(In reply to Marco Castelluccio [:marco] from comment #13)

The cron which started at 2021-01-12T12:04:20.607Z only triggered one artifact-producing job for b97f7d02c91242976b9f88dd8994875923cbeadf at 12:10, which ended at 14:28.

That makes sense then, thanks!

I guess Searchfox tried to use a91c74c24d25c15a4c3a55fd09217bf20ad2d4b5, which is the push right after b97f7d02c91242976b9f88dd8994875923cbeadf.

To clarify for anyone reading along, searchfox currently is just using the commit that taskcluster triggers all the when {hour: 10, minute: 0} cron jobs.

In a case like this, if we use the rules from comment 4, Searchfox could use "latest" in place of a91c74c24d25c15a4c3a55fd09217bf20ad2d4b5.
(Note: we could also, instead of a time-based logic, use a number of pushes logic: if there are less than X pushes between the latest revision at the time Searchfox runs and the latest coverage revision, use coverage, otherwise if there are too many pushes between them, don't use coverage)

When you say pushes, are we ignoring the number of commits in the merges? mozilla-central basically only gets 2 or 3 non-backout pushes per day and those are merges from autoland with 30-40+ commits in them. The time-based heuristic is nice because if we do find a commit with coverage from within the preceding 2 hours, there's a good chance we're actually looking at an early autoland merge that was followed up by a backout or perhaps another merge but one that will be smaller in size.

That said, pushes that would match this 2-hr criteria seem anecdotally rare based on skimming treeherder now and in the past; it actually seems like it's much more common for the push picked by the {hour: 10, minute: 0} cron to end up being a merge that's anywhere from 1.5 to 5.5 hours old which has been frequently resulting in searchfox's existing scheduling to work out okay. I've also had to modify my scripts again as treeherder started returning 500 errors for timestamps that include the timezone, so I'm thinking maybe for the time being I won't implement this heuristic given the risk-reward factor being off in terms of breakage, but we'll just adjust the searchfox AWS lambda job to be later and I'll compensate by making searchfox run against both daily nightlies (adding {hour: 22, minute: 0}) as has been discussed somewhat.

Weirdness on the 2021-01-14 run

It looks like the cron job didn't work here? Does this mean the cron job itself was running too early?

Searchfox rev artifact creator times

  • cf6956a5ec8e21896736f96237b1476c9d0aaf45 artifact project.relman.code-coverage.production.repo.mozilla-central.cf6956a5ec8e21896736f96237b1476c9d0aaf45
    • 2021-01-14T10:42:51.769Z - indexing completed in job WlWYPOdnTqyep1OKeXix_w
    • 2021-01-15T13:52:30.011Z - artifact task completed in job H24JtpffTjG8Be0d73ccHQ started at 2021-01-15T12:07:21.869Z

Hook Recent Runs for code-coverage-cron-production

  • 2021-01-14T12:31:19.564Z completed after starting at 2021-01-14T12:02:31.823Z for task To9L_0iXQ8uWPYYjLTZ5HA after 28.8 minutes

Hook Results for code-coverage-repo-production

  • rev cf6956a5ec8e21896736f96237b1476c9d0aaf45
    • 2021-01-14T10:42:51.769Z - indexing completed in job WlWYPOdnTqyep1OKeXix_w
    • 2021-01-15T13:52:30.011Z - aggregation completed in job H24JtpffTjG8Be0d73ccHQ after 105.15 minutes
See Also: → 1686969

So bug 1686969 adds the 2nd mozilla-central nightly (22 utc) scheduling, and I've now also updated the cron jobs for the release1 indexer speculatively to match all of this.

Right now this means that for the release1 (config1.json) indexer, we trigger it from the following EventBridge events:

  • for-10utc-nightly: 15:00 UTC as per comment 4 and the preceding discussion. We may end up moving this as late as 17:00 UTC. This gives coverage 5 hours to happen and is helped by the coverage cron job that as part of bug 1684105 was made to run at 12:00 UTC to help make sure things get scheduled if the pulse notification doesn't happen in a timely fashion.
  • for-22utc-nightly: 04:00 UTC chosen because per the discussion in bug 1686969 we have some slack at least for morning in Europe so this gives coverage 6 hours to happen. There currently isn't a coverage cron job to help make sure the coverage jobs are triggered if the pulse notification doesn't happen in a timely fashion. So this might want an additional cron job added.

This leaves us with the other EventBridge events being used:

  • everyday-8am: 12:00 UTC triggering of start-release4-indexer
  • everyday-9am: 13:00 UTC triggering of start-release2-indexer

release3 continues to be manually scheduled because it holds indexes that don't change much.

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #14)

In a case like this, if we use the rules from comment 4, Searchfox could use "latest" in place of a91c74c24d25c15a4c3a55fd09217bf20ad2d4b5.
(Note: we could also, instead of a time-based logic, use a number of pushes logic: if there are less than X pushes between the latest revision at the time Searchfox runs and the latest coverage revision, use coverage, otherwise if there are too many pushes between them, don't use coverage)

When you say pushes, are we ignoring the number of commits in the merges? mozilla-central basically only gets 2 or 3 non-backout pushes per day and those are merges from autoland with 30-40+ commits in them. The time-based heuristic is nice because if we do find a commit with coverage from within the preceding 2 hours, there's a good chance we're actually looking at an early autoland merge that was followed up by a backout or perhaps another merge but one that will be smaller in size.

Yes, I was thinking Searchfox could use coverage's latest revision when the coverage data is not available for the Searchfox revision. By using the number of pushes (or commits in the pushes) between Searchfox revision and coverage "latest", you can tell whether the data is too stale to use, or stale but not too much (e.g. if there are only few commits between the Searchfox revision and the latest coverage revision you could accept the staleness).

Weirdness on the 2021-01-14 run

It looks like the cron job didn't work here? Does this mean the cron job itself was running too early?

Searchfox rev artifact creator times

  • cf6956a5ec8e21896736f96237b1476c9d0aaf45 artifact project.relman.code-coverage.production.repo.mozilla-central.cf6956a5ec8e21896736f96237b1476c9d0aaf45
    • 2021-01-14T10:42:51.769Z - indexing completed in job WlWYPOdnTqyep1OKeXix_w
    • 2021-01-15T13:52:30.011Z - artifact task completed in job H24JtpffTjG8Be0d73ccHQ started at 2021-01-15T12:07:21.869Z

Hook Recent Runs for code-coverage-cron-production

  • 2021-01-14T12:31:19.564Z completed after starting at 2021-01-14T12:02:31.823Z for task To9L_0iXQ8uWPYYjLTZ5HA after 28.8 minutes

Hook Results for code-coverage-repo-production

  • rev cf6956a5ec8e21896736f96237b1476c9d0aaf45
    • 2021-01-14T10:42:51.769Z - indexing completed in job WlWYPOdnTqyep1OKeXix_w
    • 2021-01-15T13:52:30.011Z - aggregation completed in job H24JtpffTjG8Be0d73ccHQ after 105.15 minutes

The cron run (task https://firefox-ci-tc.services.mozilla.com/tasks/To9L_0iXQ8uWPYYjLTZ5HA) and triggered an ingestion task (WsgLOYGASrqgJylUR8fSqA) for revision f45a46bfdd345168bd2ed8537dd330f029338f57. The ingestion task failed (some coverage artifact could not be downloaded from a test task).
The coverage test tasks for cf6956a5ec8e21896736f96237b1476c9d0aaf45 (which is the revision Searchfox picked) were still in progress at the time the cron task for 2021-01-14 run, so the ingestion task could not start yet. It was started in the next day's (2021-01-15) cron task.
https://github.com/mozilla/code-coverage/issues/718 would help with this, if we make the coverage cron run every hour (or even every few minutes).

On a related note, I've fixed https://github.com/mozilla/code-coverage/issues/719, which will make the coverage ingestion task upload the artifact you need and index itself ASAP (~20 minutes instead of more than 2 hours). So the delay between a push and its ingestion will be around 2 hours shorter (in total it is: runtime of the decision task + runtime of the coverage build + runtime of the coverage tests + partial runtime of the ingestion task).

(In reply to Marco Castelluccio [:marco] from comment #16)
Re: staleness, what you're saying makes sense but I think the answer for all non-weekend merges at this point is "that's too many". As of when I initially started responding to this message last week, the extra time I'd added to the cron jobs seemed to have been really helping... as of now when I'm submitting, I think things haven't been 100% perfect, but with your enhancements and planned enhancements below, I think we'll be in fantastic shape.

https://github.com/mozilla/code-coverage/issues/718 would help with this, if we make the coverage cron run every hour (or even every few minutes).

Yeah, something like every half hour would be great. Presumably there's some level of cost associated with this, but presumably that can help quantify the importance of putting resource into helping move more of the coverage implementation (and searchfox) directly into taskcluster.

On a related note, I've fixed https://github.com/mozilla/code-coverage/issues/719, which will make the coverage ingestion task upload the artifact you need and index itself ASAP (~20 minutes instead of more than 2 hours). So the delay between a push and its ingestion will be around 2 hours shorter (in total it is: runtime of the decision task + runtime of the coverage build + runtime of the coverage tests + partial runtime of the ingestion task).

This is fantastic, thank you!

Depends on: 1697857

Most bugs blocking this have been fixed, we should be in a much better shape now.

(In reply to Marco Castelluccio [:marco] from comment #18)

Most bugs blocking this have been fixed, we should be in a much better shape now.

Yeah, I think we should be in fantastic shape now. Thank you so much! I am very excited to have the coverage data consistently available, I've definitely regretted its absence on the runs where it's not present.

To summarize where we are and what happened in the bug:

  • We now only pull coverage data for the explicit revision that we were indexing, ensuring that there is no misleading mismatch between the coverage data and the code.
  • We now expect to always have coverage data available:
    • Bug 1697857 runs a cron job every 30 minutes to make sure that the code coverage ingestion/aggregation task is scheduled in a timely fashion even if there are other jobs in the task group that take a long time.
    • https://github.com/mozilla/code-coverage/issues/719 changed the ingestion/aggregation task to upload its artifacts ASAP, shaving ~2 hours off the latency.
    • The mozilla-central searchfox indexing runs were moved to provide adequate time for all of this too happen. https://github.com/mozsearch/mozsearch-mozilla#how-searchfoxorg-stays-up-to-date documents the current times and will be updated as they are changed.
      • The searchfox runs probably now are too generous and can have their padding reduced. Currently we allow 6 hours from the nightly cron jobs being initiated (which the coverage jobs can sometimes get a head start on because they run on every merge to mozilla-central). I'll file a follow-up on this, but it's not urgent.
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Summary: Remove "-latest" code coverage fallback to avoid stale data. → Ensure coverage data is always exactly for the indexed build and ensure indexers have a high probability of having coverage data available.
Blocks: 1699350
Blocks: 1750240
You need to log in before you can comment on or make changes to this bug.