Closed Bug 1759071 Opened 4 years ago Closed 3 years ago

move mozilla-central indexing to use coverage aggregation jobs as the basis for indexing revisions instead of the latest searchfox indexing job

Categories

(Webtools :: Searchfox, enhancement)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: asuth, Assigned: asuth)

References

(Blocks 1 open bug)

Details

Attachments

(3 files)

Bug 1758746 has landed and I can confirm that the searchfox indexing jobs are now running against pushes as opposed to based on the nightly cron revision.

As long as merges only happen twice a day within a few hours preceding the nightly, the existing time-based behavior should reliably net us coverage data, but in general there's a risk of a new push landing in mozilla-central during the coverage "bake" time which could result in searchfox picking that run (which won't have had the bake time).

So we should switch the mozsearch-mozilla logic for mozilla-central to use the coverage aggregation artifact as the basis for the revision to index. The only real concern with this is if the coverage automation breaks we wouldn't want searchfox getting stuck on indexing that revision[1]. So ideally the logic should be smart enough to require that the aggregation needs to be from within the last 24 hours or so.

The script I've used to help figure out the timing of the various artifacts at https://gist.github.com/asutherland/61e44bee962a15a9154bbd76838493c4 may be the best basis for this since that already works and it could potentially also help provide some diagnostic information about the decision process. We may need to modify the provisioning scripts, however.

1: In particular, as relates to bug 1750240 about us not being good about indicating when we're missing all coverage data, we're even worse about conveying the searchfox index being out of date. We might actually want to consider linking to treeherder and capturing the date-stamp of the push that we've indexed.

I landed the script I've been using to diagnose the general situation when searchfox is missing coverage data which is the gist I referenced previously. This script likely continues to demonstrate some of the underlying requests we could make to accomplish the goals of this bug, although maybe not using the coreapi or taskcluster helpers, as at least the coreapi provisioning would have bitrotted already and the taskcluster install steps currently are maybe a bit manual?

Assignee: nobody → bugmail
Status: NEW → ASSIGNED

This appears to have worked correctly on the test run, although the previous heuristic would have picked the right revision too, but in the log below we can clearly see the heuristic operating and intervening:

+ REVISION_TREE=mozilla-central
+ REVISION_ID=latest
+ TC_ROOT_URL=https://firefox-ci-tc.services.mozilla.com
+ TC_INDEX_API_URL=https://firefox-ci-tc.services.mozilla.com/api/index/v1/task
+ COVERAGE_ROUTE=project.relman.code-coverage.production.repo.mozilla-central.latest
+ TC_QUEUE_API_URL=https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task
+ TRYPUSH_REV=
+ '[' -n '' ']'
++ curl -ssL https://firefox-ci-tc.services.mozilla.com/api/index/v1/task/project.relman.code-coverage.production.repo.mozilla-central.latest
++ jq -Mr .taskId
+ COV_TASKID=VhqwNlu7Q9aFt0zLhLysQQ
++ curl -ssL https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/VhqwNlu7Q9aFt0zLhLysQQ
+ COV_TASK_INFO='{
  "provisionerId": "code-coverage",
  "workerType": "bot",
  "taskQueueId": "code-coverage/bot",
  "schedulerId": "relman",
  "projectId": "none",
  "taskGroupId": "VhqwNlu7Q9aFt0zLhLysQQ",
  "dependencies": [],
  "requires": "all-completed",
  "routes": [],
  "priority": "lowest",
  "retries": 5,
  "created": "2022-07-24T09:23:20.628Z",
  "deadline": "2022-07-24T13:23:20.628Z",
  "expires": "2022-08-23T09:23:20.628Z",
  "scopes": [
    "secrets:get:project/relman/code-coverage/runtime-production",
    "notify:email:*",
    "docker-worker:cache:code-coverage-bot-production",
    "index:insert-task:project.relman.code-coverage.production.repo.*"
  ],
  "payload": {
    "env": {
      "REVISION": "f69015bf0e0a23e95d424016cde68ee18534881d",
      "REPOSITORY": "https://hg.mozilla.org/mozilla-central"
    },
    "cache": {
      "code-coverage-bot-production": "/cache"
    },
    "image": {
      "path": "public/code-coverage-bot.tar",
      "type": "indexed-image",
      "namespace": "code-analysis.v2.code-coverage.branch.production"
    },
    "command": [
      "code-coverage-repo",
      "--taskcluster-secret",
      "project/relman/code-coverage/runtime-production",
      "--cache-root",
      "/cache",
      "--working-dir",
      "/build"
    ],
    "features": {
      "taskclusterProxy": true
    },
    "maxRunTime": 14400,
    "capabilities": {}
  },
  "metadata": {
    "name": "Code Coverage aggregation task - repo (production)",
    "owner": "release-mgmt-analysis@mozilla.com",
    "source": "https://github.com/mozilla/code-coverage",
    "description": ""
  },
  "tags": {},
  "extra": {}
}'
++ jq -Mr .payload.env.REVISION
+ COV_HG_REV=f69015bf0e0a23e95d424016cde68ee18534881d
++ jq -Mr '.created | sub("\\.[0-9]+Z$"; "Z") | (now - fromdate) / (24 * 60 * 60) | floor'
+ COV_RECENCY_DAYS=0
+ [[ 0 = \0 ]]
+ REVISION_ID=revision.f69015bf0e0a23e95d424016cde68ee18534881d
+ echo 'Coverage data is recent enough, using explicit revision: revision.f69015bf0e0a23e95d424016cde68ee18534881d'
Coverage data is recent enough, using explicit revision: revision.f69015bf0e0a23e95d424016cde68ee18534881d
+ source /home/ubuntu/config/shared/resolve-gecko-revs.sh mozilla-central revision.f69015bf0e0a23e95d424016cde68ee18534881d

And then in resolve-gecko-revs.sh as sourced, we can see that come to the right conclusion as a result of using the explicit revision.

++ '[' -z /home/ubuntu/mozsearch ']'
++ '[' -z /mnt/index-scratch/mozilla-central ']'
++ REVISION_TREE=mozilla-central
++ REVISION_ID=revision.f69015bf0e0a23e95d424016cde68ee18534881d
++ date
Sun Jul 24 22:50:36 UTC 2022
++ REVISION=mozilla-central.revision.f69015bf0e0a23e95d424016cde68ee18534881d
++ CURL='curl -SsfL --compressed'
++ pushd /mnt/index-scratch/mozilla-central
/mnt/index-scratch/mozilla-central ~
++ PREEXISTING_HG_REV=
++ '[' -f target.json ']'
++ curl -SsfL --compressed https://firefox-ci-tc.services.mozilla.com/api/index/v1/task/gecko.v2.mozilla-central.revision.f69015bf0e0a23e95d424016cde68ee18534881d.firefox.linux64-searchfox-debug/artifacts/public/build/target.json
+++ jq -r .moz_source_stamp target.json
++ INDEXED_HG_REV=f69015bf0e0a23e95d424016cde68ee18534881d
++ popd
~
+ /home/ubuntu/config/shared/checkout-gecko-repos.sh mozilla-central master f69015bf0e0a23e95d424016cde68ee18534881d

Note that there is a little redundancy there because we fetch a specific revision's target.json which tells us that same revision, but this is fine.

Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED

Apparently as implemented, the route gets updated before the artifact gets uploaded. Not sure if we were explicitly downloading the artifact if the route logic operates differently?

Quoting what I just noted in #codecoverage: the utc10 run doesn't have coverage because it decided at Mon Jul 25 16:06:28 UTC 2022 to use 00a40cdda673bdbe4f7831d1cf078909ad54182e but according to the (updated) tc-coverage-timestamps.sh the artifact was only uploaded (per subtracting off expiration) at 2022-07-25T16:41:09Z, which is 35 minutes after the decision was made.

Here's the log excerpt for the decision-making process:

+ date
Mon Jul 25 16:06:28 UTC 2022
+ REVISION_TREE=mozilla-central
+ REVISION_ID=latest
+ TC_ROOT_URL=https://firefox-ci-tc.services.mozilla.com
+ TC_INDEX_API_URL=https://firefox-ci-tc.services.mozilla.com/api/index/v1/task
+ COVERAGE_ROUTE=project.relman.code-coverage.production.repo.mozilla-central.latest
+ TC_QUEUE_API_URL=https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task
+ TRYPUSH_REV=
+ '[' -n '' ']'
++ curl -ssL https://firefox-ci-tc.services.mozilla.com/api/index/v1/task/project.relman.code-coverage.production.repo.mozilla-central.latest
++ jq -Mr .taskId
+ COV_TASKID=fVTPqf9WSmSPXnaRBJGgww
++ curl -ssL https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/fVTPqf9WSmSPXnaRBJGgww
+ COV_TASK_INFO='{
  "provisionerId": "code-coverage",
  "workerType": "bot",
  "taskQueueId": "code-coverage/bot",
  "schedulerId": "relman",
  "projectId": "none",
  "taskGroupId": "Fqf8NxCiSk6npfEaAE0-DA",
  "dependencies": [],
  "requires": "all-completed",
  "routes": [],
  "priority": "lowest",
  "retries": 5,
  "created": "2022-07-25T14:01:49.527Z",
  "deadline": "2022-07-25T18:01:49.527Z",
  "expires": "2022-08-24T14:01:49.527Z",
  "scopes": [
    "secrets:get:project/relman/code-coverage/runtime-production",
    "notify:email:*",
    "docker-worker:cache:code-coverage-bot-production",
    "index:insert-task:project.relman.code-coverage.production.repo.*"
  ],
  "payload": {
    "env": {
      "REVISION": "00a40cdda673bdbe4f7831d1cf078909ad54182e",
      "taskName": "covdir for 00a40cdda673bdbe4f7831d1cf078909ad54182e",
      "REPOSITORY": "https://hg.mozilla.org/mozilla-central",
      "taskGroupId": "Fqf8NxCiSk6npfEaAE0-DA"
    },
    "cache": {
      "code-coverage-bot-production": "/cache"
    },
    "image": {
      "path": "public/code-coverage-bot.tar",
      "type": "indexed-image",
      "namespace": "code-analysis.v2.code-coverage.branch.production"
    },
    "command": [
      "code-coverage-repo",
      "--taskcluster-secret",
      "project/relman/code-coverage/runtime-production",
      "--cache-root",
      "/cache",
      "--working-dir",
      "/build"
    ],
    "features": {
      "taskclusterProxy": true
    },
    "maxRunTime": 14400,
    "capabilities": {}
  },
  "metadata": {
    "name": "covdir for 00a40cdda673bdbe4f7831d1cf078909ad54182e",
    "owner": "release-mgmt-analysis@mozilla.com",
    "source": "https://github.com/mozilla/code-coverage",
    "description": ""
  },
  "tags": {},
  "extra": {}
}'
++ jq -Mr .payload.env.REVISION
+ COV_HG_REV=00a40cdda673bdbe4f7831d1cf078909ad54182e
++ jq -Mr '.created | sub("\\.[0-9]+Z$"; "Z") | (now - fromdate) / (24 * 60 * 60) | floor'
+ COV_RECENCY_DAYS=0
+ [[ 0 = \0 ]]
+ REVISION_ID=revision.00a40cdda673bdbe4f7831d1cf078909ad54182e
+ echo 'Coverage data is recent enough, using explicit revision: revision.00a40cdda673bdbe4f7831d1cf078909ad54182e'
Coverage data is recent enough, using explicit revision: revision.00a40cdda673bdbe4f7831d1cf078909ad54182e
+ source /home/ubuntu/config/shared/resolve-gecko-revs.sh mozilla-central revision.00a40cdda673bdbe4f7831d1cf078909ad54182e

And the output of tc-coverage-timestamps.sh from the mozsearch tree where our relevant job is the 3rd from the bottom:

- f69015bf0e0a23e95d424016cde68ee18534881d artifact project.relman.code-coverage.production.repo.mozilla-central.f69015bf0e0a23e95d424016cde68ee18534881d
  - 2022-07-23T10:06:44.381Z - indexing completed in job Cz0lVi4TSO2kSKCaJCP0Jg
  - 2022-07-24T09:36:13Z - artifact uploaded
- a957b284c6130bed2c72787e7795fcc10237e8fc artifact project.relman.code-coverage.production.repo.mozilla-central.a957b284c6130bed2c72787e7795fcc10237e8fc
  - 2022-07-23T22:46:54.782Z - indexing completed in job FZFg5wtrTBmZslzdPwAneg
  - 2022-07-24T01:48:28Z - artifact uploaded
- 99811200c0e185760002042ee00ba431011eedbb artifact project.relman.code-coverage.production.repo.mozilla-central.99811200c0e185760002042ee00ba431011eedbb
  - 2022-07-25T05:11:50.250Z - indexing completed in job J-v4jWksTemu6B2Hg6aHRQ
  - 2022-07-25T07:44:04Z - artifact uploaded
- 00a40cdda673bdbe4f7831d1cf078909ad54182e artifact project.relman.code-coverage.production.repo.mozilla-central.00a40cdda673bdbe4f7831d1cf078909ad54182e
  - 2022-07-25T10:47:57.226Z - indexing completed in job NFUTDH1GQ1SxXsZOGcIp6g
  - 2022-07-25T16:41:09Z - artifact uploaded
- aefc088708a85e810290b4793474e4582d75adf9 artifact project.relman.code-coverage.production.repo.mozilla-central.aefc088708a85e810290b4793474e4582d75adf9
  - 2022-07-25T16:45:40.582Z - indexing completed in job AARO9Y6CRpiDJ0Uh5FVJXg
  - No such artifact!
- 823b580f3a922ceb7c0fb6af0d340fd12c2367d0 artifact project.relman.code-coverage.production.repo.mozilla-central.823b580f3a922ceb7c0fb6af0d340fd12c2367d0
  - 2022-07-25T17:35:46.646Z - indexing completed in job XUypBzwUSMOZRUlbsoS8YA
  - No such artifact!
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

From discussion in https://chat.mozilla.org/#/room/#firefox-ci:mozilla.org it sounds like the searchfox logic is good here, but there was a hiccup with the coverage mechanism yesterday. Specifically, the task id for the "latest" route that we see in comment 3 experienced a bunch of failures:

 $  taskcluster api queue status fVTPqf9WSmSPXnaRBJGgww
{
  "status": {
    "taskId": "fVTPqf9WSmSPXnaRBJGgww",
    "provisionerId": "code-coverage",
    "workerType": "bot",
    "taskQueueId": "code-coverage/bot",
    "schedulerId": "relman",
    "projectId": "none",
    "taskGroupId": "Fqf8NxCiSk6npfEaAE0-DA",
    "deadline": "2022-07-25T18:01:49.527Z",
    "expires": "2022-08-24T14:01:49.527Z",
    "retriesLeft": 2,
    "state": "exception",
    "runs": [
      {
        "runId": 0,
        "state": "exception",
        "reasonCreated": "scheduled",
        "reasonResolved": "claim-expired",
        "workerGroup": "us-east-1",
        "workerId": "i-00dc80360afb1e7e1",
        "takenUntil": "2022-07-25T15:13:15.629Z",
        "scheduled": "2022-07-25T14:01:49.559Z",
        "started": "2022-07-25T14:03:14.929Z",
        "resolved": "2022-07-25T15:13:17.025Z"
      },
      {
        "runId": 1,
        "state": "exception",
        "reasonCreated": "retry",
        "reasonResolved": "claim-expired",
        "workerGroup": "us-east-1",
        "workerId": "i-0a2f2fa4b1e7f39cd",
        "takenUntil": "2022-07-25T15:43:17.444Z",
        "scheduled": "2022-07-25T15:13:17.025Z",
        "started": "2022-07-25T15:13:17.160Z",
        "resolved": "2022-07-25T15:43:17.611Z"
      },
      {
        "runId": 2,
        "state": "exception",
        "reasonCreated": "retry",
        "reasonResolved": "claim-expired",
        "workerGroup": "us-west-2",
        "workerId": "i-0b7b3d814882a3756",
        "takenUntil": "2022-07-25T16:13:18.755Z",
        "scheduled": "2022-07-25T15:43:17.611Z",
        "started": "2022-07-25T15:43:18.537Z",
        "resolved": "2022-07-25T16:13:20.305Z"
      },
      {
        "runId": 3,
        "state": "exception",
        "reasonCreated": "retry",
        "reasonResolved": "deadline-exceeded",
        "workerGroup": "us-east-1",
        "workerId": "i-0420724aabbea5471",
        "takenUntil": "2022-07-25T18:13:22.113Z",
        "scheduled": "2022-07-25T16:13:20.305Z",
        "started": "2022-07-25T16:13:20.563Z",
        "resolved": "2022-07-25T18:02:50.465Z"
      }
    ]
  }
}

And the newly revised coverage script now tells us that the coverage aggregation job went back again and redid its work today, presumably because of all those failures yesterday:

- 00a40cdda673bdbe4f7831d1cf078909ad54182e artifact project.relman.code-coverage.production.repo.mozilla-central.00a40cdda673bdbe4f7831d1cf078909ad54182e
  - 2022-07-25T10:47:57.226Z - indexing completed in job NFUTDH1GQ1SxXsZOGcIp6g
  - 2022-07-26T10:12:51Z - artifact uploaded from task JM4UWLKcSmiZqrNJ_AuzBQ
  - status: {"state":"completed","reasonResolved":"completed","started":"2022-07-26T09:57:17.231Z","resolved":"2022-07-26T11:32:22.782Z"}

If we see more instances like this, it probably makes sense for searchfox to add some more diagnostic log generation and to make sure that these logs are available in some public place on the server like a public status file exposed by the web-server. Specifically, since there's some moment-in-time aspects to the decision-making, capturing the state as presented by taskcluster at the time seems useful.

Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED
Blocks: 1804315
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: