Closed Bug 1528249 Opened 6 years ago Closed 6 years ago

taskcluster-github stops posting comments to PRs / creating tasks

Categories

(Taskcluster :: Services, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pmoore, Assigned: dustin)

Details

Three times today, people reporting that taskcluster-github had stopped working. After restarting taskcluster-github dynos, it kicked into action and created tasks, and posted comments on PRs, annotated commits with ticks/crosses, etc.

The papertrail logs didn't display any crashes as far as I could see.

Affected people were bastien, grenade, marco.

See #ci and #taskcluster from today.

(although grenade contacted me privately)

https://papertrailapp.com/systems/taskcluster-github/events?focus=1032853272819605525&selected=1032853272819605525

Feb 14 23:57:12 taskcluster-github app/worker.1: 2019-02-14T23:57:11.779Z taskcluster-github:handlers:3e2269c0-30b4-11e9-8754-d6b5c4a0f1ea Received message. Starting processing... 
Feb 14 23:57:12 taskcluster-github app/worker.1: 2019-02-14T23:57:11.871Z taskcluster-github:github-auth Authenticated as installation: 42268 
Feb 14 23:57:12 taskcluster-github app/worker.1: 2019-02-14T23:57:11.871Z taskcluster-github:handlers:3e2269c0-30b4-11e9-8754-d6b5c4a0f1ea handling pull_request.synchronize webhook for: mozilla-mobile/android-components@60bb3eb6ae9299fd6837ac918be2845064af27aa 
Feb 14 23:57:12 taskcluster-github app/worker.1: 2019-02-14T23:57:11.871Z taskcluster-github:handlers:3e2269c0-30b4-11e9-8754-d6b5c4a0f1ea Checking pull request permission for mozilla-mobile/android-components@60bb3eb6ae9299fd6837ac918be2845064af27aa... 
Feb 14 23:57:12 taskcluster-github app/worker.1: 2019-02-14T23:57:12.044Z taskcluster-github:handlers:3e2269c0-30b4-11e9-8754-d6b5c4a0f1ea Trying to fetch the YML for mozilla-mobile/android-components@60bb3eb6ae9299fd6837ac918be2845064af27aa 
Feb 14 23:57:12 taskcluster-github app/worker.1: 2019-02-14T23:57:12.142Z taskcluster-github:intree intree config for mozilla-mobile/android-components matches valid schema. 
Feb 14 23:57:12 taskcluster-github app/worker.1: 2019-02-14T23:57:12.146Z taskcluster-github:handlers:3e2269c0-30b4-11e9-8754-d6b5c4a0f1ea Trying to create a record for mozilla-mobile/android-components@60bb3eb6ae9299fd6837ac918be2845064af27aa (pending) in Builds table 
Feb 14 23:57:12 taskcluster-github app/worker.1: 2019-02-14T23:57:12.146Z azure:authorization Refreshing shared-access-signature 
Feb 14 23:57:12 taskcluster-github app/worker.1: 2019-02-14T23:57:12.146Z taskcluster-client Calling: azureTableSAS, retry: 0 
Feb 14 23:57:12 taskcluster-github app/worker.1: 2019-02-14T23:57:12.200Z taskcluster-client Success calling: azureTableSAS, (0 retries) 
Feb 14 23:57:12 taskcluster-github app/worker.1: 2019-02-14T23:57:12.200Z azure:authorization Refreshed shared-access-signature, will refresh at 2019-02-15T00:07:12.000Z 
Feb 14 23:57:12 taskcluster-github app/worker.1: 2019-02-14T23:57:12.200Z azure:table Request: POST /TaskclusterGithubBuilds, retry: 0 
Feb 14 23:57:13 taskcluster-github app/worker.1: 2019-02-14T23:57:12.620Z taskcluster-github:handlers:3e2269c0-30b4-11e9-8754-d6b5c4a0f1ea Creating tasks for mozilla-mobile/android-components@60bb3eb6ae9299fd6837ac918be2845064af27aa (taskGroupId: MHuk9RMCQqCSk7Bpws1aMg) 
Feb 14 23:57:13 taskcluster-github app/worker.1: 2019-02-14T23:57:12.620Z taskcluster-client Calling: createTask, retry: 0 
Feb 14 23:57:13 taskcluster-github app/worker.1: 2019-02-14T23:57:13.584Z taskcluster-client Success calling: createTask, (0 retries) 
Feb 14 23:57:13 taskcluster-github app/worker.1: 2019-02-14T23:57:13.584Z taskcluster-github:handlers:3e2269c0-30b4-11e9-8754-d6b5c4a0f1ea Publishing status exchange for mozilla-mobile/android-components@60bb3eb6ae9299fd6837ac918be2845064af27aa (pending) 
Feb 14 23:57:13 taskcluster-github app/worker.1: 2019-02-14T23:57:13.585Z taskcluster-lib-pulse.publisher Publishing message on exchange exchange/taskcluster-github/v1/task-group-creation-requested, routing key primary.mozilla-mobile.android-components 
Feb 14 23:57:13 taskcluster-github app/worker.1: 2019-02-14T23:57:13.647Z taskcluster-github:handlers:taskGroup-handler Task group MHuk9RMCQqCSk7Bpws1aMg was defined. Creating group status... 
Feb 14 23:57:13 taskcluster-github app/worker.1: 2019-02-14T23:57:13.648Z azure:table Request: GET /TaskclusterGithubBuilds(PartitionKey='MHuk9RMCQqCSk7Bpws1aMg',RowKey='taskGroupId'), retry: 0 
Feb 14 23:57:13 taskcluster-github app/worker.1: 2019-02-14T23:57:13.650Z taskcluster-github:handlers:3e2269c0-30b4-11e9-8754-d6b5c4a0f1ea Job handling for mozilla-mobile/android-components@60bb3eb6ae9299fd6837ac918be2845064af27aa completed. 
Feb 14 23:57:13 taskcluster-github app/worker.1: 2019-02-14T23:57:13.815Z taskcluster-github:github-auth Authenticated as installation: 42268 
Feb 14 23:57:49 taskcluster-github app/worker.1: 2019-02-14T23:57:49.363Z taskcluster-lib-pulse.client-1 recycling 
Feb 14 23:57:49 taskcluster-github app/worker.1: 2019-02-14T23:57:49.364Z taskcluster-lib-pulse.conn-1 retiring 
Feb 14 23:57:49 taskcluster-github app/worker.1: 2019-02-14T23:57:49.365Z taskcluster-lib-pulse.conn-2 waiting 
Feb 14 23:57:49 taskcluster-github app/worker.1: 2019-02-14T23:57:49.367Z taskcluster-lib-pulse.conn-2 connecting 
Feb 14 23:57:50 taskcluster-github app/worker.1: 2019-02-14T23:57:49.793Z taskcluster-lib-pulse.conn-2 connected 
Feb 14 23:57:50 taskcluster-github app/worker.1: 2019-02-14T23:57:49.916Z taskcluster-lib-pulse.publisher using new channel 
Feb 14 23:58:19 taskcluster-github app/worker.1: 2019-02-14T23:58:19.373Z taskcluster-lib-pulse.conn-1 finished; closing AMQP connection 
Feb 15 00:57:49 taskcluster-github app/worker.1: 2019-02-15T00:57:49.365Z taskcluster-lib-pulse.client-1 recycling 
Feb 15 00:57:49 taskcluster-github app/worker.1: 2019-02-15T00:57:49.366Z taskcluster-lib-pulse.conn-2 retiring 
Feb 15 00:57:49 taskcluster-github app/worker.1: 2019-02-15T00:57:49.366Z taskcluster-lib-pulse.conn-3 waiting 
Feb 15 00:57:49 taskcluster-github app/worker.1: 2019-02-15T00:57:49.367Z taskcluster-lib-pulse.conn-3 connecting 
Feb 15 00:57:50 taskcluster-github app/worker.1: 2019-02-15T00:57:49.797Z taskcluster-lib-pulse.conn-3 connected 
Feb 15 00:57:50 taskcluster-github app/worker.1: 2019-02-15T00:57:49.924Z taskcluster-lib-pulse.publisher using new channel 

and that pattern of just logging about lib-pulse continued until the restart. It's normal for lib-pulse to reconnect once an hour, so it's more the absence of anything else here that's unusual. After the restart, the handler started handling a bunch of queued messages:

https://papertrailapp.com/systems/taskcluster-github/events?focus=1032984015067619336&q=program%3Aapp%2Fworker.1&selected=1032984015067619336


Feb 15 08:36:43 taskcluster-github app/worker.1: 2019-02-15T08:36:43.199Z taskcluster-github:handlers Constructing handlers... 
Feb 15 08:36:43 taskcluster-github app/worker.1: 2019-02-15T08:36:43.200Z taskcluster-github:handlers Setting up handlers... 
Feb 15 08:36:44 taskcluster-github app/worker.1: 2019-02-15T08:36:44.020Z taskcluster-github:handlers:86673166-30b4-11e9-9dfd-fdc9f852c3d9 Received message. Starting processing... 
Feb 15 08:36:44 taskcluster-github app/worker.1: 2019-02-15T08:36:44.086Z taskcluster-github:github-auth Authenticated as installation: 9218 
Feb 15 08:36:44 taskcluster-github app/worker.1: 2019-02-15T08:36:44.086Z taskcluster-github:handlers:86673166-30b4-11e9-9dfd-fdc9f852c3d9 handling push webhook for: mozilla/treeherder@112d79c55178c088b93834983c9f4b7b8857459d 
Feb 15 08:36:44 taskcluster-github app/worker.1: 2019-02-15T08:36:44.086Z taskcluster-github:handlers:86673166-30b4-11e9-9dfd-fdc9f852c3d9 handling push webhook for: mozilla/treeherder@112d79c55178c088b93834983c9f4b7b8857459d 
Feb 15 08:36:44 taskcluster-github app/worker.1: 2019-02-15T08:36:44.086Z taskcluster-github:handlers:86673166-30b4-11e9-9dfd-fdc9f852c3d9 Trying to fetch the YML for mozilla/treeherder@112d79c55178c088b93834983c9f4b7b8857459d 
Feb 15 08:36:44 taskcluster-github app/worker.1: 2019-02-15T08:36:44.091Z taskcluster-github:handlers:87261fe0-30b4-11e9-99b1-43832321c115 Received message. Starting processing... 
Feb 15 08:36:44 taskcluster-github app/worker.1: 2019-02-15T08:36:44.140Z taskcluster-github:handlers:87706f50-30b4-11e9-80b9-9ec88b62e128 Received message. Starting processing... 
Feb 15 08:36:44 taskcluster-github app/worker.1: 2019-02-15T08:36:44.147Z taskcluster-github:handlers:877689d0-30b4-11e9-87b9-d1044833a0bd Received message. Starting processing...

It looks like webhooks were still being received and published to pulse after 23:57:13. For example:

Feb 15 00:01:35 taskcluster-github app/web.1: 2019-02-15T00:01:35.543Z taskcluster-lib-pulse.publisher Publishing message on exchange exchange/taskcluster-github/v1/push, routing key primary.web-platform-tests.wpt 

At the moment, queue/taskcluster-github/stat-result and queue/taskcluster-github/jobs have pending messages, starting about 10 minutes ago.

This seems to be associated with lib-pulse reconnects: after the first reconnection ("conn-1 retiring"), nothing happens. That seems to point to https://github.com/taskcluster/taskcluster/pull/236 / bug 1527427.

Assignee: nobody → dustin

Yup, that was dumb: https://github.com/taskcluster/taskcluster/pull/236/files#diff-ec0662c7dee98f0a1f01cae19043c47fR86 doesn't shut down the whole consumer, it just shuts down use of the current connection. But the patch happily stops listening for new connection messages in _shutdown. Ergo, on reconnection, it never starts using the new connection.

I rolled back to the Feb 11 deployment of tc-github temporarily (/ni owlish to avoid surprise). But this will be affecting anything deployed in the last 17 hours, so we should get the fix landed expeditiously and roll forward.

Flags: needinfo?(bugzeeeeee)
Flags: needinfo?(bugzeeeeee)

Deployed -- hopefully that will fix things up!

Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.