1296679 - Autophone - pulse processing stalled 2016-08-19

Assignee

Description

•

8 years ago

I received growth warnings for the autophone pulse build queues. It appears the processing of the pulse messages on autophone has stalled.

queue/autophone1/builds
queue/autophone1/task-completed
queue/autophone1/jobactions

This may be fall out from Bug 1277653 with potential dead locks due to the io to get the changesets or due to some other pulse/treeherder related issue.

Bob Clary [:bc] (inactive)

Assignee

Comment 1

•

8 years ago

I've temporarily reverted Bug 1277653 locally on each host and enabled debug level logging. I am processing the pulse messages but it doesn't appear to be removing them from the queues according to pulseguardian.

Bob Clary [:bc] (inactive)

Assignee

Comment 2

•

8 years ago

After some time on Friday the queues begain updating. I've just finished upgrading from Ubuntu 14 to 16 and once again pulseguardian isn't updating the queue sizes in a timely fashion.

Mark, any ideas on why pulse guardian is slow to update?

Flags: needinfo?(mcote)

Bob Clary [:bc] (inactive)

Assignee

Comment 3

•

8 years ago

My first attempt to fix this issue failed during this morning's test. My attempt involved moving get_changeset_dirs to get_build_data and caching the changeset_dirs in the build meta data and the jobs table.

I didn't see any problems while testing over the weekend, so I must assume this is load dependent. I believe I may have a deadlock which is exacerbated by the file io getting the build, the symbols, tests and build's meta data. Still investigating.

Mark Côté [:mcote]

Updated

•

8 years ago

Depends on: 1297269

Mark Côté [:mcote]

Comment 4

•

8 years ago

This was very similar to something TaskCluster recently experienced, with a slowly growing queue being deleted without a warning message (only a deletion message).

I did some digging, and I think I figured out what's going on.  See bug 1297269.

Flags: needinfo?(mcote)

Bob Clary [:bc] (inactive)

Assignee

Comment 5

•

8 years ago

Mark, thanks! I think that explains my issues with PulseGuardian.

With respect to the original cause of this bug...

I reproduced the "dead-lock" running with verbose debug loglevel locally even with the patch I was testing. In this example the problem was due to a merge into the larch branch which had over 7000 revisions. This essentially stalled the Pulse Monitor thread which prevented any new pulse messages from being processed. This explains the "random" nature of the hang.

Note that getting the IO off of the various threads is still important since it keeps autophone more responsive to commands.

I can get the repo from the task definition's payload env MH_BRANCH and eliminate irrelevant branches before we attempt to call get_build_data. In get_changeset_dirs we can also check if the number of changesets exceeds a threshold value. If the number of changesets exceeds the threshold, I intend to return [''] which will effectively match any directory and cause the build to be tested regardless of the test's directory limitations. I think ignoring directory restrictions if there are more than 32 revisions should be sufficient.

Running a test now.

Dan Minor [:dminor]

Comment 6

•

8 years ago

I think it makes good sense to ignore directory restrictions if there are a large number of revisions, as merges happen relatively seldom anyway. Fixing Bug 1286353 would help a bit here, but I think over 7000 revisions would still be slow even if we were to hit a json endpoint for the data.

Bob Clary [:bc] (inactive)

Assignee

Comment 7

•

8 years ago

Attached patch bug-1296679-v1.patch — Details — Splinter Review

Limits number of changesets to 32 when detecting changeset directories.

Uses an empty string for a wild card changeset directory. Also handles case where a file in the source root directory was changed.

Fixes a bug where the diff url was not updated for each changeset when retrieving changeset directories.

Retrieves changeset directories as a part of get_build_data.

Caches changeset directories in build meta data and jobs.

Increases wait when attempting to recover from lost Pulse connection while releasing shared_lock to enable other Threads/Processes to continue. Should prevent some previous dead lock conditions.

Captures repo from Taskcluster task definition to perform early filter on repositories while processing Pulse messages.

REQUIRES:

jobs.sqlite schema change requires jobs.sqlite be empty and deleted prior to deployment.

build metadata change requires cached builds to be deleted prior to deployment.

Attachment #8784413 - Flags: review?(jmaher)

Attachment #8784413 - Flags: feedback?(dminor)

Dan Minor [:dminor]

Comment 8

•

8 years ago

Comment on attachment 8784413 [details] [diff] [review]
bug-1296679-v1.patch

Review of attachment 8784413 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, thanks for investigating and fixing this!

::: builds.py
@@ +1382,4 @@
>                   id=None,
>                   revision=None,
>                   changeset=None,
> +                 changeset_dirs=[],

Should be a set()?

::: phonetest.py
@@ +225,4 @@
>                      if matched:
>                          break
>                      for td in test.run_if_changed:
> +                        if cd == "" or cd.startswith(td):

I thought we used an empty set() to signal that we wanted the tests to run regardless of filtering?

Attachment #8784413 - Flags: feedback?(dminor) → feedback+

Bob Clary [:bc] (inactive)

Assignee

Comment 9

•

8 years ago

(In reply to Dan Minor [:dminor] from comment #8)
> Comment on attachment 8784413 [details] [diff] [review]
> bug-1296679-v1.patch
> 
> Review of attachment 8784413 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Looks good, thanks for investigating and fixing this!
> 
> ::: builds.py
> @@ +1382,4 @@
> >                   id=None,
> >                   revision=None,
> >                   changeset=None,
> > +                 changeset_dirs=[],
> 
> Should be a set()?
> 

Actually the only place we need a set is in PhoneTest. Making it a list elsewhere makes it possible to convert it to a json string for storage in the db or the meta data file. Forgot to mention that.

> ::: phonetest.py
> @@ +225,4 @@
> >                      if matched:
> >                          break
> >                      for td in test.run_if_changed:
> > +                        if cd == "" or cd.startswith(td):
> 
> I thought we used an empty set() to signal that we wanted the tests to run
> regardless of filtering?

That's true. In the case where we are limiting the changesets we could have returned an empty list. We would still need the cd == "" though since when we change a file in the source's root directory the changeset_dirs include "" which I think should force a test as well.

Bob Clary [:bc] (inactive)

Assignee

Comment 10

•

8 years ago

Forgot to mention that there are probably too many logger.debug statements in get_changeset_dirs. I can drop them if you want.

Joel Maher ( :jmaher ) (UTC -8)

Comment 11

•

8 years ago

Comment on attachment 8784413 [details] [diff] [review]
bug-1296679-v1.patch

Review of attachment 8784413 [details] [diff] [review]:
-----------------------------------------------------------------

some of this is rubber stamp, the rest looks good- I think the debug messages are good to leave in.

Attachment #8784413 - Flags: review?(jmaher) → review+

Bob Clary [:bc] (inactive)

Assignee

Comment 12

•

8 years ago

https://github.com/mozilla/autophone/commit/8ab7eb189723aab5b52127ff65cb4a06920fc2b5

per dminor's comment I changed get_changeset_dirs to return [] when the number of changesets exceeds the maximum.

I'll deploy as soon as the job count reaches zero.

Blocks: autophone-deployments

Joel Maher ( :jmaher ) (UTC -8)

Comment 13

•

8 years ago

I have updated autophone-1 with the latest code and it is running.

Bob Clary [:bc] (inactive)

Assignee

Updated

•

8 years ago

Blocks: 1277653

Bob Clary [:bc] (inactive)

Assignee

Comment 14

•

8 years ago

deployed autophone-2,3 2016-08-27 ~14:00:00 PDT

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

3 years ago

Product: Testing → Testing Graveyard

Bugzilla

Quick Search

Autophone - pulse processing stalled 2016-08-19

Categories

(Testing Graveyard :: Autophone, defect, P1)

Tracking

(Not tracked)

People

(Reporter: bc, Assigned: bc)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Updated

Comment 14

Updated

Attachment

General

Description

File Name

Content Type