Closed Bug 1296679 Opened 8 years ago Closed 8 years ago

Autophone - pulse processing stalled 2016-08-19

Categories

(Testing Graveyard :: Autophone, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bc, Assigned: bc)

References

Details

Attachments

(1 file)

I received growth warnings for the autophone pulse build queues. It appears the processing of the pulse messages on autophone has stalled.

queue/autophone1/builds
queue/autophone1/task-completed
queue/autophone1/jobactions

This may be fall out from Bug 1277653 with potential dead locks due to the io to get the changesets or due to some other pulse/treeherder related issue.
I've temporarily reverted Bug 1277653 locally on each host and enabled debug level logging. I am processing the pulse messages but it doesn't appear to be removing them from the queues according to pulseguardian.
After some time on Friday the queues begain updating. I've just finished upgrading from Ubuntu 14 to 16 and once again pulseguardian isn't updating the queue sizes in a timely fashion.

Mark, any ideas on why pulse guardian is slow to update?
Flags: needinfo?(mcote)
My first attempt to fix this issue failed during this morning's test. My attempt involved moving get_changeset_dirs to get_build_data and caching the changeset_dirs in the build meta data and the jobs table.

I didn't see any problems while testing over the weekend, so I must assume this is load dependent. I believe I may have a deadlock which is exacerbated by the file io getting the build, the symbols, tests and build's meta data. Still investigating.
Depends on: 1297269
This was very similar to something TaskCluster recently experienced, with a slowly growing queue being deleted without a warning message (only a deletion message).

I did some digging, and I think I figured out what's going on.  See bug 1297269.
Flags: needinfo?(mcote)
Mark, thanks! I think that explains my issues with PulseGuardian.

With respect to the original cause of this bug...

I reproduced the "dead-lock" running with verbose debug loglevel locally even with the patch I was testing. In this example the problem was due to a merge into the larch branch which had over 7000 revisions. This essentially stalled the Pulse Monitor thread which prevented any new pulse messages from being processed. This explains the "random" nature of the hang.

Note that getting the IO off of the various threads is still important since it keeps autophone more responsive to commands.

I can get the repo from the task definition's payload env MH_BRANCH and eliminate irrelevant branches before we attempt to call get_build_data. In get_changeset_dirs we can also check if the number of changesets exceeds a threshold value. If the number of changesets exceeds the threshold, I intend to return [''] which will effectively match any directory and cause the build to be tested regardless of the test's directory limitations. I think ignoring directory restrictions if there are more than 32 revisions should be sufficient.

Running a test now.
I think it makes good sense to ignore directory restrictions if there are a large number of revisions, as merges happen relatively seldom anyway. Fixing Bug 1286353 would help a bit here, but I think over 7000 revisions would still be slow even if we were to hit a json endpoint for the data.
Limits number of changesets to 32 when detecting changeset directories.

Uses an empty string for a wild card changeset directory. Also handles case where a file in the source root directory was changed.

Fixes a bug where the diff url was not updated for each changeset when retrieving changeset directories.

Retrieves changeset directories as a part of get_build_data.

Caches changeset directories in build meta data and jobs.

Increases wait when attempting to recover from lost Pulse connection while releasing shared_lock to enable other Threads/Processes to continue. Should prevent some previous dead lock conditions.

Captures repo from Taskcluster task definition to perform early filter on repositories while processing Pulse messages.

REQUIRES:

jobs.sqlite schema change requires jobs.sqlite be empty and deleted prior to deployment.

build metadata change requires cached builds to be deleted prior to deployment.
Attachment #8784413 - Flags: review?(jmaher)
Attachment #8784413 - Flags: feedback?(dminor)
Comment on attachment 8784413 [details] [diff] [review]
bug-1296679-v1.patch

Review of attachment 8784413 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, thanks for investigating and fixing this!

::: builds.py
@@ +1382,4 @@
>                   id=None,
>                   revision=None,
>                   changeset=None,
> +                 changeset_dirs=[],

Should be a set()?

::: phonetest.py
@@ +225,4 @@
>                      if matched:
>                          break
>                      for td in test.run_if_changed:
> +                        if cd == "" or cd.startswith(td):

I thought we used an empty set() to signal that we wanted the tests to run regardless of filtering?
Attachment #8784413 - Flags: feedback?(dminor) → feedback+
(In reply to Dan Minor [:dminor] from comment #8)
> Comment on attachment 8784413 [details] [diff] [review]
> bug-1296679-v1.patch
> 
> Review of attachment 8784413 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Looks good, thanks for investigating and fixing this!
> 
> ::: builds.py
> @@ +1382,4 @@
> >                   id=None,
> >                   revision=None,
> >                   changeset=None,
> > +                 changeset_dirs=[],
> 
> Should be a set()?
> 

Actually the only place we need a set is in PhoneTest. Making it a list elsewhere makes it possible to convert it to a json string for storage in the db or the meta data file. Forgot to mention that.

> ::: phonetest.py
> @@ +225,4 @@
> >                      if matched:
> >                          break
> >                      for td in test.run_if_changed:
> > +                        if cd == "" or cd.startswith(td):
> 
> I thought we used an empty set() to signal that we wanted the tests to run
> regardless of filtering?

That's true. In the case where we are limiting the changesets we could have returned an empty list. We would still need the cd == "" though since when we change a file in the source's root directory the changeset_dirs include "" which I think should force a test as well.
Forgot to mention that there are probably too many logger.debug statements in get_changeset_dirs. I can drop them if you want.
Comment on attachment 8784413 [details] [diff] [review]
bug-1296679-v1.patch

Review of attachment 8784413 [details] [diff] [review]:
-----------------------------------------------------------------

some of this is rubber stamp, the rest looks good- I think the debug messages are good to leave in.
Attachment #8784413 - Flags: review?(jmaher) → review+
https://github.com/mozilla/autophone/commit/8ab7eb189723aab5b52127ff65cb4a06920fc2b5

per dminor's comment I changed get_changeset_dirs to return [] when the number of changesets exceeds the maximum.

I'll deploy as soon as the job count reaches zero.
I have updated autophone-1 with the latest code and it is running.
Blocks: 1277653
deployed autophone-2,3 2016-08-27 ~14:00:00 PDT
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: Testing → Testing Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: