Autophone - download builds from the TaskCluster index

RESOLVED FIXED

Status

Testing
Autophone
RESOLVED FIXED
2 years ago
a year ago

People

(Reporter: bc, Assigned: bc)

Tracking

(Blocks: 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(6 attachments)

(Assignee)

Description

2 years ago
As build jobs move from buildbot to TaskCluster, the builds will no longer be available from ftp|archive.mozilla.org. trigger_runs.py will need to be updated to use the TC index instead of scraping the archive site. The pulse notification of builds may not require any modification since I would expect them to include the new locations for builds, but I haven't checked that yet.
(Assignee)

Updated

2 years ago
Depends on: 1118394
(Assignee)

Comment 1

a year ago
Created attachment 8773911 [details] [diff] [review]
bug-1275344-part-1-pulse-v1.patch

This adds monitoring for trunk builds for exchange/taskcluster-queue/v1/task-completed and begins the transition from getting metadata from the build urls. With the exception of a minor tweak to trigger_runs.py to support passing the abi/sdk to the autophone process, this does not handle the case where trigger_runs.py downloads taskcluster builds since I am still working on that patch. It also tweaks some of the more verbose logging statements to reduce the log spew.

I had planned to finish the work for supporting triggering taskcluster builds, then consolidate the taskcluster related code into a file/class but am pushing this work at its current stage since the android debug trunk buildbot builds have been turned off. I'll complete the refactoring in the next patch.

We are ok if this doesn't get reviewed until Monday since the weekend is slow and I still have backfilling to do. It should be easy enough to backfill the debug unit tests on mozilla-central even if we miss them this weekend.
Attachment #8773911 - Flags: review?(gbrown)
Comment on attachment 8773911 [details] [diff] [review]
bug-1275344-part-1-pulse-v1.patch

Review of attachment 8773911 [details] [diff] [review]:
-----------------------------------------------------------------

::: autophonepulsemonitor.py
@@ +30,5 @@
>      normalized messages in order to obtain the check-in comment for a
>      build. The comment is used to determine if a try build has
>      requested Autophone testing.
>  
> +    Taskcluster builds are detected by listing to the

s/listing/listening/

@@ +247,5 @@
>          return self.listen_thread.is_alive()
>  
>      def listen(self):
> +        if self.verbose:
> +            logger.debug('AutophonePulseMonitor: start shared_lock.acquire')

Thanks - I think I'll like that better.

@@ +519,5 @@
> +            response = tc_queue.listArtifacts(task_id, run_id, {
> +                'continuationToken': response['continuationToken']})
> +
> +        if 'buildid' not in build_metadata or 'build_type' not in build_metadata:
> +            logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: '

Does this happen often/normally? Should it be a warning? (just checking)

@@ +526,5 @@
> +            return
> +
> +        logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: build_metadata: %s' % (task_id, run_id, build_metadata))
> +
> +        if build_metadata['appName'] != 'fennec':

case sensitivity?

@@ +530,5 @@
> +        if build_metadata['appName'] != 'fennec':
> +            logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: skip %s' % (task_id, run_id, build_metadata['appName']))
> +            return
> +        if build_metadata['branch'] not in self.trees:
> +            logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: skip %s' % (task_id, run_id, build_metadata['branch']))

maybe add more to the logging here (and above, below) to differentiate the various early return cases
Attachment #8773911 - Flags: review?(gbrown) → review+
(Assignee)

Comment 3

a year ago
(In reply to Geoff Brown [:gbrown] from comment #2)
> Comment on attachment 8773911 [details] [diff] [review]
> bug-1275344-part-1-pulse-v1.patch
> 
> Review of attachment 8773911 [details] [diff] [review]:
> -----------------------------------------------------------------
> 

> 
> @@ +519,5 @@
> > +            response = tc_queue.listArtifacts(task_id, run_id, {
> > +                'continuationToken': response['continuationToken']})
> > +
> > +        if 'buildid' not in build_metadata or 'build_type' not in build_metadata:
> > +            logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: '
> 
> Does this happen often/normally? Should it be a warning? (just checking)
> 

Seems pretty common in testing. Seems the tasks are for lint checking, or in other cases an error has occurred. I don't think it should be a warning, but I wanted to get examples of the situation in the debug log.

examples:

https://queue.taskcluster.net/v1/task/Qx2GiTUvS3qwGIkEi9swXw/runs/0/artifacts
https://queue.taskcluster.net/v1/task/fw10xa4cRuG68uZ-SXXyVw/runs/0/artifacts

> @@ +526,5 @@
> > +            return
> > +
> > +        logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: build_metadata: %s' % (task_id, run_id, build_metadata))
> > +
> > +        if build_metadata['appName'] != 'fennec':
> 
> case sensitivity?
> 

buildbot's messages return in the payload, build, properties:

                [
                    "appName", 
                    "Fennec", 
                    "SetProperty Step"
                ], 

but mozinfo has appname	"fennec"

https://public-artifacts.taskcluster.net/UAA7utzeRHq2CpvH7S54CQ/0/public/build/target.mozinfo.json

The appName stuff in the handle_taskcompleted just matches the buildbot property name since I want the builds to be handled by the same build_callback. I'll clean up the camelCase when we drop buildbot support in November.


> @@ +530,5 @@
> > +        if build_metadata['appName'] != 'fennec':
> > +            logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: skip %s' % (task_id, run_id, build_metadata['appName']))
> > +            return
> > +        if build_metadata['branch'] not in self.trees:
> > +            logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: skip %s' % (task_id, run_id, build_metadata['branch']))
> 
> maybe add more to the logging here (and above, below) to differentiate the
> various early return cases

Not sure what you mean here. I'll go ahead with the patch and we can fix this in the part 2 when we add taskcluster builds to trigger.
(In reply to Bob Clary [:bc:] from comment #3)
> > @@ +530,5 @@
> > > +        if build_metadata['appName'] != 'fennec':
> > > +            logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: skip %s' % (task_id, run_id, build_metadata['appName']))
> > > +            return
> > > +        if build_metadata['branch'] not in self.trees:
> > > +            logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: skip %s' % (task_id, run_id, build_metadata['branch']))
> > 
> > maybe add more to the logging here (and above, below) to differentiate the
> > various early return cases
> 
> Not sure what you mean here. I'll go ahead with the patch and we can fix
> this in the part 2 when we add taskcluster builds to trigger.

Sorry for the lack of clarity. I was just trying to say that there are several different conditions that can trigger a log message of 'handle_taskcompleted: task_id: %s, run_id: %s: skip %s'; it will be easier to debug some conditions if each message is unique.
(Assignee)

Comment 5

a year ago
Ok, so instead of 

logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: skip %s' % (task_id, run_id, build_metadata['appName']))

we could do

logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: skip appName %s' % (task_id, run_id, build_metadata['appName']))

etc?
Yes.
(Assignee)

Updated

a year ago
Assignee: nobody → bob
(Assignee)

Comment 7

a year ago
Created attachment 8778408 [details] [diff] [review]
bug-1275344-part-1-pulse-v2.patch

v1 with tweaks to logging as discussed. for application of following patches. carrying forward r+.
Attachment #8778408 - Flags: review+
(Assignee)

Comment 8

a year ago
Created attachment 8778409 [details] [diff] [review]
bug-1275344-part-2-ftp-v1.patch

Simple patch which basically renames the BuildLocation stuff to prior to adding the Taskcluster bits. Also removes inbound-archive since that is 404.
Attachment #8778409 - Flags: review?(gbrown)
(Assignee)

Comment 9

a year ago
Created attachment 8778410 [details] [diff] [review]
bug-1275344-part-3-taskcluster-v1.patch

Adding Taskcluster support was more problematic than I thought and the patch exemplifies that fact.

Previously we had relied on the meta data contained in the ftp build urls. This is no longer available using taskcluster build urls so we need to obtain it elsewhere and need to make pass it around so we don't have to keep looking it up. This necessitated changes to the jobs table in the jobs database as well as other changes to the treeherder submission code.

utils.get_build_data() was expanded to look at the additional json files if they are available and to collect additional information such as abi, sdk, build type, nightly status, platform, the full changeset url and the revision id as well as whether the build was produced by buildbot or taskcluster.

Taskcluster also forced a change to use UTC build ids. Tracking the builder_type is necessary for example since we need to be able to tell which timezone the original build id was in: for buildbot it is Pacific time; for taskcluster it is UTC. This also made it necessary to make the code timezone aware.

During the transition from buildbot to taskcluster, the code prefers buildbot builds or taskcluster tier 1 builds and will ignore taskcluster builds which are not tier 1. If during a transition period where both a being produced, we will get "duplicate' builds from both but only for the short period where they are both running.

When reviewing, I suggest this order:

utils.py
build_dates.py
builds.py
trigger_runs.py
buildserver.py
jobs.py
autophonepulsemonitor.py
autophonetreeherder.py
phonetest.py
autophone.py
worker.py
tests/*.py
Attachment #8778410 - Flags: review?(gbrown)
(Assignee)

Comment 10

a year ago
Created attachment 8778412 [details] [diff] [review]
bug-1275344-phonedash-buildids-in-utc-v1.patch

Phonedash will now use UTC. Note that no changes are required to the database which uses System time UTC but if you have a local database and do not have it set to use UTC you will need to add default-time-zone='+00:00' to your [mysqld] section in /etc/my.cnf and restart mysqld/mariadbd or whatever you have.
Attachment #8778412 - Flags: review?(gbrown)
(Assignee)

Comment 11

a year ago
example runs on staging.

https://treeherder.allizom.org/#/jobs?repo=mozilla-inbound&exclusion_profile=false&filter-searchStr=autophone&filter-tier=1&filter-tier=2&filter-tier=3&fromchange=be293e09860e&tochange=0f66d9a52316

https://treeherder.allizom.org/#/jobs?repo=fx-team&exclusion_profile=false&filter-searchStr=autophone&filter-tier=1&filter-tier=2&filter-tier=3&fromchange=c60886d5c003&tochange=c60886d5c003

Note the debug builds are tier 1 taskcluster and have Build: Build: unknown arm-unknown-api15 android in the jobinfo panel. I'll need to deal with that some day but left it for later.
Attachment #8778409 - Flags: review?(gbrown) → review+
Comment on attachment 8778412 [details] [diff] [review]
bug-1275344-phonedash-buildids-in-utc-v1.patch

Review of attachment 8778412 [details] [diff] [review]:
-----------------------------------------------------------------

I don't know phonedash well, but this looks sane and I think you know what you are doing.
Attachment #8778412 - Flags: review?(gbrown) → review+
Comment on attachment 8778410 [details] [diff] [review]
bug-1275344-part-3-taskcluster-v1.patch

Review of attachment 8778410 [details] [diff] [review]:
-----------------------------------------------------------------

Epic. Finding all the date changes must have been painful.

I'm not sure I fully appreciate all that's going on here, but it seems to work and I don't see a single thing wrong with it.
Attachment #8778410 - Flags: review?(gbrown) → review+
(Assignee)

Comment 14

a year ago
https://github.com/mozilla/autophone/commit/07ff8deb73558990d31e459d245b05eb3d1689a9
https://github.com/mozilla/autophone/commit/473e6380fbb614db003d1ef2ab3779371418980f
https://github.com/mozilla/autophone/commit/7145da02367b342031d10d94b64147afb7bea763
Blocks: 1157427
Status: NEW → ASSIGNED
(Assignee)

Comment 15

a year ago
https://github.com/mozilla/phonedash/commit/dccd2bbdfaa3ffb2b7b9880b5e340f8b3b037d48

deployed 2016-08-10 ~00:00 PDT
Status: ASSIGNED → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
(Assignee)

Comment 16

a year ago
The production Ubuntu servers running Python 2.7.6 show the following error after starting:

Traceback (most recent call last):
  File "autophone.py", line 1503, in <module>
    exit_code = main(options)
  File "autophone.py", line 1208, in main
    autophone.run()
  File "autophone.py", line 219, in run
    self.worker_msg_loop()
  File "autophone.py", line 340, in worker_msg_loop
    self.check_for_unrecoverable_errors()
  File "autophone.py", line 329, in check_for_unrecoverable_errors
    elapsed = datetime.datetime.now(tz=pytz.utc) - worker.last_status_msg.timestamp
TypeError: can't subtract offset-naive and offset-aware datetimes

which does not occur on my local Fedora 23 system with Python 2.7.11
(Assignee)

Comment 17

a year ago
Created attachment 8779657 [details] [diff] [review]
bug-1275344-part-4-worker-status-v1.patch

This replaces the utcnow in the worker status with now(tz=pytz.utc).

https://github.com/mozilla/autophone/commit/7d54bd34a5fe18068277d8965c9c4f0e39dc7253

Hopefully there aren't other timebombs waiting to be found.
Attachment #8779657 - Flags: review+
(Assignee)

Updated

a year ago
Duplicate of this bug: 1285276
You need to log in before you can comment on or make changes to this bug.