Closed Bug 1275344 Opened 9 years ago Closed 9 years ago

Autophone - download builds from the TaskCluster index

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: bc, Assigned: bc)

References

Details

Attachments

(6 files)

bug-1275344-part-1-pulse-v1.patch 9 years ago Bob Clary [:bc] (inactive) 24.48 KB, patch	gbrown : review+	Details \| Diff \| Splinter Review
bug-1275344-part-1-pulse-v2.patch 9 years ago Bob Clary [:bc] (inactive) 24.51 KB, patch	bc : review+	Details \| Diff \| Splinter Review
bug-1275344-part-2-ftp-v1.patch 9 years ago Bob Clary [:bc] (inactive) 9.79 KB, patch	gbrown : review+	Details \| Diff \| Splinter Review
bug-1275344-part-3-taskcluster-v1.patch 9 years ago Bob Clary [:bc] (inactive) 142.86 KB, patch	gbrown : review+	Details \| Diff \| Splinter Review
bug-1275344-phonedash-buildids-in-utc-v1.patch 9 years ago Bob Clary [:bc] (inactive) 7.61 KB, patch	gbrown : review+	Details \| Diff \| Splinter Review
bug-1275344-part-4-worker-status-v1.patch 9 years ago Bob Clary [:bc] (inactive) 4.29 KB, patch	bc : review+	Details \| Diff \| Splinter Review

Bob Clary [:bc] (inactive)

Assignee

Description

•

9 years ago

As build jobs move from buildbot to TaskCluster, the builds will no longer be available from ftp|archive.mozilla.org. trigger_runs.py will need to be updated to use the TC index instead of scraping the archive site. The pulse notification of builds may not require any modification since I would expect them to include the new locations for builds, but I haven't checked that yet.

Bob Clary [:bc] (inactive)

Assignee

Updated

•

9 years ago

Depends on: 1118394

Bob Clary [:bc] (inactive)

Assignee

Comment 1

•

9 years ago

Attached patch bug-1275344-part-1-pulse-v1.patch — Details — Splinter Review

This adds monitoring for trunk builds for exchange/taskcluster-queue/v1/task-completed and begins the transition from getting metadata from the build urls. With the exception of a minor tweak to trigger_runs.py to support passing the abi/sdk to the autophone process, this does not handle the case where trigger_runs.py downloads taskcluster builds since I am still working on that patch. It also tweaks some of the more verbose logging statements to reduce the log spew. I had planned to finish the work for supporting triggering taskcluster builds, then consolidate the taskcluster related code into a file/class but am pushing this work at its current stage since the android debug trunk buildbot builds have been turned off. I'll complete the refactoring in the next patch. We are ok if this doesn't get reviewed until Monday since the weekend is slow and I still have backfilling to do. It should be easy enough to backfill the debug unit tests on mozilla-central even if we miss them this weekend.

Attachment #8773911 - Flags: review?(gbrown)

Geoff Brown [:gbrown]

Comment 2

•

9 years ago

Comment on attachment 8773911 [details] [diff] [review] bug-1275344-part-1-pulse-v1.patch Review of attachment 8773911 [details] [diff] [review]: ----------------------------------------------------------------- ::: autophonepulsemonitor.py @@ +30,5 @@ > normalized messages in order to obtain the check-in comment for a > build. The comment is used to determine if a try build has > requested Autophone testing. > > + Taskcluster builds are detected by listing to the s/listing/listening/ @@ +247,5 @@ > return self.listen_thread.is_alive() > > def listen(self): > + if self.verbose: > + logger.debug('AutophonePulseMonitor: start shared_lock.acquire') Thanks - I think I'll like that better. @@ +519,5 @@ > + response = tc_queue.listArtifacts(task_id, run_id, { > + 'continuationToken': response['continuationToken']}) > + > + if 'buildid' not in build_metadata or 'build_type' not in build_metadata: > + logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: ' Does this happen often/normally? Should it be a warning? (just checking) @@ +526,5 @@ > + return > + > + logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: build_metadata: %s' % (task_id, run_id, build_metadata)) > + > + if build_metadata['appName'] != 'fennec': case sensitivity? @@ +530,5 @@ > + if build_metadata['appName'] != 'fennec': > + logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: skip %s' % (task_id, run_id, build_metadata['appName'])) > + return > + if build_metadata['branch'] not in self.trees: > + logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: skip %s' % (task_id, run_id, build_metadata['branch'])) maybe add more to the logging here (and above, below) to differentiate the various early return cases

Attachment #8773911 - Flags: review?(gbrown) → review+

Bob Clary [:bc] (inactive)

Assignee

Comment 3

•

9 years ago

(In reply to Geoff Brown [:gbrown] from comment #2) > Comment on attachment 8773911 [details] [diff] [review] > bug-1275344-part-1-pulse-v1.patch > > Review of attachment 8773911 [details] [diff] [review]: > ----------------------------------------------------------------- > > > @@ +519,5 @@ > > + response = tc_queue.listArtifacts(task_id, run_id, { > > + 'continuationToken': response['continuationToken']}) > > + > > + if 'buildid' not in build_metadata or 'build_type' not in build_metadata: > > + logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: ' > > Does this happen often/normally? Should it be a warning? (just checking) > Seems pretty common in testing. Seems the tasks are for lint checking, or in other cases an error has occurred. I don't think it should be a warning, but I wanted to get examples of the situation in the debug log. examples: https://queue.taskcluster.net/v1/task/Qx2GiTUvS3qwGIkEi9swXw/runs/0/artifacts https://queue.taskcluster.net/v1/task/fw10xa4cRuG68uZ-SXXyVw/runs/0/artifacts > @@ +526,5 @@ > > + return > > + > > + logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: build_metadata: %s' % (task_id, run_id, build_metadata)) > > + > > + if build_metadata['appName'] != 'fennec': > > case sensitivity? > buildbot's messages return in the payload, build, properties: [ "appName", "Fennec", "SetProperty Step" ], but mozinfo has appname "fennec" https://public-artifacts.taskcluster.net/UAA7utzeRHq2CpvH7S54CQ/0/public/build/target.mozinfo.json The appName stuff in the handle_taskcompleted just matches the buildbot property name since I want the builds to be handled by the same build_callback. I'll clean up the camelCase when we drop buildbot support in November. > @@ +530,5 @@ > > + if build_metadata['appName'] != 'fennec': > > + logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: skip %s' % (task_id, run_id, build_metadata['appName'])) > > + return > > + if build_metadata['branch'] not in self.trees: > > + logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: skip %s' % (task_id, run_id, build_metadata['branch'])) > > maybe add more to the logging here (and above, below) to differentiate the > various early return cases Not sure what you mean here. I'll go ahead with the patch and we can fix this in the part 2 when we add taskcluster builds to trigger.

Geoff Brown [:gbrown]

Comment 4

•

9 years ago

(In reply to Bob Clary [:bc:] from comment #3) > > @@ +530,5 @@ > > > + if build_metadata['appName'] != 'fennec': > > > + logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: skip %s' % (task_id, run_id, build_metadata['appName'])) > > > + return > > > + if build_metadata['branch'] not in self.trees: > > > + logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: skip %s' % (task_id, run_id, build_metadata['branch'])) > > > > maybe add more to the logging here (and above, below) to differentiate the > > various early return cases > > Not sure what you mean here. I'll go ahead with the patch and we can fix > this in the part 2 when we add taskcluster builds to trigger. Sorry for the lack of clarity. I was just trying to say that there are several different conditions that can trigger a log message of 'handle_taskcompleted: task_id: %s, run_id: %s: skip %s'; it will be easier to debug some conditions if each message is unique.

Bob Clary [:bc] (inactive)

Assignee

Comment 5

•

9 years ago

Ok, so instead of logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: skip %s' % (task_id, run_id, build_metadata['appName'])) we could do logger.debug('handle_taskcompleted: task_id: %s, run_id: %s: skip appName %s' % (task_id, run_id, build_metadata['appName'])) etc?

Geoff Brown [:gbrown]

Comment 6

•

9 years ago

Yes.

Bob Clary [:bc] (inactive)

Assignee

Updated

•

9 years ago

Assignee: nobody → bob

Bob Clary [:bc] (inactive)

Assignee

Comment 7

•

9 years ago

Attached patch bug-1275344-part-1-pulse-v2.patch — Details — Splinter Review

v1 with tweaks to logging as discussed. for application of following patches. carrying forward r+.

Attachment #8778408 - Flags: review+

Bob Clary [:bc] (inactive)

Assignee

Comment 8

•

9 years ago

Attached patch bug-1275344-part-2-ftp-v1.patch — Details — Splinter Review

Simple patch which basically renames the BuildLocation stuff to prior to adding the Taskcluster bits. Also removes inbound-archive since that is 404.

Attachment #8778409 - Flags: review?(gbrown)

Bob Clary [:bc] (inactive)

Assignee

Comment 9

•

9 years ago

Attached patch bug-1275344-part-3-taskcluster-v1.patch — Details — Splinter Review

Adding Taskcluster support was more problematic than I thought and the patch exemplifies that fact. Previously we had relied on the meta data contained in the ftp build urls. This is no longer available using taskcluster build urls so we need to obtain it elsewhere and need to make pass it around so we don't have to keep looking it up. This necessitated changes to the jobs table in the jobs database as well as other changes to the treeherder submission code. utils.get_build_data() was expanded to look at the additional json files if they are available and to collect additional information such as abi, sdk, build type, nightly status, platform, the full changeset url and the revision id as well as whether the build was produced by buildbot or taskcluster. Taskcluster also forced a change to use UTC build ids. Tracking the builder_type is necessary for example since we need to be able to tell which timezone the original build id was in: for buildbot it is Pacific time; for taskcluster it is UTC. This also made it necessary to make the code timezone aware. During the transition from buildbot to taskcluster, the code prefers buildbot builds or taskcluster tier 1 builds and will ignore taskcluster builds which are not tier 1. If during a transition period where both a being produced, we will get "duplicate' builds from both but only for the short period where they are both running. When reviewing, I suggest this order: utils.py build_dates.py builds.py trigger_runs.py buildserver.py jobs.py autophonepulsemonitor.py autophonetreeherder.py phonetest.py autophone.py worker.py tests/*.py

Attachment #8778410 - Flags: review?(gbrown)

Bob Clary [:bc] (inactive)

Assignee

Comment 10

•

9 years ago

Attached patch bug-1275344-phonedash-buildids-in-utc-v1.patch — Details — Splinter Review

Phonedash will now use UTC. Note that no changes are required to the database which uses System time UTC but if you have a local database and do not have it set to use UTC you will need to add default-time-zone='+00:00' to your [mysqld] section in /etc/my.cnf and restart mysqld/mariadbd or whatever you have.

Attachment #8778412 - Flags: review?(gbrown)

Bob Clary [:bc] (inactive)

Assignee

Comment 11

•

9 years ago

example runs on staging. https://treeherder.allizom.org/#/jobs?repo=mozilla-inbound&exclusion_profile=false&filter-searchStr=autophone&filter-tier=1&filter-tier=2&filter-tier=3&fromchange=be293e09860e&tochange=0f66d9a52316 https://treeherder.allizom.org/#/jobs?repo=fx-team&exclusion_profile=false&filter-searchStr=autophone&filter-tier=1&filter-tier=2&filter-tier=3&fromchange=c60886d5c003&tochange=c60886d5c003 Note the debug builds are tier 1 taskcluster and have Build: Build: unknown arm-unknown-api15 android in the jobinfo panel. I'll need to deal with that some day but left it for later.

Geoff Brown [:gbrown]

Updated

•

9 years ago

Attachment #8778409 - Flags: review?(gbrown) → review+

Geoff Brown [:gbrown]

Comment 12

•

9 years ago

Comment on attachment 8778412 [details] [diff] [review] bug-1275344-phonedash-buildids-in-utc-v1.patch Review of attachment 8778412 [details] [diff] [review]: ----------------------------------------------------------------- I don't know phonedash well, but this looks sane and I think you know what you are doing.

Attachment #8778412 - Flags: review?(gbrown) → review+

Geoff Brown [:gbrown]

Comment 13

•

9 years ago

Comment on attachment 8778410 [details] [diff] [review] bug-1275344-part-3-taskcluster-v1.patch Review of attachment 8778410 [details] [diff] [review]: ----------------------------------------------------------------- Epic. Finding all the date changes must have been painful. I'm not sure I fully appreciate all that's going on here, but it seems to work and I don't see a single thing wrong with it.

Attachment #8778410 - Flags: review?(gbrown) → review+

Bob Clary [:bc] (inactive)

Assignee

Comment 14

•

9 years ago

https://github.com/mozilla/autophone/commit/07ff8deb73558990d31e459d245b05eb3d1689a9 https://github.com/mozilla/autophone/commit/473e6380fbb614db003d1ef2ab3779371418980f https://github.com/mozilla/autophone/commit/7145da02367b342031d10d94b64147afb7bea763

Blocks: autophone-deployments

Status: NEW → ASSIGNED

Bob Clary [:bc] (inactive)

Assignee

Comment 15

•

9 years ago

https://github.com/mozilla/phonedash/commit/dccd2bbdfaa3ffb2b7b9880b5e340f8b3b037d48 deployed 2016-08-10 ~00:00 PDT

Status: ASSIGNED → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Bob Clary [:bc] (inactive)

Assignee

Comment 16

•

9 years ago

The production Ubuntu servers running Python 2.7.6 show the following error after starting: Traceback (most recent call last): File "autophone.py", line 1503, in <module> exit_code = main(options) File "autophone.py", line 1208, in main autophone.run() File "autophone.py", line 219, in run self.worker_msg_loop() File "autophone.py", line 340, in worker_msg_loop self.check_for_unrecoverable_errors() File "autophone.py", line 329, in check_for_unrecoverable_errors elapsed = datetime.datetime.now(tz=pytz.utc) - worker.last_status_msg.timestamp TypeError: can't subtract offset-naive and offset-aware datetimes which does not occur on my local Fedora 23 system with Python 2.7.11

Bob Clary [:bc] (inactive)

Assignee

Comment 17

•

9 years ago

Attached patch bug-1275344-part-4-worker-status-v1.patch — Details — Splinter Review

This replaces the utcnow in the worker status with now(tz=pytz.utc). https://github.com/mozilla/autophone/commit/7d54bd34a5fe18068277d8965c9c4f0e39dc7253 Hopefully there aren't other timebombs waiting to be found.

Attachment #8779657 - Flags: review+

BMO Automation

Updated

•

4 years ago

Product: Testing → Testing Graveyard

You need to log in before you can comment on or make changes to this bug.