somehow treeherder is stuck at https://treeherder.mozilla.org/ui/#/jobs?repo=mozilla-inbound&revision=6c094e2b6e57 while it should show https://hg.mozilla.org/integration/mozilla-inbound/rev/dedc769ea8a8 wonder if this is the pushlog problem we have seen before ?
also trees closed now for this problem
15:15 <•mdoglio> edmorley: the etltasks stopped working 15 minutes ago, I guess because of the deployment 15:16 <•mdoglio> fubar: hey there, can you please have a look at the celery worker on the etl nodes? 15:17 <fubar> mdoglio: processes are running 15:17 <fubar> [treeherder-etl1.private.scl3.mozilla.com] out: celery RUNNING pid 2013, uptime 0:17:37 15:17 <fubar> [treeherder-etl2.private.scl3.mozilla.com] out: celery RUNNING pid 22710, uptime 0:18:51 15:17 <fubar> [treeherder-etl1.private.scl3.mozilla.com] out: 495 2034 2013 0 13:59 ? 00:00:04 /usr/bin/python /usr/bin/celery -A treeherder worker -c 3 -Q default -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker.log -l INFO -n default.%h 15:18 <•mdoglio> fubar: I'm on etl and I see the worker is started with the wrong script 15:19 <•mdoglio> s/etl/etl1 15:20 <•edmorley> mdoglio: fubar: this is the first deploy after bug 1086934 I guess? 15:20 <firebot> https://bugzil.la/1086934 — FIXED, firstname.lastname@example.org — Production's commander_settings.py is missing treeherder-etl from CELERY_HOSTGROUP 15:21 <•mdoglio> fubar: this is the supervisord conf for the etl nodes https://github.com/mozilla/treeherder-service/blob/master/deployment/supervisord/etl_node.conf 15:22 <fubar> mdoglio: I think we missed a step between building the etl nodes and actually using those queues 15:23 <•mdoglio> oh okey 15:23 — fubar adds those to puppet 15:23 <•mdoglio> thanks fubar
The pushlog and buildapi ingestion services are running now. Thanks fubar!
This was presumably a combination of: 1) Recent change to unplug the default celery worker from etl (buildapi,pushlog) queues (https://github.com/mozilla/treeherder-service/commit/5df6bd4212778425adff0e743dd9a09c63bb01c4). 2) Recent fix to the Chief deploy script, since the new ETL nodes were not included in the machines it was updating (bug 1086934). 3) The ETL nodes using the wrong supervisord conf, which wasn't noticed until now, since #1 meant the default worker was doing ETL, and #2 meant that even after #1 landed, the ETL workers didn't get the new code until the first deploy after #2 was fixed (which was the deploy 30 mins ago).