I noticed that the data from the probe-scraper  job wasn't getting updated since Apr 6. From what i see in ATMO, it didn't run since then: > Identifier > gfritzsche-telemetry-probe-scraper > Notebook name > load_and_run.ipynb > Result visibility > Public > Cluster size > 1 > Run interval > 24 hours > Job timeout > 1 > Start date > 2017-04-03 07:00 > Last scheduled date > 2017-04-06 14:55 > Last run date > n/a > Last terminated date > 2017-04-08 00:25 > Is enabled 1: https://github.com/mozilla/probe-scraper
Summary: Scheduled ATMO job not run → Scheduled ATMO job for probe-scraper not run
Marc mentioned something similar yesterday as well, in https://github.com/mozilla/telemetry-analysis-service/issues/385 and I ran the job manually form the admin. It seems as if the workers are stuck, since the job I started wasn't updated on ATMO either, in other words, the cluster status wasn't pulled from AWS and written to the ATMO db. :robotblake Can you restart the prod cluster and see if that unclogs the system?
Just restarted the scheduler and the workers on all nodes. Not sure if it's a similar issue but on redash we've run into cases where celery workers (using the redis backend) get into a strange state where they'll ack a job, hang without processing said job, and then never accept new work. There were similar bugs filed against celery on github that got closed with the release of v4 but that may not have been entirely resolved, I can dig up references if need be.
:robotblake: So what could we do to monitor Celery? Add CPU/memory monitors via Cloudwatch? Datadog? https://deadmanssnitch.com/? https://healthchecks.io/?
I rescheduled the job "gfritzsche-telemetry-probe-scraper", it still did not run. How can i get this working?
Ok, this is confusing: - the output suggests something did run  - ATMO says "last run date: N/A"  - the output data suggests it did not run since my last manual run  1: https://nbviewer.jupyter.org/url/s3-us-west-2.amazonaws.com/telemetry-public-analysis-2/gfritzsche-telemetry-probe-scraper/data/load_and_run.ipynb 2: https://analysis.telemetry.mozilla.org/jobs/104/#results 3: https://analysis-output.telemetry.mozilla.org/probe-scraper/data/general.json
still valid Jannis?
I suspect this has been resolved... Georg, can you confirm whether or not this is still a problem? Thanks!
Flags: needinfo?(jezdez) → needinfo?(gfritzsche)
It works fine for me now.
Status: NEW → RESOLVED
Last Resolved: 9 months ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.