bugzilla.mozilla.org will be intermittently unavailable on Saturday, March 24th, from 16:00 until 20:00 UTC.

Scheduled ATMO job for probe-scraper not run



Data Platform and Tools
Telemetry Analysis Service (ATMO)
a year ago
9 months ago


(Reporter: gfritzsche, Unassigned)



I noticed that the data from the probe-scraper [1] job wasn't getting updated since Apr 6.

From what i see in ATMO, it didn't run since then:

> Identifier
>     gfritzsche-telemetry-probe-scraper
> Notebook name
>     load_and_run.ipynb
> Result visibility
>     Public
> Cluster size
>     1
> Run interval
>     24 hours
> Job timeout
>     1
> Start date
>     2017-04-03 07:00
> Last scheduled date
>     2017-04-06 14:55
> Last run date
>     n/a
> Last terminated date
>     2017-04-08 00:25
> Is enabled

1: https://github.com/mozilla/probe-scraper
Summary: Scheduled ATMO job not run → Scheduled ATMO job for probe-scraper not run
Marc mentioned something similar yesterday as well, in https://github.com/mozilla/telemetry-analysis-service/issues/385 and I ran the job manually form the admin. It seems as if the workers are stuck, since the job I started wasn't updated on ATMO either, in other words, the cluster status wasn't pulled from AWS and written to the ATMO db.

:robotblake Can you restart the prod cluster and see if that unclogs the system?
Flags: needinfo?(bimsland)
Just restarted the scheduler and the workers on all nodes.

Not sure if it's a similar issue but on redash we've run into cases where celery workers (using the redis backend) get into a strange state where they'll ack a job, hang without processing said job, and then never accept new work. There were similar bugs filed against celery on github that got closed with the release of v4 but that may not have been entirely resolved, I can dig up references if need be.
Flags: needinfo?(bimsland)

Comment 3

11 months ago
:robotblake: So what could we do to monitor Celery? Add CPU/memory monitors via Cloudwatch? Datadog? https://deadmanssnitch.com/? https://healthchecks.io/?
I rescheduled the job "gfritzsche-telemetry-probe-scraper", it still did not run.
How can i get this working?
Flags: needinfo?(jezdez)

Comment 6

11 months ago
still valid Jannis?
I suspect this has been resolved... Georg, can you confirm whether or not this is still a problem? Thanks!
Flags: needinfo?(jezdez) → needinfo?(gfritzsche)
It works fine for me now.
Last Resolved: 9 months ago
Flags: needinfo?(gfritzsche)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.