Closed Bug 1467860 Opened 7 years ago Closed 7 years ago

Telemetry Aggregates failed on 2018-06-06

Categories

(Data Platform and Tools :: General, defect, P1)

defect
Points:
3

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: frank, Assigned: hwoo)

References

Details

Attachments

(1 file)

Full error below. We may have partial aggregates from that day, since this is during the database loading stage: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/mnt/yarn/usercache/hadoop/appcache/application_1528458705378_0001/container_1528458705378_0001_01_000023/pyspark.zip/pyspark/worker.py", line 177, in main process() File "/mnt/yarn/usercache/hadoop/appcache/application_1528458705378_0001/container_1528458705378_0001_01_000023/pyspark.zip/pyspark/worker.py", line 172, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2423, in pipeline_func File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2423, in pipeline_func File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2423, in pipeline_func File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 346, in func File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1041, in <lambda> File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1041, in <genexpr> File "/mnt/anaconda2/lib/python2.7/site-packages/mozaggregator/db.py", line 68, in <lambda> map(lambda x: _upsert_build_id_aggregates(x[0], x[1], connection_string, dry_run=dry_run)).\ File "/mnt/anaconda2/lib/python2.7/site-packages/mozaggregator/db.py", line 168, in _upsert_build_id_aggregates cursor.copy_from(StringIO(stage_table), stage_table_name, columns=("dimensions", "histogram")) DataError: unsupported Unicode escape sequence DETAIL: \u0000 cannot be converted to text. CONTEXT: JSON data, line 1: ...led":true,"application":"Firefox","architecture":... COPY staging_build_id_beta_61_20180528, line 42368, column dimensions: "{"metric":"USE_COUNTER2_DEPRECATED_SyncXMLHttpRequest_DOCUMENT","label":"","e10sEnabled":true,"appli..." at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193) at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
We've fixed a series of issues so far: - The NULL char above in [0] - Failed build in [1] - New deploy in [2] The job runs successfully now in an ATMO cluster, but is failing in Airflow after 4+ hours. Working on reproducing that issue. Keeping this bug open until the job is able to complete successfully in Airflow. [0] https://github.com/mozilla/telemetry-batch-view/commit/14741db20dd3873b94944b8238dfc48a003c744d [1] https://github.com/mozilla/python_mozaggregator/commit/4b9db0dd68eb641c84084d10d20ff65df46f57e3 [2] https://github.com/mozilla/telemetry-airflow/commit/0f6985a3350bbce0c7ebe19516b825ed2c51163c
Confirmed that the job runs from `./run_telemetry_aggregator.sh` in an ATMO cluster without any issues. We have data for 20180608 as a result. Given that doesn't clarify the issue, running the full EMR Step using: ``` aws s3 cp s3://us-west-2.elasticmapreduce/libs/script-runner/script-runner.jar . hadoop jar script-runner.jar s3://telemetry-airflow/steps/airflow.sh --job-name "Telemetry Aggregate View" --user frank@mozilla.com --uri https://raw.githubusercontent.com/mozilla/telemetry-airflow/master/jobs/run_telemetry_aggregator.sh --data-bucket telemetry-parquet --environment date=20180607 ``` This is the exact step EMR runs, so hopefully it recreates the error.
running with the script-runner.jar on an ATMO cluster still succeeded. Going to try and reproduce from a local Airflow instance.
Well certainly something is up. Buried in one of the executor logs from a failed attempt is ``` 18/06/19 11:04:50 INFO ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. 172.31.29.176:34695 18/06/19 11:04:50 INFO ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0 18/06/19 11:04:50 INFO CoarseGrainedExecutorBackend: Driver from 172.31.29.176:34695 disconnected during shutdown 18/06/19 11:04:50 INFO CoarseGrainedExecutorBackend: Driver from 172.31.29.176:34695 disconnected during shutdown 18/06/19 11:04:50 INFO ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. 172.31.29.176:34695 18/06/19 11:04:50 INFO ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED ``` Log: s3://telemetry-airflow/logs/frank@mozilla.com/Telemetry Aggregate View/j-3JTWT8M7BHEYJ/node/i-0f9c93fb5937ce1ff/applications/spark/spark.log.gz Checking now to see if 20180610 fails in Airflow, but actually succeeds in entering data into the database.
The 20180610 run completed in Airflow unsuccessfully, and data did not make it into the database. I'm wondering now if this is an authentication issue with the database. The credentials should be setup by env variables, and I don't think they are getting properly loaded. Note that the proper credentials can be found in my personal .bashrc file on ATMO machines. The following will work for anyone on the telemetry team: ``` scp -i ~/.ssh/id_rsa hadoop@ec2-54-202-96-12.us-west-2.compute.amazonaws.com:/home/hadoop/efs/fbertsch@mozilla.com/.bashrc . ``` Harold, can you check that the credentials are properly loaded onto production Airflow machines?
Flags: needinfo?(hwoo)
Harold and I collaborated on deploying prod Airflow yesterday morning, so that certainly could have introduced a change in the environment such as credentials.
What environment variables are you referring to? I see all the ones in cloudops-deployment and hiera-sops on the host inside the containers
Flags: needinfo?(hwoo)
I logged into an ATMO cluster and see the following env vars defined in the .bashrc :frank mentioned: -bash-4.2$ cat /mnt/efs/fbertsch\@mozilla.com/.bashrc | cut -f1 -d'=' [[ $TERM ! export PS1 export POSTGRES_PASS export POSTGRES_USER export POSTGRES_HOST export POSTGRES_RO_HOST export POSTGRES_DB
Pushing an update to staging & prod that will hopefully fix this. I updated hiera-sops to include postgres creds in the wtmo env https://github.com/mozilla/python_mozaggregator/blame/4b9db0dd68eb641c84084d10d20ff65df46f57e3/mozaggregator/db.py#L48 A few months ago i had some PRs that changed how mozaggregator worked(migration to docker, etc.). Prior to that Frank had a local file set up that had all the db credentials. It's likely that the airflow job executed on that old code, so when we pushed wtmo yesterday to track master, I'm assuming this broke the job (because it was expecting environment variables rather than a conf file).
Assignee: fbertsch → hwoo
We have the job kicked off again, and if looks like it's able to connect to the database this time. If that's true, Airflow should run through the last few days of failed jobs overnight and we should be up to date in the morning.
The failed jobs now report as complete, so resolving this bug.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
See Also: → 1471207
Component: Datasets: Telemetry Aggregates → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: