Closed
Bug 1467860
Opened 7 years ago
Closed 7 years ago
Telemetry Aggregates failed on 2018-06-06
Categories
(Data Platform and Tools :: General, defect, P1)
Data Platform and Tools
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: frank, Assigned: hwoo)
References
Details
Attachments
(1 file)
Full error below. We may have partial aggregates from that day, since this is during the database loading stage:
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/mnt/yarn/usercache/hadoop/appcache/application_1528458705378_0001/container_1528458705378_0001_01_000023/pyspark.zip/pyspark/worker.py", line 177, in main
process()
File "/mnt/yarn/usercache/hadoop/appcache/application_1528458705378_0001/container_1528458705378_0001_01_000023/pyspark.zip/pyspark/worker.py", line 172, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2423, in pipeline_func
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2423, in pipeline_func
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2423, in pipeline_func
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 346, in func
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1041, in <lambda>
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1041, in <genexpr>
File "/mnt/anaconda2/lib/python2.7/site-packages/mozaggregator/db.py", line 68, in <lambda>
map(lambda x: _upsert_build_id_aggregates(x[0], x[1], connection_string, dry_run=dry_run)).\
File "/mnt/anaconda2/lib/python2.7/site-packages/mozaggregator/db.py", line 168, in _upsert_build_id_aggregates
cursor.copy_from(StringIO(stage_table), stage_table_name, columns=("dimensions", "histogram"))
DataError: unsupported Unicode escape sequence
DETAIL: \u0000 cannot be converted to text.
CONTEXT: JSON data, line 1: ...led":true,"application":"Firefox","architecture":...
COPY staging_build_id_beta_61_20180528, line 42368, column dimensions: "{"metric":"USE_COUNTER2_DEPRECATED_SyncXMLHttpRequest_DOCUMENT","label":"","e10sEnabled":true,"appli..."
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Reporter | ||
Comment 1•7 years ago
|
||
We've fixed a series of issues so far:
- The NULL char above in [0]
- Failed build in [1]
- New deploy in [2]
The job runs successfully now in an ATMO cluster, but is failing in Airflow after 4+ hours. Working on reproducing that issue. Keeping this bug open until the job is able to complete successfully in Airflow.
[0] https://github.com/mozilla/telemetry-batch-view/commit/14741db20dd3873b94944b8238dfc48a003c744d
[1] https://github.com/mozilla/python_mozaggregator/commit/4b9db0dd68eb641c84084d10d20ff65df46f57e3
[2] https://github.com/mozilla/telemetry-airflow/commit/0f6985a3350bbce0c7ebe19516b825ed2c51163c
Reporter | ||
Comment 2•7 years ago
|
||
Confirmed that the job runs from `./run_telemetry_aggregator.sh` in an ATMO cluster without any issues. We have data for 20180608 as a result. Given that doesn't clarify the issue, running the full EMR Step using:
```
aws s3 cp s3://us-west-2.elasticmapreduce/libs/script-runner/script-runner.jar .
hadoop jar script-runner.jar s3://telemetry-airflow/steps/airflow.sh --job-name "Telemetry Aggregate View" --user frank@mozilla.com --uri https://raw.githubusercontent.com/mozilla/telemetry-airflow/master/jobs/run_telemetry_aggregator.sh --data-bucket telemetry-parquet --environment date=20180607
```
This is the exact step EMR runs, so hopefully it recreates the error.
Reporter | ||
Comment 3•7 years ago
|
||
running with the script-runner.jar on an ATMO cluster still succeeded. Going to try and reproduce from a local Airflow instance.
Reporter | ||
Comment 4•7 years ago
|
||
Well certainly something is up. Buried in one of the executor logs from a failed attempt is
```
18/06/19 11:04:50 INFO ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. 172.31.29.176:34695
18/06/19 11:04:50 INFO ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0
18/06/19 11:04:50 INFO CoarseGrainedExecutorBackend: Driver from 172.31.29.176:34695 disconnected during shutdown
18/06/19 11:04:50 INFO CoarseGrainedExecutorBackend: Driver from 172.31.29.176:34695 disconnected during shutdown
18/06/19 11:04:50 INFO ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. 172.31.29.176:34695
18/06/19 11:04:50 INFO ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED
```
Log: s3://telemetry-airflow/logs/frank@mozilla.com/Telemetry Aggregate View/j-3JTWT8M7BHEYJ/node/i-0f9c93fb5937ce1ff/applications/spark/spark.log.gz
Checking now to see if 20180610 fails in Airflow, but actually succeeds in entering data into the database.
Reporter | ||
Comment 5•7 years ago
|
||
The 20180610 run completed in Airflow unsuccessfully, and data did not make it into the database.
I'm wondering now if this is an authentication issue with the database. The credentials should be setup by env variables, and I don't think they are getting properly loaded.
Note that the proper credentials can be found in my personal .bashrc file on ATMO machines. The following will work for anyone on the telemetry team:
```
scp -i ~/.ssh/id_rsa hadoop@ec2-54-202-96-12.us-west-2.compute.amazonaws.com:/home/hadoop/efs/fbertsch@mozilla.com/.bashrc .
```
Harold, can you check that the credentials are properly loaded onto production Airflow machines?
Flags: needinfo?(hwoo)
Comment 6•7 years ago
|
||
Harold and I collaborated on deploying prod Airflow yesterday morning, so that certainly could have introduced a change in the environment such as credentials.
Assignee | ||
Comment 7•7 years ago
|
||
What environment variables are you referring to? I see all the ones in cloudops-deployment and hiera-sops on the host inside the containers
Flags: needinfo?(hwoo)
Comment 8•7 years ago
|
||
I logged into an ATMO cluster and see the following env vars defined in the .bashrc :frank mentioned:
-bash-4.2$ cat /mnt/efs/fbertsch\@mozilla.com/.bashrc | cut -f1 -d'='
[[ $TERM !
export PS1
export POSTGRES_PASS
export POSTGRES_USER
export POSTGRES_HOST
export POSTGRES_RO_HOST
export POSTGRES_DB
Assignee | ||
Comment 9•7 years ago
|
||
Pushing an update to staging & prod that will hopefully fix this. I updated hiera-sops to include postgres creds in the wtmo env
https://github.com/mozilla/python_mozaggregator/blame/4b9db0dd68eb641c84084d10d20ff65df46f57e3/mozaggregator/db.py#L48
A few months ago i had some PRs that changed how mozaggregator worked(migration to docker, etc.). Prior to that Frank had a local file set up that had all the db credentials. It's likely that the airflow job executed on that old code, so when we pushed wtmo yesterday to track master, I'm assuming this broke the job (because it was expecting environment variables rather than a conf file).
Assignee | ||
Updated•7 years ago
|
Assignee: fbertsch → hwoo
Comment 10•7 years ago
|
||
Comment 11•7 years ago
|
||
We have the job kicked off again, and if looks like it's able to connect to the database this time. If that's true, Airflow should run through the last few days of failed jobs overnight and we should be up to date in the morning.
Comment 12•7 years ago
|
||
The failed jobs now report as complete, so resolving this bug.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Updated•3 years ago
|
Component: Datasets: Telemetry Aggregates → General
You need to log in
before you can comment on or make changes to this bug.
Description
•