1467860 - Telemetry Aggregates failed on 2018-06-06

Reporter

Description

•

7 years ago

Full error below. We may have partial aggregates from that day, since this is during the database loading stage: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/mnt/yarn/usercache/hadoop/appcache/application_1528458705378_0001/container_1528458705378_0001_01_000023/pyspark.zip/pyspark/worker.py", line 177, in main process() File "/mnt/yarn/usercache/hadoop/appcache/application_1528458705378_0001/container_1528458705378_0001_01_000023/pyspark.zip/pyspark/worker.py", line 172, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2423, in pipeline_func File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2423, in pipeline_func File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2423, in pipeline_func File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 346, in func File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1041, in <lambda> File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1041, in <genexpr> File "/mnt/anaconda2/lib/python2.7/site-packages/mozaggregator/db.py", line 68, in <lambda> map(lambda x: _upsert_build_id_aggregates(x[0], x[1], connection_string, dry_run=dry_run)).\ File "/mnt/anaconda2/lib/python2.7/site-packages/mozaggregator/db.py", line 168, in _upsert_build_id_aggregates cursor.copy_from(StringIO(stage_table), stage_table_name, columns=("dimensions", "histogram")) DataError: unsupported Unicode escape sequence DETAIL: \u0000 cannot be converted to text. CONTEXT: JSON data, line 1: ...led":true,"application":"Firefox","architecture":... COPY staging_build_id_beta_61_20180528, line 42368, column dimensions: "{"metric":"USE_COUNTER2_DEPRECATED_SyncXMLHttpRequest_DOCUMENT","label":"","e10sEnabled":true,"appli..." at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193) at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

Frank Bertsch [:frank]

Reporter

Comment 1

•

7 years ago

We've fixed a series of issues so far: - The NULL char above in [0] - Failed build in [1] - New deploy in [2] The job runs successfully now in an ATMO cluster, but is failing in Airflow after 4+ hours. Working on reproducing that issue. Keeping this bug open until the job is able to complete successfully in Airflow. [0] https://github.com/mozilla/telemetry-batch-view/commit/14741db20dd3873b94944b8238dfc48a003c744d [1] https://github.com/mozilla/python_mozaggregator/commit/4b9db0dd68eb641c84084d10d20ff65df46f57e3 [2] https://github.com/mozilla/telemetry-airflow/commit/0f6985a3350bbce0c7ebe19516b825ed2c51163c

Frank Bertsch [:frank]

Reporter

Comment 2

•

7 years ago

Confirmed that the job runs from `./run_telemetry_aggregator.sh` in an ATMO cluster without any issues. We have data for 20180608 as a result. Given that doesn't clarify the issue, running the full EMR Step using: ``` aws s3 cp s3://us-west-2.elasticmapreduce/libs/script-runner/script-runner.jar . hadoop jar script-runner.jar s3://telemetry-airflow/steps/airflow.sh --job-name "Telemetry Aggregate View" --user frank@mozilla.com --uri https://raw.githubusercontent.com/mozilla/telemetry-airflow/master/jobs/run_telemetry_aggregator.sh --data-bucket telemetry-parquet --environment date=20180607 ``` This is the exact step EMR runs, so hopefully it recreates the error.

Frank Bertsch [:frank]

Reporter

Comment 3

•

7 years ago

running with the script-runner.jar on an ATMO cluster still succeeded. Going to try and reproduce from a local Airflow instance.

Frank Bertsch [:frank]

Reporter

Comment 4

•

7 years ago

Well certainly something is up. Buried in one of the executor logs from a failed attempt is ``` 18/06/19 11:04:50 INFO ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. 172.31.29.176:34695 18/06/19 11:04:50 INFO ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0 18/06/19 11:04:50 INFO CoarseGrainedExecutorBackend: Driver from 172.31.29.176:34695 disconnected during shutdown 18/06/19 11:04:50 INFO CoarseGrainedExecutorBackend: Driver from 172.31.29.176:34695 disconnected during shutdown 18/06/19 11:04:50 INFO ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. 172.31.29.176:34695 18/06/19 11:04:50 INFO ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED ``` Log: s3://telemetry-airflow/logs/frank@mozilla.com/Telemetry Aggregate View/j-3JTWT8M7BHEYJ/node/i-0f9c93fb5937ce1ff/applications/spark/spark.log.gz Checking now to see if 20180610 fails in Airflow, but actually succeeds in entering data into the database.

Frank Bertsch [:frank]

Reporter

Comment 5

•

7 years ago

The 20180610 run completed in Airflow unsuccessfully, and data did not make it into the database. I'm wondering now if this is an authentication issue with the database. The credentials should be setup by env variables, and I don't think they are getting properly loaded. Note that the proper credentials can be found in my personal .bashrc file on ATMO machines. The following will work for anyone on the telemetry team: ``` scp -i ~/.ssh/id_rsa hadoop@ec2-54-202-96-12.us-west-2.compute.amazonaws.com:/home/hadoop/efs/fbertsch@mozilla.com/.bashrc . ``` Harold, can you check that the credentials are properly loaded onto production Airflow machines?

Flags: needinfo?(hwoo)

Jeff Klukas [:klukas] (UTC-4)

Comment 6

•

7 years ago

Harold and I collaborated on deploying prod Airflow yesterday morning, so that certainly could have introduced a change in the environment such as credentials.

Harold Woo

Assignee

Comment 7

•

7 years ago

What environment variables are you referring to? I see all the ones in cloudops-deployment and hiera-sops on the host inside the containers

Flags: needinfo?(hwoo)

Jeff Klukas [:klukas] (UTC-4)

Comment 8

•

7 years ago

I logged into an ATMO cluster and see the following env vars defined in the .bashrc :frank mentioned: -bash-4.2$ cat /mnt/efs/fbertsch\@mozilla.com/.bashrc | cut -f1 -d'=' [[ $TERM ! export PS1 export POSTGRES_PASS export POSTGRES_USER export POSTGRES_HOST export POSTGRES_RO_HOST export POSTGRES_DB

Harold Woo

Assignee

Comment 9

•

7 years ago

Pushing an update to staging & prod that will hopefully fix this. I updated hiera-sops to include postgres creds in the wtmo env https://github.com/mozilla/python_mozaggregator/blame/4b9db0dd68eb641c84084d10d20ff65df46f57e3/mozaggregator/db.py#L48 A few months ago i had some PRs that changed how mozaggregator worked(migration to docker, etc.). Prior to that Frank had a local file set up that had all the db credentials. It's likely that the airflow job executed on that old code, so when we pushed wtmo yesterday to track master, I'm assuming this broke the job (because it was expecting environment variables rather than a conf file).

Harold Woo

Assignee

Updated

•

7 years ago

Assignee: fbertsch → hwoo

GitHub Bugzilla PR Linker

Comment 10

•

7 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla/telemetry-airflow/pull/307 — Details

Jeff Klukas [:klukas] (UTC-4)

Comment 11

•

7 years ago

We have the job kicked off again, and if looks like it's able to connect to the database this time. If that's true, Airflow should run through the last few days of failed jobs overnight and we should be up to date in the morning.

Jeff Klukas [:klukas] (UTC-4)

Comment 12

•

7 years ago

The failed jobs now report as complete, so resolving this bug.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Chris H-C :chutten

Updated

•

7 years ago

Updated

•

3 years ago

Component: Datasets: Telemetry Aggregates → General

Bugzilla

Telemetry Aggregates failed on 2018-06-06

Categories

(Data Platform and Tools :: General, defect, P1)

Tracking

(Not tracked)

People

(Reporter: frank, Assigned: hwoo)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Comment 10

Comment 11

Comment 12

Updated

Updated

Attachment

General

Description

File Name

Content Type