1388351 - Bump up dataset version and backfill data

This should be relatively painless as compared to previous deploys because (a) we're actually bumping the version so we can run the two stacks in parallel and (b) the backfilling logic is mostly automated: https://github.com/mozilla-services/cloudops-deployment/blob/telemetry_streaming/projects/data/ansible/README-backfill.md.

Wesley Dawson [:whd]

Comment 3

•

8 years ago

I started this job on a new cluster in prod, and also attempted a backfill, but both failed with memory issues (looks like yarn is SIGKILLing executors). Doubling the cluster size did not improve the situation, and I confirmed that using the current provisioning logic (which should be the same as the old) I'm able to run the old version of the code. I will continue to investigate tomorrow, get logs forwarded to the relevant parties, and set up dev for proper investigation. https://app.datadoghq.com/dash/298014/telemetry-streaming---spark?live=true&page=0&is_auto=false&from_ts=1504049517403&to_ts=1504053117403&tile_size=s&tpl_var_stackid=2 (stack id 2) shows my attempts at running the job, where the currently running job is the old version of the code to a test bucket. I also filed https://github.com/mozilla/telemetry-streaming/issues/66 for updating the build logic for newer versions of sbt.

Wesley Dawson [:whd]

Comment 4

•

8 years ago

Here's some debugging information for the dev cluster. :mdoglio and/or :frank are going to take a look next week. EMR cluster id: j-1LGCF3MP2EB8F Master address: ec2-34-211-48-179.us-west-2.compute.amazonaws.com Key name: 20161025-dataops-dev Master node: c3.4xl (previous version in prod: c3.xl) Core nodes: 4 c3.4xl (previous version in prod: 4 c3.2xls) EMR version: 5.6.0 Debug dashboards https://app.datadoghq.com/dash/298014/telemetry-streaming---spark?live=true&page=0&is_auto=false&from_ts=1504301362856&to_ts=1504304962856&tile_size=s&tpl_var_stackid=2&tpl_var_env=dev https://app.datadoghq.com/dash/290159/telemetry-streaming?live=true&page=0&is_auto=false&from_ts=1504301374542&to_ts=1504304974542&tile_size=xl&tpl_var_stackid=2&tpl_var_env=dev Log locations: hdfs:/var/log/spark/apps /var/log/spark/spark.log (stdout) The repo is checked out in /mnt1/telemetry-streaming and sbt 0.13.5 is installed. The command I'm testing looks like: spark-submit --master yarn --deploy-mode client --conf spark.yarn.am.memory=16g --conf spark.executor.memory=16g --class com.mozilla.telemetry.streaming.ErrorAggregator /mnt1/telemetry-streaming/target/scala-2.11/telemetry-streaming-assembly-0.1-SNAPSHOT.jar --kafkaBroker kafka-2.pipeline.us-west-2.dev.mozaws.net:6667 --outputPath /tmp/test --checkpointPath /tmp/checkpoint-1 Notably I'm not writing to s3 in this configuration. We could try moving to spark 2.2.0 and EMR 5.8.0, but for reasons not yet elucidated the production provisioning logic failed during bootstrap when I tried setting up such a cluster.

Frank Bertsch [:frank]

Comment 5

•

8 years ago

Attached file parquet-dump.out — Details

I ran error_aggregates on one day (20170801), the data is in `s3://telemetry-test-bucket/frank/submission_date=20170801/error_aggregates/v1`. The one day resulted in ~180GB of data, over 90MM rows. I pulled down a sample of it in parquet (50000 rows), and confirmed what :mdoglio mentioned - that the HLL columns are killing the size of the files. I have attached the parquet dump, note size is SZ. I also ran some analysis on size of dimensions. The two largest dimensions, by far, are build_id and os_version; each are in the 10^4 range, anything else is only in the 10^3 [0]. I think if we: 1. Reduce the number of dimensions by whitelisting 2. Increase the number of partitions We should be able to run the job as-is. Another possibility is removing the client_counts temporarily, and work iteratively on adding them. [0] - output of number of distinct values for the dimension. [Row(approx_count_distinct(window_start)=295)] [Row(approx_count_distinct(window_end)=293)] [Row(approx_count_distinct(submission_date)=1)] [Row(approx_count_distinct(channel)=6)] [Row(approx_count_distinct(version)=111)] [Row(approx_count_distinct(build_id)=4605)] [Row(approx_count_distinct(application)=1)] [Row(approx_count_distinct(os_name)=8)] [Row(approx_count_distinct(os_version)=9528)] [Row(approx_count_distinct(architecture)=9)] [Row(approx_count_distinct(country)=236)] [Row(approx_count_distinct(experiment_id)=33)] [Row(approx_count_distinct(experiment_branch)=88)] [Row(approx_count_distinct(e10s_enabled)=2)] [Row(approx_count_distinct(e10s_cohort)=42)] [Row(approx_count_distinct(gfx_compositor)=7)] [Row(approx_count_distinct(quantum_ready)=2)] [Row(approx_count_distinct(profile_age_days)=92)]

Frank Bertsch [:frank]

Comment 6

•

8 years ago

WHD, we have update the job to be drastically smaller in size. Deploying should work (relatively) seamlessly now [0]. [0] Gist is - the HLL columns were too large (see my attachment earlier), so we removed all of them but 1 - https://github.com/mozilla/telemetry-streaming/pull/72

Flags: needinfo?(whd)

Wesley Dawson [:whd]

Comment 7

•

8 years ago

I've been attempting to run this in prod but am still running into some memory issues after 10-15 minutes of execution. See https://app.datadoghq.com/dash/290159/telemetry-streaming?live=false&page=0&is_auto=false&from_ts=1505862263528&to_ts=1505867752897&tile_size=xl&tpl_var_env=prod&tpl_var_stackid=2&fullscreen=false for some attempted runs in production yesterday. The logs indicate the spark executor heartbeat is timing out, which is usually caused by OOM issues, and in fact the yarn application logs indicate this. Looking at the memory graph I see a single node is the one that usually appears to run out of memory. It doesn't appear to be the driver program because the node changes on node restarts and we're running in client deploy mode. I'll start looking at the various EMR debug interfaces tomorrow. I also had some issues running the backfill, but I've got a job running now for about a month of data that appears to be working well.

Flags: needinfo?(whd)

Frank Bertsch [:frank]

Comment 8

•

8 years ago

(In reply to Wesley Dawson [:whd] from comment #7) > I also had some issues running the backfill, but I've got a job running now > for about a month of data that appears to be working well. For the backfill, are you using batch mode? If so, what file sizes are you seeing? I bumped up the partition count but with the recent changes there are probably way too many, so the file sizes are probably much smaller than we'd like.

Wesley Dawson [:whd]

Comment 9

•

8 years ago

(In reply to Frank Bertsch [:frank] from comment #8) > For the backfill, are you using batch mode? If so, what file sizes are you > seeing? I bumped up the partition count but with the recent changes there > are probably way too many, so the file sizes are probably much smaller than > we'd like. See s3://telemetry-backfill/error_aggregates/20170919/20170601/error_aggregates/v2/ for some such data, though generated before the latest round of PRs landed today. They are indeed very much smaller than previously, where IIRC batch mode created a single file per submission date (not ideal either).

Frank Bertsch [:frank]

Updated

•

8 years ago

Blocks: 1376589

Frank Bertsch [:frank]

Comment 10

•

8 years ago

(In reply to Wesley Dawson [:whd] from comment #9) > See > s3://telemetry-backfill/error_aggregates/20170919/20170601/error_aggregates/ > v2/ for some such data, though generated before the latest round of PRs > landed today. They are indeed very much smaller than previously, where IIRC > batch mode created a single file per submission date (not ideal either). Great, those files are still too small though. I'll make the file size configurable so we can get those to a good size.

Frank Bertsch [:frank]

Comment 11

•

8 years ago

:whd, I think we can try again. We're now pruning build_id and os_version, the two largest dimensions.

Flags: needinfo?(whd)

Wesley Dawson [:whd]

Comment 12

•

8 years ago

I'm seeing the same symptoms with the latest build, both in dev for batch backfill and live production data: https://app.datadoghq.com/dash/290159/telemetry-streaming?live=false&page=0&is_auto=false&from_ts=1506026617118&to_ts=1506028608595&tile_size=xl&tpl_var_env=prod&tpl_var_stackid=2&fullscreen=false https://app.datadoghq.com/dash/290159/telemetry-streaming?live=false&page=0&is_auto=false&from_ts=1506022489292&to_ts=1506032399627&tile_size=xl&tpl_var_env=dev&tpl_var_stackid=*&fullscreen=false Once again it looks like a single executor OOMs, at which point the job stops producing data and then eventually fails. I am not sure why this would be, but it appears possibly unrelated to the large number of dimensions.

Flags: needinfo?(whd)

Wesley Dawson [:whd]

Comment 13

•

8 years ago

Interestingly while poking around today I added a process supervisor that kills the executor that is going to OOM before it can actually do so, hoping that the process will be restarted and complete successfully on a different executor. It seems to be working albeit unstably, as a new executor always starts up and begins to leak memory: https://app.datadoghq.com/dash/290159/telemetry-streaming?live=true&page=0&is_auto=false&from_ts=1506483592284&to_ts=1506497992284&tile_size=xl&tpl_var_env=prod&tpl_var_stackid=2&fullscreen=false. I don't know what specifically is leaking memory yet but I will continue to look into it. If this continues to run and produce parquet output this may be a viable workaround for running the live production job until we've figured out the proper fix. This was done using the master minus the latest PR that broke running the jar, which should have the same fix as is available in https://github.com/mozilla/telemetry-streaming/pull/79.

Frank Bertsch [:frank]

Updated

•

8 years ago

No longer blocks: 1376589

Frank Bertsch [:frank]

Updated

•

8 years ago

Blocks: 1397998

Frank Bertsch [:frank]

Updated

•

8 years ago

Blocks: 1411397

Frank Bertsch [:frank]

Updated

•

8 years ago

Blocks: 1386766

Frank Bertsch [:frank]

Comment 14

•

8 years ago

This will probably be done in Q1, either with a move to Databricks or with a deploy of spark-master.

Points: --- → 3

Component: Mission Control → Datasets: Error Aggregates

Priority: -- → P3

Product: Cloud Services → Data Platform and Tools

Wesley Dawson [:whd]

Comment 15

•

8 years ago

s/Q1/Q2/ We'll eventually move this dataset from its current provisional s3 bucket, but it's now available from Athena and is backfilled to the beginning of 2018.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

6 years ago

Product: Data Platform and Tools → Data Platform and Tools Graveyard