Closed Bug 1388351 Opened 7 years ago Closed 6 years ago

Bump up dataset version and backfill data

Categories

(Data Platform and Tools Graveyard :: Datasets: Error Aggregates, enhancement, P3)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mdoglio, Unassigned)

References

(Depends on 1 open bug)

Details

Attachments

(1 file)

This is due to a change in the way we handle experiments data (seee bug 1383759).
There are a couple of changes incoming that will require a schema change/backfill. Let's wait for them to land to optimize machine and human hours.
Depends on: 1388361, 1380753
No longer depends on: 1380753
Depends on: 1380753
Depends on: 1387610
This should be relatively painless as compared to previous deploys because (a) we're actually bumping the version so we can run the two stacks in parallel and (b) the backfilling logic is mostly automated: https://github.com/mozilla-services/cloudops-deployment/blob/telemetry_streaming/projects/data/ansible/README-backfill.md.
I started this job on a new cluster in prod, and also attempted a backfill, but both failed with memory issues (looks like yarn is SIGKILLing executors). Doubling the cluster size did not improve the situation, and I confirmed that using the current provisioning logic (which should be the same as the old) I'm able to run the old version of the code. I will continue to investigate tomorrow, get logs forwarded to the relevant parties, and set up dev for proper investigation.

https://app.datadoghq.com/dash/298014/telemetry-streaming---spark?live=true&page=0&is_auto=false&from_ts=1504049517403&to_ts=1504053117403&tile_size=s&tpl_var_stackid=2 (stack id 2) shows my attempts at running the job, where the currently running job is the old version of the code to a test bucket.

I also filed https://github.com/mozilla/telemetry-streaming/issues/66 for updating the build logic for newer versions of sbt.
Here's some debugging information for the dev cluster. :mdoglio and/or :frank are going to take a look next week.

EMR cluster id: j-1LGCF3MP2EB8F
Master address: ec2-34-211-48-179.us-west-2.compute.amazonaws.com
Key name: 20161025-dataops-dev
Master node: c3.4xl (previous version in prod: c3.xl)
Core nodes: 4 c3.4xl (previous version in prod: 4 c3.2xls)
EMR version: 5.6.0

Debug dashboards
https://app.datadoghq.com/dash/298014/telemetry-streaming---spark?live=true&page=0&is_auto=false&from_ts=1504301362856&to_ts=1504304962856&tile_size=s&tpl_var_stackid=2&tpl_var_env=dev
https://app.datadoghq.com/dash/290159/telemetry-streaming?live=true&page=0&is_auto=false&from_ts=1504301374542&to_ts=1504304974542&tile_size=xl&tpl_var_stackid=2&tpl_var_env=dev

Log locations:
hdfs:/var/log/spark/apps
/var/log/spark/spark.log
(stdout)

The repo is checked out in

/mnt1/telemetry-streaming

and sbt 0.13.5 is installed.

The command I'm testing looks like:

spark-submit --master yarn --deploy-mode client --conf spark.yarn.am.memory=16g --conf spark.executor.memory=16g --class com.mozilla.telemetry.streaming.ErrorAggregator /mnt1/telemetry-streaming/target/scala-2.11/telemetry-streaming-assembly-0.1-SNAPSHOT.jar --kafkaBroker kafka-2.pipeline.us-west-2.dev.mozaws.net:6667 --outputPath /tmp/test --checkpointPath /tmp/checkpoint-1

Notably I'm not writing to s3 in this configuration.

We could try moving to spark 2.2.0 and EMR 5.8.0, but for reasons not yet elucidated the production provisioning logic failed during bootstrap when I tried setting up such a cluster.
Attached file parquet-dump.out
I ran error_aggregates on one day (20170801), the data is in `s3://telemetry-test-bucket/frank/submission_date=20170801/error_aggregates/v1`. The one day resulted in ~180GB of data, over 90MM rows.

I pulled down a sample of it in parquet (50000 rows), and confirmed what :mdoglio mentioned - that the HLL columns are killing the size of the files. I have attached the parquet dump, note size is SZ.

I also ran some analysis on size of dimensions. The two largest dimensions, by far, are build_id and os_version; each are in the 10^4 range, anything else is only in the 10^3 [0].

I think if we:
1. Reduce the number of dimensions by whitelisting
2. Increase the number of partitions

We should be able to run the job as-is. Another possibility is removing the client_counts temporarily, and work iteratively on adding them.


[0] - output of number of distinct values for the dimension.
[Row(approx_count_distinct(window_start)=295)]
[Row(approx_count_distinct(window_end)=293)]
[Row(approx_count_distinct(submission_date)=1)]
[Row(approx_count_distinct(channel)=6)]
[Row(approx_count_distinct(version)=111)]
[Row(approx_count_distinct(build_id)=4605)]
[Row(approx_count_distinct(application)=1)]
[Row(approx_count_distinct(os_name)=8)]
[Row(approx_count_distinct(os_version)=9528)]
[Row(approx_count_distinct(architecture)=9)]
[Row(approx_count_distinct(country)=236)]
[Row(approx_count_distinct(experiment_id)=33)]
[Row(approx_count_distinct(experiment_branch)=88)]
[Row(approx_count_distinct(e10s_enabled)=2)]
[Row(approx_count_distinct(e10s_cohort)=42)]
[Row(approx_count_distinct(gfx_compositor)=7)]
[Row(approx_count_distinct(quantum_ready)=2)]
[Row(approx_count_distinct(profile_age_days)=92)]
WHD, we have update the job to be drastically smaller in size. Deploying should work (relatively) seamlessly now [0].

[0] Gist is - the HLL columns were too large (see my attachment earlier), so we removed all of them but 1 - https://github.com/mozilla/telemetry-streaming/pull/72
Flags: needinfo?(whd)
I've been attempting to run this in prod but am still running into some memory issues after 10-15 minutes of execution.  See https://app.datadoghq.com/dash/290159/telemetry-streaming?live=false&page=0&is_auto=false&from_ts=1505862263528&to_ts=1505867752897&tile_size=xl&tpl_var_env=prod&tpl_var_stackid=2&fullscreen=false for some attempted runs in production yesterday. The logs indicate the spark executor heartbeat is timing out, which is usually caused by OOM issues, and in fact the yarn application logs indicate this. Looking at the memory graph I see a single node is the one that usually appears to run out of memory. It doesn't appear to be the driver program because the node changes on node restarts and we're running in client deploy mode. I'll start looking at the various EMR debug interfaces tomorrow.

I also had some issues running the backfill, but I've got a job running now for about a month of data that appears to be working well.
Flags: needinfo?(whd)
(In reply to Wesley Dawson [:whd] from comment #7)
> I also had some issues running the backfill, but I've got a job running now
> for about a month of data that appears to be working well.

For the backfill, are you using batch mode? If so, what file sizes are you seeing? I bumped up the partition count but with the recent changes there are probably way too many, so the file sizes are probably much smaller than we'd like.
(In reply to Frank Bertsch [:frank] from comment #8)
> For the backfill, are you using batch mode? If so, what file sizes are you
> seeing? I bumped up the partition count but with the recent changes there
> are probably way too many, so the file sizes are probably much smaller than
> we'd like.

See s3://telemetry-backfill/error_aggregates/20170919/20170601/error_aggregates/v2/ for some such data, though generated before the latest round of PRs landed today. They are indeed very much smaller than previously, where IIRC batch mode created a single file per submission date (not ideal either).
Blocks: 1376589
(In reply to Wesley Dawson [:whd] from comment #9)
> See
> s3://telemetry-backfill/error_aggregates/20170919/20170601/error_aggregates/
> v2/ for some such data, though generated before the latest round of PRs
> landed today. They are indeed very much smaller than previously, where IIRC
> batch mode created a single file per submission date (not ideal either).

Great, those files are still too small though. I'll make the file size configurable so we can get those to a good size.
:whd, I think we can try again. We're now pruning build_id and os_version, the two largest dimensions.
Flags: needinfo?(whd)
I'm seeing the same symptoms with the latest build, both in dev for batch backfill and live production data:

https://app.datadoghq.com/dash/290159/telemetry-streaming?live=false&page=0&is_auto=false&from_ts=1506026617118&to_ts=1506028608595&tile_size=xl&tpl_var_env=prod&tpl_var_stackid=2&fullscreen=false

https://app.datadoghq.com/dash/290159/telemetry-streaming?live=false&page=0&is_auto=false&from_ts=1506022489292&to_ts=1506032399627&tile_size=xl&tpl_var_env=dev&tpl_var_stackid=*&fullscreen=false

Once again it looks like a single executor OOMs, at which point the job stops producing data and then eventually fails. I am not sure why this would be, but it appears possibly unrelated to the large number of dimensions.
Flags: needinfo?(whd)
Interestingly while poking around today I added a process supervisor that kills the executor that is going to OOM before it can actually do so, hoping that the process will be restarted and complete successfully on a different executor. It seems to be working albeit unstably, as a new executor always starts up and begins to leak memory: https://app.datadoghq.com/dash/290159/telemetry-streaming?live=true&page=0&is_auto=false&from_ts=1506483592284&to_ts=1506497992284&tile_size=xl&tpl_var_env=prod&tpl_var_stackid=2&fullscreen=false. I don't know what specifically is leaking memory yet but I will continue to look into it. If this continues to run and produce parquet output this may be a viable workaround for running the live production job until we've figured out the proper fix.

This was done using the master minus the latest PR that broke running the jar, which should have the same fix as is available in https://github.com/mozilla/telemetry-streaming/pull/79.
No longer blocks: 1376589
Blocks: 1397998
Blocks: 1411397
Blocks: 1386766
This will probably be done in Q1, either with a move to Databricks or with a deploy of spark-master.
Points: --- → 3
Component: Mission Control → Datasets: Error Aggregates
Priority: -- → P3
Product: Cloud Services → Data Platform and Tools
s/Q1/Q2/

We'll eventually move this dataset from its current provisional s3 bucket, but  it's now available from Athena and is backfilled to the beginning of 2018.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Product: Data Platform and Tools → Data Platform and Tools Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: