Closed Bug 1476067 Opened 7 years ago Closed 7 years ago

Set sane defaults for main_summary DAG when the deploy environment is dev

Categories

(Data Platform and Tools :: General, enhancement, P1)

enhancement
Points:
2

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: amiyaguchi, Assigned: amiyaguchi)

References

Details

Attachments

(1 file)

When adding a new dependency to the main_summary DAG, we are generally looking that the data dependencies are set up correctly and that the job will run without any serious errors. The full main_summary DAG is run when debugging locally. Instead, a subset of the main pings should be selected, and the size of clusters adjusted for the smaller volume. It might be useful to have a macro or function that decides on the size of the cluster for us, based on the linear scaling of performance vs cluster size. For example, if the job originally used 30 clusters and now only uses 1, downstream jobs could use the ceil(nodes/30). In practice, everything will probably run on a single machine when DEPLOY_ENVIRONMENT=dev.
The PR describes the changes that were involved for setting sane defaults. One thing that I ended up doing in the Dockerfile that might relevant for reference is setting `EMR_INSTANCE_TYPE=c3.2xlarge`. It takes roughly 2 hours to run the entire pipeline. There is quite a bit of cluster startup overhead involved with this process.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Scheduling → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: