Closed Bug 1043504 Opened 10 years ago Closed 9 years ago

Migrate FHR endpoint to Amazon EC2

Categories

(Mozilla Metrics :: Metrics Operations, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX
Unreviewed

People

(Reporter: mreid, Unassigned)

References

Details

Get data flowing in to AWS and process it using EC2 / EMR.
Assignee: nobody → mreid
So far, I have a Kafka consumer that reads FHR data and batches individual requests into larger files, then uploads them to S3.

I've also been able to read these files and send them along to an HBase EMR instance.

Next steps are to shore this all up with a bit more plumbing and make sure we can run the deorphaning job on EMR.

Once that is in place, it should be a fairly simple matter to pull out two weeks of source data at a time and drop it into the pipeline (to replicate the current snapshotting/deorphaning behaviour).
A few notes on this.

The code for the Kafka consumer is here:
https://github.com/mreid-moz/bagheera/blob/fhr_consumer/src/main/java/com/mozilla/bagheera/consumer/S3BatchConsumer.java

I provisioned a VM to run this consumer in bug 1057444, and have confirmed that the consumer can keep up with the incoming FHR data rate.

The consumer batches up raw requests into 500MB blobs and sends them to S3.

The next step is to read these batches and update/delete from HBase accordingly.  The current import code lives in a git branch on my computer.  It uses the 'happybase' python lib to talk to HBase via thrift.

In order to have parity with the current FHR backend (and to make sure that Hadoop Streaming MR jobs actually work...) we must filter out submissions that contain invalid UTF8 characters (see bug 1055102 for details).

The next step is to run a simple pig script to snapshot the HBase data into S3.  The guts of this pig script are as simple as:
fhr_table = LOAD 'hbase://metrics' using org.apache.pig.backend.hadoop.hbase.HBaseStorage('data:json', '-loadKey true -caching 1000') AS (id:chararray, val:chararray);

STORE fhr_table INTO 's3://mreid-fhr-test/snapshot/whatever' USING com.twitter.elephantbird.pig.store.SequenceFileStorage (
'-c com.twitter.elephantbird.pig.util.TextConverter',
'-c com.twitter.elephantbird.pig.util.TextConverter');

Once the snapshot is finished, we can run a modified version of Brendan's deorphaning pipeline (https://github.com/bcolloran/deorphaning_mrjob). The modded code for this also lives in a branch on my computer, but I'll push it to github as soon as I get a chance to parameterize the job flows and S3 paths.

The deorphaned data set ends up in S3, and we can run FHR analysis jobs on it using EMR.

To make sure the deorphaning code was working properly, I ran bsmedberg's aggregate-collection.py script, and the output it produced looked correct given the toy data set on which it ran.

So we have something working on AWS for each stage of the data pipeline. My next focus will be on getting the performance up to par with a full data set.
I've run a test of the deorphaning pipeline on a full data (exported from the prod HBase), and found the following:

Full run takes 9h36m on a 36-node EMR cluster, costing between $65 and $155 for a full run.

The cluster contained:
 1 m1.large @ spot bid of $0.03
15 c3.8xlarge @ spot bid of $0.50 (actual spot price hovered around $0.26)
20 c3.4xlarge @ spot bid of $0.40 (actual spot price hovered around $0.13)

So the price at the full spot bid would be $155.30 (10 hours at the max price), and the actual price was closer to $65.
My understanding is that the FHR backend (Bagheera, HBase) will not be migrated to AWS. Benjamin, can you confirm?
Flags: needinfo?(benjamin)
I'll have to talk to IT about that; I don't have an answer yet, but it's not a high priority right now.
Flags: needinfo?(benjamin)
Right now that pipeline involves hadoop, which we want to migrate off of, so that will move, and likely to AWS. I confess I haven't digested the entire comment thread yet, so I'm not sure the context...
Per conversation with Sheeri, this is a project that IT will be taking, not part of the data pipeline project.
Assignee: mreid → nobody
Component: Telemetry Server → Metrics Operations
Product: Webtools → Mozilla Metrics
Target Milestone: --- → Unreviewed
Version: Trunk → unspecified
Bagheera is being replaced by the new data pipeline, so it will eventually be phased out.
Plan right now is to keep bagheera in place until a good replacement is made, which may be 6-12 months or more.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.