Create data-pipeline github repo

RESOLVED FIXED

Status

Cloud Services
Metrics: Pipeline
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: Katie Parlante, Assigned: mreid)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

3 years ago
Create a repo for data-pipeline code. Should contain:
- web front end that receives submissions
- go plugin to populate the datastore
- code used to snapshot sandbox state, used to repopulate after outage
- code that sets up the data analysis solution (whatever we settle on)

And probably some other parts that I'm not capturing. My understanding is that we're starting with one repo, so some other bits might live here for now (example filters, dashboards, etc.).
(Assignee)

Comment 1

3 years ago
I've set up an initial repo here:
https://github.com/mreid-moz/data-pipeline

There's a build script to generate a heka RPM with all our goodies baked in.

If this looks OK to everyone, I'll get it moved over to "mozilla-services".

At the moment it just contains the stuff I've been working on, namely the go plugin to populate the data store (`heka/plugins/s3splitfile`) and a command-line tool for exporting data to S3 (`heka/cmd/heka-export`).

I'm envisioning other pieces (web frontend, analysis code) going into different top-level directories, while other components that need to be compiled into heka can go into the "heka" dir.
Flags: needinfo?(whd)
Flags: needinfo?(mtrinkala)
(Assignee)

Updated

3 years ago
OS: Mac OS X → All
Hardware: x86 → All
heka-export is a little too generic of a name, right now it is just an S3 uploader (unless you plan on adding other functionality?).

https://github.com/mreid-moz/data-pipeline/blob/master/heka/sandbox/decoders/extract_fhr_dimensions.lua#L28 
https://github.com/mreid-moz/data-pipeline/blob/master/heka/sandbox/decoders/extract_telemetry_dimensions.lua#L27

can be set once in the initializer list.
Flags: needinfo?(mtrinkala)
(Assignee)

Comment 3

3 years ago
(In reply to Mike Trinkala [:trink] from comment #2)
> heka-export is a little too generic of a name, right now it is just an S3
> uploader (unless you plan on adding other functionality?).
How about "heka-s3export"?

> https://github.com/mreid-moz/data-pipeline/blob/master/heka/sandbox/decoders/
> extract_fhr_dimensions.lua#L28 
> https://github.com/mreid-moz/data-pipeline/blob/master/heka/sandbox/decoders/
> extract_telemetry_dimensions.lua#L27
> 
> can be set once in the initializer list.
Ok.

How does the overall structure of the repo look?
If the nested plugin directory isn't causing problems with Go then it is fine with me.

Comment 5

3 years ago
The overall structure seems reasonable to me.

The build script runs and builds heka successfully. It prepares and builds heka in ".." which seems a bit scary to me, and I would prefer it build in a .gitignore'd subdirectory of the repo like build. The script also can only be run once effectively, but that's probably fine.

I think the plan is to build https://github.com/mreid-moz/data-pipeline/tree/master/heka/sandbox filters into the RPM as well, but that currently doesn't happen. It's probably as simple as copying them into their respective directories in https://github.com/mozilla-services/heka/tree/dev/sandbox/lua.

I'll file a PR with the rpmrebuild addition mentioned in bug #1122964. There's still some CentOS 7 and/or CMake 2.8 weirdness with the RPM (the /usr/bin conflict problem). I will investigate this more fully.
Flags: needinfo?(whd)
(Assignee)

Comment 6

3 years ago
(In reply to Wesley Dawson [:whd] from comment #5)
> The overall structure seems reasonable to me.
> 
> The build script runs and builds heka successfully. It prepares and builds
> heka in ".." which seems a bit scary to me, and I would prefer it build in a
> .gitignore'd subdirectory of the repo like build. The script also can only
> be run once effectively, but that's probably fine.
I'll change the build script to work in a subdir - that would make me feel better too.
I wasn't sure if repeated builds were worth the effort, but we could add that if needed.

> I think the plan is to build
> https://github.com/mreid-moz/data-pipeline/tree/master/heka/sandbox filters
> into the RPM as well, but that currently doesn't happen. It's probably as
> simple as copying them into their respective directories in
> https://github.com/mozilla-services/heka/tree/dev/sandbox/lua.
Yeah, I think it makes the most sense to put the lua filters in the RPM too.

> 
> I'll file a PR with the rpmrebuild addition mentioned in bug #1122964.
> There's still some CentOS 7 and/or CMake 2.8 weirdness with the RPM (the
> /usr/bin conflict problem). I will investigate this more fully.
Sounds good re: PR.

I've filed bug 1126309 to follow up on the CentOS 7 stuff.
(Assignee)

Comment 7

3 years ago
Ok, it seems like the general structure looks OK - Toby, can you help me get this repo moved over to mozilla-services?
Flags: needinfo?(telliott)
(Assignee)

Comment 8

3 years ago
Repo migrated:

https://github.com/mozilla-services/data-pipeline
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Flags: needinfo?(telliott)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.