Closed Bug 1023176 Opened 10 years ago Closed 10 years ago

[Baloo] Setup/configure Bagheera for Baloo

Categories

(Mozilla Metrics :: Data/Backend Reports, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1024059
Unreviewed

People

(Reporter: pierros, Assigned: scabral)

Details

Hello we would like to setup an entrypoint in bagheera specific to Baloo so that sent payloads end up in HBASE. https://data.mozilla.com/submit/baloo/ Thanks!
Group: metrics-private
Pierros - the data warehouse team will handle the intermediate database. We haven't identified a need for HBASE, right now the aggregations are simple enough that we're doing them in a MySQL database. Does Bahgeera require HBASE? Is Bagheera required for Baloo? It seems like we could run the aggregations in a relational database, like MySQL or Postgres, and put Tableau in front of it, that way each functional/product area (coding, QA, etc) can come up with their own reports and drill-downs. The idea is that this lessens the work on the Metrics folks, because we eliminate unnecessary aggregations. The Metrics folks will still have access to all the data, through the data warehouse, so there's no loss or difference. It also lessens the dependence on HBASE, which is overkill for the data we have, and adds the step to convert from a relational schema to JSON for HBASE. It just seems like moving all the data to HBASE is overkill and adds a dependence that isn't needed.
Assignee: nobody → scabral
In discussion today with Pierros, Adam and David, and doing a follow-up with :tmary, it seems that the information should: 0) Submit to kafka using the schema at https://wiki.mozilla.org/Baloo/Schema/0.1 (To be done by Sheero for bugzilla data, Pierros (and Adam, etc) for the rest) 1) BI/DW team will write a consumer to read from kafka and store in the endpoint (to be done by tmary) 2) BI/DW team will aggregate the activity from *all* the submissions, including de-duping (to be done by sheeri) 3) BI/DW team will extract the result of the aggregations and submit to Adam using the schema at https://docs.google.com/document/d/16Sas-dbBzSftWqacYhFRojjXLCkAXu6h2XxbkfALG-Q/edit#heading=h.6b13hi21db4l Do those steps sound right to you? the end result is Adam gets the data in the format he needs, and for now let's focus this on the data for the year ending Sunday, June 8th (I think that's the data point for June 2nd? or June 9th?)
:tmary has a consumer for #1, but it needs testing with areal data extract (list item 0). I'm working on that (#0) from the MySQL data I have, although if Pierros and Adam have JSON for their parts that will be a good test too.
Summary: [Baloo] Setup bagheera production entry point for Baloo → [Baloo] Setup/configure Bagheera for Baloo
I've just shared Github data via: Bug 1010190
Adam, I was able to parse the data in CSV format and get it into our midpoint data store for aggregations.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.