Closed Bug 970406 Opened 6 years ago Closed 6 years ago

Create crontabber job to pull ADI data, rather than receive a push from metrics

Categories

(Socorro :: General, task)

task
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: selenamarie, Assigned: selenamarie)

References

Details

(Whiteboard: [DB change])

Look into what it will take to pull data and populate our raw_adu tables. 

To start: 

* Document flow of data from load balancers to Vertica
* Document logic in ADI rollup
* Investigate ways of implementing a pull system for data
Assignee: nobody → sdeckelmann
Note that this is rather urgent because we do not get any ADI for Nightly right now, as the ADI push does not include ADI from the new blocklist.amo systems which are already used exclusively for Nightly and will soon be used for everything.
The new ADI source is missing channel and build, and possibly other dimensions.
The metrics block list ping format is defined at https://intranet.mozilla.org/Metrics/Blocklist
Current BAMO log line:
115.42.79.246 blocklist.addons.mozilla.org - [18/May/2014:11:00:17
-0700] "GET /blocklist/3/{ec8030f7-c20a-464f-9b0e-13a3a9e97384}/32.0a1/
HTTP/1.1" 200 117560 "-" "Mozilla/5.0 (Windows NT 5.1; rv:32.0)
Gecko/20100101 Firefox/32.0" "-" "DNT:-"

Old/valid log line:
raw     114.143.68.17   addons.mozilla.org      null
18/May/2014:10:16:07 +0000      GET
/blocklist/3/%7Bec8030f7-c20a-464f-9b0e-13a3a9e97384%7D/29.0.1/Firefox/20140506152807/WINNT_x86-msvc/en-US/release/Windows_NT%205.1/default/default/6/215/1/
200     117405  -       Mozilla/5.0 (Windows NT 5.1; rv:29.0)
Gecko/20100101 Firefox/29.0
__utmx=150903082.NxsqlQfBRSSBvkbX6YJy7w$71153379-26:1;
__utmxx=150903082.NxsqlQfBRSSBvkbX6YJy7w$71153379-26:1375365532:15552000;
__utma=150903082.1542016220.1375365555.1375365555.1375365555.1
DNT:-

The current BAMO format is missing quite a few parameters. We won't be able to decipher channel name and other information based on the current BAMO ping.
(In reply to Anurag Phadke[:aphadke@mozilla.com] from comment #3)
> The current BAMO format is missing quite a few parameters. We won't be able
> to decipher channel name and other information based on the current BAMO
> ping.

Hrm, that sounds like a problem in logging, then, we send the exact same ping with the same fields and length to BAMO than what we did send to AMO.
That sucks, I guess I need to revert the bug 1006615 patches as we need to get decent data again by the start of next week.
OK, I backed out the switch to BAMO and I will not land it again before we can't get that data into Socorro in some way.
Note that even with this, the issue is urgent, as we know we are missing ADI and have unstable data with the "old" AMO systems serving the blocklist and ADI data.
Blocks: 1006615
Depends on: 1011648
Depends on: 1015383
Blocks: 1020320
> Blocks: 1020320

Can we get a document that explains how this export process is blocking the ability to move to blocklist.addons.org? 

I'm confused about where data is going to and coming from and how the new process differs from the old with regard to the v2_raw_logs table. 

What I want: two diagrams - one that shows the data flow from the old system into whatever the old tables are (that we are currently pulling ADI from) and a new diagram that shows the data flow in the new system (that will talk with blocklist.addons.org). 

Thank you!
progress!

(hive)[sdeckelmann@socorro1.dev.db.phx1 hive-test]$ python test-script.py  > /pgdata/tmp/test_hive/full_output 
QUERY START 06/06/14 09:24:08AM
QUERY COMPLETE 06/06/14 09:27:23AM
DONE 06/06/14 09:28:33AM

Moving this into a cronjob that can push data into postgres.
(hive)[sdeckelmann@socorro1.dev.db.phx1 hive-test]$ python test-script.py  > /pgdata/tmp/test_hive/full_output 
QUERY START 06/06/14 01:09:58PM
QUERY COMPLETE 06/06/14 01:13:04PM
DONE 06/06/14 01:14:30PM
(hive)[sdeckelmann@socorro1.dev.db.phx1 hive-test]$ wc /pgdata/tmp/test_hive/full_output 
  188990  1699415 21978586 /pgdata/tmp/test_hive/full_output
(hive)[sdeckelmann@socorro1.dev.db.phx1 hive-test]$ time python load-data.py 

real    0m2.274s
user    0m0.039s
sys     0m0.031s

So we're looking at it taking about 5-6 minutes to do a rollup of this data and get it into Postgres from Hive :)


Now looking into normalizing it for inserting into our raw_adu table.
Side note: I investigated using StringIO instead of a temporary file. Unfortunately it added 10 minutes to the runtime vs using a temporary file. :/

QUERY START 10/06/14 02:41:09PM
QUERY COMPLETE 10/06/14 02:53:25PM
DONE 10/06/14 02:54:36PM

I'll set up two implementations of this in the event that we decide that's a reasonable tradeoff.
No longer blocks: 1024666
Commits pushed to master at https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/cc1c0679f21b3759d5982e21cf84125a989e6ea1
Fixes bug 970406 - request ADI/blocklist ping data from Hive daily
* adds new crontabber job: fetch_adi_from_hive
* adds pyhs2 dependency
* adds a new table for raw log aggregate data

https://github.com/mozilla/socorro/commit/b06802119b838124188523ab419547b5217aeff2
Migration for bug 970406

https://github.com/mozilla/socorro/commit/c69284ca88e82739cf400231c06f6e6c467926b4
Fixes bug 970406 r=peterbe

https://github.com/mozilla/socorro/commit/2c0e7a7cac8746b897abad2c023ff49e45863102
Merge pull request #2102 from selenamarie/bug970406-pull-adi-crontabber

Fixes bug 970406 - pull ADI/blocklist pings from Hive with crontabber
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → 90
Whiteboard: [DB change]
Target Milestone: 90 → 91
Blocks: 1011648
No longer depends on: 1011648
Summary: Update our ADI process to pull data, rather than receive a push from metrics → Create crontabber job to pull ADI data, rather than receive a push from metrics
You need to log in before you can comment on or make changes to this bug.