Closed Bug 1011648 Opened 8 years ago Closed 7 years ago

Enable ADI pull from Hive export scripts for Socorro

Categories

(Socorro :: Backend, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aphadke, Assigned: rhelmer)

References

Details

Attachments

(1 file)

The data team currently pushes nightly ADI data to Socorro and that involves a lot of moving pieces/teams. Based on the new plan: 
Socorro team will write an ADI import job that will -
a.) Connect to HIVE
b.) Run HIVE query
c.) Pull in the output of HIVE query
d.) Format and export the output to Socorro's own internal tables.

-anurag
Assignee: nobody → sdeckelmann
:aphadke@mozilla.com -- could we get:

* a copy of the XML from the Kettle scripts used to generate this data currently? My understanding is that there's a bunch of logic embedded in there.
* a copy of the Hbase schema we're talking to
* credentials for connecting to the prod/stage environment(s)
OS: Mac OS X → All
Hardware: x86 → All
Ricardo - can u provide the same to Selena?
Flags: needinfo?(rfradinho)
Attached file socorro_export.sql
The kettle script is really simple - just a SQL select step against Vertica and a "SQL write step" on the Socorro database. I've attached the Vertica SQL query. You'll need to tweak it for Hive.
Flags: needinfo?(rfradinho)
Depends on: 1015383
:rfradinho -- can you provide credentials for running a query against Hive?
Flags: needinfo?(rfradinho)
09:37 < tmary_away_maybe> selenamarie: if you are using thrift/jdbc to connect to hive, all you need is access to peach-gw.peach.metrics.scl3.mozilla.com port 10000 (username: use your posix username, password: ignored, auth: simple)
Flags: needinfo?(rfradinho)
Hey Anurag and/or tmary,

I had a look at the tables available in Hive, and none of the table names correspond to the table referenced in the query that was attached to this bug (blocklist_requests_by_day).

Could you please let me know which table I should be looking at?

Here is a dump of the tables visible when I connect to production Hive: 

(hive)[sdeckelmann@socorro1.dev.db.phx1 hive-test]$ python test-script.py 
[['default', '']]
['addons_pings']
['adi_backfill']
['adi_backfill_ipv6']
['aus_logs']
['bad_addons']
['bing_logs']
['boo']
['complete_logs']
['csv_table']
['dev_www_logs']
['fhr_rc']
['fhrdaily1pct']
['fhrweekly1pct']
['firefox_adi']
['full_processed']
['ip_counts']
['joy_temp_1']
['joy_temp_2']
['mango_bing_logs']
['mango_research_logs']
['metlog_json']
['new_table']
['one_pct_fhr']
['onepctofjoy']
['processed_logs']
['raw_logs']
['research_logs']
['snippet_logs']
['startup_ping']
['t_fhr_search_counts']
['telemetry_dupe_docs']
['temp_anonymized']
['temp_aphadke']
['temp_joy_fhr']
['temp_sumo']
['test_tmary']
['v1_processed_logs']
['v1_raw_logs']
['v2_anonymous_logs']
['v2_processed_logs']
['v2_raw_logs']
['xtemp']

Thanks!
-selena
Flags: needinfo?(aphadke)
hey selena - 
v2_anonymous_logs and v2_raw_logs are the tables that relate to the data you want.

sample queries:
select * from v2_raw_logs where domain='addons.mozilla.org' and ds='2014-06-01' limit 5;

select * from v2_anonymous_logs where domain='addons.mozilla.org' and ds='2014-06-01' limit 5;

What dimensions do you look for in terms of ADI queries? # of requests/day by OS/channel/version? anything else?
Flags: needinfo?(aphadke) → needinfo?(sdeckelmann)
Here's our raw_adu table: 

breakpad=# \d raw_adu
                     Table "public.raw_adu"
       Column        |           Type           |   Modifiers   
---------------------+--------------------------+---------------
 adu_count           | integer                  | 
 date                | date                     | 
 product_name        | text                     | 
 product_os_platform | text                     | 
 product_os_version  | text                     | 
 product_version     | text                     | 
 build               | text                     | 
 build_channel       | text                     | 
 product_guid        | text                     | 
 received_at         | timestamp with time zone | default now()
 update_channel      | text                     | 

build_channel and update_channel are the same thing - we're trying to consolidate terminology inside the database, but are about halfway done to completing that project :)

received_at is an auto-filled column that lets us know when ADI came in each day. 

In looking at the data, it appears that I should parse out information from "request_url". Is that correct?
Flags: needinfo?(sdeckelmann)
yes, it should be request_url, something such as:

where request_url like '/blocklist/3/%'

I'll have the query ready for our 10am meeting.
notes from meeting between Selena and Anurag @ 6/3/2014 - 10:00am PST
Background:

    0-.5% delta with new data

    old system uses a set of rules that weren't valid

Blocklist ping format: https://intranet.mozilla.org/Metrics/Blocklist
sample query:
select * from v2_raw_logs where request_url like '/blocklist/3/%' and domain='addons.mozilla.org' and ds='2014-06-01' limit 5;
       Column        |           Type           |   Modifiers   
---------------------+--------------------------+---------------
 adu_count           | integer                  | 
 date                | date                     | 
 product_name        | text                     | 
 product_os_platform | text                     | 
 product_os_version  | text                     | 
 product_version     | text                     | 
 build               | text                     | 
 build_channel       | text                     | 
 product_guid        | text                     | 
 received_at         | timestamp with time zone | default now()
 update_channel      | text                     | 
 
 build_channel and update_channel are the same.
 
breakpad:5432# select * from raw_adu limit 5;
-[ RECORD 1 ]-------+-------------------------------------
adu_count           | 1
date                | 2013-12-30
product_name        | Fennec
product_os_platform | Linux
product_os_version  |
product_version     | 10.0
build               | 20120503154655
build_channel       | release
product_guid        | a23983c0-fd0e-11dc-95ff-0800200c9a66
received_at         | 2013-12-31 04:54:04.136377+00
update_channel      |
-[ RECORD 2 ]-------+-------------------------------------
adu_count           | 60
date                | 2013-12-30
product_name        | Fennec
product_os_platform | Linux
product_os_version  |
product_version     | 10.0
build               | 20120223105947
build_channel       | release
product_guid        | a23983c0-fd0e-11dc-95ff-0800200c9a66
received_at         | 2013-12-31 04:54:04.136377+00
update_channel      |
-[ RECORD 3 ]-------+-------------------------------------
adu_count           | 1887
date                | 2013-12-30
product_name        | Fennec
product_os_platform | Linux
product_os_version  |
product_version     | 10.0
build               | 20120129020652
build_channel       | release
product_guid        | a23983c0-fd0e-11dc-95ff-0800200c9a66
received_at         | 2013-12-31 04:54:04.136377+00
update_channel      |
-[ RECORD 4 ]-------+-------------------------------------
adu_count           | 54
date                | 2013-12-30
product_name        | Fennec
product_os_platform | Linux
product_os_version  |
product_version     | 10.0
build               | 20120123232851
build_channel       | beta
product_guid        | a23983c0-fd0e-11dc-95ff-0800200c9a66
received_at         | 2013-12-31 04:54:04.136377+00
update_channel      |
-[ RECORD 5 ]-------+-------------------------------------
adu_count           | 230
date                | 2013-12-30
product_name        | Fennec
product_os_platform | Linux
product_os_version  |
product_version     | 10.0
build               | 20120118081111
build_channel       | beta
product_guid        | a23983c0-fd0e-11dc-95ff-0800200c9a66
received_at         | 2013-12-31 04:54:04.136377+00
update_channel      |
Time: 0.336 ms
select ds, split(request_url,'/')[5], split(split(request_url,'/')[10], '%20')[0], split(split(request_url,'/')[10], '%20')[1], split(request_url,'/')[4], split(request_url,'/')[6], split(request_url,'/')[9], split(request_url,'/')[3], request_time from v2_raw_logs where domain='addons.mozilla.org' and ds='2014-06-01' and request_url like '/blocklist/3/%' limit 5;

2014-06-01	Firefox	Windows_NT	6.1	29.0.1	20140506152807	release	%7Bec8030f7-c20a-464f-9b0e-13a3a9e97384%7D	01/Jun/2014:11:32:25 +0000
2014-06-01	Firefox	Windows_NT	6.1	29.0.1	20140506152807	release	%7Bec8030f7-c20a-464f-9b0e-13a3a9e97384%7D	01/Jun/2014:11:32:25 +0000
2014-06-01	Firefox	Windows_NT	5.1	29.0.1	20140506152807	release	%7Bec8030f7-c20a-464f-9b0e-13a3a9e97384%7D	01/Jun/2014:11:32:25 +0000
2014-06-01	Firefox	Windows_NT	6.1	29.0.1	20140506152807	release	%7Bec8030f7-c20a-464f-9b0e-13a3a9e97384%7D	01/Jun/2014:11:32:25 +0000
2014-06-01	Firefox	Windows_NT	6.1	29.0.1	20140506152807	release	%7Bec8030f7-c20a-464f-9b0e-13a3a9e97384%7D	01/Jun/2014:11:32:25 +0000
Working on a pure python version of the fetch process. Probably will just be a cron job, or possibly something to be run by crontabber if we can parallelize it.
Currently blocked on bug 1025346.
Have access, will begin testing Monday.
Delay in testing due to EOQ goal work.
No longer blocks: 970406
Depends on: 970406
Summary: ADI export scripts → Enable ADI export scripts
Summary: Enable ADI export scripts → Enable ADI pull from Hive export scripts for Socorro
Blocks: 1020320
Assignee: sdeckelmann → rhelmer
Status: NEW → ASSIGNED
Depends on: 1035402
(In reply to Selena Deckelmann :selenamarie :selena from comment #5)
> 09:37 < tmary_away_maybe> selenamarie: if you are using thrift/jdbc to
> connect to hive, all you need is access to
> peach-gw.peach.metrics.scl3.mozilla.com port 10000 (username: use your posix
> username, password: ignored, auth: simple)

Hey tmary - is it possible to use the "socorro" username so it's not tied to any of our accounts ^
Flags: needinfo?(tmeyarivan)
(In reply to Robert Helmer [:rhelmer] from comment #15)
> 
> Hey tmary - is it possible to use the "socorro" username so it's not tied to
> any of our accounts ^

Yes

--
Flags: needinfo?(tmeyarivan)
Depends on: 1035911
Depends on: 1035946
Depends on: 1035957
(In reply to T [:tmary] Meyarivan from comment #16)
> (In reply to Robert Helmer [:rhelmer] from comment #15)
> > 
> > Hey tmary - is it possible to use the "socorro" username so it's not tied to
> > any of our accounts ^
> 
> Yes
> 
> --

Thanks this works!
Hey Anurag (or tmary), does it matter what time we initiate this query? Could I start it as early as e.g. 1 AM UTC (or should it be Pacific?)
Flags: needinfo?(aphadke)
Hey rob,
the current ETL process starts @ 2:30am UTC and gets completed by 6am UTC. We can initiate the query by 7am UTC, is that too late for you?
Flags: needinfo?(aphadke)
(In reply to Anurag Phadke[:aphadke@mozilla.com] from comment #19)
> Hey rob,
> the current ETL process starts @ 2:30am UTC and gets completed by 6am UTC.
> We can initiate the query by 7am UTC, is that too late for you?

I'll set it to start at 7am UTC.
Whiteboard: [config change]
Whiteboard: [config change]
Depends on: 1036462
Depends on: 1037244
Depends on: 1037855
Depends on: 1036559
No longer blocks: 1030840
Depends on: 1030840
We're now using this on Socorro stage - I just dumped the schema and ensured that we're only using the new "raw_adi" table and not the "raw_adu" table anywhere, and a backfill produces reasonable numbers.

KaiRo - would you (or someone familiar with Socorro e.g. crashkill) mind looking over crash-stats.allizom.org ADI for yesterday and today when it's ready? I'll email socorro-dev and crashkill too.
Flags: needinfo?(kairo)
Depends on: 1053632
Please make sure the raw_adu table is archived and dropped before closing this bug. If that's not something you can do, my team is happy to help.
(In reply to Sheeri Cabral [:sheeri] from comment #22)
> Please make sure the raw_adu table is archived and dropped before closing
> this bug. If that's not something you can do, my team is happy to help.

That sounds to me like a low-priority followup to file separately, not something that should block this one and hinder the dependency from being marked fixed.
(In reply to Sheeri Cabral [:sheeri] from comment #22)
> Please make sure the raw_adu table is archived and dropped before closing
> this bug. If that's not something you can do, my team is happy to help.

We can do this in a migration - however I was going to do this as a separate step, not blocking this bug. 

Do you just want to make sure we don't forget it? If there's a more specific reason to block this bug please let me know.
Flags: needinfo?(kairo) → needinfo?(scabral)
Not blocking, just wanted to make sure it wasn't forgotten.
Flags: needinfo?(scabral)
Blocks: 1053956
(In reply to Sheeri Cabral [:sheeri] from comment #25)
> Not blocking, just wanted to make sure it wasn't forgotten.

Ok great! Filed bug 1053956 to track this, assigned to me - I will do it right after we get the current bug finished out.
All blockers down, testing on stage before we deploy this to prod.
Depends on: 1054078
kairo and I looked at this on stage, we are planning to ship this to prod tomorrow.

All ADI numbers were very close to our current source, except Linux Firefox on Nightly which had a slightly larger delta than expected. It's still a small discrepancy and it appears to be in the source data, nothing we're filtering on the Socorro side - worth following up, but I don't think it needs to block (please let me know ASAP if you disagree)

The old ADI worked by filtering on the Metrics side (in Vertica IIRC) which was then pushed to the raw_adu table in Socorro - does anyone know how I can get access to whatever was doing the log parsing, to make sure we're treating Linux the same way? Thanks!
Flags: needinfo?(tmeyarivan)
Flags: needinfo?(scabral)
Flags: needinfo?(kairo)
This is now in production, since we fetch ADI and build ADI-dependent reports once per day it'll be active starting tomorrow.

I'll file followup bugs for the issues mentioned in bug 28.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Flags: needinfo?(tmeyarivan)
Flags: needinfo?(scabral)
Flags: needinfo?(kairo)
Resolution: --- → FIXED
(In reply to Robert Helmer [:rhelmer] from comment #29)
> This is now in production, since we fetch ADI and build ADI-dependent
> reports once per day it'll be active starting tomorrow.
> 
> I'll file followup bugs for the issues mentioned in bug 28.

bug 1055695
:lonnen - I want to retire this Kettle script on 9/4 because I have WebDetails on retainer during September; that way they can provide warranty should anything go wrong after we actually turn down the cron running it.  I want your sign-off that we can do that.  Will you monitor over the next week and provide sign-off by 8/29? Or let me know what issues still exist so that we can close.  Thx!
Flags: needinfo?(chris.lonnen)
are you referring to the kettle scripts that pushed the old adu source into our postgres? If so, we are good to go.
Flags: needinfo?(chris.lonnen) → needinfo?(srich)
(In reply to Chris Lonnen :lonnen from comment #32)
> are you referring to the kettle scripts that pushed the old adu source into
> our postgres? If so, we are good to go.

Chris - yes.  Okay, I'll have :rfradinho remove them.  Thanks for the reply.
Flags: needinfo?(srich)
You need to log in before you can comment on or make changes to this bug.