Closed
Bug 1011648
Opened 10 years ago
Closed 10 years ago
Enable ADI pull from Hive export scripts for Socorro
Categories
(Socorro :: Backend, task)
Socorro
Backend
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: aphadke, Assigned: rhelmer)
References
Details
Attachments
(1 file)
1.16 KB,
text/sql
|
Details |
The data team currently pushes nightly ADI data to Socorro and that involves a lot of moving pieces/teams. Based on the new plan: Socorro team will write an ADI import job that will - a.) Connect to HIVE b.) Run HIVE query c.) Pull in the output of HIVE query d.) Format and export the output to Socorro's own internal tables. -anurag
Updated•10 years ago
|
Assignee: nobody → sdeckelmann
Comment 1•10 years ago
|
||
:aphadke@mozilla.com -- could we get: * a copy of the XML from the Kettle scripts used to generate this data currently? My understanding is that there's a bunch of logic embedded in there. * a copy of the Hbase schema we're talking to * credentials for connecting to the prod/stage environment(s)
OS: Mac OS X → All
Hardware: x86 → All
Reporter | ||
Comment 2•10 years ago
|
||
Ricardo - can u provide the same to Selena?
Flags: needinfo?(rfradinho)
Comment 3•10 years ago
|
||
The kettle script is really simple - just a SQL select step against Vertica and a "SQL write step" on the Socorro database. I've attached the Vertica SQL query. You'll need to tweak it for Hive.
Flags: needinfo?(rfradinho)
Comment 4•10 years ago
|
||
:rfradinho -- can you provide credentials for running a query against Hive?
Flags: needinfo?(rfradinho)
Comment 5•10 years ago
|
||
09:37 < tmary_away_maybe> selenamarie: if you are using thrift/jdbc to connect to hive, all you need is access to peach-gw.peach.metrics.scl3.mozilla.com port 10000 (username: use your posix username, password: ignored, auth: simple)
Flags: needinfo?(rfradinho)
Comment 6•10 years ago
|
||
Hey Anurag and/or tmary, I had a look at the tables available in Hive, and none of the table names correspond to the table referenced in the query that was attached to this bug (blocklist_requests_by_day). Could you please let me know which table I should be looking at? Here is a dump of the tables visible when I connect to production Hive: (hive)[sdeckelmann@socorro1.dev.db.phx1 hive-test]$ python test-script.py [['default', '']] ['addons_pings'] ['adi_backfill'] ['adi_backfill_ipv6'] ['aus_logs'] ['bad_addons'] ['bing_logs'] ['boo'] ['complete_logs'] ['csv_table'] ['dev_www_logs'] ['fhr_rc'] ['fhrdaily1pct'] ['fhrweekly1pct'] ['firefox_adi'] ['full_processed'] ['ip_counts'] ['joy_temp_1'] ['joy_temp_2'] ['mango_bing_logs'] ['mango_research_logs'] ['metlog_json'] ['new_table'] ['one_pct_fhr'] ['onepctofjoy'] ['processed_logs'] ['raw_logs'] ['research_logs'] ['snippet_logs'] ['startup_ping'] ['t_fhr_search_counts'] ['telemetry_dupe_docs'] ['temp_anonymized'] ['temp_aphadke'] ['temp_joy_fhr'] ['temp_sumo'] ['test_tmary'] ['v1_processed_logs'] ['v1_raw_logs'] ['v2_anonymous_logs'] ['v2_processed_logs'] ['v2_raw_logs'] ['xtemp'] Thanks! -selena
Flags: needinfo?(aphadke)
Reporter | ||
Comment 7•10 years ago
|
||
hey selena - v2_anonymous_logs and v2_raw_logs are the tables that relate to the data you want. sample queries: select * from v2_raw_logs where domain='addons.mozilla.org' and ds='2014-06-01' limit 5; select * from v2_anonymous_logs where domain='addons.mozilla.org' and ds='2014-06-01' limit 5; What dimensions do you look for in terms of ADI queries? # of requests/day by OS/channel/version? anything else?
Flags: needinfo?(aphadke) → needinfo?(sdeckelmann)
Comment 8•10 years ago
|
||
Here's our raw_adu table: breakpad=# \d raw_adu Table "public.raw_adu" Column | Type | Modifiers ---------------------+--------------------------+--------------- adu_count | integer | date | date | product_name | text | product_os_platform | text | product_os_version | text | product_version | text | build | text | build_channel | text | product_guid | text | received_at | timestamp with time zone | default now() update_channel | text | build_channel and update_channel are the same thing - we're trying to consolidate terminology inside the database, but are about halfway done to completing that project :) received_at is an auto-filled column that lets us know when ADI came in each day. In looking at the data, it appears that I should parse out information from "request_url". Is that correct?
Flags: needinfo?(sdeckelmann)
Reporter | ||
Comment 9•10 years ago
|
||
yes, it should be request_url, something such as: where request_url like '/blocklist/3/%' I'll have the query ready for our 10am meeting.
Reporter | ||
Comment 10•10 years ago
|
||
notes from meeting between Selena and Anurag @ 6/3/2014 - 10:00am PST Background: 0-.5% delta with new data old system uses a set of rules that weren't valid Blocklist ping format: https://intranet.mozilla.org/Metrics/Blocklist sample query: select * from v2_raw_logs where request_url like '/blocklist/3/%' and domain='addons.mozilla.org' and ds='2014-06-01' limit 5; Column | Type | Modifiers ---------------------+--------------------------+--------------- adu_count | integer | date | date | product_name | text | product_os_platform | text | product_os_version | text | product_version | text | build | text | build_channel | text | product_guid | text | received_at | timestamp with time zone | default now() update_channel | text | build_channel and update_channel are the same. breakpad:5432# select * from raw_adu limit 5; -[ RECORD 1 ]-------+------------------------------------- adu_count | 1 date | 2013-12-30 product_name | Fennec product_os_platform | Linux product_os_version | product_version | 10.0 build | 20120503154655 build_channel | release product_guid | a23983c0-fd0e-11dc-95ff-0800200c9a66 received_at | 2013-12-31 04:54:04.136377+00 update_channel | -[ RECORD 2 ]-------+------------------------------------- adu_count | 60 date | 2013-12-30 product_name | Fennec product_os_platform | Linux product_os_version | product_version | 10.0 build | 20120223105947 build_channel | release product_guid | a23983c0-fd0e-11dc-95ff-0800200c9a66 received_at | 2013-12-31 04:54:04.136377+00 update_channel | -[ RECORD 3 ]-------+------------------------------------- adu_count | 1887 date | 2013-12-30 product_name | Fennec product_os_platform | Linux product_os_version | product_version | 10.0 build | 20120129020652 build_channel | release product_guid | a23983c0-fd0e-11dc-95ff-0800200c9a66 received_at | 2013-12-31 04:54:04.136377+00 update_channel | -[ RECORD 4 ]-------+------------------------------------- adu_count | 54 date | 2013-12-30 product_name | Fennec product_os_platform | Linux product_os_version | product_version | 10.0 build | 20120123232851 build_channel | beta product_guid | a23983c0-fd0e-11dc-95ff-0800200c9a66 received_at | 2013-12-31 04:54:04.136377+00 update_channel | -[ RECORD 5 ]-------+------------------------------------- adu_count | 230 date | 2013-12-30 product_name | Fennec product_os_platform | Linux product_os_version | product_version | 10.0 build | 20120118081111 build_channel | beta product_guid | a23983c0-fd0e-11dc-95ff-0800200c9a66 received_at | 2013-12-31 04:54:04.136377+00 update_channel | Time: 0.336 ms select ds, split(request_url,'/')[5], split(split(request_url,'/')[10], '%20')[0], split(split(request_url,'/')[10], '%20')[1], split(request_url,'/')[4], split(request_url,'/')[6], split(request_url,'/')[9], split(request_url,'/')[3], request_time from v2_raw_logs where domain='addons.mozilla.org' and ds='2014-06-01' and request_url like '/blocklist/3/%' limit 5; 2014-06-01 Firefox Windows_NT 6.1 29.0.1 20140506152807 release %7Bec8030f7-c20a-464f-9b0e-13a3a9e97384%7D 01/Jun/2014:11:32:25 +0000 2014-06-01 Firefox Windows_NT 6.1 29.0.1 20140506152807 release %7Bec8030f7-c20a-464f-9b0e-13a3a9e97384%7D 01/Jun/2014:11:32:25 +0000 2014-06-01 Firefox Windows_NT 5.1 29.0.1 20140506152807 release %7Bec8030f7-c20a-464f-9b0e-13a3a9e97384%7D 01/Jun/2014:11:32:25 +0000 2014-06-01 Firefox Windows_NT 6.1 29.0.1 20140506152807 release %7Bec8030f7-c20a-464f-9b0e-13a3a9e97384%7D 01/Jun/2014:11:32:25 +0000 2014-06-01 Firefox Windows_NT 6.1 29.0.1 20140506152807 release %7Bec8030f7-c20a-464f-9b0e-13a3a9e97384%7D 01/Jun/2014:11:32:25 +0000
Comment 11•10 years ago
|
||
Working on a pure python version of the fetch process. Probably will just be a cron job, or possibly something to be run by crontabber if we can parallelize it.
Comment 12•10 years ago
|
||
Currently blocked on bug 1025346.
Comment 13•10 years ago
|
||
Have access, will begin testing Monday.
Comment 14•10 years ago
|
||
Delay in testing due to EOQ goal work.
Updated•10 years ago
|
Updated•10 years ago
|
Summary: ADI export scripts → Enable ADI export scripts
Updated•10 years ago
|
Summary: Enable ADI export scripts → Enable ADI pull from Hive export scripts for Socorro
Assignee | ||
Updated•10 years ago
|
Assignee: sdeckelmann → rhelmer
Status: NEW → ASSIGNED
Assignee | ||
Comment 15•10 years ago
|
||
(In reply to Selena Deckelmann :selenamarie :selena from comment #5) > 09:37 < tmary_away_maybe> selenamarie: if you are using thrift/jdbc to > connect to hive, all you need is access to > peach-gw.peach.metrics.scl3.mozilla.com port 10000 (username: use your posix > username, password: ignored, auth: simple) Hey tmary - is it possible to use the "socorro" username so it's not tied to any of our accounts ^
Flags: needinfo?(tmeyarivan)
Comment 16•10 years ago
|
||
(In reply to Robert Helmer [:rhelmer] from comment #15) > > Hey tmary - is it possible to use the "socorro" username so it's not tied to > any of our accounts ^ Yes --
Flags: needinfo?(tmeyarivan)
Assignee | ||
Comment 17•10 years ago
|
||
(In reply to T [:tmary] Meyarivan from comment #16) > (In reply to Robert Helmer [:rhelmer] from comment #15) > > > > Hey tmary - is it possible to use the "socorro" username so it's not tied to > > any of our accounts ^ > > Yes > > -- Thanks this works!
Assignee | ||
Comment 18•10 years ago
|
||
Hey Anurag (or tmary), does it matter what time we initiate this query? Could I start it as early as e.g. 1 AM UTC (or should it be Pacific?)
Flags: needinfo?(aphadke)
Reporter | ||
Comment 19•10 years ago
|
||
Hey rob, the current ETL process starts @ 2:30am UTC and gets completed by 6am UTC. We can initiate the query by 7am UTC, is that too late for you?
Flags: needinfo?(aphadke)
Assignee | ||
Comment 20•10 years ago
|
||
(In reply to Anurag Phadke[:aphadke@mozilla.com] from comment #19) > Hey rob, > the current ETL process starts @ 2:30am UTC and gets completed by 6am UTC. > We can initiate the query by 7am UTC, is that too late for you? I'll set it to start at 7am UTC.
Whiteboard: [config change]
Assignee | ||
Updated•10 years ago
|
Whiteboard: [config change]
Assignee | ||
Updated•10 years ago
|
Assignee | ||
Comment 21•10 years ago
|
||
We're now using this on Socorro stage - I just dumped the schema and ensured that we're only using the new "raw_adi" table and not the "raw_adu" table anywhere, and a backfill produces reasonable numbers. KaiRo - would you (or someone familiar with Socorro e.g. crashkill) mind looking over crash-stats.allizom.org ADI for yesterday and today when it's ready? I'll email socorro-dev and crashkill too.
Flags: needinfo?(kairo)
Comment 22•10 years ago
|
||
Please make sure the raw_adu table is archived and dropped before closing this bug. If that's not something you can do, my team is happy to help.
Comment 23•10 years ago
|
||
(In reply to Sheeri Cabral [:sheeri] from comment #22) > Please make sure the raw_adu table is archived and dropped before closing > this bug. If that's not something you can do, my team is happy to help. That sounds to me like a low-priority followup to file separately, not something that should block this one and hinder the dependency from being marked fixed.
Assignee | ||
Comment 24•10 years ago
|
||
(In reply to Sheeri Cabral [:sheeri] from comment #22) > Please make sure the raw_adu table is archived and dropped before closing > this bug. If that's not something you can do, my team is happy to help. We can do this in a migration - however I was going to do this as a separate step, not blocking this bug. Do you just want to make sure we don't forget it? If there's a more specific reason to block this bug please let me know.
Flags: needinfo?(kairo) → needinfo?(scabral)
Comment 25•10 years ago
|
||
Not blocking, just wanted to make sure it wasn't forgotten.
Flags: needinfo?(scabral)
Assignee | ||
Comment 26•10 years ago
|
||
(In reply to Sheeri Cabral [:sheeri] from comment #25) > Not blocking, just wanted to make sure it wasn't forgotten. Ok great! Filed bug 1053956 to track this, assigned to me - I will do it right after we get the current bug finished out.
Assignee | ||
Comment 27•10 years ago
|
||
All blockers down, testing on stage before we deploy this to prod.
Assignee | ||
Comment 28•10 years ago
|
||
kairo and I looked at this on stage, we are planning to ship this to prod tomorrow. All ADI numbers were very close to our current source, except Linux Firefox on Nightly which had a slightly larger delta than expected. It's still a small discrepancy and it appears to be in the source data, nothing we're filtering on the Socorro side - worth following up, but I don't think it needs to block (please let me know ASAP if you disagree) The old ADI worked by filtering on the Metrics side (in Vertica IIRC) which was then pushed to the raw_adu table in Socorro - does anyone know how I can get access to whatever was doing the log parsing, to make sure we're treating Linux the same way? Thanks!
Flags: needinfo?(tmeyarivan)
Flags: needinfo?(scabral)
Flags: needinfo?(kairo)
Assignee | ||
Comment 29•10 years ago
|
||
This is now in production, since we fetch ADI and build ADI-dependent reports once per day it'll be active starting tomorrow. I'll file followup bugs for the issues mentioned in bug 28.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Flags: needinfo?(tmeyarivan)
Flags: needinfo?(scabral)
Flags: needinfo?(kairo)
Resolution: --- → FIXED
Assignee | ||
Comment 30•10 years ago
|
||
(In reply to Robert Helmer [:rhelmer] from comment #29) > This is now in production, since we fetch ADI and build ADI-dependent > reports once per day it'll be active starting tomorrow. > > I'll file followup bugs for the issues mentioned in bug 28. bug 1055695
Comment 31•10 years ago
|
||
:lonnen - I want to retire this Kettle script on 9/4 because I have WebDetails on retainer during September; that way they can provide warranty should anything go wrong after we actually turn down the cron running it. I want your sign-off that we can do that. Will you monitor over the next week and provide sign-off by 8/29? Or let me know what issues still exist so that we can close. Thx!
Flags: needinfo?(chris.lonnen)
Comment 32•10 years ago
|
||
are you referring to the kettle scripts that pushed the old adu source into our postgres? If so, we are good to go.
Flags: needinfo?(chris.lonnen) → needinfo?(srich)
Comment 33•10 years ago
|
||
(In reply to Chris Lonnen :lonnen from comment #32) > are you referring to the kettle scripts that pushed the old adu source into > our postgres? If so, we are good to go. Chris - yes. Okay, I'll have :rfradinho remove them. Thanks for the reply.
Flags: needinfo?(srich)
You need to log in
before you can comment on or make changes to this bug.
Description
•