Closed Bug 1246701 Opened 8 years ago Closed 7 years ago

Prepare download-stats.mozilla.org for log ingestion

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: benjamin, Unassigned)

References

Details

(Whiteboard: [SvcOps][fce-active-legacy])

Stub installer pings are simple HTTP GETs to download-stats.mozilla.org. This bug tracks work that Wes needs to do to prepare for ingesting those logs into heka. This may mean moving that host into cloud services operations or something else.
please update with blocker tickets
Priority: -- → P2
Points: --- → 3
Priority: P2 → P1
The code for loading redshift is now living at https://github.com/whd/dsmo_load (based off of https://github.com/rafrombrc/push_derived), non-boilerplate files of which are:

https://github.com/whd/dsmo_load/blob/master/hindsight/hs_run/output/dsmo_redshift.lua
https://github.com/whd/dsmo_load/blob/master/heka/usr/share/heka/lua_filters/nginx_redshift.lua

I'm going with a hybrid of heka+hindsight approach (heka tcpoutput + hindsight tcpinput) because our standard provisioning logic is all heka-based and we don't have all the necessary pieces implemented in hindsight yet.

The steps remaining, roughly, are:

Provisioning logic (puppet+CFN, already a WIP)
Test provisioning in stage
Acquire SSL cert for download-stats from IT
Provision prod, change DNS record from A to CNAME and point at cloud services environment
Set up redshift access for :mhowell, :bsmedberg and others

I anticipate cutting over DNS early next week. Fortunately(?) there is really nothing to cut over from right now because per :mpressman and empirically d-s.m.o is currently being /dev/null'ed.
I cut this over on Monday, and things appear to be more or less working. I've hooked it into redash instead of setting up VPN access, see https://sql.telemetry.mozilla.org/queries/78 for an example.

The data from 20160322 had some bugs related to processing the error codes (https://github.com/whd/dsmo_load/commit/0f8cf7b1aa2959ca6e4c07f852f46a1987977f0b) but we can backfill this since we keep the "raw" data in an s3 bucket. I've elected not to fix things because the code in dsmo_load hasn't been reviewed and possibly contains other errors, which should be fixed before I do a proper backfill.

:mhowell can you take a look and see if the data showing up in redshift looks correct?

As an aside, there seem to be a non-zero number of "v5" stub installer pings that the parsing logic is dropping. We can add support for this format if it is desirable.
(In reply to Wesley Dawson [:whd] from comment #3)
> :mhowell can you take a look and see if the data showing up in redshift
> looks correct?

Yeah, I took a quick peek on redash and the data looks good. The only confusing thing I immediately see is a lot of rows (~5%) with version and build_id unset; zero is normal, but empty really should not happen. I don't know what to think of that at the moment; there's as much a chance of that being a bug in the installer as anything else.

> As an aside, there seem to be a non-zero number of "v5" stub installer pings
> that the parsing logic is dropping. We can add support for this format if it
> is desirable.

I don't think that will be necessary; it looks like v6 was introduced about Firefox 30, so those pings are for versions older than that.
(In reply to Matt Howell [:mhowell] from comment #4)
> (In reply to Wesley Dawson [:whd] from comment #3)
> > As an aside, there seem to be a non-zero number of "v5" stub installer pings
> > that the parsing logic is dropping. We can add support for this format if it
> > is desirable.
> 
> I don't think that will be necessary; it looks like v6 was introduced about
> Firefox 30, so those pings are for versions older than that.

Slight adjustment to this: we're fine having the parser drop v5 pings, but we do need to have a count somewhere of dropped pings so that we can tell when we break something. Can that be made available somehow?
Flags: needinfo?(whd)
One option is to throw URLs that fail to parse into a different table with a schema like (timestamp, path, reason) which could then be counted. Does that seem reasonable?
Flags: needinfo?(whd)
(In reply to Wesley Dawson [:whd] from comment #6)
> One option is to throw URLs that fail to parse into a different table with a
> schema like (timestamp, path, reason) which could then be counted. Does that
> seem reasonable?

Yes, that sounds like it would work well.
Does it makes sense to close this and open a new one for the dropped pings?
Flags: needinfo?(mhowell)
I don't think so, since it's part of the same data set? But I don't feel strongly either way; if it would be easier to deal with as a separate bug, go for it.

There's also one more thing I need that might or might not merit a separate bug, which is a download_stats view that unions all the download_stats_{date} tables. I think I don't have the permissions to make that myself.
Flags: needinfo?(mhowell)
Due to e10s work and ops related fires, work on this bug is going to be stalled. Moving to P2 until whd comes up for air.
Priority: P1 → P2
Assignee: whd → bimsland
Points: 3 → 1
Priority: P2 → P1
I will be taking over work on this.
Whiteboard: [SvcOps]
(In reply to Blake Imsland [:robotblake] from comment #11)
> I will be taking over work on this.

Hi Blake, sorry for taking a few days to get back to this. The one thing I need right now on my end is the view over the download_stats_* tables discussed above; I don't know a good way to write queries against the changing list of tables otherwise. Following that would be the table of failed parses (comment 6). Do you have a sense for when you might be able to work on those?
(In reply to Matt Howell [:mhowell] from comment #12)
> (In reply to Blake Imsland [:robotblake] from comment #11)
> > I will be taking over work on this.
> 
> Hi Blake, sorry for taking a few days to get back to this. The one thing I
> need right now on my end is the view over the download_stats_* tables
> discussed above; I don't know a good way to write queries against the
> changing list of tables otherwise. Following that would be the table of
> failed parses (comment 6). Do you have a sense for when you might be able to
> work on those?

I've done a bit of work on those already and they will be one of my priorities over the next week.
FYI, a change to the stub ping URL format recently landed on Nightly.
Bug 1261140 bumped the version field to v7 and added a new field (path component) to the end which contains attribution data. The format of that data isn't 100% settled, but this process doesn't need to parse anything in there, at least for now; just treat it as an opaque URL-encoded string.
Flags: needinfo?(bimsland)
:mhowell due to some changes with how we're doing this sort of logging and the re:dash work this fell off my radar.  The URL format change shouldn't be too difficult to add and I've got an idea for the download_stats_* view that is crude but should work for now.

Since this has sorta become a metabug, just want to be clear with what remaining work it covers...

* Create and keep up to date a download_stats view that covers download_stats_* tables.
* Add field for v7 pings to redshift table and parsing code to dsmo_load to handle the new field.
* Deploy error parsing changes into production (v5 / bad structure).
* Backfill v7 pings and pings that failed to parse into their respective tables.
* Open support request with AWS regarding sporadic redshift batch insert errors.

Let me know if you see anything missing from that list, I ordered them in roughly the order it sounds like they should be prioritised.  With that said, I'm going to drop the priority on this meta bug and spawn new bugs for each of those (as blockers) so I can get a better handle on the remaining work.
Flags: needinfo?(bimsland) → needinfo?(mhowell)
Priority: P1 → P2
Yes, you've captured all the remaining work that I'm aware of, and the ordering/prioritization makes sense to me. Thanks very much for the update and for getting all that information together.
Flags: needinfo?(mhowell)
(In reply to Blake Imsland [:robotblake] from comment #15)
> With that said, I'm going to drop the priority on this meta bug and spawn new bugs for
> each of those (as blockers) so I can get a better handle on the remaining
> work.
:robotblake were these created? I'd like to begin tracking this work in our stub attribution checkins.
Flags: needinfo?(bimsland)
After talking to :ckprice this is no longer a priority for this quarter.
Assignee: bimsland → nobody
Priority: P2 → P3
Depends on: 1290794
Depends on: 1290795
Depends on: 1290798
Depends on: 1290800
Depends on: 1290803
Whiteboard: [SvcOps] → [SvcOps][fce-active]
Deployed the changes needed to store erroneous pings along with handling of v7 pings. The backfill (https://bugzilla.mozilla.org/show_bug.cgi?id=1290800) is next up.
Hey BDS - once Blake has finished with backfill (https://bugzilla.mozilla.org/show_bug.cgi?id=1290800) are we able to close this issue?
Flags: needinfo?(benjamin)
->mhowell
Flags: needinfo?(benjamin) → needinfo?(mhowell)
Looks to me like the answer is yes; there's still work to do on the larger attribution project that spawned this bug, but we seem to be in business with this log ingestion work.

Thanks to robotblake and everyone else involved!
Flags: needinfo?(mhowell)
Depends on: 1319871
(In reply to Matt Howell [:mhowell] from comment #23)
> we seem to be in business with this log ingestion work.

Per Matt's reply, I'm closing this out. If there is more work to do here, please reopen. Thanks!
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Whiteboard: [SvcOps][fce-active] → [SvcOps][fce-active-legacy]
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.