Closed Bug 1255184 Opened 8 years ago Closed 8 years ago

Start consuming Test Pilot logs

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: clouserw, Assigned: relud)

References

Details

(Whiteboard: [SvcOps])

Test Pilot lives!  We have a production site at testpilot.firefox.com.  I'd like to start collecting the logs from testpilot and collecting them in Redshift (the same redshift bucket as bug 1255182).  I haven't done this before so please let me know anything I'm missing below.

I'm planning on using the standard JSON logging format for this, with some custom fields.  A complete example would be (I think):

> {
>   "Timestamp": 1457462527,
>   "Type": "testpilot.test-install",
>   "Logger": "TestPilotWeb",
>   "Hostname": "testpilot.servers.s3",
>   "EnvVersion": "2.0",  // We're conforming to the Fields spec
>   "Severity": 0,
>   "Pid": 1000,
>   "Fields":{
>     "agent": "User Agent String",
>     "path": "/tests/",
>     "method": "GET",
>     "code": 200,
>     "errno": 0,
>     "lang": "en-US",
>     "uid": 1000,  // This is the Test Pilot UID
>     "service": "universal_search@mozilla",  // If the ping comes from a test within the program, the identifier goes here
>     "context": "",
>     "msg": "New Test Installed: Universal Search",
>     "remoteAddressChain": ["1.2.3.4", "4.3.2.1"],
>     "rid": "0b65c6fb-26d5-4367-be29-b3094f0c69d5",
>     "t": 256,
> 
>     "feature_switches": "",
>     "campaign": ""
>   }
> }

Is that the definition you need?  Do you need more?  I'm hoping to analyze and pull the data out of Redshift.

Bug 1255182 is a complement to this bug.

Full metrics plan at https://github.com/mozilla/testpilot/blob/master/docs/README-METRICS.md
See Also: → 1255182
(Oops, disregard my last comment please. It was intended for bug 1255135)
We'll pack the sprint tomorrow, and try to make this a P1
Assignee: nobody → whd
Component: Operations: Metrics/Monitoring → Metrics: Pipeline
Blocks: 1257690
No longer depends on: 1257690
Points: --- → 2
Priority: -- → P1
Is there any information missing from this bug?  Anything I need to do?  Thanks.
(In reply to Wil Clouser [:clouserw] from comment #4)
> Is there any information missing from this bug?  Anything I need to do? 
> Thanks.

I don't think so. It's marked as P1 (i.e. highest priority, to be completed during the current sprint) and assigned to Wes, I know he's got this on his radar, so it should be getting attention Real Soon Now®. Unfortunately ops is our busiest team right now, so sometimes it takes a bit. Thanks for your patience. :)
I talked to :mostlygeek yesterday and apparently test pilot is not actually logging anything yet. I am still setting up the generic log routing to S3/ES for this week, but specifically ending up in redshift requires more of a schema/detail than is currently available.
I was thinking we can consider this bug to cover consuming the logs so they end up in S3, we can open another one for getting the log data into Redshift. I can work w/ them to help make that happen.
FWIW, Test Pilot should be ready [1] to start logging request.summary JSON to the console after our next deployment. (Not yet scheduled, but we could probably tag a release & do it whenever.)

The schema should be what's specified in the Dockerflow repo [2].

I've never set up a project to log this way before, so it's possible it won't do the right thing on the first try. Could help to get some eyes on the logging middleware [1], but I think a few folks have already given the thumbs up to the log formatter [3]

[1] https://github.com/mozilla/testpilot/commit/aa7f9c9cc331f9798017591bd52ba48f8a02d9ae
[2] https://github.com/mozilla-services/Dockerflow/blob/master/docs/mozlog.md
[3] https://github.com/mozilla/testpilot/blob/master/testpilot/base/logging.py
Deployment bug is https://bugzilla.mozilla.org/show_bug.cgi?id=1260783 .  I don't think that blocks this.  

@wesley -- can you be more specific about what you need when you say "more of a schema/detail"?  Even pointing to an example would help as I haven't done this before.  Thanks.
Deployment bug doesn't necessarily block work being done on this, but it *is* it hard to verify that the logs are being processed correctly before there are any logs being generated to process.

Here's an example of what a schema to load a log message structure into Redshift looks like:

https://github.com/rafrombrc/push_derived/blob/master/hindsight/hs_run/output/push_endp_redshift.lua
Bug 1260783 is closed.

What else is blocking this?
Depends on: 1260783
Flags: needinfo?(whd)
whd is currently blocked by ops related tasks, we will re prioritize once he comes back up for air.
Priority: P1 → P2
Whiteboard: [SvcOps]
Assignee: whd → dthorn
Priority: P2 → P1
Logs are currently being shipped to S3. I'm unclear on what the plan is for converting them into a redshift copy-able format is.
Flags: needinfo?(whd)
:clouserw I'm going to write something to read the logs from s3 and put them in redshift. I need to know: what information do you want in redshift, and what part of the logs do I find it in?
Flags: needinfo?(wclouser)
Here's a schema (which I'll also make a PR for and put in the official docs):


local schema = {
--   column name                   field type   length  attributes   field name
    {"timestamp",                  "TIMESTAMP", nil,    "SORTKEY",   "Timestamp"},
    {"uuid",                       "VARCHAR",   36,      nil,         get_uuid},
    {"type",                       "VARCHAR",   255,     nil,         "type"},
    {"logger",                     "VARCHAR",   255,     nil,         "logger"},
    {"Hostname",                   "VARCHAR",   255,     nil,         "Hostname"},
    {"Severity",                   "INTEGER",   nil,     nil,         "Severity"},
    {"agent",                      "VARCHAR",   45,      nil,         "Fields[agent]"},
    {"path",                       "VARCHAR",   56,      nil,         "Fields[path]"},
    {"method",                     "VARCHAR",   200,     nil,         "Fields[method]"},
    {"code",                       "VARCHAR",   255,     nil,         "Fields[code]"},
    {"errno",                      "VARCHAR",   255,     nil,         "Fields[errno]"},
    {"lang",                       "VARCHAR",   36,      nil,         "Fields[lang]"},
    {"uid",                        "INTEGER",   nil,     nil,         "Fields[uid]"},
    {"service",                    "VARCHAR",   255,     nil,         "Fields[service]"},
    {"context",                    "VARCHAR",   255,     nil,         "Fields[context]"},
    {"msg",                        "VARCHAR",   1000,    nil,         "Fields[msg]"},
    {"remoteAddressChain",         "VARCHAR",   255,     nil,         "Fields[remoteAddressChain]"},
    {"rid",                        "VARCHAR",   255,     nil,         "Fields[rid]"},
    {"t",                          "VARCHAR",   36,      nil,         "Fields[t]"},
    {"feature_switches",           "VARCHAR",   255,     nil,         "Fields[feature_switches]"},
    {"campaign",                   "VARCHAR",   255,     nil,         "Fields[campaign]"}
}
Flags: needinfo?(wclouser)
I've duplicated rmiller's autopush/webpush setup for testpilot. It can be viewed in sql.telemetry.mozilla.org with the data source Testpilot and tables named: testpilot_sever_YYYYMMDD
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Blocks: 1270961
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.