Closed Bug 1213417 Opened 9 years ago Closed 9 years ago

Derived dataset request: basic metadata and crash info

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: benjamin, Assigned: trink)

References

Details

User Story

Derived from the unified telemetry "main" pings:

FILTERS:
environment.build.applicationName == "Firefox"
environment.update.channel in ("nightly", "aurora", "beta", "release")

INCLUDE VERBATIM:
clientId
environment.build.version
environment.build.buildId
environment.build.architecture
environment.update.channel
environment.os.name
environment.os.version
environment.os.servicePackMajor (may be missing/NULL)
environment.os.servicePackMinor (may be missing/NULL)
environment.settings.locale
environment.activeExperiment.id
environment.activeExperiment.branch
<IP-based-geocountry>
payload.info.reason
payload.info.timezoneOffset
payload.info.subsessionStartDate
payload.info.subsessionLength
rename "payload.histograms.SUBPROCESS_ABNORMAL_ABORT.plugin.sum" to "aborts-plugin"
rename "payload.histograms.SUBPROCESS_ABNORMAL_ABORT.content.sum" to "aborts-content"
rename "payload.histograms.SUBPROCESS_ABNORMAL_ABORT.gmplugin.sum" to "aborts-gmplugin"
rename "payload.histograms.SUBPROCESS_CRASHES_WITH_DUMP.plugin.sum" to "crashesdetected-plugin"
rename "payload.histograms.SUBPROCESS_CRASHES_WITH_DUMP.pluginhang.sum" to "hangsdetected-plugin"
rename "payload.histograms.SUBPROCESS_CRASHES_WITH_DUMP.content.sum" to "crashesdetected-content"
rename "payload.histograms.SUBPROCESS_CRASHES_WITH_DUMP.gmplugin.sum" to "crashesdetected-gmplugin"

For each of SUBPROCESS_CRASH_SUBMIT_ATTEMPT and SUBPROCESS_CRASH_SUBMIT_SUCCESS
rename "payload.histograms.SUBPROCESS_CRASH_SUBMIT_X.main.sum" to "crash-N-main"
rename "payload.histograms.SUBPROCESS_CRASH_SUBMIT_X.plugin.sum" to "crash-N-plugin"
rename "payload.histograms.SUBPROCESS_CRASH_SUBMIT_X.content.sum" to "crash-N-content"

CALCULATED FIELDS:

number of active addons:
  len(environment.addons.activeAddons)

most recent version of Flash:
  max([ plugin.version.split('.') for plugin in environment.addons.activePlugins if plugin.name == 'Shockwave Flash' ])
I would like a derived dataset as specified in the "user story" which I can then use either within Spark or a heka filter to create some redshift rollup tables for usage and crash metrics.
User Story: (updated)
:bsmedberg, are you taking this? Do you need us to do this for you in the next 2 weeks?
Flags: needinfo?(benjamin)
Priority: -- → P2
User Story: (updated)
Flags: needinfo?(benjamin)
I don't know when I'll be able to take this. I'd like to but other things are still trumping.
Trink will take this
Assignee: nobody → mtrinkala
Priority: P2 → P1
I assume you mean keyedHistograms for all the SUBPROCESS_*

Also for the executive summary plugin hangs are currently counted as 
payload.keyedHistograms.SUBPROCESS_ABNORMAL_ABORT.plugin.sum is there a reason we a counting them differently here?
Flags: needinfo?(benjamin)
Yes, keyedHistograms.

When you say plugin hangs are currently counted as 
payload.keyedHistograms.SUBPROCESS_ABNORMAL_ABORT.plugin.sum that's not correct. That counts all plugin crashes (and probably some other aborts). ABNORMAL_ABORT can't tell the difference between a crash-abort and a hang-abort.
Flags: needinfo?(benjamin)
The current behaviour was reviewed and approved in the executive summary meeting with the metrics team .  There was some discussion here https://bugzilla.mozilla.org/show_bug.cgi?id=1131834 but is it unclear why we went with SUBPROCESS_ABNORMAL_ABORT (I believe it was because we wanted the executive summary to include all plugin failures)  Since the name is misleading we should rename it or change the extraction so it only represents the hangs.  I can change it, just file a bug with the desired modification.
I have been trolling the logs for PROCESS_CRASH_SUBMIT_ATTEMPT since there are no SUBPROCESS versions defined in the code
http://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/Histograms.json#8511  

A definitive list of crash types cannot be gleaned from the code (http://mxr.mozilla.org/mozilla-central/source/toolkit/components/crashes/CrashManager.jsm#1189) and I haven't seen anything but an empty object {} in the live data.  Some docs or sample data would be useful.
Flags: needinfo?(gfritzsche)
Just coded it to main, plugin, and content
Flags: needinfo?(gfritzsche)
Available for experimentation on the pipeline-redshift.dev.mozaws.net cluster (see :dthornton for creds). The following table is a test load of the Oct 4 data.

pipeline=# \d main_summary
                      Table "public.main_summary"
            Column            |            Type             | Modifiers 
------------------------------+-----------------------------+-----------
 timestamp                    | timestamp without time zone | 
 subsessiontimestamp          | timestamp without time zone | 
 clientid                     | character(36)               | 
 buildversion                 | character varying(32)       | 
 buildid                      | character(14)               | 
 buildarchitecture            | character varying(32)       | 
 channel                      | character varying(7)        | 
 os                           | character varying(7)        | 
 osversion                    | character varying(32)       | 
 ossericepackmajor            | character varying(32)       | 
 ossericepackminor            | character varying(32)       | 
 locale                       | character varying(32)       | 
 activeexperimentid           | character varying(32)       | 
 activeexperimentbranch       | character varying(32)       | 
 country                      | character varying(5)        | 
 reason                       | character varying(32)       | 
 subsessionlength             | integer                     | 
 timezoneoffset               | integer                     | 
 pluginhangs                  | integer                     | 
 aborts_plugin                | integer                     | 
 aborts_content               | integer                     | 
 aborts_gmplugin              | integer                     | 
 crashesdetected_plugin       | integer                     | 
 crashesdetected_content      | integer                     | 
 crashesdetected_gmplugin     | integer                     | 
 crash_submit_attempt_main    | integer                     | 
 crash_submit_attempt_content | integer                     | 
 crash_submit_success_main    | integer                     | 
 crash_submit_success_content | integer                     | 
 active_addons                | integer                     | 
 flash_version                | character varying(16)       | 

pipeline=# select * from main_summary limit 1;

        timestamp         | subsessiontimestamp |               clientid               | buildversion |    buildid     | buildarchitecture | channel |   os    | osversion | ossericepackmajor | ossericepackminor | locale | activeexperimentid | activeexperimentbranch | country |  reason  | subsessionlength | timezoneoffset | pluginhangs | aborts_plugin | aborts_content | aborts_gmplugin | crashesdetected_plugin | crashesdetected_content | crashesdetected_gmplugin | crash_submit_attempt_main | crash_submit_attempt_content | crash_submit_success_main | crash_submit_success_content | active_addons | flash_version 
--------------------------+---------------------+--------------------------------------+--------------+----------------+-------------------+---------+---------+-----------+-------------------+-------------------+--------+--------------------+------------------------+---------+----------+------------------+----------------+-------------+---------------+----------------+-----------------+------------------------+-------------------------+--------------------------+---------------------------+------------------------------+---------------------------+------------------------------+---------------+---------------
 2015-10-04 00:00:01.2746 | 2015-09-06 04:00:00 | f18cef1c-6e6c-40ea-af28-91617b273d83 | 42.0a2       | 20150818004007 | x86               | aurora  | Windows | 10.0      | 0                 | 0                 | en-US  |                    |                        | US      | shutdown |              879 |           -240 |           0 |             0 |              0 |               0 |                      0 |                       0 |                        0 |                         0 |                            0 |                         0 |                            0 |             4 | 
(1 row)
Excellent. Followup questions:

Is `timestamp` the date the ping was received?

Question about the sort key and distribution style for the redshift data: I presume the sort key is either `timestamp` or `subsessiontimestamp` is that correct? Is the distribution style KEYed on clientID?

Is it possible to query this same dataset in spark?
Flags: needinfo?(mtrinkala)
- timestamp is the date/time we received the data and serves as the sort key
- no indexes have been created since the primary use cases haven't been defined
- the database tables will be created as a sequence of time series tables (one per day) and hidden with a UNION ALL view
- spark looks like it has some tools but I am not a spark user... deferring to others
Flags: needinfo?(mtrinkala)
(In reply to Mike Trinkala [:trink] from comment #11)
- redshift disables indexes so none will ever be added... sadness
Yeah wasn't worried about the indexes, just about the cluster sharing type. For some of the queries I want to run, sharding (distribution key) the database by clientID can improve query results.
Depends on: 1219291
Tried this today. Here are my notes and requests, from most to least important:

A (critical): missing fields:

 crash_submit_attempt_plugin
 crash_submit_success_plugin

B (critical):  Is `pluginhangs` the name of payload.histograms.SUBPROCESS_CRASHES_WITH_DUMP.pluginhang.sum ?

C. (important):

The tables need to be distributed on the clientid key, instead of evenly. That will allow queries of "active user over day/week/month" to be done on distributed nodes instead of a giant unified join.

D. (minor)

Store subsessiontimestamp as a date instead of a timestamp (and rename to subsessiondate).

E. (trivial)

ossericepackmajor and ossericepackminor are misspelled, should be "servicepack"

Can you have at least changes A-C in place and data backfilled for the entire month of October by sometime early next week?

If the redshift bits are too complex, I think I can deal with this just as well in spark, which I think means S3 output instead of redshift.
Flags: needinfo?(mtrinkala)
A is easy, B yes, C already done in the latest version, D easy, E already fixed (the changes will land today).

Yeah we will be able to fire up a cluster of machines and backfill October (fyi the column names have changed a bit in the new version to make them consistent across all derived outputs).
Flags: needinfo?(mtrinkala)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #10)
> Is it possible to query this same dataset in spark?

It is. The data on S3 can be converted to Parquet files for fast access with Spark, see [1].

[1] https://mail.mozilla.org/pipermail/fhr-dev/2015-November/000660.html
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #16)
> (In reply to Benjamin Smedberg  [:bsmedberg] from comment #10)
> > Is it possible to query this same dataset in spark?
> 
> It is. The data on S3 can be converted to Parquet files for fast access with
> Spark, see [1].
> 
> [1] https://mail.mozilla.org/pipermail/fhr-dev/2015-November/000660.html

This data set is not currently stored on S3 - it's in a Redshift database. It would be interesting to see if we can use something like [2] to load Redshift into dataframes, but it requires Spark 1.4+ (we're currently on 1.3 AIUI).
The full October data is now available in Redshift.
Blocks: 1223506
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.