Closed Bug 1245490 Opened 8 years ago Closed 8 years ago

Create crash summary dataset

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1316860

People

(Reporter: rvitillo, Unassigned)

References

Details

(Whiteboard: loasis)

We should crash pings to the longitudinal dataset which currently contains only main pings.
Depends on: 1246137
Whiteboard: loasis
Depends on: 1247571
Assignee: nobody → rvitillo
Component: Metrics: Product Metrics → Metrics: Pipeline
Priority: -- → P2
Our dataset is currently sorted by subsessionStartDate and profileSubsessionCounter, which are not present in crash pings. Sam, Georg, do you have any thoughts on the best way to merge crash and main pings in the same dataset?
Flags: needinfo?(spenrose)
Flags: needinfo?(gfritzsche)
I have a strong, maybe unhelpful opinion: we should not automatically try to fix this problem as it is currently facing us, because to a significant extent it is an artifact of technical debt. Instead we should:

1) Ask the measurement team to look at time handling in pings as a project. Why do we have so many time fields and pseudo-time fields (counters)? Can we consolidate them? Can we at least use a single representation of time? When we looked at profileSubsessionCounters this summer we found that they appeared to be less reliable than the dimensions they were intended to verify. Should we still be using them at all?

2) Ask how crashes (not crash pings) should be tied to longitudinal datasets. Even if the current crash pings are the right long-term solutions, how do they map to subsessions? Does a crash always end a session? If so, then we should consider a datastore which makes that connection clear. Including Brendan for his opinion here.

3) Generally speaking, every time we uncover a difficult issue with data storage I would like to see us ask "should this problem be solved at the client"?

4) In the short term I assume we want to use one of /creationDate (second-level resolution) or /payload/crashDate (day level).
Flags: needinfo?(spenrose) → needinfo?(bcolloran)
(In reply to Sam Penrose from comment #2)

> 1) Ask the measurement team to look at time handling in pings as a project.
> Why do we have so many time fields and pseudo-time fields (counters)? Can we
> consolidate them? Can we at least use a single representation of time?


> When
> we looked at profileSubsessionCounters this summer we found that they
> appeared to be less reliable than the dimensions they were intended to
> verify. Should we still be using them at all?

Hmmm, that was not my recollection. I believe that using profileSubsessionCounter still seems like most reliable and natural way to order pings in the standard case (non-branching histories). What to do in the weird cases is still an open question.

> 2) Ask how crashes (not crash pings) should be tied to longitudinal
> datasets. Even if the current crash pings are the right long-term solutions,
> how do they map to subsessions? Does a crash always end a session? If so,
> then we should consider a datastore which makes that connection clear.
> Including Brendan for his opinion here.

I believe there is an open bug ATM regarding adding the subsessionId of the subsession that generated a crash to each crash ping. That should be satisfactory for tying the crashes to the longitudinal data (I fear I am missing the point of your question?).

I don't know what constraints Parquet imposes upon us or what we need to do to represent the data in some way that is reasonably idiomatic to Parquet, but just to zoom out to what would be workable in general for a non-lossy client-orient data set: the main thing that is really important is that it's possible to get all the data we have for each client without any risk of missing anything, but from our perspective, the different data items need not necessarily all conform to on master template. What I mean is that: if pings of different type don't naturally sort together, it would be acceptable to stash them in different lists that were sorted differently. If we were building a json structure, it would be fine to have--
{mainPings: [list of main pings sorted by (profileSubsessionCounter,startDate)],
 crashPings: [list of crash pings sorted by whatever],
 deletionPings: ...,
 etcPings: ...
}
--i.e. for us, it is not a requirement to find a way to fit everything into one list of pings (and have one sensible sorting scheme for that one list); it is just a requirement that everything for a client be readily available in one place, and that we can easily iterate over clients. (In fact, having everything in one list of pings would likely make the job of doing analysis harder-- it would require more filtering and digging around, and I suspect there would be several reinventions of the wheel for doing that filtering).

> 3) Generally speaking, every time we uncover a difficult issue with data
> storage I would like to see us ask "should this problem be solved at the
> client"?

I agree that this is preferred when possible (though of course that may be the exception).

> 4) In the short term I assume we want to use one of /creationDate
> (second-level resolution) or /payload/crashDate (day level).

In the course of looking at some other stuff last week, I spent some time looking at the delta between creationDate (which iiuc is a server timestamp) and the within-ping startDate (I was looking at main pings), and there were a pretty good number of pings that had a fairly large delta. Bona fide date skew aside (which also appears to be a problem) I think that if we rely on server timestamps for sorting we'll end up with pings out of order fairly regularly.
Flags: needinfo?(bcolloran)
(In reply to brendan c from comment #3)
> (In reply to Sam Penrose from comment #2)
> 
> > 1) Ask the measurement team to look at time handling in pings as a project.
> > Why do we have so many time fields and pseudo-time fields (counters)? Can we
> > consolidate them? Can we at least use a single representation of time?
> 
> 
> > When
> > we looked at profileSubsessionCounters this summer we found that they
> > appeared to be less reliable than the dimensions they were intended to
> > verify. Should we still be using them at all?
> 
> Hmmm, that was not my recollection. I believe that using
> profileSubsessionCounter still seems like most reliable and natural way to
> order pings in the standard case (non-branching histories). What to do in
> the weird cases is still an open question.

Here is the notebook holding the work in question: https://gist.github.com/SamPenrose/23b9733c9eb0ebad0e34

> > 2) Ask how crashes (not crash pings) should be tied to longitudinal
> > datasets. Even if the current crash pings are the right long-term solutions,
> > how do they map to subsessions? Does a crash always end a session? If so,
> > then we should consider a datastore which makes that connection clear.
> > Including Brendan for his opinion here.
> 
> I believe there is an open bug ATM regarding adding the subsessionId of the
> subsession that generated a crash to each crash ping. That should be
> satisfactory for tying the crashes to the longitudinal data (I fear I am
> missing the point of your question?).

Great.

> I don't know what constraints Parquet imposes upon us or what we need to do
> to represent the data in some way that is reasonably idiomatic to Parquet,
> but just to zoom out to what would be workable in general for a non-lossy
> client-orient data set: the main thing that is really important is that it's
> possible to get all the data we have for each client without any risk of
> missing anything, but from our perspective, the different data items need
> not necessarily all conform to on master template. What I mean is that: if
> pings of different type don't naturally sort together, it would be
> acceptable to stash them in different lists that were sorted differently. If
> we were building a json structure, it would be fine to have--
> {mainPings: [list of main pings sorted by
> (profileSubsessionCounter,startDate)],
>  crashPings: [list of crash pings sorted by whatever],
>  deletionPings: ...,
>  etcPings: ...
> }
> --i.e. for us, it is not a requirement to find a way to fit everything into
> one list of pings (and have one sensible sorting scheme for that one list);
> it is just a requirement that everything for a client be readily available
> in one place, and that we can easily iterate over clients. (In fact, having
> everything in one list of pings would likely make the job of doing analysis
> harder-- it would require more filtering and digging around, and I suspect
> there would be several reinventions of the wheel for doing that filtering).

You have consistently advocated that position, for good reasons. There is a lot of work being done with the assumption that others in the org need higher level abstractions, and there has been recent work demonstrating that without them analyses can go wrong.

> > 3) Generally speaking, every time we uncover a difficult issue with data
> > storage I would like to see us ask "should this problem be solved at the
> > client"?
> 
> I agree that this is preferred when possible (though of course that may be
> the exception).
> 
> > 4) In the short term I assume we want to use one of /creationDate
> > (second-level resolution) or /payload/crashDate (day level).
> 
> In the course of looking at some other stuff last week, I spent some time
> looking at the delta between creationDate (which iiuc is a server timestamp)
> and the within-ping startDate (I was looking at main pings), and there were
> a pretty good number of pings that had a fairly large delta. Bona fide date
> skew aside (which also appears to be a problem) I think that if we rely on
> server timestamps for sorting we'll end up with pings out of order fairly
> regularly.

Thanks for the catch -- I keep forgetting that "creationDate" is a misnomer.
Given the above remarks, we should proceed creating a separate Parquet dataset that contains only our crash pings.
We(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #5)
> Given the above remarks, we should proceed creating a separate Parquet
> dataset that contains only our crash pings.

We would like to be able to work with data objects that include both the main and crash pings for a client. If my comments in #2 are blocking integration of the crash pings, then we can table them for now and whatever sorting method you settle on is OK. If we need to have a second Parquet store just for crash pings, that's OK to the extent that analysts can open both the main and crash pings and unify them into a single object grouped by clientId.
(In reply to Sam Penrose from comment #6)

> If we need to have a second Parquet
> store just for crash pings, that's OK to the extent that analysts can open
> both the main and crash pings and unify them into a single object grouped by
> clientId.

Given that crash pings are smaller and less frequent it should be possible to join the two datasets efficiently.
Blocks: 1251580
It looks like this has no specific question for me now.
For the record, adding the sessionId to crash pings is bug 1187270.
Flags: needinfo?(gfritzsche)
Depends on: 1187270
Depends on: 1252844
Points: --- → 2
No longer blocks: 1242039
(In reply to brendan c from comment #3)
> In the course of looking at some other stuff last week, I spent some time
> looking at the delta between creationDate (which iiuc is a server timestamp)

The creationDate field is *not* a server timestamp.  It is a client-assigned date, and represents when the ping was saved to disk before submission.  It is reasonable to think about it as something very close to "subsessionEndDate", though it may not be that exactly.

The server-assigned timestamp is available in data returned by 'get_pings' in the 'meta/Timestamp' field, and in the longitudinal dataset at day-granularity in the "submission_date" field. I'm not sure if the full-resolution value is available in that data set.
Hmm, that is even more concerning... there were a lot of big deltas (both positive and negative) between creationDate and startDate.
Whereas session duration and session date are very important for how we measure Firefox use, and
Whereas for both temporal measures we lack:
  1. Atomic metrics we trust.
  2. Consensus (either express or implicit) in how atomic measures should be composed to create semantic categories (i.e. use-on-date-X)
Be it resolved that:
We should first organize time values in UT pings around two or more atomic dimensions we trust*, and
Having such trusted dimensions, we should define one approved way to compose atomic dimensions into semantically meaningful temporal measures.

Can I get a second?

* "Trust, but verify" as someone once said.
Blocks: 1255755
No longer blocks: 1251580
Assignee: rvitillo → mreid
Assignee: mreid → nobody
Summary: Add crash pings to longitudinal dataset → Create crash summary dataset
I'm closing this bug in favor of bug 1316860.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → DUPLICATE
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.