Closed Bug 1254716 Opened 9 years ago Closed 9 years ago

Build search dataset based on "main summary" dataset

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: spenrose, Assigned: spenrose)

References

Details

Attachments

(2 files)

Create a mapping from UT main ping fields (input) to a flat set of output fields and commit to a common repository such as mozilla-pipeline-schemas. Example mapping (not syntax, which will probably be a super set of JSON-Schema): $environment$partner$distributionId -> distributionId $payload$info$subsessionStartDate -> subsessionStartDate
Here is a proposal for how the search stream will be derived. Please comment. 1) The fields to be extracted from the ping are loaded from a config file that is expected to grow occasionally. 2) The output is a flat row structure, except for SEARCH_COUNTS. 3) SEARCH_COUNTS are minimally processed and packed into a JSON string. 4) I don't currently have a semantic validation step (access point is legit, search tallies are > 0, etc.) Step 3 is the tricky one. SEARCH_COUNTS packs the values we care about into key names, and we don't know how many we're going to see. Other approaches I considered: i. Add keys searchEngine1, searchAccessPoint1, searchCount1, searchEngine2, ... up to some small number. I can check the distribution of cardinality for SEARCH_COUNTS. We could also include all of SEARCH_COUNTS as a JSON blob in a separate field. ii. Same, but collapse each engine-access-count 3-tuple into a string. iii. For each ping, emit a (ping-search-overview, (ping-search-count-1, ...)) nested tuple: in effect an SQL breakout table. As I write, I think I prefer option i. to what I coded in the attachment. LMK.
Attachment #8728657 - Flags: review?(mreid)
Attachment #8728657 - Flags: review?(dzeber)
(In reply to Sam Penrose from comment #1) > Here is a proposal for how the search stream will be derived. Please comment. This looks fine to me. It seems to me the process of actually extracting field values from the payload is straightforward enough, but the main thing to settle is which fields should be extracted. > and we don't know how many we're going to see. Is this an issue for the data storage format? I think the format given in the attachment is the best - an entry for every (search engine, SAP, count) triple. This preserves the complete original search counts in a format that's easier to work with. > i. Add keys searchEngine1, searchAccessPoint1, searchCount1, searchEngine2, > ... up to some small number. I can check the distribution of cardinality for > SEARCH_COUNTS. We definitely don't want to be throwing away any data at this stage. I think a requirement for this dataset is that it contains the complete data (as is relevant to search). > We could also include all of SEARCH_COUNTS as a JSON blob in > a separate field. I think the approach in the attachment is better. We really want to think of search counts as a mapping with bivariate keys rather than a "histogram". Also, we can get the key splitting logic out of the way of downstream processing.
Attachment #8728657 - Flags: review?(dzeber) → review+
(In reply to Dave Zeber [:dzeber] from comment #2) > (In reply to Sam Penrose from comment #1) > > Here is a proposal for how the search stream will be derived. Please comment. > > This looks fine to me. It seems to me the process of actually extracting > field values from the payload is straightforward enough, but the main thing > to settle is which fields should be extracted. > > > and we don't know how many we're going to see. > Is this an issue for the data storage format? I think the format given in > the attachment is the best - an entry for every (search engine, SAP, count) > triple. This preserves the complete original search counts in a format > that's easier to work with. Easier to work with for some purposes, not for others. In particular, let's say we have a distribution like this: 95% of pings: <= 5 provider/point/count triples 4.9% of pings: <= 10 triples 0.1% of pings: ridiculously high variety of triples If we have (5*3=) 15 columns for triples, plus one for the SEARCH_COUNTS keyed histogram as a JSON string and maybe 1 for the total of all sum fields (e.g. TOTAL_SEARCHES_IN_PING), then we will have rows with known numbers of columns, which has real benefits on the data store side. We could use the total column (or an extra column serving as a flag) to see whether there was anything extra in SEARCH_COUNTS. Alternately, we could only include the SEARCH_COUNTS blob if there were more than 5 triples. I will check the distribution of key cardinality in SEARCH_COUNTS and post what I find. > > i. Add keys searchEngine1, searchAccessPoint1, searchCount1, searchEngine2, > > ... up to some small number. I can check the distribution of cardinality for > > SEARCH_COUNTS. > We definitely don't want to be throwing away any data at this stage. I think > a requirement for this dataset is that it contains the complete data (as is > relevant to search). We agree: containing the complete data is the primary requirement for this stream, and initial processing of the data is a valuable secondary requirement. > > We could also include all of SEARCH_COUNTS as a JSON blob in > > a separate field. > I think the approach in the attachment is better. We really want to think of > search counts as a mapping with bivariate keys rather than a "histogram". > Also, we can get the key splitting logic out of the way of downstream > processing.
(In reply to Sam Penrose from comment #3) > If we have (5*3=) 15 columns for triples, plus one for the SEARCH_COUNTS > keyed histogram as a JSON string and maybe 1 for the total of all sum fields > (e.g. TOTAL_SEARCHES_IN_PING), then we will have rows with known numbers of > columns, which has real benefits on the data store side. We could use the > total column (or an extra column serving as a flag) to see whether there was > anything extra in SEARCH_COUNTS. Alternately, we could only include the > SEARCH_COUNTS blob if there were more than 5 triples. I think I probably don't understand the stream format well enough. If we can have a single column that contains the entire SEARCH_COUNTS histogram, can't we have a single column that contains the entire contents of SEARCH_COUNTS but in the structure described in the attachment?
(In reply to Dave Zeber [:dzeber] from comment #4) > (In reply to Sam Penrose from comment #3) > > If we have (5*3=) 15 columns for triples, plus one for the SEARCH_COUNTS > > keyed histogram as a JSON string and maybe 1 for the total of all sum fields > > (e.g. TOTAL_SEARCHES_IN_PING), then we will have rows with known numbers of > > columns, which has real benefits on the data store side. We could use the > > total column (or an extra column serving as a flag) to see whether there was > > anything extra in SEARCH_COUNTS. Alternately, we could only include the > > SEARCH_COUNTS blob if there were more than 5 triples. > > I think I probably don't understand the stream format well enough. If we can > have a single column that contains the entire SEARCH_COUNTS histogram, can't > we have a single column that contains the entire contents of SEARCH_COUNTS > but in the structure described in the attachment? We can. I think it's an open question whether we should. More soon.
Here is the distribution of cardinality of SEARCH_COUNTS in 1% of 1 week of pings, along with the cumulative fraction of all provider-point pairs seen and of all searches (i.e. 'sum' values) card | # pings | cum pairs | cum searches 0 | 11706126 | 73.8% | 0% 1 | 3420021 | 95.4% | 70% 2 | 498624 | 99.4% | 94% 3 | 62135 | 99.9% | 99% 4 | 5560 | 5 | 485 | 6 | 100 | 7 | 26 | 8 | 14 | 9 | 3 | 10+ | 4 | That suggests that we could gain A LOT of convenience with a column structure like this one: ... other columns ... search_provider_1 varchar(256) -- or whatever search_access_point_1 varchar(32) -- or whatever search_count_1 unsigned integer ... 8 more columns ... search_count_4 unsigned integer search_overflow_json blob -- arbitrarily large JSON blob, null 99% of the time A more radical approach would be to have the client send only the top 4 most used provider-point pairs. We should of course replicate these findings with a different slice through the pings. Notebook: https://gist.github.com/SamPenrose/b846116d672164d7da0a
Table editing fail, doesn't affect conclusions. card | # pings | cum pairs | cum searches 0 | 11706126 | 0% | 0% 1 | 3420021 | 73.8% | 70% 2 | 498624 | 95.4% | 94% 3 | 62135 | 99.4% | 99% 4 | 5560 | 5 | 485 | 6 | 100 | 7 | 26 | 8 | 14 | 9 | 3 | 10+ | 4 | (In reply to Sam Penrose from comment #6) > card | # pings | cum pairs | cum searches > 0 | 11706126 | 73.8% | 0% > 1 | 3420021 | 95.4% | 70% > 2 | 498624 | 99.4% | 94% > 3 | 62135 | 99.9% | 99% > 4 | 5560 | > 5 | 485 | > 6 | 100 | > 7 | 26 | > 8 | 14 | > 9 | 3 | > 10+ | 4 |
(In reply to Sam Penrose from comment #6) > That suggests that we could gain A LOT of convenience with a column > structure like this one: > > ... other columns ... > search_provider_1 varchar(256) -- or whatever > search_access_point_1 varchar(32) -- or whatever > search_count_1 unsigned integer > ... 8 more columns ... > search_count_4 unsigned integer > search_overflow_json blob -- arbitrarily large JSON blob, null 99% of the > time If this means that it's faster to work with this dataset, then that's fine. The main requirements that I would like to see for this data stream are: - complete data: all search data from all Fx profiles. - provider/SAP splitting performed at this stage, so that we don't need to worry about it downstream - quick and easy to facet search measurements by other UT fields: either fields like distribution/locale/geo are included in this dataset, or it is easy to join this dataset to relevant parts of the UT data. > A more radical approach would be to have the client send only the top 4 most > used provider-point pairs. We should of course replicate these findings with > a different slice through the pings. Personally, I don't think we should be considering this type of thing in the context of this data stream. If anything, this could be done in a downstream dataset.
Attached file search.derive.json
Here's a simple schema file which could drive the code in the other attachment. The keys define fieldnames in a flat output row. The values are a search path through the input.
Attachment #8730248 - Flags: feedback?(mreid)
Given a schema like this, what will the process be for looking up/joining other UT fields by client/subsession? Two cases I forsee are: - join to the executive summary dataset - join to the rest of UT Will joins like this be quick to run? Or should we plan to include all fields relevant to search count analysis in this dataset directly?
(In reply to Dave Zeber [:dzeber] from comment #10) > Given a schema like this, what will the process be for looking up/joining > other UT fields by client/subsession? Two cases I forsee are: > - join to the executive summary dataset > - join to the rest of UT > > Will joins like this be quick to run? Or should we plan to include all > fields relevant to search count analysis in this dataset directly? We have no plan. Generally if a field is used in search queries, let's add it up until the sheer # of columns becomes cumbersome. Don't know what that point is. Currently we're at about 4 (plus actual search-specific columns) so we have room to grow.
(In reply to Sam Penrose from comment #11) > (In reply to Dave Zeber [:dzeber] from comment #10) > > Given a schema like this, what will the process be for looking up/joining > > other UT fields by client/subsession? Two cases I forsee are: > > - join to the executive summary dataset > > - join to the rest of UT > > > > Will joins like this be quick to run? Or should we plan to include all > > fields relevant to search count analysis in this dataset directly? > > We have no plan. Generally if a field is used in search queries, let's add > it up until the sheer # of columns becomes cumbersome. Don't know what that > point is. Currently we're at about 4 (plus actual search-specific columns) > so we have room to grow. I am planning to build out a general-purpose dataset that contains all the columns that are (currently) interesting at 100%. Ideally the same dataset can be used for this bug as well as for engagement ratios, churn, etc.
(In reply to Mark Reid [:mreid] from comment #12) > (In reply to Sam Penrose from comment #11) > > (In reply to Dave Zeber [:dzeber] from comment #10) > > > Given a schema like this, what will the process be for looking up/joining > > > other UT fields by client/subsession? Two cases I forsee are: > > > - join to the executive summary dataset > > > - join to the rest of UT > > > > > > Will joins like this be quick to run? Or should we plan to include all > > > fields relevant to search count analysis in this dataset directly? > > > > We have no plan. Generally if a field is used in search queries, let's add > > it up until the sheer # of columns becomes cumbersome. Don't know what that > > point is. Currently we're at about 4 (plus actual search-specific columns) > > so we have room to grow. > > I am planning to build out a general-purpose dataset that contains all the > columns that are (currently) interesting at 100%. Ideally the same dataset > can be used for this bug as well as for engagement ratios, churn, etc.
Flags: needinfo?(mreid)
Sorry for quick reply: Mark how are you planning to handle the search representation issue described above?
I like the representation you (In reply to Dave Zeber [:dzeber] from comment #10) > Given a schema like this, what will the process be for looking up/joining > other UT fields by client/subsession? Two cases I forsee are: > - join to the executive summary dataset > - join to the rest of UT > > Will joins like this be quick to run? Or should we plan to include all > fields relevant to search count analysis in this dataset directly? We should plan to include the relevant fields in this dataset. If there's anything missing from Sam's list, aside from fields in the existing executive summary data, please let me know.
(In reply to Sam Penrose from comment #14) > Sorry for quick reply: Mark how are you planning to handle the search > representation issue described above? I like the representation described in the python attachment - an array of {engine: "foo", source: "urlbar", count: 24}. This seems like a pretty direct mapping of what we want to analyze. I think that will be convenient and performant, but if it turns out to be annoying to work with in that form, we can try moving the "top N" search items out to separate sets of fields as described in comment 6. Sound OK?
Flags: needinfo?(mreid)
Here are two kinds of work that need to happen: 1) Handle nested data structures at Mozilla scale: hard. 2) Aggregate values from flat rows at Mozilla scale: not easy, but easier. My concern with your approach is that now 1) has to be done twice: first by your creating an array, and second by the next (?) layer in the processing chain. That is a big pass-through of work. I want us to figure out how that work (breaking a per-ping array into one row per triple) is going to get done. (In reply to Mark Reid [:mreid] from comment #16) > (In reply to Sam Penrose from comment #14) > > Sorry for quick reply: Mark how are you planning to handle the search > > representation issue described above? > > I like the representation described in the python attachment - an array of > {engine: "foo", source: "urlbar", count: 24}. This seems like a pretty > direct mapping of what we want to analyze. > > I think that will be convenient and performant, but if it turns out to be > annoying to work with in that form, we can try moving the "top N" search > items out to separate sets of fields as described in comment 6. Sound OK?
(In reply to Sam Penrose from comment #17) > Here are two kinds of work that need to happen: > > 1) Handle nested data structures at Mozilla scale: hard. > 2) Aggregate values from flat rows at Mozilla scale: not easy, but easier. > > My concern with your approach is that now 1) has to be done twice: first by > your creating an array, and second by the next (?) layer in the processing > chain. That is a big pass-through of work. I want us to figure out how that > work (breaking a per-ping array into one row per triple) is going to get > done. I think we have working examples here - Sam, are you OK with the "explode" method we've been using so far?
Flags: needinfo?(spenrose)
(In reply to Mark Reid [:mreid] from comment #18) > (In reply to Sam Penrose from comment #17) > > Here are two kinds of work that need to happen: > > > > 1) Handle nested data structures at Mozilla scale: hard. > > 2) Aggregate values from flat rows at Mozilla scale: not easy, but easier. > > > > My concern with your approach is that now 1) has to be done twice: first by > > your creating an array, and second by the next (?) layer in the processing > > chain. That is a big pass-through of work. I want us to figure out how that > > work (breaking a per-ping array into one row per triple) is going to get > > done. > > I think we have working examples here - Sam, are you OK with the "explode" > method we've been using so far? (Recording offline work between Mark and myself.) explode() has one great strength: it provides excellent performance with nested data structures at Mozilla scale. However, it can't be used simply and cleanly: rather than return 1 row for an empty value, it returns 0. Given table T: A | B --------- R1 'a1' | {'b': 1} R2 'a2' | NULL "SELECT A, B from T where A = 'a2'" returns ('a2', NULL) but "SELECT A, explode(B) from T where A = 'a2'" returns () That means that certain queries either can't be done or have to be assembled from e.g. FULL OUTER JOINing a table to itself, or Spark's DataFrame.fillna() first, or what have you -- a task neither Mark or I was able to complete quickly. Further, Mark's example use of explode() passed through DataFrame.selectExpr() and then registerTempTable() -> sqlContext.sql(...). It is an open question to me whether that approach used as an example contains footguns. The good news is that Mark's main-summary derived data stream is amazingly performant, and we can certainly wrap any footguns in APIs that shield the target audience. But I don't think we can declare victory until we have delivered those APIs. I'm happy to have that job on my plate for now.
Flags: needinfo?(spenrose)
(In reply to Sam Penrose from comment #19) > That means that certain queries either can't be done or have to be assembled > from e.g. FULL OUTER JOINing a table to itself, or Spark's > DataFrame.fillna() first You could use the HiveQL syntax to do that, e.g.: sqlContext.createDataFrame([Row(a="a1", b={"foo": "bar"}), Row(a="a2", b=None)]).registerTempTable("table") sqlContext.sql("SELECT a, k, v FROM table LATERAL VIEW OUTER explode(b) tmp AS k, v").show() returns +---+----+----+ | a| k| v| +---+----+----+ | a1| foo| bar| | a2|null|null| +---+----+----+ See also https://issues.apache.org/jira/browse/SPARK-13721.
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #20) > (In reply to Sam Penrose from comment #19) > > That means that certain queries either can't be done or have to be assembled > > from e.g. FULL OUTER JOINing a table to itself, or Spark's > > DataFrame.fillna() first > > You could use the HiveQL syntax to do that, e.g.: > > sqlContext.createDataFrame([Row(a="a1", b={"foo": "bar"}), Row(a="a2", > b=None)]).registerTempTable("table") > sqlContext.sql("SELECT a, k, v FROM table LATERAL VIEW OUTER explode(b) tmp > AS k, v").show() > > returns > > +---+----+----+ > | a| k| v| > +---+----+----+ > | a1| foo| bar| > | a2|null|null| > +---+----+----+ > > See also https://issues.apache.org/jira/browse/SPARK-13721. Thanks Roberto, very helpful. Mark, I think that this bug no longer reflects what we're actually doing, which is "1) implement a main-summary derived stream, and 2) implement APIs on top of it that correctly extract search data." I'm inclined to change the description of this bug to match 2). What do you think?
Flags: needinfo?(mreid)
(In reply to Sam Penrose from comment #21) > Mark, I think that this bug no longer reflects what we're actually doing, > which is "1) implement a main-summary derived stream, and 2) implement APIs > on top of it that correctly extract search data." I'm inclined to change the > description of this bug to match 2). What do you think? Yes, that sounds good. #1 is already filed as bug 1260847.
Depends on: 1260847
Flags: needinfo?(mreid)
Attachment #8728657 - Flags: review?(mreid) → feedback+
Attachment #8730248 - Flags: feedback?(mreid) → feedback+
Summary: Create schema for canonical search data stream → Build search dataset based on "main summary" dataset
This has been delivered as v4_search_rollups accessible to the BD and Search Product teams via Tableau.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: