1254716 - Build search dataset based on "main summary" dataset

Assignee

Description

•

9 years ago

Create a mapping from UT main ping fields (input) to a flat set of output fields and commit to a common repository such as mozilla-pipeline-schemas. Example mapping (not syntax, which will probably be a super set of JSON-Schema): $environment$partner$distributionId -> distributionId $payload$info$subsessionStartDate -> subsessionStartDate

Sam Penrose

Assignee

Comment 1

•

9 years ago

Attached file search_stream_heuristic.py — Details

Here is a proposal for how the search stream will be derived. Please comment. 1) The fields to be extracted from the ping are loaded from a config file that is expected to grow occasionally. 2) The output is a flat row structure, except for SEARCH_COUNTS. 3) SEARCH_COUNTS are minimally processed and packed into a JSON string. 4) I don't currently have a semantic validation step (access point is legit, search tallies are > 0, etc.) Step 3 is the tricky one. SEARCH_COUNTS packs the values we care about into key names, and we don't know how many we're going to see. Other approaches I considered: i. Add keys searchEngine1, searchAccessPoint1, searchCount1, searchEngine2, ... up to some small number. I can check the distribution of cardinality for SEARCH_COUNTS. We could also include all of SEARCH_COUNTS as a JSON blob in a separate field. ii. Same, but collapse each engine-access-count 3-tuple into a string. iii. For each ping, emit a (ping-search-overview, (ping-search-count-1, ...)) nested tuple: in effect an SQL breakout table. As I write, I think I prefer option i. to what I coded in the attachment. LMK.

Attachment #8728657 - Flags: review?(mreid)

Attachment #8728657 - Flags: review?(dzeber)

Dave Zeber [:dzeber]

Comment 2

•

9 years ago

(In reply to Sam Penrose from comment #1) > Here is a proposal for how the search stream will be derived. Please comment. This looks fine to me. It seems to me the process of actually extracting field values from the payload is straightforward enough, but the main thing to settle is which fields should be extracted. > and we don't know how many we're going to see. Is this an issue for the data storage format? I think the format given in the attachment is the best - an entry for every (search engine, SAP, count) triple. This preserves the complete original search counts in a format that's easier to work with. > i. Add keys searchEngine1, searchAccessPoint1, searchCount1, searchEngine2, > ... up to some small number. I can check the distribution of cardinality for > SEARCH_COUNTS. We definitely don't want to be throwing away any data at this stage. I think a requirement for this dataset is that it contains the complete data (as is relevant to search). > We could also include all of SEARCH_COUNTS as a JSON blob in > a separate field. I think the approach in the attachment is better. We really want to think of search counts as a mapping with bivariate keys rather than a "histogram". Also, we can get the key splitting logic out of the way of downstream processing.

Dave Zeber [:dzeber]

Updated

•

9 years ago

Attachment #8728657 - Flags: review?(dzeber) → review+

Sam Penrose

Assignee

Comment 3

•

9 years ago

(In reply to Dave Zeber [:dzeber] from comment #2) > (In reply to Sam Penrose from comment #1) > > Here is a proposal for how the search stream will be derived. Please comment. > > This looks fine to me. It seems to me the process of actually extracting > field values from the payload is straightforward enough, but the main thing > to settle is which fields should be extracted. > > > and we don't know how many we're going to see. > Is this an issue for the data storage format? I think the format given in > the attachment is the best - an entry for every (search engine, SAP, count) > triple. This preserves the complete original search counts in a format > that's easier to work with. Easier to work with for some purposes, not for others. In particular, let's say we have a distribution like this: 95% of pings: <= 5 provider/point/count triples 4.9% of pings: <= 10 triples 0.1% of pings: ridiculously high variety of triples If we have (5*3=) 15 columns for triples, plus one for the SEARCH_COUNTS keyed histogram as a JSON string and maybe 1 for the total of all sum fields (e.g. TOTAL_SEARCHES_IN_PING), then we will have rows with known numbers of columns, which has real benefits on the data store side. We could use the total column (or an extra column serving as a flag) to see whether there was anything extra in SEARCH_COUNTS. Alternately, we could only include the SEARCH_COUNTS blob if there were more than 5 triples. I will check the distribution of key cardinality in SEARCH_COUNTS and post what I find. > > i. Add keys searchEngine1, searchAccessPoint1, searchCount1, searchEngine2, > > ... up to some small number. I can check the distribution of cardinality for > > SEARCH_COUNTS. > We definitely don't want to be throwing away any data at this stage. I think > a requirement for this dataset is that it contains the complete data (as is > relevant to search). We agree: containing the complete data is the primary requirement for this stream, and initial processing of the data is a valuable secondary requirement. > > We could also include all of SEARCH_COUNTS as a JSON blob in > > a separate field. > I think the approach in the attachment is better. We really want to think of > search counts as a mapping with bivariate keys rather than a "histogram". > Also, we can get the key splitting logic out of the way of downstream > processing.

Dave Zeber [:dzeber]

Comment 4

•

9 years ago

(In reply to Sam Penrose from comment #3) > If we have (5*3=) 15 columns for triples, plus one for the SEARCH_COUNTS > keyed histogram as a JSON string and maybe 1 for the total of all sum fields > (e.g. TOTAL_SEARCHES_IN_PING), then we will have rows with known numbers of > columns, which has real benefits on the data store side. We could use the > total column (or an extra column serving as a flag) to see whether there was > anything extra in SEARCH_COUNTS. Alternately, we could only include the > SEARCH_COUNTS blob if there were more than 5 triples. I think I probably don't understand the stream format well enough. If we can have a single column that contains the entire SEARCH_COUNTS histogram, can't we have a single column that contains the entire contents of SEARCH_COUNTS but in the structure described in the attachment?

Sam Penrose

Assignee

Comment 5

•

9 years ago

(In reply to Dave Zeber [:dzeber] from comment #4) > (In reply to Sam Penrose from comment #3) > > If we have (5*3=) 15 columns for triples, plus one for the SEARCH_COUNTS > > keyed histogram as a JSON string and maybe 1 for the total of all sum fields > > (e.g. TOTAL_SEARCHES_IN_PING), then we will have rows with known numbers of > > columns, which has real benefits on the data store side. We could use the > > total column (or an extra column serving as a flag) to see whether there was > > anything extra in SEARCH_COUNTS. Alternately, we could only include the > > SEARCH_COUNTS blob if there were more than 5 triples. > > I think I probably don't understand the stream format well enough. If we can > have a single column that contains the entire SEARCH_COUNTS histogram, can't > we have a single column that contains the entire contents of SEARCH_COUNTS > but in the structure described in the attachment? We can. I think it's an open question whether we should. More soon.

Sam Penrose

Assignee

Comment 6

•

9 years ago

Here is the distribution of cardinality of SEARCH_COUNTS in 1% of 1 week of pings, along with the cumulative fraction of all provider-point pairs seen and of all searches (i.e. 'sum' values) card | # pings | cum pairs | cum searches 0 | 11706126 | 73.8% | 0% 1 | 3420021 | 95.4% | 70% 2 | 498624 | 99.4% | 94% 3 | 62135 | 99.9% | 99% 4 | 5560 | 5 | 485 | 6 | 100 | 7 | 26 | 8 | 14 | 9 | 3 | 10+ | 4 | That suggests that we could gain A LOT of convenience with a column structure like this one: ... other columns ... search_provider_1 varchar(256) -- or whatever search_access_point_1 varchar(32) -- or whatever search_count_1 unsigned integer ... 8 more columns ... search_count_4 unsigned integer search_overflow_json blob -- arbitrarily large JSON blob, null 99% of the time A more radical approach would be to have the client send only the top 4 most used provider-point pairs. We should of course replicate these findings with a different slice through the pings. Notebook: https://gist.github.com/SamPenrose/b846116d672164d7da0a

Sam Penrose

Assignee

Comment 7

•

9 years ago

Table editing fail, doesn't affect conclusions. card | # pings | cum pairs | cum searches 0 | 11706126 | 0% | 0% 1 | 3420021 | 73.8% | 70% 2 | 498624 | 95.4% | 94% 3 | 62135 | 99.4% | 99% 4 | 5560 | 5 | 485 | 6 | 100 | 7 | 26 | 8 | 14 | 9 | 3 | 10+ | 4 | (In reply to Sam Penrose from comment #6) > card | # pings | cum pairs | cum searches > 0 | 11706126 | 73.8% | 0% > 1 | 3420021 | 95.4% | 70% > 2 | 498624 | 99.4% | 94% > 3 | 62135 | 99.9% | 99% > 4 | 5560 | > 5 | 485 | > 6 | 100 | > 7 | 26 | > 8 | 14 | > 9 | 3 | > 10+ | 4 |

Dave Zeber [:dzeber]

Comment 8

•

9 years ago

(In reply to Sam Penrose from comment #6) > That suggests that we could gain A LOT of convenience with a column > structure like this one: > > ... other columns ... > search_provider_1 varchar(256) -- or whatever > search_access_point_1 varchar(32) -- or whatever > search_count_1 unsigned integer > ... 8 more columns ... > search_count_4 unsigned integer > search_overflow_json blob -- arbitrarily large JSON blob, null 99% of the > time If this means that it's faster to work with this dataset, then that's fine. The main requirements that I would like to see for this data stream are: - complete data: all search data from all Fx profiles. - provider/SAP splitting performed at this stage, so that we don't need to worry about it downstream - quick and easy to facet search measurements by other UT fields: either fields like distribution/locale/geo are included in this dataset, or it is easy to join this dataset to relevant parts of the UT data. > A more radical approach would be to have the client send only the top 4 most > used provider-point pairs. We should of course replicate these findings with > a different slice through the pings. Personally, I don't think we should be considering this type of thing in the context of this data stream. If anything, this could be done in a downstream dataset.

Sam Penrose

Assignee

Comment 9

•

9 years ago

Attached file search.derive.json — Details

Here's a simple schema file which could drive the code in the other attachment. The keys define fieldnames in a flat output row. The values are a search path through the input.

Attachment #8730248 - Flags: feedback?(mreid)

Dave Zeber [:dzeber]

Comment 10

•

9 years ago

Given a schema like this, what will the process be for looking up/joining other UT fields by client/subsession? Two cases I forsee are: - join to the executive summary dataset - join to the rest of UT Will joins like this be quick to run? Or should we plan to include all fields relevant to search count analysis in this dataset directly?

Sam Penrose

Assignee

Comment 11

•

9 years ago

(In reply to Dave Zeber [:dzeber] from comment #10) > Given a schema like this, what will the process be for looking up/joining > other UT fields by client/subsession? Two cases I forsee are: > - join to the executive summary dataset > - join to the rest of UT > > Will joins like this be quick to run? Or should we plan to include all > fields relevant to search count analysis in this dataset directly? We have no plan. Generally if a field is used in search queries, let's add it up until the sheer # of columns becomes cumbersome. Don't know what that point is. Currently we're at about 4 (plus actual search-specific columns) so we have room to grow.

Mark Reid [:mreid]

Comment 12

•

9 years ago

(In reply to Sam Penrose from comment #11) > (In reply to Dave Zeber [:dzeber] from comment #10) > > Given a schema like this, what will the process be for looking up/joining > > other UT fields by client/subsession? Two cases I forsee are: > > - join to the executive summary dataset > > - join to the rest of UT > > > > Will joins like this be quick to run? Or should we plan to include all > > fields relevant to search count analysis in this dataset directly? > > We have no plan. Generally if a field is used in search queries, let's add > it up until the sheer # of columns becomes cumbersome. Don't know what that > point is. Currently we're at about 4 (plus actual search-specific columns) > so we have room to grow. I am planning to build out a general-purpose dataset that contains all the columns that are (currently) interesting at 100%. Ideally the same dataset can be used for this bug as well as for engagement ratios, churn, etc.

Sam Penrose

Assignee

Comment 13

•

9 years ago

(In reply to Mark Reid [:mreid] from comment #12) > (In reply to Sam Penrose from comment #11) > > (In reply to Dave Zeber [:dzeber] from comment #10) > > > Given a schema like this, what will the process be for looking up/joining > > > other UT fields by client/subsession? Two cases I forsee are: > > > - join to the executive summary dataset > > > - join to the rest of UT > > > > > > Will joins like this be quick to run? Or should we plan to include all > > > fields relevant to search count analysis in this dataset directly? > > > > We have no plan. Generally if a field is used in search queries, let's add > > it up until the sheer # of columns becomes cumbersome. Don't know what that > > point is. Currently we're at about 4 (plus actual search-specific columns) > > so we have room to grow. > > I am planning to build out a general-purpose dataset that contains all the > columns that are (currently) interesting at 100%. Ideally the same dataset > can be used for this bug as well as for engagement ratios, churn, etc.

Flags: needinfo?(mreid)

Sam Penrose

Assignee

Comment 14

•

9 years ago

Sorry for quick reply: Mark how are you planning to handle the search representation issue described above?

Mark Reid [:mreid]

Comment 15

•

9 years ago

I like the representation you (In reply to Dave Zeber [:dzeber] from comment #10) > Given a schema like this, what will the process be for looking up/joining > other UT fields by client/subsession? Two cases I forsee are: > - join to the executive summary dataset > - join to the rest of UT > > Will joins like this be quick to run? Or should we plan to include all > fields relevant to search count analysis in this dataset directly? We should plan to include the relevant fields in this dataset. If there's anything missing from Sam's list, aside from fields in the existing executive summary data, please let me know.

Mark Reid [:mreid]

Comment 16

•

9 years ago

(In reply to Sam Penrose from comment #14) > Sorry for quick reply: Mark how are you planning to handle the search > representation issue described above? I like the representation described in the python attachment - an array of {engine: "foo", source: "urlbar", count: 24}. This seems like a pretty direct mapping of what we want to analyze. I think that will be convenient and performant, but if it turns out to be annoying to work with in that form, we can try moving the "top N" search items out to separate sets of fields as described in comment 6. Sound OK?

Flags: needinfo?(mreid)

Sam Penrose

Assignee

Comment 17

•

9 years ago

Here are two kinds of work that need to happen: 1) Handle nested data structures at Mozilla scale: hard. 2) Aggregate values from flat rows at Mozilla scale: not easy, but easier. My concern with your approach is that now 1) has to be done twice: first by your creating an array, and second by the next (?) layer in the processing chain. That is a big pass-through of work. I want us to figure out how that work (breaking a per-ping array into one row per triple) is going to get done. (In reply to Mark Reid [:mreid] from comment #16) > (In reply to Sam Penrose from comment #14) > > Sorry for quick reply: Mark how are you planning to handle the search > > representation issue described above? > > I like the representation described in the python attachment - an array of > {engine: "foo", source: "urlbar", count: 24}. This seems like a pretty > direct mapping of what we want to analyze. > > I think that will be convenient and performant, but if it turns out to be > annoying to work with in that form, we can try moving the "top N" search > items out to separate sets of fields as described in comment 6. Sound OK?

Mark Reid [:mreid]

Comment 18

•

9 years ago

(In reply to Sam Penrose from comment #17) > Here are two kinds of work that need to happen: > > 1) Handle nested data structures at Mozilla scale: hard. > 2) Aggregate values from flat rows at Mozilla scale: not easy, but easier. > > My concern with your approach is that now 1) has to be done twice: first by > your creating an array, and second by the next (?) layer in the processing > chain. That is a big pass-through of work. I want us to figure out how that > work (breaking a per-ping array into one row per triple) is going to get > done. I think we have working examples here - Sam, are you OK with the "explode" method we've been using so far?

Flags: needinfo?(spenrose)

Sam Penrose

Assignee

Comment 19

•

9 years ago

(In reply to Mark Reid [:mreid] from comment #18) > (In reply to Sam Penrose from comment #17) > > Here are two kinds of work that need to happen: > > > > 1) Handle nested data structures at Mozilla scale: hard. > > 2) Aggregate values from flat rows at Mozilla scale: not easy, but easier. > > > > My concern with your approach is that now 1) has to be done twice: first by > > your creating an array, and second by the next (?) layer in the processing > > chain. That is a big pass-through of work. I want us to figure out how that > > work (breaking a per-ping array into one row per triple) is going to get > > done. > > I think we have working examples here - Sam, are you OK with the "explode" > method we've been using so far? (Recording offline work between Mark and myself.) explode() has one great strength: it provides excellent performance with nested data structures at Mozilla scale. However, it can't be used simply and cleanly: rather than return 1 row for an empty value, it returns 0. Given table T: A | B --------- R1 'a1' | {'b': 1} R2 'a2' | NULL "SELECT A, B from T where A = 'a2'" returns ('a2', NULL) but "SELECT A, explode(B) from T where A = 'a2'" returns () That means that certain queries either can't be done or have to be assembled from e.g. FULL OUTER JOINing a table to itself, or Spark's DataFrame.fillna() first, or what have you -- a task neither Mark or I was able to complete quickly. Further, Mark's example use of explode() passed through DataFrame.selectExpr() and then registerTempTable() -> sqlContext.sql(...). It is an open question to me whether that approach used as an example contains footguns. The good news is that Mark's main-summary derived data stream is amazingly performant, and we can certainly wrap any footguns in APIs that shield the target audience. But I don't think we can declare victory until we have delivered those APIs. I'm happy to have that job on my plate for now.

Flags: needinfo?(spenrose)

Roberto Agostino Vitillo (:rvitillo)

Comment 20

•

9 years ago

(In reply to Sam Penrose from comment #19) > That means that certain queries either can't be done or have to be assembled > from e.g. FULL OUTER JOINing a table to itself, or Spark's > DataFrame.fillna() first You could use the HiveQL syntax to do that, e.g.: sqlContext.createDataFrame([Row(a="a1", b={"foo": "bar"}), Row(a="a2", b=None)]).registerTempTable("table") sqlContext.sql("SELECT a, k, v FROM table LATERAL VIEW OUTER explode(b) tmp AS k, v").show() returns +---+----+----+ | a| k| v| +---+----+----+ | a1| foo| bar| | a2|null|null| +---+----+----+ See also https://issues.apache.org/jira/browse/SPARK-13721.

Sam Penrose

Assignee

Comment 21

•

9 years ago

(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #20) > (In reply to Sam Penrose from comment #19) > > That means that certain queries either can't be done or have to be assembled > > from e.g. FULL OUTER JOINing a table to itself, or Spark's > > DataFrame.fillna() first > > You could use the HiveQL syntax to do that, e.g.: > > sqlContext.createDataFrame([Row(a="a1", b={"foo": "bar"}), Row(a="a2", > b=None)]).registerTempTable("table") > sqlContext.sql("SELECT a, k, v FROM table LATERAL VIEW OUTER explode(b) tmp > AS k, v").show() > > returns > > +---+----+----+ > | a| k| v| > +---+----+----+ > | a1| foo| bar| > | a2|null|null| > +---+----+----+ > > See also https://issues.apache.org/jira/browse/SPARK-13721. Thanks Roberto, very helpful. Mark, I think that this bug no longer reflects what we're actually doing, which is "1) implement a main-summary derived stream, and 2) implement APIs on top of it that correctly extract search data." I'm inclined to change the description of this bug to match 2). What do you think?

Flags: needinfo?(mreid)

Mark Reid [:mreid]

Comment 22

•

9 years ago

(In reply to Sam Penrose from comment #21) > Mark, I think that this bug no longer reflects what we're actually doing, > which is "1) implement a main-summary derived stream, and 2) implement APIs > on top of it that correctly extract search data." I'm inclined to change the > description of this bug to match 2). What do you think? Yes, that sounds good. #1 is already filed as bug 1260847.

Depends on: 1260847

Flags: needinfo?(mreid)

Mark Reid [:mreid]

Updated

•

9 years ago

Attachment #8728657 - Flags: review?(mreid) → feedback+

Mark Reid [:mreid]

Updated

•

9 years ago

Attachment #8730248 - Flags: feedback?(mreid) → feedback+

Mark Reid [:mreid]

Updated

•

9 years ago

Summary: Create schema for canonical search data stream → Build search dataset based on "main summary" dataset

Sam Penrose

Assignee

Comment 23

•

9 years ago

This has been delivered as v4_search_rollups accessible to the BD and Search Product teams via Tableau.

Sam Penrose

Assignee

Updated

•

9 years ago

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

7 years ago

Product: Cloud Services → Cloud Services Graveyard

search_stream_heuristic.py 9 years ago Sam Penrose 1.69 KB, text/x-python	dzeber : review+ mreid : feedback+	Details
search.derive.json 9 years ago Sam Penrose 395 bytes, application/json	mreid : feedback+	Details