If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

Build pipeline for cliqz testpilot result data

RESOLVED FIXED

Status

Cloud Services
Metrics: Pipeline
P1
normal
RESOLVED FIXED
8 months ago
7 months ago

People

(Reporter: harter, Assigned: harter)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Assignee)

Description

8 months ago
Pipeline will filter all testpilot pings to pings relevant to the experiment, transform into a usable format, and make the data available to STMO. 

Pipeline will also collate with data provided by Cliqz and main_summary.
(Assignee)

Comment 1

8 months ago
David, can I share the cliqz_table document you provided when we last met? I don't see anything, but I wanted to make sure there's nothing sensitive.
Flags: needinfo?(dzeber)
(Assignee)

Updated

8 months ago
Depends on: 1336617
(Assignee)

Comment 2

8 months ago
The testpiplot, testpilottest, and search data are available on S3 at the following locations:

s3://telemetry-parquet/harter/cliqz_search/v1/
s3://telemetry-parquet/harter/cliqz_testpilot/v1/
s3://telemetry-parquet/harter/cliqz_testpilottest/v1/

The testpilot clientids still need to be decryped, but that should be fixed today. I'll run a 2 week backfill once this is addressed
Depends on: 1337462
Flags: needinfo?(dzeber)

Comment 3

8 months ago
Nice work, Ryan, thanks. Yes, as we discussed feel free to share the tables doc within Mozilla.
(Assignee)

Comment 4

8 months ago
The cliqz client id is now stored in the cliqz_testpilottest table encrypted, decrypted, and cleaned to only include the client id. Data should now be available from 2016-01-28 forward.

Here's a link to the original spec:
https://gist.github.com/dzeber/5b8f75432cace22da1b78bbef7f1459f

On Monday, I'll rename the columns as described in this document[0] and add the final profile daily table.


[0]https://docs.google.com/spreadsheets/d/1kCmt4F_pWuJ5lFoXra8tr9j7iDC46GQUrdrGyMlymZw/edit#gid=0
(Assignee)

Comment 5

7 months ago
Hey Dave, I'm building the profile daily table now. I want to make sure I understand the spec. 

It looks like there will be a lot of missing data. For example, every main ping from the 2 week submission_date buffer will have no associated testpilot data. There's also a chance that some testpilot data will not have any main_summary data for a given day. Just making sure this matches your expectations.

It looks like most of the main ping measurements are pretty stable (normalized_channel, os, ...) with the exception of session_hours. Could we use the cross_sectional table to supply these measurements or are you trying to do some sort of timeseries analysis over the data?
Flags: needinfo?(dzeber)

Comment 6

7 months ago
(In reply to Ryan Harter [:harter] from comment #5)
> It looks like there will be a lot of missing data. For example, every main
> ping from the 2 week submission_date buffer will have no associated
> testpilot data. There's also a chance that some testpilot data will not have
> any main_summary data for a given day. Just making sure this matches your
> expectations.

Yes, I was thinking in that case those "counts of TxP events" columns would just be 0. Search counts after installing the add-on would also be missing, since searches are handled by the Cliqz data collection.

> It looks like most of the main ping measurements are pretty stable
> (normalized_channel, os, ...) with the exception of session_hours. Could we
> use the cross_sectional table to supply these measurements or are you trying
> to do some sort of timeseries analysis over the data?

Channel and OS should be stable. However, I think we'd want to see the other measurements on a daily basis to check for changes (eg. isDefaultBrowser).

In general, the idea for this table is to provide longitudinal activity metrics for profiles in the Cliqz experiment over the course of their participation, as well as 14 days prior to entry. This allows for time series aggregate views as well as longitudinal before/after comparisons. The goal is that most questions can be answered without going beyond this activity table and the search table.
Flags: needinfo?(dzeber)
(Assignee)

Comment 7

7 months ago
Thanks Dave. 

The profile_daily table is now available on s3 [0]. It contains data starting 2017-01-01. The job used to generate this table is here [1].

A few caveats:
* As you noted in the spec, `total_uri_count` appears to be unpopulated, so page_views is excluded
* For os, channel, and search_default you asked for the first or last value on a given day. It was convenient for me to choose an arbitrary instead of first/last value for a given day. If this is a problem, let me know. 
* The data is not yet in S3

Let me know if you notice any oddities in the data.

[0] s3://telemetry-parquet/harter/cliqz_profile_daily/v1/
[1] https://github.com/harterrt/cliqz_ping_pipeline/blob/master/prof_daily_prototype.py

Comment 8

7 months ago
> A few caveats:
> * As you noted in the spec, `total_uri_count` appears to be unpopulated, so
> page_views is excluded

No problem.

> * For os, channel, and search_default you asked for the first or last value
> on a given day. It was convenient for me to choose an arbitrary instead of
> first/last value for a given day. If this is a problem, let me know.

That should be fine.

It looks like the only outstanding tasks would be backfilling and making the datasets available in re:dash. Ryan, can you confirm when those are done?

As discussed, the profile_daily table runs back to 2017-01-01. The testpilot tables should run back to 2017-01-10, the start date of the experiment.
(Assignee)

Updated

7 months ago
Depends on: 1340318
(Assignee)

Comment 9

7 months ago
> It looks like the only outstanding tasks would be backfilling and making the
> datasets available in re:dash. Ryan, can you confirm when those are done?
> 
> As discussed, the profile_daily table runs back to 2017-01-01. The testpilot
> tables should run back to 2017-01-10, the start date of the experiment.

I started the backfill last night, but it looks like we hit some trouble. There is a sudden increase in testpilottest pings from other other experiments before 2016-01-27. The backfill job is choking so I'm going to tune the performance and restart. I'll keep this bug updated with progress.
(Assignee)

Comment 10

7 months ago
I thought I updated this last night, but I guess I failed to save. 

TL;DR: All of the data is backfilled, we're waiting on re:dash to catch up. 

The backfill job was really taking too long. For context, I stopped the 2017-01-27 job after 5.5 hours. Last night I made some performance changes and got the execution time down closer to 20m per affected day. I restarted the backfill last night, and it completed this morning.

There was an additional issue where parquet was choking on the new boolean value introduced in Bug 1340318. Took me a while to figure that out this morning, but we should now be good to go. I started a backfill for the profile_daily table as well, which is now complete.

For now, the data on S3 is complete. We're waiting on re:dash to get the updated data. Let me know if anything doesn't look right.
(Assignee)

Comment 11

7 months ago
Just checked and the data appears to be there.
Status: NEW → RESOLVED
Last Resolved: 7 months ago
Resolution: --- → FIXED
(Assignee)

Updated

7 months ago
Depends on: 1342194
(Assignee)

Updated

7 months ago
See Also: → bug 1344260
You need to log in before you can comment on or make changes to this bug.