Closed Bug 1473641 Opened 7 years ago Closed 7 years ago

"main_summary_experiments" and downstream jobs failed on 2018-07-03

Categories

(Data Platform and Tools :: General, enhancement, P1)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mreid, Assigned: klukas)

Details

Attachments

(1 file)

It also failed on 2018-07-04. Once it's fixed, we should re-run downstream jobs as well: main_summary_experiments, experiments_aggregates, and experiments_aggregates_import
The error above is from experiments aggregates, but doesn't explain the main_summary_experiments failure. It's less obvious what's going on there. I'm looking through logs.
Assignee: nobody → jklukas
Priority: -- → P1
On each failing run of the job, it looks like there's one node that eventually hits "ERROR FileFormatWriter: Aborting job null." One example is: s3://telemetry-airflow/logs/ssuh@mozilla.com/Experiments Main Summary View/j-A50N63OJCHMW/node/i-0d51339b792db7cc8/applications/spark/spark.log.gz
The main experiments job is definitely failing while trying to write out data, but it's still unclear what's causing that failure.
I've looked at three different clusters that failed and they don't seem to be obviously consistent. In some cases, there's an OutOfMemory exception, but not always. In some cases, there seem to be networking issues (NoRouteToHostException). It appears the experiments_main and experiments_aggregates failures are unrelated, so I went ahead and cleared the 07/02 run of the aggregates job, which should hopefully succeed now with the update in https://github.com/mozilla/telemetry-batch-view/pull/448 and we can mark at least that half of the issue solved. I'll continue to poke at the experiments_main logs in the meantime.
It's interesting to note that execution times for experiments_main have been trending sharply up. The 6/29 and 6/30 runs each finished in ~3 hours 40 minutes. 7/1 was a little over 4 hours, then 7/2 was nearly 7 hours. 7/3 failed once early, but the second run went for 10 hours before being terminated by our configured timeout. So, the issue here may have actually started on 7/1 or 7/2.
The re-run of experiments_aggregates for 7/2 succeeded, so looks like sunah's PR was successful and we should expect future runs of that job to succeed once we get main_summary_experiments running again.
Confirmation that this is a data volume issue: https://sql.telemetry.mozilla.org/queries/56672/source In mid-July, we averaged something like 10 millions rows per day in experiments, but they shot up to 45 million in the last days of June and hit 74 million on 7/2. We've been failing since, so presumably the volume is even greater now. I'm going to look now at volumes for main_summary to see if the problem looks to exist in that source data or whether there's some logic in the experiments dataset code that could be to blame.
It does look like overall experiment enrollment doubled in the past week: https://sql.telemetry.mozilla.org/queries/56674/source For a 1% sample of clients, we see a total of ~1 million experiment entries a week ago vs. nearly 2 million this week.
We have a suspect! https://sql.telemetry.mozilla.org/queries/56675/source The 'pref-hotfix-tls-13-avast-rollback' experiment has _much_ higher enrollment than any other experiment in the past month. It turned on on 6/28 with ~4 million appearances, which ballooned to 61M by 7/2. This is at least an order of magnitude higher than any other experiment. I'm pretty certain the problem we're hitting is that ExperimentSummaryView is partitioning by experiment_id when writing with no cap on how large a partition is, so it's trying to accumulate all 62M records on a single node before writing out an object. If we apply maxRecordsPerFile as we do on MainSummaryView, then I think we'll be good.
(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #10) > The 'pref-hotfix-tls-13-avast-rollback' experiment has _much_ higher > enrollment than any other experiment in the past month. It turned on on 6/28 > with ~4 million appearances, which ballooned to 61M by 7/2. This is at least > an order of magnitude higher than any other experiment. I suspect this is an "experiment" that is targeting 100% of release and should be blacklisted [1]. We should have a better mechanism for catching these before they happen. [1] https://github.com/mozilla-services/puppet-config/blob/master/pipeline/modules/pipeline/templates/hindsight/output/telemetry_s3.cfg.erb#L37
I've excluded the experiment at the ExperimentsSummaryView level in the tbv PR. Interesting to see there's an equivalent config at the hindsight level.
The tbv PR is deployed to master and I've restarted the failed runs for Experiments Summary View in Airflow. If all goes well, the experiments aggregates runs should also backfill automatically as the summary jobs complete.
(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #13) > I've excluded the experiment at the ExperimentsSummaryView level in the tbv > PR. Interesting to see there's an equivalent config at the hindsight level. For context see bug #1381954 bug #1416934 bug #1416945 https://github.com/mozilla/normandy/issues/1106. At any rate, I've deployed the PR from comment #14 so we should stop wasting space storing these.
The three failed runs of experiments_main_summary are now showing as completed in Airflow and the downstream jobs have also run successfully, so resolving.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Datasets: Experiments → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: