Closed Bug 1304662 Opened 8 years ago Closed 8 years ago

Increase size of files on S3

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rvitillo, Unassigned)

References

Details

(Whiteboard: [SvcOps])

User Story

We have received complaints that analyses on heartbeat data are slow. I have run some simple benchmarks to compare the Scala & Python API on one day of data on a single machine:

Scala Dataset timings:

time{ 
  Dataset("telemetry")
  .where("docType"){ 
    case "heartbeat" => true 
  }.where("submissionDate"){ 
    case x if x == "20160607" => true 
  }.where("appName"){ 
    case "Firefox" => true 
  }.where("appUpdateChannel"){ 
    case "beta" => true 
  }.summaries()
  .length() 

time: 860.778365ms
res4: Int = 760

time{ 
  Dataset("telemetry")
  .where("docType"){ 
    case "heartbeat" => true 
  }.where("submissionDate"){ 
    case x if x == "20160607" => true 
  }.where("appName"){ 
    case "Firefox" => true 
  }.where("appUpdateChannel"){ 
    case "beta" => true 
  }.records()
  .count() 
}

time: 26486.169233ms
res2: Long = 1892


Python Dataset timings:

Dataset.from_source('telemetry') \
    .where(docType = 'heartbeat') \
    .where(submissionDate = '20160607') \
    .where(appName = 'Firefox') \
    .where(appUpdateChannel = "beta") \
    ._summaries()
Wall time: 2.41 s

Dataset.from_source('telemetry') \
    .where(docType = 'heartbeat') \
    .where(submissionDate = '20160607') \
    .where(appName = 'Firefox') \
    .where(appUpdateChannel = "beta") \
    .records(sc) \
    .count()
Wall time: 1min 14s


The Scala version is nearly 3x faster, both in fetching the summaries and in running a simple count, which is somewhat expected. That said, they are both pretty slow all things considered, since we don't have a lot of heartbeat data. Further investigation on the size of data received on that day yields a staggering number of tiny files:

aws s3 ls --summarize --human-readable --recursive s3://net-mozaws-prod-us-west-2-pipeline-data/telemetry-2/20160607/telemetry/4/heartbeat/Firefox/beta/ | tail -n2
Total Objects: 760
Total Size: 7.2 MiB

No wonder fetching the data takes that long! Ideally there should be a single file instead of 760 files each with an average size of about 10K.

Attachments

(1 file)

      No description provided.
Flags: needinfo?(whd)
Blocks: 1295359
This is a problem with all "small" dimensions - we have both size- and time-based limits on upload, so that we don't buffer data for too long (or accumulate huge files) before flushing to long-term storage on S3.
Would it be possible to flush with small time-limits, and have an aggregation job that batches the small files?
Hello peeps!  Thanks for looking into this!

I wonder if 'OTHER' has the same issues?  That is where my analyses pull from.
(In reply to Mark Reid [:mreid] from comment #1)
> This is a problem with all "small" dimensions - we have both size- and
> time-based limits on upload, so that we don't buffer data for too long (or
> accumulate huge files) before flushing to long-term storage on S3.

I think we have to increase the time-based limit as it's clearly not doing a great job, unless as Frank suggested we have another entity batching smaller files. The former should be easier to implement though.
(In reply to Frank Bertsch [:frank] from comment #2)
> Would it be possible to flush with small time-limits, and have an
> aggregation job that batches the small files?

We could do that, but S3's consistency guarantees are such that we couldn't batch in-place, we'd have to write the small files somewhere else, then move the batched versions into place.

(In reply to Gregg Lind (Fx Strategy and Insights - Shield - Heartbeat ) from comment #3)
> I wonder if 'OTHER' has the same issues?  That is where my analyses pull
> from.

Yup, it would potentially have the same issue.


(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #4)
> I think we have to increase the time-based limit as it's clearly not doing a
> great job, unless as Frank suggested we have another entity batching smaller
> files. The former should be easier to implement though.

We could do that, though it would increase the latency before data is available to get_pings().

Another possibility would be to accumulate data for a longer period, but overwrite on S3 (using the same object name) every X minutes. We would be subject to S3's eventual consistency, but we could sort of "append" by using that approach.
Is this problem resolved by the fix in Bug 1304693?
Flags: needinfo?(rvitillo)
> (In reply to Roberto Agostino Vitillo (:rvitillo) from comment #4)
> > I think we have to increase the time-based limit as it's clearly not doing a
> > great job, unless as Frank suggested we have another entity batching smaller
> > files. The former should be easier to implement though.
> 
> We could do that, though it would increase the latency before data is
> available to get_pings().

I don't think that's an issue, as long as we agree on a reasonable maximum latency (1 hour?). I would rather have our analyses jobs run much faster than having the data visible right away considering that for real-time needs we have Hindsight.
Flags: needinfo?(rvitillo)
Whiteboard: [SvcOps]
Summary: Heartbeat data stored in very inefficient format on S3 → Increase size of files on S3
I can change the time-based limit to an hour (from 15 min), but this does mean that if we lose a node we are going to lose more data. Additionally we currently need 5 nodes to handle the full telemetry stream, so the already sparse data is being split in 5. I think we're only going to 4x or so the current file size by batching four times as long. Another option for sparse docTypes is to set up a different kafka topic like we did for testpilot and batch that data much more aggressively and using only a single node, but in that case we still have the data loss problem.

As long as people (:mreid?) are ok with potential extra data loss I will make the change. FWIW, we have only seen a few node failures of the type I am describing in the last year or so.
Flags: needinfo?(whd)
https://github.com/mozilla-services/puppet-config/pull/2309/

This is a blanket change to all telemetry doc types. I suppose we could set up the message matchers to create a separate output for each small data set if we wanted to (and in fact we do this in some places), but this would be the first data set to use an existing prefix but have a different max file age scheme for certain doc types.
Flags: needinfo?(mreid)
(In reply to Wesley Dawson [:whd] from comment #8)

> Another option for sparse
> docTypes is to set up a different kafka topic like we did for testpilot and
> batch that data much more aggressively and using only a single node, but in
> that case we still have the data loss problem.

I like this solution better, in fact we could probably come up with a smarter batching logic based on the whole prefix of a submission.

> As long as people (:mreid?) are ok with potential extra data loss I will
> make the change. FWIW, we have only seen a few node failures of the type I
> am describing in the last year or so.

Is the data loss caused by batching submissions on the edge nodes before dumping them to Kafka?
We can still recover from the loss of an S3-loader EC2 instance by re-processing from Kafka right? As in re-read the Kafka queue, limiting to a single day's Timestamps (or match submissionDate), and replace the data for that entire day.

As long as we have a means like the above (or relying on "landfill" data) for recovering from the (rare) loss of a machine, I'm OK with increasing the timeout to 1hr.

We will need to increase (and add, where missing) the delay before kicking off our scheduled daily jobs via Airflow to ensure that we've waited for all the data to arrive for the previous day. Specifically, [1] should be increased to allow for the 1 hour delay.

We should also send a message to the relevant mailing lists, as devs occasionally need to wait for this timeout to see if their test submissions have arrived in long-term storage.


[1] https://github.com/mozilla/telemetry-airflow/blob/master/dags/main_summary.py#L21
Flags: needinfo?(mreid)
(In reply to Mark Reid [:mreid] from comment #11)
> We can still recover from the loss of an S3-loader EC2 instance by
> re-processing from Kafka right? As in re-read the Kafka queue, limiting to a
> single day's Timestamps (or match submissionDate), and replace the data for
> that entire day.
> 

Yes.

I'll make the change as-is and we can investigate better options separately.
This change has been deployed, and an email has been sent out.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: