A Modest Proposal for Preventing the Histograms of Poor Developers From Being a Burden to Server and For Enabling Lazy Validation(eg saving $$$ on AWS)

RESOLVED FIXED

Status

()

RESOLVED FIXED
6 years ago
3 years ago

People

(Reporter: taras.mozilla, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

6 years ago
So now that we include the js schema(bug 832007) with every submission most of the packet is redundant(eg histogram name, parameters, bucket values). We can change the format of the packet to be
histogram_id:[sum, log_sum, log_sum_squares, b1, b2, b3...]
Where:
* histogram_id:int id refering to histogram in the json file(can be a line number or something else)
* sum, log_sum, log_sum_squares are obvious
* b1, b2, b3 ... are bucket values

According to my calculations histograms constitute >90% of the packet and this scheme can result in ~5x reduction in histogram payload -> ~4x(conservative) reduction in uncompressed payload size.

Alternatively if we do not care about packet size, we can implement this on the serverside as part of the validation step(instead of merely adding an extra field to the packet)
If considering a format change:  Try separating each histogram from the ping, marking each with the ping id, and grouping identical-named histograms together.  This will increase size on disk, but reduce the total data that needs to be read at any one time (assuming only a few histograms are of interest at any one time).  These grouped histograms should compress** well because of their similarity to each other.


**compress in general sense, not just the zip sense.
(Reporter)

Comment 2

6 years ago
(In reply to Kyle Lahnakoski from comment #1)
> If considering a format change:  Try separating each histogram from the
> ping, marking each with the ping id, and grouping identical-named histograms
> together.  This will increase size on disk, but reduce the total data that
> needs to be read at any one time (assuming only a few histograms are of
> interest at any one time).  These grouped histograms should compress** well
> because of their similarity to each other.
> 
No. the point of this proposal is to reduce disk footprint. In theory one could have a separate log file or table for every histogram, but the costs in that would outweigh any perceived benefits
Taras, starting to pick this back up and see what progress we can make in the short term.

I did have a request for you.  If we are looking at building a translation / encoder to be run on the server side, could you do a little investigation into protobuff and give some thought to whether we should just have the encoder write the payloads out into that format instead? It would provide even better schema validation, and it would be yet another significant gain in payload size.
(Reporter)

Comment 4

6 years ago
(In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #3)
> Taras, starting to pick this back up and see what progress we can make in
> the short term.
> 
> I did have a request for you.  If we are looking at building a translation /
> encoder to be run on the server side, could you do a little investigation
> into protobuff and give some thought to whether we should just have the
> encoder write the payloads out into that format instead? It would provide
> even better schema validation, and it would be yet another significant gain
> in payload size.

I'm a bit skeptical of protobuf. I tried a few other binary encoders to see if they result in a smaller raw packet, json packet was always smaller when compressed. Above histogram structure is already really minimal, the extra complexity in dealing with a binary schema wont be worth it.

Comment 5

5 years ago
Here is an initial spec for an updated storage format:
https://github.com/mreid-moz/telemetry-server/blob/master/StorageFormat.md

There are a few changes from the initial description above, namely that we will continue to use the Histogram Name instead of using a numeric Histogram ID (unless it's shown to use too much space) and that the metadata fields (sum, log_sum, etc) are moved to the end of the array for convenience.

Comment 6

5 years ago
Updated link:
https://github.com/mreid-moz/telemetry-server/blob/master/docs/StorageFormat.md

I've also added "sum_squares_lo" and "sum_squares_hi" to the metadata fields at the end of the array.
(Reporter)

Updated

5 years ago
Blocks: 922743
(Reporter)

Updated

5 years ago
Depends on: 920169
(Reporter)

Updated

5 years ago
Summary: A Modest Proposal for Preventing the Histograms of Poor Developers From Being a Burden to Hadoop Cluster and For Making Validation Beneficial → A Modest Proposal for Preventing the Histograms of Poor Developers From Being a Burden to Server and For Enabling Lazy Validation(eg saving $$$ on AWS)

Comment 8

3 years ago
This has been implemented in the server-side data storage for ages.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
No longer depends on: 920169
You need to log in before you can comment on or make changes to this bug.