Closed
Bug 856263
Opened 12 years ago
Closed 10 years ago
A Modest Proposal for Preventing the Histograms of Poor Developers From Being a Burden to Server and For Enabling Lazy Validation(eg saving $$$ on AWS)
Categories
(Toolkit :: Telemetry, defect)
Tracking
()
RESOLVED
FIXED
People
(Reporter: taras.mozilla, Unassigned)
References
Details
So now that we include the js schema(bug 832007) with every submission most of the packet is redundant(eg histogram name, parameters, bucket values). We can change the format of the packet to be
histogram_id:[sum, log_sum, log_sum_squares, b1, b2, b3...]
Where:
* histogram_id:int id refering to histogram in the json file(can be a line number or something else)
* sum, log_sum, log_sum_squares are obvious
* b1, b2, b3 ... are bucket values
According to my calculations histograms constitute >90% of the packet and this scheme can result in ~5x reduction in histogram payload -> ~4x(conservative) reduction in uncompressed payload size.
Alternatively if we do not care about packet size, we can implement this on the serverside as part of the validation step(instead of merely adding an extra field to the packet)
Comment 1•12 years ago
|
||
If considering a format change: Try separating each histogram from the ping, marking each with the ping id, and grouping identical-named histograms together. This will increase size on disk, but reduce the total data that needs to be read at any one time (assuming only a few histograms are of interest at any one time). These grouped histograms should compress** well because of their similarity to each other.
**compress in general sense, not just the zip sense.
Reporter | ||
Comment 2•12 years ago
|
||
(In reply to Kyle Lahnakoski from comment #1)
> If considering a format change: Try separating each histogram from the
> ping, marking each with the ping id, and grouping identical-named histograms
> together. This will increase size on disk, but reduce the total data that
> needs to be read at any one time (assuming only a few histograms are of
> interest at any one time). These grouped histograms should compress** well
> because of their similarity to each other.
>
No. the point of this proposal is to reduce disk footprint. In theory one could have a separate log file or table for every histogram, but the costs in that would outweigh any perceived benefits
Comment 3•12 years ago
|
||
Taras, starting to pick this back up and see what progress we can make in the short term.
I did have a request for you. If we are looking at building a translation / encoder to be run on the server side, could you do a little investigation into protobuff and give some thought to whether we should just have the encoder write the payloads out into that format instead? It would provide even better schema validation, and it would be yet another significant gain in payload size.
Reporter | ||
Comment 4•12 years ago
|
||
(In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #3)
> Taras, starting to pick this back up and see what progress we can make in
> the short term.
>
> I did have a request for you. If we are looking at building a translation /
> encoder to be run on the server side, could you do a little investigation
> into protobuff and give some thought to whether we should just have the
> encoder write the payloads out into that format instead? It would provide
> even better schema validation, and it would be yet another significant gain
> in payload size.
I'm a bit skeptical of protobuf. I tried a few other binary encoders to see if they result in a smaller raw packet, json packet was always smaller when compressed. Above histogram structure is already really minimal, the extra complexity in dealing with a binary schema wont be worth it.
Comment 5•12 years ago
|
||
Here is an initial spec for an updated storage format:
https://github.com/mreid-moz/telemetry-server/blob/master/StorageFormat.md
There are a few changes from the initial description above, namely that we will continue to use the Histogram Name instead of using a numeric Histogram ID (unless it's shown to use too much space) and that the metadata fields (sum, log_sum, etc) are moved to the end of the array for convenience.
Comment 6•12 years ago
|
||
Updated link:
https://github.com/mreid-moz/telemetry-server/blob/master/docs/StorageFormat.md
I've also added "sum_squares_lo" and "sum_squares_hi" to the metadata fields at the end of the array.
Reporter | ||
Comment 7•12 years ago
|
||
See "converted format" in https://github.com/mreid-moz/telemetry-server/blob/master/docs/StorageFormat.md
Reporter | ||
Updated•12 years ago
|
Summary: A Modest Proposal for Preventing the Histograms of Poor Developers From Being a Burden to Hadoop Cluster and For Making Validation Beneficial → A Modest Proposal for Preventing the Histograms of Poor Developers From Being a Burden to Server and For Enabling Lazy Validation(eg saving $$$ on AWS)
Comment 8•10 years ago
|
||
This has been implemented in the server-side data storage for ages.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•