856263 - A Modest Proposal for Preventing the Histograms of Poor Developers From Being a Burden to Server and For Enabling Lazy Validation(eg saving $$$ on AWS)

Reporter

Description

•

12 years ago

So now that we include the js schema(bug 832007) with every submission most of the packet is redundant(eg histogram name, parameters, bucket values). We can change the format of the packet to be histogram_id:[sum, log_sum, log_sum_squares, b1, b2, b3...] Where: * histogram_id:int id refering to histogram in the json file(can be a line number or something else) * sum, log_sum, log_sum_squares are obvious * b1, b2, b3 ... are bucket values According to my calculations histograms constitute >90% of the packet and this scheme can result in ~5x reduction in histogram payload -> ~4x(conservative) reduction in uncompressed payload size. Alternatively if we do not care about packet size, we can implement this on the serverside as part of the validation step(instead of merely adding an extra field to the packet)

Kyle Lahnakoski [:ekyle]

Comment 1

•

12 years ago

If considering a format change: Try separating each histogram from the ping, marking each with the ping id, and grouping identical-named histograms together. This will increase size on disk, but reduce the total data that needs to be read at any one time (assuming only a few histograms are of interest at any one time). These grouped histograms should compress** well because of their similarity to each other. **compress in general sense, not just the zip sense.

(dormant account)

Reporter

Comment 2

•

12 years ago

(In reply to Kyle Lahnakoski from comment #1) > If considering a format change: Try separating each histogram from the > ping, marking each with the ping id, and grouping identical-named histograms > together. This will increase size on disk, but reduce the total data that > needs to be read at any one time (assuming only a few histograms are of > interest at any one time). These grouped histograms should compress** well > because of their similarity to each other. > No. the point of this proposal is to reduce disk footprint. In theory one could have a separate log file or table for every histogram, but the costs in that would outweigh any perceived benefits

Daniel Einspanjer [:dre] [:deinspanjer]

Comment 3

•

12 years ago

Taras, starting to pick this back up and see what progress we can make in the short term. I did have a request for you. If we are looking at building a translation / encoder to be run on the server side, could you do a little investigation into protobuff and give some thought to whether we should just have the encoder write the payloads out into that format instead? It would provide even better schema validation, and it would be yet another significant gain in payload size.

(dormant account)

Reporter

Comment 4

•

12 years ago

(In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #3) > Taras, starting to pick this back up and see what progress we can make in > the short term. > > I did have a request for you. If we are looking at building a translation / > encoder to be run on the server side, could you do a little investigation > into protobuff and give some thought to whether we should just have the > encoder write the payloads out into that format instead? It would provide > even better schema validation, and it would be yet another significant gain > in payload size. I'm a bit skeptical of protobuf. I tried a few other binary encoders to see if they result in a smaller raw packet, json packet was always smaller when compressed. Above histogram structure is already really minimal, the extra complexity in dealing with a binary schema wont be worth it.

Mark Reid [:mreid]

Comment 5

•

12 years ago

Here is an initial spec for an updated storage format: https://github.com/mreid-moz/telemetry-server/blob/master/StorageFormat.md There are a few changes from the initial description above, namely that we will continue to use the Histogram Name instead of using a numeric Histogram ID (unless it's shown to use too much space) and that the metadata fields (sum, log_sum, etc) are moved to the end of the array for convenience.

Mark Reid [:mreid]

Comment 6

•

12 years ago

Updated link: https://github.com/mreid-moz/telemetry-server/blob/master/docs/StorageFormat.md I've also added "sum_squares_lo" and "sum_squares_hi" to the metadata fields at the end of the array.

(dormant account)

Reporter

Updated

•

12 years ago

Blocks: 922743

(dormant account)

Reporter

Updated

•

12 years ago

Depends on: 920169

(dormant account)

Reporter

Comment 7

•

12 years ago

See "converted format" in https://github.com/mreid-moz/telemetry-server/blob/master/docs/StorageFormat.md

(dormant account)

Reporter

Updated

•

12 years ago

Summary: A Modest Proposal for Preventing the Histograms of Poor Developers From Being a Burden to Hadoop Cluster and For Making Validation Beneficial → A Modest Proposal for Preventing the Histograms of Poor Developers From Being a Burden to Server and For Enabling Lazy Validation(eg saving $$$ on AWS)

Mark Reid [:mreid]

Comment 8

•

10 years ago

This has been implemented in the server-side data storage for ages.

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Georg Fritzsche [:gfritzsche]

Updated

•

9 years ago

No longer depends on: 920169

Bugzilla

A Modest Proposal for Preventing the Histograms of Poor Developers From Being a Burden to Server and For Enabling Lazy Validation(eg saving $$$ on AWS)

Categories

(Toolkit :: Telemetry, defect)

Tracking

()

People

(Reporter: taras.mozilla, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Updated

Updated

Comment 7

Updated

Comment 8

Updated