mreid did some analysis on data volumes and costs, which raised issues about the storage model, the data formats that clients actually want to process, and the ping formats that we are sending. mreid is driving a conversation about this and working toward a proposal for any changes (to storage formats, ping formats sent by the client, etc.)
Telemetry: the monthly costs for storing and processing telemetry is 8-10k per month (roughly 50-50 on the storage versus processing) FHR: FHR costs us $20,875 per month in hardware and support costs. Vertica + Bagheera + Hadoop = $5,366.48 + 898 + 14,611 FHR is 56% of Vertica (9,583 per month) = $5,366.48 per month Bagheera hardware (100%) = 32,325 total, amortized monthly over 3 years, $898 per month FHR is 59% of Hadoop - 24,765 total per month, so 59% is 14,611 total per month (24,765 total is broken down as: Pythian Hadoop = $9,440 per month Cloudera support = $6,710 total per month Peach hardware costs = 310,155 total amortized monthly over 3 years = $8,615 per month)
What about data center / colo fees for peach?
will update bug with the data - apologies for its omission
via Sheeri: Data center cost for FHR is an additional $10,200 per month - mostly because there is a lot of wasted space in the data center, and getting rid of the hardware behind FHR will free us up to no longer have that space (calculated as 20% of 60% of the total amount of wasted space). If we're going with absolute cost, the FHR data center cost is $1,100 per month. So overall, either $21,975, or $31,075 per month, depending on how you calculate "data center cost".
Long-term storage of the main corpus of data will be in Amazon S3. Data is stored in immutable chunks (up to 300MB), partitioned by a number of dimensions including submission date, document version, document type, application name, update channel, application version, and build id. Each message will be individually compressed using the Snappy compression scheme. Messages are compressed individually instead of entire files so that we retain the ability to index individual messages by clientid. In order to facilitate longitudinal data analysis, we will additionally index the data by clientId. This provides the ability to fetch data for a given set of clientIds, albeit at a slower rate than fetching data by "dimension" as described above. We will additionally store a number of derived data sets for specific purposes. These data sets will capture specific signals in the data, or will capture information about different subsets of the data. Each will be relatively small compared to the full data set, and may use a different backend storage mechanism than S3. We also retain a "raw" copy of the data as received directly from Firefox. This gives us a failsafe if our data decoding and validation fails or needs to be modified. Data may be read back out of S3 for analysis using Heka and Spark, and possibly other analysis frameworks in the future. ** Note: This will be significantly more expensive than pre-unification Telemetry and FHR were. ** Pre-unification, Telemetry was opt-in on the Release channel, resulting in far less data being submitted / stored. Even with the smaller "base measures" payloads, the much larger volume of Release submissions will far outweigh all other channels. FHRv2's data representation effectively required only approximately 1kb of storage per day. Unified Telemetry payloads, on the other hand, are more than 100kb each, and are submitted 5-10 times per day (on average). As such, we are accumulating a much larger volume of data than FHRv2.  This is a configuration setting and may be tweaked in the future.  See Bug 1175573  Based on pre-release "extended" submissions. The "base" submissions are expected to be closer to 20kb.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.