Some clients are sending a large number of pings per day (2,300?) which ties up server resources (e.g. disk space & processing power) – we can do session information aggregation so we really only need to make sure clients are sending at most once per day. Let's try to batch pings (keep it simple!) after a certain point during a day. Question(s) to answer (NI georg): * When we aggregate session data, is it correlated back to the day from which the ping was created? For example, if we aggregate pings from Feb. 1 into a ping sent on Feb. 2, are we concerned that the user may have used Fennec for 18 hrs on Feb. 1 but it's associated with the ping from Feb. 2?
Georg answered this question via Vidyo: when aggregating, it would be a problem if the ping creation date did not match the day the sessions are recorded, but fwiw, we already have that issue – see bug 1277091 comment 3. Next steps: * (georg) Analyze the data to discover if this is actually a problem * (me) Design a potential implementation to do this, even if I'm not the one to implement it
The problem we're trying to solve is: analysis of pings on the server takes too long. Some options: * Convert the raw data into an intermediate and faster format on the server (e.g. maybe relational database?) * Change the upload format from the client to something faster (or easier to ^) on the server * Reduce the amount of data acted upon by the server - Submit data that is unlikely to change less frequently (e.g. architecture, device) - Only submit data that has changed (e.g. only send the app version once until it changes again) - Collapse redundant pings (i.e. pings created on the same day) While we should consider other solutions, I will be discussing a possible implementation of the last one, collapsing redundant pings, because that is what this bug is about. We can either do this on the: --- Server + no loss of data (e.g. by combining separate pings) - user bandwidth is not reduced (I can't speak for all of the effects) --- Client (Opposite of the above) - adds a fair amount of complexity - no good way of specifying JSON object equality in Java so we'd have to do something :( like: 1) write a method to do JSON equality for our objects 2) continuously update the equation code to match our updates in schemas 3) write our schemas in separate files and generate our comparison code from these schemas. - Vulnerable to client-side clock skew (though the server-side method wouldn't be much better) - Both implementations run in generic code (i.e. for any ping type) so we'd have to generalize the ping collapsing code --- I'll explain a possible client implementation. The client currently creates a ping on startup, stores this ping to disk, and the background will start and read from the disk to find all of the pings that still haven't been uploaded (including the one it just created). Pings can be combined when they are from the same date and their non-aggregate-able fields (e.g. Strings) are equal. Note that this *includes* the stored url (with exception to the docID). We can combine the ping when... 1) The ping is initially created. We'd read the last ping that is stored on disk, determine if it's combine-able, and if so, modify the current ping on disk. Trade-offs: + Saves disk space by collapsing data + Since we reduce the ping storage count as soon as we create them, we're less likely to get pings pruned away (i.e. data loss) + Only have to do these computations once - Additional disk accesses (stat, read & write) which can affect performance - Our current file locking scheme may become more complex - Store is more complicated conceptually, has to take on more complexities since it has to know how to combine pings, which can get painful since the code is so generic 2) We attempt to upload all pending pings. Trade-offs: + Don't have to modify data that is already on disk, which can become less reliable for each ping we collapse (e.g. debugging errors becomes much more difficult) + Simpler conceptually – all we're adding is: `List<TelemetryPing> pings = pingCombiner.combinePings(store.getAllPings());` & some book-keeping - More to reading from disk during upload than in 1) - Have to recompute each time we try to upload - More likely data loss since we don't collapse ping storage in the store and they may get pruned before we have a chance to upload them Still figuring out which approach I prefer...
I like 2 for the conceptual simplicity, but I'm leaning towards 1 because the code runs in the background during startup and the worst case for 2 (comparing a current limit of 40 or so JSON objects of however many fields) seems bad, which could affect startup time. The next implementer should look into the trade-offs that are relevant when this is implemented, however.
I don't think we actually ran into any problems so far with the ping volume from core pings. If we can handle this volume then we should probably just close this bug and move on. Frank, what do you think?
Agreed, we can definitely close this out.