Closed Bug 840651 Opened 12 years ago Closed 11 years ago

Some Telemetry clients are spamming submissions non-stop

Categories

(Toolkit :: Telemetry, defect)

12 Branch
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: dre, Unassigned)

References

Details

Attachments

(1 file)

397.53 KB, application/x-zip-compressed
Details
We have found a small set of submissions from clients that are constantly re-sending the same submission, sometimes 2 or 3 per second. Most of the cases that we have linked to actual submissions have so far been old Firefox versions (12, 13, 14) and have very large uptimes (> 2^32 ms). Conversation with Taras and Nathan in IRC indicate they think there might be a problem with idle-timer if uptime rolls over a 32bit int. There was also a reference by them to an old Windows XP bug related to long uptimes, but I don't have a bug number. I can make some example payloads available as well as a separate analysis on the trend of IPs or UAs sending the data through a secure channel.
Providing example payloads from each of the problematic clients would be very helpful in trying to understand the problem.
On IRC you mentioned "the smallest set of useful data" and above you mentioned example payloads. In the event that it's easier to provide individual fields and histograms from the data, I'd like to see: simpleMeasurements.uptime info.appVersion info.appBuildID info.platformBuildID info.OS info.appUpdateChannel and the values for the histograms (if they don't exist, that's fine): TELEMETRY_PING TELEMETRY_SUCCESS READ_SAVED_PING_SUCCESS for as many of the offending clients as you can gather (max 100-200 or so).
Some preliminary data I can deliver tonight. This doesn't answer :froydnj's asks yet, but it might be useful regardless. I ran a scan of the access logs for 2012-02-11-00 PDT. In that one hour period we had: 1,097,858 submission requests 3,161 unique IP addresses 2,104 unique IP + UA combinations 354 unique IP + UA + submission ID combinations 426 unique submission IDs. most_common_ip_ua_id (IP not listed for privacy): 8376 1 Firefox/11.0 ddca3530-4c56-4682-a3b2-f94cfe76a127 3321 2 Firefox/14.0.1 8ce123cb-d517-49c4-aade-482ae1072ad7 3120 3 Firefox/17.0 4b74d645-8a74-ca44-a463-39c57b27db67 700 4 Firefox/17.0 795f3ab5-2b45-e144-92e5-6d83d68e8cd2 535 5 Firefox/18.0 723b611e-d9aa-4d0c-9668-e126d150dda7 most_common_ua 957031 Firefox/18.0 30151 Firefox/17.0 29276 Firefox/19.0 15506 Firefox/20.0 11893 Firefox/11.0
The submission numbers broken down by UA is helpful, though I suppose we can see that information from the dashboard. I'll take a look; the presence of 11.0 and 14.0 in there is quite odd, though.
(In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #3) > I ran a scan of the access logs for 2012-02-11-00 PDT. In that one hour > period we had: Quick check - is the 2012 above a typo? Is the data under most_common_ip_ua_id from 2013? (probably yes, because FF 18 didn't exist until 2012-08) > 1,097,858 submission requests > 3,161 unique IP addresses > 2,104 unique IP + UA combinations > 354 unique IP + UA + submission ID combinations > 426 unique submission IDs. > > most_common_ip_ua_id (IP not listed for privacy): > 8376 1 Firefox/11.0 ddca3530-4c56-4682-a3b2-f94cfe76a127 > 3321 2 Firefox/14.0.1 8ce123cb-d517-49c4-aade-482ae1072ad7 > 3120 3 Firefox/17.0 4b74d645-8a74-ca44-a463-39c57b27db67 > 700 4 Firefox/17.0 795f3ab5-2b45-e144-92e5-6d83d68e8cd2 > 535 5 Firefox/18.0 723b611e-d9aa-4d0c-9668-e126d150dda7 Is there an easy way to map these build IDs back to platform, or is it easier to pull OS out of the telemetry pings? > most_common_ua > 957031 Firefox/18.0 > 30151 Firefox/17.0 > 29276 Firefox/19.0 > 15506 Firefox/20.0 > 11893 Firefox/11.0 so it's likely that 8376 out of 11893 pings for FF 11 that hour came from one misbehaving client, or from a group of clients behind a NAT? Can we tell the difference?
(In reply to Irving Reid (:irving) from comment #5) > (In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #3) > > > I ran a scan of the access logs for 2012-02-11-00 PDT. In that one hour > > period we had: > > Quick check - is the 2012 above a typo? Is the data under > most_common_ip_ua_id from 2013? (probably yes, because FF 18 didn't exist > until 2012-08) Yep. dumb typo. The query was for data from 2013-02-11-00 (i.e. just a couple of days ago). > > most_common_ip_ua_id (IP not listed for privacy): > > 8376 1 Firefox/11.0 ddca3530-4c56-4682-a3b2-f94cfe76a127 > > 3321 2 Firefox/14.0.1 8ce123cb-d517-49c4-aade-482ae1072ad7 > > 3120 3 Firefox/17.0 4b74d645-8a74-ca44-a463-39c57b27db67 > > 700 4 Firefox/17.0 795f3ab5-2b45-e144-92e5-6d83d68e8cd2 > > 535 5 Firefox/18.0 723b611e-d9aa-4d0c-9668-e126d150dda7 > > Is there an easy way to map these build IDs back to platform, or is it > easier to pull OS out of the telemetry pings? Parsing the OS reliably out of the UA string is a bit harder than what I did here (I just looked for Firefox|Thunderbird/[^ ]+). We will be getting first and last submission samples for the above listed IDs shortly which should hopefully contain the platform info in it. > > > most_common_ua > > 957031 Firefox/18.0 > > 30151 Firefox/17.0 > > 29276 Firefox/19.0 > > 15506 Firefox/20.0 > > 11893 Firefox/11.0 > > so it's likely that 8376 out of 11893 pings for FF 11 that hour came from > one misbehaving client, or from a group of clients behind a NAT? Can we tell > the difference? 8376 pings came from a single misbehaving client that used the UA string FF 11. It is very unlikely to be a group of clients behind a NAT because those 8376 requests all had the exact same Telemetry session ID in the URL. According to the contract that I understand from Telemetry, we should never see more than *1* request with the same session ID in a 24 hour period. If the browser stays running multiple days, then you'd see ~1 per day. On the first startup after a session ends and more than 24 hours have passed since the last submission, we'll see one final submission with that ID that will be the cumulative saved-session info.
Attached file telemetry docs
Please check the attachment for sample telemetry docs based on these ids ddca3530-4c56-4682-a3b2-f94cfe76a127 8ce123cb-d517-49c4-aade-482ae1072ad7 4b74d645-8a74-ca44-a463-39c57b27db67 795f3ab5-2b45-e144-92e5-6d83d68e8cd2 723b611e-d9aa-4d0c-9668-e126d150dda7
We can provide more docs as needed, but would appreciate it if you could look through these first and tell us if it is useful and any changes that might be needed for subsequent runs.
Some initial analyses: - We had a theory about this being limited to high uptime clients; there are two examples of low uptime clients (7-14 days) sending large amounts of pings in this small dataset. - Very limited evidence does point to something going wrong after a couple of days of uptime, though; in both of the above examples, earlier pings from the same session indicate that we've been sending ~1ping/day with significant increases later. So the flow of pings is definitely not constant. Eyeballing numbers from the other clients also suggests the pings/day is going up as the sessions get longer. - The READ_SAVED_PING_SUCCESS histogram is only present in one client and indicates that only a small number of pings were read in from disk. So there's no deluge of saved pings being sent in, at least given this data. But I think READ_SAVED_PING_SUCCESS is a fairly recent invention, so we won't see it in older clients. Getting full pings is very helpful, thanks for getting these. If whatever means you are using to retrieve data from the database could be convinced to format the results more like: <8-digit-date> <raw JSON document> all on one line without extra text or spaces or newlines, that would be welcome. Also, are these raw pings? I see some "valid":"true" members in some of the histograms, which is definitely not anything the telemetry bits insert into the histograms. Taking out 'invalid' histograms isn't going to be terribly helpful for trying to analyze some things. Another run with: - more ids - more documents from each id would be helpful. It would be especially helpful if we could get documents spaced at more-or-less regular intervals throughout the life of the id so that we could get more evidence for/against something going wrong as a session gets older. Or even just a couple near the beginning and a couple near the end. Just not all on a single day.
One of the limitations we currently have is we are not likely to have more than one document per day for a given id. That is because if a client sends in more than one document on a single day, the old document will be overwritten with the new one.
(In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #10) > One of the limitations we currently have is we are not likely to have more > than one document per day for a given id. That is because if a client sends > in more than one document on a single day, the old document will be > overwritten with the new one. Thanks for clarifying. I didn't know exactly how overwriting documents worked. I think having ~1 document/day would work well for the purposes of this particular hypothesis; ideally we'd see the first couple of documents (~days) at regular ~24h intervals and then they'd start getting closer together.
All the data points so far are for Windows version 5.2, which is the 64 bit version of XP (also Server 2003 and a few other related releases). Are there any problem clients that *don't* have "version":"5.2" in their "info" section?
Yeah, lemme write a query to pick out the most common OS portion of the UA strings for abusive clients.
(In reply to Irving Reid (:irving) from comment #12) > All the data points so far are for Windows version 5.2, which is the 64 bit > version of XP (also Server 2003 and a few other related releases). Are there > any problem clients that *don't* have "version":"5.2" in their "info" > section? Some of the posted documents (795f3ab5 and 4b74d645) are from OS X.
oops, wasn't looking at all the files in the previous group - they aren't all "version":"5.2". I'll keep digging.
This spreadsheet contains an analysis for the data over a 5 day period from the 9th to the 13th. It provides averages and sums of requests broken down into the following groups: * overall daily numbers * daily numbers for uuids seen only once per day (i.e. the normal submissions) * daily numbers for uuids seen multiple times per day in the following buckets: ** 2 - 10 times a day ** 11 - 100 times a day ** 101 - 1000 times a day ** more than 1000 times a day It provides stats on the OS and Fx version for each of these groups. I did not see any particular outliers there, but I could have easily overlooked something.
regarding comment 9 about "valid" tag this is part of telemetry validation which runs everyday. It checks if telemetry submissions are as per spec(Histograms.json). You can ignore those for this analysis.
I've created an extract of all the submission IDs that violated the > 1 per day rule between 2013-02-09 and 2013-02-13. They are grouped by max_range which is a bucketing of the most times that ID was seen in a single day. Summary: max101-1000 461 max11-100 3,042 max2-10 13,826 max2 443,957 Here is a sample of just over 1000 of the ids, roughly proportional the the occurrence of their group: https://metrics.mozilla.com/protected/telemetry_dup_submission_sample.tsv Each record is a single ID with some stats about how often it was seen: max_range -- Which grouping this ID fell into for the most times it was seen on one day submission_id -- The uuid submitted by Telemetry. This number is supposed to be unique to a session. total_requests_across_days -- the total number of times a POST was made between the 9th and 13th avg_requests_per_day -- for the 5 day period number_days_seen -- How many days out of the 5 we saw the ID max_requests_across_days -- The most times this ID was seen on one day Harsha will be running over this set to extract partial payloads per day as requested in comment #9.
Can we get a longer term view of one or two of the busier IDs? What I'm looking for is whether there's a pattern to the number of submissions a single browser is sending per day - that is, does the browser send roughly the same number per day (with some days missing) or is the number of submissions from that browser completely random, or is it following a curve? This might give us insight into what sort of failure is causing the multiple submissions. Probably best to pick the worst client, and one that is medium bad (maybe around 1000/day), and one that's low (50/day ish) and go as far back into history as we can for that client - when did it start receiving multiple submissions per day with the same ID, and what's the growth curve (if any) in the submission rate. If there's any way for us to tell, it would be interesting to know if the same client was working normally (that is, sending once/day for some time period) with a different session ID before the problem started.
Here are the documents that have frequent submissions for dates 20130210 - 20130213. http://metrics.mozilla.com/protected/telemetry_docs.zip
Please let me know if those documents are easy to parse they are in the format of hbasekey, json . Json payload is in one line without breaking into new lines.
(In reply to Harsha [:harsha] from comment #22) > Please let me know if those documents are easy to parse they are in the > format of hbasekey, json . Json payload is in one line without breaking into > new lines. Why do the hbasekeys start with seemingly random, sometimes unprintable, characters?
(In reply to Nathan Froyd (:froydnj) from comment #23) > (In reply to Harsha [:harsha] from comment #22) > > Please let me know if those documents are easy to parse they are in the > > format of hbasekey, json . Json payload is in one line without breaking into > > new lines. > > Why do the hbasekeys start with seemingly random, sometimes unprintable, > characters? And sometimes they don't start with an extra character at all. Why is that? These special cases make things less easy to parse.
you don't need to parse hbase keys. If you want I can get the documents with json in them. Let me know.
first character in hbase keys are used to distribute the documents among machines. I don't think we need hbase keys for this bug. you can do cat docs_20130210 | cut -f2 to get json doc.
For transparency, The documents are stored in hbase with a leading byte based on the first two characters of the uuid followed by the submission date followed by the entire uuid of the submission. This is done to prevent hotspotting of the data into a single location which would happen if the key started with the date (all data is inserted into hbase ordered by key). The date is there to allow us to scan the table for only documents submitted on a certain day. You could easily truncate the first byte or the first byte + date prefix if you want, which would leave you with the uuid that could be used to group all the documents related to a single session.
I can just look at the json docs, but I want to be able to associate docs with a common uuid together to look for trends in that particular submission over time. So I do need the keys, as those are the only things that have dates and session ids in them. The hbase keys in the files provided have 0, 1, or 3 leading bytes, depending on which keys you're looking at. So simple truncation does not work.
OK, something is clearly weird with a number of these clients. Consider these (uuid, date, uptime) for several clients: 2c8ea920-9885-44d3-89e2-2a19f2130914 20130211 552 2c8ea920-9885-44d3-89e2-2a19f2130914 20130212 552 2c8ea920-9885-44d3-89e2-2a19f2130914 20130213 552 35020845-8ca3-46e5-8288-bf48a8886183 20130212 118 dc788645-de0c-4d94-9774-44c88fba1734 20130210 0 dc788645-de0c-4d94-9774-44c88fba1734 20130211 0 dc788645-de0c-4d94-9774-44c88fba1734 20130212 0 dc788645-de0c-4d94-9774-44c88fba1734 20130213 0 fbf0801a-0a41-495f-bbb9-abbe9c02adfe 20130212 30 fbf0801a-0a41-495f-bbb9-abbe9c02adfe 20130213 80 fd366bc6-eab1-204d-a5bd-da73f4343a66 20130211 30 af4a680a-4319-4364-8afe-6f79c99f7ac7 20130212 77 af4a680a-4319-4364-8afe-6f79c99f7ac7 20130213 81 7f34bb17-3415-4b70-8f89-bbffb958ed64 20130211 1821226 7f34bb17-3415-4b70-8f89-bbffb958ed64 20130212 1821226 7f34bb17-3415-4b70-8f89-bbffb958ed64 20130213 1821226 6b69e91d-ccde-4d0c-b5cb-760261901b93 20130210 0 6b69e91d-ccde-4d0c-b5cb-760261901b93 20130211 0 6b69e91d-ccde-4d0c-b5cb-760261901b93 20130212 0 6b69e91d-ccde-4d0c-b5cb-760261901b93 20130213 0 62b47917-c589-4aa4-872e-1a902ac5a726 20130213 5785968 (!) We have several instances of 0 uptime (!) and several more where the uptimes barely increase across submissions of consecutive days. We have huge uptimes, small uptimes... Eyeballing the data suggests that the majority of clients sending "frequent" submissions (what does "frequent submissions" mean for this dataset) have low uptimes (< 1 week). I'll look more at the histograms tomorrow.
Is it possible to get, say, 2 weeks worth of this kind of data? I obviously worked around the hbase keys issue, so that doesn't need fixing up.
I am running the script to get data from 20130206 to 20130220
Harsha can you provide a couple complete json packets where it's a frequently-repeated submission on the nightly channel on windows and contains chromehangs
Nathan, You can get docs from feb 14 to feb 20 from here http://metrics.mozilla.com/protected/telemetry_docs_0214-0220.zip Taras, I will be working on a script which will get most frequent uuid over given period. I'll add those conditions to get your docs. -Harsha
Depends on: 844325
No longer relevant --
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: