Closed
Bug 840651
Opened 12 years ago
Closed 11 years ago
Some Telemetry clients are spamming submissions non-stop
Categories
(Toolkit :: Telemetry, defect)
Tracking
()
RESOLVED
WONTFIX
People
(Reporter: dre, Unassigned)
References
Details
Attachments
(1 file)
|
397.53 KB,
application/x-zip-compressed
|
Details |
We have found a small set of submissions from clients that are constantly re-sending the same submission, sometimes 2 or 3 per second.
Most of the cases that we have linked to actual submissions have so far been old Firefox versions (12, 13, 14) and have very large uptimes (> 2^32 ms).
Conversation with Taras and Nathan in IRC indicate they think there might be a problem with idle-timer if uptime rolls over a 32bit int.
There was also a reference by them to an old Windows XP bug related to long uptimes, but I don't have a bug number.
I can make some example payloads available as well as a separate analysis on the trend of IPs or UAs sending the data through a secure channel.
Comment 1•12 years ago
|
||
Providing example payloads from each of the problematic clients would be very helpful in trying to understand the problem.
Comment 2•12 years ago
|
||
On IRC you mentioned "the smallest set of useful data" and above you mentioned example payloads. In the event that it's easier to provide individual fields and histograms from the data, I'd like to see:
simpleMeasurements.uptime
info.appVersion
info.appBuildID
info.platformBuildID
info.OS
info.appUpdateChannel
and the values for the histograms (if they don't exist, that's fine):
TELEMETRY_PING
TELEMETRY_SUCCESS
READ_SAVED_PING_SUCCESS
for as many of the offending clients as you can gather (max 100-200 or so).
| Reporter | ||
Comment 3•12 years ago
|
||
Some preliminary data I can deliver tonight. This doesn't answer :froydnj's asks yet, but it might be useful regardless.
I ran a scan of the access logs for 2012-02-11-00 PDT. In that one hour period we had:
1,097,858 submission requests
3,161 unique IP addresses
2,104 unique IP + UA combinations
354 unique IP + UA + submission ID combinations
426 unique submission IDs.
most_common_ip_ua_id (IP not listed for privacy):
8376 1 Firefox/11.0 ddca3530-4c56-4682-a3b2-f94cfe76a127
3321 2 Firefox/14.0.1 8ce123cb-d517-49c4-aade-482ae1072ad7
3120 3 Firefox/17.0 4b74d645-8a74-ca44-a463-39c57b27db67
700 4 Firefox/17.0 795f3ab5-2b45-e144-92e5-6d83d68e8cd2
535 5 Firefox/18.0 723b611e-d9aa-4d0c-9668-e126d150dda7
most_common_ua
957031 Firefox/18.0
30151 Firefox/17.0
29276 Firefox/19.0
15506 Firefox/20.0
11893 Firefox/11.0
Comment 4•12 years ago
|
||
The submission numbers broken down by UA is helpful, though I suppose we can see that information from the dashboard. I'll take a look; the presence of 11.0 and 14.0 in there is quite odd, though.
Comment 5•12 years ago
|
||
(In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #3)
> I ran a scan of the access logs for 2012-02-11-00 PDT. In that one hour
> period we had:
Quick check - is the 2012 above a typo? Is the data under most_common_ip_ua_id from 2013? (probably yes, because FF 18 didn't exist until 2012-08)
> 1,097,858 submission requests
> 3,161 unique IP addresses
> 2,104 unique IP + UA combinations
> 354 unique IP + UA + submission ID combinations
> 426 unique submission IDs.
>
> most_common_ip_ua_id (IP not listed for privacy):
> 8376 1 Firefox/11.0 ddca3530-4c56-4682-a3b2-f94cfe76a127
> 3321 2 Firefox/14.0.1 8ce123cb-d517-49c4-aade-482ae1072ad7
> 3120 3 Firefox/17.0 4b74d645-8a74-ca44-a463-39c57b27db67
> 700 4 Firefox/17.0 795f3ab5-2b45-e144-92e5-6d83d68e8cd2
> 535 5 Firefox/18.0 723b611e-d9aa-4d0c-9668-e126d150dda7
Is there an easy way to map these build IDs back to platform, or is it easier to pull OS out of the telemetry pings?
> most_common_ua
> 957031 Firefox/18.0
> 30151 Firefox/17.0
> 29276 Firefox/19.0
> 15506 Firefox/20.0
> 11893 Firefox/11.0
so it's likely that 8376 out of 11893 pings for FF 11 that hour came from one misbehaving client, or from a group of clients behind a NAT? Can we tell the difference?
| Reporter | ||
Comment 6•12 years ago
|
||
(In reply to Irving Reid (:irving) from comment #5)
> (In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #3)
>
> > I ran a scan of the access logs for 2012-02-11-00 PDT. In that one hour
> > period we had:
>
> Quick check - is the 2012 above a typo? Is the data under
> most_common_ip_ua_id from 2013? (probably yes, because FF 18 didn't exist
> until 2012-08)
Yep. dumb typo. The query was for data from 2013-02-11-00 (i.e. just a couple of days ago).
> > most_common_ip_ua_id (IP not listed for privacy):
> > 8376 1 Firefox/11.0 ddca3530-4c56-4682-a3b2-f94cfe76a127
> > 3321 2 Firefox/14.0.1 8ce123cb-d517-49c4-aade-482ae1072ad7
> > 3120 3 Firefox/17.0 4b74d645-8a74-ca44-a463-39c57b27db67
> > 700 4 Firefox/17.0 795f3ab5-2b45-e144-92e5-6d83d68e8cd2
> > 535 5 Firefox/18.0 723b611e-d9aa-4d0c-9668-e126d150dda7
>
> Is there an easy way to map these build IDs back to platform, or is it
> easier to pull OS out of the telemetry pings?
Parsing the OS reliably out of the UA string is a bit harder than what I did here (I just looked for Firefox|Thunderbird/[^ ]+). We will be getting first and last submission samples for the above listed IDs shortly which should hopefully contain the platform info in it.
>
> > most_common_ua
> > 957031 Firefox/18.0
> > 30151 Firefox/17.0
> > 29276 Firefox/19.0
> > 15506 Firefox/20.0
> > 11893 Firefox/11.0
>
> so it's likely that 8376 out of 11893 pings for FF 11 that hour came from
> one misbehaving client, or from a group of clients behind a NAT? Can we tell
> the difference?
8376 pings came from a single misbehaving client that used the UA string FF 11. It is very unlikely to be a group of clients behind a NAT because those 8376 requests all had the exact same Telemetry session ID in the URL.
According to the contract that I understand from Telemetry, we should never see more than *1* request with the same session ID in a 24 hour period. If the browser stays running multiple days, then you'd see ~1 per day. On the first startup after a session ends and more than 24 hours have passed since the last submission, we'll see one final submission with that ID that will be the cumulative saved-session info.
Comment 7•12 years ago
|
||
Please check the attachment for sample telemetry docs based on these ids
ddca3530-4c56-4682-a3b2-f94cfe76a127
8ce123cb-d517-49c4-aade-482ae1072ad7
4b74d645-8a74-ca44-a463-39c57b27db67
795f3ab5-2b45-e144-92e5-6d83d68e8cd2
723b611e-d9aa-4d0c-9668-e126d150dda7
| Reporter | ||
Comment 8•12 years ago
|
||
We can provide more docs as needed, but would appreciate it if you could look through these first and tell us if it is useful and any changes that might be needed for subsequent runs.
Comment 9•12 years ago
|
||
Some initial analyses:
- We had a theory about this being limited to high uptime clients; there are two examples of low uptime clients (7-14 days) sending large amounts of pings in this small dataset.
- Very limited evidence does point to something going wrong after a couple of days of uptime, though; in both of the above examples, earlier pings from the same session indicate that we've been sending ~1ping/day with significant increases later. So the flow of pings is definitely not constant. Eyeballing numbers from the other clients also suggests the pings/day is going up as the sessions get longer.
- The READ_SAVED_PING_SUCCESS histogram is only present in one client and indicates that only a small number of pings were read in from disk. So there's no deluge of saved pings being sent in, at least given this data. But I think READ_SAVED_PING_SUCCESS is a fairly recent invention, so we won't see it in older clients.
Getting full pings is very helpful, thanks for getting these. If whatever means you are using to retrieve data from the database could be convinced to format the results more like:
<8-digit-date> <raw JSON document>
all on one line without extra text or spaces or newlines, that would be welcome.
Also, are these raw pings? I see some "valid":"true" members in some of the histograms, which is definitely not anything the telemetry bits insert into the histograms. Taking out 'invalid' histograms isn't going to be terribly helpful for trying to analyze some things.
Another run with:
- more ids
- more documents from each id
would be helpful. It would be especially helpful if we could get documents spaced at more-or-less regular intervals throughout the life of the id so that we could get more evidence for/against something going wrong as a session gets older. Or even just a couple near the beginning and a couple near the end. Just not all on a single day.
| Reporter | ||
Comment 10•12 years ago
|
||
One of the limitations we currently have is we are not likely to have more than one document per day for a given id. That is because if a client sends in more than one document on a single day, the old document will be overwritten with the new one.
Comment 11•12 years ago
|
||
(In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #10)
> One of the limitations we currently have is we are not likely to have more
> than one document per day for a given id. That is because if a client sends
> in more than one document on a single day, the old document will be
> overwritten with the new one.
Thanks for clarifying. I didn't know exactly how overwriting documents worked.
I think having ~1 document/day would work well for the purposes of this particular hypothesis; ideally we'd see the first couple of documents (~days) at regular ~24h intervals and then they'd start getting closer together.
Comment 12•12 years ago
|
||
All the data points so far are for Windows version 5.2, which is the 64 bit version of XP (also Server 2003 and a few other related releases). Are there any problem clients that *don't* have "version":"5.2" in their "info" section?
| Reporter | ||
Comment 13•12 years ago
|
||
Yeah, lemme write a query to pick out the most common OS portion of the UA strings for abusive clients.
Comment 14•12 years ago
|
||
(In reply to Irving Reid (:irving) from comment #12)
> All the data points so far are for Windows version 5.2, which is the 64 bit
> version of XP (also Server 2003 and a few other related releases). Are there
> any problem clients that *don't* have "version":"5.2" in their "info"
> section?
Some of the posted documents (795f3ab5 and 4b74d645) are from OS X.
Comment 15•12 years ago
|
||
oops, wasn't looking at all the files in the previous group - they aren't all "version":"5.2". I'll keep digging.
| Reporter | ||
Comment 16•12 years ago
|
||
This spreadsheet contains an analysis for the data over a 5 day period from the 9th to the 13th.
It provides averages and sums of requests broken down into the following groups:
* overall daily numbers
* daily numbers for uuids seen only once per day (i.e. the normal submissions)
* daily numbers for uuids seen multiple times per day in the following buckets:
** 2 - 10 times a day
** 11 - 100 times a day
** 101 - 1000 times a day
** more than 1000 times a day
It provides stats on the OS and Fx version for each of these groups. I did not see any particular outliers there, but I could have easily overlooked something.
| Reporter | ||
Comment 17•12 years ago
|
||
Comment 18•12 years ago
|
||
regarding comment 9 about "valid" tag this is part of telemetry validation which runs everyday. It checks if telemetry submissions are as per spec(Histograms.json). You can ignore those for this analysis.
| Reporter | ||
Comment 19•12 years ago
|
||
I've created an extract of all the submission IDs that violated the > 1 per day rule between 2013-02-09 and 2013-02-13. They are grouped by max_range which is a bucketing of the most times that ID was seen in a single day.
Summary:
max101-1000 461
max11-100 3,042
max2-10 13,826
max2 443,957
Here is a sample of just over 1000 of the ids, roughly proportional the the occurrence of their group:
https://metrics.mozilla.com/protected/telemetry_dup_submission_sample.tsv
Each record is a single ID with some stats about how often it was seen:
max_range -- Which grouping this ID fell into for the most times it was seen on one day
submission_id -- The uuid submitted by Telemetry. This number is supposed to be unique to a session.
total_requests_across_days -- the total number of times a POST was made between the 9th and 13th
avg_requests_per_day -- for the 5 day period
number_days_seen -- How many days out of the 5 we saw the ID
max_requests_across_days -- The most times this ID was seen on one day
Harsha will be running over this set to extract partial payloads per day as requested in comment #9.
Comment 20•12 years ago
|
||
Can we get a longer term view of one or two of the busier IDs? What I'm looking for is whether there's a pattern to the number of submissions a single browser is sending per day - that is, does the browser send roughly the same number per day (with some days missing) or is the number of submissions from that browser completely random, or is it following a curve? This might give us insight into what sort of failure is causing the multiple submissions.
Probably best to pick the worst client, and one that is medium bad (maybe around 1000/day), and one that's low (50/day ish) and go as far back into history as we can for that client - when did it start receiving multiple submissions per day with the same ID, and what's the growth curve (if any) in the submission rate.
If there's any way for us to tell, it would be interesting to know if the same client was working normally (that is, sending once/day for some time period) with a different session ID before the problem started.
Comment 21•12 years ago
|
||
Here are the documents that have frequent submissions for dates 20130210 - 20130213.
http://metrics.mozilla.com/protected/telemetry_docs.zip
Comment 22•12 years ago
|
||
Please let me know if those documents are easy to parse they are in the format of hbasekey, json . Json payload is in one line without breaking into new lines.
Comment 23•12 years ago
|
||
(In reply to Harsha [:harsha] from comment #22)
> Please let me know if those documents are easy to parse they are in the
> format of hbasekey, json . Json payload is in one line without breaking into
> new lines.
Why do the hbasekeys start with seemingly random, sometimes unprintable, characters?
Comment 24•12 years ago
|
||
(In reply to Nathan Froyd (:froydnj) from comment #23)
> (In reply to Harsha [:harsha] from comment #22)
> > Please let me know if those documents are easy to parse they are in the
> > format of hbasekey, json . Json payload is in one line without breaking into
> > new lines.
>
> Why do the hbasekeys start with seemingly random, sometimes unprintable,
> characters?
And sometimes they don't start with an extra character at all. Why is that? These special cases make things less easy to parse.
Comment 25•12 years ago
|
||
you don't need to parse hbase keys. If you want I can get the documents with json in them. Let me know.
Comment 26•12 years ago
|
||
first character in hbase keys are used to distribute the documents among machines. I don't think we need hbase keys for this bug.
you can do
cat docs_20130210 | cut -f2 to get json doc.
| Reporter | ||
Comment 27•12 years ago
|
||
For transparency, The documents are stored in hbase with a leading byte based on the first two characters of the uuid followed by the submission date followed by the entire uuid of the submission.
This is done to prevent hotspotting of the data into a single location which would happen if the key started with the date (all data is inserted into hbase ordered by key). The date is there to allow us to scan the table for only documents submitted on a certain day.
You could easily truncate the first byte or the first byte + date prefix if you want, which would leave you with the uuid that could be used to group all the documents related to a single session.
Comment 28•12 years ago
|
||
I can just look at the json docs, but I want to be able to associate docs with a common uuid together to look for trends in that particular submission over time. So I do need the keys, as those are the only things that have dates and session ids in them.
The hbase keys in the files provided have 0, 1, or 3 leading bytes, depending on which keys you're looking at. So simple truncation does not work.
Comment 29•12 years ago
|
||
OK, something is clearly weird with a number of these clients. Consider these (uuid, date, uptime) for several clients:
2c8ea920-9885-44d3-89e2-2a19f2130914 20130211 552
2c8ea920-9885-44d3-89e2-2a19f2130914 20130212 552
2c8ea920-9885-44d3-89e2-2a19f2130914 20130213 552
35020845-8ca3-46e5-8288-bf48a8886183 20130212 118
dc788645-de0c-4d94-9774-44c88fba1734 20130210 0
dc788645-de0c-4d94-9774-44c88fba1734 20130211 0
dc788645-de0c-4d94-9774-44c88fba1734 20130212 0
dc788645-de0c-4d94-9774-44c88fba1734 20130213 0
fbf0801a-0a41-495f-bbb9-abbe9c02adfe 20130212 30
fbf0801a-0a41-495f-bbb9-abbe9c02adfe 20130213 80
fd366bc6-eab1-204d-a5bd-da73f4343a66 20130211 30
af4a680a-4319-4364-8afe-6f79c99f7ac7 20130212 77
af4a680a-4319-4364-8afe-6f79c99f7ac7 20130213 81
7f34bb17-3415-4b70-8f89-bbffb958ed64 20130211 1821226
7f34bb17-3415-4b70-8f89-bbffb958ed64 20130212 1821226
7f34bb17-3415-4b70-8f89-bbffb958ed64 20130213 1821226
6b69e91d-ccde-4d0c-b5cb-760261901b93 20130210 0
6b69e91d-ccde-4d0c-b5cb-760261901b93 20130211 0
6b69e91d-ccde-4d0c-b5cb-760261901b93 20130212 0
6b69e91d-ccde-4d0c-b5cb-760261901b93 20130213 0
62b47917-c589-4aa4-872e-1a902ac5a726 20130213 5785968 (!)
We have several instances of 0 uptime (!) and several more where the uptimes barely increase across submissions of consecutive days. We have huge uptimes, small uptimes...
Eyeballing the data suggests that the majority of clients sending "frequent" submissions (what does "frequent submissions" mean for this dataset) have low uptimes (< 1 week).
I'll look more at the histograms tomorrow.
Comment 30•12 years ago
|
||
Is it possible to get, say, 2 weeks worth of this kind of data? I obviously worked around the hbase keys issue, so that doesn't need fixing up.
Comment 31•12 years ago
|
||
I am running the script to get data from 20130206 to 20130220
Comment 32•12 years ago
|
||
Harsha can you provide a couple complete json packets where it's a frequently-repeated submission on the nightly channel on windows and contains chromehangs
Comment 33•12 years ago
|
||
Nathan,
You can get docs from feb 14 to feb 20 from here
http://metrics.mozilla.com/protected/telemetry_docs_0214-0220.zip
Taras,
I will be working on a script which will get most frequent uuid over given period. I'll add those conditions to get your docs.
-Harsha
Comment 34•11 years ago
|
||
No longer relevant
--
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•