Closed Bug 1684920 Opened 4 years ago Closed 4 years ago

Investigate increase in telemetry_stable.event_v4 partition size

Categories

(Data Platform and Tools :: General, task)

task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ascholtz, Assigned: ascholtz)

References

Details

(Whiteboard: [dataquality])

The partition size for telemetry_stable.event_v4 increase to 2.1TB in the last two weeks (while main_v4 dropped due to lower usage the week after Christmas). It'd be worth investigating whether that is due to the number of pings or ping sizes.

Assignee: nobody → ascholtz

I looked into moz-fx-data-shared-prod.monitoring.average_ping_sizes_v1 and it appears that the number or pings increased while the average ping size has been pretty stable over the last weeks. I'll see if I can find out where the increase in number of pings sent is coming from.

I did some further analysis:

Looking into the events sent now vs. beginning of December 2020 there has been a significant increase in security.unexpectedload events.

It seems this started to happen on 2020-12-15, [the day FF 84] was released. It appears that also recent versions (85, 86) send a significant higher number of these event pings: https://sql.telemetry.mozilla.org/queries/77523/source#193018

Looking at the change log, security.unexpectedload events have been added a while ago (2020-07-08) and have been sent in earlier versions. I'm not entirely sure what changed in recent versions that account for this significant increase.

The event_map_values for these events look mostly like:

contenttype: TYPE_STYLESHEET
remotetype: web
filedetails: unknown

Good work tracking that down. A logical next step here would be to bring in fbraun and ckerschb who are listed for notifications on this event, and ask them for context.

It appears that also recent versions (85, 86) send a significant higher number of these event pings

I believe we'll send an event ping if the number of written events hits 500. It may be worth looking into whether the increase in number of event pings can be accounted for by the extra pings with full event arrays.

It appears event pings are submitted at most once per ten minute interval, and at least once per hour in which an event was recorded. Upon reaching 1000 events, the ping is sent immediately unless it would be within ten minutes of the previous ping. Looking into how often pings are sent compared to an earlier date, it seems like the number of periodically sent pings increased by 3-4x, which is in line with the increase of security.unexpectedload events. There is no increase in full event arrays, so I would guess there are more events within the 10 minute intervals that get recorded and sent.

@fbraun @ckersch: We have been seeing a 3x to 4x increase of security.unexpectedload events that started with the release of Firefox 84.0. Do you have any context of what might be causing this?

Flags: needinfo?(fbraun)
Flags: needinfo?(ckerschb)

FYI, we're looking.

For context, the purpose of this probe is to find incorrect/legacy usages of the SystemPrincipal (a privileged context that isn't subject to security checks).

With the logs exploding as they are, my first guess that someone added a new feature that uses the SystemPrincipal incorrectly. Comments from above look as if these were (mostly) loads with external/internalContentPolicyType of TYPE_STYLESHEET.
Further, given this is so prevalent, my hope is that this shows up during normal loading. To find out more, I've run a recent Nightly for a while (trying different new-ish features, preferences, addons, browsing, about:newtab etc.) and recording all security checks to disk (e.g. running Firefox like so MOZ_LOG_FILE="/tmp/csmlog.txt" MOZ_LOG="CSMLog:5,raw,append" ./firefox --no-remote -p).

Going through the logs with a couple of lines functional Python (so we don't have to parse all files into memory first), gives me some preliminary results. The python script is at https://gist.github.com/mozfreddyb/dfc2155f8a3c135dffc055156b377709.

I'll continue sharing my progress, as I go.

I have been trying to get more interesting stuff out of telemetry by looking at the query from https://sql.telemetry.mozilla.org/queries/73537/source:

source:

SELECT event_string_value,
       event_process,
       TO_JSON_STRING(event_map_values),
       count(*) AS COUNT
FROM telemetry.events
WHERE event_category = 'security'
  AND event_method = "unexpectedload"
  AND submission_date BETWEEN '2020-12-01' AND '2021-01-22'
GROUP BY event_string_value,
         TO_JSON_STRING(event_map_values),
         event_process
ORDER BY COUNT DESC

If I'm not mistaken those are mostly events with TYPE_STYLESHEET that are hosted on a data URL (event_string_value == dataurl). Given that we're only looking at loads with the SystemPrincipal, my guess is that these are loads for ContentScripts/Styles.

AFAIU we use data URLs to cross process/privilege boundaries. I thought I filed a bug to move them to blobs, but can't find it anymore.
This could be a popular addon or an experiment or a webcompat style fix for a popular website.

Continuing.

Flags: needinfo?(fbraun)

I applied some local debugging patches (using no additional addons) and have been surfing the web all day with that. Given comment 2 saying that events should pop up periodically in a 10 minute interval I thought that I might be able to get lucky and see some of those - unfortunately no luck.

Given what Freddy has reported in Comment 8 it could indeed be some popular addon that has changed.

Anyway, I understand that receiving so many pings is sub-optimal. Anything we could do as a hotfix? Does it make sense, or is it necessary, that we disable those pings for now?

Flags: needinfo?(ckerschb)

This issue seems to affect all builds (release, nightly). I plotted a graph in Redash and the spiking starts happening on 2020-12-15: https://sql.telemetry.mozilla.org/queries/77523/source#193018

Not sure if this is helpful but I there is also a way to check what addons clients have installed and in common that send these events. This Redash looks at the addons clients have installed that send a lot of these events (top 1000 clients based on event count): https://sql.telemetry.mozilla.org/queries/77672/source
I was hoping to see some 3rd party addon that all of these clients have installed, but that doesn't seem to be the case.

I think for now it is okay for these events continue to be sent. The size increase isn't super critical at the moment but something we should keep an eye on.

Closing this for now since events_v4 has been pretty stable in the past few weeks. We'll keep an eye on it during our weekly meetings to see if this causes any more unexpected increases.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
See Also: → 1626083
Component: Datasets: General → General
Whiteboard: [data-quality] → [dataquality]
You need to log in before you can comment on or make changes to this bug.