Closed Bug 1007844 Opened 8 years ago Closed 8 years ago

Implement per-host telemetry for operational mozilla properties (aus4, amo)

Categories

(Core :: Security: PSM, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla32

People

(Reporter: mmc, Assigned: mmc)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 1 obsolete file)

In this mode, pinsets will be evaluated but pinning violations will only be reported to telemetry, not acted on.

The current pinset structure contains:
- hostname
- boolean include_subdomains (for whether to match *.addons.mozilla.org instead of just addons.mozilla.org)
- list of cert fingerprints

Then the pinning service uses these static pinsets to compute whether or not the certificate chain for a given host matches one of the allowed fingerprints, and blocks the connection if no matches are found. If we add another boolean, test_mode, that is used for telemetry logging only, then it becomes a lot safer to activate a given pinset. We can watch the telemetry for the pinset in test mode, and if violations are zero (or near zero), then promote the pinset to production mode.

This lets us do the following rollout plan:
- Convert all pinsets to test mode, including ones imported from Google
- Flip the default pinning enforcement level to on
- Watch the test mode telemetry buckets
- For all the buckets that have zero errors, promote the corresponding pinset to non-test mode.
bug 772756 required test mode already, so stealing this bug for telemetry improvements.
Summary: implement test_mode for certificate pinning → Implement telemetry for all pinsets
Assignee: nobody → mmc
Status: NEW → ASSIGNED
Comment on attachment 8420473 [details] [diff] [review]
Implement telemetry for all the buckets, including test mode (

Review of attachment 8420473 [details] [diff] [review]:
-----------------------------------------------------------------

Unfortunately this change adds a huge list of histogram ids to the preload json. The reason is because we are importing pinsets from jsons and then sorting them by hostname, so just generating histogram ids won't be stable.

This change also adds an "imported" field to the pinsets, which we technically don't need but is good as a sanity check when inspecting the generated file.

As far as the number of buckets go, the largest existing telemetry histogram is 1024, so I didn't want to push my luck. There's also this cool thing http://telemetry.mozilla.org/docs.html#Telemetry that we can use for massaging the data later.
Attachment #8420473 - Flags: review?(dkeeler)
Attachment #8420473 - Flags: review?(cviecco)
Comment on attachment 8420473 [details] [diff] [review]
Implement telemetry for all the buckets, including test mode (

Review of attachment 8420473 [details] [diff] [review]:
-----------------------------------------------------------------

My main concerns are:
1. Updating the ids of new pinning entries. How do we plan to make that scale? Do we at least have a bug on automating it?
2. It seems like the CERT_PINNING_(TEST_)RESULTS telemetry is redundant.
r- for now.

::: security/manager/boot/src/PublicKeyPinningService.cpp
@@ +194,5 @@
>      }
> +    mozilla::Telemetry::ID histogram =
> +      mozilla::Telemetry::CERT_PINNING_RESULTS_BY_HOST;
> +    if (foundEntry->mTestMode) {
> +      histogram = mozilla::Telemetry::CERT_PINNING_TEST_RESULTS_BY_HOST;

Would it make more sense to put this in the other "if (foundEntry->mTestMode)" block?

@@ +196,5 @@
> +      mozilla::Telemetry::CERT_PINNING_RESULTS_BY_HOST;
> +    if (foundEntry->mTestMode) {
> +      histogram = mozilla::Telemetry::CERT_PINNING_TEST_RESULTS_BY_HOST;
> +    }
> +    uint32_t bucket = foundEntry->mId * 2 + result;

I would be more comfortable with " + (result ? 1 : 0);"

::: security/manager/ssl/tests/unit/test_pinning.js
@@ +99,5 @@
> +  do_check_eq(results.counts[0], 1); // Failure count
> +  do_check_eq(results.counts[1], 3); // Success count
> +
> +  let results = service.getHistogramById("CERT_PINNING_TEST_RESULTS")
> +                .snapshot();

nit: indent two

@@ +115,5 @@
> +  const TEST_MODE_PASSED = 292 * 2 + 1;
> +
> +  // Everything is in production mode except test-mode.pinning.example.com
> +  results = service.getHistogramById("CERT_PINNING_RESULTS_BY_HOST")
> +            .snapshot();

nit: indent two

@@ +116,5 @@
> +
> +  // Everything is in production mode except test-mode.pinning.example.com
> +  results = service.getHistogramById("CERT_PINNING_RESULTS_BY_HOST")
> +            .snapshot();
> +  do_check_eq(1, results.counts[INCLUDE_SUBDOMAINS_FAILED]);

nit: switching the order of (actual, expected) might lead to confusion later. It's fine to do it this way, but let's be consistent in this function.

@@ +124,5 @@
> +  do_check_eq(0, results.counts[TEST_MODE_FAILED]);
> +  do_check_eq(0, results.counts[TEST_MODE_PASSED]);
> +
> +  results = service.getHistogramById("CERT_PINNING_TEST_RESULTS_BY_HOST")
> +            .snapshot();

nit: indent two

::: security/manager/tools/PreloadedHPKPins.json
@@ +87,5 @@
>        "pins": "mozilla_test", "test_mode": true }
> +  ],
> +  // Associate telemetry ids with all of the pins we want to monitor.
> +  "nextUnusedId": 307,
> +  "ids": {

Is this generated by hand? How does it get updated when new hosts are added?

::: security/manager/tools/genHPKPStaticPins.js
@@ +54,5 @@
>    "  const char* mHost;\n" +
>    "  const bool mIncludeSubdomains;\n" +
>    "  const bool mTestMode;\n" +
> +  "  const bool mImported;\n" +
> +  "  const int mId;\n" +

This eventually gets used in a calculation that gets stored in a uint32_t. Why not just start it that way?

@@ +450,5 @@
> +  const UNKNOWN_ID = 0;
> +  if (entry.name in gStaticPins.ids) {
> +    printVal += gStaticPins.ids[entry.name] + ", ";
> +  } else {
> +    printVal += UNKNOWN_ID + ", ";

Shouldn't we warn here? Or automatically update the list of pin ids?

::: toolkit/components/telemetry/Histograms.json
@@ +5916,5 @@
>      "high": "5000",
>      "n_buckets": 10,
>      "extended_statistics_ok": true
>    },
> +  "CERT_PINNING_RESULTS": {

It seems like this can be calculated based on CERT_PINNING_RESULTS_BY_HOST - is it necessary to have right now? (I imagine we'll want it when we do HPKP (i.e. non-preloaded pinning).)

@@ +5921,5 @@
> +    "expires_in_version": "never",
> +    "kind": "boolean",
> +    "description": "Certificate pinning results (0 = failure, 1 = success)"
> +  },
> +  "CERT_PINNING_TEST_RESULTS": {

Same here.
Attachment #8420473 - Flags: review?(dkeeler) → review-
(In reply to David Keeler (:keeler) from comment #4)
> Comment on attachment 8420473 [details] [diff] [review]
> Implement telemetry for all the buckets, including test mode (
> 
> Review of attachment 8420473 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> My main concerns are:
> 1. Updating the ids of new pinning entries. How do we plan to make that
> scale? Do we at least have a bug on automating it?

Yeah, we should have talked about this in the meeting, sorry I had to run out. About new pinning entries, here's a list of things I think we should automatically import from google:
1) Whenever certs get added or removed to a pinset (done)
2) Whenever a new host gets pinned, we should import but not enable in production mode (done).

We can detect 2) when the UNKNOWN_ID bucket goes up. I'd love for us to get monitoring automatically when new hosts get added, but I don't see how without having access to strings.

> 2. It seems like the CERT_PINNING_(TEST_)RESULTS telemetry is redundant.
> r- for now.

It is redundant if everything works correctly :) We could leave it in for human-readable sanity checks, or take it out.

> ::: security/manager/tools/PreloadedHPKPins.json
> @@ +87,5 @@
> >        "pins": "mozilla_test", "test_mode": true }
> > +  ],
> > +  // Associate telemetry ids with all of the pins we want to monitor.
> > +  "nextUnusedId": 307,
> > +  "ids": {
> 
> Is this generated by hand? How does it get updated when new hosts are added?

For updating this list:
1) Notice that the UNKNOWN_ID (0) bucket count is increasing.
2) Go and inspect changes to transport_security_static_certs.json.
3) Find the new hosts and add them to the list, starting with nextUnusedId.
4) Update nextUnusedId.

It sounds like I should try to automate this. I'll work on it.
Comment on attachment 8420473 [details] [diff] [review]
Implement telemetry for all the buckets, including test mode (

Review of attachment 8420473 [details] [diff] [review]:
-----------------------------------------------------------------

::: security/manager/tools/PreloadedHPKPins.json
@@ +87,5 @@
>        "pins": "mozilla_test", "test_mode": true }
> +  ],
> +  // Associate telemetry ids with all of the pins we want to monitor.
> +  "nextUnusedId": 307,
> +  "ids": {

I am also worried about this. This file is supposed to be maintained by humans not machines. If we want something like this list it should live similar to the rejects for HSTS. Another option would be not to use a list like this but just hash the domain in the entry and use that as the index. There will be collisions, but as long as we are able to have our test set reasonably small this will not happen often.

::: security/manager/tools/genHPKPStaticPins.js
@@ +53,5 @@
>    "struct TransportSecurityPreload {\n" +
>    "  const char* mHost;\n" +
>    "  const bool mIncludeSubdomains;\n" +
>    "  const bool mTestMode;\n" +
> +  "  const bool mImported;\n" +

you dont use this value, please remove
Attachment #8420473 - Flags: review?(cviecco) → review-
Here are the collisions using the last 12 bits of sha256 (note: we only have 10 bits as proposed, so the problem is worse in practice). The cdn.mozilla.org/google collision is especially bad.

Camilo and David, how about it we assign new buckets as we see new domains and keep it in a separate JSON?

This lets us keep the preloads json file clean, and have another json that's both input and output to help keep track of buckets.

052 -, cdn.mozilla.org
052 -, google.com.bd
1c1 -, google.com.np
1c1 -, google.tk
32c -, google.gy
32c -, google.vg
51e -, google.gr
51e -, google.rw
61f -, google.co.ke
61f -, translate.googleapis.com
655 -, google.cc
655 -, google.com.cu
(In reply to [:mmc] Monica Chew (please use needinfo) from comment #7)
> Camilo and David, how about it we assign new buckets as we see new domains
> and keep it in a separate JSON?
> 
> This lets us keep the preloads json file clean, and have another json that's
> both input and output to help keep track of buckets.

That's more like what I was thinking.
what is the plan once we have more than 1000 domains. cycle the numbers?
Adding bsmedberg at his request. He expressed concern about counting pinning violations when I asked about the maximum number of buckets that could exist in a histogram.

bsmedberg, what will it take to move forward on this?

Thanks,
Monica
(In reply to Camilo Viecco (:cviecco) from comment #9)
> what is the plan once we have more than 1000 domains. cycle the numbers?

Switch to a new histogram.
Summary: Implement telemetry for all pinsets → Implement per-host telemetry for operational mozilla properties (aus4, amo)
Attached patch bucket-amoSplinter Review
Also adds aus4 in test mode.
Attachment #8420473 - Attachment is obsolete: true
Attachment #8423470 - Flags: review?(dkeeler)
Comment on attachment 8423470 [details] [diff] [review]
bucket-amo

Review of attachment 8423470 [details] [diff] [review]:
-----------------------------------------------------------------

Great!

::: security/manager/boot/src/PublicKeyPinningService.cpp
@@ +200,5 @@
>          ? Telemetry::CERT_PINNING_MOZ_TEST_RESULTS
>          : Telemetry::CERT_PINNING_TEST_RESULTS;
>        retval = true;
>      }
> +    // We can collect per-host pinning violations for this entry.

I would add: "because it is a Mozilla host necessary for the operation of the browser" or something

::: security/manager/tools/genHPKPStaticPins.js
@@ +444,5 @@
>      printVal += "true, ";
>    } else {
>      printVal += "false, ";
>    }
> +  if (entry.id >= 512) {

How about "if ("id" in entry && entry.id >= 0)"? (id won't be a defined property on all of the entries, right?) (Or I guess entry.hasOwnProperty("id") would work?)

::: toolkit/components/telemetry/Histograms.json
@@ +5950,5 @@
>    },
>    "CERT_PINNING_MOZ_RESULTS": {
>      "expires_in_version": "never",
>      "kind": "boolean",
> +    "description": "Certificate pinning evaluation results (0 = failure, 1 = success)"

Should we be a bit more descriptive in each of these descriptions? (e.g. test mode vs. production mode, non-operational mozilla properties vs. non-mozilla properties)

@@ +5960,5 @@
> +  },
> +  "CERT_PINNING_MOZ_RESULTS_BY_HOST": {
> +    "expires_in_version": "never",
> +    "kind": "enumerated",
> +    "n_values": 1024,

Seems like we're not going to need anywhere near 1024 buckets this time around.

@@ +5967,5 @@
> +  "CERT_PINNING_MOZ_TEST_RESULTS_BY_HOST": {
> +    "expires_in_version": "never",
> +    "kind": "enumerated",
> +    "n_values": 1024,
> +    "description": "Certificate pinning evaluation results by host"

Seems like we should mention these are for mozilla operational hosts. Also, it would be good to differentiate test vs. production in the description.
Attachment #8423470 - Flags: review?(dkeeler) → review+
Fixed comments and descriptions and cut buckets down to 512 (I haven't audited the moz properties enough to know how many "operational" ones there are), since it breaks histograms to increase n_values, better safe than sorry. I think we should write a custom dashboard anyway, so hopefully ugly empty buckets won't be an issue.

> How about "if ("id" in entry && entry.id >= 0)"? (id won't be a defined
> property on all of the entries, right?) (Or I guess
> entry.hasOwnProperty("id") would work?)

For this one, I left it -- if id is not in entry, it will be undefined and default to -1 as intended. test_mode required more care since we wanted it to default to true if not present.
https://hg.mozilla.org/mozilla-central/rev/7cf39ac6f41e
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla32
You need to log in before you can comment on or make changes to this bug.