Closed Bug 1538093 Opened 8 months ago Closed 8 months ago

4.86 - 9.53% Heap Unclassified (linux64, osx-10-10, windows10-64, windows10-64-pgo-qr) regression on push 5514aae0e34e81b39a88447094a34e13c0d74aac

Categories

(Core :: Security: PSM, defect)

defect
Not set

Tracking

()

VERIFIED FIXED
mozilla68
Tracking Status
firefox-esr60 --- unaffected
firefox66 --- unaffected
firefox67 --- unaffected
firefox68 --- fixed

People

(Reporter: igoldan, Assigned: myk)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: perf, regression)

Attachments

(1 file)

We have detected an awsy regression from push:

https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=989c91cf970eebff42ea372951399cdfaa8ccf03&tochange=5514aae0e34e81b39a88447094a34e13c0d74aac

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

10% Heap Unclassified windows10-64 pgo stylo 44,913,154.82 -> 49,192,859.81
7% Heap Unclassified osx-10-10 opt stylo 74,598,810.00 -> 79,551,722.98
6% Heap Unclassified windows10-64-pgo-qr opt stylo 54,957,674.79 -> 58,183,361.15
5% Heap Unclassified linux64 pgo stylo 79,531,437.71 -> 83,393,662.70

You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=20020

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/AWSY/Tests

Component: General → Security: PSM
Product: Testing → Core
Flags: needinfo?(mgoodwin)

It seems to me that almost all of the heap usage here will be in rkv. Myk, does anything jump out as being an obvious cause of this?

Flags: needinfo?(mgoodwin) → needinfo?(myk)

massif is telling me that merely instantiating an Rkv allocates about 3MB, which appears to account for the majority of these increases:

99.99% (3,151,258B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->66.54% (2,097,152B) 0x133E75: mdb_env_open (in /home/keeler/src/rust-scratch/target/release/scratch)
| ->66.54% (2,097,152B) 0x10E8F6: lmdb::environment::EnvironmentBuilder::open_with_permissions (in /home/keeler/src/rust-scratch/target/release/scratch)
|   ->66.54% (2,097,152B) 0x10E6EE: lmdb::environment::EnvironmentBuilder::open (in /home/keeler/src/rust-scratch/target/release/scratch)
|     ->66.54% (2,097,152B) 0x10DC4D: rkv::env::Rkv::from_env (in /home/keeler/src/rust-scratch/target/release/scratch)
|       ->66.54% (2,097,152B) 0x10DB5D: rkv::env::Rkv::new (in /home/keeler/src/rust-scratch/target/release/scratch)
|         ->66.54% (2,097,152B) 0x10D809: scratch::main (in /home/keeler/src/rust-scratch/target/release/scratch)
|           ->66.54% (2,097,152B) 0x10D571: std::rt::lang_start::{{closure}} (in /home/keeler/src/rust-scratch/target/release/scratch)
|             ->66.54% (2,097,152B) 0x11E981: std::panicking::try::do_call (rt.rs:59)
|               ->66.54% (2,097,152B) 0x12C158: __rust_maybe_catch_panic (lib.rs:102)
|                 ->66.54% (2,097,152B) 0x11F302: std::rt::lang_start_internal (panicking.rs:289)
|                   ->66.54% (2,097,152B) 0x10DA10: main (in /home/keeler/src/rust-scratch/target/release/scratch)
|                     
->33.27% (1,048,584B) 0x112731: mdb_midl_alloc (in /home/keeler/src/rust-scratch/target/release/scratch)
| ->33.27% (1,048,584B) 0x133E53: mdb_env_open (in /home/keeler/src/rust-scratch/target/release/scratch)
|   ->33.27% (1,048,584B) 0x10E8F6: lmdb::environment::EnvironmentBuilder::open_with_permissions (in /home/keeler/src/rust-scratch/target/release/scratch)
|     ->33.27% (1,048,584B) 0x10E6EE: lmdb::environment::EnvironmentBuilder::open (in /home/keeler/src/rust-scratch/target/release/scratch)
|       ->33.27% (1,048,584B) 0x10DC4D: rkv::env::Rkv::from_env (in /home/keeler/src/rust-scratch/target/release/scratch)
|         ->33.27% (1,048,584B) 0x10DB5D: rkv::env::Rkv::new (in /home/keeler/src/rust-scratch/target/release/scratch)
|           ->33.27% (1,048,584B) 0x10D809: scratch::main (in /home/keeler/src/rust-scratch/target/release/scratch)
|             ->33.27% (1,048,584B) 0x10D571: std::rt::lang_start::{{closure}} (in /home/keeler/src/rust-scratch/target/release/scratch)
|               ->33.27% (1,048,584B) 0x11E981: std::panicking::try::do_call (rt.rs:59)
|                 ->33.27% (1,048,584B) 0x12C158: __rust_maybe_catch_panic (lib.rs:102)
|                   ->33.27% (1,048,584B) 0x11F302: std::rt::lang_start_internal (panicking.rs:289)
|                     ->33.27% (1,048,584B) 0x10DA10: main (in /home/keeler/src/rust-scratch/target/release/scratch)
|                       
->00.18% (5,522B) in 1+ places, all below ms_print's threshold (01.00%)

(this is from a little scratch program that basically just calls Rkv::new(Path::new("/tmp/")))

DMD shows several allocations on a local macOS build:

Unreported {
  1 block in heap block record 2 of 6,020
  2,097,152 bytes (2,097,152 requested / 0 slop)
  1.81% of the heap (7.33% cumulative)
  5.51% of unreported (22.27% cumulative)
  Allocated at {
    #01: replace_calloc(unsigned long, unsigned long) (DMD.cpp:1144, in libmozglue.dylib)
    #02: mdb_env_open (in XUL) + 228
    #03: lmdb::environment::EnvironmentBuilder::open_with_permissions::h6871600ac7cbdcc3 (environment.rs:277, in XUL)
    #04: lmdb::environment::EnvironmentBuilder::open::h281e464919916b16 (environment.rs:249, in XUL)
    #05: rkv::env::Rkv::from_env::h0ac73aabfded456d (env.rs:72, in XUL)
    #06: rkv::env::Rkv::with_capacity::h319e356b0d46680f (env.rs:89, in XUL)
    #07: rkv::env::Rkv::new::h2ea2cc226a47ecc7 (env.rs:62, in XUL)
    #08: cert_storage::SecurityState::new::h4c70ddb503b0fed0 (lib.rs:78, in XUL)
  }
}

Unreported {
  1 block in heap block record 5 of 6,020
  1,052,672 bytes (1,048,584 requested / 4,088 slop)
  0.91% of the heap (11.48% cumulative)
  2.77% of unreported (34.88% cumulative)
  Allocated at {
    #01: replace_malloc(unsigned long) (DMD.cpp:1128, in libmozglue.dylib)
    #02: mdb_midl_alloc (midl.c:107, in XUL)
    #03: mdb_env_open (in XUL) + 190
    #04: lmdb::environment::EnvironmentBuilder::open_with_permissions::h6871600ac7cbdcc3 (environment.rs:277, in XUL)
    #05: lmdb::environment::EnvironmentBuilder::open::h281e464919916b16 (environment.rs:249, in XUL)
    #06: rkv::env::Rkv::from_env::h0ac73aabfded456d (env.rs:72, in XUL)
    #07: rkv::env::Rkv::with_capacity::h319e356b0d46680f (env.rs:89, in XUL)
    #08: rkv::env::Rkv::new::h2ea2cc226a47ecc7 (env.rs:62, in XUL)
  }
}

Unreported {
  1 block in heap block record 633 of 6,020
  4,096 bytes (4,096 requested / 0 slop)
  0.00% of the heap (30.44% cumulative)
  0.01% of unreported (92.49% cumulative)
  Allocated at {
    #01: replace_calloc(unsigned long, unsigned long) (DMD.cpp:1144, in libmozglue.dylib)
    #02: mdb_env_open (in XUL) + 2165
    #03: lmdb::environment::EnvironmentBuilder::open_with_permissions::h6871600ac7cbdcc3 (environment.rs:277, in XUL)
    #04: lmdb::environment::EnvironmentBuilder::open::h281e464919916b16 (environment.rs:249, in XUL)
    #05: rkv::env::Rkv::from_env::h0ac73aabfded456d (env.rs:72, in XUL)
    #06: rkv::env::Rkv::with_capacity::h319e356b0d46680f (env.rs:89, in XUL)
    #07: rkv::env::Rkv::new::h2ea2cc226a47ecc7 (env.rs:62, in XUL)
    #08: cert_storage::SecurityState::new::h4c70ddb503b0fed0 (lib.rs:78, in XUL)
  }
}

These don't account for all of the regression seen in CI, but then I didn't run the tp5 test locally, I only started up the browser; so there may be more to uncover.

I don't know why the mdb_env_open stack frames don't have line numbers, but I think the first two (i.e. the significant ones) are https://searchfox.org/mozilla-central/rev/4eaf7c637a303e0f7adcf1ae45064b2d4bef9eb0/third_party/rust/lmdb-rkv-sys/lmdb/libraries/liblmdb/mdb.c#4972 and https://searchfox.org/mozilla-central/rev/4eaf7c637a303e0f7adcf1ae45064b2d4bef9eb0/third_party/rust/lmdb-rkv-sys/lmdb/libraries/liblmdb/mdb.c#4971.

Those are intentional, and not leaks, but significant nevertheless. At the very least, they suggest that we should aggressively group stores into a small number of environments as we add rkv consumers (to avoid paying this cost for each new consumer). And perhaps there's more we can do (close the environment when we aren't using it, configure it to be read-only when we only expect to read from it, configure LMDB to reduce these allocations, etc.).

I also don't know how we would go about reporting on heap allocations in Rust crates (or rather, in this case, in C libraries within Rust crates) to move these allocations from unclassified to explicit.

Eric, do you have a suggestion for that?

Flags: needinfo?(myk) → needinfo?(erahm)

The new rkv-based cert_storage database caused a Heap Unclassified regression because of memory that LMDB reserves when opening a database in read-write mode. Since cert_storage usage is read-heavy, this change claws back that regression by opening it in read-only mode except when changes are being made.

Assignee: nobody → myk
Status: NEW → ASSIGNED
Pushed by myk@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/3561cdc13806
reopen security_state env as read-only when not writing r=keeler
Status: ASSIGNED → RESOLVED
Closed: 8 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla68
Status: RESOLVED → VERIFIED

(In reply to Myk Melez [:myk] [@mykmelez] from comment #3)

I also don't know how we would go about reporting on heap allocations in Rust crates (or rather, in this case, in C libraries within Rust crates) to move these allocations from unclassified to explicit.

Eric, do you have a suggestion for that?

I chatted with Eric, who gave me some suggestions, which I've noted in bug 1542928.

Flags: needinfo?(erahm)
You need to log in before you can comment on or make changes to this bug.