Closed Bug 2037089 Opened 18 days ago Closed 8 hours ago

Update Hunspell to v1.7.3

Categories

(Core :: Spelling checker, task)

task

Tracking

()

RESOLVED FIXED
153 Branch
Tracking Status
firefox153 --- fixed

People

(Reporter: RyanVM, Assigned: RyanVM)

References

()

Details

Attachments

(3 files, 2 obsolete files)

We're currently on v1.7.1, so a couple releases behind now. There's some security-related fixes in v1.7.3, though our RLBox sandboxing may mitigate them. Unfortunately, we have some in-tree modifications to the vendored code which complicate our ability to easily update. In particular, bug 1739761 added arena allocation to code which has since been heavily modified upstream.

I had Claude take a look and that was the main point of concern. The other in-tree patches look pretty rebaseable. Here's what it had to say about bug 1739761.

- bug1739761 (HashMgr arena allocator) — the painful one. Upstream rewrote precisely the
  malloc-heavy paths the patch was targeting:
    - tableptr (hentry** + tablesize) → std::vector<hentry*>
  - aliasf/aliasflen/aliasm (unsigned short** / char**) → std::vector
  - add_word gained a bool own_aff parameter; new release_flags(astr, owned) helper
  - decode_flag/encode_flag switched to std::string; lookup/hash gained size_t len
  - csconv is const; remove_forbidden_flag returns void
  - New public API: Hunspell_add_with_flags, Hunspell_suffix_suggest (additive — doesn't
    break wasm bindings).

- Three options, in increasing effort:
  - (a) Drop the patch. Justification: upstream's std::vector migration eliminates most
    of the per-allocation overhead the arena was hiding. Validate by comparing
  	dictionary-load memory/heap-fragmentation on a heavy locale (Hungarian, the original
	motivator) before/after on at least Linux + Win64.
  - (b) Port a slimmer arena. Keep arena only for the residual unsigned short* flag-vector
    allocations in decode_flags / aliasf's per-row vectors that upstream still mallocs.
	Smaller patch, smaller regression surface.
  - (c) Full port. Rewrite the arena to coexist with std::vector storage. Most invasive;
    least appealing.
  - My recommendation: try (a) first with a benchmark; fall back to (b) if regression seen.

@RyanVM: Yup, that rationale seems fine.

The rlbox sandbox will mitigate the vulnerabilities but we should try to update asap.

I think checking if the new update mitigates the fragmentation should be pretty easy: we can run a simpler version of the two tests listed in Bug 1739761

Test 1. Without the patch. For example bg requires about 23.26mb in the sandbox to perform spellchecking.

  • When allocating a 16mb sandbox, spellchecking uses 11.67mb out of 16mb used and then crashes with OOM
  • When allocating a 32mb sandbox, spellchecking uses 22.53mb out of 32mb used and then crashes with OOM
  • When allocating a 64mb sandbox, spellchecking uses 23.26mb out of 64mb used and succeeds

Test 2. With the patch

  • This reduces the number of wasm pages allocated from 64 to 47 on en_us builds, corresponding to about 1 MB of reduced memory usage.

Documenting this in case it comes up in the future

  1. Add the following after the sandbox->create, i.e., after https://searchfox.org/firefox-main/source/extensions/spellcheck/hunspell/glue/RLBoxHunspell.cpp#68
    std::cout << "!!!!!!!!!!!Hunspell RLBox for locale: " << dpath.get() << "\n";
    
  2. Add the following at the bottom of RLBoxHunspell::spell, i.e., after https://searchfox.org/firefox-main/source/extensions/spellcheck/hunspell/glue/RLBoxHunspell.cpp#184
    size_t sz = mSandbox->get_total_memory();
    std::cout << "!!!!!!!!!!!Hunspell RLBox Size: " << sz << " bytes, " << sz / (64 * 1024) << " Wasm pages\n";
    
  3. Open a webpage like below with a textbox and right click to enable check spelling first, and choose the language under "Languages" with the English language then with the Bulgarian language. (you may need to install the required language dictionary extensions from the Mozilla store)
    <!DOCTYPE html>
    <html>
    <head>
        <meta charset='utf-8'>
        <meta http-equiv='X-UA-Compatible' content='IE=edge'>
        <title>Hunspell test</title>
        <meta name='viewport' content='width=device-width, initial-scale=1'>
    </head>
    <body>
        <input type="text" value="Hello world!" />
    </body>
    </html>
    
  4. Update hunspell to the new version without the patch from Bug 1739761 and repeat steps 1 to 3
  5. If you see that the numbers look similar, then the upstream version no longer needs the memory arena patch. If there is a difference, we may need to spend time to figure out what to do
  6. Note we usually collect the "number of page" values from the opt build. The Wasm code when compiled without optimizations is much slower and heavier than the optimized versions.

Thanks for the pointers, super helpful! Here's what I have running that test:

  ┌────────┬────────────────┬─────────────────┬─────────────┐                                                                                                                                                 
  │ Locale │ v1.7.1 + arena │ v1.7.3 no arena │    Delta    │
  ├────────┼────────────────┼─────────────────┼─────────────┤                                                                                                                                                 
  │ en-US  │ 36 pages       │ 65 pages        │ +29 (+81%)  │
  ├────────┼────────────────┼─────────────────┼─────────────┤
  │ bg_BG  │ 487 pages      │ 654 pages       │ +167 (+34%) │                                                                                                                                                 
  └────────┴────────────────┴─────────────────┴─────────────┘  

So yeah, looks like we still need the arena code.

On a related note, is there any reason we shouldn't be trying to upstream these arena changes if they improve memory usage that significantly?

Ok, that looks like the new changes upstream did little to improve the fragmentation. Per Bug 1739761, the English locale on 1.71 without the arena was 64 pages, which is more or less what we have here.

Re upstreaming - unfortunately, I don't know. The patch was written by Bobby and I wasn't involved. glandium may know? From a cursory look at the patch, it looks like the idea was to use a bump allocator that would defer freeing to the time when all allocations are freed. This would certainly hold on to memory for a bit longer than the original code. I'm guessing this didn't matter for the Firefox use case, although I am not familiar enough with the code to know if that actually matters for upstream.

(Also edited the documentation above to note we usually collect this in an opt build. Since we are doing a relative test here, I guess it shouldn't matter as long as both builds are opt or both are debug)

Had Claude do a pass at porting the arena code:

┌────────────────────────────┬──────────┬───────────┐
│          Version           │  en-US   │   bg_BG   │
├────────────────────────────┼──────────┼───────────┤
│ v1.7.1 + arena             │ 36 pages │ 487 pages │
├────────────────────────────┼──────────┼───────────┤
│ v1.7.3 + arena 4KB chunks  │ 39 pages │ 530 pages │
├────────────────────────────┼──────────┼───────────┤
│ v1.7.3 + arena 64KB chunks │ 39 pages │ 526 pages │
└────────────────────────────┴──────────┴───────────┘

Not sure if there's much room for improvement beyond that, but I'm open to suggestions anything people want to try. I'm not sure the 4KB->64KB bump is really worth it either for the very modest reduction for bg_BG.

Claude found one simple change that wins back the rest of the memory regression from v1.7.3 (hentry struct reordering to avoid extra padding):

┌─────────────────────────────────┬────────────┬─────────────┐
│             Version             │   en-US    │    bg_BG    │
├─────────────────────────────────┼────────────┼─────────────┤
│ v1.7.1 + arena baseline         │ 36 pages   │ 487 pages   │
├─────────────────────────────────┼────────────┼─────────────┤
│ v1.7.3 + arena (before reorder) │ 39 pages   │ 526 pages   │
├─────────────────────────────────┼────────────┼─────────────┤
│ v1.7.3 + arena + hentry reorder │ 36 pages   │ 483 pages   │
└─────────────────────────────────┴────────────┴─────────────┘

Still poking for any other easy wins, but it looks like we can pull off taking this update without losing ground from where we were with v1.7.1 already.

Oh nice. That is useful --- there is a part in the rlbox support for 32-bit machines where we need to guess what the upper limit on memory usage is so we can reserve non fragmented space for the sandbox. 64-bit machines don't have to do this.

https://searchfox.org/firefox-main/source/extensions/spellcheck/hunspell/glue/RLBoxHunspell.cpp#53

We'd have to double check our heuristic here if the memory profile got significantly worse. However, it looks like that's not going to be needed.

I haven't looked at the code, but at least struct reordering changes sounds like something that can be upstreamed.

Patches:

  • bug1410214.patch: regenerated for v1.7.3 filemgr.hxx (minor upstream
    cleanup: <cstdio> instead of <stdio.h>, = delete instead of private).
  • bug1653659.patch: regenerated for v1.7.3 csutil.cxx/.hxx; also adds a
    MOZILLA_CLIENT stub for unicodeisalpha (returns 0; the function is never
    reached on the RLBox sandbox path but needs to compile without
    utf_info.hxx, which is excluded from MOZILLA_CLIENT builds).
  • bug1739761.patch: dropped; will be re-added in the following commit
    after rebasing for v1.7.3's significantly restructured HashMgr.
  • bug1838113.patch: dropped; upstream incorporated the change in v1.7.3.

Rebases bug1739761.patch against v1.7.3's significantly restructured
HashMgr. v1.7.3 converted several malloc-heavy paths to std::vector and
added the H_OPT_OWNFLAGS bit and a release_flags() helper, requiring a
non-trivial port.

Key invariant: arena_alloc'd memory must only ever see arena_free() (a
no-op); calling free()/delete[] on arena interior pointers will crash wasm
dlmalloc since they are interior pointers into arena chunks, not the starts
of malloc'd blocks.

The arena (std::vector<std::unique_ptr<uint8_t[]>>, 64KB chunks) covers
hentry structs, flag vectors from decode_flags, flags2 in
add_hidden_capitalized_word, and aliasm strings. Runtime additions via
add/add_with_affix/remove continue to use new[]/delete[] with
H_OPT_OWNFLAGS, which is handled correctly by release_flags.

Reorders the hentry struct fields (pointers first, then 2-byte fields, then
1-byte fields) to eliminate internal padding. On wasm32, sizeof(hentry)
drops from 24 to 20 bytes; on 64-bit native, from 40 to 32 bytes. For
Bulgarian's ~700K hentries this saves ~2.8MB of wasm linear memory, fully
recovering the remaining gap vs the v1.7.1 baseline.

RLBox sandbox memory (64-bit Linux):
v1.7.1 + arena: en-US 36 pages, bg_BG 487 pages
v1.7.3 + arena (before reorder): en-US 39 pages, bg_BG 526 pages
v1.7.3 + arena + reorder: en-US 36 pages, bg_BG 483 pages

Pure layout change, no behavior modification.

Tests that a word added to the personal dictionary is accepted by the spell
checker and rejected again after removal. This exercises the H_OPT_OWNFLAGS
code path in HashMgr::free_table, where heap-owned flag vectors for runtime
personal dictionary additions coexist with arena-owned load-time entries.

Chrome script errors are surfaced as test failures rather than hangs. r?#spelling-checker

Assignee: nobody → ryanvm

Tests that a word added to the personal dictionary is accepted by the spell
checker and rejected again after removal. This exercises the H_OPT_OWNFLAGS
code path in HashMgr::free_table, where heap-owned flag vectors for runtime
personal dictionary additions coexist with arena-owned load-time entries.

Chrome script errors are surfaced as test failures rather than hangs.

Attachment #9583980 - Attachment is obsolete: true
Attachment #9583862 - Attachment is obsolete: true
Attachment #9583861 - Attachment description: Bug 2037089 - Update vendored Hunspell to v1.7.3. → Bug 2037089 - Update vendored Hunspell to v1.7.3. r?glandium,masayuki
Attachment #9583863 - Attachment description: Bug 2037089 - Reorder hentry struct fields to eliminate padding. → Bug 2037089 - Reorder hentry struct fields to eliminate padding. r?masayuki
Attachment #9583864 - Attachment description: Bug 2037089 - Add mochitest for personal dictionary add/check/remove. → Bug 2037089 - Add mochitest for personal dictionary add/check/remove. r?masayuki

Backed out for causing hazard bustages @affixmgr.cxx.

Flags: needinfo?(ryanvm)
Flags: needinfo?(ryanvm)
Status: NEW → RESOLVED
Closed: 8 hours ago
Resolution: --- → FIXED
Target Milestone: --- → 153 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: