Update Hunspell to v1.7.3
Categories
(Core :: Spelling checker, task)
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox153 | --- | fixed |
People
(Reporter: RyanVM, Assigned: RyanVM)
References
()
Details
Attachments
(3 files, 2 obsolete files)
We're currently on v1.7.1, so a couple releases behind now. There's some security-related fixes in v1.7.3, though our RLBox sandboxing may mitigate them. Unfortunately, we have some in-tree modifications to the vendored code which complicate our ability to easily update. In particular, bug 1739761 added arena allocation to code which has since been heavily modified upstream.
I had Claude take a look and that was the main point of concern. The other in-tree patches look pretty rebaseable. Here's what it had to say about bug 1739761.
- bug1739761 (HashMgr arena allocator) — the painful one. Upstream rewrote precisely the
malloc-heavy paths the patch was targeting:
- tableptr (hentry** + tablesize) → std::vector<hentry*>
- aliasf/aliasflen/aliasm (unsigned short** / char**) → std::vector
- add_word gained a bool own_aff parameter; new release_flags(astr, owned) helper
- decode_flag/encode_flag switched to std::string; lookup/hash gained size_t len
- csconv is const; remove_forbidden_flag returns void
- New public API: Hunspell_add_with_flags, Hunspell_suffix_suggest (additive — doesn't
break wasm bindings).
- Three options, in increasing effort:
- (a) Drop the patch. Justification: upstream's std::vector migration eliminates most
of the per-allocation overhead the arena was hiding. Validate by comparing
dictionary-load memory/heap-fragmentation on a heavy locale (Hungarian, the original
motivator) before/after on at least Linux + Win64.
- (b) Port a slimmer arena. Keep arena only for the residual unsigned short* flag-vector
allocations in decode_flags / aliasf's per-row vectors that upstream still mallocs.
Smaller patch, smaller regression surface.
- (c) Full port. Rewrite the arena to coexist with std::vector storage. Most invasive;
least appealing.
- My recommendation: try (a) first with a benchmark; fall back to (b) if regression seen.
Comment 1•18 days ago
•
|
||
@RyanVM: Yup, that rationale seems fine.
The rlbox sandbox will mitigate the vulnerabilities but we should try to update asap.
I think checking if the new update mitigates the fragmentation should be pretty easy: we can run a simpler version of the two tests listed in Bug 1739761
Test 1. Without the patch. For example bg requires about 23.26mb in the sandbox to perform spellchecking.
- When allocating a 16mb sandbox, spellchecking uses 11.67mb out of 16mb used and then crashes with OOM
- When allocating a 32mb sandbox, spellchecking uses 22.53mb out of 32mb used and then crashes with OOM
- When allocating a 64mb sandbox, spellchecking uses 23.26mb out of 64mb used and succeeds
Test 2. With the patch
- This reduces the number of wasm pages allocated from 64 to 47 on en_us builds, corresponding to about 1 MB of reduced memory usage.
Documenting this in case it comes up in the future
- Add the following after the
sandbox->create, i.e., after https://searchfox.org/firefox-main/source/extensions/spellcheck/hunspell/glue/RLBoxHunspell.cpp#68std::cout << "!!!!!!!!!!!Hunspell RLBox for locale: " << dpath.get() << "\n"; - Add the following at the bottom of
RLBoxHunspell::spell, i.e., after https://searchfox.org/firefox-main/source/extensions/spellcheck/hunspell/glue/RLBoxHunspell.cpp#184size_t sz = mSandbox->get_total_memory(); std::cout << "!!!!!!!!!!!Hunspell RLBox Size: " << sz << " bytes, " << sz / (64 * 1024) << " Wasm pages\n"; - Open a webpage like below with a textbox and right click to enable check spelling first, and choose the language under "Languages" with the English language then with the Bulgarian language. (you may need to install the required language dictionary extensions from the Mozilla store)
<!DOCTYPE html> <html> <head> <meta charset='utf-8'> <meta http-equiv='X-UA-Compatible' content='IE=edge'> <title>Hunspell test</title> <meta name='viewport' content='width=device-width, initial-scale=1'> </head> <body> <input type="text" value="Hello world!" /> </body> </html> - Update hunspell to the new version without the patch from Bug 1739761 and repeat steps 1 to 3
- If you see that the numbers look similar, then the upstream version no longer needs the memory arena patch. If there is a difference, we may need to spend time to figure out what to do
- Note we usually collect the "number of page" values from the opt build. The Wasm code when compiled without optimizations is much slower and heavier than the optimized versions.
| Assignee | ||
Comment 2•18 days ago
|
||
Thanks for the pointers, super helpful! Here's what I have running that test:
┌────────┬────────────────┬─────────────────┬─────────────┐
│ Locale │ v1.7.1 + arena │ v1.7.3 no arena │ Delta │
├────────┼────────────────┼─────────────────┼─────────────┤
│ en-US │ 36 pages │ 65 pages │ +29 (+81%) │
├────────┼────────────────┼─────────────────┼─────────────┤
│ bg_BG │ 487 pages │ 654 pages │ +167 (+34%) │
└────────┴────────────────┴─────────────────┴─────────────┘
So yeah, looks like we still need the arena code.
| Assignee | ||
Comment 3•18 days ago
|
||
On a related note, is there any reason we shouldn't be trying to upstream these arena changes if they improve memory usage that significantly?
Comment 4•18 days ago
•
|
||
Ok, that looks like the new changes upstream did little to improve the fragmentation. Per Bug 1739761, the English locale on 1.71 without the arena was 64 pages, which is more or less what we have here.
Re upstreaming - unfortunately, I don't know. The patch was written by Bobby and I wasn't involved. glandium may know? From a cursory look at the patch, it looks like the idea was to use a bump allocator that would defer freeing to the time when all allocations are freed. This would certainly hold on to memory for a bit longer than the original code. I'm guessing this didn't matter for the Firefox use case, although I am not familiar enough with the code to know if that actually matters for upstream.
(Also edited the documentation above to note we usually collect this in an opt build. Since we are doing a relative test here, I guess it shouldn't matter as long as both builds are opt or both are debug)
| Assignee | ||
Comment 5•18 days ago
•
|
||
Had Claude do a pass at porting the arena code:
┌────────────────────────────┬──────────┬───────────┐
│ Version │ en-US │ bg_BG │
├────────────────────────────┼──────────┼───────────┤
│ v1.7.1 + arena │ 36 pages │ 487 pages │
├────────────────────────────┼──────────┼───────────┤
│ v1.7.3 + arena 4KB chunks │ 39 pages │ 530 pages │
├────────────────────────────┼──────────┼───────────┤
│ v1.7.3 + arena 64KB chunks │ 39 pages │ 526 pages │
└────────────────────────────┴──────────┴───────────┘
Not sure if there's much room for improvement beyond that, but I'm open to suggestions anything people want to try. I'm not sure the 4KB->64KB bump is really worth it either for the very modest reduction for bg_BG.
| Assignee | ||
Comment 6•17 days ago
|
||
Claude found one simple change that wins back the rest of the memory regression from v1.7.3 (hentry struct reordering to avoid extra padding):
┌─────────────────────────────────┬────────────┬─────────────┐
│ Version │ en-US │ bg_BG │
├─────────────────────────────────┼────────────┼─────────────┤
│ v1.7.1 + arena baseline │ 36 pages │ 487 pages │
├─────────────────────────────────┼────────────┼─────────────┤
│ v1.7.3 + arena (before reorder) │ 39 pages │ 526 pages │
├─────────────────────────────────┼────────────┼─────────────┤
│ v1.7.3 + arena + hentry reorder │ 36 pages │ 483 pages │
└─────────────────────────────────┴────────────┴─────────────┘
Still poking for any other easy wins, but it looks like we can pull off taking this update without losing ground from where we were with v1.7.1 already.
Comment 7•17 days ago
•
|
||
Oh nice. That is useful --- there is a part in the rlbox support for 32-bit machines where we need to guess what the upper limit on memory usage is so we can reserve non fragmented space for the sandbox. 64-bit machines don't have to do this.
https://searchfox.org/firefox-main/source/extensions/spellcheck/hunspell/glue/RLBoxHunspell.cpp#53
We'd have to double check our heuristic here if the memory profile got significantly worse. However, it looks like that's not going to be needed.
I haven't looked at the code, but at least struct reordering changes sounds like something that can be upstreamed.
| Assignee | ||
Comment 8•17 days ago
•
|
||
| Assignee | ||
Comment 9•17 days ago
|
||
Patches:
- bug1410214.patch: regenerated for v1.7.3 filemgr.hxx (minor upstream
cleanup: <cstdio> instead of <stdio.h>, = delete instead of private). - bug1653659.patch: regenerated for v1.7.3 csutil.cxx/.hxx; also adds a
MOZILLA_CLIENT stub for unicodeisalpha (returns 0; the function is never
reached on the RLBox sandbox path but needs to compile without
utf_info.hxx, which is excluded from MOZILLA_CLIENT builds). - bug1739761.patch: dropped; will be re-added in the following commit
after rebasing for v1.7.3's significantly restructured HashMgr. - bug1838113.patch: dropped; upstream incorporated the change in v1.7.3.
| Assignee | ||
Comment 10•17 days ago
|
||
Rebases bug1739761.patch against v1.7.3's significantly restructured
HashMgr. v1.7.3 converted several malloc-heavy paths to std::vector and
added the H_OPT_OWNFLAGS bit and a release_flags() helper, requiring a
non-trivial port.
Key invariant: arena_alloc'd memory must only ever see arena_free() (a
no-op); calling free()/delete[] on arena interior pointers will crash wasm
dlmalloc since they are interior pointers into arena chunks, not the starts
of malloc'd blocks.
The arena (std::vector<std::unique_ptr<uint8_t[]>>, 64KB chunks) covers
hentry structs, flag vectors from decode_flags, flags2 in
add_hidden_capitalized_word, and aliasm strings. Runtime additions via
add/add_with_affix/remove continue to use new[]/delete[] with
H_OPT_OWNFLAGS, which is handled correctly by release_flags.
| Assignee | ||
Comment 11•17 days ago
|
||
Reorders the hentry struct fields (pointers first, then 2-byte fields, then
1-byte fields) to eliminate internal padding. On wasm32, sizeof(hentry)
drops from 24 to 20 bytes; on 64-bit native, from 40 to 32 bytes. For
Bulgarian's ~700K hentries this saves ~2.8MB of wasm linear memory, fully
recovering the remaining gap vs the v1.7.1 baseline.
RLBox sandbox memory (64-bit Linux):
v1.7.1 + arena: en-US 36 pages, bg_BG 487 pages
v1.7.3 + arena (before reorder): en-US 39 pages, bg_BG 526 pages
v1.7.3 + arena + reorder: en-US 36 pages, bg_BG 483 pages
Pure layout change, no behavior modification.
| Assignee | ||
Comment 12•17 days ago
|
||
Tests that a word added to the personal dictionary is accepted by the spell
checker and rejected again after removal. This exercises the H_OPT_OWNFLAGS
code path in HashMgr::free_table, where heap-owned flag vectors for runtime
personal dictionary additions coexist with arena-owned load-time entries.
Chrome script errors are surfaced as test failures rather than hangs. r?#spelling-checker
| Assignee | ||
Updated•17 days ago
|
| Assignee | ||
Comment 13•17 days ago
|
||
Tests that a word added to the personal dictionary is accepted by the spell
checker and rejected again after removal. This exercises the H_OPT_OWNFLAGS
code path in HashMgr::free_table, where heap-owned flag vectors for runtime
personal dictionary additions coexist with arena-owned load-time entries.
Chrome script errors are surfaced as test failures rather than hangs.
Updated•17 days ago
|
Updated•15 days ago
|
Updated•15 days ago
|
Updated•15 days ago
|
Updated•15 days ago
|
Comment 14•3 days ago
|
||
Backed out for causing hazard bustages @affixmgr.cxx.
| Assignee | ||
Updated•3 days ago
|
Comment 15•10 hours ago
|
||
Comment 16•8 hours ago
|
||
| bugherder | ||
https://hg.mozilla.org/mozilla-central/rev/8d49a1635df8
https://hg.mozilla.org/mozilla-central/rev/1679ef1959d1
https://hg.mozilla.org/mozilla-central/rev/17702a114cde
Description
•