Closed Bug 627220 Opened 13 years ago Closed 7 years ago

Finalize external strings in the background

Categories

(Core :: JavaScript: GC, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla56
Performance Impact high
Tracking Status
firefox56 --- fixed

People

(Reporter: gal, Assigned: jandem, NeedInfo)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

External strings are currently finalized via an API callback. This is lame, at least for our current embedding, because all we end up doing in the callback is nsMemory::Free which does NS_Free which does NS_Free_P which does free. Our default engine-internal treatment is to call free as well, but we do so from a background thread, which is quite a bit faster.
External string allocation profile for a mix of cnn.com, facebook.com and slashdorg.org:

 647 0
1091 1
 467 2
1034 3
2093 4
 660 5
 576 6
1379 7
 853 8
 785 9
 676 10
1676 11
 818 12
 592 13
 453 14
 852 15
 858 16
 327 17
 186 18
 399 19
 174 20
 168 21
 157 22
 179 23
 125 24
 109 25
  42 26
  28 27
 193 28
  96 29
  74 30
  98 31
 114 32
  62 33
 243 34
  42 35
  53 36
 221 37
  54 38
  52 39
  59 40
  51 41
  45 42
 120 43
 147 44
  51 45
  56 46
 293 47
  34 48
  35 49
  55 50
 161 51
 103 52
  82 53
  62 54
  44 55
 215 56
  82 57
  56 58
  83 59
  39 60
  57 61
  55 62
  50 63
  30 64
  35 65
 106 66
  33 67
 104 68
  54 69
  96 70
  68 71
  95 72
 183 73
  56 74
 105 75
  37 76
  75 77
 136 78
  52 79
 136 80
  63 81
 100 82
  49 83
  82 84
  63 85
  71 86
  69 87
  67 88
  85 89
  31 90
  97 91
  34 92
  66 93
  30 94
  38 95
  20 96
  22 97
  35 98
  16 99
  10 100
  35 101
  39 102
  54 103
   7 104
  16 105
  20 106
   6 107
  16 108
   3 109
  10 110
   6 111
  11 112
  14 113
  21 114
   3 115
   6 116
  14 117
   6 118
   5 119
  13 120
   4 121
   3 122
   4 123
   4 124
   7 125
  13 126
   7 127
   3 128
   6 129
   1 130
   2 131
   4 132
   6 133
   3 134
   1 135
  11 136
   1 140
   3 141
   6 142
  12 144
   5 145
   1 147
   2 148
   3 152
   1 153
   3 154
   1 155
   3 156
   3 157
   2 158
   1 159
   2 160
   2 161
   1 162
   1 163
   6 166
  19 168
   1 169
   4 170
  14 171
   2 172
   1 174
  18 175
   8 182
   1 184
   6 185
   9 187
   1 188
   2 191
   6 192
   5 193
   1 194
   1 196
   5 197
   3 198
   2 199
   1 200
   2 201
  22 202
  15 204
   2 208
   1 210
  10 214
   2 218
   2 221
   1 222
   1 224
   6 225
  22 226
   1 230
   1 232
  27 234
   1 236
   1 238
  10 240
  24 242
   1 245
   1 246
   1 247
   1 248
  12 249
   1 250
   1 251
  10 252
   2 254
   1 259
   1 264
   1 266
   4 274
  10 276
   7 282
   8 285
   1 298
   1 307
   6 332
   6 360
  28 427
  10 439
   1 445
   1 452
   2 457
  11 465
  17 491
  35 504
  24 573
   2 581
   1 663
   4 666
   2 699
   1 707
   2 777
  12 827
   1 832
   1 840
   2 843
   2 857
   2 872
   2 876
   2 985
   2 1138
   2 1140
   1 1462
   2 1519
   1 1523
   2 3068
   4 3413
   2 3732
   2 4595
   1 6690
   1 7238
   2 16384
   1 138318
   6 851954
   2 990121

There is a few really huge strings here, and quite a few really short ones that we could use short strings or internal strings for, but no clear simple wins.
(note that this isn't as simple for DOM strings, those to nsStringBuffer::FromData(chars)->Release())
That Release call should, in fact, be the common case of external strings, no?  Unless the callee had an autostring....
Yeah (the data above is for both cases, it was taken in JS_NewExternalString). Both finalizers can run from a background thread though, the ref counting of the string is atomic. anygregor is working on a patch for that.
> the ref counting of the string is atomic

Yes, though note that we'd like it to stop being so...
First of all: yeah, do that. The atomic op is crazy expensive. I don't think it would hurt us here. These strings are relatively infrequent. I can ship them back in a runnable to the main thread and ref-count them down there.
Doing that is bug 585661, fwiw.
Assignee: general → nobody
Component: JavaScript Engine → JavaScript: GC
Summary: improve external string finalization → Finalize external strings in the background
Whiteboard: [qf]
Flags: needinfo?(jdemooij)
Note that this would have been harder before bug 1353758. We used to clear Gecko's string cache in the finalizer but we no longer do that. This also means we can remove the Zone* argument from the finalizer as nobody uses it.
Try looks green. I'll be traveling tomorrow, will post patches for review soon.
Blocks: 1351769
Whiteboard: [qf] → [qf:p1]
None of the finalizers use the Zone* argument and it's best not to expose it once we may be called off thread.
Assignee: nobody → jdemooij
Status: NEW → ASSIGNED
Attachment #8881091 - Flags: review?(sphink)
Flags: needinfo?(jdemooij)
Attachment #8881092 - Flags: review?(sphink)
Attachment #8881091 - Flags: review?(sphink) → review+
Comment on attachment 8881092 [details] [diff] [review]
Part 2 - Finalize external strings in the background

Review of attachment 8881092 [details] [diff] [review]:
-----------------------------------------------------------------

If that passes tests, this is certainly about as straightforward as you can get.
Attachment #8881092 - Flags: review?(sphink) → review+
Pushed by jandemooij@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/efd7a75d3b34
part 1 - Remove unused Zone* argument from external string finalizers. r=sfink
https://hg.mozilla.org/integration/mozilla-inbound/rev/d4bc665ad39b
part 2 - Finalize external strings in the background. r=sfink
https://hg.mozilla.org/mozilla-central/rev/efd7a75d3b34
https://hg.mozilla.org/mozilla-central/rev/d4bc665ad39b
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
Henri, did this actually help your encoding/decoding benchmark?  I was seeing some weirdness where profiler-reported mainthread time dropped, but wall-clock measurements did not..
Flags: needinfo?(hsivonen)
(In reply to Boris Zbarsky [:bz] from comment #16)
> Henri, did this actually help your encoding/decoding benchmark?  I was
> seeing some weirdness where profiler-reported mainthread time dropped, but
> wall-clock measurements did not..

Some wall clock times become better (Decode js) but others became worse by quite a bit (Decode en, Decode de, Decode fr).

## Before

What 	Language 	Encoding 	MiB/s 	Mi u16s / s 	ns / iteration
Decode	ar	utf-8	256	217	1173902
Decode	ar	windows-1256	258	258	1008361
Decode	cs	utf-8	323	315	984831
Decode	cs	windows-1250	339	339	934001
Decode	de	utf-8	400	395	759988
Decode	de	windows-1252	419	419	732955
Decode	el	utf-8	167	146	1388130
Decode	el	windows-1253	165	165	1267433
Decode	en	utf-8	409	408	1646406
Decode	en	windows-1252	427	427	1587302
Decode	fr	utf-8	469	463	1911695
Decode	fr	windows-1252	468	468	1913543
Decode	he	utf-8	242	207	1036547
Decode	he	windows-1255	159	159	1392141
Decode	ja	utf-8	177	147	1335255
Decode	ja	shift_jis	142	130	1538963
Decode	ko	utf-8	138	119	1138206
Decode	ko	euc-kr	129	120	1167464
Decode	pt	utf-8	371	367	1524078
Decode	pt	windows-1252	370	370	1529838
Decode	ru	utf-8	265	227	2257596
Decode	ru	windows-1251	275	275	1882883
Decode	th	utf-8	377	308	2467706
Decode	th	windows-874	391	391	1961111
Decode	tr	utf-8	295	286	947936
Decode	tr	windows-1254	317	317	872234
Decode	vi	utf-8	272	257	2040392
Decode	vi	windows-1258	337	337	1755556
Decode	zh_cn	utf-8	312	274	953178
Decode	zh_cn	gb18030	242	225	1158397
Decode	zh_tw	utf-8	310	271	963019
Decode	zh_tw	big5	271	253	1045893
Decode	ja	euc-jp	150	137	1456071
Decode	ja	iso-2022-jp	79.43	67.29	2963345
Decode	bin	x-user-defined	174	174	73810
Decode	js	utf-8	188	188	440061
Decode	js	big5	190	190	434660
Decode	js	euc-kr	182	182	455065
Decode	js	gb18030	191	191	432953
Decode	js	gbk	188	188	439677
Decode	js	shift_jis	192	192	431096
Decode	js	euc-jp	188	188	440894
Decode	js	iso-2022-jp	77.66	77.66	1064802
Decode	js	windows-1250	192	192	430904
Decode	js	windows-1251	178	178	465639
Decode	js	windows-1252	189	189	436603
Decode	js	windows-1253	188	188	439229
Decode	js	windows-1254	185	185	446721
Decode	js	windows-1255	185	185	446721
Decode	js	windows-1256	186	186	445423
Decode	js	windows-1257	185	185	445833
Decode	js	windows-1258	181	181	456131
Decode	js	windows-874	192	192	430392
Decode	js☃	utf-8	187	187	442646
Encode	ar	utf-8	134	74.82	164431
Encode	cs	utf-8	133	121	347444
Encode	de	utf-8	246	242	298258
Encode	el	utf-8	133	73.96	232621
Encode	en	utf-8	274	274	224858
Encode	fr	utf-8	169	163	805228
Encode	he	utf-8	129	74.16	365266
Encode	ja	utf-8	190	68.02	163995
Encode	ko	utf-8	138	59.75	138347
Encode	pt	utf-8	205	199	316752
Encode	ru	utf-8	141	78.94	789866
Encode	th	utf-8	191	68.41	925751
Encode	tr	utf-8	135	122	395010
Encode	vi	utf-8	114	87.60	725922
Encode	zh_cn	utf-8	188	69.64	156411
Encode	zh_tw	utf-8	175	64.83	168321
Encode	js	utf-8	314	314	263765

## After

What 	Language 	Encoding 	MiB/s 	Mi u16s / s 	ns / iteration
Decode	ar	utf-8	171	145	1750520
Decode	ar	windows-1256	204	204	1277857
Decode	cs	utf-8	237	231	1340804
Decode	cs	windows-1250	245	245	1293655
Decode	de	utf-8	260	257	1171449
Decode	de	windows-1252	275	275	1116803
Decode	el	utf-8	203	178	1138713
Decode	el	windows-1253	205	205	1020184
Decode	en	utf-8	302	301	2233333
Decode	en	windows-1252	299	299	2266686
Decode	fr	utf-8	303	299	2960871
Decode	fr	windows-1252	305	305	2931200
Decode	he	utf-8	189	162	1329025
Decode	he	windows-1255	198	198	1115323
Decode	ja	utf-8	214	177	1103170
Decode	ja	shift_jis	172	157	1272135
Decode	ko	utf-8	154	133	1019358
Decode	ko	euc-kr	155	144	973668
Decode	pt	utf-8	270	267	2093879
Decode	pt	windows-1252	278	278	2040381
Decode	ru	utf-8	213	182	2818242
Decode	ru	windows-1251	217	217	2384516
Decode	th	utf-8	275	224	3387314
Decode	th	windows-874	267	267	2877903
Decode	tr	utf-8	214	207	1307787
Decode	tr	windows-1254	220	220	1255750
Decode	vi	utf-8	214	202	2593434
Decode	vi	windows-1258	251	251	2353548
Decode	zh_cn	utf-8	235	206	1265369
Decode	zh_cn	gb18030	193	179	1456071
Decode	zh_tw	utf-8	236	207	1263775
Decode	zh_tw	big5	203	190	1393010
Decode	ja	euc-jp	181	165	1209320
Decode	ja	iso-2022-jp	92.76	78.58	2537500
Decode	bin	x-user-defined	220	220	58543
Decode	js	utf-8	229	229	360997
Decode	js	big5	222	222	372837
Decode	js	euc-kr	223	223	370419
Decode	js	gb18030	223	223	370731
Decode	js	gbk	218	218	378857
Decode	js	shift_jis	216	216	383438
Decode	js	euc-jp	222	222	371745
Decode	js	iso-2022-jp	91.11	91.11	907597
Decode	js	windows-1250	231	231	357297
Decode	js	windows-1251	218	218	378663
Decode	js	windows-1252	230	230	360143
Decode	js	windows-1253	222	222	372268
Decode	js	windows-1254	225	225	368226
Decode	js	windows-1255	221	221	374879
Decode	js	windows-1256	219	219	378375
Decode	js	windows-1257	222	222	371812
Decode	js	windows-1258	219	219	377049
Decode	js	windows-874	219	219	378315
Decode	js☃	utf-8	238	238	347533
Encode	ar	utf-8	133	74.49	165170
Encode	cs	utf-8	135	122	342971
Encode	de	utf-8	238	234	308094
Encode	el	utf-8	139	77.25	222716
Encode	en	utf-8	282	281	219035
Encode	fr	utf-8	165	159	825270
Encode	he	utf-8	134	76.96	351947
Encode	ja	utf-8	194	69.46	160589
Encode	ko	utf-8	138	60.03	137715
Encode	pt	utf-8	207	201	313627
Encode	ru	utf-8	142	79.36	785702
Encode	th	utf-8	199	71.29	888390
Encode	tr	utf-8	136	123	392599
Encode	vi	utf-8	116	89.27	712368
Encode	zh_cn	utf-8	185	68.48	159075
Encode	zh_tw	utf-8	185	68.39	159573
Encode	js	utf-8	319	319	259043
Flags: needinfo?(hsivonen)
Hrm.  So I'm left wondering whether this fix is a win in general at all, given the behavior described in bug 1375566.  :(
Flags: needinfo?(sphink)
Flags: needinfo?(jdemooij)
(In reply to Boris Zbarsky [:bz] from comment #18)
> Hrm.  So I'm left wondering whether this fix is a win in general at all,
> given the behavior described in bug 1375566.  :(

So the problem is that instead of sweeping, we now spend a lot of time waiting for the background thread, right?

We use background sweeping for a lot of other GC things (like non-external strings) and if there's an issue with that, it likely affects those other things as well.

sfink or jonco can probably say more about this..
Flags: needinfo?(jdemooij)
Flags: needinfo?(jcoppeard)
> we now spend a lot of time waiting for the background thread, right?

Well, it's _a_ problem.  I don't know whether it's the only problem in this case.

In particular, I can see how that problem would make this change not improve performance.  It's not clear to me why performance would actually become "much worse" as comment 17 describes.  That seems pretty fishy to me.  Maybe we're ending up with significant contention on the malloc locks or something?
Could it be because of reasons along the lines of Julian's findings in bug 1291355 comment 26?  jemalloc's multi-core performance is terrible: having a background thread call into jemalloc more often can certainly make malloc a lot slower, not due to lock contention, but due to cache line movement since all data is shared across all cores basically.

Perhaps Julian can help?
(In reply to :Ehsan Akhgari (needinfo please, extremely long backlog, Away Aug 7-8) from comment #21)
> Could it be because of reasons along the lines of Julian's findings in bug
> 1291355 comment 26?  [..]  Perhaps Julian can help?

I can run some low level memory-system (MESI, etc) profiles to see what's
going on, if someone can give me STR for the alleged slowdown.  In other
words, what test case + configuration should I measure?
Flags: needinfo?(ehsan)
The testcase is https://hsivonen.com/test/moz/encoding_bench_web/english-only.html and then clicking the "Run" button.
I hope comment 23 answers your question Julian.  :-)  Thanks a lot for your help!
Flags: needinfo?(ehsan)
I can't replicate the slowdown, at least using a local copy of the test,
which has been set up to run 4 times back-to-back, per hsivonen's guidance.

I am comparing the following versions:

  without (patch) = 366117:638029b68a85
  with            = 366119:d4bc665ad39b

Measured rates (Encoding MiB/sec)

  117:  699 700 696
  119:  771 772 767

This is on x86_64 Linux, Fedora 25, non-debug build, -g -O2.
Performance Impact: --- → P1
Whiteboard: [qf:p1]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: