Closed Bug 926980 Opened 11 years ago Closed 8 years ago

Load mac ICU data from a separate file so that we don't have two copies of it in universal builds

Categories

(Firefox Build System :: General, defect, P2)

x86
macOS
defect

Tracking

(firefox27+ wontfix)

RESOLVED FIXED
Tracking Status
firefox27 + wontfix

People

(Reporter: benjamin, Assigned: ted)

References

Details

(Whiteboard: p=8)

Attachments

(1 obsolete file)

The ICU data is currently being compiled into the mozjs binary. This is probably fine on windows/linux, but on mac it means we're keeping two copies of the data. We should instead build the data as a separate .dat file which is loaded dynamically (ICU already has support for this, so this is primarily a plumbing question).
OS: Linux → Mac OS X
Hardware: x86_64 → x86
This will negatively impact the FF27 download size, if left unresolved.
Ugh. Apparently I've been glancing over the notification emails about this bug because the subject line doesn't seem relevant.

Anyway, I don't think this should be tracking 27 because a beta uplift for the eventual fix is IMO too risky. I will also challenge the importance of fixing this. While having the ICU data packaged twice is certainly suboptimal, I suspect compression in the DMG will minimize the impact on download size. However, I haven't verified this.

Can we bump the tracking down to a newer version? I can try to obtain size data tomorrow (it's almost midnight now) if you need it to make a decision.
(In reply to Gregory Szorc [:gps] from comment #2)
> Ugh. Apparently I've been glancing over the notification emails about this
> bug because the subject line doesn't seem relevant.
> 
> Anyway, I don't think this should be tracking 27 because a beta uplift for
> the eventual fix is IMO too risky. I will also challenge the importance of
> fixing this. While having the ICU data packaged twice is certainly
> suboptimal, I suspect compression in the DMG will minimize the impact on
> download size. However, I haven't verified this.
> 
> Can we bump the tracking down to a newer version? I can try to obtain size
> data tomorrow (it's almost midnight now) if you need it to make a decision.

Hi :gps,

Have you had a chance to check on the compare the sizes yet ? Would our couple of FF27 Beta's that have been spun give you any helpful data to help here ?
Flags: needinfo?(gps)
Beta builds don't include Intl or ICU, because of bug 924839 and attachment 8345073 [details] [diff] [review] (which disables Intl in beta/release builds), so beta is unaffected by any duplication in play here.
Double-checked this -- a build from ftp://ftp.mozilla.org/pub/mozilla.org/firefox/candidates/27.0b4-candidates/build1/linux-x86_64/en-US/ gives "ReferenceError: Intl is not defined" in the console, so no effect at all from this on beta.
Flags: needinfo?(gps)
Priority: -- → P2
wontfixing this given the investigation in comment #4 and comment #5. Please re-open if needed.
Has there been any decision on whether this is desirable? After ICU landed, we can't link XUL on OS X/ppc because the .dat file bulges things by larger than the maximum displacement of a branch instruction (16MB), so we have to build --without-intl-api for now. Using an external .dat file would probably solve the problem.

If there is little interest, we can implement this ourselves; I just want to know where the best place to hack it into the build system might be.
You can probably build with MOZ_SHARED_ICU=1.
I didn't know that was still possible. Thanks!
Whiteboard: p=0
Status: NEW → ASSIGNED
Whiteboard: p=0 → p=13 s=it-31c-30a-29b.2
Whiteboard: p=13 s=it-31c-30a-29b.2 → p=13
Assignee: gps → nobody
Whiteboard: p=13 → p=8
Status: ASSIGNED → NEW
No longer blocks: fxdesktopbacklog
Flags: firefox-backlog+
Still there are some regressions in cpp unit tests though, with separated ICU data file, I confirmed filesize decrease both in dmg (2.5MB) and extracted total files (9.8MB):

file         |  original   |  patched    |  increase/decrease
-------------+-------------+-------------+---------------------------
dmg          |  98,300,400 |  95,656,651 | -  2,643,749 (2.5MB, 2.7%)
             |             |             |
XUL          | 260,032,832 | 239,332,864 | - 20,699,968 (19.7MB, 8%)
icudt52l.dat |           - |  10,386,304 | + 10,386,304 (9.9MB)

  patched
    https://treeherder.mozilla.org/#/jobs?repo=try&revision=a21bdc998376
  original
    https://treeherder.mozilla.org/ui/#/jobs?repo=mozilla-central&revision=0500f1032ea1

Also, confirmed that icudt52l.dat works well both with i386 and x86_64, not sure ppc's case ("l" in icudt52l means "little endian"), but it's no more supported and doesn't matter, right?
We may not be a tier-1 port anymore, but we're definitely still out there making PowerPC versions for OS X (TenFourFox and Tenfourbird). If there's an endian problem we'll work around it; just don't break us unnecessarily!
(In reply to Cameron Kaiser [:spectre] from comment #11)
> We may not be a tier-1 port anymore, but we're definitely still out there
> making PowerPC versions for OS X (TenFourFox and Tenfourbird). If there's an
> endian problem we'll work around it; just don't break us unnecessarily!

Oops, sorry for my ignorance, and thank you for letting me know that.

Currently the patch has configure option to disable separated ICU data file (--with-icu-data-archive=no).
Also, it could be possible to store two or more ICU data files with some minor tweak for build scripts,
since ICU API accepts a directory path for the ICU data file, instead of a individual file path.
Blocks: 1239083
I looked into this in the context of bug 1239083 and ICU is very flexible here:
http://userguide.icu-project.org/icudata

We could ship the .dat file as a file and point ICU at the directory containing it, or we could bundle it in omni.ja and call udata_setCommonData to pass a pointer to the data blob in memory.
should we apply same change to other OS/arch at once in this bug?
or apply to OSX first and do others in separated bug?
Flags: needinfo?(ted)
I'm going to look at making this change for all platforms.
Assignee: nobody → ted
Flags: needinfo?(ted)
Okay, I've got a WIP patch that works on Linux, that Try push is to see how it fares on other platforms. This patch configures ICU --with-data-packaging=archive, which builds an "icudt52.dat" file containing the data instead of linking it into a shared library. This works, but I don't know if we'd rather ship it in omni.ja or not.

https://hg.mozilla.org/try/rev/7754651a2957
Comment on attachment 8714787 [details]
MozReview Request: bug 926980 - load ICU data from an archive file. r?glandium,waldo

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/33241/diff/1-2/
I had mucked up the package-manifest in the first Try run, the second one looks slightly better, but it's still broken packaging on Mac, probably because Mac needs special handling for app bundles.
Comment on attachment 8714787 [details]
MozReview Request: bug 926980 - load ICU data from an archive file. r?glandium,waldo

https://reviewboard.mozilla.org/r/33241/#review30075

You'll also want some other reviewers for the C++ code parts once you're all done with the FIXMEs and TODOs.

::: build/autoconf/icu.m4:82
(Diff revision 2)
> +    ICU_DATA_FILE="icudt${version}l.dat"

This should only be set when not building against system ICU, I guess...

BTW, tell me if I got this right: we still end up with a icudt/icudata dll, which itself loads the .dat file?

::: config/external/icu/Makefile.in:22
(Diff revision 2)
> +	#FIXME

#FIXME :)

::: js/src/shell/js.cpp:6981
(Diff revision 2)
> +#if EXPOSE_INTL_API
> +    std::string binpath = sArgv[0];
> +#if defined(XP_WIN)
> +    char* bindir = &binpath[0];
> +    PathRemoveFileSpecA(bindir);
> +#else
> +    char* bindir = dirname(&binpath[0]);
> +#endif
> +    u_setDataDirectory(bindir);
> +#endif

This should presumably be conditional on not building against system ICU.

::: js/src/tests/lib/jittests.py:587
(Diff revision 2)
> -                     'libplc4.so', 'libplds4.so']
> +                     'libplc4.so', 'libplds4.so', 'icudt56l.dat']

This will require a manual change for every icu version bump :(
Attachment #8714787 - Flags: review?(mh+mozilla)
https://reviewboard.mozilla.org/r/33241/#review30075

> This should only be set when not building against system ICU, I guess...
> 
> BTW, tell me if I got this right: we still end up with a icudt/icudata dll, which itself loads the .dat file?

ICU has a "stubdata" library that it builds when you build --with-data-packaging=archive:
https://dxr.mozilla.org/mozilla-central/source/intl/icu/source/stubdata/stubdata.c

The other libraries want to link against that `icudt56` symbol, so it provides it, but then they fall back to loading from the data file. Kind of janky. Chromium has a little patch to link stubdata.c into the icu common library:
https://chromium.googlesource.com/chromium/deps/icu/+/42c58d4e49f2250039f0e98d43e0b76e8f5ca024/icu.gyp#344

...although I'm not really sure what configuration they're shipping.

> #FIXME :)

AFAIK we don't build this codepath right now (non-shared-ICU on Windows), and the patch in bug 1239083 removes this entire Makefile anyway. Do you mind if I just leave this broken?

> This will require a manual change for every icu version bump :(

Yeah, this sucks. We could do some hackery to embed the data into the js shell binary, that might make life simpler there.
https://reviewboard.mozilla.org/r/33241/#review30075

> AFAIK we don't build this codepath right now (non-shared-ICU on Windows), and the patch in bug 1239083 removes this entire Makefile anyway. Do you mind if I just leave this broken?

I fixed this by turning this into an $(error). I plan to try to remove the shared-ICU configuration entirely, which will get rid of the bits here we're not using and simplify this.

> Yeah, this sucks. We could do some hackery to embed the data into the js shell binary, that might make life simpler there.

I looked into this, and there's no way to override this filename without patching ICU, it's hardcoded in a header:
https://dxr.mozilla.org/mozilla-central/rev/584870f1cbc5d060a57e147ce249f736956e2b62/intl/icu/source/common/unicode/utypes.h#137

I think we can do something smarter here like linking the ICU data into the shell binary, but I'd rather try to fix that as part of bug 1239083 where I don't have to work around ICU's build system.
New patch incoming shortly, I fixed the OS X universal build and made some changes to better support standalone JS. Notably, if we're building JS standalone, I made it so we continue building ICU as we currently are, so that the ICU data gets linked against the shell etc, so that people building just the shell or embedders don't have to worry about calling u_setDataDirectory to make things work. There's a variable MOZ_ICU_DATA_ARCHIVE that indicates whether we're using a separate ICU data archive file or not.

I also gave in and added the u_setDataDirectory to jsapi-tests' main() to make that work for now. I would like to make things less bad, but like I said in my previous comment I would rather tackle that as part of the moz.build work because it'll be easier.
https://reviewboard.mozilla.org/r/33241/#review30501

::: js/src/shell/js.cpp:6989
(Diff revision 2)
> +    u_setDataDirectory(bindir);

I'm not happy with forcing embedders to call u_setDataDirectory themselves.  We've never required embedders to call any ICU functions themselves.  Instead, we provide such operations through our own JSAPI functions that call them, e.g. JS_SetICUMemoryFunctions.  This also lets us document where/when these functions should be called, e.g. JS_SetICUMemoryFunctions can *only* be called before JS_Init.

Here, it looks like u_setDataDirectory should be performed by JS_Init: a called-once function that every embedder must call to use the JS engine.  But to be honest, I don't really understand why this sort of thing has to be done by the embedder at all.  ICU is part of SpiderMonkey and is managed/maintained by it.  Why isn't that also true of the data file?  It seems like JS_Init should remain argument-less, and inside it should be calling u_setDataDirectory and passing the location of the SpiderMonkey library as an argument to it, then finding the sibling data file in the directory, more or less.  I'd really prefer we didn't foist data-finding responsibilities on every embedder, if at all possible.

Also regarding adding this to JS_Init: by not adding it there, you're missing all the other JS_Init users.  In particular, it seems very likely that by not changing how jsapi-tests are set up, you've broken js/src/jsapi-tests/testIntlAvailableLocales.cpp (run as $objdir/js/src/jsapi-tests/jsapi-tests testIntlAvailableLocales, if memory serves).  Make sure to give that a shot locally before trying to run with this.

::: js/src/tests/lib/jittests.py:587
(Diff revision 2)
> -                     'libplc4.so', 'libplds4.so']
> +                     'libplc4.so', 'libplds4.so', 'icudt56l.dat']

Add this version-bump change requirement to the NOTE at the end of intl/update-icu.sh, so that people have a fighting chance of running across it during an upgrade.
https://reviewboard.mozilla.org/r/33241/#review30501

> I'm not happy with forcing embedders to call u_setDataDirectory themselves.  We've never required embedders to call any ICU functions themselves.  Instead, we provide such operations through our own JSAPI functions that call them, e.g. JS_SetICUMemoryFunctions.  This also lets us document where/when these functions should be called, e.g. JS_SetICUMemoryFunctions can *only* be called before JS_Init.
> 
> Here, it looks like u_setDataDirectory should be performed by JS_Init: a called-once function that every embedder must call to use the JS engine.  But to be honest, I don't really understand why this sort of thing has to be done by the embedder at all.  ICU is part of SpiderMonkey and is managed/maintained by it.  Why isn't that also true of the data file?  It seems like JS_Init should remain argument-less, and inside it should be calling u_setDataDirectory and passing the location of the SpiderMonkey library as an argument to it, then finding the sibling data file in the directory, more or less.  I'd really prefer we didn't foist data-finding responsibilities on every embedder, if at all possible.
> 
> Also regarding adding this to JS_Init: by not adding it there, you're missing all the other JS_Init users.  In particular, it seems very likely that by not changing how jsapi-tests are set up, you've broken js/src/jsapi-tests/testIntlAvailableLocales.cpp (run as $objdir/js/src/jsapi-tests/jsapi-tests testIntlAvailableLocales, if memory serves).  Make sure to give that a shot locally before trying to run with this.

I fully agree! As written, this patch doesn't require embedders to do anything extra. This `u_setDataDirectory` call is only required when building JS as part of Gecko. For standalone JS builds things will continue to work as they do now.

Making JS_Init do the right thing here is hard, mostly because there just isn't an easy way to get the directory in question, which is why I resorted to doing this somewhere with access to `argv[0]`. If you've got a suggestion that you think will work reliably I'm happy to implement that instead.

All that being said, I think I can get rid of these changes as part of bug 1239083--ICU is actually pretty flexible about its data storage, it's just hard to do in practice here when I also have to wrangle with the terrible ICU build system. Consider parts of this patch a half-measure until I finish the job there.

I will check that jsapi-tests test, but note that I did make the same change to jsapi-tests' `main`.

> Add this version-bump change requirement to the NOTE at the end of intl/update-icu.sh, so that people have a fighting chance of running across it during an upgrade.

Thanks, will do.
Blocks: 1247396
Attachment #8714787 - Attachment description: MozReview Request: bug 926980 - WIP: load ICU data from an archive file. r?glandium → MozReview Request: bug 926980 - load ICU data from an archive file. r?glandium,waldo
Attachment #8714787 - Flags: review?(mh+mozilla)
Attachment #8714787 - Flags: review?(jwalden+bmo)
Comment on attachment 8714787 [details]
MozReview Request: bug 926980 - load ICU data from an archive file. r?glandium,waldo

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/33241/diff/2-3/
https://reviewboard.mozilla.org/r/33241/#review30501

> I fully agree! As written, this patch doesn't require embedders to do anything extra. This `u_setDataDirectory` call is only required when building JS as part of Gecko. For standalone JS builds things will continue to work as they do now.
> 
> Making JS_Init do the right thing here is hard, mostly because there just isn't an easy way to get the directory in question, which is why I resorted to doing this somewhere with access to `argv[0]`. If you've got a suggestion that you think will work reliably I'm happy to implement that instead.
> 
> All that being said, I think I can get rid of these changes as part of bug 1239083--ICU is actually pretty flexible about its data storage, it's just hard to do in practice here when I also have to wrangle with the terrible ICU build system. Consider parts of this patch a half-measure until I finish the job there.
> 
> I will check that jsapi-tests test, but note that I did make the same change to jsapi-tests' `main`.

Oh, apparently I made the change to jsapi-tests but didn't update the review with that version of the patch.
https://reviewboard.mozilla.org/r/33241/#review30075

> I fixed this by turning this into an $(error). I plan to try to remove the shared-ICU configuration entirely, which will get rid of the bits here we're not using and simplify this.

Patch for removing MOZ_SHARED_ICU is in bug 1247396.
Comment on attachment 8714787 [details]
MozReview Request: bug 926980 - load ICU data from an archive file. r?glandium,waldo

https://reviewboard.mozilla.org/r/33241/#review31361

I think you should rebase on top of the MOZ_SHARED_ICU removal, instead of the other way around.

::: build/autoconf/icu.m4:112
(Diff revision 3)
> +        dnl JS standalone so that embeddors don't have to deal with it.

embedders?

::: build/autoconf/icu.m4:115
(Diff revision 3)
> +            #TODO: the l is actually endian-dependent

I bet gaston will come to fix this when it breaks his openbsd sparc builds.

::: build/autoconf/icu.m4:118
(Diff revision 3)
> +        else
> +            # In this case we won't link to the stubdata library.
> +            ICU_LIB_NAMES="$ICU_LIB_NAMES $ICU_DATA_LIB"

You should test -n $JS_STANDALONE, so that this branch ends up close to the comment on lines 111-112.

::: build/autoconf/icu.m4:189
(Diff revision 3)
> +                HOST_ICU_BUILD_OPTS="$HOST_ICU_BUILD_OPTS --with-data-packaging=archive"

All the host ICU part should go away.
Attachment #8714787 - Flags: review?(mh+mozilla)
https://reviewboard.mozilla.org/r/33241/#review31361

I kinda didn't want to deal with that headache since I already had this patch mostly working. Is there value there aside from the patches winding up slightly simpler?

> I bet gaston will come to fix this when it breaks his openbsd sparc builds.

He's gonna be even more broken by bug 1239083, since we're only going to have a little-endian file in-tree. I guess we can just error in configure and require that big-endian builds use --with-system-icu or --disable-intl-api.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #34)
> > I bet gaston will come to fix this when it breaks his openbsd sparc builds.
> 
> He's gonna be even more broken by bug 1239083, since we're only going to
> have a little-endian file in-tree. I guess we can just error in configure
> and require that big-endian builds use --with-system-icu or
> --disable-intl-api.

Why can't there be a big-endian file in the tree as well? We have other big-endian NPOTB stuff.
No real reason other than it's an extra 10MB file and I don't know how to generate one without building on a big-endian machine currently.
I can probably do that in a separate bug if :gaston doesn't get to it first, if you're willing to accept it.
I think we could take it. That'd be part of bug 1239083. This patch probably just needs a tiny check to use 'b' or 'l' depending on target endianness. I punted on that because we didn't have an existing endianness check in configure anywhere.
Attachment #8714787 - Flags: review?(jwalden+bmo)
Comment on attachment 8714787 [details]
MozReview Request: bug 926980 - load ICU data from an archive file. r?glandium,waldo

https://reviewboard.mozilla.org/r/33241/#review31849

::: build/autoconf/icu.m4:112
(Diff revision 3)
> +        dnl JS standalone so that embeddors don't have to deal with it.

*grmbls and cantankers about the goofy "embeddor" spelling versus embedder*

::: js/src/jsapi-tests/tests.cpp:124
(Diff revision 3)
> +#endif

Okay.  I guess we do have to foist work that's something like this onto every caller.

But, I still want the u_setDataDirectory call pushed into JS_Init.  JSAPI's ability to provide a useful Intl API is implicated here.  The JS engine provides Intl functionality via ICU.  The JS engine initializes ICU.  It's responsible for setting ICU's memory functions, for memory-reporting purposes.  It should be responsible for making that call.

JS_Init should take a const char\* in all build configurations.  Then, #ifdef MOZ_ICU_DATA_ARCHIVE, assert the pointer is non-null with an assertion-reason-string indicating the user didn't pass a data location and make the u_setDataDirectory call.  Otherwise, or if !EXPOSE_INTL_API, assert the pointer is null.  The u_setDataDirectory bits should go within the #if EXPOSE_INTL_API in JS_Init, directly before the u_init call.
Attachment #8714787 - Attachment is obsolete: true
This patch got folded into the patch in bug 1239083.
No longer blocks: 1239083, 1247396
Depends on: 1239083
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: