Closed Bug 108134 Opened 24 years ago Closed 22 years ago

history database is endian; corrupts the display of titles

Categories

(Core Graveyard :: History: Global, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jmorzins, Assigned: rbasch)

References

()

Details

Attachments

(3 files)

This bug is best experienced with screenshots, so I've put some up at http://web.mit.edu/jmorzins/www/mozilla/history-endian.html What's wrong: Mozilla stores titles into the history database in an endian-dependent fashion. What happens: If a user switches between little-endian machines (like PCs) and big-endian machines (like Sun workstations), they will only be able to read half of their history titles at a time. The little-endian machine is able to read the little-endian titles, and big-endian ones come out looking like garbled Chinese. Likewise, the big-endian machine will be able to read the big-endian titles, with little-endian titles coming out as garbled Chinese. What should have happened: Mozilla should either store names in an endian-independent fashion, or should tag the Unicode strings in a way that allows Mozilla to recognize what endianness the strings are. Got any other information? I don't understand the format of the history.dat file, but if it would be useful to you I can provide mine.
No dupe, can we please get this off UNCO and into something else ??
This bug will be trivial for anyone at a large Unix site to duplicate, but perhaps there aren't that many of us left. I'm attaching my history.dat file to this bug report, in case examining it will help developers believe this bug report. To reproduce the bug: 1) On a big-endian machine, visit a web page. For example, http://web.mit.edu/jmorzins/www/mozilla/big-endian.html 2) Quit mozilla (to save your history). 3) One a little-endian machine, visit a web page. For example, http://web.mit.edu/jmorzins/www/mozilla/little-endian.html 4) Quit mozilla (to save your history). 5) Now re-start mozilla. Notice that: (a) if you type URLs into the location bar, the pop-up autocompletion menu has garbled (Chinese!) titles for other-endian platforms (b) if you view your history, the list of titles in the history has garbled (Chinese!) titles for other-endian platforms.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: mozilla1.1
Here's that bug I mentioned.... Looks like there's no problem with the URL itself (I hate having a lossy memory). And titles are fucked up whether or not they are ascii. So the base-64-decoding hypothesis seems like the most likely one....
Looking at the example history.dat at the end of http://web.mit.edu/jmorzins/www/mozilla/history-endian.html, it is in fact some kind of encoding of UCS-2, not base64-encoded UTF-8, as I said before. (I was thinking of the export format, not the internal one). The odd thing is that I don't see a difference in endian-ness between the lines that show up differently in the screenshots.
Hm. I don't remember if I copied the 30 excerpted lines of my history.dat file at the same moment that I generated the screen shots. If I copied them after some intervening web browsing, the endianness of page titles visible in the excerpt might not correspond to the endianness of titles visible in the screen shots. I've been more careful about only visiting the "big-endian.html" URL when browsing from a big-endian platform, and the "little-endian.html" URL when browsing from a little-endian platform. At present my history.dat file contains the following six entries for those two pages. Notice the difference between the byte order of the 645E9 field and the 645F9 field. (645E7 =http://web.mit.edu/jmorzins/www/mozilla/big-endian.html)(645E8 =1027946404095809)(645E9=$00v$00i$00e$00w$00e$00d$00 $00o$00n$00 $00b$00i$00\ g$00-$00e$00n$00d$00i$00a$00n)(645F7 =http://web.mit.edu/jmorzins/www/mozilla/little-endian.html)(645F8 =1027946493690063)(645F9 =v$00i$00e$00w$00e$00d$00 $00o$00n$00 $00l$00i$00t$00t$00l$00e$00-$00e$00n\ $00d$00i$00a$00n$00)
*** Bug 158127 has been marked as a duplicate of this bug. ***
This patch to xpfe/components/history/src/nsGlobalHistory.cpp fixes bug 108134 (titles in history.dat are byte order dependent), by inserting and detecting a byte order marker in the title string. (The patch is against Mozilla 1.2.1). If the byte order marker is byte-reversed, the resulting Unicode string is byte-swapped.
To robert (and thanks for the patch!)
Assignee: blaker → rbasch
Comment on attachment 117883 [details] [diff] [review] Patch to use byte order marker in history.dat titles + PRUnichar c = (((p[i] >> 8) & 0xff) | (p[i] << 8)); Would it be better to have |(p[i] & 0xff) << 8| at the end there? I guess we're assuming that PRUnichar is 16-bit anyway, so this is safe... alec, would you review, please?
Attachment #117883 - Flags: superreview+
Attachment #117883 - Flags: review?(alecf)
If PRUnichar is not 16 bits, then it should not get to the byte-swap case. (And byte-swapping a 32-bit value, for example, would have to be done differently anyway). I suppose it wouldn't hurt to change it to "(p[i] & 0xff) << 8", or to check the size, but, while I am admittedly a novice with this code, it looks like there will be other problems if PRUnichar is not a 16-bit value.
Yeah, the codebase is littered with assumptions that PRUnichar is 16-bit...
Comment on attachment 117883 [details] [diff] [review] Patch to use byte order marker in history.dat titles oh man this is so wrong! do NOT check this in first of all, the yarn belongs to mork, and is supposed to be read-only. do not attempt to tweak the data in the yarn. Second of all, you're putting the BOM at the front of every title string in the database? that seems like a total waste. I think you need to set an attribute in the file indicating the endianness of the file, and then converting as appropriate whenever you read in the title. the endianness of the file should be set the first time the DB is created.. then when the file gets passed back and forth between PCs and Macs, then the machine can make the decision about how to do the conversion. We really need to get to a world where we're using UTF8... *sigh* (that's a whole other bug dealing with corrupting of short titles with non-zero mYarn_Forms when you compact data)
Attachment #117883 - Flags: review?(alecf) → review-
> first of all, the yarn belongs to mork, and is supposed to be read-only. The patch code does not modify the yarn buffer; it converts the characters read from it as it stores them into the result, analogous to the way PRInt64 data is converted from/to a string. > the endianness of the file should be set the first time the DB is created.. I thought about doing something along this line, but thought it was too problematic for dealing with entries in an existing file. I would also be a bit concerned about the possible "waste" of doing excessive conversions when the user switches endianness. (In our environment, users commonly switch back and forth between big and little-endian machines, which share home directories). > We really need to get to a world where we're using UTF8... *sigh* (that's a > whole other bug dealing with corrupting of short titles with non-zero > mYarn_Forms when you compact data) I gathered that something like this was the case, from comments in the code. How feasible would it be to address that bug in order to solve this problem?
ccing alecf so he'll see the comments and all... ;)
I think it will be helpful to re-read the initial report, before talking about passing a file back and forth between Macs and PCs. The file is on a network filesystem, and never gets passed anywhere. Instead, a client mounts a network filesystem and starts reading the file from the network. It would be delightful if the endianness of the file were fixed. The problem is that the endianness of the *Mozilla client* is not fixed. Intel-Unix Mozilla's assume that the file is of one endianness, and write data into the file in little-endian fashion. Sparc-Unix Mozilla's assume that the file is another endianness, and write data into the file in big-endian fashion. After a few days of browsing, switching platforms randomly each day, a user will be left with a throroughly mixed-up database.
I've read through the comments - it doesn't matter how this file makes it from one platform to the next, etc.. as for your issue of waste when you switch from big to little endian and back, what seems wasteful is to optimize for the case where the user does switch back and forth. when 99.99% of the users never switch platforms (let alone share history.dat between platforms) I don't think we should be optimizing for those 0.01% of the users. If we simply flag the file as "this is how you should be writing to the file" then we don't have to encode 0xFFEF into every title string... it means we MAY break backwards compatibility with any entries that are already in the history file, but again that seems like the right case to optimize for. it seems quite reasonable to add something to the release notes "when upgrading from mozilla 1.3 to mozilla 1.4 and then switching from one platform to another, page titles created with the original platform may be corrupt until they are visited again" i.e. we would check "does this history file have an endianness flag?" if not, we set it to the current machine's endianness and just leave the old wrong-endianness titles as they are. Then we just let the titles work themselves out over time.. since history entries do expire over time, and titles are reupdated as you visit them. by the way, you can look at the current endianness of the platform with the IS_BIG_ENDIAN / IS_LITTLE_ENDIAN flags...
some sample pseudocode: #if IS_BIG_ENDIAN if (file_is_big_endian) SetStringValue(str); else SetStringValue(ConvertToLittleEndian(str)); #else if (file_is_little_endian) SetStringValue(str); else SetStringValue(ConvertToBigEndian(str)); #endif
I agree that the code should optimize for the common case, all things being equal, but the overhead of adding one character per title seems trivial to me. Especially when you consider that the current code essentially doubles the size of title strings in the file, i.e. by not converting to UTF8, apparently merely to work around some other bug. Can you refer me to more information on the nonzero mYarn_Forms bug? Is that the only thing that prevents enabling the code to convert to UTF8?
Yeah, that's the only thing preventing us from going utf8. I don't know where the bug is, but the problem is basically when you start using non-zero forms like we were trying to do for UTF8, the database gets corrupted if you compact it. I narrowed it down to short titles (4 characters or less, like say "TREK" on www.trekbikes.com - though now they've changed it) - mork does not atomize titles if they take less than 4 bytes.. and the parser would barf when it hit one of these non-atomized strings. that would really be a better bug to fix, but it means fixing mork! :(
I tried enabling the UTF8 conversion code, and found that setting a nonzero Form doesn't seem to work at all; all titles were corrupted, regardless of size, apparently because the Form was not properly detected. I started to look at the relevant code, but I fear that fixing this may require a greater effort than I am able to make. (It might help if there were documentation on the database format, and the implementation -- is such documentation available? Or was that the reason for your ":("?...)
Two (obvious) workarounds, documented here only for completeness: * Nuke the history (Edit -> Pref -> Navigator -> History -> Clear History) each time you switch endian machines. or * Have two profiles, one for big-endian machines, one for little-endian
This patch implements the suggested alternate solution, of setting the byte order in the history file, and byte swapping title strings when running on an other-endian machine. Changes include: - A ByteOrder token is added to the history file meta row. - When the file byte order is not yet set, it is initialized to the machine byte order (either "BE" or "LE"). - When the file is opened, its byte order is noted, and a flag is set indicating whether Unicode strings need to be byte-swapped. - In OpenNewFile(), code is added to create the meta row (fixes an existing small bug which becomes more of a problem with the addition of the new token. - By adding byte-swapping methods for the explicit types, we can handle the case where PRUnichar is either a 16-bit or 32-bit value; anything else should cause a compile error. - The file byte order is reset when the history is cleared. The patch is against the Mozilla 1.2.1 base (as stored in our local CVS repository).
Comment on attachment 126046 [details] [diff] [review] Patch to set the byte order in the history file For reference, this applies cleanly (modulo a few offsets) to a current trunk tree. Alec, could you please take a look at this?
Attachment #126046 - Flags: review?(alecf)
great! the approach sounds ideal. I'll take a look at the patch sometime around monday or tuesday (6/23-6/24) however, this is something I'd like to see land on the trunk before it hits a branch like 1.2.. I'm guessing enough has changed since 1.2 such that you're going to need to make a new patch...
oops, just saw boris's comments.. amazing that it applies! :) I'll still review on mon/tues
Comment on attachment 126046 [details] [diff] [review] Patch to set the byte order in the history file wow, love the AFS file path, haven't seen that in ages :) sr=alecf
Comment on attachment 126046 [details] [diff] [review] Patch to set the byte order in the history file wow, love the AFS file path, haven't seen that in ages :) sr=alecf
Attachment #126046 - Flags: review?(alecf) → review+
Attachment #126046 - Flags: superreview+
Changed a few idents to follow Mozilla convention (mFoo for members, aFoo for function arguments, nsnull vs NULL) and checked into the trunk. Robert, thanks for the patch and your patience.... I'm sorry this has taken so ridiculously long. :(
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: