Open Bug 1794177 Opened 2 months ago

Have output-file generate line-centric random-access LZ4 compressed versions of rendered HTML lines and analysis data to make the query mechanism faster and support efficent excerpts

Categories

(Webtools :: Searchfox, enhancement)

enhancement

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: asuth, Assigned: asuth)

Details

The "query" mechanism knows how to read the rendered HTML files and extract the line excerpts it's interested in with context. This is categorically too slow as it exists and my tentative plan had been along the lines of what I propose in https://bugzilla.mozilla.org/show_bug.cgi?id=1779340#c0 where we:

  • Have some pre-computed offsets/ranges in the HTML and/or analysis files
  • Make sure that any compression (which is desirable) is chunked into blocks aligned with our range mechanism so we don't need to read and decompress the file up to the offset in question.

The lz4 project describes an approach for doing this at https://github.com/lz4/lz4/blob/dev/examples/dictionaryRandomAccess.md and has an example program. (Although the example is GPLv2 whereas the lib is BSD 2-clause, but we also are planning to use https://crates.io/crates/lz4_flex instead of an lz4 binding and roll our own file format a bit.)

The general idea is this:

  • We will run zstd --train <training files> -o <dictionary name> on mozilla-central's HTML output and analysis files and end up using the default dictionary size of "112640" and commit this into the tree under config_defaults so that we can do one-off custom dictionaries for other trees (ex: wubkat probably wants this).
  • In output-file we:
    • Define some constant for our line chunks. Probably something like 32 to start with.
    • As we process lines we process the lines, accumulating them into line-centric buffers which we compress using compress_with_dict and retain in memory along with their lengths, we:
      • Re-output all source records we see for the analysis random-access stream. We don't need target records or structured records or the future diagnostic records.
      • Output the rendered HTML sans any nesting structuring (which complicates things) and maybe without the blame data and coverage data.
    • Once we're done processing the file, we write out 8-byte sequences of [4 bytes: offset for this line chunk, 4 bytes: length of this line chunk], one for each line chunk. The offsets assume we start writing the chunks immediately after the last line chunk, which we then do. This format obviously explodes if anyone ever asks for more lines than actually exist in the file, so we don't let line requests come from outside the system for now and this is mainly a way to leave a foot-gun that forces me to consider baking the offsets into the crossref file. (That said, we can easily have the per-file info include the number of lines in the file, and now that I type that I realize we absolutely do already want that because that's useful information. So consulting the per-file-info could just be a precondition.)
  • When we are augmenting the search results we know the file and lines we care about, so we:
    • Open the compressed file, seek to the 8-bytes we know we want to get the offset and length and then read that in. Then seek to the offset and read the given length and use decompress_with_dict to decompress the data, which we then pass through our existing lol_html line extracting logic versus having some kind of per-line offset table. This is partially because I'm very lazy but also because I think there's quite potentially some value in explicitly running a filtering/transformation pass on our HTML markup so I don't want to pre-optimize that or make adding it back later feel like a regression.

The choice to do this for both source records and HTML is because:

  • In bug 1779340 we potentially want to have random access to the source records.
  • It's possible that it might be more interesting/efficient/useful to render the HTML on the fly from the analysis records and the token stream (although we're not chunking the raw file source and the tokenizer state going into each line or chunk, which we might need to do to make that viable).

It's not a given that this will actually let us hit the performance targets I think are necessary for this. If they aren't, we'd likely be looking at baking the HTML output into the crossref output which might result in us wanting to do some additional crossref optimizations like making crossref-extra be compressed on a per-value basis with lz4 and/or separating the hit excerpts so that the diagram logic doesn't have to parse voluminous JSON data it doesn't care about.

You need to log in before you can comment on or make changes to this bug.