C-like tokenization can get confused by script tags with HTML contents, violating assumptions in regexp scanning logic
Categories
(Webtools :: Searchfox, defect)
Tracking
(Not tracked)
People
(Reporter: asuth, Assigned: arai)
References
(Regression)
Details
(Keywords: regression)
Attachments
(1 file)
+ parallel --files --halt 2 -X --eta /home/ubuntu/mozsearch/tools/target/release/output-file /mnt/index-scratch/config.json mozilla-mobile
has failed 2 days in a row. Investigating.
| Reporter | ||
Comment 1•5 years ago
|
||
Re-ran on the indexer, and got:
File focus-android/app/src/androidTest/assets/ad.html
thread 'main' panicked at 'index out of bounds: the len is 141 but the index is 141', /rustc/0a58f5864659ddfe1d95c122abaa75c88220aed0/src/libcore/slice/mod.rs:2791:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
Re-ran with BACKTRACE=1 and got:
File focus-android/app/src/androidTest/assets/ad.html
thread 'main' panicked at 'index out of bounds: the len is 141 but the index is 141', /rustc/0a58f5864659ddfe1d95c122abaa75c88220aed0/src/libcore/slice/mod.rs:2791:10
stack backtrace:
0: backtrace::backtrace::libunwind::trace
at ./cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/libunwind.rs:88
1: backtrace::backtrace::trace_unsynchronized
at ./cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/mod.rs:66
2: std::sys_common::backtrace::_print_fmt
at src/libstd/sys_common/backtrace.rs:77
3: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt
at src/libstd/sys_common/backtrace.rs:59
4: core::fmt::write
at src/libcore/fmt/mod.rs:1057
5: std::io::Write::write_fmt
at src/libstd/io/mod.rs:1426
6: std::sys_common::backtrace::_print
at src/libstd/sys_common/backtrace.rs:62
7: std::sys_common::backtrace::print
at src/libstd/sys_common/backtrace.rs:49
8: std::panicking::default_hook::{{closure}}
at src/libstd/panicking.rs:195
9: std::panicking::default_hook
at src/libstd/panicking.rs:215
10: std::panicking::rust_panic_with_hook
at src/libstd/panicking.rs:472
11: rust_begin_unwind
at src/libstd/panicking.rs:376
12: core::panicking::panic_fmt
at src/libcore/panicking.rs:84
13: core::panicking::panic_bounds_check
at src/libcore/panicking.rs:62
14: <usize as core::slice::SliceIndex<[T]>>::index
at ./rustc/0a58f5864659ddfe1d95c122abaa75c88220aed0/src/libcore/slice/mod.rs:2791
15: core::slice::<impl core::ops::index::Index<I> for [T]>::index
at ./rustc/0a58f5864659ddfe1d95c122abaa75c88220aed0/src/libcore/slice/mod.rs:2656
16: <alloc::vec::Vec<T> as core::ops::index::Index<I>>::index
at ./rustc/0a58f5864659ddfe1d95c122abaa75c88220aed0/src/liballoc/vec.rs:1878
17: tools::tokenize::tokenize_c_like::{{closure}}
at src/tokenize.rs:72
18: tools::tokenize::tokenize_c_like
at src/tokenize.rs:420
19: tools::tokenize::tokenize_tag_like
at src/tokenize.rs:1152
20: tools::format::format_code
at src/format.rs:50
21: tools::format::format_file_data
at src/format.rs:346
22: output_file::main
at src/bin/output-file.rs:330
23: std::rt::lang_start::{{closure}}
at ./rustc/0a58f5864659ddfe1d95c122abaa75c88220aed0/src/libstd/rt.rs:67
24: std::rt::lang_start_internal::{{closure}}
at src/libstd/rt.rs:52
25: std::panicking::try::do_call
at src/libstd/panicking.rs:296
26: __rust_maybe_catch_panic
at src/libpanic_unwind/lib.rs:79
27: std::panicking::try
at src/libstd/panicking.rs:272
28: std::panic::catch_unwind
at src/libstd/panic.rs:394
29: std::rt::lang_start_internal
at src/libstd/rt.rs:51
30: main
31: __libc_start_main
32: _start
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Adding debug! instrumentation at the top of tokenize_c_like, we find the problem is the line <script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"><span data-mce-type="bookmark" style="display: inline-block; width: 0px; overflow: hidden; line-height: 0;" class="mce_SELRES_start"></span></script> as reported by the debug output:
DEBUG 2020-01-04T22:12:28Z: tools::tokenize: c-like: <span data-mce-type="bookmark" style="display: inline-block; width: 0px; overflow: hidden; line-height: 0;" class="mce_SELRES_start"></span>
| Reporter | ||
Comment 2•5 years ago
|
||
So the problem here is basically that:
- tokenize_tag_like decided to parse the contents of a
<script>tag that has asrcattribute, which means that the inline contents are moot and should not have been tokenized as C-like. - We ended up in the tokenize.rs logic loop consuming characters that tries to find the matching close '/' for the start.
- Because HTML is not valid JS and the loop was written assuming that we would observe a newline before running out of character, it doesn't work out.
Fixes would be some combination of:
- Make the consumption safe by:
- Treat running out of characters like hitting a newline, falling over to plain tokenization.
- Normalize input to include a newline at the end, always.
- If a script tag has a "src" attribute, don't attempt to C-tokenize its contents, instead treating them as plaintext. This would likely only matter when there are clever JS loaders in play, like in this situation where the script tag is clearly being used to tunnel data for a script.
| Reporter | ||
Comment 3•5 years ago
|
||
I've landed a fix on https://github.com/mozsearch/mozsearch/pull/267 with after-the-fact review requested, rationale on PR. I think we'll want to potentially do some follow-ups on this. Un-assigning myself to make it clear that I probably won't be able to dig in for a week or two and others should feel free to take, etc.
Re-triggering the indexer now.
| Assignee | ||
Comment 4•1 month ago
|
||
| Assignee | ||
Updated•1 month ago
|
Description
•