Closed Bug 1607031 Opened 5 years ago Closed 1 month ago

C-like tokenization can get confused by script tags with HTML contents, violating assumptions in regexp scanning logic

Categories

(Webtools :: Searchfox, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: asuth, Assigned: arai)

References

(Regression)

Details

(Keywords: regression)

Attachments

(1 file)

 + parallel --files --halt 2 -X --eta /home/ubuntu/mozsearch/tools/target/release/output-file /mnt/index-scratch/config.json mozilla-mobile

has failed 2 days in a row. Investigating.

Re-ran on the indexer, and got:

File focus-android/app/src/androidTest/assets/ad.html
thread 'main' panicked at 'index out of bounds: the len is 141 but the index is 141', /rustc/0a58f5864659ddfe1d95c122abaa75c88220aed0/src/libcore/slice/mod.rs:2791:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.

Re-ran with BACKTRACE=1 and got:

File focus-android/app/src/androidTest/assets/ad.html
thread 'main' panicked at 'index out of bounds: the len is 141 but the index is 141', /rustc/0a58f5864659ddfe1d95c122abaa75c88220aed0/src/libcore/slice/mod.rs:2791:10
stack backtrace:
   0: backtrace::backtrace::libunwind::trace
             at ./cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/libunwind.rs:88
   1: backtrace::backtrace::trace_unsynchronized
             at ./cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/mod.rs:66
   2: std::sys_common::backtrace::_print_fmt
             at src/libstd/sys_common/backtrace.rs:77
   3: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt
             at src/libstd/sys_common/backtrace.rs:59
   4: core::fmt::write
             at src/libcore/fmt/mod.rs:1057
   5: std::io::Write::write_fmt
             at src/libstd/io/mod.rs:1426
   6: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:62
   7: std::sys_common::backtrace::print
             at src/libstd/sys_common/backtrace.rs:49
   8: std::panicking::default_hook::{{closure}}
             at src/libstd/panicking.rs:195
   9: std::panicking::default_hook
             at src/libstd/panicking.rs:215
  10: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:472
  11: rust_begin_unwind
             at src/libstd/panicking.rs:376
  12: core::panicking::panic_fmt
             at src/libcore/panicking.rs:84
  13: core::panicking::panic_bounds_check
             at src/libcore/panicking.rs:62
  14: <usize as core::slice::SliceIndex<[T]>>::index
             at ./rustc/0a58f5864659ddfe1d95c122abaa75c88220aed0/src/libcore/slice/mod.rs:2791
  15: core::slice::<impl core::ops::index::Index<I> for [T]>::index
             at ./rustc/0a58f5864659ddfe1d95c122abaa75c88220aed0/src/libcore/slice/mod.rs:2656
  16: <alloc::vec::Vec<T> as core::ops::index::Index<I>>::index
             at ./rustc/0a58f5864659ddfe1d95c122abaa75c88220aed0/src/liballoc/vec.rs:1878
  17: tools::tokenize::tokenize_c_like::{{closure}}
             at src/tokenize.rs:72
  18: tools::tokenize::tokenize_c_like
             at src/tokenize.rs:420
  19: tools::tokenize::tokenize_tag_like
             at src/tokenize.rs:1152
  20: tools::format::format_code
             at src/format.rs:50
  21: tools::format::format_file_data
             at src/format.rs:346
  22: output_file::main
             at src/bin/output-file.rs:330
  23: std::rt::lang_start::{{closure}}
             at ./rustc/0a58f5864659ddfe1d95c122abaa75c88220aed0/src/libstd/rt.rs:67
  24: std::rt::lang_start_internal::{{closure}}
             at src/libstd/rt.rs:52
  25: std::panicking::try::do_call
             at src/libstd/panicking.rs:296
  26: __rust_maybe_catch_panic
             at src/libpanic_unwind/lib.rs:79
  27: std::panicking::try
             at src/libstd/panicking.rs:272
  28: std::panic::catch_unwind
             at src/libstd/panic.rs:394
  29: std::rt::lang_start_internal
             at src/libstd/rt.rs:51
  30: main
  31: __libc_start_main
  32: _start
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Adding debug! instrumentation at the top of tokenize_c_like, we find the problem is the line <script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"><span data-mce-type="bookmark" style="display: inline-block; width: 0px; overflow: hidden; line-height: 0;" class="mce_SELRES_start"></span></script> as reported by the debug output:

DEBUG 2020-01-04T22:12:28Z: tools::tokenize: c-like: <span data-mce-type="bookmark" style="display: inline-block; width: 0px; overflow: hidden; line-height: 0;" class="mce_SELRES_start"></span>

So the problem here is basically that:

  • tokenize_tag_like decided to parse the contents of a <script> tag that has a src attribute, which means that the inline contents are moot and should not have been tokenized as C-like.
  • We ended up in the tokenize.rs logic loop consuming characters that tries to find the matching close '/' for the start.
  • Because HTML is not valid JS and the loop was written assuming that we would observe a newline before running out of character, it doesn't work out.

Fixes would be some combination of:

  • Make the consumption safe by:
    • Treat running out of characters like hitting a newline, falling over to plain tokenization.
    • Normalize input to include a newline at the end, always.
  • If a script tag has a "src" attribute, don't attempt to C-tokenize its contents, instead treating them as plaintext. This would likely only matter when there are clever JS loaders in play, like in this situation where the script tag is clearly being used to tunnel data for a script.
Regressed by: 1588908

I've landed a fix on https://github.com/mozsearch/mozsearch/pull/267 with after-the-fact review requested, rationale on PR. I think we'll want to potentially do some follow-ups on this. Un-assigning myself to make it clear that I probably won't be able to dig in for a week or two and others should feel free to take, etc.

Re-triggering the indexer now.

Assignee: bugmail → nobody
Status: ASSIGNED → NEW
Summary: mozilla-mobile indexer run failing during output-files → C-like tokenization can get confused by script tags with HTML contents, violating assumptions in regexp scanning logic
Attached file GitHub Pull Request
Assignee: nobody → arai.unmht
Keywords: regression
Status: NEW → RESOLVED
Closed: 1 month ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: