Open Bug 1277609 Opened 4 years ago Updated 4 years ago

(Coverity) nsBayesianFilter.cpp: else if (IS_JA_TOUTEN(c)) is never executed !?

Categories

(MailNews Core :: Filters, defect, minor)

defect
Not set
minor

Tracking

(Not tracked)

People

(Reporter: ishikawa, Assigned: ishikawa)

References

(Blocks 1 open bug)

Details

(Keywords: coverity, Whiteboard: CID 1137370)

Attachments

(1 file)

(Coverity found this nugget.)

I suspect that macro definitions have become incorrect over the years.

Let me show what Coverity has to say.
Quite interesting, basically, by chasing the possible values of |c|, it figures that the particular check for IS_JA_TOUTEN(c) never gets executed !!!

Something is wrong in the code: most likely incorrect macro definitions.

I am quoting the covery output.
Sorry the line number gets inserted at the beginning of line.

633static char_class getCharClass(char16_t c)
634{
635  char_class charClass = others;
636
   cond_at_least: Condition (uint16_t)(c - 12352) <= 95, taking false branch. Now the value of c is at least 12448.
637  if(IS_JA_HIRAGANA(c))
638    charClass = hiragana;
   cond_at_least: Condition (uint16_t)(c - 65382) <= 57, taking false branch. Now the value of c is at least 65440.
639  else if(IS_JA_KATAKANA(c))
640    charClass = katakana;
   cond_at_least: Condition (uint16_t)(c - 11904) <= 351, taking false branch. Now the value of c is at least 65440.
   cond_at_least: Condition (uint16_t)(c - 19968) <= 20911, taking false branch. Now the value of c is at least 65440.
641  else if(IS_JA_KANJI(c))
   CID 1137370 (#1 of 4): Logically dead code (DEADCODE) [select issue]
642    charClass = kanji;
   cond_cannot_single: Condition c == 12289, taking false branch. Now the value of c cannot be equal to 12289.
   cond_cannot_set: Condition c == 65380, taking false branch. Now the value of c cannot be equal to any of {12289, 65380}.
   cond_cannot_set: Condition c == 65294, taking false branch. Now the value of c cannot be equal to any of {12289, 65294, 65380}.
643  else if(IS_JA_KUTEN(c))
   CID 1137370 (#3 of 4): Logically dead code (DEADCODE) [select issue]
644    charClass = kuten;
   at_least: At condition c == 12290, the value of c must be at least 65440.
   cannot_set: At condition c == 12290, the value of c cannot be equal to any of {12289, 65294, 65380}.
   dead_error_condition: The condition c == 12290 cannot be true.
   at_least: At condition c == 65377, the value of c must be at least 65440.
   cond_cannot_set: Condition c == 12290, taking false branch. Now the value of c cannot be equal to any of {12289, 12290, 65294, 65380}.
   cannot_set: At condition c == 65377, the value of c cannot be equal to any of {12289, 12290, 65294, 65380}.
   dead_error_condition: The condition c == 65377 cannot be true.
   at_least: At condition c == 65292, the value of c must be at least 65440.
   cond_cannot_set: Condition c == 65377, taking false branch. Now the value of c cannot be equal to any of {12289, 12290, 65294, 65377, 65380}.
   cannot_set: At condition c == 65292, the value of c cannot be equal to any of {12289, 12290, 65294, 65377, 65380}.
   dead_error_condition: The condition c == 65292 cannot be true.
645  else if(IS_JA_TOUTEN(c))
   CID 1137370 (#4 of 4): Logically dead code (DEADCODE)dead_error_line: Execution cannot reach this statement: charClass = touten;.
646    charClass = touten;
647  else if(IS_JA_FWLATAIN(c))
   CID 1137370 (#2 of 4): Logically dead code (DEADCODE) [select issue]
648    charClass = fwlatain;
649
650  return charClass;

Please note that when the first if() is not taken, that means that
c is at least 65440 and from that time on, c cannot satisfy
many conditions in the subsequent if() conditions.

 if(IS_JA_HIRAGANA(c))
638    charClass = hiragana;
   cond_at_least: Condition (uint16_t)(c - 65382) <= 57, taking false branch. Now the value of c is at least 65440.

I suspect there is a 
16-bit sign extension or sign vs unsigned issue, and or
simply incorrect macro definition.

TIa
Summary: (coveritry) nsBayesianFilter.cpp: else if (IS_JA_TOUTEN(c)) is never executed !? → (Coverity) nsBayesianFilter.cpp: else if (IS_JA_TOUTEN(c)) is never executed !?
Severity: normal → minor
Component: Untriaged → Filters
Product: Thunderbird → MailNews Core
Oops. I think I found the reason, but this problem seems to have been
with TB since time immemorial when the code was inherited from CVS (!)

The problem may be grave, I have no idea.


From MXR

IN_RANGE

If you can't find what you're looking for, you can always perform a free-text search for it.
Defined as a preprocessor macro in:

    mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.cpp (View Hg log or Hg annotations)
        line 606 -- #define IN_RANGE(x, low, high) ((uint16_t)((x)-(low)) <= (high)-(low)) 

Referenced in:

    mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.cpp (View Hg log or Hg annotations)
        line 608 -- #define IS_JA_HIRAGANA(x) IN_RANGE(x, 0x3040, 0x309F)
        line 610 -- #define IS_JA_KATAKANA(x) (IN_RANGE(x^0x0004, 0x30A0, 0x30FE)||(IN_RANGE(x, 0xFF66, 0xFF9F)))
        line 611 -- #define IS_JA_KANJI(x) (IN_RANGE(x, 0x2E80, 0x2FDF)||IN_RANGE(x, 0x4E00, 0x9FAF))
        line 615 -- #define IS_JA_FWLATAIN(x) IN_RANGE(x, 0xFF01, 0xFF5E)
        line 616 -- #define IS_JA_FWNUMERAL(x) IN_RANGE(x, 0xFF10, 0xFF19)
        line 618 -- #define IS_JAPANESE_SPECIFIC(x) (IN_RANGE(x, 0x3040, 0x30FF)||IN_RANGE(x, 0xFF01, 0xFF9F)) 

in the said function |c| is char16_t.

I wonder why IN_RANGE(x, low, high) is as simple as
  (unsigned) low <= x && x <= (unsigned) high

and/or add "u" to all those constants that appear in the macro 
definitions.

I initially wrote the above donw, but now I think

#define IN_RANGE(x, low, high) ((uint16_t)((x)-(low)) <= (high)-(low)) 

is patently wrong because (forget about the signedness for a moment, but)

(x) - (low <= (high) - (low) means if we add (low) on both sides,
we only get
(x) <= (high) ?

Where does this (low) <= (x) go?

I checked the generated code by gcc -O3 -S of the following simple code and found that the used code does seem to generate only one check for x <= 65439 (and false condition of this means x >= 65440 as noted by Coverity.)
(The other possibility of (x < low) is not taken care of by the buggy IN_RANGE() and that is why IS_JA_TOUTEN() part never gets executed!!!
 

#define low 0xFF66
#define high 0xFF9F

int test1 (int c) {
  return low <= c && c <= high;
}

int test2 (int c) {
  return (c - low) <= (high - low);
}

int utest1 (int c) {
  return (unsigned) low <= c && c <= high;
}

int utest2 (int c) {
  return (c - (unsigned) low) <= ((unsigned) high - (unsigned) low);
}

Generated code.

	.file	"test.c"
	.section	.text.unlikely,"ax",@progbits
.LCOLDB0:
	.text
.LHOTB0:
	.p2align 4,,15
	.globl	test1
	.type	test1, @function
test1:
.LFB5:
	.cfi_startproc
	subl	$65382, %edi
	xorl	%eax, %eax
	cmpl	$57, %edi
	setbe	%al
	ret
	.cfi_endproc
.LFE5:
	.size	test1, .-test1
	.section	.text.unlikely
.LCOLDE0:
	.text
.LHOTE0:
	.section	.text.unlikely
.LCOLDB1:
	.text
.LHOTB1:
	.p2align 4,,15
	.globl	test2
	.type	test2, @function
test2:
.LFB1:
	.cfi_startproc
	xorl	%eax, %eax
	cmpl	$65439, %edi
	setle	%al
	ret
	.cfi_endproc
.LFE1:
	.size	test2, .-test2
	.section	.text.unlikely
.LCOLDE1:
	.text
.LHOTE1:
	.section	.text.unlikely
.LCOLDB2:
	.text
.LHOTB2:
	.p2align 4,,15
	.globl	utest1
	.type	utest1, @function
utest1:
.LFB2:
	.cfi_startproc
	cmpl	$65381, %edi
	seta	%al
	xorl	%edx, %edx
	cmpl	$65439, %edi
	setle	%dl
	andl	%edx, %eax
	ret
	.cfi_endproc
.LFE2:
	.size	utest1, .-utest1
	.section	.text.unlikely
.LCOLDE2:
	.text
.LHOTE2:
	.section	.text.unlikely
.LCOLDB3:
	.text
.LHOTB3:
	.p2align 4,,15
	.globl	utest2
	.type	utest2, @function
utest2:
.LFB3:
	.cfi_startproc
	subl	$65382, %edi
	xorl	%eax, %eax
	cmpl	$57, %edi
	setbe	%al
	ret
	.cfi_endproc
.LFE3:
	.size	utest2, .-utest2
	.section	.text.unlikely
.LCOLDE3:
	.text
.LHOTE3:
	.ident	"GCC: (Debian 5.3.1-13) 5.3.1 20160323"
	.section	.note.GNU-stack,"",@progbits

TIA
Jorg, do you understand this C/assembler voodoo? :)
Flags: needinfo?(mozilla)
Sorry, I did some PDP-10 (https://en.wikipedia.org/wiki/PDP-10) and some 360 Assembler (https://en.wikibooks.org/wiki/360_Assembly). Never on x86.
Flags: needinfo?(mozilla)
But maybe reading the C macros is enough here.
Here is my take on the issue.

At least, there is no bad effect.
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=0c30eb585e1ac216b4a4f774a4237d3153d54c06

But this means probably there has never been a good test for this before.

TIA

PS: I think I can try to create a code sample to show that the old macro and old function that uses it is wrong whereas the new definition of the said macro and the redefined function (that converts the argument to unt16_t) is correct, but short on time at this time. We can just call the tested functions with Japanese character codes that are to be classified and see what values are returned by each of the functions.

In any case, Coverity is right. It correctly deduced the dead code that resulted from the previous macro definitions.
It is rather difficult to test the function and new macros out of the tree because they depend on the mozilla declared types and such within the tree and how they are called, etc. My hope is that once it is inside the tree, a new Coverty scan is performed and all is well then. (To be honest, I could not reproduce the issue outside the tree... It may take a time to get all the type declarations, etc. correct outside the tree.)
Assignee: nobody → ishikawa
Why is using "unsigned" in IN_RANGE better that uint16_t when the 'c' being passed in is uint16_t ?
I'd feel safer if all the constants and arguments were the same size (2bytes) which uint16_t probably says. But how large is "unsigned"? Maybe if it were "unsigned short". And how large the 'u' modifier in the contants?
Keywords: coverity
(In reply to :aceman from comment #7)
> Why is using "unsigned" in IN_RANGE better that uint16_t when the 'c' being
> passed in is uint16_t ?
> I'd feel safer if all the constants and arguments were the same size
> (2bytes) which uint16_t probably says. But how large is "unsigned"? Maybe if
> it were "unsigned short". 

I see. Maybe I should have used "unsigned short" in IN_RANGE() macro.
> And how large the 'u' modifier in the contants?

Hmm. That is another good question. 

Maybe I should cast all the constants to "(unsigned short)", which is indeed two octets under linux 64-bit, GCC 5.x and 4.x.
(I tested the size under Debian GNU/Linux.)

TIA
Whiteboard: CID 1137370
You need to log in before you can comment on or make changes to this bug.