Closed Bug 1167568 Opened 7 years ago Closed 5 years ago
Reader View displays only the first part of specific articles from ehow
Reproduces with Firefox 38.0.5 RC, Firefox 39 beta 1, Nightly 41.0a1. Source page: http://www.ehow.com/how_4851888_throw-graduation-party-budget.html Result: Only the first part of the article is shown in reader view. There are no images. Several subtitles and paragraphs are missing.
Looks like this might be a regression bug. It cannot be reproduced if we use the build. Continue to investigate... : https://github.com/n1k0/readable-proxy/blob/master/vendor/Readability.js
Assignee: nobody → evan
The value of `topCandidates` at is not correct. It shows: ``` <p class="readability-styled" style="display: inline;"> </p><p>Graduation parties are a great way to commemorate the years of hard work teens and college co-eds devote to education. They’re also costly for mom and dad.</p><p class="readability-styled" style="display: inline;"> </p><p>The average cost of a graduation party in 2013 was a whopping $1,200, according to Graduationparty.com; $700 of that was allocated for food. However that budget was based on Midwestern statistics, and parties in urban areas like New York City are thought to have a much higher price tag.</p><p class="readability-styled" style="display: inline;"> </p><p>Thankfully, there are plenty of creative ways to trim a little grad party fat without sacrificing any of the fun or celebratory spirit.</p><p class="readability-styled" style="display: inline;"> </p> ``` Looks like we need to find correct `topCandidates` to fix this issue. : https://github.com/mozilla/readability/blob/master/Readability.js#L782-L806
If we resort `topCandidates ` array and make the item whose content length is maximum as `topCandidates`, then we can get correct result. Looks like the score counting algorithm might be not correct. : https://github.com/evanxd/readability/commit/aa67fe8b25e955051e7a29bd4850de13fd91e84c : https://github.com/mozilla/readability/blob/master/Readability.js#L724-L804
(In reply to Evan Tseng [:evanxd][:愛聞插低] from comment #3) > If we resort `topCandidates ` array and make the item whose content > length is maximum as `topCandidates`, then we can get correct result. > Looks like the score counting algorithm might be not correct. > > : > https://github.com/evanxd/readability/commit/ > aa67fe8b25e955051e7a29bd4850de13fd91e84c > : > https://github.com/mozilla/readability/blob/master/Readability.js#L724-L804 As you might have realized, changing the score counting algorithm also has effects on all the other sites we use Readability with... so it's a tricky balance of figuring out how to maintain that algorithm and have it be as effective as possible on different websites. Would it be useful to have a vidyo/irc conversation about this code and/or do you have questions that I might be able to help with?
> As you might have realized, changing the score counting algorithm also has > effects on all the other sites we use Readability with... so it's a tricky > balance of figuring out how to maintain that algorithm and have it be as > effective as possible on different websites. Yes, I know that. Looks like we might generate new regression issues easily after land a patch now. Do you think we already have enough test to avoid regression issues on ? > Would it be useful to have a vidyo/irc conversation about this code and/or > do you have questions that I might be able to help with? I have a question now, and I'm finding the root cause. If I apply the patch and use readable-proxy to test the webpage mentioned on Comment 0, the result will be good or failed intermittently. It's weird. Do you think it is possible to happen? I'm investing the issue. If I have problems, I'll pin you. Thanks for asking. :) : https://github.com/mozilla/readability/tree/master/test : https://github.com/evanxd/readability/commit/aa67fe8b25e955051e7a29bd4850de13fd91e84c : https://github.com/n1k0/readable-proxy/ : http://www.ehow.com/how_4851888_throw-graduation-party-budget.html
Evan, could you answer my question yesterday -- did we have enough sample articles as test cases to ensure we don't break things?
I would say no. Currently, we only have 44 pages, some from real world web sites, some are test pages we made. I think we should test more. In first step, we should list the websites we want to support(like popular news websites, e.g. CNN, top 20 website in US, sadly we don't support it now). Then test them. What do you guys think? : https://github.com/mozilla/readability/tree/master/test/test-pages : http://edition.cnn.com
(In reply to Evan Tseng [:evanxd][:愛聞插低] from comment #5) > > Would it be useful to have a vidyo/irc conversation about this code and/or > > do you have questions that I might be able to help with? > I have a question now, and I'm finding the root cause. If I apply the > patch and use readable-proxy to test the webpage mentioned on > Comment 0, the result will be good or failed intermittently. It's weird. Do > you think it is possible to happen? I don't know anything about readable-proxy and have never used it, so I can't really help with that. At a guess, I suspect the ad intermittently loads by the time readable-proxy gets the article text, or doesn't load, which might influence the result. The easiest way to check this might be to dump the original failing/passing markup to file or a terminal by modifying readable-proxy and/or readability and/or jsdomparser. I usually just test by turning on logging (var debug = false; replace false with true) inside readability either/both in Firefox and in nodejs when running tests. So here I guess I would use the generate-testcase.js script against the reported URL, then fix the XML-to-HTML issues manually (it would be nice if the script could do this but right now it does not - it's possible that this would be an option if we used jsdom to parse the initial article and then wrote a custom stringifier based on the JSDOM version of the DOM that inserted the requisite /> on the self-closing elements like <img>), then try to change the code until the test output matches the "real" expectations, then update the testcase and commit both code + test changes. For both of these avenues, note that you can technically use the webkit or chrome (and these days, maybe the firefox?) debugger with nodejs, so you can step through code that way if that's a technique you find useful. > I'm investing the issue. If I have problems, I'll pin you. > Thanks for asking. :) > > : https://github.com/mozilla/readability/tree/master/test > : > https://github.com/evanxd/readability/commit/ > aa67fe8b25e955051e7a29bd4850de13fd91e84c The topCandidates array is sorted based on the scores of the candidates (or it should be, anyway), so I don't think this change is correct. Parent elements get partial scores from their kids, so in a DOM like this: <div> <p> some text</p> <p> some more text</p> </div> We score the paragraphs and then give the <div> a score bonus (that's less than the score of the paragraphs themselves) for the scored paragraphs that are its children (or grandchildren, or ...). I haven't checked with this article, but from what you've said I suspect that the parent isn't getting enough of these score bonuses to make it score higher than the children. This is sometimes correct, imagine a site like: <main> <p>Main article text in here</p> <footer> Some boring footer links here </footer> </main> Readability should strip the footer (so it should pick the <p> over the <main>). The main way to disambiguate "important" and "unimportant" things should probably be 'helping' to decrease the score of the footer and/or increase the score of the relevant bits in this particular article (so that eventually the parent gets enough of a score boost that it scores higher than the individual kids). Avenues to check include (and when I say "parent" below, I mean the topCandidate that we think should be the readable container): - check both kids with substantial text (IIRC the article text is in 2 containers separated by the ad and an illustration/picture) are already contributing to the score of the parent. It's possible that right now one is so far 'down' the DOM tree that it doesn't. In that case, maybe we can simplify DOM trees (further) that have layers of nested non-semantic elements (div, span) with just 1 child. We already do this for <div><p>Text</p></div> structures, I think. - check the figure gets positive scores that influence the parent - check we remove the ad soon-ish as irrelevant and it doesn't contribute to negative scoring of the parent (it might, I forget exactly how that works right now) - check if there's classes on either the parent/kids that "semantically" (things like "content" or "article" or whatever) that indicate we should be bonusing/up-scoring the paragraphs, that we're not using yet for whatever reason Hope that helps! (In reply to Evan Tseng [:evanxd][:愛聞插低] from comment #7) > I would say no. Currently, we only have 44 pages, some from real world > web sites, some are test pages we made. > > I think we should test more. In first step, we should list the websites we > want to support(like popular news websites, e.g. CNN, top 20 website in > US, sadly we don't support it now). Then test them. > > What do you guys think? > > : https://github.com/mozilla/readability/tree/master/test/test-pages > : http://edition.cnn.com The test set was somewhat organically constructed as a result of "things that are broken that we run into and fix" rather than a systematic scraping of popular sites. It might be useful to do that for some high-profile sites, though on the other hand, as you say there are issues with some of these sites. Bits of CNN do work in reader mode (or at least they did the last time I checked), but I also know we have issues on file for some aspects of popular site, esp. in the github tracker. I've just had no cycles to spend on fixing any. So one thing to note would also be that if results change for existing tests that isn't *necessarily* a bad thing. It depends whether the results get significantly better or worse. So, for instance, if we end up selecting 1 div higher up the hierarchy on some site, that will cause test issues, but if there's no added (irrelevant) text in the result then that might be OK. It can be a bit of a judgment call. While it would be interesting to try to improve the test framework to make it less of a judgment call, it's also inherently difficult and somewhat of a circular problem: make the code in the test framework that we write check if the code in the actual readability library we write selects the humanly-considered-best part of a page by... using code to see if this is the humanly-considered-best part of a page. ;-)
Hi Gijs, Thanks for the explanation on Comment 8. Learned a lot. I've written a patch to fix this bug and sent a pull request to the readability repo on GitHub. Let's review it on GitHub, thanks. : https://github.com/mozilla/readability/pull/309
Added tests for the patch. I haven't cleaned up the source code of web page(just copy the source code from the web page), and it causes some JSDOMParser errors. If we try to clean up the web page to fix the JSDOMParser errors, it will not be the original web page and we don't test the readability lib in syntax (or other) error situations. Is that what we want, just test the readability lib with the web pages without syntax errors? But in real world, there might be some syntax errors in web pages. : https://github.com/mozilla/readability/pull/309/commits/6e918c9df42b9d9243a67856a4503fc7f2d2d186 : https://github.com/evanxd/readability/blob/bug-1167568/test/test-pages/ehow-2/source.html
Hi Gijs, I added tests. Could you review it again? Thanks. : https://github.com/mozilla/readability/pull/309
Commented on github.
> The test set was somewhat organically constructed as a result of "things > that are broken that we run into and fix" rather than a systematic scraping > of popular sites. It might be useful to do that for some high-profile sites, > though on the other hand, as you say there are issues with some of these > sites. Yes, it will be useful. It can help avoid regression issues for high-profile sites and understand how many high-profile website we already supported. So let's list high-profile websites we want to protect first, and then write tests for them. I filed a bug (Bug 1309499) for it, let's discuss there. :)
Updated patch for a new idea, find a better parent if there are other (at least three) nodes' score in `topCandidates` array which are quite closed with current `topCandidate` node. https://github.com/mozilla/readability/pull/309/commits/8c0db5e0496dd22f78e9cf247db02e26be48111e
Discussed the patch with reviewer on GitHub. : https://github.com/mozilla/readability/pull/309#issuecomment-264207963
Discussed the patch with reviewer on GitHub. : https://github.com/mozilla/readability/pull/309#issuecomment-266968561
Fixed the ehow-2 test failure and left a comment to discuss patch with reviewer. : https://github.com/mozilla/readability/pull/309#issuecomment-268964135
Updated patch for review comments and fixed the test failures. : https://github.com/mozilla/readability/pull/309/commits/11087a79c07efb5287916afdfceae29f83ed904a
Updated patch for review comments, fixed the test failures and left comments to discuss the patch. : https://github.com/mozilla/readability/pull/309/commits/c80ce19235cb717ad125cfe2c7175e2074c1564b
Updated patch and tests for review comments. : https://github.com/mozilla/readability/pull/309/commits/ea538c7468b39cda666149dcf99a897b6ab52504
Landed in GitHub repo. : https://github.com/mozilla/readability/commit/e1c7170d854ebb9b1a915dc2f8fdc7ba7012a702
We'll land it in m-c in the MozReview patch. : https://reviewboard.mozilla.org/r/109976/diff/2#index_header
Landed in m-c: https://hg.mozilla.org/mozilla-central/rev/d1bef3268b21
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.