Discussion:
Getting the text from a webpage (not the source)
(too old to reply)
Ivan Shmakov
2012-09-20 11:14:44 UTC
Permalink
[Cross-posting to news:comp.infosystems.www.misc.]

[...]
Well, I don't need a recursive download as much as a way to ignore
"display:none" in CSS (I'm interested in a specific set of pages,
not the entire web).
[...]
Anything with display:none will not show up in the screen grab,
unless you also take other measures.
Does your system have html2text? How close does
wget -O - <url> | html2text
come to what you want?
Neither html2text nor $ lynx -dump honors CSS (or JavaScript,
for that matter), which is what (AIUI) the OP needs.
--
FSF associate member #7257
Ben Bacarisse
2012-09-20 11:41:09 UTC
Permalink
Post by Ivan Shmakov
[Cross-posting to news:comp.infosystems.www.misc.]
[...]
Well, I don't need a recursive download as much as a way to ignore
"display:none" in CSS (I'm interested in a specific set of pages,
not the entire web).
[...]
Anything with display:none will not show up in the screen grab,
unless you also take other measures.
Does your system have html2text? How close does
wget -O - <url> | html2text
come to what you want?
Neither html2text nor $ lynx -dump honors CSS (or JavaScript,
for that matter), which is what (AIUI) the OP needs.
Possibly, yes. That's why I said "how close does it come", but I was
responding to a message that seemed to more specific. It mentioned only
ignoring display:none which html2text does of course. What other CSS
should be honoured seems to be up in the air,

I get the feeling the requirements are not set in stone and certainly
have not yet been fully stated. For example, the fact that CSS can
generate text has not yet come up, nor has the fact the CSS's
re-ordering of the text can make a significant difference to how usable
the result is.
--
Ben.
Ivan Shmakov
2012-09-20 12:29:54 UTC
Permalink
[...]
Post by Ben Bacarisse
Does your system have html2text? How close does
wget -O - <url> | html2text
come to what you want?
Neither html2text nor $ lynx -dump honors CSS (or JavaScript, for
that matter), which is what (AIUI) the OP needs.
Possibly, yes. That's why I said "how close does it come", but I was
responding to a message that seemed to more specific. It mentioned
only ignoring display:none which html2text does of course. What
other CSS should be honoured seems to be up in the air.
A quick web search [1] reveals that there're in fact two (at the
least) versions of html2text, and while I haven't checked for
that specifically, I'm pretty sure that the version currently in
Debian [2] (which I was referring to) doesn't honor CSS.

I don't know anything about the other version, though.

[1] http://duckduckgo.com/?q=html2text
[2] http://packages.debian.org/sid/html2text
Post by Ben Bacarisse
I get the feeling the requirements are not set in stone and certainly
have not yet been fully stated. For example, the fact that CSS can
generate text has not yet come up, nor has the fact the CSS's
re-ordering of the text can make a significant difference to how
usable the result is.
Yes.
--
FSF associate member #7257
Ben Bacarisse
2012-09-20 13:23:49 UTC
Permalink
Post by Ivan Shmakov
[...]
Post by Ben Bacarisse
Does your system have html2text? How close does
wget -O - <url> | html2text
come to what you want?
Neither html2text nor $ lynx -dump honors CSS (or JavaScript, for
that matter), which is what (AIUI) the OP needs.
Possibly, yes. That's why I said "how close does it come", but I was
responding to a message that seemed to more specific. It mentioned
only ignoring display:none which html2text does of course. What
other CSS should be honoured seems to be up in the air.
A quick web search [1] reveals that there're in fact two (at the
least) versions of html2text, and while I haven't checked for
that specifically, I'm pretty sure that the version currently in
Debian [2] (which I was referring to) doesn't honor CSS.
I don't know anything about the other version, though.
Yes, I knew about the two versions but since my reply was just a punt I
didn't think to mention it. I should have, just in case the OP finds it
suitable.

The main differences seem to be that the Debian version can do recoding
but can't fetch the page via HTTP (which means that it can't do the
recoding based on the HTTP response!).

<snip>
--
Ben.
Ivan Shmakov
2012-09-20 14:42:03 UTC
Permalink
[...]
Post by Ben Bacarisse
Post by Ivan Shmakov
Post by Ben Bacarisse
Neither html2text nor $ lynx -dump honors CSS (or JavaScript, for
that matter), which is what (AIUI) the OP needs.
Possibly, yes. That's why I said "how close does it come", but I
was responding to a message that seemed to more specific. It
mentioned only ignoring display:none which html2text does of
course. What other CSS should be honoured seems to be up in the
air.
A quick web search [1] reveals that there're in fact two (at the
least) versions of html2text, and while I haven't checked for that
specifically, I'm pretty sure that the version currently in Debian
[2] (which I was referring to) doesn't honor CSS.
I don't know anything about the other version, though.
Yes, I knew about the two versions but since my reply was just a punt
I didn't think to mention it. I should have, just in case the OP
finds it suitable.
The main differences seem to be that the Debian version can do
recoding but can't fetch the page via HTTP
The main difference between the two versions I've been referring
to is that one of them is written in Python, and the other in
C++.

The version of html2text in Debian (written in C++) doesn't seem
to honor CSS (or process &# symbol references, BTW.) E. g.:

$ html2text < 1348151128.xhtml
****** CSS &#x2018;display:none&#x2019; example ******
This text should be visible, and this one shouldn't.
$

Neither does Lynx:

$ lynx -dump -- 1348151128.xhtml
CSS `display:none' example

This text should be visible, and this one shouldn't.
$
Post by Ben Bacarisse
(which means that it can't do the recoding based on the HTTP
response!).
Then it shouldn't be expected to take any external CSS
referenced into account, either.

The document is as follows (it's correctly rendered by
Iceweasel, and passes checks at http://validator.w3.org/.)

$ cat < 1348151128.xhtml
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml"
xml:lang="en">
<head>
<title>CSS &#x2018;display:none&#x2019; example</title>
<style type="text/css">.invis { display: none; }</style>
</head>

<body>
<h1>CSS &#x2018;display:none&#x2019; example</h1>

<p>This text should be
visible<span class="invis"
Post by Ben Bacarisse
, and this one shouldn't</span>.</p>
</body>
</html>
$
--
FSF associate member #7257
Eli the Bearded
2012-09-20 21:58:18 UTC
Permalink
Post by Ivan Shmakov
$ html2text < 1348151128.xhtml
****** CSS &#x2018;display:none&#x2019; example ******
This text should be visible, and this one shouldn't.
$
$ lynx -dump -- 1348151128.xhtml
CSS `display:none' example
This text should be visible, and this one shouldn't.
$
The document is as follows (it's correctly rendered by
Iceweasel, and passes checks at http://validator.w3.org/.)
$ cat < 1348151128.xhtml
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml"
xml:lang="en">
<head>
<title>CSS &#x2018;display:none&#x2019; example</title>
<style type="text/css">.invis { display: none; }</style>
</head>
<body>
<h1>CSS &#x2018;display:none&#x2019; example</h1>
<p>This text should be
visible<span class="invis"
Post by Ben Bacarisse
, and this one shouldn't</span>.</p>
</body>
</html>
$
Tricky! I tried with links (full screen text mode browser), elinks (full
screen text mode browser), w3m (full screen text mode browser), and
edbrowse (ed style, line by line text mode browser), too. All failed
that test:

$ links http://localhost/invis.html
CSS `display:none' example
This text should be visible, and this one shouldn't.

$ elinks http://localhost/invis.html
CSS `display:none' example

This text should be visible, and this one shouldn't.

$ w3m http://localhost/invis.html
CSS ‘display:none’ example

This text should be visible, and this one shouldn't.

$ edbrowse http://localhost/invis.html
no ssl certificate file specified; secure connections cannot be verified
417
85
1,$p
CSS ‘display:none’ example

This text should be visible, and this one shouldn't.
q
$

The last browser was selected because it, alone from all the
other text browsers, has some support for javascript text changes:

$ cat javascript.html
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml"
xml:lang="en">
<head>
<title>Javascript &#x2018;innerHTML&#x2019; example</title>
<script type="text/javascript">
function changeText(){
document.getElementById('change_me').innerHTML =
', and so should this';
}
</script>
</head>

<body onLoad="changeText();" >
<h1>Javascript &#x2018;innerHTML&#x2019; example</h1>

<p>This text should be
visible<span id="change_me"
Post by Ivan Shmakov
, and this one shouldn't</span>.</p>
</body>
</html>
$ edbrowse http://localhost/javascript.html
no ssl certificate file specified; secure connections cannot be verified
552
191
Javascript ‘innerHTML’ example

This text should be visible, and this one shouldn't.

Loading...