[PLUG] sed regex to strip html code
Randal L. Schwartz
merlyn at stonehenge.com
Wed Jul 3 21:07:54 UTC 2002
>>>>> "Rob" == Rob Hudson <rob at euglug.net> writes:
Rob> This will do it too...
Rob> perl -pi -e 's/<.*?>//g' file.html
For some meaning of "do" and "it".
It'll break on this:
<!-- > this is still inside the comment, but won't get stripped
< -->
Yes, that's all stuff that should be stripped. Or try this:
<div anything="foo>bar">
That will leave _ bar"> _ when it shouldn't.
This is a lot closer:
$ perl -MHTML::Parser -e 'HTML::Parser->new(text_h => [ sub { print @_ }, "dtext" ])->parse_file(\*STDIN)' <<\END
<!-- > this is still inside the comment, but won't get stripped
< -->
And this is outside. Here's <HTML entities> for you!
<a href="foo">And here's a link</a>.
END
which generates:
And this is outside. Here's <HTML entities> for you!
And here's a link.
Proper!
--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<merlyn at stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!
More information about the PLUG
mailing list