[PLUG] sed regex to strip html code

Randal L. Schwartz merlyn at stonehenge.com
Wed Jul 3 21:07:54 UTC 2002


>>>>> "Rob" == Rob Hudson <rob at euglug.net> writes:

Rob> This will do it too...
Rob>   perl -pi -e 's/<.*?>//g' file.html

For some meaning of "do" and "it".

It'll break on this:

        <!-- > this is still inside the comment, but won't get stripped
        < -->

Yes, that's all stuff that should be stripped.  Or try this:

        <div anything="foo>bar">

That will leave _ bar"> _ when it shouldn't.

This is a lot closer:

$ perl -MHTML::Parser -e 'HTML::Parser->new(text_h => [ sub { print @_ }, "dtext" ])->parse_file(\*STDIN)' <<\END

        <!-- > this is still inside the comment, but won't get stripped
        < -->
        And this is outside.  Here's <HTML entities> for you!
        <a href="foo">And here's a link</a>.

END

which generates:

        And this is outside.  Here's <HTML entities> for you!
        And here's a link.

Proper!
        
-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<merlyn at stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!



More information about the PLUG mailing list