[PLUG] Parsing HTML with Perl

Wed Jun 30 15:01:02 UTC 2004

On Wed, 2004-06-30 at 14:47, Matt Alexander wrote:
> I'm trying to use Perl to parse out "records" from an HTML page.  I'm able
> to identify the beginning of a record with the following tag:
> 
> <TABLE border=0 width=500 class='A'>
> 
> or
> 
> <TABLE border=0 width=500 class='B'>
> 
> ...but I have no obvious way to find the end of a record except by the
> next starting tag.  There could be numerous additional table tags embedded
> in a record so I can't do anything simple like this:
> 
> @records = /<TABLE border=0 width=500 class=.+?>(.*?)<\/TABLE>/g;
> 
> Does anyone have a suggestion for how to pull everything between each
> occurance of these beginning tags?  I realize I'll end up losing the last
> record, but I'll deal with that later.
> Thanks,
> ~M

Have you tried using HTML::Parser? It should be included with
Fedora/RedHat and will probably work a lot better than just using
regular expressions.
-- 
Shahms King <shahms at shahms.com>