LUFDOC : fasthtmlparser

Notes

When you need to extract data from other websites who don't offer you an API of any sort, such as XML or CSV data feeds, or web service feeds, then you can just parse the HTML itself.

An HTML parser is a long term data mining solution. For quicker hacks, you might even be tempted to use regular expressions. This parser is more dedicated at doing one thing well.. parsing HTML. It is arguably faster than regular expressions since one can parse an entire html page in one single pass, and it is more structured and extensible.

The fast html parser feeds you tags and text when they are found. They can be described as OnTag and OnText callbacks or events. When you receive the Tag in the OnTag event, you can analyze it with the pwhtmtils.pas and pwhtmtool.pas units (previously named htmlutils/htmutil) and get the name/value pairs from the tag attributes. When you receive the Text in the OnText event you can analyze it too.

The fast HTML parser was taken from a Delphi developer and modified a bit - it is an open source and extremely simplified html parser. It doesn't use a whitelist ruleset for validating HTML - it just parses the tags and any HTML is acceptable. This is perfect for parsing any sort of website on the internet - since many internet websites aren't written with proper strict HTML form anyway.

At current date of writing you can find the latest fast HTML parser on:

google code: download area
SVN: google code svn pwfasthtmparser.pas
SVN: google code svn pwhtmtils.pas
SVN: google code svn pwhtmtool.pas

Successful in the Real World

I have parsed over 23,000 ebay auctions with it, and actually once parsed over 15,000 websites in a single evening. The lufdoc engine uses this parser to extract information from the FPDOC html files and compile them into new html files that are more cgi-bin capable.

When I have time, I will make tutorials about how to use it to parse extremely complex html files, and maybe even show you some videos of my command line racing through 10,000 html files in only a few seconds using some counter and backbuffer tricks.

Tips from the Developer

It is very powerful if the programmer gets the hang of trapping the tag attributes into variables in the OnTag event, and making them viewable inside the OnText event.

This parser does not dump tags into a tree, although one could build a tree dumper based on this parser. Usually, one just wants to grab a certain part of a web page (such as all url's or all table cells) and one could care less about all the other bloat surrounding the important content. This parser can handle these situations well since it only feeds you tags and text which you can ignore completely or choose to analyze closer.

Older versions with examples

Older versions of the html parser may have more example programs:

fast-html-parser-jul-02-2007.zip (has examples)
fast-html-parser-jul-02-2007.zip from Z505 site (has examples)
FastHtmlParse1.0.zip (outdated copy but has examples)

There should be an example in one of the older packages that shows how to extract URL's (href tag) from websites.