An HTML parser is a long term data mining solution. For quicker hacks, you might even be tempted to use regular expressions. This parser is more dedicated at doing one thing well.. parsing HTML. It is arguably faster than regular expressions since one can parse an entire html page in one single pass, and it is more structured and extensible.
The fast html parser feeds you tags and text when they are found. They can be described as OnTag and OnText callbacks or events. When you receive the Tag in the OnTag event, you can analyze it with the pwhtmtils.pas and pwhtmtool.pas units (previously named htmlutils/htmutil) and get the name/value pairs from the tag attributes. When you receive the Text in the OnText event you can analyze it too.
The fast HTML parser was taken from a Delphi developer and modified a bit - it is an open source and extremely simplified html parser. It doesn't use a whitelist ruleset for validating HTML - it just parses the tags and any HTML is acceptable. This is perfect for parsing any sort of website on the internet - since many internet websites aren't written with proper strict HTML form anyway.
At current date of writing you can find the latest fast HTML parser on:
When I have time, I will make tutorials about how to use it to parse extremely complex html files, and maybe even show you some videos of my command line racing through 10,000 html files in only a few seconds using some counter and backbuffer tricks.
This parser does not dump tags into a tree, although one could build a tree dumper based on this parser. Usually, one just wants to grab a certain part of a web page (such as all url's or all table cells) and one could care less about all the other bloat surrounding the important content. This parser can handle these situations well since it only feeds you tags and text which you can ignore completely or choose to analyze closer.