PasWiki : The Art Of Parsing

A r t i c l e s
Buy Z505 Laptops Today

Navigation

Note: This site is
a bit older, personal views
may have changed.

The Art Of Parsing

Parsing, a science, an art

Parsing is most likely more a science than an art - as there are some well known techniques that are used over and over again in all parsers but in slightly different ways - which is what makes parsing interesting and fun.

However to make the article sound more interesting I wanted to call it the Art of Parsing since all programmers these days seem to think parsing using C or Pascal while loops (and in some cases, unstructured GOTO's) is some voodoo black art that only elite folks can do. It's not - but I'm not going to try and convince anyone that programming is a science because that bores people when something isn't a black art or voodoo magic. And don't think that I'm some computer science student or computer science professor trying to push my bullshit on you - because I've never taken a computer science course in my life and I still think computing is more of a study/science than it is an art.

Bison, Yacc, Lex

These don't teach you how to parse. Stay away from them until you know how to parse. Learning about the tapedeck is more important than learning how to define language rulesets. If you learn about the tape deck you learn other things about programming that come up even when you aren't parsing. You will need to know about the tapedeck first anyway, to understand what to put in those rulesets that parser generators require. What's this tapedeck I speak of? That is explained further below.

We all parse, even non-programmers

When I first started parsing several eons ago, I didn't even know how to program. At that time I was using a text editor and regular expressions. I didn't know I was parsing at the time, I thought it was just "search and replace". After I learned to program I then started making use of "search and replace" in some websites and programs. Search and replace is a poor man's way of parsing - someone with poor programming skills will never look past search and replace parsing. Now I only use search and replace in programs when there is a really small and quick job that needs to be done.

Spaghetti code was easily made with "search and replace" style parsing. Many PHP programs prove that search and replace is powerful - some so called "parsers" written in PHP don't actually parse anything, they just search and replace things. There is no doubt that even search and replace can create extremely powerful programs. And some people even maintain these programs and spend hours tweaking the spaghetti to be thrown in slightly different ways - and again PHP has proven that even search and replace style parsing is only somewhat maintainable. Even if something is hard to maintain or hard on your eyes, that doesn't stop someone from maintaining it.

The biggest myth floating about is that projects are dropped when the code becomes too unmaintainable. The reality is that certain programmers live for this sort of thing - if a script is hard to maintain and they wrote it themselves, they usually can maintain it. Even if it is hard to maintain that does not mean the project is dropped or that the code is replaced with better code. I've seen hundreds of projects with search and replace style parsers that have been around for longer than 10 years - and they are still maintained today - and they will still be maintained for the next 10 years, and the next 10 after that possibly. It's sort of like wondering why stores still sell white sugar, cigarettes, and coke if they've proven to be unhealthy - come on, don't be ridiculous.

But this article is about the science or art of parsing, not search and replacing.

The metaphor: parsing is a tape deck.

Parsing is an advanced tapedeck or reel to reel system, where you can make your own buttons, analyzers (equalizers), and counters. It costs you thousands of dollars for audio equipment but your parsing equipment is generated with code. No hardware based audio tapedeck offers you the ability to magically create your own buttons and counters on the fly. Audio tape decks give you a fixed number of buttons and features, and a fixed number of counters, and a fixed number of input and output jacks. But parsing let's you build your own three million dollar high qualty tape deck with as many counters and input/output jacks as you like.

Parsing allows you to have a tapedeck with more tapes than just two. Most tape decks have two tapes - one for playback, one for recording. But your parser can contain 20 tapes if it needs to or 200 tapes.. But with all this power, your tapedeck is only useful if you know how to operate it. If you buy a 3 million dollar tapedeck and you aren't an expert enough to use it then you've simply got to study how to use it.

Peak ahead, Skip, Record, Rewind, Fast Forward, Start Counter

Maintainable, and advanced parsing is done via peak ahead, peak behind, skip ahead, skip back tactics using incrementing and decrementing. Once you've peaked behind or peaked ahead of the data that you need to see, and once you've recorded any of the data into data structures and buffers - you rest easy knowing that you can mung around that data and those buffers at a later time in some other sub-procedure or some other event.

With a tape deck parser you can skip past or skip back if you arrive at data that you don't need to see or analyze. You simply increment or decrement your tape deck to the next part of the tape that you want to see. If you've found data that you need to disect, you record the data into tape2 or tape3, or tape10 for that matter - this external tape structure or a buffer stores your data without losing your position on tape1, your main tape. You simply hit the record button of tape2 or tape3 when you need to. You have to understand that you aren't operating a limited double tapedeck here with tapeA and tapeB.

You can rewind tape1 a few counts back once you've got that data onto tape2 and tape3.. and you can put some more data onto tape4 that you got from tape2. All the data originally comes from tape1, you are just analyzing tape1 and disecting the hell out of it. You can disect the hell out of data with regular expressions but you can't rewind and fast forward to certain known places and positions in the data tape with regular expressions.

It's hard to call parsing "char by char parsing" since it is much more than that.. sometimes we skip ahead a few hundred characters if we don't care about it. Sometimes we peak ahead to see what that data is and where it ends, and then if that data is useful we return back to our held or remembered position.

All this reminds me of a tape deck or a record player. If you want to listen to a certain song you must manually find where this song starts on your record player. With a tape player it is a bit more precise - and with a CD it becomes easier because tracks are used to memorize positions on the CD. But with a programming language, you have even more power.

For if he's stuck do begin

One mistake people make when parsing char by char style or tape deck style, is they use a FOR loop. The problem with FOR loops is that they are like a one way train. The train can't go back and forth if it misses a stop or misses some nice scenery. The train can't jump off the tracks and skip ahead 4 miles when it needs to. A FOR loop is a very limited train going in one direction. Don't get stuck into FOR loop thinking... remember a tapedeck can rewind and fast forward and a tapedeck doesn't just go one way like a train.

A WHILE or REPEAT loop is a much better choice for parsing when you need full control. With a while loop you can skip ahead or skip back 15 chars or 10 chars or 2 characters. You make your own DOWNTO's, FOR's on demand with a while or repeat loop. You need to be able to break away from that inc(i, 1). With a while loop that is designed well, uou control whether you are doing a DOWNTO, FOR, SKIP PAST, SKIP BACK, SEEK AHEAD, SEEK BACK, READ BACK, READ FORWARD, PEAK AHEAD, PEAK BACK, IGNORE UNTIL, REDO, START COUNTER, RECORD DATA1, RECORD DATA2, etc.

Rather than thinking of a char by char parser it is better to think of a parser as a tapedeck that has the ability to stop and rewind a bit, but not rewind all the way, and at the same time the tape deck can stop and fast forward a bit without fast forwarding all the way.

The term "single pass" parser is confusing. Several peices of text get munged around while you are passing by.. The idea is that you should not RESET your tapedeck to ZERO and start all over again if you can avoid it. Sometimes you have to reset to zero, and that is okay. But usually you can get by with rewinding or fast forwarding your tapedeck partway without reseting it all the way back to zero.

Black art or a Science

I've already mentioned this but I want to mention it again. I consider parsing more of a science than an art - because there are tactics which can be used over and over, and tactics that are proven to work in all parsers. There are certain patterns or techniques you can reuse in other parsers.

There are plenty of "scraper" companies and blogs out there who discuss "scraping" as if it was some sort of devilish black art. I've found that "scraping" is just a buzzword for parsing. There are some scrapers out there that instead of parsing they use optical character recognition, or they use a combination of parsing and optical character recognition. But I still don't think this is a black art.

Either way, I still like to call this article "The Art Of Parsing" instead of "The Science of Parsing" because I know more people will read a mystical sounding article.

How can I rewind or fast forward a Regular Expression

Tape deck parsers are better than Regular expression parsers. There, I've said it. You can even use regular expressions inside a tapedeck parser if you need to, making it more powerful than a regular expression based parser.

With a tape deck parser, if I'm scanning a peice of text and I want to find the {stuff inside curly braces} I stop at the open curly {, remember my current position on the tape deck. I record that position. I then peak ahead until I see the } curly, and once I'm done with that data I still have a memory of where I originally left off at the { open curly.. because I recorded that position. I've also counted how many characters are inbetween the open and close curly while I was analyzing that data.. so that if I need to, I can jump past that data when I'm done screwing with it - and I can jump past it instantly using an INC(i, amount) call.

None of this is possible with regular expressions because you aren't controlling the tape deck with regular expressions, you're just telling the tape deck to search for certain peices of music that have drums and trumpet in them, or music that only has guitar and saxaphone in it. But what if I want to stop at the drumming and skip ahead to the saxaphone, and then go back to the drumming again? Regular expressions end up causing several passes and resets back to tape position zero. Some regular expressions can be designed so that there aren't any resets back to zero but these regular expressions quickly become unmaintainable (although unmaintainable that doesn't mean that companies don't still waste thousands of dollars "maintaining" the unmaintainable.. have a look at PHP code and you will see what I mean). There is no way to precisely control the tape deck with a regular expression.

Sadly regular expressions limit a programmer's thought so much that the programmer sometimes gets stuck in a rut where regexes are the only way and the best way to do everything and anything.

This is one reason why I don't like Perl - because it encourages almost anything and everything to be a regex - and this of course makes code messy too. Larry Wall once said that Lisp code looked like oatmeal with nail clippings mixed in. That's very nice but I think Larry should look at his own code in the mirror - Perl code looks like Regex Soup and censored Swear words. Some kids opened a tin of alphabit soup at school lunch hour and started organizing the alphabit soup to spell out swear words with the pound signs and what not. Remember Chef Boy A'rde alphabit soup? Did this alphabit soup also have pound signs and curly braces available? if it did you'd be sure to find some elementry kids organizing their soup to spell out censored swear words at lunch hour.

It's like regular expressions are a drug that traps you into one way of thinking.

Anyway, so now that I've dealt with the {data between curlies} I can now skip ahead until the first character after the close curly. I'm now ready to scan for my next data area or data zone. All this is very precise and the programmer has full control over the tape deck - many many more buttons are available on the tape deck than an audio tape deck - you can create your own buttons that do different things. You could create one button that checks the current data for bad characters. This button could run another while loop inside your main while loop.. or this button could use a regex. But the point is to not base the parser on regexes from the start - otherwise you lose the ability to rewind, fast forward, and count. You lose the abilities of a tape deck when you use regexes from day one.

Now you may be thinking - yes, that's nice, but what happens when a nest comes up {like {this} and there are {multiple {nests} inside} nests}. Not a problem.. just keep track of how many open curlies you arrive on before you arrive on a closer curly. If there is an open curly that occurs after an open curly, then you know a nest has started. One can build a funny looking regex to detect nests too - I've seen PHP code that handles nests with regular expressions - it's just kind of silly and amateur, that's all - if one doesn't know about the "tape deck" and he doesn't have his doors open to learn about why a tape deck is superior to a pattern match, then he won't be an excellent programmer he'll just be a good or amateur programmer for his entire life.

When it comes time to upgrade the parser sometimes the regular expression breaks all your existing code or you need to rewrite the regular expression from scratch, or you need to use multiple passes to make the code clean (several smaller regular expression passes). And it becomes an unmaintainable peice of poo. But with a proper tape deck parser you can extend the parser to handle nests by simply improving your tape deck.

This article isn't finished, I'll come back to it later. I think this topic could be covered in a book as it just can't be summed up in a few paragraphs.

Even Better than a Tape Deck

The analogy breaks down because tape decks make copies of tapes. With programming you can use pointers to strings which are cheap (i.e. fast, don't take up extra copied memory). Pointers don't even need to be used purposely, a lot of the pointer work is down automatically for you without you thinking about pointers. i.e. when you send a CONST string parameter (read only) in to a procedure or function in FPC, it's not a copy of the string, (a new tape in the tape deck) it's just a pointer to data! Fast, extremely compact storage. Same goes for Slices in GoLang (the best invention ever since Sliced Bread).

Thought Experiment

At first I thought that the Tape Deck analogy was, an analogy.. but later I found out this may be a form of what some call Visual Thinking or even a "thought experiment. Quoting from the wikipedia:

"In its broadest usage, thought experimentation is the process of employing imaginary situations to help us understand the way things really are.."
"A thought experiment (from the German term Gedankenexperiment, coined by Hans Christian �rsted) in the broadest sense is the use of a hypothetical scenario to help us understand the way things actually are."
"Thought experiments have been used in a variety of fields, including philosophy, physics, and mathematics. In philosophy, they have been used at least since Greek antiquity, some pre-dating Socrates. In physics and other sciences, notable thought experiments date from the 19th, and especially the 20th Century, but examples can be found at least as early as Galileo."

See Visual Thinking and Thought Experiments on Wikipedia.

Visualizing parsing as a tape deck is an imaginary situation. It helps us understand visually how parsing really works. It allows us to see the advantages of parsing since we already know visually how tape decks and record players work. Computers do not show any visual movements when you look at the processor chip, so if you just think about parsing as a processor chip processing data, you will never understand parsing. You must visually step inside your parser and pretend you can press buttons to fastfwd and rewind the record.

At night when I am going to sleep, sometimes I imagine myself running through a procedure as a VAR param and coming out of the procedure modified. Or if you are a C programmer, imagine being sent in to a C function, pointed to, then modified. That's your brain every day, it's a stdin/stdout system modified by the environment, albeit just a touch more complex.

In the future this article will contain code to back up my findings that parsers can be better than any super advanced multi tapedeck system. I will also include code proving that parsers can be written in much more maintainable manners than with using regex based hacks. For example see the Simple Wiki parser in Powtils or the modified fast html parser for delphi and fpc. C2 Wiki and Ward Cunningham, and Wikipedia owners, and all other semi-retarded programmers: if you're using perl and ugly regexes to power your website, you're doing it wrong. PHP is the new perl. Wikipedia regex infested source code, please die.

About

This site is about programming and other things.