When John released his bindings to
html tidy, I joked with him, that it would have been far more interesting (as a project), to write a proper HTML lexer, rather than bind to an existing library. (mainly cause having written one in PHP, I didnt think it would be that difficult), and I have a strange idea of fun...
Well, over the weekend, I was re-pondering this. Partly due to the fact I had used the Flexy Parser to try and parse HTML from a web site, and found the tokenizer in Flexy was getting slower with age (5seconds on average to parse a page). While this is not a huge issue normally, as this parsing is cached during the compiling phase of template engine. It is a huge issue if you are pulling pages down, parsing out the forms, and reposting the forms in a web test script.
So over the weekend after a little google search and discover trip, I ran across a little w3c project, "
A Lexical Analyzer for HTML and SGML", It looked interesting, but it wasnt until I pulled the code down, untared and built it, that I realized it could be used to write a really fast, and simple HTML tokenizer. (not only that, it could easily form the basis of a C based backend for Flexy.)
To create an extension that used the code (not a library, but just pulled in the C code into a PHP extension), and parse a string of HTML took about 30 minutes.. - It took an extra 3 hours, on and off over a few days, to make it return a array of tokens (with attributes sorted into a sensible structure.)
So now I have a cute extension that has 1 function, and 1 result, KISS at it's best..
<?php
print_r(
flexyparser_tokenize(
file_get_contents("..some file...")
));
Outputs:
[0] => Array
(
[0] => 14 // token type (look up the source)
[1] => // data (tag name or string)
[2] => 1 // line number
[3] => 0 // character position
)
[1] => Array
(
[0] => 1
[1] =>
[2] => 2
[3] => 50
)
[2] => Array
(
[0] => 2
[1] => HTML
[2] => 2
[3] => 51
)
[3] => Array
(
[0] => 2
[1] => HEAD
[2] => 2
[3] => 57
)
.....
......
[15] => Array
(
[0] => 2
[1] => A
[2] => 6
[3] => 212
[4] => Array // array of attributes
(
[HREF] => "/pub/WWW/Consortium/"
)
)
[16] => Array
(
[0] => 2
[1] => IMG
[2] => 7
[3] => 243
[4] => Array
(
[align] => bottom
[src] => "/pub/WWW/Icons/WWW/w3c_48x48"
)
)
the code is in my svn server, under
akpear/flexyparser, works perfectly with PHP5 and PHP4 at the moment.
I really want to do a tree version of this, that loads data into a user defined object: eg.
<?php
$tree = flexyparser_toTree($data, new MyClass);
so it can be used 'how you want it...'