Published 2004-11-01 22:59:54

When John released his bindings to html tidy, I joked with him, that it would have been far more interesting (as a project), to write a proper HTML lexer, rather than bind to an existing library. (mainly cause having written one in PHP, I didnt think it would be that difficult), and I have a strange idea of fun...

Well, over the weekend, I was re-pondering this. Partly due to the fact I had used the Flexy Parser to try and parse HTML from a web site, and found the tokenizer in Flexy was getting slower with age (5seconds on average to parse a page). While this is not a huge issue normally, as this parsing is cached during the compiling phase of template engine. It is a huge issue if you are pulling pages down, parsing out the forms, and reposting the forms in a web test script.

So over the weekend after a little google search and discover trip, I ran across a little w3c project, "A Lexical Analyzer for HTML and SGML", It looked interesting, but it wasnt until I pulled the code down, untared and built it, that I realized it could be used to write a really fast, and simple HTML tokenizer. (not only that, it could easily form the basis of a C based backend for Flexy.)

To create an extension that used the code (not a library, but just pulled in the C code into a PHP extension), and parse a string of HTML took about 30 minutes.. - It took an extra 3 hours, on and off over a few days, to make it return a array of tokens (with attributes sorted into a sensible structure.)

So now I have a cute extension that has 1 function, and 1 result, KISS at it's best..

<?php
print_r(
flexyparser_tokenize(
file_get_contents("..some file...")
));

Outputs:


    [0] => Array
        (
            [0] => 14 // token type (look up the source)
            [1] =>    // data (tag name or string)
            [2] => 1  // line number
            [3] => 0  // character position
        )

    [1] => Array
        (
            [0] => 1
            [1] =>

            [2] => 2
            [3] => 50
        )

    [2] => Array
        (
            [0] => 2
            [1] => HTML
            [2] => 2
            [3] => 51
        )

    [3] => Array
        (
            [0] => 2
            [1] => HEAD
            [2] => 2
            [3] => 57
        )
.....
......
     [15] => Array
         (
            [0] => 2
            [1] => A
            [2] => 6
            [3] => 212
            [4] => Array  // array of attributes
                (
                    [HREF] => "/pub/WWW/Consortium/"
                )

        )

    [16] => Array
        (
            [0] => 2
            [1] => IMG
            [2] => 7
            [3] => 243
            [4] => Array
                (
                    [align] => bottom

                    [src] => "/pub/WWW/Icons/WWW/w3c_48x48"
                )

        )

the code is in my svn server, under akpear/flexyparser, works perfectly with PHP5 and PHP4 at the moment.

I really want to do a tree version of this, that loads data into a user defined object: eg.
<?php
$tree = flexyparser_toTree($data, new MyClass);

so it can be used 'how you want it...'

Mentioned By:
www.experts-exchange.com : PHP: PHP function to parse table cell contents? (250 referals)
google.com : php5 html parser (81 referals)
google.com : november (72 referals)
google.com : php html tokenizer (56 referals)
google.com : april (45 referals)
google.com : php parse html (31 referals)
google.com : PHP HTML parser (21 referals)
google.com : php "html to array" (15 referals)
marc.theaimsgroup.com : MARC: msg '[PECL-DEV] flexyparser.. or anothername..' (13 referals)
google.com : html parser php (12 referals)
google.com : php html_parse (12 referals)
google.com : html parser php5 (11 referals)
google.com : html parse php4 (10 referals)
google.com : PHP parse array (10 referals)
google.com : parsing html with php (9 referals)
google.com : php parse html to array (9 referals)
google.com : PHP5 parse HTML (9 referals)
google.com : html tokenizer php (8 referals)
google.com : parse array php (8 referals)
google.com : php "parse HTML" (8 referals)

Related

Making simple things easy, and difficult things possible. yet another html parser.

Comments

Add Your Comment

Related

Comments

Add Your Comment

Follow us on 🦋 Bluesky

OUR BLOG

Bluesky - @roojs.com