HTML

Bill Campbell bill at celestial.com
Tue Nov 12 16:35:28 PST 2019


On Tue, Nov 12, 2019, Fairlight via Filepro-list wrote:
>The whole problem with that tech is that it's a moving target.  It's slower
>moving than it used to be, but there are vast differences between 3, 3.2,
>4, XHTML, and 5.
>
>I would not want to have to support that, long-term.  I do not envy you,
>unless it's one hell of a revenue stream.

I've done a fair amount of html parsing, mostly using the python urllib,
sometimes running the html through tidy to clean it up before digging into
it.  Tidy will parse the html into well-formed xhtml making parsing the
output much easier.

I often parse the body of the HTML with regular expressions, finding that
easier than running through libraries like the python lxml etree parser.

Bill
-- 
INTERNET:   bill at celestial.com  Bill Campbell; Celestial Software LLC
URL: http://www2.celestial.com/ 6641 E. Mercer Way
Mobile:         (206) 947-5591  PO Box 820
Fax:            (206) 232-9186  Mercer Island, WA 98040-0820

Cutting the space budget really restores my faith in humanity.  It
eliminates dreams, goals, and ideals and lets us get straight to the
business of hate, debauchery, and self-annihilation.  -- Johnny Hart


More information about the Filepro-list mailing list