HTML
Bill Campbell
bill at celestial.com
Tue Nov 12 16:35:28 PST 2019
On Tue, Nov 12, 2019, Fairlight via Filepro-list wrote:
>The whole problem with that tech is that it's a moving target. It's slower
>moving than it used to be, but there are vast differences between 3, 3.2,
>4, XHTML, and 5.
>
>I would not want to have to support that, long-term. I do not envy you,
>unless it's one hell of a revenue stream.
I've done a fair amount of html parsing, mostly using the python urllib,
sometimes running the html through tidy to clean it up before digging into
it. Tidy will parse the html into well-formed xhtml making parsing the
output much easier.
I often parse the body of the HTML with regular expressions, finding that
easier than running through libraries like the python lxml etree parser.
Bill
--
INTERNET: bill at celestial.com Bill Campbell; Celestial Software LLC
URL: http://www2.celestial.com/ 6641 E. Mercer Way
Mobile: (206) 947-5591 PO Box 820
Fax: (206) 232-9186 Mercer Island, WA 98040-0820
Cutting the space budget really restores my faith in humanity. It
eliminates dreams, goals, and ideals and lets us get straight to the
business of hate, debauchery, and self-annihilation. -- Johnny Hart
More information about the Filepro-list
mailing list