Parsing HTML with regex

pmjv@lemmy.sdf.org · 9 months ago

Parsing HTML with regex

MonkderZweite@feddit.ch · edit-2 9 months ago

Actually, you can’t even parse html (5) with specialized tools or by converting it and then using xml linters (they quit out due to too many errors). Only tools capable of reliably parsing html (mostly) are the big 3 browser engines. Experience from converting saved webpages to asciidoctor, it involves cleaning up manually, despite tidy and pandoc.

kevincox@lemmy.ml · 9 months ago

This isn’t true. HTML5 made a very strict set of rules and there are a large handful of compliant parsers. But yes, you absolutely can’t use an XML parser. You can’t even use an XML emitter, as you can emit valid XML that means something completely different in HTML.

…what a fucking disaster. I still wish XHTML won.