HTML
A pure-PHP HTML5 parser and tag rewriter mirroring WordPress core's HTML API. Treat HTML the way browsers do — without libxml2, DOMDocument, or regex hacks — and rewrite attributes in a single linear pass.
When the native API extension is loaded, tag and fragment processors can use
native delegates by default while preserving PHP fallback behavior. Define
WP_NATIVE_APIS_DISABLE_DEFAULTS before loading the component to
force the pure PHP fallback. Full-document parsing through
WP_HTML_Processor::create_full_parser() remains PHP-backed for
now. Fragment processors, including covered table, list, description-list,
select/option/optgroup, omitted-paragraph, and ruby tree-builder cases, can use
native delegates when the extension is loaded.
composer require wp-php-toolkit/html
WordPress runs HTML fragments through filters every time a request renders: post content, block markup, comments, excerpts, widgets, feeds, imported documents. Those fragments can omit <html> and <body>, close tags implicitly, or mix browser-correct markup with author mistakes that DOMDocument and regular expressions do not model well.
The HTML component gives WordPress-style code the same parsing model WordPress core uses: a browser-compatible tokenizer and tree-aware processor that run in pure PHP. Choose it for exact-byte rewrites, imperfect fragments, and post-content filters where a full DOM would do too much work.
The component gives you two processors. WP_HTML_Tag_Processor is a forward-only cursor over tags and tokens — useful for attribute rewriting at scale. WP_HTML_Processor layers HTML5 tree construction on top so you can query by ancestry (breadcrumbs), serialize the parsed document, and trust that <p>one<p>two parses as two paragraphs the way a browser sees it.
Add loading="lazy" to every image
The "hello world" of tag rewriting. One linear pass, no DOM, no reserialization cost beyond the bytes you actually changed.
Try this: click Run, then change 'lazy' to 'eager' on the first image only by guarding it with $tags->get_attribute( 'src' ) === 'hero.jpg'. Run again and notice that get_updated_html() only rewrites the bytes for that one tag.
Rewrite relative links to absolute URLs
Use this before sending post content to an RSS feed, an email template, or a CDN-backed copy of a site. The processor rewrites only the changed bytes, so untouched markup stays byte-identical.
Strip every script and inline event handler
A common sanitization step: neutralize untrusted HTML before display. Blank a script's body with set_modifiable_text() and strip every on* attribute via get_attribute_names_with_prefix().
Stamp a CSP nonce on inline scripts and styles
Content Security Policy in nonce- mode requires every inline <script> and <style> to carry a matching nonce attribute. Tag-by-tag is exactly the right granularity.
Build a srcset from a single src
Generate responsive image markup at render time without touching the editor data model. Read the existing src, derive a srcset with width descriptors, add a sizes hint.
Decode HTML entities the way the spec demands
The HTML5 entity table has roughly 2,200 named references and a long list of edge cases. WP_HTML_Decoder implements the algorithm — don't roll your own.
Find images by ancestry with breadcrumbs
The full WP_HTML_Processor understands HTML5 tree construction, so you can ask "find every <img> directly inside a <figure>" without writing your own DOM walker.
Outline a document by walking tokens with depth
The full processor exposes get_current_depth() and get_breadcrumbs(). Combine with next_token() to print a structural outline.
Bookmarks: annotate a parent based on its children
Bookmarks are the one escape from forward-only scanning. Save a position, scan ahead, decide what to do, then seek() back and rewrite the earlier tag.
When to use which
| Use | For |
|---|---|
WP_HTML_Tag_Processor | Attribute rewriting, sanitization, finding tags by name. Forward-only walks. Anything where speed and byte-honesty matter more than context. |
WP_HTML_Processor::create_fragment() | Queries by ancestry (breadcrumbs), heading outline extraction, anything that needs to know "is this tag inside that one." |
WP_HTML_Decoder::decode_text_node() | Turning entity-encoded text (AT&T) back into raw text correctly. Implements the HTML5 entity algorithm — don't roll your own. |
WP_HTML_Decoder::attribute_starts_with() | Safe URL-prefix checks that decode HTML character references while comparing — so javascript: (where a is the letter a) is correctly recognized as starting with javascript:. The classic strpos approach misses these. |