HTML

A pure-PHP HTML5 parser and tag rewriter mirroring WordPress core's HTML API. Treat HTML the way browsers do — without libxml2, DOMDocument, or regex hacks — and rewrite attributes in a single linear pass.

When the native API extension is loaded, tag and fragment processors can use native delegates by default while preserving PHP fallback behavior. Define WP_NATIVE_APIS_DISABLE_DEFAULTS before loading the component to force the pure PHP fallback. Full-document parsing through WP_HTML_Processor::create_full_parser() remains PHP-backed for now. Fragment processors, including covered table, list, description-list, select/option/optgroup, omitted-paragraph, and ruby tree-builder cases, can use native delegates when the extension is loaded.

composer require wp-php-toolkit/html

Ported from WordPress core. The HTML component is a port of WordPress core's WP_HTML_Tag_Processor and WP_HTML_Processor. Source: WordPress/wordpress-develop. Bug fixes flow in both directions.

WordPress runs HTML fragments through filters every time a request renders: post content, block markup, comments, excerpts, widgets, feeds, imported documents. Those fragments can omit <html> and <body>, close tags implicitly, or mix browser-correct markup with author mistakes that DOMDocument and regular expressions do not model well.

The HTML component gives WordPress-style code the same parsing model WordPress core uses: a browser-compatible tokenizer and tree-aware processor that run in pure PHP. Choose it for exact-byte rewrites, imperfect fragments, and post-content filters where a full DOM would do too much work.

The component gives you two processors. WP_HTML_Tag_Processor is a forward-only cursor over tags and tokens — useful for attribute rewriting at scale. WP_HTML_Processor layers HTML5 tree construction on top so you can query by ancestry (breadcrumbs), serialize the parsed document, and trust that <p>one<p>two parses as two paragraphs the way a browser sees it.

Add loading="lazy" to every image

The "hello world" of tag rewriting. One linear pass, no DOM, no reserialization cost beyond the bytes you actually changed.

Try this: click Run, then change 'lazy' to 'eager' on the first image only by guarding it with $tags->get_attribute( 'src' ) === 'hero.jpg'. Run again and notice that get_updated_html() only rewrites the bytes for that one tag.

Rewrite relative links to absolute URLs

Use this before sending post content to an RSS feed, an email template, or a CDN-backed copy of a site. The processor rewrites only the changed bytes, so untouched markup stays byte-identical.

Strip every script and inline event handler

A common sanitization step: neutralize untrusted HTML before display. Blank a script's body with set_modifiable_text() and strip every on* attribute via get_attribute_names_with_prefix().

Stamp a CSP nonce on inline scripts and styles

Content Security Policy in nonce- mode requires every inline <script> and <style> to carry a matching nonce attribute. Tag-by-tag is exactly the right granularity.

Build a srcset from a single src

Generate responsive image markup at render time without touching the editor data model. Read the existing src, derive a srcset with width descriptors, add a sizes hint.

Decode HTML entities the way the spec demands

The HTML5 entity table has roughly 2,200 named references and a long list of edge cases. WP_HTML_Decoder implements the algorithm — don't roll your own.

Find images by ancestry with breadcrumbs

The full WP_HTML_Processor understands HTML5 tree construction, so you can ask "find every <img> directly inside a <figure>" without writing your own DOM walker.

Outline a document by walking tokens with depth

The full processor exposes get_current_depth() and get_breadcrumbs(). Combine with next_token() to print a structural outline.

Bookmarks: annotate a parent based on its children

Bookmarks are the one escape from forward-only scanning. Save a position, scan ahead, decide what to do, then seek() back and rewrite the earlier tag.

When to use which

Use	For
`WP_HTML_Tag_Processor`	Attribute rewriting, sanitization, finding tags by name. Forward-only walks. Anything where speed and byte-honesty matter more than context.
`WP_HTML_Processor::create_fragment()`	Queries by ancestry (`breadcrumbs`), heading outline extraction, anything that needs to know "is this tag inside that one."
`WP_HTML_Decoder::decode_text_node()`	Turning entity-encoded text (`AT&T`) back into raw text correctly. Implements the HTML5 entity algorithm — don't roll your own.
`WP_HTML_Decoder::attribute_starts_with()`	Safe URL-prefix checks that decode HTML character references while comparing — so `javascript:` (where `a` is the letter `a`) is correctly recognized as starting with `javascript:`. The classic `strpos` approach misses these.

Pitfalls

Mutations are buffered. Nothing changes in the source string until you call get_updated_html(). If you read get_attribute() after a set_attribute() on the same tag, you see the new value — but downstream tooling reading the original string sees stale HTML until you serialize.

next_tag() only stops on opening tags. Closers and text are skipped, so a guard like ! $tags->is_tag_closer() inside a next_tag() loop is harmless but never fires. If you need to visit closing tags or text nodes, use next_token() instead and check get_token_type().

Tag-name matches are uppercase. get_tag() always returns the tag name in uppercase ('IMG', not 'img'). Compare accordingly. The filter argument to next_tag() is case-insensitive in either direction.

Don't confuse WP_HTML_Tag_Processor with the full processor. The cursor is forward-only and ancestry-blind, and it doesn't expose get_breadcrumbs() at all — calling that on a WP_HTML_Tag_Processor raises a Call to undefined method error. Breadcrumbs and HTML5 tree construction (implicit <tbody> insertion, automatic <p> closing, and the rest) live only on WP_HTML_Processor.

HTML

Add loading="lazy" to every image

Rewrite relative links to absolute URLs

Strip every script and inline event handler

Stamp a CSP nonce on inline scripts and styles

Build a srcset from a single src

Decode HTML entities the way the spec demands

Find images by ancestry with breadcrumbs

Outline a document by walking tokens with depth

Bookmarks: annotate a parent based on its children

When to use which

Pitfalls

See also