PHP Toolkit

HTML

A pure-PHP HTML5 parser and tag rewriter mirroring WordPress core's HTML API. Treat HTML the way browsers do — without libxml2, DOMDocument, or regex hacks — and rewrite attributes in a single linear pass.

When the native API extension is loaded, tag and fragment processors can use native delegates by default while preserving PHP fallback behavior. Define WP_NATIVE_APIS_DISABLE_DEFAULTS before loading the component to force the pure PHP fallback. Full-document parsing through WP_HTML_Processor::create_full_parser() remains PHP-backed for now. Fragment processors, including covered table, list, description-list, select/option/optgroup, omitted-paragraph, and ruby tree-builder cases, can use native delegates when the extension is loaded.

composer require wp-php-toolkit/html

WordPress runs HTML fragments through filters every time a request renders: post content, block markup, comments, excerpts, widgets, feeds, imported documents. Those fragments can omit <html> and <body>, close tags implicitly, or mix browser-correct markup with author mistakes that DOMDocument and regular expressions do not model well.

The HTML component gives WordPress-style code the same parsing model WordPress core uses: a browser-compatible tokenizer and tree-aware processor that run in pure PHP. Choose it for exact-byte rewrites, imperfect fragments, and post-content filters where a full DOM would do too much work.

The component gives you two processors. WP_HTML_Tag_Processor is a forward-only cursor over tags and tokens — useful for attribute rewriting at scale. WP_HTML_Processor layers HTML5 tree construction on top so you can query by ancestry (breadcrumbs), serialize the parsed document, and trust that <p>one<p>two parses as two paragraphs the way a browser sees it.

Add loading="lazy" to every image

The "hello world" of tag rewriting. One linear pass, no DOM, no reserialization cost beyond the bytes you actually changed.

Try this: click Run, then change 'lazy' to 'eager' on the first image only by guarding it with $tags->get_attribute( 'src' ) === 'hero.jpg'. Run again and notice that get_updated_html() only rewrites the bytes for that one tag.

Use this before sending post content to an RSS feed, an email template, or a CDN-backed copy of a site. The processor rewrites only the changed bytes, so untouched markup stays byte-identical.

Strip every script and inline event handler

A common sanitization step: neutralize untrusted HTML before display. Blank a script's body with set_modifiable_text() and strip every on* attribute via get_attribute_names_with_prefix().

Stamp a CSP nonce on inline scripts and styles

Content Security Policy in nonce- mode requires every inline <script> and <style> to carry a matching nonce attribute. Tag-by-tag is exactly the right granularity.

Build a srcset from a single src

Generate responsive image markup at render time without touching the editor data model. Read the existing src, derive a srcset with width descriptors, add a sizes hint.

Decode HTML entities the way the spec demands

The HTML5 entity table has roughly 2,200 named references and a long list of edge cases. WP_HTML_Decoder implements the algorithm — don't roll your own.

Find images by ancestry with breadcrumbs

The full WP_HTML_Processor understands HTML5 tree construction, so you can ask "find every <img> directly inside a <figure>" without writing your own DOM walker.

Outline a document by walking tokens with depth

The full processor exposes get_current_depth() and get_breadcrumbs(). Combine with next_token() to print a structural outline.

Bookmarks: annotate a parent based on its children

Bookmarks are the one escape from forward-only scanning. Save a position, scan ahead, decide what to do, then seek() back and rewrite the earlier tag.

When to use which

UseFor
WP_HTML_Tag_ProcessorAttribute rewriting, sanitization, finding tags by name. Forward-only walks. Anything where speed and byte-honesty matter more than context.
WP_HTML_Processor::create_fragment()Queries by ancestry (breadcrumbs), heading outline extraction, anything that needs to know "is this tag inside that one."
WP_HTML_Decoder::decode_text_node()Turning entity-encoded text (AT&amp;T) back into raw text correctly. Implements the HTML5 entity algorithm — don't roll your own.
WP_HTML_Decoder::attribute_starts_with()Safe URL-prefix checks that decode HTML character references while comparing — so j&#x61;vascript: (where &#x61; is the letter a) is correctly recognized as starting with javascript:. The classic strpos approach misses these.

Pitfalls

See also