HTML
A pure-PHP HTML5 parser and tag rewriter mirroring WordPress core's HTML API. Treat HTML the way browsers do — without libxml2, DOMDocument, or regex hacks — and rewrite attributes in a single linear pass.
composer require wp-php-toolkit/html
WordPress runs HTML fragments through filters every time a request renders: post content, block markup, comments, excerpts, widgets, feeds, imported documents. Those fragments can omit <html> and <body>, close tags implicitly, or mix browser-correct markup with author mistakes that DOMDocument and regular expressions do not model well.
The HTML component gives WordPress-style code the same parsing model WordPress core uses: a browser-compatible tokenizer and tree-aware processor that run in pure PHP. Choose it for exact-byte rewrites, imperfect fragments, and post-content filters where a full DOM would do too much work.
The component gives you two processors. WP_HTML_Tag_Processor is a forward-only cursor over tags and tokens — useful for attribute rewriting at scale. WP_HTML_Processor layers HTML5 tree construction on top so you can query by ancestry (breadcrumbs), serialize the parsed document, and trust that <p>one<p>two parses as two paragraphs the way a browser sees it.
Add loading="lazy" to every image
The "hello world" of tag rewriting. One linear pass, no DOM, no reserialization cost beyond the bytes you actually changed.
Try this: click Run, then change 'lazy' to 'eager' on the first image only by guarding it with $tags->get_attribute( 'src' ) === 'hero.jpg'. Run again and notice that get_updated_html() only rewrites the bytes for that one tag.
<?php
require '/wordpress/wp-content/php-toolkit/vendor/autoload.php';
$html = <<<'HTML'
<article>
<img src="hero.jpg" alt="Hero">
<p>Intro copy.</p>
<img src="inline.jpg" alt="Inline">
</article>
HTML;
$tags = new WP_HTML_Tag_Processor( $html );
while ( $tags->next_tag( 'img' ) ) {
// Don't clobber an explicit eager hint the author already set.
if ( null === $tags->get_attribute( 'loading' ) ) {
$tags->set_attribute( 'loading', 'lazy' );
}
$tags->set_attribute( 'decoding', 'async' );
}
echo $tags->get_updated_html();
Rewrite relative links to absolute URLs
Use this before sending post content to an RSS feed, an email template, or a CDN-backed copy of a site. The processor rewrites only the changed bytes, so untouched markup stays byte-identical.
<?php
require '/wordpress/wp-content/php-toolkit/vendor/autoload.php';
$html = <<<'HTML'
<p>See <a href="/about">about</a>, <a href="https://example.com/x">x</a>,
and <a href="contact.html">contact</a>.</p>
HTML;
$base = 'https://my-site.test/';
$tags = new WP_HTML_Tag_Processor( $html );
while ( $tags->next_tag( 'a' ) ) {
$href = $tags->get_attribute( 'href' );
if ( null === $href || '' === $href ) {
continue;
}
if ( preg_match( '#^[a-z][a-z0-9+.-]*:#i', $href ) || 0 === strpos( $href, '//' ) || 0 === strpos( $href, '#' ) ) {
continue;
}
$tags->set_attribute( 'href', rtrim( $base, '/' ) . '/' . ltrim( $href, '/' ) );
}
echo $tags->get_updated_html();
Strip every script and inline event handler
A common sanitization step: neutralize untrusted HTML before display. Blank a script's body with set_modifiable_text() and strip every on* attribute via get_attribute_names_with_prefix().
<?php
require '/wordpress/wp-content/php-toolkit/vendor/autoload.php';
$untrusted = <<<'HTML'
<p onclick="x()">hi</p>
<script>evil()</script>
<img src="x" onerror="boom()">
HTML;
$tags = new WP_HTML_Tag_Processor( $untrusted );
while ( $tags->next_tag() ) {
// next_tag() never lands on closing tags, so no is_tag_closer() guard
// is needed here.
if ( 'SCRIPT' === $tags->get_tag() ) {
$tags->set_modifiable_text( '' );
}
foreach ( $tags->get_attribute_names_with_prefix( 'on' ) as $attr ) {
$tags->remove_attribute( $attr );
}
}
echo $tags->get_updated_html();
Stamp a CSP nonce on inline scripts and styles
Content Security Policy in nonce- mode requires every inline <script> and <style> to carry a matching nonce attribute. Tag-by-tag is exactly the right granularity.
<?php
require '/wordpress/wp-content/php-toolkit/vendor/autoload.php';
$nonce = bin2hex( random_bytes( 8 ) );
$html = <<<'HTML'
<head><style>body{font:16px sans-serif}</style></head>
<body><script>console.log("hi")</script><script src="vendor.js"></script></body>
HTML;
$tags = new WP_HTML_Tag_Processor( $html );
while ( $tags->next_tag() ) {
$tag = $tags->get_tag();
if ( 'SCRIPT' === $tag || 'STYLE' === $tag ) {
$tags->set_attribute( 'nonce', $nonce );
}
}
echo "nonce: {$nonce}\n\n";
echo $tags->get_updated_html();
Build a srcset from a single src
Generate responsive image markup at render time without touching the editor data model. Read the existing src, derive a srcset with width descriptors, add a sizes hint.
<?php
require '/wordpress/wp-content/php-toolkit/vendor/autoload.php';
$html = '<figure><img src="https://cdn.test/uploads/photo.jpg" alt="Sunset"></figure>';
$widths = array( 480, 768, 1200 );
$tags = new WP_HTML_Tag_Processor( $html );
while ( $tags->next_tag( 'img' ) ) {
$src = $tags->get_attribute( 'src' );
if ( null === $src || $tags->get_attribute( 'srcset' ) !== null ) {
continue;
}
$variants = array();
foreach ( $widths as $w ) {
$variants[] = $src . '?w=' . $w . ' ' . $w . 'w';
}
$tags->set_attribute( 'srcset', implode( ', ', $variants ) );
$tags->set_attribute( 'sizes', '(max-width: 768px) 100vw, 768px' );
}
echo $tags->get_updated_html();
Decode HTML entities the way the spec demands
The HTML5 entity table has roughly 2,200 named references and a long list of edge cases. WP_HTML_Decoder implements the algorithm — don't roll your own.
<?php
require '/wordpress/wp-content/php-toolkit/vendor/autoload.php';
echo "attribute: " . WP_HTML_Decoder::decode_attribute( 'path?a=1&b=2&copy' ) . "\n";
echo "text: " . WP_HTML_Decoder::decode_text_node( 'AT&T — 100% 😀' ) . "\n";
// Safe URL prefix check that decodes character references while comparing.
// `j` is the letter `j`, so this string really does start with javascript:.
// strpos() would miss it.
$is_javascript = WP_HTML_Decoder::attribute_starts_with(
'javascript:alert(1)',
'javascript:',
'ascii-case-insensitive'
);
var_dump( $is_javascript );
Find images by ancestry with breadcrumbs
The full WP_HTML_Processor understands HTML5 tree construction, so you can ask "find every <img> directly inside a <figure>" without writing your own DOM walker.
<?php
require '/wordpress/wp-content/php-toolkit/vendor/autoload.php';
$html = <<<'HTML'
<article>
<figure><img src="hero.jpg" alt="Hero"><figcaption>Hero shot</figcaption></figure>
<p>Body copy <img src="emoji.png" alt=""> mid-paragraph.</p>
<figure><img src="diagram.png" alt="Diagram"></figure>
</article>
HTML;
$p = WP_HTML_Processor::create_fragment( $html );
$figure_images = 0;
while ( $p->next_tag( array( 'breadcrumbs' => array( 'FIGURE', 'IMG' ) ) ) ) {
$p->add_class( 'figure-image' );
$figure_images++;
}
echo "found {$figure_images} figure images\n";
echo $p->get_updated_html();
Outline a document by walking tokens with depth
The full processor exposes get_current_depth() and get_breadcrumbs(). Combine with next_token() to print a structural outline.
<?php
require '/wordpress/wp-content/php-toolkit/vendor/autoload.php';
$html = <<<'HTML'
<section><h1>Title</h1>
<section><h2>Chapter 1</h2><p>Body</p></section>
<section><h2>Chapter 2</h2><p>More body</p></section>
</section>
HTML;
$p = WP_HTML_Processor::create_fragment( $html );
while ( $p->next_token() ) {
if ( '#tag' !== $p->get_token_type() || $p->is_tag_closer() ) {
continue;
}
$tag = $p->get_tag();
if ( ! preg_match( '/^H[1-6]$/', $tag ) ) {
continue;
}
$indent = str_repeat( ' ', max( 0, $p->get_current_depth() - 2 ) );
$text = '';
while ( $p->next_token() ) {
if ( '#text' === $p->get_token_type() ) {
$text .= $p->get_modifiable_text();
continue;
}
if ( '#tag' === $p->get_token_type() && $tag === $p->get_tag() && $p->is_tag_closer() ) {
break;
}
}
echo "{$indent}{$tag} {$text}\n";
}
Bookmarks: annotate a parent based on its children
Bookmarks are the one escape from forward-only scanning. Save a position, scan ahead, decide what to do, then seek() back and rewrite the earlier tag.
<?php
require '/wordpress/wp-content/php-toolkit/vendor/autoload.php';
$html = <<<'HTML'
<ul>
<li><input type="checkbox" checked> Buy milk</li>
<li><input type="checkbox"> Walk the dog</li>
<li><input type="checkbox" checked> Read book</li>
</ul>
HTML;
$tags = new WP_HTML_Tag_Processor( $html );
$tags->next_tag( 'ul' );
$tags->set_bookmark( 'list' );
$total = 0;
$done = 0;
while ( $tags->next_tag( 'input' ) ) {
$total++;
if ( null !== $tags->get_attribute( 'checked' ) ) {
$done++;
}
}
$tags->seek( 'list' );
$tags->set_attribute( 'data-progress', $done . '/' . $total );
$tags->release_bookmark( 'list' );
echo $tags->get_updated_html();
When to use which
| Use | For |
|---|---|
WP_HTML_Tag_Processor | Attribute rewriting, sanitization, finding tags by name. Forward-only walks. Anything where speed and byte-honesty matter more than context. |
WP_HTML_Processor::create_fragment() | Queries by ancestry (breadcrumbs), heading outline extraction, anything that needs to know "is this tag inside that one." |
WP_HTML_Decoder::decode_text_node() | Turning entity-encoded text (AT&T) back into raw text correctly. Implements the HTML5 entity algorithm — don't roll your own. |
WP_HTML_Decoder::attribute_starts_with() | Safe URL-prefix checks that decode HTML character references while comparing — so javascript: (where a is the letter a) is correctly recognized as starting with javascript:. The classic strpos approach misses these. |