PHP Toolkit

HTML

A pure-PHP HTML5 parser and tag rewriter mirroring WordPress core's HTML API. Treat HTML the way browsers do — without libxml2, DOMDocument, or regex hacks — and rewrite attributes in a single linear pass.

composer require wp-php-toolkit/html

WordPress runs HTML fragments through filters every time a request renders: post content, block markup, comments, excerpts, widgets, feeds, imported documents. Those fragments can omit <html> and <body>, close tags implicitly, or mix browser-correct markup with author mistakes that DOMDocument and regular expressions do not model well.

The HTML component gives WordPress-style code the same parsing model WordPress core uses: a browser-compatible tokenizer and tree-aware processor that run in pure PHP. Choose it for exact-byte rewrites, imperfect fragments, and post-content filters where a full DOM would do too much work.

The component gives you two processors. WP_HTML_Tag_Processor is a forward-only cursor over tags and tokens — useful for attribute rewriting at scale. WP_HTML_Processor layers HTML5 tree construction on top so you can query by ancestry (breadcrumbs), serialize the parsed document, and trust that <p>one<p>two parses as two paragraphs the way a browser sees it.

Add loading="lazy" to every image

The "hello world" of tag rewriting. One linear pass, no DOM, no reserialization cost beyond the bytes you actually changed.

Try this: click Run, then change 'lazy' to 'eager' on the first image only by guarding it with $tags->get_attribute( 'src' ) === 'hero.jpg'. Run again and notice that get_updated_html() only rewrites the bytes for that one tag.

<?php
require '/wordpress/wp-content/php-toolkit/vendor/autoload.php';

$html = <<<'HTML'
<article>
	<img src="hero.jpg" alt="Hero">
	<p>Intro copy.</p>
	<img src="inline.jpg" alt="Inline">
</article>
HTML;

$tags = new WP_HTML_Tag_Processor( $html );
while ( $tags->next_tag( 'img' ) ) {
	// Don't clobber an explicit eager hint the author already set.
	if ( null === $tags->get_attribute( 'loading' ) ) {
		$tags->set_attribute( 'loading', 'lazy' );
	}
	$tags->set_attribute( 'decoding', 'async' );
}

echo $tags->get_updated_html();

Use this before sending post content to an RSS feed, an email template, or a CDN-backed copy of a site. The processor rewrites only the changed bytes, so untouched markup stays byte-identical.

<?php
require '/wordpress/wp-content/php-toolkit/vendor/autoload.php';

$html = <<<'HTML'
<p>See <a href="/about">about</a>, <a href="https://example.com/x">x</a>, 
and <a href="contact.html">contact</a>.</p>
HTML;

$base = 'https://my-site.test/';

$tags = new WP_HTML_Tag_Processor( $html );
while ( $tags->next_tag( 'a' ) ) {
	$href = $tags->get_attribute( 'href' );
	if ( null === $href || '' === $href ) {
		continue;
	}
	if ( preg_match( '#^[a-z][a-z0-9+.-]*:#i', $href ) || 0 === strpos( $href, '//' ) || 0 === strpos( $href, '#' ) ) {
		continue;
	}
	$tags->set_attribute( 'href', rtrim( $base, '/' ) . '/' . ltrim( $href, '/' ) );
}

echo $tags->get_updated_html();

Strip every script and inline event handler

A common sanitization step: neutralize untrusted HTML before display. Blank a script's body with set_modifiable_text() and strip every on* attribute via get_attribute_names_with_prefix().

<?php
require '/wordpress/wp-content/php-toolkit/vendor/autoload.php';

$untrusted = <<<'HTML'
<p onclick="x()">hi</p>
<script>evil()</script>
<img src="x" onerror="boom()">
HTML;

$tags = new WP_HTML_Tag_Processor( $untrusted );
while ( $tags->next_tag() ) {
	// next_tag() never lands on closing tags, so no is_tag_closer() guard
	// is needed here.
	if ( 'SCRIPT' === $tags->get_tag() ) {
		$tags->set_modifiable_text( '' );
	}
	foreach ( $tags->get_attribute_names_with_prefix( 'on' ) as $attr ) {
		$tags->remove_attribute( $attr );
	}
}

echo $tags->get_updated_html();

Stamp a CSP nonce on inline scripts and styles

Content Security Policy in nonce- mode requires every inline <script> and <style> to carry a matching nonce attribute. Tag-by-tag is exactly the right granularity.

<?php
require '/wordpress/wp-content/php-toolkit/vendor/autoload.php';

$nonce = bin2hex( random_bytes( 8 ) );

$html = <<<'HTML'
<head><style>body{font:16px sans-serif}</style></head>
<body><script>console.log("hi")</script><script src="vendor.js"></script></body>
HTML;

$tags = new WP_HTML_Tag_Processor( $html );
while ( $tags->next_tag() ) {
	$tag = $tags->get_tag();
	if ( 'SCRIPT' === $tag || 'STYLE' === $tag ) {
		$tags->set_attribute( 'nonce', $nonce );
	}
}

echo "nonce: {$nonce}\n\n";
echo $tags->get_updated_html();

Build a srcset from a single src

Generate responsive image markup at render time without touching the editor data model. Read the existing src, derive a srcset with width descriptors, add a sizes hint.

<?php
require '/wordpress/wp-content/php-toolkit/vendor/autoload.php';

$html = '<figure><img src="https://cdn.test/uploads/photo.jpg" alt="Sunset"></figure>';
$widths = array( 480, 768, 1200 );

$tags = new WP_HTML_Tag_Processor( $html );
while ( $tags->next_tag( 'img' ) ) {
	$src = $tags->get_attribute( 'src' );
	if ( null === $src || $tags->get_attribute( 'srcset' ) !== null ) {
		continue;
	}
	$variants = array();
	foreach ( $widths as $w ) {
		$variants[] = $src . '?w=' . $w . ' ' . $w . 'w';
	}
	$tags->set_attribute( 'srcset', implode( ', ', $variants ) );
	$tags->set_attribute( 'sizes', '(max-width: 768px) 100vw, 768px' );
}

echo $tags->get_updated_html();

Decode HTML entities the way the spec demands

The HTML5 entity table has roughly 2,200 named references and a long list of edge cases. WP_HTML_Decoder implements the algorithm — don't roll your own.

<?php
require '/wordpress/wp-content/php-toolkit/vendor/autoload.php';

echo "attribute: " . WP_HTML_Decoder::decode_attribute( 'path?a=1&amp;b=2&amp;copy' ) . "\n";
echo "text:      " . WP_HTML_Decoder::decode_text_node( 'AT&amp;T &mdash; 100&percnt; &#x1F600;' ) . "\n";

// Safe URL prefix check that decodes character references while comparing.
// `&#x6A;` is the letter `j`, so this string really does start with javascript:.
// strpos() would miss it.
$is_javascript = WP_HTML_Decoder::attribute_starts_with(
	'&#x6A;avascript:alert(1)',
	'javascript:',
	'ascii-case-insensitive'
);
var_dump( $is_javascript );

Find images by ancestry with breadcrumbs

The full WP_HTML_Processor understands HTML5 tree construction, so you can ask "find every <img> directly inside a <figure>" without writing your own DOM walker.

<?php
require '/wordpress/wp-content/php-toolkit/vendor/autoload.php';

$html = <<<'HTML'
<article>
<figure><img src="hero.jpg" alt="Hero"><figcaption>Hero shot</figcaption></figure>
<p>Body copy <img src="emoji.png" alt=""> mid-paragraph.</p>
<figure><img src="diagram.png" alt="Diagram"></figure>
</article>
HTML;

$p = WP_HTML_Processor::create_fragment( $html );
$figure_images = 0;
while ( $p->next_tag( array( 'breadcrumbs' => array( 'FIGURE', 'IMG' ) ) ) ) {
	$p->add_class( 'figure-image' );
	$figure_images++;
}

echo "found {$figure_images} figure images\n";
echo $p->get_updated_html();

Outline a document by walking tokens with depth

The full processor exposes get_current_depth() and get_breadcrumbs(). Combine with next_token() to print a structural outline.

<?php
require '/wordpress/wp-content/php-toolkit/vendor/autoload.php';

$html = <<<'HTML'
<section><h1>Title</h1>
<section><h2>Chapter 1</h2><p>Body</p></section>
<section><h2>Chapter 2</h2><p>More body</p></section>
</section>
HTML;

$p = WP_HTML_Processor::create_fragment( $html );
while ( $p->next_token() ) {
	if ( '#tag' !== $p->get_token_type() || $p->is_tag_closer() ) {
		continue;
	}
	$tag = $p->get_tag();
	if ( ! preg_match( '/^H[1-6]$/', $tag ) ) {
		continue;
	}
	$indent = str_repeat( '  ', max( 0, $p->get_current_depth() - 2 ) );
	$text = '';
	while ( $p->next_token() ) {
		if ( '#text' === $p->get_token_type() ) {
			$text .= $p->get_modifiable_text();
			continue;
		}
		if ( '#tag' === $p->get_token_type() && $tag === $p->get_tag() && $p->is_tag_closer() ) {
			break;
		}
	}
	echo "{$indent}{$tag}  {$text}\n";
}

Bookmarks: annotate a parent based on its children

Bookmarks are the one escape from forward-only scanning. Save a position, scan ahead, decide what to do, then seek() back and rewrite the earlier tag.

<?php
require '/wordpress/wp-content/php-toolkit/vendor/autoload.php';

$html = <<<'HTML'
<ul>
<li><input type="checkbox" checked> Buy milk</li>
<li><input type="checkbox"> Walk the dog</li>
<li><input type="checkbox" checked> Read book</li>
</ul>
HTML;

$tags = new WP_HTML_Tag_Processor( $html );
$tags->next_tag( 'ul' );
$tags->set_bookmark( 'list' );

$total = 0;
$done = 0;
while ( $tags->next_tag( 'input' ) ) {
	$total++;
	if ( null !== $tags->get_attribute( 'checked' ) ) {
		$done++;
	}
}

$tags->seek( 'list' );
$tags->set_attribute( 'data-progress', $done . '/' . $total );
$tags->release_bookmark( 'list' );

echo $tags->get_updated_html();

When to use which

UseFor
WP_HTML_Tag_ProcessorAttribute rewriting, sanitization, finding tags by name. Forward-only walks. Anything where speed and byte-honesty matter more than context.
WP_HTML_Processor::create_fragment()Queries by ancestry (breadcrumbs), heading outline extraction, anything that needs to know "is this tag inside that one."
WP_HTML_Decoder::decode_text_node()Turning entity-encoded text (AT&amp;T) back into raw text correctly. Implements the HTML5 entity algorithm — don't roll your own.
WP_HTML_Decoder::attribute_starts_with()Safe URL-prefix checks that decode HTML character references while comparing — so j&#x61;vascript: (where &#x61; is the letter a) is correctly recognized as starting with javascript:. The classic strpos approach misses these.

Pitfalls

See also