Rewriting HTML safely

In the quickstart you added a single attribute to a single tag. In this chapter we'll do the work the importer actually needs to do on real-world post content: add lazy loading to every image, rewrite relative links to absolute URLs, and neutralize script tags and inline event handlers that creep into pasted HTML. By the end you'll have a clean_post_html() function the next chapter will plug into the importer.

The input we're cleaning

The importer's job is to read a folder of Markdown posts. Each post has frontmatter, prose, and inline HTML that survived from a previous CMS — a mix of helpful markup, sloppy markup, and the occasional <script> tag someone pasted from Stack Overflow. Here's a representative example:

HTML;

Three things need fixing before this HTML belongs in a database:

The <img> has no loading hint, so it'll fetch eagerly even when it's far below the fold.
Two of the four <a> tags use relative URLs. They were correct on the source site; on the destination site they'll point to nothing.
There's a <script> tag and an onmouseover handler. They have to be neutralized.

Each of the next three sections fixes one of these. They all use the same component — WP_HTML_Tag_Processor — and the same shape: open a processor over the HTML, walk it, ask the cursor to make edits, then call get_updated_html() for the result.

Lazy-load every image

Start with the most ergonomic fix: add loading="lazy" to every image that doesn't already have a loading attribute. The processor's filter argument lets us skip everything that isn't an <img>:

Notice three things in the output. The first <img> gained both loading="lazy" and decoding="async". The second <img> kept its author-provided loading="eager" — the null === get_attribute( 'loading' ) guard saw it and skipped the lazy line — but still gained decoding="async", because that's an unconditional set_attribute() call. And every byte that wasn't an image attribute came through untouched: the <figure>, the whitespace, the <figcaption>, the <p>, even the prose inside it.

Try this: change the second image to also lack a loading attribute, then add a third <img> wrapped in <picture><source><img></picture>. Run again. The processor walks all of them.

Pitfall — mutations are buffered. If you read get_attribute('loading') after a set_attribute('loading', 'lazy') on the same tag, you'll see 'lazy'. But if a different bit of code reads the original $post_html string, it sees the unmodified HTML — the edits only land in the string when you call get_updated_html(). This trips people up when they pass the source HTML to a logger or a hash function and wonder why the output doesn't match.

Rewrite relative URLs to absolute

The importer needs every link in a post to be addressable from the destination site. /recipes/sauces meant something on the source site; on the destination it points to nothing. We'll resolve every relative href against a base URL — and leave protocol-relative URLs, fragments, and absolute URLs alone.

The pattern in the body of the loop is a small URL classifier: scheme-bearing URLs and scheme-relative ones (//other.test/...) are already absolute; fragments stay on the current document; everything else gets the base URL prepended. The classifier itself is forgettable boilerplate — what matters is that the processor lets us write it once and apply it to every <a> in the document with five lines of loop scaffolding.

Try this: add a mailto:cooks@example.com link and a javascript:void(0) link to the input. Run. The mailto stays intact (good); the javascript link also stays intact (less good — sanitization is the next section's problem, not this one's).

Note. Real-world URL resolution has corners (relative paths with .., query-string-only URLs like ?page=2, base elements). The toolkit ships a focused URL handling utility under DataLiberation that does the full job. We'll meet it in chapter 3. For now the simple classifier is good enough for the inputs the importer actually sees.

Neutralize scripts and inline event handlers

This is the security-shaped fix. A user pasted some HTML, and an onmouseover handler came along with it. Maybe a <script> tag too. The importer needs to neutralize both before the content lands in a database that will later be rendered into someone else's browser.

For inline event handlers, get_attribute_names_with_prefix( 'on' ) returns every attribute on the current tag whose name starts with on — that's onclick, onmouseover, onerror, every variant. We loop over the returned names and remove each.

For <script> tags, set_modifiable_text('') blanks the script's body without disturbing the surrounding markup. Combined with stripping its attributes, the result is an inert <script></script> shell — readable, valid HTML, but executable as a no-op.

' . ''; $tags = new WP_HTML_Tag_Processor( $untrusted ); while ( $tags->next_tag() ) { $tag = $tags->get_tag(); // 1. Neutralize script bodies and remove their attributes. if ( 'SCRIPT' === $tag && ! $tags->is_tag_closer() ) { $tags->set_modifiable_text( '' ); foreach ( $tags->get_attribute_names_with_prefix( '' ) as $attr ) { $tags->remove_attribute( $attr ); } continue; } // 2. Remove every on* handler on every other tag. foreach ( $tags->get_attribute_names_with_prefix( 'on' ) as $handler ) { $tags->remove_attribute( $handler ); } } echo $tags->get_updated_html();

Read the output carefully. The onclick, onerror, and onmouseover attributes are gone. Both <script> tags survive structurally — empty, no src, no body — but they're inert. The surrounding <p>, <figure>, and <figcaption> markup is unchanged.

Pitfall — script tag pairs. An HTML processor sees <script> twice: once as the opener, once as the closer. The ! $tags->is_tag_closer() guard makes sure we only blank the body when we're sitting on the opener; calling set_modifiable_text() on a closer is a no-op but the attribute-stripping loop would emit nothing useful. The same shape applies to <style>, <textarea>, and other elements with raw text content.

Warning — this is not a complete sanitizer. Stripping on* handlers and emptying <script> tags closes the most common pasted-HTML vectors. It does not close javascript: URLs in href and src, CSS expressions in style, or SVG-borne event handlers. If you're accepting untrusted HTML from end users (not just migrating known-author content like our importer is), use WordPress's wp_kses_post() after this pass, or build an explicit allowlist of tags and attributes. The Tag Processor gives you the surgical tool; a full sanitizer is a policy decision on top.

Combine the three into one function

Each of the three rewrites above used its own WP_HTML_Tag_Processor instance. That's fine for a tutorial, but the importer is going to call this on every post — twelve, then a hundred, then a thousand — and each instance allocates a little state. We'll fold all three into a single pass over a single processor.

This is also the function chapter 2 will import and reuse. Save the shape:

HTML; echo clean_post_html( $post_html, 'https://recipes.example.com/' );

One processor instance, one walk, one allocation of update state. Real importers run this on the body of every post they ingest — call it ten thousand times in a long export and the difference between one allocation per post and four is measurable.

Try this: add a <style> tag with a body to the input, then extend clean_post_html() to blank <style> bodies the same way it blanks <script> bodies. Run. CSS-borne attacks are rare in author-supplied content but they exist; the same surgical fix applies.

When the tag-level cursor is the wrong tool

Everything above used WP_HTML_Tag_Processor, which walks tags as a flat sequence. It doesn't know that <img> is inside <figure>; it just sees them in document order. For attribute rewriting that's perfect — fast, allocation-light, byte-honest.

It's the wrong tool when ancestry matters. If you need "every <img> directly inside a <figure>, but not images in paragraphs," or "the <h1> at the top of the article body, ignoring <h1>s nested inside <blockquote>," reach for WP_HTML_Processor — the same component, one class up. It implements HTML5 tree construction, so you can query by ancestry (breadcrumbs) and trust that <p>one<p>two parses as two paragraphs the way a browser sees it.

The reference page for the HTML component (reference/html.html) shows both processors side by side with worked examples. We won't need the full processor in the importer.

Recap

You can now:

Open a WP_HTML_Tag_Processor over an HTML string and walk it with next_tag().
Add, replace, and remove attributes — and read the result with get_updated_html() — without disturbing untouched bytes.
Use get_attribute_names_with_prefix() to find and remove every on* handler in a single pass.
Blank the body of a special-content tag (<script>, <style>) with set_modifiable_text('').
Combine multiple rewrites into a single processor walk for performance and clarity.
Recognize when the cursor model is the right tool and when ancestry-aware WP_HTML_Processor is.

The clean_post_html() function is the importer's first real piece. We'll use it again in chapter 3.

In chapter 2 the importer's input becomes a real ZIP file: a thousand Markdown posts in a 40 MB archive that you can't afford to extract to disk on a memory-constrained host. We'll wrap the archive as a Filesystem, read entries one at a time, and pipe them into a memory-backed staging filesystem the next chapter will read from.