PHP Toolkit

Rewriting HTML safely

In the quickstart you added a single attribute to a single tag. In this chapter we'll do the work the importer actually needs to do on real-world post content: add lazy loading to every image, rewrite relative links to absolute URLs, and neutralize script tags and inline event handlers that creep into pasted HTML. By the end you'll have a clean_post_html() function the next chapter will plug into the importer.

The input we're cleaning

The importer's job is to read a folder of Markdown posts. Each post has frontmatter, prose, and inline HTML that survived from a previous CMS — a mix of helpful markup, sloppy markup, and the occasional <script> tag someone pasted from Stack Overflow. Here's a representative example:

HTML;

Three things need fixing before this HTML belongs in a database:

  1. The <img> has no loading hint, so it'll fetch eagerly even when it's far below the fold.
  2. Two of the four <a> tags use relative URLs. They were correct on the source site; on the destination site they'll point to nothing.
  3. There's a <script> tag and an onmouseover handler. They have to be neutralized.

Each of the next three sections fixes one of these. They all use the same component — WP_HTML_Tag_Processor — and the same shape: open a processor over the HTML, walk it, ask the cursor to make edits, then call get_updated_html() for the result.

Lazy-load every image

Start with the most ergonomic fix: add loading="lazy" to every image that doesn't already have a loading attribute. The processor's filter argument lets us skip everything that isn't an <img>:

Notice three things in the output. The first <img> gained both loading="lazy" and decoding="async". The second <img> kept its author-provided loading="eager" — the null === get_attribute( 'loading' ) guard saw it and skipped the lazy line — but still gained decoding="async", because that's an unconditional set_attribute() call. And every byte that wasn't an image attribute came through untouched: the <figure>, the whitespace, the <figcaption>, the <p>, even the prose inside it.

Rewrite relative URLs to absolute

The importer needs every link in a post to be addressable from the destination site. /recipes/sauces meant something on the source site; on the destination it points to nothing. We'll resolve every relative href against a base URL — and leave protocol-relative URLs, fragments, and absolute URLs alone.

The pattern in the body of the loop is a small URL classifier: scheme-bearing URLs and scheme-relative ones (//other.test/...) are already absolute; fragments stay on the current document; everything else gets the base URL prepended. The classifier itself is forgettable boilerplate — what matters is that the processor lets us write it once and apply it to every <a> in the document with five lines of loop scaffolding.

Neutralize scripts and inline event handlers

This is the security-shaped fix. A user pasted some HTML, and an onmouseover handler came along with it. Maybe a <script> tag too. The importer needs to neutralize both before the content lands in a database that will later be rendered into someone else's browser.

For inline event handlers, get_attribute_names_with_prefix( 'on' ) returns every attribute on the current tag whose name starts with on — that's onclick, onmouseover, onerror, every variant. We loop over the returned names and remove each.

For <script> tags, set_modifiable_text('') blanks the script's body without disturbing the surrounding markup. Combined with stripping its attributes, the result is an inert <script></script> shell — readable, valid HTML, but executable as a no-op.

' . ''; $tags = new WP_HTML_Tag_Processor( $untrusted ); while ( $tags->next_tag() ) { $tag = $tags->get_tag(); // 1. Neutralize script bodies and remove their attributes. if ( 'SCRIPT' === $tag && ! $tags->is_tag_closer() ) { $tags->set_modifiable_text( '' ); foreach ( $tags->get_attribute_names_with_prefix( '' ) as $attr ) { $tags->remove_attribute( $attr ); } continue; } // 2. Remove every on* handler on every other tag. foreach ( $tags->get_attribute_names_with_prefix( 'on' ) as $handler ) { $tags->remove_attribute( $handler ); } } echo $tags->get_updated_html();

Read the output carefully. The onclick, onerror, and onmouseover attributes are gone. Both <script> tags survive structurally — empty, no src, no body — but they're inert. The surrounding <p>, <figure>, and <figcaption> markup is unchanged.

Combine the three into one function

Each of the three rewrites above used its own WP_HTML_Tag_Processor instance. That's fine for a tutorial, but the importer is going to call this on every post — twelve, then a hundred, then a thousand — and each instance allocates a little state. We'll fold all three into a single pass over a single processor.

This is also the function chapter 2 will import and reuse. Save the shape:

HTML; echo clean_post_html( $post_html, 'https://recipes.example.com/' );

One processor instance, one walk, one allocation of update state. Real importers run this on the body of every post they ingest — call it ten thousand times in a long export and the difference between one allocation per post and four is measurable.

When the tag-level cursor is the wrong tool

Everything above used WP_HTML_Tag_Processor, which walks tags as a flat sequence. It doesn't know that <img> is inside <figure>; it just sees them in document order. For attribute rewriting that's perfect — fast, allocation-light, byte-honest.

It's the wrong tool when ancestry matters. If you need "every <img> directly inside a <figure>, but not images in paragraphs," or "the <h1> at the top of the article body, ignoring <h1>s nested inside <blockquote>," reach for WP_HTML_Processor — the same component, one class up. It implements HTML5 tree construction, so you can query by ancestry (breadcrumbs) and trust that <p>one<p>two parses as two paragraphs the way a browser sees it.

The reference page for the HTML component (reference/html.html) shows both processors side by side with worked examples. We won't need the full processor in the importer.

Recap

You can now:

The clean_post_html() function is the importer's first real piece. We'll use it again in chapter 3.

In chapter 2 the importer's input becomes a real ZIP file: a thousand Markdown posts in a 40 MB archive that you can't afford to extract to disk on a memory-constrained host. We'll wrap the archive as a Filesystem, read entries one at a time, and pipe them into a memory-backed staging filesystem the next chapter will read from.