Rewriting HTML safely
In the quickstart you added a single attribute to a single tag. In this chapter we'll do the work the importer actually needs to do on real-world post content: add lazy loading to every image, rewrite relative links to absolute URLs, and neutralize script tags and inline event handlers that creep into pasted HTML. By the end you'll have a clean_post_html() function the next chapter will plug into the importer.
The input we're cleaning
The importer's job is to read a folder of Markdown posts. Each post has frontmatter, prose, and inline HTML that survived from a previous CMS — a mix of helpful markup, sloppy markup, and the occasional <script> tag someone pasted from Stack Overflow. Here's a representative example:
Three things need fixing before this HTML belongs in a database:
- The
<img>has noloadinghint, so it'll fetch eagerly even when it's far below the fold. - Two of the four
<a>tags use relative URLs. They were correct on the source site; on the destination site they'll point to nothing. - There's a
<script>tag and anonmouseoverhandler. They have to be neutralized.
Each of the next three sections fixes one of these. They all use the same component — WP_HTML_Tag_Processor — and the same shape: open a processor over the HTML, walk it, ask the cursor to make edits, then call get_updated_html() for the result.
Lazy-load every image
Start with the most ergonomic fix: add loading="lazy" to every image that doesn't already have a loading attribute. The processor's filter argument lets us skip everything that isn't an <img>:
Notice three things in the output. The first <img> gained both loading="lazy" and decoding="async". The second <img> kept its author-provided loading="eager" — the null === get_attribute( 'loading' ) guard saw it and skipped the lazy line — but still gained decoding="async", because that's an unconditional set_attribute() call. And every byte that wasn't an image attribute came through untouched: the <figure>, the whitespace, the <figcaption>, the <p>, even the prose inside it.
Rewrite relative URLs to absolute
The importer needs every link in a post to be addressable from the destination site. /recipes/sauces meant something on the source site; on the destination it points to nothing. We'll resolve every relative href against a base URL — and leave protocol-relative URLs, fragments, and absolute URLs alone.
The pattern in the body of the loop is a small URL classifier: scheme-bearing URLs and scheme-relative ones (//other.test/...) are already absolute; fragments stay on the current document; everything else gets the base URL prepended. The classifier itself is forgettable boilerplate — what matters is that the processor lets us write it once and apply it to every <a> in the document with five lines of loop scaffolding.
Neutralize scripts and inline event handlers
This is the security-shaped fix. A user pasted some HTML, and an onmouseover handler came along with it. Maybe a <script> tag too. The importer needs to neutralize both before the content lands in a database that will later be rendered into someone else's browser.
For inline event handlers, get_attribute_names_with_prefix( 'on' ) returns every attribute on the current tag whose name starts with on — that's onclick, onmouseover, onerror, every variant. We loop over the returned names and remove each.
For <script> tags, set_modifiable_text('') blanks the script's body without disturbing the surrounding markup. Combined with stripping its attributes, the result is an inert <script></script> shell — readable, valid HTML, but executable as a no-op.
Read the output carefully. The onclick, onerror, and onmouseover attributes are gone. Both <script> tags survive structurally — empty, no src, no body — but they're inert. The surrounding <p>, <figure>, and <figcaption> markup is unchanged.
Combine the three into one function
Each of the three rewrites above used its own WP_HTML_Tag_Processor instance. That's fine for a tutorial, but the importer is going to call this on every post — twelve, then a hundred, then a thousand — and each instance allocates a little state. We'll fold all three into a single pass over a single processor.
This is also the function chapter 2 will import and reuse. Save the shape:
One processor instance, one walk, one allocation of update state. Real importers run this on the body of every post they ingest — call it ten thousand times in a long export and the difference between one allocation per post and four is measurable.
When the tag-level cursor is the wrong tool
Everything above used WP_HTML_Tag_Processor, which walks tags as a flat sequence. It doesn't know that <img> is inside <figure>; it just sees them in document order. For attribute rewriting that's perfect — fast, allocation-light, byte-honest.
It's the wrong tool when ancestry matters. If you need "every <img> directly inside a <figure>, but not images in paragraphs," or "the <h1> at the top of the article body, ignoring <h1>s nested inside <blockquote>," reach for WP_HTML_Processor — the same component, one class up. It implements HTML5 tree construction, so you can query by ancestry (breadcrumbs) and trust that <p>one<p>two parses as two paragraphs the way a browser sees it.
The reference page for the HTML component (reference/html.html) shows both processors side by side with worked examples. We won't need the full processor in the importer.
Recap
You can now:
- Open a
WP_HTML_Tag_Processorover an HTML string and walk it withnext_tag(). - Add, replace, and remove attributes — and read the result with
get_updated_html()— without disturbing untouched bytes. - Use
get_attribute_names_with_prefix()to find and remove everyon*handler in a single pass. - Blank the body of a special-content tag (
<script>,<style>) withset_modifiable_text(''). - Combine multiple rewrites into a single processor walk for performance and clarity.
- Recognize when the cursor model is the right tool and when ancestry-aware
WP_HTML_Processoris.
The clean_post_html() function is the importer's first real piece. We'll use it again in chapter 3.