Rewriting HTML safely
In the quickstart you added a single attribute to a single tag. In this chapter we'll do the work the importer actually needs to do on real-world post content: add lazy loading to every image, rewrite relative links to absolute URLs, and strip the script tags and inline event handlers that creep into pasted HTML. By the end you'll have a clean_post_html() function the next chapter will plug into the importer.
The input we're cleaning
The importer's job is to read a folder of Markdown posts. Each post has frontmatter, prose, and inline HTML that survived from a previous CMS — a mix of helpful markup, sloppy markup, and the occasional <script> tag someone pasted from Stack Overflow. Here's a representative example:
Three things need fixing before this HTML belongs in a database:
- The
<img>has noloadinghint, so it'll fetch eagerly even when it's far below the fold. - Two of the four
<a>tags use relative URLs. They were correct on the source site; on the destination site they'll point to nothing. - There's a
<script>tag and anonmouseoverhandler. They have to go.
Each of the next three sections fixes one of these. They all use the same component — WP_HTML_Tag_Processor — and the same shape: open a processor over the HTML, walk it, ask the cursor to make edits, then call get_updated_html() for the result.
Lazy-load every image
Start with the most ergonomic fix: add loading="lazy" to every image that doesn't already have a loading attribute. The processor's filter argument lets us skip everything that isn't an <img>:
Notice three things in the output. The first <img> gained both loading="lazy" and decoding="async". The second <img> kept its author-provided loading="eager" — the null === get_attribute( 'loading' ) guard saw it and skipped the lazy line — but still gained decoding="async", because that's an unconditional set_attribute() call. And every byte that wasn't an image attribute came through untouched: the <figure>, the whitespace, the <figcaption>, the <p>, even the prose inside it.
Rewrite relative URLs to absolute
The importer needs every link in a post to be addressable from the destination site. /recipes/sauces meant something on the source site; on the destination it points to nothing. We'll resolve every relative href against a base URL — and leave protocol-relative URLs, fragments, and absolute URLs alone.
The pattern in the body of the loop is a small URL classifier: scheme-bearing URLs and scheme-relative ones (//other.test/...) are already absolute; fragments stay on the current document; everything else gets the base URL prepended. The classifier itself is forgettable boilerplate — what matters is that the processor lets us write it once and apply it to every <a> in the document with five lines of loop scaffolding.
Strip script tags and inline event handlers
This is the security-shaped fix. A user pasted some HTML, and an onmouseover handler came along with it. Maybe a <script> tag too. The importer needs to neutralize both before the content lands in a database that will later be rendered into someone else's browser.
For inline event handlers, get_attribute_names_with_prefix( 'on' ) returns every attribute on the current tag whose name starts with on — that's onclick, onmouseover, onerror, every variant. We loop over the returned names and remove each.
For <script> tags, set_modifiable_text('') blanks the script's body without disturbing the surrounding markup. Combined with stripping its attributes, the result is an inert <script></script> shell — readable, valid HTML, but executable as a no-op.
Read the output carefully. The onclick, onerror, and onmouseover attributes are gone. Both <script> tags survive structurally — empty, no src, no body — but they're inert. The surrounding <p>, <figure>, and <figcaption> markup is unchanged.
Combine the three into one function
Each of the three rewrites above used its own WP_HTML_Tag_Processor instance. That's fine for a tutorial, but the importer is going to call this on every post — twelve, then a hundred, then a thousand — and each instance allocates a little state. We'll fold all three into a single pass over a single processor.
This is also the function chapter 2 will import and reuse. Save the shape:
One processor instance, one walk, one allocation of update state. Real importers run this on the body of every post they ingest — call it ten thousand times in a long export and the difference between one allocation per post and four is measurable.
When the tag-level cursor is the wrong tool
Everything above used WP_HTML_Tag_Processor, which walks tags as a flat sequence. It doesn't know that <img> is inside <figure>; it just sees them in document order. For attribute rewriting that's perfect — fast, allocation-light, byte-honest.
It's the wrong tool when ancestry matters. If you need "every <img> directly inside a <figure>, but not images in paragraphs," or "the <h1> at the top of the article body, ignoring <h1>s nested inside <blockquote>," reach for WP_HTML_Processor — the same component, one class up. It implements HTML5 tree construction, so you can query by ancestry (breadcrumbs) and trust that <p>one<p>two parses as two paragraphs the way a browser sees it.
The reference page for the HTML component (reference/html.html) shows both processors side by side with worked examples. We won't need the full processor in the importer.
Recap
You can now:
- Open a
WP_HTML_Tag_Processorover an HTML string and walk it withnext_tag(). - Add, replace, and remove attributes — and read the result with
get_updated_html()— without disturbing untouched bytes. - Use
get_attribute_names_with_prefix()to find and remove everyon*handler in a single pass. - Blank the body of a special-content tag (
<script>,<style>) withset_modifiable_text(''). - Combine multiple rewrites into a single processor walk for performance and clarity.
- Recognize when the cursor model is the right tool and when ancestry-aware
WP_HTML_Processoris.
The clean_post_html() function is the importer's first real piece. We'll use it again in chapter 3.