Markdown to blocks to WXR

Chapter 1 cleaned a single HTML string. Chapter 2 staged a folder of Markdown files in memory. This chapter turns each of those files into a WXR-style extended-RSS export stream. Along the way we meet three more components — Markdown, BlockParser, and DataLiberation — and watch them compose into something none of them could do alone.

Markdown to block markup

The Markdown component does one thing well: it bridges Markdown and WordPress block markup, in either direction. The MarkdownConsumer class takes a Markdown string and returns a result object containing both the rendered block markup and any frontmatter parsed from the document's leading YAML.

Two outputs come back: the post metadata (read with get_meta_value() for scalars, or get_all_metadata() for the raw structure) and the block markup itself, which is the … string that WordPress stores in post_content. From here on we treat that string the way WordPress treats it.

Note — two ways to read frontmatter. get_meta_value( 'title' ) returns the first occurrence of that key as a scalar; for everything else use get_all_metadata() and index into the resulting array. The toolkit wraps repeated keys so the same call works whether a key appears once or many times.

Audit the produced blocks

Before we ship the converted post into a WXR file, the importer should sanity-check what came out. Did Markdown conversion produce blocks the destination site can render? Are there headings out of order? Are there blocks the importer doesn't know how to handle? WP_Block_Parser walks the same block markup WordPress core uses and gives us a structured tree:

Two patterns to keep. The flat counter (a queue that walks innerBlocks) answers any "how many" or "does it use" question. The level checker is a domain-specific rule — accessibility wants no jumps in heading depth — but every audit you'll write follows the same shape: walk the tree, gate by blockName, ask the question. The reference page for BlockParser covers both patterns in more depth.

Apply chapter 1's cleaner inside the blocks

Block markup is HTML embedded in HTML comments. Chapter 1's clean_post_html() takes an HTML string and returns a clean one — we can run it on the whole block-markup string in one pass, because WP_HTML_Tag_Processor is happy to walk the HTML between the block-comment delimiters. The block comments themselves don't look like tags to the processor, so they pass through untouched:

Notice how  survived the walk verbatim. The Tag Processor only sees real HTML tags; comments and text aren't tags to it. That's why combining the two components here works without any special-casing — Markdown produces block markup, the cleaner walks the HTML inside it, and the comments pass through as plain bytes.

Stream a WXR file with DataLiberation

We have post titles, post content (clean block markup), post metadata. WXR — WordPress eXtended RSS — is an XML dialect with a fixed shape. DataLiberation's WXRWriter takes ImportEntity objects and streams them into a byte sink, one entity at a time. If you need direct compatibility with the WordPress importer plugin, include its required channel metadata such as wp:wxr_version around this stream.

The writer holds only what it needs to close currently-open XML tags and emit the current entity. Every append_entity() writes one item to the underlying byte sink and forgets it, so large exports should write to a file-backed stream instead of accumulating the final XML in memory.

Note. The byte sink here is MemoryPipe for the example. In the real importer you'd pass FileWriteStream::from_path( 'export.xml', 'truncate' ) and the WXR would stream straight to disk without ever existing as a single in-memory string.

End-to-end: Markdown folder to WXR file

Now we wire it all together. The pipeline reads the staged Markdown files from chapter 2, converts each to block markup, cleans the HTML inside it, builds an ImportEntity with title and slug from frontmatter, and streams the whole thing into a WXR document. This is the importer's first complete end-to-end run:

One pass. Three components composed (Markdown for parsing, HTML for cleaning, DataLiberation for WXR), each doing one thing well. The output is real WXR — drop it on a WordPress site through the importer plugin and you get three published posts with the cleaned content, the right slugs, and the frontmatter titles.

Try this: add a fourth Markdown post with an onclick handler in raw inline HTML. Run. The handler is gone in the WXR output — the cleaner is doing its job inside block markup the same way it did on a bare HTML string in chapter 1.

Refinement: rewrite URLs across an existing WXR

The pattern above (build WXR from sources) is one half of DataLiberation. The other half is reading and transforming an existing WXR. WXREntityReader emits one entity at a time from a WXR document, and you can wire it to a WXRWriter to produce a transformed copy:

The same pattern handles every "transform an export between sites" job — staging-to-production URL rewrites, theme migrations, slug normalization. Reader on the left, writer on the right, your transformation in the middle. Feed the reader bytes incrementally (instead of append_bytes( $source ) all at once) and pipe the writer to a file sink (instead of MemoryPipe), and the same code can process large exports with memory dominated by the current entity and the buffers you choose.

Note — asymmetric field names. WXREntityReader emits the post body under post_content (matching WordPress's database column); WXRWriter reads it under content (matching its own internal mapping). Pipelines that read on one side and write on the other have to copy the value across, as the example does.

Recap

You can now:

Convert Markdown plus YAML frontmatter into block markup with MarkdownConsumer.
Walk the produced block tree with WP_Block_Parser to count, audit, or rewrite blocks.
Apply HTML rewrites to block markup without breaking the surrounding block comments.
Stream a WXR document with WXRWriter while holding only the current entity and writer state.
Read an existing WXR with WXREntityReader and pipe its entities through a transformation into a new WXR.

The importer is now functionally complete for text content. What's missing is the network — when a Markdown post references ![](https://cdn.example.com/bread.jpg), the destination site doesn't have that image. Chapter 4 fixes that.

In chapter 4 the importer learns to fetch the images referenced from imported posts: ten downloads at a time, with progress reporting, ranged-resume on partial failures, and the option to mount a remote ZIP without downloading it first. The HttpClient component meets the importer.