Recap and where to go next
Across four chapters you built a working content importer. It reads a ZIP of Markdown posts, cleans the HTML inside each one, frontloads referenced images over HTTP, and streams a WXR-style export. Direct WordPress importer compatibility requires the importer-required channel metadata around that stream. The path avoids libzip, libxml2, and a mandatory curl dependency, and runs on PHP 7.2+ when the optional features you choose are available.
What you built
| Chapter 1 | → | clean_post_html() using WP_HTML_Tag_Processor: lazy-load images, rewrite URLs, neutralize scripts, all in one pass. |
| Chapter 2 | → | Read the input ZIP through ZipFilesystem, stage it in InMemoryFilesystem, defend against zip-slip with ZipDecoder::sanitize_path(). |
| Chapter 3 | → | Convert each post with MarkdownConsumer, audit the output with WP_Block_Parser, stream the WXR with WXRWriter. |
| Chapter 4 | → | Frontload images with HttpClient through a sliding-window event loop; mount remote archives with SeekableRequestReadStream. |
What the toolkit does that the tutorial didn't touch
The importer used eight components. The toolkit ships eighteen. Here's what's left, with the use case each one shows up in:
- Git — snapshot your importer's runs into a PHP-backed Git repository for revision history. Useful for "what changed between last week's import and this week's." Reference →
- Merge — three-way diff and merge for content sync. If posts edit on both the source and the destination side, this is how you reconcile them. Reference →
- HttpServer — a tiny local listening port for OAuth callbacks during a CLI workflow, fixture servers for HttpClient tests, or a status page during a long import. Not for production traffic. Reference →
- CORSProxy — when you ship the importer as a browser tool, a server-side proxy to fetch URLs that don't send the right CORS headers. Reference →
- CLI — POSIX-style argument parser to wrap your importer as
importer.php --site-url=… --dry-run. Reference → - Encoding — UTF-8 validation and scrubbing for inputs that may contain mixed encodings. Most importers eventually need it. Reference →
- XML — the cursor-based XML processor underneath DataLiberation; reach for it directly when you need to walk export-sized files. Reference →
- Blueprints — declarative site setup. Spin up the destination WordPress with the right plugins and options before running the importer against it. Reference →
- Polyfill — WordPress-shaped helpers (
esc_html,add_filter,__) so toolkit code can run outside WordPress without ifdefs. Reference → - ToolkitCodingStandards — PHPCS sniffs encoding the project's review feedback as enforceable rules. Borrow if your project follows WordPress style. Reference →
Patterns worth keeping
Three shapes recurred across the tutorial. Watch for them in your own code:
Cursor over a string
WP_HTML_Tag_Processor walks a string forward, records edits as a side-buffer of byte-range replacements, and emits the modified string only when you call get_updated_html(). The result is byte-honest — bytes you didn't edit come through bit-identical. When you need to make small changes to large markup, that property is gold. The XML component's XMLProcessor applies the same pattern to XML.
Pull / consume streams
ZipFilesystem::open_read_stream(), HttpClient response bodies, InflateReadStream, and the rest all share the same shape: pull(N) reads up to N bytes from the underlying source into an internal buffer and returns how many ended up there; consume(N) reads N bytes from that buffer and advances past them. Memory is governed by the stream buffers and chunk sizes you choose, not by a requirement to load the whole file. Once you internalize this loop you can compose any byte source with any byte sink.
One interface, multiple backends
Code that takes a Filesystem rather than a path doesn't care if the filesystem is on disk, in memory, in a SQLite database, or inside a ZIP. That's how the importer's stage works for both production (memory) and debugging (local disk) without a code change. Same pattern shows up in HttpClient (curl vs sockets transport) and ByteStream (file, memory, deflate, hash all implementing the same byte-stream interface).