PHP Toolkit

Recap and where to go next

Across four chapters you built a working content importer. It reads a ZIP of Markdown posts, cleans the HTML inside each one, frontloads referenced images over HTTP, and streams a WXR file the WordPress importer plugin will accept. None of it required curl, libzip, libxml2, or DOMDocument; all of it runs on PHP 7.2 through 8.3 and inside a browser via WordPress Playground.

What you built

Chapter 1clean_post_html() using WP_HTML_Tag_Processor: lazy-load images, rewrite URLs, strip scripts, all in one pass.
Chapter 2Read the input ZIP through ZipFilesystem, stage it in InMemoryFilesystem, defend against zip-slip with ZipDecoder::sanitize_path().
Chapter 3Convert each post with MarkdownConsumer, audit the output with WP_Block_Parser, stream the WXR with WXRWriter.
Chapter 4Frontload images with HttpClient through a sliding-window event loop; mount remote archives with SeekableRequestReadStream.

What the toolkit does that the tutorial didn't touch

The importer used eight components. The toolkit ships eighteen. Here's what's left, with the use case each one shows up in:

Patterns worth keeping

Three shapes recurred across the tutorial. Watch for them in your own code:

Cursor over a string

WP_HTML_Tag_Processor walks a string forward, records edits as a side-buffer of byte-range replacements, and emits the modified string only when you call get_updated_html(). The result is byte-honest — bytes you didn't edit come through bit-identical. When you need to make small changes to large markup, that property is gold. The XML component's XMLProcessor applies the same pattern to XML.

Pull / consume streams

ZipFilesystem::open_read_stream(), HttpClient response bodies, InflateReadStream, and the rest all share the same shape: pull(N) reads up to N bytes from the underlying source into an internal buffer and returns how many ended up there; consume(N) reads N bytes from that buffer and advances past them. Memory used is bounded by the chunk size, never by the file size. Once you internalize this loop you can compose any byte source with any byte sink.

One interface, multiple backends

Code that takes a Filesystem rather than a path doesn't care if the filesystem is on disk, in memory, in a SQLite database, or inside a ZIP. That's how the importer's stage works for both production (memory) and debugging (local disk) without a code change. Same pattern shows up in HttpClient (curl vs sockets transport) and ByteStream (file, memory, deflate, hash all implementing the same byte-stream interface).

Where to go from here

Three honest paths:

  1. Take the importer further. Add a --dry-run flag with the CLI component. Snapshot each run into a Git repository so you can diff between imports. Wrap it in a CORSProxy-fronted browser tool. Each of those is a one-component addition; the structure you have already accommodates them.
  2. Pick a single component and go deep. The reference pages all have refinements past the minimal example — bookmarks and breadcrumbs in HTML, three-way merges in Git, sliding windows and resumable downloads in HttpClient. The depth is there when the project asks for it.
  3. Read the source. Each component lives under components/<Name>/. components/HTML/class-wp-html-tag-processor.php is the same code WordPress core ships in wp-includes/html-api/; components/Zip/class-zipdecoder.php is a clean implementation of the parts of the ZIP spec that the toolkit actually uses. The code is written to be read.
Browse all 18 components → Back to landing GitHub