PHP Toolkit

Streaming archives

In chapter 1 you wrote clean_post_html(), a function that takes one HTML string and returns a clean one. The importer needs to run that on a thousand posts at a time. In this chapter the input becomes a real ZIP file: a folder of Markdown posts that you can't always extract to disk — the host might not give you a writable scratch directory, the runtime might not have a persistent filesystem at all. We'll wrap the archive as a Filesystem, read entries one at a time, and stage them in memory for chapter 3 to import.

Why we don't extract to disk

The naive approach to importing a ZIP of posts is $zip->extractTo('/tmp/staging'), then walk the directory. That's fine if you control the host. The toolkit's whole point is that you often don't. Shared hosts ration disk quota; WebAssembly runtimes have no persistent disk; Docker containers running as non-root may not be able to write where you'd like.

We sidestep the issue by never extracting. ZipFilesystem reads entry data on demand directly from the archive bytes, and an InMemoryFilesystem gives us a place to stage results that vanishes when the process ends. The importer reads from one and writes to the other; the disk is never involved.

Open the input ZIP as a filesystem

The ZIP component's highest-level type is ZipFilesystem — an archive presented through the same Filesystem interface that InMemoryFilesystem and LocalFilesystem implement. Once you've wrapped it, you call get_contents(), ls(), and is_dir() the same way you would on disk:

Three things matter in that snippet. The build-the-archive part (ZipEncoder, FileEntry, MemoryPipe) is scaffolding so the example runs end-to-end; in your real importer the ZIP comes from argv. The read part is one line: ZipFilesystem::create( FileReadStream::from_path( $path ) ) wraps the archive bytes and gives you the interface. And the loop reads each entry's contents, but doesn't extract — the bytes get inflated on demand and discarded after we're done with them. Memory stays flat regardless of how big the archive is.

Stream a large entry without buffering it

For our small Markdown posts get_contents() is fine. But the importer might also include a data.csv with twenty thousand rows of metadata, or a large JSON file describing categories. open_read_stream() returns a pull-based byte reader instead of a buffered string, so you can process the entry chunk-by-chunk:

That pull-loop is the same shape every byte stream in the toolkit uses. pull(8192) means "buffer up to 8 KB"; consume($n) reads and advances. The trailing partial line gets carried into the next iteration. Memory used is the chunk size plus one partial line — the same regardless of whether the file is 50 KB or 5 GB.

Stage the imports in memory

Now we connect the two halves. The input is the ZIP we just opened. The staging area is an InMemoryFilesystem — same interface, no disk. Walking the input and copying into the stage is one helper:

Read that example carefully because it's the heart of how the importer composes. The input is read-only (a ZIP) and the output is writable (in-memory). Both expose the same interface, so a generic copy_between_filesystems() works on both. In chapter 3 we'll iterate the staged Markdown files and convert them; in chapter 4 we'll add downloaded media to the same stage. The shape doesn't change between chapters — only what's in the stage.

Defend against malicious archive paths

Every importer that accepts external ZIPs needs to defend against zip-slip: an archive containing an entry named ../../etc/passwd that, if extracted naively, writes outside the intended destination. The toolkit ships a one-line defense:

Run any entry path through ZipDecoder::sanitize_path() before using it as a key in your destination filesystem. copy_between_filesystems() already does this; if you build your own loop you must too.

Folding it into the importer

The importer so far has chapter 1's clean_post_html() and chapter 2's stage. Combine them: open the input ZIP, copy it into the stage, then iterate the stage's posts/ directory and remember to apply clean_post_html() when we render in chapter 3. We're not invoking it yet because the Markdown-to-HTML conversion is chapter 3's job — but we can already see the shape:

Three small functions, each with a single job. open_input_zip() is one line and exists mostly for readability. stage_input() is the composition we just built. each_post() is a generator so the caller can iterate without loading every post's text at once. The signatures take the abstract Filesystem type, not InMemoryFilesystem, which means a future version of the importer that stages on disk for a debugging session would not need any code change.

Recap

You can now:

The stage is now ready to feed chapter 3, where the Markdown-to-blocks conversion actually happens.

In chapter 3 we'll turn each Markdown post into WordPress block markup, run it through clean_post_html(), and stream the whole thing into a WXR file the WordPress importer plugin will accept. Three more components — Markdown, BlockParser, and DataLiberation — finally meet the importer.