Streaming archives

In chapter 1 you wrote clean_post_html(), a function that takes one HTML string and returns a clean one. The importer needs to run that on a thousand posts at a time. In this chapter the input becomes a real ZIP file: a folder of Markdown posts that you can't always extract to disk — the host might not give you a writable scratch directory, the runtime might not have a persistent filesystem at all. We'll wrap the archive as a Filesystem, read entries one at a time, and stage them in memory for chapter 3 to import.

Why we don't extract to disk

The naive approach to importing a ZIP of posts is $zip->extractTo('/tmp/staging'), then walk the directory. That's fine if you control the host. The toolkit's whole point is that you often don't. Shared hosts ration disk quota; WebAssembly runtimes have no persistent disk; Docker containers running as non-root may not be able to write where you'd like.

We sidestep the issue by never extracting. ZipFilesystem reads entry data on demand directly from the archive bytes, and an InMemoryFilesystem gives us a place to stage results that vanishes when the process ends. The importer reads from one and writes to the other; the disk is never involved.

Open the input ZIP as a filesystem

The ZIP component's highest-level type is ZipFilesystem — an archive presented through the same Filesystem interface that InMemoryFilesystem and LocalFilesystem implement. Once you've wrapped it, you call get_contents(), ls(), and is_dir() the same way you would on disk:

Three things matter in that snippet. The build-the-archive part (ZipEncoder, FileEntry, MemoryPipe) is scaffolding so the example runs end-to-end; in your real importer the ZIP comes from argv. The read part is one line: ZipFilesystem::create( FileReadStream::from_path( $path ) ) wraps the archive bytes and gives you the interface. And the loop reads each entry's contents, but doesn't extract — the bytes get inflated on demand and discarded after we're done with them. Memory stays flat regardless of how big the archive is.

Pitfall — one entry stream at a time. The decoder is positional. get_contents() opens an entry, reads it whole, and closes it. If you switch to open_read_stream() for a large entry (next section), you must drain or close that stream before opening another. Holding two open simultaneously is undefined.

Stream a large entry without buffering it

For our small Markdown posts get_contents() is fine. But the importer might also include a data.csv with twenty thousand rows of metadata, or a large JSON file describing categories. open_read_stream() returns a pull-based byte reader instead of a buffered string, so you can process the entry chunk-by-chunk:

That pull-loop is the same shape every byte stream in the toolkit uses. pull(8192) means "buffer up to 8 KB"; consume($n) reads and advances. The trailing partial line gets carried into the next iteration. Memory used is the chunk size plus one partial line, so it stays bounded as long as individual lines are bounded.

Stage the imports in memory

Now we connect the two halves. The input is the ZIP we just opened. The staging area is an InMemoryFilesystem — same interface, no disk. Walking the input and copying into the stage is one helper:

Read that example carefully because it's the heart of how the importer composes. The input is read-only (a ZIP) and the output is writable (in-memory). Both expose the same interface, so a generic copy_between_filesystems() works on both. In chapter 3 we'll iterate the staged Markdown files and convert them; in chapter 4 we'll add downloaded media to the same stage. The shape doesn't change between chapters — only what's in the stage.

Try this: swap InMemoryFilesystem::create() for LocalFilesystem::create($somePath). The rest of the example works identically — that's the point of the shared interface. For a CLI tool you'd ship to disk; for an in-process importer you'd stay in memory; for tests you'd use the in-memory backend so each test starts fresh.

Defend against malicious archive paths

Every importer that accepts external ZIPs needs to defend against zip-slip: an archive containing an entry named ../../etc/passwd that, if extracted naively, writes outside the intended destination. The toolkit ships a one-line defense:

ZipDecoder sanitizes ZIP entry names before ZipFilesystem exposes them, so the standard copy path sees normalized archive paths. If you parse ZIP records yourself or accept path strings from another source, run those names through ZipDecoder::sanitize_path() before using them as destination keys.

Folding it into the importer

The importer so far has chapter 1's clean_post_html() and chapter 2's stage. Combine them: open the input ZIP, copy it into the stage, then iterate the stage's posts/ directory and remember to apply clean_post_html() when we render in chapter 3. We're not invoking it yet because the Markdown-to-HTML conversion is chapter 3's job — but we can already see the shape:

Three small functions, each with a single job. open_input_zip() is one line and exists mostly for readability. stage_input() is the composition we just built. each_post() is a generator so the caller can iterate without loading every post's text at once. The signatures take the abstract Filesystem type, not InMemoryFilesystem, which means a future version of the importer that stages on disk for a debugging session would not need any code change.

Recap

You can now:

Wrap a ZIP archive as a Filesystem with ZipFilesystem::create() and read entries through the standard interface.
Stream a large entry with open_read_stream(), the pull() / consume() loop, and a trailing-partial-line carry.
Stage data in InMemoryFilesystem for in-process work, and swap to a different backend without changing the calling code.
Compose source and destination filesystems with copy_between_filesystems() in one helper call.
Understand where ZIP entry names are normalized, and when to call ZipDecoder::sanitize_path() yourself.

The stage is now ready to feed chapter 3, where the Markdown-to-blocks conversion actually happens.

In chapter 3 we'll turn each Markdown post into WordPress block markup, run it through clean_post_html(), and stream the whole thing into a WXR-style export. Three more components — Markdown, BlockParser, and DataLiberation — finally meet the importer.