Talking to the network
By the end of chapter 3 the importer produces a valid WXR from a folder of Markdown. There's one loose thread: when a post references , the destination site has no bread.jpg in its media library. The WordPress importer plugin will try to fetch each remote image as it runs, but that's a fragile thing to do during an import — slow, easy to rate-limit, easy to leave behind half-fetched media. The robust answer is to fetch the images before the import and stage them locally so the import can reference local paths. This chapter covers the fetch side of that work using HttpClient. We'll also see how the same client mounts a remote ZIP for streaming, which means in some workflows you don't need chapter 2's local archive at all.
Why a new HTTP client
The instinct is file_get_contents( $url ) or curl_exec(). Both work — until they don't. file_get_contents on a URL needs allow_url_fopen, which security-conscious hosts disable. curl_exec needs the curl extension, which WebAssembly builds of PHP don't ship. And the simplest forms of both — no CURLOPT_FILE, no chunked stream wrapper — buffer the whole response into one PHP string, which is fatal for a 50 MB media file on a host with a 64 MB memory limit.
HttpClient gives you the same shape regardless of host capabilities: an event loop, response objects with status codes and headers, response bodies as ByteReadStreams you can pipe somewhere instead of buffering. Under the hood it picks curl when available and PHP stream sockets otherwise. From your code's perspective those two transports are identical.
Fetch one URL
The smallest possible request: create a Request, hand it to Client::fetch(), wait for the response, read the body. The result of fetch() is a stream — the response headers arrive at await_response(), and the body bytes come through consume_all() or chunk-by-chunk via pull()/consume():
Read the lifecycle. fetch() returns immediately with a stream object — the request is queued, not yet executed. await_response() blocks until the response headers have arrived, then returns the Response object. consume_all() reads the body to completion. Splitting "headers" from "body" matters because for some workflows (progress reporting, redirect logging, content-type sniffing) you act on the headers before deciding what to do with the body.
Download an image to the stage
The importer's job in this chapter is to take an image URL, fetch it, and place the bytes into the staging filesystem under a deterministic local path. We'll write the bytes incrementally so the response never has to fit into memory:
Notice the function signature. It takes a Filesystem, not a directory path; it takes a Client, not a URL string transformed into one. That keeps it testable — you can pass an InMemoryFilesystem and a mock client and the function doesn't know the difference. It also keeps the HTTP and storage decisions out of the caller, so when you later swap the in-memory stage for LocalFilesystem, the function is unchanged.
The event loop, with progress
For files small enough that you don't care about memory, consume_all() is fine. For big ones, you want to know how the download is going and write bytes as they arrive. Drop down a layer: Client::enqueue() + await_next_event() exposes every stage of the request as an event you can react to:
Read the event flow. EVENT_GOT_HEADERS fires once when headers come in — useful for sniffing Content-Length or rejecting based on status. EVENT_BODY_CHUNK_AVAILABLE fires repeatedly as the body comes in — that's where you write to disk, update progress, or compute a hash. EVENT_FINISHED or EVENT_FAILED ends the request. Memory used is one chunk at a time; the importer can stream a 500 MB file under any memory limit large enough to hold the chunk size.
A sliding window of ten concurrent downloads
The importer might reference dozens of images. Doing them one at a time would be unnecessarily slow; firing all of them at once would hammer the upstream and risk being rate-limited. The polite move is a fixed-size window: keep ten requests in flight, and as each one finishes, enqueue the next:
The sliding window is a small piece of bookkeeping — a pending queue, an active set, an "enqueue next" callback — wrapped around the same event loop you saw above. Real importers do exactly this for media frontloading. The concurrency option in the Client constructor is the upper bound; the bookkeeping enforces a moving window so you don't enqueue more work than the window holds.
Resume a partial download
Long downloads fail. Sometimes the network drops, sometimes the host runs out of execution time. The importer should be able to resume rather than redownload. HTTP's contract for that is Range: bytes=N-. Sending it to a cooperating server returns a 206 Partial Content response with the missing bytes:
The defensive check matters: not every server respects Range, especially when sitting behind a CDN with caching that doesn't know how to pass the header upstream. If you ask for a partial response and the server hands you a fresh 200 instead, your existing bytes don't match what's coming and you have to start over. That's the recursion in resumable_download() — it's a one-line fallback rather than a separate retry path.
Stream a remote ZIP through ZipFilesystem
The importer's input is a ZIP — chapter 2 read it from disk. But what if the ZIP lives on a URL? Downloading it whole, opening it with ZipFilesystem, then deleting the file afterwards works, but it asks you to coordinate a temp path the toolkit could manage for you. SeekableRequestReadStream wraps a Request as a seekable byte stream that ZipFilesystem can read directly: bytes are downloaded sequentially as the consumer reads, the class caches them in an internal temp file (cleaned up when you call close_reading()), and seeks back into already-downloaded ranges hit the cache instead of re-fetching:
That's the entire chapter-2 setup with a remote URL substituted for the local file. SeekableRequestReadStream downloads the response body once, lazily, into a temporary file as ZipFilesystem asks for bytes — so reads work the way they would on a local file (including the seeks that ZipFilesystem performs to find the central directory at the end of the archive). The temp file caches what's been seen, so seeking backwards doesn't re-fetch.
End-to-end: the importer, finally complete
The importer now spans four chapters' worth of components. The full shape:
- Open the input ZIP — locally with
ZipFilesystem, or remotely withSeekableRequestReadStream. - Stage its contents in an
InMemoryFilesystemwithcopy_between_filesystems(). - For each Markdown file in the stage, run
MarkdownConsumer, thenclean_post_html()on the produced block markup. - For each image URL referenced from the cleaned content, fetch it with HttpClient through a sliding-window concurrency loop and stage the bytes alongside the WXR.
- Stream the whole thing into a WXR document with
WXRWriter, with the cleaned post markup ascontentand rewritten image references pointing at the local paths under the staged uploads tree.
The full importer is roughly a hundred lines of PHP. It depends on no extension beyond json and mbstring. It runs in browser-side WebAssembly, on PHP 7.2 through 8.3, and on every shared host that's kept up with PHP releases. That's the toolkit's whole pitch — pure-PHP libraries that handle the work the platform usually outsources to extensions.
Recap
You can now:
- Fetch a URL with
Client::fetch()and read the body either whole (consume_all()) or in chunks (pull()/consume()). - Drive the event loop with
enqueue()+await_next_event()for progress reporting and per-chunk processing. - Maintain a sliding window of N concurrent requests by tracking active and pending sets.
- Resume a partial download with the
Rangeheader, and fall back to a full download when the server doesn't honor it. - Mount a remote ZIP through
SeekableRequestReadStreamsoZipFilesystemcan seek over the response — bytes are downloaded lazily into a temp-file cache as they're read.