Talking to the network

By the end of chapter 3 the importer produces a WXR-style export stream from a folder of Markdown. There's one loose thread: when a post references ![](https://cdn.example.com/bread.jpg), the destination site has no bread.jpg in its media library. The WordPress importer plugin downloads files for attachment entities, but it does not scan arbitrary post content and import every remote image URL for you. The robust answer is to fetch the images before the import and stage them locally so the import can reference local paths. This chapter covers the fetch side of that work using HttpClient. We'll also see how the same client mounts a remote ZIP for streaming, which means in some workflows you don't need chapter 2's local archive at all.

Why a new HTTP client

The instinct is file_get_contents( $url ) or curl_exec(). Both work — until they don't. file_get_contents on a URL needs allow_url_fopen, which security-conscious hosts disable. curl_exec needs the curl extension, which WebAssembly builds of PHP don't ship. And the simplest forms of both — no CURLOPT_FILE, no chunked stream wrapper — buffer the whole response into one PHP string, which is fatal for a 50 MB media file on a host with a 64 MB memory limit.

HttpClient gives you the same request/response shape regardless of host capabilities: an event loop, response objects with status codes and headers, response bodies as ByteReadStreams you can pipe somewhere instead of buffering. Under the hood it picks curl when available and PHP stream sockets otherwise. The public API stays the same, though low-level transport errors and edge-case protocol behavior can still differ.

Fetch one URL

The smallest possible request: create a Request, hand it to Client::fetch(), wait for the response, read the body. The result of fetch() is a stream — the response headers arrive at await_response(), and the body bytes come through consume_all() or chunk-by-chunk via pull()/consume():

Read the lifecycle. fetch() returns immediately with a stream object — the request is queued, not yet executed. await_response() blocks until the response headers have arrived, then returns the Response object. consume_all() reads the body to completion. Splitting "headers" from "body" matters because for some workflows (progress reporting, redirect logging, content-type sniffing) you act on the headers before deciding what to do with the body.

Download an image to the stage

The importer's job in this chapter is to take an image URL, fetch it, and place the bytes into the staging filesystem under a deterministic local path. We'll write the bytes incrementally so the response never has to fit into memory:

Notice the function signature. It takes a Filesystem, not a directory path; it takes a Client, not a URL string transformed into one. That keeps it testable — you can pass an InMemoryFilesystem and a mock client and the function doesn't know the difference. It also keeps the HTTP and storage decisions out of the caller, so when you later swap the in-memory stage for LocalFilesystem, the function is unchanged.

The event loop, with progress

For files small enough that you don't care about memory, consume_all() is fine. For big ones, you want to know how the download is going and write bytes as they arrive. Drop down a layer: Client::enqueue() + await_next_event() exposes every stage of the request as an event you can react to:

Read the event flow. EVENT_GOT_HEADERS fires once when headers come in — useful for sniffing Content-Length or rejecting based on status. EVENT_BODY_CHUNK_AVAILABLE fires repeatedly as the body comes in — that's where you write to disk, update progress, or compute a hash. EVENT_FINISHED or EVENT_FAILED ends the request. Memory used is one chunk at a time; the importer can stream a 500 MB file under any memory limit large enough to hold the chunk size.

A sliding window of ten concurrent downloads

The importer might reference dozens of images. Doing them one at a time would be unnecessarily slow; firing all of them at once would hammer the upstream and risk being rate-limited. The polite move is a fixed-size window: keep ten requests in flight, and as each one finishes, enqueue the next:

The sliding window is a small piece of bookkeeping — a pending queue, an active set, an "enqueue next" callback — wrapped around the same event loop you saw above. Real importers do exactly this for media frontloading. The concurrency option in the Client constructor is the upper bound; the bookkeeping enforces a moving window so you don't enqueue more work than the window holds.

Resume a partial download

Long downloads fail. Sometimes the network drops, sometimes the host runs out of execution time. The importer should be able to resume rather than redownload. HTTP's contract for that is Range: bytes=N-. Sending it to a cooperating server returns a 206 Partial Content response with the missing bytes:

The defensive check matters: not every server respects Range, especially when sitting behind a CDN with caching that doesn't know how to pass the header upstream. If you ask for a partial response and the server hands you a fresh 200 instead, your existing bytes don't match what's coming and you have to start over. That's the recursion in resumable_download() — it's a one-line fallback rather than a separate retry path.

Stream a remote ZIP through ZipFilesystem

The importer's input is a ZIP — chapter 2 read it from disk. But what if the ZIP lives on a URL? Downloading it whole, opening it with ZipFilesystem, then deleting the file afterwards works, but it asks you to coordinate a temp path the toolkit could manage for you. SeekableRequestReadStream wraps a Request as a seekable byte stream that ZipFilesystem can read directly: bytes are downloaded sequentially as the consumer reads, the class caches them in an internal temp file (cleaned up when you call close_reading()), and seeks back into already-downloaded ranges hit the cache instead of re-fetching:

That's the entire chapter-2 setup with a remote URL substituted for the local file. SeekableRequestReadStream downloads the response body once, lazily, into a temporary file as ZipFilesystem asks for bytes — so reads work the way they would on a local file (including the seeks that ZipFilesystem performs to find the central directory at the end of the archive). The temp file caches what's been seen, so seeking backwards doesn't re-fetch.

Note — when this is the right tool. SeekableRequestReadStream is the right wrapper when the consumer needs random-access reads on the response (a ZIP central directory, an offset table, a large binary index). For workflows that read the body straight through, $client->fetch($req)->consume_all() or the event-loop pattern from earlier in this chapter is simpler.

End-to-end: the importer, finally complete

The importer now spans four chapters' worth of components. The full shape:

Open the input ZIP — locally with ZipFilesystem, or remotely with SeekableRequestReadStream.
Stage its contents in an InMemoryFilesystem with copy_between_filesystems().
For each Markdown file in the stage, run MarkdownConsumer, then clean_post_html() on the produced block markup.
For each image URL referenced from the cleaned content, fetch it with HttpClient through a sliding-window concurrency loop and stage the bytes alongside the WXR.
Stream the whole thing into a WXR document with WXRWriter, with the cleaned post markup as content and rewritten image references pointing at the local paths under the staged uploads tree.

The full importer is roughly a hundred lines of PHP. It avoids libxml2, libzip, and a mandatory curl dependency. It runs in browser-side WebAssembly and on PHP 7.2+ when the features you choose have their optional platform extensions available. That's the toolkit's whole pitch — PHP libraries that handle much of the work the platform usually outsources to extensions.

Recap

You can now:

Fetch a URL with Client::fetch() and read the body either whole (consume_all()) or in chunks (pull()/consume()).
Drive the event loop with enqueue() + await_next_event() for progress reporting and per-chunk processing.
Maintain a sliding window of N concurrent requests by tracking active and pending sets.
Resume a partial download with the Range header, and fall back to a full download when the server doesn't honor it.
Mount a remote ZIP through SeekableRequestReadStream so ZipFilesystem can seek over the response — bytes are downloaded lazily into a temp-file cache as they're read.

The recap page summarizes what the four chapters built, what's still in the toolkit beyond what we used, and where to look in the reference for the components we didn't visit (Git for snapshots, Merge for sync, HttpServer for OAuth callbacks, Blueprints for site setup).