Skip to content

[Feature Request] Asynchronous On-demand FD-passing for virtio-pmem via userfaultfd #5740

Description

@joy-allen

Feature Request

Standard virtio-pmem backends in Firecracker are typically backed by local files. However, in high-density container environments (e.g., using Nydus or EROFS-based lazy loading), image data is often stored in remote registries and fetched on-demand.

Currently, achieving "lazy-loading" in Firecracker requires either virtio-block (which lacks DAX and causes memory-heavy double-buffering) or complex vhost-user-fs setups. This feature request proposes an asynchronous, on-demand FD-passing backend for virtio-pmem using userfaultfd. This allows MicroVMs to start instantly while sharing the Host's page cache with zero-copy overhead, perfectly aligning with Firecracker's performance and security philosophy.

Describe the desired solution

We propose a new UffdBackend for the virtio-pmem device that leverages userfaultfd (UFFD) and Unix Domain Socket (UDS) file descriptor passing.

Architecture

  1. UFFD Registration: The VMM creates an anonymous RO mapping for the virtio-pmem address space and registers it with userfaultfd.
  2. NBD-style Control Plane: The VMM communicates with an external image service (e.g., nydusd) via UDS using a lightweight, asynchronous protocol inspired by NBD.
    • PROBE: Queries the service for existing local chunks at boot to perform initial mappings.
    • FETCH: Triggered by a UFFD page fault; the VMM sends a request with (pos, len) and returns to the event loop.
  3. FD-Passing & Remapping: The service replies using SCM_RIGHTS to pass File Descriptors. The VMM receiver thread then performs an mmap(MAP_FIXED) to map the blob-backed FD into the Guest's address space.
  4. Zero-Copy Execution: Once mapped, the Guest accesses the data via DAX (e.g., EROFS + DAX). This bypasses the VMM entirely for subsequent reads, sharing the Host Page Cache directly.

Protocol Definition

  • Request: magic (u32), type (u32), handle (u64), pos (u64), len (u32)
  • Reply: magic (u32), code (u32), handle (u64), dev_sz (u64), ranges_count (u32)
  • Control Message: [fd, off, len, dev_off] (passed via SCM_RIGHTS)

Describe possible alternatives

1. Virtio-fs with DAX

While virtio-fs is a common solution, it is significantly more complex to implement and audit for Firecracker. It requires a stateful FUSE control plane and a dynamic DAX window management protocol (SetupMapping/RemoveMapping). Our virtio-pmem approach treats the image as a static, flat address space, keeping the VMM implementation stateless and minimizing the attack surface.

2. Standard virtio-block

virtio-block lacks DAX support. Every block access requires a VM-exit and results in "double-buffering" (data exists in both the Host and Guest page caches), which significantly increases memory pressure in high-density deployments.

3. UFFDIO_COPY

The VMM could fetch data into a userspace buffer and use UFFDIO_COPY. However, this introduces an unnecessary memcpy and higher CPU overhead compared to the proposed mmap + FD-passing approach, which natively shares the page cache.

How do you work around not having this feature?

Currently, we have to pre-fetch the entire container image before VM startup to use virtio-pmem, which leads to high "Time To First Byte" (TTFB) and wastes local storage/memory if only a fraction of the image is actually accessed by the Guest.

Additional context

1. Prototype Status

We have already implemented a functional prototype of this mechanism. The VMM is capable of:

  • Handling userfaultfd events in a dedicated thread.
  • Communicating with a modified nydusd via the NBD-style UDS protocol.
  • Performing mmap(MAP_FIXED) with FD passing and successfully waking up the Guest vCPU.

2. Failure Handling & Resilience

A critical concern for this feature is the stability of the UDS connection. Our implementation includes:

  • Auto-reconnect: The VMM can transparently re-establish the connection to the image service (e.g., if nydusd restarts).

3. Use Case: Large-scale Serverless & AI Inference

In serverless environments, MicroVMs are short-lived but require instant access to large base images (often several GBs for AI models or thick containers). This feature enables:

  • Shared Host Page Cache: Multiple MicroVMs using the same base image will map the same physical pages on the Host, significantly reducing the memory footprint (RSS) compared to virtio-block.
  • Reduced Cold Start Latency: By only fetching the metadata and the entry-point code chunks, the effective "Ready" time is decoupled from the total image size.

4. Comparison with Standard UFFD usage

Unlike the traditional UFFDIO_COPY approach used in post-copy migration, this "FD-passing" approach is uniquely suited for read-only DAX devices. It treats the VMM as a coordinator rather than a data buffer, which is consistent with Firecracker's goal of being a "thin" VMM.

Checks

  • Have you searched the Firecracker Issues database for similar requests?
  • Have you read all the existing relevant Firecracker documentation?
  • Have you read and understood Firecracker's core tenets?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Status: Awaiting authorIndicates that an issue or pull request requires author action

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions