Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,12 @@ and this project adheres to

### Added

- [#6003](https://github.com/firecracker-microvm/firecracker/pull/6003): Added a
new option `Transparent` for the `huge_pages` setting. If set, Firecracker
will use transparent huge pages for the guest memory via
`madvise(MADV_HUGEPAGE)`. Guest memory must be a multiple of 2MB when using
this option.

### Changed

### Deprecated
Expand Down
51 changes: 32 additions & 19 deletions docs/hugepages.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,37 @@
# Backing Guest Memory by Huge Pages

Firecracker supports backing the guest memory of a VM by 2MB hugetlbfs pages.
This can be enabled by setting the `huge_pages` field of `PUT` or `PATCH`
requests to the `/machine-config` endpoint to `2M`.

Backing guest memory by huge pages can bring performance improvements for
specific workloads, due to less TLB contention and less overhead during
virtual->physical address resolution. It can also help reduce the number of
KVM_EXITS required to rebuild extended page tables post snapshot restore, as
well as improve boot times (by up to 50% as measured by Firecracker's
Firecracker supports three modes for the `huge_pages` field of `PUT` or `PATCH`
requests to the `/machine-config` endpoint:

- `None` (default): Uses regular 4K pages with no huge page behavior.
- `Transparent`: Uses `madvise(MADV_HUGEPAGE)` to request transparent huge pages
for guest memory. Guest memory size must be a multiple of 2MB.
- `2M`: Backs guest memory by 2MB hugetlbfs pages.

## Transparent Huge Pages (THP)

Setting `huge_pages` to `Transparent` enables transparent huge pages for guest
memory via `madvise(MADV_HUGEPAGE)`. This allows the kernel to opportunistically
back guest memory with 2MB pages without requiring a pre-allocated hugetlbfs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With mTHP, can't it also be sizes other than 2 MiB?

pool.

Limitations:

- THP is only effective for anonymous memory (non-memfd). When vhost-user-blk
devices are in use, guest memory is memfd-backed and THP will not be applied.
- THP does not integrate with UFFD; no transparent huge pages will be allocated
during userfault-handling while resuming from a snapshot.

Please refer to the [Linux Documentation][thp_docs] for more information.

## Hugetlbfs (2M)

Setting `huge_pages` to `2M` backs guest memory by 2MB hugetlbfs pages. This can
bring performance improvements for specific workloads, due to less TLB
contention and less overhead during virtual->physical address resolution. It can
also help reduce the number of KVM_EXITS required to rebuild extended page
tables post snapshot restore, as well as improve boot times (by up to 50% as
measured by Firecracker's
[boot time performance tests](../tests/integration_tests/performance/test_boottime.py))

Using hugetlbfs requires the host running Firecracker to have a pre-allocated
Expand Down Expand Up @@ -43,15 +66,5 @@ the device is unable to reclaim the hugepage backing of the guest and drop RSS.
However, the balloon can still be inflated and used to restrict memory usage in
the guest.

## FAQ

### Why does Firecracker not offer a transparent huge pages (THP) setting?

Firecracker's guest memory can be memfd based. Linux (as of 6.1) does not offer
a way to dynamically enable THP for such memory regions. Additionally, UFFD does
not integrate with THP (no transparent huge pages will be allocated during
userfaulting). Please refer to the [Linux Documentation][thp_docs] for more
information.

[hugetlbfs_docs]: https://docs.kernel.org/admin-guide/mm/hugetlbpage.html
[thp_docs]: https://www.kernel.org/doc/html/next/admin-guide/mm/transhuge.html#hugepages-in-tmpfs-shmem
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,7 @@ mod tests {

let huge_pages_cases = [
("None", HugePageConfig::None),
("Transparent", HugePageConfig::Transparent),
("2M", HugePageConfig::Hugetlbfs2M),
];

Expand Down
7 changes: 6 additions & 1 deletion src/firecracker/swagger/firecracker.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1442,8 +1442,13 @@ definitions:
type: string
enum:
- None
- Transparent
- 2M
description: Which huge pages configuration (if any) should be used to back guest memory.
default: None
description: >-
Which huge pages configuration should be used to back guest memory.
"None" uses regular 4K pages. "Transparent" enables THP via
madvise(MADV_HUGEPAGE). "2M" uses explicit hugetlbfs 2MB pages.
MemoryBackend:
type: object
Expand Down
1 change: 1 addition & 0 deletions src/vmm/src/devices/virtio/vhost_user.rs
Original file line number Diff line number Diff line change
Expand Up @@ -487,6 +487,7 @@ pub(crate) mod tests {
libc::MAP_PRIVATE,
Some(file),
false,
libc::MADV_HUGEPAGE,
)
.unwrap()
.into_iter()
Expand Down
13 changes: 10 additions & 3 deletions src/vmm/src/persist.rs
Original file line number Diff line number Diff line change
Expand Up @@ -449,8 +449,13 @@ pub fn restore_from_snapshot(
.into());
}
(
guest_memory_from_file(mem_backend_path, mem_state, track_dirty_pages)
.map_err(RestoreFromSnapshotGuestMemoryError::File)?,
guest_memory_from_file(
mem_backend_path,
mem_state,
track_dirty_pages,
vm_resources.machine_config.huge_pages,
)
.map_err(RestoreFromSnapshotGuestMemoryError::File)?,
None,
)
}
Expand Down Expand Up @@ -512,9 +517,11 @@ fn guest_memory_from_file(
mem_file_path: &Path,
mem_state: &GuestMemoryState,
track_dirty_pages: bool,
huge_pages: HugePageConfig,
) -> Result<Vec<GuestRegionMmap>, GuestMemoryFromFileError> {
let mem_file = File::open(mem_file_path)?;
let guest_mem = memory::snapshot_file(mem_file, mem_state.regions(), track_dirty_pages)?;
let guest_mem =
memory::snapshot_file(mem_file, mem_state.regions(), track_dirty_pages, huge_pages)?;
Ok(guest_mem)
}

Expand Down
21 changes: 21 additions & 0 deletions src/vmm/src/resources.rs
Original file line number Diff line number Diff line change
Expand Up @@ -580,6 +580,7 @@ mod tests {
use crate::vmm_config::RateLimiterConfig;
use crate::vmm_config::boot_source::{BootConfig, BootSource, BootSourceConfig};
use crate::vmm_config::drive::{BlockBuilder, BlockDeviceConfig};
use crate::vmm_config::machine_config::HugePageConfig::{Hugetlbfs2M, Transparent};
use crate::vmm_config::machine_config::{HugePageConfig, MachineConfig, MachineConfigError};
use crate::vmm_config::net::{NetBuilder, NetworkInterfaceConfig};
use crate::vmm_config::vsock::tests::default_config;
Expand Down Expand Up @@ -1476,6 +1477,26 @@ mod tests {
Err(MachineConfigError::InvalidMemorySize)
);

// Odd memory size - not supported by THP/Hugetlbfs
aux_vm_config.mem_size_mib = Some(1025);
aux_vm_config.huge_pages = Some(Transparent);
assert_eq!(
vm_resources.update_machine_config(&aux_vm_config),
Err(MachineConfigError::InvalidMemorySize)
);
aux_vm_config.huge_pages = Some(Hugetlbfs2M);
assert_eq!(
vm_resources.update_machine_config(&aux_vm_config),
Err(MachineConfigError::InvalidMemorySize)
);
// Odd size supported by HugePageConfig::None
aux_vm_config.huge_pages = Some(HugePageConfig::None);
vm_resources.update_machine_config(&aux_vm_config).unwrap();
assert_eq!(
MachineConfigUpdate::from(vm_resources.machine_config.clone()),
aux_vm_config
);

// Incompatible mem_size_mib with balloon size.
vm_resources.machine_config.mem_size_mib = 128;
vm_resources
Expand Down
23 changes: 19 additions & 4 deletions src/vmm/src/vmm_config/machine_config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,11 @@ pub enum MachineConfigError {
/// Describes the possible (huge)page configurations for a microVM's memory.
#[derive(Clone, Copy, Debug, Default, PartialEq, Eq, Serialize, Deserialize)]
pub enum HugePageConfig {
/// Do not use hugepages, e.g. back guest memory by 4K
/// Back guest memory by 4K pages, no hugepage behavior
#[default]
None,
/// Use madvise(MADV_HUGEPAGE) for transparent huge pages
Transparent,
/// Back guest memory by 2MB hugetlbfs pages
#[serde(rename = "2M")]
Hugetlbfs2M,
Expand All @@ -49,6 +51,10 @@ impl HugePageConfig {
let divisor = match self {
// Any integer memory size expressed in MiB will be a multiple of 4096KiB.
HugePageConfig::None => 1,
// Note: THP technically supports memory not 2MB aligned, however that would mean

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wording is slightly confusing because here it talks about alignment (if it's misaligned there can also be 4KiB pages at the head too), but then it talks about enforcing size, not alignment.

// some pages at the tail would be forced to be 4k size. To avoid performance/fragmentation surprises,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is everybody "aligned" (no pun intended) with this? I'm not necessarily against this, but equally I don't think it's a big problem having a few 4K pages at the end of the memory region (especially since THP is not a guarantee anyway). And we already know that internal customers often use non 2MiB multiples.

// having a memory multiple of 2MB is wiser.
HugePageConfig::Transparent => 2,
HugePageConfig::Hugetlbfs2M => 2,
};

Expand All @@ -59,11 +65,20 @@ impl HugePageConfig {
/// create a mapping backed by huge pages as described by this [`HugePageConfig`].
pub fn mmap_flags(&self) -> libc::c_int {
match self {
HugePageConfig::None => 0,
HugePageConfig::None | HugePageConfig::Transparent => 0,
HugePageConfig::Hugetlbfs2M => libc::MAP_HUGETLB | libc::MAP_HUGE_2MB,
}
}

/// Returns the flags required to pass to [libc::madvise], after allocating anonymous guest memory.
/// Note: returning [libc::MADV_NORMAL] might skip the call to `madvise` entirely.
pub fn madvise_flags(&self) -> libc::c_int {
match self {
HugePageConfig::Transparent => libc::MADV_HUGEPAGE,
HugePageConfig::None | HugePageConfig::Hugetlbfs2M => libc::MADV_NORMAL,
}
}

/// Returns `true` iff this [`HugePageConfig`] describes a hugetlbfs-based configuration.
pub fn is_hugetlbfs(&self) -> bool {
matches!(self, HugePageConfig::Hugetlbfs2M)
Expand All @@ -72,7 +87,7 @@ impl HugePageConfig {
/// Gets the page size in bytes of this [`HugePageConfig`].
pub fn page_size(&self) -> usize {
match self {
HugePageConfig::None => 4096,
HugePageConfig::None | HugePageConfig::Transparent => 4096,
HugePageConfig::Hugetlbfs2M => 2 * 1024 * 1024,
}
}
Expand All @@ -81,7 +96,7 @@ impl HugePageConfig {
impl From<HugePageConfig> for Option<memfd::HugetlbSize> {
fn from(value: HugePageConfig) -> Self {
match value {
HugePageConfig::None => None,
HugePageConfig::None | HugePageConfig::Transparent => None,
HugePageConfig::Hugetlbfs2M => Some(memfd::HugetlbSize::Huge2MB),
}
}
Expand Down
51 changes: 42 additions & 9 deletions src/vmm/src/vstate/memory.rs
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
// found in the THIRD-PARTY file.

use std::fs::File;
use std::io;
use std::io::SeekFrom;
use std::ops::Deref;
use std::sync::{Arc, Mutex};
Expand All @@ -22,12 +23,12 @@ pub use vm_memory::{
};
use vm_memory::{GuestMemoryError, GuestMemoryRegionBytes, VolatileSlice, WriteVolatile};

use crate::DirtyBitmap;
use crate::arch::host_page_size;
use crate::logger::error;
use crate::utils::u64_to_usize;
use crate::vmm_config::machine_config::HugePageConfig;
use crate::vstate::vm::{KvmVm, VmError};
use crate::{DirtyBitmap, warn_unrestricted};

/// Type of GuestRegionMmap.
pub type GuestRegionMmap = vm_memory::GuestRegionMmap<Option<AtomicBitmap>>;
Expand Down Expand Up @@ -528,6 +529,7 @@ pub fn create(
mmap_flags: libc::c_int,
file: Option<File>,
track_dirty_pages: bool,
madvise_flags: libc::c_int,
) -> Result<Vec<GuestRegionMmap>, MemoryError> {
let mut offset = 0;
let file = file.map(Arc::new);
Expand Down Expand Up @@ -559,6 +561,18 @@ pub fn create(
start,
)
.ok_or(MemoryError::VmMemoryError)
.inspect(|region| {
if madvise_flags != libc::MADV_NORMAL {
// SAFETY: The referenced memory was just mapped.
let ret = unsafe {
libc::madvise(region.as_ptr().cast(), region.size(), madvise_flags)
};
if ret != 0 {
let e = io::Error::last_os_error();
warn_unrestricted!("Madvise call failed for guest memory: {e}");
}
}
})
})
.collect::<Result<Vec<_>, _>>()
}
Expand All @@ -577,6 +591,7 @@ pub fn memfd_backed(
libc::MAP_SHARED | huge_pages.mmap_flags(),
Some(memfd_file),
track_dirty_pages,
huge_pages.madvise_flags(),
)
}

Expand All @@ -591,6 +606,7 @@ pub fn anonymous(
libc::MAP_PRIVATE | libc::MAP_ANONYMOUS | huge_pages.mmap_flags(),
None,
track_dirty_pages,
huge_pages.madvise_flags(),
)
}

Expand All @@ -600,6 +616,7 @@ pub fn snapshot_file(
file: File,
regions: impl Iterator<Item = (GuestAddress, usize)>,
track_dirty_pages: bool,
huge_pages: HugePageConfig,
) -> Result<Vec<GuestRegionMmap>, MemoryError> {
let regions: Vec<_> = regions.collect();
let memory_size = regions
Expand All @@ -619,6 +636,7 @@ pub fn snapshot_file(
libc::MAP_PRIVATE,
Some(file),
track_dirty_pages,
huge_pages.madvise_flags(),
)
}

Expand Down Expand Up @@ -951,8 +969,13 @@ mod tests {
file.write_all(&vec![0x42u8; page_size]).unwrap();

let regions = vec![(GuestAddress(0), page_size)];
let guest_regions =
snapshot_file(file, regions.into_iter(), dirty_page_tracking).unwrap();
let guest_regions = snapshot_file(
file,
regions.into_iter(),
dirty_page_tracking,
HugePageConfig::None,
)
.unwrap();
assert_eq!(guest_regions.len(), 1);
guest_regions.iter().for_each(|region| {
assert_eq!(region.bitmap().is_some(), dirty_page_tracking);
Expand All @@ -973,7 +996,8 @@ mod tests {
(GuestAddress(0x10000), page_size),
(GuestAddress(0x20000), page_size),
];
let guest_regions = snapshot_file(file, regions.into_iter(), false).unwrap();
let guest_regions =
snapshot_file(file, regions.into_iter(), false, HugePageConfig::None).unwrap();
assert_eq!(guest_regions.len(), 3);
}

Expand All @@ -985,7 +1009,7 @@ mod tests {
file.write_all(&vec![0x42u8; page_size]).unwrap();

let regions = vec![(GuestAddress(0), 2 * page_size)];
let result = snapshot_file(file, regions.into_iter(), false);
let result = snapshot_file(file, regions.into_iter(), false, HugePageConfig::None);
assert!(matches!(result.unwrap_err(), MemoryError::OffsetTooLarge));
}

Expand Down Expand Up @@ -1175,8 +1199,15 @@ mod tests {
let mut memory_file = TempFile::new().unwrap().into_file();
guest_memory.dump(&mut memory_file).unwrap();

let restored_guest_memory =
into_region_ext(snapshot_file(memory_file, memory_state.regions(), false).unwrap());
let restored_guest_memory = into_region_ext(
snapshot_file(
memory_file,
memory_state.regions(),
false,
HugePageConfig::None,
)
.unwrap(),
);

// Check that the region contents are the same.
let mut restored_region = vec![0u8; page_size * 2];
Expand Down Expand Up @@ -1240,8 +1271,9 @@ mod tests {
.unwrap();

// We can restore from this because this is the first dirty dump.
let restored_guest_memory =
into_region_ext(snapshot_file(file, memory_state.regions(), false).unwrap());
let restored_guest_memory = into_region_ext(
snapshot_file(file, memory_state.regions(), false, HugePageConfig::None).unwrap(),
);

// Check that the region contents are the same.
let mut restored_region = vec![0u8; region_size];
Expand Down Expand Up @@ -1465,6 +1497,7 @@ mod tests {
memory_file,
std::iter::once((GuestAddress(0), 2 * page_size)),
false,
HugePageConfig::None,
)
.unwrap(),
);
Expand Down
Loading
Loading