[svsm-devel] [EXTERNAL] Re: EDK2 CAA Page Fragmented Allocation

Mon May 19 18:07:49 CEST 2025

Gerd wrote:

> Well.  Linux relocates the CAA page, and I don't think there is some way for UEFI runtime services to use that.
> First because UEFI can't figure the address of the Linux CAA page.  And second because linux wouldn't map
> the CAA page into the efi runtime service sandbox anyway.

This is why MSR C001_F000 is defined in the SVSM specification (see Section 4.2 - Post Boot).  UEFI runtime services can read that MSR, which will result in a #VC being delivered to Linux.  The Linux #VC handler should resolve the #VC by supplying the address of the CAA page.  This will enable UEFI runtime services to obtain the address of the Linux CAA page so it can make SVSM calls.

It seems that Linux would have to map at least the #VC handler into the UEFI runtime sandbox because there is the potential for UEFI runtime services to cause other #VC exceptions, which must not be fatal to the execution of the UEFI runtime.  If Linux must do that, then it seems like a small additional step to map the CAA page into that sandbox as well.

>     - Is it possible to emulate MMIO devices by letting page access
>       trap into SVSM?

This requires enabling REFLECT_VC in the guest VMSA, which would cause all #VC exceptions in Linux to be routed instead to VMPL 0.  This effectively eliminates the SVSM model (becoming a paravisor model) and that change would be exceptionally intrusive to the operation of both VMPLs.  I would not recommend pursuing this path just to fix UEFI runtime services.

If enabling SVSM calls from the UEFI runtime sandbox is completely impractical, then it would theoretically be possible to have Linux issue an SVSM call immediately before calling UEFI runtime services, where this call would enable REFLECT_VC (no such call is defined today, but it could be), and then have Linux issue another SVSM call immediately after the UEFI runtime services call returns, where this call would disable REFLECT_VC again.  This would require the SVSM to be prepared to handle every VC that could be generated UEFI runtime services, which is potentially a lot of code.  It would also require KVM to be aware that when a lower VMPL exits due to #VC, then VMPL 0 should be scheduled for execution (and I don't think KVM knows how to do this today).

-Jon

-----Original Message-----
From: Gerd Hoffmann <kraxel at redhat.com> 
Sent: Monday, May 19, 2025 8:30 AM
To: Adam Dunlap <acdunlap at google.com>
Cc: coconut-svsm at lists.linux.dev; svsm-devel at coconut-svsm.dev
Subject: [EXTERNAL] Re: [svsm-devel] EDK2 CAA Page Fragmented Allocation

[You don't often get email from kraxel at redhat.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]

  Hi,

> I've been debugging a boot failure on VMs running under Coconut-SVSM 
> with 120 or more vCPUs. The guest kernel (on several distros) is 
> hanging while trying to allocate a 256MiB memblock out of "low" memory 
> for the SWIOTLB.
>
> While debugging this, I noticed that the memory map reported by EDK2 
> has a ton of entries. There are the expected reserved regions, but 
> then there's a section of about 2x #vcpus where there is 1 reserved 
> and 1 usable page alternating. It turns out the reserved pages are 
> allocated as CAA pages and then usable pages are allocated (and then
> freed) as VMSA pages at this[1] point in EDK2. Note that while these 
> pages are allocated many times over the various UEFI phases, it's only 
> the CAA pages from the first allocation that are getting leaked.

Doesn't match what I'm seeing here, using latest upstream edk2.

edk2 goes allocate three pages (when running under svsm).  Usually it will free the first, use the second as vmsa and the third as caa.  In case the vmsa page happens to be 2M aligned it'll instead use the first as vmsa, second as caa and free the third to workaround a processor bug.

Due to top-down allocation this packs the pages next to each other, I see a single reserved memory block with two pages per processor (except boot processor).

> I had a few questions that I thought someone here might know the answer to:
> 1. Is UEFI supposed to keep these CAA pages allocated? I believe that 
> UEFI is supposed to be able to talk to the TPM post-ExitBootServices 
> and that would likely require CAA pages, but I might be missing 
> something.

Well.  Linux relocates the CAA page, and I don't think there is some way for UEFI runtime services to use that.  First because UEFI can't figure the address of the Linux CAA page.  And second because linux wouldn't map the CAA page into the efi runtime service sandbox anyway.

UEFI runtime services having their own CAA page doesn't work either because the CAA page address is registered in SVSM, and once the linux kernel started using its own CAA page the UEFI CAA page stops working.

For the TPM this should not be an issue.  There are no TPM runtime services, the linux kernel talks to the (v)TPM using it's own driver.

For the UEFI variable service I'm working on this /is/ a problem though.
I simply can't do SVSM protocol calls after ExitBootService.  I guess my options are:

 (a) Find some other way to call into svsm.
     - Is it possible to emulate MMIO devices by letting page access
       trap into SVSM?
 (b) Write a linux driver for svsm efi variable access.

Hints or opinions anyone?

> 2. How are the other CAA pages being freed? The commit that adds the 
> allocation [2] does not add any corresponding frees

No clue.  Apparently the pages are allocated twice, once in PEI phase, once in DXE phase, and only the PEI phase allocations stick.  I had expected both stay because they are allocated as Reserved (so they are not freed automatically at ExitBootService) and -- as you already figured -- there is no explicit free call.

take care,
  Gerd