[svsm-devel] X86-64 Syscall API/ABI Document

Mon Jun 24 04:50:31 CEST 2024

> -----Original Message-----
> From: Jörg Rödel <jroedel at suse.de>
> Sent: Friday, June 21, 2024 6:09 PM
> To: Dong, Chuanxiao <chuanxiao.dong at intel.com>
> Cc: Jörg Rödel <joro at 8bytes.org>; svsm-devel at coconut-svsm.dev
> Subject: Re: [svsm-devel] X86-64 Syscall API/ABI Document
> 
> Hi Chuanxiao,
> 
> Thanks for your feedback and questions, please see my comments inline.
> 
> On Fri, Jun 21, 2024 at 08:35:03AM +0000, Dong, Chuanxiao wrote:
> > Thank you for sharing the design document! It has greatly enhanced our understanding of the user
> mode design philosophy. Here are some questions and comments from our side:
> >
> > THREAD_CREATE:
> > a) From the input parameter, seems it only supports creating thread on
> > the current CPU. Is there a plan to allow creating thread on a remote
> > CPU? Or we should avoid this case?
> > b) Is there any plan to add CPU affinity to threads?
> 
> There will be no direct concept of CPU affinity for threads. Affinity is implemented via event sources.
> Each event source is either global or bound to a specific CPU. If a thread waits on a CPU-bound event
> source and is woken up, it will get to run on the CPU the event is bound to.
> 
> So, for example, if user-space aquired an object representing a specific VCPU and waits for events on
> that object, the thread will run on the CPU the VCPU object is bound to, which is the same CPU as
> where the event happened.

I see. So still taking vcpu as an example, user space process first use THREAD_CREATE to create a vcpu thread but this vcpu thread runs on the current CPU. The vcpu thread uses VCPU_CREATE syscall to get a vcpu_obj and uses WAIT_FOR_EVENT(vcpu_obj) syscall to schedule it to the CPU where the vcpu_obj is bound to, and then performs the VM enter on the right CPU.

> 
> 
> > BATCH:
> > a) Is the user_data in SysCallIn an identifier, to pair the SysCallOut
> > and SysCallIn which have the same value for the user_data? Would like
> > to better understand the usage of the user_data.
> 
> User-data is a just a piece of data forwared from SysCallIn to SysCallOut to help user-space to associate
> the request with its internal state. The data is not interpreted by the COCONUT kernel in any way.

When the BATCH syscall returns, if the user_data in the SysCallIn is the same with the user_data in the corresponding SysCallOut, means kernel has forwarded the user_data to SysCallOut which represents this sub syscall is completed. Is this understanding correct?

> 
> > OPEN_PHYSMEM:
> > a) Suppose the SVMS kernel memory range is excluded, right?
> 
> Yes, it definitly is.
> 
> > b) Is there any mechanism to let the user-space process know the valid
> > memory range represented by the object handle? E.g., the end of the
> > memory addresses
> 
> No, not yet, we need to add a way to inform user-space about the valid memory ranges.
> 
> In general there also needs to be a limit on how much user-space can map, as we need to avoid that
> user-space just maps everything forever.
> 
> There is also a requirement that the COCONUT kernel can not allow shared/private state changes of
> pages which are currently mapped to user-space, so any physmem mapping needs to be temporary.

Makes sense. So the SET_MEM_STATE can fail to change page shared/private state due to the page is MMAP, until such page is MUNMAP.

> 
> > CAPABILITIES:
> > a) For SEV, VMPL0 is svsm and VM starts from VMPL1, and for TDX, vm-id
> > 0 is svsm and VM starts from vm-id 1. From this point of view, looks
> > like SEV and TDX are similar. So maybe can use a unified VM index
> > bitmap format for them?
> 
> Yes, that is the idea. I want to keep the interface as common as possible between TDX and SEV.
> 
> > VM_OPEN:
> > a) If one VM index is opened by a user-space process, the same index
> > cannot be opened again by another user-space process which guarantees
> > that only one process can operate with this VM object handle. But the
> > all threads created by this user-space process can operate with this
> > VM object handle. Is this a correct understanding?
> 
> Yes, this is correct. Though what is missing so far is the ability to pass object handles to newly created
> processes. This is important for device emulations like the TPM or serial port. The MMIO or IOIO object
> handles have to be created from the VM object handle, but the handling of actual MMIO and IOIO
> events needs to happen in separate processes. So there needs to be a way for forward handles on EXEC.

Yes. As this is related a process to use object handles created by another process, it appears that there is no policy restricting syscall operations on the object handle exclusively to the process that created it, and the object handle is unique globally, is this correct?

> 
> > VM_CAPABILITIES:
> > a) Should report the VM memory range? Or always assume VM can see the
> > entire range supported by the ObjHandle created by OPEN_PHYSMEM?
> 
> This depends on how the memory ranges given to specific VMs are going to be defined. It should
> probably be up to user-space to assign memory to a given VM. I need to think more about that.
> 
> > MMIO/IOIO_EVENT:
> > a) Is there any way to know the length of the mappable area provided
> > by the evt_obj? Or the length is some fixed size defined by the kernel
> > mode?
> 
> I consider these architecturally defined, the MMIO area range is all of physical memory, the IOIO range
> is the whole IO port range.

Does this mean, for MMIO_EVENT, it can take any valid range as input parameters, no matter it is MMIO or RAM, and for IOIO_EVENT, it can take any valid IO port range as input parameters?

Actually I also want to understand the mappable area provided by an evt_obj. As it is mappable area, suppose it can be MMAP(). So wondering how this mappable area be used?

> 
> > b) Would like to have a better understanding of the usage. We assume
> > that, the MMIO/IOIO range set by the syscalls are monitored by some
> > device model thread in the user space. When VM occurs vmexit due to
> > MMIO/IOIO accessing, the kernel vcpu thread decodes the MMIO/IOIO
> > address and if the address is located 	in the range set by
> > these syscalls, the kernel vcpu thread wakes up the monitoring device
> > model thread and then waits. The device model thread in the user mode
> > will get the address/data via the mappable area and do emulation.
> > After the emulation is completed, the device model thread wakes up the
> > kernel vcpu thread. If the decoding part is not in the kernel-space
> > but in the user-space, then the vcpu should back to the user mode to
> > decode instruction and wait to be waked up by the device model thread.
> > Is this the expected flow for handling MMIO/IOIO in the user mode?
> 
> As I have written above, the device models should run in a separate process. When the COOCNUT
> kernel is waked up with an MMIO or IOIO event from a guest is decodes the instruction and based on
> the address which is accessed it will wake up the corresponding device emulation process.

So MMIO_EVENT/IOIO_EVENT syscalls tell the kernel that, the range specified in the syscalls is emulated by some user-space process. If an MMIO/IOIO address is located in this range, the corresponding user-space process should be waked up, right?

BTW, what we though is when MMIO/IOIO event happens, vcpu thread exits from the guest mode and decodes the instruction to get the MMIO address and data (if it is an MMIO write), then waked up the device model process. I think this matches with what you described here. Please let me know if it is not.

> 
> That process then handles the request and updates device and CPU state.

How the device model process to get the detailed request? I guess it is via the mappable area provided by the evt_obj, but not sure so would like to raise the above questions about the mappable area.

Any consideration for updating CPU state in the device model process? Taking the MMIO reading as an example, the reading data may be written back to the vCPU register, or VM's memory (depending on the instruction). My understanding is that this is part of the instruction decoding. The device model process can emulate the MMIO/IOIO request and provide the emulated data to the process which did the instruction decoding, and that process can write the data back via the instruction decoder.

> Upon return to kernel mode COCONUT will directly go back into executing the guest OS.
> 
> I think this can be done without or with very minimal involvement of the main VM management
> process.

Does the "main VM management process" represent the vCPU threads?

> 
> > SET_MEM_STATE:
> > a) Any consideration for using paddr + page_size instead of
> > start_paddr + end_paddr as the input parameters? Using start_paddr +
> > end_paddr may be more efficient to set state for a memory region.
> 
> The SET_MEM_STATE input is defined to also suit AMDs use-case. on AMD the PVALIDATE instruction
> needs a page-size specified, so there needs to be a way to pass that information in. This is difficult to do
> with a range-based interface.
> 
> Also, errors are harder to propagate on a range-based input, as validation can fail in the middle of a
> range and user-space needs a consistent picture of what was validated.
> 
> For validating a whole range the BATCH system call can be used to send as many SET_MEM_STATE
> requests as needed in one invocation.

Got it. Then it makes more sense for using paddr + page_size.

> 
> > VCPU_CREATE:
> > a) Regarding "The VCPU can be run with the WAIT_FOR_EVENT() system
> > call",  the expectation is that the WAIT_FOR_EVENT(vcpu_obj) will
> > perform vmenter in the kernel mode, is this understanding correct?
> 
> Yes, this is correct. I had a RUN system call specified initially, but found it redundant with the
> WAIT_FOR_EVENT call in this case.
> 
> > GET/SET_STATE:
> > a) The type struct VmsaGpa is defined as AMD only. Does this mean the
> > user mode should have knowledge about if the platform is SEV or TDX?
> > Or there will be a per-architecture user mode VMM executable binary?
> 
> My hope is that we can one binary which is mostly common between TDX and SEV-SNP. The binary has
> to be aware of the platform as there will always be platform-specific handling required to some degree,
> but a lot of the main handling infrastructure can be common, I think.

Additional question is how to make the binary aware of the platform? Though some compiling option or some syscall?

> 
> > b) For TDX, all the VCPU states are stored in the VMCS. Probably need
> > to introduce a new type for TDX to set/get vcpu state:
> 
> > 	Type		Data Structure		Description
> > 	--------------------------------------------------------------------------------------------------------
> > 	VMCS		Struct VmcsField	The VMCS field encoding and the
> > 						corresponding data (Intel Only)
> > 	Struct VmcsField {
> > 		encoding: u32,		// VMCS field encoding
> > 		data: u64,		// VMCS field data
> > 	}
> > 	GET/SET_STATE syscall can be batched so that can set/get multiple VCPU states by one time.
> 
> This depends, a lot of state in the VMCS also exists in the VMSA, for those there will be common state
> definitions, e.g. for:
> 
> 	* GPRs
> 	* FPU/XMM/YMM state
> 	* Control registers
> 	* Architectural MSRs (if needed)
> 	* APIC and IRQ injection state
> 
> There are probably some VMCS specific fields which are not covered by common state abstractions, for
> those TDX specific state calls are fine.

I suppose when GET/SET_STATE(VMSA_GPA), the kernel mode will return all the CPU states defined in the VMSA page by coping data from one page to another, is this correct? For TDX, this is more expensive because reading one VMCS field will require sending one tdcall to TDX module. For example, if user mode only wants to know CR0, it is better for TDX to just send one tdcall to read the CR0 field instead of sending many tdcalls to read all the CPU state from VMCS field to construct the VMSA page.

> 
> 
> > 	c) The vLAPIC is emulated in the kernel mode. In this case, is
> > 	it necessary to allow user mode get the vLAPIC state? The
> > 	potential user case is for the user mode to emulate vCPUID 01
> > 	EBX which requires the vLAPIC ID.
> 
> It is at least a way needed for device emulation processes to inject an IRQ into the guest OS.

Indeed, we can think more about the way for the device to process to inject an IRQ. SET_STATE might be an option.

> Full APIC access might not be needed, the vLAPIC ID can also be propagated via a VCPU GET_STATE call or
> something similar.

Agree.

> 
> > Possible new syscalls for Class0: to help the user mode to execute some privilege instructions:
> > a) The user mode may need to read/write L1's MSR to emulate the vMSR
> > for VM, but rdmsr/wrmsr instructions are not allowed in the user mode
> > for both SEV and TDX. The RDMSR/WRMSR syscalls can help the user mode
> > to achieve this:
> 
> Can you elaborate a bit on that? Whic L1 MSRs need to be accessible and for what reason? In general I
> do not think it is a good idea to allow user-space accessing arbitratry MSRs.

Sure. As some of the L1's MSR are not used by svsm, we "passthrough" such MSR to the VM to simplify the MSR emulation. When VM accesses such MSR, the vmexit is handled by reading from/writing to the corresponding L1's MSR.
Here is the list of "passthrough" MSR we made for TDP guest: https://github.com/intel-staging/td-partitioning-svsm/blob/svsm-tdp/kernel/src/tdx/vmsr.rs#L595

> 
> > b) The user mode may need to access some MMIO or IO port emulated by
> > L0 to emulate certain MMIO/IOIO event for VM. The accessing in user
> > mode can be done via MOV/IOIO instructions, but it can trigger #VC/#VE
> > from the user mode and requires instruction decoding/emulating. To
> > simplify, the MMIO/IOIO accessing can be done via the enlighten way
> > (VMGEXIT for SEV and TDCALL for TDX). But for TDX, the TDCALL is
> > privileged command which is not allowed in the user mode. So it has to
> > be done in the kernel mode. If using the enlighten way for the user
> > mode to access MMIO/IOIO is preferred, then new syscalls are necessary
> > for TDX. Shall we?
> 
> Yes, you are right, a way for user-space to forward request to the host is currently completely missing.
> 
> I wonder what the best way here is, but routing these through user-space if maybe not the best
> approach, or do you see cases where user-space won't forward request unmodified?

I don't see the need for routing such MMIO/IOIO requests from the kernel-space to the user-space, if we allow kernel-space to directly forward these MMIO/IOIO requests to host.

> 
> Other options would be:
> 
> 	* Let the COCONUT kernel forward MMIO and IOIO requests for
> 	  which it has no handler to the host and use the result for
> 	  emulation
> 	* Re-inject these events as #VC/#VE execeptions into the guest
> 	  OS and let it deal with it.
> 
> The right approach probably also depends on wether the OS runs in englightened or paravisor mode.

For the either enlightened or paravisor mode, is the 1st option workable for both? If so, looks like the 1st option is better as it is more efficient.

> 
> > Questions for ObjHandle:
> > a) Although the ObjHandle seems a common type, a given ObjHandle
> > should only be used as an input for specific syscalls. For example,
> > the ObjHandle returned by OPEN cannot be used as input for the
> > VM_CAPABILITIES. Is this a correct understanding?
> 
> Yes, that is correct. The system call design follows an "everything is an object" model with the system
> call classes being traits for these objects. Every object only implements a subset of system call classes.
> 
> > b) If a) is true, sounds like for a given ObjHandle, the user should
> > know how the ObjHandle is created, otherwise the it is hard to
> > distinguish which syscalls can use this ObjHandle as an input?
> 
> Yes, user-space knows what type an object handle is of by the way it created it. VM objects cans only
> be created by VM_OPEN, VCPU objects only by VCPU_CREATE, and so on.
> 
> (VCPU_CREATE should be renamed to VCPU_OPEN, as VCPUs conceptually  already exist, I will update
> the document).

Got it. This renaming sounds good to me.

Thanks
Chuanxiao

> 
> Regards,
> 
> 	Joerg
> 
> --
> Jörg Rödel
> jroedel at suse.de
> 
> SUSE Software Solutions Germany GmbH
> Frankenstraße 146
> 90461 Nürnberg
> Germany
> https://www.suse.com/
> 
> Geschäftsführer: Ivo Totev, Andrew McDonald, Werner Knoblich (HRB 36809, AG Nürnberg)