Commit Graph

54 Commits

Author SHA1 Message Date
Linus Torvalds
113691ce9f * Centralize global metadata infrastructure
* Use new TDX module features for exception suppression and RBP
    clobbering
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmeQO2IACgkQaDWVMHDJ
 krA/UQ/5Abo250b0p6dyez2EhDiKXVJ3v30iriiZvGT+mKDyxQQFvLYE+QDUfHz7
 rKjJLEaHKt0wnHkSmBX05RpSieek2oS2ySpfIDxXW2tMjsap5a5WqcvFyOW6NcKy
 uwim2tRX6/54kr76smGNdGEeLPV7GL/FTAoub8pt/TCfL/88m+qEsQOvYg7RNvrV
 j9lOyL/Tt791E855cTgb1qHHFBu/I9siUWU2RTBKUPo2ILYAlSPS/xRiDe8W5fzS
 kPQxtwa5RSvhLBsz4lVbABJFZ1DmvY5TqxS0MMKwh/Dfm0jBXeod7alvMXrfhddF
 ikeBVKPuoB3u044rIGpoPqk4nM6ICowqI+H36D/4/bzXfz4zxRlIoPS5WdzS6ICm
 23pS1MsoAUOYfWfJTgRtNvKJ7JKEhXgAZFVDWjgLF6kVVIwD0lu550fVTd0OIhaA
 mko4+RmaXx+OfumZV3HZsHauV39rCQOT4Ay5jNhPho73BFExOzK3m1fmIjf2nsKu
 8Pk+2bhPQV7LvBOgc42BKVMusP/IW4O3dyzS8NnD2YO25gw+xd1oUNAb3LFGg9OJ
 vJXLuxlcHe6PmH9TdSZ+Lwj31vpbFtGTX5AOAzVbaIyfIJTuDUmTY5BX9eTW0/N7
 iLiNaY713mapkDBbKqWBsiBi1SEVuDZyrixujUiMMrGUwXRCAH4=
 =nzuM
 -----END PGP SIGNATURE-----

Merge tag 'x86_tdx_for_6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 TDX updates from Dave Hansen:
 "Intel Trust Domain updates.

  The existing TDX code needs a _bit_ of metadata from the TDX module.
  But KVM is going to need a bunch more very shortly. Rework the
  interface with the TDX module to be more consistent and handle the new
  higher volume.

  The TDX module has added a few new features. The first is a promise
  not to clobber RBP under any circumstances. Basically the kernel now
  will refuse to use any modules that don't have this promise. Second,
  enable the new "REDUCE_VE" feature. This ensures that the TDX module
  will not send some silly virtualization exceptions that the guest had
  no good way to handle anyway.

   - Centralize global metadata infrastructure

   - Use new TDX module features for exception suppression and RBP
     clobbering"

* tag 'x86_tdx_for_6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/virt/tdx: Require the module to assert it has the NO_RBP_MOD mitigation
  x86/virt/tdx: Switch to use auto-generated global metadata reading code
  x86/virt/tdx: Use dedicated struct members for PAMT entry sizes
  x86/virt/tdx: Use auto-generated code to read global metadata
  x86/virt/tdx: Start to track all global metadata in one structure
  x86/virt/tdx: Rename 'struct tdx_tdmr_sysinfo' to reflect the spec better
  x86/tdx: Dump attributes and TD_CTLS on boot
  x86/tdx: Disable unnecessary virtualization exceptions
2025-01-24 05:58:31 -08:00
Kai Huang
6f5c71cc42 x86/virt/tdx: Require the module to assert it has the NO_RBP_MOD mitigation
Old TDX modules can clobber RBP in the TDH.VP.ENTER SEAMCALL.  However
RBP is used as frame pointer in the x86_64 calling convention, and
clobbering RBP could result in bad things like being unable to unwind
the stack if any non-maskable exceptions (NMI, #MC etc) happens in that
gap.

A new "NO_RBP_MOD" feature was introduced to more recent TDX modules to
not clobber RBP.  KVM will need to use the TDH.VP.ENTER SEAMCALL to run
TDX guests.  It won't be safe to run TDX guests w/o this feature.  To
prevent it, just don't initialize the TDX module if this feature is not
supported [1].

Note the bit definitions of TDX_FEATURES0 are not auto-generated in
tdx_global_metadata.h.  Manually define a macro for it in "tdx.h".

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Reviewed-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://lore.kernel.org/fc0e8ab7-86d4-4428-be31-82e1ece6dd21@intel.com/ [1]
Link: https://lore.kernel.org/all/76ae5025502c84d799e3a56a6fc4f69a82da8f93.1734188033.git.kai.huang%40intel.com
2024-12-18 14:36:02 -08:00
Kai Huang
fae43b24a6 x86/virt/tdx: Switch to use auto-generated global metadata reading code
Continue the process to have a centralized solution for TDX global
metadata reading.  Now that the new autogenerated solution is ready for
use, switch to it and remove the old one.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://lore.kernel.org/all/fc025d1e13b92900323f47cfe9aac3157bf08ee7.1734188033.git.kai.huang%40intel.com
2024-12-18 14:36:02 -08:00
Kai Huang
6bfb77f489 x86/virt/tdx: Use dedicated struct members for PAMT entry sizes
Currently, the 'struct tdmr_sys_info_tdmr' which includes TDMR related
fields defines the PAMT entry sizes for TDX supported page sizes (4KB,
2MB and 1GB) as an array:

	struct tdx_sys_info_tdmr {
		...
		u16 pamt_entry_sizes[TDX_PS_NR];
	};

PAMT entry sizes are needed when allocating PAMTs for each TDMR.  Using
the array to contain PAMT entry sizes reduces the number of arguments
that need to be passed when calling tdmr_set_up_pamt().  It also makes
the code pattern like below clearer:

	for (pgsz = TDX_PS_4K; pgsz < TDX_PS_NR; pgsz++) {
		pamt_size[pgsz] = tdmr_get_pamt_sz(tdmr, pgsz,
					pamt_entry_size[pgsz]);
		tdmr_pamt_size += pamt_size[pgsz];
	}

However, the auto-generated metadata reading code generates a structure
member for each field.  The 'global_metadata.json' has a dedicated field
for each PAMT entry size, and the new 'struct tdx_sys_info_tdmr' looks
like:

	struct tdx_sys_info_tdmr {
		...
		u16 pamt_4k_entry_size;
		u16 pamt_2m_entry_size;
		u16 pamt_1g_entry_size;
	};

Prepare to use the autogenerated code by making the existing 'struct
tdx_sys_info_tdmr' look like the generated one.  When passing to
tdmrs_set_up_pamt_all(), build a local array of PAMT entry sizes from
the structure so the code to allocate PAMTs can stay the same.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://lore.kernel.org/all/ccf46f3dacb01be1fb8309592616d443ac17caba.1734188033.git.kai.huang%40intel.com
2024-12-18 14:36:02 -08:00
Paolo Bonzini
04a7bc7316 x86/virt/tdx: Use auto-generated code to read global metadata
The TDX module provides a set of "Global Metadata Fields".  They report
things like TDX module version, supported features, and fields related
to create/run TDX guests and so on.

Currently the kernel only reads "TD Memory Region" (TDMR) related fields
for module initialization.  There are needs to read more global metadata
fields for future use:

 - Supported features ("TDX_FEATURES0") to fail module initialization
   when the module doesn't support "not clobbering host RBP when exiting
   from TDX guest" feature [1].
 - KVM TDX baseline support and other features like TDX Connect will
   need to read more.

The current global metadata reading code has limitations (e.g., it only
has a primitive helper to read metadata field with 16-bit element size,
while TDX supports 8/16/32/64 bits metadata element sizes).  It needs
tweaks in order to read more metadata fields.

But even with the tweaks, when new code is added to read a new field,
the reviewers will still need to review against the spec to make sure
the new code doesn't screw up things like using the wrong metadata
field ID (each metadata field is associated with a unique field ID,
which is a TDX-defined u64 constant) etc.

TDX documents all global metadata fields in a 'global_metadata.json'
file as part of TDX spec [2].  JSON format is machine readable.  Instead
of tweaking the metadata reading code, use a script to generate the code
so that:

  1) Using the generated C is simple.
  2) Adding a field is simple, e.g., the script just pulls the field ID
     out of the JSON for a given field thus no manual review is needed.

Specifically, to match the layout of the 'struct tdx_sys_info' and its
sub-structures, the script uses a table with each entry containing the
the name of the sub-structures (which reflects the "Class") and the
"Field Name" of all its fields, and auto-generate:

  1) The 'struct tdx_sys_info' and all 'struct tdx_sys_info_xx'
     sub-structures in 'tdx_global_metadata.h'.

  2) The main function 'get_tdx_sys_info()' which reads all metadata to
     'struct tdx_sys_info' and the 'get_tdx_sys_info_xx()' functions
     which read 'struct tdx_sys_info_xx()' in 'tdx_global_metadata.c'.

Using the generated C is simple: 1) include "tdx_global_metadata.h" to
the local "tdx.h"; 2) explicitly include "tdx_global_metadata.c" to the
local "tdx.c" after the read_sys_metadata_field() primitive (which is a
wrapper of TDH.SYS.RD SEAMCALL to read global metadata).

Adding a field is also simple: 1) just add the new field to an existing
structure, or add it with a new structure; 2) re-run the script to
generate the new code; 3) update the existing tdx_global_metadata.{hc}
with the new ones.

For now, use the auto-generated code to read the TDMR related fields and
the aforesaid metadata field "TDX_FEATURES0".

The tdx_global_metadata.{hc} can be generated by running below:

 #python tdx_global_metadata.py global_metadata.json \
	tdx_global_metadata.h tdx_global_metadata.c

.. where the 'global_metadata.json' can be fetched from [2] and the
'tdx_global_metadata.py' can be found from [3].

Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://lore.kernel.org/fc0e8ab7-86d4-4428-be31-82e1ece6dd21@intel.com/ [1]
Link: https://cdrdv2.intel.com/v1/dl/getContent/795381 [2]
Link: https://lore.kernel.org/762a50133300710771337398284567b299a86f67.camel@intel.com/ [3]
Link: https://lore.kernel.org/all/cbe3f12b1e5479399b53f4873f2ff783d9fc669b.1734188033.git.kai.huang%40intel.com
2024-12-18 14:36:01 -08:00
Kai Huang
c4e0862a62 x86/virt/tdx: Start to track all global metadata in one structure
The TDX module provides a set of "Global Metadata Fields".  They report
things like TDX module version, supported features, and fields related
to create/run TDX guests and so on.

Today the kernel only reads "TD Memory Region" (TDMR) related fields for
module initialization.  KVM will need to read additional metadata fields
to run TDX guests.  Move towards having the TDX host core-kernel provide
a centralized, canonical, and immutable structure for the global
metadata that comes out from the TDX module for all kernel components to
use.

As the first step, introduce a new 'struct tdx_sys_info' to track all
global metadata fields.

TDX categorizes global metadata fields into different "Classes".  E.g.,
the TDMR related fields are under class "TDMR Info".  Instead of making
'struct tdx_sys_info' a plain structure to contain all metadata fields,
organize them in smaller structures based on the "Class".

This allows those metadata fields to be used in finer granularity thus
makes the code clearer.  E.g., construct_tdmrs() can just take the
structure which contains "TDMR Info" metadata fields.

Add get_tdx_sys_info() as the placeholder to read all metadata fields.
Have it only call get_tdx_sys_info_tdmr() to read TDMR related fields
for now.

Place get_tdx_sys_info() as the first step of init_tdx_module() to
enable early prerequisite checks on the metadata to support early module
initialization abort.  This results in moving get_tdx_sys_info_tdmr() to
be before build_tdx_memlist(), but this is fine because there are no
dependencies between these two functions.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://lore.kernel.org/all/bfacb4e90527cf79d4be0d1753e6f318eea21118.1734188033.git.kai.huang%40intel.com
2024-12-18 14:36:01 -08:00
Kai Huang
e8aa393b0a x86/virt/tdx: Rename 'struct tdx_tdmr_sysinfo' to reflect the spec better
The TDX module provides a set of "Global Metadata Fields".  They report
things like TDX module version, supported features, and fields related
to create/run TDX guests and so on.

TDX organizes those metadata fields by "Classes" based on the meaning of
those fields.  E.g., for now the kernel only reads "TD Memory Region"
(TDMR) related fields for module initialization.  Those fields are
defined under class "TDMR Info".

Today the kernel reads some of the global metadata to initialize the TDX
module.  KVM will need to read additional metadata fields to run TDX
guests.  Move towards having the TDX host core-kernel provide a
centralized, canonical, and immutable structure for the global metadata
that comes out from the TDX module for all kernel components to use.

More specifically, prepare the code to end up with an organization like:

       struct tdx_sys_info {
	       struct tdx_sys_info_classA a;
	       struct tdx_sys_info_classB b;
	       ...
       };

Currently the kernel organizes all fields under "TDMR Info" class in
'struct tdx_tdmr_sysinfo'.  Prepare for the above by renaming the
structure to 'struct tdx_sys_info_tdmr' to follow the class name better.

No functional change intended.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://lore.kernel.org/all/de165d09e0b571cfeb119a368f4be6e2888ebb93.1734188033.git.kai.huang%40intel.com
2024-12-18 14:36:01 -08:00
Tom Lendacky
8ae3291f77 x86/sev: Add full support for a segmented RMP table
A segmented RMP table allows for improved locality of reference between
the memory protected by the RMP and the RMP entries themselves.

Add support to detect and initialize a segmented RMP table with multiple
segments as configured by the system BIOS. While the RMPREAD instruction
will be used to read an RMP entry in a segmented RMP, initialization and
debugging capabilities will require the mapping of the segments.

The RMP_CFG MSR indicates if segmented RMP support is enabled and, if
enabled, the amount of memory that an RMP segment covers. When segmented
RMP support is enabled, the RMP_BASE MSR points to the start of the RMP
bookkeeping area, which is 16K in size. The RMP Segment Table (RST) is
located immediately after the bookkeeping area and is 4K in size. The RST
contains up to 512 8-byte entries that identify the location of the RMP
segment and amount of memory mapped by the segment (which must be less
than or equal to the configured segment size). The physical address that
is covered by a segment is based on the segment size and the index of the
segment in the RST. The RMP entry for a physical address is based on the
offset within the segment.

  For example, if the segment size is 64GB (0x1000000000 or 1 << 36), then
  physical address 0x9000800000 is RST entry 9 (0x9000800000 >> 36) and
  RST entry 9 covers physical memory 0x9000000000 to 0x9FFFFFFFFF.

  The RMP entry index within the RMP segment is the physical address
  AND-ed with the segment mask, 64GB - 1 (0xFFFFFFFFF), and then
  right-shifted 12 bits or PHYS_PFN(0x9000800000 & 0xFFFFFFFFF), which
  is 0x800.

CPUID 0x80000025_EBX[9:0] describes the number of RMP segments that can
be cached by the hardware. Additionally, if CPUID 0x80000025_EBX[10] is
set, then the number of actual RMP segments defined cannot exceed the
number of RMP segments that can be cached and can be used as a maximum
RST index.

  [ bp: Unify printk hex format specifiers. ]

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Link: https://lore.kernel.org/r/02afd0ffd097a19cb6e5fb1bb76eb110496c5b11.1734101742.git.thomas.lendacky@amd.com
2024-12-14 11:47:42 +01:00
Tom Lendacky
0f14af0d1d x86/sev: Treat the contiguous RMP table as a single RMP segment
In preparation for support of a segmented RMP table, treat the contiguous
RMP table as a segmented RMP table with a single segment covering all
of memory. By treating a contiguous RMP table as a single segment, much
of the code that initializes and accesses the RMP can be re-used.

Segmented RMP tables can have up to 512 segment entries. Each segment
will have metadata associated with it to identify the segment location,
the segment size, etc. The segment data and the physical address are used
to determine the index of the segment within the table and then the RMP
entry within the segment. For an actual segmented RMP table environment,
much of the segment information will come from a configuration MSR. For
the contiguous RMP, though, much of the information will be statically
defined.

  [ bp: Touchups, explain array_index_nospec() usage. ]

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Link: https://lore.kernel.org/r/8c40fbc9c5217f0d79b37cf861eff03ab0330bef.1733172653.git.thomas.lendacky@amd.com
2024-12-14 11:40:01 +01:00
Tom Lendacky
ac517965a5 x86/sev: Map only the RMP table entries instead of the full RMP range
In preparation for support of a segmented RMP table, map only the RMP
table entries. The RMP bookkeeping area is only ever accessed when
first enabling SNP and does not need to remain mapped. To accomplish
this, split the initialization of the RMP bookkeeping area and the
initialization of the RMP entry area. The RMP bookkeeping area will be
mapped only while it is being initialized.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Reviewed-by: Ashish Kalra <ashish.kalra@amd.com>
Link: https://lore.kernel.org/r/22f179998d319834f49c13a8c01187fbf0fd308d.1733172653.git.thomas.lendacky@amd.com
2024-12-14 11:39:02 +01:00
Tom Lendacky
e2f3d40df8 x86/sev: Move the SNP probe routine out of the way
To make patch review easier for the segmented RMP support, move the SNP
probe function out from in between the initialization-related routines.

No functional change.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Link: https://lore.kernel.org/r/6c2975bbf132d567dd12e1435be1d18c0bf9131c.1733172653.git.thomas.lendacky@amd.com
2024-12-14 11:08:48 +01:00
Tom Lendacky
0cbc025841 x86/sev: Add support for the RMPREAD instruction
The RMPREAD instruction returns an architecture defined format of an
RMP table entry. This is the preferred method for examining RMP entries.

The instruction is advertised in CPUID 0x8000001f_EAX[21]. Use this
instruction when available.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Reviewed-by: Ashish Kalra <ashish.kalra@amd.com>
Link: https://lore.kernel.org/r/72c734ac8b324bbc0c839b2c093a11af4a8881fa.1733172653.git.thomas.lendacky@amd.com
2024-12-14 01:02:30 +01:00
Tom Lendacky
3e43c60eb3 x86/sev: Prepare for using the RMPREAD instruction to access the RMP
The RMPREAD instruction returns an architecture defined format of an RMP
entry. This is the preferred method for examining RMP entries.

In preparation for using the RMPREAD instruction, convert the existing code
that directly accesses the RMP to map the raw RMP information into the
architecture defined format.

RMPREAD output returns a status bit for the 2MB region status. If the input
page address is 2MB aligned and any other pages within the 2MB region are
assigned, then 2MB region status will be set to 1. Otherwise, the 2MB region
status will be set to 0.

For systems that do not support RMPREAD, calculating this value would require
looping over all of the RMP table entries within that range until one is found
with the assigned bit set. Since this bit is not defined in the current
format, and so not used today, do not incur the overhead associated with
calculating it.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Link: https://lore.kernel.org/r/da49d5af1eb7f9039f35f14a32ca091efb2dd818.1733172653.git.thomas.lendacky@amd.com
2024-12-13 22:46:14 +01:00
Linus Torvalds
55db8eb456 - Do the proper memory conversion of guest memory in order to be able to kexec
kernels in SNP guests along with other adjustments and cleanups to that
   effect
 
 - Start converting and moving functionality from the sev-guest driver into
   core code with the purpose of supporting the secure TSC SNP feature where
   the hypervisor cannot influence the TSC exposed to the guest anymore
 
 - Add a "nosnp" cmdline option in order to be able to disable SNP support in
   the hypervisor and thus free-up resources which are not going to be used
 
 - Cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmc7ZToACgkQEsHwGGHe
 VUp61hAArA8taJaGUSdoe3sN60yRWCTe30QiDLvUrDGqmPHbBnDpdYsoaZujkQMI
 334piSWWu/pB6meO93uwv8X/ZO0ryOw46RK3szTz/RhBB5pTO3NbAj1zMF5q2KUy
 a+SYbZffV+qBUEpGujGrqrwT7X3U70yCKJFaZQOGvyYFzo+kyx6euqlYP+StOD+D
 ph7SDrXv0N0uU/2OiwCzF0cKvAuNHG2Cfn3kqSKvcZ+NWF3BKmw1IkgFA9f05P+j
 mOkc+1jCbi26b94MSJHSL33iRtbD0NgUzT9F2tw9Qszw1BQ5Er30Y45ywoudAhsn
 VrpMhBwWRCUdakQ2PsI7O8WB4gnBdWpEuzS2Ssqa1akB+pggH2xQzVb5EznmbzlS
 gz/SqUP75ijTT/oGh+C/hKAES3pmO4pH48J7llOKzb8YpoxxzjSEVb2pVbLzNdIV
 +it12Cap0lW+CTNGF4p2TbuKXKkE1LiGya1JMymQiZL8quCBYJIQUttiBvBg8Ac1
 oCw2DXQZsjDw55Hwwhr95J4FuY4+iQd+o1GgRDQ4MEqaYFEfdcFRA1YCbMHgiAzu
 NOGwjrQ2PB5xGST34qobGtk7Xt2nIilDvl5K5Co2E4s14NLrlBHo2uq33d0unlIZ
 BJMrHG/IWNjuHbKl/vM05fuiKEIvpL5qTKz7oVL6tX8Zphf6ljU=
 =C431
 -----END PGP SIGNATURE-----

Merge tag 'x86_sev_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 SEV updates from Borislav Petkov:

 - Do the proper memory conversion of guest memory in order to be able
   to kexec kernels in SNP guests along with other adjustments and
   cleanups to that effect

 - Start converting and moving functionality from the sev-guest driver
   into core code with the purpose of supporting the secure TSC SNP
   feature where the hypervisor cannot influence the TSC exposed to the
   guest anymore

 - Add a "nosnp" cmdline option in order to be able to disable SNP
   support in the hypervisor and thus free-up resources which are not
   going to be used

 - Cleanups

[ Reminding myself about the endless TLA's again: SEV is the AMD Secure
  Encrypted Virtualization    - Linus ]

* tag 'x86_sev_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sev: Cleanup vc_handle_msr()
  x86/sev: Convert shared memory back to private on kexec
  x86/mm: Refactor __set_clr_pte_enc()
  x86/boot: Skip video memory access in the decompressor for SEV-ES/SNP
  virt: sev-guest: Carve out SNP message context structure
  virt: sev-guest: Reduce the scope of SNP command mutex
  virt: sev-guest: Consolidate SNP guest messaging parameters to a struct
  x86/sev: Cache the secrets page address
  x86/sev: Handle failures from snp_init()
  virt: sev-guest: Use AES GCM crypto library
  x86/virt: Provide "nosnp" boot option for sev kernel command line
  x86/virt: Move SEV-specific parsing into arch/x86/virt/svm
2024-11-19 12:21:35 -08:00
Ashish Kalra
88a921aa3c x86/sev: Ensure that RMP table fixups are reserved
The BIOS reserves RMP table memory via e820 reservations. This can still lead
to RMP page faults during kexec if the host tries to access memory within the
same 2MB region.

Commit

  400fea4b96 ("x86/sev: Add callback to apply RMP table fixups for kexec"

adjusts the e820 reservations for the RMP table so that the entire 2MB range
at the start/end of the RMP table is marked reserved.

The e820 reservations are then passed to firmware via SNP_INIT where they get
marked HV-Fixed.

The RMP table fixups are done after the e820 ranges have been added to
memblock, allowing the fixup ranges to still be allocated and used by the
system.

The problem is that this memory range is now marked reserved in the e820
tables and during SNP initialization these reserved ranges are marked as
HV-Fixed.  This means that the pages cannot be used by an SNP guest, only by
the hypervisor.

However, the memory management subsystem does not make this distinction and
can allocate one of those pages to an SNP guest. This will ultimately result
in RMPUPDATE failures associated with the guest, causing it to fail to start
or terminate when accessing the HV-Fixed page.

The issue is captured below with memblock=debug:

  [    0.000000] SEV-SNP: *** DEBUG: snp_probe_rmptable_info:352 - rmp_base=0x280d4800000, rmp_end=0x28357efffff
  ...
  [    0.000000] BIOS-provided physical RAM map:
  ...
  [    0.000000] BIOS-e820: [mem 0x00000280d4800000-0x0000028357efffff] reserved
  [    0.000000] BIOS-e820: [mem 0x0000028357f00000-0x0000028357ffffff] usable
  ...
  ...
  [    0.183593] memblock add: [0x0000028357f00000-0x0000028357ffffff] e820__memblock_setup+0x74/0xb0
  ...
  [    0.203179] MEMBLOCK configuration:
  [    0.207057]  memory size = 0x0000027d0d194000 reserved size = 0x0000000009ed2c00
  [    0.215299]  memory.cnt  = 0xb
  ...
  [    0.311192]  memory[0x9]     [0x0000028357f00000-0x0000028357ffffff], 0x0000000000100000 bytes flags: 0x0
  ...
  ...
  [    0.419110] SEV-SNP: Reserving start/end of RMP table on a 2MB boundary [0x0000028357e00000]
  [    0.428514] e820: update [mem 0x28357e00000-0x28357ffffff] usable ==> reserved
  [    0.428517] e820: update [mem 0x28357e00000-0x28357ffffff] usable ==> reserved
  [    0.428520] e820: update [mem 0x28357e00000-0x28357ffffff] usable ==> reserved
  ...
  ...
  [    5.604051] MEMBLOCK configuration:
  [    5.607922]  memory size = 0x0000027d0d194000 reserved size = 0x0000000011faae02
  [    5.616163]  memory.cnt  = 0xe
  ...
  [    5.754525]  memory[0xc]     [0x0000028357f00000-0x0000028357ffffff], 0x0000000000100000 bytes on node 0 flags: 0x0
  ...
  ...
  [   10.080295] Early memory node ranges[   10.168065]
  ...
  node   0: [mem 0x0000028357f00000-0x0000028357ffffff]
  ...
  ...
  [ 8149.348948] SEV-SNP: RMPUPDATE failed for PFN 28357f7c, pg_level: 1, ret: 2

As shown above, the memblock allocations show 1MB after the end of the RMP as
available for allocation, which is what the RMP table fixups have reserved.
This memory range subsequently gets allocated as SNP guest memory, resulting
in an RMPUPDATE failure.

This can potentially be fixed by not reserving the memory range in the e820
table, but that causes kexec failures when using the KEXEC_FILE_LOAD syscall.

The solution is to use memblock_reserve() to mark the memory reserved for the
system, ensuring that it cannot be allocated to an SNP guest.

Since HV-Fixed memory is still readable/writable by the host, this only ends
up being a problem if the memory in this range requires a page state change,
which generally will only happen when allocating memory in this range to be
used for running SNP guests, which is now possible with the SNP hypervisor
support in kernel 6.11.

Backporter note:

Fixes tag points to a 6.9 change but as the last paragraph above explains,
this whole thing can happen after 6.11 received SNP HV support, therefore
backporting to 6.9 is not really necessary.

  [ bp: Massage commit message. ]

Fixes: 400fea4b96 ("x86/sev: Add callback to apply RMP table fixups for kexec")
Suggested-by: Thomas Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: <stable@kernel.org> # 6.11, see Backporter note above.
Link: https://lore.kernel.org/r/20240815221630.131133-1-Ashish.Kalra@amd.com
2024-10-23 12:34:06 +02:00
Pavan Kumar Paluri
2db67aaca5 x86/virt: Provide "nosnp" boot option for sev kernel command line
Provide a "nosnp" kernel command line option to prevent enabling of the RMP
and SEV-SNP features in the host/hypervisor. Not initializing the RMP
removes system overhead associated with RMP checks.

  [ bp: Actually make it a HV-only cmdline option. ]

Co-developed-by: Eric Van Tassell <Eric.VanTassell@amd.com>
Signed-off-by: Eric Van Tassell <Eric.VanTassell@amd.com>
Signed-off-by: Pavan Kumar Paluri <papaluri@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20241014130948.1476946-3-papaluri@amd.com
2024-10-15 20:22:18 +02:00
Pavan Kumar Paluri
4ae47fa7e8 x86/virt: Move SEV-specific parsing into arch/x86/virt/svm
Move SEV-specific kernel command line option parsing support from
arch/x86/coco/sev/core.c to arch/x86/virt/svm/cmdline.c so that both
host and guest related SEV command line options can be supported.

No functional changes intended.

Signed-off-by: Pavan Kumar Paluri <papaluri@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20241014130948.1476946-2-papaluri@amd.com
2024-10-15 19:54:42 +02:00
Linus Torvalds
408323581b - Add support for running the kernel in a SEV-SNP guest, over a Secure
VM Service Module (SVSM).
 
    When running over a SVSM, different services can run at different
    protection levels, apart from the guest OS but still within the
    secure SNP environment.  They can provide services to the guest, like
    a vTPM, for example.
 
    This series adds the required facilities to interface with such a SVSM
    module.
 
  - The usual fixlets, refactoring and cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmaWQuoACgkQEsHwGGHe
 VUrmEw/+KqM5DK5cfpue3gn0RfH6OYUoFxOdYhGkG53qUMc3c3ka5zPVqLoHPkzp
 WPXha0Z5pVdrcD9mKtVUW9RIuLjInCM/mnoNc3tIUL+09xxemAjyG1+O+4kodiU7
 sZ5+HuKUM2ihoC4Rrm+ApRrZfH4+WcgQNvFky77iObWVBo4yIscS7Pet/MYFvuuz
 zNaGp2SGGExDeoX/pMQNI3S9FKYD26HR17AUI3DHpS0teUl2npVi4xDjFVYZh0dQ
 yAhTKbSX3Q6ekDDkvAQUbxvWTJw9qoIsvLO9dvZdx6SSWmzF9IbuECpQKGQwYcp+
 pVtcHb+3MwfB+nh5/fHyssRTOZp1UuI5GcmLHIQhmhQwCqPgzDH6te4Ud1ovkxOu
 3GoBre7KydnQIyv12I+56/ZxyPbjHWmn8Fg106nAwGTdGbBJhfcVYfPmPvwpI4ib
 nXpjypvM8FkLzLAzDK6GE9QiXqJJlxOn7t66JiH/FkXR4gnY3eI8JLMfnm5blAb+
 97LC7oyeqtstWth9/4tpCILgPR2tirrMQGjUXttgt+2VMzqnEamnFozsKvR95xok
 4j6ulKglZjdpn0ixHb2vAzAcOJvD7NP147jtCmXH7M6/f9H1Lih3MKdxX98MVhWB
 wSp16udXHzu5lF45J0BJG8uejSgBI2y51jc92HLX7kRULOGyaEo=
 =u15r
 -----END PGP SIGNATURE-----

Merge tag 'x86_sev_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 SEV updates from Borislav Petkov:

 - Add support for running the kernel in a SEV-SNP guest, over a Secure
   VM Service Module (SVSM).

   When running over a SVSM, different services can run at different
   protection levels, apart from the guest OS but still within the
   secure SNP environment. They can provide services to the guest, like
   a vTPM, for example.

   This series adds the required facilities to interface with such a
   SVSM module.

 - The usual fixlets, refactoring and cleanups

[ And as always: "SEV" is AMD's "Secure Encrypted Virtualization".

  I can't be the only one who gets all the newer x86 TLA's confused,
  can I?
              - Linus ]

* tag 'x86_sev_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Documentation/ABI/configfs-tsm: Fix an unexpected indentation silly
  x86/sev: Do RMP memory coverage check after max_pfn has been set
  x86/sev: Move SEV compilation units
  virt: sev-guest: Mark driver struct with __refdata to prevent section mismatch
  x86/sev: Allow non-VMPL0 execution when an SVSM is present
  x86/sev: Extend the config-fs attestation support for an SVSM
  x86/sev: Take advantage of configfs visibility support in TSM
  fs/configfs: Add a callback to determine attribute visibility
  sev-guest: configfs-tsm: Allow the privlevel_floor attribute to be updated
  virt: sev-guest: Choose the VMPCK key based on executing VMPL
  x86/sev: Provide guest VMPL level to userspace
  x86/sev: Provide SVSM discovery support
  x86/sev: Use the SVSM to create a vCPU when not in VMPL0
  x86/sev: Perform PVALIDATE using the SVSM when not at VMPL0
  x86/sev: Use kernel provided SVSM Calling Areas
  x86/sev: Check for the presence of an SVSM in the SNP secrets page
  x86/irqflags: Provide native versions of the local_irq_save()/restore()
2024-07-16 11:12:25 -07:00
Tom Lendacky
0440feb090 x86/sev: Do RMP memory coverage check after max_pfn has been set
The RMP table is probed early in the boot process before max_pfn has been
set, so the logic to check if the RMP covers all of system memory is not
valid.

Move the RMP memory coverage check from snp_probe_rmptable_info() into
snp_rmptable_init(), which is well after max_pfn has been set. Also, fix
the calculation to use PFN_UP instead of PHYS_PFN, in order to compute
the required RMP size properly.

Fixes: 216d106c7f ("x86/sev: Add SEV-SNP host initialization support")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/bec4364c7e34358cc576f01bb197a7796a109169.1718984524.git.thomas.lendacky@amd.com
2024-07-11 12:03:23 +02:00
Tony Luck
189e8d4b98 x86/virt/tdx: Switch to new Intel CPU model defines
New CPU #defines encode vendor and family as well as model.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20240520224620.9480-31-tony.luck%40intel.com
2024-05-28 10:59:02 -07:00
Linus Torvalds
c4273a6692 x86/cleanups changes for v6.10:
- Fix function prototypes to address clang function type cast
    warnings in the math-emu code
 
  - Reorder definitions in <asm/msr-index.h>
 
  - Remove unused code
 
  - Fix typos
 
  - Simplify #include sections
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmZBvHQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jeSBAAqPMBFEYc5nge52ONZ8bzADEPQ6pBohgO
 xfONNuUpjtQ/Xtnhc8FGoFf+C9pnOlf2eX2VfusqvA6M9XJDgZxu1M6QZSOHuILo
 4T4opzTj7VYLbo1DQGLcPMymW/rhJNwKdRwhHr4SNIk9YcIJS7uyxtnLNvqjcCsB
 /iMw2/mhlXRXN1MP1Eg4YM6BXJ4qYkjx79gzKEGbq6tJgUahR37LGvw1aq+GAiap
 Wbo0o2jLgu8ByZXKEfUmUnW5jMR02LeUBg1OqDjaziO48df6eUi4ngaCoSA5qIew
 SDKZ1uq3qTOlDtGlxIGlBznM/HjvPejr+XQXKukCn+B9N62PMtR4fOS5q/4ODTD+
 wQttK0rg/fLpp1zgv33ey2N0qpbUxbtxC4JkA4DPfqstO/uiQXTNJM6H68Pqr9p/
 6TuW+HYrsgUdi54X4KTEHIAGOSUP0bjJrtSP6Tzxt9+epOQl+ymHaR07a4rRn2cw
 SnK7CQcWsjv90PUkCsb3F7gZtYVOkb4C0ZCPn2AlSPo+y0YnBadG+S6uQ6suFwxA
 kX5QNf+OPmqJZz/muqGQ+c7Swc9ONPdv6RSt35nqp2vz0ugp4Q1FNUciQGfOLj2V
 O0KaFVcdFvlkLGgxgYlGZJKxWKeuhh+L5IHyaL5fy7nOUhJtI+djoF5ZaCfR0Ofp
 Piqz80R6w9I=
 =6pkd
 -----END PGP SIGNATURE-----

Merge tag 'x86-cleanups-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cleanups from Ingo Molnar:

 - Fix function prototypes to address clang function type cast
   warnings in the math-emu code

 - Reorder definitions in <asm/msr-index.h>

 - Remove unused code

 - Fix typos

 - Simplify #include sections

* tag 'x86-cleanups-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/pci/ce4100: Remove unused 'struct sim_reg_op'
  x86/msr: Move ARCH_CAP_XAPIC_DISABLE bit definition to its rightful place
  x86/math-emu: Fix function cast warnings
  x86/extable: Remove unused fixup type EX_TYPE_COPY
  x86/rtc: Remove unused intel-mid.h
  x86/32: Remove unused IA32_STACK_TOP and two externs
  x86/head: Simplify relative include path to xen-head.S
  x86/fred: Fix typo in Kconfig description
  x86/syscall/compat: Remove ia32_unistd.h
  x86/syscall/compat: Remove unused macro __SYSCALL_ia32_NR
  x86/virt/tdx: Remove duplicate include
  x86/xen: Remove duplicate #include
2024-05-13 18:21:24 -07:00
Ashish Kalra
400fea4b96 x86/sev: Add callback to apply RMP table fixups for kexec
Handle cases where the RMP table placement in the BIOS is not 2M aligned
and the kexec-ed kernel could try to allocate from within that chunk
which then causes a fatal RMP fault.

The kexec failure is illustrated below:

  SEV-SNP: RMP table physical range [0x0000007ffe800000 - 0x000000807f0fffff]
  BIOS-provided physical RAM map:
  BIOS-e820: [mem 0x0000000000000000-0x000000000008efff] usable
  BIOS-e820: [mem 0x000000000008f000-0x000000000008ffff] ACPI NVS
  ...
  BIOS-e820: [mem 0x0000004080000000-0x0000007ffe7fffff] usable
  BIOS-e820: [mem 0x0000007ffe800000-0x000000807f0fffff] reserved
  BIOS-e820: [mem 0x000000807f100000-0x000000807f1fefff] usable

As seen here in the e820 memory map, the end range of the RMP table is not
aligned to 2MB and not reserved but it is usable as RAM.

Subsequently, kexec -s (KEXEC_FILE_LOAD syscall) loads it's purgatory
code and boot_param, command line and other setup data into this RAM
region as seen in the kexec logs below, which leads to fatal RMP fault
during kexec boot.

  Loaded purgatory at 0x807f1fa000
  Loaded boot_param, command line and misc at 0x807f1f8000 bufsz=0x1350 memsz=0x2000
  Loaded 64bit kernel at 0x7ffae00000 bufsz=0xd06200 memsz=0x3894000
  Loaded initrd at 0x7ff6c89000 bufsz=0x4176014 memsz=0x4176014
  E820 memmap:
  0000000000000000-000000000008efff (1)
  000000000008f000-000000000008ffff (4)
  0000000000090000-000000000009ffff (1)
  ...
  0000004080000000-0000007ffe7fffff (1)
  0000007ffe800000-000000807f0fffff (2)
  000000807f100000-000000807f1fefff (1)
  000000807f1ff000-000000807fffffff (2)
  nr_segments = 4
  segment[0]: buf=0x00000000e626d1a2 bufsz=0x4000 mem=0x807f1fa000 memsz=0x5000
  segment[1]: buf=0x0000000029c67bd6 bufsz=0x1350 mem=0x807f1f8000 memsz=0x2000
  segment[2]: buf=0x0000000045c60183 bufsz=0xd06200 mem=0x7ffae00000 memsz=0x3894000
  segment[3]: buf=0x000000006e54f08d bufsz=0x4176014 mem=0x7ff6c89000 memsz=0x4177000
  kexec_file_load: type:0, start:0x807f1fa150 head:0x1184d0002 flags:0x0

Check if RMP table start and end physical range in the e820 tables are
not aligned to 2MB and in that case map this range to reserved in all
the three e820 tables.

  [ bp: Massage. ]

Fixes: c3b86e61b7 ("x86/cpufeatures: Enable/unmask SEV-SNP CPU feature")
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/df6e995ff88565262c2c7c69964883ff8aa6fc30.1714090302.git.ashish.kalra@amd.com
2024-04-29 11:21:09 +02:00
Borislav Petkov (AMD)
0ecaefb303 x86/CPU/AMD: Track SNP host status with cc_platform_*()
The host SNP worthiness can determined later, after alternatives have
been patched, in snp_rmptable_init() depending on cmdline options like
iommu=pt which is incompatible with SNP, for example.

Which means that one cannot use X86_FEATURE_SEV_SNP and will need to
have a special flag for that control.

Use that newly added CC_ATTR_HOST_SEV_SNP in the appropriate places.

Move kdump_sev_callback() to its rightful place, while at it.

Fixes: 216d106c7f ("x86/sev: Add SEV-SNP host initialization support")
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Srikanth Aithal <sraithal@amd.com>
Link: https://lore.kernel.org/r/20240327154317.29909-6-bp@alien8.de
2024-04-04 10:40:30 +02:00
Masahiro Yamada
3f1a9bc5d8 x86/build: Use obj-y to descend into arch/x86/virt/
Commit c33621b4c5 ("x86/virt/tdx: Wire up basic SEAMCALL functions")
introduced a new instance of core-y instead of the standardized obj-y
syntax.

X86 Makefiles descend into subdirectories of arch/x86/virt inconsistently;
into arch/x86/virt/ via core-y defined in arch/x86/Makefile, but into
arch/x86/virt/svm/ via obj-y defined in arch/x86/Kbuild.

This is problematic when you build a single object in parallel because
multiple threads attempt to build the same file.

  $ make -j$(nproc) arch/x86/virt/vmx/tdx/seamcall.o
    [ snip ]
    AS      arch/x86/virt/vmx/tdx/seamcall.o
    AS      arch/x86/virt/vmx/tdx/seamcall.o
  fixdep: error opening file: arch/x86/virt/vmx/tdx/.seamcall.o.d: No such file or directory
  make[4]: *** [scripts/Makefile.build:362: arch/x86/virt/vmx/tdx/seamcall.o] Error 2

Use the obj-y syntax, as it works correctly.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240330060554.18524-1-masahiroy@kernel.org
2024-03-30 10:41:49 +01:00
Jiapeng Chong
27d45fc7df x86/virt/tdx: Remove duplicate include
./arch/x86/virt/vmx/tdx/tdx.c: linux/acpi.h is included more than once.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240322061741.9869-1-jiapeng.chong@linux.alibaba.com

Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=8609
2024-03-22 09:35:43 +01:00
Ashish Kalra
8ef979584e crypto: ccp: Add panic notifier for SEV/SNP firmware shutdown on kdump
Add a kdump safe version of sev_firmware_shutdown() and register it as a
crash_kexec_post_notifier so it will be invoked during panic/crash to do
SEV/SNP shutdown. This is required for transitioning all IOMMU pages to
reclaim/hypervisor state, otherwise re-init of IOMMU pages during
crashdump kernel boot fails and panics the crashdump kernel.

This panic notifier runs in atomic context, hence it ensures not to
acquire any locks/mutexes and polls for PSP command completion instead
of depending on PSP command completion interrupt.

  [ mdr: Remove use of "we" in comments. ]

Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240126041126.1927228-21-michael.roth@amd.com
2024-01-29 20:34:19 +01:00
Ashish Kalra
8dac642999 x86/sev: Introduce an SNP leaked pages list
Pages are unsafe to be released back to the page-allocator if they
have been transitioned to firmware/guest state and can't be reclaimed
or transitioned back to hypervisor/shared state. In this case, add them
to an internal leaked pages list to ensure that they are not freed or
touched/accessed to cause fatal page faults.

  [ mdr: Relocate to arch/x86/virt/svm/sev.c ]

Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/20240126041126.1927228-16-michael.roth@amd.com
2024-01-29 20:34:18 +01:00
Michael Roth
661b1c6169 x86/sev: Adjust the directmap to avoid inadvertent RMP faults
If the kernel uses a 2MB or larger directmap mapping to write to an
address, and that mapping contains any 4KB pages that are set to private
in the RMP table, an RMP #PF will trigger and cause a host crash.

SNP-aware code that owns the private PFNs will never attempt such
a write, but other kernel tasks writing to other PFNs in the range may
trigger these checks inadvertently due to writing to those other PFNs
via a large directmap mapping that happens to also map a private PFN.

Prevent this by splitting any 2MB+ mappings that might end up containing
a mix of private/shared PFNs as a result of a subsequent RMPUPDATE for
the PFN/rmp_level passed in.

Another way to handle this would be to limit the directmap to 4K
mappings in the case of hosts that support SNP, but there is potential
risk for performance regressions of certain host workloads.

Handling it as-needed results in the directmap being slowly split over
time, which lessens the risk of a performance regression since the more
the directmap gets split as a result of running SNP guests, the more
likely the host is being used primarily to run SNP guests, where
a mostly-split directmap is actually beneficial since there is less
chance of TLB flushing and cpa_lock contention being needed to perform
these splits.

Cases where a host knows in advance it wants to primarily run SNP guests
and wishes to pre-split the directmap can be handled by adding
a tuneable in the future, but preliminary testing has shown this to not
provide a signficant benefit in the common case of guests that are
backed primarily by 2MB THPs, so it does not seem to be warranted
currently and can be added later if a need arises in the future.

Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/20240126041126.1927228-12-michael.roth@amd.com
2024-01-29 20:34:18 +01:00
Brijesh Singh
2c35819ee0 x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction
The RMPUPDATE instruction updates the access restrictions for a page via
its corresponding entry in the RMP Table. The hypervisor will use the
instruction to enforce various access restrictions on pages used for
confidential guests and other specialized functionality. See APM3 for
details on the instruction operations.

The PSMASH instruction expands a 2MB RMP entry in the RMP table into a
corresponding set of contiguous 4KB RMP entries while retaining the
state of the validated bit from the original 2MB RMP entry. The
hypervisor will use this instruction in cases where it needs to re-map a
page as 4K rather than 2MB in a guest's nested page table.

Add helpers to make use of these instructions.

  [ mdr: add RMPUPDATE retry logic for transient FAIL_OVERLAP errors. ]

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Link: https://lore.kernel.org/r/20240126041126.1927228-11-michael.roth@amd.com
2024-01-29 20:30:59 +01:00
Brijesh Singh
1f568d3636 x86/fault: Add helper for dumping RMP entries
This information will be useful for debugging things like page faults
due to RMP access violations and RMPUPDATE failures.

  [ mdr: move helper to standalone patch, rework dump logic as suggested
    by Boris. ]

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240126041126.1927228-8-michael.roth@amd.com
2024-01-29 17:26:30 +01:00
Brijesh Singh
94b36bc244 x86/sev: Add RMP entry lookup helpers
Add a helper that can be used to access information contained in the RMP
entry corresponding to a particular PFN. This will be needed to make
decisions on how to handle setting up mappings in the NPT in response to
guest page-faults and handling things like cleaning up pages and setting
them back to the default hypervisor-owned state when they are no longer
being used for private data.

  [ mdr: separate 'assigned' indicator from return code, and simplify
    function signatures for various helpers. ]

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Ashish Kalra <ashish.kalra@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240126041126.1927228-7-michael.roth@amd.com
2024-01-29 17:25:55 +01:00
Brijesh Singh
216d106c7f x86/sev: Add SEV-SNP host initialization support
The memory integrity guarantees of SEV-SNP are enforced through a new
structure called the Reverse Map Table (RMP). The RMP is a single data
structure shared across the system that contains one entry for every 4K
page of DRAM that may be used by SEV-SNP VMs. The APM Volume 2 section
on Secure Nested Paging (SEV-SNP) details a number of steps needed to
detect/enable SEV-SNP and RMP table support on the host:

 - Detect SEV-SNP support based on CPUID bit
 - Initialize the RMP table memory reported by the RMP base/end MSR
   registers and configure IOMMU to be compatible with RMP access
   restrictions
 - Set the MtrrFixDramModEn bit in SYSCFG MSR
 - Set the SecureNestedPagingEn and VMPLEn bits in the SYSCFG MSR
 - Configure IOMMU

RMP table entry format is non-architectural and it can vary by
processor. It is defined by the PPR document for each respective CPU
family. Restrict SNP support to CPU models/families which are compatible
with the current RMP table entry format to guard against any undefined
behavior when running on other system types. Future models/support will
handle this through an architectural mechanism to allow for broader
compatibility.

SNP host code depends on CONFIG_KVM_AMD_SEV config flag which may be
enabled even when CONFIG_AMD_MEM_ENCRYPT isn't set, so update the
SNP-specific IOMMU helpers used here to rely on CONFIG_KVM_AMD_SEV
instead of CONFIG_AMD_MEM_ENCRYPT.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Ashish Kalra <ashish.kalra@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Co-developed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Co-developed-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Link: https://lore.kernel.org/r/20240126041126.1927228-5-michael.roth@amd.com
2024-01-29 17:20:23 +01:00
Kai Huang
70060463cb x86/mce: Differentiate real hardware #MCs from TDX erratum ones
The first few generations of TDX hardware have an erratum.  Triggering
it in Linux requires some kind of kernel bug involving relatively exotic
memory writes to TDX private memory and will manifest via
spurious-looking machine checks when reading the affected memory.

Make an effort to detect these TDX-induced machine checks and spit out
a new blurb to dmesg so folks do not think their hardware is failing.

== Background ==

Virtually all kernel memory accesses operations happen in full
cachelines.  In practice, writing a "byte" of memory usually reads a 64
byte cacheline of memory, modifies it, then writes the whole line back.
Those operations do not trigger this problem.

This problem is triggered by "partial" writes where a write transaction
of less than cacheline lands at the memory controller.  The CPU does
these via non-temporal write instructions (like MOVNTI), or through
UC/WC memory mappings.  The issue can also be triggered away from the
CPU by devices doing partial writes via DMA.

== Problem ==

A partial write to a TDX private memory cacheline will silently "poison"
the line.  Subsequent reads will consume the poison and generate a
machine check.  According to the TDX hardware spec, neither of these
things should have happened.

To add insult to injury, the Linux machine code will present these as a
literal "Hardware error" when they were, in fact, a software-triggered
issue.

== Solution ==

In the end, this issue is hard to trigger.  Rather than do something
rash (and incomplete) like unmap TDX private memory from the direct map,
improve the machine check handler.

Currently, the #MC handler doesn't distinguish whether the memory is
TDX private memory or not but just dump, for instance, below message:

 [...] mce: [Hardware Error]: CPU 147: Machine Check Exception: f Bank 1: bd80000000100134
 [...] mce: [Hardware Error]: RIP 10:<ffffffffadb69870> {__tlb_remove_page_size+0x10/0xa0}
 	...
 [...] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
 [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
 [...] Kernel panic - not syncing: Fatal local machine check

Which says "Hardware Error" and "Data load in unrecoverable area of
kernel".

Ideally, it's better for the log to say "software bug around TDX private
memory" instead of "Hardware Error".  But in reality the real hardware
memory error can happen, and sadly such software-triggered #MC cannot be
distinguished from the real hardware error.  Also, the error message is
used by userspace tool 'mcelog' to parse, so changing the output may
break userspace.

So keep the "Hardware Error".  The "Data load in unrecoverable area of
kernel" is also helpful, so keep it too.

Instead of modifying above error log, improve the error log by printing
additional TDX related message to make the log like:

  ...
 [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
 [...] mce: [Hardware Error]: Machine Check: TDX private memory error. Possible kernel bug.

Adding this additional message requires determination of whether the
memory page is TDX private memory.  There is no existing infrastructure
to do that.  Add an interface to query the TDX module to fill this gap.

== Impact ==

This issue requires some kind of kernel bug to trigger.

TDX private memory should never be mapped UC/WC.  A partial write
originating from these mappings would require *two* bugs, first mapping
the wrong page, then writing the wrong memory.  It would also be
detectable using traditional memory corruption techniques like
DEBUG_PAGEALLOC.

MOVNTI (and friends) could cause this issue with something like a simple
buffer overrun or use-after-free on the direct map.  It should also be
detectable with normal debug techniques.

The one place where this might get nasty would be if the CPU read data
then wrote back the same data.  That would trigger this problem but
would not, for instance, set off mechanisms like slab redzoning because
it doesn't actually corrupt data.

With an IOMMU at least, the DMA exposure is similar to the UC/WC issue.
TDX private memory would first need to be incorrectly mapped into the
I/O space and then a later DMA to that mapping would actually cause the
poisoning event.

[ dhansen: changelog tweaks ]

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-18-dave.hansen%40intel.com
2023-12-12 08:46:46 -08:00
Kai Huang
1e536e1068 x86/cpu: Detect TDX partial write machine check erratum
TDX memory has integrity and confidentiality protections.  Violations of
this integrity protection are supposed to only affect TDX operations and
are never supposed to affect the host kernel itself.  In other words,
the host kernel should never, itself, see machine checks induced by the
TDX integrity hardware.

Alas, the first few generations of TDX hardware have an erratum.  A
partial write to a TDX private memory cacheline will silently "poison"
the line.  Subsequent reads will consume the poison and generate a
machine check.  According to the TDX hardware spec, neither of these
things should have happened.

Virtually all kernel memory accesses operations happen in full
cachelines.  In practice, writing a "byte" of memory usually reads a 64
byte cacheline of memory, modifies it, then writes the whole line back.
Those operations do not trigger this problem.

This problem is triggered by "partial" writes where a write transaction
of less than cacheline lands at the memory controller.  The CPU does
these via non-temporal write instructions (like MOVNTI), or through
UC/WC memory mappings.  The issue can also be triggered away from the
CPU by devices doing partial writes via DMA.

With this erratum, there are additional things need to be done.  To
prepare for those changes, add a CPU bug bit to indicate this erratum.
Note this bug reflects the hardware thus it is detected regardless of
whether the kernel is built with TDX support or not.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-17-dave.hansen%40intel.com
2023-12-12 08:46:40 -08:00
Kai Huang
f3f6aa6864 x86/virt/tdx: Handle TDX interaction with sleep and hibernation
TDX is incompatible with hibernation and some ACPI sleep states.
Users must disable hibernation to use TDX.  Users must also disable
TDX if they want to use ACPI S3 sleep.

This feels a bit wonky and asymmetric, but it avoids adding any new
command-line parameters for now.  It can be improved if users hate it
too much.

Long version:

TDX cannot survive from S3 and deeper states.  The hardware resets and
disables TDX completely when platform goes to S3 and deeper.  Both TDX
guests and the TDX module get destroyed permanently.

The kernel uses S3 to support suspend-to-ram, and S4 or deeper states to
support hibernation.  The kernel also maintains TDX states to track
whether it has been initialized and its metadata resource, etc.  After
resuming from S3 or hibernation, these TDX states won't be correct
anymore.

Theoretically, the kernel can do more complicated things like resetting
TDX internal states and TDX module metadata before going to S3 or
deeper, and re-initialize TDX module after resuming, etc, but there is
no way to save/restore TDX guests for now.

Until TDX supports full save and restore of TDX guests, there is no big
value to handle TDX module in suspend and hibernation alone.  To make
things simple, just choose to make TDX mutually exclusive with S3 and
hibernation.

Note the TDX module is initialized at runtime.  To avoid having to deal
with the fuss of determining TDX state at runtime, just choose TDX vs S3
and hibernation at kernel early boot.  It's a bad user experience if the
choice of TDX and S3/hibernation is done at runtime anyway, i.e., the
user can experience being able to do S3/hibernation but later becoming
unable to due to TDX being enabled.

Disable TDX in kernel early boot when hibernation support is available.
Currently there's no mechanism exposed by the hibernation code to allow
other kernel code to disable hibernation once for all.  Users that want
TDX must disable hibernation, like using hibername=no on the command
line.

Disable ACPI S3 when TDX is enabled by the BIOS.  For now the user needs
to disable TDX in the BIOS to use ACPI S3.  A new kernel command line
can be added in the future if there's a need to let user disable TDX
host via kernel command line.

Alternatively, the kernel could disable TDX when ACPI S3 is supported
and request the user to disable S3 to use TDX.  But there's no existing
kernel command line to do that, and BIOS doesn't always have an option
to disable S3.

[ dhansen: subject / changelog tweaks ]

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-16-dave.hansen%40intel.com
2023-12-08 09:12:46 -08:00
Kai Huang
0b2bc38131 x86/virt/tdx: Initialize all TDMRs
After the global KeyID has been configured on all packages, initialize
all TDMRs to make all TDX-usable memory regions that are passed to the
TDX module become usable.

This is the last step of initializing the TDX module.

Initializing TDMRs can be time consuming on large memory systems as it
involves initializing all metadata entries for all pages that can be
used by TDX guests.  Initializing different TDMRs can be parallelized.
For now to keep it simple, just initialize all TDMRs one by one.  It can
be enhanced in the future.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-15-dave.hansen%40intel.com
2023-12-08 09:12:45 -08:00
Kai Huang
e56d28df2f x86/virt/tdx: Configure global KeyID on all packages
After the list of TDMRs and the global KeyID are configured to the TDX
module, the kernel needs to configure the key of the global KeyID on all
packages using TDH.SYS.KEY.CONFIG.

This SEAMCALL cannot run parallel on different cpus.  Loop all online
cpus and use smp_call_on_cpu() to call this SEAMCALL on the first cpu of
each package.

To keep things simple, this implementation takes no affirmative steps to
online cpus to make sure there's at least one cpu for each package.  The
callers (aka. KVM) can ensure success by ensuring sufficient CPUs are
online for this to succeed.

Intel hardware doesn't guarantee cache coherency across different
KeyIDs.  The PAMTs are transitioning from being used by the kernel
mapping (KeyId 0) to the TDX module's "global KeyID" mapping.

This means that the kernel must flush any dirty KeyID-0 PAMT cachelines
before the TDX module uses the global KeyID to access the PAMTs.
Otherwise, if those dirty cachelines were written back, they would
corrupt the TDX module's metadata.  Aside: This corruption would be
detected by the memory integrity hardware on the next read of the memory
with the global KeyID.  The result would likely be fatal to the system
but would not impact TDX security.

Following the TDX module specification, flush cache before configuring
the global KeyID on all packages.  Given the PAMT size can be large
(~1/256th of system RAM), just use WBINVD on all CPUs to flush.

If TDH.SYS.KEY.CONFIG fails, the TDX module may already have "converted"
some memory for TDX module use.  Convert the memory back so that it can
be safely used by the kernel again.  Note that this is slower than it
should be because of the "partial write machine check" erratum which
affects TDX-capable hardware.

Also refactor and introduce a new helper: tdmr_do_pamt_func().  This
takes a TDMR and runs a function on its PAMT.  It looks a _bit_ odd to
pass a function pointer around like this, but its use is pretty narrow
and it does eliminate what would otherwise be some copying and pasting.

[ dhansen: * munge changelog as usual
	   * remove weird (*pamd_func)() syntax ]

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-14-dave.hansen%40intel.com
2023-12-08 09:12:43 -08:00
Kai Huang
554ce1c36d x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID
The TDX module uses a private KeyID as the "global KeyID" for mapping
things like the PAMT and other TDX metadata.  This KeyID has already
been reserved when detecting TDX during the kernel early boot.

Now that the "TD Memory Regions" (TDMRs) are fully built, pass them to
the TDX module together with the global KeyID.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-13-dave.hansen%40intel.com
2023-12-08 09:12:41 -08:00
Kai Huang
dde3b60d57 x86/virt/tdx: Designate reserved areas for all TDMRs
As the last step of constructing TDMRs, populate reserved areas for all
TDMRs.  Cover all memory holes and PAMTs with a TMDR reserved area.

[ dhansen: trim down chagnelog ]

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-12-dave.hansen%40intel.com
2023-12-08 09:12:39 -08:00
Kai Huang
ac3a220884 x86/virt/tdx: Allocate and set up PAMTs for TDMRs
The TDX module uses additional metadata to record things like which
guest "owns" a given page of memory.  This metadata, referred as
Physical Address Metadata Table (PAMT), essentially serves as the
'struct page' for the TDX module.  PAMTs are not reserved by hardware
up front.  They must be allocated by the kernel and then given to the
TDX module during module initialization.

TDX supports 3 page sizes: 4K, 2M, and 1G.  Each "TD Memory Region"
(TDMR) has 3 PAMTs to track the 3 supported page sizes.  Each PAMT must
be a physically contiguous area from a Convertible Memory Region (CMR).
However, the PAMTs which track pages in one TDMR do not need to reside
within that TDMR but can be anywhere in CMRs.  If one PAMT overlaps with
any TDMR, the overlapping part must be reported as a reserved area in
that particular TDMR.

Use alloc_contig_pages() since PAMT must be a physically contiguous area
and it may be potentially large (~1/256th of the size of the given TDMR).
The downside is alloc_contig_pages() may fail at runtime.  One (bad)
mitigation is to launch a TDX guest early during system boot to get
those PAMTs allocated at early time, but the only way to fix is to add a
boot option to allocate or reserve PAMTs during kernel boot.

It is imperfect but will be improved on later.

TDX only supports a limited number of reserved areas per TDMR to cover
both PAMTs and memory holes within the given TDMR.  If many PAMTs are
allocated within a single TDMR, the reserved areas may not be sufficient
to cover all of them.

Adopt the following policies when allocating PAMTs for a given TDMR:

  - Allocate three PAMTs of the TDMR in one contiguous chunk to minimize
    the total number of reserved areas consumed for PAMTs.
  - Try to first allocate PAMT from the local node of the TDMR for better
    NUMA locality.

Also dump out how many pages are allocated for PAMTs when the TDX module
is initialized successfully.  This helps answer the eternal "where did
all my memory go?" questions.

[ dhansen: merge in error handling cleanup ]

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-11-dave.hansen%40intel.com
2023-12-08 09:12:37 -08:00
Kai Huang
f3338ac159 x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
Start to transit out the "multi-steps" to construct a list of "TD Memory
Regions" (TDMRs) to cover all TDX-usable memory regions.

The kernel configures TDX-usable memory regions by passing a list of
TDMRs "TD Memory Regions" (TDMRs) to the TDX module.  Each TDMR contains
the information of the base/size of a memory region, the base/size of the
associated Physical Address Metadata Table (PAMT) and a list of reserved
areas in the region.

Do the first step to fill out a number of TDMRs to cover all TDX memory
regions.  To keep it simple, always try to use one TDMR for each memory
region.  As the first step only set up the base/size for each TDMR.

Each TDMR must be 1G aligned and the size must be in 1G granularity.
This implies that one TDMR could cover multiple memory regions.  If a
memory region spans the 1GB boundary and the former part is already
covered by the previous TDMR, just use a new TDMR for the remaining
part.

TDX only supports a limited number of TDMRs.  Disable TDX if all TDMRs
are consumed but there is more memory region to cover.

There are fancier things that could be done like trying to merge
adjacent TDMRs.  This would allow more pathological memory layouts to be
supported.  But, current systems are not even close to exhausting the
existing TDMR resources in practice.  For now, keep it simple.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-10-dave.hansen%40intel.com
2023-12-08 09:12:35 -08:00
Kai Huang
5173d3c5d0 x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions
After the kernel selects all TDX-usable memory regions, the kernel needs
to pass those regions to the TDX module via data structure "TD Memory
Region" (TDMR).

Add a placeholder to construct a list of TDMRs (in multiple steps) to
cover all TDX-usable memory regions.

=== Long Version ===

TDX provides increased levels of memory confidentiality and integrity.
This requires special hardware support for features like memory
encryption and storage of memory integrity checksums.  Not all memory
satisfies these requirements.

As a result, TDX introduced the concept of a "Convertible Memory Region"
(CMR).  During boot, the firmware builds a list of all of the memory
ranges which can provide the TDX security guarantees.  The list of these
ranges is available to the kernel by querying the TDX module.

The TDX architecture needs additional metadata to record things like
which TD guest "owns" a given page of memory.  This metadata essentially
serves as the 'struct page' for the TDX module.  The space for this
metadata is not reserved by the hardware up front and must be allocated
by the kernel and given to the TDX module.

Since this metadata consumes space, the VMM can choose whether or not to
allocate it for a given area of convertible memory.  If it chooses not
to, the memory cannot receive TDX protections and can not be used by TDX
guests as private memory.

For every memory region that the VMM wants to use as TDX memory, it sets
up a "TD Memory Region" (TDMR).  Each TDMR represents a physically
contiguous convertible range and must also have its own physically
contiguous metadata table, referred to as a Physical Address Metadata
Table (PAMT), to track status for each page in the TDMR range.

Unlike a CMR, each TDMR requires 1G granularity and alignment.  To
support physical RAM areas that don't meet those strict requirements,
each TDMR permits a number of internal "reserved areas" which can be
placed over memory holes.  If PAMT metadata is placed within a TDMR it
must be covered by one of these reserved areas.

Let's summarize the concepts:

 CMR - Firmware-enumerated physical ranges that support TDX.  CMRs are
       4K aligned.
TDMR - Physical address range which is chosen by the kernel to support
       TDX.  1G granularity and alignment required.  Each TDMR has
       reserved areas where TDX memory holes and overlapping PAMTs can
       be represented.
PAMT - Physically contiguous TDX metadata.  One table for each page size
       per TDMR.  Roughly 1/256th of TDMR in size.  256G TDMR = ~1G
       PAMT.

As one step of initializing the TDX module, the kernel configures
TDX-usable memory regions by passing a list of TDMRs to the TDX module.

Constructing the list of TDMRs consists below steps:

1) Fill out TDMRs to cover all memory regions that the TDX module will
   use for TD memory.
2) Allocate and set up PAMT for each TDMR.
3) Designate reserved areas for each TDMR.

Add a placeholder to construct TDMRs to do the above steps.  To keep
things simple, just allocate enough space to hold maximum number of
TDMRs up front.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-9-dave.hansen%40intel.com
2023-12-08 09:12:32 -08:00
Kai Huang
cf72bc4816 x86/virt/tdx: Get module global metadata for module initialization
The TDX module global metadata provides system-wide information about
the module.

TL;DR:

Use the TDH.SYS.RD SEAMCALL to tell if the module is good or not.

Long Version:

1) Only initialize TDX module with version 1.5 and later

TDX module 1.0 has some compatibility issues with the later versions of
module, as documented in the "Intel TDX module ABI incompatibilities
between TDX1.0 and TDX1.5" spec.  Don't bother with module versions that
do not have a stable ABI.

2) Get the essential global metadata for module initialization

TDX reports a list of "Convertible Memory Region" (CMR) to tell the
kernel which memory is TDX compatible.  The kernel needs to build a list
of memory regions (out of CMRs) as "TDX-usable" memory and pass them to
the TDX module.  The kernel does this by constructing a list of "TD
Memory Regions" (TDMRs) to cover all these memory regions and passing
them to the TDX module.

Each TDMR is a TDX architectural data structure containing the memory
region that the TDMR covers, plus the information to track (within this
TDMR):
  a) the "Physical Address Metadata Table" (PAMT) to track each TDX
     memory page's status (such as which TDX guest "owns" a given page,
     and
  b) the "reserved areas" to tell memory holes that cannot be used as
     TDX memory.

The kernel needs to get below metadata from the TDX module to build the
list of TDMRs:
  a) the maximum number of supported TDMRs
  b) the maximum number of supported reserved areas per TDMR and,
  c) the PAMT entry size for each TDX-supported page size.

== Implementation ==

The TDX module has two modes of fetching the metadata: a one field at
a time, or all in one blob.  Use the field at a time for now.  It is
slower, but there just are not enough fields now to justify the
complexity of extra unpacking.

The err_free_tdxmem=>out_put_tdxmem goto looks wonky by itself.  But
it is the first of a bunch of error handling that will get stuck at
its site.

[ dhansen: clean up changelog and add a struct to map between
	   the TDX module fields and 'struct tdx_tdmr_sysinfo' ]

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-8-dave.hansen%40intel.com
2023-12-08 09:12:18 -08:00
Kai Huang
abe8dbab8f x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory
Start to transit out the "multi-steps" to initialize the TDX module.

TDX provides increased levels of memory confidentiality and integrity.
This requires special hardware support for features like memory
encryption and storage of memory integrity checksums.  Not all memory
satisfies these requirements.

As a result, TDX introduced the concept of a "Convertible Memory Region"
(CMR).  During boot, the firmware builds a list of all of the memory
ranges which can provide the TDX security guarantees.  The list of these
ranges is available to the kernel by querying the TDX module.

CMRs tell the kernel which memory is TDX compatible.  The kernel needs
to build a list of memory regions (out of CMRs) as "TDX-usable" memory
and pass them to the TDX module.  Once this is done, those "TDX-usable"
memory regions are fixed during module's lifetime.

To keep things simple, assume that all TDX-protected memory will come
from the page allocator.  Make sure all pages in the page allocator
*are* TDX-usable memory.

As TDX-usable memory is a fixed configuration, take a snapshot of the
memory configuration from memblocks at the time of module initialization
(memblocks are modified on memory hotplug).  This snapshot is used to
enable TDX support for *this* memory configuration only.  Use a memory
hotplug notifier to ensure that no other RAM can be added outside of
this configuration.

This approach requires all memblock memory regions at the time of module
initialization to be TDX convertible memory to work, otherwise module
initialization will fail in a later SEAMCALL when passing those regions
to the module.  This approach works when all boot-time "system RAM" is
TDX convertible memory and no non-TDX-convertible memory is hot-added
to the core-mm before module initialization.

For instance, on the first generation of TDX machines, both CXL memory
and NVDIMM are not TDX convertible memory.  Using kmem driver to hot-add
any CXL memory or NVDIMM to the core-mm before module initialization
will result in failure to initialize the module.  The SEAMCALL error
code will be available in the dmesg to help user to understand the
failure.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-7-dave.hansen%40intel.com
2023-12-08 09:12:16 -08:00
Kai Huang
6162b310bc x86/virt/tdx: Add skeleton to enable TDX on demand
There are essentially two steps to get the TDX module ready:
1) Get each CPU ready to run TDX
2) Set up the shared TDX module data structures

Introduce and export (to KVM) the infrastructure to do both of these
pieces at runtime.

== Per-CPU TDX Initialization ==

Track the initialization status of each CPU with a per-cpu variable.
This avoids failures in the case of KVM module reloads and handles cases
where CPUs come online later.

Generally, the per-cpu SEAMCALLs happen first.  But there's actually one
global call that has to happen before _any_ others (TDH_SYS_INIT).  It's
analogous to the boot CPU having to do a bit of extra work just because
it happens to be the first one.  Track if _any_ CPU has done this call
and then only actually do it during the first per-cpu init.

== Shared TDX Initialization ==

Create the global state function (tdx_enable()) as a simple placeholder.
The TODO list will be pared down as functionality is added.

Use a state machine protected by mutex to make sure the work in
tdx_enable() will only be done once.  This avoids failures if the KVM
module is reloaded.

A CPU must be made ready to run TDX before it can participate in
initializing the shared parts of the module.  Any caller of tdx_enable()
need to ensure that it can never run on a CPU which is not ready to
run TDX.  It needs to be wary of CPU hotplug, preemption and the
VMX enabling state of any CPU on which it might run.

== Why runtime instead of boot time? ==

The TDX module can be initialized only once in its lifetime.  Instead
of always initializing it at boot time, this implementation chooses an
"on demand" approach to initialize TDX until there is a real need (e.g
when requested by KVM).  This approach has below pros:

1) It avoids consuming the memory that must be allocated by kernel and
given to the TDX module as metadata (~1/256th of the TDX-usable memory),
and also saves the CPU cycles of initializing the TDX module (and the
metadata) when TDX is not used at all.

2) The TDX module design allows it to be updated while the system is
running.  The update procedure shares quite a few steps with this "on
demand" initialization mechanism.  The hope is that much of "on demand"
mechanism can be shared with a future "update" mechanism.  A boot-time
TDX module implementation would not be able to share much code with the
update mechanism.

3) Making SEAMCALL requires VMX to be enabled.  Currently, only the KVM
code mucks with VMX enabling.  If the TDX module were to be initialized
separately from KVM (like at boot), the boot code would need to be
taught how to muck with VMX enabling and KVM would need to be taught how
to cope with that.  Making KVM itself responsible for TDX initialization
lets the rest of the kernel stay blissfully unaware of VMX.

[ dhansen: completely reorder/rewrite changelog ]

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-6-dave.hansen%40intel.com
2023-12-08 09:12:10 -08:00
Kai Huang
df01f5ae07 x86/virt/tdx: Add SEAMCALL error printing for module initialization
The SEAMCALLs involved during the TDX module initialization are not
expected to fail.  In fact, they are not expected to return any non-zero
code (except the "running out of entropy error", which can be handled
internally already).

Add yet another set of SEAMCALL wrappers, which treats all non-zero
return code as error, to support printing SEAMCALL error upon failure
for module initialization.  Note the TDX module initialization doesn't
use the _saved_ret() variant thus no wrapper is added for it.

SEAMCALL assembly can also return kernel-defined error codes for three
special cases: 1) TDX isn't enabled by the BIOS; 2) TDX module isn't
loaded; 3) CPU isn't in VMX operation.  Whether they can legally happen
depends on the caller, so leave to the caller to print error message
when desired.

Also convert the SEAMCALL error codes to the kernel error codes in the
new wrappers so that each SEAMCALL caller doesn't have to repeat the
conversion.

[ dhansen: Align the register dump with show_regs().  Zero-pad the
	   contents, split on two lines and use consistent spacing. ]

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-5-dave.hansen%40intel.com
2023-12-08 09:12:08 -08:00
Kai Huang
765a0542fd x86/virt/tdx: Detect TDX during kernel boot
Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks.  A CPU-attested software module
called 'the TDX module' runs inside a new isolated memory range as a
trusted hypervisor to manage and run protected VMs.

Pre-TDX Intel hardware has support for a memory encryption architecture
called MKTME.  The memory encryption hardware underpinning MKTME is also
used for Intel TDX.  TDX ends up "stealing" some of the physical address
space from the MKTME architecture for crypto-protection to VMs.  The
BIOS is responsible for partitioning the "KeyID" space between legacy
MKTME and TDX.  The KeyIDs reserved for TDX are called 'TDX private
KeyIDs' or 'TDX KeyIDs' for short.

During machine boot, TDX microcode verifies that the BIOS programmed TDX
private KeyIDs consistently and correctly programmed across all CPU
packages.  The MSRs are locked in this state after verification.  This
is why MSR_IA32_MKTME_KEYID_PARTITIONING gets used for TDX enumeration:
it indicates not just that the hardware supports TDX, but that all the
boot-time security checks passed.

The TDX module is expected to be loaded by the BIOS when it enables TDX,
but the kernel needs to properly initialize it before it can be used to
create and run any TDX guests.  The TDX module will be initialized by
the KVM subsystem when KVM wants to use TDX.

Detect platform TDX support by detecting TDX private KeyIDs.

The TDX module itself requires one TDX KeyID as the 'TDX global KeyID'
to protect its metadata.  Each TDX guest also needs a TDX KeyID for its
own protection.  Just use the first TDX KeyID as the global KeyID and
leave the rest for TDX guests.  If no TDX KeyID is left for TDX guests,
disable TDX as initializing the TDX module alone is useless.

[ dhansen: add X86_FEATURE, replace helper function ]

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-1-dave.hansen%40intel.com
2023-12-08 09:11:58 -08:00
Kai Huang
7b804135d4 x86/virt/tdx: Make TDX_MODULE_CALL handle SEAMCALL #UD and #GP
SEAMCALL instruction causes #UD if the CPU isn't in VMX operation.
Currently the TDX_MODULE_CALL assembly doesn't handle #UD, thus making
SEAMCALL when VMX is disabled would cause Oops.

Unfortunately, there are legal cases that SEAMCALL can be made when VMX
is disabled.  For instance, VMX can be disabled due to emergency reboot
while there are still TDX guests running.

Extend the TDX_MODULE_CALL assembly to return an error code for #UD to
handle this case gracefully, e.g., KVM can then quietly eat all SEAMCALL
errors caused by emergency reboot.

SEAMCALL instruction also causes #GP when TDX isn't enabled by the BIOS.
Use _ASM_EXTABLE_FAULT() to catch both exceptions with the trap number
recorded, and define two new error codes by XORing the trap number to
the TDX_SW_ERROR.  This opportunistically handles #GP too while using
the same simple assembly code.

A bonus is when kernel mistakenly calls SEAMCALL when CPU isn't in VMX
operation, or when TDX isn't enabled by the BIOS, or when the BIOS is
buggy, the kernel can get a nicer error code rather than a less
understandable Oops.

This is basically based on Peter's code.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/de975832a367f476aab2d0eb0d9de66019a16b54.1692096753.git.kai.huang%40intel.com
2023-09-12 16:30:27 -07:00
Kai Huang
c33621b4c5 x86/virt/tdx: Wire up basic SEAMCALL functions
Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks.  A CPU-attested software module
called 'the TDX module' runs inside a new isolated memory range as a
trusted hypervisor to manage and run protected VMs.

TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM).  This
mode runs only the TDX module itself or other code to load the TDX
module.

The host kernel communicates with SEAM software via a new SEAMCALL
instruction.  This is conceptually similar to a guest->host hypercall,
except it is made from the host to SEAM software instead.  The TDX
module establishes a new SEAMCALL ABI which allows the host to
initialize the module and to manage VMs.

The SEAMCALL ABI is very similar to the TDCALL ABI and leverages much
TDCALL infrastructure.  Wire up basic functions to make SEAMCALLs for
the basic support of running TDX guests: __seamcall(), __seamcall_ret(),
and __seamcall_saved_ret() for TDH.VP.ENTER.  All SEAMCALLs involved in
the basic TDX support don't use "callee-saved" registers as input and
output, except the TDH.VP.ENTER.

To start to support TDX, create a new arch/x86/virt/vmx/tdx/tdx.c for
TDX host kernel support.  Add a new Kconfig option CONFIG_INTEL_TDX_HOST
to opt-in TDX host kernel support (to distinguish with TDX guest kernel
support).  So far only KVM uses TDX.  Make the new config option depend
on KVM_INTEL.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Isaku Yamahata <isaku.yamahata@intel.com>
Link: https://lore.kernel.org/all/4db7c3fc085e6af12acc2932294254ddb3d320b3.1692096753.git.kai.huang%40intel.com
2023-09-12 16:30:27 -07:00
Kai Huang
90f5ecd37f x86/tdx: Reimplement __tdx_hypercall() using TDX_MODULE_CALL asm
Now the TDX_HYPERCALL asm is basically identical to the TDX_MODULE_CALL
with both '\saved' and '\ret' enabled, with two minor things though:

1) The way to restore the structure pointer is different

The TDX_HYPERCALL uses RCX as spare to restore the structure pointer,
but the TDX_MODULE_CALL assumes no spare register can be used.  In other
words, TDX_MODULE_CALL already covers what TDX_HYPERCALL does.

2) TDX_MODULE_CALL only clears shared registers for TDH.VP.ENTER

For this just need to make that code available for the non-host case.

Thus, remove the TDX_HYPERCALL and reimplement the __tdx_hypercall()
using the TDX_MODULE_CALL.

Extend the TDX_MODULE_CALL to cover "clear shared registers" for
TDG.VP.VMCALL.  Introduce a new __tdcall_saved_ret() to replace the
temporary __tdcall_hypercall().

The __tdcall_saved_ret() can also be used for those new TDCALLs which
require more input/output registers than the basic TDCALLs do.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/e68a2473fb6f5bcd78b078cae7510e9d0753b3df.1692096753.git.kai.huang%40intel.com
2023-09-12 16:30:14 -07:00