Commit Graph

1350 Commits

Author SHA1 Message Date
Paolo Bonzini
3eba032bb7 Merge branch 'kvm-userspace-hypercall' into HEAD
Make the completion of hypercalls go through the complete_hypercall
function pointer argument, no matter if the hypercall exits to
userspace or not.  Previously, the code assumed that KVM_HC_MAP_GPA_RANGE
specifically went to userspace, and all the others did not; the new code
need not special case KVM_HC_MAP_GPA_RANGE and in fact does not care at
all whether there was an exit to userspace or not.
2025-01-20 07:03:06 -05:00
Paolo Bonzini
4f7ff70c05 KVM x86 misc changes for 6.14:
- Overhaul KVM's CPUID feature infrastructure to replace "governed" features
    with per-vCPU tracking of the vCPU's capabailities for all features.  Along
    the way, refactor the code to make it easier to add/modify features, and
    add a variety of self-documenting macro types to again simplify adding new
    features and to help readers understand KVM's handling of existing features.
 
  - Rework KVM's handling of VM-Exits during event vectoring to plug holes where
    KVM unintentionally puts the vCPU into infinite loops in some scenarios,
    e.g. if emulation is triggered by the exit, and to bring parity between VMX
    and SVM.
 
  - Add pending request and interrupt injection information to the kvm_exit and
    kvm_entry tracepoints respectively.
 
  - Fix a relatively benign flaw where KVM would end up redoing RDPKRU when
    loading guest/host PKRU due to a refactoring of the kernel helpers that
    didn't account for KVM's pre-checking of the need to do WRPKRU.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmeJngsACgkQOlYIJqCj
 N/1dfA/+NIZmnd8OV9Zvc6HGxrzgt4QsM9pmsUmrfkDWefxMYIAMeaW8Vn4CJfRf
 zY/UcqyNI7JYxSiuVTckz+Tf54HhqYaLrUwILGCQ49koirZx+aQT1OUfjLroVMlh
 ffX1i6GOoLNtxjb9MXM/heLVdUbvmzQMSFkd/AkOH+nrOtDNOiPlZfjHsewj9zrf
 BNJGhzvT4M6vc/AsScC7tc0yFD5KKFRv8tVwJ6Zf1nWKyUDOSpMTWkVnq6geKJPZ
 iGBZPPNg55Oy1g6uj6VYWmqYTD8Qioz5jtEJ/8pPHdAyIFo21s81bfJc548d+QLh
 KfrL1K7TrCOhSAGC3Cb3lTLeq2immmGHaiTBLwGABG4MhpiX4NVpMMdOyFbVLMOS
 HIYuwXwDckm1pfU7/w+PgPaakCyPrXQntm+3Y2pvDOoY6e2JbwodK4j8BvvQda35
 8TrYKEGFvq5aij7Iw1O9TUoLAocDM/sHIHE6BCazHyzKBIv9xLRFeabiCQ+A1pwv
 gZk5u0+j+DPpLdeLhbMYhIXUtr3bvyMYvc+tRkG716f8ubAE3+Kn5BEDo4Ot2DcT
 vc+NTRYYWN6zavHiJH3Ddt153yj256JCZhLwCdfbryCQdz3Mpy16m36tgkDRd3lR
 QT4IkPQo1Vl/aU0yiE/dhnJgh1rTO26YQjZoHs5Oj16d0HRrKyc=
 =32mM
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-misc-6.14' of https://github.com/kvm-x86/linux into HEAD

KVM x86 misc changes for 6.14:

 - Overhaul KVM's CPUID feature infrastructure to track all vCPU capabilities
   instead of just those where KVM needs to manage state and/or explicitly
   enable the feature in hardware.  Along the way, refactor the code to make
   it easier to add features, and to make it more self-documenting how KVM
   is handling each feature.

 - Rework KVM's handling of VM-Exits during event vectoring; this plugs holes
   where KVM unintentionally puts the vCPU into infinite loops in some scenarios
   (e.g. if emulation is triggered by the exit), and brings parity between VMX
   and SVM.

 - Add pending request and interrupt injection information to the kvm_exit and
   kvm_entry tracepoints respectively.

 - Fix a relatively benign flaw where KVM would end up redoing RDPKRU when
   loading guest/host PKRU, due to a refactoring of the kernel helpers that
   didn't account for KVM's pre-checking of the need to do WRPKRU.
2025-01-20 06:49:39 -05:00
Sean Christopherson
13b64ce1b6 KVM: x86: Move "emulate hypercall" function declarations to x86.h
Move the declarations for the hypercall emulation APIs to x86.h.  While the
helpers are exported, they are intended to be consumed only by KVM vendor
modules, i.e. don't need to be exposed to the kernel at-large.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20241128004344.4072099-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-22 13:00:25 -05:00
Chao Gao
ca0245d131 KVM: x86: Remove hwapic_irr_update() from kvm_x86_ops
Remove the redundant .hwapic_irr_update() ops.

If a vCPU has APICv enabled, KVM updates its RVI before VM-enter to L1
in vmx_sync_pir_to_irr(). This guarantees RVI is up-to-date and aligned
with the vIRR in the virtual APIC. So, no need to update RVI every time
the vIRR changes.

Note that KVM never updates vmcs02 RVI in .hwapic_irr_update() or
vmx_sync_pir_to_irr(). So, removing .hwapic_irr_update() has no
impact to the nested case.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20241111085947.432645-1-chao.gao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-19 07:34:15 -08:00
Maxim Levitsky
3e633e7e7d KVM: x86: Add interrupt injection information to the kvm_entry tracepoint
Add VMX/SVM specific interrupt injection info the kvm_entry tracepoint.
As is done with kvm_exit, gather the information via a kvm_x86_ops hook
to avoid the moderately costly VMREADs on VMX when the tracepoint isn't
enabled.

Opportunistically rename the parameters in the get_exit_info()
declaration to match the names used by both SVM and VMX.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20240910200350.264245-2-mlevitsk@redhat.com
[sean: drop is_guest_mode() change, use intr_info/error_code for names]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 15:14:48 -08:00
Ivan Orlov
47ef3ef843 KVM: VMX: Handle event vectoring error in check_emulate_instruction()
Move handling of emulation during event vectoring, which KVM doesn't
support, into VMX's check_emulate_instruction(), so that KVM detects
all unsupported emulation, not just cached emulated MMIO (EPT misconfig).
E.g. on emulated MMIO that isn't cached (EPT Violation) or occurs with
legacy shadow paging (#PF).

Rejecting emulation on other sources of emulation also fixes a largely
theoretical flaw (thanks to the "unprotect and retry" logic), where KVM
could incorrectly inject a #DF:

  1. CPU executes an instruction and hits a #GP
  2. While vectoring the #GP, a shadow #PF occurs
  3. On the #PF VM-Exit, KVM re-injects #GP
  4. KVM emulates because of the write-protected page
  5. KVM "successfully" emulates and also detects the #GP
  6. KVM synthesizes a #GP, and since #GP has already been injected,
     incorrectly escalates to a #DF.

Fix the comment about EMULTYPE_PF as this flag doesn't necessarily
mean MMIO anymore: it can also be set due to the write protection
violation.

Note, handle_ept_misconfig() checks vmx_check_emulate_instruction() before
attempting emulation of any kind.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ivan Orlov <iorlov@amazon.com>
Link: https://lore.kernel.org/r/20241217181458.68690-5-iorlov@amazon.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 15:14:44 -08:00
Ivan Orlov
11c98fa07a KVM: x86: Add function for vectoring error generation
Extract VMX code for unhandleable VM-Exit during vectoring into
vendor-agnostic function so that boiler-plate code can be shared by SVM.

To avoid unnecessarily complexity in the helper, unconditionally report a
GPA to userspace instead of having a conditional entry.  For exits that
don't report a GPA, i.e. everything except EPT Misconfig, simply report
KVM's "invalid GPA".

Signed-off-by: Ivan Orlov <iorlov@amazon.com>
Link: https://lore.kernel.org/r/20241217181458.68690-2-iorlov@amazon.com
[sean: clarify that the INVALID_GPA logic is new]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 15:14:41 -08:00
Sean Christopherson
7ea34578ae KVM: x86: Replace guts of "governed" features with comprehensive cpu_caps
Replace the internals of the governed features framework with a more
comprehensive "guest CPU capabilities" implementation, i.e. with a guest
version of kvm_cpu_caps.  Keep the skeleton of governed features around
for now as vmx_adjust_sec_exec_control() relies on detecting governed
features to do the right thing for XSAVES, and switching all guest feature
queries to guest_cpu_cap_has() requires subtle and non-trivial changes,
i.e. is best done as a standalone change.

Tracking *all* guest capabilities that KVM cares will allow excising the
poorly named "governed features" framework, and effectively optimizes all
KVM queries of guest capabilities, i.e. doesn't require making a
subjective decision as to whether or not a feature is worth "governing",
and doesn't require adding the code to do so.

The cost of tracking all features is currently 92 bytes per vCPU on 64-bit
kernels: 100 bytes for cpu_caps versus 8 bytes for governed_features.
That cost is well worth paying even if the only benefit was eliminating
the "governed features" terminology.  And practically speaking, the real
cost is zero unless those 92 bytes pushes the size of vcpu_vmx or vcpu_svm
into a new order-N allocation, and if that happens there are better ways
to reduce the footprint of kvm_vcpu_arch, e.g. making the PMU and/or MTRR
state separate allocations.

Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-41-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:05 -08:00
Sean Christopherson
a5b3271808 KVM: x86: Remove unnecessary caching of KVM's PV CPUID base
Now that KVM only searches for KVM's PV CPUID base when userspace sets
guest CPUID, drop the cache and simply do the search every time.

Practically speaking, this is a nop except for situations where userspace
sets CPUID _after_ running the vCPU, which is anything but a hot path,
e.g. QEMU does so only when hotplugging a vCPU.  And on the flip side,
caching guest CPUID information, especially information that is used to
query/modify _other_ CPUID state, is inherently dangerous as it's all too
easy to use stale information, i.e. KVM should only cache CPUID state when
the performance and/or programming benefits justify it.

Link: https://lore.kernel.org/r/20241128013424.4096668-34-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:56 -08:00
Sean Christopherson
76bce9f101 KVM: x86: Plumb in the vCPU to kvm_x86_ops.hwapic_isr_update()
Pass the target vCPU to the hwapic_isr_update() vendor hook so that VMX
can defer the update until after nested VM-Exit if an EOI for L1's vAPIC
occurs while L2 is active.

Note, commit d39850f57d ("KVM: x86: Drop @vcpu parameter from
kvm_x86_ops.hwapic_isr_update()") removed the parameter with the
justification that doing so "allows for a decent amount of (future)
cleanup in the APIC code", but it's not at all clear what cleanup was
intended, or if it was ever realized.

No functional change intended.

Cc: stable@vger.kernel.org
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20241128000010.4051275-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-16 15:18:30 -08:00
Paolo Bonzini
d96c77bd4e KVM: x86: switch hugepage recovery thread to vhost_task
kvm_vm_create_worker_thread() is meant to be used for kthreads that
can consume significant amounts of CPU time on behalf of a VM or in
response to how the VM behaves (for example how it accesses its memory).
Therefore it wants to charge the CPU time consumed by that work to
the VM's container.

However, because of these threads, cgroups which have kvm instances
inside never complete freezing.  This can be trivially reproduced:

  root@test ~# mkdir /sys/fs/cgroup/test
  root@test ~# echo $$ > /sys/fs/cgroup/test/cgroup.procs
  root@test ~# qemu-system-x86_64 -nographic -enable-kvm

and in another terminal:

  root@test ~# echo 1 > /sys/fs/cgroup/test/cgroup.freeze
  root@test ~# cat /sys/fs/cgroup/test/cgroup.events
  populated 1
  frozen 0

The cgroup freezing happens in the signal delivery path but
kvm_nx_huge_page_recovery_worker, while joining non-root cgroups, never
calls into the signal delivery path and thus never gets frozen. Because
the cgroup freezer determines whether a given cgroup is frozen by
comparing the number of frozen threads to the total number of threads
in the cgroup, the cgroup never becomes frozen and users waiting for
the state transition may hang indefinitely.

Since the worker kthread is tied to a user process, it's better if
it behaves similarly to user tasks as much as possible, including
being able to send SIGSTOP and SIGCONT.  In fact, vhost_task is all
that kvm_vm_create_worker_thread() wanted to be and more: not only it
inherits the userspace process's cgroups, it has other niceties like
being parented properly in the process tree.  Use it instead of the
homegrown alternative.

Incidentally, the new code is also better behaved when you flip recovery
back and forth to disabled and back to enabled.  If your recovery period
is 1 minute, it will run the next recovery after 1 minute independent
of how many times you flipped the parameter.

(Commit message based on emails from Tejun).

Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Luca Boccassi <bluca@debian.org>
Acked-by: Tejun Heo <tj@kernel.org>
Tested-by: Luca Boccassi <bluca@debian.org>
Cc: stable@vger.kernel.org
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-14 13:20:04 -05:00
Paolo Bonzini
bb4409a9e7 KVM x86 misc changes for 6.13
- Clean up and optimize KVM's handling of writes to MSR_IA32_APICBASE.
 
  - Quirk KVM's misguided behavior of initialized certain feature MSRs to
    their maximum supported feature set, which can result in KVM creating
    invalid vCPU state.  E.g. initializing PERF_CAPABILITIES to a non-zero
    value results in the vCPU having invalid state if userspace hides PDCM
    from the guest, which can lead to save/restore failures.
 
  - Fix KVM's handling of non-canonical checks for vCPUs that support LA57
    to better follow the "architecture", in quotes because the actual
    behavior is poorly documented.  E.g. most MSR writes and descriptor
    table loads ignore CR4.LA57 and operate purely on whether the CPU
    supports LA57.
 
  - Bypass the register cache when querying CPL from kvm_sched_out(), as
    filling the cache from IRQ context is generally unsafe, and harden the
    cache accessors to try to prevent similar issues from occuring in the
    future.
 
  - Advertise AMD_IBPB_RET to userspace, and fix a related bug where KVM
    over-advertises SPEC_CTRL when trying to support cross-vendor VMs.
 
  - Minor cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmczowUACgkQOlYIJqCj
 N/3G8w//RYIslQkHXZXovQvhHKM9RBxdg6FjU0Do2KLN/xnR+JdjrBAI3HnG0TVu
 TEsA6slaTdvAFudSBZ27rORPARJ3XCSnDT+NflDR2UqtmGYFXxixcs4LQRUC3I2L
 tS/e847Qfp7/+kXFYQuH6YmMftCf7SQNbUPU3EwSXY8seUJB6ZhO89WXgrBtaMH+
 94b7EdkP86L4dqiEMGr/q/46/ewFpPlB4WWFNIAY+SoZebQ+C8gcAsGDzfG6giNV
 HxdehWNp5nJ34k9Tudt2FseL/IclA3nXZZIl2dU6PWLkjgvDqL2kXLHpAQTpKuXf
 ZzBM4wjdVqZRQHscLgX6+0/uOJDf9/iJs/dwD9PbzLhAJnHF3SWRAy/grzpChLFR
 Yil5zagtdhjqSKDf2FsUCJ7lwaST0fhHvmZZx4loIhcmN0/rvvrhhfLsJgCWv/Kf
 RWMkiSQGhxAZUXjOJDQB+wTHv7ZoJy51ov7/rYjr49jCJrCVO6I6yQ6lNKDCYClR
 vDS3yK6fiEXbC09iudm74FBl2KO+BwJKhnidekzzKGv34RHRAux5Oo54J5jJMhBA
 tXFxALGwKr1KAI7vLK8ByZzwZjN5pDmKHUuOAlJNy9XQi+b4zbbREkXlcqZ5VgxA
 xCibpnLzJFoHS8y7c76wfz4mCkWSWzuC9Rzy4sTkHMzgJESmZiE=
 =dESL
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-misc-6.13' of https://github.com/kvm-x86/linux into HEAD

KVM x86 misc changes for 6.13

 - Clean up and optimize KVM's handling of writes to MSR_IA32_APICBASE.

 - Quirk KVM's misguided behavior of initialized certain feature MSRs to
   their maximum supported feature set, which can result in KVM creating
   invalid vCPU state.  E.g. initializing PERF_CAPABILITIES to a non-zero
   value results in the vCPU having invalid state if userspace hides PDCM
   from the guest, which can lead to save/restore failures.

 - Fix KVM's handling of non-canonical checks for vCPUs that support LA57
   to better follow the "architecture", in quotes because the actual
   behavior is poorly documented.  E.g. most MSR writes and descriptor
   table loads ignore CR4.LA57 and operate purely on whether the CPU
   supports LA57.

 - Bypass the register cache when querying CPL from kvm_sched_out(), as
   filling the cache from IRQ context is generally unsafe, and harden the
   cache accessors to try to prevent similar issues from occuring in the
   future.

 - Advertise AMD_IBPB_RET to userspace, and fix a related bug where KVM
   over-advertises SPEC_CTRL when trying to support cross-vendor VMs.

 - Minor cleanups
2024-11-13 06:33:00 -05:00
Vipin Sharma
4cf20d4254 KVM: x86/mmu: Drop per-VM zapped_obsolete_pages list
Drop the per-VM zapped_obsolete_pages list now that the usage from the
defunct mmu_shrinker is gone, and instead use a local list to track pages
in kvm_zap_obsolete_pages(), the sole remaining user of
zapped_obsolete_pages.

Opportunistically add an assertion to verify and document that slots_lock
must be held, i.e. that there can only be one active instance of
kvm_zap_obsolete_pages() at any given time, and by doing so also prove
that using a local list instead of a per-VM list doesn't change any
functionality (beyond trivialities like list initialization).

Signed-off-by: Vipin Sharma <vipinsh@google.com>
Link: https://lore.kernel.org/r/20241101201437.1604321-2-vipinsh@google.com
[sean: split to separate patch, write changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04 19:22:53 -08:00
David Matlack
13e2e4f62a KVM: x86/mmu: Recover TDP MMU huge page mappings in-place instead of zapping
Recover TDP MMU huge page mappings in-place instead of zapping them when
dirty logging is disabled, and rename functions that recover huge page
mappings when dirty logging is disabled to move away from the "zap
collapsible spte" terminology.

Before KVM flushes TLBs, guest accesses may be translated through either
the (stale) small SPTE or the (new) huge SPTE. This is already possible
when KVM is doing eager page splitting (where TLB flushes are also
batched), and when vCPUs are faulting in huge mappings (where TLBs are
flushed after the new huge SPTE is installed).

Recovering huge pages reduces the number of page faults when dirty
logging is disabled:

 $ perf stat -e kvm:kvm_page_fault -- ./dirty_log_perf_test -s anonymous_hugetlb_2mb -v 64 -e -b 4g

 Before: 393,599      kvm:kvm_page_fault
 After:  262,575      kvm:kvm_page_fault

vCPU throughput and the latency of disabling dirty-logging are about
equal compared to zapping, but avoiding faults can be beneficial to
remove vCPU jitter in extreme scenarios.

Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240823235648.3236880-5-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04 18:37:22 -08:00
Sean Christopherson
dcb988cdac KVM: x86: Quirk initialization of feature MSRs to KVM's max configuration
Add a quirk to control KVM's misguided initialization of select feature
MSRs to KVM's max configuration, as enabling features by default violates
KVM's approach of letting userspace own the vCPU model, and is actively
problematic for MSRs that are conditionally supported, as the vCPU will
end up with an MSR value that userspace can't restore.  E.g. if the vCPU
is configured with PDCM=0, userspace will save and attempt to restore a
non-zero PERF_CAPABILITIES, thanks to KVM's meddling.

Link: https://lore.kernel.org/r/20240802185511.305849-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01 09:22:31 -07:00
Sean Christopherson
f0e7012c4b KVM: x86: Bypass register cache when querying CPL from kvm_sched_out()
When querying guest CPL to determine if a vCPU was preempted while in
kernel mode, bypass the register cache, i.e. always read SS.AR_BYTES from
the VMCS on Intel CPUs.  If the kernel is running with full preemption
enabled, using the register cache in the preemption path can result in
stale and/or uninitialized data being cached in the segment cache.

In particular the following scenario is currently possible:

 - vCPU is just created, and the vCPU thread is preempted before
   SS.AR_BYTES is written in vmx_vcpu_reset().

 - When scheduling out the vCPU task, kvm_arch_vcpu_in_kernel() =>
   vmx_get_cpl() reads and caches '0' for SS.AR_BYTES.

 - vmx_vcpu_reset() => seg_setup() configures SS.AR_BYTES, but doesn't
   invoke vmx_segment_cache_clear() to invalidate the cache.

As a result, KVM retains a stale value in the cache, which can be read,
e.g. via KVM_GET_SREGS.  Usually this is not a problem because the VMX
segment cache is reset on each VM-Exit, but if the userspace VMM (e.g KVM
selftests) reads and writes system registers just after the vCPU was
created, _without_ modifying SS.AR_BYTES, userspace will write back the
stale '0' value and ultimately will trigger a VM-Entry failure due to
incorrect SS segment type.

Note, the VM-Enter failure can also be avoided by moving the call to
vmx_segment_cache_clear() until after the vmx_vcpu_reset() initializes all
segments.  However, while that change is correct and desirable (and will
come along shortly), it does not address the underlying problem that
accessing KVM's register caches from !task context is generally unsafe.

In addition to fixing the immediate bug, bypassing the cache for this
particular case will allow hardening KVM register caching log to assert
that the caches are accessed only when KVM _knows_ it is safe to do so.

Fixes: de63ad4cf4 ("KVM: X86: implement the logic for spinlock optimization")
Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
Closes: https://lore.kernel.org/all/20240716022014.240960-3-mlevitsk@redhat.com
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241009175002.1118178-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01 09:22:21 -07:00
Paolo Bonzini
3f8df62852 Merge tag 'kvm-x86-vmx-6.12' of https://github.com/kvm-x86/linux into HEAD
KVM VMX changes for 6.12:

 - Set FINAL/PAGE in the page fault error code for EPT Violations if and only
   if the GVA is valid.  If the GVA is NOT valid, there is no guest-side page
   table walk and so stuffing paging related metadata is nonsensical.

 - Fix a bug where KVM would incorrectly synthesize a nested VM-Exit instead of
   emulating posted interrupt delivery to L2.

 - Add a lockdep assertion to detect unsafe accesses of vmcs12 structures.

 - Harden eVMCS loading against an impossible NULL pointer deref (really truly
   should be impossible).

 - Minor SGX fix and a cleanup.
2024-09-17 12:41:23 -04:00
Paolo Bonzini
5d55a052e3 Merge tag 'kvm-x86-mmu-6.12' of https://github.com/kvm-x86/linux into HEAD
KVM x86 MMU changes for 6.12:

 - Overhaul the "unprotect and retry" logic to more precisely identify cases
   where retrying is actually helpful, and to harden all retry paths against
   putting the guest into an infinite retry loop.

 - Add support for yielding, e.g. to honor NEED_RESCHED, when zapping rmaps in
   the shadow MMU.

 - Refactor pieces of the shadow MMU related to aging SPTEs in prepartion for
   adding MGLRU support in KVM.

 - Misc cleanups
2024-09-17 12:39:53 -04:00
Paolo Bonzini
41786cc5ea Merge tag 'kvm-x86-misc-6.12' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.12

 - Advertise AVX10.1 to userspace (effectively prep work for the "real" AVX10
   functionality that is on the horizon).

 - Rework common MSR handling code to suppress errors on userspace accesses to
   unsupported-but-advertised MSRs.  This will allow removing (almost?) all of
   KVM's exemptions for userspace access to MSRs that shouldn't exist based on
   the vCPU model (the actual cleanup is non-trivial future work).

 - Rework KVM's handling of x2APIC ICR, again, because AMD (x2AVIC) splits the
   64-bit value into the legacy ICR and ICR2 storage, whereas Intel (APICv)
   stores the entire 64-bit value a the ICR offset.

 - Fix a bug where KVM would fail to exit to userspace if one was triggered by
   a fastpath exit handler.

 - Add fastpath handling of HLT VM-Exit to expedite re-entering the guest when
   there's already a pending wake event at the time of the exit.

 - Finally fix the RSM vs. nested VM-Enter WARN by forcing the vCPU out of
   guest mode prior to signalling SHUTDOWN (architecturally, the SHUTDOWN is
   supposed to hit L1, not L2).
2024-09-17 11:38:23 -04:00
Sean Christopherson
6b3dcabc10 KVM: x86/mmu: Subsume kvm_mmu_unprotect_page() into the and_retry() version
Fold kvm_mmu_unprotect_page() into kvm_mmu_unprotect_gfn_and_retry() now
that all other direct usage is gone.

No functional change intended.

Link: https://lore.kernel.org/r/20240831001538.336683-21-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:34 -07:00
Sean Christopherson
4df685664b KVM: x86: Update retry protection fields when forcing retry on emulation failure
When retrying the faulting instruction after emulation failure, refresh
the infinite loop protection fields even if no shadow pages were zapped,
i.e. avoid hitting an infinite loop even when retrying the instruction as
a last-ditch effort to avoid terminating the guest.

Link: https://lore.kernel.org/r/20240831001538.336683-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:32 -07:00
Sean Christopherson
01dd4d3192 KVM: x86/mmu: Apply retry protection to "fast nTDP unprotect" path
Move the anti-infinite-loop protection provided by last_retry_{eip,addr}
into kvm_mmu_write_protect_fault() so that it guards unprotect+retry that
never hits the emulator, as well as reexecute_instruction(), which is the
last ditch "might as well try it" logic that kicks in when emulation fails
on an instruction that faulted on a write-protected gfn.

Add a new helper, kvm_mmu_unprotect_gfn_and_retry(), to set the retry
fields and deduplicate other code (with more to come).

Link: https://lore.kernel.org/r/20240831001538.336683-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:23 -07:00
Sean Christopherson
4ececec19a KVM: x86/mmu: Replace PFERR_NESTED_GUEST_PAGE with a more descriptive helper
Drop the globally visible PFERR_NESTED_GUEST_PAGE and replace it with a
more appropriately named is_write_to_guest_page_table().  The macro name
is misleading, because while all nNPT walks match PAGE|WRITE|PRESENT, the
reverse is not true.

No functional change intended.

Link: https://lore.kernel.org/r/20240831001538.336683-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:18 -07:00
Sean Christopherson
363010e1dd KVM: nVMX: Get to-be-acknowledge IRQ for nested VM-Exit at injection site
Move the logic to get the to-be-acknowledge IRQ for a nested VM-Exit from
nested_vmx_vmexit() to vmx_check_nested_events(), which is subtly the one
and only path where KVM invokes nested_vmx_vmexit() with
EXIT_REASON_EXTERNAL_INTERRUPT.  A future fix will perform a last-minute
check on L2's nested posted interrupt notification vector, just before
injecting a nested VM-Exit.  To handle that scenario correctly, KVM needs
to get the interrupt _before_ injecting VM-Exit, as simply querying the
highest priority interrupt, via kvm_cpu_has_interrupt(), would result in
TOCTOU bug, as a new, higher priority interrupt could arrive between
kvm_cpu_has_interrupt() and kvm_cpu_get_interrupt().

Unfortunately, simply moving the call to kvm_cpu_get_interrupt() doesn't
suffice, as a VMWRITE to GUEST_INTERRUPT_STATUS.SVI is hiding in
kvm_get_apic_interrupt(), and acknowledging the interrupt before nested
VM-Exit would cause the VMWRITE to hit vmcs02 instead of vmcs01.

Open code a rough equivalent to kvm_cpu_get_interrupt() so that the IRQ
is acknowledged after emulating VM-Exit, taking care to avoid the TOCTOU
issue described above.

Opportunistically convert the WARN_ON() to a WARN_ON_ONCE().  If KVM has
a bug that results in a false positive from kvm_cpu_has_interrupt(),
spamming dmesg won't help the situation.

Note, nested_vmx_reflect_vmexit() can never reflect external interrupts as
they are always "wanted" by L0.

Link: https://lore.kernel.org/r/20240906043413.1049633-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:14:58 -07:00
Sean Christopherson
590b09b1d8 KVM: x86: Register "emergency disable" callbacks when virt is enabled
Register the "disable virtualization in an emergency" callback just
before KVM enables virtualization in hardware, as there is no functional
need to keep the callbacks registered while KVM happens to be loaded, but
is inactive, i.e. if KVM hasn't enabled virtualization.

Note, unregistering the callback every time the last VM is destroyed could
have measurable latency due to the synchronize_rcu() needed to ensure all
references to the callback are dropped before KVM is unloaded.  But the
latency should be a small fraction of the total latency of disabling
virtualization across all CPUs, and userspace can set enable_virt_at_load
to completely eliminate the runtime overhead.

Add a pointer in kvm_x86_ops to allow vendor code to provide its callback.
There is no reason to force vendor code to do the registration, and either
way KVM would need a new kvm_x86_ops hook.

Suggested-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240830043600.127750-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-04 11:02:34 -04:00
Sean Christopherson
0617a769ce KVM: x86: Rename virtualization {en,dis}abling APIs to match common KVM
Rename x86's the per-CPU vendor hooks used to enable virtualization in
hardware to align with the recently renamed arch hooks.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-ID: <20240830043600.127750-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-04 11:02:33 -04:00
Sean Christopherson
f7f39c50ed KVM: x86: Exit to userspace if fastpath triggers one on instruction skip
Exit to userspace if a fastpath handler triggers such an exit, which can
happen when skipping the instruction, e.g. due to userspace
single-stepping the guest via KVM_GUESTDBG_SINGLESTEP or because of an
emulation failure.

Fixes: 404d5d7bff ("KVM: X86: Introduce more exit_fastpath_completion enum values")
Link: https://lore.kernel.org/r/20240802195120.325560-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-29 19:50:21 -07:00
Sean Christopherson
73b42dc69b KVM: x86: Re-split x2APIC ICR into ICR+ICR2 for AMD (x2AVIC)
Re-introduce the "split" x2APIC ICR storage that KVM used prior to Intel's
IPI virtualization support, but only for AMD.  While not stated anywhere
in the APM, despite stating the ICR is a single 64-bit register, AMD CPUs
store the 64-bit ICR as two separate 32-bit values in ICR and ICR2.  When
IPI virtualization (IPIv on Intel, all AVIC flavors on AMD) is enabled,
KVM needs to match CPU behavior as some ICR ICR writes will be handled by
the CPU, not by KVM.

Add a kvm_x86_ops knob to control the underlying format used by the CPU to
store the x2APIC ICR, and tune it to AMD vs. Intel regardless of whether
or not x2AVIC is enabled.  If KVM is handling all ICR writes, the storage
format for x2APIC mode doesn't matter, and having the behavior follow AMD
versus Intel will provide better test coverage and ease debugging.

Fixes: 4d1d7942e3 ("KVM: SVM: Introduce logic to (de)activate x2AVIC mode")
Cc: stable@vger.kernel.org
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/20240719235107.3023592-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-29 16:25:05 -07:00
Sean Christopherson
b848f24bd7 KVM: x86: Rename get_msr_feature() APIs to get_feature_msr()
Rename all APIs related to feature MSRs from get_msr_feature() to
get_feature_msr().  The APIs get "feature MSRs", not "MSR features".
And unlike kvm_{g,s}et_msr_common(), the "feature" adjective doesn't
describe the helper itself.

No functional change intended.

Link: https://lore.kernel.org/r/20240802181935.292540-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 12:06:56 -07:00
Sean Christopherson
74c6c98a59 KVM: x86: Refactor kvm_x86_ops.get_msr_feature() to avoid kvm_msr_entry
Refactor get_msr_feature() to take the index and data pointer as distinct
parameters in anticipation of eliminating "struct kvm_msr_entry" usage
further up the primary callchain.

No functional change intended.

Link: https://lore.kernel.org/r/20240802181935.292540-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 12:06:30 -07:00
Sean Christopherson
653ea4489e KVM: nVMX: Honor userspace MSR filter lists for nested VM-Enter/VM-Exit
Synthesize a consistency check VM-Exit (VM-Enter) or VM-Abort (VM-Exit) if
L1 attempts to load/store an MSR via the VMCS MSR lists that userspace has
disallowed access to via an MSR filter.  Intel already disallows including
a handful of "special" MSRs in the VMCS lists, so denying access isn't
completely without precedent.

More importantly, the behavior is well-defined _and_ can be communicated
the end user, e.g. to the customer that owns a VM running as L1 on top of
KVM.  On the other hand, ignoring userspace MSR filters is all but
guaranteed to result in unexpected behavior as the access will hit KVM's
internal state, which is likely not up-to-date.

Unlike KVM-internal accesses, instruction emulation, and dedicated VMCS
fields, the MSRs in the VMCS load/store lists are 100% guest controlled,
thus making it all but impossible to reason about the correctness of
ignoring the MSR filter.  And if userspace *really* wants to deny access
to MSRs via the aforementioned scenarios, userspace can hide the
associated feature from the guest, e.g. by disabling the PMU to prevent
accessing PERF_GLOBAL_CTRL via its VMCS field.  But for the MSR lists, KVM
is blindly processing MSRs; the  MSR filters are the _only_ way for
userspace to deny access.

This partially reverts commit ac8d6cad3c ("KVM: x86: Only do MSR
filtering when access MSR by rdmsr/wrmsr").

Cc: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20240722235922.3351122-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:35:16 -07:00
Yan Zhao
aa8d1f48d3 KVM: x86/mmu: Introduce a quirk to control memslot zap behavior
Introduce the quirk KVM_X86_QUIRK_SLOT_ZAP_ALL to allow users to select
KVM's behavior when a memslot is moved or deleted for KVM_X86_DEFAULT_VM
VMs. Make sure KVM behave as if the quirk is always disabled for
non-KVM_X86_DEFAULT_VM VMs.

The KVM_X86_QUIRK_SLOT_ZAP_ALL quirk offers two behavior options:
- when enabled:  Invalidate/zap all SPTEs ("zap-all"),
- when disabled: Precisely zap only the leaf SPTEs within the range of the
                 moving/deleting memory slot ("zap-slot-leafs-only").

"zap-all" is today's KVM behavior to work around a bug [1] where the
changing the zapping behavior of memslot move/deletion would cause VM
instability for VMs with an Nvidia GPU assigned; while
"zap-slot-leafs-only" allows for more precise zapping of SPTEs within the
memory slot range, improving performance in certain scenarios [2], and
meeting the functional requirements for TDX.

Previous attempts to select "zap-slot-leafs-only" include a per-VM
capability approach [3] (which was not preferred because the root cause of
the bug remained unidentified) and a per-memslot flag approach [4]. Sean
and Paolo finally recommended the implementation of this quirk and
explained that it's the least bad option [5].

By default, the quirk is enabled on KVM_X86_DEFAULT_VM VMs to use
"zap-all". Users have the option to disable the quirk to select
"zap-slot-leafs-only" for specific KVM_X86_DEFAULT_VM VMs that are
unaffected by this bug.

For non-KVM_X86_DEFAULT_VM VMs, the "zap-slot-leafs-only" behavior is
always selected without user's opt-in, regardless of if the user opts for
"zap-all".
This is because it is assumed until proven otherwise that non-
KVM_X86_DEFAULT_VM VMs will not be exposed to the bug [1], and most
importantly, it's because TDX must have "zap-slot-leafs-only" always
selected. In TDX's case a memslot's GPA range can be a mixture of "private"
or "shared" memory. Shared is roughly analogous to how EPT is handled for
normal VMs, but private GPAs need lots of special treatment:
1) "zap-all" would require to zap private root page or non-leaf entries or
   at least leaf-entries beyond the deleting memslot scope. However, TDX
   demands that the root page of the private page table remains unchanged,
   with leaf entries being zapped before non-leaf entries, and any dropped
   private guest pages must be re-accepted by the guest.
2) if "zap-all" zaps only shared page tables, it would result in private
   pages still being mapped when the memslot is gone. This may affect even
   other processes if later the gmem fd was whole punched, causing the
   pages being freed on the host while still mapped in the TD, because
   there's no pgoff to the gfn information to zap the private page table
   after memslot is gone.

So, simply go "zap-slot-leafs-only" as if the quirk is always disabled for
non-KVM_X86_DEFAULT_VM VMs to avoid manual opt-in for every VM type [6] or
complicating quirk disabling interface (current quirk disabling interface
is limited, no way to query quirks, or force them to be disabled).

Add a new function kvm_mmu_zap_memslot_leafs() to implement
"zap-slot-leafs-only". This function does not call kvm_unmap_gfn_range(),
bypassing special handling to APIC_ACCESS_PAGE_PRIVATE_MEMSLOT, as
1) The APIC_ACCESS_PAGE_PRIVATE_MEMSLOT cannot be created by users, nor can
   it be moved. It is only deleted by KVM when APICv is permanently
   inhibited.
2) kvm_vcpu_reload_apic_access_page() effectively does nothing when
   APIC_ACCESS_PAGE_PRIVATE_MEMSLOT is deleted.
3) Avoid making all cpus request of KVM_REQ_APIC_PAGE_RELOAD can save on
   costly IPIs.

Suggested-by: Kai Huang <kai.huang@intel.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://patchwork.kernel.org/project/kvm/patch/20190205210137.1377-11-sean.j.christopherson@intel.com [1]
Link: https://patchwork.kernel.org/project/kvm/patch/20190205210137.1377-11-sean.j.christopherson@intel.com/#25054908 [2]
Link: https://lore.kernel.org/kvm/20200713190649.GE29725@linux.intel.com/T/#mabc0119583dacf621025e9d873c85f4fbaa66d5c [3]
Link: https://lore.kernel.org/all/20240515005952.3410568-3-rick.p.edgecombe@intel.com [4]
Link: https://lore.kernel.org/all/7df9032d-83e4-46a1-ab29-6c7973a2ab0b@redhat.com [5]
Link: https://lore.kernel.org/all/ZnGa550k46ow2N3L@google.com [6]
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20240703021043.13881-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-08-14 12:29:11 -04:00
Sean Christopherson
66155de93b KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)
Disallow read-only memslots for SEV-{ES,SNP} VM types, as KVM can't
directly emulate instructions for ES/SNP, and instead the guest must
explicitly request emulation.  Unless the guest explicitly requests
emulation without accessing memory, ES/SNP relies on KVM creating an MMIO
SPTE, with the subsequent #NPF being reflected into the guest as a #VC.

But for read-only memslots, KVM deliberately doesn't create MMIO SPTEs,
because except for ES/SNP, doing so requires setting reserved bits in the
SPTE, i.e. the SPTE can't be readable while also generating a #VC on
writes.  Because KVM never creates MMIO SPTEs and jumps directly to
emulation, the guest never gets a #VC.  And since KVM simply resumes the
guest if ES/SNP guests trigger emulation, KVM effectively puts the vCPU
into an infinite #NPF loop if the vCPU attempts to write read-only memory.

Disallow read-only memory for all VMs with protected state, i.e. for
upcoming TDX VMs as well as ES/SNP VMs.  For TDX, it's actually possible
to support read-only memory, as TDX uses EPT Violation #VE to reflect the
fault into the guest, e.g. KVM could configure read-only SPTEs with RX
protections and SUPPRESS_VE=0.  But there is no strong use case for
supporting read-only memslots on TDX, e.g. the main historical usage is
to emulate option ROMs, but TDX disallows executing from shared memory.
And if someone comes along with a legitimate, strong use case, the
restriction can always be lifted for TDX.

Don't bother trying to retroactively apply the restriction to SEV-ES
VMs that are created as type KVM_X86_DEFAULT_VM.  Read-only memslots can't
possibly work for SEV-ES, i.e. disallowing such memslots is really just
means reporting an error to userspace instead of silently hanging vCPUs.
Trying to deal with the ordering between KVM_SEV_INIT and memslot creation
isn't worth the marginal benefit it would provide userspace.

Fixes: 26c44aa9e0 ("KVM: SEV: define VM types for SEV and SEV-ES")
Fixes: 1dfe571c12 ("KVM: SEV: Add initial SEV-SNP support")
Cc: Peter Gonda <pgonda@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerly Tng <ackerleytng@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240809190319.1710470-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-08-14 12:28:24 -04:00
Paolo Bonzini
5932ca411e KVM: x86: disallow pre-fault for SNP VMs before initialization
KVM_PRE_FAULT_MEMORY for an SNP guest can race with
sev_gmem_post_populate() in bad ways. The following sequence for
instance can potentially trigger an RMP fault:

  thread A, sev_gmem_post_populate: called
  thread B, sev_gmem_prepare: places below 'pfn' in a private state in RMP
  thread A, sev_gmem_post_populate: *vaddr = kmap_local_pfn(pfn + i);
  thread A, sev_gmem_post_populate: copy_from_user(vaddr, src + i * PAGE_SIZE, PAGE_SIZE);
  RMP #PF

Fix this by only allowing KVM_PRE_FAULT_MEMORY to run after a guest's
initial private memory contents have been finalized via
KVM_SEV_SNP_LAUNCH_FINISH.

Beyond fixing this issue, it just sort of makes sense to enforce this,
since the KVM_PRE_FAULT_MEMORY documentation states:

  "KVM maps memory as if the vCPU generated a stage-2 read page fault"

which sort of implies we should be acting on the same guest state that a
vCPU would see post-launch after the initial guest memory is all set up.

Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-26 14:46:14 -04:00
Wei Wang
5d766508fd KVM: x86/pmu: Add kvm_pmu_call() to simplify static calls of kvm_pmu_ops
Similar to kvm_x86_call(), kvm_pmu_call() is added to streamline the usage
of static calls of kvm_pmu_ops, which improves code readability.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Link: https://lore.kernel.org/r/20240507133103.15052-4-wei.w.wang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-16 12:14:12 -04:00
Wei Wang
896046474f KVM: x86: Introduce kvm_x86_call() to simplify static calls of kvm_x86_ops
Introduces kvm_x86_call(), to streamline the usage of static calls of
kvm_x86_ops. The current implementation of these calls is verbose and
could lead to alignment challenges. This makes the code susceptible to
exceeding the "80 columns per single line of code" limit as defined in
the coding-style document. Another issue with the existing implementation
is that the addition of kvm_x86_ prefix to hooks at the static_call sites
hinders code readability and navigation. kvm_x86_call() is added to
improve code readability and maintainability, while adhering to the coding
style guidelines.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Link: https://lore.kernel.org/r/20240507133103.15052-3-wei.w.wang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-16 12:14:12 -04:00
Wei Wang
f4854bf741 KVM: x86: Replace static_call_cond() with static_call()
The use of static_call_cond() is essentially the same as static_call() on
x86 (e.g. static_call() now handles a NULL pointer as a NOP), so replace
it with static_call() to simplify the code.

Link: https://lore.kernel.org/all/3916caa1dcd114301a49beafa5030eca396745c1.1679456900.git.jpoimboe@kernel.org/
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Link: https://lore.kernel.org/r/20240507133103.15052-2-wei.w.wang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-16 12:14:11 -04:00
Paolo Bonzini
208a352a54 KVM VMX changes for 6.11
- Remove an unnecessary EPT TLB flush when enabling hardware.
 
  - Fix a series of bugs that cause KVM to fail to detect nested pending posted
    interrupts as valid wake eents for a vCPU executing HLT in L2 (with
    HLT-exiting disable by L1).
 
  - Misc cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRvX0ACgkQOlYIJqCj
 N/2Aiw/9Htwy4MfJ2zdTX0ypZx6CUAVY0B7R2q9LVaqlBBL02dLoNWn9ndf7J2pd
 TJKtp39sHzf342ghti/Za5+mZgRgXA9IjQ5cvcQQjfmjDdDODygEc12otISeSNqq
 uL2jbUZzzjbcQyUrXkeFptVcNFpaiOG0dFfvnoi1csWzXVf7t+CD+8/3kjVm2Qt7
 vQXkV4yN7tNiYOvaukfXP7Og9ALpF8g8ok3YmXVXDPMu7+R7G+P6j3mVWr9ABMPj
 LOmC+5Z/sscMFw1Io3XHuWoF5socQARXEzJNLCblDaw3GMlSj4LNxif2M/6B7bmR
 nQVtiegj9K1Fc3OGOqPJcAIRPI4O9nMmf7uOwvXmOlwDSk7rCxF/yPk7Cto2+UXm
 6mnLcH1l0/VaidW+a7rUAcDGIlWwgfw0F6tp2j6FdVl2Lx/IThcrkn0teLY1gAW8
 CMi/BfTBEXO5583O3+ZCAzVQzeKnWR3yqwJe0oSftB1/rPkPD8PQ39MH8LuJJJxi
 CN1W4R1/taQdOxMZqggDvS1biz7gwpjNGtnWsO9szAgMEXVjf2M1HOZVcT2e2997
 81xDMdZaJSfd26tm7PhWtQnVPqyMZ6vqqIiq7FlIbEEkAE75Kbg4fUn/4y4WRnh9
 3Gog6MZPu/MA5TbwvcZ/sy/CRfFu0HKm5q98oArhjSyU8C7oGeQ=
 =W1/6
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-vmx-6.11' of https://github.com/kvm-x86/linux into HEAD

KVM VMX changes for 6.11

 - Remove an unnecessary EPT TLB flush when enabling hardware.

 - Fix a series of bugs that cause KVM to fail to detect nested pending posted
   interrupts as valid wake eents for a vCPU executing HLT in L2 (with
   HLT-exiting disable by L1).

 - Misc cleanups
2024-07-16 09:56:41 -04:00
Paolo Bonzini
cda231cd42 KVM x86/pmu changes for 6.11
- Don't advertise IA32_PERF_GLOBAL_OVF_CTRL as an MSR-to-be-saved, as it reads
    '0' and writes from userspace are ignored.
 
  - Update to the newfangled Intel CPU FMS infrastructure.
 
  - Use macros instead of open-coded literals to clean up KVM's manipulation of
    FIXED_CTR_CTRL MSRs.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRu3oACgkQOlYIJqCj
 N/0f/Q/+PoRFKrr9ENwlVjxmq7DBJOzrEiht5EH89bQdpYL0pcmv6I+n+Z77o08X
 l49YFO2zVq26dMCe8EFDuQrZpqKjOS/qEc+/zTsLu4lx8NJD1gqYJLJryejgtdQI
 +GefPVIN11TvlDDjuuxSWgKUCAevk8s3PRe+zbUwlsHmw+GVky8dJoe71QbW27rK
 hL7Y2pOe5Y8MgRAadxlhm6QmgOnz3RKKYs9t/HMzi2gQP1TuvPxnYtMC3Gz5pVe+
 w3Ak7M4fh8Z7FbQsoNY5h3IdigG6eFrssqHX4QpCXr/G5L9vAgUmSR93/M8jLjNv
 wAkUulLx7vFeTlOXjqcEJSn0U6mX/48pt68vrPB5ES1Rx28RB5s9tzYXCGtCmSxv
 nHmMDc3YUbg6tp2hvliMqjsN0j5l2GQiX7LJwH2Ma9qQFlTHPmFwJGS4hciki6c5
 obCK2vXBoS1jyxrZx8qUhIcJl2oigv3hwihN2YqZ0Q4QDwllv8cw4BeABQByYR9x
 T91PQ0biiJ9vWCkALbyzYOpy+grdHCblwYW9+FM/qZBGH0ouPzDyZWPrRBLX12pH
 fEgDMB3vT9JqQ5tyafd0MHuAVlrDVHYEY+lmXplzFGKEFonBkN7HmDzAOKafuCuj
 GnIe0Sa1JnHVPNomx2dnG6Sku6/tPIfERuHEXrR9zkUJsacnfRY=
 =pJxP
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-pmu-6.11' of https://github.com/kvm-x86/linux into HEAD

KVM x86/pmu changes for 6.11

 - Don't advertise IA32_PERF_GLOBAL_OVF_CTRL as an MSR-to-be-saved, as it reads
   '0' and writes from userspace are ignored.

 - Update to the newfangled Intel CPU FMS infrastructure.

 - Use macros instead of open-coded literals to clean up KVM's manipulation of
   FIXED_CTR_CTRL MSRs.
2024-07-16 09:55:15 -04:00
Paolo Bonzini
5c5ddf7107 KVM x86 MTRR virtualization removal
Remove support for virtualizing MTRRs on Intel CPUs, along with a nasty CR0.CD
 hack, and instead always honor guest PAT on CPUs that support self-snoop.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRuwAACgkQOlYIJqCj
 N/32Gg/+Nnnz6TCRno2vursPJme7gvtLdqSxjazAj3u2ZO8IApGYWMyfVpS+ymC9
 Wdpj6gRe2ukSxgTsUI2CYoy5V2NxDaA9YgdTPZUVQvqwujVrqZCJ7L393iPYYnC9
 No3LXZ+SOYRmomiCzknjC6GOlT2hAZHzQsyaXDlEYok7NAA2L6XybbLonEdA4RYi
 V1mS62W5PaA4tUesuxkJjPujXo1nXRWD/aXOruJWjPESdSFSALlx7reFAf2Nwn7K
 Uw8yZqhq6vWAZSph0Nz8OrZOS/kULKA3q2zl1B/qJJ0ToAt2VdXS6abXky52RExf
 KvP+jBAWMO5kHbIqaMRtCHjbIkbhH8RdUIYNJQEUQ5DdydM5+/RDa+KprmLPcmUn
 qvJq+3uyH0MEENtneGegs8uxR+sn6fT32cGMIw790yIywddh562+IJ4Z+C3BuYJi
 yszD71odqKT8+knUd2CaZjE9UZyoQNDfj2OCCTzzZOC/6TuJWCh9CYQ1csssHbQR
 KcvZCKE6ht8tWwi+2HWj0laOdg1reX2kV869k3xH4uCwEaFIj2Wk+/Bw/lg2Tn5h
 5uTnQ01dx5XhAV1klr6IY3VXJ/A8G8895wRfkZEelsA9Wj8qZvNgXhsoXReIUIrn
 aR0ppsFcbqHzC50qE2JT4juTD1EPx95LL9zKT8pI9mGKwxCAxUM=
 =yb10
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-mtrrs-6.11' of https://github.com/kvm-x86/linux into HEAD

KVM x86 MTRR virtualization removal

Remove support for virtualizing MTRRs on Intel CPUs, along with a nasty CR0.CD
hack, and instead always honor guest PAT on CPUs that support self-snoop.
2024-07-16 09:54:57 -04:00
Paolo Bonzini
5dcc1e7614 KVM x86 misc changes for 6.11
- Add a global struct to consolidate tracking of host values, e.g. EFER, and
    move "shadow_phys_bits" into the structure as "maxphyaddr".
 
  - Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC
    bus frequency, because TDX.
 
  - Print the name of the APICv/AVIC inhibits in the relevant tracepoint.
 
  - Clean up KVM's handling of vendor specific emulation to consistently act on
    "compatible with Intel/AMD", versus checking for a specific vendor.
 
  - Misc cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRub0ACgkQOlYIJqCj
 N/2LMxAArGzhcWZ6Qdo2aMRaMIPtSBJHmbEgEuHvHMumgsTZQzDcn9cxDi/hNSrc
 l8ODOwAM2qNcq95YfwjU7F0ae3E+HRzGvKcBnmZWuQeCDp2HhVEoCphFu1sHst+t
 XEJTL02b6OgyJUEU3h40mYk12eiq2S4FCnFYXPCqijwwuL6Y5KQvvTqek3c2/SDn
 c+VneutYGax/S0GiiCkYh4wrwWh9g7qm0IX70ycBwJbW5qBFKgyglvHxvL8JLJC9
 Nkkw/p2657wcOdraH+fOBuRy2dMwE5fv++1tOjWwB5WAAhSOJPZh0BGYvgA2yfN7
 OE+k7APKUQd9Xxtud8H3LrTPoyMA4hz2sdDFyqrrWK9yjpBY7zXNyN50Fxi7VVsm
 T8nTIiKAGyRbjotY+m7krXQPXjfZYhVqrJ/jtxESOZLZ93q2gSWU2p/ZXpUPVHnH
 +YOBAI1owP3wepaYlrthtI4LQx9lF422dnmeSflztfKFGabRbQZxg3uHMCCxIaGc
 lJ6CD546+D45f/uBXRDMqk//qFTqXhKUbDk9sutmU/C2oWufMwW0R8kOyItGPyvk
 9PP1vd8vSsIHj+tpwg+i04jBqYDaAcPBOcTZaHm9SYYP+1e11Uu5Vjep37JL1bkA
 xJWxnDZOCGcfKQi2jkh51HJ/dOAHXY1GQKMfyAoPQOSonYHvGVY=
 =Cf2R
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-misc-6.11' of https://github.com/kvm-x86/linux into HEAD

KVM x86 misc changes for 6.11

 - Add a global struct to consolidate tracking of host values, e.g. EFER, and
   move "shadow_phys_bits" into the structure as "maxphyaddr".

 - Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC
   bus frequency, because TDX.

 - Print the name of the APICv/AVIC inhibits in the relevant tracepoint.

 - Clean up KVM's handling of vendor specific emulation to consistently act on
   "compatible with Intel/AMD", versus checking for a specific vendor.

 - Misc cleanups
2024-07-16 09:53:05 -04:00
Paolo Bonzini
86014c1e20 KVM generic changes for 6.11
- Enable halt poll shrinking by default, as Intel found it to be a clear win.
 
  - Setup empty IRQ routing when creating a VM to avoid having to synchronize
    SRCU when creating a split IRQCHIP on x86.
 
  - Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag
    that arch code can use for hooking both sched_in() and sched_out().
 
  - Take the vCPU @id as an "unsigned long" instead of "u32" to avoid
    truncating a bogus value from userspace, e.g. to help userspace detect bugs.
 
  - Mark a vCPU as preempted if and only if it's scheduled out while in the
    KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest
    memory when retrieving guest state during live migration blackout.
 
  - A few minor cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRuOYACgkQOlYIJqCj
 N/1UnQ/8CI5Qfr+/0gzYgtWmtEMczGG+rMNpzD3XVqPjJjXcMcBiQnplnzUVLhha
 vlPdYVK7vgmEt003XGzV55mik46LHL+DX/v4hI3HEdblfyCeNLW3fKEWVRB44qJe
 o+YUQwSK42SORUp9oXuQINxhA//U9EnI7CQxlJ8w8wenv5IJKfIGr01DefmfGPAV
 PKm9t6WLcNqvhZMEyy/zmzM3KVPCJL0NcwI97x6sHxFpQYIDtL0E/VexA4AFqMoT
 QK7cSDC/2US41Zvem/r/GzM/ucdF6vb9suzZYBohwhxtVhwJe2CDeYQZvtNKJ1U7
 GOHPaKL6nBWdZCm/yyWbbX2nstY1lHqxhN3JD0X8wqU5rNcwm2b8Vfyav0Ehc7H+
 jVbDTshOx4YJmIgajoKjgM050rdBK59TdfVL+l+AAV5q/TlHocalYtvkEBdGmIDg
 2td9UHSime6sp20vQfczUEz4bgrQsh4l2Fa/qU2jFwLievnBw0AvEaMximkSGMJe
 b8XfjmdTjlOesWAejANKtQolfrq14+1wYw0zZZ8PA+uNVpKdoovmcqSOcaDC9bT8
 GO/NFUvoG+lkcvJcIlo1SSl81SmGLosijwxWfGvFAqsgpR3/3l3dYp0QtztoCNJO
 d3+HnjgYn5o5FwufuTD3eUOXH4AFjG108DH0o25XrIkb2Kymy0o=
 =BalU
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-generic-6.11' of https://github.com/kvm-x86/linux into HEAD

KVM generic changes for 6.11

 - Enable halt poll shrinking by default, as Intel found it to be a clear win.

 - Setup empty IRQ routing when creating a VM to avoid having to synchronize
   SRCU when creating a split IRQCHIP on x86.

 - Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag
   that arch code can use for hooking both sched_in() and sched_out().

 - Take the vCPU @id as an "unsigned long" instead of "u32" to avoid
   truncating a bogus value from userspace, e.g. to help userspace detect bugs.

 - Mark a vCPU as preempted if and only if it's scheduled out while in the
   KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest
   memory when retrieving guest state during live migration blackout.

 - A few minor cleanups
2024-07-16 09:51:36 -04:00
Dapeng Mi
f287bef6dd KVM: x86/pmu: Introduce distinct macros for GP/fixed counter max number
Refine the macros which define maximum General Purpose (GP) and fixed
counter numbers.

Currently the macro KVM_INTEL_PMC_MAX_GENERIC is used to represent the
maximum supported General Purpose (GP) counter number ambiguously across
Intel and AMD platforms. This would cause issues if AMD begins to support
more GP counters than Intel.

Thus a bunch of new macros including vendor specific and vendor
independent are introduced to replace the old macros. The vendor
independent macros are used in x86 common code to hide vendor difference
and eliminate the ambiguity.

No logic changes are introduced in this patch.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240627021756.144815-1-dapeng1.mi@linux.intel.com
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 09:12:16 -07:00
Sean Christopherson
321ef62b0c KVM: nVMX: Fold requested virtual interrupt check into has_nested_events()
Check for a Requested Virtual Interrupt, i.e. a virtual interrupt that is
pending delivery, in vmx_has_nested_events() and drop the one-off
kvm_x86_ops.guest_apic_has_interrupt() hook.

In addition to dropping a superfluous hook, this fixes a bug where KVM
would incorrectly treat virtual interrupts _for L2_ as always enabled due
to kvm_arch_interrupt_allowed(), by way of vmx_interrupt_blocked(),
treating IRQs as enabled if L2 is active and vmcs12 is configured to exit
on IRQs, i.e. KVM would treat a virtual interrupt for L2 as a valid wake
event based on L1's IRQ blocking status.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240607172609.3205077-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 08:59:06 -07:00
Sean Christopherson
32f55e475c KVM: nVMX: Request immediate exit iff pending nested event needs injection
When requesting an immediate exit from L2 in order to inject a pending
event, do so only if the pending event actually requires manual injection,
i.e. if and only if KVM actually needs to regain control in order to
deliver the event.

Avoiding the "immediate exit" isn't simply an optimization, it's necessary
to make forward progress, as the "already expired" VMX preemption timer
trick that KVM uses to force a VM-Exit has higher priority than events
that aren't directly injected.

At present time, this is a glorified nop as all events processed by
vmx_has_nested_events() require injection, but that will not hold true in
the future, e.g. if there's a pending virtual interrupt in vmcs02.RVI.
I.e. if KVM is trying to deliver a virtual interrupt to L2, the expired
VMX preemption timer will trigger VM-Exit before the virtual interrupt is
delivered, and KVM will effectively hang the vCPU in an endless loop of
forced immediate VM-Exits (because the pending virtual interrupt never
goes away).

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240607172609.3205077-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-28 08:59:04 -07:00
Sean Christopherson
8fbb696a8f KVM: x86: Fold kvm_arch_sched_in() into kvm_arch_vcpu_load()
Fold the guts of kvm_arch_sched_in() into kvm_arch_vcpu_load(), keying
off the recently added kvm_vcpu.scheduled_out as appropriate.

Note, there is a very slight functional change, as PLE shrink updates will
now happen after blasting WBINVD, but that is quite uninteresting as the
two operations do not interact in any way.

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240522014013.1672962-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-11 14:18:44 -07:00
Hou Wenlong
c7d4c5f019 KVM: x86: Drop unused check_apicv_inhibit_reasons() callback definition
The check_apicv_inhibit_reasons() callback implementation was dropped in
the commit b3f257a846 ("KVM: x86: Track required APICv inhibits with
variable, not callback"), but the definition removal was missed in the
final version patch (it was removed in the v4). Therefore, it should be
dropped, and the vmx_check_apicv_inhibit_reasons() function declaration
should also be removed.

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Link: https://lore.kernel.org/r/54abd1d0ccaba4d532f81df61259b9c0e021fbde.1714977229.git.houwenlong.hwl@antgroup.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-10 09:20:52 -07:00
Sean Christopherson
0a7b73559b KVM: x86: Remove VMX support for virtualizing guest MTRR memtypes
Remove KVM's support for virtualizing guest MTRR memtypes, as full MTRR
adds no value, negatively impacts guest performance, and is a maintenance
burden due to it's complexity and oddities.

KVM's approach to virtualizating MTRRs make no sense, at all.  KVM *only*
honors guest MTRR memtypes if EPT is enabled *and* the guest has a device
that may perform non-coherent DMA access.  From a hardware virtualization
perspective of guest MTRRs, there is _nothing_ special about EPT.  Legacy
shadowing paging doesn't magically account for guest MTRRs, nor does NPT.

Unwinding and deciphering KVM's murky history, the MTRR virtualization
code appears to be the result of misdiagnosed issues when EPT + VT-d with
passthrough devices was enabled years and years ago.  And importantly, the
underlying bugs that were fudged around by honoring guest MTRR memtypes
have since been fixed (though rather poorly in some cases).

The zapping GFNs logic in the MTRR virtualization code came from:

  commit efdfe536d8
  Author: Xiao Guangrong <guangrong.xiao@linux.intel.com>
  Date:   Wed May 13 14:42:27 2015 +0800

    KVM: MMU: fix MTRR update

    Currently, whenever guest MTRR registers are changed
    kvm_mmu_reset_context is called to switch to the new root shadow page
    table, however, it's useless since:
    1) the cache type is not cached into shadow page's attribute so that
       the original root shadow page will be reused

    2) the cache type is set on the last spte, that means we should sync
       the last sptes when MTRR is changed

    This patch fixs this issue by drop all the spte in the gfn range which
    is being updated by MTRR

which was a fix for:

  commit 0bed3b568b
  Author:     Sheng Yang <sheng@linux.intel.com>
  AuthorDate: Thu Oct 9 16:01:54 2008 +0800
  Commit:     Avi Kivity <avi@redhat.com>
  CommitDate: Wed Dec 31 16:51:44 2008 +0200

      KVM: Improve MTRR structure

      As well as reset mmu context when set MTRR.

which was part of a "MTRR/PAT support for EPT" series that also added:

+       if (mt_mask) {
+               mt_mask = get_memory_type(vcpu, gfn) <<
+                         kvm_x86_ops->get_mt_mask_shift();
+               spte |= mt_mask;
+       }

where get_memory_type() was a truly gnarly helper to retrieve the guest
MTRR memtype for a given memtype.  And *very* subtly, at the time of that
change, KVM *always* set VMX_EPT_IGMT_BIT,

        kvm_mmu_set_base_ptes(VMX_EPT_READABLE_MASK |
                VMX_EPT_WRITABLE_MASK |
                VMX_EPT_DEFAULT_MT << VMX_EPT_MT_EPTE_SHIFT |
                VMX_EPT_IGMT_BIT);

which came in via:

  commit 928d4bf747
  Author:     Sheng Yang <sheng@linux.intel.com>
  AuthorDate: Thu Nov 6 14:55:45 2008 +0800
  Commit:     Avi Kivity <avi@redhat.com>
  CommitDate: Tue Nov 11 21:00:37 2008 +0200

      KVM: VMX: Set IGMT bit in EPT entry

      There is a potential issue that, when guest using pagetable without vmexit when
      EPT enabled, guest would use PAT/PCD/PWT bits to index PAT msr for it's memory,
      which would be inconsistent with host side and would cause host MCE due to
      inconsistent cache attribute.

      The patch set IGMT bit in EPT entry to ignore guest PAT and use WB as default
      memory type to protect host (notice that all memory mapped by KVM should be WB).

Note the CommitDates!  The AuthorDates strongly suggests Sheng Yang added
the whole "ignoreIGMT things as a bug fix for issues that were detected
during EPT + VT-d + passthrough enabling, but it was applied earlier
because it was a generic fix.

Jumping back to 0bed3b568b ("KVM: Improve MTRR structure"), the other
relevant code, or rather lack thereof, is the handling of *host* MMIO.
That fix came in a bit later, but given the author and timing, it's safe
to say it was all part of the same EPT+VT-d enabling mess.

  commit 2aaf69dcee
  Author:     Sheng Yang <sheng@linux.intel.com>
  AuthorDate: Wed Jan 21 16:52:16 2009 +0800
  Commit:     Avi Kivity <avi@redhat.com>
  CommitDate: Sun Feb 15 02:47:37 2009 +0200

    KVM: MMU: Map device MMIO as UC in EPT

    Software are not allow to access device MMIO using cacheable memory type, the
    patch limit MMIO region with UC and WC(guest can select WC using PAT and
    PCD/PWT).

In addition to the host MMIO and IGMT issues, KVM's MTRR virtualization
was obviously never tested on NPT until much later, which lends further
credence to the theory/argument that this was all the result of
misdiagnosed issues.

Discussion from the EPT+MTRR enabling thread[*] more or less confirms that
Sheng Yang was trying to resolve issues with passthrough MMIO.

 * Sheng Yang
  : Do you mean host(qemu) would access this memory and if we set it to guest
  : MTRR, host access would be broken? We would cover this in our shadow MTRR
  : patch, for we encountered this in video ram when doing some experiment with
  : VGA assignment.

And in the same thread, there's also what appears to be confirmation of
Intel running into issues with Windows XP related to a guest device driver
mapping DMA with WC in the PAT.

 * Avi Kavity
  : Sheng Yang wrote:
  : > Yes... But it's easy to do with assigned devices' mmio, but what if guest
  : > specific some non-mmio memory's memory type? E.g. we have met one issue in
  : > Xen, that a assigned-device's XP driver specific one memory region as buffer,
  : > and modify the memory type then do DMA.
  : >
  : > Only map MMIO space can be first step, but I guess we can modify assigned
  : > memory region memory type follow guest's?
  : >
  :
  : With ept/npt, we can't, since the memory type is in the guest's
  : pagetable entries, and these are not accessible.

[*] https://lore.kernel.org/all/1223539317-32379-1-git-send-email-sheng@linux.intel.com

So, for the most part, what likely happened is that 15 years ago, a few
engineers (a) fixed a #MC problem by ignoring guest PAT and (b) initially
"fixed" passthrough device MMIO by emulating *guest* MTRRs.  Except for
the below case, everything since then has been a result of those two
intertwined changes.

The one exception, which is actually yet more confirmation of all of the
above, is the revert of Paolo's attempt at "full" virtualization of guest
MTRRs:

  commit 606decd670
  Author: Paolo Bonzini <pbonzini@redhat.com>
  Date:   Thu Oct 1 13:12:47 2015 +0200

    Revert "KVM: x86: apply guest MTRR virtualization on host reserved pages"

    This reverts commit fd717f1101.
    It was reported to cause Machine Check Exceptions (bug 104091).

...

  commit fd717f1101
  Author: Paolo Bonzini <pbonzini@redhat.com>
  Date:   Tue Jul 7 14:38:13 2015 +0200

    KVM: x86: apply guest MTRR virtualization on host reserved pages

    Currently guest MTRR is avoided if kvm_is_reserved_pfn returns true.
    However, the guest could prefer a different page type than UC for
    such pages. A good example is that pass-throughed VGA frame buffer is
    not always UC as host expected.

    This patch enables full use of virtual guest MTRRs.

I.e. Paolo tried to add back KVM's behavior before "Map device MMIO as UC
in EPT" and got the same result: machine checks, likely due to the guest
MTRRs not being trustworthy/sane at all times.

Note, Paolo also tried to enable MTRR virtualization on SVM+NPT, but that
too got reverted.  Unfortunately, it doesn't appear that anyone ever found
a smoking gun, i.e. exactly why emulating guest MTRRs via NPT PAT caused
extremely slow boot times doesn't appear to have a definitive root cause.

  commit fc07e76ac7
  Author: Paolo Bonzini <pbonzini@redhat.com>
  Date:   Thu Oct 1 13:20:22 2015 +0200

    Revert "KVM: SVM: use NPT page attributes"

    This reverts commit 3c2e7f7de3.
    Initializing the mapping from MTRR to PAT values was reported to
    fail nondeterministically, and it also caused extremely slow boot
    (due to caching getting disabled---bug 103321) with assigned devices.

...

  commit 3c2e7f7de3
  Author: Paolo Bonzini <pbonzini@redhat.com>
  Date:   Tue Jul 7 14:32:17 2015 +0200

    KVM: SVM: use NPT page attributes

    Right now, NPT page attributes are not used, and the final page
    attribute depends solely on gPAT (which however is not synced
    correctly), the guest MTRRs and the guest page attributes.

    However, we can do better by mimicking what is done for VMX.
    In the absence of PCI passthrough, the guest PAT can be ignored
    and the page attributes can be just WB.  If passthrough is being
    used, instead, keep respecting the guest PAT, and emulate the guest
    MTRRs through the PAT field of the nested page tables.

    The only snag is that WP memory cannot be emulated correctly,
    because Linux's default PAT setting only includes the other types.

In short, honoring guest MTRRs for VMX was initially a workaround of
sorts for KVM ignoring guest PAT *and* for KVM not forcing UC for host
MMIO.  And while there *are* known cases where honoring guest MTRRs is
desirable, e.g. passthrough VGA frame buffers, the desired behavior in
that case is to get WC instead of UC, i.e. at this point it's for
performance, not correctness.

Furthermore, the complete absence of MTRR virtualization on NPT and
shadow paging proves that, while KVM theoretically can do better, it's
by no means necessary for correctnesss.

Lastly, since kernels mostly rely on firmware to do MTRR setup, and the
host typically provides guest firmware, honoring guest MTRRs is effectively
honoring *host* userspace memtypes, which is also backwards.  I.e. it
would be far better for host userspace to communicate its desired memtype
directly to KVM (or perhaps indirectly via VMAs in the host kernel), not
through guest MTRRs.

Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Link: https://lore.kernel.org/r/20240309010929.1403984-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-05 08:13:14 -07:00
Alejandro Jimenez
f992572120 KVM: x86: Keep consistent naming for APICv/AVIC inhibit reasons
Keep kvm_apicv_inhibit enum naming consistent with the current pattern by
renaming the reason/enumerator defined as APICV_INHIBIT_REASON_DISABLE to
APICV_INHIBIT_REASON_DISABLED.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Link: https://lore.kernel.org/r/20240506225321.3440701-3-alejandro.j.jimenez@oracle.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-05 06:18:28 -07:00
Alejandro Jimenez
69148ccec6 KVM: x86: Print names of apicv inhibit reasons in traces
Use the tracing infrastructure helper __print_flags() for printing flag
bitfields, to enhance the trace output by displaying a string describing
each of the inhibit reasons set.

The kvm_apicv_inhibit_changed tracepoint currently shows the raw bitmap
value, requiring the user to consult the source file where the inhibit
reasons are defined to decode the trace output.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20240506225321.3440701-2-alejandro.j.jimenez@oracle.com
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-05 06:18:27 -07:00