mirror of
git://git.yoctoproject.org/linux-yocto.git
synced 2025-07-19 03:59:45 +02:00
04e317ba72
189 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
![]() |
a5a9e00605 |
seccomp updates for v5.16-rc1
- set spec_store_bypass_disable & spectre_v2_user to prctl (Andrea Arcangeli) -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmGAGAkWHGtlZXNjb29r QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJqOWD/4mMFp84IMa/VdCmD6PS+BhisyI i7+Hyfisg8AWpjgW4+JihU/6hfsDgs/hNNKbiIopcwc/12KV4M0QIQyF7vmceSwB uMsAX7pkobNUUisnrQVbw6boK4hrBvrV3STlVdRHvNlLeQLQIu4UN3+9UMj/qsmh 46ltdxR489oDDLXFgMkKq9auVP2t5t4fbyRmgBPLSKIXaOxIhWck3kUQwt/Rbr44 M87/Xr4iQ0w4ddiBFJz9GOHQ5Iz08ms4dBfO+e5FSl6I69Nt6q836el35c/6j4y8 r7C21WU088MSkjk75RCa3v2sq8db2CjLe+wBugq+yYC29qGgxtTiUZaoiNQCN5bL DIRfl1iU5Ge1wEKorpr3DR6DksmfJO4MNPdMo4CcVZT3Gkdi7udLHfrEI82xgdDl lh1UiJlRx4YNEcDbGBnxCzKGwauqHa2TgPNWulUPdH7OGhUL86FAV49L84uz9lCD C/+PKxDqc2XKjbgqMsbuyQ7hzB2KQK/ieEXzduoHxTxIr5vO/viENrbkUiSL8bsO 6msCVbCIjtFDvW4Ac16IOwGoflJ7vLAIuXIdAYCeN+JXqOVV+FG/MN447Y674FeH R84G6JCT82ULEXrKlwuoSSVJEwA5lzP4IwoWm/ujeUbzi1s+7m+7WRpuJe2jZm6c zPsCVkNPUrvp82L/wA== =NAsc -----END PGP SIGNATURE----- Merge tag 'seccomp-v5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull seccomp updates from Kees Cook: "These are x86-specific, but I carried these since they're also seccomp-specific. This flips the defaults for spec_store_bypass_disable and spectre_v2_user from "seccomp" to "prctl", as enough time has passed to allow system owners to have updated the defensive stances of their various workloads, and it's long overdue to unpessimize seccomp threads. Extensive rationale and details are in Andrea's main patch. Summary: - set spec_store_bypass_disable & spectre_v2_user to prctl (Andrea Arcangeli)" * tag 'seccomp-v5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: x86: deduplicate the spectre_v2_user documentation x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl |
||
![]() |
8cb1ae19bf |
x86/fpu updates:
- Cleanup of extable fixup handling to be more robust, which in turn allows to make the FPU exception fixups more robust as well. - Change the return code for signal frame related failures from explicit error codes to a boolean fail/success as that's all what the calling code evaluates. - A large refactoring of the FPU code to prepare for adding AMX support: - Distangle the public header maze and remove especially the misnomed kitchen sink internal.h which is despite it's name included all over the place. - Add a proper abstraction for the register buffer storage (struct fpstate) which allows to dynamically size the buffer at runtime by flipping the pointer to the buffer container from the default container which is embedded in task_struct::tread::fpu to a dynamically allocated container with a larger register buffer. - Convert the code over to the new fpstate mechanism. - Consolidate the KVM FPU handling by moving the FPU related code into the FPU core which removes the number of exports and avoids adding even more export when AMX has to be supported in KVM. This also removes duplicated code which was of course unnecessary different and incomplete in the KVM copy. - Simplify the KVM FPU buffer handling by utilizing the new fpstate container and just switching the buffer pointer from the user space buffer to the KVM guest buffer when entering vcpu_run() and flipping it back when leaving the function. This cuts the memory requirements of a vCPU for FPU buffers in half and avoids pointless memory copy operations. This also solves the so far unresolved problem of adding AMX support because the current FPU buffer handling of KVM inflicted a circular dependency between adding AMX support to the core and to KVM. With the new scheme of switching fpstate AMX support can be added to the core code without affecting KVM. - Replace various variables with proper data structures so the extra information required for adding dynamically enabled FPU features (AMX) can be added in one place - Add AMX (Advanved Matrix eXtensions) support (finally): AMX is a large XSTATE component which is going to be available with Saphire Rapids XEON CPUs. The feature comes with an extra MSR (MSR_XFD) which allows to trap the (first) use of an AMX related instruction, which has two benefits: 1) It allows the kernel to control access to the feature 2) It allows the kernel to dynamically allocate the large register state buffer instead of burdening every task with the the extra 8K or larger state storage. It would have been great to gain this kind of control already with AVX512. The support comes with the following infrastructure components: 1) arch_prctl() to - read the supported features (equivalent to XGETBV(0)) - read the permitted features for a task - request permission for a dynamically enabled feature Permission is granted per process, inherited on fork() and cleared on exec(). The permission policy of the kernel is restricted to sigaltstack size validation, but the syscall obviously allows further restrictions via seccomp etc. 2) A stronger sigaltstack size validation for sys_sigaltstack(2) which takes granted permissions and the potentially resulting larger signal frame into account. This mechanism can also be used to enforce factual sigaltstack validation independent of dynamic features to help with finding potential victims of the 2K sigaltstack size constant which is broken since AVX512 support was added. 3) Exception handling for #NM traps to catch first use of a extended feature via a new cause MSR. If the exception was caused by the use of such a feature, the handler checks permission for that feature. If permission has not been granted, the handler sends a SIGILL like the #UD handler would do if the feature would have been disabled in XCR0. If permission has been granted, then a new fpstate which fits the larger buffer requirement is allocated. In the unlikely case that this allocation fails, the handler sends SIGSEGV to the task. That's not elegant, but unavoidable as the other discussed options of preallocation or full per task permissions come with their own set of horrors for kernel and/or userspace. So this is the lesser of the evils and SIGSEGV caused by unexpected memory allocation failures is not a fundamentally new concept either. When allocation succeeds, the fpstate properties are filled in to reflect the extended feature set and the resulting sizes, the fpu::fpstate pointer is updated accordingly and the trap is disarmed for this task permanently. 4) Enumeration and size calculations 5) Trap switching via MSR_XFD The XFD (eXtended Feature Disable) MSR is context switched with the same life time rules as the FPU register state itself. The mechanism is keyed off with a static key which is default disabled so !AMX equipped CPUs have zero overhead. On AMX enabled CPUs the overhead is limited by comparing the tasks XFD value with a per CPU shadow variable to avoid redundant MSR writes. In case of switching from a AMX using task to a non AMX using task or vice versa, the extra MSR write is obviously inevitable. All other places which need to be aware of the variable feature sets and resulting variable sizes are not affected at all because they retrieve the information (feature set, sizes) unconditonally from the fpstate properties. 6) Enable the new AMX states Note, this is relatively new code despite the fact that AMX support is in the works for more than a year now. The big refactoring of the FPU code, which allowed to do a proper integration has been started exactly 3 weeks ago. Refactoring of the existing FPU code and of the original AMX patches took a week and has been subject to extensive review and testing. The only fallout which has not been caught in review and testing right away was restricted to AMX enabled systems, which is completely irrelevant for anyone outside Intel and their early access program. There might be dragons lurking as usual, but so far the fine grained refactoring has held up and eventual yet undetected fallout is bisectable and should be easily addressable before the 5.16 release. Famous last words... Many thanks to Chang Bae and Dave Hansen for working hard on this and also to the various test teams at Intel who reserved extra capacity to follow the rapid development of this closely which provides the confidence level required to offer this rather large update for inclusion into 5.16-rc1. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmF/NkITHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYodDkEADH4+/nN/QoSUHIuuha5Zptj3g2b16a /3TxT9fhwPen/kzMGsUk70s3iWJMA+I5dCfkSZexJ2hfhcRe9cBzZIa1HCawKwf3 YCISTsO/M+LpeORuZ+TpfFLJKnxNr1SEOl+EYffGhq0AkCjifb9Cnr0JZuoMUzGU jpfJZ2bj28ri5lG812DtzSMBM9E3SAwgJv+GNjmZbxZKb9mAfhbAMdBUXHirX7Ej jmx6koQjYOKwYIW8w1BrdC270lUKQUyJTbQgdRkN9Mh/HnKyFixQ18JqGlgaV2cT EtYePUfTEdaHdAhUINLIlEug1MfOslHU+HyGsdywnoChNB4GHPQuePC5Tz60VeFN RbQ9aKcBUu8r95rjlnKtAtBijNMA4bjGwllVxNwJ/ZoA9RPv1SbDZ07RX3qTaLVY YhVQl8+shD33/W24jUTJv1kMMexpHXIlv0gyfMryzpwI7uzzmGHRPAokJdbYKctC dyMPfdE90rxTiMUdL/1IQGhnh3awjbyfArzUhHyQ++HyUyzCFh0slsO0CD18vUy8 FofhCugGBhjuKw3XwLNQ+KsWURz5qHctSzBc3qMOSyqFHbAJCVRANkhsFvWJo2qL 75+Z7OTRebtsyOUZIdq26r4roSxHrps3dupWTtN70HWx2NhQG1nLEw986QYiQu1T hcKvDmehQLrUvg== =x3WL -----END PGP SIGNATURE----- Merge tag 'x86-fpu-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fpu updates from Thomas Gleixner: - Cleanup of extable fixup handling to be more robust, which in turn allows to make the FPU exception fixups more robust as well. - Change the return code for signal frame related failures from explicit error codes to a boolean fail/success as that's all what the calling code evaluates. - A large refactoring of the FPU code to prepare for adding AMX support: - Distangle the public header maze and remove especially the misnomed kitchen sink internal.h which is despite it's name included all over the place. - Add a proper abstraction for the register buffer storage (struct fpstate) which allows to dynamically size the buffer at runtime by flipping the pointer to the buffer container from the default container which is embedded in task_struct::tread::fpu to a dynamically allocated container with a larger register buffer. - Convert the code over to the new fpstate mechanism. - Consolidate the KVM FPU handling by moving the FPU related code into the FPU core which removes the number of exports and avoids adding even more export when AMX has to be supported in KVM. This also removes duplicated code which was of course unnecessary different and incomplete in the KVM copy. - Simplify the KVM FPU buffer handling by utilizing the new fpstate container and just switching the buffer pointer from the user space buffer to the KVM guest buffer when entering vcpu_run() and flipping it back when leaving the function. This cuts the memory requirements of a vCPU for FPU buffers in half and avoids pointless memory copy operations. This also solves the so far unresolved problem of adding AMX support because the current FPU buffer handling of KVM inflicted a circular dependency between adding AMX support to the core and to KVM. With the new scheme of switching fpstate AMX support can be added to the core code without affecting KVM. - Replace various variables with proper data structures so the extra information required for adding dynamically enabled FPU features (AMX) can be added in one place - Add AMX (Advanced Matrix eXtensions) support (finally): AMX is a large XSTATE component which is going to be available with Saphire Rapids XEON CPUs. The feature comes with an extra MSR (MSR_XFD) which allows to trap the (first) use of an AMX related instruction, which has two benefits: 1) It allows the kernel to control access to the feature 2) It allows the kernel to dynamically allocate the large register state buffer instead of burdening every task with the the extra 8K or larger state storage. It would have been great to gain this kind of control already with AVX512. The support comes with the following infrastructure components: 1) arch_prctl() to - read the supported features (equivalent to XGETBV(0)) - read the permitted features for a task - request permission for a dynamically enabled feature Permission is granted per process, inherited on fork() and cleared on exec(). The permission policy of the kernel is restricted to sigaltstack size validation, but the syscall obviously allows further restrictions via seccomp etc. 2) A stronger sigaltstack size validation for sys_sigaltstack(2) which takes granted permissions and the potentially resulting larger signal frame into account. This mechanism can also be used to enforce factual sigaltstack validation independent of dynamic features to help with finding potential victims of the 2K sigaltstack size constant which is broken since AVX512 support was added. 3) Exception handling for #NM traps to catch first use of a extended feature via a new cause MSR. If the exception was caused by the use of such a feature, the handler checks permission for that feature. If permission has not been granted, the handler sends a SIGILL like the #UD handler would do if the feature would have been disabled in XCR0. If permission has been granted, then a new fpstate which fits the larger buffer requirement is allocated. In the unlikely case that this allocation fails, the handler sends SIGSEGV to the task. That's not elegant, but unavoidable as the other discussed options of preallocation or full per task permissions come with their own set of horrors for kernel and/or userspace. So this is the lesser of the evils and SIGSEGV caused by unexpected memory allocation failures is not a fundamentally new concept either. When allocation succeeds, the fpstate properties are filled in to reflect the extended feature set and the resulting sizes, the fpu::fpstate pointer is updated accordingly and the trap is disarmed for this task permanently. 4) Enumeration and size calculations 5) Trap switching via MSR_XFD The XFD (eXtended Feature Disable) MSR is context switched with the same life time rules as the FPU register state itself. The mechanism is keyed off with a static key which is default disabled so !AMX equipped CPUs have zero overhead. On AMX enabled CPUs the overhead is limited by comparing the tasks XFD value with a per CPU shadow variable to avoid redundant MSR writes. In case of switching from a AMX using task to a non AMX using task or vice versa, the extra MSR write is obviously inevitable. All other places which need to be aware of the variable feature sets and resulting variable sizes are not affected at all because they retrieve the information (feature set, sizes) unconditonally from the fpstate properties. 6) Enable the new AMX states Note, this is relatively new code despite the fact that AMX support is in the works for more than a year now. The big refactoring of the FPU code, which allowed to do a proper integration has been started exactly 3 weeks ago. Refactoring of the existing FPU code and of the original AMX patches took a week and has been subject to extensive review and testing. The only fallout which has not been caught in review and testing right away was restricted to AMX enabled systems, which is completely irrelevant for anyone outside Intel and their early access program. There might be dragons lurking as usual, but so far the fine grained refactoring has held up and eventual yet undetected fallout is bisectable and should be easily addressable before the 5.16 release. Famous last words... Many thanks to Chang Bae and Dave Hansen for working hard on this and also to the various test teams at Intel who reserved extra capacity to follow the rapid development of this closely which provides the confidence level required to offer this rather large update for inclusion into 5.16-rc1 * tag 'x86-fpu-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (110 commits) Documentation/x86: Add documentation for using dynamic XSTATE features x86/fpu: Include vmalloc.h for vzalloc() selftests/x86/amx: Add context switch test selftests/x86/amx: Add test cases for AMX state management x86/fpu/amx: Enable the AMX feature in 64-bit mode x86/fpu: Add XFD handling for dynamic states x86/fpu: Calculate the default sizes independently x86/fpu/amx: Define AMX state components and have it used for boot-time checks x86/fpu/xstate: Prepare XSAVE feature table for gaps in state component numbers x86/fpu/xstate: Add fpstate_realloc()/free() x86/fpu/xstate: Add XFD #NM handler x86/fpu: Update XFD state where required x86/fpu: Add sanity checks for XFD x86/fpu: Add XFD state to fpstate x86/msr-index: Add MSRs for XFD x86/cpufeatures: Add eXtended Feature Disabling (XFD) feature bit x86/fpu: Reset permission and fpstate on exec() x86/fpu: Prepare fpu_clone() for dynamically enabled features x86/fpu/signal: Prepare for variable sigframe length x86/signal: Use fpu::__state_user_size for sigalt stack validation ... |
||
![]() |
f8a66d608a |
x86,bugs: Unconditionally allow spectre_v2=retpoline,amd
Currently Linux prevents usage of retpoline,amd on !AMD hardware, this is unfriendly and gets in the way of testing. Remove this restriction. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Borislav Petkov <bp@suse.de> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Tested-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/r/20211026120310.487348118@infradead.org |
||
![]() |
b56d2795b2 |
x86/fpu: Replace the includes of fpu/internal.h
Now that the file is empty, fixup all references with the proper includes and delete the former kitchen sink. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211015011540.001197214@linutronix.de |
||
![]() |
2f46993d83 |
x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl
Switch the kernel default of SSBD and STIBP to the ones with CONFIG_SECCOMP=n (i.e. spec_store_bypass_disable=prctl spectre_v2_user=prctl) even if CONFIG_SECCOMP=y. Several motivations listed below: - If SMT is enabled the seccomp jail can still attack the rest of the system even with spectre_v2_user=seccomp by using MDS-HT (except on XEON PHI where MDS can be tamed with SMT left enabled, but that's a special case). Setting STIBP become a very expensive window dressing after MDS-HT was discovered. - The seccomp jail cannot attack the kernel with spectre-v2-HT regardless (even if STIBP is not set), but with MDS-HT the seccomp jail can attack the kernel too. - With spec_store_bypass_disable=prctl the seccomp jail can attack the other userland (guest or host mode) using spectre-v2-HT, but the userland attack is already mitigated by both ASLR and pid namespaces for host userland and through virt isolation with libkrun or kata. (if something if somebody is worried about spectre-v2-HT it's best to mount proc with hidepid=2,gid=proc on workstations where not all apps may run under container runtimes, rather than slowing down all seccomp jails, but the best is to add pid namespaces to the seccomp jail). As opposed MDS-HT is not mitigated and the seccomp jail can still attack all other host and guest userland if SMT is enabled even with spec_store_bypass_disable=seccomp. - If full security is required then MDS-HT must also be mitigated with nosmt and then spectre_v2_user=prctl and spectre_v2_user=seccomp would become identical. - Setting spectre_v2_user=seccomp is overall lower priority than to setting javascript.options.wasm false in about:config to protect against remote wasm MDS-HT, instead of worrying about Spectre-v2-HT and STIBP which again is already statistically well mitigated by other means in userland and it's fully mitigated in kernel with retpolines (unlike the wasm assist call with MDS-HT). - SSBD is needed to prevent reading the JIT memory and the primary user being the OpenJDK. However the primary user of SSBD wouldn't be covered by spec_store_bypass_disable=seccomp because it doesn't use seccomp and the primary user also explicitly declined to set PR_SET_SPECULATION_CTRL+PR_SPEC_STORE_BYPASS despite it easily could. In fact it would need to set it only when the sandboxing mechanism is enabled for javaws applets, but it still declined it by declaring security within the same user address space as an untenable objective for their JIT, even in the sandboxing case where performance would be a lesser concern (for the record: I kind of disagree in not setting PR_SPEC_STORE_BYPASS in the sandbox case and I prefer to run javaws through a wrapper that sets PR_SPEC_STORE_BYPASS if I need). In turn it can be inferred that even if the primary user of SSBD would use seccomp, they would invoke it with SECCOMP_FILTER_FLAG_SPEC_ALLOW by now. - runc/crun already set SECCOMP_FILTER_FLAG_SPEC_ALLOW by default, k8s and podman have a default json seccomp allowlist that cannot be slowed down, so for the #1 seccomp user this change is already a noop. - systemd/sshd or other apps that use seccomp, if they really need STIBP or SSBD, they need to explicitly set the PR_SET_SPECULATION_CTRL by now. The stibp/ssbd seccomp blind catch-all approach was done probably initially with a wishful thinking objective to pretend to have a peace of mind that it could magically fix it all. That was wishful thinking before MDS-HT was discovered, but after MDS-HT has been discovered it become just window dressing. - For qemu "-sandbox" seccomp jail it wouldn't make sense to set STIBP or SSBD. SSBD doesn't help with KVM because there's no JIT (if it's needed with TCG it should be an opt-in with PR_SET_SPECULATION_CTRL+PR_SPEC_STORE_BYPASS and it shouldn't slowdown KVM for nothing). For qemu+KVM STIBP would be even more window dressing than it is for all other apps, because in the qemu+KVM case there's not only the MDS attack to worry about with SMT enabled. Even after disabling SMT, there's still a theoretical spectre-v2 attack possible within the same thread context from guest mode to host ring3 that the host kernel retpoline mitigation has no theoretical chance to mitigate. On some kernels a ibrs-always/ibrs-retpoline opt-in model is provided that will enabled IBRS in the qemu host ring3 userland which fixes this theoretical concern. Only after enabling IBRS in the host userland it would then make sense to proceed and worry about STIBP and an attack on the other host userland, but then again SMT would need to be disabled for full security anyway, so that would render STIBP again a noop. - last but not the least: the lack of "spec_store_bypass_disable=prctl spectre_v2_user=prctl" means the moment a guest boots and sshd/systemd runs, the guest kernel will write to SPEC_CTRL MSR which will make the guest vmexit forever slower, forcing KVM to issue a very slow rdmsr instruction at every vmexit. So the end result is that SPEC_CTRL MSR is only available in GCE. Most other public cloud providers don't expose SPEC_CTRL, which means that not only STIBP/SSBD isn't available, but IBPB isn't available either (which would cause no overhead to the guest or the hypervisor because it's write only and requires no reading during vmexit). So the current default already net loss in security (missing IBPB) which means most public cloud providers cannot achieve a fully secure guest with nosmt (and nosmt is enough to fully mitigate MDS-HT). It also means GCE and is unfairly penalized in performance because it provides the option to enable full security in the guest as an opt-in (i.e. nosmt and IBPB). So this change will allow all cloud providers to expose SPEC_CTRL without incurring into any hypervisor slowdown and at the same time it will remove the unfair penalization of GCE performance for doing the right thing and it'll allow to get full security with nosmt with IBPB being available (and STIBP becoming meaningless). Example to put things in prospective: the STIBP enabled in seccomp has never been about protecting apps using seccomp like sshd from an attack from a malicious userland, but to the contrary it has always been about protecting the system from an attack from sshd, after a successful remote network exploit against sshd. In fact initially it wasn't obvious STIBP would work both ways (STIBP was about preventing the task that runs with STIBP to be attacked with spectre-v2-HT, but accidentally in the STIBP case it also prevents the attack in the other direction). In the hypothetical case that sshd has been remotely exploited the last concern should be STIBP being set, because it'll be still possible to obtain info even from the kernel by using MDS if nosmt wasn't set (and if it was set, STIBP is a noop in the first place). As opposed kernel cannot leak anything with spectre-v2 HT because of retpolines and the userland is mitigated by ASLR already and ideally PID namespaces too. If something it'd be worth checking if sshd run the seccomp thread under pid namespaces too if available in the running kernel. SSBD also would be a noop for sshd, since sshd uses no JIT. If sshd prefers to keep doing the STIBP window dressing exercise, it still can even after this change of defaults by opting-in with PR_SPEC_INDIRECT_BRANCH. Ultimately setting SSBD and STIBP by default for all seccomp jails is a bad sweet spot and bad default with more cons than pros that end up reducing security in the public cloud (by giving an huge incentive to not expose SPEC_CTRL which would be needed to get full security with IBPB after setting nosmt in the guest) and by excessively hurting performance to more secure apps using seccomp that end up having to opt out with SECCOMP_FILTER_FLAG_SPEC_ALLOW. The following is the verified result of the new default with SMT enabled: (gdb) print spectre_v2_user_stibp $1 = SPECTRE_V2_USER_PRCTL (gdb) print spectre_v2_user_ibpb $2 = SPECTRE_V2_USER_PRCTL (gdb) print ssb_mode $3 = SPEC_STORE_BYPASS_PRCTL Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20201104235054.5678-1-aarcange@redhat.com Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/lkml/AAA2EF2C-293D-4D5B-BFA6-FF655105CD84@redhat.com Acked-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/lkml/c0722838-06f7-da6b-138f-e0f26362f16a@redhat.com |
||
![]() |
e893bb1bb4 |
x86, prctl: Hook L1D flushing in via prctl
Use the existing PR_GET/SET_SPECULATION_CTRL API to expose the L1D flush capability. For L1D flushing PR_SPEC_FORCE_DISABLE and PR_SPEC_DISABLE_NOEXEC are not supported. Enabling L1D flush does not check if the task is running on an SMT enabled core, rather a check is done at runtime (at the time of flush), if the task runs on a SMT sibling then the task is sent a SIGBUS which is executed before the task returns to user space or to a guest. This is better than the other alternatives of: a. Ensuring strict affinity of the task (hard to enforce without further changes in the scheduler) b. Silently skipping flush for tasks that move to SMT enabled cores. Hook up the core prctl and implement the x86 specific parts which in turn makes it functional. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Balbir Singh <sblbir@amazon.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210108121056.21940-5-sblbir@amazon.com |
||
![]() |
b5f06f64e2 |
x86/mm: Prepare for opt-in based L1D flush in switch_mm()
The goal of this is to allow tasks that want to protect sensitive information, against e.g. the recently found snoop assisted data sampling vulnerabilites, to flush their L1D on being switched out. This protects their data from being snooped or leaked via side channels after the task has context switched out. This could also be used to wipe L1D when an untrusted task is switched in, but that's not a really well defined scenario while the opt-in variant is clearly defined. The mechanism is default disabled and can be enabled on the kernel command line. Prepare for the actual prctl based opt-in: 1) Provide the necessary setup functionality similar to the other mitigations and enable the static branch when the command line option is set and the CPU provides support for hardware assisted L1D flushing. Software based L1D flush is not supported because it's CPU model specific and not really well defined. This does not come with a sysfs file like the other mitigations because it is not bound to any specific vulnerability. Support has to be queried via the prctl(2) interface. 2) Add TIF_SPEC_L1D_FLUSH next to L1D_SPEC_IB so the two bits can be mangled into the mm pointer in one go which allows to reuse the existing mechanism in switch_mm() for the conditional IBPB speculation barrier efficiently. 3) Add the L1D flush specific functionality which flushes L1D when the outgoing task opted in. Also check whether the incoming task has requested L1D flush and if so validate that it is not accidentaly running on an SMT sibling as this makes the whole excercise moot because SMT siblings share L1D which opens tons of other attack vectors. If that happens schedule task work which signals the incoming task on return to user/guest with SIGBUS as this is part of the paranoid L1D flush contract. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Balbir Singh <sblbir@amazon.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210108121056.21940-1-sblbir@amazon.com |
||
![]() |
33fc379df7 |
x86/speculation: Fix prctl() when spectre_v2_user={seccomp,prctl},ibpb
When spectre_v2_user={seccomp,prctl},ibpb is specified on the command line, IBPB is force-enabled and STIPB is conditionally-enabled (or not available). However, since |
||
![]() |
1978b3a53a |
x86/speculation: Allow IBPB to be conditionally enabled on CPUs with always-on STIBP
On AMD CPUs which have the feature X86_FEATURE_AMD_STIBP_ALWAYS_ON,
STIBP is set to on and
spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT_PREFERRED
At the same time, IBPB can be set to conditional.
However, this leads to the case where it's impossible to turn on IBPB
for a process because in the PR_SPEC_DISABLE case in ib_prctl_set() the
spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT_PREFERRED
condition leads to a return before the task flag is set. Similarly,
ib_prctl_get() will return PR_SPEC_DISABLE even though IBPB is set to
conditional.
More generally, the following cases are possible:
1. STIBP = conditional && IBPB = on for spectre_v2_user=seccomp,ibpb
2. STIBP = on && IBPB = conditional for AMD CPUs with
X86_FEATURE_AMD_STIBP_ALWAYS_ON
The first case functions correctly today, but only because
spectre_v2_user_ibpb isn't updated to reflect the IBPB mode.
At a high level, this change does one thing. If either STIBP or IBPB
is set to conditional, allow the prctl to change the task flag.
Also, reflect that capability when querying the state. This isn't
perfect since it doesn't take into account if only STIBP or IBPB is
unconditionally on. But it allows the conditional feature to work as
expected, without affecting the unconditional one.
[ bp: Massage commit message and comment; space out statements for
better readability. ]
Fixes:
|
||
![]() |
f29dfa53cc |
x86/bugs/multihit: Fix mitigation reporting when VMX is not in use
On systems that have virtualization disabled or unsupported, sysfs
mitigation for X86_BUG_ITLB_MULTIHIT is reported incorrectly as:
$ cat /sys/devices/system/cpu/vulnerabilities/itlb_multihit
KVM: Vulnerable
System is not vulnerable to DoS attack from a rogue guest when
virtualization is disabled or unsupported in the hardware. Change the
mitigation reporting for these cases.
Fixes:
|
||
![]() |
4da9f33026 |
Support for FSGSBASE. Almost 5 years after the first RFC to support it,
this has been brought into a shape which is maintainable and actually works. This final version was done by Sasha Levin who took it up after Intel dropped the ball. Sasha discovered that the SGX (sic!) offerings out there ship rogue kernel modules enabling FSGSBASE behind the kernels back which opens an instantanious unpriviledged root hole. The FSGSBASE instructions provide a considerable speedup of the context switch path and enable user space to write GSBASE without kernel interaction. This enablement requires careful handling of the exception entries which go through the paranoid entry path as they cannot longer rely on the assumption that user GSBASE is positive (as enforced via prctl() on non FSGSBASE enabled systemn). All other entries (syscalls, interrupts and exceptions) can still just utilize SWAPGS unconditionally when the entry comes from user space. Converting these entries to use FSGSBASE has no benefit as SWAPGS is only marginally slower than WRGSBASE and locating and retrieving the kernel GSBASE value is not a free operation either. The real benefit of RD/WRGSBASE is the avoidance of the MSR reads and writes. The changes come with appropriate selftests and have held up in field testing against the (sanitized) Graphene-SGX driver. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl8pGnoTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoTYJD/9873GkwvGcc/Vq/dJH1szGTgFftPyZ c/Y9gzx7EGBPLo25BS820L+ZlynzXHDxExKfCEaD10TZfe5XIc1vYNR0J74M2NmK IBgEDstJeW93ai+rHCFRXIevhpzU4GgGYJ1MeeOgbVMN3aGU1g6HfzMvtF0fPn8Y n6fsLZa43wgnoTdjwjjikpDTrzoZbaL1mbODBzBVPAaTbim7IKKTge6r/iCKrOjz Uixvm3g9lVzx52zidJ9kWa8esmbOM1j0EPe7/hy3qH9DFo87KxEzjHNH3T6gY5t6 NJhRAIfY+YyTHpPCUCshj6IkRudE6w/qjEAmKP9kWZxoJrvPCTWOhCzelwsFS9b9 gxEYfsnaKhsfNhB6fi0PtWlMzPINmEA7SuPza33u5WtQUK7s1iNlgHfvMbjstbwg MSETn4SG2/ZyzUrSC06lVwV8kh0RgM3cENc/jpFfIHD0vKGI3qfka/1RY94kcOCG AeJd0YRSU2RqL7lmxhHyG8tdb8eexns41IzbPCLXX2sF00eKNkVvMRYT2mKfKLFF q8v1x7yuwmODdXfFR6NdCkGm9IU7wtL6wuQ8Nhu9UraFmcXo6X6FLJC18FqcvSb9 jvcRP4XY/8pNjjf44JB8yWfah0xGQsaMIKQGP4yLv4j6Xk1xAQKH1MqcC7l1D2HN 5Z24GibFqSK/vA== =QaAN -----END PGP SIGNATURE----- Merge tag 'x86-fsgsbase-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fsgsbase from Thomas Gleixner: "Support for FSGSBASE. Almost 5 years after the first RFC to support it, this has been brought into a shape which is maintainable and actually works. This final version was done by Sasha Levin who took it up after Intel dropped the ball. Sasha discovered that the SGX (sic!) offerings out there ship rogue kernel modules enabling FSGSBASE behind the kernels back which opens an instantanious unpriviledged root hole. The FSGSBASE instructions provide a considerable speedup of the context switch path and enable user space to write GSBASE without kernel interaction. This enablement requires careful handling of the exception entries which go through the paranoid entry path as they can no longer rely on the assumption that user GSBASE is positive (as enforced via prctl() on non FSGSBASE enabled systemn). All other entries (syscalls, interrupts and exceptions) can still just utilize SWAPGS unconditionally when the entry comes from user space. Converting these entries to use FSGSBASE has no benefit as SWAPGS is only marginally slower than WRGSBASE and locating and retrieving the kernel GSBASE value is not a free operation either. The real benefit of RD/WRGSBASE is the avoidance of the MSR reads and writes. The changes come with appropriate selftests and have held up in field testing against the (sanitized) Graphene-SGX driver" * tag 'x86-fsgsbase-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits) x86/fsgsbase: Fix Xen PV support x86/ptrace: Fix 32-bit PTRACE_SETREGS vs fsbase and gsbase selftests/x86/fsgsbase: Add a missing memory constraint selftests/x86/fsgsbase: Fix a comment in the ptrace_write_gsbase test selftests/x86: Add a syscall_arg_fault_64 test for negative GSBASE selftests/x86/fsgsbase: Test ptracer-induced GS base write with FSGSBASE selftests/x86/fsgsbase: Test GS selector on ptracer-induced GS base write Documentation/x86/64: Add documentation for GS/FS addressing mode x86/elf: Enumerate kernel FSGSBASE capability in AT_HWCAP2 x86/cpu: Enable FSGSBASE on 64bit by default and add a chicken bit x86/entry/64: Handle FSGSBASE enabled paranoid entry/exit x86/entry/64: Introduce the FIND_PERCPU_BASE macro x86/entry/64: Switch CR3 before SWAPGS in paranoid entry x86/speculation/swapgs: Check FSGSBASE in enabling SWAPGS mitigation x86/process/64: Use FSGSBASE instructions on thread copy and ptrace x86/process/64: Use FSBSBASE in switch_to() if available x86/process/64: Make save_fsgs_for_kvm() ready for FSGSBASE x86/fsgsbase/64: Enable FSGSBASE instructions in helper functions x86/fsgsbase/64: Add intrinsics for FSGSBASE instructions x86/cpu: Add 'unsafe_fsgsbase' to enable CR4.FSGSBASE ... |
||
![]() |
978e1342c3 |
x86/speculation/swapgs: Check FSGSBASE in enabling SWAPGS mitigation
Before enabling FSGSBASE the kernel could safely assume that the content of GS base was a user address. Thus any speculative access as the result of a mispredicted branch controlling the execution of SWAPGS would be to a user address. So systems with speculation-proof SMAP did not need to add additional LFENCE instructions to mitigate. With FSGSBASE enabled a hostile user can set GS base to a kernel address. So they can make the kernel speculatively access data they wish to leak via a side channel. This means that SMAP provides no protection. Add FSGSBASE as an additional condition to enable the fence-based SWAPGS mitigation. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200528201402.1708239-9-sashal@kernel.org |
||
![]() |
a5ce9f2bb6 |
x86/speculation: Merge one test in spectre_v2_user_select_mitigation()
Merge the test whether the CPU supports STIBP into the test which determines whether STIBP is required. Thus try to simplify what is already an insane logic. Remove a superfluous newline in a comment, while at it. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Anthony Steinhauser <asteinhauser@google.com> Link: https://lkml.kernel.org/r/20200615065806.GB14668@zn.tnic |
||
![]() |
6a45a65888 |
A set of fixes and updates for x86:
- Unbreak paravirt VDSO clocks. While the VDSO code was moved into lib for sharing a subtle check for the validity of paravirt clocks got replaced. While the replacement works perfectly fine for bare metal as the update of the VDSO clock mode is synchronous, it fails for paravirt clocks because the hypervisor can invalidate them asynchronous. Bring it back as an optional function so it does not inflict this on architectures which are free of PV damage. - Fix the jiffies to jiffies64 mapping on 64bit so it does not trigger an ODR violation on newer compilers - Three fixes for the SSBD and *IB* speculation mitigation maze to ensure consistency, not disabling of some *IB* variants wrongly and to prevent a rogue cross process shutdown of SSBD. All marked for stable. - Add yet more CPU models to the splitlock detection capable list !@#%$! - Bring the pr_info() back which tells that TSC deadline timer is enabled. - Reboot quirk for MacBook6,1 -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl7ie1oTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYofXrEACDD0mNBU2c4vQiR+n4d41PqW1p15DM /wG7dYqYt2RdR6qOAspmNL5ilUP+L+eoT/86U9y0g4j3FtTREqyy6mpWE4MQzqaQ eKWVoeYt7l9QbR1kP4eks1CN94OyVBUPo3P78UPruWMB11iyKjyrkEdsDmRSLOdr 6doqMFGHgowrQRwsLPFUt7b2lls6ssOSYgM/ChHi2Iga431ZuYYcRe2mNVsvqx3n 0N7QZlJ/LivXdCmdpe3viMBsDaomiXAloKUo+HqgrCLYFXefLtfOq09U7FpddYqH ztxbGW/7gFn2HEbmdeaiufux263MdHtnjvdPhQZKHuyQmZzzxDNBFgOILSrBJb5y qLYJGhMa0sEwMBM9MMItomNgZnOITQ3WGYAdSCg3mG3jK4EXzr6aQm/Qz5SI+Cte bQKB2dgR53Gw/1uc7F5qMGQ2NzeUbKycT0ZbF3vkUPVh1kdU3juIntsovv2lFeBe Rog/rZliT1xdHrGAHRbubb2/3v66CSodMoYz0eQtr241Oz0LGwnyFqLN3qcZVLDt OtxHQ3bbaxevDEetJXfSh3CfHKNYMToAcszmGDse3MJxC7DL5AA51OegMa/GYOX6 r5J99MUsEzZQoQYyXFf1MjwgxH4CQK1xBBUXYaVG65AcmhT21YbNWnCbxgf7hW+V hqaaUSig4V3NLw== =VlBk -----END PGP SIGNATURE----- Merge tag 'x86-urgent-2020-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull more x86 updates from Thomas Gleixner: "A set of fixes and updates for x86: - Unbreak paravirt VDSO clocks. While the VDSO code was moved into lib for sharing a subtle check for the validity of paravirt clocks got replaced. While the replacement works perfectly fine for bare metal as the update of the VDSO clock mode is synchronous, it fails for paravirt clocks because the hypervisor can invalidate them asynchronously. Bring it back as an optional function so it does not inflict this on architectures which are free of PV damage. - Fix the jiffies to jiffies64 mapping on 64bit so it does not trigger an ODR violation on newer compilers - Three fixes for the SSBD and *IB* speculation mitigation maze to ensure consistency, not disabling of some *IB* variants wrongly and to prevent a rogue cross process shutdown of SSBD. All marked for stable. - Add yet more CPU models to the splitlock detection capable list !@#%$! - Bring the pr_info() back which tells that TSC deadline timer is enabled. - Reboot quirk for MacBook6,1" * tag 'x86-urgent-2020-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/vdso: Unbreak paravirt VDSO clocks lib/vdso: Provide sanity check for cycles (again) clocksource: Remove obsolete ifdef x86_64: Fix jiffies ODR violation x86/speculation: PR_SPEC_FORCE_DISABLE enforcement for indirect branches. x86/speculation: Prevent rogue cross-process SSBD shutdown x86/speculation: Avoid force-disabling IBPB based on STIBP and enhanced IBRS. x86/cpu: Add Sapphire Rapids CPU model number x86/split_lock: Add Icelake microserver and Tigerlake CPU models x86/apic: Make TSC deadline timer detection message visible x86/reboot/quirks: Add MacBook6,1 reboot quirk |
||
![]() |
a5ad5742f6 |
Merge branch 'akpm' (patches from Andrew)
Merge even more updates from Andrew Morton: - a kernel-wide sweep of show_stack() - pagetable cleanups - abstract out accesses to mmap_sem - prep for mmap_sem scalability work - hch's user acess work Subsystems affected by this patch series: debug, mm/pagemap, mm/maccess, mm/documentation. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (93 commits) include/linux/cache.h: expand documentation over __read_mostly maccess: return -ERANGE when probe_kernel_read() fails x86: use non-set_fs based maccess routines maccess: allow architectures to provide kernel probing directly maccess: move user access routines together maccess: always use strict semantics for probe_kernel_read maccess: remove strncpy_from_unsafe tracing/kprobes: handle mixed kernel/userspace probes better bpf: rework the compat kernel probe handling bpf:bpf_seq_printf(): handle potentially unsafe format string better bpf: handle the compat string in bpf_trace_copy_string better bpf: factor out a bpf_trace_copy_string helper maccess: unify the probe kernel arch hooks maccess: remove probe_read_common and probe_write_common maccess: rename strnlen_unsafe_user to strnlen_user_nofault maccess: rename strncpy_from_unsafe_strict to strncpy_from_kernel_nofault maccess: rename strncpy_from_unsafe_user to strncpy_from_user_nofault maccess: update the top of file comment maccess: clarify kerneldoc comments maccess: remove duplicate kerneldoc comments ... |
||
![]() |
65fddcfca8 |
mm: reorder includes after introduction of linux/pgtable.h
The replacement of <asm/pgrable.h> with <linux/pgtable.h> made the include of the latter in the middle of asm includes. Fix this up with the aid of the below script and manual adjustments here and there. import sys import re if len(sys.argv) is not 3: print "USAGE: %s <file> <header>" % (sys.argv[0]) sys.exit(1) hdr_to_move="#include <linux/%s>" % sys.argv[2] moved = False in_hdrs = False with open(sys.argv[1], "r") as f: lines = f.readlines() for _line in lines: line = _line.rstrip(' ') if line == hdr_to_move: continue if line.startswith("#include <linux/"): in_hdrs = True elif not moved and in_hdrs: moved = True print hdr_to_move print line Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Cain <bcain@codeaurora.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Mark Salter <msalter@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Nick Hu <nickhu@andestech.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vincent Chen <deanbo422@gmail.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: http://lkml.kernel.org/r/20200514170327.31389-4-rppt@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
![]() |
ca5999fde0 |
mm: introduce include/linux/pgtable.h
The include/linux/pgtable.h is going to be the home of generic page table manipulation functions. Start with moving asm-generic/pgtable.h to include/linux/pgtable.h and make the latter include asm/pgtable.h. Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Cain <bcain@codeaurora.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Mark Salter <msalter@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Nick Hu <nickhu@andestech.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vincent Chen <deanbo422@gmail.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: http://lkml.kernel.org/r/20200514170327.31389-3-rppt@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
![]() |
4d8df8cbb9 |
x86/speculation: PR_SPEC_FORCE_DISABLE enforcement for indirect branches.
Currently, it is possible to enable indirect branch speculation even after
it was force-disabled using the PR_SPEC_FORCE_DISABLE option. Moreover, the
PR_GET_SPECULATION_CTRL command gives afterwards an incorrect result
(force-disabled when it is in fact enabled). This also is inconsistent
vs. STIBP and the documention which cleary states that
PR_SPEC_FORCE_DISABLE cannot be undone.
Fix this by actually enforcing force-disabled indirect branch
speculation. PR_SPEC_ENABLE called after PR_SPEC_FORCE_DISABLE now fails
with -EPERM as described in the documentation.
Fixes:
|
||
![]() |
21998a3515 |
x86/speculation: Avoid force-disabling IBPB based on STIBP and enhanced IBRS.
When STIBP is unavailable or enhanced IBRS is available, Linux
force-disables the IBPB mitigation of Spectre-BTB even when simultaneous
multithreading is disabled. While attempts to enable IBPB using
prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, ...) fail with
EPERM, the seccomp syscall (or its prctl(PR_SET_SECCOMP, ...) equivalent)
which are used e.g. by Chromium or OpenSSH succeed with no errors but the
application remains silently vulnerable to cross-process Spectre v2 attacks
(classical BTB poisoning). At the same time the SYSFS reporting
(/sys/devices/system/cpu/vulnerabilities/spectre_v2) displays that IBPB is
conditionally enabled when in fact it is unconditionally disabled.
STIBP is useful only when SMT is enabled. When SMT is disabled and STIBP is
unavailable, it makes no sense to force-disable also IBPB, because IBPB
protects against cross-process Spectre-BTB attacks regardless of the SMT
state. At the same time since missing STIBP was only observed on AMD CPUs,
AMD does not recommend using STIBP, but recommends using IBPB, so disabling
IBPB because of missing STIBP goes directly against AMD's advice:
https://developer.amd.com/wp-content/resources/Architecture_Guidelines_Update_Indirect_Branch_Control.pdf
Similarly, enhanced IBRS is designed to protect cross-core BTB poisoning
and BTB-poisoning attacks from user space against kernel (and
BTB-poisoning attacks from guest against hypervisor), it is not designed
to prevent cross-process (or cross-VM) BTB poisoning between processes (or
VMs) running on the same core. Therefore, even with enhanced IBRS it is
necessary to flush the BTB during context-switches, so there is no reason
to force disable IBPB when enhanced IBRS is available.
Enable the prctl control of IBPB even when STIBP is unavailable or enhanced
IBRS is available.
Fixes:
|
||
![]() |
7e5b3c267d |
x86/speculation: Add Special Register Buffer Data Sampling (SRBDS) mitigation
SRBDS is an MDS-like speculative side channel that can leak bits from the random number generator (RNG) across cores and threads. New microcode serializes the processor access during the execution of RDRAND and RDSEED. This ensures that the shared buffer is overwritten before it is released for reuse. While it is present on all affected CPU models, the microcode mitigation is not needed on models that enumerate ARCH_CAPABILITIES[MDS_NO] in the cases where TSX is not supported or has been disabled with TSX_CTRL. The mitigation is activated by default on affected processors and it increases latency for RDRAND and RDSEED instructions. Among other effects this will reduce throughput from /dev/urandom. * Enable administrator to configure the mitigation off when desired using either mitigations=off or srbds=off. * Export vulnerability status via sysfs * Rename file-scoped macros to apply for non-whitelist table initializations. [ bp: Massage, - s/VULNBL_INTEL_STEPPING/VULNBL_INTEL_STEPPINGS/g, - do not read arch cap MSR a second time in tsx_fused_off() - just pass it in, - flip check in cpu_set_bug_bits() to save an indentation level, - reflow comments. jpoimboe: s/Mitigated/Mitigation/ in user-visible strings tglx: Dropped the fused off magic for now ] Signed-off-by: Mark Gross <mgross@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Tested-by: Neelima Krishnan <neelima.krishnan@intel.com> |
||
![]() |
72c2ce9867 |
x86/bugs: Move enum taa_mitigations to bugs.c
... because it is used only there. No functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: x86@kernel.org Link: https://lkml.kernel.org/r/20191112221823.19677-1-bp@alien8.de |
||
![]() |
cd5a2aa89e |
x86/speculation: Fix redundant MDS mitigation message
Since MDS and TAA mitigations are inter-related for processors that are affected by both vulnerabilities, the followiing confusing messages can be printed in the kernel log: MDS: Vulnerable MDS: Mitigation: Clear CPU buffers To avoid the first incorrect message, defer the printing of MDS mitigation after the TAA mitigation selection has been done. However, that has the side effect of printing TAA mitigation first before MDS mitigation. [ bp: Check box is affected/mitigations are disabled first before printing and massage. ] Suggested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Mark Gross <mgross@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20191115161445.30809-3-longman@redhat.com |
||
![]() |
64870ed1b1 |
x86/speculation: Fix incorrect MDS/TAA mitigation status
For MDS vulnerable processors with TSX support, enabling either MDS or
TAA mitigations will enable the use of VERW to flush internal processor
buffers at the right code path. IOW, they are either both mitigated
or both not. However, if the command line options are inconsistent,
the vulnerabilites sysfs files may not report the mitigation status
correctly.
For example, with only the "mds=off" option:
vulnerabilities/mds:Vulnerable; SMT vulnerable
vulnerabilities/tsx_async_abort:Mitigation: Clear CPU buffers; SMT vulnerable
The mds vulnerabilities file has wrong status in this case. Similarly,
the taa vulnerability file will be wrong with mds mitigation on, but
taa off.
Change taa_select_mitigation() to sync up the two mitigation status
and have them turned off if both "mds=off" and "tsx_async_abort=off"
are present.
Update documentation to emphasize the fact that both "mds=off" and
"tsx_async_abort=off" have to be specified together for processors that
are affected by both TAA and MDS to be effective.
[ bp: Massage and add kernel-parameters.txt change too. ]
Fixes:
|
||
![]() |
012206a822 |
x86/speculation/taa: Fix printing of TAA_MSG_SMT on IBRS_ALL CPUs
For new IBRS_ALL CPUs, the Enhanced IBRS check at the beginning of
cpu_bugs_smt_update() causes the function to return early, unintentionally
skipping the MDS and TAA logic.
This is not a problem for MDS, because there appears to be no overlap
between IBRS_ALL and MDS-affected CPUs. So the MDS mitigation would be
disabled and nothing would need to be done in this function anyway.
But for TAA, the TAA_MSG_SMT string will never get printed on Cascade
Lake and newer.
The check is superfluous anyway: when 'spectre_v2_enabled' is
SPECTRE_V2_IBRS_ENHANCED, 'spectre_v2_user' is always
SPECTRE_V2_USER_NONE, and so the 'spectre_v2_user' switch statement
handles it appropriately by doing nothing. So just remove the check.
Fixes:
|
||
![]() |
b8e8c8303f |
kvm: mmu: ITLB_MULTIHIT mitigation
With some Intel processors, putting the same virtual address in the TLB as both a 4 KiB and 2 MiB page can confuse the instruction fetch unit and cause the processor to issue a machine check resulting in a CPU lockup. Unfortunately when EPT page tables use huge pages, it is possible for a malicious guest to cause this situation. Add a knob to mark huge pages as non-executable. When the nx_huge_pages parameter is enabled (and we are using EPT), all huge pages are marked as NX. If the guest attempts to execute in one of those pages, the page is broken down into 4K pages, which are then marked executable. This is not an issue for shadow paging (except nested EPT), because then the host is in control of TLB flushes and the problematic situation cannot happen. With nested EPT, again the nested guest can cause problems shadow and direct EPT is treated in the same way. [ tglx: Fixup default to auto and massage wording a bit ] Originally-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> |
||
![]() |
db4d30fbb7 |
x86/bugs: Add ITLB_MULTIHIT bug infrastructure
Some processors may incur a machine check error possibly resulting in an unrecoverable CPU lockup when an instruction fetch encounters a TLB multi-hit in the instruction TLB. This can occur when the page size is changed along with either the physical address or cache type. The relevant erratum can be found here: https://bugzilla.kernel.org/show_bug.cgi?id=205195 There are other processors affected for which the erratum does not fully disclose the impact. This issue affects both bare-metal x86 page tables and EPT. It can be mitigated by either eliminating the use of large pages or by using careful TLB invalidations when changing the page size in the page tables. Just like Spectre, Meltdown, L1TF and MDS, a new bit has been allocated in MSR_IA32_ARCH_CAPABILITIES (PSCHANGE_MC_NO) and will be set on CPUs which are mitigated against this issue. Signed-off-by: Vineela Tummalapalli <vineela.tummalapalli@intel.com> Co-developed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> |
||
![]() |
6608b45ac5 |
x86/speculation/taa: Add sysfs reporting for TSX Async Abort
Add the sysfs reporting file for TSX Async Abort. It exposes the vulnerability and the mitigation state similar to the existing files for the other hardware vulnerabilities. Sysfs file path is: /sys/devices/system/cpu/vulnerabilities/tsx_async_abort Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Neelima Krishnan <neelima.krishnan@intel.com> Reviewed-by: Mark Gross <mgross@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> |
||
![]() |
1b42f01741 |
x86/speculation/taa: Add mitigation for TSX Async Abort
TSX Async Abort (TAA) is a side channel vulnerability to the internal buffers in some Intel processors similar to Microachitectural Data Sampling (MDS). In this case, certain loads may speculatively pass invalid data to dependent operations when an asynchronous abort condition is pending in a TSX transaction. This includes loads with no fault or assist condition. Such loads may speculatively expose stale data from the uarch data structures as in MDS. Scope of exposure is within the same-thread and cross-thread. This issue affects all current processors that support TSX, but do not have ARCH_CAP_TAA_NO (bit 8) set in MSR_IA32_ARCH_CAPABILITIES. On CPUs which have their IA32_ARCH_CAPABILITIES MSR bit MDS_NO=0, CPUID.MD_CLEAR=1 and the MDS mitigation is clearing the CPU buffers using VERW or L1D_FLUSH, there is no additional mitigation needed for TAA. On affected CPUs with MDS_NO=1 this issue can be mitigated by disabling the Transactional Synchronization Extensions (TSX) feature. A new MSR IA32_TSX_CTRL in future and current processors after a microcode update can be used to control the TSX feature. There are two bits in that MSR: * TSX_CTRL_RTM_DISABLE disables the TSX sub-feature Restricted Transactional Memory (RTM). * TSX_CTRL_CPUID_CLEAR clears the RTM enumeration in CPUID. The other TSX sub-feature, Hardware Lock Elision (HLE), is unconditionally disabled with updated microcode but still enumerated as present by CPUID(EAX=7).EBX{bit4}. The second mitigation approach is similar to MDS which is clearing the affected CPU buffers on return to user space and when entering a guest. Relevant microcode update is required for the mitigation to work. More details on this approach can be found here: https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html The TSX feature can be controlled by the "tsx" command line parameter. If it is force-enabled then "Clear CPU buffers" (MDS mitigation) is deployed. The effective mitigation state can be read from sysfs. [ bp: - massage + comments cleanup - s/TAA_MITIGATION_TSX_DISABLE/TAA_MITIGATION_TSX_DISABLED/g - Josh. - remove partial TAA mitigation in update_mds_branch_idle() - Josh. - s/tsx_async_abort_cmdline/tsx_async_abort_parse_cmdline/g ] Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> |
||
![]() |
c5f12fdb8b |
Merge branch 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 apic updates from Thomas Gleixner: - Cleanup the apic IPI implementation by removing duplicated code and consolidating the functions into the APIC core. - Implement a safe variant of the IPI broadcast mode. Contrary to earlier attempts this uses the core tracking of which CPUs have been brought online at least once so that a broadcast does not end up in some dead end in BIOS/SMM code when the CPU is still waiting for init. Once all CPUs have been brought up once, IPI broadcasting is enabled. Before that regular one by one IPIs are issued. - Drop the paravirt CR8 related functions as they have no user anymore - Initialize the APIC TPR to block interrupt 16-31 as they are reserved for CPU exceptions and should never be raised by any well behaving device. - Emit a warning when vector space exhaustion breaks the admin set affinity of an interrupt. - Make sure to use the NMI fallback when shutdown via reboot vector IPI fails. The original code had conditions which prevent the code path to be reached. - Annotate various APIC config variables as RO after init. [ The ipi broadcase change came in earlier through the cpu hotplug branch, but I left the explanation in the commit message since it was shared between the two different branches - Linus ] * 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits) x86/apic/vector: Warn when vector space exhaustion breaks affinity x86/apic: Annotate global config variables as "read-only after init" x86/apic/x2apic: Implement IPI shorthands support x86/apic/flat64: Remove the IPI shorthand decision logic x86/apic: Share common IPI helpers x86/apic: Remove the shorthand decision logic x86/smp: Enhance native_send_call_func_ipi() x86/smp: Move smp_function_call implementations into IPI code x86/apic: Provide and use helper for send_IPI_allbutself() x86/apic: Add static key to Control IPI shorthands x86/apic: Move no_ipi_broadcast() out of 32bit x86/apic: Add NMI_VECTOR wait to IPI shorthand x86/apic: Remove dest argument from __default_send_IPI_shortcut() x86/hotplug: Silence APIC and NMI when CPU is dead x86/cpu: Move arch_smt_update() to a neutral place x86/apic/uv: Make x2apic_extra_bits static x86/apic: Consolidate the apic local headers x86/apic: Move apic_flat_64 header into apic directory x86/apic: Move ipi header into apic directory x86/apic: Cleanup the include maze ... |
||
![]() |
5e741407ea |
x86/intel: Aggregate big core graphics naming
Currently big core clients with extra graphics on have: - _G - _GT3E Make it uniformly: _G for i in `git grep -l "\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*_GT3E"` do sed -i -e 's/\(\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*\)_GT3E/\1_G/g' ${i} done Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Tony Luck <tony.luck@intel.com> Cc: x86@kernel.org Cc: Dave Hansen <dave.hansen@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Link: https://lkml.kernel.org/r/20190827195122.622802314@infradead.org |
||
![]() |
af239c44e3 |
x86/intel: Aggregate big core mobile naming
Currently big core mobile chips have either: - _L - _ULT - _MOBILE Make it uniformly: _L. for i in `git grep -l "\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*_\(MOBILE\|ULT\)"` do sed -i -e 's/\(\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*\)_\(MOBILE\|ULT\)/\1_L/g' ${i} done Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Tony Luck <tony.luck@intel.com> Cc: x86@kernel.org Cc: Dave Hansen <dave.hansen@intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190827195122.568978530@infradead.org |
||
![]() |
c66f78a6de |
x86/intel: Aggregate big core client naming
Currently the big core client models either have: - no OPTDIFF - _CORE - _DESKTOP Make it uniformly: 'no OPTDIFF'. for i in `git grep -l "\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*_\(CORE\|DESKTOP\)"` do sed -i -e 's/\(\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*\)_\(CORE\|DESKTOP\)/\1/g' ${i} done Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Tony Luck <tony.luck@intel.com> Cc: x86@kernel.org Cc: Dave Hansen <dave.hansen@intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190827195122.513945586@infradead.org |
||
![]() |
7a30bdd99f |
Merge branch master from git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Pick up the spectre documentation so the Grand Schemozzle can be added. |
||
![]() |
f36cf386e3 |
x86/speculation/swapgs: Exclude ATOMs from speculation through SWAPGS
Intel provided the following information: On all current Atom processors, instructions that use a segment register value (e.g. a load or store) will not speculatively execute before the last writer of that segment retires. Thus they will not use a speculatively written segment value. That means on ATOMs there is no speculation through SWAPGS, so the SWAPGS entry paths can be excluded from the extra LFENCE if PTI is disabled. Create a separate bug flag for the through SWAPGS speculation and mark all out-of-order ATOMs and AMD/HYGON CPUs as not affected. The in-order ATOMs are excluded from the whole mitigation mess anyway. Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Tyler Hicks <tyhicks@canonical.com> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> |
||
![]() |
9c92374b63 |
x86/cpu: Move arch_smt_update() to a neutral place
arch_smt_update() will be used to control IPI/NMI broadcasting via the shorthand mechanism. Keeping it in the bugs file and calling the apic function from there is possible, but not really intuitive. Move it to a neutral place and invoke the bugs function from there. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20190722105219.910317273@linutronix.de |
||
![]() |
517c3ba009 |
x86/speculation/mds: Apply more accurate check on hypervisor platform
X86_HYPER_NATIVE isn't accurate for checking if running on native platform,
e.g. CONFIG_HYPERVISOR_GUEST isn't set or "nopv" is enabled.
Checking the CPU feature bit X86_FEATURE_HYPERVISOR to determine if it's
running on native platform is more accurate.
This still doesn't cover the platforms on which X86_FEATURE_HYPERVISOR is
unsupported, e.g. VMware, but there is nothing which can be done about this
scenario.
Fixes:
|
||
![]() |
a205982598 |
x86/speculation: Enable Spectre v1 swapgs mitigations
The previous commit added macro calls in the entry code which mitigate the Spectre v1 swapgs issue if the X86_FEATURE_FENCE_SWAPGS_* features are enabled. Enable those features where applicable. The mitigations may be disabled with "nospectre_v1" or "mitigations=off". There are different features which can affect the risk of attack: - When FSGSBASE is enabled, unprivileged users are able to place any value in GS, using the wrgsbase instruction. This means they can write a GS value which points to any value in kernel space, which can be useful with the following gadget in an interrupt/exception/NMI handler: if (coming from user space) swapgs mov %gs:<percpu_offset>, %reg1 // dependent load or store based on the value of %reg // for example: mov %(reg1), %reg2 If an interrupt is coming from user space, and the entry code speculatively skips the swapgs (due to user branch mistraining), it may speculatively execute the GS-based load and a subsequent dependent load or store, exposing the kernel data to an L1 side channel leak. Note that, on Intel, a similar attack exists in the above gadget when coming from kernel space, if the swapgs gets speculatively executed to switch back to the user GS. On AMD, this variant isn't possible because swapgs is serializing with respect to future GS-based accesses. NOTE: The FSGSBASE patch set hasn't been merged yet, so the above case doesn't exist quite yet. - When FSGSBASE is disabled, the issue is mitigated somewhat because unprivileged users must use prctl(ARCH_SET_GS) to set GS, which restricts GS values to user space addresses only. That means the gadget would need an additional step, since the target kernel address needs to be read from user space first. Something like: if (coming from user space) swapgs mov %gs:<percpu_offset>, %reg1 mov (%reg1), %reg2 // dependent load or store based on the value of %reg2 // for example: mov %(reg2), %reg3 It's difficult to audit for this gadget in all the handlers, so while there are no known instances of it, it's entirely possible that it exists somewhere (or could be introduced in the future). Without tooling to analyze all such code paths, consider it vulnerable. Effects of SMAP on the !FSGSBASE case: - If SMAP is enabled, and the CPU reports RDCL_NO (i.e., not susceptible to Meltdown), the kernel is prevented from speculatively reading user space memory, even L1 cached values. This effectively disables the !FSGSBASE attack vector. - If SMAP is enabled, but the CPU *is* susceptible to Meltdown, SMAP still prevents the kernel from speculatively reading user space memory. But it does *not* prevent the kernel from reading the user value from L1, if it has already been cached. This is probably only a small hurdle for an attacker to overcome. Thanks to Dave Hansen for contributing the speculative_smap() function. Thanks to Andrew Cooper for providing the inside scoop on whether swapgs is serializing on AMD. [ tglx: Fixed the USER fence decision and polished the comment as suggested by Dave Hansen ] Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Dave Hansen <dave.hansen@intel.com> |
||
![]() |
c1f7fec1eb |
x86/speculation: Allow guests to use SSBD even if host does not
The bits set in x86_spec_ctrl_mask are used to calculate the guest's value
of SPEC_CTRL that is written to the MSR before VMENTRY, and control which
mitigations the guest can enable. In the case of SSBD, unless the host has
enabled SSBD always on mode (by passing "spec_store_bypass_disable=on" in
the kernel parameters), the SSBD bit is not set in the mask and the guest
can not properly enable the SSBD always on mitigation mode.
This has been confirmed by running the SSBD PoC on a guest using the SSBD
always on mitigation mode (booted with kernel parameter
"spec_store_bypass_disable=on"), and verifying that the guest is vulnerable
unless the host is also using SSBD always on mode. In addition, the guest
OS incorrectly reports the SSB vulnerability as mitigated.
Always set the SSBD bit in x86_spec_ctrl_mask when the host CPU supports
it, allowing the guest to use SSBD whether or not the host has chosen to
enable the mitigation in any of its modes.
Fixes:
|
||
![]() |
fa4bff1650 |
Merge branch 'x86-mds-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 MDS mitigations from Thomas Gleixner: "Microarchitectural Data Sampling (MDS) is a hardware vulnerability which allows unprivileged speculative access to data which is available in various CPU internal buffers. This new set of misfeatures has the following CVEs assigned: CVE-2018-12126 MSBDS Microarchitectural Store Buffer Data Sampling CVE-2018-12130 MFBDS Microarchitectural Fill Buffer Data Sampling CVE-2018-12127 MLPDS Microarchitectural Load Port Data Sampling CVE-2019-11091 MDSUM Microarchitectural Data Sampling Uncacheable Memory MDS attacks target microarchitectural buffers which speculatively forward data under certain conditions. Disclosure gadgets can expose this data via cache side channels. Contrary to other speculation based vulnerabilities the MDS vulnerability does not allow the attacker to control the memory target address. As a consequence the attacks are purely sampling based, but as demonstrated with the TLBleed attack samples can be postprocessed successfully. The mitigation is to flush the microarchitectural buffers on return to user space and before entering a VM. It's bolted on the VERW instruction and requires a microcode update. As some of the attacks exploit data structures shared between hyperthreads, full protection requires to disable hyperthreading. The kernel does not do that by default to avoid breaking unattended updates. The mitigation set comes with documentation for administrators and a deeper technical view" * 'x86-mds-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) x86/speculation/mds: Fix documentation typo Documentation: Correct the possible MDS sysfs values x86/mds: Add MDSUM variant to the MDS documentation x86/speculation/mds: Add 'mitigations=' support for MDS x86/speculation/mds: Print SMT vulnerable on MSBDS with mitigations off x86/speculation/mds: Fix comment x86/speculation/mds: Add SMT warning message x86/speculation: Move arch_smt_update() call to after mitigation decisions x86/speculation/mds: Add mds=full,nosmt cmdline option Documentation: Add MDS vulnerability documentation Documentation: Move L1TF to separate directory x86/speculation/mds: Add mitigation mode VMWERV x86/speculation/mds: Add sysfs reporting for MDS x86/speculation/mds: Add mitigation control for MDS x86/speculation/mds: Conditionally clear CPU buffers on idle entry x86/kvm/vmx: Add MDS protection when L1D Flush is not active x86/speculation/mds: Clear CPU buffers on exit to user x86/speculation/mds: Add mds_clear_cpu_buffers() x86/kvm: Expose X86_FEATURE_MD_CLEAR to guests x86/speculation/mds: Add BUG_MSBDS_ONLY ... |
||
![]() |
0a499fc5c3 |
Merge branch 'core-speculation-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull speculation mitigation update from Ingo Molnar: "This adds the "mitigations=" bootline option, which offers a cross-arch set of options that will work on x86, PowerPC and s390 that will map to the arch specific option internally" * 'core-speculation-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: s390/speculation: Support 'mitigations=' cmdline option powerpc/speculation: Support 'mitigations=' cmdline option x86/speculation: Support 'mitigations=' cmdline option cpu/speculation: Add 'mitigations=' cmdline option |
||
![]() |
1de7edbb59 |
x86/cpu/bugs: Use __initconst for 'const' init data
Some of the recently added const tables use __initdata which causes section
attribute conflicts.
Use __initconst instead.
Fixes:
|
||
![]() |
5c14068f87 |
x86/speculation/mds: Add 'mitigations=' support for MDS
Add MDS to the new 'mitigations=' cmdline option. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> |
||
![]() |
e9fee6fe08 |
Merge branch 'core/speculation' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
Pull in the command line updates from the tip tree so the MDS parts can be added. |
||
![]() |
d68be4c4d3 |
x86/speculation: Support 'mitigations=' cmdline option
Configure x86 runtime CPU speculation bug mitigations in accordance with the 'mitigations=' cmdline option. This affects Meltdown, Spectre v2, Speculative Store Bypass, and L1TF. The default behavior is unchanged. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jiri Kosina <jkosina@suse.cz> (on x86) Reviewed-by: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@alien8.de> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jiri Kosina <jikos@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Jon Masters <jcm@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linuxppc-dev@lists.ozlabs.org Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux-s390@vger.kernel.org Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-arch@vger.kernel.org Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Steven Price <steven.price@arm.com> Cc: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/6616d0ae169308516cfdf5216bedd169f8a8291b.1555085500.git.jpoimboe@redhat.com |
||
![]() |
e2c3c94788 |
x86/speculation/mds: Print SMT vulnerable on MSBDS with mitigations off
This code is only for CPUs which are affected by MSBDS, but are *not* affected by the other two MDS issues. For such CPUs, enabling the mds_idle_clear mitigation is enough to mitigate SMT. However if user boots with 'mds=off' and still has SMT enabled, we should not report that SMT is mitigated: $cat /sys//devices/system/cpu/vulnerabilities/mds Vulnerable; SMT mitigated But rather: Vulnerable; SMT vulnerable Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Tyler Hicks <tyhicks@canonical.com> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lkml.kernel.org/r/20190412215118.294906495@localhost.localdomain |
||
![]() |
cae5ec3426 |
x86/speculation/mds: Fix comment
s/L1TF/MDS/ Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Tyler Hicks <tyhicks@canonical.com> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> |
||
![]() |
39226ef02b |
x86/speculation/mds: Add SMT warning message
MDS is vulnerable with SMT. Make that clear with a one-time printk whenever SMT first gets enabled. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Tyler Hicks <tyhicks@canonical.com> Acked-by: Jiri Kosina <jkosina@suse.cz> |
||
![]() |
7c3658b201 |
x86/speculation: Move arch_smt_update() call to after mitigation decisions
arch_smt_update() now has a dependency on both Spectre v2 and MDS mitigations. Move its initial call to after all the mitigation decisions have been made. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Tyler Hicks <tyhicks@canonical.com> Acked-by: Jiri Kosina <jkosina@suse.cz> |
||
![]() |
d71eb0ce10 |
x86/speculation/mds: Add mds=full,nosmt cmdline option
Add the mds=full,nosmt cmdline option. This is like mds=full, but with SMT disabled if the CPU is vulnerable. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Tyler Hicks <tyhicks@canonical.com> Acked-by: Jiri Kosina <jkosina@suse.cz> |
||
![]() |
65fd4cb65b |
Documentation: Move L1TF to separate directory
Move L!TF to a separate directory so the MDS stuff can be added at the side. Otherwise the all hardware vulnerabilites have their own top level entry. Should have done that right away. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Jon Masters <jcm@redhat.com> |