linux-yocto

mirror of git://git.yoctoproject.org/linux-yocto.git synced 2025-08-22 00:42:01 +02:00

Author	SHA1	Message	Date
Ashutosh Dixit	c0403e4cee	drm/xe/oa: Fix "Missing outer runtime PM protection" warning Fix the following drm_WARN: [953.586396] xe 0000:00:02.0: [drm] Missing outer runtime PM protection ... <4> [953.587090] ? xe_pm_runtime_get_noresume+0x8d/0xa0 [xe] <4> [953.587208] guc_exec_queue_add_msg+0x28/0x130 [xe] <4> [953.587319] guc_exec_queue_fini+0x3a/0x40 [xe] <4> [953.587425] xe_exec_queue_destroy+0xb3/0xf0 [xe] <4> [953.587515] xe_oa_release+0x9c/0xc0 [xe] Suggested-by: John Harrison <john.c.harrison@intel.com> Suggested-by: Matthew Brost <matthew.brost@intel.com> Fixes: `e936f885f1` ("drm/xe/oa/uapi: Expose OA stream fd") Cc: stable@vger.kernel.org Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241109032003.3093811-1-ashutosh.dixit@intel.com (cherry picked from commit `b107c63d29`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-11-13 11:37:22 -08:00
Matthew Auld	be7eeaba2a	drm/xe: handle flat ccs during hibernation on igpu Starting from LNL, CCS has moved over to flat CCS model where there is now dedicated memory reserved for storing compression state. On platforms like LNL this reserved memory lives inside graphics stolen memory, which is not treated like normal RAM and is therefore skipped by the core kernel when creating the hibernation image. Currently if something was compressed and we enter hibernation all the corresponding CCS state is lost on such HW, resulting in corrupted memory. To fix this evict user buffers from TT -> SYSTEM to ensure we take a snapshot of the raw CCS state when entering hibernation, where upon resuming we can restore the raw CCS state back when next validating the buffer. This has been confirmed to fix display corruption on LNL when coming back from hibernation. Fixes: `cbdc52c11c` ("drm/xe/xe2: Support flat ccs") Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3409 Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241112162827.116523-2-matthew.auld@intel.com (cherry picked from commit `c8b3c6db94`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-11-13 07:58:58 -08:00
Matthew Auld	46f1f4b0f3	drm/xe: improve hibernation on igpu The GGTT looks to be stored inside stolen memory on igpu which is not treated as normal RAM. The core kernel skips this memory range when creating the hibernation image, therefore when coming back from hibernation the GGTT programming is lost. This seems to cause issues with broken resume where GuC FW fails to load: [drm] ERROR GT0: load failed: status = 0x400000A0, time = 10ms, freq = 1250MHz (req 1300MHz), done = -1 [drm] ERROR GT0: load failed: status: Reset = 0, BootROM = 0x50, UKernel = 0x00, MIA = 0x00, Auth = 0x01 [drm] ERROR GT0: firmware signature verification failed [drm] ERROR CRITICAL: Xe has declared device 0000:00:02.0 as wedged. Current GGTT users are kernel internal and tracked as pinned, so it should be possible to hook into the existing save/restore logic that we use for dgpu, where the actual evict is skipped but on restore we importantly restore the GGTT programming. This has been confirmed to fix hibernation on at least ADL and MTL, though likely all igpu platforms are affected. This also means we have a hole in our testing, where the existing s4 tests only really test the driver hooks, and don't go as far as actually rebooting and restoring from the hibernation image and in turn powering down RAM (and therefore losing the contents of stolen). v2 (Brost) - Remove extra newline and drop unnecessary parentheses. Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3275 Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241101170156.213490-2-matthew.auld@intel.com (cherry picked from commit `f2a6b8e396`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-11-13 07:58:31 -08:00
Matthew Brost	dd886a63d6	drm/xe: Restore system memory GGTT mappings GGTT mappings reside on the device and this state is lost during suspend / d3cold thus this state must be restored resume regardless if the BO is in system memory or VRAM. v2: - Unnecessary parentheses around bo->placements[0] (Checkpatch) Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241031182257.2949579-1-matthew.brost@intel.com (cherry picked from commit `a19d1db9a3`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-11-13 07:58:22 -08:00
Matthew Brost	ce0d697023	drm/xe: Ensure all locks released in exec IOCTL In couple of places the wrong error handling goto was used to release locks. Fix these to ensure all locks dropped on exec IOCTL errors. Cc: Francois Dugast <francois.dugast@intel.com> Fixes: `d16ef1a18e` ("drm/xe/exec: Switch hw engine group execution mode upon job submission") Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Francois Dugast <francois.dugast@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241106224944.30130-1-matthew.brost@intel.com (cherry picked from commit `9e7aacd840`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-11-13 07:49:14 -08:00
Lucas De Marchi	514447a121	drm/xe: Stop accumulating LRC timestamp on job_free The exec queue timestamp is only really useful when it's being queried through the fdinfo. There's no need to update it so often, on every job_free. Tracing a simple app like vkcube running shows an update rate of ~ 120Hz. In case of discrete, the BO is on vram, creating a lot of pcie transactions. The update on job_free() is used to cover a gap: if exec queue is created and destroyed rapidly, before a new query, the timestamp still needs to be accumulated and accounted for in the xef. Initial implementation in commit `6109f24f87` ("drm/xe: Add helper to accumulate exec queue runtime") couldn't do it on the exec_queue_fini since the xef could be gone at that point. However since commit `ce8c161cba` ("drm/xe: Add ref counting for xe_file") the xef is refcounted and the exec queue always holds a reference, making this safe now. Improve the fix in commit `2149ded630` ("drm/xe: Fix use after free when client stats are captured") by reducing the frequency in which the update is needed. Fixes: `2149ded630` ("drm/xe: Fix use after free when client stats are captured") Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241104143815.2112272-3-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit `83db047d94`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-11-05 15:40:13 -08:00
Michal Wajdeczko	a353c78459	drm/xe/pf: Fix potential GGTT allocation leak In unlikely event that we fail during sending the new VF GGTT configuration to the GuC, we will free only the GGTT node data struct but will miss to release the actual GGTT allocation. This will later lead to list corruption, GGTT space leak and finally risking crash when unloading the driver: [ ] ... [drm] GT0: PF: Failed to provision VF1 with 1073741824 (1.00 GiB) GGTT (-EIO) [ ] ... [drm] GT0: PF: VF1 provisioning remains at 0 (0 B) GGTT [ ] list_add corruption. next->prev should be prev (ffff88813cfcd628), but was 0000000000000000. (next=ffff88813cfe2028). [ ] RIP: 0010:__list_add_valid_or_report+0x6b/0xb0 [ ] Call Trace: [ ] drm_mm_insert_node_in_range+0x2c0/0x4e0 [ ] xe_ggtt_node_insert+0x46/0x70 [xe] [ ] pf_provision_vf_ggtt+0x7f5/0xa70 [xe] [ ] xe_gt_sriov_pf_config_set_ggtt+0x5e/0x770 [xe] [ ] ggtt_set+0x4b/0x70 [xe] [ ] simple_attr_write_xsigned.constprop.0.isra.0+0xb0/0x110 [ ] ... [drm] GT0: PF: Failed to provision VF1 with 1073741824 (1.00 GiB) GGTT (-ENOSPC) [ ] ... [drm] GT0: PF: VF1 provisioning remains at 0 (0 B) GGTT [ ] Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b7b: 0000 [#1] PREEMPT SMP NOPTI [ ] RIP: 0010:drm_mm_remove_node+0x1b7/0x390 [ ] Call Trace: [ ] <TASK> [ ] ? die_addr+0x2e/0x80 [ ] ? exc_general_protection+0x1a1/0x3e0 [ ] ? asm_exc_general_protection+0x22/0x30 [ ] ? drm_mm_remove_node+0x1b7/0x390 [ ] ggtt_node_remove+0xa5/0xf0 [xe] [ ] xe_ggtt_node_remove+0x35/0x70 [xe] [ ] xe_ttm_bo_destroy+0x123/0x220 [xe] [ ] intel_user_framebuffer_destroy+0x44/0x70 [xe] [ ] intel_plane_destroy_state+0x3b/0xc0 [xe] [ ] drm_atomic_state_default_clear+0x1cd/0x2f0 [ ] intel_atomic_state_clear+0x9/0x20 [xe] [ ] __drm_atomic_state_free+0x1d/0xb0 Fix that by using pf_release_ggtt() on the error path, which now works regardless if the node has GGTT allocation or not. Fixes: `34e804220f` ("drm/xe: Make xe_ggtt_node struct independent") Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241104144901.1903-1-michal.wajdeczko@intel.com (cherry picked from commit `43b1dd2b55`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-11-05 15:40:12 -08:00
Matthew Brost	64a2b6ed4b	drm/xe: Drop VM dma-resv lock on xe_sync_in_fence_get failure in exec IOCTL Upon failure all locks need to be dropped before returning to the user. Fixes: `58480c1c91` ("drm/xe: Skip VMAs pin when requesting signal to the last XE_EXEC") Cc: <stable@vger.kernel.org> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241105043524.4062774-3-matthew.brost@intel.com (cherry picked from commit `7d1a4258e6`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-11-05 15:40:12 -08:00
Matthew Brost	af797b831d	drm/xe: Fix possible exec queue leak in exec IOCTL In a couple of places after an exec queue is looked up the exec IOCTL returns on input errors without dropping the exec queue ref. Fix this ensuring the exec queue ref is dropped on input error. Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Cc: <stable@vger.kernel.org> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241105043524.4062774-2-matthew.brost@intel.com (cherry picked from commit `07064a200b`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-11-05 15:40:12 -08:00
Nirmoy Das	1491efb39a	drm/xe/guc/tlb: Flush g2h worker in case of tlb timeout Flush the g2h worker explicitly if TLB timeout happens which is observed on LNL and that points to the recent scheduling issue with E-cores on LNL. This is similar to the recent fix: commit `e515272338` ("drm/xe/guc/ct: Flush g2h worker in case of g2h response timeout") and should be removed once there is E core scheduling fix. v2: Add platform check(Himal) v3: Remove gfx platform check as the issue related to cpu platform(John) Use the common WA macro(John) and print when the flush resolves timeout(Matt B) v4: Remove the resolves log and do the flush before taking pending_lock(Matt A) Cc: Badal Nilawar <badal.nilawar@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: stable@vger.kernel.org # v6.11+ Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2687 Signed-off-by: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241029120117.449694-3-nirmoy.das@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit `e1f6fa5566`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-11-04 08:12:30 -08:00
Nirmoy Das	7d1e2580ed	drm/xe/ufence: Flush xe ordered_wq in case of ufence timeout Flush xe ordered_wq in case of ufence timeout which is observed on LNL and that points to recent scheduling issue with E-cores. This is similar to the recent fix: commit `e515272338` ("drm/xe/guc/ct: Flush g2h worker in case of g2h response timeout") and should be removed once there is a E-core scheduling fix for LNL. v2: Add platform check(Himal) s/__flush_workqueue/flush_workqueue(Jani) v3: Remove gfx platform check as the issue related to cpu platform(John) v4: Use the Common macro(John) and print when the flush resolves timeout(Matt B) Cc: Badal Nilawar <badal.nilawar@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: stable@vger.kernel.org # v6.11+ Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2754 Suggested-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241029120117.449694-2-nirmoy.das@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit `38c4c8722b`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-11-04 08:12:30 -08:00
Nirmoy Das	55e8a3f37e	drm/xe: Move LNL scheduling WA to xe_device.h Move LNL scheduling WA to xe_device.h so this can be used in other places without needing keep the same comment about removal of this WA in the future. The WA, which flushes work or workqueues, is now wrapped in macros and can be reused wherever needed. Cc: Badal Nilawar <badal.nilawar@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> cc: stable@vger.kernel.org # v6.11+ Suggested-by: John Harrison <John.C.Harrison@Intel.com> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241029120117.449694-1-nirmoy.das@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit `cbe006a649`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-11-04 08:12:30 -08:00
Balasubramani Vivekanandan	4b468a92dd	drm/xe: Use the filelist from drm for ccs_mode change Drop the exclusive client count tracking and use the filelist from the drm to track the active clients. This also ensures the clients created internally by the driver won't block changing the ccs mode. Fixes: `ce8c161cba` ("drm/xe: Add ref counting for xe_file") Signed-off-by: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241008073628.377433-3-balasubramani.vivekanandan@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit `1c35f1ed1f`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-11-04 08:12:30 -08:00
Balasubramani Vivekanandan	7fd3fa006f	drm/xe: Set mask bits for CCS_MODE register CCS_MODE register requires setting mask bits from Xe2+ platforms. Set the mask bits unconditionally, as those bits are unused for older platforms. Signed-off-by: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com> Cc: stable@vger.kernel.org # v6.11+ Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241008073628.377433-2-balasubramani.vivekanandan@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit `23ea2c7572`) [ Fix conflict with mmio refactors ] Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-11-04 08:03:40 -08:00
Matthew Brost	fe05cee4d9	drm/xe: Don't short circuit TDR on jobs not started Short circuiting TDR on jobs not started is an optimization which is not required. On LNL we are facing an issue where jobs do not get scheduled by the GuC if it misses a GGTT page update. When this occurs let the TDR fire, toggle the scheduling which may get the job unstuck, and print a warning message. If the TDR fires twice on job that hasn't started, timeout the job. v2: - Add warning message (Paulo) - Add fixes tag (Paulo) - Timeout job which hasn't started after TDR firing twice v3: - Include local change v4: - Short circuit check_timeout on job not started - use warn level rather than notice (Paulo) Fixes: `7ddb9403dd` ("drm/xe: Sample ctx timestamp to determine if jobs have timed out") Cc: stable@vger.kernel.org Cc: Paulo Zanoni <paulo.r.zanoni@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241025214330.2010521-2-matthew.brost@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit `35d25a4a00`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-31 07:03:14 -07:00
Matthew Brost	993ca0ecce	drm/xe: Add mmio read before GGTT invalidate On LNL without a mmio read before a GGTT invalidate the GuC can incorrectly read the GGTT scratch page upon next access leading to jobs not getting scheduled. A mmio read before a GGTT invalidate seems to fix this. Since a GGTT invalidate is not a hot code path, blindly do a mmio read before each GGTT invalidate. Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: stable@vger.kernel.org Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Reported-by: Paulo Zanoni <paulo.r.zanoni@intel.com> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3164 Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241023221200.1797832-1-matthew.brost@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit `5a71019688`) [ Fix conflict with mmio vs gt argument ] Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-31 07:02:42 -07:00
Imre Deak	6a9d2e2988	drm/xe/display: Add missing HPD interrupt enabling during non-d3cold RPM resume Atm the display HPD interrupts that got disabled during runtime suspend, are re-enabled only if d3cold is enabled. Fix things by also re-enabling the interrupts if d3cold is disabled. Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Signed-off-by: Imre Deak <imre.deak@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241009194358.1321200-5-imre.deak@intel.com (cherry picked from commit `bbc4a30de0`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-29 07:12:56 -07:00
Imre Deak	dcb6c1d071	drm/xe/display: Separate the d3cold and non-d3cold runtime PM handling For clarity separate the d3cold and non-d3cold runtime PM handling. The only change in behavior is disabling polling later during runtime resume. This shouldn't make a difference, since the poll disabling is handled from a work, which could run at any point wrt. the runtime resume handler. The work will also require a runtime PM reference, syncing it with the resume handler. Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Signed-off-by: Imre Deak <imre.deak@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241009194358.1321200-4-imre.deak@intel.com (cherry picked from commit `a4de6beb83`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-29 07:12:56 -07:00
Maarten Lankhorst	25f2ff5383	drm/xe: Remove runtime argument from display s/r functions The previous change ensures that pm_suspend is only called when suspending or resuming. This ensures no further bugs like those in the previous commit. Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Reviewed-by: Vinod Govindapillai <vinod.govindapillai@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240905150052.174895-3-maarten.lankhorst@linux.intel.com (cherry picked from commit `f90491d4b6`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-29 07:12:55 -07:00
Nirmoy Das	cdc21021f0	drm/xe: Don't restart parallel queues multiple times on GT reset In case of parallel submissions multiple GuC id will point to the same exec queue and on GT reset such exec queues will get restarted multiple times which is not desirable. v2: don't use exec_queue_enabled() which could race, do the same for xe_guc_submit_stop (Matt B) Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2295 Cc: Jonathan Cavitt <jonathan.cavitt@intel.com> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Tejas Upadhyay <tejas.upadhyay@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241022103555.731557-1-nirmoy.das@intel.com Signed-off-by: Nirmoy Das <nirmoy.das@intel.com> (cherry picked from commit `c8b0acd6d8`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-24 12:42:52 -05:00
Nirmoy Das	9c1813b325	drm/xe/ufence: Prefetch ufence addr to catch bogus address access_ok() only checks for addr overflow so also try to read the addr to catch invalid addr sent from userspace. Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/1630 Cc: Francois Dugast <francois.dugast@intel.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241016082304.66009-2-nirmoy.das@intel.com Signed-off-by: Nirmoy Das <nirmoy.das@intel.com> (cherry picked from commit `9408c45084`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-24 12:42:52 -05:00
Shuicheng Lin	69418db678	drm/xe: Handle unreliable MMIO reads during forcewake In some cases, when the driver attempts to read an MMIO register, the hardware may return 0xFFFFFFFF. The current force wake path code treats this as a valid response, as it only checks the BIT. However, 0xFFFFFFFF should be considered an invalid value, indicating a potential issue. To address this, we should add a log entry to highlight this condition and return failure. The force wake failure log level is changed from notice to err to match the failure return value. v2 (Matt Brost): - set ret value (-EIO) to kick the error to upper layers v3 (Rodrigo): - add commit message for the log level promotion from notice to err v4: - update reviewed info Suggested-by: Alex Zuo <alex.zuo@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Acked-by: Badal Nilawar <badal.nilawar@intel.com> Cc: Anshuman Gupta <anshuman.gupta@intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241017221547.1564029-1-shuicheng.lin@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> (cherry picked from commit `a9fbeabe72`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-24 12:42:52 -05:00
Badal Nilawar	22ef43c786	drm/xe/guc/ct: Flush g2h worker in case of g2h response timeout In case if g2h worker doesn't get opportunity to within specified timeout delay then flush the g2h worker explicitly. v2: - Describe change in the comment and add TODO (Matt B/John H) - Add xe_gt_warn on fence done after G2H flush (John H) v3: - Updated the comment with root cause - Clean up xe_gt_warn message (John H) Closes: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1620 Closes: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2902 Signed-off-by: Badal Nilawar <badal.nilawar@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Acked-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241017111410.2553784-2-badal.nilawar@intel.com (cherry picked from commit `e515272338`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-24 12:42:52 -05:00
Shuicheng Lin	c8fb95e7a5	drm/xe: Enlarge the invalidation timeout from 150 to 500 There are error messages like below that are occurring during stress testing: "[ 31.004009] xe 0000:03:00.0: [drm] ERROR GT0: Global invalidation timeout". Previously it was hitting this 3 out of 1000 executions of warm reboot. After raising it to 500, 1000 warm reboot executions passed and it didn't fail. Due to the way xe_mmio_wait32() is implemented, the timeout is able to expire early when the register matches the expected value due to the wait increments starting small. So, the larger timeout value should have no effect during normal use cases. v2 (Jonathan): - rework the commit message v3 (Lucas): - add conclusive message for the fail rate and test case v4: - add suggested-by Suggested-by: Jia Yao <jia.yao@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Tested-by: Zongyao Bai <zongyao.bai@intel.com> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Signed-off-by: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241015161207.1373401-1-shuicheng.lin@intel.com (cherry picked from commit `2eb460ab9f`) [ Fix conflict with gt->mmio ] Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-24 12:42:52 -05:00
Matthew Auld	6df106e93f	drm/xe/bmg: improve cache flushing behaviour The BSpec says that EN_L3_RW_CCS_CACHE_FLUSH must be toggled on for manual global invalidation to take effect and actually flush device cache, however this also turns on flushing for things like pipecontrol, which occurs between submissions for compute/render. This sounds like massive overkill for our needs, where we already have the manual flushing on the display side with the global invalidation. Some observations on BMG: 1. Disabling l2 caching for host writes and stubbing out the driver global invalidation but keeping EN_L3_RW_CCS_CACHE_FLUSH enabled, has no impact on wb-transient-vs-display IGT, which makes sense since the pipecontrol is now flushing the device cache after the render copy. Without EN_L3_RW_CCS_CACHE_FLUSH the test then fails, which is also expected since device cache is now dirty and display engine can't see the writes. 2. Disabling EN_L3_RW_CCS_CACHE_FLUSH, but keeping the driver global invalidation also has no impact on wb-transient-vs-display. This suggests that the global invalidation still works as expected and is flushing the device cache without EN_L3_RW_CCS_CACHE_FLUSH turned on. With that drop EN_L3_RW_CCS_CACHE_FLUSH. This helps some workloads since we no longer flush the device cache between submissions as part of pipecontrol. Edit: We now also have clarification from HW side that BSpec was indeed wrong here. v2: - Rebase and update commit message. BSpec: 71718 Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Vitasta Wattal <vitasta.wattal@intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Cc: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241007074541.33937-2-matthew.auld@intel.com (cherry picked from commit `67ec9f87bd`) [ Fix conflict due to changed xe_mmio_write32() signature ] Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-16 09:00:22 -05:00
Matthew Auld	816b186ce2	drm/xe/xe_sync: initialise ufence.signalled We can incorrectly think that the fence has signalled, if we get a non-zero value here from the kmalloc, which is quite plausible. Just use kzalloc to prevent stuff like this. Fixes: `977e5b82e0` ("drm/xe: Expose user fence from xe_sync_entry") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Nirmoy Das <nirmoy.das@intel.com> Cc: <stable@vger.kernel.org> # v6.10+ Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241011133633.388008-2-matthew.auld@intel.com (cherry picked from commit `26f69e88dc`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-16 09:00:22 -05:00
Nirmoy Das	4e8b5a1651	drm/xe/ufence: ufence can be signaled right after wait_woken do_comapre() can return success after a timedout wait_woken() which was treated as -ETIME. The loop calling wait_woken() sets correct err so there is no need to re-evaluate err. v2: Remove entire check that reevaluate err at the end(Matt) Fixes: `e670f0b4ef` ("drm/xe/uapi: Return correct error code for xe_wait_user_fence_ioctl") Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/1630 Cc: stable@vger.kernel.org # v6.8+ Cc: Bommu Krishnaiah <krishnaiah.bommu@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241011151029.4160630-1-nirmoy.das@intel.com Signed-off-by: Nirmoy Das <nirmoy.das@intel.com> (cherry picked from commit `ec7e6a1d52`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-16 09:00:22 -05:00
Matthew Brost	e7518276e9	drm/xe: Use bookkeep slots for external BO's in exec IOCTL Fix external BO's dma-resv usage in exec IOCTL using bookkeep slots rather than write slots. This leaves syncing to user space rather than the KMD blindly enforcing write semantics on every external BO. Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Cc: José Roberto de Souza <jose.souza@intel.com> Cc: Kenneth Graunke <kenneth.w.graunke@intel.com> Cc: Paulo Zanoni <paulo.r.zanoni@intel.com> Reported-by: Simona Vetter <simona.vetter@ffwll.ch> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2673 Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: José Roberto de Souza <jose.souza@intel.com> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240911152622.903058-1-matthew.brost@intel.com (cherry picked from commit `b8b1163248`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-16 09:00:22 -05:00
Lucas De Marchi	477d665e9b	drm/xe/query: Increase timestamp width Starting with Xe2 the timestamp is a full 64 bit counter, contrary to the 36 bit that was available before. Although 36 should be sufficient for any reasonable delta calculation (for Xe2, of about 30min), it's surprising to userspace to get something truncated. Also if the timestamp being compared to is coming from the GPU and the application is not careful enough to apply the width there, a delta calculation would be wrong. Extend it to full 64-bits starting with Xe2. v2: Expand width=64 to media gt, as it's just a wrong tagging in the spec - empirical tests show it goes beyond 36 bits and match the engines for the main gt Bspec: 60411 Cc: Szymon Morek <szymon.morek@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241011035618.1057602-1-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit `9d559cdcb2`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-16 09:00:22 -05:00
Matthew Brost	82926f52d7	drm/xe: Don't free job in TDR Freeing job in TDR is not safe as TDR can pass the run_job thread resulting in UAF. It is only safe for free job to naturally be called by the scheduler. Rather free job in TDR, add to pending list. Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2811 Cc: Matthew Auld <matthew.auld@intel.com> Fixes: `e275d61c5f` ("drm/xe/guc: Handle timing out of signaled jobs gracefully") Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241003001657.3517883-3-matthew.brost@intel.com (cherry picked from commit `ea2f6a77d0`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-16 09:00:22 -05:00
Matthew Brost	ed931fb40e	drm/xe: Take job list lock in xe_sched_add_pending_job A fragile micro optimization in xe_sched_add_pending_job relied on both the GPU scheduler being stopped and fence signaling stopped to safely add a job to the pending list without the job list lock in xe_sched_add_pending_job. Remove this optimization and just take the job list lock. Fixes: `7ddb9403dd` ("drm/xe: Sample ctx timestamp to determine if jobs have timed out") Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241003001657.3517883-2-matthew.brost@intel.com (cherry picked from commit `90521df5fc`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-16 09:00:22 -05:00
Matthew Auld	761f916af4	drm/xe: fix unbalanced rpm put() with declare_wedged() Technically the or_reset() means we call the action on failure, however that would lead to unbalanced rpm put(). Move the get() earlier to fix this. It should be extremely unlikely to ever trigger this in practice. Fixes: `90936a0a4c` ("drm/xe: Don't suspend device upon wedge") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241009084808.204432-4-matthew.auld@intel.com (cherry picked from commit `a187c1b0a8`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-16 09:00:22 -05:00
Matthew Auld	03a86c24ae	drm/xe: fix unbalanced rpm put() with fence_fini() Currently we can call fence_fini() twice if something goes wrong when sending the GuC CT for the tlb request, since we signal the fence and return an error, leading to the caller also calling fini() on the error path in the case of stack version of the flow, which leads to an extra rpm put() which might later cause device to enter suspend when it shouldn't. It looks like we can just drop the fini() call since the fence signaller side will already call this for us. There are known mysterious splats with device going to sleep even with an rpm ref, and this could be one candidate. v2 (Matt B): - Prefer warning if we detect double fini() Fixes: `f002702290` ("drm/xe: Hold a PM ref when GT TLB invalidations are inflight") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241009084808.204432-3-matthew.auld@intel.com (cherry picked from commit `cfcbc0520d`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-16 09:00:22 -05:00
Aradhya Bhatia	4ceead37ca	drm/xe/xe2lpg: Extend Wa_15016589081 for xe2lpg Add workaround (wa) 15016589081 which applies to Xe2_v3_LPG_MD. Xe2_v3_LPG_MD is a Lunar Lake platform with GFX version: 20.04. This wa is type: permanent, and hence is applicable on all steppings. Signed-off-by: Aradhya Bhatia <aradhya.bhatia@intel.com> Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241009065542.283151-1-aradhya.bhatia@intel.com (cherry picked from commit `8fb1da9f9b`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-16 09:00:22 -05:00
Dave Airlie	ac44ff7cec	Driver Changes: - Fix error checking with xa_store() (Matthe Auld) - Fix missing freq restore on GSC load error (Vinay) - Fix wedged_mode file permission (Matt Roper) - Fix use-after-free in ct communication (Matthew Auld) -----BEGIN PGP SIGNATURE----- iQJNBAABCAA3FiEE6rM8lpABPHM5FqyDm6KlpjDL6lMFAmcIlxMZHGx1Y2FzLmRl bWFyY2hpQGludGVsLmNvbQAKCRCboqWmMMvqUwwnD/0bJkBEUHey7Q9SK02ZAV9Q Cw/Im7wM1CQd8dWfpJqF0mwoEYhgQDsm9crReDCzfhl6QxdT4TdcGs5HfQwfTlMJ 7rZxmJ0VC5kAMvUq0uXiTvjC5KdmfySCCZ9in6vBBXmDzs9FKrycDjVjnutwaKY0 TeH/D265RWY0U6Iy/0vs+dvJfmcV79xCwuHN8U6b80jq4RA+2VeSo0gK8MaOB5Rt bIkzTgSHFY8I+x9eApW24eIoTMr9svr1uQNeSR0J7ueM/iXCRMvQAqyzFjuNxrgo htcsNeDUFlLQF9cbhVgmDgIDrHas31nB8E7bXrBZEoSeV3ZuWKO4V0H6WdVjd4kG Dyjl6l36VhalGhsKz2p+eSNbwTtFTB6JMRipA0HXG28PgiClkIWFl17tQn8gjGD7 eVk11+u165FzHAeGp0KTSrV5RiOG07XEE+sZsnHV+shFaK78Cn5+uzSTbsJNHQu2 IZdVrUGro0lScdJXrul7qcEzlLxXKmVa8qwM00jwGw+GgNJ5KZQwVx75cA3XM/hh Sdhr+zbsx3Qi/yr1vUNnSTGWRIleHRa0bZMOvpmCinv+tQsnqaUBSTTRSVOj/xc4 5NpD2Sum0Gz8drxeDvFdFpT4nvDHrtDvfrbigzHPABWO0CR14AYFS0UDXZKBNM9B R8GzPbS6tLwI9tcsZtYu3Q== =dYfY -----END PGP SIGNATURE----- Merge tag 'drm-xe-fixes-2024-10-10' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes Driver Changes: - Fix error checking with xa_store() (Matthe Auld) - Fix missing freq restore on GSC load error (Vinay) - Fix wedged_mode file permission (Matt Roper) - Fix use-after-free in ct communication (Matthew Auld) Signed-off-by: Dave Airlie <airlied@redhat.com> From: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/jri65tmv3bjbhqhxs5smv45nazssxzhtwphojem4uufwtjuliy@gsdhlh6kzsdy	2024-10-11 13:54:10 +10:00
Dave Airlie	b634acb2a0	Short summary of fixes pull: fbdev-dma: - Only clean up deferred I/O if instanciated nouveau: - dmem: Fix privileged error in copy engine channel; Fix possible data leak in migrate_to_ram() - gsp: Fix coding style sched: - Avoid leaking lockdep map v3d: - Stop active perfmon before destroying it vc4: - Stop active perfmon before destroying it xe: - Drop GuC submit_wq pool -----BEGIN PGP SIGNATURE----- iQEzBAABCgAdFiEEchf7rIzpz2NEoWjlaA3BHVMLeiMFAmcH2E4ACgkQaA3BHVML eiNz3ggAl2M7ysgSptizfEdQnwlitswqqVIYCgBjZCjWKwzm5WkEdGZ0w1N8KQqA 9pGuMFh3n6VOfOsVqAEUlNGonZwYMU5SppAoR7qJrOcqkBFJEE9woRnicWsFxYss HshaPLkNUCbmSvZXQN2O0c2HfcF1hKrc7zKBKUZWdl8aMROJGByvgpbwFARlNANO zb7QtBhVmCnbYIbyiCPwnEIgAG1QAxHQze/swR3pVL53XLteJ0hPWOugp7jc0d4+ FtmW0djjzIWD+6yfyYhA+VvDlCWucKoy81UqijcsX5ALXNWqNuGa9SROW+OomLgv HERn5RyCGySa4OACDr5/5o0rxhlSjw== =uxYD -----END PGP SIGNATURE----- Merge tag 'drm-misc-fixes-2024-10-10' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes Short summary of fixes pull: fbdev-dma: - Only clean up deferred I/O if instanciated nouveau: - dmem: Fix privileged error in copy engine channel; Fix possible data leak in migrate_to_ram() - gsp: Fix coding style sched: - Avoid leaking lockdep map v3d: - Stop active perfmon before destroying it vc4: - Stop active perfmon before destroying it xe: - Drop GuC submit_wq pool Signed-off-by: Dave Airlie <airlied@redhat.com> From: Thomas Zimmermann <tzimmermann@suse.de> Link: https://patchwork.freedesktop.org/patch/msgid/20241010133708.GA461532@localhost.localdomain	2024-10-11 09:03:30 +10:00
Matt Roper	1badf48281	drm/xe: Make wedged_mode debugfs writable The intent of this debugfs entry is to allow modification of wedging behavior, either from IGT tests or during manual debug; it should be marked as writable to properly reflect this. In practice this hasn't caused a problem because we always access wedged_mode as root, which ignores file permissions, but it's still misleading to have the entry incorrectly marked as RO. Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Fixes: `6b8ef44cc0` ("drm/xe: Introduce the wedged_mode debugfs") Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Gustavo Sousa <gustavo.sousa@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241002230620.1249258-2-matthew.d.roper@intel.com (cherry picked from commit `93d9381342`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-08 18:06:24 -05:00
Vinay Belgaumkar	3fd76be868	drm/xe: Restore GT freq on GSC load error As part of a Wa_22019338487, ensure that GT freq is restored even when GSC reload is not successful. Fixes: `3b1592fb78` ("drm/xe/lnl: Apply Wa_22019338487") Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240925204918.1989574-1-vinay.belgaumkar@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> (cherry picked from commit `491418a258`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-08 18:06:20 -05:00
Matthew Auld	42465603a3	drm/xe/guc_submit: fix xa_store() error checking Looks like we are meant to use xa_err() to extract the error encoded in the ptr. Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Badal Nilawar <badal.nilawar@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241001084346.98516-7-matthew.auld@intel.com (cherry picked from commit `f040327238`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-08 18:06:16 -05:00
Matthew Auld	e863781abe	drm/xe/ct: fix xa_store() error checking Looks like we are meant to use xa_err() to extract the error encoded in the ptr. Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Badal Nilawar <badal.nilawar@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241001084346.98516-6-matthew.auld@intel.com (cherry picked from commit `1aa4b78647`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-08 18:06:10 -05:00
Matthew Auld	db7f92af62	drm/xe/ct: prevent UAF in send_recv() Ensure we serialize with completion side to prevent UAF with fence going out of scope on the stack, since we have no clue if it will fire after the timeout before we can erase from the xa. Also we have some dependent loads and stores for which we need the correct ordering, and we lack the needed barriers. Fix this by grabbing the ct->lock after the wait, which is also held by the completion side. v2 (Badal): - Also print done after acquiring the lock and seeing timeout. Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Badal Nilawar <badal.nilawar@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241001084346.98516-5-matthew.auld@intel.com (cherry picked from commit `52789ce35c`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-08 18:06:05 -05:00
Matthew Brost	a6f3b25273	drm/xe: Fix memory leak when aborting binds Make sure to call xe_pt_update_ops_fini in xe_pt_update_ops_abort to free any memory the bind allocated. Caught by kmemleak when running Vulkan CTS tests on LNL. The leak seems to happen only when there's some kind of failure happening, like the lack of memory. Example output: unreferenced object 0xffff9120bdf62000 (size 8192): comm "deqp-vk", pid 115008, jiffies 4310295728 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 1b 05 f9 28 01 00 00 40 ...........(...@ 00 00 00 00 00 00 00 00 1b 15 f9 28 01 00 00 40 ...........(...@ backtrace (crc 7a56be79): [<ffffffff86dd81f0>] __kmalloc_cache_noprof+0x310/0x3d0 [<ffffffffc08e8211>] xe_pt_new_shared.constprop.0+0x81/0xb0 [xe] [<ffffffffc08e8309>] xe_pt_insert_entry+0xb9/0x140 [xe] [<ffffffffc08eab6d>] xe_pt_stage_bind_entry+0x12d/0x5b0 [xe] [<ffffffffc08ecbca>] xe_pt_walk_range+0xea/0x280 [xe] [<ffffffffc08eccea>] xe_pt_walk_range+0x20a/0x280 [xe] [<ffffffffc08eccea>] xe_pt_walk_range+0x20a/0x280 [xe] [<ffffffffc08eccea>] xe_pt_walk_range+0x20a/0x280 [xe] [<ffffffffc08eccea>] xe_pt_walk_range+0x20a/0x280 [xe] [<ffffffffc08e9eff>] xe_pt_stage_bind.constprop.0+0x25f/0x580 [xe] [<ffffffffc08eb21a>] bind_op_prepare+0xea/0x6e0 [xe] [<ffffffffc08ebab8>] xe_pt_update_ops_prepare+0x1c8/0x440 [xe] [<ffffffffc08ffbf3>] ops_execute+0x143/0x850 [xe] [<ffffffffc0900b64>] vm_bind_ioctl_ops_execute+0x244/0x800 [xe] [<ffffffffc0906467>] xe_vm_bind_ioctl+0x1877/0x2370 [xe] [<ffffffffc05e92b3>] drm_ioctl_kernel+0xb3/0x110 [drm] unreferenced object 0xffff9120bdf72000 (size 8192): comm "deqp-vk", pid 115008, jiffies 4310295728 hex dump (first 32 bytes): 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk backtrace (crc 23b2f0b5): [<ffffffff86dd81f0>] __kmalloc_cache_noprof+0x310/0x3d0 [<ffffffffc08e8211>] xe_pt_new_shared.constprop.0+0x81/0xb0 [xe] [<ffffffffc08e8453>] xe_pt_stage_unbind_post_descend+0xb3/0x150 [xe] [<ffffffffc08ecd26>] xe_pt_walk_range+0x246/0x280 [xe] [<ffffffffc08eccea>] xe_pt_walk_range+0x20a/0x280 [xe] [<ffffffffc08eccea>] xe_pt_walk_range+0x20a/0x280 [xe] [<ffffffffc08eccea>] xe_pt_walk_range+0x20a/0x280 [xe] [<ffffffffc08ece31>] xe_pt_walk_shared+0xc1/0x110 [xe] [<ffffffffc08e7b2a>] xe_pt_stage_unbind+0x9a/0xd0 [xe] [<ffffffffc08e913d>] unbind_op_prepare+0xdd/0x270 [xe] [<ffffffffc08eb9f6>] xe_pt_update_ops_prepare+0x106/0x440 [xe] [<ffffffffc08ffbf3>] ops_execute+0x143/0x850 [xe] [<ffffffffc0900b64>] vm_bind_ioctl_ops_execute+0x244/0x800 [xe] [<ffffffffc0906467>] xe_vm_bind_ioctl+0x1877/0x2370 [xe] [<ffffffffc05e92b3>] drm_ioctl_kernel+0xb3/0x110 [drm] [<ffffffffc05e95a0>] drm_ioctl+0x280/0x4e0 [drm] Reported-by: Paulo Zanoni <paulo.r.zanoni@intel.com> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2877 Fixes: `a708f6501c` ("drm/xe: Update PT layer with better error handling") Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Paulo Zanoni <paulo.r.zanoni@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240927232228.3255246-1-matthew.brost@intel.com (cherry picked from commit `63e0695597`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-03 01:24:54 -05:00
Zhanjun Dong	7257d9c9a3	drm/xe: Prevent null pointer access in xe_migrate_copy xe_migrate_copy designed to copy content of TTM resources. When source resource is null, it will trigger a NULL pointer dereference in xe_migrate_copy. To avoid this situation, update lacks source flag to true for this case, the flag will trigger xe_migrate_clear rather than xe_migrate_copy. Issue trace: <7> [317.089847] xe 0000:00:02.0: [drm:xe_migrate_copy [xe]] Pass 14, sizes: 4194304 & 4194304 <7> [317.089945] xe 0000:00:02.0: [drm:xe_migrate_copy [xe]] Pass 15, sizes: 4194304 & 4194304 <1> [317.128055] BUG: kernel NULL pointer dereference, address: 0000000000000010 <1> [317.128064] #PF: supervisor read access in kernel mode <1> [317.128066] #PF: error_code(0x0000) - not-present page <6> [317.128069] PGD 0 P4D 0 <4> [317.128071] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI <4> [317.128074] CPU: 1 UID: 0 PID: 1440 Comm: kunit_try_catch Tainted: G U N 6.11.0-rc7-xe #1 <4> [317.128078] Tainted: [U]=USER, [N]=TEST <4> [317.128080] Hardware name: Intel Corporation Lunar Lake Client Platform/LNL-M LP5 RVP1, BIOS LNLMFWI1.R00.3221.D80.2407291239 07/29/2024 <4> [317.128082] RIP: 0010:xe_migrate_copy+0x66/0x13e0 [xe] <4> [317.128158] Code: 00 00 48 89 8d e0 fe ff ff 48 8b 40 10 4c 89 85 c8 fe ff ff 44 88 8d bd fe ff ff 65 48 8b 3c 25 28 00 00 00 48 89 7d d0 31 ff <8b> 79 10 48 89 85 a0 fe ff ff 48 8b 00 48 89 b5 d8 fe ff ff 83 ff <4> [317.128162] RSP: 0018:ffffc9000167f9f0 EFLAGS: 00010246 <4> [317.128164] RAX: ffff8881120d8028 RBX: ffff88814d070428 RCX: 0000000000000000 <4> [317.128166] RDX: ffff88813cb99c00 RSI: 0000000004000000 RDI: 0000000000000000 <4> [317.128168] RBP: ffffc9000167fbb8 R08: ffff88814e7b1f08 R09: 0000000000000001 <4> [317.128170] R10: 0000000000000001 R11: 0000000000000001 R12: ffff88814e7b1f08 <4> [317.128172] R13: ffff88814e7b1f08 R14: ffff88813cb99c00 R15: 0000000000000001 <4> [317.128174] FS: 0000000000000000(0000) GS:ffff88846f280000(0000) knlGS:0000000000000000 <4> [317.128176] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4> [317.128178] CR2: 0000000000000010 CR3: 000000011f676004 CR4: 0000000000770ef0 <4> [317.128180] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 <4> [317.128182] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400 <4> [317.128184] PKRU: 55555554 <4> [317.128185] Call Trace: <4> [317.128187] <TASK> <4> [317.128189] ? show_regs+0x67/0x70 <4> [317.128194] ? __die_body+0x20/0x70 <4> [317.128196] ? __die+0x2b/0x40 <4> [317.128198] ? page_fault_oops+0x15f/0x4e0 <4> [317.128203] ? do_user_addr_fault+0x3fb/0x970 <4> [317.128205] ? lock_acquire+0xc7/0x2e0 <4> [317.128209] ? exc_page_fault+0x87/0x2b0 <4> [317.128212] ? asm_exc_page_fault+0x27/0x30 <4> [317.128216] ? xe_migrate_copy+0x66/0x13e0 [xe] <4> [317.128263] ? __lock_acquire+0xb9d/0x26f0 <4> [317.128265] ? __lock_acquire+0xb9d/0x26f0 <4> [317.128267] ? sg_free_append_table+0x20/0x80 <4> [317.128271] ? lock_acquire+0xc7/0x2e0 <4> [317.128273] ? mark_held_locks+0x4d/0x80 <4> [317.128275] ? trace_hardirqs_on+0x1e/0xd0 <4> [317.128278] ? _raw_spin_unlock_irqrestore+0x31/0x60 <4> [317.128281] ? __pm_runtime_resume+0x60/0xa0 <4> [317.128284] xe_bo_move+0x682/0xc50 [xe] <4> [317.128315] ? lock_is_held_type+0xaa/0x120 <4> [317.128318] ttm_bo_handle_move_mem+0xe5/0x1a0 [ttm] <4> [317.128324] ttm_bo_validate+0xd1/0x1a0 [ttm] <4> [317.128328] shrink_test_run_device+0x721/0xc10 [xe] <4> [317.128360] ? find_held_lock+0x31/0x90 <4> [317.128363] ? lock_release+0xd1/0x2a0 <4> [317.128365] ? __pfx_kunit_generic_run_threadfn_adapter+0x10/0x10 [kunit] <4> [317.128370] xe_bo_shrink_kunit+0x11/0x20 [xe] <4> [317.128397] kunit_try_run_case+0x6e/0x150 [kunit] <4> [317.128400] ? trace_hardirqs_on+0x1e/0xd0 <4> [317.128402] ? _raw_spin_unlock_irqrestore+0x31/0x60 <4> [317.128404] kunit_generic_run_threadfn_adapter+0x1e/0x40 [kunit] <4> [317.128407] kthread+0xf5/0x130 <4> [317.128410] ? __pfx_kthread+0x10/0x10 <4> [317.128412] ret_from_fork+0x39/0x60 <4> [317.128415] ? __pfx_kthread+0x10/0x10 <4> [317.128416] ret_from_fork_asm+0x1a/0x30 <4> [317.128420] </TASK> Fixes: `266c858852` ("drm/xe/xe2: Handle flat ccs move for igfx.") Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240927161308.862323-2-zhanjun.dong@intel.com (cherry picked from commit `59a1c9c7e1`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-03 01:24:22 -05:00
José Roberto de Souza	8135f1c09d	drm/xe/oa: Don't reset OAC_CONTEXT_ENABLE on OA stream close Mesa testing on Xe2+ revealed that when OA metrics are collected for an exec_queue, after the OA stream is closed, future batch buffers submitted on that exec_queue do not complete. Not resetting OAC_CONTEXT_ENABLE on OA stream close resolves these hangs and should not have any adverse effects. v2: Make the change that we don't reset the bit clearer (Ashutosh) Also make the same fix for OAC as OAR (Ashutosh) Bspec: 60314 Fixes: `2f4a730fcd` ("drm/xe/oa: Add OAR support") Fixes: `14e077f800` ("drm/xe/oa: Add OAC support") Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2821 Signed-off-by: José Roberto de Souza <jose.souza@intel.com> Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: stable@vger.kernel.org Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240924213713.3497992-1-ashutosh.dixit@intel.com (cherry picked from commit `0c8650b09a`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-03 01:23:18 -05:00
Matthew Auld	67801fa67b	drm/xe/queue: move xa_alloc to prevent UAF Evil user can guess the next id of the queue before the ioctl completes and then call queue destroy ioctl to trigger UAF since create ioctl is still referencing the same queue. Move the xa_alloc all the way to the end to prevent this. v2: - Rebase Fixes: `2149ded630` ("drm/xe: Fix use after free when client stats are captured") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240925071426.144015-4-matthew.auld@intel.com (cherry picked from commit `16536582dd`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-03 01:22:50 -05:00
Matthew Auld	74231870cf	drm/xe/vm: move xa_alloc to prevent UAF Evil user can guess the next id of the vm before the ioctl completes and then call vm destroy ioctl to trigger UAF since create ioctl is still referencing the same vm. Move the xa_alloc all the way to the end to prevent this. v2: - Rebase Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240925071426.144015-3-matthew.auld@intel.com (cherry picked from commit `dcfd397132`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-03 01:22:05 -05:00
Matthew Brost	9e3c85ddea	drm/xe: Clean up VM / exec queue file lock usage. Both the VM / exec queue file lock protect the lookup and reference to the object, nothing more. These locks are not intended anything else underneath them. XA have their own locking too, so no need to take the VM / exec queue file lock aside from when doing a lookup and reference get. Add some kernel doc to make this clear and cleanup a few typos too. Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240921011712.2681510-1-matthew.brost@intel.com (cherry picked from commit `fe4f5d4b66`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-03 01:21:55 -05:00
Matthew Brost	1b30f87e08	drm/xe: Resume TDR after GT reset Not starting the TDR after GT reset on exec queue which have been restarted can lead to jobs being able to be run forever. Fix this by restarting the TDR. Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240724235919.1917216-1-matthew.brost@intel.com (cherry picked from commit `8ec5a4e5ce`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-03 01:19:44 -05:00
Gustavo Sousa	6ef5a04221	drm/xe/xe2: Add performance tuning for L3 cache flushing A recommended performance tuning for LNL related to L3 cache flushing was recently introduced in Bspec. Implement it. Unlike the other existing tuning settings, we limit this one for LNL only, since there is no info about whether this would be applicable to other platforms yet. In the future we can come back and use IP version ranges if applicable. v2: - Fix reference to Bspec. (Sai Teja, Tejas) - Use correct register name for "Tuning: L3 RW flush all Cache". (Sai Teja) - Use SCRATCH3_LBCF (with the underscore) for better readability. v3: - Limit setting to LNL only. (Matt) Bspec: 72161 Cc: Sai Teja Pottumuttu <sai.teja.pottumuttu@intel.com> Cc: Tejas Upadhyay <tejas.upadhyay@intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Signed-off-by: Gustavo Sousa <gustavo.sousa@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240920211459.255181-5-gustavo.sousa@intel.com (cherry picked from commit `876253165f`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-03 01:13:55 -05:00
Gustavo Sousa	3bf90935aa	drm/xe/xe2: Extend performance tuning to media GT With exception of "Tuning: L3 cache - media", we are currently applying recommended performance tuning settings only for the primary GT. Let's also implement them for the media GT when applicable. According to our spec, media GT registers CCCHKNREG1 and L3SQCREG* exist only in Xe2_LPM and their offsets do not match their primary GT counterparts. Furthermore, the range where CCCHKNREG1 belongs is not listed as a multicast range on the media GT. As such, we need to have Xe2_LPM-specific definitions for those registers and apply the setting only for that specific IP. Both Xe2_HPM and Xe2_LPM contain STATELESS_COMPRESSION_CTRL and the offset on the media GT matches the one on the primary one. So we can simply have a copy of "Tuning: Stateless compression control" for the media GT. v2: - Fix implementation with respect to multicast vs non-multicast registers. (Matt) - Add missing XE2LPM_CCCHKNREG1 on second action of "Tuning: Compression Overfetch - media". v3: - STATELESS_COMPRESSION_CTRL on Xe2_HPM is also a multicast register, do not define a XE2HPM_STATELESS_COMPRESSION_CTRL register. (Tejas) Bspec: 72161 Cc: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com> Signed-off-by: Gustavo Sousa <gustavo.sousa@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240920211459.255181-3-gustavo.sousa@intel.com (cherry picked from commit `e1f813947c`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-10-03 01:13:55 -05:00

1 2 3 4 5 ...

2384 Commits