Commit Graph

511 Commits

Author SHA1 Message Date
Kyle Huey
4a01398066 perf: Move perf_event_fasync() to perf_event.h
This will allow it to be called from perf_output_wakeup().

Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240413141618.4160-2-khuey@kylehuey.com
2024-04-14 22:26:32 +02:00
Kyle Huey
76f6d58845 perf/bpf: Remove unneeded uses_default_overflow_handler()
Now that struct perf_event's orig_overflow_handler is gone, there's no need
for the functions and macros to support looking past overflow_handler to
orig_overflow_handler.

This patch is solely a refactoring and results in no behavior change.

Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240412015019.7060-6-khuey@kylehuey.com
2024-04-12 11:49:50 +02:00
Kyle Huey
f11f10bfa1 perf/bpf: Call BPF handler directly, not through overflow machinery
To ultimately allow BPF programs attached to perf events to completely
suppress all of the effects of a perf event overflow (rather than just the
sample output, as they do today), call bpf_overflow_handler() from
__perf_event_overflow() directly rather than modifying struct perf_event's
overflow_handler. Return the BPF program's return value from
bpf_overflow_handler() so that __perf_event_overflow() knows how to
proceed. Remove the now unnecessary orig_overflow_handler from struct
perf_event.

This patch is solely a refactoring and results in no behavior change.

Suggested-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240412015019.7060-5-khuey@kylehuey.com
2024-04-12 11:49:49 +02:00
Kyle Huey
14e40a9578 perf/bpf: Remove #ifdef CONFIG_BPF_SYSCALL from struct perf_event members
This will allow __perf_event_overflow() (which is independent of
CONFIG_BPF_SYSCALL) to use struct perf_event's prog to decide whether to
call bpf_overflow_handler().

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240412015019.7060-4-khuey@kylehuey.com
2024-04-12 11:49:49 +02:00
Namhyung Kim
0259bf63f7 perf/core: Optimize perf_adjust_freq_unthr_context()
It was unnecessarily disabling and enabling PMUs for each event.  It
should be done at PMU level.  Add pmu_ctx->nr_freq counter to check it
at each PMU.  As PMU context has separate active lists for pinned group
and flexible group, factor out a new function to do the job.

Another minor optimization is that it can skip PMUs w/ CAP_NO_INTERRUPT
even if it needs to unthrottle sampling events.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Mingwei Zhang <mizhang@google.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Link: https://lore.kernel.org/r/20240207050545.2727923-1-namhyung@kernel.org
2024-04-10 06:13:57 +02:00
Peter Zijlstra
5d2d4a9f60 Merge branch 'tip/perf/urgent'
Avoid conflicts, base on fixes.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2023-11-15 10:15:40 +01:00
Peter Zijlstra
889c58b315 perf/core: Fix cpuctx refcounting
Audit of the refcounting turned up that perf_pmu_migrate_context()
fails to migrate the ctx refcount.

Fixes: bd27568117 ("perf: Rewrite core context handling")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20230612093539.085862001@infradead.org
Cc: <stable@vger.kernel.org>
2023-11-15 04:18:31 +01:00
Linus Torvalds
bceb7accb7 Performance events changes for v6.7 are:
- Add AMD Unified Memory Controller (UMC) events introduced with Zen 4
  - Simplify & clean up the uncore management code
  - Fall back from RDPMC to RDMSR on certain uncore PMUs
  - Improve per-package and cstate event reading
  - Extend the Intel ref-cycles event to GP counters
  - Fix Intel MTL event constraints
  - Improve the Intel hybrid CPU handling code
  - Micro-optimize the RAPL code
  - Optimize perf_cgroup_switch()
  - Improve large AUX area error handling
 
  - Misc fixes and cleanups
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmU89YsRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iQqQ/9EF9mG4te5By4qN+B7jADCmE71xG5ViKz
 sp4Thl86SHxhwFuiHn8dMUixrp+qbcemi5yTbQ9TF8cKl4s3Ju2CihU8jaauUp0a
 iS5W0IliMqLD1pxQoXAPLuPVInVYgrNOCbR4l6l7D6ervh5Z6PVEf7SVeAP3L5wo
 QV/V3NKkrYeNQL+FoKhCH8Vhxw0HxUmKJO7UhW6yuCt7BAok9Es18h3OVnn+7es4
 BB7VI/JvdmXf2ioKhTPnDXJjC+vh5vnwiBoTcdQ2W9ADhWUvfL4ozxOXT6z7oC3A
 nwBOdXf8w8Rqnqqd8hduop1QUrusMxlEVgOMCk27qHx97uWgPceZWdoxDXGHBiRK
 fqJAwXERf9wp5/M57NDlPwyf/43Hocdx2CdLkQBpfD78/k/sB5hW0KxnzY0FUI9x
 jBRQyWD05IDJATBaMHz+VbrexS+Itvjp2QvSiSm9zislYD4zA9fQ3lAgFhEpcUbA
 ZA/nN4t+CbiGEAsJEuBPlvSC1ahUwVP/0nz3PFlVWFDqAx0mXgVNKBe083A9yh7I
 dVisVY6KPAVDzyOc1LqzU8WFXNFnIkIIaLrb6fRHJVEM8MDfpLPS/a+7AHdRcDP4
 yq6fjVVjyP7e9lSQLYBUP3/3uiVnWQj92l6V6CrcgDMX5rDOb0VN+BQrmhPR6fWY
 WEim6WZrZj4=
 =OLsv
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull performance event updates from Ingo Molnar:
 - Add AMD Unified Memory Controller (UMC) events introduced with Zen 4
 - Simplify & clean up the uncore management code
 - Fall back from RDPMC to RDMSR on certain uncore PMUs
 - Improve per-package and cstate event reading
 - Extend the Intel ref-cycles event to GP counters
 - Fix Intel MTL event constraints
 - Improve the Intel hybrid CPU handling code
 - Micro-optimize the RAPL code
 - Optimize perf_cgroup_switch()
 - Improve large AUX area error handling
 - Misc fixes and cleanups

* tag 'perf-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
  perf/x86/amd/uncore: Pass through error code for initialization failures, instead of -ENODEV
  perf/x86/amd/uncore: Fix uninitialized return value in amd_uncore_init()
  x86/cpu: Fix the AMD Fam 17h, Fam 19h, Zen2 and Zen4 MSR enumerations
  perf: Optimize perf_cgroup_switch()
  perf/x86/amd/uncore: Add memory controller support
  perf/x86/amd/uncore: Add group exclusivity
  perf/x86/amd/uncore: Use rdmsr if rdpmc is unavailable
  perf/x86/amd/uncore: Move discovery and registration
  perf/x86/amd/uncore: Refactor uncore management
  perf/core: Allow reading package events from perf_event_read_local
  perf/x86/cstate: Allow reading the package statistics from local CPU
  perf/x86/intel/pt: Fix kernel-doc comments
  perf/x86/rapl: Annotate 'struct rapl_pmus' with __counted_by
  perf/core: Rename perf_proc_update_handler() -> perf_event_max_sample_rate_handler(), for readability
  perf/x86/rapl: Fix "Using plain integer as NULL pointer" Sparse warning
  perf/x86/rapl: Use local64_try_cmpxchg in rapl_event_update()
  perf/x86/rapl: Stop doing cpu_relax() in the local64_cmpxchg() loop in rapl_event_update()
  perf/core: Bail out early if the request AUX area is out of bound
  perf/x86/intel: Extend the ref-cycles event to GP counters
  perf/x86/intel: Fix broken fixed event constraints extension
  ...
2023-10-30 13:44:35 -10:00
Kan Liang
1f2376cd03 perf: Add branch_sample_call_stack
Add a helper function to check call stack sample type.

The later patch will invoke the function in several places.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231025201626.3000228-3-kan.liang@linux.intel.com
2023-10-27 15:05:09 +02:00
Kan Liang
571d91dcad perf: Add branch stack counters
Currently, the additional information of a branch entry is stored in a
u64 space. With more and more information added, the space is running
out. For example, the information of occurrences of events will be added
for each branch.

Two places were suggested to append the counters.
https://lore.kernel.org/lkml/20230802215814.GH231007@hirez.programming.kicks-ass.net/
One place is right after the flags of each branch entry. It changes the
existing struct perf_branch_entry. The later ARCH specific
implementation has to be really careful to consistently pick
the right struct.
The other place is right after the entire struct perf_branch_stack.
The disadvantage is that the pointer of the extra space has to be
recorded. The common interface perf_sample_save_brstack() has to be
updated.

The latter is much straightforward, and should be easily understood and
maintained. It is implemented in the patch.

Add a new branch sample type, PERF_SAMPLE_BRANCH_COUNTERS, to indicate
the event which is recorded in the branch info.

The "u64 counters" may store the occurrences of several events. The
information regarding the number of events/counters and the width of
each counter should be exposed via sysfs as a reference for the perf
tool. Define the branch_counter_nr and branch_counter_width ABI here.
The support will be implemented later in the Intel-specific patch.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231025201626.3000228-1-kan.liang@linux.intel.com
2023-10-27 15:05:08 +02:00
Peter Zijlstra
32671e3799 perf: Disallow mis-matched inherited group reads
Because group consistency is non-atomic between parent (filedesc) and children
(inherited) events, it is possible for PERF_FORMAT_GROUP read() to try and sum
non-matching counter groups -- with non-sensical results.

Add group_generation to distinguish the case where a parent group removes and
adds an event and thus has the same number, but a different configuration of
events as inherited groups.

This became a problem when commit fa8c269353 ("perf/core: Invert
perf_read_group() loops") flipped the order of child_list and sibling_list.
Previously it would iterate the group (sibling_list) first, and for each
sibling traverse the child_list. In this order, only the group composition of
the parent is relevant. By flipping the order the group composition of the
child (inherited) events becomes an issue and the mis-match in group
composition becomes evident.

That said; even prior to this commit, while reading of a group that is not
equally inherited was not broken, it still made no sense.

(Ab)use ECHILD as error return to indicate issues with child process group
composition.

Fixes: fa8c269353 ("perf/core: Invert perf_read_group() loops")
Reported-by: Budimir Markovic <markovicbudimir@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231018115654.GK33217@noisy.programming.kicks-ass.net
2023-10-19 10:09:42 +02:00
Peter Zijlstra
f06cc667f7 perf: Optimize perf_cgroup_switch()
Namhyung reported that bd27568117 ("perf: Rewrite core context handling")
regresses context switch overhead when perf-cgroup is in use together
with 'slow' PMUs like uncore.

Specifically, perf_cgroup_switch()'s perf_ctx_disable() /
ctx_sched_out() etc.. all iterate the full list of active PMUs for
that CPU, even if they don't have cgroup events.

Previously there was cgrp_cpuctx_list which linked the relevant PMUs
together, but that got lost in the rework. Instead of re-instruducing
a similar list, let the perf_event_pmu_context iteration skip those
that do not have cgroup events. This avoids growing multiple versions
of the perf_event_pmu_context iteration.

Measured performance (on a slightly different patch):

Before)

  $ taskset -c 0 ./perf bench sched pipe -l 10000 -G AAA,BBB
  # Running 'sched/pipe' benchmark:
  # Executed 10000 pipe operations between two processes

       Total time: 0.901 [sec]

        90.128700 usecs/op
            11095 ops/sec

After)

  $ taskset -c 0 ./perf bench sched pipe -l 10000 -G AAA,BBB
  # Running 'sched/pipe' benchmark:
  # Executed 10000 pipe operations between two processes

       Total time: 0.065 [sec]

         6.560100 usecs/op
           152436 ops/sec

Fixes: bd27568117 ("perf: Rewrite core context handling")
Reported-by: Namhyung Kim <namhyung@kernel.org>
Debugged-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231009210425.GC6307@noisy.programming.kicks-ass.net
2023-10-12 19:28:38 +02:00
Xiu Jianfeng
e6814ec3ba perf/core: Rename perf_proc_update_handler() -> perf_event_max_sample_rate_handler(), for readability
Follow the naming pattern of the other sysctl handlers in perf.

Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230721090607.172002-1-xiujianfeng@huawei.com
2023-10-05 10:26:50 +02:00
Linus Torvalds
e0152e7481 RISC-V Patches for the 6.6 Merge Window, Part 1
* Support for the new "riscv,isa-extensions" and "riscv,isa-base" device
   tree interfaces for probing extensions.
 * Support for userspace access to the performance counters.
 * Support for more instructions in kprobes.
 * Crash kernels can be allocated above 4GiB.
 * Support for KCFI.
 * Support for ELFs in !MMU configurations.
 * ARCH_KMALLOC_MINALIGN has been reduced to 8.
 * mmap() defaults to sv48-sized addresses, with longer addresses hidden
   behind a hint (similar to Arm and Intel).
 * Also various fixes and cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEKzw3R0RoQ7JKlDp6LhMZ81+7GIkFAmTx96kTHHBhbG1lckBk
 YWJiZWx0LmNvbQAKCRAuExnzX7sYiVjRD/9DYVLlkQ/OEDJjPaEcYCP49xgIVUUU
 lhs3XbSs2VNHBaiG114f6Q0AaT/uNi+uqSej3CeTmEot2kZkBk/f2yu+UNIriPZ9
 GQiZsdyXhu921C+5VFtiI47KDWOVZ+Jpy3M1ll61IWt3yPSQHr1xOP0AOiyHHqe3
 cmqpNnzjajlfVDoXPc2mGGzUJt/7ar4thcwnMNi98raXR5Qh7SP6rrHjoQhE1oFk
 LMP3CHqEAcHE2tE4CxZVpc6HOQ5m0LpQIOK7ypufGMyoIYESm5dt/JOT4MlhTtDw
 6JzyVKtiM7lartUnUaW3ZoX4trQYT5gbXxWrJ2gCnUGy3VulikoXr1Rpz0qfdeOR
 XN8OLkVAqHfTGFI7oKk24f9Adw96R5NPZcdCay90h4J/kMfCiC7ZyUUI1XIa5iy1
 np5pZCkf8HNcdywML7qcFd5n2O0wchyFnRLFZo6kJP9Ls5cEi6kBx/1jSdTcNgx/
 fUKXyoEcriGoQiiwn29+4RZnU69gJV3zqQNLPpuwDQ5F/Q1zHTlrr+dqzezKkzcO
 dRTV2d2Q4A5vIDXPptzNNLlRQdrc8qxPJ1lxQVkPIU4/mtqczmZBwlyY2u9zwPyS
 sehJgJZnoAf+jm71NgQAKLck4MUBsMnMogOWunhXkVRCoZlbbkUWX4ECZYwPKsVk
 W7zVPmLvSM0l5g==
 =/tXb
 -----END PGP SIGNATURE-----

Merge tag 'riscv-for-linus-6.6-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

Pull RISC-V updates from Palmer Dabbelt:

 - Support for the new "riscv,isa-extensions" and "riscv,isa-base"
   device tree interfaces for probing extensions

 - Support for userspace access to the performance counters

 - Support for more instructions in kprobes

 - Crash kernels can be allocated above 4GiB

 - Support for KCFI

 - Support for ELFs in !MMU configurations

 - ARCH_KMALLOC_MINALIGN has been reduced to 8

 - mmap() defaults to sv48-sized addresses, with longer addresses hidden
   behind a hint (similar to Arm and Intel)

 - Also various fixes and cleanups

* tag 'riscv-for-linus-6.6-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (51 commits)
  lib/Kconfig.debug: Restrict DEBUG_INFO_SPLIT for RISC-V
  riscv: support PREEMPT_DYNAMIC with static keys
  riscv: Move create_tmp_mapping() to init sections
  riscv: Mark KASAN tmp* page tables variables as static
  riscv: mm: use bitmap_zero() API
  riscv: enable DEBUG_FORCE_FUNCTION_ALIGN_64B
  riscv: remove redundant mv instructions
  RISC-V: mm: Document mmap changes
  RISC-V: mm: Update pgtable comment documentation
  RISC-V: mm: Add tests for RISC-V mm
  RISC-V: mm: Restrict address space for sv39,sv48,sv57
  riscv: enable DMA_BOUNCE_UNALIGNED_KMALLOC for !dma_coherent
  riscv: allow kmalloc() caches aligned to the smallest value
  riscv: support the elf-fdpic binfmt loader
  binfmt_elf_fdpic: support 64-bit systems
  riscv: Allow CONFIG_CFI_CLANG to be selected
  riscv/purgatory: Disable CFI
  riscv: Add CFI error handling
  riscv: Add ftrace_stub_graph
  riscv: Add types to indirectly called assembly functions
  ...
2023-09-01 08:09:48 -07:00
Linus Torvalds
542034175c arm64 updates for 6.6
CPU features and system registers:
 	* Advertise hinted conditional branch support (FEAT_HBC) to
 	  userspace
 
 	* Avoid false positive "SANITY CHECK" warning when xCR registers
 	  differ outside of the length field
 
 Documentation:
 	* Fix macro name typo in SME documentation
 
 Entry code:
 	* Unmask exceptions earlier on the system call entry path
 
 Memory management:
 	* Don't bother clearing PTE_RDONLY for dirty ptes in
 	  pte_wrprotect() and pte_modify()
 
 Perf and PMU drivers:
 	* Initial support for Coresight TRBE devices on ACPI systems (the
 	  coresight driver changes will come later)
 
 	* Fix hw_breakpoint single-stepping when called from bpf
 
 	* Fixes for DDR PMU on i.MX8MP SoC
 
 	* Add NUMA-awareness to Hisilicon PCIe PMU driver
 
 	* Fix locking dependency issue in Arm DMC620 PMU driver
 
 	* Workaround Hisilicon erratum 162001900 in the SMMUv3 PMU driver
 
 	* Add support for Arm CMN-700 r3 parts to the CMN PMU driver
 
 	* Add support for recent Arm Cortex CPU PMUs
 
 	* Update Hisilicon PMU maintainers
 
 Selftests:
 	* Add a bunch of new features to the hwcap test (JSCVT, PMULL,
 	  AES, SHA1, etc)
 
 	* Fix SSVE test to leave streaming-mode after grabbing the
 	  signal context
 
 	* Add new test for SVE vector-length changes with SME enabled
 
 Miscellaneous:
 	* Allow compiler to warn on suspicious looking system register
 	  expressions
 
 	* Work around SDEI firmware bug by aborting any running
 	  handlers on a kernel crash
 
 	* Fix some harmless warnings when building with W=1
 
 	* Remove some unused function declarations
 
 	* Other minor fixes and cleanup
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmTon4QQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNG0nCAC9lTqppELnqXPA3FswONhtDBnKEufZHp0+
 4+Z6CPjAYZpd7ruiezvxeZA62tZl3eX+tYOx+6lf4xYxFA5W/RQdmxM7e0mGJd+n
 sgps85kxArApCgJR9zJiTCAIPXzKH5ObsFWWbcRljI9fiISVDTYn1JFAEx9UERI5
 5yr6blYF2H115oD8V2f/0vVObGOAuiqNnzqJIuKL1I8H9xBK0pssrKvuCCN8J2o4
 28+PeO7PzwWPiSfnO15bLd/bGuzbMCcexv4/DdjtLZaAanW7crJRVAzOon+URuVx
 JXmkzQvXkOgSKnEFwfVRYTsUbtOz2cBafjSujVmjwIBymhbBCZR/
 =WqmX
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 "I think we have a bit less than usual on the architecture side, but
  that's somewhat balanced out by a large crop of perf/PMU driver
  updates and extensions to our selftests.

  CPU features and system registers:

   - Advertise hinted conditional branch support (FEAT_HBC) to userspace

   - Avoid false positive "SANITY CHECK" warning when xCR registers
     differ outside of the length field

  Documentation:

   - Fix macro name typo in SME documentation

  Entry code:

   - Unmask exceptions earlier on the system call entry path

  Memory management:

   - Don't bother clearing PTE_RDONLY for dirty ptes in pte_wrprotect()
     and pte_modify()

  Perf and PMU drivers:

   - Initial support for Coresight TRBE devices on ACPI systems (the
     coresight driver changes will come later)

   - Fix hw_breakpoint single-stepping when called from bpf

   - Fixes for DDR PMU on i.MX8MP SoC

   - Add NUMA-awareness to Hisilicon PCIe PMU driver

   - Fix locking dependency issue in Arm DMC620 PMU driver

   - Workaround Hisilicon erratum 162001900 in the SMMUv3 PMU driver

   - Add support for Arm CMN-700 r3 parts to the CMN PMU driver

   - Add support for recent Arm Cortex CPU PMUs

   - Update Hisilicon PMU maintainers

  Selftests:

   - Add a bunch of new features to the hwcap test (JSCVT, PMULL, AES,
     SHA1, etc)

   - Fix SSVE test to leave streaming-mode after grabbing the signal
     context

   - Add new test for SVE vector-length changes with SME enabled

  Miscellaneous:

   - Allow compiler to warn on suspicious looking system register
     expressions

   - Work around SDEI firmware bug by aborting any running handlers on a
     kernel crash

   - Fix some harmless warnings when building with W=1

   - Remove some unused function declarations

   - Other minor fixes and cleanup"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (62 commits)
  drivers/perf: hisi: Update HiSilicon PMU maintainers
  arm_pmu: acpi: Add a representative platform device for TRBE
  arm_pmu: acpi: Refactor arm_spe_acpi_register_device()
  kselftest/arm64: Fix hwcaps selftest build
  hw_breakpoint: fix single-stepping when using bpf_overflow_handler
  arm64/sysreg: refactor deprecated strncpy
  kselftest/arm64: add jscvt feature to hwcap test
  kselftest/arm64: add pmull feature to hwcap test
  kselftest/arm64: add AES feature check to hwcap test
  kselftest/arm64: add SHA1 and related features to hwcap test
  arm64: sysreg: Generate C compiler warnings on {read,write}_sysreg_s arguments
  kselftest/arm64: build BTI tests in output directory
  perf/imx_ddr: don't enable counter0 if none of 4 counters are used
  perf/imx_ddr: speed up overflow frequency of cycle
  drivers/perf: hisi: Schedule perf session according to locality
  kselftest/arm64: fix a memleak in zt_regs_run()
  perf/arm-dmc620: Fix dmc620_pmu_irqs_lock/cpu_hotplug_lock circular lock dependency
  perf/smmuv3: Add MODULE_ALIAS for module auto loading
  perf/smmuv3: Enable HiSilicon Erratum 162001900 quirk for HIP08/09
  kselftest/arm64: Size sycall-abi buffers for the actual maximum VL
  ...
2023-08-28 17:34:54 -07:00
Tomislav Novak
d11a69873d hw_breakpoint: fix single-stepping when using bpf_overflow_handler
Arm platforms use is_default_overflow_handler() to determine if the
hw_breakpoint code should single-step over the breakpoint trigger or
let the custom handler deal with it.

Since bpf_overflow_handler() currently isn't recognized as a default
handler, attaching a BPF program to a PERF_TYPE_BREAKPOINT event causes
it to keep firing (the instruction triggering the data abort exception
is never skipped). For example:

  # bpftrace -e 'watchpoint:0x10000:4:w { print("hit") }' -c ./test
  Attaching 1 probe...
  hit
  hit
  [...]
  ^C

(./test performs a single 4-byte store to 0x10000)

This patch replaces the check with uses_default_overflow_handler(),
which accounts for the bpf_overflow_handler() case by also testing
if one of the perf_event_output functions gets invoked indirectly,
via orig_default_handler.

Signed-off-by: Tomislav Novak <tnovak@meta.com>
Tested-by: Samuel Gosselin <sgosselin@google.com> # arm64
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/linux-arm-kernel/20220923203644.2731604-1-tnovak@fb.com/
Link: https://lore.kernel.org/r/20230605191923.1219974-1-tnovak@meta.com
Signed-off-by: Will Deacon <will@kernel.org>
2023-08-18 17:04:09 +01:00
Alexandre Ghiti
366d259ff5
perf: Fix wrong comment about default event_idx
Since commit c719f56092 ("perf: Fix and clean up initialization of
pmu::event_idx"), event_idx default implementation has returned 0, not
idx + 1, so fix the comment that can be misleading.

Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
2023-08-16 07:28:15 -07:00
YueHaibing
62af032237 perf: Remove unused extern declaration arch_perf_get_page_size()
commit 8af26be062 ("perf/core: Fix arch_perf_get_page_size()")
left behind this.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230725135038.25060-1-yuehaibing@huawei.com
2023-07-26 12:28:48 +02:00
James Clark
0cb52ad7bb perf: Remove unused PERF_PMU_CAP_HETEROGENEOUS_CPUS capability
Since commit bd27568117 ("perf: Rewrite core context handling") the
relationship between perf_event_context and PMUs has changed so that
the error scenario that PERF_PMU_CAP_HETEROGENEOUS_CPUS originally
silenced no longer exists.

Remove the capability to avoid confusion that it actually influences
any perf core behavior and shift down the following capability bits to
fill in the unused space. This change should be a no-op.

Signed-off-by: James Clark <james.clark@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20230724134500.970496-5-james.clark@arm.com
2023-07-26 12:28:47 +02:00
Ravi Bangoria
5c6e623f1b perf/mem: Add PERF_MEM_LVLNUM_NA to PERF_MEM_NA
Add PERF_MEM_LVLNUM_NA wherever PERF_MEM_NA is used to set default values.

Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230725150206.184-3-ravi.bangoria@amd.com
2023-07-26 12:28:45 +02:00
Linus Torvalds
d25f002575 cxl for v6.5
- Add infrastructure for supporting background commands along with
   support for device sanitization and firmware update
 
 - Introduce a CXL performance monitoring unit driver based on the common
   definition in the specification.
 
 - Land some preparatory cleanup and refactoring for the anticipated
   arrival of CXL type-2 (accelerator devices) and CXL RCH (CXL-v1.1
   topology) error handling.
 
 - Rework CPU cache management with respect to region configuration
   (device hotplug or other dynamic changes to memory interleaving)
 
 - Fix region reconfiguration vs CXL decoder ordering rules.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSbo+XnGs+rwLz9XGXfioYZHlFsZwUCZJ9fkQAKCRDfioYZHlFs
 ZyWcAP9THJ6ZzX1mbAfHhPz9r+oxsrE3l1jQpNjNbh7MNW29MAEA36dmTE62JaHK
 lTPDgHxqBt1vrHPktYWOM9ZPHE2tLwA=
 =3fFL
 -----END PGP SIGNATURE-----

Merge tag 'cxl-for-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl

Pull CXL updates from Dan Williams:
 "The highlights in terms of new functionality are support for the
  standard CXL Performance Monitor definition that appeared in CXL 3.0,
  support for device sanitization (wiping all data from a device),
  secure-erase (re-keying encryption of user data), and support for
  firmware update. The firmware update support is notable as it reuses
  the simple sysfs_upload interface to just cat(1) a blob to a sysfs
  file and pipe that to the device.

  Additionally there are a substantial number of cleanups and
  reorganizations to get ready for RCH error handling (RCH == Restricted
  CXL Host == current shipping hardware generation / pre CXL-2.0
  topologies) and type-2 (accelerator / vendor specific) devices.

  For vendor specific devices they implement a subset of what the
  generic type-3 (generic memory expander) driver expects. As a result
  the rework decouples optional infrastructure from the core driver
  context.

  For RCH topologies, where the specification working group did not want
  to confuse pre-CXL-aware operating systems, many of the standard
  registers are hidden which makes support standard bus features like
  AER (PCIe Advanced Error Reporting) difficult. The rework arranges for
  the driver to help the PCI-AER core. Bjorn is on board with this
  direction but a late regression disocvery means the completion of this
  functionality needs to cook a bit longer, so it is code
  reorganizations only for now.

  Summary:

   - Add infrastructure for supporting background commands along with
     support for device sanitization and firmware update

   - Introduce a CXL performance monitoring unit driver based on the
     common definition in the specification.

   - Land some preparatory cleanup and refactoring for the anticipated
     arrival of CXL type-2 (accelerator devices) and CXL RCH (CXL-v1.1
     topology) error handling.

   - Rework CPU cache management with respect to region configuration
     (device hotplug or other dynamic changes to memory interleaving)

   - Fix region reconfiguration vs CXL decoder ordering rules"

* tag 'cxl-for-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (51 commits)
  cxl: Fix one kernel-doc comment
  cxl/pci: Use correct flag for sanitize polling
  docs: perf: Minimal introduction the the CXL PMU device and driver
  perf: CXL Performance Monitoring Unit driver
  tools/testing/cxl: add firmware update emulation to CXL memdevs
  tools/testing/cxl: Use named effects for the Command Effect Log
  tools/testing/cxl: Fix command effects for inject/clear poison
  cxl: add a firmware update mechanism using the sysfs firmware loader
  cxl/test: Add Secure Erase opcode support
  cxl/mem: Support Secure Erase
  cxl/test: Add Sanitize opcode support
  cxl/mem: Wire up Sanitization support
  cxl/mbox: Add sanitization handling machinery
  cxl/mem: Introduce security state sysfs file
  cxl/mbox: Allow for IRQ_NONE case in the isr
  Revert "cxl/port: Enable the HDM decoder capability for switch ports"
  cxl/memdev: Formalize endpoint port linkage
  cxl/pci: Unconditionally unmask 256B Flit errors
  cxl/region: Manage decoder target_type at decoder-attach time
  cxl/hdm: Default CXL_DEVTYPE_DEVMEM decoders to CXL_DECODER_DEVMEM
  ...
2023-07-01 08:58:41 -07:00
Linus Torvalds
a193cc7506 Perf events changes for v6.5:
- Rework & fix the event forwarding logic by extending the
   core interface. This fixes AMD PMU events that have to
   be forwarded from the core PMU to the IBS PMU.
 
 - Add self-tests to test AMD IBS invocation via core PMU events
 
 - Clean up Intel FixCntrCtl MSR encoding & handling
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmSayC0RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jlWxAAqUPtfst1p6H5sSyCBPYo5Y/Rh0SyyqJj
 w0YZ8p2nbB/+EmIN3WS1uYhx1/AalTP254q2BgVF4DlDFQA1MlJCmSNJ9HhtzOgt
 mbpNKzy50cQCR/iH+s3ldcFsLGhSG07j6w8xeb6BGiABm2JoiZeg6iVU76zRe5A1
 iPnjC7qoqjKH+sq8pu32fBClMjzf05/LGMd0MqFuYfl5950xRW61olstjo93XWgK
 O5z+5wm5H3MhJ2mzU6x+0C/xurIEQ0zRf6AqLbFp41BbJJJORgTCK746flghiqd5
 DiADc7oj9eOqL1X9jFPHgE07T/6QPrMC8BoH64pOcM3PoZ6Iq3zTkUHxAw3qK5j+
 kqduxzlVaFLFnf7R/vxUvjMg1PM+qP3pqgCrT+NFUdqsdLgSPxRzt5pAM6aAUwmU
 1lhuapESH44RUFZGWrfOwzQE5q/FDmUc2yGyGW2aYDmwkclNjVpnvHEJrQMugI3M
 M3/y9a+ErcPDUJfHcodutBDGw9l7VhsxJFMt4ydOTkNbEfZLbi2TzNapui6SKFja
 G2efrB/HhrV9nE+21Wfa3uxoKMuJ/UPiGrVr2qyGOnShQpK7sdyGDshO1s6TTPye
 OoVf9I0LhewMPap52SU/KDP7GJVPW1BhL/C7w6OSnXxlS5k4lOji7z4Dj2hqXHib
 19Jm7BhqZwE=
 =xn05
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf events updates from Ingo Molnar:

 - Rework & fix the event forwarding logic by extending the core
   interface.

   This fixes AMD PMU events that have to be forwarded from the
   core PMU to the IBS PMU.

 - Add self-tests to test AMD IBS invocation via core PMU events

 - Clean up Intel FixCntrCtl MSR encoding & handling

* tag 'perf-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Re-instate the linear PMU search
  perf/x86/intel: Define bit macros for FixCntrCtl MSR
  perf test: Add selftest to test IBS invocation via core pmu events
  perf/core: Remove pmu linear searching code
  perf/ibs: Fix interface via core pmu events
  perf/core: Rework forwarding of {task|cpu}-clock events
2023-06-27 14:43:02 -07:00
Marc Zyngier
b50f26a448 perf/core: Drop __weak attribute from arch_perf_update_userpage() prototype
Reiji reports that the arm64 implementation of arch_perf_update_userpage()
is now ignored and replaced by the dummy stub in core code.
This seems to happen since the PMUv3 driver was moved to driver/perf.

As it turns out, dropping the __weak attribute from the *prototype*
of the function solves the problem. You're right, this doesn't seem
to make much sense. And yet... It appears that both symbols get
flagged as weak, and that the first one to appear in the link order
wins:

$ nm drivers/perf/arm_pmuv3.o|grep arch_perf_update_userpage
0000000000001db0 W arch_perf_update_userpage

Dropping the attribute from the prototype restores the expected
behaviour, and arm64 is able to enjoy arch_perf_update_userpage()
again.

Fixes: 7755cec63a ("arm64: perf: Move PMUv3 driver to drivers/perf")
Fixes: f1ec3a517b ("kernel/events: Add a missing prototype for arch_perf_update_userpage()")
Reported-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Reiji Watanabe <reijiw@google.com>
Link: https://lkml.kernel.org/r/20230616114831.3186980-1-maz@kernel.org
2023-06-16 16:46:33 +02:00
Jonathan Cameron
143f83e200 perf: Allow a PMU to have a parent
Some PMUs have well defined parents such as PCI devices.
As the device_initialize() and device_add() are all within
pmu_dev_alloc() which is called from perf_pmu_register()
there is no opportunity to set the parent from within a driver.

Add a struct device *parent field to struct pmu and use that
to set the parent.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/20230526095824.16336-2-Jonathan.Cameron@huawei.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2023-05-30 11:20:35 -07:00
Ravi Bangoria
0d6d062ca2 perf/core: Rework forwarding of {task|cpu}-clock events
Currently, PERF_TYPE_SOFTWARE is treated specially since task-clock and
cpu-clock events are interfaced through it but internally gets forwarded
to their own pmus.

Rework this by overwriting event->attr.type in perf_swevent_init() which
will cause perf_init_event() to retry with updated type and event will
automatically get forwarded to right pmu. With the change, SW pmu no
longer needs to be treated specially and can be included in 'pmu_idr'
list.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230504110003.2548-2-ravi.bangoria@amd.com
2023-05-08 10:58:30 +02:00
Namhyung Kim
f6e707156e perf/core: Introduce perf_prepare_header()
Factor out perf_prepare_header() so that it can call
perf_prepare_sample() without a header if not needed.

Also it checks the filtered_sample_type to avoid duplicate
work when perf_prepare_sample() is called twice (or more).

Suggested-by: Peter Zijlstr <peterz@infradead.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230118060559.615653-8-namhyung@kernel.org
2023-01-18 11:57:20 +01:00
Namhyung Kim
eb55b455ef perf/core: Add perf_sample_save_brstack() helper
When we saves the branch stack to the perf sample data, we needs to
update the sample flags and the dynamic size.  To make sure this is
done consistently, add the perf_sample_save_brstack() helper and
convert all call sites.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230118060559.615653-5-namhyung@kernel.org
2023-01-18 11:57:20 +01:00
Namhyung Kim
0a9081cf0a perf/core: Add perf_sample_save_raw_data() helper
When we save the raw_data to the perf sample data, we need to update
the sample flags and the dynamic size.  To make sure this is done
consistently, add the perf_sample_save_raw_data() helper and convert
all call sites.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230118060559.615653-4-namhyung@kernel.org
2023-01-18 11:57:19 +01:00
Namhyung Kim
31046500c1 perf/core: Add perf_sample_save_callchain() helper
When we save the callchain to the perf sample data, we need to update
the sample flags and the dynamic size.  To ensure this is done consistently,
add the perf_sample_save_callchain() helper and convert all call sites.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230118060559.615653-3-namhyung@kernel.org
2023-01-18 11:57:19 +01:00
Namhyung Kim
4cf7a13611 perf/core: Save the dynamic parts of sample data size
The perf sample data can be divided into parts.  The event->header_size
and event->id_header_size keep the static part of the sample data which
is determined by the sample_type flags.

But other parts like CALLCHAIN and BRANCH_STACK are changing dynamically
so it needs to see the actual data.  In preparation of handling repeated
calls for perf_prepare_sample(), it can save the dynamic size to the
perf sample data to avoid the duplicate work.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230118060559.615653-2-namhyung@kernel.org
2023-01-18 11:57:19 +01:00
Namhyung Kim
7bdb1767bf perf/core: Change the layout of perf_sample_data
The layout of perf_sample_data is designed to minimize cache-line
access.  The perf_sample_data_init() used to initialize a couple of
fields unconditionally so they were placed together at the head.

But it's changed now to set the fields according to the actual
sample_type flags.  The main user (the perf tools) sets the IP, TID,
TIME, PERIOD always.  Also group relevant fields like addr, phys_addr
and data_page_size.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20221229204101.1099430-1-namhyung@kernel.org
2023-01-09 12:22:09 +01:00
Kan Liang
b6c00fb994 perf: Add PMU_FORMAT_ATTR_SHOW
The macro PMU_FORMAT_ATTR facilitates the definition of both the "show"
function and "format_attr". But it only works for a non-hybrid platform.
For a hybrid platform, the name "format_attr_hybrid_" is used.

The definition of the "show" function can be shared between a non-hybrid
platform and a hybrid platform. Add a new macro PMU_FORMAT_ATTR_SHOW.

No functional change. The PMU_FORMAT_ATTR_SHOW will be used in the
following patch.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230104201349.1451191-1-kan.liang@linux.intel.com
2023-01-09 12:22:07 +01:00
Peter Zijlstra
bd27568117 perf: Rewrite core context handling
There have been various issues and limitations with the way perf uses
(task) contexts to track events. Most notable is the single hardware
PMU task context, which has resulted in a number of yucky things (both
proposed and merged).

Notably:
 - HW breakpoint PMU
 - ARM big.little PMU / Intel ADL PMU
 - Intel Branch Monitoring PMU
 - AMD IBS PMU
 - S390 cpum_cf PMU
 - PowerPC trace_imc PMU

*Current design:*

Currently we have a per task and per cpu perf_event_contexts:

  task_struct::perf_events_ctxp[] <-> perf_event_context <-> perf_cpu_context
       ^                                 |    ^     |           ^
       `---------------------------------'    |     `--> pmu ---'
                                              v           ^
                                         perf_event ------'

Each task has an array of pointers to a perf_event_context. Each
perf_event_context has a direct relation to a PMU and a group of
events for that PMU. The task related perf_event_context's have a
pointer back to that task.

Each PMU has a per-cpu pointer to a per-cpu perf_cpu_context, which
includes a perf_event_context, which again has a direct relation to
that PMU, and a group of events for that PMU.

The perf_cpu_context also tracks which task context is currently
associated with that CPU and includes a few other things like the
hrtimer for rotation etc.

Each perf_event is then associated with its PMU and one
perf_event_context.

*Proposed design:*

New design proposed by this patch reduce to a single task context and
a single CPU context but adds some intermediate data-structures:

  task_struct::perf_event_ctxp -> perf_event_context <- perf_cpu_context
       ^                           |   ^ ^
       `---------------------------'   | |
                                       | |    perf_cpu_pmu_context <--.
                                       | `----.    ^                  |
                                       |      |    |                  |
                                       |      v    v                  |
                                       | ,--> perf_event_pmu_context  |
                                       | |                            |
                                       | |                            |
                                       v v                            |
                                  perf_event ---> pmu ----------------'

With the new design, perf_event_context will hold all events for all
pmus in the (respective pinned/flexible) rbtrees. This can be achieved
by adding pmu to rbtree key:

  {cpu, pmu, cgroup, group_index}

Each perf_event_context carries a list of perf_event_pmu_context which
is used to hold per-pmu-per-context state. For example, it keeps track
of currently active events for that pmu, a pmu specific task_ctx_data,
a flag to tell whether rotation is required or not etc.

Additionally, perf_cpu_pmu_context is used to hold per-pmu-per-cpu
state like hrtimer details to drive the event rotation, a pointer to
perf_event_pmu_context of currently running task and some other
ancillary information.

Each perf_event is associated to it's pmu, perf_event_context and
perf_event_pmu_context.

Further optimizations to current implementation are possible. For
example, ctx_resched() can be optimized to reschedule only single pmu
events.

Much thanks to Ravi for picking this up and pushing it towards
completion.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20221008062424.313-1-ravi.bangoria@amd.com
2022-10-27 20:12:16 +02:00
Peter Zijlstra
ca6c21327c perf: Fix missing SIGTRAPs
Marco reported:

Due to the implementation of how SIGTRAP are delivered if
perf_event_attr::sigtrap is set, we've noticed 3 issues:

  1. Missing SIGTRAP due to a race with event_sched_out() (more
     details below).

  2. Hardware PMU events being disabled due to returning 1 from
     perf_event_overflow(). The only way to re-enable the event is
     for user space to first "properly" disable the event and then
     re-enable it.

  3. The inability to automatically disable an event after a
     specified number of overflows via PERF_EVENT_IOC_REFRESH.

The worst of the 3 issues is problem (1), which occurs when a
pending_disable is "consumed" by a racing event_sched_out(), observed
as follows:

		CPU0			|	CPU1
	--------------------------------+---------------------------
	__perf_event_overflow()		|
	 perf_event_disable_inatomic()	|
	  pending_disable = CPU0	| ...
					| _perf_event_enable()
					|  event_function_call()
					|   task_function_call()
					|    /* sends IPI to CPU0 */
	<IPI>				| ...
	 __perf_event_enable()		+---------------------------
	  ctx_resched()
	   task_ctx_sched_out()
	    ctx_sched_out()
	     group_sched_out()
	      event_sched_out()
	       pending_disable = -1
	</IPI>
	<IRQ-work>
	 perf_pending_event()
	  perf_pending_event_disable()
	   /* Fails to send SIGTRAP because no pending_disable! */
	</IRQ-work>

In the above case, not only is that particular SIGTRAP missed, but also
all future SIGTRAPs because 'event_limit' is not reset back to 1.

To fix, rework pending delivery of SIGTRAP via IRQ-work by introduction
of a separate 'pending_sigtrap', no longer using 'event_limit' and
'pending_disable' for its delivery.

Additionally; and different to Marco's proposed patch:

 - recognise that pending_disable effectively duplicates oncpu for
   the case where it is set. As such, change the irq_work handler to
   use ->oncpu to target the event and use pending_* as boolean toggles.

 - observe that SIGTRAP targets the ctx->task, so the context switch
   optimization that carries contexts between tasks is invalid. If
   the irq_work were delayed enough to hit after a context switch the
   SIGTRAP would be delivered to the wrong task.

 - observe that if the event gets scheduled out
   (rotation/migration/context-switch/...) the irq-work would be
   insufficient to deliver the SIGTRAP when the event gets scheduled
   back in (the irq-work might still be pending on the old CPU).

   Therefore have event_sched_out() convert the pending sigtrap into a
   task_work which will deliver the signal at return_to_user.

Fixes: 97ba62b278 ("perf: Add support for SIGTRAP on perf events")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Debugged-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Marco Elver <elver@google.com>
Debugged-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Marco Elver <elver@google.com>
Tested-by: Marco Elver <elver@google.com>
2022-10-17 16:32:05 +02:00
Peter Zijlstra
0ce38047e8 perf: Fix lockdep_assert_event_ctx()
I'm a flamin' moron; because even after Mark told me it should be '&&'
I still got it wrong in the final commit.

Fixes: f3c0eba287 ("perf: Add a few assertions")
Reported-by: Borislav Petkov <bp@alien8.de>
Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/YvvIWmDBWdIUCMZj@FVFF77S0Q05N
2022-10-04 13:32:08 +02:00
Namhyung Kim
838d9bb62d perf: Use sample_flags for raw_data
Use the new sample_flags to indicate whether the raw data field is
filled by the PMU driver.  Although it could check with the NULL,
follow the same rule with other fields.

Remove the raw field from the perf_sample_data_init() to minimize
the number of cache lines touched.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220921220032.2858517-2-namhyung@kernel.org
2022-09-27 22:50:24 +02:00
Namhyung Kim
7b08463015 perf: Use sample_flags for addr
Use the new sample_flags to indicate whether the addr field is filled by
the PMU driver.  As most PMU drivers pass 0, it can set the flag only if
it has a non-zero value.  And use 0 in perf_sample_output() if it's not
filled already.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220921220032.2858517-1-namhyung@kernel.org
2022-09-27 22:50:24 +02:00
Peter Zijlstra
f3c0eba287 perf: Add a few assertions
While auditing 6b959ba22d ("perf/core: Fix reentry problem in
perf_output_read_group()") a few spots were found that wanted
assertions.

Notable for_each_sibling_event() relies on exclusion from
modification. This would normally be holding either ctx->lock or
ctx->mutex, however due to how things are constructed disabling IRQs
is a valid and sufficient substitute for ctx->lock.

Another possible site to add assertions would be the various
pmu::{add,del,read,..}() methods, but that's not trivially expressable
in C -- the best option is wrappers, but those are easy enough to
forget.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2022-09-07 21:54:01 +02:00
Anshuman Khandual
f67dd218fa perf/core: Assert PERF_EVENT_FLAG_ARCH does not overlap with generic flags
This just ensures that PERF_EVENT_FLAG_ARCH does not overlap with generic
hardware event flags.

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: James Clark <james.clark@arm.com>
Link: https://lkml.kernel.org/r/20220907091924.439193-3-anshuman.khandual@arm.com
2022-09-07 21:54:00 +02:00
Anshuman Khandual
7517f08b9a perf/core: Expand PERF_EVENT_FLAG_ARCH
Two hardware event flags on x86 platform has overshot PERF_EVENT_FLAG_ARCH
(0x0000ffff). These flags are PERF_X86_EVENT_PEBS_LAT_HYBRID (0x20000) and
PERF_X86_EVENT_AMD_BRS (0x10000). Lets expand PERF_EVENT_FLAG_ARCH mask to
accommodate those flags, and also create room for two more in the future.

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: James Clark <james.clark@arm.com>
Link: https://lkml.kernel.org/r/20220907091924.439193-2-anshuman.khandual@arm.com
2022-09-07 21:54:00 +02:00
Anshuman Khandual
03b02db93b perf: Consolidate branch sample filter helpers
Besides the branch type filtering requests, 'event.attr.branch_sample_type'
also contains various flags indicating which additional information should
be captured, along with the base branch record. These flags help configure
the underlying hardware, and capture the branch records appropriately when
required e.g after PMU interrupt. But first, this moves an existing helper
perf_sample_save_hw_index() into the header before adding some more helpers
for other branch sample filter flags.

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220906084414.396220-1-anshuman.khandual@arm.com
2022-09-07 21:54:00 +02:00
Kan Liang
ee9db0e14b perf: Use sample_flags for txn
Use the new sample_flags to indicate whether the txn field is filled by
the PMU driver.

Remove the txn field from the perf_sample_data_init() to minimize the
number of cache lines touched.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220901130959.1285717-7-kan.liang@linux.intel.com
2022-09-06 11:33:03 +02:00
Kan Liang
e16fd7f2cb perf: Use sample_flags for data_src
Use the new sample_flags to indicate whether the data_src field is
filled by the PMU driver.

Remove the data_src field from the perf_sample_data_init() to minimize
the number of cache lines touched.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220901130959.1285717-6-kan.liang@linux.intel.com
2022-09-06 11:33:03 +02:00
Kan Liang
2abe681da0 perf: Use sample_flags for weight
Use the new sample_flags to indicate whether the weight field is filled
by the PMU driver.

Remove the weight field from the perf_sample_data_init() to minimize the
number of cache lines touched.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220901130959.1285717-5-kan.liang@linux.intel.com
2022-09-06 11:33:02 +02:00
Kan Liang
a9a931e266 perf: Use sample_flags for branch stack
Use the new sample_flags to indicate whether the branch stack is filled
by the PMU driver.

Remove the br_stack from the perf_sample_data_init() to minimize the number
of cache lines touched.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220901130959.1285717-4-kan.liang@linux.intel.com
2022-09-06 11:33:02 +02:00
Kan Liang
3aac580d5c perf: Add sample_flags to indicate the PMU-filled sample data
On some platforms, some data e.g., timestamps, can be retrieved from
the PMU driver. Usually, the data from the PMU driver is more accurate.
The current perf kernel should output the PMU-filled sample data if
it's available.

To check the availability of the PMU-filled sample data, the current
perf kernel initializes the related fields in the
perf_sample_data_init(). When outputting a sample, the perf checks
whether the field is updated by the PMU driver. If yes, the updated
value will be output. If not, the perf uses an SW way to calculate the
value or just outputs the initialized value if an SW way is unavailable
either.

With more and more data being provided by the PMU driver, more fields
has to be initialized in the perf_sample_data_init(). That will
increase the number of cache lines touched in perf_sample_data_init()
and be harmful to the performance.

Add new "sample_flags" to indicate the PMU-filled sample data. The PMU
driver should set the corresponding PERF_SAMPLE_ flag when the field is
updated. The initialization of the corresponding field is not required
anymore. The following patches will make use of it and remove the
corresponding fields from the perf_sample_data_init(), which will
further minimize the number of cache lines touched.

Only clear the sample flags that have already been done by the PMU
driver in the perf_prepare_sample() for the PERF_RECORD_SAMPLE. For the
other PERF_RECORD_ event type, the sample data is not available.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220901130959.1285717-2-kan.liang@linux.intel.com
2022-09-06 11:33:01 +02:00
Marco Elver
0370dc314d perf/hw_breakpoint: Optimize list of per-task breakpoints
On a machine with 256 CPUs, running the recently added perf breakpoint
benchmark results in:

 | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
 | # Running 'breakpoint/thread' benchmark:
 | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
 |      Total time: 236.418 [sec]
 |
 |   123134.794271 usecs/op
 |  7880626.833333 usecs/op/cpu

The benchmark tests inherited breakpoint perf events across many
threads.

Looking at a perf profile, we can see that the majority of the time is
spent in various hw_breakpoint.c functions, which execute within the
'nr_bp_mutex' critical sections which then results in contention on that
mutex as well:

    37.27%  [kernel]       [k] osq_lock
    34.92%  [kernel]       [k] mutex_spin_on_owner
    12.15%  [kernel]       [k] toggle_bp_slot
    11.90%  [kernel]       [k] __reserve_bp_slot

The culprit here is task_bp_pinned(), which has a runtime complexity of
O(#tasks) due to storing all task breakpoints in the same list and
iterating through that list looking for a matching task. Clearly, this
does not scale to thousands of tasks.

Instead, make use of the "rhashtable" variant "rhltable" which stores
multiple items with the same key in a list. This results in average
runtime complexity of O(1) for task_bp_pinned().

With the optimization, the benchmark shows:

 | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
 | # Running 'breakpoint/thread' benchmark:
 | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
 |      Total time: 0.208 [sec]
 |
 |      108.422396 usecs/op
 |     6939.033333 usecs/op/cpu

On this particular setup that's a speedup of ~1135x.

While one option would be to make task_struct a breakpoint list node,
this would only further bloat task_struct for infrequently used data.
Furthermore, after all optimizations in this series, there's no evidence
it would result in better performance: later optimizations make the time
spent looking up entries in the hash table negligible (we'll reach the
theoretical ideal performance i.e. no constraints).

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20220829124719.675715-5-elver@google.com
2022-08-30 10:56:21 +02:00
Sandipan Das
93315e46b0 perf/core: Add speculation info to branch entries
Add a new "spec" bitfield to branch entries for providing speculation
information. This will be populated using hints provided by branch sampling
features on supported hardware. The following cases are covered:

  * No branch speculation information is available
  * Branch is speculative but taken on the wrong path
  * Branch is non-speculative but taken on the correct path
  * Branch is speculative and taken on the correct path

Suggested-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/834088c302faf21c7b665031dd111f424e509a64.1660211399.git.sandipan.das@amd.com
2022-08-27 00:05:45 +02:00
Namhyung Kim
119a784c81 perf/core: Add a new read format to get a number of lost samples
Sometimes we want to know an accurate number of samples even if it's
lost.  Currenlty PERF_RECORD_LOST is generated for a ring-buffer which
might be shared with other events.  So it's hard to know per-event
lost count.

Add event->lost_samples field and PERF_FORMAT_LOST to retrieve it from
userspace.

Original-patch-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220616180623.1358843-1-namhyung@kernel.org
2022-06-28 09:08:31 +02:00
Stephane Eranian
2a606a18cd ACPI: Add perf low power callback
Add an optional callback needed by some PMU features, e.g., AMD
BRS, to give a chance to the perf_events code to change its state before
a CPU goes to low power and after it comes back.

The callback is void when the PERF_NEEDS_LOPWR_CB flag is not set.
This flag must be set in arch specific perf_event.h header whenever needed.
When not set, there is no impact on the ACPI code.

Signed-off-by: Stephane Eranian <eranian@google.com>
[peterz: build fix]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220322221517.2510440-9-eranian@google.com
2022-04-05 10:24:38 +02:00