Commit Graph

1621 Commits

Author SHA1 Message Date
Peter Zijlstra
76a4efa809 perf/arch: Remove perf_sample_data::regs_user_copy
struct perf_sample_data lives on-stack, we should be careful about it's
size. Furthermore, the pt_regs copy in there is only because x86_64 is a
trainwreck, solve it differently.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Steven Rostedt <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20201030151955.258178461@infradead.org
2020-11-09 18:12:34 +01:00
Peter Zijlstra
09da9c8125 perf: Optimize get_recursion_context()
"Look ma, no branches!"

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lkml.kernel.org/r/20201030151955.187580298@infradead.org
2020-11-09 18:12:34 +01:00
Peter Zijlstra
ce0f17fc93 perf: Fix get_recursion_context()
One should use in_serving_softirq() to detect SoftIRQ context.

Fixes: 96f6d44443 ("perf_counter: avoid recursion")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201030151955.120572175@infradead.org
2020-11-09 18:12:34 +01:00
Peter Zijlstra
267fb27352 perf: Reduce stack usage of perf_output_begin()
__perf_output_begin() has an on-stack struct perf_sample_data in the
unlikely case it needs to generate a LOST record. However, every call
to perf_output_begin() must already have a perf_sample_data on-stack.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201030151954.985416146@infradead.org
2020-11-09 18:12:33 +01:00
kiyin(尹亮)
7bdb157cde perf/core: Fix a memory leak in perf_event_parse_addr_filter()
As shown through runtime testing, the "filename" allocation is not
always freed in perf_event_parse_addr_filter().

There are three possible ways that this could happen:

 - It could be allocated twice on subsequent iterations through the loop,
 - or leaked on the success path,
 - or on the failure path.

Clean up the code flow to make it obvious that 'filename' is always
freed in the reallocation path and in the two return paths as well.

We rely on the fact that kfree(NULL) is NOP and filename is initialized
with NULL.

This fixes the leak. No other side effects expected.

[ Dan Carpenter: cleaned up the code flow & added a changelog. ]
[ Ingo Molnar: updated the changelog some more. ]

Fixes: 375637bc52 ("perf/core: Introduce address range filtering")
Signed-off-by: "kiyin(尹亮)" <kiyin@tencent.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: "Srivatsa S. Bhat" <srivatsa@csail.mit.edu>
Cc: Anthony Liguori <aliguori@amazon.com>
--
 kernel/events/core.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)
2020-11-07 13:07:26 +01:00
Thomas Gleixner
01be83eea0 Merge branch 'core/urgent' into core/entry
Pick up the entry fix before further modifications.
2020-11-04 18:14:52 +01:00
Peter Zijlstra
51b646b2d9 perf,mm: Handle non-page-table-aligned hugetlbfs
A limited nunmber of architectures support hugetlbfs sizes that do not
align with the page-tables (ARM64, Power, Sparc64). Add support for
this to the generic perf_get_page_size() implementation, and also
allow an architecture to override this implementation.

This latter is only needed when it uses non-page-table aligned huge
pages in its kernel map.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-10-29 11:00:39 +01:00
Stephane Eranian
995f088efe perf/core: Add support for PERF_SAMPLE_CODE_PAGE_SIZE
When studying code layout, it is useful to capture the page size of the
sampled code address.

Add a new sample type for code page size.
The new sample type requires collecting the ip. The code page size can
be calculated from the NMI-safe perf_get_page_size().

For large PEBS, it's very unlikely that the mapping is gone for the
earlier PEBS records. Enable the feature for the large PEBS. The worst
case is that page-size '0' is returned.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201001135749.2804-5-kan.liang@linux.intel.com
2020-10-29 11:00:39 +01:00
Kan Liang
8d97e71811 perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE
Current perf can report both virtual addresses and physical addresses,
but not the MMU page size. Without the MMU page size information of the
utilized page, users cannot decide whether to promote/demote large pages
to optimize memory usage.

Add a new sample type for the data MMU page size.

Current perf already has a facility to collect data virtual addresses.
A page walker is required to walk the pages tables and calculate the
MMU page size from a given virtual address.

On some platforms, e.g., X86, the page walker is invoked in an NMI
handler. So the page walker must be NMI-safe and low overhead. Besides,
the page walker should work for both user and kernel virtual address.
The existing generic page walker, e.g., walk_page_range_novma(), is a
little bit complex and doesn't guarantee the NMI-safe. The follow_page()
is only for user-virtual address.

Add a new function perf_get_page_size() to walk the page tables and
calculate the MMU page size. In the function:
- Interrupts have to be disabled to prevent any teardown of the page
  tables.
- For user space threads, the current->mm is used for the page walker.
  For kernel threads and the like, the current->mm is NULL. The init_mm
  is used for the page walker. The active_mm is not used here, because
  it can be NULL.
  Quote from Peter Zijlstra,
  "context_switch() can set prev->active_mm to NULL when it transfers it
   to @next. It does this before @current is updated. So an NMI that
   comes in between this active_mm swizzling and updating @current will
   see !active_mm."
- The MMU page size is calculated from the page table level.

The method should work for all architectures, but it has only been
verified on X86. Should there be some architectures, which support perf,
where the method doesn't work, it can be fixed later separately.
Reporting the wrong page size would not be fatal for the architecture.

Some under discussion features may impact the method in the future.
Quote from Dave Hansen,
  "There are lots of weird things folks are trying to do with the page
   tables, like Address Space Isolation.  For instance, if you get a
   perf NMI when running userspace, current->mm->pgd is *different* than
   the PGD that was in use when userspace was running. It's close enough
   today, but it might not stay that way."
If the case happens later, lots of consecutive page walk errors will
happen. The worst case is that lots of page-size '0' are returned, which
would not be fatal.
In the perf tool, a check is implemented to detect this case. Once it
happens, a kernel patch could be implemented accordingly then.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201001135749.2804-2-kan.liang@linux.intel.com
2020-10-29 11:00:38 +01:00
Jens Axboe
5c251e9dc0 signal: Add task_sigpending() helper
This is in preparation for maintaining signal_pending() as the decider of
whether or not a schedule() loop should be broken, or continue sleeping.
This is different than the core signal use cases, which really need to know
whether an actual signal is pending or not. task_sigpending() returns
non-zero if TIF_SIGPENDING is set.

Only core kernel use cases should care about the distinction between
the two, make sure those use the task_sigpending() helper.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20201026203230.386348-2-axboe@kernel.dk
2020-10-29 09:37:36 +01:00
Jens Axboe
91989c7078 task_work: cleanup notification modes
A previous commit changed the notification mode from true/false to an
int, allowing notify-no, notify-yes, or signal-notify. This was
backwards compatible in the sense that any existing true/false user
would translate to either 0 (on notification sent) or 1, the latter
which mapped to TWA_RESUME. TWA_SIGNAL was assigned a value of 2.

Clean this up properly, and define a proper enum for the notification
mode. Now we have:

- TWA_NONE. This is 0, same as before the original change, meaning no
  notification requested.
- TWA_RESUME. This is 1, same as before the original change, meaning
  that we use TIF_NOTIFY_RESUME.
- TWA_SIGNAL. This uses TIF_SIGPENDING/JOBCTL_TASK_WORK for the
  notification.

Clean up all the callers, switching their 0/1/false/true to using the
appropriate TWA_* mode for notifications.

Fixes: e91b481623 ("task_work: teach task_work_add() to do signal_wake_up()")
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17 15:05:30 -06:00
Linus Torvalds
3bff6112c8 These are the performance events changes for v5.10:
x86 Intel updates:
 
  - Add Jasper Lake support
 
  - Add support for TopDown metrics on Ice Lake
 
  - Fix Ice Lake & Tiger Lake uncore support, add Snow Ridge support
 
  - Add a PCI sub driver to support uncore PMUs where the PCI resources
    have been claimed already - extending the range of supported systems.
 
 x86 AMD updates:
 
  - Restore 'perf stat -a' behaviour to program the uncore PMU
    to count all CPU threads.
 
  - Fix setting the proper count when sampling Large Increment
    per Cycle events / 'paired' events.
 
  - Fix IBS Fetch sampling on F17h and some other IBS fine tuning,
    greatly reducing the number of interrupts when large sample
    periods are specified.
 
  - Extends Family 17h RAPL support to also work on compatible
    F19h machines.
 
 Core code updates:
 
  - Fix race in perf_mmap_close()
 
  - Add PERF_EV_CAP_SIBLING, to denote that sibling events should be
    closed if the leader is removed.
 
  - Smaller fixes and updates.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl+Ef40RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1h7NQ//ZdQ26Yg79ZaxBX1QSINJ9AgXDi6rXs75
 qU9qNwr/6EF+633RZoPQGAE0Iy5v6h7iLFokcJzM9+kK/rE3ax44tSnPlcMa0+6N
 SHXKCa5iL+hH7o2Spo2MZwCYseH79rloX3TSH7ajnN3X8PvwgWshF0lUE3WEWtCs
 eHSojdCk43IuL9TpusuNOBM2FvgnheFYWiMbFHd0MTBUMxul30sLVCG8IIWCPA+q
 TwG4RJS3X42VbL3SuAGFmOv4OmqNsfkvHvjpDs4NF07tRB9zjXzGrxmGhgSw0NAN
 2KK25qbmrpKATIb4Eqsgk/yikX/SCrDEXrjhg3r8FnyPvRfctq1crZjjf672PI2E
 bDda76dH6Lq9jv5fsyJjas5OsYdMKBCnA+tGQxXPGbmTXeEcYMRbDnwhYnevI/Q/
 8pP+xstF0pmBA3tvpDPrQnYH72Qt7CLJSdcTB15NqZftU2tJxaAyJGx4gJy33jxQ
 wu6BIEGHQ7onQYiIyTwsBHyz6xNsF/CRHwAPcGdYrRRbXB5K5nxHiXNb4awciTMx
 2HF31/S4OqURNpfcpxOQo+1fb/cLqj3loGqE4jCTwkbS3lrHcAcfxyv9QNn77l1f
 hdQ0jworbUNVLUYEUQz1bkZ06GD3LSSas2ZlY1NNdHo62mjyXMQmgirNcZmrFgWl
 tl2gNFAU9x4=
 =2fuY
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull performance events updates from Ingo Molnar:
 "x86 Intel updates:

   - Add Jasper Lake support

   - Add support for TopDown metrics on Ice Lake

   - Fix Ice Lake & Tiger Lake uncore support, add Snow Ridge support

   - Add a PCI sub driver to support uncore PMUs where the PCI resources
     have been claimed already - extending the range of supported
     systems.

  x86 AMD updates:

   - Restore 'perf stat -a' behaviour to program the uncore PMU to count
     all CPU threads.

   - Fix setting the proper count when sampling Large Increment per
     Cycle events / 'paired' events.

   - Fix IBS Fetch sampling on F17h and some other IBS fine tuning,
     greatly reducing the number of interrupts when large sample periods
     are specified.

   - Extends Family 17h RAPL support to also work on compatible F19h
     machines.

  Core code updates:

   - Fix race in perf_mmap_close()

   - Add PERF_EV_CAP_SIBLING, to denote that sibling events should be
     closed if the leader is removed.

   - Smaller fixes and updates"

* tag 'perf-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
  perf/core: Fix race in the perf_mmap_close() function
  perf/x86: Fix n_metric for cancelled txn
  perf/x86: Fix n_pair for cancelled txn
  x86/events/amd/iommu: Fix sizeof mismatch
  perf/x86/intel: Check perf metrics feature for each CPU
  perf/x86/intel: Fix Ice Lake event constraint table
  perf/x86/intel/uncore: Fix the scale of the IMC free-running events
  perf/x86/intel/uncore: Fix for iio mapping on Skylake Server
  perf/x86/msr: Add Jasper Lake support
  perf/x86/intel: Add Jasper Lake support
  perf/x86/intel/uncore: Reduce the number of CBOX counters
  perf/x86/intel/uncore: Update Ice Lake uncore units
  perf/x86/intel/uncore: Split the Ice Lake and Tiger Lake MSR uncore support
  perf/x86/intel/uncore: Support PCIe3 unit on Snow Ridge
  perf/x86/intel/uncore: Generic support for the PCI sub driver
  perf/x86/intel/uncore: Factor out uncore_pci_pmu_unregister()
  perf/x86/intel/uncore: Factor out uncore_pci_pmu_register()
  perf/x86/intel/uncore: Factor out uncore_pci_find_dev_pmu()
  perf/x86/intel/uncore: Factor out uncore_pci_get_dev_die_info()
  perf/amd/uncore: Inform the user how many counters each uncore PMU has
  ...
2020-10-12 14:14:35 -07:00
Jiri Olsa
f91072ed1b perf/core: Fix race in the perf_mmap_close() function
There's a possible race in perf_mmap_close() when checking ring buffer's
mmap_count refcount value. The problem is that the mmap_count check is
not atomic because we call atomic_dec() and atomic_read() separately.

  perf_mmap_close:
  ...
   atomic_dec(&rb->mmap_count);
   ...
   if (atomic_read(&rb->mmap_count))
      goto out_put;

   <ring buffer detach>
   free_uid

out_put:
  ring_buffer_put(rb); /* could be last */

The race can happen when we have two (or more) events sharing same ring
buffer and they go through atomic_dec() and then they both see 0 as refcount
value later in atomic_read(). Then both will go on and execute code which
is meant to be run just once.

The code that detaches ring buffer is probably fine to be executed more
than once, but the problem is in calling free_uid(), which will later on
demonstrate in related crashes and refcount warnings, like:

  refcount_t: addition on 0; use-after-free.
  ...
  RIP: 0010:refcount_warn_saturate+0x6d/0xf
  ...
  Call Trace:
  prepare_creds+0x190/0x1e0
  copy_creds+0x35/0x172
  copy_process+0x471/0x1a80
  _do_fork+0x83/0x3a0
  __do_sys_wait4+0x83/0x90
  __do_sys_clone+0x85/0xa0
  do_syscall_64+0x5b/0x1e0
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Using atomic decrease and check instead of separated calls.

Tested-by: Michael Petlan <mpetlan@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Wade Mealing <wmealing@redhat.com>
Fixes: 9bb5d40cd9 ("perf: Fix mmap() accounting hole");
Link: https://lore.kernel.org/r/20200916115311.GE2301783@krava
2020-10-12 13:24:26 +02:00
Kajol Jain
6d6b8b9f4f perf: Fix task_function_call() error handling
The error handling introduced by commit:

  2ed6edd33a ("perf: Add cond_resched() to task_function_call()")

looses any return value from smp_call_function_single() that is not
{0, -EINVAL}. This is a problem because it will return -EXNIO when the
target CPU is offline. Worse, in that case it'll turn into an infinite
loop.

Fixes: 2ed6edd33a ("perf: Add cond_resched() to task_function_call()")
Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Kajol Jain <kjain@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Barret Rhoden <brho@google.com>
Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: https://lkml.kernel.org/r/20200827064732.20860-1-kjain@linux.ibm.com
2020-10-09 08:18:33 +02:00
Kan Liang
44fae179ce perf/core: Pull pmu::sched_task() into perf_event_context_sched_out()
The pmu::sched_task() is a context switch callback. It passes the
cpuctx->task_ctx as a parameter to the lower code. To find the
cpuctx->task_ctx, the current code iterates a cpuctx list.
The same context will iterated in perf_event_context_sched_out() soon.
Share the cpuctx->task_ctx can avoid the unnecessary iteration of the
cpuctx list.

The pmu::sched_task() is also required for the optimization case for
equivalent contexts.

The task_ctx_sched_out() will eventually disable and reenable the PMU
when schedule out events. Add perf_pmu_disable() and perf_pmu_enable()
around task_ctx_sched_out() don't break anything.

Drop the cpuctx->ctx.lock for the pmu::sched_task(). The lock is for
per-CPU context, which is not necessary for the per-task context
schedule.

No one uses sched_cb_entry, perf_sched_cb_usages, sched_cb_list, and
perf_pmu_sched_task() any more.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200821195754.20159-2-kan.liang@linux.intel.com
2020-09-10 11:19:34 +02:00
Kan Liang
556cccad38 perf/core: Pull pmu::sched_task() into perf_event_context_sched_in()
The pmu::sched_task() is a context switch callback. It passes the
cpuctx->task_ctx as a parameter to the lower code. To find the
cpuctx->task_ctx, the current code iterates a cpuctx list.

The same context was just iterated in perf_event_context_sched_in(),
which is invoked right before the pmu::sched_task().

Reuse the cpuctx->task_ctx from perf_event_context_sched_in() can avoid
the unnecessary iteration of the cpuctx list.

Both pmu::sched_task and perf_event_context_sched_in() have to disable
PMU. Pull the pmu::sched_task into perf_event_context_sched_in() can
also save the overhead from the PMU disable and reenable.

The new and old tasks may have equivalent contexts. The current code
optimize this case by swapping the context, which avoids the scheduling.
For this case, pmu::sched_task() is still required, e.g., restore the
LBR content.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200821195754.20159-1-kan.liang@linux.intel.com
2020-09-10 11:19:34 +02:00
Gustavo A. R. Silva
df561f6688 treewide: Use fallthrough pseudo-keyword
Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-23 17:36:59 -05:00
Hugh Dickins
c17c3dc9d0 uprobes: __replace_page() avoid BUG in munlock_vma_page()
syzbot crashed on the VM_BUG_ON_PAGE(PageTail) in munlock_vma_page(), when
called from uprobes __replace_page().  Which of many ways to fix it?
Settled on not calling when PageCompound (since Head and Tail are equals
in this context, PageCompound the usual check in uprobes.c, and the prior
use of FOLL_SPLIT_PMD will have cleared PageMlocked already).

Fixes: 5a52c9df62 ("uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>	[5.4+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008161338360.20413@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-21 09:52:53 -07:00
Kan Liang
9f0c4fa111 perf/core: Add a new PERF_EV_CAP_SIBLING event capability
Current perf assumes that events in a group are independent. Close an
event doesn't impact the value of the other events in the same group.
If the closed event is a member, after the event closure, other events
are still running like a group. If the closed event is a leader, other
events are running as singleton events.

Add PERF_EV_CAP_SIBLING to allow events to indicate they require being
part of a group, and when the leader dies they cannot exist
independently.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200723171117.9918-8-kan.liang@linux.intel.com
2020-08-18 16:34:36 +02:00
Linus Torvalds
7f5faaaa59 Misc fixes, an expansion of perf syscall access to CAP_PERFMON privileged tools,
plus a RAPL HW-enablement for Intel SPR platforms.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl83xBQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gb9RAArM0jJemRPHv1a/xLhrRo/cKURrOWpNl0
 OQtgppEv9axkavYL34eyoax4LTDCFXxE+NDClSC16abFEVPNriGODNE+CMbFgMbW
 AyDfP+AsDdNExwl+JWR+J37KIpEIzWLqtjzEjVxZqsuov3C+EaLU4gv947UFohxM
 QE93d8q3znBSdMjeC/aZyL8iX4aCB0oMjrP7BMXo9a61/oseKLnvE8Zu/ESFDe1S
 TYZ+VlCxyaZOUBkEyd8+h/CBL8kOvQ2ObBEBxmyQQdGuRZ20BcJRodk3g+mOdnHJ
 zeohRcXvIHskHTEVeQv+Eh4EitFT3bEFrbk0LwMhKubIhFTKIB42sAzyeC6iUGc/
 O5+Qe+bn3kYMynMHNo1yfh0s0S3cU3cfBnC1I2A/NyAn49H0UPr+rjynuKHtCA1+
 S36Q9BydZegU/jyhbbDs+h/cdOiKY2F3MPEAZg3u/7EM+NIrmvuQoA7+C33fmLA+
 tZzpeDpqNKz65JgYDQ2sZdghyVp41KTogeTm6Xu5O3sLhCnATiyqL2z2LCoWj+yZ
 KuZ+zHtV8ajRwt1bhq7qFUIyQLsHHUlz5z7TiUC7qqB48LpxO7LiTZ7CxUDY432N
 Xz8QPD/D71HAWmbkAXUih+JXG0nQSdlF6Xpwquczqc/8odJ46xdQ+i5wIgBOcudP
 A+kEXRqz5rA=
 =NsxB
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Ingo Molnar:
 "Misc fixes, an expansion of perf syscall access to CAP_PERFMON
  privileged tools, plus a RAPL HW-enablement for Intel SPR platforms"

* tag 'perf-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/rapl: Add support for Intel SPR platform
  perf/x86/rapl: Support multiple RAPL unit quirks
  perf/x86/rapl: Fix missing psys sysfs attributes
  hw_breakpoint: Remove unused __register_perf_hw_breakpoint() declaration
  kprobes: Remove show_registers() function prototype
  perf/core: Take over CAP_SYS_PTRACE creds to CAP_PERFMON capability
2020-08-15 10:34:24 -07:00
Peter Xu
64019a2e46 mm/gup: remove task_struct pointer for all gup code
After the cleanup of page fault accounting, gup does not need to pass
task_struct around any more.  Remove that parameter in the whole gup
stack.

Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Link: http://lkml.kernel.org/r/20200707225021.200906-26-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:58:04 -07:00
Christoph Hellwig
3d13f313ce uaccess: add force_uaccess_{begin,end} helpers
Add helpers to wrap the get_fs/set_fs magic for undoing any damange done
by set_fs(KERNEL_DS).  There is no real functional benefit, but this
documents the intent of these calls better, and will allow stubbing the
functions out easily for kernels builds that do not allow address space
overrides in the future.

[hch@lst.de: drop two incorrect hunks, fix a commit log typo]
  Link: http://lkml.kernel.org/r/20200714105505.935079-6-hch@lst.de

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Greentime Hu <green.hu@gmail.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Link: http://lkml.kernel.org/r/20200710135706.537715-6-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:57:59 -07:00
Joonsoo Kim
b518154e59 mm/vmscan: protect the workingset on anonymous LRU
In current implementation, newly created or swap-in anonymous page is
started on active list.  Growing active list results in rebalancing
active/inactive list so old pages on active list are demoted to inactive
list.  Hence, the page on active list isn't protected at all.

Following is an example of this situation.

Assume that 50 hot pages on active list.  Numbers denote the number of
pages on active/inactive list (active | inactive).

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(uo) | 50(h)

3. workload: another 50 newly created (used-once) pages
50(uo) | 50(uo), swap-out 50(h)

This patch tries to fix this issue.  Like as file LRU, newly created or
swap-in anonymous pages will be inserted to the inactive list.  They are
promoted to active list if enough reference happens.  This simple
modification changes the above example as following.

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(h) | 50(uo)

3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(uo)

As you can see, hot pages on active list would be protected.

Note that, this implementation has a drawback that the page cannot be
promoted and will be swapped-out if re-access interval is greater than the
size of inactive list but less than the size of total(active+inactive).
To solve this potential issue, following patch will apply workingset
detection similar to the one that's already applied to file LRU.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Link: http://lkml.kernel.org/r/1595490560-15117-3-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:57:55 -07:00
Alexey Budankov
45fd22da97 perf/core: Take over CAP_SYS_PTRACE creds to CAP_PERFMON capability
Open access to per-process monitoring for CAP_PERFMON only
privileged processes [1]. Extend ptrace_may_access() check
in perf_events subsystem with perfmon_capable() to simplify
user experience and make monitoring more secure by reducing
attack surface.

[1] https://lore.kernel.org/lkml/7776fa40-6c65-2aa6-1322-eb3a01201000@linux.intel.com/

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/6e8392ff-4732-0012-2949-e1587709f0f6@linux.intel.com
2020-08-06 15:03:20 +02:00
Linus Torvalds
47ec5303d7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:

 1) Support 6Ghz band in ath11k driver, from Rajkumar Manoharan.

 2) Support UDP segmentation in code TSO code, from Eric Dumazet.

 3) Allow flashing different flash images in cxgb4 driver, from Vishal
    Kulkarni.

 4) Add drop frames counter and flow status to tc flower offloading,
    from Po Liu.

 5) Support n-tuple filters in cxgb4, from Vishal Kulkarni.

 6) Various new indirect call avoidance, from Eric Dumazet and Brian
    Vazquez.

 7) Fix BPF verifier failures on 32-bit pointer arithmetic, from
    Yonghong Song.

 8) Support querying and setting hardware address of a port function via
    devlink, use this in mlx5, from Parav Pandit.

 9) Support hw ipsec offload on bonding slaves, from Jarod Wilson.

10) Switch qca8k driver over to phylink, from Jonathan McDowell.

11) In bpftool, show list of processes holding BPF FD references to
    maps, programs, links, and btf objects. From Andrii Nakryiko.

12) Several conversions over to generic power management, from Vaibhav
    Gupta.

13) Add support for SO_KEEPALIVE et al. to bpf_setsockopt(), from Dmitry
    Yakunin.

14) Various https url conversions, from Alexander A. Klimov.

15) Timestamping and PHC support for mscc PHY driver, from Antoine
    Tenart.

16) Support bpf iterating over tcp and udp sockets, from Yonghong Song.

17) Support 5GBASE-T i40e NICs, from Aleksandr Loktionov.

18) Add kTLS RX HW offload support to mlx5e, from Tariq Toukan.

19) Fix the ->ndo_start_xmit() return type to be netdev_tx_t in several
    drivers. From Luc Van Oostenryck.

20) XDP support for xen-netfront, from Denis Kirjanov.

21) Support receive buffer autotuning in MPTCP, from Florian Westphal.

22) Support EF100 chip in sfc driver, from Edward Cree.

23) Add XDP support to mvpp2 driver, from Matteo Croce.

24) Support MPTCP in sock_diag, from Paolo Abeni.

25) Commonize UDP tunnel offloading code by creating udp_tunnel_nic
    infrastructure, from Jakub Kicinski.

26) Several pci_ --> dma_ API conversions, from Christophe JAILLET.

27) Add FLOW_ACTION_POLICE support to mlxsw, from Ido Schimmel.

28) Add SK_LOOKUP bpf program type, from Jakub Sitnicki.

29) Refactor a lot of networking socket option handling code in order to
    avoid set_fs() calls, from Christoph Hellwig.

30) Add rfc4884 support to icmp code, from Willem de Bruijn.

31) Support TBF offload in dpaa2-eth driver, from Ioana Ciornei.

32) Support XDP_REDIRECT in qede driver, from Alexander Lobakin.

33) Support PCI relaxed ordering in mlx5 driver, from Aya Levin.

34) Support TCP syncookies in MPTCP, from Flowian Westphal.

35) Fix several tricky cases of PMTU handling wrt. briding, from Stefano
    Brivio.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2056 commits)
  net: thunderx: initialize VF's mailbox mutex before first usage
  usb: hso: remove bogus check for EINPROGRESS
  usb: hso: no complaint about kmalloc failure
  hso: fix bailout in error case of probe
  ip_tunnel_core: Fix build for archs without _HAVE_ARCH_IPV6_CSUM
  selftests/net: relax cpu affinity requirement in msg_zerocopy test
  mptcp: be careful on subflow creation
  selftests: rtnetlink: make kci_test_encap() return sub-test result
  selftests: rtnetlink: correct the final return value for the test
  net: dsa: sja1105: use detected device id instead of DT one on mismatch
  tipc: set ub->ifindex for local ipv6 address
  ipv6: add ipv6_dev_find()
  net: openvswitch: silence suspicious RCU usage warning
  Revert "vxlan: fix tos value before xmit"
  ptp: only allow phase values lower than 1 period
  farsync: switch from 'pci_' to 'dma_' API
  wan: wanxl: switch from 'pci_' to 'dma_' API
  hv_netvsc: do not use VF device if link is down
  dpaa2-eth: Fix passing zero to 'PTR_ERR' warning
  net: macb: Properly handle phylink on at91sam9x
  ...
2020-08-05 20:13:21 -07:00
Linus Torvalds
99ea1521a0 Remove uninitialized_var() macro for v5.9-rc1
- Clean up non-trivial uses of uninitialized_var()
 - Update documentation and checkpatch for uninitialized_var() removal
 - Treewide removal of uninitialized_var()
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl8oYLQWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJsfjEACvf0D3WL3H7sLHtZ2HeMwOgAzq
 il08t6vUscINQwiIIK3Be43ok3uQ1Q+bj8sr2gSYTwunV2IYHFferzgzhyMMno3o
 XBIGd1E+v1E4DGBOiRXJvacBivKrfvrdZ7AWiGlVBKfg2E0fL1aQbe9AYJ6eJSbp
 UGqkBkE207dugS5SQcwrlk1tWKUL089lhDAPd7iy/5RK76OsLRCJFzIerLHF2ZK2
 BwvA+NWXVQI6pNZ0aRtEtbbxwEU4X+2J/uaXH5kJDszMwRrgBT2qoedVu5LXFPi8
 +B84IzM2lii1HAFbrFlRyL/EMueVFzieN40EOB6O8wt60Y4iCy5wOUzAdZwFuSTI
 h0xT3JI8BWtpB3W+ryas9cl9GoOHHtPA8dShuV+Y+Q2bWe1Fs6kTl2Z4m4zKq56z
 63wQCdveFOkqiCLZb8s6FhnS11wKtAX4czvXRXaUPgdVQS1Ibyba851CRHIEY+9I
 AbtogoPN8FXzLsJn7pIxHR4ADz+eZ0dQ18f2hhQpP6/co65bYizNP5H3h+t9hGHG
 k3r2k8T+jpFPaddpZMvRvIVD8O2HvJZQTyY6Vvneuv6pnQWtr2DqPFn2YooRnzoa
 dbBMtpon+vYz6OWokC5QNWLqHWqvY9TmMfcVFUXE4AFse8vh4wJ8jJCNOFVp8On+
 drhmmImUr1YylrtVOw==
 =xHmk
 -----END PGP SIGNATURE-----

Merge tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull uninitialized_var() macro removal from Kees Cook:
 "This is long overdue, and has hidden too many bugs over the years. The
  series has several "by hand" fixes, and then a trivial treewide
  replacement.

   - Clean up non-trivial uses of uninitialized_var()

   - Update documentation and checkpatch for uninitialized_var() removal

   - Treewide removal of uninitialized_var()"

* tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  compiler: Remove uninitialized_var() macro
  treewide: Remove uninitialized_var() usage
  checkpatch: Remove awareness of uninitialized_var() macro
  mm/debug_vm_pgtable: Remove uninitialized_var() usage
  f2fs: Eliminate usage of uninitialized_var() macro
  media: sur40: Remove uninitialized_var() usage
  KVM: PPC: Book3S PR: Remove uninitialized_var() usage
  clk: spear: Remove uninitialized_var() usage
  clk: st: Remove uninitialized_var() usage
  spi: davinci: Remove uninitialized_var() usage
  ide: Remove uninitialized_var() usage
  rtlwifi: rtl8192cu: Remove uninitialized_var() usage
  b43: Remove uninitialized_var() usage
  drbd: Remove uninitialized_var() usage
  x86/mm/numa: Remove uninitialized_var() usage
  docs: deprecated.rst: Add uninitialized_var()
2020-08-04 13:49:43 -07:00
Ingo Molnar
e89d4ca1df Linux 5.8-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl8d8h4eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGd0sH/2iktYhMwPxzzpnb
 eI3OuTX/mRn4vUFOfpx9dmGVleMfKkpbvnn3IY7wA62Qfv7J7lkFRa1Bd1DlqXfW
 yyGTGDSKG5chiRCOU3s9ni92M4xIzFlrojyt/dIK2lUGMzUPI9FGlZRGQLKqqwLh
 2syOXRWbcQ7e52IHtDSy3YBNveKRsP4NyqV+GxGiex18SMB/M3Pw9EMH614eDPsE
 QAGQi5uGv4hPJtFHgXgUyBPLFHIyFAiVxhFRIj7u2DSEKY79+wO1CGWFiFvdTY4B
 CbqKXLffY3iQdFsLJkj9Dl8cnOQnoY44V0EBzhhORxeOp71StUVaRwQMFa5tp48G
 171s5Hs=
 =BQIl
 -----END PGP SIGNATURE-----

Merge tag 'v5.8-rc7' into perf/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-07-28 13:18:01 +02:00
Song Liu
5d99cb2c86 bpf: Fail PERF_EVENT_IOC_SET_BPF when bpf_get_[stack|stackid] cannot work
bpf_get_[stack|stackid] on perf_events with precise_ip uses callchain
attached to perf_sample_data. If this callchain is not presented, do not
allow attaching BPF program that calls bpf_get_[stack|stackid] to this
event.

In the error case, -EPROTO is returned so that libbpf can identify this
error and print proper hint message.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200723180648.1429892-3-songliubraving@fb.com
2020-07-25 20:16:34 -07:00
David S. Miller
a57066b1a0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
The UDP reuseport conflict was a little bit tricky.

The net-next code, via bpf-next, extracted the reuseport handling
into a helper so that the BPF sk lookup code could invoke it.

At the same time, the logic for reuseport handling of unconnected
sockets changed via commit efc6b6f6c3
which changed the logic to carry on the reuseport result into the
rest of the lookup loop if we do not return immediately.

This requires moving the reuseport_has_conns() logic into the callers.

While we are here, get rid of inline directives as they do not belong
in foo.c files.

The other changes were cases of more straightforward overlapping
modifications.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-25 17:49:04 -07:00
Oleg Nesterov
fe5ed7ab99 uprobes: Change handle_swbp() to send SIGTRAP with si_code=SI_KERNEL, to fix GDB regression
If a tracee is uprobed and it hits int3 inserted by debugger, handle_swbp()
does send_sig(SIGTRAP, current, 0) which means si_code == SI_USER. This used
to work when this code was written, but then GDB started to validate si_code
and now it simply can't use breakpoints if the tracee has an active uprobe:

	# cat test.c
	void unused_func(void)
	{
	}
	int main(void)
	{
		return 0;
	}

	# gcc -g test.c -o test
	# perf probe -x ./test -a unused_func
	# perf record -e probe_test:unused_func gdb ./test -ex run
	GNU gdb (GDB) 10.0.50.20200714-git
	...
	Program received signal SIGTRAP, Trace/breakpoint trap.
	0x00007ffff7ddf909 in dl_main () from /lib64/ld-linux-x86-64.so.2
	(gdb)

The tracee hits the internal breakpoint inserted by GDB to monitor shared
library events but GDB misinterprets this SIGTRAP and reports a signal.

Change handle_swbp() to use force_sig(SIGTRAP), this matches do_int3_user()
and fixes the problem.

This is the minimal fix for -stable, arch/x86/kernel/uprobes.c is equally
wrong; it should use send_sigtrap(TRAP_TRACE) instead of send_sig(SIGTRAP),
but this doesn't confuse GDB and needs another x86-specific patch.

Reported-by: Aaron Merey <amerey@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20200723154420.GA32043@redhat.com
2020-07-24 15:38:37 +02:00
Kees Cook
3f649ab728 treewide: Remove uninitialized_var() usage
Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.

In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:

git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
	xargs perl -pi -e \
		's/\buninitialized_var\(([^\)]+)\)/\1/g;
		 s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.

No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.

[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-16 12:35:15 -07:00
Kan Liang
5a09928d33 perf/x86: Remove task_ctx_size
A new kmem_cache method has replaced the kzalloc() to allocate the PMU
specific data. The task_ctx_size is not required anymore.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1593780569-62993-19-git-send-email-kan.liang@linux.intel.com
2020-07-08 11:38:55 +02:00
Kan Liang
217c2a633e perf/core: Use kmem_cache to allocate the PMU specific data
Currently, the PMU specific data task_ctx_data is allocated by the
function kzalloc() in the perf generic code. When there is no specific
alignment requirement for the task_ctx_data, the method works well for
now. However, there will be a problem once a specific alignment
requirement is introduced in future features, e.g., the Architecture LBR
XSAVE feature requires 64-byte alignment. If the specific alignment
requirement is not fulfilled, the XSAVE family of instructions will fail
to save/restore the xstate to/from the task_ctx_data.

The function kzalloc() itself only guarantees a natural alignment. A
new method to allocate the task_ctx_data has to be introduced, which
has to meet the requirements as below:
- must be a generic method can be used by different architectures,
  because the allocation of the task_ctx_data is implemented in the
  perf generic code;
- must be an alignment-guarantee method (The alignment requirement is
  not changed after the boot);
- must be able to allocate/free a buffer (smaller than a page size)
  dynamically;
- should not cause extra CPU overhead or space overhead.

Several options were considered as below:
- One option is to allocate a larger buffer for task_ctx_data. E.g.,
    ptr = kmalloc(size + alignment, GFP_KERNEL);
    ptr &= ~(alignment - 1);
  This option causes space overhead.
- Another option is to allocate the task_ctx_data in the PMU specific
  code. To do so, several function pointers have to be added. As a
  result, both the generic structure and the PMU specific structure
  will become bigger. Besides, extra function calls are added when
  allocating/freeing the buffer. This option will increase both the
  space overhead and CPU overhead.
- The third option is to use a kmem_cache to allocate a buffer for the
  task_ctx_data. The kmem_cache can be created with a specific alignment
  requirement by the PMU at boot time. A new pointer for kmem_cache has
  to be added in the generic struct pmu, which would be used to
  dynamically allocate a buffer for the task_ctx_data at run time.
  Although the new pointer is added to the struct pmu, the existing
  variable task_ctx_size is not required anymore. The size of the
  generic structure is kept the same.

The third option which meets all the aforementioned requirements is used
to replace kzalloc() for the PMU specific data allocation. A later patch
will remove the kzalloc() method and the related variables.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1593780569-62993-17-git-send-email-kan.liang@linux.intel.com
2020-07-08 11:38:55 +02:00
Kan Liang
ff9ff92688 perf/core: Factor out functions to allocate/free the task_ctx_data
The method to allocate/free the task_ctx_data is going to be changed in
the following patch. Currently, the task_ctx_data is allocated/freed in
several different places. To avoid repeatedly modifying the same codes
in several different places, alloc_task_ctx_data() and
free_task_ctx_data() are factored out to allocate/free the
task_ctx_data. The modification only needs to be applied once.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1593780569-62993-16-git-send-email-kan.liang@linux.intel.com
2020-07-08 11:38:54 +02:00
Song Liu
d141b8bc57 perf: Expose get/put_callchain_entry()
Sanitize and expose get/put_callchain_entry(). This would be used by bpf
stack map.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200630062846.664389-2-songliubraving@fb.com
2020-07-01 08:22:08 -07:00
Adrian Hunter
e17d43b93e perf: Add perf text poke event
Record (single instruction) changes to the kernel text (i.e.
self-modifying code) in order to support tracers like Intel PT and
ARM CoreSight.

A copy of the running kernel code is needed as a reference point (e.g.
from /proc/kcore). The text poke event records the old bytes and the
new bytes so that the event can be processed forwards or backwards.

The basic problem is recording the modified instruction in an
unambiguous manner given SMP instruction cache (in)coherence. That is,
when modifying an instruction concurrently any solution with one or
multiple timestamps is not sufficient:

	CPU0				CPU1
 0
 1	write insn A
 2					execute insn A
 3	sync-I$
 4

Due to I$, CPU1 might execute either the old or new A. No matter where
we record tracepoints on CPU0, one simply cannot tell what CPU1 will
have observed, except that at 0 it must be the old one and at 4 it
must be the new one.

To solve this, take inspiration from x86 text poking, which has to
solve this exact problem due to variable length instruction encoding
and I-fetch windows.

 1) overwrite the instruction with a breakpoint and sync I$

This guarantees that that code flow will never hit the target
instruction anymore, on any CPU (or rather, it will cause an
exception).

 2) issue the TEXT_POKE event

 3) overwrite the breakpoint with the new instruction and sync I$

Now we know that any execution after the TEXT_POKE event will either
observe the breakpoint (and hit the exception) or the new instruction.

So by guarding the TEXT_POKE event with an exception on either side;
we can now tell, without doubt, which instruction another CPU will
have observed.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200512121922.8997-2-adrian.hunter@intel.com
2020-06-15 14:09:48 +02:00
Linus Torvalds
a5ad5742f6 Merge branch 'akpm' (patches from Andrew)
Merge even more updates from Andrew Morton:

 - a kernel-wide sweep of show_stack()

 - pagetable cleanups

 - abstract out accesses to mmap_sem - prep for mmap_sem scalability work

 - hch's user acess work

Subsystems affected by this patch series: debug, mm/pagemap, mm/maccess,
mm/documentation.

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (93 commits)
  include/linux/cache.h: expand documentation over __read_mostly
  maccess: return -ERANGE when probe_kernel_read() fails
  x86: use non-set_fs based maccess routines
  maccess: allow architectures to provide kernel probing directly
  maccess: move user access routines together
  maccess: always use strict semantics for probe_kernel_read
  maccess: remove strncpy_from_unsafe
  tracing/kprobes: handle mixed kernel/userspace probes better
  bpf: rework the compat kernel probe handling
  bpf:bpf_seq_printf(): handle potentially unsafe format string better
  bpf: handle the compat string in bpf_trace_copy_string better
  bpf: factor out a bpf_trace_copy_string helper
  maccess: unify the probe kernel arch hooks
  maccess: remove probe_read_common and probe_write_common
  maccess: rename strnlen_unsafe_user to strnlen_user_nofault
  maccess: rename strncpy_from_unsafe_strict to strncpy_from_kernel_nofault
  maccess: rename strncpy_from_unsafe_user to strncpy_from_user_nofault
  maccess: update the top of file comment
  maccess: clarify kerneldoc comments
  maccess: remove duplicate kerneldoc comments
  ...
2020-06-09 09:54:46 -07:00
Oleg Nesterov
013b2deba9 uprobes: ensure that uprobe->offset and ->ref_ctr_offset are properly aligned
uprobe_write_opcode() must not cross page boundary; prepare_uprobe()
relies on arch_uprobe_analyze_insn() which should validate "vaddr" but
some architectures (csky, s390, and sparc) don't do this.

We can remove the BUG_ON() check in prepare_uprobe() and validate the
offset early in __uprobe_register(). The new IS_ALIGNED() check matches
the alignment check in arch_prepare_kprobe() on supported architectures,
so I think that all insns must be aligned to UPROBE_SWBP_INSN_SIZE.

Another problem is __update_ref_ctr() which was wrong from the very
beginning, it can read/write outside of kmap'ed page unless "vaddr" is
aligned to sizeof(short), __uprobe_register() should check this too.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Tested-by: Sven Schnelle <svens@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:49:24 -07:00
Michel Lespinasse
c1e8d7c6a7 mmap locking API: convert mmap_sem comments
Convert comments that reference mmap_sem to reference mmap_lock instead.

[akpm@linux-foundation.org: fix up linux-next leftovers]
[akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
[akpm@linux-foundation.org: more linux-next fixups, per Michel]

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:14 -07:00
Michel Lespinasse
d8ed45c5dc mmap locking API: use coccinelle to convert mmap_sem rwsem call sites
This change converts the existing mmap_sem rwsem calls to use the new mmap
locking API instead.

The change is generated using coccinelle with the following rule:

// spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

@@
expression mm;
@@
(
-init_rwsem
+mmap_init_lock
|
-down_write
+mmap_write_lock
|
-down_write_killable
+mmap_write_lock_killable
|
-down_write_trylock
+mmap_write_trylock
|
-up_write
+mmap_write_unlock
|
-downgrade_write
+mmap_write_downgrade
|
-down_read
+mmap_read_lock
|
-down_read_killable
+mmap_read_lock_killable
|
-down_read_trylock
+mmap_read_trylock
|
-up_read
+mmap_read_unlock
)
-(&mm->mmap_sem)
+(mm)

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:14 -07:00
Christoph Hellwig
885f7f8e30 mm: rename flush_icache_user_range to flush_icache_user_page
The function currently known as flush_icache_user_range only operates on
a single page.  Rename it to flush_icache_user_page as we'll need the
name flush_icache_user_range for something else soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/20200515143646.3857579-20-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-08 11:05:58 -07:00
Souptick Joarder
dadbb612f6 mm/gup.c: convert to use get_user_{page|pages}_fast_only()
API __get_user_pages_fast() renamed to get_user_pages_fast_only() to
align with pin_user_pages_fast_only().

As part of this we will get rid of write parameter.  Instead caller will
pass FOLL_WRITE to get_user_pages_fast_only().  This will not change any
existing functionality of the API.

All the callers are changed to pass FOLL_WRITE.

Also introduce get_user_page_fast_only(), and use it in a few places
that hard-code nr_pages to 1.

Updated the documentation of the API.

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>		[arch/powerpc/kvm]
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Michal Suchanek <msuchanek@suse.de>
Link: http://lkml.kernel.org/r/1590396812-31277-1-git-send-email-jrdr.linux@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-08 11:05:56 -07:00
Linus Torvalds
7ae77150d9 powerpc updates for 5.8
- Support for userspace to send requests directly to the on-chip GZIP
    accelerator on Power9.
 
  - Rework of our lockless page table walking (__find_linux_pte()) to make it
    safe against parallel page table manipulations without relying on an IPI for
    serialisation.
 
  - A series of fixes & enhancements to make our machine check handling more
    robust.
 
  - Lots of plumbing to add support for "prefixed" (64-bit) instructions on
    Power10.
 
  - Support for using huge pages for the linear mapping on 8xx (32-bit).
 
  - Remove obsolete Xilinx PPC405/PPC440 support, and an associated sound driver.
 
  - Removal of some obsolete 40x platforms and associated cruft.
 
  - Initial support for booting on Power10.
 
  - Lots of other small features, cleanups & fixes.
 
 Thanks to:
   Alexey Kardashevskiy, Alistair Popple, Andrew Donnellan, Andrey Abramov,
   Aneesh Kumar K.V, Balamuruhan S, Bharata B Rao, Bulent Abali, Cédric Le
   Goater, Chen Zhou, Christian Zigotzky, Christophe JAILLET, Christophe Leroy,
   Dmitry Torokhov, Emmanuel Nicolet, Erhard F., Gautham R. Shenoy, Geoff Levand,
   George Spelvin, Greg Kurz, Gustavo A. R. Silva, Gustavo Walbon, Haren Myneni,
   Hari Bathini, Joel Stanley, Jordan Niethe, Kajol Jain, Kees Cook, Leonardo
   Bras, Madhavan Srinivasan., Mahesh Salgaonkar, Markus Elfring, Michael
   Neuling, Michal Simek, Nathan Chancellor, Nathan Lynch, Naveen N. Rao,
   Nicholas Piggin, Oliver O'Halloran, Paul Mackerras, Pingfan Liu, Qian Cai, Ram
   Pai, Raphael Moreira Zinsly, Ravi Bangoria, Sam Bobroff, Sandipan Das, Segher
   Boessenkool, Stephen Rothwell, Sukadev Bhattiprolu, Tyrel Datwyler, Wolfram
   Sang, Xiongfeng Wang.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl7aYZ8THG1wZUBlbGxl
 cm1hbi5pZC5hdQAKCRBR6+o8yOGlgPiKD/9zNCuZLFMAFrIdbm0HlYA2RGYZFT75
 GUHsqYyei1pxA7PgM3KwJiXELVODsBv0eQbgNh1tbecKrxPRegN/cywd1KLjPZ7I
 v5/qweQP8MvR0RhzjbhvUcO0jq/f8u2LbJr5mUfVzjU6tAvrvcWo3oZqDElsekCS
 kgyOH3r1vZ2PLTMiGFhb0gWi2iqc+6BHU1AFCGPCMjB1Vu5d5+54VvZ/6lllGsOF
 yg9CBXmmVvQ+Bn6tH4zdEB78FYxnAIwBqlbmL79i5ca+HQJ0Sw6HuPRy9XYq35p6
 2EiXS4Wrgp7i7+1TN3HO362u5Onb8TSyQU7NS6yCFPoJ6JQxcJMBIw6mHhnXOPuZ
 CrjgcdwUMjx8uDoKmX1Epbfuex2w+AysW+4yBHPFiSgl3klKC3D0wi95mR485w2F
 rN8uzJtrDeFKcYZJG7IoB/cgFCCPKGf9HaXr8q0S/jBKMffx91ul3cfzlfdIXOCw
 FDNw/+ZX7UD6ddFEG12ZTO+vdL8yf1uCRT/DIZwUiDMIA0+M6F4nc7j3lfyZfoO1
 65f9UlhoLxScq7VH2fKH4UtZatO9cPID2z1CmiY4UbUIPtFDepSuYClgLF+Duf4b
 rkfxhKU0+Ja1zNH5XNc+L+Bc5/W4lFiJXz02dYIjtHoUpWkc1aToOETVwzggYFNM
 G3PXIBOI0jRgRw==
 =o0WU
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:

 - Support for userspace to send requests directly to the on-chip GZIP
   accelerator on Power9.

 - Rework of our lockless page table walking (__find_linux_pte()) to
   make it safe against parallel page table manipulations without
   relying on an IPI for serialisation.

 - A series of fixes & enhancements to make our machine check handling
   more robust.

 - Lots of plumbing to add support for "prefixed" (64-bit) instructions
   on Power10.

 - Support for using huge pages for the linear mapping on 8xx (32-bit).

 - Remove obsolete Xilinx PPC405/PPC440 support, and an associated sound
   driver.

 - Removal of some obsolete 40x platforms and associated cruft.

 - Initial support for booting on Power10.

 - Lots of other small features, cleanups & fixes.

Thanks to: Alexey Kardashevskiy, Alistair Popple, Andrew Donnellan,
Andrey Abramov, Aneesh Kumar K.V, Balamuruhan S, Bharata B Rao, Bulent
Abali, Cédric Le Goater, Chen Zhou, Christian Zigotzky, Christophe
JAILLET, Christophe Leroy, Dmitry Torokhov, Emmanuel Nicolet, Erhard F.,
Gautham R. Shenoy, Geoff Levand, George Spelvin, Greg Kurz, Gustavo A.
R. Silva, Gustavo Walbon, Haren Myneni, Hari Bathini, Joel Stanley,
Jordan Niethe, Kajol Jain, Kees Cook, Leonardo Bras, Madhavan
Srinivasan., Mahesh Salgaonkar, Markus Elfring, Michael Neuling, Michal
Simek, Nathan Chancellor, Nathan Lynch, Naveen N. Rao, Nicholas Piggin,
Oliver O'Halloran, Paul Mackerras, Pingfan Liu, Qian Cai, Ram Pai,
Raphael Moreira Zinsly, Ravi Bangoria, Sam Bobroff, Sandipan Das, Segher
Boessenkool, Stephen Rothwell, Sukadev Bhattiprolu, Tyrel Datwyler,
Wolfram Sang, Xiongfeng Wang.

* tag 'powerpc-5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (299 commits)
  powerpc/pseries: Make vio and ibmebus initcalls pseries specific
  cxl: Remove dead Kconfig options
  powerpc: Add POWER10 architected mode
  powerpc/dt_cpu_ftrs: Add MMA feature
  powerpc/dt_cpu_ftrs: Enable Prefixed Instructions
  powerpc/dt_cpu_ftrs: Advertise support for ISA v3.1 if selected
  powerpc: Add support for ISA v3.1
  powerpc: Add new HWCAP bits
  powerpc/64s: Don't set FSCR bits in INIT_THREAD
  powerpc/64s: Save FSCR to init_task.thread.fscr after feature init
  powerpc/64s: Don't let DT CPU features set FSCR_DSCR
  powerpc/64s: Don't init FSCR_DSCR in __init_FSCR()
  powerpc/32s: Fix another build failure with CONFIG_PPC_KUAP_DEBUG
  powerpc/module_64: Use special stub for _mcount() with -mprofile-kernel
  powerpc/module_64: Simplify check for -mprofile-kernel ftrace relocations
  powerpc/module_64: Consolidate ftrace code
  powerpc/32: Disable KASAN with pages bigger than 16k
  powerpc/uaccess: Don't set KUEP by default on book3s/32
  powerpc/uaccess: Don't set KUAP by default on book3s/32
  powerpc/8xx: Reduce time spent in allow_user_access() and friends
  ...
2020-06-05 12:39:30 -07:00
Linus Torvalds
15a2bc4dbb Merge branch 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull execve updates from Eric Biederman:
 "Last cycle for the Nth time I ran into bugs and quality of
  implementation issues related to exec that could not be easily be
  fixed because of the way exec is implemented. So I have been digging
  into exec and cleanup up what I can.

  I don't think I have exec sorted out enough to fix the issues I
  started with but I have made some headway this cycle with 4 sets of
  changes.

   - promised cleanups after introducing exec_update_mutex

   - trivial cleanups for exec

   - control flow simplifications

   - remove the recomputation of bprm->cred

  The net result is code that is a bit easier to understand and work
  with and a decrease in the number of lines of code (if you don't count
  the added tests)"

* 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (24 commits)
  exec: Compute file based creds only once
  exec: Add a per bprm->file version of per_clear
  binfmt_elf_fdpic: fix execfd build regression
  selftests/exec: Add binfmt_script regression test
  exec: Remove recursion from search_binary_handler
  exec: Generic execfd support
  exec/binfmt_script: Don't modify bprm->buf and then return -ENOEXEC
  exec: Move the call of prepare_binprm into search_binary_handler
  exec: Allow load_misc_binary to call prepare_binprm unconditionally
  exec: Convert security_bprm_set_creds into security_bprm_repopulate_creds
  exec: Factor security_bprm_creds_for_exec out of security_bprm_set_creds
  exec: Teach prepare_exec_creds how exec treats uids & gids
  exec: Set the point of no return sooner
  exec: Move handling of the point of no return to the top level
  exec: Run sync_mm_rss before taking exec_update_mutex
  exec: Fix spelling of search_binary_handler in a comment
  exec: Move the comment from above de_thread to above unshare_sighand
  exec: Rename flush_old_exec begin_new_exec
  exec: Move most of setup_new_exec into flush_old_exec
  exec: In setup_new_exec cache current in the local variable me
  ...
2020-06-04 14:07:08 -07:00
Linus Torvalds
ee01c4d72a Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
 "More mm/ work, plenty more to come

  Subsystems affected by this patch series: slub, memcg, gup, kasan,
  pagealloc, hugetlb, vmscan, tools, mempolicy, memblock, hugetlbfs,
  thp, mmap, kconfig"

* akpm: (131 commits)
  arm64: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
  x86: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
  riscv: support DEBUG_WX
  mm: add DEBUG_WX support
  drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup
  mm/thp: rename pmd_mknotpresent() as pmd_mkinvalid()
  powerpc/mm: drop platform defined pmd_mknotpresent()
  mm: thp: don't need to drain lru cache when splitting and mlocking THP
  hugetlbfs: get unmapped area below TASK_UNMAPPED_BASE for hugetlbfs
  sparc32: register memory occupied by kernel as memblock.memory
  include/linux/memblock.h: fix minor typo and unclear comment
  mm, mempolicy: fix up gup usage in lookup_node
  tools/vm/page_owner_sort.c: filter out unneeded line
  mm: swap: memcg: fix memcg stats for huge pages
  mm: swap: fix vmstats for huge pages
  mm: vmscan: limit the range of LRU type balancing
  mm: vmscan: reclaim writepage is IO cost
  mm: vmscan: determine anon/file pressure balance at the reclaim root
  mm: balance LRU lists based on relative thrashing
  mm: only count actual rotations as LRU reclaim cost
  ...
2020-06-03 20:24:15 -07:00
Johannes Weiner
d9eb1ea2bf mm: memcontrol: delete unused lrucare handling
Swapin faults were the last event to charge pages after they had already
been put on the LRU list.  Now that we charge directly on swapin, the
lrucare portion of the charge code is unused.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Link: http://lkml.kernel.org/r/20200508183105.225460-19-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03 20:09:48 -07:00
Johannes Weiner
9d82c69438 mm: memcontrol: convert anon and file-thp to new mem_cgroup_charge() API
With the page->mapping requirement gone from memcg, we can charge anon and
file-thp pages in one single step, right after they're allocated.

This removes two out of three API calls - especially the tricky commit
step that needed to happen at just the right time between when the page is
"set up" and when it's "published" - somewhat vague and fluid concepts
that varied by page type.  All we need is a freshly allocated page and a
memcg context to charge.

v2: prevent double charges on pre-allocated hugepages in khugepaged

[hannes@cmpxchg.org: Fix crash - *hpage could be ERR_PTR instead of NULL]
  Link: http://lkml.kernel.org/r/20200512215813.GA487759@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/20200508183105.225460-13-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03 20:09:48 -07:00
Johannes Weiner
be5d0a74c6 mm: memcontrol: switch to native NR_ANON_MAPPED counter
Memcg maintains a private MEMCG_RSS counter.  This divergence from the
generic VM accounting means unnecessary code overhead, and creates a
dependency for memcg that page->mapping is set up at the time of charging,
so that page types can be told apart.

Convert the generic accounting sites to mod_lruvec_page_state and friends
to maintain the per-cgroup vmstat counter of NR_ANON_MAPPED.  We use
lock_page_memcg() to stabilize page->mem_cgroup during rmap changes, the
same way we do for NR_FILE_MAPPED.

With the previous patch removing MEMCG_CACHE and the private NR_SHMEM
counter, this patch finally eliminates the need to have page->mapping set
up at charge time.  However, we need to have page->mem_cgroup set up by
the time rmap runs and does the accounting, so switch the commit and the
rmap callbacks around.

v2: fix temporary accounting bug by switching rmap<->commit (Joonsoo)

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Link: http://lkml.kernel.org/r/20200508183105.225460-11-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03 20:09:47 -07:00
Johannes Weiner
3fba69a56e mm: memcontrol: drop @compound parameter from memcg charging API
The memcg charging API carries a boolean @compound parameter that tells
whether the page we're dealing with is a hugepage.
mem_cgroup_commit_charge() has another boolean @lrucare that indicates
whether the page needs LRU locking or not while charging.  The majority of
callsites know those parameters at compile time, which results in a lot of
naked "false, false" argument lists.  This makes for cryptic code and is a
breeding ground for subtle mistakes.

Thankfully, the huge page state can be inferred from the page itself and
doesn't need to be passed along.  This is safe because charging completes
before the page is published and somebody may split it.

Simplify the callsites by removing @compound, and let memcg infer the
state by using hpage_nr_pages() unconditionally.  That function does
PageTransHuge() to identify huge pages, which also helpfully asserts that
nobody passes in tail pages by accident.

The following patches will introduce a new charging API, best not to carry
over unnecessary weight.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Link: http://lkml.kernel.org/r/20200508183105.225460-4-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03 20:09:47 -07:00
Linus Torvalds
cb8e59cc87 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:

 1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
    Augusto von Dentz.

 2) Add GSO partial support to igc, from Sasha Neftin.

 3) Several cleanups and improvements to r8169 from Heiner Kallweit.

 4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
    device self-test. From Andrew Lunn.

 5) Start moving away from custom driver versions, use the globally
    defined kernel version instead, from Leon Romanovsky.

 6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.

 7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.

 8) Add sriov and vf support to hinic, from Luo bin.

 9) Support Media Redundancy Protocol (MRP) in the bridging code, from
    Horatiu Vultur.

10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.

11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
    Dubroca. Also add ipv6 support for espintcp.

12) Lots of ReST conversions of the networking documentation, from Mauro
    Carvalho Chehab.

13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
    from Doug Berger.

14) Allow to dump cgroup id and filter by it in inet_diag code, from
    Dmitry Yakunin.

15) Add infrastructure to export netlink attribute policies to
    userspace, from Johannes Berg.

16) Several optimizations to sch_fq scheduler, from Eric Dumazet.

17) Fallback to the default qdisc if qdisc init fails because otherwise
    a packet scheduler init failure will make a device inoperative. From
    Jesper Dangaard Brouer.

18) Several RISCV bpf jit optimizations, from Luke Nelson.

19) Correct the return type of the ->ndo_start_xmit() method in several
    drivers, it's netdev_tx_t but many drivers were using
    'int'. From Yunjian Wang.

20) Add an ethtool interface for PHY master/slave config, from Oleksij
    Rempel.

21) Add BPF iterators, from Yonghang Song.

22) Add cable test infrastructure, including ethool interfaces, from
    Andrew Lunn. Marvell PHY driver is the first to support this
    facility.

23) Remove zero-length arrays all over, from Gustavo A. R. Silva.

24) Calculate and maintain an explicit frame size in XDP, from Jesper
    Dangaard Brouer.

25) Add CAP_BPF, from Alexei Starovoitov.

26) Support terse dumps in the packet scheduler, from Vlad Buslov.

27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.

28) Add devm_register_netdev(), from Bartosz Golaszewski.

29) Minimize qdisc resets, from Cong Wang.

30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
    eliminate set_fs/get_fs calls. From Christoph Hellwig.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
  selftests: net: ip_defrag: ignore EPERM
  net_failover: fixed rollback in net_failover_open()
  Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
  Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
  vmxnet3: allow rx flow hash ops only when rss is enabled
  hinic: add set_channels ethtool_ops support
  selftests/bpf: Add a default $(CXX) value
  tools/bpf: Don't use $(COMPILE.c)
  bpf, selftests: Use bpf_probe_read_kernel
  s390/bpf: Use bcr 0,%0 as tail call nop filler
  s390/bpf: Maintain 8-byte stack alignment
  selftests/bpf: Fix verifier test
  selftests/bpf: Fix sample_cnt shared between two threads
  bpf, selftests: Adapt cls_redirect to call csum_level helper
  bpf: Add csum_level helper for fixing up csum levels
  bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
  sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
  crypto/chtls: IPv6 support for inline TLS
  Crypto/chcr: Fixes a coccinile check error
  Crypto/chcr: Fixes compilations warnings
  ...
2020-06-03 16:27:18 -07:00
Ingo Molnar
0bffedbce9 Linux 5.7-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl7K9iEeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGzTAH/0ifZEG4BQ8x/WlB
 8YLSLE6QQTSXYi25nyExuJbFkkKY5Tik8M2HD/36xwY/HnZOlH9jH6m0ntqZxpaA
 3EU9lr1ct79nCBMYhiJssvz8d9AOZXlyogFW9y2y9pmPjlmUtseZ7yGh1xD465cj
 B5Ty2w2W34cs7zF3og2xn5agOJMtWWXLXZ5mRa9EOquKC5zeYyRicmd0T+plYQD6
 hbRYmxFfDfppVnBCBARPNN0+NU5JJD94H+8bOuf1tl48XNrLiZMOicmtohKNQ6+W
 rZNpJNEGEp7KMtqWH0Nl3hmy3yfZHMwe1DXM/AZDqR7jTHZY4mZ0GEpLyfI9AU4n
 34jVHwU=
 =SmJ9
 -----END PGP SIGNATURE-----

Merge tag 'v5.7-rc7' into perf/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-05-28 07:58:12 +02:00
Gustavo A. R. Silva
c50c75e9b8 perf/core: Replace zero-length array with flexible-array
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200511201227.GA14041@embeddedor
2020-05-19 20:34:16 +02:00
Ravi Bangoria
29da4f91c0 powerpc/watchpoint: Don't allow concurrent perf and ptrace events
With Book3s DAWR, ptrace and perf watchpoints on powerpc behaves
differently. Ptrace watchpoint works in one-shot mode and generates
signal before executing instruction. It's ptrace user's job to
single-step the instruction and re-enable the watchpoint. OTOH, in
case of perf watchpoint, kernel emulates/single-steps the instruction
and then generates event. If perf and ptrace creates two events with
same or overlapping address ranges, it's ambiguous to decide who
should single-step the instruction. Because of this issue, don't
allow perf and ptrace watchpoint at the same time if their address
range overlaps.

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Michael Neuling <mikey@neuling.org>
Link: https://lore.kernel.org/r/20200514111741.97993-15-ravi.bangoria@linux.ibm.com
2020-05-19 00:14:45 +10:00
Eric W. Biederman
96ecee29b0 exec: Merge install_exec_creds into setup_new_exec
The two functions are now always called one right after the
other so merge them together to make future maintenance easier.

Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Ungerer <gerg@linux-m68k.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-05-07 16:55:47 -05:00
Barret Rhoden
2ed6edd33a perf: Add cond_resched() to task_function_call()
Under rare circumstances, task_function_call() can repeatedly fail and
cause a soft lockup.

There is a slight race where the process is no longer running on the cpu
we targeted by the time remote_function() runs.  The code will simply
try again.  If we are very unlucky, this will continue to fail, until a
watchdog fires.  This can happen in a heavily loaded, multi-core virtual
machine.

Reported-by: syzbot+bb4935a5c09b5ff79940@syzkaller.appspotmail.com
Signed-off-by: Barret Rhoden <brho@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200414222920.121401-1-brho@google.com
2020-04-30 20:14:36 +02:00
Daniel Borkmann
0b54142e4b Merge branch 'work.sysctl' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull in Christoph Hellwig's series that changes the sysctl's ->proc_handler
methods to take kernel pointers instead. It gets rid of the set_fs address
space overrides used by BPF. As per discussion, pull in the feature branch
into bpf-next as it relates to BPF sysctl progs.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200427071508.GV23230@ZenIV.linux.org.uk/T/
2020-04-28 21:23:38 +02:00
Christoph Hellwig
32927393dc sysctl: pass kernel pointers to ->proc_handler
Instead of having all the sysctl handlers deal with user pointers, which
is rather hairy in terms of the BPF interaction, copy the input to and
from  userspace in common code.  This also means that the strings are
always NUL-terminated by the common code, making the API a little bit
safer.

As most handler just pass through the data to one of the common handlers
a lot of the changes are mechnical.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-27 02:07:40 -04:00
Ian Rogers
f3bed55e85 perf/core: fix parent pid/tid in task exit events
Current logic yields the child task as the parent.

Before:
$ perf record bash -c "perf list > /dev/null"
$ perf script -D |grep 'FORK\|EXIT'
4387036190981094 0x5a70 [0x30]: PERF_RECORD_FORK(10472:10472):(10470:10470)
4387036606207580 0xf050 [0x30]: PERF_RECORD_EXIT(10472:10472):(10472:10472)
4387036607103839 0x17150 [0x30]: PERF_RECORD_EXIT(10470:10470):(10470:10470)
                                                   ^
  Note the repeated values here -------------------/

After:
383281514043 0x9d8 [0x30]: PERF_RECORD_FORK(2268:2268):(2266:2266)
383442003996 0x2180 [0x30]: PERF_RECORD_EXIT(2268:2268):(2266:2266)
383451297778 0xb70 [0x30]: PERF_RECORD_EXIT(2266:2266):(2265:2265)

Fixes: 94d5d1b2d8 ("perf_counter: Report the cloning task as parent on perf_counter_fork()")
Reported-by: KP Singh <kpsingh@google.com>
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200417182842.12522-1-irogers@google.com
2020-04-22 23:10:14 +02:00
Alexey Budankov
c9e0924e5c perf/core: open access to probes for CAP_PERFMON privileged process
Open access to monitoring via kprobes and uprobes and eBPF tracing for
CAP_PERFMON privileged process. Providing the access under CAP_PERFMON
capability singly, without the rest of CAP_SYS_ADMIN credentials,
excludes chances to misuse the credentials and makes operation more
secure.

perf kprobes and uprobes are used by ftrace and eBPF. perf probe uses
ftrace to define new kprobe events, and those events are treated as
tracepoint events. eBPF defines new probes via perf_event_open interface
and then the probes are used in eBPF tracing.

CAP_PERFMON implements the principle of least privilege for performance
monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39
principle of least privilege: A security design principle that states
that a process or program be granted only those privileges (e.g.,
capabilities) necessary to accomplish its legitimate function, and only
for the time that such privileges are actually required)

For backward compatibility reasons access to perf_events subsystem
remains open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN
usage for secure perf_events monitoring is discouraged with respect to
CAP_PERFMON capability.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Igor Lubashev <ilubashe@akamai.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: intel-gfx@lists.freedesktop.org
Cc: linux-doc@vger.kernel.org
Cc: linux-security-module@vger.kernel.org
Cc: selinux@vger.kernel.org
Cc: linux-man@vger.kernel.org
Link: http://lore.kernel.org/lkml/3c129d9a-ba8a-3483-ecc5-ad6c8e7c203f@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-04-16 12:19:08 -03:00
Alexey Budankov
18aa185662 perf/core: Open access to the core for CAP_PERFMON privileged process
Open access to monitoring of kernel code, CPUs, tracepoints and
namespaces data for a CAP_PERFMON privileged process. Providing the
access under CAP_PERFMON capability singly, without the rest of
CAP_SYS_ADMIN credentials, excludes chances to misuse the credentials
and makes operation more secure.

CAP_PERFMON implements the principle of least privilege for performance
monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39
principle of least privilege: A security design principle that states
that a process or program be granted only those privileges (e.g.,
capabilities) necessary to accomplish its legitimate function, and only
for the time that such privileges are actually required)

For backward compatibility reasons the access to perf_events subsystem
remains open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN
usage for secure perf_events monitoring is discouraged with respect to
CAP_PERFMON capability.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Igor Lubashev <ilubashe@akamai.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: linux-man@vger.kernel.org
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: intel-gfx@lists.freedesktop.org
Cc: linux-doc@vger.kernel.org
Cc: linux-security-module@vger.kernel.org
Cc: selinux@vger.kernel.org
Link: http://lore.kernel.org/lkml/471acaef-bb8a-5ce2-923f-90606b78eef9@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-04-16 12:19:08 -03:00
Jiri Olsa
d3296fb372 perf/core: Disable page faults when getting phys address
We hit following warning when running tests on kernel
compiled with CONFIG_DEBUG_ATOMIC_SLEEP=y:

 WARNING: CPU: 19 PID: 4472 at mm/gup.c:2381 __get_user_pages_fast+0x1a4/0x200
 CPU: 19 PID: 4472 Comm: dummy Not tainted 5.6.0-rc6+ #3
 RIP: 0010:__get_user_pages_fast+0x1a4/0x200
 ...
 Call Trace:
  perf_prepare_sample+0xff1/0x1d90
  perf_event_output_forward+0xe8/0x210
  __perf_event_overflow+0x11a/0x310
  __intel_pmu_pebs_event+0x657/0x850
  intel_pmu_drain_pebs_nhm+0x7de/0x11d0
  handle_pmi_common+0x1b2/0x650
  intel_pmu_handle_irq+0x17b/0x370
  perf_event_nmi_handler+0x40/0x60
  nmi_handle+0x192/0x590
  default_do_nmi+0x6d/0x150
  do_nmi+0x2f9/0x3c0
  nmi+0x8e/0xd7

While __get_user_pages_fast() is IRQ-safe, it calls access_ok(),
which warns on:

  WARN_ON_ONCE(!in_task() && !pagefault_disabled())

Peter suggested disabling page faults around __get_user_pages_fast(),
which gets rid of the warning in access_ok() call.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200407141427.3184722-1-jolsa@kernel.org
2020-04-08 11:33:46 +02:00
Ian Rogers
24fb6b8e7c perf/cgroup: Correct indirection in perf_less_group_idx()
The void* in perf_less_group_idx() is to a member in the array which points
at a perf_event*, as such it is a perf_event**.

Reported-By: John Sperbeck <jsperbeck@google.com>
Fixes: 6eef8a7116 ("perf/core: Use min_heap in visit_groups_merge()")
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200321164331.107337-1-irogers@google.com
2020-04-08 11:33:45 +02:00
Peter Zijlstra
33238c5045 perf/core: Fix event cgroup tracking
Song reports that installing cgroup events is broken since:

  db0503e4f6 ("perf/core: Optimize perf_install_in_event()")

The problem being that cgroup events try to track cpuctx->cgrp even
for disabled events, which is pointless and actively harmful since the
above commit. Rework the code to have explicit enable/disable hooks
for cgroup events, such that we can limit cgroup tracking to active
events.

More specifically, since the above commit disabled events are no
longer added to their context from the 'right' CPU, and we can't
access things like the current cgroup for a remote CPU.

Cc: <stable@vger.kernel.org> # v5.5+
Fixes: db0503e4f6 ("perf/core: Optimize perf_install_in_event()")
Reported-by: Song Liu <songliubraving@fb.com>
Tested-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200318193337.GB20760@hirez.programming.kicks-ass.net
2020-04-08 11:33:44 +02:00
Anshuman Khandual
03911132aa mm/vma: replace all remaining open encodings with is_vm_hugetlb_page()
This replaces all remaining open encodings with is_vm_hugetlb_page().

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Will Deacon <will@kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Paul Burton <paulburton@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/1582520593-30704-4-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:37 -07:00
Linus Torvalds
c48b07226b perf updates all over the place:
core:
 
    - Support for cgroup tracking in samples to allow cgroup based
      analysis
 
  tools:
 
    - Support for cgroup analysis
 
    - Commandline option and hotkey for perf top to change the sort order
 
    - A set of fixes all over the place
 
    - Various build system related improvements
 
    - Updates of the X86 pmu event JSON data
 
    - Documentation updates
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6J2iATHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoXMEEACy6WaNabX7EzkLUnX0WCgxnZVSryNR
 4EnxLJSg5lChEe4q2mOS3mRMpzlXHQieWcFxlwda7kjIbFlgQvjJQUiYlAvxRRO/
 giY/GwCtTi/Flcb+7wKxTgMYmtAUOZDWeQbBZUlFLi9vyeCHVkjget9EyVsgbe/W
 EmRsrPuKOVMUTeEwm3zpIE051DObpiWLNge++My70q5W/yNsS94PbNydgKO7osqk
 pX37YVyBFpI2IQxMGzaE3yK7OxXRjYljZaz1tONFDMEYOX9gmxpDsCCflsP1ZOzL
 4/P4faRvAOElwxtYBelKmRl8eboqhRpTEK0Et0TI0LYbUZrE2nisDi0LTKPWQb0k
 Om2Qi6AfZs67PVzx9htlx6rfee72+sUluz5BDKOGH0pNJ8CFy70ns8InLsZqbBZ7
 SgFVNjx6bHxB58VuVE9WEzr/KVs6zI/SuJlH7WG7FLXbm5j0cETfCjg40JlDpSNh
 CZs+Epky1zyytrVJ9Gc7KnRlw8VB2eWEQ2cQ0sqj5w7WxhfrnsCmQf1zk4sofhOF
 iz3Dvb9Llz/pYWGZbEiQAuI+8bo0psJptidzpmpbIXs/woKDhW49w1FxqowJQMWe
 +lGWcauSfo3gjgEnTkOWAx3yiH4i9rvRChX8Jh0Z07d6Kwf19YYfxcuhlkx0Wutj
 eKaaErWDtWAxpA==
 =2UUD
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-2020-04-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull more perf updates from Thomas Gleixner:
 "Perf updates all over the place:

  core:

   - Support for cgroup tracking in samples to allow cgroup based
     analysis

  tools:

   - Support for cgroup analysis

   - Commandline option and hotkey for perf top to change the sort order

   - A set of fixes all over the place

   - Various build system related improvements

   - Updates of the X86 pmu event JSON data

   - Documentation updates"

* tag 'perf-urgent-2020-04-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (55 commits)
  perf python: Fix clang detection to strip out options passed in $CC
  perf tools: Support Python 3.8+ in Makefile
  perf script: Fix invalid read of directory entry after closedir()
  perf script report: Fix SEGFAULT when using DWARF mode
  perf script: add -S/--symbols documentation
  perf pmu-events x86: Use CPU_CLK_UNHALTED.THREAD in Kernel_Utilization metric
  perf events parser: Add missing Intel CPU events to parser
  perf script: Allow --symbol to accept hexadecimal addresses
  perf report/top TUI: Fix title line formatting
  perf top: Support hotkey to change sort order
  perf top: Support --group-sort-idx to change the sort order
  perf symbols: Fix arm64 gap between kernel start and module end
  perf build-test: Honour JOBS to override detection of number of cores
  perf script: Add --show-cgroup-events option
  perf top: Add --all-cgroups option
  perf record: Add --all-cgroups option
  perf record: Support synthesizing cgroup events
  perf report: Add 'cgroup' sort key
  perf cgroup: Maintain cgroup hierarchy
  perf tools: Basic support for CGROUP event
  ...
2020-04-05 12:26:24 -07:00
Linus Torvalds
d987ca1c6b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull exec/proc updates from Eric Biederman:
 "This contains two significant pieces of work: the work to sort out
  proc_flush_task, and the work to solve a deadlock between strace and
  exec.

  Fixing proc_flush_task so that it no longer requires a persistent
  mount makes improvements to proc possible. The removal of the
  persistent mount solves an old regression that that caused the hidepid
  mount option to only work on remount not on mount. The regression was
  found and reported by the Android folks. This further allows Alexey
  Gladkov's work making proc mount options specific to an individual
  mount of proc to move forward.

  The work on exec starts solving a long standing issue with exec that
  it takes mutexes of blocking userspace applications, which makes exec
  extremely deadlock prone. For the moment this adds a second mutex with
  a narrower scope that handles all of the easy cases. Which makes the
  tricky cases easy to spot. With a little luck the code to solve those
  deadlocks will be ready by next merge window"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (25 commits)
  signal: Extend exec_id to 64bits
  pidfd: Use new infrastructure to fix deadlocks in execve
  perf: Use new infrastructure to fix deadlocks in execve
  proc: io_accounting: Use new infrastructure to fix deadlocks in execve
  proc: Use new infrastructure to fix deadlocks in execve
  kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve
  kernel: doc: remove outdated comment cred.c
  mm: docs: Fix a comment in process_vm_rw_core
  selftests/ptrace: add test cases for dead-locks
  exec: Fix a deadlock in strace
  exec: Add exec_update_mutex to replace cred_guard_mutex
  exec: Move exec_mmap right after de_thread in flush_old_exec
  exec: Move cleanup of posix timers on exec out of de_thread
  exec: Factor unshare_sighand out of de_thread and call it separately
  exec: Only compute current once in flush_old_exec
  pid: Improve the comment about waiting in zap_pid_ns_processes
  proc: Remove the now unnecessary internal mount of proc
  uml: Create a private mount of proc for mconsole
  uml: Don't consult current to find the proc_mnt in mconsole_proc
  proc: Use a list of inodes to flush from proc
  ...
2020-04-02 11:22:17 -07:00
Linus Torvalds
29d9f30d4c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:
 "Highlights:

   1) Fix the iwlwifi regression, from Johannes Berg.

   2) Support BSS coloring and 802.11 encapsulation offloading in
      hardware, from John Crispin.

   3) Fix some potential Spectre issues in qtnfmac, from Sergey
      Matyukevich.

   4) Add TTL decrement action to openvswitch, from Matteo Croce.

   5) Allow paralleization through flow_action setup by not taking the
      RTNL mutex, from Vlad Buslov.

   6) A lot of zero-length array to flexible-array conversions, from
      Gustavo A. R. Silva.

   7) Align XDP statistics names across several drivers for consistency,
      from Lorenzo Bianconi.

   8) Add various pieces of infrastructure for offloading conntrack, and
      make use of it in mlx5 driver, from Paul Blakey.

   9) Allow using listening sockets in BPF sockmap, from Jakub Sitnicki.

  10) Lots of parallelization improvements during configuration changes
      in mlxsw driver, from Ido Schimmel.

  11) Add support to devlink for generic packet traps, which report
      packets dropped during ACL processing. And use them in mlxsw
      driver. From Jiri Pirko.

  12) Support bcmgenet on ACPI, from Jeremy Linton.

  13) Make BPF compatible with RT, from Thomas Gleixnet, Alexei
      Starovoitov, and your's truly.

  14) Support XDP meta-data in virtio_net, from Yuya Kusakabe.

  15) Fix sysfs permissions when network devices change namespaces, from
      Christian Brauner.

  16) Add a flags element to ethtool_ops so that drivers can more simply
      indicate which coalescing parameters they actually support, and
      therefore the generic layer can validate the user's ethtool
      request. Use this in all drivers, from Jakub Kicinski.

  17) Offload FIFO qdisc in mlxsw, from Petr Machata.

  18) Support UDP sockets in sockmap, from Lorenz Bauer.

  19) Fix stretch ACK bugs in several TCP congestion control modules,
      from Pengcheng Yang.

  20) Support virtual functiosn in octeontx2 driver, from Tomasz
      Duszynski.

  21) Add region operations for devlink and use it in ice driver to dump
      NVM contents, from Jacob Keller.

  22) Add support for hw offload of MACSEC, from Antoine Tenart.

  23) Add support for BPF programs that can be attached to LSM hooks,
      from KP Singh.

  24) Support for multiple paths, path managers, and counters in MPTCP.
      From Peter Krystad, Paolo Abeni, Florian Westphal, Davide Caratti,
      and others.

  25) More progress on adding the netlink interface to ethtool, from
      Michal Kubecek"

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2121 commits)
  net: ipv6: rpl_iptunnel: Fix potential memory leak in rpl_do_srh_inline
  cxgb4/chcr: nic-tls stats in ethtool
  net: dsa: fix oops while probing Marvell DSA switches
  net/bpfilter: remove superfluous testing message
  net: macb: Fix handling of fixed-link node
  net: dsa: ksz: Select KSZ protocol tag
  netdevsim: dev: Fix memory leak in nsim_dev_take_snapshot_write
  net: stmmac: add EHL 2.5Gbps PCI info and PCI ID
  net: stmmac: add EHL PSE0 & PSE1 1Gbps PCI info and PCI ID
  net: stmmac: create dwmac-intel.c to contain all Intel platform
  net: dsa: bcm_sf2: Support specifying VLAN tag egress rule
  net: dsa: bcm_sf2: Add support for matching VLAN TCI
  net: dsa: bcm_sf2: Move writing of CFP_DATA(5) into slicing functions
  net: dsa: bcm_sf2: Check earlier for FLOW_EXT and FLOW_MAC_EXT
  net: dsa: bcm_sf2: Disable learning for ASP port
  net: dsa: b53: Deny enslaving port 7 for 7278 into a bridge
  net: dsa: b53: Prevent tagged VLAN on port 7 for 7278
  net: dsa: b53: Restore VLAN entries upon (re)configuration
  net: dsa: bcm_sf2: Fix overflow checks
  hv_netvsc: Remove unnecessary round_up for recv_completion_cnt
  ...
2020-03-31 17:29:33 -07:00
Namhyung Kim
6546b19f95 perf/core: Add PERF_SAMPLE_CGROUP feature
The PERF_SAMPLE_CGROUP bit is to save (perf_event) cgroup information in
the sample.  It will add a 64-bit id to identify current cgroup and it's
the file handle in the cgroup file system.  Userspace should use this
information with PERF_RECORD_CGROUP event to match which cgroup it
belongs.

I put it before PERF_SAMPLE_AUX for simplicity since it just needs a
64-bit word.  But if we want bigger samples, I can work on that
direction too.

Committer testing:

  $ pahole perf_sample_data | grep -w cgroup -B5 -A5
  	/* --- cacheline 4 boundary (256 bytes) was 56 bytes ago --- */
  	struct perf_regs           regs_intr;            /*   312    16 */
  	/* --- cacheline 5 boundary (320 bytes) was 8 bytes ago --- */
  	u64                        stack_user_size;      /*   328     8 */
  	u64                        phys_addr;            /*   336     8 */
  	u64                        cgroup;               /*   344     8 */

  	/* size: 384, cachelines: 6, members: 22 */
  	/* padding: 32 */
  };
  $

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Zefan Li <lizefan@huawei.com>
Link: http://lore.kernel.org/lkml/20200325124536.2800725-3-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-03-27 10:41:44 -03:00
Namhyung Kim
96aaab6865 perf/core: Add PERF_RECORD_CGROUP event
To support cgroup tracking, add CGROUP event to save a link between
cgroup path and id number.  This is needed since cgroups can go away
when userspace tries to read the cgroup info (from the id) later.

The attr.cgroup bit was also added to enable cgroup tracking from
userspace.

This event will be generated when a new cgroup becomes active.
Userspace might need to synthesize those events for existing cgroups.

Committer testing:

From the resulting kernel, using /sys/kernel/btf/vmlinux:

  $ pahole perf_event_attr | grep -w cgroup -B5 -A1
  	__u64                      write_backward:1;     /*    40:27  8 */
  	__u64                      namespaces:1;         /*    40:28  8 */
  	__u64                      ksymbol:1;            /*    40:29  8 */
  	__u64                      bpf_event:1;          /*    40:30  8 */
  	__u64                      aux_output:1;         /*    40:31  8 */
  	__u64                      cgroup:1;             /*    40:32  8 */
  	__u64                      __reserved_1:31;      /*    40:33  8 */
  $

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
[staticize perf_event_cgroup function]
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Zefan Li <lizefan@huawei.com>
Link: http://lore.kernel.org/lkml/20200325124536.2800725-2-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-03-27 10:39:11 -03:00
Bernd Edlinger
6914303824 perf: Use new infrastructure to fix deadlocks in execve
This changes perf_event_set_clock to use the new exec_update_mutex
instead of cred_guard_mutex.

This should be safe, as the credentials are only used for reading.

Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-03-25 10:04:01 -05:00
Dan Carpenter
a6763625ae perf/core: Fix reversed NULL check in perf_event_groups_less()
This NULL check is reversed so it leads to a Smatch warning and
presumably a NULL dereference.

    kernel/events/core.c:1598 perf_event_groups_less()
    error: we previously assumed 'right->cgrp->css.cgroup' could be null
	(see line 1590)

Fixes: 95ed6c707f ("perf/cgroup: Order events in RB tree by cgroup id")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200312105637.GA8960@mwanda
2020-03-20 13:06:22 +01:00
Peter Zijlstra
90c91dfb86 perf/core: Fix endless multiplex timer
Kan and Andi reported that we fail to kill rotation when the flexible
events go empty, but the context does not. XXX moar

Fixes: fd7d55172d ("perf/cgroups: Don't rotate events for cgroups unnecessarily")
Reported-by: Andi Kleen <ak@linux.intel.com>
Reported-by: Kan Liang <kan.liang@linux.intel.com>
Tested-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200305123851.GX2596@hirez.programming.kicks-ass.net
2020-03-20 13:06:22 +01:00
Jiri Olsa
bfea9a8574 bpf: Add name to struct bpf_ksym
Adding name to 'struct bpf_ksym' object to carry the name
of the symbol for bpf_prog, bpf_trampoline, bpf_dispatcher
objects.

The current benefit is that name is now generated only when
the symbol is added to the list, so we don't need to generate
it every time it's accessed.

The future benefit is that we will have all the bpf objects
symbols represented by struct bpf_ksym.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200312195610.346362-5-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-03-13 12:49:51 -07:00
Ian Rogers
95ed6c707f perf/cgroup: Order events in RB tree by cgroup id
If one is monitoring 6 events on 20 cgroups the per-CPU RB tree will
hold 120 events. The scheduling in of the events currently iterates
over all events looking to see which events match the task's cgroup or
its cgroup hierarchy. If a task is in 1 cgroup with 6 events, then 114
events are considered unnecessarily.

This change orders events in the RB tree by cgroup id if it is present.
This means scheduling in may go directly to events associated with the
task's cgroup if one is present. The per-CPU iterator storage in
visit_groups_merge is sized sufficent for an iterator per cgroup depth,
where different iterators are needed for the task's cgroup and parent
cgroups. By considering the set of iterators when visiting, the lowest
group_index event may be selected and the insertion order group_index
property is maintained. This also allows event rotation to function
correctly, as although events are grouped into a cgroup, rotation always
selects the lowest group_index event to rotate (delete/insert into the
tree) and the min heap of iterators make it so that the group_index order
is maintained.

Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20190724223746.153620-3-irogers@google.com
2020-03-06 11:57:01 +01:00
Ian Rogers
c2283c9368 perf/cgroup: Grow per perf_cpu_context heap storage
Allow the per-CPU min heap storage to have sufficient space for per-cgroup
iterators.

Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200214075133.181299-6-irogers@google.com
2020-03-06 11:57:00 +01:00
Ian Rogers
836196beb3 perf/core: Add per perf_cpu_context min_heap storage
The storage required for visit_groups_merge's min heap needs to vary in
order to support more iterators, such as when multiple nested cgroups'
events are being visited. This change allows for 2 iterators and doesn't
support growth.

Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200214075133.181299-5-irogers@google.com
2020-03-06 11:57:00 +01:00
Ian Rogers
6eef8a7116 perf/core: Use min_heap in visit_groups_merge()
visit_groups_merge will pick the next event based on when it was
inserted in to the context (perf_event group_index). Events may be per CPU
or for any CPU, but in the future we'd also like to have per cgroup events
to avoid searching all events for the events to schedule for a cgroup.
Introduce a min heap for the events that maintains a property that the
earliest inserted event is always at the 0th element. Initialize the heap
with per-CPU and any-CPU events for the context.

Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200214075133.181299-4-irogers@google.com
2020-03-06 11:56:59 +01:00
Peter Zijlstra
98add2af89 perf/cgroup: Reorder perf_cgroup_connect()
Move perf_cgroup_connect() after perf_event_alloc(), such that we can
find/use the PMU's cpu context.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200214075133.181299-2-irogers@google.com
2020-03-06 11:56:58 +01:00
Peter Zijlstra
2c2366c754 perf/core: Remove 'struct sched_in_data'
We can deduce the ctx and cpuctx from the event, no need to pass them
along. Remove the structure and pass in can_add_hw directly.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-03-06 11:56:58 +01:00
Peter Zijlstra
ab6f824cfd perf/core: Unify {pinned,flexible}_sched_in()
Less is more; unify the two very nearly identical function.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-03-06 11:56:55 +01:00
Thomas Gleixner
1d7bf6b7d3 perf/bpf: Remove preempt disable around BPF invocation
The BPF invocation from the perf event overflow handler does not require to
disable preemption because this is called from NMI or at least hard
interrupt context which is already non-preemptible.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200224145643.151953573@linutronix.de
2020-02-24 16:18:20 -08:00
Kan Liang
bbfd5e4fab perf/core: Add new branch sample type for HW index of raw branch records
The low level index is the index in the underlying hardware buffer of
the most recently captured taken branch which is always saved in
branch_entries[0]. It is very useful for reconstructing the call stack.
For example, in Intel LBR call stack mode, the depth of reconstructed
LBR call stack limits to the number of LBR registers. With the low level
index information, perf tool may stitch the stacks of two samples. The
reconstructed LBR call stack can break the HW limitation.

Add a new branch sample type to retrieve low level index of raw branch
records. The low level index is between -1 (unknown) and max depth which
can be retrieved in /sys/devices/cpu/caps/branches.

Only when the new branch sample type is set, the low level index
information is dumped into the PERF_SAMPLE_BRANCH_STACK output.
Perf tool should check the attr.branch_sample_type, and apply the
corresponding format for PERF_SAMPLE_BRANCH_STACK samples.
Otherwise, some user case may be broken. For example, users may parse a
perf.data, which include the new branch sample type, with an old version
perf tool (without the check). Users probably get incorrect information
without any warning.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200127165355.27495-2-kan.liang@linux.intel.com
2020-02-11 13:23:49 +01:00
Linus Torvalds
ca21b9b370 A set of fixes and improvements for the perf subsystem:
- Kernel fixes:
 
    - Install cgroup events to the correct CPU context to prevent a
      potential list double add
 
    - Prevent am intgeer underflow in the perf mlock acounting
 
    - Add a missing prototyp for arch_perf_update_userpage()
 
  - Tooling:
 
    - Add a missing unlock in the error path of maps__insert() in perf maps.
 
    - Fix the build with the latest libbfd
 
    - Fix the perf parser so it does not delete parse event terms, which
      caused a regression for using perf with the ARM CoreSight as the sink
      confuguration was missing due to the deletion.
 
    - Fix the double free in the perf CPU map merging test case
 
    - Add the missing ustring support for the perf probe command
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl5AC0ITHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoaJtD/4jEdN6KNGVJIQ5jOYdchXK/zb68plS
 3By6CegbaNq1SU5UPIdMX4BkznVGaVtJU/0hWuvD/ycpBTAMgKjwalYJtAC+anVi
 JhG7NiPRV1Nhm+7eZ/78mUpW4CUimTlvZVzU/yneYdFm2klvcxUHblJYSqEGp0AS
 r2aZRsqQnWSoI/+z+0THO8tI+HLSpkmKy2slLxaZphI0VjSrjWPDHfF6eAOyl/dq
 lTCz+tjd6EytELL+lhWFsGXYAi6HPKP3T4yPRH+eDYKQmByYaEYbK3E8wg/0XB/J
 2AHgSBf9pSPDBIkLOWOidmkmWgZD9ykCTyOPu4N0S70+NeaCm2nXLTOQ7dnyLE7t
 WCx8mvnIS2hshNUoXMkarG5LYexPupDMMEfHyUT5+T2rKxacKWLaRoIV+JCsUpQb
 m6eU3+n/YsN1C05V75Fuztt4irGhltlQxcG8F3gH/vqSy6VDdZb8lMU6+iyE2VKG
 ezsI7AMQkT6LrTGa2hXHHnnluaxHHSA32GPe4W1QTwMCMWMtRTwQHBBLoJ4mC0wk
 iujB9DVuh7ljmr7QSG9ZYV91eplpzJDUC54P6Qs/p7ouG4YzkIO6glt6BOgBmbp7
 YkrJtGpV6npjJmLckktcSd9rtnCzot6yGxeaIVfLPhhtf2KECSCckCyddwkakt0A
 wwVVBe8RNxXf2A==
 =xu7D
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-2020-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Thomas Gleixner:
 "A set of fixes and improvements for the perf subsystem:

  Kernel fixes:

   - Install cgroup events to the correct CPU context to prevent a
     potential list double add

   - Prevent an integer underflow in the perf mlock accounting

   - Add a missing prototype for arch_perf_update_userpage()

  Tooling:

   - Add a missing unlock in the error path of maps__insert() in perf
     maps.

   - Fix the build with the latest libbfd

   - Fix the perf parser so it does not delete parse event terms, which
     caused a regression for using perf with the ARM CoreSight as the
     sink configuration was missing due to the deletion.

   - Fix the double free in the perf CPU map merging test case

   - Add the missing ustring support for the perf probe command"

* tag 'perf-urgent-2020-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf maps: Add missing unlock to maps__insert() error case
  perf probe: Add ustring support for perf probe command
  perf: Make perf able to build with latest libbfd
  perf test: Fix test case Merge cpu map
  perf parse: Copy string to perf_evsel_config_term
  perf parse: Refactor 'struct perf_evsel_config_term'
  kernel/events: Add a missing prototype for arch_perf_update_userpage()
  perf/cgroups: Install cgroup events to correct cpuctx
  perf/core: Fix mlock accounting in perf_mmap()
2020-02-09 12:04:09 -08:00
Linus Torvalds
e310396bb8 Tracing updates:
- Added new "bootconfig".
    Looks for a file appended to initrd to add boot config options.
    This has been discussed thoroughly at Linux Plumbers.
    Very useful for adding kprobes at bootup.
    Only enabled if "bootconfig" is on the real kernel command line.
 
  - Created dynamic event creation.
    Merges common code between creating synthetic events and
      kprobe events.
 
  - Rename perf "ring_buffer" structure to "perf_buffer"
 
  - Rename ftrace "ring_buffer" structure to "trace_buffer"
    Had to rename existing "trace_buffer" to "array_buffer"
 
  - Allow trace_printk() to work withing (some) tracing code.
 
  - Sort of tracing configs to be a little better organized
 
  - Fixed bug where ftrace_graph hash was not being protected properly
 
  - Various other small fixes and clean ups
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXjtAURQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qshOAQDzopQmvAVrrI6oogghr8JQA30Z2yqT
 i+Ld7vPWL2MV9wEA1S+zLGDSYrj8f/vsCq6BxRYT1ApO+YtmY6LTXiUejwg=
 =WNds
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:

 - Added new "bootconfig".

   This looks for a file appended to initrd to add boot config options,
   and has been discussed thoroughly at Linux Plumbers.

   Very useful for adding kprobes at bootup.

   Only enabled if "bootconfig" is on the real kernel command line.

 - Created dynamic event creation.

   Merges common code between creating synthetic events and kprobe
   events.

 - Rename perf "ring_buffer" structure to "perf_buffer"

 - Rename ftrace "ring_buffer" structure to "trace_buffer"

   Had to rename existing "trace_buffer" to "array_buffer"

 - Allow trace_printk() to work withing (some) tracing code.

 - Sort of tracing configs to be a little better organized

 - Fixed bug where ftrace_graph hash was not being protected properly

 - Various other small fixes and clean ups

* tag 'trace-v5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (88 commits)
  bootconfig: Show the number of nodes on boot message
  tools/bootconfig: Show the number of bootconfig nodes
  bootconfig: Add more parse error messages
  bootconfig: Use bootconfig instead of boot config
  ftrace: Protect ftrace_graph_hash with ftrace_sync
  ftrace: Add comment to why rcu_dereference_sched() is open coded
  tracing: Annotate ftrace_graph_notrace_hash pointer with __rcu
  tracing: Annotate ftrace_graph_hash pointer with __rcu
  bootconfig: Only load bootconfig if "bootconfig" is on the kernel cmdline
  tracing: Use seq_buf for building dynevent_cmd string
  tracing: Remove useless code in dynevent_arg_pair_add()
  tracing: Remove check_arg() callbacks from dynevent args
  tracing: Consolidate some synth_event_trace code
  tracing: Fix now invalid var_ref_vals assumption in trace action
  tracing: Change trace_boot to use synth_event interface
  tracing: Move tracing selftests to bottom of menu
  tracing: Move mmio tracer config up with the other tracers
  tracing: Move tracing test module configs together
  tracing: Move all function tracing configs together
  tracing: Documentation for in-kernel synthetic event API
  ...
2020-02-06 07:12:11 +00:00
Linus Torvalds
6aee4badd8 Merge branch 'work.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull openat2 support from Al Viro:
 "This is the openat2() series from Aleksa Sarai.

  I'm afraid that the rest of namei stuff will have to wait - it got
  zero review the last time I'd posted #work.namei, and there had been a
  leak in the posted series I'd caught only last weekend. I was going to
  repost it on Monday, but the window opened and the odds of getting any
  review during that... Oh, well.

  Anyway, openat2 part should be ready; that _did_ get sane amount of
  review and public testing, so here it comes"

From Aleksa's description of the series:
 "For a very long time, extending openat(2) with new features has been
  incredibly frustrating. This stems from the fact that openat(2) is
  possibly the most famous counter-example to the mantra "don't silently
  accept garbage from userspace" -- it doesn't check whether unknown
  flags are present[1].

  This means that (generally) the addition of new flags to openat(2) has
  been fraught with backwards-compatibility issues (O_TMPFILE has to be
  defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
  kernels gave errors, since it's insecure to silently ignore the
  flag[2]). All new security-related flags therefore have a tough road
  to being added to openat(2).

  Furthermore, the need for some sort of control over VFS's path
  resolution (to avoid malicious paths resulting in inadvertent
  breakouts) has been a very long-standing desire of many userspace
  applications.

  This patchset is a revival of Al Viro's old AT_NO_JUMPS[3] patchset
  (which was a variant of David Drysdale's O_BENEATH patchset[4] which
  was a spin-off of the Capsicum project[5]) with a few additions and
  changes made based on the previous discussion within [6] as well as
  others I felt were useful.

  In line with the conclusions of the original discussion of
  AT_NO_JUMPS, the flag has been split up into separate flags. However,
  instead of being an openat(2) flag it is provided through a new
  syscall openat2(2) which provides several other improvements to the
  openat(2) interface (see the patch description for more details). The
  following new LOOKUP_* flags are added:

  LOOKUP_NO_XDEV:

     Blocks all mountpoint crossings (upwards, downwards, or through
     absolute links). Absolute pathnames alone in openat(2) do not
     trigger this. Magic-link traversal which implies a vfsmount jump is
     also blocked (though magic-link jumps on the same vfsmount are
     permitted).

  LOOKUP_NO_MAGICLINKS:

     Blocks resolution through /proc/$pid/fd-style links. This is done
     by blocking the usage of nd_jump_link() during resolution in a
     filesystem. The term "magic-links" is used to match with the only
     reference to these links in Documentation/, but I'm happy to change
     the name.

     It should be noted that this is different to the scope of
     ~LOOKUP_FOLLOW in that it applies to all path components. However,
     you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
     will *not* fail (assuming that no parent component was a
     magic-link), and you will have an fd for the magic-link.

     In order to correctly detect magic-links, the introduction of a new
     LOOKUP_MAGICLINK_JUMPED state flag was required.

  LOOKUP_BENEATH:

     Disallows escapes to outside the starting dirfd's
     tree, using techniques such as ".." or absolute links. Absolute
     paths in openat(2) are also disallowed.

     Conceptually this flag is to ensure you "stay below" a certain
     point in the filesystem tree -- but this requires some additional
     to protect against various races that would allow escape using
     "..".

     Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
     can trivially beam you around the filesystem (breaking the
     protection). In future, there might be similar safety checks done
     as in LOOKUP_IN_ROOT, but that requires more discussion.

  In addition, two new flags are added that expand on the above ideas:

  LOOKUP_NO_SYMLINKS:

     Does what it says on the tin. No symlink resolution is allowed at
     all, including magic-links. Just as with LOOKUP_NO_MAGICLINKS this
     can still be used with NOFOLLOW to open an fd for the symlink as
     long as no parent path had a symlink component.

  LOOKUP_IN_ROOT:

     This is an extension of LOOKUP_BENEATH that, rather than blocking
     attempts to move past the root, forces all such movements to be
     scoped to the starting point. This provides chroot(2)-like
     protection but without the cost of a chroot(2) for each filesystem
     operation, as well as being safe against race attacks that
     chroot(2) is not.

     If a race is detected (as with LOOKUP_BENEATH) then an error is
     generated, and similar to LOOKUP_BENEATH it is not permitted to
     cross magic-links with LOOKUP_IN_ROOT.

     The primary need for this is from container runtimes, which
     currently need to do symlink scoping in userspace[7] when opening
     paths in a potentially malicious container.

     There is a long list of CVEs that could have bene mitigated by
     having RESOLVE_THIS_ROOT (such as CVE-2017-1002101,
     CVE-2017-1002102, CVE-2018-15664, and CVE-2019-5736, just to name a
     few).

  In order to make all of the above more usable, I'm working on
  libpathrs[8] which is a C-friendly library for safe path resolution.
  It features a userspace-emulated backend if the kernel doesn't support
  openat2(2). Hopefully we can get userspace to switch to using it, and
  thus get openat2(2) support for free once it's ready.

  Future work would include implementing things like
  RESOLVE_NO_AUTOMOUNT and possibly a RESOLVE_NO_REMOTE (to allow
  programs to be sure they don't hit DoSes though stale NFS handles)"

* 'work.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  Documentation: path-lookup: include new LOOKUP flags
  selftests: add openat2(2) selftests
  open: introduce openat2(2) syscall
  namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
  namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
  namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
  namei: LOOKUP_NO_XDEV: block mountpoint crossing
  namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution
  namei: LOOKUP_NO_SYMLINKS: block symlink resolution
  namei: allow set_root() to produce errors
  namei: allow nd_jump_link() to produce errors
  nsfs: clean-up ns_get_path() signature to return int
  namei: only return -ECHILD from follow_dotdot_rcu()
2020-01-29 11:20:24 -08:00
Song Liu
07c5972951 perf/cgroups: Install cgroup events to correct cpuctx
cgroup events are always installed in the cpuctx. However, when it is not
installed via IPI, list_update_cgroup_event() adds it to cpuctx of current
CPU, which triggers list corruption:

  [] list_add double add: new=ffff888ff7cf0db0, prev=ffff888ff7ce82f0, next=ffff888ff7cf0db0.

To reproduce this, we can simply run:

  # perf stat -e cs -a &
  # perf stat -e cs -G anycgroup

Fix this by installing it to cpuctx that contains event->ctx, and the
proper cgrp_cpuctx_list.

Fixes: db0503e4f6 ("perf/core: Optimize perf_install_in_event()")
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200122195027.2112449-1-songliubraving@fb.com
2020-01-28 21:20:19 +01:00
Song Liu
003461559e perf/core: Fix mlock accounting in perf_mmap()
Decreasing sysctl_perf_event_mlock between two consecutive perf_mmap()s of
a perf ring buffer may lead to an integer underflow in locked memory
accounting. This may lead to the undesired behaviors, such as failures in
BPF map creation.

Address this by adjusting the accounting logic to take into account the
possibility that the amount of already locked memory may exceed the
current limit.

Fixes: c4b7547974 ("perf/core: Make the mlock accounting simple again")
Suggested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: <stable@vger.kernel.org>
Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Link: https://lkml.kernel.org/r/20200123181146.2238074-1-songliubraving@fb.com
2020-01-28 21:20:18 +01:00
Mark Rutland
da9ec3d3dd perf: Correctly handle failed perf_get_aux_event()
Vince reports a worrying issue:

| so I was tracking down some odd behavior in the perf_fuzzer which turns
| out to be because perf_even_open() sometimes returns 0 (indicating a file
| descriptor of 0) even though as far as I can tell stdin is still open.

... and further the cause:

| error is triggered if aux_sample_size has non-zero value.
|
| seems to be this line in kernel/events/core.c:
|
| if (perf_need_aux_event(event) && !perf_get_aux_event(event, group_leader))
|                goto err_locked;
|
| (note, err is never set)

This seems to be a thinko in commit:

  ab43762ef0 ("perf: Allow normal events to output AUX data")

... and we should probably return -EINVAL here, as this should only
happen when the new event is mis-configured or does not have a
compatible aux_event group leader.

Fixes: ab43762ef0 ("perf: Allow normal events to output AUX data")
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Tested-by: Vince Weaver <vincent.weaver@maine.edu>
2020-01-17 11:32:44 +01:00
Steven Rostedt (VMware)
56de4e8f91 perf: Make struct ring_buffer less ambiguous
eBPF requires needing to know the size of the perf ring buffer structure.
But it unfortunately has the same name as the generic ring buffer used by
tracing and oprofile. To make it less ambiguous, rename the perf ring buffer
structure to "perf_buffer".

As other parts of the ring buffer code has "perf_" as the prefix, it only
makes sense to give the ring buffer the "perf_" prefix as well.

Link: https://lore.kernel.org/r/20191213153553.GE20583@krava
Acked-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-13 13:19:38 -05:00
Sebastian Andrzej Siewior
9f0bff1180 perf/core: Add SRCU annotation for pmus list walk
Since commit
   28875945ba ("rcu: Add support for consolidated-RCU reader checking")

there is an additional check to ensure that a RCU related lock is held
while the RCU list is iterated.
This section holds the SRCU reader lock instead.

Add annotation to list_for_each_entry_rcu() that pmus_srcu must be
acquired during the list traversal.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Link: https://lkml.kernel.org/r/20191119121429.zhcubzdhm672zasg@linutronix.de
2019-12-17 13:32:46 +01:00
Aleksa Sarai
ce623f8987 nsfs: clean-up ns_get_path() signature to return int
ns_get_path() and ns_get_path_cb() only ever return either NULL or an
ERR_PTR. It is far more idiomatic to simply return an integer, and it
makes all of the callers of ns_get_path() more straightforward to read.

Fixes: e149ed2b80 ("take the targets of /proc/*/ns/* symlinks to separate fs")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-12-08 19:09:37 -05:00
Gaowei Pu
ff68dac6d6 mm/mmap.c: use IS_ERR_VALUE to check return value of get_unmapped_area
get_unmapped_area() returns an address or -errno on failure.  Historically
we have checked for the failure by offset_in_page() which is correct but
quite hard to read.  Newer code started using IS_ERR_VALUE which is much
easier to read.  Convert remaining users of offset_in_page as well.

[mhocko@suse.com: rewrite changelog]
[mhocko@kernel.org: fix mremap.c and uprobes.c sites also]
Link: http://lkml.kernel.org/r/20191012102512.28051-1-pugaowei@gmail.com
Signed-off-by: Gaowei Pu <pugaowei@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 06:29:19 -08:00
Linus Torvalds
3f59dbcace Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
 "The main kernel side changes in this cycle were:

   - Various Intel-PT updates and optimizations (Alexander Shishkin)

   - Prohibit kprobes on Xen/KVM emulate prefixes (Masami Hiramatsu)

   - Add support for LSM and SELinux checks to control access to the
     perf syscall (Joel Fernandes)

   - Misc other changes, optimizations, fixes and cleanups - see the
     shortlog for details.

  There were numerous tooling changes as well - 254 non-merge commits.
  Here are the main changes - too many to list in detail:

   - Enhancements to core tooling infrastructure, perf.data, libperf,
     libtraceevent, event parsing, vendor events, Intel PT, callchains,
     BPF support and instruction decoding.

   - There were updates to the following tools:

        perf annotate
        perf diff
        perf inject
        perf kvm
        perf list
        perf maps
        perf parse
        perf probe
        perf record
        perf report
        perf script
        perf stat
        perf test
        perf trace

   - And a lot of other changes: please see the shortlog and Git log for
     more details"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (279 commits)
  perf parse: Fix potential memory leak when handling tracepoint errors
  perf probe: Fix spelling mistake "addrees" -> "address"
  libtraceevent: Fix memory leakage in copy_filter_type
  libtraceevent: Fix header installation
  perf intel-bts: Does not support AUX area sampling
  perf intel-pt: Add support for decoding AUX area samples
  perf intel-pt: Add support for recording AUX area samples
  perf pmu: When using default config, record which bits of config were changed by the user
  perf auxtrace: Add support for queuing AUX area samples
  perf session: Add facility to peek at all events
  perf auxtrace: Add support for dumping AUX area samples
  perf inject: Cut AUX area samples
  perf record: Add aux-sample-size config term
  perf record: Add support for AUX area sampling
  perf auxtrace: Add support for AUX area sample recording
  perf auxtrace: Move perf_evsel__find_pmu()
  perf record: Add a function to test for kernel support for AUX area sampling
  perf tools: Add kernel AUX area sampling definitions
  perf/core: Make the mlock accounting simple again
  perf report: Jump to symbol source view from total cycles view
  ...
2019-11-26 15:04:47 -08:00
Linus Torvalds
386403a115 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:
 "Another merge window, another pull full of stuff:

   1) Support alternative names for network devices, from Jiri Pirko.

   2) Introduce per-netns netdev notifiers, also from Jiri Pirko.

   3) Support MSG_PEEK in vsock/virtio, from Matias Ezequiel Vara
      Larsen.

   4) Allow compiling out the TLS TOE code, from Jakub Kicinski.

   5) Add several new tracepoints to the kTLS code, also from Jakub.

   6) Support set channels ethtool callback in ena driver, from Sameeh
      Jubran.

   7) New SCTP events SCTP_ADDR_ADDED, SCTP_ADDR_REMOVED,
      SCTP_ADDR_MADE_PRIM, and SCTP_SEND_FAILED_EVENT. From Xin Long.

   8) Add XDP support to mvneta driver, from Lorenzo Bianconi.

   9) Lots of netfilter hw offload fixes, cleanups and enhancements,
      from Pablo Neira Ayuso.

  10) PTP support for aquantia chips, from Egor Pomozov.

  11) Add UDP segmentation offload support to igb, ixgbe, and i40e. From
      Josh Hunt.

  12) Add smart nagle to tipc, from Jon Maloy.

  13) Support L2 field rewrite by TC offloads in bnxt_en, from Venkat
      Duvvuru.

  14) Add a flow mask cache to OVS, from Tonghao Zhang.

  15) Add XDP support to ice driver, from Maciej Fijalkowski.

  16) Add AF_XDP support to ice driver, from Krzysztof Kazimierczak.

  17) Support UDP GSO offload in atlantic driver, from Igor Russkikh.

  18) Support it in stmmac driver too, from Jose Abreu.

  19) Support TIPC encryption and auth, from Tuong Lien.

  20) Introduce BPF trampolines, from Alexei Starovoitov.

  21) Make page_pool API more numa friendly, from Saeed Mahameed.

  22) Introduce route hints to ipv4 and ipv6, from Paolo Abeni.

  23) Add UDP segmentation offload to cxgb4, Rahul Lakkireddy"

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1857 commits)
  libbpf: Fix usage of u32 in userspace code
  mm: Implement no-MMU variant of vmalloc_user_node_flags
  slip: Fix use-after-free Read in slip_open
  net: dsa: sja1105: fix sja1105_parse_rgmii_delays()
  macvlan: schedule bc_work even if error
  enetc: add support Credit Based Shaper(CBS) for hardware offload
  net: phy: add helpers phy_(un)lock_mdio_bus
  mdio_bus: don't use managed reset-controller
  ax88179_178a: add ethtool_op_get_ts_info()
  mlxsw: spectrum_router: Fix use of uninitialized adjacency index
  mlxsw: spectrum_router: After underlay moves, demote conflicting tunnels
  bpf: Simplify __bpf_arch_text_poke poke type handling
  bpf: Introduce BPF_TRACE_x helper for the tracing tests
  bpf: Add bpf_jit_blinding_enabled for !CONFIG_BPF_JIT
  bpf, testing: Add various tail call test cases
  bpf, x86: Emit patchable direct jump as tail call
  bpf: Constant map key tracking for prog array pokes
  bpf: Add poke dependency tracking for prog array maps
  bpf: Add initial poke descriptor table for jit images
  bpf: Move owner type, jited info into array auxiliary data
  ...
2019-11-25 20:02:57 -08:00
Linus Torvalds
752272f16d ARM:
- Data abort report and injection
 - Steal time support
 - GICv4 performance improvements
 - vgic ITS emulation fixes
 - Simplify FWB handling
 - Enable halt polling counters
 - Make the emulated timer PREEMPT_RT compliant
 
 s390:
 - Small fixes and cleanups
 - selftest improvements
 - yield improvements
 
 PPC:
 - Add capability to tell userspace whether we can single-step the guest.
 - Improve the allocation of XIVE virtual processor IDs
 - Rewrite interrupt synthesis code to deliver interrupts in virtual
   mode when appropriate.
 - Minor cleanups and improvements.
 
 x86:
 - XSAVES support for AMD
 - more accurate report of nested guest TSC to the nested hypervisor
 - retpoline optimizations
 - support for nested 5-level page tables
 - PMU virtualization optimizations, and improved support for nested
   PMU virtualization
 - correct latching of INITs for nested virtualization
 - IOAPIC optimization
 - TSX_CTRL virtualization for more TAA happiness
 - improved allocation and flushing of SEV ASIDs
 - many bugfixes and cleanups
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJd27PMAAoJEL/70l94x66DspsH+gPc6YWtKJFJH58Zj8NrNh6y
 t0FwDFcvUa51+m4jaY4L5Y8+zqu1dZFnPPhFGqNWpxrjCEvE/glQJv3BiUX06Seh
 aYUHNymGoYCTJOHaaGhV+NlgQaDuZOCOkIsOLAPehyFd1KojwB+FRC0xmO6aROPw
 9yQgYrKuK1UUn5HwxBNrMS4+Xv+2iKv/9sTnq1G4W2qX2NZQg84LVPg1zIdkCh3D
 3GOvoCBEk3ivQqjmdE7rP/InPr0XvW0b6TFhchIk8J6jEIQFHsmOUefiTvTxsIHV
 OKAZwvyeYPrYHA/aDZpaBmY2aR0ydfKDUQcviNIJoF1vOktGs0hvl3VbsmG8QCg=
 =OSI1
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "ARM:
   - data abort report and injection
   - steal time support
   - GICv4 performance improvements
   - vgic ITS emulation fixes
   - simplify FWB handling
   - enable halt polling counters
   - make the emulated timer PREEMPT_RT compliant

  s390:
   - small fixes and cleanups
   - selftest improvements
   - yield improvements

  PPC:
   - add capability to tell userspace whether we can single-step the
     guest
   - improve the allocation of XIVE virtual processor IDs
   - rewrite interrupt synthesis code to deliver interrupts in virtual
     mode when appropriate.
   - minor cleanups and improvements.

  x86:
   - XSAVES support for AMD
   - more accurate report of nested guest TSC to the nested hypervisor
   - retpoline optimizations
   - support for nested 5-level page tables
   - PMU virtualization optimizations, and improved support for nested
     PMU virtualization
   - correct latching of INITs for nested virtualization
   - IOAPIC optimization
   - TSX_CTRL virtualization for more TAA happiness
   - improved allocation and flushing of SEV ASIDs
   - many bugfixes and cleanups"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (127 commits)
  kvm: nVMX: Relax guest IA32_FEATURE_CONTROL constraints
  KVM: x86: Grab KVM's srcu lock when setting nested state
  KVM: x86: Open code shared_msr_update() in its only caller
  KVM: Fix jump label out_free_* in kvm_init()
  KVM: x86: Remove a spurious export of a static function
  KVM: x86: create mmu/ subdirectory
  KVM: nVMX: Remove unnecessary TLB flushes on L1<->L2 switches when L1 use apic-access-page
  KVM: x86: remove set but not used variable 'called'
  KVM: nVMX: Do not mark vmcs02->apic_access_page as dirty when unpinning
  KVM: vmx: use MSR_IA32_TSX_CTRL to hard-disable TSX on guest that lack it
  KVM: vmx: implement MSR_IA32_TSX_CTRL disable RTM functionality
  KVM: x86: implement MSR_IA32_TSX_CTRL effect on CPUID
  KVM: x86: do not modify masked bits of shared MSRs
  KVM: x86: fix presentation of TSX feature in ARCH_CAPABILITIES
  KVM: PPC: Book3S HV: XIVE: Fix potential page leak on error path
  KVM: PPC: Book3S HV: XIVE: Free previous EQ page when setting up a new one
  KVM: nVMX: Assume TLB entries of L1 and L2 are tagged differently if L0 use EPT
  KVM: x86: Unexport kvm_vcpu_reload_apic_access_page()
  KVM: nVMX: add CR4_LA57 bit to nested CR4_FIXED1
  KVM: nVMX: Use semi-colon instead of comma for exit-handlers initialization
  ...
2019-11-25 18:02:36 -08:00
Ingo Molnar
c494cd6469 Merge branch 'perf/urgent' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-25 09:08:29 +01:00
Paolo Bonzini
46f4f0aabc Merge branch 'kvm-tsx-ctrl' into HEAD
Conflicts:
	arch/x86/kvm/vmx/vmx.c
2019-11-21 12:03:40 +01:00
Alexander Shishkin
c4b7547974 perf/core: Make the mlock accounting simple again
Commit:

  d44248a413 ("perf/core: Rework memory accounting in perf_mmap()")

does a lot of things to the mlock accounting arithmetics, while the only
thing that actually needed to happen is subtracting the part that is
charged to the mm from the part that is charged to the user, so that the
former isn't charged twice.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: songliubraving@fb.com
Link: https://lkml.kernel.org/r/20191120170640.54123-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-21 07:37:50 +01:00
David S. Miller
ee5a489fd9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2019-11-20

The following pull-request contains BPF updates for your *net-next* tree.

We've added 81 non-merge commits during the last 17 day(s) which contain
a total of 120 files changed, 4958 insertions(+), 1081 deletions(-).

There are 3 trivial conflicts, resolve it by always taking the chunk from
196e8ca74886c433:

<<<<<<< HEAD
=======
void *bpf_map_area_mmapable_alloc(u64 size, int numa_node);
>>>>>>> 196e8ca748

<<<<<<< HEAD
void *bpf_map_area_alloc(u64 size, int numa_node)
=======
static void *__bpf_map_area_alloc(u64 size, int numa_node, bool mmapable)
>>>>>>> 196e8ca748

<<<<<<< HEAD
        if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
=======
        /* kmalloc()'ed memory can't be mmap()'ed */
        if (!mmapable && size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
>>>>>>> 196e8ca748

The main changes are:

1) Addition of BPF trampoline which works as a bridge between kernel functions,
   BPF programs and other BPF programs along with two new use cases: i) fentry/fexit
   BPF programs for tracing with practically zero overhead to call into BPF (as
   opposed to k[ret]probes) and ii) attachment of the former to networking related
   programs to see input/output of networking programs (covering xdpdump use case),
   from Alexei Starovoitov.

2) BPF array map mmap support and use in libbpf for global data maps; also a big
   batch of libbpf improvements, among others, support for reading bitfields in a
   relocatable manner (via libbpf's CO-RE helper API), from Andrii Nakryiko.

3) Extend s390x JIT with usage of relative long jumps and loads in order to lift
   the current 64/512k size limits on JITed BPF programs there, from Ilya Leoshkevich.

4) Add BPF audit support and emit messages upon successful prog load and unload in
   order to have a timeline of events, from Daniel Borkmann and Jiri Olsa.

5) Extension to libbpf and xdpsock sample programs to demo the shared umem mode
   (XDP_SHARED_UMEM) as well as RX-only and TX-only sockets, from Magnus Karlsson.

6) Several follow-up bug fixes for libbpf's auto-pinning code and a new API
   call named bpf_get_link_xdp_info() for retrieving the full set of prog
   IDs attached to XDP, from Toke Høiland-Jørgensen.

7) Add BTF support for array of int, array of struct and multidimensional arrays
   and enable it for skb->cb[] access in kfree_skb test, from Martin KaFai Lau.

8) Fix AF_XDP by using the correct number of channels from ethtool, from Luigi Rizzo.

9) Two fixes for BPF selftest to get rid of a hang in test_tc_tunnel and to avoid
   xdping to be run as standalone, from Jiri Benc.

10) Various BPF selftest fixes when run with latest LLVM trunk, from Yonghong Song.

11) Fix a memory leak in BPF fentry test run data, from Colin Ian King.

12) Various smaller misc cleanups and improvements mostly all over BPF selftests and
    samples, from Daniel T. Lee, Andre Guedes, Anders Roxell, Mao Wenan, Yue Haibing.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-20 18:11:23 -08:00
Alexander Shishkin
36b3db03b4 perf/core: Fix the mlock accounting, again
Commit:

  5e6c3c7b1e ("perf/aux: Fix tracking of auxiliary trace buffer allocation")

tried to guess the correct combination of arithmetic operations that would
undo the AUX buffer's mlock accounting, and failed, leaking the bottom part
when an allocation needs to be charged partially to both user->locked_vm
and mm->pinned_vm, eventually leaving the user with no locked bonus:

  $ perf record -e intel_pt//u -m1,128 uname
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.061 MB perf.data ]

  $ perf record -e intel_pt//u -m1,128 uname
  Permission error mapping pages.
  Consider increasing /proc/sys/kernel/perf_event_mlock_kb,
  or try again with a smaller value of -m/--mmap_pages.
  (current value: 1,128)

Fix this by subtracting both locked and pinned counts when AUX buffer is
unmapped.

Reported-by: Thomas Richter <tmricht@linux.ibm.com>
Tested-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-18 16:27:37 +01:00
Andrii Nakryiko
85192dbf4d bpf: Convert bpf_prog refcnt to atomic64_t
Similarly to bpf_map's refcnt/usercnt, convert bpf_prog's refcnt to atomic64
and remove artificial 32k limit. This allows to make bpf_prog's refcounting
non-failing, simplifying logic of users of bpf_prog_add/bpf_prog_inc.

Validated compilation by running allyesconfig kernel build.

Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20191117172806.2195367-3-andriin@fb.com
2019-11-18 11:41:59 +01:00
Like Xu
52ba4b0b99 perf/core: Provide a kernel-internal interface to pause perf_event
Exporting perf_event_pause() as an external accessor for kernel users (such
as KVM) who may do both disable perf_event and read count with just one
time to hold perf_event_ctx_lock. Also the value could be reset optionally.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-15 11:44:07 +01:00
Like Xu
3ca270fc9e perf/core: Provide a kernel-internal interface to recalibrate event period
Currently, perf_event_period() is used by user tools via ioctl. Based on
naming convention, exporting perf_event_period() for kernel users (such
as KVM) who may recalibrate the event period for their assigned counter
according to their requirements.

The perf_event_period() is an external accessor, just like the
perf_event_{en,dis}able() and should thus use perf_event_ctx_lock().

Suggested-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-15 11:44:06 +01:00
Alexander Shishkin
a4faf00d99 perf/aux: Allow using AUX data in perf samples
AUX data can be used to annotate perf events such as performance counters
or tracepoints/breakpoints by including it in sample records when
PERF_SAMPLE_AUX flag is set. Such samples would be instrumental in debugging
and profiling by providing, for example, a history of instruction flow
leading up to the event's overflow.

The implementation makes use of grouping an AUX event with all the events
that wish to take samples of the AUX data, such that the former is the
group leader. The samplees should also specify the desired size of the AUX
sample via attr.aux_sample_size.

AUX capable PMUs need to explicitly add support for sampling, because it
relies on a new callback to take a snapshot of the buffer without touching
the event states.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: adrian.hunter@intel.com
Cc: mathieu.poirier@linaro.org
Link: https://lkml.kernel.org/r/20191025140835.53665-2-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13 11:06:14 +01:00
Qian Cai
deb0c3c29d perf/core: Fix unlock balance in perf_init_event()
Commit:

  66d258c5b0 ("perf/core: Optimize perf_init_event()")

introduced an unlock imbalance in perf_init_event() where it calls
"goto again" and then only repeat rcu_read_unlock().

Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: 66d258c5b0 ("perf/core: Optimize perf_init_event()")
Link: https://lkml.kernel.org/r/20191106052935.8352-1-cai@lca.pw
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13 11:06:13 +01:00
Ingo Molnar
fed4c9c681 Merge branch 'perf/urgent' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13 11:04:43 +01:00
Ben Dooks (Codethink)
d00dbd2981 perf/core: Fix missing static inline on perf_cgroup_switch()
It looks like a "static inline" has been missed in front
of the empty definition of perf_cgroup_switch() under
certain configurations.

Fixes the following sparse warning:

  kernel/events/core.c:1035:1: warning: symbol 'perf_cgroup_switch' was not declared. Should it be static?

Signed-off-by: Ben Dooks (Codethink) <ben.dooks@codethink.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: https://lkml.kernel.org/r/20191106132527.19977-1-ben.dooks@codethink.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13 08:16:44 +01:00
Alexander Shishkin
697d877849 perf/core: Consistently fail fork on allocation failures
Commit:

  313ccb9615 ("perf: Allocate context task_ctx_data for child event")

makes the inherit path skip over the current event in case of task_ctx_data
allocation failure. This, however, is inconsistent with allocation failures
in perf_event_alloc(), which would abort the fork.

Correct this by returning an error code on task_ctx_data allocation
failure and failing the fork in that case.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: https://lkml.kernel.org/r/20191105075702.60319-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13 08:16:43 +01:00
Alexander Shishkin
dce5affb94 perf/aux: Disallow aux_output for kernel events
Commit

  ab43762ef0 ("perf: Allow normal events to output AUX data")

added 'aux_output' bit to the attribute structure, which relies on AUX
events and grouping, neither of which is supported for the kernel events.
This notwithstanding, attempts have been made to use it in the kernel
code, suggesting the necessity of an explicit hard -EINVAL.

Fix this by rejecting attributes with aux_output set for kernel events.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: https://lkml.kernel.org/r/20191030134731.5437-3-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13 08:16:42 +01:00
Alexander Shishkin
f25d8ba9e1 perf/core: Reattach a misplaced comment
A comment is in a wrong place in perf_event_create_kernel_counter().
Fix that.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: https://lkml.kernel.org/r/20191030134731.5437-2-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13 08:16:41 +01:00
Alexander Shishkin
00496fe5e0 perf/aux: Fix the aux_output group inheritance fix
Commit

  f733c6b508 ("perf/core: Fix inheritance of aux_output groups")

adds a NULL pointer dereference in case inherit_group() races with
perf_release(), which causes the below crash:

 > BUG: kernel NULL pointer dereference, address: 000000000000010b
 > #PF: supervisor read access in kernel mode
 > #PF: error_code(0x0000) - not-present page
 > PGD 3b203b067 P4D 3b203b067 PUD 3b2040067 PMD 0
 > Oops: 0000 [#1] SMP KASAN
 > CPU: 0 PID: 315 Comm: exclusive-group Tainted: G B 5.4.0-rc3-00181-g72e1839403cb-dirty #878
 > RIP: 0010:perf_get_aux_event+0x86/0x270
 > Call Trace:
 >  ? __perf_read_group_add+0x3b0/0x3b0
 >  ? __kasan_check_write+0x14/0x20
 >  ? __perf_event_init_context+0x154/0x170
 >  inherit_task_group.isra.0.part.0+0x14b/0x170
 >  perf_event_init_task+0x296/0x4b0

Fix this by skipping over events that are getting closed, in the
inheritance path.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: f733c6b508 ("perf/core: Fix inheritance of aux_output groups")
Link: https://lkml.kernel.org/r/20191101151248.47327-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13 08:16:40 +01:00
Peter Zijlstra
09f4e8f05d perf/core: Disallow uncore-cgroup events
While discussing uncore event scheduling, I noticed we do not in fact
seem to dis-allow making uncore-cgroup events. Such events make no
sense what so ever because the cgroup is a CPU local state where
uncore counts across a number of CPUs.

Disallow them.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13 08:16:39 +01:00
Liang, Kan
d44f821b0e perf/core: Optimize perf_init_event() for TYPE_SOFTWARE
Andi reported that he was hitting the linear search in
perf_init_event() a lot. Now that all !TYPE_SOFTWARE events should hit
the IDR, make sure the TYPE_SOFTWARE events are at the head of the
list such that we'll quickly find the right PMU (provided a valid
event was given).

Signed-off-by: Liang, Kan <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-28 12:53:28 +01:00
Peter Zijlstra
66d258c5b0 perf/core: Optimize perf_init_event()
Andi reported that he was hitting the linear search in
perf_init_event() a lot. Make more agressive use of the IDR lookup to
avoid hitting the linear search.

With exception of PERF_TYPE_SOFTWARE (which relies on a hideous hack),
we can put everything in the IDR. On top of that, we can alias
TYPE_HARDWARE and TYPE_HW_CACHE to TYPE_RAW on the lookup side.

This greatly reduces the chances of hitting the linear search.

Reported-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan <kan.liang@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-28 12:51:02 +01:00
Peter Zijlstra
db0503e4f6 perf/core: Optimize perf_install_in_event()
Andi reported that when creating a lot of events, a lot of time is
spent in IPIs and asked if it would be possible to elide some of that.

Now when, as for example the perf-tool always does, events are created
disabled, then these events will not need to be scheduled when added
to the context (they're still disable) and therefore the IPI is not
required -- except for the very first event, that will need to set
ctx->is_active.

( It might be possible to set ctx->is_active remotely for cpu_ctx, but
  we really need the IPI for task_ctx, so lets not make that
  distinction. )

Also use __perf_effective_state() since group events depend on the
state of the leader, if the leader is OFF, the whole group is OFF.

So when sibling events are created enabled (XXX check tool) then we
only need a single IPI to create and enable the whole group (+ that
initial IPI to initialize the context).

Suggested-by: Andi Kleen <andi@firstfloor.org>
Reported-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Cc: kan.liang@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-28 12:51:02 +01:00
Alexey Budankov
c2b98a8661 perf/x86: Synchronize PMU task contexts on optimized context switches
Install Intel specific PMU task context synchronization adapter and
extend optimized context switch path with PMU specific task context
synchronization to fix LBR callstack virtualization on context switches.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: https://lkml.kernel.org/r/9c6445a9-bdba-ef03-3859-f1f91198f27a@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-28 12:51:01 +01:00
Ingo Molnar
65133033ee Merge branch 'perf/urgent' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-28 12:38:26 +01:00
Alexander Shishkin
8c7e975667 perf/core: Start rejecting the syscall with attr.__reserved_2 set
Commit:

  1a59413124 ("perf: Add wakeup watermark control to the AUX area")

added attr.__reserved_2 padding, but forgot to add an ABI check to reject
attributes with this field set. Fix that.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: adrian.hunter@intel.com
Cc: mathieu.poirier@linaro.org
Link: https://lkml.kernel.org/r/20191025121636.75182-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-28 11:01:59 +01:00
Linus Torvalds
a8a31fdcca Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
 "A set of perf fixes:

  kernel:

   - Unbreak the tracking of auxiliary buffer allocations which got
     imbalanced causing recource limit failures.

   - Fix the fallout of splitting of ToPA entries which missed to shift
     the base entry PA correctly.

   - Use the correct context to lookup the AUX event when unmapping the
     associated AUX buffer so the event can be stopped and the buffer
     reference dropped.

  tools:

   - Fix buildiid-cache mode setting in copyfile_mode_ns() when copying
     /proc/kcore

   - Fix freeing id arrays in the event list so the correct event is
     closed.

   - Sync sched.h anc kvm.h headers with the kernel sources.

   - Link jvmti against tools/lib/ctype.o to have weak strlcpy().

   - Fix multiple memory and file descriptor leaks, found by coverity in
     perf annotate.

   - Fix leaks in error handling paths in 'perf c2c', 'perf kmem', found
     by a static analysis tool"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/aux: Fix AUX output stopping
  perf/aux: Fix tracking of auxiliary trace buffer allocation
  perf/x86/intel/pt: Fix base for single entry topa
  perf kmem: Fix memory leak in compact_gfp_flags()
  tools headers UAPI: Sync sched.h with the kernel
  tools headers kvm: Sync kvm.h headers with the kernel sources
  tools headers kvm: Sync kvm headers with the kernel sources
  tools headers kvm: Sync kvm headers with the kernel sources
  perf c2c: Fix memory leak in build_cl_output()
  perf tools: Fix mode setting in copyfile_mode_ns()
  perf annotate: Fix multiple memory and file descriptor leaks
  perf tools: Fix resource leak of closedir() on the error paths
  perf evlist: Fix fix for freed id arrays
  perf jvmti: Link against tools/lib/ctype.h to have weak strlcpy()
2019-10-27 06:59:34 -04:00
Alexander Shishkin
f3a519e4ad perf/aux: Fix AUX output stopping
Commit:

  8a58ddae23 ("perf/core: Fix exclusive events' grouping")

allows CAP_EXCLUSIVE events to be grouped with other events. Since all
of those also happen to be AUX events (which is not the case the other
way around, because arch/s390), this changes the rules for stopping the
output: the AUX event may not be on its PMU's context any more, if it's
grouped with a HW event, in which case it will be on that HW event's
context instead. If that's the case, munmap() of the AUX buffer can't
find and stop the AUX event, potentially leaving the last reference with
the atomic context, which will then end up freeing the AUX buffer. This
will then trip warnings:

Fix this by using the context's PMU context when looking for events
to stop, instead of the event's PMU context.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20191022073940.61814-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-22 14:39:37 +02:00
Ingo Molnar
aa7a7b7297 Merge branch 'perf/urgent' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-22 01:15:32 +02:00
Thomas Richter
5e6c3c7b1e perf/aux: Fix tracking of auxiliary trace buffer allocation
The following commit from the v5.4 merge window:

  d44248a413 ("perf/core: Rework memory accounting in perf_mmap()")

... breaks auxiliary trace buffer tracking.

If I run command 'perf record -e rbd000' to record samples and saving
them in the **auxiliary** trace buffer then the value of 'locked_vm' becomes
negative after all trace buffers have been allocated and released:

During allocation the values increase:

  [52.250027] perf_mmap user->locked_vm:0x87 pinned_vm:0x0 ret:0
  [52.250115] perf_mmap user->locked_vm:0x107 pinned_vm:0x0 ret:0
  [52.250251] perf_mmap user->locked_vm:0x188 pinned_vm:0x0 ret:0
  [52.250326] perf_mmap user->locked_vm:0x208 pinned_vm:0x0 ret:0
  [52.250441] perf_mmap user->locked_vm:0x289 pinned_vm:0x0 ret:0
  [52.250498] perf_mmap user->locked_vm:0x309 pinned_vm:0x0 ret:0
  [52.250613] perf_mmap user->locked_vm:0x38a pinned_vm:0x0 ret:0
  [52.250715] perf_mmap user->locked_vm:0x408 pinned_vm:0x2 ret:0
  [52.250834] perf_mmap user->locked_vm:0x408 pinned_vm:0x83 ret:0
  [52.250915] perf_mmap user->locked_vm:0x408 pinned_vm:0x103 ret:0
  [52.251061] perf_mmap user->locked_vm:0x408 pinned_vm:0x184 ret:0
  [52.251146] perf_mmap user->locked_vm:0x408 pinned_vm:0x204 ret:0
  [52.251299] perf_mmap user->locked_vm:0x408 pinned_vm:0x285 ret:0
  [52.251383] perf_mmap user->locked_vm:0x408 pinned_vm:0x305 ret:0
  [52.251544] perf_mmap user->locked_vm:0x408 pinned_vm:0x386 ret:0
  [52.251634] perf_mmap user->locked_vm:0x408 pinned_vm:0x406 ret:0
  [52.253018] perf_mmap user->locked_vm:0x408 pinned_vm:0x487 ret:0
  [52.253197] perf_mmap user->locked_vm:0x408 pinned_vm:0x508 ret:0
  [52.253374] perf_mmap user->locked_vm:0x408 pinned_vm:0x589 ret:0
  [52.253550] perf_mmap user->locked_vm:0x408 pinned_vm:0x60a ret:0
  [52.253726] perf_mmap user->locked_vm:0x408 pinned_vm:0x68b ret:0
  [52.253903] perf_mmap user->locked_vm:0x408 pinned_vm:0x70c ret:0
  [52.254084] perf_mmap user->locked_vm:0x408 pinned_vm:0x78d ret:0
  [52.254263] perf_mmap user->locked_vm:0x408 pinned_vm:0x80e ret:0

The value of user->locked_vm increases to a limit then the memory
is tracked by pinned_vm.

During deallocation the size is subtracted from pinned_vm until
it hits a limit. Then a larger value is subtracted from locked_vm
leading to a large number (because of type unsigned):

  [64.267797] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x78d
  [64.267826] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x70c
  [64.267848] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x68b
  [64.267869] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x60a
  [64.267891] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x589
  [64.267911] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x508
  [64.267933] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x487
  [64.267952] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x406
  [64.268883] perf_mmap_close mmap_user->locked_vm:0x307 pinned_vm:0x406
  [64.269117] perf_mmap_close mmap_user->locked_vm:0x206 pinned_vm:0x406
  [64.269433] perf_mmap_close mmap_user->locked_vm:0x105 pinned_vm:0x406
  [64.269536] perf_mmap_close mmap_user->locked_vm:0x4 pinned_vm:0x404
  [64.269797] perf_mmap_close mmap_user->locked_vm:0xffffffffffffff84 pinned_vm:0x303
  [64.270105] perf_mmap_close mmap_user->locked_vm:0xffffffffffffff04 pinned_vm:0x202
  [64.270374] perf_mmap_close mmap_user->locked_vm:0xfffffffffffffe84 pinned_vm:0x101
  [64.270628] perf_mmap_close mmap_user->locked_vm:0xfffffffffffffe04 pinned_vm:0x0

This value sticks for the user until system is rebooted, causing
follow-on system calls using locked_vm resource limit to fail.

Note: There is no issue using the normal trace buffer.

In fact the issue is in perf_mmap_close(). During allocation auxiliary
trace buffer memory is either traced as 'extra' and added to 'pinned_vm'
or trace as 'user_extra' and added to 'locked_vm'. This applies for
normal trace buffers and auxiliary trace buffer.

However in function perf_mmap_close() all auxiliary trace buffer is
subtraced from 'locked_vm' and never from 'pinned_vm'. This breaks the
ballance.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: gor@linux.ibm.com
Cc: hechaol@fb.com
Cc: heiko.carstens@de.ibm.com
Cc: linux-perf-users@vger.kernel.org
Cc: songliubraving@fb.com
Fixes: d44248a413 ("perf/core: Rework memory accounting in perf_mmap()")
Link: https://lkml.kernel.org/r/20191021083354.67868-1-tmricht@linux.ibm.com
[ Minor readability edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 11:31:24 +02:00
Song Liu
aa5de305c9 kernel/events/uprobes.c: only do FOLL_SPLIT_PMD for uprobe register
Attaching uprobe to text section in THP splits the PMD mapped page table
into PTE mapped entries.  On uprobe detach, we would like to regroup PMD
mapped page table entry to regain performance benefit of THP.

However, the regroup is broken For perf_event based trace_uprobe.  This
is because perf_event based trace_uprobe calls uprobe_unregister twice
on close: first in TRACE_REG_PERF_CLOSE, then in
TRACE_REG_PERF_UNREGISTER.  The second call will split the PMD mapped
page table entry, which is not the desired behavior.

Fix this by only use FOLL_SPLIT_PMD for uprobe register case.

Add a WARN() to confirm uprobe unregister never work on huge pages, and
abort the operation when this WARN() triggers.

Link: http://lkml.kernel.org/r/20191017164223.2762148-6-songliubraving@fb.com
Fixes: 5a52c9df62 ("uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT")
Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:33 -04:00
Yunfeng Ye
d7e78706e4 perf/ring_buffer: Matching the memory allocate and free, in rb_alloc()
Currently perf_mmap_alloc_page() is used to allocate memory in
rb_alloc(), but using free_page() to free memory in the failure path.

It's better to use perf_mmap_free_page() instead.

Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <jolsa@redhat.co>
Cc: <acme@kernel.org>
Cc: <mingo@redhat.com>
Cc: <mark.rutland@arm.com>
Cc: <namhyung@kernel.org>
Cc: <alexander.shishkin@linux.intel.com>
Link: https://lkml.kernel.org/r/575c7e8c-90c7-4e3a-b41d-f894d8cdbd7f@huawei.com
2019-10-17 21:31:55 +02:00
Yunfeng Ye
8a9f91c51e perf/ring_buffer: Modify the parameter type of perf_mmap_free_page()
In perf_mmap_free_page(), the unsigned long type is converted to the
pointer type, but where the call is made, the pointer type is converted
to the unsigned long type. There is no need to do these operations.

Modify the parameter type of perf_mmap_free_page() to pointer type.

Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <jolsa@redhat.co>
Cc: <acme@kernel.org>
Cc: <mingo@redhat.com>
Cc: <mark.rutland@arm.com>
Cc: <namhyung@kernel.org>
Cc: <alexander.shishkin@linux.intel.com>
Link: https://lkml.kernel.org/r/e6ae3f0c-d04c-50f9-544a-aee3b30330cd@huawei.com
2019-10-17 21:31:55 +02:00
Joel Fernandes (Google)
da97e18458 perf_event: Add support for LSM and SELinux checks
In current mainline, the degree of access to perf_event_open(2) system
call depends on the perf_event_paranoid sysctl.  This has a number of
limitations:

1. The sysctl is only a single value. Many types of accesses are controlled
   based on the single value thus making the control very limited and
   coarse grained.
2. The sysctl is global, so if the sysctl is changed, then that means
   all processes get access to perf_event_open(2) opening the door to
   security issues.

This patch adds LSM and SELinux access checking which will be used in
Android to access perf_event_open(2) for the purposes of attaching BPF
programs to tracepoints, perf profiling and other operations from
userspace. These operations are intended for production systems.

5 new LSM hooks are added:
1. perf_event_open: This controls access during the perf_event_open(2)
   syscall itself. The hook is called from all the places that the
   perf_event_paranoid sysctl is checked to keep it consistent with the
   systctl. The hook gets passed a 'type' argument which controls CPU,
   kernel and tracepoint accesses (in this context, CPU, kernel and
   tracepoint have the same semantics as the perf_event_paranoid sysctl).
   Additionally, I added an 'open' type which is similar to
   perf_event_paranoid sysctl == 3 patch carried in Android and several other
   distros but was rejected in mainline [1] in 2016.

2. perf_event_alloc: This allocates a new security object for the event
   which stores the current SID within the event. It will be useful when
   the perf event's FD is passed through IPC to another process which may
   try to read the FD. Appropriate security checks will limit access.

3. perf_event_free: Called when the event is closed.

4. perf_event_read: Called from the read(2) and mmap(2) syscalls for the event.

5. perf_event_write: Called from the ioctl(2) syscalls for the event.

[1] https://lwn.net/Articles/696240/

Since Peter had suggest LSM hooks in 2016 [1], I am adding his
Suggested-by tag below.

To use this patch, we set the perf_event_paranoid sysctl to -1 and then
apply selinux checking as appropriate (default deny everything, and then
add policy rules to give access to domains that need it). In the future
we can remove the perf_event_paranoid sysctl altogether.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Co-developed-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: James Morris <jmorris@namei.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: rostedt@goodmis.org
Cc: Yonghong Song <yhs@fb.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: jeffv@google.com
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: primiano@google.com
Cc: Song Liu <songliubraving@fb.com>
Cc: rsavitski@google.com
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Matthew Garrett <matthewgarrett@google.com>
Link: https://lkml.kernel.org/r/20191014170308.70668-1-joel@joelfernandes.org
2019-10-17 21:31:55 +02:00
Song Liu
7fa343b7fd perf/core: Fix corner case in perf_rotate_context()
In perf_rotate_context(), when the first cpu flexible event fail to
schedule, cpu_rotate is 1, while cpu_event is NULL. Since cpu_event is
NULL, perf_rotate_context will _NOT_ call cpu_ctx_sched_out(), thus
cpuctx->ctx.is_active will have EVENT_FLEXIBLE set. Then, the next
perf_event_sched_in() will skip all cpu flexible events because of the
EVENT_FLEXIBLE bit.

In the next call of perf_rotate_context(), cpu_rotate stays 1, and
cpu_event stays NULL, so this process repeats. The end result is, flexible
events on this cpu will not be scheduled (until another event being added
to the cpuctx).

Here is an easy repro of this issue. On Intel CPUs, where ref-cycles
could only use one counter, run one pinned event for ref-cycles, one
flexible event for ref-cycles, and one flexible event for cycles. The
flexible ref-cycles is never scheduled, which is expected. However,
because of this issue, the cycles event is never scheduled either.

 $ perf stat -e ref-cycles:D,ref-cycles,cycles -C 5 -I 1000

           time             counts unit events
    1.000152973         15,412,480      ref-cycles:D
    1.000152973      <not counted>      ref-cycles     (0.00%)
    1.000152973      <not counted>      cycles         (0.00%)
    2.000486957         18,263,120      ref-cycles:D
    2.000486957      <not counted>      ref-cycles     (0.00%)
    2.000486957      <not counted>      cycles         (0.00%)

To fix this, when the flexible_active list is empty, try rotate the
first event in the flexible_groups. Also, rename ctx_first_active() to
ctx_event_to_rotate(), which is more accurate.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <kernel-team@fb.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 8d5bce0c37 ("perf/core: Optimize perf_rotate_context() event scheduling")
Link: https://lkml.kernel.org/r/20191008165949.920548-1-songliubraving@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-09 12:44:13 +02:00
Song Liu
d44248a413 perf/core: Rework memory accounting in perf_mmap()
perf_mmap() always increases user->locked_vm. As a result, "extra" could
grow bigger than "user_extra", which doesn't make sense. Here is an
example case:

(Note: Assume "user_lock_limit" is very small.)

  | # of perf_mmap calls |vma->vm_mm->pinned_vm|user->locked_vm|
  | 0                    | 0                   | 0             |
  | 1                    | user_extra          | user_extra    |
  | 2                    | 3 * user_extra      | 2 * user_extra|
  | 3                    | 6 * user_extra      | 3 * user_extra|
  | 4                    | 10 * user_extra     | 4 * user_extra|

Fix this by maintaining proper user_extra and extra.

Reviewed-By: Hechao Li <hechaol@fb.com>
Reported-by: Hechao Li <hechaol@fb.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <kernel-team@fb.com>
Cc: Jie Meng <jmeng@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190904214618.3795672-1-songliubraving@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-09 12:44:12 +02:00
Alexander Shishkin
f733c6b508 perf/core: Fix inheritance of aux_output groups
Commit:

  ab43762ef0 ("perf: Allow normal events to output AUX data")

forgets to configure aux_output relation in the inherited groups, which
results in child PEBS events forever failing to schedule.

Fix this by setting up the AUX output link in the inheritance path.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191004125729.32397-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-07 16:50:42 +02:00
Aleksa Sarai
c2ba8f41ad perf_event_open: switch to copy_struct_from_user()
Switch perf_event_open() syscall from it's own copying
struct perf_event_attr from userspace to the new dedicated
copy_struct_from_user() helper.

The change is very straightforward, and helps unify the syscall
interface for struct-from-userspace syscalls.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
[christian.brauner@ubuntu.com: improve commit message]
Link: https://lore.kernel.org/r/20191001011055.19283-5-cyphar@cyphar.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2019-10-01 15:45:22 +02:00
Linus Torvalds
aefcf2f4b5 Merge branch 'next-lockdown' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull kernel lockdown mode from James Morris:
 "This is the latest iteration of the kernel lockdown patchset, from
  Matthew Garrett, David Howells and others.

  From the original description:

    This patchset introduces an optional kernel lockdown feature,
    intended to strengthen the boundary between UID 0 and the kernel.
    When enabled, various pieces of kernel functionality are restricted.
    Applications that rely on low-level access to either hardware or the
    kernel may cease working as a result - therefore this should not be
    enabled without appropriate evaluation beforehand.

    The majority of mainstream distributions have been carrying variants
    of this patchset for many years now, so there's value in providing a
    doesn't meet every distribution requirement, but gets us much closer
    to not requiring external patches.

  There are two major changes since this was last proposed for mainline:

   - Separating lockdown from EFI secure boot. Background discussion is
     covered here: https://lwn.net/Articles/751061/

   -  Implementation as an LSM, with a default stackable lockdown LSM
      module. This allows the lockdown feature to be policy-driven,
      rather than encoding an implicit policy within the mechanism.

  The new locked_down LSM hook is provided to allow LSMs to make a
  policy decision around whether kernel functionality that would allow
  tampering with or examining the runtime state of the kernel should be
  permitted.

  The included lockdown LSM provides an implementation with a simple
  policy intended for general purpose use. This policy provides a coarse
  level of granularity, controllable via the kernel command line:

    lockdown={integrity|confidentiality}

  Enable the kernel lockdown feature. If set to integrity, kernel features
  that allow userland to modify the running kernel are disabled. If set to
  confidentiality, kernel features that allow userland to extract
  confidential information from the kernel are also disabled.

  This may also be controlled via /sys/kernel/security/lockdown and
  overriden by kernel configuration.

  New or existing LSMs may implement finer-grained controls of the
  lockdown features. Refer to the lockdown_reason documentation in
  include/linux/security.h for details.

  The lockdown feature has had signficant design feedback and review
  across many subsystems. This code has been in linux-next for some
  weeks, with a few fixes applied along the way.

  Stephen Rothwell noted that commit 9d1f8be5cf ("bpf: Restrict bpf
  when kernel lockdown is in confidentiality mode") is missing a
  Signed-off-by from its author. Matthew responded that he is providing
  this under category (c) of the DCO"

* 'next-lockdown' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (31 commits)
  kexec: Fix file verification on S390
  security: constify some arrays in lockdown LSM
  lockdown: Print current->comm in restriction messages
  efi: Restrict efivar_ssdt_load when the kernel is locked down
  tracefs: Restrict tracefs when the kernel is locked down
  debugfs: Restrict debugfs when the kernel is locked down
  kexec: Allow kexec_file() with appropriate IMA policy when locked down
  lockdown: Lock down perf when in confidentiality mode
  bpf: Restrict bpf when kernel lockdown is in confidentiality mode
  lockdown: Lock down tracing and perf kprobes when in confidentiality mode
  lockdown: Lock down /proc/kcore
  x86/mmiotrace: Lock down the testmmiotrace module
  lockdown: Lock down module params that specify hardware parameters (eg. ioport)
  lockdown: Lock down TIOCSSERIAL
  lockdown: Prohibit PCMCIA CIS storage when the kernel is locked down
  acpi: Disable ACPI table override if the kernel is locked down
  acpi: Ignore acpi_rsdp kernel param when the kernel has been locked down
  ACPI: Limit access to custom_method when the kernel is locked down
  x86/msr: Restrict MSR access when the kernel is locked down
  x86: Lock down IO port access when the kernel is locked down
  ...
2019-09-28 08:14:15 -07:00
Linus Torvalds
a7b7b772bb Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull more perf updates from Ingo Molnar:
 "The only kernel change is comment typo fixes.

  The rest is mostly tooling fixes, but also new vendor event additions
  and updates, a bigger libperf/libtraceevent library and a header files
  reorganization that came in a bit late"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (108 commits)
  perf unwind: Fix libunwind build failure on i386 systems
  perf parser: Remove needless include directives
  perf build: Add detection of java-11-openjdk-devel package
  perf jvmti: Include JVMTI support for s390
  perf vendor events: Remove P8 HW events which are not supported
  perf evlist: Fix access of freed id arrays
  perf stat: Fix free memory access / memory leaks in metrics
  perf tools: Replace needless mmap.h with what is needed, event.h
  perf evsel: Move config terms to a separate header
  perf evlist: Remove unused perf_evlist__fprintf() method
  perf evsel: Introduce evsel_fprintf.h
  perf evsel: Remove need for symbol_conf in evsel_fprintf.c
  perf copyfile: Move copyfile routines to separate files
  libperf: Add perf_evlist__poll() function
  libperf: Add perf_evlist__add_pollfd() function
  libperf: Add perf_evlist__alloc_pollfd() function
  libperf: Add libperf_init() call to the tests
  libperf: Merge libperf_set_print() into libperf_init()
  libperf: Add libperf dependency for tests targets
  libperf: Use sys/types.h to get ssize_t, not unistd.h
  ...
2019-09-26 15:38:07 -07:00
Song Liu
f385cb85a4 uprobe: collapse THP pmd after removing all uprobes
After all uprobes are removed from the huge page (with PTE pgtable), it is
possible to collapse the pmd and benefit from THP again.  This patch does
the collapse by calling collapse_pte_mapped_thp().

Link: http://lkml.kernel.org/r/20190815164525.1848545-7-songliubraving@fb.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: kbuild test robot <lkp@intel.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:11 -07:00
Song Liu
5a52c9df62 uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT
Use the newly added FOLL_SPLIT_PMD in uprobe.  This preserves the huge
page when the uprobe is enabled.  When the uprobe is disabled, newer
instances of the same application could still benefit from huge page.

For the next step, we will enable khugepaged to regroup the pmd, so that
existing instances of the application could also benefit from huge page
after the uprobe is disabled.

Link: http://lkml.kernel.org/r/20190815164525.1848545-5-songliubraving@fb.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:11 -07:00
Song Liu
fb4fb04ff4 uprobe: use original page when all uprobes are removed
Currently, uprobe swaps the target page with a anonymous page in both
install_breakpoint() and remove_breakpoint().  When all uprobes on a page
are removed, the given mm is still using an anonymous page (not the
original page).

This patch allows uprobe to use original page when possible (all uprobes
on the page are already removed, and the original page is in page cache
and uptodate).

As suggested by Oleg, we unmap the old_page and let the original page
fault in.

Link: http://lkml.kernel.org/r/20190815164525.1848545-3-songliubraving@fb.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:11 -07:00
Roy Ben Shlomo
9f014e3a66 perf/core: Fix several typos in comments
Fix typos in a few functions' documentation comments.

Signed-off-by: Roy Ben Shlomo <royb@sentinelone.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: royb@sentinelone.com
Link: http://lore.kernel.org/lkml/20190920171254.31373-1-royb@sentinelone.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-09-20 16:05:20 -03:00
Linus Torvalds
7f2444d38f Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core timer updates from Thomas Gleixner:
 "Timers and timekeeping updates:

   - A large overhaul of the posix CPU timer code which is a preparation
     for moving the CPU timer expiry out into task work so it can be
     properly accounted on the task/process.

     An update to the bogus permission checks will come later during the
     merge window as feedback was not complete before heading of for
     travel.

   - Switch the timerqueue code to use cached rbtrees and get rid of the
     homebrewn caching of the leftmost node.

   - Consolidate hrtimer_init() + hrtimer_init_sleeper() calls into a
     single function

   - Implement the separation of hrtimers to be forced to expire in hard
     interrupt context even when PREEMPT_RT is enabled and mark the
     affected timers accordingly.

   - Implement a mechanism for hrtimers and the timer wheel to protect
     RT against priority inversion and live lock issues when a (hr)timer
     which should be canceled is currently executing the callback.
     Instead of infinitely spinning, the task which tries to cancel the
     timer blocks on a per cpu base expiry lock which is held and
     released by the (hr)timer expiry code.

   - Enable the Hyper-V TSC page based sched_clock for Hyper-V guests
     resulting in faster access to timekeeping functions.

   - Updates to various clocksource/clockevent drivers and their device
     tree bindings.

   - The usual small improvements all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
  posix-cpu-timers: Fix permission check regression
  posix-cpu-timers: Always clear head pointer on dequeue
  hrtimer: Add a missing bracket and hide `migration_base' on !SMP
  posix-cpu-timers: Make expiry_active check actually work correctly
  posix-timers: Unbreak CONFIG_POSIX_TIMERS=n build
  tick: Mark sched_timer to expire in hard interrupt context
  hrtimer: Add kernel doc annotation for HRTIMER_MODE_HARD
  x86/hyperv: Hide pv_ops access for CONFIG_PARAVIRT=n
  posix-cpu-timers: Utilize timerqueue for storage
  posix-cpu-timers: Move state tracking to struct posix_cputimers
  posix-cpu-timers: Deduplicate rlimit handling
  posix-cpu-timers: Remove pointless comparisons
  posix-cpu-timers: Get rid of 64bit divisions
  posix-cpu-timers: Consolidate timer expiry further
  posix-cpu-timers: Get rid of zero checks
  rlimit: Rewrite non-sensical RLIMIT_CPU comment
  posix-cpu-timers: Respect INFINITY for hard RTTIME limit
  posix-cpu-timers: Switch thread group sampling to array
  posix-cpu-timers: Restructure expiry array
  posix-cpu-timers: Remove cputime_expires
  ...
2019-09-17 12:35:15 -07:00
Linus Torvalds
7e67a85999 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:

 - MAINTAINERS: Add Mark Rutland as perf submaintainer, Juri Lelli and
   Vincent Guittot as scheduler submaintainers. Add Dietmar Eggemann,
   Steven Rostedt, Ben Segall and Mel Gorman as scheduler reviewers.

   As perf and the scheduler is getting bigger and more complex,
   document the status quo of current responsibilities and interests,
   and spread the review pain^H^H^H^H fun via an increase in the Cc:
   linecount generated by scripts/get_maintainer.pl. :-)

 - Add another series of patches that brings the -rt (PREEMPT_RT) tree
   closer to mainline: split the monolithic CONFIG_PREEMPT dependencies
   into a new CONFIG_PREEMPTION category that will allow the eventual
   introduction of CONFIG_PREEMPT_RT. Still a few more hundred patches
   to go though.

 - Extend the CPU cgroup controller with uclamp.min and uclamp.max to
   allow the finer shaping of CPU bandwidth usage.

 - Micro-optimize energy-aware wake-ups from O(CPUS^2) to O(CPUS).

 - Improve the behavior of high CPU count, high thread count
   applications running under cpu.cfs_quota_us constraints.

 - Improve balancing with SCHED_IDLE (SCHED_BATCH) tasks present.

 - Improve CPU isolation housekeeping CPU allocation NUMA locality.

 - Fix deadline scheduler bandwidth calculations and logic when cpusets
   rebuilds the topology, or when it gets deadline-throttled while it's
   being offlined.

 - Convert the cpuset_mutex to percpu_rwsem, to allow it to be used from
   setscheduler() system calls without creating global serialization.
   Add new synchronization between cpuset topology-changing events and
   the deadline acceptance tests in setscheduler(), which were broken
   before.

 - Rework the active_mm state machine to be less confusing and more
   optimal.

 - Rework (simplify) the pick_next_task() slowpath.

 - Improve load-balancing on AMD EPYC systems.

 - ... and misc cleanups, smaller fixes and improvements - please see
   the Git log for more details.

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
  sched/psi: Correct overly pessimistic size calculation
  sched/fair: Speed-up energy-aware wake-ups
  sched/uclamp: Always use 'enum uclamp_id' for clamp_id values
  sched/uclamp: Update CPU's refcount on TG's clamp changes
  sched/uclamp: Use TG's clamps to restrict TASK's clamps
  sched/uclamp: Propagate system defaults to the root group
  sched/uclamp: Propagate parent clamps
  sched/uclamp: Extend CPU's cgroup controller
  sched/topology: Improve load balancing on AMD EPYC systems
  arch, ia64: Make NUMA select SMP
  sched, perf: MAINTAINERS update, add submaintainers and reviewers
  sched/fair: Use rq_lock/unlock in online_fair_sched_group
  cpufreq: schedutil: fix equation in comment
  sched: Rework pick_next_task() slow-path
  sched: Allow put_prev_task() to drop rq->lock
  sched/fair: Expose newidle_balance()
  sched: Add task_struct pointer to sched_class::set_curr_task
  sched: Rework CPU hotplug task selection
  sched/{rt,deadline}: Fix set_next_task vs pick_next_task
  sched: Fix kerneldoc comment for ia64_set_curr_task
  ...
2019-09-16 17:25:49 -07:00
Linus Torvalds
772c1d06bd Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
 "Kernel side changes:

   - Improved kbprobes robustness

   - Intel PEBS support for PT hardware tracing

   - Other Intel PT improvements: high order pages memory footprint
     reduction and various related cleanups

   - Misc cleanups

  The perf tooling side has been very busy in this cycle, with over 300
  commits. This is an incomplete high-level summary of the many
  improvements done by over 30 developers:

   - Lots of updates to the following tools:

      'perf c2c'
      'perf config'
      'perf record'
      'perf report'
      'perf script'
      'perf test'
      'perf top'
      'perf trace'

   - Updates to libperf and libtraceevent, and a consolidation of the
     proliferation of x86 instruction decoder libraries.

   - Vendor event updates for Intel and PowerPC CPUs,

   - Updates to hardware tracing tooling for ARM and Intel CPUs,

   - ... and lots of other changes and cleanups - see the shortlog and
     Git log for details"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (322 commits)
  kprobes: Prohibit probing on BUG() and WARN() address
  perf/x86: Make more stuff static
  x86, perf: Fix the dependency of the x86 insn decoder selftest
  objtool: Ignore intentional differences for the x86 insn decoder
  objtool: Update sync-check.sh from perf's check-headers.sh
  perf build: Ignore intentional differences for the x86 insn decoder
  perf intel-pt: Use shared x86 insn decoder
  perf intel-pt: Remove inat.c from build dependency list
  perf: Update .gitignore file
  objtool: Move x86 insn decoder to a common location
  perf metricgroup: Support multiple events for metricgroup
  perf metricgroup: Scale the metric result
  perf pmu: Change convert_scale from static to global
  perf symbols: Move mem_info and branch_info out of symbol.h
  perf auxtrace: Uninline functions that touch perf_session
  perf tools: Remove needless evlist.h include directives
  perf tools: Remove needless evlist.h include directives
  perf tools: Remove needless thread_map.h include directives
  perf tools: Remove needless thread.h include directives
  perf tools: Remove needless map.h include directives
  ...
2019-09-16 17:06:21 -07:00
Ingo Molnar
563c4f85f9 Merge branch 'sched/rt' into sched/core, to pick up -rt changes
Pick up the first couple of patches working towards PREEMPT_RT.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-16 14:05:04 +02:00
Mark-PK Tsai
310aa0a25b perf/hw_breakpoint: Fix arch_hw_breakpoint use-before-initialization
If we disable the compiler's auto-initialization feature, if
-fplugin-arg-structleak_plugin-byref or -ftrivial-auto-var-init=pattern
are disabled, arch_hw_breakpoint may be used before initialization after:

  9a4903dde2 ("perf/hw_breakpoint: Split attribute parse and commit")

On our ARM platform, the struct step_ctrl in arch_hw_breakpoint, which
used to be zero-initialized by kzalloc(), may be used in
arch_install_hw_breakpoint() without initialization.

Signed-off-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alix Wu <alix.wu@mediatek.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: YJ Chiang <yj.chiang@mediatek.com>
Link: https://lkml.kernel.org/r/20190906060115.9460-1-mark-pk.tsai@mediatek.com
[ Minor edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-06 08:24:01 +02:00
Alexander Shishkin
ab43762ef0 perf: Allow normal events to output AUX data
In some cases, ordinary (non-AUX) events can generate data for AUX events.
For example, PEBS events can come out as records in the Intel PT stream
instead of their usual DS records, if configured to do so.

One requirement for such events is to consistently schedule together, to
ensure that the data from the "AUX output" events isn't lost while their
corresponding AUX event is not scheduled. We use grouping to provide this
guarantee: an "AUX output" event can be added to a group where an AUX event
is a group leader, and provided that the former supports writing to the
latter.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kan.liang@linux.intel.com
Link: https://lkml.kernel.org/r/20190806084606.4021-2-alexander.shishkin@linux.intel.com
2019-08-28 11:29:38 +02:00
David Howells
b0c8fdc7fd lockdown: Lock down perf when in confidentiality mode
Disallow the use of certain perf facilities that might allow userspace to
access kernel data.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Matthew Garrett <mjg59@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Signed-off-by: James Morris <jmorris@namei.org>
2019-08-19 21:54:16 -07:00
Sebastian Andrzej Siewior
30f9028b6c perf/core: Mark hrtimers to expire in hard interrupt context
To guarantee that the multiplexing mechanism and the hrtimer driven events
work on PREEMPT_RT enabled kernels it's required that the related hrtimers
expire in hard interrupt context. Mark them so PREEMPT_RT kernels wont
defer them to soft interrupt context.

No functional change.

[ tglx: Split out of larger combo patch. Added changelog ]

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726185753.169509224@linutronix.de
2019-08-01 20:51:20 +02:00
Matthew Wilcox (Oracle)
7b3c92b85a sched/core: Convert get_task_struct() to return the task
Returning the pointer that was passed in allows us to write
slightly more idiomatic code.  Convert a few users.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190704221323.24290-1-willy@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:51:54 +02:00
Leonard Crestez
4ce54af8b3 perf/core: Fix creating kernel counters for PMUs that override event->cpu
Some hardware PMU drivers will override perf_event.cpu inside their
event_init callback. This causes a lockdep splat when initialized through
the kernel API:

 WARNING: CPU: 0 PID: 250 at kernel/events/core.c:2917 ctx_sched_out+0x78/0x208
 pc : ctx_sched_out+0x78/0x208
 Call trace:
  ctx_sched_out+0x78/0x208
  __perf_install_in_context+0x160/0x248
  remote_function+0x58/0x68
  generic_exec_single+0x100/0x180
  smp_call_function_single+0x174/0x1b8
  perf_install_in_context+0x178/0x188
  perf_event_create_kernel_counter+0x118/0x160

Fix this by calling perf_install_in_context with event->cpu, just like
perf_event_open

Signed-off-by: Leonard Crestez <leonard.crestez@nxp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Frank Li <Frank.li@nxp.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Link: https://lkml.kernel.org/r/c4ebe0503623066896d7046def4d6b1e06e0eb2e.1563972056.git.leonard.crestez@nxp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:41:31 +02:00
Linus Torvalds
4f5ed1318c Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted stuff"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  perf_event_get(): don't bother with fget_raw()
  vfs: update d_make_root() description
2019-07-19 11:35:08 -07:00
Alexander Shishkin
8a58ddae23 perf/core: Fix exclusive events' grouping
So far, we tried to disallow grouping exclusive events for the fear of
complications they would cause with moving between contexts. Specifically,
moving a software group to a hardware context would violate the exclusivity
rules if both groups contain matching exclusive events.

This attempt was, however, unsuccessful: the check that we have in the
perf_event_open() syscall is both wrong (looks at wrong PMU) and
insufficient (group leader may still be exclusive), as can be illustrated
by running:

  $ perf record -e '{intel_pt//,cycles}' uname
  $ perf record -e '{cycles,intel_pt//}' uname

ultimately successfully.

Furthermore, we are completely free to trigger the exclusivity violation
by:

   perf -e '{cycles,intel_pt//}' -e '{intel_pt//,instructions}'

even though the helpful perf record will not allow that, the ABI will.

The warning later in the perf_event_open() path will also not trigger, because
it's also wrong.

Fix all this by validating the original group before moving, getting rid
of broken safeguards and placing a useful one to perf_install_in_context().

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: mathieu.poirier@linaro.org
Cc: will.deacon@arm.com
Fixes: bed5b25ad9 ("perf: Add a pmu capability for "exclusive" events")
Link: https://lkml.kernel.org/r/20190701110755.24646-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-13 11:21:28 +02:00
Peter Zijlstra
1cf8dfe8a6 perf/core: Fix race between close() and fork()
Syzcaller reported the following Use-after-Free bug:

	close()						clone()

							  copy_process()
							    perf_event_init_task()
							      perf_event_init_context()
							        mutex_lock(parent_ctx->mutex)
								inherit_task_group()
								  inherit_group()
								    inherit_event()
								      mutex_lock(event->child_mutex)
								      // expose event on child list
								      list_add_tail()
								      mutex_unlock(event->child_mutex)
							        mutex_unlock(parent_ctx->mutex)

							    ...
							    goto bad_fork_*

							  bad_fork_cleanup_perf:
							    perf_event_free_task()

	  perf_release()
	    perf_event_release_kernel()
	      list_for_each_entry()
		mutex_lock(ctx->mutex)
		mutex_lock(event->child_mutex)
		// event is from the failing inherit
		// on the other CPU
		perf_remove_from_context()
		list_move()
		mutex_unlock(event->child_mutex)
		mutex_unlock(ctx->mutex)

							      mutex_lock(ctx->mutex)
							      list_for_each_entry_safe()
							        // event already stolen
							      mutex_unlock(ctx->mutex)

							    delayed_free_task()
							      free_task()

	     list_for_each_entry_safe()
	       list_del()
	       free_event()
	         _free_event()
		   // and so event->hw.target
		   // is the already freed failed clone()
		   if (event->hw.target)
		     put_task_struct(event->hw.target)
		       // WHOOPSIE, already quite dead

Which puts the lie to the the comment on perf_event_free_task():
'unexposed, unused context' not so much.

Which is a 'fun' confluence of fail; copy_process() doing an
unconditional free_task() and not respecting refcounts, and perf having
creative locking. In particular:

  82d94856fa ("perf/core: Fix lock inversion between perf,trace,cpuhp")

seems to have overlooked this 'fun' parade.

Solve it by using the fact that detached events still have a reference
count on their (previous) context. With this perf_event_free_task()
can detect when events have escaped and wait for their destruction.

Debugged-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Reported-by: syzbot+a24c397a29ad22d86c98@syzkaller.appspotmail.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: <stable@vger.kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: 82d94856fa ("perf/core: Fix lock inversion between perf,trace,cpuhp")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-13 11:21:25 +02:00
Linus Torvalds
608745f124 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
 "The main changes in this cycle on the kernel side were:

   - CPU PMU and uncore driver updates to Intel Snow Ridge, IceLake,
     KabyLake, AmberLake and WhiskeyLake CPUs.

   - Rework the MSR probing infrastructure to make it more robust, make
     it work better on virtualized systems and to better expose it on
     sysfs.

   - Rework PMU attributes group support based on the feedback from
     Greg. The core sysfs patch that adds sysfs_update_groups() was
     acked by Greg.

  There's a lot of perf tooling changes as well, all around the place:

   - vendor updates to Intel, cs-etm (ARM), ARM64, s390,

   - various enhancements to Intel PT tooling support:
      - Improve CBR (Core to Bus Ratio) packets support.
      - Export power and ptwrite events to sqlite and postgresql.
      - Add support for decoding PEBS via PT packets.
      - Add support for samples to contain IPC ratio, collecting cycles
        information from CYC packets, showing the IPC info periodically
      - Allow using time ranges

   - lots of updates to perf pmu, perf stat, perf trace, eBPF support,
     perf record, perf diff, etc. - please see the shortlog and Git log
     for details"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (252 commits)
  tools arch x86: Sync asm/cpufeatures.h with the with the kernel
  tools build: Check if gettid() is available before providing helper
  perf jvmti: Address gcc string overflow warning for strncpy()
  perf python: Remove -fstack-protector-strong if clang doesn't have it
  perf annotate TUI browser: Do not use member from variable within its own initialization
  perf tests: Fix record+probe_libc_inet_pton.sh for powerpc64
  perf evsel: Do not rely on errno values for precise_ip fallback
  perf thread: Allow references to thread objects after machine__exit()
  perf header: Assign proper ff->ph in perf_event__synthesize_features()
  tools arch kvm: Sync kvm headers with the kernel sources
  perf script: Allow specifying the files to process guest samples
  perf tools metric: Don't include duration_time in group
  perf list: Avoid extra : for --raw metrics
  perf vendor events intel: Metric fixes for SKX/CLX
  perf tools: Fix typos / broken sentences
  perf jevents: Add support for Hisi hip08 L3C PMU aliasing
  perf jevents: Add support for Hisi hip08 HHA PMU aliasing
  perf jevents: Add support for Hisi hip08 DDRC PMU aliasing
  perf pmu: Support more complex PMU event aliasing
  perf diff: Documentation -c cycles option
  ...
2019-07-09 11:15:52 -07:00
Linus Torvalds
5ad18b2e60 Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull force_sig() argument change from Eric Biederman:
 "A source of error over the years has been that force_sig has taken a
  task parameter when it is only safe to use force_sig with the current
  task.

  The force_sig function is built for delivering synchronous signals
  such as SIGSEGV where the userspace application caused a synchronous
  fault (such as a page fault) and the kernel responded with a signal.

  Because the name force_sig does not make this clear, and because the
  force_sig takes a task parameter the function force_sig has been
  abused for sending other kinds of signals over the years. Slowly those
  have been fixed when the oopses have been tracked down.

  This set of changes fixes the remaining abusers of force_sig and
  carefully rips out the task parameter from force_sig and friends
  making this kind of error almost impossible in the future"

* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
  signal/x86: Move tsk inside of CONFIG_MEMORY_FAILURE in do_sigbus
  signal: Remove the signal number and task parameters from force_sig_info
  signal: Factor force_sig_info_to_task out of force_sig_info
  signal: Generate the siginfo in force_sig
  signal: Move the computation of force into send_signal and correct it.
  signal: Properly set TRACE_SIGNAL_LOSE_INFO in __send_signal
  signal: Remove the task parameter from force_sig_fault
  signal: Use force_sig_fault_to_task for the two calls that don't deliver to current
  signal: Explicitly call force_sig_fault on current
  signal/unicore32: Remove tsk parameter from __do_user_fault
  signal/arm: Remove tsk parameter from __do_user_fault
  signal/arm: Remove tsk parameter from ptrace_break
  signal/nds32: Remove tsk parameter from send_sigtrap
  signal/riscv: Remove tsk parameter from do_trap
  signal/sh: Remove tsk parameter from force_sig_info_fault
  signal/um: Remove task parameter from send_sigtrap
  signal/x86: Remove task parameter from send_sigtrap
  signal: Remove task parameter from force_sig_mceerr
  signal: Remove task parameter from force_sig
  signal: Remove task parameter from force_sigsegv
  ...
2019-07-08 21:48:15 -07:00
Linus Torvalds
46f1ec23a4 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The changes in this cycle are:

   - RCU flavor consolidation cleanups and optmizations

   - Documentation updates

   - Miscellaneous fixes

   - SRCU updates

   - RCU-sync flavor consolidation

   - Torture-test updates

   - Linux-kernel memory-consistency-model updates, most notably the
     addition of plain C-language accesses"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits)
  tools/memory-model: Improve data-race detection
  tools/memory-model: Change definition of rcu-fence
  tools/memory-model: Expand definition of barrier
  tools/memory-model: Do not use "herd" to refer to "herd7"
  tools/memory-model: Fix comment in MP+poonceonces.litmus
  Documentation: atomic_t.txt: Explain ordering provided by smp_mb__{before,after}_atomic()
  rcu: Don't return a value from rcu_assign_pointer()
  rcu: Force inlining of rcu_read_lock()
  rcu: Fix irritating whitespace error in rcu_assign_pointer()
  rcu: Upgrade sync_exp_work_done() to smp_mb()
  rcutorture: Upper case solves the case of the vanishing NULL pointer
  torture: Suppress propagating trace_printk() warning
  rcutorture: Dump trace buffer for callback pipe drain failures
  torture: Add --trust-make to suppress "make clean"
  torture: Make --cpus override idleness calculations
  torture: Run kernel build in source directory
  torture: Add function graph-tracing cheat sheet
  torture: Capture qemu output
  rcutorture: Tweak kvm options
  rcutorture: Add trivial RCU implementation
  ...
2019-07-08 15:45:14 -07:00
Linus Torvalds
927ba67a63 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "The timer and timekeeping departement delivers:

  Core:

   - The consolidation of the VDSO code into a generic library including
     the conversion of x86 and ARM64. Conversion of ARM and MIPS are en
     route through the relevant maintainer trees and should end up in
     5.4.

     This gets rid of the unnecessary different copies of the same code
     and brings all architectures on the same level of VDSO
     functionality.

   - Make the NTP user space interface more robust by restricting the
     TAI offset to prevent undefined behaviour. Includes a selftest.

   - Validate user input in the compat settimeofday() syscall to catch
     invalid values which would be turned into valid values by a
     multiplication overflow

   - Consolidate the time accessors

   - Small fixes, improvements and cleanups all over the place

  Drivers:

   - Support for the NXP system counter, TI davinci timer

   - Move the Microsoft HyperV clocksource/events code into the
     drivers/clocksource directory so it can be shared between x86 and
     ARM64.

   - Overhaul of the Tegra driver

   - Delay timer support for IXP4xx

   - Small fixes, improvements and cleanups as usual"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
  time: Validate user input in compat_settimeofday()
  timer: Document TIMER_PINNED
  clocksource/drivers: Continue making Hyper-V clocksource ISA agnostic
  clocksource/drivers: Make Hyper-V clocksource ISA agnostic
  MAINTAINERS: Fix Andy's surname and the directory entries of VDSO
  hrtimer: Use a bullet for the returns bullet list
  arm64: vdso: Fix compilation with clang older than 8
  arm64: compat: Fix __arch_get_hw_counter() implementation
  arm64: Fix __arch_get_hw_counter() implementation
  lib/vdso: Make delta calculation work correctly
  MAINTAINERS: Add entry for the generic VDSO library
  arm64: compat: No need for pre-ARMv7 barriers on an ARMv8 system
  arm64: vdso: Remove unnecessary asm-offsets.c definitions
  vdso: Remove superfluous #ifdef __KERNEL__ in vdso/datapage.h
  clocksource/drivers/davinci: Add support for clocksource
  clocksource/drivers/davinci: Add support for clockevents
  clocksource/drivers/tegra: Set up maximum-ticks limit properly
  clocksource/drivers/tegra: Cycles can't be 0
  clocksource/drivers/tegra: Restore base address before cleanup
  clocksource/drivers/tegra: Add verbose definition for 1MHz constant
  ...
2019-07-08 11:06:29 -07:00
Ingo Molnar
552a031ba1 Linux 5.2
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl0idTweHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGZesIAJDKicw2Voyx8K8m
 3pXSK+71RuO/d3Y9M51mdfTMKRP4PHR9/4wVZ9wHPwC4dV6wxgsmIYCF69a1Wety
 LD1MpDCP1DK5wVfPNKVX2xmj7ua6iutPtSsJHzdzM2TlscgsrFKjmUccqJ5JLwL5
 c34nqwXWnzzRyI5Ga9cQSlwzAXq0vDHXyML3AnCosSsLX0lKFrHlK1zttdOPNkfj
 dXRN62g3q+9kVQozzhDXb8atZZ7IkBk8Q0lujpNXW83Ci1VjaVNv3SB8GZTXIlLj
 U15VdyuwfJDfpBgFBN6/unzVaAB6FFrEKy0jT1aeTyKarMKDKgOnJjn10aKjDNno
 /bXsKKc=
 =TVqV
 -----END PGP SIGNATURE-----

Merge tag 'v5.2' into perf/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-08 18:04:41 +02:00
Ingo Molnar
83086d654d Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull rcu/next + tools/memory-model changes from Paul E. McKenney:

 - RCU flavor consolidation cleanups and optmizations
 - Documentation updates
 - Miscellaneous fixes
 - SRCU updates
 - RCU-sync flavor consolidation
 - Torture-test updates
 - Linux-kernel memory-consistency-model updates, most notably the addition of plain C-language accesses

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-28 19:46:47 +02:00
Al Viro
02e5ad9738 perf_event_get(): don't bother with fget_raw()
... since we immediately follow that with check that it *is* an
opened perf file, with O_PATH ones ending with with the same
-EBADF we'd get for descriptor that isn't opened at all.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-06-26 20:43:53 -04:00
Ian Rogers
fd7d55172d perf/cgroups: Don't rotate events for cgroups unnecessarily
Currently perf_rotate_context assumes that if the context's nr_events !=
nr_active a rotation is necessary for perf event multiplexing. With
cgroups, nr_events is the total count of events for all cgroups and
nr_active will not include events in a cgroup other than the current
task's. This makes rotation appear necessary for cgroups when it is not.

Add a perf_event_context flag that is set when rotation is necessary.
Clear the flag during sched_out and set it when a flexible sched_in
fails due to resources.

Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: https://lkml.kernel.org/r/20190601082722.44543-1-irogers@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:30:04 +02:00
Kan Liang
e321d02db8 perf/x86: Disable extended registers for non-supported PMUs
The perf fuzzer caused Skylake machine to crash:

[ 9680.085831] Call Trace:
[ 9680.088301]  <IRQ>
[ 9680.090363]  perf_output_sample_regs+0x43/0xa0
[ 9680.094928]  perf_output_sample+0x3aa/0x7a0
[ 9680.099181]  perf_event_output_forward+0x53/0x80
[ 9680.103917]  __perf_event_overflow+0x52/0xf0
[ 9680.108266]  ? perf_trace_run_bpf_submit+0xc0/0xc0
[ 9680.113108]  perf_swevent_hrtimer+0xe2/0x150
[ 9680.117475]  ? check_preempt_wakeup+0x181/0x230
[ 9680.122091]  ? check_preempt_curr+0x62/0x90
[ 9680.126361]  ? ttwu_do_wakeup+0x19/0x140
[ 9680.130355]  ? try_to_wake_up+0x54/0x460
[ 9680.134366]  ? reweight_entity+0x15b/0x1a0
[ 9680.138559]  ? __queue_work+0x103/0x3f0
[ 9680.142472]  ? update_dl_rq_load_avg+0x1cd/0x270
[ 9680.147194]  ? timerqueue_del+0x1e/0x40
[ 9680.151092]  ? __remove_hrtimer+0x35/0x70
[ 9680.155191]  __hrtimer_run_queues+0x100/0x280
[ 9680.159658]  hrtimer_interrupt+0x100/0x220
[ 9680.163835]  smp_apic_timer_interrupt+0x6a/0x140
[ 9680.168555]  apic_timer_interrupt+0xf/0x20
[ 9680.172756]  </IRQ>

The XMM registers can only be collected by PEBS hardware events on the
platforms with PEBS baseline support, e.g. Icelake, not software/probe
events.

Add capabilities flag PERF_PMU_CAP_EXTENDED_REGS to indicate the PMU
which support extended registers. For X86, the extended registers are
XMM registers.

Add has_extended_regs() to check if extended registers are applied.

The generic code define the mask of extended registers as 0 if arch
headers haven't overridden it.

Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 878068ea27 ("perf/x86: Support outputting XMM registers")
Link: https://lkml.kernel.org/r/1559081314-9714-1-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:19:23 +02:00
Ravi Bangoria
913a90bc5a perf/ioctl: Add check for the sample_period value
perf_event_open() limits the sample_period to 63 bits. See:

  0819b2e30c ("perf: Limit perf_event_attr::sample_period to 63 bits")

Make ioctl() consistent with it.

Also on PowerPC, negative sample_period could cause a recursive
PMIs leading to a hang (reported when running perf-fuzzer).

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: maddy@linux.vnet.ibm.com
Cc: mpe@ellerman.id.au
Fixes: 0819b2e30c ("perf: Limit perf_event_attr::sample_period to 63 bits")
Link: https://lkml.kernel.org/r/20190604042953.914-1-ravi.bangoria@linux.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:19:22 +02:00
Jason A. Donenfeld
9285ec4c8b timekeeping: Use proper clock specifier names in functions
This makes boot uniformly boottime and tai uniformly clocktai, to
address the remaining oversights.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lkml.kernel.org/r/20190621203249.3909-2-Jason@zx2c4.com
2019-06-22 12:11:27 +02:00
Peter Zijlstra
085ebfe937 perf/core: Fix perf_sample_regs_user() mm check
perf_sample_regs_user() uses 'current->mm' to test for the presence of
userspace, but this is insufficient, consider use_mm().

A better test is: '!(current->flags & PF_KTHREAD)', exec() clears
PF_KTHREAD after it sets the new ->mm but before it drops to userspace
for the first time.

Possibly obsoletes: bf05fc25f2 ("powerpc/perf: Fix oops when kthread execs user process")

Reported-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
Reported-by: Young Xiao <92siuyang@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 4018994f3d ("perf: Add ability to attach user level registers dump to sample")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:11:58 +02:00
Jiri Olsa
f3a3a8257e perf/core: Add attr_groups_update into struct pmu
Adding attr_update attribute group into pmu, to allow
having multiple attribute groups for same group name.

This will allow us to update "events" or "format"
directories with attributes that depend on various
HW conditions.

For example having group_format_extra group that updates
"format" directory only if pmu version is 2 and higher:

  static umode_t
  exra_is_visible(struct kobject *kobj, struct attribute *attr, int i)
  {
         return x86_pmu.version >= 2 ? attr->mode : 0;
  }

  static struct attribute_group group_format_extra = {
         .name       = "format",
         .is_visible = exra_is_visible,
  };

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190512155518.21468-3-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:58:21 +02:00
Song Liu
9fd2e48b9a perf/core: Allow non-privileged uprobe for user processes
Currently, non-privileged user could only use uprobe with

    kernel.perf_event_paranoid = -1

However, setting perf_event_paranoid to -1 leaks other users' processes to
non-privileged uprobes.

To introduce proper permission control of uprobes, we are building the
following system:

  A daemon with CAP_SYS_ADMIN is in charge to create uprobes via tracefs;
  Users asks the daemon to create uprobes;
  Then user can attach uprobe only to processes owned by the user.

This patch allows non-privileged user to attach uprobe to processes owned
by the user.

The following example shows how to use uprobe with non-privileged user.
This is based on Brendan's blog post [1]

1. Create uprobe with root:

  sudo perf probe -x 'readline%return +0($retval):string'

2. Then non-root user can use the uprobe as:

  perf record -vvv -e probe_bash:readline__return -p <pid> sleep 20
  perf script

[1] http://www.brendangregg.com/blog/2015-06-28/linux-ftrace-uprobe.html

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <kernel-team@fb.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190507161545.788381-1-songliubraving@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:58:18 +02:00
Oleg Nesterov
2bf1acc299 uprobes: Use DEFINE_STATIC_PERCPU_RWSEM() to initialize dup_mmap_sem
Use DEFINE_STATIC_PERCPU_RWSEM() to initialize dup_mmap_sem.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 09:05:23 -07:00
Eric W. Biederman
3cf5d076fb signal: Remove task parameter from force_sig
All of the remaining callers pass current into force_sig so
remove the task parameter to make this obvious and to make
misuse more difficult in the future.

This also makes it clear force_sig passes current into force_sig_info.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-05-27 09:36:28 -05:00
Peter Zijlstra
5322ea58a0 perf/ring-buffer: Use regular variables for nesting
While the IRQ/NMI will nest, the nest-count will be invariant over the
actual exception, since it will decrement equal to increment.

This means we can -- carefully -- use a regular variable since the
typical LOAD-STORE race doesn't exist (similar to preempt_count).

This optimizes the ring-buffer for all LOAD-STORE architectures, since
they need to use atomic ops to implement local_t.

Suggested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Cc: mark.rutland@arm.com
Cc: namhyung@kernel.org
Cc: yabinc@google.com
Link: http://lkml.kernel.org/r/20190517115418.481392777@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-05-24 09:00:11 +02:00
Peter Zijlstra
4d839dd9e4 perf/ring-buffer: Always use {READ,WRITE}_ONCE() for rb->user_page data
We must use {READ,WRITE}_ONCE() on rb->user_page data such that
concurrent usage will see whole values. A few key sites were missing
this.

Suggested-by: Yabin Cui <yabinc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Cc: mark.rutland@arm.com
Cc: namhyung@kernel.org
Fixes: 7b732a7504 ("perf_counter: new output ABI - part 1")
Link: http://lkml.kernel.org/r/20190517115418.394192145@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-05-24 09:00:11 +02:00
Peter Zijlstra
3f9fbe9bd8 perf/ring_buffer: Add ordering to rb->nest increment
Similar to how decrementing rb->next too early can cause data_head to
(temporarily) be observed to go backward, so too can this happen when
we increment too late.

This barrier() ensures the rb->head load happens after the increment,
both the one in the 'goto again' path, as the one from
perf_output_get_handle() -- albeit very unlikely to matter for the
latter.

Suggested-by: Yabin Cui <yabinc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Cc: mark.rutland@arm.com
Cc: namhyung@kernel.org
Fixes: ef60777c9a ("perf: Optimize the perf_output() path by removing IRQ-disables")
Link: http://lkml.kernel.org/r/20190517115418.309516009@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-05-24 09:00:10 +02:00
Yabin Cui
1b038c6e05 perf/ring_buffer: Fix exposing a temporarily decreased data_head
In perf_output_put_handle(), an IRQ/NMI can happen in below location and
write records to the same ring buffer:

	...
	local_dec_and_test(&rb->nest)
	...                          <-- an IRQ/NMI can happen here
	rb->user_page->data_head = head;
	...

In this case, a value A is written to data_head in the IRQ, then a value
B is written to data_head after the IRQ. And A > B. As a result,
data_head is temporarily decreased from A to B. And a reader may see
data_head < data_tail if it read the buffer frequently enough, which
creates unexpected behaviors.

This can be fixed by moving dec(&rb->nest) to after updating data_head,
which prevents the IRQ/NMI above from updating data_head.

[ Split up by peterz. ]

Signed-off-by: Yabin Cui <yabinc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: mark.rutland@arm.com
Fixes: ef60777c9a ("perf: Optimize the perf_output() path by removing IRQ-disables")
Link: http://lkml.kernel.org/r/20190517115418.224478157@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-05-24 09:00:10 +02:00
Jérôme Glisse
7269f99993 mm/mmu_notifier: use correct mmu_notifier events for each invalidation
This updates each existing invalidation to use the correct mmu notifier
event that represent what is happening to the CPU page table.  See the
patch which introduced the events to see the rational behind this.

Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:49 -07:00
Jérôme Glisse
6f4f13e8d9 mm/mmu_notifier: contextual information for event triggering invalidation
CPU page table update can happens for many reasons, not only as a result
of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
a result of kernel activities (memory compression, reclaim, migration,
...).

Users of mmu notifier API track changes to the CPU page table and take
specific action for them.  While current API only provide range of virtual
address affected by the change, not why the changes is happening.

This patchset do the initial mechanical convertion of all the places that
calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
event as well as the vma if it is know (most invalidation happens against
a given vma).  Passing down the vma allows the users of mmu notifier to
inspect the new vma page protection.

The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
should assume that every for the range is going away when that event
happens.  A latter patch do convert mm call path to use a more appropriate
events for each call.

This is done as 2 patches so that no call site is forgotten especialy
as it uses this following coccinelle patch:

%<----------------------------------------------------------------------
@@
identifier I1, I2, I3, I4;
@@
static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1,
+enum mmu_notifier_event event,
+unsigned flags,
+struct vm_area_struct *vma,
struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... }

@@
@@
-#define mmu_notifier_range_init(range, mm, start, end)
+#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end)

@@
expression E1, E3, E4;
identifier I1;
@@
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, I1,
I1->vm_mm, E3, E4)
...>

@@
expression E1, E2, E3, E4;
identifier FN, VMA;
@@
FN(..., struct vm_area_struct *VMA, ...) {
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, VMA,
E2, E3, E4)
...> }

@@
expression E1, E2, E3, E4;
identifier FN, VMA;
@@
FN(...) {
struct vm_area_struct *VMA;
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, VMA,
E2, E3, E4)
...> }

@@
expression E1, E2, E3, E4;
identifier FN;
@@
FN(...) {
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, NULL,
E2, E3, E4)
...> }
---------------------------------------------------------------------->%

Applied with:
spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
spatch --sp-file mmu-notifier.spatch --dir mm --in-place

Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:49 -07:00
Linus Torvalds
0968621917 Printk changes for 5.2
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAlzP8nQACgkQUqAMR0iA
 lPK79A/+NkRouqA9ihAZhUbgW0DHzOAFvUJSBgX11HQAZbGjngakuoyYFvwUx0T0
 m80SUTCysxQrWl+xLdccPZ9ZrhP2KFQrEBEdeYHZ6ymcYcl83+3bOIBS7VwdZAbO
 EzB8u/58uU/sI6ABL4lF7ZF/+R+U4CXveEUoVUF04bxdPOxZkRX4PT8u3DzCc+RK
 r4yhwQUXGcKrHa2GrRL3GXKsDxcnRdFef/nzq4RFSZsi0bpskzEj34WrvctV6j+k
 FH/R3kEcZrtKIMPOCoDMMWq07yNqK/QKj0MJlGoAlwfK4INgcrSXLOx+pAmr6BNq
 uMKpkxCFhnkZVKgA/GbKEGzFf+ZGz9+2trSFka9LD2Ig6DIstwXqpAgiUK8JFQYj
 lq1mTaJZD3DfF2vnGHGeAfBFG3XETv+mIT/ow6BcZi3NyNSVIaqa5GAR+lMc6xkR
 waNkcMDkzLFuP1r0p7ZizXOksk9dFkMP3M6KqJomRtApwbSNmtt+O2jvyLPvB3+w
 wRyN9WT7IJZYo4v0rrD5Bl6BjV15ZeCPRSFZRYofX+vhcqJQsFX1M9DeoNqokh55
 Cri8f6MxGzBVjE1G70y2/cAFFvKEKJud0NUIMEuIbcy+xNrEAWPF8JhiwpKKnU10
 c0u674iqHJ2HeVsYWZF0zqzqQ6E1Idhg/PrXfuVuhAaL5jIOnYY=
 =WZfC
 -----END PGP SIGNATURE-----

Merge tag 'printk-for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk

Pull printk updates from Petr Mladek:

 - Allow state reset of printk_once() calls.

 - Prevent crashes when dereferencing invalid pointers in vsprintf().
   Only the first byte is checked for simplicity.

 - Make vsprintf warnings consistent and inlined.

 - Treewide conversion of obsolete %pf, %pF to %ps, %pF printf
   modifiers.

 - Some clean up of vsprintf and test_printf code.

* tag 'printk-for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
  lib/vsprintf: Make function pointer_string static
  vsprintf: Limit the length of inlined error messages
  vsprintf: Avoid confusion between invalid address and value
  vsprintf: Prevent crash when dereferencing invalid pointers
  vsprintf: Consolidate handling of unknown pointer specifiers
  vsprintf: Factor out %pO handler as kobject_string()
  vsprintf: Factor out %pV handler as va_format()
  vsprintf: Factor out %p[iI] handler as ip_addr_string()
  vsprintf: Do not check address of well-known strings
  vsprintf: Consistent %pK handling for kptr_restrict == 0
  vsprintf: Shuffle restricted_pointer()
  printk: Tie printk_once / printk_deferred_once into .data.once for reset
  treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively
  lib/test_printf: Switch to bitmap_zalloc()
2019-05-07 09:18:12 -07:00
Linus Torvalds
0bc40e549a Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Ingo Molnar:
 "The changes in here are:

   - text_poke() fixes and an extensive set of executability lockdowns,
     to (hopefully) eliminate the last residual circumstances under
     which we are using W|X mappings even temporarily on x86 kernels.
     This required a broad range of surgery in text patching facilities,
     module loading, trampoline handling and other bits.

   - tweak page fault messages to be more informative and more
     structured.

   - remove DISCONTIGMEM support on x86-32 and make SPARSEMEM the
     default.

   - reduce KASLR granularity on 5-level paging kernels from 512 GB to
     1 GB.

   - misc other changes and updates"

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
  x86/mm: Initialize PGD cache during mm initialization
  x86/alternatives: Add comment about module removal races
  x86/kprobes: Use vmalloc special flag
  x86/ftrace: Use vmalloc special flag
  bpf: Use vmalloc special flag
  modules: Use vmalloc special flag
  mm/vmalloc: Add flag for freeing of special permsissions
  mm/hibernation: Make hibernation handle unmapped pages
  x86/mm/cpa: Add set_direct_map_*() functions
  x86/alternatives: Remove the return value of text_poke_*()
  x86/jump-label: Remove support for custom text poker
  x86/modules: Avoid breaking W^X while loading modules
  x86/kprobes: Set instruction page as executable
  x86/ftrace: Set trampoline pages as executable
  x86/kgdb: Avoid redundant comparison of patched code
  x86/alternatives: Use temporary mm for text poking
  x86/alternatives: Initialize temporary mm for patching
  fork: Provide a function for copying init_mm
  uprobes: Initialize uprobes earlier
  x86/mm: Save debug registers when loading a temporary mm
  ...
2019-05-06 16:13:31 -07:00
Linus Torvalds
90489a72fb Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
 "The main kernel changes were:

   - add support for Intel's "adaptive PEBS v4" - which embedds LBS data
     in PEBS records and can thus batch up and reduce the IRQ (NMI) rate
     significantly - reducing overhead and making call-graph profiling
     less intrusive.

   - add Intel CPU core and uncore support updates for Tremont, Icelake,

   - extend the x86 PMU constraints scheduler with 'constraint ranges'
     to better support Icelake hw constraints,

   - make x86 call-chain support work better with CONFIG_FRAME_POINTER=y

   - misc other changes

  Tooling changes:

   - updates to the main tools: 'perf record', 'perf trace', 'perf
     stat'

   - updated Intel and S/390 vendor events

   - libtraceevent updates

   - misc other updates and fixes"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (69 commits)
  perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER
  watchdog: Fix typo in comment
  perf/x86/intel: Add Tremont core PMU support
  perf/x86/intel/uncore: Add Intel Icelake uncore support
  perf/x86/msr: Add Icelake support
  perf/x86/intel/rapl: Add Icelake support
  perf/x86/intel/cstate: Add Icelake support
  perf/x86/intel: Add Icelake support
  perf/x86: Support constraint ranges
  perf/x86/lbr: Avoid reading the LBRs when adaptive PEBS handles them
  perf/x86/intel: Support adaptive PEBS v4
  perf/x86/intel/ds: Extract code of event update in short period
  perf/x86/intel: Extract memory code PEBS parser for reuse
  perf/x86: Support outputting XMM registers
  perf/x86/intel: Force resched when TFA sysctl is modified
  perf/core: Add perf_pmu_resched() as global function
  perf/headers: Fix stale comment for struct perf_addr_filter
  perf/core: Make perf_swevent_init_cpu() static
  perf/x86: Add sanity checks to x86_schedule_events()
  perf/x86: Optimize x86_schedule_events()
  ...
2019-05-06 14:16:36 -07:00
Alexander Shishkin
26ae4f4406 perf/ring_buffer: Fix AUX software double buffering
This recent commit:

  5768402fd9 ("perf/ring_buffer: Use high order allocations for AUX buffers optimistically")

overlooked the fact that the previous one page granularity of the AUX buffer
provided an implicit double buffering capability to the PMU driver, which
went away when the entire buffer became one high-order page.

Always make the full-trace mode AUX allocation at least two-part to preserve
the previous behavior and allow the implicit double buffering to continue.

Reported-by: Ammy Yi <ammy.yi@intel.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: adrian.hunter@intel.com
Fixes: 5768402fd9 ("perf/ring_buffer: Use high order allocations for AUX buffers optimistically")
Link: http://lkml.kernel.org/r/20190503085536.24119-2-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-05-03 12:46:10 +02:00
Nadav Amit
aad42dd44d uprobes: Initialize uprobes earlier
In order to have a separate address space for text poking, we need to
duplicate init_mm early during start_kernel(). This, however, introduces
a problem since uprobes functions are called from dup_mmap(), but
uprobes is still not initialized in this early stage.

Since uprobes initialization is necassary for fork, and since all the
dependant initialization has been done when fork is initialized (percpu
and vmalloc), move uprobes initialization to fork_init(). It does not
seem uprobes introduces any security problem for the poking_mm.

Crash and burn if uprobes initialization fails, similarly to other early
initializations. Change the init_probes() name to probes_init() to match
other early initialization functions name convention.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: ard.biesheuvel@linaro.org
Cc: deneen.t.dock@intel.com
Cc: kernel-hardening@lists.openwall.com
Cc: kristen@linux.intel.com
Cc: linux_dti@icloud.com
Cc: will.deacon@arm.com
Link: https://lkml.kernel.org/r/20190426232303.28381-6-nadav.amit@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-30 12:37:51 +02:00
Stephane Eranian
c68d224e5e perf/core: Add perf_pmu_resched() as global function
This patch add perf_pmu_resched() a global function that can be called
to force rescheduling of events for a given PMU. The function locks
both cpuctx and task_ctx internally. This will be used by a subsequent
patch.

Signed-off-by: Stephane Eranian <eranian@google.com>
[ Simplified the calling convention. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: kan.liang@intel.com
Cc: nelson.dsouza@intel.com
Cc: tonyj@suse.com
Link: https://lkml.kernel.org/r/20190408173252.37932-2-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-16 12:19:34 +02:00
Ingo Molnar
cc8670945d Merge branch 'perf/urgent' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-16 12:14:46 +02:00
Alexander Shishkin
339bc41835 perf/ring_buffer: Fix AUX record suppression
The following commit:

  1627314fb5 ("perf: Suppress AUX/OVERWRITE records")

has an unintended side-effect of also suppressing all AUX records with no flags
and non-zero size, so all the regular records in the full trace mode.
This breaks some use cases for people.

Fix this by restoring "regular" AUX records.

Reported-by: Ben Gainey <Ben.Gainey@arm.com>
Tested-by: Ben Gainey <Ben.Gainey@arm.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: 1627314fb5 ("perf: Suppress AUX/OVERWRITE records")
Link: https://lkml.kernel.org/r/20190329091338.29999-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-16 12:13:57 +02:00
Alexander Shishkin
52a44f83fc perf/core: Fix the address filtering fix
The following recent commit:

  c60f83b813 ("perf, pt, coresight: Fix address filters for vmas with non-zero offset")

changes the address filtering logic to communicate filter ranges to the PMU driver
via a single address range object, instead of having the driver do the final bit of
math.

That change forgets to take into account kernel filters, which are not calculated
the same way as DSO based filters.

Fix that by passing the kernel filters the same way as file-based filters.
This doesn't require any additional changes in the drivers.

Reported-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: c60f83b813 ("perf, pt, coresight: Fix address filters for vmas with non-zero offset")
Link: https://lkml.kernel.org/r/20190329091212.29870-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-16 12:13:57 +02:00
Ingo Molnar
496156e364 Merge branch 'linus' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-16 12:02:43 +02:00
Peter Zijlstra
1d54ad9440 perf/core: Fix perf_event_disable_inatomic() race
Thomas-Mich Richter reported he triggered a WARN()ing from event_function_local()
on his s390. The problem boils down to:

	CPU-A				CPU-B

	perf_event_overflow()
	  perf_event_disable_inatomic()
	    @pending_disable = 1
	    irq_work_queue();

	sched-out
	  event_sched_out()
	    @pending_disable = 0

					sched-in
					perf_event_overflow()
					  perf_event_disable_inatomic()
					    @pending_disable = 1;
					    irq_work_queue(); // FAILS

	irq_work_run()
	  perf_pending_event()
	    if (@pending_disable)
	      perf_event_disable_local(); // WHOOPS

The problem exists in generic, but s390 is particularly sensitive
because it doesn't implement arch_irq_work_raise(), nor does it call
irq_work_run() from it's PMU interrupt handler (nor would that be
sufficient in this case, because s390 also generates
perf_event_overflow() from pmu::stop). Add to that the fact that s390
is a virtual architecture and (virtual) CPU-A can stall long enough
for the above race to happen, even if it would self-IPI.

Adding a irq_work_sync() to event_sched_in() would work for all hardare
PMUs that properly use irq_work_run() but fails for software PMUs.

Instead encode the CPU number in @pending_disable, such that we can
tell which CPU requested the disable. This then allows us to detect
the above scenario and even redirect the IPI to make up for the failed
queue.

Reported-by: Thomas-Mich Richter <tmricht@linux.ibm.com>
Tested-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Hendrik Brueckner <brueckner@linux.ibm.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-12 08:55:55 +02:00
Sakari Ailus
d75f773c86 treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively
%pF and %pf are functionally equivalent to %pS and %ps conversion
specifiers. The former are deprecated, therefore switch the current users
to use the preferred variant.

The changes have been produced by the following command:

	git grep -l '%p[fF]' | grep -v '^\(tools\|Documentation\)/' | \
	while read i; do perl -i -pe 's/%pf/%ps/g; s/%pF/%pS/g;' $i; done

And verifying the result.

Link: http://lkml.kernel.org/r/20190325193229.23390-1-sakari.ailus@linux.intel.com
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: sparclinux@vger.kernel.org
Cc: linux-um@lists.infradead.org
Cc: xen-devel@lists.xenproject.org
Cc: linux-acpi@vger.kernel.org
Cc: linux-pm@vger.kernel.org
Cc: drbd-dev@lists.linbit.com
Cc: linux-block@vger.kernel.org
Cc: linux-mmc@vger.kernel.org
Cc: linux-nvdimm@lists.01.org
Cc: linux-pci@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Cc: linux-btrfs@vger.kernel.org
Cc: linux-f2fs-devel@lists.sourceforge.net
Cc: linux-mm@kvack.org
Cc: ceph-devel@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Acked-by: David Sterba <dsterba@suse.com> (for btrfs)
Acked-by: Mike Rapoport <rppt@linux.ibm.com> (for mm/memblock.c)
Acked-by: Bjorn Helgaas <bhelgaas@google.com> (for drivers/pci)
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-04-09 14:19:06 +02:00
Valdis Kletnieks
d18bf4229b perf/core: Make perf_swevent_init_cpu() static
'make W=1' causes GCC to complain:

  kernel/events/core.c:11877:6: warning: no previous prototype for 'perf_swevent_init_cpu' [-Wmissing-prototypes]

It's not referenced anywhere else, make it static.

Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/28974.1552377997@turing-police
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-03 09:52:34 +02:00
Thomas Gleixner
4a98be8293 perf/core improvements and fixes:
kernel:
 
   Stephane Eranian :
 
   - Restore mmap record type correctly when handling PERF_RECORD_MMAP2
     events, as the same template is used for all the threads interested
     in mmap events, some may want just PERF_RECORD_MMAP, while some
     may want the extra info in MMAP2 records.
 
 perf probe:
 
   Adrian Hunter:
 
   - Fix getting the kernel map, because since changes related to x86 PTI
     entry trampolines handling, there are more than one kernel map.
 
 perf script:
 
   Andi Kleen:
 
   - Support insn output for normal samples, i.e.:
 
     perf script -F ip,sym,insn --xed
 
     Will fetch the sample IP from the thread address space and feed it
     to Intel's XED disassembler, producing lines such as:
 
       ffffffffa4068804 native_write_msr            wrmsr
       ffffffffa415b95e __hrtimer_next_event_base   movq  0x18(%rax), %rdx
 
     That match 'perf annotate's output.
 
   - Make the --cpu filter apply to  PERF_RECORD_COMM/FORK/... events, in
     addition to PERF_RECORD_SAMPLE.
 
 perf report:
 
   - Add a new --samples option to save a small random number of samples
     per hist entry, using a reservoir technique to select a representative
     number of samples.
 
     Then allow browsing the samples using 'perf script' as part of the hist
     entry context menu. This automatically adds the right filters, so only
     the thread or CPU of the sample is displayed. Then we use less' search
     functionality to directly jump to the time stamp of the selected sample.
 
     It uses different menus for assembler and source display.  Assembler
     needs xed installed and source needs debuginfo.
 
   - Fix the UI browser scripts pop up menu when there are many scripts
     available.
 
 perf report:
 
   Andi Kleen:
 
   - Add 'time' sort option. E.g.:
 
     % perf report --sort time,overhead,symbol --time-quantum 1ms --stdio
     ...
          0.67%  277061.87300  [.] _dl_start
          0.50%  277061.87300  [.] f1
          0.50%  277061.87300  [.] f2
          0.33%  277061.87300  [.] main
          0.29%  277061.87300  [.] _dl_lookup_symbol_x
          0.29%  277061.87300  [.] dl_main
          0.29%  277061.87300  [.] do_lookup_x
          0.17%  277061.87300  [.] _dl_debug_initialize
          0.17%  277061.87300  [.] _dl_init_paths
          0.08%  277061.87300  [.] check_match
          0.04%  277061.87300  [.] _dl_count_modids
          1.33%  277061.87400  [.] f1
          1.33%  277061.87400  [.] f2
          1.33%  277061.87400  [.] main
          1.17%  277061.87500  [.] main
          1.08%  277061.87500  [.] f1
          1.08%  277061.87500  [.] f2
          1.00%  277061.87600  [.] main
          0.83%  277061.87600  [.] f1
          0.83%  277061.87600  [.] f2
          1.00%  277061.87700  [.] main
 
 tools headers:
 
   Arnaldo Carvalho de Melo:
 
   - Update x86's syscall_64.tbl, no change in tools/perf behaviour.
 
   -  Sync copies asm-generic/unistd.h and linux/in with the kernel sources.
 
 perf data:
 
   Jiri Olsa:
 
   - Prep work to support having perf.data stored as a directory, with one
     file per CPU, that ultimately will allow having one ring buffer reading
     thread per CPU.
 
 Vendor events:
 
   Martin Liška:
 
   - perf PMU events for AMD Family 17h.
 
 perf script python:
 
   Tony Jones:
 
   - Add python3 support for the remaining Intel PT related scripts, with
     these we should have a clean build of perf with python3 while still
     supporting the build with python2.
 
 libbpf:
 
   Arnaldo Carvalho de Melo:
 
   - Fix the build on uCLibc, adding the missing stdarg.h since we use
     va_list in one typedef.
 
 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQR2GiIUctdOfX2qHhGyPKLppCJ+JwUCXIbMlgAKCRCyPKLppCJ+
 J/fzAQDNlP1cEuryAfWCDZ/sf5N/76srvkt/kIyYO0CliCjiBAEAiHRWrhsNs1Gd
 Z8626lCTYt7BTdz5yfTb7gbt/n7xNAY=
 =Ycye
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-for-mingo-5.1-20190311' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent

Pull perf/core improvements and fixes from Arnaldo:

kernel:

  Stephane Eranian :

  - Restore mmap record type correctly when handling PERF_RECORD_MMAP2
    events, as the same template is used for all the threads interested
    in mmap events, some may want just PERF_RECORD_MMAP, while some
    may want the extra info in MMAP2 records.

perf probe:

  Adrian Hunter:

  - Fix getting the kernel map, because since changes related to x86 PTI
    entry trampolines handling, there are more than one kernel map.

perf script:

  Andi Kleen:

  - Support insn output for normal samples, i.e.:

    perf script -F ip,sym,insn --xed

    Will fetch the sample IP from the thread address space and feed it
    to Intel's XED disassembler, producing lines such as:

      ffffffffa4068804 native_write_msr            wrmsr
      ffffffffa415b95e __hrtimer_next_event_base   movq  0x18(%rax), %rdx

    That match 'perf annotate's output.

  - Make the --cpu filter apply to  PERF_RECORD_COMM/FORK/... events, in
    addition to PERF_RECORD_SAMPLE.

perf report:

  - Add a new --samples option to save a small random number of samples
    per hist entry, using a reservoir technique to select a representative
    number of samples.

    Then allow browsing the samples using 'perf script' as part of the hist
    entry context menu. This automatically adds the right filters, so only
    the thread or CPU of the sample is displayed. Then we use less' search
    functionality to directly jump to the time stamp of the selected sample.

    It uses different menus for assembler and source display.  Assembler
    needs xed installed and source needs debuginfo.

  - Fix the UI browser scripts pop up menu when there are many scripts
    available.

perf report:

  Andi Kleen:

  - Add 'time' sort option. E.g.:

    % perf report --sort time,overhead,symbol --time-quantum 1ms --stdio
    ...
         0.67%  277061.87300  [.] _dl_start
         0.50%  277061.87300  [.] f1
         0.50%  277061.87300  [.] f2
         0.33%  277061.87300  [.] main
         0.29%  277061.87300  [.] _dl_lookup_symbol_x
         0.29%  277061.87300  [.] dl_main
         0.29%  277061.87300  [.] do_lookup_x
         0.17%  277061.87300  [.] _dl_debug_initialize
         0.17%  277061.87300  [.] _dl_init_paths
         0.08%  277061.87300  [.] check_match
         0.04%  277061.87300  [.] _dl_count_modids
         1.33%  277061.87400  [.] f1
         1.33%  277061.87400  [.] f2
         1.33%  277061.87400  [.] main
         1.17%  277061.87500  [.] main
         1.08%  277061.87500  [.] f1
         1.08%  277061.87500  [.] f2
         1.00%  277061.87600  [.] main
         0.83%  277061.87600  [.] f1
         0.83%  277061.87600  [.] f2
         1.00%  277061.87700  [.] main

tools headers:

  Arnaldo Carvalho de Melo:

  - Update x86's syscall_64.tbl, no change in tools/perf behaviour.

  -  Sync copies asm-generic/unistd.h and linux/in with the kernel sources.

perf data:

  Jiri Olsa:

  - Prep work to support having perf.data stored as a directory, with one
    file per CPU, that ultimately will allow having one ring buffer reading
    thread per CPU.

Vendor events:

  Martin Liška:

  - perf PMU events for AMD Family 17h.

perf script python:

  Tony Jones:

  - Add python3 support for the remaining Intel PT related scripts, with
    these we should have a clean build of perf with python3 while still
    supporting the build with python2.

libbpf:

  Arnaldo Carvalho de Melo:

  - Fix the build on uCLibc, adding the missing stdarg.h since we use
    va_list in one typedef.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-03-22 22:50:41 +01:00
Linus Torvalds
6cdfa54cd2 The biggest change for this release is in the histogram code.
- Add "onchange(var)" histogram handler that executes a action when $var
    changes.
 
  - Add new "snapshot()" action for histogram handlers, that causes a
    snapshot of the ring buffer when triggered.
    ie. onchange(var).snapshot() will trigger a snapshot if var changes.
 
  - Add alternative for "trace()" action.
    Currently, to trigger a synthetic event, the name of that event is used
    as the handler name, which is inconsistent with the other actions.
    onchange(var).synthetic(param) where it can now be
    onchange(var).trace(synthetic, param). The older method will still be
    allowed, as long as the synthetic events do not overlap with other
    handler names.
 
  - The histogram documentation at testcases were updated for the new
    changes.
 
 Added a quicker way to enable set_ftrace_filter files, that will make
 it much quicker to bisect tracing a function that shouldn't be traced and
 crashes the kernel. (You can echo in numbers to set_ftrace_filter, and it
 will select the corresponding function that is in
 available_filter_functions).
 
 Some better displaying of the tracing data (and more information was added).
 
 The rest are small fixes and more clean ups to the code.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXIXXjRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qrSJAQCbGXAvZE+shfKRhbU1cu1C1nwRMHhH
 eeRecJs1RChGFgD/TwatD4FzARQPjfk7snQD5KWPpoRc9grUACC2cZcaWwQ=
 =LVBI
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:
 "The biggest change for this release is in the histogram code:

   - Add "onchange(var)" histogram handler that executes a action when
     $var changes.

   - Add new "snapshot()" action for histogram handlers, that causes a
     snapshot of the ring buffer when triggered. ie.
     onchange(var).snapshot() will trigger a snapshot if var changes.

   - Add alternative for "trace()" action. Currently, to trigger a
     synthetic event, the name of that event is used as the handler
     name, which is inconsistent with the other actions.
     onchange(var).synthetic(param) where it can now be
     onchange(var).trace(synthetic, param). The older method will still
     be allowed, as long as the synthetic events do not overlap with
     other handler names.

   - The histogram documentation at testcases were updated for the new
     changes.

  Outside of the histogram code, we have:

   - Added a quicker way to enable set_ftrace_filter files, that will
     make it much quicker to bisect tracing a function that shouldn't be
     traced and crashes the kernel. (You can echo in numbers to
     set_ftrace_filter, and it will select the corresponding function
     that is in available_filter_functions).

   - Some better displaying of the tracing data (and more information
     was added).

  The rest are small fixes and more clean ups to the code"

* tag 'trace-v5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (37 commits)
  tracing: Use strncpy instead of memcpy when copying comm in trace.c
  tracing: Use strncpy instead of memcpy when copying comm for hist triggers
  tracing: Use strncpy instead of memcpy for string keys in hist triggers
  tracing: Use str_has_prefix() in synth_event_create()
  x86/ftrace: Fix warning and considate ftrace_jmp_replace() and ftrace_call_replace()
  tracing/perf: Use strndup_user() instead of buggy open-coded version
  doc: trace: Fix documentation for uprobe_profile
  tracing: Fix spelling mistake: "analagous" -> "analogous"
  tracing: Comment why cond_snapshot is checked outside of max_lock protection
  tracing: Add hist trigger action 'expected fail' test case
  tracing: Add alternative synthetic event trace action test case
  tracing: Add hist trigger onchange() handler test case
  tracing: Add hist trigger snapshot() action test case
  tracing: Add SPDX license GPL-2.0 license identifier to inter-event testcases
  tracing: Add alternative synthetic event trace action syntax
  tracing: Add hist trigger onchange() handler Documentation
  tracing: Add hist trigger onchange() handler
  tracing: Add hist trigger snapshot() action Documentation
  tracing: Add hist trigger snapshot() action
  tracing: Add conditional snapshot
  ...
2019-03-11 17:01:32 -07:00
Stephane Eranian
d9c1bb2f6a perf/core: Restore mmap record type correctly
On mmap(), perf_events generates a RECORD_MMAP record and then checks
which events are interested in this record. There are currently 2
versions of mmap records: RECORD_MMAP and RECORD_MMAP2. MMAP2 is larger.
The event configuration controls which version the user level tool
accepts.

If the event->attr.mmap2=1 field then MMAP2 record is returned.  The
perf_event_mmap_output() takes care of this. It checks attr->mmap2 and
corrects the record fields before putting it in the sampling buffer of
the event.  At the end the function restores the modified MMAP record
fields.

The problem is that the function restores the size but not the type.
Thus, if a subsequent event only accepts MMAP type, then it would
instead receive an MMAP2 record with a size of MMAP record.

This patch fixes the problem by restoring the record type on exit.

Signed-off-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Fixes: 13d7a2410f ("perf: Add attr->mmap2 attribute to an event")
Link: http://lkml.kernel.org/r/20190307185233.225521-1-eranian@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-03-11 11:56:02 -03:00
Linus Torvalds
12ad143e1b Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Thomas Gleixner:
 "Perf updates and fixes:

  Kernel:
   - Handle events which have the bpf_event attribute set as side band
     events as they carry information about BPF programs.
   - Add missing switch-case fall-through comments

  Libraries:
   - Fix leaks and double frees in error code paths.
   - Prevent buffer overflows in libtraceevent

  Tools:
   - Improvements in handling Intel BT/PTS
   - Add BTF ELF markers to perf trace BPF programs to improve output
   - Support --time, --cpu, --pid and --tid filters for perf diff
   - Calculate the column width in perf annotate as the hardcoded 6
     characters for the instruction are not sufficient
   - Small fixes all over the place"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
  perf/core: Mark expected switch fall-through
  perf/x86/intel/uncore: Fix client IMC events return huge result
  perf/ring_buffer: Use high order allocations for AUX buffers optimistically
  perf data: Force perf_data__open|close zero data->file.path
  perf session: Fix double free in perf_data__close
  perf evsel: Probe for precise_ip with simple attr
  perf tools: Read and store caps/max_precise in perf_pmu
  perf hist: Fix memory leak of srcline
  perf hist: Add error path into hist_entry__init
  perf c2c: Fix c2c report for empty numa node
  perf script python: Add Python3 support to intel-pt-events.py
  perf script python: Add Python3 support to event_analyzing_sample.py
  perf script python: add Python3 support to check-perf-trace.py
  perf script python: Add Python3 support to futex-contention.py
  perf script python: Remove mixed indentation
  perf diff: Support --pid/--tid filter options
  perf diff: Support --cpu filter option
  perf diff: Support --time filter option
  perf thread: Generalize function to copy from thread addr space from intel-bts code
  perf annotate: Calculate the max instruction name, align column to that
  ...
2019-03-10 15:22:03 -07:00
Linus Torvalds
a50243b1dd 5.1 Merge Window Pull Request
This has been a slightly more active cycle than normal with ongoing core
 changes and quite a lot of collected driver updates.
 
 - Various driver fixes for bnxt_re, cxgb4, hns, mlx5, pvrdma, rxe
 
 - A new data transfer mode for HFI1 giving higher performance
 
 - Significant functional and bug fix update to the mlx5 On-Demand-Paging MR
   feature
 
 - A chip hang reset recovery system for hns
 
 - Change mm->pinned_vm to an atomic64
 
 - Update bnxt_re to support a new 57500 chip
 
 - A sane netlink 'rdma link add' method for creating rxe devices and fixing
   the various unregistration race conditions in rxe's unregister flow
 
 - Allow lookup up objects by an ID over netlink
 
 - Various reworking of the core to driver interface:
   * Drivers should not assume umem SGLs are in PAGE_SIZE chunks
   * ucontext is accessed via udata not other means
   * Start to make the core code responsible for object memory
     allocation
   * Drivers should convert struct device to struct ib_device
     via a helper
   * Drivers have more tools to avoid use after unregister problems
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAlyAJYYACgkQOG33FX4g
 mxrWwQ/+OyAx4Moru7Aix0C6GWxTJp/wKgw21CS3reZxgLai6x81xNYG/s2wCNjo
 IccObVd7mvzyqPdxOeyHBsJBbQDqWvoD6O2duH8cqGMgBRgh3CSdUep2zLvPpSAx
 2W1SvWYCLDnCuarboFrCA8c4AN3eCZiqD7z9lHyFQGjy3nTUWzk1uBaOP46uaiMv
 w89N8EMdXJ/iY6ONzihvE05NEYbMA8fuvosKLLNdghRiHIjbMQU8SneY23pvyPDd
 ZziPu9NcO3Hw9OVbkwtJp47U3KCBgvKHmnixyZKkikjiD+HVoABw2IMwcYwyBZwP
 Bic/ddONJUvAxMHpKRnQaW7znAiHARk21nDG28UAI7FWXH/wMXgicMp6LRcNKqKF
 vqXdxHTKJb0QUR4xrYI+eA8ihstss7UUpgSgByuANJ0X729xHiJtlEvPb1DPo1Dz
 9CB4OHOVRl5O8sA5Jc6PSusZiKEpvWoyWbdmw0IiwDF5pe922VLl5Nv88ta+sJ38
 v2Ll5AgYcluk7F3599Uh9D7gwp5hxW2Ph3bNYyg2j3HP4/dKsL9XvIJPXqEthgCr
 3KQS9rOZfI/7URieT+H+Mlf+OWZhXsZilJG7No0fYgIVjgJ00h3SF1/299YIq6Qp
 9W7ZXBfVSwLYA2AEVSvGFeZPUxgBwHrSZ62wya4uFeB1jyoodPk=
 =p12E
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "This has been a slightly more active cycle than normal with ongoing
  core changes and quite a lot of collected driver updates.

   - Various driver fixes for bnxt_re, cxgb4, hns, mlx5, pvrdma, rxe

   - A new data transfer mode for HFI1 giving higher performance

   - Significant functional and bug fix update to the mlx5
     On-Demand-Paging MR feature

   - A chip hang reset recovery system for hns

   - Change mm->pinned_vm to an atomic64

   - Update bnxt_re to support a new 57500 chip

   - A sane netlink 'rdma link add' method for creating rxe devices and
     fixing the various unregistration race conditions in rxe's
     unregister flow

   - Allow lookup up objects by an ID over netlink

   - Various reworking of the core to driver interface:
       - drivers should not assume umem SGLs are in PAGE_SIZE chunks
       - ucontext is accessed via udata not other means
       - start to make the core code responsible for object memory
         allocation
       - drivers should convert struct device to struct ib_device via a
         helper
       - drivers have more tools to avoid use after unregister problems"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (280 commits)
  net/mlx5: ODP support for XRC transport is not enabled by default in FW
  IB/hfi1: Close race condition on user context disable and close
  RDMA/umem: Revert broken 'off by one' fix
  RDMA/umem: minor bug fix in error handling path
  RDMA/hns: Use GFP_ATOMIC in hns_roce_v2_modify_qp
  cxgb4: kfree mhp after the debug print
  IB/rdmavt: Fix concurrency panics in QP post_send and modify to error
  IB/rdmavt: Fix loopback send with invalidate ordering
  IB/iser: Fix dma_nents type definition
  IB/mlx5: Set correct write permissions for implicit ODP MR
  bnxt_re: Clean cq for kernel consumers only
  RDMA/uverbs: Don't do double free of allocated PD
  RDMA: Handle ucontext allocations by IB/core
  RDMA/core: Fix a WARN() message
  bnxt_re: fix the regression due to changes in alloc_pbl
  IB/mlx4: Increase the timeout for CM cache
  IB/core: Abort page fault handler silently during owning process exit
  IB/mlx5: Validate correct PD before prefetch MR
  IB/mlx5: Protect against prefetch of invalid MR
  RDMA/uverbs: Store PR pointer before it is overwritten
  ...
2019-03-09 15:53:03 -08:00
Ingo Molnar
b339da4803 perf bpf:
Arnaldo Carvalho de Melo:
 
   - Automatically add BTF ELF markers to 'perf trace' BPF programs, so that
     tools such as 'bpftool map dump' can pretty print map keys and values.
 
 perf c2c:
 
   Jiri Olsa:
 
   - Fix report for empty NUMA node.
 
 perf diff:
 
   Jin Yao:
 
   - Support --time, --cpu, --pid and --tid filter options.
 
 perf probe:
 
   Arnaldo Carvalho de Melo:
 
   - Clarify error message about not finding kernel modules debuginfo.
 
 perf record:
 
   Jiri Olsa:
 
   - Fixup probing for max attr.precise_ip.
 
 perf trace:
 
   Arnaldo Carvalho de Melo:
 
   - Add missing %s lost in the 'msg_flags' recvmmsg arg when adding prefix suppression logic.
 
 perf annotate:
 
   Arnaldo Carvalho de Melo:
 
   - Calculate the max instruction name, align column to that, removing the
     hardcoded max 6 chars and cope with instructions with names longer than that,
     such as vpmovmskb, vpcmpeqb, etc.
 
 kernel:
 
   Song Liu:
 
   - Consider events with attr.bpf_event set as side-band.
 
   Gustavo A. R. Silva:
 
   - Mark expected switch fall-through in perf_event_parse_addr_filter().
 
 Libraries:
 
   Jiri Olsa:
 
   - Fix leaks and double frees on error paths.
 
 libtraceevent:
 
   Tony Jones:
 
   - Fix buffer overflow in arg_eval().
 
 python scripting:
 
   Tony Jones:
 
   - More python3 fixes.
 
 Trivial:
 
   Yang Wei:
 
   - Remove needless extra semicolon in clang C++ glue code.
 
 Intel PT/BTS:
 
   Adrian Hunter:
 
   - Improve auxtrace address filter error message when there is no DSO.
 
   - Fix divide by zero when TSC is not available.
 
   - Further improvements to the export to sqlite/posgresql python scripts
     and to the GUI sqlviewer, exporting 'parent_id' so that we have enable
     the creation of call trees.
 
   Andi Kleen:
 
   - Generalize function to copy from thread addr space from intel-bts code.
 
 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQR2GiIUctdOfX2qHhGyPKLppCJ+JwUCXIFXsgAKCRCyPKLppCJ+
 Jz++AQDVDXs1rKyZ5JDmnDpJ1tvVPZM1tTAU+6C/GnnoSDgX/AD+L3smvLoPihbu
 msd3TpSroXuQ7nZ4BQ894jHyX3STqQE=
 =MN9Q
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-for-mingo-5.1-20190307' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent

Pull perf/core changes from Arnaldo Carvalho de Melo:

perf bpf:

  Arnaldo Carvalho de Melo:

  - Automatically add BTF ELF markers to 'perf trace' BPF programs, so that
    tools such as 'bpftool map dump' can pretty print map keys and values.

perf c2c:

  Jiri Olsa:

  - Fix report for empty NUMA node.

perf diff:

  Jin Yao:

  - Support --time, --cpu, --pid and --tid filter options.

perf probe:

  Arnaldo Carvalho de Melo:

  - Clarify error message about not finding kernel modules debuginfo.

perf record:

  Jiri Olsa:

  - Fixup probing for max attr.precise_ip.

perf trace:

  Arnaldo Carvalho de Melo:

  - Add missing %s lost in the 'msg_flags' recvmmsg arg when adding prefix suppression logic.

perf annotate:

  Arnaldo Carvalho de Melo:

  - Calculate the max instruction name, align column to that, removing the
    hardcoded max 6 chars and cope with instructions with names longer than that,
    such as vpmovmskb, vpcmpeqb, etc.

kernel:

  Song Liu:

  - Consider events with attr.bpf_event set as side-band.

  Gustavo A. R. Silva:

  - Mark expected switch fall-through in perf_event_parse_addr_filter().

Libraries:

  Jiri Olsa:

  - Fix leaks and double frees on error paths.

libtraceevent:

  Tony Jones:

  - Fix buffer overflow in arg_eval().

python scripting:

  Tony Jones:

  - More python3 fixes.

Trivial:

  Yang Wei:

  - Remove needless extra semicolon in clang C++ glue code.

Intel PT/BTS:

  Adrian Hunter:

  - Improve auxtrace address filter error message when there is no DSO.

  - Fix divide by zero when TSC is not available.

  - Further improvements to the export to sqlite/posgresql python scripts
    and to the GUI sqlviewer, exporting 'parent_id' so that we have enable
    the creation of call trees.

  Andi Kleen:

  - Generalize function to copy from thread addr space from intel-bts code.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-03-09 17:00:17 +01:00
Gustavo A. R. Silva
43aa378b41 perf/core: Mark expected switch fall-through
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

This patch fixes the following warning:

  kernel/events/core.c: In function ‘perf_event_parse_addr_filter’:
  kernel/events/core.c:9154:11: warning: this statement may fall through [-Wimplicit-fallthrough=]
      kernel = 1;
      ~~~~~~~^~~
  kernel/events/core.c:9156:3: note: here
     case IF_SRC_FILEADDR:
     ^~~~

Warning level 3 was used: -Wimplicit-fallthrough=3

This patch is part of the ongoing efforts to enable -Wimplicit-fallthrough.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: https://lkml.kernel.org/r/20190212205430.GA8446@embeddedor
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-03-09 14:10:32 +01:00
Alexander Shishkin
5768402fd9 perf/ring_buffer: Use high order allocations for AUX buffers optimistically
Currently, the AUX buffer allocator will use high-order allocations
for PMUs that don't support hardware scatter-gather chaining to ensure
large contiguous blocks of pages, and always use an array of single
pages otherwise.

There is, however, a tangible performance benefit in using larger chunks
of contiguous memory even in the latter case, that comes from not having
to fetch the next page's address at every page boundary. In particular,
a task running under Intel PT on an Atom CPU shows 1.5%-2% less runtime
penalty with a single multi-page output region in snapshot mode (no PMI)
than with multiple single-page output regions, from ~6% down to ~4%. For
the snapshot mode it does make a difference as it is intended to run over
long periods of time.

For this reason, change the allocation policy to always optimistically
start with the highest possible order when allocating pages for the AUX
buffer, desceding until the allocation succeeds or order zero allocation
fails.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: https://lkml.kernel.org/r/20190215114727.62648-2-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-03-09 14:10:30 +01:00
Gustavo A. R. Silva
10c3405f06 perf: Mark expected switch fall-through
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

This patch fixes the following warning:

  kernel/events/core.c: In function ‘perf_event_parse_addr_filter’:
  kernel/events/core.c:9154:11: warning: this statement may fall through [-Wimplicit-fallthrough=]
      kernel = 1;
      ~~~~~~~^~~
  kernel/events/core.c:9156:3: note: here
     case IF_SRC_FILEADDR:
     ^~~~

Warning level 3 was used: -Wimplicit-fallthrough=3

This patch is part of the ongoing efforts to enable -Wimplicit-fallthrough.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Gustavo A. R. Silva <gustavo@embeddedor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kees Kook <keescook@chromium.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20190212205430.GA8446@embeddedor
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-03-01 10:54:00 -03:00
Song Liu
21038f2baa perf, bpf: Consider events with attr.bpf_event as side-band events
Events with attr.bpf_event set should be considered as side-band events,
as they carry information about BPF programs.

Signed-off-by: Song Liu <songliubraving@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: kernel-team@fb.com
Cc: netdev@vger.kernel.org
Fixes: 6ee52e2a3f ("perf, bpf: Introduce PERF_RECORD_BPF_EVENT")
Link: http://lkml.kernel.org/r/20190226002019.3748539-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-02-28 14:20:35 -03:00
Ingo Molnar
c978b9460f perf/core improvements and fixes:
perf annotate:
 
   Wei Li:
 
   - Fix getting source line failure
 
 perf script:
 
   Andi Kleen:
 
   - Handle missing fields with -F +...
 
 perf data:
 
   Jiri Olsa:
 
   - Prep work to support per-cpu files in a directory.
 
 Intel PT:
 
   Adrian Hunter:
 
   - Improve thread_stack__no_call_return()
 
   - Hide x86 retpolines in thread stacks.
 
   - exported SQL viewer refactorings, new 'top calls' report..
 
   Alexander Shishkin:
 
   - Copy parent's address filter offsets on clone
 
   - Fix address filters for vmas with non-zero offset. Applies to
     ARM's CoreSight as well.
 
 python scripts:
 
   Tony Jones:
 
   - Python3 support for several 'perf script' python scripts.
 
 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQR2GiIUctdOfX2qHhGyPKLppCJ+JwUCXHRYNwAKCRCyPKLppCJ+
 J8XmAQDKY7gb3GhkX+4aE8cGffFYB2YV5mD9Bbu4AM9tuFFBJwD+KAq87FMCy7m7
 h7xyWk3UILpz6y235AVdfOmgcNDkpAQ=
 =SJCG
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-for-mingo-5.1-20190225' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core

Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:

perf annotate:

  Wei Li:

  - Fix getting source line failure

perf script:

  Andi Kleen:

  - Handle missing fields with -F +...

perf data:

  Jiri Olsa:

  - Prep work to support per-cpu files in a directory.

Intel PT:

  Adrian Hunter:

  - Improve thread_stack__no_call_return()

  - Hide x86 retpolines in thread stacks.

  - exported SQL viewer refactorings, new 'top calls' report..

  Alexander Shishkin:

  - Copy parent's address filter offsets on clone

  - Fix address filters for vmas with non-zero offset. Applies to
    ARM's CoreSight as well.

python scripts:

  Tony Jones:

  - Python3 support for several 'perf script' python scripts.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-28 08:29:50 +01:00
Ingo Molnar
9ed8f1a6e7 Merge branch 'linus' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-28 08:27:17 +01:00
Alexander Shishkin
c60f83b813 perf, pt, coresight: Fix address filters for vmas with non-zero offset
Currently, the address range calculation for file-based filters works as
long as the vma that maps the matching part of the object file starts
from offset zero into the file (vm_pgoff==0). Otherwise, the resulting
filter range would be off by vm_pgoff pages. Another related problem is
that in case of a partially matching vma, that is, a vma that matches
part of a filter region, the filter range size wouldn't be adjusted.

Fix the arithmetics around address filter range calculations, taking
into account vma offset, so that the entire calculation is done before
the filter configuration is passed to the PMU drivers instead of having
those drivers do the final bit of arithmetics.

Based on the patch by Adrian Hunter <adrian.hunter.intel.com>.

Reported-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Tested-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Fixes: 375637bc52 ("perf/core: Introduce address range filtering")
Link: http://lkml.kernel.org/r/20190215115655.63469-3-alexander.shishkin@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-02-22 16:52:07 -03:00
Alexander Shishkin
18736eef12 perf: Copy parent's address filter offsets on clone
When a child event is allocated in the inherit_event() path, the VMA
based filter offsets are not copied from the parent, even though the
address space mapping of the new task remains the same, which leads to
no trace for the new task until exec.

Reported-by: Mansour Alharthi <malharthi9@gatech.edu>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Tested-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Fixes: 375637bc52 ("perf/core: Introduce address range filtering")
Link: http://lkml.kernel.org/r/20190215115655.63469-2-alexander.shishkin@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-02-22 16:52:07 -03:00
Elena Reshetova
ce59b8e99c uprobes: convert uprobe.ref to refcount_t
atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable uprobe.ref is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

**Important note for maintainers:

Some functions from refcount_t API defined in lib/refcount.c
have different memory ordering guarantees than their atomic
counterparts.
The full comparison can be seen in
https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
in state to be merged to the documentation tree.
Normally the differences should not matter since refcount_t provides
enough guarantees to satisfy the refcounting use cases, but in
some rare cases it might matter.
Please double check that you don't have some undocumented
memory guarantees for this variable usage.

For the uprobe.ref it might make a difference
in following places:
 - put_uprobe(): decrement in refcount_dec_and_test() only
   provides RELEASE ordering and control dependency on success
   vs. fully ordered atomic counterpart

Link: http://lkml.kernel.org/r/1547637627-29526-1-git-send-email-elena.reshetova@intel.com

Suggested-by: Kees Cook <keescook@chromium.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-02-15 13:10:14 -05:00
Ingo Molnar
528871b456 perf/core: Fix impossible ring-buffer sizes warning
The following commit:

  9dff0aa95a ("perf/core: Don't WARN() for impossible ring-buffer sizes")

results in perf recording failures with larger mmap areas:

  root@skl:/tmp# perf record -g -a
  failed to mmap with 12 (Cannot allocate memory)

The root cause is that the following condition is buggy:

	if (order_base_2(size) >= MAX_ORDER)
		goto fail;

The problem is that @size is in bytes and MAX_ORDER is in pages,
so the right test is:

	if (order_base_2(size) >= PAGE_SHIFT+MAX_ORDER)
		goto fail;

Fix it.

Reported-by: "Jin, Yao" <yao.jin@linux.intel.com>
Bisected-by: Borislav Petkov <bp@alien8.de>
Analyzed-by: Peter Zijlstra <peterz@infradead.org>
Cc: Julien Thierry <julien.thierry@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: <stable@vger.kernel.org>
Fixes: 9dff0aa95a ("perf/core: Don't WARN() for impossible ring-buffer sizes")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-13 08:05:02 +01:00
Jiri Olsa
81ec3f3c4c perf/x86: Add check_period PMU callback
Vince (and later on Ravi) reported crashes in the BTS code during
fuzzing with the following backtrace:

  general protection fault: 0000 [#1] SMP PTI
  ...
  RIP: 0010:perf_prepare_sample+0x8f/0x510
  ...
  Call Trace:
   <IRQ>
   ? intel_pmu_drain_bts_buffer+0x194/0x230
   intel_pmu_drain_bts_buffer+0x160/0x230
   ? tick_nohz_irq_exit+0x31/0x40
   ? smp_call_function_single_interrupt+0x48/0xe0
   ? call_function_single_interrupt+0xf/0x20
   ? call_function_single_interrupt+0xa/0x20
   ? x86_schedule_events+0x1a0/0x2f0
   ? x86_pmu_commit_txn+0xb4/0x100
   ? find_busiest_group+0x47/0x5d0
   ? perf_event_set_state.part.42+0x12/0x50
   ? perf_mux_hrtimer_restart+0x40/0xb0
   intel_pmu_disable_event+0xae/0x100
   ? intel_pmu_disable_event+0xae/0x100
   x86_pmu_stop+0x7a/0xb0
   x86_pmu_del+0x57/0x120
   event_sched_out.isra.101+0x83/0x180
   group_sched_out.part.103+0x57/0xe0
   ctx_sched_out+0x188/0x240
   ctx_resched+0xa8/0xd0
   __perf_event_enable+0x193/0x1e0
   event_function+0x8e/0xc0
   remote_function+0x41/0x50
   flush_smp_call_function_queue+0x68/0x100
   generic_smp_call_function_single_interrupt+0x13/0x30
   smp_call_function_single_interrupt+0x3e/0xe0
   call_function_single_interrupt+0xf/0x20
   </IRQ>

The reason is that while event init code does several checks
for BTS events and prevents several unwanted config bits for
BTS event (like precise_ip), the PERF_EVENT_IOC_PERIOD allows
to create BTS event without those checks being done.

Following sequence will cause the crash:

If we create an 'almost' BTS event with precise_ip and callchains,
and it into a BTS event it will crash the perf_prepare_sample()
function because precise_ip events are expected to come
in with callchain data initialized, but that's not the
case for intel_pmu_drain_bts_buffer() caller.

Adding a check_period callback to be called before the period
is changed via PERF_EVENT_IOC_PERIOD. It will deny the change
if the event would become BTS. Plus adding also the limit_period
check as well.

Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20190204123532.GA4794@krava
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-11 11:46:43 +01:00
Davidlohr Bueso
70f8a3ca68 mm: make mm->pinned_vm an atomic64 counter
Taking a sleeping lock to _only_ increment a variable is quite the
overkill, and pretty much all users do this. Furthermore, some drivers
(ie: infiniband and scif) that need pinned semantics can go to quite
some trouble to actually delay via workqueue (un)accounting for pinned
pages when not possible to acquire it.

By making the counter atomic we no longer need to hold the mmap_sem and
can simply some code around it for pinned_vm users. The counter is 64-bit
such that we need not worry about overflows such as rdma user input
controlled from userspace.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-07 12:54:02 -07:00
Mathieu Poirier
840018668c perf/aux: Make perf_event accessible to setup_aux()
When pmu::setup_aux() is called the coresight PMU needs to know which
sink to use for the session by looking up the information in the
event's attr::config2 field.

As such simply replace the cpu information by the complete perf_event
structure and change all affected customers.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Reviewed-by: Suzuki Poulouse <suzuki.poulose@arm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-s390@vger.kernel.org
Link: http://lkml.kernel.org/r/20190131184714.20388-2-mathieu.poirier@linaro.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-02-06 10:00:39 -03:00
Elena Reshetova
ca3bb3d027 perf/ring_buffer: Convert ring_buffer.aux_refcount to refcount_t
atomic_t variables are currently used to implement reference
counters with the following properties:

 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable ring_buffer.aux_refcount is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

** Important note for maintainers:

Some functions from refcount_t API defined in lib/refcount.c
have different memory ordering guarantees than their atomic
counterparts. Please check Documentation/core-api/refcount-vs-atomic.rst
for more information.

Normally the differences should not matter since refcount_t provides
enough guarantees to satisfy the refcounting use cases, but in
some rare cases it might matter.
Please double check that you don't have some undocumented
memory guarantees for this variable usage.

For the ring_buffer.aux_refcount it might make a difference
in following places:

 - perf_aux_output_begin(): increment in refcount_inc_not_zero() only
   guarantees control dependency on success vs. fully ordered
   atomic counterpart
 - rb_free_aux(): decrement in refcount_dec_and_test() only
   provides RELEASE ordering and ACQUIRE ordering + control dependency
   on success vs. fully ordered atomic counterpart

Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: namhyung@kernel.org
Link: https://lkml.kernel.org/r/1548678448-24458-4-git-send-email-elena.reshetova@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-04 08:46:17 +01:00
Elena Reshetova
fecb8ed2ce perf/ring_buffer: Convert ring_buffer.refcount to refcount_t
atomic_t variables are currently used to implement reference
counters with the following properties:

 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable ring_buffer.refcount is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

** Important note for maintainers:

Some functions from refcount_t API defined in lib/refcount.c
have different memory ordering guarantees than their atomic
counterparts. Please check Documentation/core-api/refcount-vs-atomic.rst
for more information.

Normally the differences should not matter since refcount_t provides
enough guarantees to satisfy the refcounting use cases, but in
some rare cases it might matter.
Please double check that you don't have some undocumented
memory guarantees for this variable usage.

For the ring_buffer.refcount it might make a difference
in following places:

 - ring_buffer_get(): increment in refcount_inc_not_zero() only
   guarantees control dependency on success vs. fully ordered
   atomic counterpart
 - ring_buffer_put(): decrement in refcount_dec_and_test() only
   provides RELEASE ordering and ACQUIRE ordering + control dependency
   on success vs. fully ordered atomic counterpart

Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: namhyung@kernel.org
Link: https://lkml.kernel.org/r/1548678448-24458-3-git-send-email-elena.reshetova@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-04 08:46:16 +01:00
Elena Reshetova
8c94abbbe1 perf: Convert perf_event_context.refcount to refcount_t
atomic_t variables are currently used to implement reference
counters with the following properties:

 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable perf_event_context.refcount is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

** Important note for maintainers:

Some functions from refcount_t API defined in lib/refcount.c
have different memory ordering guarantees than their atomic
counterparts. Please check Documentation/core-api/refcount-vs-atomic.rst
for more information.

Normally the differences should not matter since refcount_t provides
enough guarantees to satisfy the refcounting use cases, but in
some rare cases it might matter.
Please double check that you don't have some undocumented
memory guarantees for this variable usage.

For the perf_event_context.refcount it might make a difference
in following places:

 - get_ctx(), perf_event_ctx_lock_nested(), perf_lock_task_context()
   and __perf_event_ctx_lock_double(): increment in
   refcount_inc_not_zero() only guarantees control dependency
   on success vs. fully ordered atomic counterpart
 - put_ctx(): decrement in refcount_dec_and_test() provides
   RELEASE ordering and ACQUIRE ordering + control dependency on success
   vs. fully ordered atomic counterpart

Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: namhyung@kernel.org
Link: https://lkml.kernel.org/r/1548678448-24458-2-git-send-email-elena.reshetova@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-04 08:46:15 +01:00
Thomas Gleixner
720e596a16 perf/uprobes: Convert to SPDX license identifier
Replace the license boiler plate with a SPDX license identifier.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul McKenney <paulmck@linux.ibm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190116111308.211981422@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-04 08:46:13 +01:00
Thomas Gleixner
469eb32eaf perf/hw_breakpoints: Convert to SPDX license identifier
Replace the license boiler plate with a SPDX license identifier.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul McKenney <paulmck@linux.ibm.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190116111308.105855650@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-04 08:46:13 +01:00
Thomas Gleixner
8e86e01526 perf/core: Convert to SPDX license identifiers
Use proper SPDX license identifiers instead of the bogus reference to
kernel-base/COPYING.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190116111308.012666937@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-04 08:46:11 +01:00
Ingo Molnar
98cb621081 Merge branch 'perf/urgent' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-04 08:45:42 +01:00
Mark Rutland
9dff0aa95a perf/core: Don't WARN() for impossible ring-buffer sizes
The perf tool uses /proc/sys/kernel/perf_event_mlock_kb to determine how
large its ringbuffer mmap should be. This can be configured to arbitrary
values, which can be larger than the maximum possible allocation from
kmalloc.

When this is configured to a suitably large value (e.g. thanks to the
perf fuzzer), attempting to use perf record triggers a WARN_ON_ONCE() in
__alloc_pages_nodemask():

   WARNING: CPU: 2 PID: 5666 at mm/page_alloc.c:4511 __alloc_pages_nodemask+0x3f8/0xbc8

Let's avoid this by checking that the requested allocation is possible
before calling kzalloc.

Reported-by: Julien Thierry <julien.thierry@arm.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Julien Thierry <julien.thierry@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20190110142745.25495-1-mark.rutland@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-04 08:45:25 +01:00
Song Liu
6ee52e2a3f perf, bpf: Introduce PERF_RECORD_BPF_EVENT
For better performance analysis of BPF programs, this patch introduces
PERF_RECORD_BPF_EVENT, a new perf_event_type that exposes BPF program
load/unload information to user space.

Each BPF program may contain up to BPF_MAX_SUBPROGS (256) sub programs.
The following example shows kernel symbols for a BPF program with 7 sub
programs:

    ffffffffa0257cf9 t bpf_prog_b07ccb89267cf242_F
    ffffffffa02592e1 t bpf_prog_2dcecc18072623fc_F
    ffffffffa025b0e9 t bpf_prog_bb7a405ebaec5d5c_F
    ffffffffa025dd2c t bpf_prog_a7540d4a39ec1fc7_F
    ffffffffa025fcca t bpf_prog_05762d4ade0e3737_F
    ffffffffa026108f t bpf_prog_db4bd11e35df90d4_F
    ffffffffa0263f00 t bpf_prog_89d64e4abf0f0126_F
    ffffffffa0257cf9 t bpf_prog_ae31629322c4b018__dummy_tracepoi

When a bpf program is loaded, PERF_RECORD_KSYMBOL is generated for each
of these sub programs. Therefore, PERF_RECORD_BPF_EVENT is not needed
for simple profiling.

For annotation, user space need to listen to PERF_RECORD_BPF_EVENT and
gather more information about these (sub) programs via sys_bpf.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradeaed.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: kernel-team@fb.com
Cc: netdev@vger.kernel.org
Link: http://lkml.kernel.org/r/20190117161521.1341602-4-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-01-21 17:00:57 -03:00
Song Liu
76193a9452 perf, bpf: Introduce PERF_RECORD_KSYMBOL
For better performance analysis of dynamically JITed and loaded kernel
functions, such as BPF programs, this patch introduces
PERF_RECORD_KSYMBOL, a new perf_event_type that exposes kernel symbol
register/unregister information to user space.

The following data structure is used for PERF_RECORD_KSYMBOL.

    /*
     * struct {
     *      struct perf_event_header        header;
     *      u64                             addr;
     *      u32                             len;
     *      u16                             ksym_type;
     *      u16                             flags;
     *      char                            name[];
     *      struct sample_id                sample_id;
     * };
     */

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: kernel-team@fb.com
Cc: netdev@vger.kernel.org
Link: http://lkml.kernel.org/r/20190117161521.1341602-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-01-21 17:00:57 -03:00
Arnaldo Carvalho de Melo
5620196951 perf: Make perf_event_output() propagate the output() return
For the original mode of operation it isn't needed, since we report back
errors via PERF_RECORD_LOST records in the ring buffer, but for use in
bpf_perf_event_output() it is convenient to return the errors, basically
-ENOSPC.

Currently bpf_perf_event_output() returns an error indication, the last
thing it does, which is to push it to the ring buffer is that can fail
and if so, this failure won't be reported back to its users, fix it.

Reported-by: Jamal Hadi Salim <jhs@mojatatu.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lkml.kernel.org/r/20190118150938.GN5823@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-01-21 17:00:57 -03:00
Andrew Murray
cc6795aeff perf/core: Add PERF_PMU_CAP_NO_EXCLUDE for exclusion incapable PMUs
Many PMU drivers do not have the capability to exclude counting events
that occur in specific contexts such as idle, kernel, guest, etc. These
drivers indicate this by returning an error in their event_init upon
testing the events attribute flags. This approach is error prone and
often inconsistent.

Let's instead allow PMU drivers to advertise their inability to exclude
based on context via a new capability: PERF_PMU_CAP_NO_EXCLUDE. This
allows the perf core to reject requests for exclusion events where
there is no support in the PMU.

Signed-off-by: Andrew Murray <andrew.murray@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sascha Hauer <s.hauer@pengutronix.de>
Cc: Shawn Guo <shawnguo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: robin.murphy@arm.com
Cc: suzuki.poulose@arm.com
Link: https://lkml.kernel.org/r/1547128414-50693-4-git-send-email-andrew.murray@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-01-21 11:01:20 +01:00
Stephane Eranian
1a51c5da5a perf core: Fix perf_proc_update_handler() bug
The perf_proc_update_handler() handles /proc/sys/kernel/perf_event_max_sample_rate
syctl variable.  When the PMU IRQ handler timing monitoring is disabled, i.e,
when /proc/sys/kernel/perf_cpu_time_max_percent is equal to 0 or 100,
then no modification to sysctl_perf_event_sample_rate is allowed to prevent
possible hang from wrong values.

The problem is that the test to prevent modification is made after the
sysctl variable is modified in perf_proc_update_handler().

You get an error:

  $ echo 10001 >/proc/sys/kernel/perf_event_max_sample_rate
  echo: write error: invalid argument

But the value is still modified causing all sorts of inconsistencies:

  $ cat /proc/sys/kernel/perf_event_max_sample_rate
  10001

This patch fixes the problem by moving the parsing of the value after
the test.

Committer testing:

  # echo 100 > /proc/sys/kernel/perf_cpu_time_max_percent
  # echo 10001 > /proc/sys/kernel/perf_event_max_sample_rate
  -bash: echo: write error: Invalid argument
  # cat /proc/sys/kernel/perf_event_max_sample_rate
  10001
  #

Signed-off-by: Stephane Eranian <eranian@google.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1547169436-6266-1-git-send-email-eranian@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-01-18 11:10:38 -03:00
Linus Torvalds
96d4f267e4 Remove 'type' argument from access_ok() function
Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.

It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access.  But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.

A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model.  And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.

This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.

There were a couple of notable cases:

 - csky still had the old "verify_area()" name as an alias.

 - the iter_iov code had magical hardcoded knowledge of the actual
   values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
   really used it)

 - microblaze used the type argument for a debug printout

but other than those oddities this should be a total no-op patch.

I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something.  Any missed conversion should be trivially fixable, though.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-03 18:57:57 -08:00
Jérôme Glisse
ac46d4f3c4 mm/mmu_notifier: use structure for invalidate_range_start/end calls v2
To avoid having to change many call sites everytime we want to add a
parameter use a structure to group all parameters for the mmu_notifier
invalidate_range_start/end cakks.  No functional changes with this patch.

[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
From: Jérôme Glisse <jglisse@redhat.com>
Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3

fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n

Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:50 -08:00
Linus Torvalds
116b081c28 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
 "The main changes in this cycle on the kernel side:

   - rework kprobes blacklist handling (Masami Hiramatsu)

   - misc cleanups

  on the tooling side these areas were the main focus:

   - 'perf trace'     enhancements (Arnaldo Carvalho de Melo)

   - 'perf bench'     enhancements (Davidlohr Bueso)

   - 'perf record'    enhancements (Alexey Budankov)

   - 'perf annotate'  enhancements (Jin Yao)

   - 'perf top'       enhancements (Jiri Olsa)

   - Intel hw tracing enhancements (Adrian Hunter)

   - ARM hw tracing   enhancements (Leo Yan, Mathieu Poirier)

   - ... plus lots of other enhancements, cleanups and fixes"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (171 commits)
  tools uapi asm: Update asm-generic/unistd.h copy
  perf symbols: Relax checks on perf-PID.map ownership
  perf trace: Wire up the fadvise 'advice' table generator
  perf beauty: Add generator for fadvise64's 'advice' arg constants
  tools headers uapi: Grab a copy of fadvise.h
  perf beauty mmap: Print mmap's 'offset' arg in hexadecimal
  perf beauty mmap: Print PROT_READ before PROT_EXEC to match strace output
  perf trace beauty: Beautify arch_prctl()'s arguments
  perf trace: When showing string prefixes show prefix + ??? for unknown entries
  perf trace: Move strarrays to beauty.h for further reuse
  perf beauty: Wire up the x86_arch prctl code table generator
  perf beauty: Add a string table generator for x86's 'arch_prctl' codes
  tools include arch: Grab a copy of x86's prctl.h
  perf trace: Show NULL when syscall pointer args are 0
  perf trace: Enclose the errno strings with ()
  perf augmented_raw_syscalls: Copy 'access' arg as well
  perf trace: Add alignment spaces after the closing parens
  perf trace beauty: Print O_RDONLY when (flags & O_ACCMODE) == 0
  perf trace: Allow asking for not suppressing common string prefixes
  perf trace: Add a prefix member to the strarray class
  ...
2018-12-26 14:45:18 -08:00
Linus Torvalds
792bf4d871 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The biggest RCU changes in this cycle were:

   - Convert RCU's BUG_ON() and similar calls to WARN_ON() and similar.

   - Replace calls of RCU-bh and RCU-sched update-side functions to
     their vanilla RCU counterparts. This series is a step towards
     complete removal of the RCU-bh and RCU-sched update-side functions.

     ( Note that some of these conversions are going upstream via their
       respective maintainers. )

   - Documentation updates, including a number of flavor-consolidation
     updates from Joel Fernandes.

   - Miscellaneous fixes.

   - Automate generation of the initrd filesystem used for rcutorture
     testing.

   - Convert spin_is_locked() assertions to instead use lockdep.

     ( Note that some of these conversions are going upstream via their
       respective maintainers. )

   - SRCU updates, especially including a fix from Dennis Krein for a
     bag-on-head-class bug.

   - RCU torture-test updates"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (112 commits)
  rcutorture: Don't do busted forward-progress testing
  rcutorture: Use 100ms buckets for forward-progress callback histograms
  rcutorture: Recover from OOM during forward-progress tests
  rcutorture: Print forward-progress test age upon failure
  rcutorture: Print time since GP end upon forward-progress failure
  rcutorture: Print histogram of CB invocation at OOM time
  rcutorture: Print GP age upon forward-progress failure
  rcu: Print per-CPU callback counts for forward-progress failures
  rcu: Account for nocb-CPU callback counts in RCU CPU stall warnings
  rcutorture: Dump grace-period diagnostics upon forward-progress OOM
  rcutorture: Prepare for asynchronous access to rcu_fwd_startat
  torture: Remove unnecessary "ret" variables
  rcutorture: Affinity forward-progress test to avoid housekeeping CPUs
  rcutorture: Break up too-long rcu_torture_fwd_prog() function
  rcutorture: Remove cbflood facility
  torture: Bring any extra CPUs online during kernel startup
  rcutorture: Add call_rcu() flooding forward-progress tests
  rcutorture/formal: Replace synchronize_sched() with synchronize_rcu()
  tools/kernel.h: Replace synchronize_sched() with synchronize_rcu()
  net/decnet: Replace rcu_barrier_bh() with rcu_barrier()
  ...
2018-12-26 13:07:19 -08:00
Ingo Molnar
76aea1eeb9 Linux 4.20-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlwW4/oeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG2QMH/Rl6iMpTUX23tMHe
 eXQzAOSvQXaWlFoX25j1Jvt8nhS7Uy8vkdpYTCOI/7DF0Jg4O/6uxcZkErlwWxb8
 MW1rMgpfO+OpDLSLXAO2GKxaKI3ArqF2BcOQA2mji1/jR2VUTqmIvBoudn5d+GYz
 19aCyfdzmVTC38G9sBhhcqJ10EkxLiHe2K74bf4JxVuSf2EnTI4LYt5xJPDoT0/C
 6fOeUNwVhvv5a4svvzJmortq7x7BwyxBQArc7PbO0MPhabLU4wyFUOTRszgsGd76
 o5JuOFwgdIIHlSSacGla6rKq10nmkwR07fHfRFFwbvrfBOEHsXOP2hvzMZX+FLBK
 IXOzdtc=
 =XlMc
 -----END PGP SIGNATURE-----

Merge tag 'v4.20-rc7' into perf/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-17 17:46:26 +01:00
Linus Torvalds
abb8d6ecbd This is a single commit that fixes a bug in uprobes SDT code
due to a missing mutex protection.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXAlffRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qq0KAP0eIy6/kwoBocygRLgB6N4naX/zFcw4
 m2NiSlYe3NpC6AD/Z1g3wg8bKlm7ar2OzaqE4wQdeKjrvPlUtymUKiwFxA8=
 =8Huu
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.20-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
 "This is a single commit that fixes a bug in uprobes SDT code due to a
  missing mutex protection"

* tag 'trace-v4.20-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  Uprobes: Fix kernel oops with delayed_uprobe_remove()
2018-12-06 10:35:19 -08:00
Ravi Bangoria
1aed58e67a Uprobes: Fix kernel oops with delayed_uprobe_remove()
There could be a race between task exit and probe unregister:

  exit_mm()
  mmput()
  __mmput()                     uprobe_unregister()
  uprobe_clear_state()          put_uprobe()
  delayed_uprobe_remove()       delayed_uprobe_remove()

put_uprobe() is calling delayed_uprobe_remove() without taking
delayed_uprobe_lock and thus the race sometimes results in a
kernel crash. Fix this by taking delayed_uprobe_lock before
calling delayed_uprobe_remove() from put_uprobe().

Detailed crash log can be found at:
  Link: http://lkml.kernel.org/r/000000000000140c370577db5ece@google.com

Link: http://lkml.kernel.org/r/20181205033423.26242-1-ravi.bangoria@linux.ibm.com

Acked-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reported-by: syzbot+cb1fb754b771caca0a88@syzkaller.appspotmail.com
Fixes: 1cc33161a8 ("uprobes: Support SDT markers having reference count (semaphore)")
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-12-05 23:05:13 -05:00
Ingo Molnar
4bbfd7467c Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU changes from Paul E. McKenney:

- Convert RCU's BUG_ON() and similar calls to WARN_ON() and similar.

- Replace calls of RCU-bh and RCU-sched update-side functions
  to their vanilla RCU counterparts.  This series is a step
  towards complete removal of the RCU-bh and RCU-sched update-side
  functions.

  ( Note that some of these conversions are going upstream via their
    respective maintainers. )

- Documentation updates, including a number of flavor-consolidation
  updates from Joel Fernandes.

- Miscellaneous fixes.

- Automate generation of the initrd filesystem used for
  rcutorture testing.

- Convert spin_is_locked() assertions to instead use lockdep.

  ( Note that some of these conversions are going upstream via their
    respective maintainers. )

- SRCU updates, especially including a fix from Dennis Krein
  for a bag-on-head-class bug.

- RCU torture-test updates.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-04 07:52:30 +01:00
Ingo Molnar
fca0c11650 perf: Fix typos in comments
Fix two typos in kernel/events/*.

No change in functionality intended.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-03 11:22:32 +01:00
Paul E. McKenney
0809d95451 events: Replace synchronize_sched() with synchronize_rcu()
Now that synchronize_rcu() waits for preempt-disable regions of code
as well as RCU read-side critical sections, synchronize_sched() can be
replaced by synchronize_rcu().  This commit therefore makes this change.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
2018-11-27 09:21:44 -08:00
Andrea Parri
09d3f015d1 uprobes: Fix handle_swbp() vs. unregister() + register() race once more
Commit:

  142b18ddc8 ("uprobes: Fix handle_swbp() vs unregister() + register() race")

added the UPROBE_COPY_INSN flag, and corresponding smp_wmb() and smp_rmb()
memory barriers, to ensure that handle_swbp() uses fully-initialized
uprobes only.

However, the smp_rmb() is mis-placed: this barrier should be placed
after handle_swbp() has tested for the flag, thus guaranteeing that
(program-order) subsequent loads from the uprobe can see the initial
stores performed by prepare_uprobe().

Move the smp_rmb() accordingly.  Also amend the comments associated
to the two memory barriers to indicate their actual locations.

Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: stable@kernel.org
Fixes: 142b18ddc8 ("uprobes: Fix handle_swbp() vs unregister() + register() race")
Link: http://lkml.kernel.org/r/20181122161031.15179-1-andrea.parri@amarulasolutions.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-23 08:31:19 +01:00
Linus Torvalds
01897f3e05 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates and fixes from Ingo Molnar:
 "These are almost all tooling updates: 'perf top', 'perf trace' and
  'perf script' fixes and updates, an UAPI header sync with the merge
  window versions, license marker updates, much improved Sparc support
  from David Miller, and a number of fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (66 commits)
  perf intel-pt/bts: Calculate cpumode for synthesized samples
  perf intel-pt: Insert callchain context into synthesized callchains
  perf tools: Don't clone maps from parent when synthesizing forks
  perf top: Start display thread earlier
  tools headers uapi: Update linux/if_link.h header copy
  tools headers uapi: Update linux/netlink.h header copy
  tools headers: Sync the various kvm.h header copies
  tools include uapi: Update linux/mmap.h copy
  perf trace beauty: Use the mmap flags table generated from headers
  perf beauty: Wire up the mmap flags table generator to the Makefile
  perf beauty: Add a generator for MAP_ mmap's flag constants
  tools include uapi: Update asound.h copy
  tools arch uapi: Update asm-generic/unistd.h and arm64 unistd.h copies
  tools include uapi: Update linux/fs.h copy
  perf callchain: Honour the ordering of PERF_CONTEXT_{USER,KERNEL,etc}
  perf cs-etm: Correct CPU mode for samples
  perf unwind: Take pgoff into account when reporting elf to libdwfl
  perf top: Do not use overwrite mode by default
  perf top: Allow disabling the overwrite mode
  perf trace: Beautify mount's first pathname arg
  ...
2018-11-03 18:13:43 -07:00
Linus Torvalds
343a9f3540 The biggest change here is the updates to kprobes
Back in January I posted patches to create function based events. These were
 the events that you suggested I make to allow developers to easily create
 events in code where no trace event exists. After posting those changes for
 review, it was suggested that we implement this instead with kprobes.
 
 The problem with kprobes is that the interface is too complex and needs to
 be simplified. Masami Hiramatsu posted patches in March and I've been
 playing with them a bit. There's been a bit of clean up in the kprobe code
 that was inspired by the function based event patches, and a couple of
 enhancements to the kprobe event interface.
 
  - If the arch supports it (we added support for x86), you can place a
    kprobe event at the start of a function and use $arg1, $arg2, etc
    to reference the arguments of a function. (Before you needed to know
    what register or where on the stack the argument was).
 
  - The second is a way to see array of events. For example, if you reference
    a mac address, you can add:
 
    echo 'p:mac ip_rcv perm_addr=+574($arg2):x8[6]' > kprobe_events
 
    And this will produce:
 
    mac: (ip_rcv+0x0/0x140) perm_addr={0x52,0x54,0x0,0xc0,0x76,0xec}
 
 Other changes include
 
  - Exporting trace_dump_stack to modules
 
  - Have the stack tracer trace the entire stack (stop trying to remove
    tracing itself, as we keep removing too much).
 
  - Added support for SDT in uprobes
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCW9hdjxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qmtbAP9GS/o2WSvsYLSIw4+mF94eCL06lUxp
 rRrktkEofm/PagEAl2JNmvHrAJN+LIrajqXTbwlZ7Ckk1rZhCW41Am7qnQs=
 =sTUM
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:
 "The biggest change here is the updates to kprobes

  Back in January I posted patches to create function based events.
  These were the events that you suggested I make to allow developers to
  easily create events in code where no trace event exists. After
  posting those changes for review, it was suggested that we implement
  this instead with kprobes.

  The problem with kprobes is that the interface is too complex and
  needs to be simplified. Masami Hiramatsu posted patches in March and
  I've been playing with them a bit. There's been a bit of clean up in
  the kprobe code that was inspired by the function based event patches,
  and a couple of enhancements to the kprobe event interface.

   - If the arch supports it (we added support for x86), you can place a
     kprobe event at the start of a function and use $arg1, $arg2, etc
     to reference the arguments of a function. (Before you needed to
     know what register or where on the stack the argument was).

   - The second is a way to see array of events. For example, if you
     reference a mac address, you can add:

	echo 'p:mac ip_rcv perm_addr=+574($arg2):x8[6]' > kprobe_events

     And this will produce:

	mac: (ip_rcv+0x0/0x140) perm_addr={0x52,0x54,0x0,0xc0,0x76,0xec}

  Other changes include

   - Exporting trace_dump_stack to modules

   - Have the stack tracer trace the entire stack (stop trying to remove
     tracing itself, as we keep removing too much).

   - Added support for SDT in uprobes"

[ SDT - "Statically Defined Tracing" are userspace markers for tracing.
  Let's not use random TLA's in explanations unless they are fairly
  well-established as generic (at least for kernel people) - Linus ]

* tag 'trace-v4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (24 commits)
  tracing: Have stack tracer trace full stack
  tracing: Export trace_dump_stack to modules
  tracing: probeevent: Fix uninitialized used of offset in parse args
  tracing/kprobes: Allow kprobe-events to record module symbol
  tracing/kprobes: Check the probe on unloaded module correctly
  tracing/uprobes: Fix to return -EFAULT if copy_from_user failed
  tracing: probeevent: Add $argN for accessing function args
  x86: ptrace: Add function argument access API
  tracing: probeevent: Add array type support
  tracing: probeevent: Add symbol type
  tracing: probeevent: Unify fetch_insn processing common part
  tracing: probeevent: Append traceprobe_ for exported function
  tracing: probeevent: Return consumed bytes of dynamic area
  tracing: probeevent: Unify fetch type tables
  tracing: probeevent: Introduce new argument fetching code
  tracing: probeevent: Remove NOKPROBE_SYMBOL from print functions
  tracing: probeevent: Cleanup argument field definition
  tracing: probeevent: Cleanup print argument functions
  trace_uprobe: support reference counter in fd-based uprobe
  perf probe: Support SDT markers having reference counter (semaphore)
  ...
2018-10-30 09:49:56 -07:00
Colin Ian King
28fa741c27 perf/core: Clean up inconsisent indentation
Replace a bunch of spaces with tab, cleans up indentation

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-janitors@vger.kernel.org
Link: http://lkml.kernel.org/r/20181029233211.21475-1-colin.king@canonical.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-30 09:51:58 +01:00
Linus Torvalds
ba9f6f8954 Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull siginfo updates from Eric Biederman:
 "I have been slowly sorting out siginfo and this is the culmination of
  that work.

  The primary result is in several ways the signal infrastructure has
  been made less error prone. The code has been updated so that manually
  specifying SEND_SIG_FORCED is never necessary. The conversion to the
  new siginfo sending functions is now complete, which makes it
  difficult to send a signal without filling in the proper siginfo
  fields.

  At the tail end of the patchset comes the optimization of decreasing
  the size of struct siginfo in the kernel from 128 bytes to about 48
  bytes on 64bit. The fundamental observation that enables this is by
  definition none of the known ways to use struct siginfo uses the extra
  bytes.

  This comes at the cost of a small user space observable difference.
  For the rare case of siginfo being injected into the kernel only what
  can be copied into kernel_siginfo is delivered to the destination, the
  rest of the bytes are set to 0. For cases where the signal and the
  si_code are known this is safe, because we know those bytes are not
  used. For cases where the signal and si_code combination is unknown
  the bits that won't fit into struct kernel_siginfo are tested to
  verify they are zero, and the send fails if they are not.

  I made an extensive search through userspace code and I could not find
  anything that would break because of the above change. If it turns out
  I did break something it will take just the revert of a single change
  to restore kernel_siginfo to the same size as userspace siginfo.

  Testing did reveal dependencies on preferring the signo passed to
  sigqueueinfo over si->signo, so bit the bullet and added the
  complexity necessary to handle that case.

  Testing also revealed bad things can happen if a negative signal
  number is passed into the system calls. Something no sane application
  will do but something a malicious program or a fuzzer might do. So I
  have fixed the code that performs the bounds checks to ensure negative
  signal numbers are handled"

* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (80 commits)
  signal: Guard against negative signal numbers in copy_siginfo_from_user32
  signal: Guard against negative signal numbers in copy_siginfo_from_user
  signal: In sigqueueinfo prefer sig not si_signo
  signal: Use a smaller struct siginfo in the kernel
  signal: Distinguish between kernel_siginfo and siginfo
  signal: Introduce copy_siginfo_from_user and use it's return value
  signal: Remove the need for __ARCH_SI_PREABLE_SIZE and SI_PAD_SIZE
  signal: Fail sigqueueinfo if si_signo != sig
  signal/sparc: Move EMT_TAGOVF into the generic siginfo.h
  signal/unicore32: Use force_sig_fault where appropriate
  signal/unicore32: Generate siginfo in ucs32_notify_die
  signal/unicore32: Use send_sig_fault where appropriate
  signal/arc: Use force_sig_fault where appropriate
  signal/arc: Push siginfo generation into unhandled_exception
  signal/ia64: Use force_sig_fault where appropriate
  signal/ia64: Use the force_sig(SIGSEGV,...) in ia64_rt_sigreturn
  signal/ia64: Use the generic force_sigsegv in setup_frame
  signal/arm/kvm: Use send_sig_mceerr
  signal/arm: Use send_sig_fault where appropriate
  signal/arm: Use force_sig_fault where appropriate
  ...
2018-10-24 11:22:39 +01:00
Song Liu
a6ca88b241 trace_uprobe: support reference counter in fd-based uprobe
This patch enables uprobes with reference counter in fd-based uprobe.
Highest 32 bits of perf_event_attr.config is used to stored offset
of the reference count (semaphore).

Format information in /sys/bus/event_source/devices/uprobe/format/ is
updated to reflect this new feature.

Link: http://lkml.kernel.org/r/20181002053636.1896903-1-songliubraving@fb.com

Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-10-10 22:14:17 -04:00
Ingo Molnar
97e831e130 Merge branch 'perf/urgent' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-02 09:50:34 +02:00
Jiri Olsa
cd6fb677ce perf/ring_buffer: Prevent concurent ring buffer access
Some of the scheduling tracepoints allow the perf_tp_event
code to write to ring buffer under different cpu than the
code is running on.

This results in corrupted ring buffer data demonstrated in
following perf commands:

  # perf record -e 'sched:sched_switch,sched:sched_wakeup' perf bench sched messaging
  # Running 'sched/messaging' benchmark:
  # 20 sender and receiver processes per group
  # 10 groups == 400 processes run

       Total time: 0.383 [sec]
  [ perf record: Woken up 8 times to write data ]
  0x42b890 [0]: failed to process type: -1765585640
  [ perf record: Captured and wrote 4.825 MB perf.data (29669 samples) ]

  # perf report --stdio
  0x42b890 [0]: failed to process type: -1765585640

The reason for the corruption are some of the scheduling tracepoints,
that have __perf_task dfined and thus allow to store data to another
cpu ring buffer:

  sched_waking
  sched_wakeup
  sched_wakeup_new
  sched_stat_wait
  sched_stat_sleep
  sched_stat_iowait
  sched_stat_blocked

The perf_tp_event function first store samples for current cpu
related events defined for tracepoint:

    hlist_for_each_entry_rcu(event, head, hlist_entry)
      perf_swevent_event(event, count, &data, regs);

And then iterates events of the 'task' and store the sample
for any task's event that passes tracepoint checks:

  ctx = rcu_dereference(task->perf_event_ctxp[perf_sw_context]);

  list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
    if (event->attr.type != PERF_TYPE_TRACEPOINT)
      continue;
    if (event->attr.config != entry->type)
      continue;

    perf_swevent_event(event, count, &data, regs);
  }

Above code can race with same code running on another cpu,
ending up with 2 cpus trying to store under the same ring
buffer, which is specifically not allowed.

This patch prevents the problem, by allowing only events with the same
current cpu to receive the event.

NOTE: this requires the use of (per-task-)per-cpu buffers for this
feature to work; perf-record does this.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
[peterz: small edits to Changelog]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: e6dab5ffab ("perf/trace: Add ability to set a target task for events")
Link: http://lkml.kernel.org/r/20180923161343.GB15054@krava
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-02 09:37:59 +02:00
Peter Zijlstra
a9f9772114 perf/core: Fix perf_pmu_unregister() locking
When we unregister a PMU, we fail to serialize the @pmu_idr properly.
Fix that by doing the entire thing under pmu_lock.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: 2e80a82a49 ("perf: Dynamic pmu types")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-02 09:37:56 +02:00
Reinette Chatre
befb1b3c27 perf/core: Add sanity check to deal with pinned event failure
It is possible that a failure can occur during the scheduling of a
pinned event. The initial portion of perf_event_read_local() contains
the various error checks an event should pass before it can be
considered valid. Ensure that the potential scheduling failure
of a pinned event is checked for and have a credible error.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: fenghua.yu@intel.com
Cc: tony.luck@intel.com
Cc: acme@kernel.org
Cc: gavin.hindman@intel.com
Cc: jithu.joseph@intel.com
Cc: dave.hansen@intel.com
Cc: hpa@zytor.com
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/6486385d1f30336e9973b24c8c65f5079543d3d3.1537377064.git.reinette.chatre@intel.com
2018-09-28 22:44:53 +02:00
Ravi Bangoria
22bad38286 uprobes/sdt: Prevent multiple reference counter for same uprobe
We assume to have only one reference counter for one uprobe.
Don't allow user to register multiple uprobes having same
inode+offset but different reference counter.

Link: http://lkml.kernel.org/r/20180820044250.11659-3-ravi.bangoria@linux.ibm.com

Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Song Liu <songliubraving@fb.com>
Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-09-24 04:44:54 -04:00
Ravi Bangoria
1cc33161a8 uprobes: Support SDT markers having reference count (semaphore)
Userspace Statically Defined Tracepoints[1] are dtrace style markers
inside userspace applications. Applications like PostgreSQL, MySQL,
Pthread, Perl, Python, Java, Ruby, Node.js, libvirt, QEMU, glib etc
have these markers embedded in them. These markers are added by developer
at important places in the code. Each marker source expands to a single
nop instruction in the compiled code but there may be additional
overhead for computing the marker arguments which expands to couple of
instructions. In case the overhead is more, execution of it can be
omitted by runtime if() condition when no one is tracing on the marker:

    if (reference_counter > 0) {
        Execute marker instructions;
    }

Default value of reference counter is 0. Tracer has to increment the
reference counter before tracing on a marker and decrement it when
done with the tracing.

Implement the reference counter logic in core uprobe. User will be
able to use it from trace_uprobe as well as from kernel module. New
trace_uprobe definition with reference counter will now be:

    <path>:<offset>[(ref_ctr_offset)]

where ref_ctr_offset is an optional field. For kernel module, new
variant of uprobe_register() has been introduced:

    uprobe_register_refctr(inode, offset, ref_ctr_offset, consumer)

No new variant for uprobe_unregister() because it's assumed to have
only one reference counter for one uprobe.

[1] https://sourceware.org/systemtap/wiki/UserSpaceProbeImplementation

Note: 'reference counter' is called as 'semaphore' in original Dtrace
(or Systemtap, bcc and even in ELF) documentation and code. But the
term 'semaphore' is misleading in this context. This is just a counter
used to hold number of tracers tracing on a marker. This is not really
used for any synchronization. So we are calling it a 'reference counter'
in kernel / perf code.

Link: http://lkml.kernel.org/r/20180820044250.11659-2-ravi.bangoria@linux.ibm.com

Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
[Only trace_uprobe.c]
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Song Liu <songliubraving@fb.com>
Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-09-24 04:44:53 -04:00
Alexander Shishkin
1627314fb5 perf: Suppress AUX/OVERWRITE records
It has been pointed out to me many times that it is useful to be able to
switch off AUX records to save the bandwidth for records that actually
matter, for example, in AUX overwrite mode.

The usefulness of PERF_RECORD_AUX is in some of its flags, like the
TRUNCATED flag that tells the decoder where exactly gaps in the trace
are.  The OVERWRITE flag, on the other hand will be set on every single
record in overwrite mode. However, a PERF_RECORD_AUX[flags=OVERWRITE] is
generated on every target task's sched_out, which over time adds up to a
lot of useless information.

If any folks out there have userspace that depends on a constant stream
of OVERWRITE records for a good reason, they'll have to let us know.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Markus T Metzger <markus.t.metzger@intel.com>
Link: http://lkml.kernel.org/r/20180404145323.28651-1-alexander.shishkin@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-09-18 17:21:13 -03:00
Eric W. Biederman
55a3235fc7 signal: Properly deliver SIGILL from uprobes
For userspace to tell the difference between a random signal and an
exception, the exception must include siginfo information.

Using SEND_SIG_FORCED for SIGILL is thus wrong, and it will result
in userspace seeing si_code == SI_USER (like a random signal) instead
of si_code == SI_KERNEL or a more specific si_code as all exceptions
deliver.

Therefore replace force_sig_info(SIGILL, SEND_SIG_FORCE, current)
with force_sig(SIG_ILL, current) which gets this right and is
shorter and easier to type.

Fixes: 014940bad8 ("uprobes/x86: Send SIGILL if arch_uprobe_post_xol() fails")
Fixes: 0b5256c7f1 ("uprobes: Send SIGILL if handle_trampoline() fails")
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-09-11 21:18:45 +02:00
Yabin Cui
02e184476e perf/core: Force USER_DS when recording user stack data
Perf can record user stack data in response to a synchronous request, such
as a tracepoint firing. If this happens under set_fs(KERNEL_DS), then we
end up reading user stack data using __copy_from_user_inatomic() under
set_fs(KERNEL_DS). I think this conflicts with the intention of using
set_fs(KERNEL_DS). And it is explicitly forbidden by hardware on ARM64
when both CONFIG_ARM64_UAO and CONFIG_ARM64_PAN are used.

So fix this by forcing USER_DS when recording user stack data.

Signed-off-by: Yabin Cui <yabinc@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 88b0193d94 ("perf/callchain: Force USER_DS when invoking perf_callchain_user()")
Link: http://lkml.kernel.org/r/20180823225935.27035-1-yabinc@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10 14:01:46 +02:00
Ingo Molnar
fa94351b56 perf/urgent fixes:
Kernel:
 
 - Modify breakpoint fixes (Jiri Olsa)
 
 perf annotate:
 
 - Fix parsing aarch64 branch instructions after objdump update (Kim Phillips)
 
 - Fix parsing indirect calls in 'perf annotate' (Martin Liška)
 
 perf probe:
 
 - Ignore SyS symbols irrespective of endianness on PowerPC (Sandipan Das)
 
 perf trace:
 
 - Fix include path for asm-generic/unistd.h on arm64 (Kim Phillips)
 
 Core libraries:
 
 - Fix potential null pointer dereference in perf_evsel__new_idx() (Hisao Tanabe)
 
 - Use fixed size string for comms instead of scanf("%m"), that is
   not present in the bionic libc and leads to a crash (Chris Phlipot)
 
 - Fix bad memory access in trace info on 32-bit systems, we were reading
   8 bytes from a 4-byte long variable when saving the command line in the
   perf.data file.  (Chris Phlipot)
 
 Build system:
 
 - Streamline bpf examples and headers installation, clarifying
   some install messages. (Arnaldo Carvalho de Melo)
 
 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQR2GiIUctdOfX2qHhGyPKLppCJ+JwUCW41FnAAKCRCyPKLppCJ+
 J8MCAP4/RC5GwNrO5KYJ+G1iYb7QiNq9X/wsM7jCBlqWnTH+zgD9GYPIT3WQWKBN
 Rv94N4PNsYP4cpP7hTzWG0ar7p70owo=
 =+ODq
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-for-mingo-4.19-20180903' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent

Pull perf/urgent fixes from Arnaldo Carvalho de Melo:

Kernel:

- Modify breakpoint fixes (Jiri Olsa)

perf annotate:

- Fix parsing aarch64 branch instructions after objdump update (Kim Phillips)

- Fix parsing indirect calls in 'perf annotate' (Martin Liška)

perf probe:

- Ignore SyS symbols irrespective of endianness on PowerPC (Sandipan Das)

perf trace:

- Fix include path for asm-generic/unistd.h on arm64 (Kim Phillips)

Core libraries:

- Fix potential null pointer dereference in perf_evsel__new_idx() (Hisao Tanabe)

- Use fixed size string for comms instead of scanf("%m"), that is
  not present in the bionic libc and leads to a crash (Chris Phlipot)

- Fix bad memory access in trace info on 32-bit systems, we were reading
  8 bytes from a 4-byte long variable when saving the command line in the
  perf.data file.  (Chris Phlipot)

Build system:

- Streamline bpf examples and headers installation, clarifying
  some install messages. (Arnaldo Carvalho de Melo)

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-09 21:36:31 +02:00
Jiri Olsa
bf06278c3f perf/hw_breakpoint: Simplify breakpoint enable in perf_event_modify_breakpoint
We can safely enable the breakpoint back for both the fail and success
paths by checking only the bp->attr.disabled, which either holds the new
'requested' disabled state or the original breakpoint state.

Committer testing:

At the end of the series, the 'perf test' entry introduced as the first
patch now runs to completion without finding the fixed issues:

  # perf test "bp modify"
  62: x86 bp modify                                         : Ok
  #

In verbose mode:

  # perf test -v "bp modify"
  62: x86 bp modify                                         :
  --- start ---
  test child forked, pid 5161
  rip 5950a0, bp_1 0x5950a0
  in bp_1
  rip 5950a0, bp_1 0x5950a0
  in bp_1
  test child finished with 0
  ---- end ----
  x86 bp modify: Ok

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Milind Chabbi <chabbi.milind@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180827091228.2878-6-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-08-30 14:49:24 -03:00
Jiri Olsa
969558371b perf/hw_breakpoint: Enable breakpoint in modify_user_hw_breakpoint
Currently we enable the breakpoint back only if the breakpoint
modification was successful. If it fails we can leave the breakpoint in
disabled state with attr->disabled == 0.

We can safely enable the breakpoint back for both the fail and success
paths by checking the bp->attr.disabled, which either holds the new
'requested' disabled state or the original breakpoint state.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Milind Chabbi <chabbi.milind@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180827091228.2878-5-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-08-30 14:49:23 -03:00
Jiri Olsa
cb45302d7c perf/hw_breakpoint: Remove superfluous bp->attr.disabled = 0
Once the breakpoint was succesfully modified, the attr->disabled value
is in bp->attr.disabled. So there's no reason to set it again, removing
that.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Milind Chabbi <chabbi.milind@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180827091228.2878-4-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-08-30 14:49:23 -03:00
Jiri Olsa
bd14406b78 perf/hw_breakpoint: Modify breakpoint even if the new attr has disabled set
We need to change the breakpoint even if the attr with new fields has
disabled set to true.

Current code prevents following user code to change the breakpoint
address:

  ptrace(PTRACE_POKEUSER, child, offsetof(struct user, u_debugreg[0]), addr_1)
  ptrace(PTRACE_POKEUSER, child, offsetof(struct user, u_debugreg[0]), addr_2)
  ptrace(PTRACE_POKEUSER, child, offsetof(struct user, u_debugreg[7]), dr7)

The first PTRACE_POKEUSER creates the breakpoint with attr.disabled set
to true:

  ptrace_set_breakpoint_addr(nr = 0)
    struct perf_event *bp = t->ptrace_bps[nr];

    ptrace_register_breakpoint(..., disabled = true)
      ptrace_fill_bp_fields(..., disabled)
      register_user_hw_breakpoint

So the second PTRACE_POKEUSER will be omitted:

  ptrace_set_breakpoint_addr(nr = 0)
    struct perf_event *bp = t->ptrace_bps[nr];
    struct perf_event_attr attr = bp->attr;

    modify_user_hw_breakpoint(bp, &attr)
      if (!attr->disabled)
        modify_user_hw_breakpoint_check

Reported-by: Milind Chabbi <chabbi.milind@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180827091228.2878-3-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-08-30 14:49:23 -03:00
Arnd Bergmann
3723c63247 treewide: convert ISO_8859-1 text comments to utf-8
Almost all files in the kernel are either plain text or UTF-8 encoded.  A
couple however are ISO_8859-1, usually just a few characters in a C
comments, for historic reasons.

This converts them all to UTF-8 for consistency.

Link: http://lkml.kernel.org/r/20180724111600.4158975-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Simon Horman <horms@verge.net.au>			[IPVS portion]
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>	[IIO]
Acked-by: Michael Ellerman <mpe@ellerman.id.au>			[powerpc]
Acked-by: Rob Herring <robh@kernel.org>
Cc: Joe Perches <joe@perches.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Samuel Ortiz <sameo@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Rob Herring <robh+dt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 18:48:43 -07:00
Linus Torvalds
0214f46b3a Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull core signal handling updates from Eric Biederman:
 "It was observed that a periodic timer in combination with a
  sufficiently expensive fork could prevent fork from every completing.
  This contains the changes to remove the need for that restart.

  This set of changes is split into several parts:

   - The first part makes PIDTYPE_TGID a proper pid type instead
     something only for very special cases. The part starts using
     PIDTYPE_TGID enough so that in __send_signal where signals are
     actually delivered we know if the signal is being sent to a a group
     of processes or just a single process.

   - With that prep work out of the way the logic in fork is modified so
     that fork logically makes signals received while it is running
     appear to be received after the fork completes"

* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits)
  signal: Don't send signals to tasks that don't exist
  signal: Don't restart fork when signals come in.
  fork: Have new threads join on-going signal group stops
  fork: Skip setting TIF_SIGPENDING in ptrace_init_task
  signal: Add calculate_sigpending()
  fork: Unconditionally exit if a fatal signal is pending
  fork: Move and describe why the code examines PIDNS_ADDING
  signal: Push pid type down into complete_signal.
  signal: Push pid type down into __send_signal
  signal: Push pid type down into send_signal
  signal: Pass pid type into do_send_sig_info
  signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task
  signal: Pass pid type into group_send_sig_info
  signal: Pass pid and pid type into send_sigqueue
  posix-timers: Noralize good_sigevent
  signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
  pid: Implement PIDTYPE_TGID
  pids: Move the pgrp and session pid pointers from task_struct to signal_struct
  kvm: Don't open code task_pid in kvm_vcpu_ioctl
  pids: Compute task_tgid using signal->leader_pid
  ...
2018-08-21 13:47:29 -07:00
Linus Torvalds
7140ad3898 Updates for v4.19:
- Restructure of lockdep and latency tracers
 
    This is the biggest change. Joel Fernandes restructured the hooks
    from irqs and preemption disabling and enabling. He got rid of
    a lot of the preprocessor #ifdef mess that they caused.
 
    He turned both lockdep and the latency tracers to use trace events
    inserted in the preempt/irqs disabling paths. But unfortunately,
    these started to cause issues in corner cases. Thus, parts of the
    code was reverted back to where lockde and the latency tracers
    just get called directly (without using the trace events).
    But because the original change cleaned up the code very nicely
    we kept that, as well as the trace events for preempt and irqs
    disabling, but they are limited to not being called in NMIs.
 
  - Have trace events use SRCU for "rcu idle" calls. This was required
    for the preempt/irqs off trace events. But it also had to not
    allow them to be called in NMI context. Waiting till Paul makes
    an NMI safe SRCU API.
 
  - New notrace SRCU API to allow trace events to use SRCU.
 
  - Addition of mcount-nop option support
 
  - SPDX headers replacing GPL templates.
 
  - Various other fixes and clean ups.
 
  - Some fixes are marked for stable, but were not fully tested
    before the merge window opened.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCW3ruhRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qiM7AP47NhYdSnCFCRUJfrt6PovXmQtuCHt3
 c3QMoGGdvzh9YAEAqcSXwh7uLhpHUp1LjMAPkXdZVwNddf4zJQ1zyxQ+EAU=
 =vgEr
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:

 - Restructure of lockdep and latency tracers

   This is the biggest change. Joel Fernandes restructured the hooks
   from irqs and preemption disabling and enabling. He got rid of a lot
   of the preprocessor #ifdef mess that they caused.

   He turned both lockdep and the latency tracers to use trace events
   inserted in the preempt/irqs disabling paths. But unfortunately,
   these started to cause issues in corner cases. Thus, parts of the
   code was reverted back to where lockdep and the latency tracers just
   get called directly (without using the trace events). But because the
   original change cleaned up the code very nicely we kept that, as well
   as the trace events for preempt and irqs disabling, but they are
   limited to not being called in NMIs.

 - Have trace events use SRCU for "rcu idle" calls. This was required
   for the preempt/irqs off trace events. But it also had to not allow
   them to be called in NMI context. Waiting till Paul makes an NMI safe
   SRCU API.

 - New notrace SRCU API to allow trace events to use SRCU.

 - Addition of mcount-nop option support

 - SPDX headers replacing GPL templates.

 - Various other fixes and clean ups.

 - Some fixes are marked for stable, but were not fully tested before
   the merge window opened.

* tag 'trace-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (44 commits)
  tracing: Fix SPDX format headers to use C++ style comments
  tracing: Add SPDX License format tags to tracing files
  tracing: Add SPDX License format to bpf_trace.c
  blktrace: Add SPDX License format header
  s390/ftrace: Add -mfentry and -mnop-mcount support
  tracing: Add -mcount-nop option support
  tracing: Avoid calling cc-option -mrecord-mcount for every Makefile
  tracing: Handle CC_FLAGS_FTRACE more accurately
  Uprobe: Additional argument arch_uprobe to uprobe_write_opcode()
  Uprobes: Simplify uprobe_register() body
  tracepoints: Free early tracepoints after RCU is initialized
  uprobes: Use synchronize_rcu() not synchronize_sched()
  tracing: Fix synchronizing to event changes with tracepoint_synchronize_unregister()
  ftrace: Remove unused pointer ftrace_swapper_pid
  tracing: More reverting of "tracing: Centralize preemptirq tracepoints and unify their usage"
  tracing/irqsoff: Handle preempt_count for different configs
  tracing: Partial revert of "tracing: Centralize preemptirq tracepoints and unify their usage"
  tracing: irqsoff: Account for additional preempt_disable
  trace: Use rcu_dereference_raw for hooks from trace-event subsystem
  tracing/kprobes: Fix within_notrace_func() to check only notrace functions
  ...
2018-08-20 18:32:00 -07:00
Linus Torvalds
1202f4fdbc arm64 updates for 4.19
A bunch of good stuff in here:
 
 - Wire up support for qspinlock, replacing our trusty ticket lock code
 
 - Add an IPI to flush_icache_range() to ensure that stale instructions
   fetched into the pipeline are discarded along with the I-cache lines
 
 - Support for the GCC "stackleak" plugin
 
 - Support for restartable sequences, plus an arm64 port for the selftest
 
 - Kexec/kdump support on systems booting with ACPI
 
 - Rewrite of our syscall entry code in C, which allows us to zero the
   GPRs on entry from userspace
 
 - Support for chained PMU counters, allowing 64-bit event counters to be
   constructed on current CPUs
 
 - Ensure scheduler topology information is kept up-to-date with CPU
   hotplug events
 
 - Re-enable support for huge vmalloc/IO mappings now that the core code
   has the correct hooks to use break-before-make sequences
 
 - Miscellaneous, non-critical fixes and cleanups
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABCgAGBQJbbV41AAoJELescNyEwWM0WoEIALhrKtsIn6vqFlSs/w6aDuJL
 cMWmFxjTaKLmIq2+cJIdFLOJ3CH80Pu9gB+nEv/k+cZdCTfUVKfRf28HTpmYWsht
 bb4AhdHMC7yFW752BHk+mzJspeC8h/2Rm8wMuNVplZ3MkPrwo3vsiuJTofLhVL/y
 BihlU3+5sfBvCYIsWnuEZIev+/I/s/qm1ASiqIcKSrFRZP6VTt5f9TC75vFI8seW
 7yc3odKb0CArexB8yBjiPNziehctQF42doxQyL45hezLfWw4qdgHOSiwyiOMxEz9
 Fwwpp8Tx33SKLNJgqoqYznGW9PhYJ7n2Kslv19uchJrEV+mds82vdDNaWRULld4=
 =kQn6
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 "A bunch of good stuff in here. Worth noting is that we've pulled in
  the x86/mm branch from -tip so that we can make use of the core
  ioremap changes which allow us to put down huge mappings in the
  vmalloc area without screwing up the TLB. Much of the positive
  diffstat is because of the rseq selftest for arm64.

  Summary:

   - Wire up support for qspinlock, replacing our trusty ticket lock
     code

   - Add an IPI to flush_icache_range() to ensure that stale
     instructions fetched into the pipeline are discarded along with the
     I-cache lines

   - Support for the GCC "stackleak" plugin

   - Support for restartable sequences, plus an arm64 port for the
     selftest

   - Kexec/kdump support on systems booting with ACPI

   - Rewrite of our syscall entry code in C, which allows us to zero the
     GPRs on entry from userspace

   - Support for chained PMU counters, allowing 64-bit event counters to
     be constructed on current CPUs

   - Ensure scheduler topology information is kept up-to-date with CPU
     hotplug events

   - Re-enable support for huge vmalloc/IO mappings now that the core
     code has the correct hooks to use break-before-make sequences

   - Miscellaneous, non-critical fixes and cleanups"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (90 commits)
  arm64: alternative: Use true and false for boolean values
  arm64: kexec: Add comment to explain use of __flush_icache_range()
  arm64: sdei: Mark sdei stack helper functions as static
  arm64, kaslr: export offset in VMCOREINFO ELF notes
  arm64: perf: Add cap_user_time aarch64
  efi/libstub: Only disable stackleak plugin for arm64
  arm64: drop unused kernel_neon_begin_partial() macro
  arm64: kexec: machine_kexec should call __flush_icache_range
  arm64: svc: Ensure hardirq tracing is updated before return
  arm64: mm: Export __sync_icache_dcache() for xen-privcmd
  drivers/perf: arm-ccn: Use devm_ioremap_resource() to map memory
  arm64: Add support for STACKLEAK gcc plugin
  arm64: Add stack information to on_accessible_stack
  drivers/perf: hisi: update the sccl_id/ccl_id when MT is supported
  arm64: fix ACPI dependencies
  rseq/selftests: Add support for arm64
  arm64: acpi: fix alignment fault in accessing ACPI
  efi/arm: map UEFI memory map even w/o runtime services enabled
  efi/arm: preserve early mapping of UEFI memory map longer for BGRT
  drivers: acpi: add dependency of EFI for arm64
  ...
2018-08-14 16:39:13 -07:00
Ravi Bangoria
6d43743e90 Uprobe: Additional argument arch_uprobe to uprobe_write_opcode()
Add addition argument 'arch_uprobe' to uprobe_write_opcode().
We need this in later set of patches.

Link: http://lkml.kernel.org/r/20180809041856.1547-3-ravi.bangoria@linux.ibm.com

Reviewed-by: Song Liu <songliubraving@fb.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-08-13 20:08:33 -04:00
Ravi Bangoria
38e967ae1e Uprobes: Simplify uprobe_register() body
Simplify uprobe_register() function body and let __uprobe_register()
handle everything. Also move dependency functions around to avoid build
failures.

Link: http://lkml.kernel.org/r/20180809041856.1547-2-ravi.bangoria@linux.ibm.com

Reviewed-by: Song Liu <songliubraving@fb.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-08-13 20:07:06 -04:00
Michael O'Farrell
9d2dcc8fc6 arm64: perf: Add cap_user_time aarch64
It is useful to get the running time of a thread.  Doing so in an
efficient manner can be important for performance of user applications.
Avoiding system calls in `clock_gettime` when handling
CLOCK_THREAD_CPUTIME_ID is important.  Other clocks are handled in the
VDSO, but CLOCK_THREAD_CPUTIME_ID falls back on the system call.

CLOCK_THREAD_CPUTIME_ID is not handled in the VDSO since it would have
costs associated with maintaining updated user space accessible time
offsets.  These offsets have to be updated everytime the a thread is
scheduled/descheduled.  However, for programs regularly checking the
running time of a thread, this is a performance improvement.

This patch takes a middle ground, and adds support for cap_user_time an
optional feature of the perf_event API.  This way costs are only
incurred when the perf_event api is enabled.  This is done the same way
as it is in x86.

Ultimately this allows calculating the thread running time in userspace
on aarch64 as follows (adapted from perf_event_open manpage):

u32 seq, time_mult, time_shift;
u64 running, count, time_offset, quot, rem, delta;
struct perf_event_mmap_page *pc;
pc = buf;  // buf is the perf event mmaped page as documented in the API.

if (pc->cap_usr_time) {
    do {
        seq = pc->lock;
        barrier();
        running = pc->time_running;

        count = readCNTVCT_EL0();  // Read ARM hardware clock.
        time_offset = pc->time_offset;
        time_mult   = pc->time_mult;
        time_shift  = pc->time_shift;

        barrier();
    } while (pc->lock != seq);

    quot = (count >> time_shift);
    rem = count & (((u64)1 << time_shift) - 1);
    delta = time_offset + quot * time_mult +
            ((rem * time_mult) >> time_shift);

    running += delta;
    // running now has the current nanosecond level thread time.
}

Summary of changes in the patch:

For aarch64 systems, make arch_perf_update_userpage update the timing
information stored in the perf_event page.  Requiring the following
calculations:
  - Calculate the appropriate time_mult, and time_shift factors to convert
    ticks to nano seconds for the current clock frequency.
  - Adjust the mult and shift factors to avoid shift factors of 32 bits.
    (possibly unnecessary)
  - The time_offset userspace should apply when doing calculations:
    negative the current sched time (now), because time_running and
    time_enabled fields of the perf_event page have just been updated.
Toggle bits to appropriate values:
  - Enable cap_user_time

Signed-off-by: Michael O'Farrell <micpof@gmail.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-07-31 10:14:00 +01:00
Ingo Molnar
93081caaae Merge branch 'perf/urgent' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25 11:47:02 +02:00
Mathieu Poirier
7f635ff187 perf/core: Fix crash when using HW tracing kernel filters
In function perf_event_parse_addr_filter(), the path::dentry of each struct
perf_addr_filter is left unassigned (as it should be) when the pattern
being parsed is related to kernel space.  But in function
perf_addr_filter_match() the same dentries are given to d_inode() where
the value is not expected to be NULL, resulting in the following splat:

  Unable to handle kernel NULL pointer dereference at virtual address 0000000000000058
  pc : perf_event_mmap+0x2fc/0x5a0
  lr : perf_event_mmap+0x2c8/0x5a0
  Process uname (pid: 2860, stack limit = 0x000000001cbcca37)
  Call trace:
   perf_event_mmap+0x2fc/0x5a0
   mmap_region+0x124/0x570
   do_mmap+0x344/0x4f8
   vm_mmap_pgoff+0xe4/0x110
   vm_mmap+0x2c/0x40
   elf_map+0x60/0x108
   load_elf_binary+0x450/0x12c4
   search_binary_handler+0x90/0x290
   __do_execve_file.isra.13+0x6e4/0x858
   sys_execve+0x3c/0x50
   el0_svc_naked+0x30/0x34

This patch is fixing the problem by introducing a new check in function
perf_addr_filter_match() to see if the filter's dentry is NULL.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Cc: miklos@szeredi.hu
Cc: namhyung@kernel.org
Cc: songliubraving@fb.com
Fixes: 9511bce9fe ("perf/core: Fix bad use of igrab()")
Link: http://lkml.kernel.org/r/1531782831-1186-1-git-send-email-mathieu.poirier@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25 11:46:22 +02:00
Peter Zijlstra
6cbc304f2f perf/x86/intel: Fix unwind errors from PEBS entries (mk-II)
Vince reported the perf_fuzzer giving various unwinder warnings and
Josh reported:

> Deja vu.  Most of these are related to perf PEBS, similar to the
> following issue:
>
>   b8000586c9 ("perf/x86/intel: Cure bogus unwind from PEBS entries")
>
> This is basically the ORC version of that.  setup_pebs_sample_data() is
> assembling a franken-pt_regs which ORC isn't happy about.  RIP is
> inconsistent with some of the other registers (like RSP and RBP).

And where the previous unwinder only needed BP,SP ORC also requires
IP. But we cannot spoof IP because then the sample will get displaced,
entirely negating the point of PEBS.

So cure the whole thing differently by doing the unwind early; this
does however require a means to communicate we did the unwind early.
We (ab)use an unused sample_type bit for this, which we set on events
that fill out the data->callchain before the normal
perf_prepare_sample().

Debugged-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Tested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25 11:46:21 +02:00
Eric W. Biederman
6883f81aac pid: Implement PIDTYPE_TGID
Everywhere except in the pid array we distinguish between a tasks pid and
a tasks tgid (thread group id).  Even in the enumeration we want that
distinction sometimes so we have added __PIDTYPE_TGID.  With leader_pid
we almost have an implementation of PIDTYPE_TGID in struct signal_struct.

Add PIDTYPE_TGID as a first class member of the pid_type enumeration and
into the pids array.  Then remove the __PIDTYPE_TGID special case and the
leader_pid in signal_struct.

The net size increase is just an extra pointer added to struct pid and
an extra pair of pointers of an hlist_node added to task_struct.

The effect on code maintenance is the removal of a number of special
cases today and the potential to remove many more special cases as
PIDTYPE_TGID gets used to it's fullest.  The long term potential
is allowing zombie thread group leaders to exit, which will remove
a lot more special cases in the code.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-21 10:43:12 -05:00
Tobias Tefke
788faab70d perf, tools: Use correct articles in comments
Some of the comments in the perf events code use articles incorrectly,
using 'a' for words beginning with a vowel sound, where 'an' should be
used.

Signed-off-by: Tobias Tefke <tobias.tefke@tutanota.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: alexander.shishkin@linux.intel.com
Cc: jolsa@redhat.com
Cc: namhyung@kernel.org
Link: http://lkml.kernel.org/r/20180709105715.22938-1-tobias.tefke@tutanota.com
[ Fix a few more perf related 'a event' typo fixes from all around the kernel and tooling tree. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-16 00:21:03 +02:00
Mathieu Malaterre
9331510135 perf/core: Move inline keyword at the beginning of declaration
Fix non-fatal warning triggered during compilation with W=1:

  kernel/events/core.c:6106:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]
   static void __always_inline
   ^~~~~~

Signed-off-by: Mathieu Malaterre <malat@debian.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180626202301.20270-1-malat@debian.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-27 09:55:58 +02:00
Frederic Weisbecker
26c6ccdf5c perf/hw_breakpoint: Clean up and consolidate modify_user_hw_breakpoint_check()
Remove the dance around old and new attributes. Just don't modify the
previous breakpoint at all until we have verified everything.

Original-patch-by: Andy Lutomirski <luto@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Joel Fernandes <joel.opensrc@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/1529981939-8231-13-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-26 09:07:59 +02:00
Frederic Weisbecker
cb8b78815b perf/hw_breakpoint: Pass new breakpoint type to modify_breakpoint_slot()
We soon won't be able to rely on bp->attr anymore to get the new
type of the modifying breakpoint because the new attributes are going
to be copied only once we successfully modified the breakpoint slot.

This will fix the current misdesigned layout where the new attr are
copied to the modifying breakpoint before we actually know if the
modification will be validated.

In order to prepare for that, allow modify_breakpoint_slot() to take
the new breakpoint type.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Joel Fernandes <joel.opensrc@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/1529981939-8231-12-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-26 09:07:59 +02:00
Frederic Weisbecker
cffbb3bd44 perf/hw_breakpoint: Remove default hw_breakpoint_arch_parse()
All architectures have implemented it, we can now remove the poor weak
version.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Joel Fernandes <joel.opensrc@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/1529981939-8231-11-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-26 09:07:58 +02:00
Frederic Weisbecker
8e983ff9ac perf/hw_breakpoint: Pass arch breakpoint struct to arch_check_bp_in_kernelspace()
We can't pass the breakpoint directly on arch_check_bp_in_kernelspace()
anymore because its architecture internal datas (struct arch_hw_breakpoint)
are not yet filled by the time we call the function, and most
implementation need this backend to be up to date. So arrange the
function to take the probing struct instead.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Joel Fernandes <joel.opensrc@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/1529981939-8231-3-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-26 09:07:54 +02:00
Frederic Weisbecker
9a4903dde2 perf/hw_breakpoint: Split attribute parse and commit
arch_validate_hwbkpt_settings() mixes up attribute check and commit into
a single code entity. Therefore the validation may return an error due to
incorrect atributes while still leaving halfway modified architecture
breakpoint data.

This is harmless when we deal with a new breakpoint but it becomes a
problem when we modify an existing breakpoint.

Split attribute parse and commit to fix that. The architecture is
passed a "struct arch_hw_breakpoint" to fill on top of the new attr
and the core takes care about copying the backend data once it's fully
validated. The architectures then need to implement the new API.

Original-patch-by: Andy Lutomirski <luto@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Joel Fernandes <joel.opensrc@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/1529981939-8231-2-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-26 09:07:53 +02:00
Ingo Molnar
f446474889 Merge branch 'linus' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-26 09:02:41 +02:00
Linus Torvalds
c81b995f00 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
 "A pile of perf updates:

  Kernel side:

   - Remove an incorrect warning in uprobe_init_insn() when
     insn_get_length() fails. The error return code is handled at the
     call site.

   - Move the inline keyword to the right place in the perf ringbuffer
     code to address a W=1 build warning.

  Tooling:

  perf stat:

   - Fix metric column header display alignment

   - Improve error messages for default attributes, providing better
     output for error in command line.

   - Add --interval-clear option, to provide a 'watch' like printing

  perf script:

   - Show hw-cache events too

  perf c2c:

   - Fix data dependency problem in layout of 'struct c2c_hist_entry'

  Core:

   - Do not blindly assume that 'struct perf_evsel' can be obtained via
     a straight forward container_of() as there are call sites which
     hand in a plain 'struct hist' which is not part of a container.

   - Fix error index in the PMU event parser, so that error messages can
     point to the problematic token"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Move the inline keyword at the beginning of the function declaration
  uprobes/x86: Remove incorrect WARN_ON() in uprobe_init_insn()
  perf script: Show hw-cache events
  perf c2c: Keep struct hist_entry at the end of struct c2c_hist_entry
  perf stat: Add event parsing error handling to add_default_attributes
  perf stat: Allow to specify specific metric column len
  perf stat: Fix metric column header display alignment
  perf stat: Use only color_fprintf call in print_metric_only
  perf stat: Add --interval-clear option
  perf tools: Fix error index for pmu event parser
  perf hists: Reimplement hists__has_callchains()
  perf hists browser gtk: Use hist_entry__has_callchains()
  perf hists: Make hist_entry__has_callchains() work with 'perf c2c'
  perf hists: Save the callchain_size in struct hist_entry
2018-06-24 20:29:15 +08:00
Mathieu Malaterre
57d6a7938a perf/core: Move the inline keyword at the beginning of the function declaration
When building perf with W=1 the following warning triggers:

  CC      kernel/events/ring_buffer.o
  kernel/events/ring_buffer.c:105:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]
   static bool __always_inline
   ^~~~~~
  ...

Move the inline keyword to the beginning of the function declaration.

Signed-off-by: Mathieu Malaterre <malat@debian.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: trival@kernel.org
Link: http://lkml.kernel.org/r/20180308202856.9378-1-malat@debian.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-22 11:07:47 +02:00
Souptick Joarder
9e3ed2d759 perf/core: Change perf_mmap_fault() return type to 'vm_fault_t'
Use new return type 'vm_fault_t' for fault handlers.

For now, this is just documenting that the function returns
a VM_FAULT value rather than an errno. Once all instances
are converted, vm_fault_t will become a distinct type.

See the following commit:

  1c8f422059 ("mm: change return type to vm_fault_t")

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: akpm@linux-foundation.org
Cc: alexander.shishkin@linux.intel.com
Cc: jolsa@redhat.com
Cc: namhyung@kernel.org
Link: https://lkml.kernel.org/lkml/20180521182520.GA19677@jordon-HP-15-Notebook-PC
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21 16:26:25 +02:00
Kees Cook
590b5b7d86 treewide: kzalloc_node() -> kcalloc_node()
The kzalloc_node() function has a 2-factor argument form, kcalloc_node(). This
patch replaces cases of:

        kzalloc_node(a * b, gfp, node)

with:
        kcalloc_node(a * b, gfp, node)

as well as handling cases of:

        kzalloc_node(a * b * c, gfp, node)

with:

        kzalloc_node(array3_size(a, b, c), gfp, node)

as it's slightly less ugly than:

        kcalloc_node(array_size(a, b), c, gfp, node)

This does, however, attempt to ignore constant size factors like:

        kzalloc_node(4 * 1024, gfp, node)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
  kzalloc_node(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  kzalloc_node(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  kzalloc_node(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc_node(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc_node(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc_node(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc_node(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  kzalloc_node(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  kzalloc_node(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  kzalloc_node(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kzalloc_node
+ kcalloc_node
  (
-	sizeof(TYPE) * (COUNT_ID)
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kzalloc_node
+ kcalloc_node
  (
-	sizeof(TYPE) * COUNT_ID
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kzalloc_node
+ kcalloc_node
  (
-	sizeof(TYPE) * (COUNT_CONST)
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kzalloc_node
+ kcalloc_node
  (
-	sizeof(TYPE) * COUNT_CONST
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kzalloc_node
+ kcalloc_node
  (
-	sizeof(THING) * (COUNT_ID)
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kzalloc_node
+ kcalloc_node
  (
-	sizeof(THING) * COUNT_ID
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kzalloc_node
+ kcalloc_node
  (
-	sizeof(THING) * (COUNT_CONST)
+	COUNT_CONST, sizeof(THING)
  , ...)
|
- kzalloc_node
+ kcalloc_node
  (
-	sizeof(THING) * COUNT_CONST
+	COUNT_CONST, sizeof(THING)
  , ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kzalloc_node
+ kcalloc_node
  (
-	SIZE * COUNT
+	COUNT, SIZE
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  kzalloc_node(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc_node(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc_node(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc_node(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc_node(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kzalloc_node(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kzalloc_node(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kzalloc_node(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  kzalloc_node(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kzalloc_node(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kzalloc_node(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kzalloc_node(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kzalloc_node(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  kzalloc_node(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
  kzalloc_node(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc_node(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc_node(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc_node(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc_node(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc_node(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc_node(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc_node(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  kzalloc_node(C1 * C2 * C3, ...)
|
  kzalloc_node(
-	(E1) * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kzalloc_node(
-	(E1) * (E2) * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kzalloc_node(
-	(E1) * (E2) * (E3)
+	array3_size(E1, E2, E3)
  , ...)
|
  kzalloc_node(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
  kzalloc_node(sizeof(THING) * C2, ...)
|
  kzalloc_node(sizeof(TYPE) * C2, ...)
|
  kzalloc_node(C1 * C2 * C3, ...)
|
  kzalloc_node(C1 * C2, ...)
|
- kzalloc_node
+ kcalloc_node
  (
-	sizeof(TYPE) * (E2)
+	E2, sizeof(TYPE)
  , ...)
|
- kzalloc_node
+ kcalloc_node
  (
-	sizeof(TYPE) * E2
+	E2, sizeof(TYPE)
  , ...)
|
- kzalloc_node
+ kcalloc_node
  (
-	sizeof(THING) * (E2)
+	E2, sizeof(THING)
  , ...)
|
- kzalloc_node
+ kcalloc_node
  (
-	sizeof(THING) * E2
+	E2, sizeof(THING)
  , ...)
|
- kzalloc_node
+ kcalloc_node
  (
-	(E1) * E2
+	E1, E2
  , ...)
|
- kzalloc_node
+ kcalloc_node
  (
-	(E1) * (E2)
+	E1, E2
  , ...)
|
- kzalloc_node
+ kcalloc_node
  (
-	E1 * E2
+	E1, E2
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Kees Cook
6396bb2215 treewide: kzalloc() -> kcalloc()
The kzalloc() function has a 2-factor argument form, kcalloc(). This
patch replaces cases of:

        kzalloc(a * b, gfp)

with:
        kcalloc(a * b, gfp)

as well as handling cases of:

        kzalloc(a * b * c, gfp)

with:

        kzalloc(array3_size(a, b, c), gfp)

as it's slightly less ugly than:

        kzalloc_array(array_size(a, b), c, gfp)

This does, however, attempt to ignore constant size factors like:

        kzalloc(4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
  kzalloc(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  kzalloc(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  kzalloc(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * (COUNT_ID)
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * COUNT_ID
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * (COUNT_CONST)
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * COUNT_CONST
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * (COUNT_ID)
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * COUNT_ID
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * (COUNT_CONST)
+	COUNT_CONST, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * COUNT_CONST
+	COUNT_CONST, sizeof(THING)
  , ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kzalloc
+ kcalloc
  (
-	SIZE * COUNT
+	COUNT, SIZE
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  kzalloc(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kzalloc(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kzalloc(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kzalloc(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  kzalloc(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kzalloc(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kzalloc(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  kzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
  kzalloc(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  kzalloc(C1 * C2 * C3, ...)
|
  kzalloc(
-	(E1) * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kzalloc(
-	(E1) * (E2) * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kzalloc(
-	(E1) * (E2) * (E3)
+	array3_size(E1, E2, E3)
  , ...)
|
  kzalloc(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
  kzalloc(sizeof(THING) * C2, ...)
|
  kzalloc(sizeof(TYPE) * C2, ...)
|
  kzalloc(C1 * C2 * C3, ...)
|
  kzalloc(C1 * C2, ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * (E2)
+	E2, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * E2
+	E2, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * (E2)
+	E2, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * E2
+	E2, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	(E1) * E2
+	E1, E2
  , ...)
|
- kzalloc
+ kcalloc
  (
-	(E1) * (E2)
+	E1, E2
  , ...)
|
- kzalloc
+ kcalloc
  (
-	E1 * E2
+	E1, E2
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Linus Torvalds
1c8c5a9d38 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) Add Maglev hashing scheduler to IPVS, from Inju Song.

 2) Lots of new TC subsystem tests from Roman Mashak.

 3) Add TCP zero copy receive and fix delayed acks and autotuning with
    SO_RCVLOWAT, from Eric Dumazet.

 4) Add XDP_REDIRECT support to mlx5 driver, from Jesper Dangaard
    Brouer.

 5) Add ttl inherit support to vxlan, from Hangbin Liu.

 6) Properly separate ipv6 routes into their logically independant
    components. fib6_info for the routing table, and fib6_nh for sets of
    nexthops, which thus can be shared. From David Ahern.

 7) Add bpf_xdp_adjust_tail helper, which can be used to generate ICMP
    messages from XDP programs. From Nikita V. Shirokov.

 8) Lots of long overdue cleanups to the r8169 driver, from Heiner
    Kallweit.

 9) Add BTF ("BPF Type Format"), from Martin KaFai Lau.

10) Add traffic condition monitoring to iwlwifi, from Luca Coelho.

11) Plumb extack down into fib_rules, from Roopa Prabhu.

12) Add Flower classifier offload support to igb, from Vinicius Costa
    Gomes.

13) Add UDP GSO support, from Willem de Bruijn.

14) Add documentation for eBPF helpers, from Quentin Monnet.

15) Add TLS tx offload to mlx5, from Ilya Lesokhin.

16) Allow applications to be given the number of bytes available to read
    on a socket via a control message returned from recvmsg(), from
    Soheil Hassas Yeganeh.

17) Add x86_32 eBPF JIT compiler, from Wang YanQing.

18) Add AF_XDP sockets, with zerocopy support infrastructure as well.
    From Björn Töpel.

19) Remove indirect load support from all of the BPF JITs and handle
    these operations in the verifier by translating them into native BPF
    instead. From Daniel Borkmann.

20) Add GRO support to ipv6 gre tunnels, from Eran Ben Elisha.

21) Allow XDP programs to do lookups in the main kernel routing tables
    for forwarding. From David Ahern.

22) Allow drivers to store hardware state into an ELF section of kernel
    dump vmcore files, and use it in cxgb4. From Rahul Lakkireddy.

23) Various RACK and loss detection improvements in TCP, from Yuchung
    Cheng.

24) Add TCP SACK compression, from Eric Dumazet.

25) Add User Mode Helper support and basic bpfilter infrastructure, from
    Alexei Starovoitov.

26) Support ports and protocol values in RTM_GETROUTE, from Roopa
    Prabhu.

27) Support bulking in ->ndo_xdp_xmit() API, from Jesper Dangaard
    Brouer.

28) Add lots of forwarding selftests, from Petr Machata.

29) Add generic network device failover driver, from Sridhar Samudrala.

* ra.kernel.org:/pub/scm/linux/kernel/git/davem/net-next: (1959 commits)
  strparser: Add __strp_unpause and use it in ktls.
  rxrpc: Fix terminal retransmission connection ID to include the channel
  net: hns3: Optimize PF CMDQ interrupt switching process
  net: hns3: Fix for VF mailbox receiving unknown message
  net: hns3: Fix for VF mailbox cannot receiving PF response
  bnx2x: use the right constant
  Revert "net: sched: cls: Fix offloading when ingress dev is vxlan"
  net: dsa: b53: Fix for brcm tag issue in Cygnus SoC
  enic: fix UDP rss bits
  netdev-FAQ: clarify DaveM's position for stable backports
  rtnetlink: validate attributes in do_setlink()
  mlxsw: Add extack messages for port_{un, }split failures
  netdevsim: Add extack error message for devlink reload
  devlink: Add extack to reload and port_{un, }split operations
  net: metrics: add proper netlink validation
  ipmr: fix error path when ipmr_new_table fails
  ip6mr: only set ip6mr_table from setsockopt when ip6mr_new_table succeeds
  net: hns3: remove unused hclgevf_cfg_func_mta_filter
  netfilter: provide udp*_lib_lookup for nf_tproxy
  qed*: Utilize FW 8.37.2.0
  ...
2018-06-06 18:39:49 -07:00
Eugene Syromiatnikov
82489c5fe5 perf/core: Wire up compat PERF_EVENT_IOC_QUERY_BPF, PERF_EVENT_IOC_MODIFY_ATTRIBUTES
Since pointer size is different in compat, and switching in _perf_ioctl
is done using exact ioctl numbers, all new ioctl numbers that use pointer
should be added to perf_compat_ioctl for _IOC_SIZE fixup before passing
to perf_ioctl routine (this shouldn't be needed if semantics of the size
argument of _IO* macros was honored).

Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/20180521123420.GA24291@asgard.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-25 08:11:11 +02:00
Song Liu
9511bce9fe perf/core: Fix bad use of igrab()
As Miklos reported and suggested:

 "This pattern repeats two times in trace_uprobe.c and in
  kernel/events/core.c as well:

      ret = kern_path(filename, LOOKUP_FOLLOW, &path);
      if (ret)
          goto fail_address_parse;

      inode = igrab(d_inode(path.dentry));
      path_put(&path);

  And it's wrong.  You can only hold a reference to the inode if you
  have an active ref to the superblock as well (which is normally
  through path.mnt) or holding s_umount.

  This way unmounting the containing filesystem while the tracepoint is
  active will give you the "VFS: Busy inodes after unmount..." message
  and a crash when the inode is finally put.

  Solution: store path instead of inode."

This patch fixes the issue in kernel/event/core.c.

Reviewed-and-tested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Reported-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <kernel-team@fb.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: 375637bc52 ("perf/core: Introduce address range filtering")
Link: http://lkml.kernel.org/r/20180418062907.3210386-2-songliubraving@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-25 08:11:10 +02:00
Song Liu
a1150c2022 perf/core: Fix group scheduling with mixed hw and sw events
When hw and sw events are mixed in the same group, they are all attached
to the hw perf_event_context. This sometimes requires moving group of
perf_event to a different context.

We found a bug in how the kernel handles this, for example if we do:

   perf stat -e '{faults,ref-cycles,faults}'  -I 1000

     1.005591180              1,297      faults
     1.005591180        457,476,576      ref-cycles
     1.005591180    <not supported>      faults

First, sw event "faults" is attached to the sw context, and becomes the
group leader. Then, hw event "ref-cycles" is attached, so both events
are moved to the hw context. Last, another sw "faults" tries to attach,
but it fails because of mismatch between the new target ctx (from sw
pmu) and the group_leader's ctx (hw context, same as ref-cycles).

The broken condition is:
   group_leader is sw event;
   group_leader is on hw context;
   add a sw event to the group.

Fix this scenario by checking group_leader's context (instead of just
event type). If group_leader is on hw context, use the ->pmu of this
context to look up context for the new event.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <kernel-team@fb.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: b04243ef70 ("perf: Complete software pmu grouping")
Link: http://lkml.kernel.org/r/20180503194716.162815-1-songliubraving@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-25 08:11:10 +02:00
David S. Miller
90fed9c946 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2018-05-24

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Björn Töpel cleans up AF_XDP (removes rebind, explicit cache alignment from uapi, etc).

2) David Ahern adds mtu checks to bpf_ipv{4,6}_fib_lookup() helpers.

3) Jesper Dangaard Brouer adds bulking support to ndo_xdp_xmit.

4) Jiong Wang adds support for indirect and arithmetic shifts to NFP

5) Martin KaFai Lau cleans up BTF uapi and makes the btf_header extensible.

6) Mathieu Xhonneux adds an End.BPF action to seg6local with BPF helpers allowing
   to edit/grow/shrink a SRH and apply on a packet generic SRv6 actions.

7) Sandipan Das adds support for bpf2bpf function calls in ppc64 JIT.

8) Yonghong Song adds BPF_TASK_FD_QUERY command for introspection of tracing events.

9) other misc fixes from Gustavo A. R. Silva, Sirio Balmelli, John Fastabend, and Magnus Karlsson
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-24 22:20:51 -04:00
Yonghong Song
f8d959a5b1 perf/core: add perf_get_event() to return perf_event given a struct file
A new extern function, perf_get_event(), is added to return a perf event
given a struct file. This function will be used in later patches.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24 18:18:19 -07:00
Peter Zijlstra
4411ec1d19 perf/core: Fix possible Spectre-v1 indexing for ->aux_pages[]
> kernel/events/ring_buffer.c:871 perf_mmap_to_page() warn: potential spectre issue 'rb->aux_pages'

Userspace controls @pgoff through the fault address. Sanitize the
array index before doing the array dereference.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-05 08:37:27 +02:00
Linus Torvalds
f4ef6a438c Various fixes in tracing:
- Tracepoints should not give warning on OOM failures
 
  - Use special field for function pointer in trace event
 
  - Fix igrab issues in uprobes
 
  - Fixes to the new histogram triggers
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCWuoYdBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qtFnAP9X4+AVDQH0VfsMLSc9D+rK6WmcRIhv
 q8J2gNPv3anM+AD/SFXWGO4ihN+0KDw/TqmJxESNEybq47vTZ/s5lM6A4gQ=
 =fQbj
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Various fixes in tracing:

   - Tracepoints should not give warning on OOM failures

   - Use special field for function pointer in trace event

   - Fix igrab issues in uprobes

   - Fixes to the new histogram triggers"

* tag 'trace-v4.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracepoint: Do not warn on ENOMEM
  tracing: Add field modifier parsing hist error for hist triggers
  tracing: Add field parsing hist error for hist triggers
  tracing: Restore proper field flag printing when displaying triggers
  tracing: initcall: Ordered comparison of function pointers
  tracing: Remove igrab() iput() call from uprobes.c
  tracing: Fix bad use of igrab in trace_uprobe.c
2018-05-02 17:38:37 -10:00
Song Liu
61f94203c9 tracing: Remove igrab() iput() call from uprobes.c
Caller of uprobe_register is required to keep the inode and containing
mount point referenced.

There was misuse of igrab() in uprobes.c and trace_uprobe.c. This is
because igrab() will not prevent umount of the containing mount point.
To fix this, we added path to struct trace_uprobe, which keeps the inode
and containing mount reference.

For uprobes.c, it is not necessary to call igrab() in uprobe_register(),
as the caller is required to keep the inode reference. The igrab() is
removed and comments on this requirement is added to uprobe_register().

Link: http://lkml.kernel.org/r/CAELBmZB2XX=qEOLAdvGG4cPx4GEntcSnWQquJLUK1ongRj35cA@mail.gmail.com
Link: http://lkml.kernel.org/r/20180423172135.4050588-2-songliubraving@fb.com

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Howard McLauchlan <hmclauchlan@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-04-26 14:50:56 -04:00
Jiri Olsa
bfb3d7b8b9 perf: Remove superfluous allocation error check
If the get_callchain_buffers fails to allocate the buffer it will
decrease the nr_callchain_events right away.

There's no point of checking the allocation error for
nr_callchain_events > 1. Removing that check.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: syzkaller-bugs@googlegroups.com
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/20180415092352.12403-3-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-04-17 09:47:40 -03:00
Jiri Olsa
5af44ca53d perf: Fix sample_max_stack maximum check
The syzbot hit KASAN bug in perf_callchain_store having the entry stored
behind the allocated bounds [1].

We miss the sample_max_stack check for the initial event that allocates
callchain buffers. This missing check allows to create an event with
sample_max_stack value bigger than the global sysctl maximum:

  # sysctl -a | grep perf_event_max_stack
  kernel.perf_event_max_stack = 127

  # perf record -vv -C 1 -e cycles/max-stack=256/ kill
  ...
  perf_event_attr:
    size                             112
    ...
    sample_max_stack                 256
  ------------------------------------------------------------
  sys_perf_event_open: pid -1  cpu 1  group_fd -1  flags 0x8 = 4

Note the '-C 1', which forces perf record to create just single event.
Otherwise it opens event for every cpu, then the sample_max_stack check
fails on the second event and all's fine.

The fix is to run the sample_max_stack check also for the first event
with callchains.

[1] https://marc.info/?l=linux-kernel&m=152352732920874&w=2

Reported-by: syzbot+7c449856228b63ac951e@syzkaller.appspotmail.com
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: syzkaller-bugs@googlegroups.com
Cc: x86@kernel.org
Fixes: 97c79a38cd ("perf core: Per event callchain limit")
Link: http://lkml.kernel.org/r/20180415092352.12403-2-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-04-17 09:47:40 -03:00
Jiri Olsa
78b562fbfa perf: Return proper values for user stack errors
Return immediately when we find issue in the user stack checks. The
error value could get overwritten by following check for
PERF_SAMPLE_REGS_INTR.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: syzkaller-bugs@googlegroups.com
Cc: x86@kernel.org
Fixes: 60e2364e60 ("perf: Add ability to sample machine state on interrupt")
Link: http://lkml.kernel.org/r/20180415092352.12403-1-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-04-17 09:47:39 -03:00
Alexey Budankov
101592b490 perf/core: Store context switch out type in PERF_RECORD_SWITCH[_CPU_WIDE]
Store preempting context switch out event into Perf trace as a part of
PERF_RECORD_SWITCH[_CPU_WIDE] record.

Percentage of preempting and non-preempting context switches help
understanding the nature of workloads (CPU or IO bound) that are running
on a machine;

The event is treated as preemption one when task->state value of the
thread being switched out is TASK_RUNNING. Event type encoding is
implemented using PERF_RECORD_MISC_SWITCH_OUT_PREEMPT bit;

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/9ff84e83-a0ca-dd82-a6d0-cb951689be74@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-04-17 09:47:39 -03:00
Song Liu
32e6e967fb perf/core: Need CAP_SYS_ADMIN to create k/uprobe with perf_event_open()
Non-root user cannot create kprobe or uprobe through the text-based
interface (kprobe_events, uprobe_events),so they should not be able
to create probes via perf_event_open() either.

Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Song Liu <songliubraving@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 33ea4b2427 ("perf/core: Implement the 'perf_uprobe' PMU")
Fixes: e12f03d703 ("perf/core: Implement the 'perf_kprobe' PMU")
Link: http://lkml.kernel.org/r/C0B2EFB5-C403-4BDB-9046-C14B3EE66999@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-04-12 09:55:50 +02:00
Prashant Bhole
621b6d2ea2 perf/core: Fix use-after-free in uprobe_perf_close()
A use-after-free bug was caught by KASAN while running usdt related
code (BCC project. bcc/tests/python/test_usdt2.py):

	==================================================================
	BUG: KASAN: use-after-free in uprobe_perf_close+0x222/0x3b0
	Read of size 4 at addr ffff880384f9b4a4 by task test_usdt2.py/870

	CPU: 4 PID: 870 Comm: test_usdt2.py Tainted: G        W         4.16.0-next-20180409 #215
	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
	Call Trace:
	 dump_stack+0xc7/0x15b
	 ? show_regs_print_info+0x5/0x5
	 ? printk+0x9c/0xc3
	 ? kmsg_dump_rewind_nolock+0x6e/0x6e
	 ? uprobe_perf_close+0x222/0x3b0
	 print_address_description+0x83/0x3a0
	 ? uprobe_perf_close+0x222/0x3b0
	 kasan_report+0x1dd/0x460
	 ? uprobe_perf_close+0x222/0x3b0
	 uprobe_perf_close+0x222/0x3b0
	 ? probes_open+0x180/0x180
	 ? free_filters_list+0x290/0x290
	 trace_uprobe_register+0x1bb/0x500
	 ? perf_event_attach_bpf_prog+0x310/0x310
	 ? probe_event_disable+0x4e0/0x4e0
	 perf_uprobe_destroy+0x63/0xd0
	 _free_event+0x2bc/0xbd0
	 ? lockdep_rcu_suspicious+0x100/0x100
	 ? ring_buffer_attach+0x550/0x550
	 ? kvm_sched_clock_read+0x1a/0x30
	 ? perf_event_release_kernel+0x3e4/0xc00
	 ? __mutex_unlock_slowpath+0x12e/0x540
	 ? wait_for_completion+0x430/0x430
	 ? lock_downgrade+0x3c0/0x3c0
	 ? lock_release+0x980/0x980
	 ? do_raw_spin_trylock+0x118/0x150
	 ? do_raw_spin_unlock+0x121/0x210
	 ? do_raw_spin_trylock+0x150/0x150
	 perf_event_release_kernel+0x5d4/0xc00
	 ? put_event+0x30/0x30
	 ? fsnotify+0xd2d/0xea0
	 ? sched_clock_cpu+0x18/0x1a0
	 ? __fsnotify_update_child_dentry_flags.part.0+0x1b0/0x1b0
	 ? pvclock_clocksource_read+0x152/0x2b0
	 ? pvclock_read_flags+0x80/0x80
	 ? kvm_sched_clock_read+0x1a/0x30
	 ? sched_clock_cpu+0x18/0x1a0
	 ? pvclock_clocksource_read+0x152/0x2b0
	 ? locks_remove_file+0xec/0x470
	 ? pvclock_read_flags+0x80/0x80
	 ? fcntl_setlk+0x880/0x880
	 ? ima_file_free+0x8d/0x390
	 ? lockdep_rcu_suspicious+0x100/0x100
	 ? ima_file_check+0x110/0x110
	 ? fsnotify+0xea0/0xea0
	 ? kvm_sched_clock_read+0x1a/0x30
	 ? rcu_note_context_switch+0x600/0x600
	 perf_release+0x21/0x40
	 __fput+0x264/0x620
	 ? fput+0xf0/0xf0
	 ? do_raw_spin_unlock+0x121/0x210
	 ? do_raw_spin_trylock+0x150/0x150
	 ? SyS_fchdir+0x100/0x100
	 ? fsnotify+0xea0/0xea0
	 task_work_run+0x14b/0x1e0
	 ? task_work_cancel+0x1c0/0x1c0
	 ? copy_fd_bitmaps+0x150/0x150
	 ? vfs_read+0xe5/0x260
	 exit_to_usermode_loop+0x17b/0x1b0
	 ? trace_event_raw_event_sys_exit+0x1a0/0x1a0
	 do_syscall_64+0x3f6/0x490
	 ? syscall_return_slowpath+0x2c0/0x2c0
	 ? lockdep_sys_exit+0x1f/0xaa
	 ? syscall_return_slowpath+0x1a3/0x2c0
	 ? lockdep_sys_exit+0x1f/0xaa
	 ? prepare_exit_to_usermode+0x11c/0x1e0
	 ? enter_from_user_mode+0x30/0x30
	random: crng init done
	 ? __put_user_4+0x1c/0x30
	 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
	RIP: 0033:0x7f41d95f9340
	RSP: 002b:00007fffe71e4268 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
	RAX: 0000000000000000 RBX: 000000000000000d RCX: 00007f41d95f9340
	RDX: 0000000000000000 RSI: 0000000000002401 RDI: 000000000000000d
	RBP: 0000000000000000 R08: 00007f41ca8ff700 R09: 00007f41d996dd1f
	R10: 00007fffe71e41e0 R11: 0000000000000246 R12: 00007fffe71e4330
	R13: 0000000000000000 R14: fffffffffffffffc R15: 00007fffe71e4290

	Allocated by task 870:
	 kasan_kmalloc+0xa0/0xd0
	 kmem_cache_alloc_node+0x11a/0x430
	 copy_process.part.19+0x11a0/0x41c0
	 _do_fork+0x1be/0xa20
	 do_syscall_64+0x198/0x490
	 entry_SYSCALL_64_after_hwframe+0x3d/0xa2

	Freed by task 0:
	 __kasan_slab_free+0x12e/0x180
	 kmem_cache_free+0x102/0x4d0
	 free_task+0xfe/0x160
	 __put_task_struct+0x189/0x290
	 delayed_put_task_struct+0x119/0x250
	 rcu_process_callbacks+0xa6c/0x1b60
	 __do_softirq+0x238/0x7ae

	The buggy address belongs to the object at ffff880384f9b480
	 which belongs to the cache task_struct of size 12928

It occurs because task_struct is freed before perf_event which refers
to the task and task flags are checked while teardown of the event.
perf_event_alloc() assigns task_struct to hw.target of perf_event,
but there is no reference counting for it.

As a fix we get_task_struct() in perf_event_alloc() at above mentioned
assignment and put_task_struct() in _free_event().

Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 63b6da39bb ("perf: Fix perf_event_exit_task() race")
Link: http://lkml.kernel.org/r/20180409100346.6416-1-bhole_prashant_q7@lab.ntt.co.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-04-09 18:15:58 +02:00
Alexander Shishkin
6ed70cf342 perf/x86/pt, coresight: Clean up address filter structure
This is a cosmetic patch that deals with the address filter structure's
ambiguous fields 'filter' and 'range'. The former stands to mean that the
filter's *action* should be to filter the traces to its address range if
it's set or stop tracing if it's unset. This is confusing and hard on the
eyes, so this patch replaces it with 'action' enum. The 'range' field is
completely redundant (meaning that the filter is an address range as
opposed to a single address trigger), as we can use zero size to mean the
same thing.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Acked-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/20180329120648.11902-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-29 16:07:22 +02:00
Ingo Molnar
2d074918fb Merge branch 'perf/urgent' into perf/core
Conflicts:
	kernel/events/hw_breakpoint.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-29 16:03:48 +02:00
Linus Torvalds
f67b15037a perf/hwbp: Simplify the perf-hwbp code, fix documentation
Annoyingly, modify_user_hw_breakpoint() unnecessarily complicates the
modification of a breakpoint - simplify it and remove the pointless
local variables.

Also update the stale Docbook while at it.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-28 17:41:50 +02:00
Ingo Molnar
7054e4e0b1 Merge branch 'perf/urgent' into perf/core, to pick up fixes
With the cherry-picked perf/urgent commit merged separately we can now
merge all the fixes without conflicts.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-24 09:21:47 +01:00
Song Liu
c917e0f259 perf/cgroup: Fix child event counting bug
When a perf_event is attached to parent cgroup, it should count events
for all children cgroups:

   parent_group   <---- perf_event
     \
      - child_group  <---- process(es)

However, in our tests, we found this perf_event cannot report reliable
results. Here is an example case:

  # create cgroups
  mkdir -p /sys/fs/cgroup/p/c
  # start perf for parent group
  perf stat -e instructions -G "p"

  # on another console, run test process in child cgroup:
  stressapptest -s 2 -M 1000 & echo $! > /sys/fs/cgroup/p/c/cgroup.procs

  # after the test process is done, stop perf in the first console shows

       <not counted>      instructions              p

The instruction should not be "not counted" as the process runs in the
child cgroup.

We found this is because perf_event->cgrp and cpuctx->cgrp are not
identical, thus perf_event->cgrp are not updated properly.

This patch fixes this by updating perf_cgroup properly for ancestor
cgroup(s).

Reported-by: Ephraim Park <ephiepark@fb.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <jolsa@redhat.com>
Cc: <kernel-team@fb.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/20180312165943.1057894-1-songliubraving@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-20 08:58:47 +01:00
Ingo Molnar
134933e557 Linux 4.16-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlqvCPYeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGOaAH/171cgZGFEXSONxK
 3O1AAv61wN5K/ISMt6mnelWR6fZg195FarOx0Rnq7Ot8OWuVa8CGcyT4vX4Z7nb9
 SVMQKNMPCVQE4WCDOv6S0njChmRC0BxBoVJtTN9fhywdYgX1KcaTS/drMRHACF5n
 rB9eouMQScfMzKGAW08gp5NvEGJ6W1SLX7La3/u0751dYisdJSP7+vFZNxUrGXEA
 yIPOQjFu0Tfo8GXz/BwC678RZVzVLN0sE6+/vM7zNnoDlsRVkdDIVMo3UiVqm/NK
 B37/TlZz8CYoapoKnRRB5giXnSPDSXtsikbGy3mcy0u5imGe+ZgdjrdYSaLk31cR
 NVZY08k=
 =pu3X
 -----END PGP SIGNATURE-----

Merge tag 'v4.16-rc6' into perf/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-19 20:37:35 +01:00
Mark Rutland
24868367cd perf/core: Clear sibling list of detached events
When perf_group_dettach() is called on a group leader, it updates each
sibling's group_leader field to point to that sibling, effectively
upgrading each siblnig to a group leader. After perf_group_detach has
completed, the caller may free the leader event.

We only remove siblings from the group leader's sibling_list when the
leader has a non-empty group_node. This was fine prior to commit:

  8343aae661 ("perf/core: Remove perf_event::group_entry")

... as the sibling's sibling_list would be empty. However, now that we
use the sibling_list field as both the list head and the list entry,
this leaves each sibling with a non-empty sibling list, including the
stale leader event.

If perf_group_detach() is subsequently called on a sibling, it will
appear to be a group leader, and we'll walk the sibling_list,
potentially dereferencing these stale events. In 0day testing, this has
been observed to result in kernel panics.

Let's avoid this by always removing siblings from the sibling list when
we promote them to leaders.

Fixes: 8343aae661 ("perf/core: Remove perf_event::group_entry")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: vincent.weaver@maine.edu
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: torvalds@linux-foundation.org
Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
Cc: valery.cherepennikov@intel.com
Cc: linux-tip-commits@vger.kernel.org
Cc: eranian@google.com
Cc: acme@redhat.com
Cc: alexander.shishkin@linux.intel.com
Cc: davidcc@google.com
Cc: kan.liang@intel.com
Cc: Dmitry.Prohorov@intel.com
Cc: Jiri Olsa <jolsa@redhat.com>
Link: https://lkml.kernel.org/r/20180316131741.3svgr64yibc6vsid@lakrids.cambridge.arm.com
2018-03-16 20:44:32 +01:00
Peter Zijlstra
edb39592a5 perf: Fix sibling iteration
Mark noticed that the change to sibling_list changed some iteration
semantics; because previously we used group_list as list entry,
sibling events would always have an empty sibling_list.

But because we now use sibling_list for both list head and list entry,
siblings will report as having siblings.

Fix this with a custom for_each_sibling_event() iterator.

Fixes: 8343aae661 ("perf/core: Remove perf_event::group_entry")
Reported-by: Mark Rutland <mark.rutland@arm.com>
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: vincent.weaver@maine.edu
Cc: alexander.shishkin@linux.intel.com
Cc: torvalds@linux-foundation.org
Cc: alexey.budankov@linux.intel.com
Cc: valery.cherepennikov@intel.com
Cc: eranian@google.com
Cc: acme@redhat.com
Cc: linux-tip-commits@vger.kernel.org
Cc: davidcc@google.com
Cc: kan.liang@intel.com
Cc: Dmitry.Prohorov@intel.com
Cc: jolsa@redhat.com
Link: https://lkml.kernel.org/r/20180315170129.GX4043@hirez.programming.kicks-ass.net
2018-03-16 20:44:12 +01:00
Milind Chabbi
32ff77e8cc perf/core: Implement fast breakpoint modification via _IOC_MODIFY_ATTRIBUTES
Problem and motivation: Once a breakpoint perf event (PERF_TYPE_BREAKPOINT)
is created, there is no flexibility to change the breakpoint type
(bp_type), breakpoint address (bp_addr), or breakpoint length (bp_len). The
only option is to close the perf event and configure a new breakpoint
event. This inflexibility has a significant performance overhead. For
example, sampling-based, lightweight performance profilers (and also
concurrency bug detection tools),  monitor different addresses for a short
duration using PERF_TYPE_BREAKPOINT and change the address (bp_addr) to
another address or change the kind of breakpoint (bp_type) from  "write" to
a "read" or vice-versa or change the length (bp_len) of the address being
monitored. The cost of these modifications is prohibitive since it involves
unmapping the circular buffer associated with the perf event, closing the
perf event, opening another perf event and mmaping another circular buffer.

Solution: The new ioctl flag for perf events,
PERF_EVENT_IOC_MODIFY_ATTRIBUTES, introduced in this patch takes a pointer
to a struct perf_event_attr as an argument to update an old breakpoint
event with new address, type, and size. This facility allows retaining a
previous mmaped perf events ring buffer and avoids having to close and
reopen another perf event.

This patch supports only changing PERF_TYPE_BREAKPOINT event type; future
implementations can extend this feature. The patch replicates some of its
functionality of modify_user_hw_breakpoint() in
kernel/events/hw_breakpoint.c. modify_user_hw_breakpoint cannot be called
directly since perf_event_ctx_lock() is already held in _perf_ioctl().

Evidence: Experiments show that the baseline (not able to modify an already
created breakpoint) costs an order of magnitude (~10x) more than the
suggested optimization (having the ability to dynamically modifying a
configured breakpoint via ioctl). When the breakpoints typically do not
trap, the speedup due to the suggested optimization is ~10x; even when the
breakpoints always trap, the speedup is ~4x due to the suggested
optimization.

Testing: tests posted at
https://github.com/linux-contrib/perf_event_modify_bp demonstrate the
performance significance of this patch. Tests also check the functional
correctness of the patch.

Signed-off-by: Milind Chabbi <chabbi.milind@gmail.com>
[ Using modify_user_hw_breakpoint_check function. ]
[ Reformated PERF_EVENT_IOC_*, so the values are all in one column. ]
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
Cc: Jin Yao <yao.jin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Oleg Nesterov <onestero@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/20180312134548.31532-8-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-13 15:24:02 +01:00
Jiri Olsa
5f970521d3 perf/core: Move perf_event_attr::sample_max_stack into perf_copy_attr()
Move the sample_max_stack check and setup into perf_copy_attr(),
so we have all perf_event_attr initial setup in one place
and can easily compare attrs in the new ioctl introduced
in following change.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
Cc: Jin Yao <yao.jin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Milind Chabbi <chabbi.milind@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Oleg Nesterov <onestero@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/20180312134548.31532-7-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-13 06:56:08 +01:00
Jiri Olsa
705feaf321 hw_breakpoint: Add perf_event_attr fields check in __modify_user_hw_breakpoint()
And rename it to modify_user_hw_breakpoint_check().

We are about to use modify_user_hw_breakpoint_check() for user space
breakpoints modification, we must be very strict to check only the
fields we can change have changed. As Peter explained:

 "Suppose someone does:

        attr = malloc(sizeof(*attr)); // uninitialized memory
        attr->type = BP;
        attr->bp_addr = new_addr;
        attr->bp_type = bp_type;
        attr->bp_len = bp_len;
        ioctl(fd, PERF_IOC_MOD_ATTR, &attr);

  And feeds absolute shite for the rest of the fields.
  Then we later want to extend IOC_MOD_ATTR to allow changing
  attr::sample_type but we can't, because that would break the
  above application."

I'm making this check optional because we already export
modify_user_hw_breakpoint() and with this check we could
break existing users.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
Cc: Jin Yao <yao.jin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Milind Chabbi <chabbi.milind@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Oleg Nesterov <onestero@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/20180312134548.31532-6-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-13 06:56:08 +01:00
Jiri Olsa
18ff57b220 hw_breakpoint: Factor out __modify_user_hw_breakpoint() function
Moving out all the functionality without the events
disabling/enabling calls, because we want to call another
disabling/enabling functions in following change.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
Cc: Jin Yao <yao.jin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Milind Chabbi <chabbi.milind@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Oleg Nesterov <onestero@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/20180312134548.31532-5-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-13 06:56:08 +01:00
Jiri Olsa
ea6a9d530c hw_breakpoint: Add modify_bp_slot() function
Add the modify_bp_slot() function to keep slot numbers
correct when changing the breakpoint type.

Using existing __release_bp_slot()/__reserve_bp_slot()
call sequence to update the slot counts.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
Cc: Jin Yao <yao.jin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Milind Chabbi <chabbi.milind@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Oleg Nesterov <onestero@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/20180312134548.31532-4-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-13 06:56:07 +01:00
Jiri Olsa
1ad9ff7dea hw_breakpoint: Pass bp_type argument to __reserve_bp_slot|__release_bp_slot()
Passing bp_type argument to __reserve_bp_slot() and __release_bp_slot()
functions, so we can pass another bp_type than the one defined in
bp->attr.bp_type. This will be handy in following change that fixes
breakpoint slot counts during its modification.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
Cc: Jin Yao <yao.jin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Milind Chabbi <chabbi.milind@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Oleg Nesterov <onestero@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/20180312134548.31532-3-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-13 06:56:07 +01:00
Jiri Olsa
cbd9d9f114 hw_breakpoint: Pass bp_type directly as find_slot_idx() argument
Pass bp_type directly as a find_slot_idx() argument,
so we don't need to have whole event to get the
breakpoint slot type. It will be used in following
changes.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
Cc: Jin Yao <yao.jin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Milind Chabbi <chabbi.milind@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Oleg Nesterov <onestero@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/20180312134548.31532-2-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-13 06:56:07 +01:00
leilei.lin
33801b9474 perf/core: Fix installing cgroup events on CPU
There's two problems when installing cgroup events on CPUs: firstly
list_update_cgroup_event() only tries to set cpuctx->cgrp for the
first event, if that mismatches on @cgrp we'll not try again for later
additions.

Secondly, when we install a cgroup event into an active context, only
issue an event reprogram when the event matches the current cgroup
context. This avoids a pointless event reprogramming.

Signed-off-by: leilei.lin <leilei.lin@alibaba-inc.com>
[ Improved the changelog and comments. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: brendan.d.gregg@gmail.com
Cc: eranian@gmail.com
Cc: linux-kernel@vger.kernel.org
Cc: yang_oliver@hotmail.com
Link: http://lkml.kernel.org/r/20180306093637.28247-1-linxiulei@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 15:28:51 +01:00
Peter Zijlstra
8d5bce0c37 perf/core: Optimize perf_rotate_context() event scheduling
The event schedule order (as per perf_event_sched_in()) is:

 - cpu  pinned
 - task pinned
 - cpu  flexible
 - task flexible

But perf_rotate_context() will unschedule cpu-flexible even if it
doesn't need a rotation.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 15:28:50 +01:00
Peter Zijlstra
8703a7cfe1 perf/core: Fix tree based event rotation
Similar to how first programming cpu=-1 and then cpu=# is wrong, so is
rotating both. It was especially wrong when we were still programming
the PMU in this same order, because in that scenario we might never
actually end up running cpu=# events at all.

Cure this by using the active_list to pick the rotation event; since
at programming we already select the left-most event.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Carrillo-Cisneros <davidcc@google.com>
Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valery Cherepennikov <valery.cherepennikov@intel.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 15:28:50 +01:00
Peter Zijlstra
6e6804d2fa perf/core: Simpify perf_event_groups_for_each()
The last argument is, and always must be, the same.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Carrillo-Cisneros <davidcc@google.com>
Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valery Cherepennikov <valery.cherepennikov@intel.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 15:28:50 +01:00
Peter Zijlstra
6668128a9e perf/core: Optimize ctx_sched_out()
When an event group contains more events than can be scheduled on the
hardware, iterating the full event group for ctx_sched_out is a waste
of time.

Keep track of the events that got programmed on the hardware, such
that we can iterate this smaller list in order to schedule them out.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Carrillo-Cisneros <davidcc@google.com>
Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valery Cherepennikov <valery.cherepennikov@intel.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 15:28:50 +01:00
Peter Zijlstra
8343aae661 perf/core: Remove perf_event::group_entry
Now that all the grouping is done with RB trees, we no longer need
group_entry and can replace the whole thing with sibling_list.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Carrillo-Cisneros <davidcc@google.com>
Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valery Cherepennikov <valery.cherepennikov@intel.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 15:28:49 +01:00
Peter Zijlstra
1cac7b1ae3 perf/core: Fix event schedule order
Scheduling in events with cpu=-1 before events with cpu=# changes
semantics and is undesirable in that it would priorize these events.

Given that groups->index is across all groups we actually have an
inter-group ordering, meaning we can merge-sort two groups, which is
just what we need to preserve semantics.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Carrillo-Cisneros <davidcc@google.com>
Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valery Cherepennikov <valery.cherepennikov@intel.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 15:28:49 +01:00
Peter Zijlstra
161c85fab7 perf/core: Cleanup the rb-tree code
Trivial comment and code fixups..

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Carrillo-Cisneros <davidcc@google.com>
Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valery Cherepennikov <valery.cherepennikov@intel.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 15:28:49 +01:00
Alexey Budankov
8e1a2031e4 perf/cor: Use RB trees for pinned/flexible groups
Change event groups into RB trees sorted by CPU and then by a 64bit
index, so that multiplexing hrtimer interrupt handler would be able
skipping to the current CPU's list and ignore groups allocated for the
other CPUs.

New API for manipulating event groups in the trees is implemented as well
as adoption on the API in the current implementation.

pinned_group_sched_in() and flexible_group_sched_in() API are
introduced to consolidate code enabling the whole group from pinned
and flexible groups appropriately.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: David Carrillo-Cisneros <davidcc@google.com>
Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valery Cherepennikov <valery.cherepennikov@intel.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/372f9c8b-0cfe-4240-e44d-83d863d40813@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 15:28:49 +01:00
Peter Zijlstra
9e5b127d6f perf/core: Fix perf_output_read_group()
Mark reported his arm64 perf fuzzer runs sometimes splat like:

  armv8pmu_read_counter+0x1e8/0x2d8
  armpmu_event_update+0x8c/0x188
  armpmu_read+0xc/0x18
  perf_output_read+0x550/0x11e8
  perf_event_read_event+0x1d0/0x248
  perf_event_exit_task+0x468/0xbb8
  do_exit+0x690/0x1310
  do_group_exit+0xd0/0x2b0
  get_signal+0x2e8/0x17a8
  do_signal+0x144/0x4f8
  do_notify_resume+0x148/0x1e8
  work_pending+0x8/0x14

which asserts that we only call pmu::read() on ACTIVE events.

The above callchain does:

  perf_event_exit_task()
    perf_event_exit_task_context()
      task_ctx_sched_out() // INACTIVE
      perf_event_exit_event()
        perf_event_set_state(EXIT) // EXIT
        sync_child_event()
          perf_event_read_event()
            perf_output_read()
              perf_output_read_group()
                leader->pmu->read()

Which results in doing a pmu::read() on an !ACTIVE event.

I _think_ this is 'new' since we added attr.inherit_stat, which added
the perf_event_read_event() to the exit path, without that
perf_event_read_output() would only trigger from samples and for
@event to trigger a sample, it's leader _must_ be ACTIVE too.

Still, adding this check makes it consistent with the @sub case for
the siblings.

Reported-and-Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 15:28:48 +01:00
Song Liu
bd903afeb5 perf/core: Fix ctx_event_type in ctx_resched()
In ctx_resched(), EVENT_FLEXIBLE should be sched_out when EVENT_PINNED is
added. However, ctx_resched() calculates ctx_event_type before checking
this condition. As a result, pinned events will NOT get higher priority
than flexible events.

The following shows this issue on an Intel CPU (where ref-cycles can
only use one hardware counter).

  1. First start:
       perf stat -C 0 -e ref-cycles  -I 1000
  2. Then, in the second console, run:
       perf stat -C 0 -e ref-cycles:D -I 1000

The second perf uses pinned events, which is expected to have higher
priority. However, because it failed in ctx_resched(). It is never
run.

This patch fixes this by calculating ctx_event_type after re-evaluating
event_type.

Reported-by: Ephraim Park <ephiepark@fb.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <jolsa@redhat.com>
Cc: <kernel-team@fb.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: 487f05e18a ("perf/core: Optimize event rescheduling on active contexts")
Link: http://lkml.kernel.org/r/20180306055504.3283731-1-songliubraving@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 08:03:02 +01:00
Ingo Molnar
7057bb975d Merge branch 'perf/urgent' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-17 11:39:28 +01:00
Linus Torvalds
a9a08845e9 vfs: do bulk POLL* -> EPOLL* replacement
This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:

    for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
        L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
        for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
    done

with de-mangling cleanups yet to come.

NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do.  But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.

The next patch from Al will sort out the final differences, and we
should be all done.

Scripted-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-11 14:34:03 -08:00
Song Liu
33ea4b2427 perf/core: Implement the 'perf_uprobe' PMU
This patch adds perf_uprobe support with similar pattern as previous
patch (for kprobe).

Two functions, create_local_trace_uprobe() and
destroy_local_trace_uprobe(), are created so a uprobe can be created
and attached to the file descriptor created by perf_event_open().

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yonghong Song <yhs@fb.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Cc: <daniel@iogearbox.net>
Cc: <davem@davemloft.net>
Cc: <kernel-team@fb.com>
Cc: <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171206224518.3598254-7-songliubraving@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-06 11:29:28 +01:00
Song Liu
e12f03d703 perf/core: Implement the 'perf_kprobe' PMU
A new PMU type, perf_kprobe is added. Based on attr from perf_event_open(),
perf_kprobe creates a kprobe (or kretprobe) for the perf_event. This
kprobe is private to this perf_event, and thus not added to global
lists, and not available in tracefs.

Two functions, create_local_trace_kprobe() and
destroy_local_trace_kprobe()  are added to created and destroy these
local trace_kprobe.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yonghong Song <yhs@fb.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Cc: <daniel@iogearbox.net>
Cc: <davem@davemloft.net>
Cc: <kernel-team@fb.com>
Cc: <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171206224518.3598254-6-songliubraving@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-06 11:29:26 +01:00
Linus Torvalds
b2fe5fa686 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) Significantly shrink the core networking routing structures. Result
    of http://vger.kernel.org/~davem/seoul2017_netdev_keynote.pdf

 2) Add netdevsim driver for testing various offloads, from Jakub
    Kicinski.

 3) Support cross-chip FDB operations in DSA, from Vivien Didelot.

 4) Add a 2nd listener hash table for TCP, similar to what was done for
    UDP. From Martin KaFai Lau.

 5) Add eBPF based queue selection to tun, from Jason Wang.

 6) Lockless qdisc support, from John Fastabend.

 7) SCTP stream interleave support, from Xin Long.

 8) Smoother TCP receive autotuning, from Eric Dumazet.

 9) Lots of erspan tunneling enhancements, from William Tu.

10) Add true function call support to BPF, from Alexei Starovoitov.

11) Add explicit support for GRO HW offloading, from Michael Chan.

12) Support extack generation in more netlink subsystems. From Alexander
    Aring, Quentin Monnet, and Jakub Kicinski.

13) Add 1000BaseX, flow control, and EEE support to mvneta driver. From
    Russell King.

14) Add flow table abstraction to netfilter, from Pablo Neira Ayuso.

15) Many improvements and simplifications to the NFP driver bpf JIT,
    from Jakub Kicinski.

16) Support for ipv6 non-equal cost multipath routing, from Ido
    Schimmel.

17) Add resource abstration to devlink, from Arkadi Sharshevsky.

18) Packet scheduler classifier shared filter block support, from Jiri
    Pirko.

19) Avoid locking in act_csum, from Davide Caratti.

20) devinet_ioctl() simplifications from Al viro.

21) More TCP bpf improvements from Lawrence Brakmo.

22) Add support for onlink ipv6 route flag, similar to ipv4, from David
    Ahern.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1925 commits)
  tls: Add support for encryption using async offload accelerator
  ip6mr: fix stale iterator
  net/sched: kconfig: Remove blank help texts
  openvswitch: meter: Use 64-bit arithmetic instead of 32-bit
  tcp_nv: fix potential integer overflow in tcpnv_acked
  r8169: fix RTL8168EP take too long to complete driver initialization.
  qmi_wwan: Add support for Quectel EP06
  rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK
  ipmr: Fix ptrdiff_t print formatting
  ibmvnic: Wait for device response when changing MAC
  qlcnic: fix deadlock bug
  tcp: release sk_frag.page in tcp_disconnect
  ipv4: Get the address of interface correctly.
  net_sched: gen_estimator: fix lockdep splat
  net: macb: Handle HRESP error
  net/mlx5e: IPoIB, Fix copy-paste bug in flow steering refactoring
  ipv6: addrconf: break critical section in addrconf_verify_rtnl()
  ipv6: change route cache aging logic
  i40e/i40evf: Update DESC_NEEDED value to reflect larger value
  bnxt_en: cleanup DIM work on device shutdown
  ...
2018-01-31 14:31:10 -08:00
Linus Torvalds
168fe32a07 Merge branch 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull poll annotations from Al Viro:
 "This introduces a __bitwise type for POLL### bitmap, and propagates
  the annotations through the tree. Most of that stuff is as simple as
  'make ->poll() instances return __poll_t and do the same to local
  variables used to hold the future return value'.

  Some of the obvious brainos found in process are fixed (e.g. POLLIN
  misspelled as POLL_IN). At that point the amount of sparse warnings is
  low and most of them are for genuine bugs - e.g. ->poll() instance
  deciding to return -EINVAL instead of a bitmap. I hadn't touched those
  in this series - it's large enough as it is.

  Another problem it has caught was eventpoll() ABI mess; select.c and
  eventpoll.c assumed that corresponding POLL### and EPOLL### were
  equal. That's true for some, but not all of them - EPOLL### are
  arch-independent, but POLL### are not.

  The last commit in this series separates userland POLL### values from
  the (now arch-independent) kernel-side ones, converting between them
  in the few places where they are copied to/from userland. AFAICS, this
  is the least disruptive fix preserving poll(2) ABI and making epoll()
  work on all architectures.

  As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and
  it will trigger only on what would've triggered EPOLLWRBAND on other
  architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered
  at all on sparc. With this patch they should work consistently on all
  architectures"

* 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
  make kernel-side POLL... arch-independent
  eventpoll: no need to mask the result of epi_item_poll() again
  eventpoll: constify struct epoll_event pointers
  debugging printk in sg_poll() uses %x to print POLL... bitmap
  annotate poll(2) guts
  9p: untangle ->poll() mess
  ->si_band gets POLL... bitmap stored into a user-visible long field
  ring_buffer_poll_wait() return value used as return value of ->poll()
  the rest of drivers/*: annotate ->poll() instances
  media: annotate ->poll() instances
  fs: annotate ->poll() instances
  ipc, kernel, mm: annotate ->poll() instances
  net: annotate ->poll() instances
  apparmor: annotate ->poll() instances
  tomoyo: annotate ->poll() instances
  sound: annotate ->poll() instances
  acpi: annotate ->poll() instances
  crypto: annotate ->poll() instances
  block: annotate ->poll() instances
  x86: annotate ->poll() instances
  ...
2018-01-30 17:58:07 -08:00
Linus Torvalds
0aebc6a440 arm64 updates for 4.16:
- Security mitigations:
   - variant 2: invalidating the branch predictor with a call to secure firmware
   - variant 3: implementing KPTI for arm64
 
 - 52-bit physical address support for arm64 (ARMv8.2)
 
 - arm64 support for RAS (firmware first only) and SDEI (software
   delegated exception interface; allows firmware to inject a RAS error
   into the OS)
 
 - Perf support for the ARM DynamIQ Shared Unit PMU
 
 - CPUID and HWCAP bits updated for new floating point multiplication
   instructions in ARMv8.4
 
 - Removing some virtual memory layout printks during boot
 
 - Fix initial page table creation to cope with larger than 32M kernel
   images when 16K pages are enabled
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAlpwxDMACgkQa9axLQDI
 XvF55BAAniMpxPXnYNfv6l7/4O8eKo1lJIaG1wbej4JRZ/rT3K4Z3OBXW1dKHO8d
 /PTbVmZ90IqIGROkoDrE+6xyjjn9yK3uuW4ytN2zQkBa8VFaHAnHlX+zKQcuwy9f
 yxwiHk+C7vK5JR7mpXTazjRknsUv1MPtlTt7DQrSdq0KRDJVDNFC+grmbew2rz0X
 cjQDqZqgzuFyrKxdiQVjDmc3zH9NsNBhDo0hlGHf2jK6bGJsAPtI8M2JcLrK8ITG
 Ye/dD7BJp1mWD8ff0BPaMxu24qfAMNLH8f2dpTa986/H78irVz7i/t5HG0/1+5Jh
 EE4OFRTKZ59Qgyo1zWcaJvdp8YjiaX/L4PWJg8CxM5OhP9dIac9ydcFQfWzpKpUs
 xyZfmK6XliGFReAkVOOf5tEqFUDhMtsqhzPYmbmU1lp61wmSYIZ8CTenpWWCJSRO
 NOGyG1X2uFBvP69+iPNlfTGz1r7tg1URY5iO8fUEIhY8LrgyORkiqw4OvPEgnMXP
 Ngy+dXhyvnps2AAWbSX0O4puRlTgEYLT5KaMLzH/+gWsXATT0rzUCD/aOwUQq/Y7
 SWXZHkb3jpmOZZnzZsLL2MNzEIPCFBwSUE9fSv4dA9d/N6tUmlmZALJjHkfzCDpj
 +mPsSmAMTj72kUYzm0b5GCtOu/iQ2kDWOZjOM1m4+v/B+f7JoEE=
 =iEjP
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Catalin Marinas:
 "The main theme of this pull request is security covering variants 2
  and 3 for arm64. I expect to send additional patches next week
  covering an improved firmware interface (requires firmware changes)
  for variant 2 and way for KPTI to be disabled on unaffected CPUs
  (Cavium's ThunderX doesn't work properly with KPTI enabled because of
  a hardware erratum).

  Summary:

   - Security mitigations:
      - variant 2: invalidate the branch predictor with a call to
        secure firmware
      - variant 3: implement KPTI for arm64

   - 52-bit physical address support for arm64 (ARMv8.2)

   - arm64 support for RAS (firmware first only) and SDEI (software
     delegated exception interface; allows firmware to inject a RAS
     error into the OS)

   - perf support for the ARM DynamIQ Shared Unit PMU

   - CPUID and HWCAP bits updated for new floating point multiplication
     instructions in ARMv8.4

   - remove some virtual memory layout printks during boot

   - fix initial page table creation to cope with larger than 32M kernel
     images when 16K pages are enabled"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (104 commits)
  arm64: Fix TTBR + PAN + 52-bit PA logic in cpu_do_switch_mm
  arm64: Turn on KPTI only on CPUs that need it
  arm64: Branch predictor hardening for Cavium ThunderX2
  arm64: Run enable method for errata work arounds on late CPUs
  arm64: Move BP hardening to check_and_switch_context
  arm64: mm: ignore memory above supported physical address size
  arm64: kpti: Fix the interaction between ASID switching and software PAN
  KVM: arm64: Emulate RAS error registers and set HCR_EL2's TERR & TEA
  KVM: arm64: Handle RAS SErrors from EL2 on guest exit
  KVM: arm64: Handle RAS SErrors from EL1 on guest exit
  KVM: arm64: Save ESR_EL2 on guest SError
  KVM: arm64: Save/Restore guest DISR_EL1
  KVM: arm64: Set an impdef ESR for Virtual-SError using VSESR_EL2.
  KVM: arm/arm64: mask/unmask daif around VHE guests
  arm64: kernel: Prepare for a DISR user
  arm64: Unconditionally enable IESB on exception entry/return for firmware-first
  arm64: kernel: Survive corrected RAS errors notified by SError
  arm64: cpufeature: Detect CPU RAS Extentions
  arm64: sysreg: Move to use definitions for all the SCTLR bits
  arm64: cpufeature: __this_cpu_has_cap() shouldn't stop early
  ...
2018-01-30 13:57:43 -08:00
Linus Torvalds
d8b91dde38 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
 "Kernel side changes:

   - Clean up the x86 instruction decoder (Masami Hiramatsu)

   - Add new uprobes optimization for PUSH instructions on x86 (Yonghong
     Song)

   - Add MSR_IA32_THERM_STATUS to the MSR events (Stephane Eranian)

   - Fix misc bugs, update documentation, plus various cleanups (Jiri
     Olsa)

  There's a large number of tooling side improvements:

   - Intel-PT/BTS improvements (Adrian Hunter)

   - Numerous 'perf trace' improvements (Arnaldo Carvalho de Melo)

   - Introduce an errno code to string facility (Hendrik Brueckner)

   - Various build system improvements (Jiri Olsa)

   - Add support for CoreSight trace decoding by making the perf tools
     use the external openCSD (Mathieu Poirier, Tor Jeremiassen)

   - Add ARM Statistical Profiling Extensions (SPE) support (Kim
     Phillips)

   - libtraceevent updates (Steven Rostedt)

   - Intel vendor event JSON updates (Andi Kleen)

   - Introduce 'perf report --mmaps' and 'perf report --tasks' to show
     info present in 'perf.data' (Jiri Olsa, Arnaldo Carvalho de Melo)

   - Add infrastructure to record first and last sample time to the
     perf.data file header, so that when processing all samples in a
     'perf record' session, such as when doing build-id processing, or
     when specifically requesting that that info be recorded, use that
     in 'perf report --time', that also got support for percent slices
     in addition to absolute ones.

     I.e. now it is possible to ask for the samples in the 10%-20% time
     slice of a perf.data file (Jin Yao)

   - Allow system wide 'perf stat --per-thread', sorting the result (Jin
     Yao)

     E.g.:

      [root@jouet ~]# perf stat --per-thread --metrics IPC
      ^C
       Performance counter stats for 'system wide':

                  make-22229  23,012,094,032  inst_retired.any   #  0.8 IPC
                   cc1-22419     692,027,497  inst_retired.any   #  0.8 IPC
                   gcc-22418     328,231,855  inst_retired.any   #  0.9 IPC
                   cc1-22509     220,853,647  inst_retired.any   #  0.8 IPC
                   gcc-22486     199,874,810  inst_retired.any   #  1.0 IPC
                    as-22466     177,896,365  inst_retired.any   #  0.9 IPC
                   cc1-22465     150,732,374  inst_retired.any   #  0.8 IPC
                   gcc-22508     112,555,593  inst_retired.any   #  0.9 IPC
                   cc1-22487     108,964,079  inst_retired.any   #  0.7 IPC
       qemu-system-x86-2697       21,330,550  inst_retired.any   #  0.3 IPC
       systemd-journal-551        20,642,951  inst_retired.any   #  0.4 IPC
       docker-containe-17651       9,552,892  inst_retired.any   #  0.5 IPC
       dockerd-current-9809        7,528,586  inst_retired.any   #  0.5 IPC
                  make-22153  12,504,194,380  inst_retired.any   #  0.8 IPC
               python2-22429  12,081,290,954  inst_retired.any   #  0.8 IPC
      <SNIP>
               python2-22429  15,026,328,103  cpu_clk_unhalted.thread
                   cc1-22419     826,660,193  cpu_clk_unhalted.thread
                   gcc-22418     365,321,295  cpu_clk_unhalted.thread
                   cc1-22509     279,169,362  cpu_clk_unhalted.thread
                   gcc-22486     210,156,950  cpu_clk_unhalted.thread
      <SNIP>

           5.638075538 seconds time elapsed

     [root@jouet ~]#

   - Improve shell auto-completion of perf events (Jin Yao)

   - 'perf probe' improvements (Masami Hiramatsu)

   - Improve PMU infrastructure to support amp64's ThunderX2
     implementation defined core events (Ganapatrao Kulkarni)

   - Various annotation related improvements and fixes (Thomas Richter)

   - Clarify usage of 'overwrite' and 'backward' in the evlist/mmap
     code, removing the 'overwrite' parameter from several functions as
     it was always used it as 'false' (Wang Nan)

   - Fix/improve 'perf record' reverse recording support (Wang Nan)

   - Improve command line options documentation (Sihyeon Jang)

   - Optimize sample parsing for ordering events, where we don't need to
     parse all the PERF_SAMPLE_ bits, just the ones leading to the
     timestamp needed to reorder events (Jiri Olsa)

   - Generalize the annotation code to support other source information
     besides objdump/DWARF obtained ones, starting with python scripts,
     that will is slated to be merged soon (Jiri Olsa)

   - ... and a lot more that I failed to list, see the shortlog and
     changelog for details"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (262 commits)
  perf trace beauty flock: Move to separate object file
  perf evlist: Remove fcntl.h from evlist.h
  perf trace beauty futex: Beautify FUTEX_BITSET_MATCH_ANY
  perf trace: Do not print from time delta for interrupted syscall lines
  perf trace: Add --print-sample
  perf bpf: Remove misplaced __maybe_unused attribute
  MAINTAINERS: Adding entry for CoreSight trace decoding
  perf tools: Add mechanic to synthesise CoreSight trace packets
  perf tools: Add full support for CoreSight trace decoding
  pert tools: Add queue management functionality
  perf tools: Add functionality to communicate with the openCSD decoder
  perf tools: Add support for decoding CoreSight trace data
  perf tools: Add decoder mechanic to support dumping trace data
  perf tools: Add processing of coresight metadata
  perf tools: Add initial entry point for decoder CoreSight traces
  perf tools: Integrating the CoreSight decoding library
  perf vendor events intel: Update IvyTown files to V20
  perf vendor events intel: Update IvyBridge files to V20
  perf vendor events intel: Update BroadwellDE events to V7
  perf vendor events intel: Update SkylakeX events to V1.06
  ...
2018-01-30 11:15:14 -08:00
Linus Torvalds
d772794637 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The main RCU changes in this cycle were:

   - Updates to use cond_resched() instead of cond_resched_rcu_qs()
     where feasible (currently everywhere except in kernel/rcu and in
     kernel/torture.c). Also a couple of fixes to avoid sending IPIs to
     offline CPUs.

   - Updates to simplify RCU's dyntick-idle handling.

   - Updates to remove almost all uses of smp_read_barrier_depends() and
     read_barrier_depends().

   - Torture-test updates.

   - Miscellaneous fixes"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
  torture: Save a line in stutter_wait(): while -> for
  torture: Eliminate torture_runnable and perf_runnable
  torture: Make stutter less vulnerable to compilers and races
  locking/locktorture: Fix num reader/writer corner cases
  locking/locktorture: Fix rwsem reader_delay
  torture: Place all torture-test modules in one MAINTAINERS group
  rcutorture/kvm-build.sh: Skip build directory check
  rcutorture: Simplify functions.sh include path
  rcutorture: Simplify logging
  rcutorture/kvm-recheck-*: Improve result directory readability check
  rcutorture/kvm.sh: Support execution from any directory
  rcutorture/kvm.sh: Use consistent help text for --qemu-args
  rcutorture/kvm.sh: Remove unused variable, `alldone`
  rcutorture: Remove unused script, config2frag.sh
  rcutorture/configinit: Fix build directory error message
  rcutorture: Preempt RCU-preempt readers more vigorously
  torture: Reduce #ifdefs for preempt_schedule()
  rcu: Remove have_rcu_nocb_mask from tree_plugin.h
  rcu: Add comment giving debug strategy for double call_rcu()
  tracing, rcu: Hide trace event rcu_nocb_wake when not used
  ...
2018-01-30 10:15:30 -08:00
Peter Zijlstra
0c7296cad6 perf/core: Fix ctx::mutex deadlock
Lockdep noticed the following 3-way lockup scenario:

	sys_perf_event_open()
	  perf_event_alloc()
	    perf_try_init_event()
 #0	      ctx = perf_event_ctx_lock_nested(1)
	      perf_swevent_init()
		swevent_hlist_get()
 #1		  mutex_lock(&pmus_lock)

	perf_event_init_cpu()
 #1	  mutex_lock(&pmus_lock)
 #2	  mutex_lock(&ctx->mutex)

	sys_perf_event_open()
	  mutex_lock_double()
 #2	   mutex_lock()
 #0	   mutex_lock_nested()

And while we need that perf_event_ctx_lock_nested() for HW PMUs such
that they can iterate the sibling list, trying to match it to the
available counters, the software PMUs need do no such thing. Exclude
them.

In particular the swevent triggers the above invertion, while the
tpevent PMU triggers a more elaborate one through their event_mutex.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-25 14:48:30 +01:00
Peter Zijlstra
43fa87f7de perf/core: Fix another perf,trace,cpuhp lock inversion
Lockdep noticed the following 3-way lockup race:

        perf_trace_init()
 #0       mutex_lock(&event_mutex)
          perf_trace_event_init()
            perf_trace_event_reg()
              tp_event->class->reg() := tracepoint_probe_register
 #1              mutex_lock(&tracepoints_mutex)
                  trace_point_add_func()
 #2                  static_key_enable()

 #2	do_cpu_up()
	  perf_event_init_cpu()
 #3	    mutex_lock(&pmus_lock)
 #4	    mutex_lock(&ctx->mutex)

	perf_ioctl()
 #4	  ctx = perf_event_ctx_lock()
	  _perf_iotcl()
	    ftrace_profile_set_filter()
 #0	      mutex_lock(&event_mutex)

Fudge it for now by noting that the tracepoint state does not depend
on the event <-> context relation. Ugly though :/

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-25 14:48:30 +01:00
Peter Zijlstra
82d94856fa perf/core: Fix lock inversion between perf,trace,cpuhp
Lockdep gifted us with noticing the following 4-way lockup scenario:

        perf_trace_init()
 #0       mutex_lock(&event_mutex)
          perf_trace_event_init()
            perf_trace_event_reg()
              tp_event->class->reg() := tracepoint_probe_register
 #1             mutex_lock(&tracepoints_mutex)
                  trace_point_add_func()
 #2                 static_key_enable()

 #2     do_cpu_up()
          perf_event_init_cpu()
 #3         mutex_lock(&pmus_lock)
 #4         mutex_lock(&ctx->mutex)

        perf_event_task_disable()
          mutex_lock(&current->perf_event_mutex)
 #4       ctx = perf_event_ctx_lock()
 #5       perf_event_for_each_child()

        do_exit()
          task_work_run()
            __fput()
              perf_release()
                perf_event_release_kernel()
 #4               mutex_lock(&ctx->mutex)
 #5               mutex_lock(&event->child_mutex)
                  free_event()
                    _free_event()
                      event->destroy() := perf_trace_destroy
 #0                     mutex_lock(&event_mutex);

Fix that by moving the free_event() out from under the locks.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-25 14:48:29 +01:00
Jiri Olsa
99e818cc88 perf: Return empty callchain instead of NULL
It simplifies the code a bit, because we dump the callchain
Link: http://lkml.kernel.org/n/tip-uqp7qd6aif47g39glnbu95yl@git.kernel.org
even if it's empty. With 'empty' callchain we can remove
all the NULL-checking code paths.

Original-patch-from: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/20180107160356.28203-7-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-01-08 12:35:01 -03:00
Jiri Olsa
8cf7e0e224 perf: Make perf_callchain function static
And move it to core.c, because there's no caller of this function other
than the one in core.c

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180107160356.28203-6-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-01-08 12:33:01 -03:00
Jiri Olsa
313ccb9615 perf: Allocate context task_ctx_data for child event
Currently we use perf_event_context::task_ctx_data to save and restore
the LBR status when the task is scheduled out and in.

We don't allocate it for child contexts, which results in shorter task's
LBR stack, because we don't save the history from previous run and start
over every time we schedule the task in.

I made a test to generate samples with LBR call stack and got higher
numbers on bigger chain depths:

                            before:     after:
  LBR call chain: nr: 1       60561     498127
  LBR call chain: nr: 2           0          0
  LBR call chain: nr: 3      107030       2172
  LBR call chain: nr: 4      466685      62758
  LBR call chain: nr: 5     2307319     878046
  LBR call chain: nr: 6       48713     495218
  LBR call chain: nr: 7        1040       4551
  LBR call chain: nr: 8         481        172
  LBR call chain: nr: 9         878        120
  LBR call chain: nr: 10       2377       6698
  LBR call chain: nr: 11      28830     151487
  LBR call chain: nr: 12      29347     339867
  LBR call chain: nr: 13          4         22
  LBR call chain: nr: 14          3         53

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Fixes: 4af57ef28c ("perf: Add pmu specific data for perf task context")
Link: http://lkml.kernel.org/r/20180107160356.28203-4-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-01-08 12:26:46 -03:00
Ingo Molnar
475c5ee193 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

- Updates to use cond_resched() instead of cond_resched_rcu_qs()
  where feasible (currently everywhere except in kernel/rcu and
  in kernel/torture.c).  Also a couple of fixes to avoid sending
  IPIs to offline CPUs.

- Updates to simplify RCU's dyntick-idle handling.

- Updates to remove almost all uses of smp_read_barrier_depends()
  and read_barrier_depends().

- Miscellaneous fixes.

- Torture-test updates.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-03 14:14:18 +01:00
Suzuki K Poulose
82975c46da perf: Export perf_event_update_userpage
Export perf_event_update_userpage() so that PMU driver using them,
can be built as modules.

Acked-by: Peter Zilstra <peterz@infradead.org>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-01-02 16:43:12 +00:00
David S. Miller
59436c9ee1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2017-12-18

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Allow arbitrary function calls from one BPF function to another BPF function.
   As of today when writing BPF programs, __always_inline had to be used in
   the BPF C programs for all functions, unnecessarily causing LLVM to inflate
   code size. Handle this more naturally with support for BPF to BPF calls
   such that this __always_inline restriction can be overcome. As a result,
   it allows for better optimized code and finally enables to introduce core
   BPF libraries in the future that can be reused out of different projects.
   x86 and arm64 JIT support was added as well, from Alexei.

2) Add infrastructure for tagging functions as error injectable and allow for
   BPF to return arbitrary error values when BPF is attached via kprobes on
   those. This way of injecting errors generically eases testing and debugging
   without having to recompile or restart the kernel. Tags for opting-in for
   this facility are added with BPF_ALLOW_ERROR_INJECTION(), from Josef.

3) For BPF offload via nfp JIT, add support for bpf_xdp_adjust_head() helper
   call for XDP programs. First part of this work adds handling of BPF
   capabilities included in the firmware, and the later patches add support
   to the nfp verifier part and JIT as well as some small optimizations,
   from Jakub.

4) The bpftool now also gets support for basic cgroup BPF operations such
   as attaching, detaching and listing current BPF programs. As a requirement
   for the attach part, bpftool can now also load object files through
   'bpftool prog load'. This reuses libbpf which we have in the kernel tree
   as well. bpftool-cgroup man page is added along with it, from Roman.

5) Back then commit e87c6bc385 ("bpf: permit multiple bpf attachments for
   a single perf event") added support for attaching multiple BPF programs
   to a single perf event. Given they are configured through perf's ioctl()
   interface, the interface has been extended with a PERF_EVENT_IOC_QUERY_BPF
   command in this work in order to return an array of one or multiple BPF
   prog ids that are currently attached, from Yonghong.

6) Various minor fixes and cleanups to the bpftool's Makefile as well
   as a new 'uninstall' and 'doc-uninstall' target for removing bpftool
   itself or prior installed documentation related to it, from Quentin.

7) Add CONFIG_CGROUP_BPF=y to the BPF kernel selftest config file which is
   required for the test_dev_cgroup test case to run, from Naresh.

8) Fix reporting of XDP prog_flags for nfp driver, from Jakub.

9) Fix libbpf's exit code from the Makefile when libelf was not found in
   the system, also from Jakub.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-18 10:51:06 -05:00
Yonghong Song
f4e2298e63 bpf/tracing: fix kernel/events/core.c compilation error
Commit f371b304f1 ("bpf/tracing: allow user space to
query prog array on the same tp") introduced a perf
ioctl command to query prog array attached to the
same perf tracepoint. The commit introduced a
compilation error under certain config conditions, e.g.,
  (1). CONFIG_BPF_SYSCALL is not defined, or
  (2). CONFIG_TRACING is defined but neither CONFIG_UPROBE_EVENTS
       nor CONFIG_KPROBE_EVENTS is defined.

Error message:
  kernel/events/core.o: In function `perf_ioctl':
  core.c:(.text+0x98c4): undefined reference to `bpf_event_query_prog_array'

This patch fixed this error by guarding the real definition under
CONFIG_BPF_EVENTS and provided static inline dummy function
if CONFIG_BPF_EVENTS was not defined.
It renamed the function from bpf_event_query_prog_array to
perf_event_query_prog_array and moved the definition from linux/bpf.h
to linux/trace_events.h so the definition is in proximity to
other prog_array related functions.

Fixes: f371b304f1 ("bpf/tracing: allow user space to query prog array on the same tp")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-13 22:44:10 +01:00
Josef Bacik
9802d86585 bpf: add a bpf_override_function helper
Error injection is sloppy and very ad-hoc.  BPF could fill this niche
perfectly with it's kprobe functionality.  We could make sure errors are
only triggered in specific call chains that we care about with very
specific situations.  Accomplish this with the bpf_override_funciton
helper.  This will modify the probe'd callers return value to the
specified value and set the PC to an override function that simply
returns, bypassing the originally probed function.  This gives us a nice
clean way to implement systematic error injection for all of our code
paths.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2017-12-12 09:02:34 -08:00
Yonghong Song
f371b304f1 bpf/tracing: allow user space to query prog array on the same tp
Commit e87c6bc385 ("bpf: permit multiple bpf attachments
for a single perf event") added support to attach multiple
bpf programs to a single perf event.
Although this provides flexibility, users may want to know
what other bpf programs attached to the same tp interface.
Besides getting visibility for the underlying bpf system,
such information may also help consolidate multiple bpf programs,
understand potential performance issues due to a large array,
and debug (e.g., one bpf program which overwrites return code
may impact subsequent program results).

Commit 2541517c32 ("tracing, perf: Implement BPF programs
attached to kprobes") utilized the existing perf ioctl
interface and added the command PERF_EVENT_IOC_SET_BPF
to attach a bpf program to a tracepoint. This patch adds a new
ioctl command, given a perf event fd, to query the bpf program
array attached to the same perf tracepoint event.

The new uapi ioctl command:
  PERF_EVENT_IOC_QUERY_BPF

The new uapi/linux/perf_event.h structure:
  struct perf_event_query_bpf {
       __u32	ids_len;
       __u32	prog_cnt;
       __u32	ids[0];
  };

User space provides buffer "ids" for kernel to copy to.
When returning from the kernel, the number of available
programs in the array is set in "prog_cnt".

The usage:
  struct perf_event_query_bpf *query =
    malloc(sizeof(*query) + sizeof(u32) * ids_len);
  query.ids_len = ids_len;
  err = ioctl(pmu_efd, PERF_EVENT_IOC_QUERY_BPF, query);
  if (err == 0) {
    /* query.prog_cnt is the number of available progs,
     * number of progs in ids: (ids_len == 0) ? 0 : query.prog_cnt
     */
  } else if (errno == ENOSPC) {
    /* query.ids_len number of progs copied,
     * query.prog_cnt is the number of available progs
     */
  } else {
      /* other errors */
  }

Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2017-12-12 08:46:40 -08:00
Linus Torvalds
e9ef1fe312 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) CAN fixes from Martin Kelly (cancel URBs properly in all the CAN usb
    drivers).

 2) Revert returning -EEXIST from __dev_alloc_name() as this propagates
    to userspace and broke some apps. From Johannes Berg.

 3) Fix conn memory leaks and crashes in TIPC, from Jon Malloc and Cong
    Wang.

 4) Gianfar MAC can't do EEE so don't advertise it by default, from
    Claudiu Manoil.

 5) Relax strict netlink attribute validation, but emit a warning. From
    David Ahern.

 6) Fix regression in checksum offload of thunderx driver, from Florian
    Westphal.

 7) Fix UAPI bpf issues on s390, from Hendrik Brueckner.

 8) New card support in iwlwifi, from Ihab Zhaika.

 9) BBR congestion control bug fixes from Neal Cardwell.

10) Fix port stats in nfp driver, from Pieter Jansen van Vuuren.

11) Fix leaks in qualcomm rmnet, from Subash Abhinov Kasiviswanathan.

12) Fix DMA API handling in sh_eth driver, from Thomas Petazzoni.

13) Fix spurious netpoll warnings in bnxt_en, from Calvin Owens.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (67 commits)
  net: mvpp2: fix the RSS table entry offset
  tcp: evaluate packet losses upon RTT change
  tcp: fix off-by-one bug in RACK
  tcp: always evaluate losses in RACK upon undo
  tcp: correctly test congestion state in RACK
  bnxt_en: Fix sources of spurious netpoll warnings
  tcp_bbr: reset long-term bandwidth sampling on loss recovery undo
  tcp_bbr: reset full pipe detection on loss recovery undo
  tcp_bbr: record "full bw reached" decision in new full_bw_reached bit
  sfc: pass valid pointers from efx_enqueue_unwind
  gianfar: Disable EEE autoneg by default
  tcp: invalidate rate samples during SACK reneging
  can: peak/pcie_fd: fix potential bug in restarting tx queue
  can: usb_8dev: cancel urb on -EPIPE and -EPROTO
  can: kvaser_usb: cancel urb on -EPIPE and -EPROTO
  can: esd_usb2: cancel urb on -EPIPE and -EPROTO
  can: ems_usb: cancel urb on -EPIPE and -EPROTO
  can: mcba_usb: cancel urb on -EPROTO
  usbnet: fix alignment for frames with no ethernet header
  tcp: use current time in tcp_rcv_space_adjust()
  ...
2017-12-08 13:32:44 -08:00
Ingo Molnar
d6eabce257 Merge branch 'linus' into perf/urgent, to synchronize UAPI headers
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-06 22:39:39 +01:00
Hendrik Brueckner
c895f6f703 bpf: correct broken uapi for BPF_PROG_TYPE_PERF_EVENT program type
Commit 0515e5999a ("bpf: introduce BPF_PROG_TYPE_PERF_EVENT
program type") introduced the bpf_perf_event_data structure which
exports the pt_regs structure.  This is OK for multiple architectures
but fail for s390 and arm64 which do not export pt_regs.  Programs
using them, for example, the bpf selftest fail to compile on these
architectures.

For s390, exporting the pt_regs is not an option because s390 wants
to allow changes to it.  For arm64, there is a user_pt_regs structure
that covers parts of the pt_regs structure for use by user space.

To solve the broken uapi for s390 and arm64, introduce an abstract
type for pt_regs and add an asm/bpf_perf_event.h file that concretes
the type.  An asm-generic header file covers the architectures that
export pt_regs today.

The arch-specific enablement for s390 and arm64 follows in separate
commits.

Reported-by: Thomas Richter <tmricht@linux.vnet.ibm.com>
Fixes: 0515e5999a ("bpf: introduce BPF_PROG_TYPE_PERF_EVENT program type")
Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Reviewed-and-tested-by: Thomas Richter <tmricht@linux.vnet.ibm.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-05 15:02:40 +01:00
Paul E. McKenney
5c6338b487 uprobes: Remove now-redundant smp_read_barrier_depends()
Now that READ_ONCE() implies smp_read_barrier_depends(), the
get_xol_area() and get_trampoline_vaddr() no longer need their
smp_read_barrier_depends() calls, which this commit removes.
While we are here, convert the corresponding smp_wmb() to an
smp_store_release().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
2017-12-04 10:52:55 -08:00
Ingo Molnar
6e948c67c4 Merge branch 'perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent
Pull perf tooling fixes from Arnaldo Carvalho de Melo:

"- Fix window dimensions change handling in 'perf top' (Jiri Olsa)

- Fix 'perf record -c/-F' options for CPU event aliases (Andi Kleen)

- Generate PERF_RECORD_{MMAP,COMM,EXEC} with 'perf record --delay'
  fixing symbol resolution for processes created, maps put in place
  while --delay happens (Arnaldo Carvalho de Melo)

- Fix up leftover perf_evsel_stat usage via evsel->priv, plugging
  a SEGV when using event groups as in:

     $ perf stat -e '{cpu-clock,instructions}' workload

- Fix 'perf script --per-event-dump' for auxtrace synth evsels (Arnaldo Carvalho de Melo)

- Ignore kptr_restrict when not sampling the kernel (Arnaldo Carvalho de Melo)

- Synchronize kernel ABI headers wrt SPDX tags and ABI changes,
  taking minimal action to handle new syscall args and silencing
  perf build warnings (Arnaldo Carvalho de Melo, Ingo Molnar)

- Fix header.size for namespace events (Jiri Olsa)

- Fix a bug during strstart() conversion in 'perf help' (Namhyung Kim)

- Do not truncate instruction names at 6 chars in 'perf annotate', there
  are really long instruction names in PPC (Ravi Bangoria)

- Fixup discontiguous/sparse numa nodes in 'perf bench numa' (Satheesh Rajendran)

- Fix an exit code of trace__symbols_init in 'perf trace' (Andrei Vagin)

- Fix 'perf test' entries on s/390 (Thomas Richter)

- Bring instruction decoder files used by Intel PT into line with the kernel,
  silencing build warning (Adrian Hunter)"

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-29 07:15:09 +01:00
Ingo Molnar
4fc31ba13d Merge branch 'linus' into perf/urgent, to pick up dependent commits
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-29 07:11:24 +01:00
Jiri Olsa
34900ec5c9 perf: Fix header.size for namespace events
Reset header size for namespace events, otherwise it only gets bigger in
ctx iterations.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Fixes: e422267322 ("perf: Add PERF_RECORD_NAMESPACES to include namespaces related info")
Link: http://lkml.kernel.org/n/tip-nlo4gonz9d4guyb8153ukzt0@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-11-28 14:27:05 -03:00
Al Viro
9dd957485d ipc, kernel, mm: annotate ->poll() instances
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-11-27 16:20:05 -05:00
Linus Torvalds
580e3d552d Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Misc fixes: two PMU driver fixes and a memory leak fix"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Fix memory leak triggered by perf --namespace
  perf/x86/intel/uncore: Add event constraint for BDX PCU
  perf/x86/intel: Hide TSX events when RTM is not supported
2017-11-26 13:41:48 -08:00
Linus Torvalds
2dcd9c71c1 Tracing updates for 4.15:
- Now allow module init functions to be traced
 
  - Clean up some unused or not used by config events (saves space)
 
  - Clean up of trace histogram code
 
  - Add support for preempt and interrupt enabled/disable events
 
  - Other various clean ups
 -----BEGIN PGP SIGNATURE-----
 
 iQHIBAABCgAyFiEEPm6V/WuN2kyArTUe1a05Y9njSUkFAloPGgkUHHJvc3RlZHRA
 Z29vZG1pcy5vcmcACgkQ1a05Y9njSUmfaAwAjge5FWBCBQeby8tVuw4RGAorRgl5
 IFuijFSygcKRMhQFP6B+haHsezeCbNaBBtIncXhoJGDC5XuhUhr9foYf1SChEmYp
 tCOK2o71FgZ8yG539IYCVjG9cJZxPLM0OI7RQ8hcMETAr+eiXPXxHrmrm9kdBtYM
 ZAQERvqI5yu2HWIb87KBc38H0rgYrOJKZt9Rx20as/aqAME7hFvYErFlcnxdmHo+
 LmovJOQBCTicNJ4TXJc418JaUWi9cm/A3uhW3o5aLMoRAxCc/8FD+dq2rg4qlHDH
 tOtK6pwIPHfqRZ3nMLXXWhaa+w+swsxBOnegkvgP2xCyibKjFgh9kzcpaj41w3x1
 0FCfvS7flx9ob//fAB8kxLvJyY5p3Qp3xdvj0+gp2qa3Ga5lSqcMzS419TLY1Yfa
 Jpi2oAagDqP94m0EjAGTkhZMOrsFIDr49g3h7nqz3T3Z54luyXniDoYoO11d+dUF
 vCUiIJz/PsQIE3NVViZiaRtcLVXneLHISmnz
 =h3F2
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from

 - allow module init functions to be traced

 - clean up some unused or not used by config events (saves space)

 - clean up of trace histogram code

 - add support for preempt and interrupt enabled/disable events

 - other various clean ups

* tag 'trace-v4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (30 commits)
  tracing, thermal: Hide cpu cooling trace events when not in use
  tracing, thermal: Hide devfreq trace events when not in use
  ftrace: Kill FTRACE_OPS_FL_PER_CPU
  perf/ftrace: Small cleanup
  perf/ftrace: Fix function trace events
  perf/ftrace: Revert ("perf/ftrace: Fix double traces of perf on ftrace:function")
  tracing, dma-buf: Remove unused trace event dma_fence_annotate_wait_on
  tracing, memcg, vmscan: Hide trace events when not in use
  tracing/xen: Hide events that are not used when X86_PAE is not defined
  tracing: mark trace_test_buffer as __maybe_unused
  printk: Remove superfluous memory barriers from printk_safe
  ftrace: Clear hashes of stale ips of init memory
  tracing: Add support for preempt and irq enable/disable events
  tracing: Prepare to add preempt and irq trace events
  ftrace/kallsyms: Have /proc/kallsyms show saved mod init functions
  ftrace: Add freeing algorithm to free ftrace_mod_maps
  ftrace: Save module init functions kallsyms symbols for tracing
  ftrace: Allow module init functions to be traced
  ftrace: Add a ftrace_free_mem() function for modules to use
  tracing: Reimplement log2
  ...
2017-11-17 14:58:01 -08:00
Linus Torvalds
5bbcc0f595 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Highlights:

   1) Maintain the TCP retransmit queue using an rbtree, with 1GB
      windows at 100Gb this really has become necessary. From Eric
      Dumazet.

   2) Multi-program support for cgroup+bpf, from Alexei Starovoitov.

   3) Perform broadcast flooding in hardware in mv88e6xxx, from Andrew
      Lunn.

   4) Add meter action support to openvswitch, from Andy Zhou.

   5) Add a data meta pointer for BPF accessible packets, from Daniel
      Borkmann.

   6) Namespace-ify almost all TCP sysctl knobs, from Eric Dumazet.

   7) Turn on Broadcom Tags in b53 driver, from Florian Fainelli.

   8) More work to move the RTNL mutex down, from Florian Westphal.

   9) Add 'bpftool' utility, to help with bpf program introspection.
      From Jakub Kicinski.

  10) Add new 'cpumap' type for XDP_REDIRECT action, from Jesper
      Dangaard Brouer.

  11) Support 'blocks' of transformations in the packet scheduler which
      can span multiple network devices, from Jiri Pirko.

  12) TC flower offload support in cxgb4, from Kumar Sanghvi.

  13) Priority based stream scheduler for SCTP, from Marcelo Ricardo
      Leitner.

  14) Thunderbolt networking driver, from Amir Levy and Mika Westerberg.

  15) Add RED qdisc offloadability, and use it in mlxsw driver. From
      Nogah Frankel.

  16) eBPF based device controller for cgroup v2, from Roman Gushchin.

  17) Add some fundamental tracepoints for TCP, from Song Liu.

  18) Remove garbage collection from ipv6 route layer, this is a
      significant accomplishment. From Wei Wang.

  19) Add multicast route offload support to mlxsw, from Yotam Gigi"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2177 commits)
  tcp: highest_sack fix
  geneve: fix fill_info when link down
  bpf: fix lockdep splat
  net: cdc_ncm: GetNtbFormat endian fix
  openvswitch: meter: fix NULL pointer dereference in ovs_meter_cmd_reply_start
  netem: remove unnecessary 64 bit modulus
  netem: use 64 bit divide by rate
  tcp: Namespace-ify sysctl_tcp_default_congestion_control
  net: Protect iterations over net::fib_notifier_ops in fib_seq_sum()
  ipv6: set all.accept_dad to 0 by default
  uapi: fix linux/tls.h userspace compilation error
  usbnet: ipheth: prevent TX queue timeouts when device not ready
  vhost_net: conditionally enable tx polling
  uapi: fix linux/rxrpc.h userspace compilation errors
  net: stmmac: fix LPI transitioning for dwmac4
  atm: horizon: Fix irq release error
  net-sysfs: trigger netlink notification on ifalias change via sysfs
  openvswitch: Using kfree_rcu() to simplify the code
  openvswitch: Make local function ovs_nsh_key_attr_size() static
  openvswitch: Fix return value check in ovs_meter_cmd_features()
  ...
2017-11-15 11:56:19 -08:00
Linus Torvalds
c9b012e5f4 arm64 updates for 4.15
Plenty of acronym soup here:
 
 - Initial support for the Scalable Vector Extension (SVE)
 - Improved handling for SError interrupts (required to handle RAS events)
 - Enable GCC support for 128-bit integer types
 - Remove kernel text addresses from backtraces and register dumps
 - Use of WFE to implement long delay()s
 - ACPI IORT updates from Lorenzo Pieralisi
 - Perf PMU driver for the Statistical Profiling Extension (SPE)
 - Perf PMU driver for Hisilicon's system PMUs
 - Misc cleanups and non-critical fixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABCgAGBQJaCcLqAAoJELescNyEwWM0JREH/2FbmD/khGzEtP8LW+o9D8iV
 TBM02uWQxS1bbO1pV2vb+512YQO+iWfeQwJH9Jv2FZcrMvFv7uGRnYgAnJuXNGrl
 W+LL6OhN22A24LSawC437RU3Xe7GqrtONIY/yLeJBPablfcDGzPK1eHRA0pUzcyX
 VlyDruSHWX44VGBPV6JRd3x0vxpV8syeKOjbRvopRfn3Nwkbd76V3YSfEgwoTG5W
 ET1sOnXLmHHdeifn/l1Am5FX1FYstpcd7usUTJ4Oto8y7e09tw3bGJCD0aMJ3vow
 v1pCUWohEw7fHqoPc9rTrc1QEnkdML4vjJvMPUzwyTfPrN+7uEuMIEeJierW+qE=
 =0qrg
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 "The big highlight is support for the Scalable Vector Extension (SVE)
  which required extensive ABI work to ensure we don't break existing
  applications by blowing away their signal stack with the rather large
  new vector context (<= 2 kbit per vector register). There's further
  work to be done optimising things like exception return, but the ABI
  is solid now.

  Much of the line count comes from some new PMU drivers we have, but
  they're pretty self-contained and I suspect we'll have more of them in
  future.

  Plenty of acronym soup here:

   - initial support for the Scalable Vector Extension (SVE)

   - improved handling for SError interrupts (required to handle RAS
     events)

   - enable GCC support for 128-bit integer types

   - remove kernel text addresses from backtraces and register dumps

   - use of WFE to implement long delay()s

   - ACPI IORT updates from Lorenzo Pieralisi

   - perf PMU driver for the Statistical Profiling Extension (SPE)

   - perf PMU driver for Hisilicon's system PMUs

   - misc cleanups and non-critical fixes"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (97 commits)
  arm64: Make ARMV8_DEPRECATED depend on SYSCTL
  arm64: Implement __lshrti3 library function
  arm64: support __int128 on gcc 5+
  arm64/sve: Add documentation
  arm64/sve: Detect SVE and activate runtime support
  arm64/sve: KVM: Hide SVE from CPU features exposed to guests
  arm64/sve: KVM: Treat guest SVE use as undefined instruction execution
  arm64/sve: KVM: Prevent guests from using SVE
  arm64/sve: Add sysctl to set the default vector length for new processes
  arm64/sve: Add prctl controls for userspace vector length management
  arm64/sve: ptrace and ELF coredump support
  arm64/sve: Preserve SVE registers around EFI runtime service calls
  arm64/sve: Preserve SVE registers around kernel-mode NEON use
  arm64/sve: Probe SVE capabilities and usable vector lengths
  arm64: cpufeature: Move sys_caps_initialised declarations
  arm64/sve: Backend logic for setting the vector length
  arm64/sve: Signal handling support
  arm64/sve: Support vector length resetting for new processes
  arm64/sve: Core task context handling
  arm64/sve: Low-level CPU setup
  ...
2017-11-15 10:56:56 -08:00
Vasily Averin
4a31b424ac perf/core: Fix memory leak triggered by perf --namespace
perf with --namespace key leaks various memory objects including namespaces

  4.14.0+
  pid_namespace          1     12   2568   12    8
  user_namespace         1     39    824   39    8
  net_namespace          1      5   6272    5    8

This happen because perf_fill_ns_link_info() struct patch ns_path:
during initialization ns_path incremented counters on related mnt and dentry,
but without lost path_put nobody decremented them back.
Leaked dentry is name of related namespace,
and its leak does not allow to free unused namespace.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: commit e422267322 ("perf: Add PERF_RECORD_NAMESPACES to include namespaces related info")
Link: http://lkml.kernel.org/r/c510711b-3904-e5e1-d296-61273d21118d@virtuozzo.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-15 09:48:10 +01:00
Vasily Averin
0e18dd1206 perf/core: Fix memory leak triggered by perf --namespace
perf with --namespace key leaks various memory objects including namespaces

  4.14.0+
  pid_namespace          1     12   2568   12    8
  user_namespace         1     39    824   39    8
  net_namespace          1      5   6272    5    8

This happen because perf_fill_ns_link_info() struct patch ns_path:
during initialization ns_path incremented counters on related mnt and dentry,
but without lost path_put nobody decremented them back.
Leaked dentry is name of related namespace,
and its leak does not allow to free unused namespace.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: commit e422267322 ("perf: Add PERF_RECORD_NAMESPACES to include namespaces related info")
Link: http://lkml.kernel.org/r/c510711b-3904-e5e1-d296-61273d21118d@virtuozzo.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-15 09:47:27 +01:00
Linus Torvalds
31486372a1 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
 "The main changes in this cycle were:

  Kernel:

   - kprobes updates: use better W^X patterns for code modifications,
     improve optprobes, remove jprobes. (Masami Hiramatsu, Kees Cook)

   - core fixes: event timekeeping (enabled/running times statistics)
     fixes, perf_event_read() locking fixes and cleanups, etc. (Peter
     Zijlstra)

   - Extend x86 Intel free-running PEBS support and support x86
     user-register sampling in perf record and perf script. (Andi Kleen)

  Tooling:

   - Completely rework the way inline frames are handled. Instead of
     querying for the inline nodes on-demand in the individual tools, we
     now create proper callchain nodes for inlined frames. (Milian
     Wolff)

   - 'perf trace' updates (Arnaldo Carvalho de Melo)

   - Implement a way to print formatted output to per-event files in
     'perf script' to facilitate generate flamegraphs, elliminating the
     need to write scripts to do that separation (yuzhoujian, Arnaldo
     Carvalho de Melo)

   - Update vendor events JSON metrics for Intel's Broadwell, Broadwell
     Server, Haswell, Haswell Server, IvyBridge, IvyTown, JakeTown,
     Sandy Bridge, Skylake, SkyLake Server - and Goldmont Plus V1 (Andi
     Kleen, Kan Liang)

   - Multithread the synthesizing of PERF_RECORD_ events for
     pre-existing threads in 'perf top', speeding up that phase, greatly
     improving the user experience in systems such as Intel's Knights
     Mill (Kan Liang)

   - Introduce the concept of weak groups in 'perf stat': try to set up
     a group, but if it's not schedulable fallback to not using a group.
     That gives us the best of both worlds: groups if they work, but
     still a usable fallback if they don't. E.g: (Andi Kleen)

   - perf sched timehist enhancements (David Ahern)

   - ... various other enhancements, updates, cleanups and fixes"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (139 commits)
  kprobes: Don't spam the build log with deprecation warnings
  arm/kprobes: Remove jprobe test case
  arm/kprobes: Fix kretprobe test to check correct counter
  perf srcline: Show correct function name for srcline of callchains
  perf srcline: Fix memory leak in addr2inlines()
  perf trace beauty kcmp: Beautify arguments
  perf trace beauty: Implement pid_fd beautifier
  tools include uapi: Grab a copy of linux/kcmp.h
  perf callchain: Fix double mapping al->addr for children without self period
  perf stat: Make --per-thread update shadow stats to show metrics
  perf stat: Move the shadow stats scale computation in perf_stat__update_shadow_stats
  perf tools: Add perf_data_file__write function
  perf tools: Add struct perf_data_file
  perf tools: Rename struct perf_data_file to perf_data
  perf script: Print information about per-event-dump files
  perf trace beauty prctl: Generate 'option' string table from kernel headers
  tools include uapi: Grab a copy of linux/prctl.h
  perf script: Allow creating per-event dump files
  perf evsel: Restore evsel->priv as a tool private area
  perf script: Use event_format__fprintf()
  ...
2017-11-13 13:05:08 -08:00
David S. Miller
f3edacbd69 bpf: Revert bpf_overrid_function() helper changes.
NACK'd by x86 maintainer.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-11 18:24:55 +09:00
Josef Bacik
dd0bb688ea bpf: add a bpf_override_function helper
Error injection is sloppy and very ad-hoc.  BPF could fill this niche
perfectly with it's kprobe functionality.  We could make sure errors are
only triggered in specific call chains that we care about with very
specific situations.  Accomplish this with the bpf_override_funciton
helper.  This will modify the probe'd callers return value to the
specified value and set the PC to an override function that simply
returns, bypassing the originally probed function.  This gives us a nice
clean way to implement systematic error injection for all of our code
paths.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-11 12:18:05 +09:00
David S. Miller
4dc6758d78 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Simple cases of overlapping changes in the packet scheduler.

Must easier to resolve this time.

Which probably means that I screwed it up somehow.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-10 10:00:18 +09:00
Frederic Weisbecker
164446455a perf/core: Use lockdep to assert IRQs are disabled/enabled
Use lockdep to check that IRQs are enabled or disabled as expected. This
way the sanity check only shows overhead when concurrency correctness
debug code is enabled.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David S . Miller <davem@davemloft.net>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1509980490-4285-9-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-08 11:13:51 +01:00
Ingo Molnar
8c5db92a70 Merge branch 'linus' into locking/core, to resolve conflicts
Conflicts:
	include/linux/compiler-clang.h
	include/linux/compiler-gcc.h
	include/linux/compiler-intel.h
	include/uapi/linux/stddef.h

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-07 10:32:44 +01:00
Ingo Molnar
15bcdc9477 Merge branch 'linus' into perf/core, to fix conflicts
Conflicts:
	tools/perf/arch/arm/annotate/instructions.c
	tools/perf/arch/arm64/annotate/instructions.c
	tools/perf/arch/powerpc/annotate/instructions.c
	tools/perf/arch/s390/annotate/instructions.c
	tools/perf/arch/x86/tests/intel-cqm.c
	tools/perf/ui/tui/progress.c
	tools/perf/util/zlib.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-07 10:30:18 +01:00
David S. Miller
2a171788ba Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Files removed in 'net-next' had their license header updated
in 'net'.  We take the remove from 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-04 09:26:51 +09:00
Ingo Molnar
294cbd05e3 Merge branch 'linus' into perf/urgent, to pick up dependent commits
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-03 12:30:12 +01:00
Greg Kroah-Hartman
b24413180f License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier.  The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
 - file had no licensing information it it.
 - file was a */uapi/* one with no licensing information in it,
 - file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne.  Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed.  Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
 - Files considered eligible had to be source code files.
 - Make and config files were included as candidates if they contained >5
   lines of source
 - File already had some variant of a license header in it (even if <5
   lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

 - when both scanners couldn't find any license traces, file was
   considered to have no license information in it, and the top level
   COPYING file license applied.

   For non */uapi/* files that summary was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0                                              11139

   and resulted in the first patch in this series.

   If that file was a */uapi/* path one, it was "GPL-2.0 WITH
   Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0 WITH Linux-syscall-note                        930

   and resulted in the second patch in this series.

 - if a file had some form of licensing information in it, and was one
   of the */uapi/* ones, it was denoted with the Linux-syscall-note if
   any GPL family license was found in the file or had no licensing in
   it (per prior point).  Results summary:

   SPDX license identifier                            # files
   ---------------------------------------------------|------
   GPL-2.0 WITH Linux-syscall-note                       270
   GPL-2.0+ WITH Linux-syscall-note                      169
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
   LGPL-2.1+ WITH Linux-syscall-note                      15
   GPL-1.0+ WITH Linux-syscall-note                       14
   ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
   LGPL-2.0+ WITH Linux-syscall-note                       4
   LGPL-2.1 WITH Linux-syscall-note                        3
   ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
   ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

   and that resulted in the third patch in this series.

 - when the two scanners agreed on the detected license(s), that became
   the concluded license(s).

 - when there was disagreement between the two scanners (one detected a
   license but the other didn't, or they both detected different
   licenses) a manual inspection of the file occurred.

 - In most cases a manual inspection of the information in the file
   resulted in a clear resolution of the license that should apply (and
   which scanner probably needed to revisit its heuristics).

 - When it was not immediately clear, the license identifier was
   confirmed with lawyers working with the Linux Foundation.

 - If there was any question as to the appropriate license identifier,
   the file was flagged for further research and to be revisited later
   in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.  The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
 - a full scancode scan run, collecting the matched texts, detected
   license ids and scores
 - reviewing anything where there was a license detected (about 500+
   files) to ensure that the applied SPDX license was correct
 - reviewing anything where there was no detection but the patch license
   was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
   SPDX license was correct

This produced a worksheet with 20 files needing minor correction.  This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg.  Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected.  This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.)  Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02 11:10:55 +01:00
Tejun Heo
be96b316de perf/cgroup: Fix perf cgroup hierarchy support
The following commit:

  864c2357ca ("perf/core: Do not set cpuctx->cgrp for unscheduled cgroups")

made list_update_cgroup_event() skip setting cpuctx->cgrp if no cgroup event
targets %current's cgroup.

This breaks perf_event's hierarchical support because events which target one
of the ancestors get ignored.

Fix it by using cgroup_is_descendant() test instead of equality.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: David Carrillo-Cisneros <davidcc@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: kernel-team@fb.com
Cc: stable@vger.kernel.org # v4.9+
Fixes: 864c2357ca ("perf/core: Do not set cpuctx->cgrp for unscheduled cgroups")
Link: http://lkml.kernel.org/r/20171028164237.GA972780@devbig577.frc2.facebook.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-30 11:58:51 +01:00
Peter Zijlstra
0d3d73aac2 perf/core: Rewrite event timekeeping
The current even timekeeping, which computes enabled and running
times, uses 3 distinct timestamps to reflect the various event states:
OFF (stopped), INACTIVE (enabled) and ACTIVE (running).

Furthermore, the update rules are such that even INACTIVE events need
their timestamps updated. This is undesirable because we'd like to not
touch INACTIVE events if at all possible, this makes event scheduling
(much) more expensive than needed.

Rewrite the timekeeping to directly use event->state, this greatly
simplifies the code and results in only having to update things when
we change state, or an up-to-date value is requested (read).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-27 10:31:59 +02:00
Peter Zijlstra
0c1cbc18df perf/core: Fix perf_event_read()
perf_event_read() has a number of issues regarding the timekeeping bits.

 - The IPI didn't update group times when it found INACTIVE

 - The direct call would not re-check ->state after taking ctx->lock
   which can result in ->count and timestamps getting out of sync.

And we can make use of the ordering introduced for perf_event_stop()
to make it more accurate for ACTIVE.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-27 10:31:59 +02:00
Peter Zijlstra
7f0ec32526 perf/core: Remove wrong barrier
The barrier and comment make no sense:

 - if what the barrier says is true, it should be wmb() but that
   should then be part of the arch driver, not the generic code.

 - if it is an SMP barrier, there must be a matching barrier, and
   there isn't one.

So kill it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-27 10:31:58 +02:00
Peter Zijlstra
8ca2bd41c7 perf/core: Rename 'enum perf_event_active_state'
Its a weird name, active is one of the states, it should not be part
of the name, also, its too long.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-27 10:31:58 +02:00
Peter Zijlstra
3c5c8711dc perf/core: Make sure to update ctx time before using it
We should make sure to update ctx time before we use it to update
event times.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-27 10:31:58 +02:00
Peter Zijlstra
a9cd8194e1 perf/core: Fix __perf_read_group_add() locking
Event timestamps are serialized using ctx->lock, make sure to hold it
over reading all values.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-27 10:31:57 +02:00
Peter Zijlstra
0ee098c97a perf/core: Update ctx time before detaching events
We should make sure the ctx time is updated before we detach events;
which will want to update event times.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-27 10:31:57 +02:00
Peter Zijlstra
ca0dd44cf3 perf/core: Fix perf_event_read_value() locking
perf_event_read_value() is an external accessor, just like
perf_event_{en,dis}able() and should thus use perf_event_ctx_lock().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: f63a8daa58 ("perf: Fix event->ctx locking")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-27 10:31:57 +02:00
Yonghong Song
7d9285e82d perf/bpf: Extend the perf_event_read_local() interface, a.k.a. "bpf: perf event change needed for subsequent bpf helpers"
eBPF programs would like access to the (perf) event enabled and
running times along with the event value, such that they can deal with
event multiplexing (among other things).

This patch extends the interface; a future eBPF patch will utilize
the new functionality.

[ Note, there's a same-content commit with a poor changelog and a meaningless
  title in the networking tree as well - but we need this change for subsequent
  perf work, so apply it here as well, with a proper changelog. Hopefully Git
  will be able to sort out this somewhat messy workflow, if there are no other,
  conflicting changes to these files. ]

Signed-off-by: Yonghong Song <yhs@fb.com>
[ Rewrote the changelog. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <ast@fb.com>
Cc: <daniel@iogearbox.net>
Cc: <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David S. Miller <davem@davemloft.net>
Link: http://lkml.kernel.org/r/20171005161923.332790-2-yhs@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-27 10:31:56 +02:00
Mark Rutland
6aa7de0591 locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.

For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.

However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:

----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()

// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch

virtual patch

@ depends on patch @
expression E1, E2;
@@

- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)

@ depends on patch @
expression E;
@@

- ACCESS_ONCE(E)
+ READ_ONCE(E)
----

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-25 11:01:08 +02:00
Yonghong Song
e87c6bc385 bpf: permit multiple bpf attachments for a single perf event
This patch enables multiple bpf attachments for a
kprobe/uprobe/tracepoint single trace event.
Each trace_event keeps a list of attached perf events.
When an event happens, all attached bpf programs will
be executed based on the order of attachment.

A global bpf_event_mutex lock is introduced to protect
prog_array attaching and detaching. An alternative will
be introduce a mutex lock in every trace_event_call
structure, but it takes a lot of extra memory.
So a global bpf_event_mutex lock is a good compromise.

The bpf prog detachment involves allocation of memory.
If the allocation fails, a dummy do-nothing program
will replace to-be-detached program in-place.

Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-25 10:47:47 +09:00
Yonghong Song
0b4c6841fe bpf: use the same condition in perf event set/free bpf handler
This is a cleanup such that doing the same check in
perf_event_free_bpf_prog as we already do in
perf_event_set_bpf_prog step.

Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-25 10:47:46 +09:00
Will Deacon
506458efaf locking/barriers: Convert users of lockless_dereference() to READ_ONCE()
READ_ONCE() now has an implicit smp_read_barrier_depends() call, so it
can be used instead of lockless_dereference() without any change in
semantics.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1508840570-22169-4-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-24 13:17:33 +02:00
David S. Miller
f8ddadc4db Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
There were quite a few overlapping sets of changes here.

Daniel's bug fix for off-by-ones in the new BPF branch instructions,
along with the added allowances for "data_end > ptr + x" forms
collided with the metadata additions.

Along with those three changes came veritifer test cases, which in
their final form I tried to group together properly.  If I had just
trimmed GIT's conflict tags as-is, this would have split up the
meta tests unnecessarily.

In the socketmap code, a set of preemption disabling changes
overlapped with the rename of bpf_compute_data_end() to
bpf_compute_data_pointers().

Changes were made to the mv88e6060.c driver set addr method
which got removed in net-next.

The hyperv transport socket layer had a locking change in 'net'
which overlapped with a change of socket state macro usage
in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-22 13:39:14 +01:00
Ingo Molnar
ca4b9c3b74 Merge branch 'perf/urgent' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-20 11:02:05 +02:00
Will Deacon
bc1d202023 perf/core: Export AUX buffer helpers to modules
Perf PMU drivers using AUX buffers cannot be built as modules unless
the AUX helpers are exported.

This patch exports perf_aux_output_{begin,end,skip} and perf_get_aux to
modules.

Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-10-18 12:53:30 +01:00
Peter Zijlstra
8fd0fbbe88 perf/ftrace: Revert ("perf/ftrace: Fix double traces of perf on ftrace:function")
Revert commit:

  75e8387685 ("perf/ftrace: Fix double traces of perf on ftrace:function")

The reason I instantly stumbled on that patch is that it only addresses the
ftrace situation and doesn't mention the other _5_ places that use this
interface. It doesn't explain why those don't have the problem and if not, why
their solution doesn't work for ftrace.

It doesn't, but this is just putting more duct tape on.

Link: http://lkml.kernel.org/r/20171011080224.200565770@infradead.org

Cc: Zhou Chengming <zhouchengming1@huawei.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-10-16 18:11:02 -04:00
leilei.lin
e6a5203399 perf/core: Fix cgroup time when scheduling descendants
Update cgroup time when an event is scheduled in by descendants.

Reviewed-and-tested-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: leilei.lin <leilei.lin@alibaba-inc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: alexander.shishkin@linux.intel.com
Cc: brendan.d.gregg@gmail.com
Cc: yang_oliver@hotmail.com
Link: http://lkml.kernel.org/r/CALPjY3mkHiekRkRECzMi9G-bjUQOvOjVBAqxmWkTzc-g+0LwMg@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10 10:06:55 +02:00
Will Deacon
df0062b27e perf/core: Avoid freeing static PMU contexts when PMU is unregistered
Since commit:

  1fd7e41699 ("perf/core: Remove perf_cpu_context::unique_pmu")

... when a PMU is unregistered then its associated ->pmu_cpu_context is
unconditionally freed. Whilst this is fine for dynamically allocated
context types (i.e. those registered using perf_invalid_context), this
causes a problem for sharing of static contexts such as
perf_{sw,hw}_context, which are used by multiple built-in PMUs and
effectively have a global lifetime.

Whilst testing the ARM SPE driver, which must use perf_sw_context to
support per-task AUX tracing, unregistering the driver as a result of a
module unload resulted in:

 Unable to handle kernel NULL pointer dereference at virtual address 00000038
 Internal error: Oops: 96000004 [#1] PREEMPT SMP
 Modules linked in: [last unloaded: arm_spe_pmu]
 PC is at ctx_resched+0x38/0xe8
 LR is at perf_event_exec+0x20c/0x278
 [...]
 ctx_resched+0x38/0xe8
 perf_event_exec+0x20c/0x278
 setup_new_exec+0x88/0x118
 load_elf_binary+0x26c/0x109c
 search_binary_handler+0x90/0x298
 do_execveat_common.isra.14+0x540/0x618
 SyS_execve+0x38/0x48

since the software context has been freed and the ctx.pmu->pmu_disable_count
field has been set to NULL.

This patch fixes the problem by avoiding the freeing of static PMU contexts
altogether. Whilst the sharing of dynamic contexts is questionable, this
actually requires the caller to share their context pointer explicitly
and so the burden is on them to manage the object lifetime.

Reported-by: Kim Phillips <kim.phillips@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 1fd7e41699 ("perf/core: Remove perf_cpu_context::unique_pmu")
Link: http://lkml.kernel.org/r/1507040450-7730-1-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10 10:06:54 +02:00
Yonghong Song
97562633bc bpf: perf event change needed for subsequent bpf helpers
This patch does not impact existing functionalities.
It contains the changes in perf event area needed for
subsequent bpf_perf_event_read_value and
bpf_perf_prog_read_value helpers.

Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 23:05:57 +01:00
Alexander Shishkin
5bce9db189 perf/core: Explain perf_sched_mutex
To clarify why atomic_inc_return(&perf_sched_events) is not sufficient and
a mutex is needed to order static branch enabling vs the atomic counter
increment, this adds a comment with a short explanation.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170829140103.6563-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 13:28:30 +02:00
Alexander Shishkin
441430eb54 perf/aux: Only update ->aux_wakeup in non-overwrite mode
The following commit:

  d9a50b0256 ("perf/aux: Ensure aux_wakeup represents most recent wakeup index")

changed the AUX wakeup position calculation to rounddown(), which causes
a division-by-zero in AUX overwrite mode (aka "snapshot mode").

The zero denominator results from the fact that perf record doesn't set
aux_watermark to anything, in which case the kernel will set it to half
the AUX buffer size, but only for non-overwrite mode. In the overwrite
mode aux_watermark stays zero.

The good news is that, AUX overwrite mode, wakeups don't happen and
related bookkeeping is not relevant, so we can simply forego the whole
wakeup updates.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/20170906160811.16510-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-29 10:06:45 +02:00
Yonghong Song
ec9dd352d5 bpf: one perf event close won't free bpf program attached by another perf event
This patch fixes a bug exhibited by the following scenario:
  1. fd1 = perf_event_open with attr.config = ID1
  2. attach bpf program prog1 to fd1
  3. fd2 = perf_event_open with attr.config = ID1
     <this will be successful>
  4. user program closes fd2 and prog1 is detached from the tracepoint.
  5. user program with fd1 does not work properly as tracepoint
     no output any more.

The issue happens at step 4. Multiple perf_event_open can be called
successfully, but only one bpf prog pointer in the tp_event. In the
current logic, any fd release for the same tp_event will free
the tp_event->prog.

The fix is to free tp_event->prog only when the closing fd
corresponds to the one which registered the program.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-20 14:10:29 -07:00
Linus Torvalds
608c1d3c17 Merge branch 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "Several notable changes this cycle:

   - Thread mode was merged. This will be used for cgroup2 support for
     CPU and possibly other controllers. Unfortunately, CPU controller
     cgroup2 support didn't make this pull request but most contentions
     have been resolved and the support is likely to be merged before
     the next merge window.

   - cgroup.stat now shows the number of descendant cgroups.

   - cpuset now can enable the easier-to-configure v2 behavior on v1
     hierarchy"

* 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
  cpuset: Allow v2 behavior in v1 cgroup
  cgroup: Add mount flag to enable cpuset to use v2 behavior in v1 cgroup
  cgroup: remove unneeded checks
  cgroup: misc changes
  cgroup: short-circuit cset_cgroup_from_root() on the default hierarchy
  cgroup: re-use the parent pointer in cgroup_destroy_locked()
  cgroup: add cgroup.stat interface with basic hierarchy stats
  cgroup: implement hierarchy limits
  cgroup: keep track of number of descent cgroups
  cgroup: add comment to cgroup_enable_threaded()
  cgroup: remove unnecessary empty check when enabling threaded mode
  cgroup: update debug controller to print out thread mode information
  cgroup: implement cgroup v2 thread support
  cgroup: implement CSS_TASK_ITER_THREADED
  cgroup: introduce cgroup->dom_cgrp and threaded css_set handling
  cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS
  cgroup: reorganize cgroup.procs / task write path
  cgroup: replace css_set walking populated test with testing cgrp->nr_populated_csets
  cgroup: distinguish local and children populated states
  cgroup: remove now unused list_head @pending in cgroup_apply_cftypes()
  ...
2017-09-06 22:25:25 -07:00
Linus Torvalds
aae3dbb477 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) Support ipv6 checksum offload in sunvnet driver, from Shannon
    Nelson.

 2) Move to RB-tree instead of custom AVL code in inetpeer, from Eric
    Dumazet.

 3) Allow generic XDP to work on virtual devices, from John Fastabend.

 4) Add bpf device maps and XDP_REDIRECT, which can be used to build
    arbitrary switching frameworks using XDP. From John Fastabend.

 5) Remove UFO offloads from the tree, gave us little other than bugs.

 6) Remove the IPSEC flow cache, from Florian Westphal.

 7) Support ipv6 route offload in mlxsw driver.

 8) Support VF representors in bnxt_en, from Sathya Perla.

 9) Add support for forward error correction modes to ethtool, from
    Vidya Sagar Ravipati.

10) Add time filter for packet scheduler action dumping, from Jamal Hadi
    Salim.

11) Extend the zerocopy sendmsg() used by virtio and tap to regular
    sockets via MSG_ZEROCOPY. From Willem de Bruijn.

12) Significantly rework value tracking in the BPF verifier, from Edward
    Cree.

13) Add new jump instructions to eBPF, from Daniel Borkmann.

14) Rework rtnetlink plumbing so that operations can be run without
    taking the RTNL semaphore. From Florian Westphal.

15) Support XDP in tap driver, from Jason Wang.

16) Add 32-bit eBPF JIT for ARM, from Shubham Bansal.

17) Add Huawei hinic ethernet driver.

18) Allow to report MD5 keys in TCP inet_diag dumps, from Ivan
    Delalande.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1780 commits)
  i40e: point wb_desc at the nvm_wb_desc during i40e_read_nvm_aq
  i40e: avoid NVM acquire deadlock during NVM update
  drivers: net: xgene: Remove return statement from void function
  drivers: net: xgene: Configure tx/rx delay for ACPI
  drivers: net: xgene: Read tx/rx delay for ACPI
  rocker: fix kcalloc parameter order
  rds: Fix non-atomic operation on shared flag variable
  net: sched: don't use GFP_KERNEL under spin lock
  vhost_net: correctly check tx avail during rx busy polling
  net: mdio-mux: add mdio_mux parameter to mdio_mux_init()
  rxrpc: Make service connection lookup always check for retry
  net: stmmac: Delete dead code for MDIO registration
  gianfar: Fix Tx flow control deactivation
  cxgb4: Ignore MPS_TX_INT_CAUSE[Bubble] for T6
  cxgb4: Fix pause frame count in t4_get_port_stats
  cxgb4: fix memory leak
  tun: rename generic_xdp to skb_xdp
  tun: reserve extra headroom only when XDP is set
  net: dsa: bcm_sf2: Configure IMP port TC2QOS mapping
  net: dsa: bcm_sf2: Advertise number of egress queues
  ...
2017-09-06 14:45:08 -07:00
Linus Torvalds
f57091767a Merge branch 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cache quality monitoring update from Thomas Gleixner:
 "This update provides a complete rewrite of the Cache Quality
  Monitoring (CQM) facility.

  The existing CQM support was duct taped into perf with a lot of issues
  and the attempts to fix those turned out to be incomplete and
  horrible.

  After lengthy discussions it was decided to integrate the CQM support
  into the Resource Director Technology (RDT) facility, which is the
  obvious choise as in hardware CQM is part of RDT. This allowed to add
  Memory Bandwidth Monitoring support on top.

  As a result the mechanisms for allocating cache/memory bandwidth and
  the corresponding monitoring mechanisms are integrated into a single
  management facility with a consistent user interface"

* 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
  x86/intel_rdt: Turn off most RDT features on Skylake
  x86/intel_rdt: Add command line options for resource director technology
  x86/intel_rdt: Move special case code for Haswell to a quirk function
  x86/intel_rdt: Remove redundant ternary operator on return
  x86/intel_rdt/cqm: Improve limbo list processing
  x86/intel_rdt/mbm: Fix MBM overflow handler during CPU hotplug
  x86/intel_rdt: Modify the intel_pqr_state for better performance
  x86/intel_rdt/cqm: Clear the default RMID during hotcpu
  x86/intel_rdt: Show bitmask of shareable resource with other executing units
  x86/intel_rdt/mbm: Handle counter overflow
  x86/intel_rdt/mbm: Add mbm counter initialization
  x86/intel_rdt/mbm: Basic counting of MBM events (total and local)
  x86/intel_rdt/cqm: Add CPU hotplug support
  x86/intel_rdt/cqm: Add sched_in support
  x86/intel_rdt: Introduce rdt_enable_key for scheduling
  x86/intel_rdt/cqm: Add mount,umount support
  x86/intel_rdt/cqm: Add rmdir support
  x86/intel_rdt: Separate the ctrl bits from rmdir
  x86/intel_rdt/cqm: Add mon_data
  x86/intel_rdt: Prepare for RDT monitor data support
  ...
2017-09-04 13:56:37 -07:00
Linus Torvalds
9657752cb5 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
 "Kernel side changes:

   - Add branch type profiling/tracing support. (Jin Yao)

   - Add the PERF_SAMPLE_PHYS_ADDR ABI to allow the tracing/profiling of
     physical memory addresses, where the PMU supports it. (Kan Liang)

   - Export some PMU capability details in the new
     /sys/bus/event_source/devices/cpu/caps/ sysfs directory. (Andi
     Kleen)

   - Aux data fixes and updates (Will Deacon)

   - kprobes fixes and updates (Masami Hiramatsu)

   - AMD uncore PMU driver fixes and updates (Janakarajan Natarajan)

  On the tooling side, here's a (limited!) list of highlights - there
  were many other changes that I could not list, see the shortlog and
  git history for details:

  UI improvements:

   - Implement a visual marker for fused x86 instructions in the
     annotate TUI browser, available now in 'perf report', more work
     needed to have it available as well in 'perf top' (Jin Yao)

     Further explanation from one of Jin's patches:

             │   ┌──cmpl   $0x0,argp_program_version_hook
       81.93 │   ├──je     20
             │   │  lock   cmpxchg %esi,0x38a9a4(%rip)
             │   │↓ jne    29
             │   │↓ jmp    43
       11.47 │20:└─→cmpxch %esi,0x38a999(%rip)

     That means the cmpl+je is a fused instruction pair and they should
     be considered together.

   - Record the branch type and then show statistics and info about in
     callchain entries (Jin Yao)

     Example from one of Jin's patches:

        # perf record -g -j any,save_type
        # perf report --branch-history --stdio --no-children

        38.50%  div.c:45                [.] main                    div
                |
                ---main div.c:42 (RET CROSS_2M cycles:2)
                   compute_flag div.c:28 (cycles:2)
                   compute_flag div.c:27 (RET CROSS_2M cycles:1)
                   rand rand.c:28 (cycles:1)
                   rand rand.c:28 (RET CROSS_2M cycles:1)
                   __random random.c:298 (cycles:1)
                   __random random.c:297 (COND_BWD CROSS_2M cycles:1)
                   __random random.c:295 (cycles:1)
                   __random random.c:295 (COND_BWD CROSS_2M cycles:1)
                   __random random.c:295 (cycles:1)
                   __random random.c:295 (RET CROSS_2M cycles:9)

  namespaces support:

   - Add initial support for namespaces, using setns to access files in
     namespaces, grabbing their build-ids, etc. (Krister Johansen)

  perf trace enhancements:

   - Beautify pkey_{alloc,free,mprotect} arguments in 'perf trace'
     (Arnaldo Carvalho de Melo)

   - Add initial 'clone' syscall args beautifier in 'perf trace'
     (Arnaldo Carvalho de Melo)

   - Ignore 'fd' and 'offset' args for MAP_ANONYMOUS in 'perf trace'
     (Arnaldo Carvalho de Melo)

   - Beautifiers for the 'cmd' arg of several ioctl types, including:
     sound, DRM, KVM, vhost virtio and perf_events. (Arnaldo Carvalho de
     Melo)

   - Add PERF_SAMPLE_CALLCHAIN and PERF_RECORD_MMAP[2] to 'perf data'
     CTF conversion, allowing CTF trace visualization tools to show
     callchains and to resolve symbols (Geneviève Bastien)

   - Beautify the fcntl syscall, which is an interesting one in the
     sense that infrastructure had to be put in place to change the
     formatters of some arguments according to the value in a previous
     one, i.e. cmd dictates how arg and the syscall return will be
     formatted. (Arnaldo Carvalho de Melo

  perf stat enhancements:

   - Use group read for event groups in 'perf stat', reducing overhead
     when groups are defined in the event specification, i.e. when using
     {} to enclose a list of events, asking them to be read at the same
     time, e.g.: "perf stat -e '{cycles,instructions}'" (Jiri Olsa)

  pipe mode improvements:

   - Process tracing data in 'perf annotate' pipe mode (David
     Carrillo-Cisneros)

   - Add header record types to pipe-mode, now this command:

        $ perf record -o - -e cycles sleep 1 | perf report --stdio --header

     Will show the same as in non-pipe mode, i.e. involving a perf.data
     file (David Carrillo-Cisneros)

  Vendor specific hardware event support updates/enhancements:

   - Update POWER9 vendor events tables (Sukadev Bhattiprolu)

   - Add POWER9 PMU events Sukadev (Bhattiprolu)

   - Support additional POWER8+ PVR in PMU mapfile (Shriya)

   - Add Skylake server uncore JSON vendor events (Andi Kleen)

   - Support exporting Intel PT data to sqlite3 with python perf
     scripts, this is in addition to the postgresql support that was
     already there (Adrian Hunter)"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (253 commits)
  perf symbols: Fix plt entry calculation for ARM and AARCH64
  perf probe: Fix kprobe blacklist checking condition
  perf/x86: Fix caps/ for !Intel
  perf/core, x86: Add PERF_SAMPLE_PHYS_ADDR
  perf/core, pt, bts: Get rid of itrace_started
  perf trace beauty: Beautify pkey_{alloc,free,mprotect} arguments
  tools headers: Sync cpu features kernel ABI headers with tooling headers
  perf tools: Pass full path of FEATURES_DUMP
  perf tools: Robustify detection of clang binary
  tools lib: Allow external definition of CC, AR and LD
  perf tools: Allow external definition of flex and bison binary names
  tools build tests: Don't hardcode gcc name
  perf report: Group stat values on global event id
  perf values: Zero value buffers
  perf values: Fix allocation check
  perf values: Fix thread index bug
  perf report: Add dump_read function
  perf record: Set read_format for inherit_stat
  perf c2c: Fix remote HITM detection for Skylake
  perf tools: Fix static build with newer toolchains
  ...
2017-09-04 08:39:02 -07:00
Linus Torvalds
e92d51aff5 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:

 - Prevent a potential inconistency in the perf user space access which
   might lead to evading sanity checks.

 - Prevent perf recording function trace entries twice

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/ftrace: Fix double traces of perf on ftrace:function
  perf/core: Fix potential double-fetch bug
2017-09-03 09:23:23 -07:00
David S. Miller
6026e043d0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Three cases of simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-01 17:42:05 -07:00
Eric Biggers
355627f518 mm, uprobes: fix multiple free of ->uprobes_state.xol_area
Commit 7c05126793 ("mm, fork: make dup_mmap wait for mmap_sem for
write killable") made it possible to kill a forking task while it is
waiting to acquire its ->mmap_sem for write, in dup_mmap().

However, it was overlooked that this introduced an new error path before
the new mm_struct's ->uprobes_state.xol_area has been set to NULL after
being copied from the old mm_struct by the memcpy in dup_mm().  For a
task that has previously hit a uprobe tracepoint, this resulted in the
'struct xol_area' being freed multiple times if the task was killed at
just the right time while forking.

Fix it by setting ->uprobes_state.xol_area to NULL in mm_init() rather
than in uprobe_dup_mmap().

With CONFIG_UPROBE_EVENTS=y, the bug can be reproduced by the same C
program given by commit 2b7e8665b4 ("fork: fix incorrect fput of
->exe_file causing use-after-free"), provided that a uprobe tracepoint
has been set on the fork_thread() function.  For example:

    $ gcc reproducer.c -o reproducer -lpthread
    $ nm reproducer | grep fork_thread
    0000000000400719 t fork_thread
    $ echo "p $PWD/reproducer:0x719" > /sys/kernel/debug/tracing/uprobe_events
    $ echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable
    $ ./reproducer

Here is the use-after-free reported by KASAN:

    BUG: KASAN: use-after-free in uprobe_clear_state+0x1c4/0x200
    Read of size 8 at addr ffff8800320a8b88 by task reproducer/198

    CPU: 1 PID: 198 Comm: reproducer Not tainted 4.13.0-rc7-00015-g36fde05f3fb5 #255
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
    Call Trace:
     dump_stack+0xdb/0x185
     print_address_description+0x7e/0x290
     kasan_report+0x23b/0x350
     __asan_report_load8_noabort+0x19/0x20
     uprobe_clear_state+0x1c4/0x200
     mmput+0xd6/0x360
     do_exit+0x740/0x1670
     do_group_exit+0x13f/0x380
     get_signal+0x597/0x17d0
     do_signal+0x99/0x1df0
     exit_to_usermode_loop+0x166/0x1e0
     syscall_return_slowpath+0x258/0x2c0
     entry_SYSCALL_64_fastpath+0xbc/0xbe

    ...

    Allocated by task 199:
     save_stack_trace+0x1b/0x20
     kasan_kmalloc+0xfc/0x180
     kmem_cache_alloc_trace+0xf3/0x330
     __create_xol_area+0x10f/0x780
     uprobe_notify_resume+0x1674/0x2210
     exit_to_usermode_loop+0x150/0x1e0
     prepare_exit_to_usermode+0x14b/0x180
     retint_user+0x8/0x20

    Freed by task 199:
     save_stack_trace+0x1b/0x20
     kasan_slab_free+0xa8/0x1a0
     kfree+0xba/0x210
     uprobe_clear_state+0x151/0x200
     mmput+0xd6/0x360
     copy_process.part.8+0x605f/0x65d0
     _do_fork+0x1a5/0xbd0
     SyS_clone+0x19/0x20
     do_syscall_64+0x22f/0x660
     return_from_SYSCALL_64+0x0/0x7a

Note: without KASAN, you may instead see a "Bad page state" message, or
simply a general protection fault.

Link: http://lkml.kernel.org/r/20170830033303.17927-1-ebiggers3@gmail.com
Fixes: 7c05126793 ("mm, fork: make dup_mmap wait for mmap_sem for write killable")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>    [4.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-31 16:33:15 -07:00
Kan Liang
fc7ce9c74c perf/core, x86: Add PERF_SAMPLE_PHYS_ADDR
For understanding how the workload maps to memory channels and hardware
behavior, it's very important to collect address maps with physical
addresses. For example, 3D XPoint access can only be found by filtering
the physical address.

Add a new sample type for physical address.

perf already has a facility to collect data virtual address. This patch
introduces a function to convert the virtual address to physical address.
The function is quite generic and can be extended to any architecture as
long as a virtual address is provided.

 - For kernel direct mapping addresses, virt_to_phys is used to convert
   the virtual addresses to physical address.

 - For user virtual addresses, __get_user_pages_fast is used to walk the
   pages tables for user physical address.

 - This does not work for vmalloc addresses right now. These are not
   resolved, but code to do that could be added.

The new sample type requires collecting the virtual address. The
virtual address will not be output unless SAMPLE_ADDR is applied.

For security, the physical address can only be exposed to root or
privileged user.

Tested-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Cc: mpe@ellerman.id.au
Link: http://lkml.kernel.org/r/1503967969-48278-1-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-29 15:09:25 +02:00
Alexander Shishkin
8d4e6c4caa perf/core, pt, bts: Get rid of itrace_started
I just noticed that hw.itrace_started and hw.config are aliased to the
same location. Now, the PT driver happens to use both, which works out
fine by sheer luck:

 - STORE(hw.itrace_start) is ordered before STORE(hw.config), in the
    program order, although there are no compiler barriers to ensure that,

 - to the perf_log_itrace_start() hw.itrace_start looks set at the same
   time as when it is intended to be set because both stores happen in the
   same path,

 - hw.config is never reset to zero in the PT driver.

Now, the use of hw.config by the PT driver makes more sense (it being a
HW PMU) than messing around with itrace_started, which is an awkward API
to begin with.

This patch replaces hw.itrace_started with an attach_state bit and an
API call for the PMU drivers to use to communicate the condition.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20170330153956.25994-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-29 15:09:24 +02:00
Ingo Molnar
e0563e0495 Merge branch 'perf/urgent' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-29 15:09:03 +02:00
Zhou Chengming
75e8387685 perf/ftrace: Fix double traces of perf on ftrace:function
When running perf on the ftrace:function tracepoint, there is a bug
which can be reproduced by:

  perf record -e ftrace:function -a sleep 20 &
  perf record -e ftrace:function ls
  perf script

              ls 10304 [005]   171.853235: ftrace:function:
  perf_output_begin
              ls 10304 [005]   171.853237: ftrace:function:
  perf_output_begin
              ls 10304 [005]   171.853239: ftrace:function:
  task_tgid_nr_ns
              ls 10304 [005]   171.853240: ftrace:function:
  task_tgid_nr_ns
              ls 10304 [005]   171.853242: ftrace:function:
  __task_pid_nr_ns
              ls 10304 [005]   171.853244: ftrace:function:
  __task_pid_nr_ns

We can see that all the function traces are doubled.

The problem is caused by the inconsistency of the register
function perf_ftrace_event_register() with the probe function
perf_ftrace_function_call(). The former registers one probe
for every perf_event. And the latter handles all perf_events
on the current cpu. So when two perf_events on the current cpu,
the traces of them will be doubled.

So this patch adds an extra parameter "event" for perf_tp_event,
only send sample data to this event when it's not NULL.

Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: alexander.shishkin@linux.intel.com
Cc: huawei.libin@huawei.com
Link: http://lkml.kernel.org/r/1503668977-12526-1-git-send-email-zhouchengming1@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-29 13:29:29 +02:00
Meng Xu
f12f42acdb perf/core: Fix potential double-fetch bug
While examining the kernel source code, I found a dangerous operation that
could turn into a double-fetch situation (a race condition bug) where the same
userspace memory region are fetched twice into kernel with sanity checks after
the first fetch while missing checks after the second fetch.

  1. The first fetch happens in line 9573 get_user(size, &uattr->size).

  2. Subsequently the 'size' variable undergoes a few sanity checks and
     transformations (line 9577 to 9584).

  3. The second fetch happens in line 9610 copy_from_user(attr, uattr, size)

  4. Given that 'uattr' can be fully controlled in userspace, an attacker can
     race condition to override 'uattr->size' to arbitrary value (say, 0xFFFFFFFF)
     after the first fetch but before the second fetch. The changed value will be
     copied to 'attr->size'.

  5. There is no further checks on 'attr->size' until the end of this function,
     and once the function returns, we lose the context to verify that 'attr->size'
     conforms to the sanity checks performed in step 2 (line 9577 to 9584).

  6. My manual analysis shows that 'attr->size' is not used elsewhere later,
     so, there is no working exploit against it right now. However, this could
     easily turns to an exploitable one if careless developers start to use
     'attr->size' later.

To fix this, override 'attr->size' from the second fetch to the one from the
first fetch, regardless of what is actually copied in.

In this way, it is assured that 'attr->size' is consistent with the checks
performed after the first fetch.

Signed-off-by: Meng Xu <mengxu.gatech@gmail.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: alexander.shishkin@linux.intel.com
Cc: meng.xu@gatech.edu
Cc: sanidhya@gatech.edu
Cc: taesoo@gatech.edu
Link: http://lkml.kernel.org/r/1503522470-35531-1-git-send-email-meng.xu@gatech.edu
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-29 13:26:22 +02:00
Jesper Dangaard Brouer
d0618410ec tracing, perf: Adjust code layout in get_recursion_context()
In an XDP redirect applications using tracepoint xdp:xdp_redirect to
diagnose TX overrun, I noticed perf_swevent_get_recursion_context()
was consuming 2% CPU. This was reduced to 1.85% with this simple
change.

Looking at the annotated asm code, it was clear that the unlikely case
in_nmi() test was chosen (by the compiler) as the most likely
event/branch.  This small adjustment makes the compiler (GCC version
7.1.1 20170622 (Red Hat 7.1.1-3)) put in_nmi() as an unlikely branch.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/150342256382.16595.986861478681783732.stgit@firesoul
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25 11:04:18 +02:00
Oleg Nesterov
1d953111b6 perf/core: Don't report zero PIDs for exiting tasks
The exiting/dead task has no PIDs and in this case perf_event_pid/tid()
return zero, change them to return -1 to distinguish this case from
idle threads.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170822155928.GA6892@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25 11:04:17 +02:00
Will Deacon
d9a50b0256 perf/aux: Ensure aux_wakeup represents most recent wakeup index
The aux_watermark member of struct ring_buffer represents the period (in
terms of bytes) at which wakeup events should be generated when data is
written to the aux buffer in non-snapshot mode. On hardware that cannot
generate an interrupt when the aux_head reaches an arbitrary wakeup index
(such as ARM SPE), the aux_head sampled from handle->head in
perf_aux_output_{skip,end} may in fact be past the wakeup index. This
can lead to wakeup slowly falling behind the head. For example, consider
the case where hardware can only generate an interrupt on a page-boundary
and the aux buffer is initialised as follows:

  // Buffer size is 2 * PAGE_SIZE
  rb->aux_head = rb->aux_wakeup = 0
  rb->aux_watermark = PAGE_SIZE / 2

following the first perf_aux_output_begin call, the handle is
initialised with:

  handle->head = 0
  handle->size = 2 * PAGE_SIZE
  handle->wakeup = PAGE_SIZE / 2

and the hardware will be programmed to generate an interrupt at
PAGE_SIZE.

When the interrupt is raised, the hardware head will be at PAGE_SIZE,
so calling perf_aux_output_end(handle, PAGE_SIZE) puts the ring buffer
into the following state:

  rb->aux_head = PAGE_SIZE
  rb->aux_wakeup = PAGE_SIZE / 2
  rb->aux_watermark = PAGE_SIZE / 2

and then the next call to perf_aux_output_begin will result in:

  handle->head = handle->wakeup = PAGE_SIZE

for which the semantics are unclear and, for a smaller aux_watermark
(e.g. PAGE_SIZE / 4), then the wakeup would in fact be behind head at
this point.

This patch fixes the problem by rounding down the aux_head (as sampled
from the handle) to the nearest aux_watermark boundary when updating
rb->aux_wakeup, therefore taking into account any overruns by the
hardware.

Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lkml.kernel.org/r/1502900297-21839-2-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25 11:04:16 +02:00
Will Deacon
2ab346cfb0 perf/aux: Make aux_{head,wakeup} ring_buffer members long
The aux_head and aux_wakeup members of struct ring_buffer are defined
using the local_t type, despite the fact that they are only accessed via
the perf_aux_output_*() functions, which cannot race with each other for a
given ring buffer.

This patch changes the type of the members to long, so we can avoid
using the local_*() API where it isn't needed.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lkml.kernel.org/r/1502900297-21839-1-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25 11:04:15 +02:00
Ingo Molnar
290d9bf281 Merge branch 'perf/urgent' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25 11:01:05 +02:00
Mark Rutland
64aee2a965 perf/core: Fix group {cpu,task} validation
Regardless of which events form a group, it does not make sense for the
events to target different tasks and/or CPUs, as this leaves the group
inconsistent and impossible to schedule. The core perf code assumes that
these are consistent across (successfully intialised) groups.

Core perf code only verifies this when moving SW events into a HW
context. Thus, we can violate this requirement for pure SW groups and
pure HW groups, unless the relevant PMU driver happens to perform this
verification itself. These mismatched groups subsequently wreak havoc
elsewhere.

For example, we handle watchpoints as SW events, and reserve watchpoint
HW on a per-CPU basis at pmu::event_init() time to ensure that any event
that is initialised is guaranteed to have a slot at pmu::add() time.
However, the core code only checks the group leader's cpu filter (via
event_filter_match()), and can thus install follower events onto CPUs
violating thier (mismatched) CPU filters, potentially installing them
into a CPU without sufficient reserved slots.

This can be triggered with the below test case, resulting in warnings
from arch backends.

  #define _GNU_SOURCE
  #include <linux/hw_breakpoint.h>
  #include <linux/perf_event.h>
  #include <sched.h>
  #include <stdio.h>
  #include <sys/prctl.h>
  #include <sys/syscall.h>
  #include <unistd.h>

  static int perf_event_open(struct perf_event_attr *attr, pid_t pid, int cpu,
			   int group_fd, unsigned long flags)
  {
	return syscall(__NR_perf_event_open, attr, pid, cpu, group_fd, flags);
  }

  char watched_char;

  struct perf_event_attr wp_attr = {
	.type = PERF_TYPE_BREAKPOINT,
	.bp_type = HW_BREAKPOINT_RW,
	.bp_addr = (unsigned long)&watched_char,
	.bp_len = 1,
	.size = sizeof(wp_attr),
  };

  int main(int argc, char *argv[])
  {
	int leader, ret;
	cpu_set_t cpus;

	/*
	 * Force use of CPU0 to ensure our CPU0-bound events get scheduled.
	 */
	CPU_ZERO(&cpus);
	CPU_SET(0, &cpus);
	ret = sched_setaffinity(0, sizeof(cpus), &cpus);
	if (ret) {
		printf("Unable to set cpu affinity\n");
		return 1;
	}

	/* open leader event, bound to this task, CPU0 only */
	leader = perf_event_open(&wp_attr, 0, 0, -1, 0);
	if (leader < 0) {
		printf("Couldn't open leader: %d\n", leader);
		return 1;
	}

	/*
	 * Open a follower event that is bound to the same task, but a
	 * different CPU. This means that the group should never be possible to
	 * schedule.
	 */
	ret = perf_event_open(&wp_attr, 0, 1, leader, 0);
	if (ret < 0) {
		printf("Couldn't open mismatched follower: %d\n", ret);
		return 1;
	} else {
		printf("Opened leader/follower with mismastched CPUs\n");
	}

	/*
	 * Open as many independent events as we can, all bound to the same
	 * task, CPU0 only.
	 */
	do {
		ret = perf_event_open(&wp_attr, 0, 0, -1, 0);
	} while (ret >= 0);

	/*
	 * Force enable/disble all events to trigger the erronoeous
	 * installation of the follower event.
	 */
	printf("Opened all events. Toggling..\n");
	for (;;) {
		prctl(PR_TASK_PERF_EVENTS_DISABLE, 0, 0, 0, 0);
		prctl(PR_TASK_PERF_EVENTS_ENABLE, 0, 0, 0, 0);
	}

	return 0;
  }

Fix this by validating this requirement regardless of whether we're
moving events.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zhou Chengming <zhouchengming1@huawei.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1498142498-15758-1-git-send-email-mark.rutland@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25 11:00:34 +02:00
David S. Miller
e2a7c34fb2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-08-21 17:06:42 -07:00
leilei.lin
fdccc3fb7a perf/core: Reduce context switch overhead
Skip most of the PMU context switching overhead when ctx->nr_events is 0.

50% performance overhead was observed under an extreme testcase.

Signed-off-by: leilei.lin <leilei.lin@alibaba-inc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: alexander.shishkin@linux.intel.com
Cc: eranian@gmail.com
Cc: jolsa@redhat.com
Cc: linxiulei@gmail.com
Cc: yang_oliver@hotmail.com
Link: http://lkml.kernel.org/r/20170809002921.69813-1-leilei.lin@alibaba-inc.com
[ Rewrote the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10 12:08:40 +02:00