Commit Graph

520 Commits

Author SHA1 Message Date
Wanpeng Li
26a8a3c531 KVM: Fix stack-out-of-bounds read in write_mmio
commit e39d200fa5 upstream.

Reported by syzkaller:

  BUG: KASAN: stack-out-of-bounds in write_mmio+0x11e/0x270 [kvm]
  Read of size 8 at addr ffff8803259df7f8 by task syz-executor/32298

  CPU: 6 PID: 32298 Comm: syz-executor Tainted: G           OE    4.15.0-rc2+ #18
  Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016
  Call Trace:
   dump_stack+0xab/0xe1
   print_address_description+0x6b/0x290
   kasan_report+0x28a/0x370
   write_mmio+0x11e/0x270 [kvm]
   emulator_read_write_onepage+0x311/0x600 [kvm]
   emulator_read_write+0xef/0x240 [kvm]
   emulator_fix_hypercall+0x105/0x150 [kvm]
   em_hypercall+0x2b/0x80 [kvm]
   x86_emulate_insn+0x2b1/0x1640 [kvm]
   x86_emulate_instruction+0x39a/0xb90 [kvm]
   handle_exception+0x1b4/0x4d0 [kvm_intel]
   vcpu_enter_guest+0x15a0/0x2640 [kvm]
   kvm_arch_vcpu_ioctl_run+0x549/0x7d0 [kvm]
   kvm_vcpu_ioctl+0x479/0x880 [kvm]
   do_vfs_ioctl+0x142/0x9a0
   SyS_ioctl+0x74/0x80
   entry_SYSCALL_64_fastpath+0x23/0x9a

The path of patched vmmcall will patch 3 bytes opcode 0F 01 C1(vmcall)
to the guest memory, however, write_mmio tracepoint always prints 8 bytes
through *(u64 *)val since kvm splits the mmio access into 8 bytes. This
leaks 5 bytes from the kernel stack (CVE-2017-17741).  This patch fixes
it by just accessing the bytes which we operate on.

Before patch:

syz-executor-5567  [007] .... 51370.561696: kvm_mmio: mmio write len 3 gpa 0x10 val 0x1ffff10077c1010f

After patch:

syz-executor-13416 [002] .... 51302.299573: kvm_mmio: mmio write len 3 gpa 0x10 val 0xc1010f

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Tested-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[bwh: Backported to 3.2:
 - Drop ARM changes
 - Adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-01-01 20:51:04 +00:00
Rik van Riel
417d604d7f tracing: Add #undef to fix compile error
commit bf7165cfa2 upstream.

There are several trace include files that define TRACE_INCLUDE_FILE.

Include several of them in the same .c file (as I currently have in
some code I am working on), and the compile will blow up with a
"warning: "TRACE_INCLUDE_FILE" redefined #define TRACE_INCLUDE_FILE syscalls"

Every other include file in include/trace/events/ avoids that issue
by having a #undef TRACE_INCLUDE_FILE before the #define; syscalls.h
should have one, too.

Link: http://lkml.kernel.org/r/20160928225554.13bd7ac6@annuminas.surriel.com

Fixes: b8007ef742 ("tracing: Separate raw syscall from syscall tracer")
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2017-07-18 18:38:30 +01:00
Jan Kara
a3ceb22921 jbd2: issue cache flush after checkpointing even with internal journal
commit 79feb521a4 upstream.

When we reach jbd2_cleanup_journal_tail(), there is no guarantee that
checkpointed buffers are on a stable storage - especially if buffers were
written out by jbd2_log_do_checkpoint(), they are likely to be only in disk's
caches. Thus when we update journal superblock effectively removing old
transaction from journal, this write of superblock can get to stable storage
before those checkpointed buffers which can result in filesystem corruption
after a crash. Thus we must unconditionally issue a cache flush before we
update journal superblock in these cases.

A similar problem can also occur if journal superblock is written only in
disk's caches, other transaction starts reusing space of the transaction
cleaned from the log and power failure happens. Subsequent journal replay would
still try to replay the old transaction but some of it's blocks may be already
overwritten by the new transaction. For this reason we must use WRITE_FUA when
updating log tail and we must first write new log tail to disk and update
in-memory information only after that.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
[bwh: Prerequisite for "jbd2: fix ocfs2 corrupt when updating journal
 superblock fails".
 Backported to 3.2:
 - Adjust context
 - Drop changes to jbd2_journal_update_sb_log_tail trace event]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2015-08-12 16:33:15 +02:00
Oleg Nesterov
cd58af0522 tracing: Fix syscall_*regfunc() vs copy_process() race
commit 4af4206be2 upstream.

syscall_regfunc() and syscall_unregfunc() should set/clear
TIF_SYSCALL_TRACEPOINT system-wide, but do_each_thread() can race
with copy_process() and miss the new child which was not added to
the process/thread lists yet.

Change copy_process() to update the child's TIF_SYSCALL_TRACEPOINT
under tasklist.

Link: http://lkml.kernel.org/p/20140413185854.GB20668@redhat.com

Fixes: a871bd33a6 "tracing: Add syscall tracepoints"
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2014-07-11 13:33:53 +01:00
Romain Izard
1ad1054c3b trace: module: Maintain a valid user count
commit 098507ae3e upstream.

The replacement of the 'count' variable by two variables 'incs' and
'decs' to resolve some race conditions during module unloading was done
in parallel with some cleanup in the trace subsystem, and was integrated
as a merge.

Unfortunately, the formula for this replacement was wrong in the tracing
code, and the refcount in the traces was not usable as a result.

Use 'count = incs - decs' to compute the user count.

Link: http://lkml.kernel.org/p/1393924179-9147-1-git-send-email-romain.izard.pro@gmail.com

Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Fixes: c1ab9cab75 "merge conflict resolution"
Signed-off-by: Romain Izard <romain.izard.pro@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2014-06-09 13:29:10 +01:00
Roman Pen
5b85afa68e blktrace: fix accounting of partially completed requests
commit af5040da01 upstream.

trace_block_rq_complete does not take into account that request can
be partially completed, so we can get the following incorrect output
of blkparser:

  C   R 232 + 240 [0]
  C   R 240 + 232 [0]
  C   R 248 + 224 [0]
  C   R 256 + 216 [0]

but should be:

  C   R 232 + 8 [0]
  C   R 240 + 8 [0]
  C   R 248 + 8 [0]
  C   R 256 + 8 [0]

Also, the whole output summary statistics of completed requests and
final throughput will be incorrect.

This patch takes into account real completion size of the request and
fixes wrong completion accounting.

Signed-off-by: Roman Pen <r.peniaev@gmail.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: linux-kernel@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
[bwh: Backported to 3.2: drop change in blk-mq.c]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2014-04-30 16:23:20 +01:00
Steven Rostedt (Red Hat)
e8a972ca97 tracing: Allow events to have NULL strings
commit 4e58e54754 upstream.

If an TRACE_EVENT() uses __assign_str() or __get_str on a NULL pointer
then the following oops will happen:

BUG: unable to handle kernel NULL pointer dereference at   (null)
IP: [<c127a17b>] strlen+0x10/0x1a
*pde = 00000000 ^M
Oops: 0000 [#1] PREEMPT SMP
Modules linked in:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.13.0-rc1-test+ #2
Hardware name:                  /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006^M
task: f5cde9f0 ti: f5e5e000 task.ti: f5e5e000
EIP: 0060:[<c127a17b>] EFLAGS: 00210046 CPU: 1
EIP is at strlen+0x10/0x1a
EAX: 00000000 EBX: c2472da8 ECX: ffffffff EDX: c2472da8
ESI: c1c5e5fc EDI: 00000000 EBP: f5e5fe84 ESP: f5e5fe80
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
CR0: 8005003b CR2: 00000000 CR3: 01f32000 CR4: 000007d0
Stack:
 f5f18b90 f5e5feb8 c10687a8 0759004f 00000005 00000005 00000005 00200046
 00000002 00000000 c1082a93 f56c7e28 c2472da8 c1082a93 f5e5fee4 c106bc61^M
 00000000 c1082a93 00000000 00000000 00000001 00200046 00200082 00000000
Call Trace:
 [<c10687a8>] ftrace_raw_event_lock+0x39/0xc0
 [<c1082a93>] ? ktime_get+0x29/0x69
 [<c1082a93>] ? ktime_get+0x29/0x69
 [<c106bc61>] lock_release+0x57/0x1a5
 [<c1082a93>] ? ktime_get+0x29/0x69
 [<c10824dd>] read_seqcount_begin.constprop.7+0x4d/0x75
 [<c1082a93>] ? ktime_get+0x29/0x69^M
 [<c1082a93>] ktime_get+0x29/0x69
 [<c108a46a>] __tick_nohz_idle_enter+0x1e/0x426
 [<c10690e8>] ? lock_release_holdtime.part.19+0x48/0x4d
 [<c10bc184>] ? time_hardirqs_off+0xe/0x28
 [<c1068c82>] ? trace_hardirqs_off_caller+0x3f/0xaf
 [<c108a8cb>] tick_nohz_idle_enter+0x59/0x62
 [<c1079242>] cpu_startup_entry+0x64/0x192
 [<c102299c>] start_secondary+0x277/0x27c
Code: 90 89 c6 89 d0 88 c4 ac 38 e0 74 09 84 c0 75 f7 be 01 00 00 00 89 f0 48 5e 5d c3 55 89 e5 57 66 66 66 66 90 83 c9 ff 89 c7 31 c0 <f2> ae f7 d1 8d 41 ff 5f 5d c3 55 89 e5 57 66 66 66 66 90 31 ff
EIP: [<c127a17b>] strlen+0x10/0x1a SS:ESP 0068:f5e5fe80
CR2: 0000000000000000
---[ end trace 01bc47bf519ec1b2 ]---

New tracepoints have been added that have allowed for NULL pointers
being assigned to strings. To fix this, change the TRACE_EVENT() code
to check for NULL and if it is, it will assign "(null)" to it instead
(similar to what glibc printf does).

Reported-by: Shuah Khan <shuah.kh@samsung.com>
Reported-by: Jovi Zhangwei <jovi.zhangwei@gmail.com>
Link: http://lkml.kernel.org/r/CAGdX0WFeEuy+DtpsJzyzn0343qEEjLX97+o1VREFkUEhndC+5Q@mail.gmail.com
Link: http://lkml.kernel.org/r/528D6972.9010702@samsung.com
Fixes: 9cbf117662 ("tracing/events: provide string with undefined size support")
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2014-01-03 04:33:26 +00:00
Konrad Rzeszutek Wilk
c2ba72569b xen/mmu: Use Xen specific TLB flush instead of the generic one.
commit 95a7d76897 upstream.

As Mukesh explained it, the MMUEXT_TLB_FLUSH_ALL allows the
hypervisor to do a TLB flush on all active vCPUs. If instead
we were using the generic one (which ends up being xen_flush_tlb)
we end up making the MMUEXT_TLB_FLUSH_LOCAL hypercall. But
before we make that hypercall the kernel will IPI all of the
vCPUs (even those that were asleep from the hypervisor
perspective). The end result is that we needlessly wake them
up and do a TLB flush when we can just let the hypervisor
do it correctly.

This patch gives around 50% speed improvement when migrating
idle guest's from one host to another.

Oracle-bug: 14630170

Tested-by:  Jingjie Jiang <jingjie.jiang@oracle.com>
Suggested-by:  Mukesh Rathor <mukesh.rathor@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-11-16 16:47:03 +00:00
Wen Congyang
b72a2bd8e4 tracing: Don't call page_to_pfn() if page is NULL
commit 85f2a2ef1d upstream.

When allocating memory fails, page is NULL. page_to_pfn() will
cause the kernel panicked if we don't use sparsemem vmemmap.

Link: http://lkml.kernel.org/r/505AB1FF.8020104@cn.fujitsu.com

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-10-10 03:31:00 +01:00
Wu Fengguang
422204b779 writeback: fix dereferencing NULL bdi->dev on trace_writeback_queue
commit 977b7e3a52 upstream.

When a SD card is hot removed without umount, del_gendisk() will call
bdi_unregister() without destroying/freeing it. This leaves the bdi in
the bdi->dev = NULL, bdi->wb.task = NULL, bdi->bdi_list removed state.

When sync(2) gets the bdi before bdi_unregister() and calls
bdi_queue_work() after the unregister, trace_writeback_queue will be
dereferencing the NULL bdi->dev. Fix it with a simple test for NULL.

LKML-reference: http://lkml.org/lkml/2012/1/18/346
Reported-by: Rabin Vincent <rabin@rab.in>
Tested-by: Namjae Jeon <linkinjeon@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-02-20 12:46:17 -08:00
Wu Fengguang
aec14f459c writeback: fix NULL bdi->dev in trace writeback_single_inode
commit 15eb77a07c upstream.

bdi_prune_sb() resets sb->s_bdi to default_backing_dev_info when the
tearing down the original bdi. Fix trace_writeback_single_inode to
use sb->s_bdi=default_backing_dev_info rather than bdi->dev=NULL for a
teared down bdi.

Reported-by: Rabin Vincent <rabin@rab.in>
Tested-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-02-20 12:46:16 -08:00
Wu Fengguang
b3bba872dd writeback: show writeback reason with __print_symbolic
This makes the binary trace understandable by trace-cmd.

CC: Dave Chinner <david@fromorbit.com>
CC: Curt Wohlgemuth <curtw@google.com>
CC: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-12-18 14:20:17 +08:00
Linus Torvalds
32aaeffbd4 Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux
* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
  Revert "tracing: Include module.h in define_trace.h"
  irq: don't put module.h into irq.h for tracking irqgen modules.
  bluetooth: macroize two small inlines to avoid module.h
  ip_vs.h: fix implicit use of module_get/module_put from module.h
  nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
  include: replace linux/module.h with "struct module" wherever possible
  include: convert various register fcns to macros to avoid include chaining
  crypto.h: remove unused crypto_tfm_alg_modname() inline
  uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
  pm_runtime.h: explicitly requires notifier.h
  linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
  miscdevice.h: fix up implicit use of lists and types
  stop_machine.h: fix implicit use of smp.h for smp_processor_id
  of: fix implicit use of errno.h in include/linux/of.h
  of_platform.h: delete needless include <linux/module.h>
  acpi: remove module.h include from platform/aclinux.h
  miscdevice.h: delete unnecessary inclusion of module.h
  device_cgroup.h: delete needless include <linux/module.h>
  net: sch_generic remove redundant use of <linux/module.h>
  net: inet_timewait_sock doesnt need <linux/module.h>
  ...

Fix up trivial conflicts (other header files, and  removal of the ab3550 mfd driver) in
 - drivers/media/dvb/frontends/dibx000_common.c
 - drivers/media/video/{mt9m111.c,ov6650.c}
 - drivers/mfd/ab3550-core.c
 - include/linux/dmaengine.h
2011-11-06 19:44:47 -08:00
Linus Torvalds
208bca0860 Merge branch 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux
* 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
  writeback: Add a 'reason' to wb_writeback_work
  writeback: send work item to queue_io, move_expired_inodes
  writeback: trace event balance_dirty_pages
  writeback: trace event bdi_dirty_ratelimit
  writeback: fix ppc compile warnings on do_div(long long, unsigned long)
  writeback: per-bdi background threshold
  writeback: dirty position control - bdi reserve area
  writeback: control dirty pause time
  writeback: limit max dirty pause time
  writeback: IO-less balance_dirty_pages()
  writeback: per task dirty rate limit
  writeback: stabilize bdi->dirty_ratelimit
  writeback: dirty rate control
  writeback: add bg_threshold parameter to __bdi_update_bandwidth()
  writeback: dirty position control
  writeback: account per-bdi accumulated dirtied pages
2011-11-06 19:02:23 -08:00
Linus Torvalds
f1f8935a5c Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (97 commits)
  jbd2: Unify log messages in jbd2 code
  jbd/jbd2: validate sb->s_first in journal_get_superblock()
  ext4: let ext4_ext_rm_leaf work with EXT_DEBUG defined
  ext4: fix a syntax error in ext4_ext_insert_extent when debugging enabled
  ext4: fix a typo in struct ext4_allocation_context
  ext4: Don't normalize an falloc request if it can fit in 1 extent.
  ext4: remove comments about extent mount option in ext4_new_inode()
  ext4: let ext4_discard_partial_buffers handle unaligned range correctly
  ext4: return ENOMEM if find_or_create_pages fails
  ext4: move vars to local scope in ext4_discard_partial_page_buffers_no_lock()
  ext4: Create helper function for EXT4_IO_END_UNWRITTEN and i_aiodio_unwritten
  ext4: optimize locking for end_io extent conversion
  ext4: remove unnecessary call to waitqueue_active()
  ext4: Use correct locking for ext4_end_io_nolock()
  ext4: fix race in xattr block allocation path
  ext4: trace punch_hole correctly in ext4_ext_map_blocks
  ext4: clean up AGGRESSIVE_TEST code
  ext4: move variables to their scope
  ext4: fix quota accounting during migration
  ext4: migrate cleanup
  ...
2011-11-02 10:06:20 -07:00
Minchan Kim
4356f21d09 mm: change isolate mode from #define to bitwise type
Change ISOLATE_XXX macro with bitwise isolate_mode_t type.  Normally,
macro isn't recommended as it's type-unsafe and making debugging harder as
symbol cannot be passed throught to the debugger.

Quote from Johannes
" Hmm, it would probably be cleaner to fully convert the isolation mode
into independent flags.  INACTIVE, ACTIVE, BOTH is currently a
tri-state among flags, which is a bit ugly."

This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31 17:30:44 -07:00
Paul Gortmaker
67b84999b1 Revert "tracing: Include module.h in define_trace.h"
This reverts commit 3a9f987b31.

With all the files that are real modules now having module.h
explicitly called out for inclusion, and no reliance on any
implicit presence of module.h assumed, we should no longer
need this workaround.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31 19:32:35 -04:00
Paul Gortmaker
de47725421 include: replace linux/module.h with "struct module" wherever possible
The <linux/module.h> pretty much brings in the kitchen sink along
with it, so it should be avoided wherever reasonably possible in
terms of being included from other commonly used <linux/something.h>
files, as it results in a measureable increase on compile times.

The worst culprit was probably device.h since it is used everywhere.
This file also had an implicit dependency/usage of mutex.h which was
masked by module.h, and is also fixed here at the same time.

There are over a dozen other headers that simply declare the
struct instead of pulling in the whole file, so follow their lead
and simply make it a few more.

Most of the implicit dependencies on module.h being present by
these headers pulling it in have been now weeded out, so we can
finally make this change with hopefully minimal breakage.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31 19:32:32 -04:00
Curt Wohlgemuth
0e175a1835 writeback: Add a 'reason' to wb_writeback_work
This creates a new 'reason' field in a wb_writeback_work
structure, which unambiguously identifies who initiates
writeback activity.  A 'wb_reason' enumeration has been
added to writeback.h, to enumerate the possible reasons.

The 'writeback_work_class' and tracepoint event class and
'writeback_queue_io' tracepoints are updated to include the
symbolic 'reason' in all trace events.

And the 'writeback_inodes_sbXXX' family of routines has had
a wb_stats parameter added to them, so callers can specify
why writeback is being started.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-10-31 00:33:36 +08:00
Curt Wohlgemuth
ad4e38dd6a writeback: send work item to queue_io, move_expired_inodes
Instead of sending ->older_than_this to queue_io() and
move_expired_inodes(), send the entire wb_writeback_work
structure.  There are other fields of a work item that are
useful in these routines and in tracepoints.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-10-31 00:33:27 +08:00
Wu Fengguang
ece13ac31b writeback: trace event balance_dirty_pages
Useful for analyzing the dynamics of the throttling algorithms and
debugging user reported problems.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-10-31 00:29:38 +08:00
Wu Fengguang
b48c104d22 writeback: trace event bdi_dirty_ratelimit
It helps understand how various throttle bandwidths are updated.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-10-31 00:29:21 +08:00
Linus Torvalds
68d99b2c8e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (549 commits)
  ALSA: hda - Fix ADC input-amp handling for Cx20549 codec
  ALSA: hda - Keep EAPD turned on for old Conexant chips
  ALSA: hda/realtek - Fix missing volume controls with ALC260
  ASoC: wm8940: Properly set codec->dapm.bias_level
  ALSA: hda - Fix pin-config for ASUS W90V
  ALSA: hda - Fix surround/CLFE headphone and speaker pins order
  ALSA: hda - Fix typo
  ALSA: Update the sound git tree URL
  ALSA: HDA: Add new revision for ALC662
  ASoC: max98095: Convert codec->hw_write to snd_soc_write
  ASoC: keep pointer to resource so it can be freed
  ASoC: sgtl5000: Fix wrong mask in some snd_soc_update_bits calls
  ASoC: wm8996: Fix wrong mask for setting WM8996_AIF_CLOCKING_2
  ASoC: da7210: Add support for line out and DAC
  ASoC: da7210: Add support for DAPM
  ALSA: hda/realtek - Fix DAC assignments of multiple speakers
  ASoC: Use SGTL5000_LINREG_VDDD_MASK instead of hardcoded mask value
  ASoC: Set sgtl5000->ldo in ldo_regulator_register
  ASoC: wm8996: Use SND_SOC_DAPM_AIF_OUT for AIF2 Capture
  ASoC: wm8994: Use SND_SOC_DAPM_AIF_OUT for AIF3 Capture
  ...
2011-10-28 14:25:01 -07:00
Eric Gouriou
6f91bc5fda ext4: optimize ext4_ext_convert_to_initialized()
This patch introduces a fast path in ext4_ext_convert_to_initialized()
for the case when the conversion can be performed by transferring
the newly initialized blocks from the uninitialized extent into
an adjacent initialized extent. Doing so removes the expensive
invocations of memmove() which occur during extent insertion and
the subsequent merge.

In practice this should be the common case for clients performing
append writes into files pre-allocated via
fallocate(FALLOC_FL_KEEP_SIZE). In such a workload performed via
direct IO and when using a suboptimal implementation of memmove()
(x86_64 prior to the 2.6.39 rewrite), this patch reduces kernel CPU
consumption by 32%.

Two new trace points are added to ext4_ext_convert_to_initialized()
to offer visibility into its operations. No exit trace point has
been added due to the multiplicity of return points. This can be
revisited once the upstream cleanup is backported.

Signed-off-by: Eric Gouriou <egouriou@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-27 11:43:23 -04:00
Linus Torvalds
8a4a8918ed Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
  llist: Add back llist_add_batch() and llist_del_first() prototypes
  sched: Don't use tasklist_lock for debug prints
  sched: Warn on rt throttling
  sched: Unify the ->cpus_allowed mask copy
  sched: Wrap scheduler p->cpus_allowed access
  sched: Request for idle balance during nohz idle load balance
  sched: Use resched IPI to kick off the nohz idle balance
  sched: Fix idle_cpu()
  llist: Remove cpu_relax() usage in cmpxchg loops
  sched: Convert to struct llist
  llist: Add llist_next()
  irq_work: Use llist in the struct irq_work logic
  llist: Return whether list is empty before adding in llist_add()
  llist: Move cpu_relax() to after the cmpxchg()
  llist: Remove the platform-dependent NMI checks
  llist: Make some llist functions inline
  sched, tracing: Show PREEMPT_ACTIVE state in trace_sched_switch
  sched: Remove redundant test in check_preempt_tick()
  sched: Add documentation for bandwidth control
  sched: Return unused runtime on group dequeue
  ...
2011-10-26 17:08:43 +02:00
Linus Torvalds
7115e3fcf4 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (121 commits)
  perf symbols: Increase symbol KSYM_NAME_LEN size
  perf hists browser: Refuse 'a' hotkey on non symbolic views
  perf ui browser: Use libslang to read keys
  perf tools: Fix tracing info recording
  perf hists browser: Elide DSO column when it is set to just one DSO, ditto for threads
  perf hists: Don't consider filtered entries when calculating column widths
  perf hists: Don't decay total_period for filtered entries
  perf hists browser: Honour symbol_conf.show_{nr_samples,total_period}
  perf hists browser: Do not exit on tab key with single event
  perf annotate browser: Don't change selection line when returning from callq
  perf tools: handle endianness of feature bitmap
  perf tools: Add prelink suggestion to dso update message
  perf script: Fix unknown feature comment
  perf hists browser: Apply the dso and thread filters when merging new batches
  perf hists: Move the dso and thread filters from hist_browser
  perf ui browser: Honour the xterm colors
  perf top tui: Give color hints just on the percentage, like on --stdio
  perf ui browser: Make the colors configurable and change the defaults
  perf tui: Remove unneeded call to newtCls on startup
  perf hists: Don't format the percentage on hist_entry__snprintf
  ...

Fix up conflicts in arch/x86/kernel/kprobes.c manually.

Ingo's tree did the insane "add volatile to const array", which just
doesn't make sense ("volatile const"?).  But we could remove the const
*and* make the array volatile to make doubly sure that gcc doesn't
optimize it away..

Also fix up kernel/trace/ring_buffer.c non-data-conflicts manually: the
reader_lock has been turned into a raw lock by the core locking merge,
and there was a new user of it introduced in this perf core merge.  Make
sure that new use also uses the raw accessor functions.
2011-10-26 17:03:38 +02:00
Linus Torvalds
19b4a8d520 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
  rcu: Move propagation of ->completed from rcu_start_gp() to rcu_report_qs_rsp()
  rcu: Remove rcu_needs_cpu_flush() to avoid false quiescent states
  rcu: Wire up RCU_BOOST_PRIO for rcutree
  rcu: Make rcu_torture_boost() exit loops at end of test
  rcu: Make rcu_torture_fqs() exit loops at end of test
  rcu: Permit rt_mutex_unlock() with irqs disabled
  rcu: Avoid having just-onlined CPU resched itself when RCU is idle
  rcu: Suppress NMI backtraces when stall ends before dump
  rcu: Prohibit grace periods during early boot
  rcu: Simplify unboosting checks
  rcu: Prevent early boot set_need_resched() from __rcu_pending()
  rcu: Dump local stack if cannot dump all CPUs' stacks
  rcu: Move __rcu_read_unlock()'s barrier() within if-statement
  rcu: Improve rcu_assign_pointer() and RCU_INIT_POINTER() documentation
  rcu: Make rcu_assign_pointer() unconditionally insert a memory barrier
  rcu: Make rcu_implicit_dynticks_qs() locals be correct size
  rcu: Eliminate in_irq() checks in rcu_enter_nohz()
  nohz: Remove nohz_cpu_mask
  rcu: Document interpretation of RCU-lockdep splats
  rcu: Allow rcutorture's stat_interval parameter to be changed at runtime
  ...
2011-10-26 16:26:53 +02:00
Linus Torvalds
e33bae14fd Merge branch 'for-linus' of git://github.com/ericvh/linux
* 'for-linus' of git://github.com/ericvh/linux:
  9p: fix 9p.txt to advertise msize instead of maxdata
  net/9p: Convert net/9p protocol dumps to tracepoints
  fs/9p: change an int to unsigned int
  fs/9p: Cleanup option parsing in 9p
  9p: move dereference after NULL check
  fs/9p: inode file operation is properly initialized init_special_inode
  fs/9p: Update zero-copy implementation in 9p
2011-10-26 14:20:53 +02:00
Linus Torvalds
7e0bb71e75 Merge branch 'pm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
* 'pm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (63 commits)
  PM / Clocks: Remove redundant NULL checks before kfree()
  PM / Documentation: Update docs about suspend and CPU hotplug
  ACPI / PM: Add Sony VGN-FW21E to nonvs blacklist.
  ARM: mach-shmobile: sh7372 A4R support (v4)
  ARM: mach-shmobile: sh7372 A3SP support (v4)
  PM / Sleep: Mark devices involved in wakeup signaling during suspend
  PM / Hibernate: Improve performance of LZO/plain hibernation, checksum image
  PM / Hibernate: Do not initialize static and extern variables to 0
  PM / Freezer: Make fake_signal_wake_up() wake TASK_KILLABLE tasks too
  PM / Hibernate: Add resumedelay kernel param in addition to resumewait
  MAINTAINERS: Update linux-pm list address
  PM / ACPI: Blacklist Vaio VGN-FW520F machine known to require acpi_sleep=nonvs
  PM / ACPI: Blacklist Sony Vaio known to require acpi_sleep=nonvs
  PM / Hibernate: Add resumewait param to support MMC-like devices as resume file
  PM / Hibernate: Fix typo in a kerneldoc comment
  PM / Hibernate: Freeze kernel threads after preallocating memory
  PM: Update the policy on default wakeup settings
  PM / VT: Cleanup #if defined uglyness and fix compile error
  PM / Suspend: Off by one in pm_suspend()
  PM / Hibernate: Include storage keys in hibernation image on s390
  ...
2011-10-25 15:18:39 +02:00
Linus Torvalds
4e7e2a2008 Merge branch 'for-linus' of git://opensource.wolfsonmicro.com/regmap
* 'for-linus' of git://opensource.wolfsonmicro.com/regmap: (62 commits)
  mfd: Enable rbtree cache for wm831x devices
  regmap: Support some block operations on cached devices
  regmap: Allow caches for devices with no defaults
  regmap: Ensure rbtree syncs registers set to zero properly
  regmap: Allow rbtree to cache zero default values
  regmap: Warn on raw I/O as well as bulk reads that bypass cache
  regmap: Return a sensible error code if we fail to read the cache
  regmap: Use bsearch() to search the register defaults
  regmap: Fix doc comment
  regmap: Optimize the lookup path to use binary search
  regmap: Ensure we scream if we enable cache bypass/only at the same time
  regmap: Implement regcache_cache_bypass helper function
  regmap: Save/restore the bypass state upon syncing
  regmap: Lock the sync path, ensure we use the lockless _regmap_write()
  regmap: Fix apostrophe usage
  regmap: Make _regmap_write() global
  regmap: Fix lock used for regcache_cache_only()
  regmap: Grab the lock in regcache_cache_only()
  regmap: Modify map->cache_bypass directly
  regmap: Fix regcache_sync generic implementation
  ...
2011-10-25 13:57:45 +02:00
Aneesh Kumar K.V
348b59012e net/9p: Convert net/9p protocol dumps to tracepoints
This helps in more control over debugging.
root@qemu-img-64:~# ls /pass/123
ls: cannot access /pass/123: No such file or directory
root@qemu-img-64:~# cat /sys/kernel/debug/tracing/trace
# tracer: nop
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
              ls-1536  [001]    70.928584: 9p_protocol_dump: clnt 18446612132784021504 P9_TWALK(tag = 1)
000: 16 00 00 00 6e 01 00 01 00 00 00 02 00 00 00 01
010: 00 03 00 31 32 33 00 00 00 ff ff ff ff 00 00 00

              ls-1536  [001]    70.928587: <stack trace>
 => trace_9p_protocol_dump
 => p9pdu_finalize
 => p9_client_rpc
 => p9_client_walk
 => v9fs_vfs_lookup
 => d_alloc_and_lookup
 => walk_component
 => path_lookupat
              ls-1536  [000]    70.929696: 9p_protocol_dump: clnt 18446612132784021504 P9_RLERROR(tag = 1)
000: 0b 00 00 00 07 01 00 02 00 00 00 4e 03 00 02 00
010: 00 00 00 00 03 00 02 00 00 00 00 00 ff 43 00 00

              ls-1536  [000]    70.929697: <stack trace>
 => trace_9p_protocol_dump
 => p9_client_rpc
 => p9_client_walk
 => v9fs_vfs_lookup
 => d_alloc_and_lookup
 => walk_component
 => path_lookupat
 => do_path_lookup

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2011-10-24 11:13:12 -05:00
Rafael J. Wysocki
d727b60659 Merge branch 'pm-runtime' into pm-for-linus
* pm-runtime:
  PM / Tracing: build rpm-traces.c only if CONFIG_PM_RUNTIME is set
  PM / Runtime: Replace dev_dbg() with trace_rpm_*()
  PM / Runtime: Introduce trace points for tracing rpm_* functions
  PM / Runtime: Don't run callbacks under lock for power.irq_safe set
  USB: Add wakeup info to debugging messages
  PM / Runtime: pm_runtime_idle() can be called in atomic context
  PM / Runtime: Add macro to test for runtime PM events
  PM / Runtime: Add might_sleep() to runtime PM functions
2011-10-07 23:16:55 +02:00
Ingo Molnar
9d01402023 Merge commit 'v3.1-rc9' into perf/core
Merge reason: pick up latest fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-10-06 12:49:21 +02:00
Ingo Molnar
22f92bacbe Merge branch 'linus' into sched/core
Merge reason: pick up the latest fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-10-04 11:09:08 +02:00
Andrew Vagin
92e51938f5 perf: Fix counter of ftrace events
Each event adds some points to its counters. By default it adds 1,
and a number of points may be transmited in event's parameters.

E.g. sched:sched_stat_runtime adds how long process has been running.

But this functionality was broken by v2.6.31-rc5-392-gf413cdb
and now the event's parameters doesn't affect on a number of points.

TP_perf_assign isn't defined, so __perf_count(c) isn't executed and
__count is always equal to 1.

Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1317052535-1765247-2-git-send-email-avagin@openvz.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-10-04 11:07:54 +02:00
Wu Fengguang
143dfe8611 writeback: IO-less balance_dirty_pages()
As proposed by Chris, Dave and Jan, don't start foreground writeback IO
inside balance_dirty_pages(). Instead, simply let it idle sleep for some
time to throttle the dirtying task. In the mean while, kick off the
per-bdi flusher thread to do background writeback IO.

RATIONALS
=========

- disk seeks on concurrent writeback of multiple inodes (Dave Chinner)

  If every thread doing writes and being throttled start foreground
  writeback, it leads to N IO submitters from at least N different
  inodes at the same time, end up with N different sets of IO being
  issued with potentially zero locality to each other, resulting in
  much lower elevator sort/merge efficiency and hence we seek the disk
  all over the place to service the different sets of IO.
  OTOH, if there is only one submission thread, it doesn't jump between
  inodes in the same way when congestion clears - it keeps writing to
  the same inode, resulting in large related chunks of sequential IOs
  being issued to the disk. This is more efficient than the above
  foreground writeback because the elevator works better and the disk
  seeks less.

- lock contention and cache bouncing on concurrent IO submitters (Dave Chinner)

  With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes
  from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention".

  * "CPU usage has dropped by ~55%", "it certainly appears that most of
    the CPU time saving comes from the removal of contention on the
    inode_wb_list_lock" (IMHO at least 10% comes from the reduction of
    cacheline bouncing, because the new code is able to call much less
    frequently into balance_dirty_pages() and hence access the global
    page states)

  * the user space "App overhead" is reduced by 20%, by avoiding the
    cacheline pollution by the complex writeback code path

  * "for a ~5% throughput reduction", "the number of write IOs have
    dropped by ~25%", and the elapsed time reduced from 41:42.17 to
    40:53.23.

  * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%,
    and improves IO throughput from 38MB/s to 42MB/s.

- IO size too small for fast arrays and too large for slow USB sticks

  The write_chunk used by current balance_dirty_pages() cannot be
  directly set to some large value (eg. 128MB) for better IO efficiency.
  Because it could lead to more than 1 second user perceivable stalls.
  Even the current 4MB write size may be too large for slow USB sticks.
  The fact that balance_dirty_pages() starts IO on itself couples the
  IO size to wait time, which makes it hard to do suitable IO size while
  keeping the wait time under control.

  Now it's possible to increase writeback chunk size proportional to the
  disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram,
  the larger writeback size dramatically reduces the seek count to 1/10
  (far beyond my expectation) and improves the write throughput by 24%.

- long block time in balance_dirty_pages() hurts desktop responsiveness

  Many of us may have the experience: it often takes a couple of seconds
  or even long time to stop a heavy writing dd/cp/tar command with
  Ctrl-C or "kill -9".

- IO pipeline broken by bumpy write() progress

  There are a broad class of "loop {read(buf); write(buf);}" applications
  whose read() pipeline will be under-utilized or even come to a stop if
  the write()s have long latencies _or_ don't progress in a constant rate.
  The current threshold based throttling inherently transfers the large
  low level IO completion fluctuations to bumpy application write()s,
  and further deteriorates with increasing number of dirtiers and/or bdi's.

  For example, when doing 50 dd's + 1 remote rsync to an XFS partition,
  the rsync progresses very bumpy in legacy kernel, and throughput is
  improved by 67% by this patchset. (plus the larger write chunk size,
  it will be 93% speedup).

  The new rate based throttling can support 1000+ dd's with excellent
  smoothness, low latency and low overheads.

For the above reasons, it's much better to do IO-less and low latency
pauses in balance_dirty_pages().

Jan Kara, Dave Chinner and me explored the scheme to let
balance_dirty_pages() wait for enough writeback IO completions to
safeguard the dirty limit. However it's found to have two problems:

- in large NUMA systems, the per-cpu counters may have big accounting
  errors, leading to big throttle wait time and jitters.

- NFS may kill large amount of unstable pages with one single COMMIT.
  Because NFS server serves COMMIT with expensive fsync() IOs, it is
  desirable to delay and reduce the number of COMMITs. So it's not
  likely to optimize away such kind of bursty IO completions, and the
  resulted large (and tiny) stall times in IO completion based throttling.

So here is a pause time oriented approach, which tries to control the
pause time in each balance_dirty_pages() invocations, by controlling
the number of pages dirtied before calling balance_dirty_pages(), for
smooth and efficient dirty throttling:

- avoid useless (eg. zero pause time) balance_dirty_pages() calls
- avoid too small pause time (less than   4ms, which burns CPU power)
- avoid too large pause time (more than 200ms, which hurts responsiveness)
- avoid big fluctuations of pause times

It can control pause times at will. The default policy (in a followup
patch) will be to do ~10ms pauses in 1-dd case, and increase to ~100ms
in 1000-dd case.

BEHAVIOR CHANGE
===============

(1) dirty threshold

Users will notice that the applications will get throttled once crossing
the global (background + dirty)/2=15% threshold, and then balanced around
17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable
memory in 1-dd case.

Since the task will be soft throttled earlier than before, it may be
perceived by end users as performance "slow down" if his application
happens to dirty more than 15% dirtyable memory.

(2) smoothness/responsiveness

Users will notice a more responsive system during heavy writeback.
"killall dd" will take effect instantly.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-10-03 21:08:57 +08:00
Ingo Molnar
048b718029 Merge branch 'rcu/next' of git://github.com/paulmckrcu/linux into core/rcu 2011-10-01 14:21:36 +02:00
Paul E. McKenney
d4c08f2ac3 rcu: Add grace-period, quiescent-state, and call_rcu trace events
Add trace events to record grace-period start and end, quiescent states,
CPUs noticing grace-period start and end, grace-period initialization,
call_rcu() invocation, tasks blocking in RCU read-side critical sections,
tasks exiting those same critical sections, force_quiescent_state()
detection of dyntick-idle and offline CPUs, CPUs entering and leaving
dyntick-idle mode (except from NMIs), CPUs coming online and going
offline, and CPUs being kicked for staying in dyntick-idle mode for too
long (as in many weeks, even on 32-bit systems).

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

rcu: Add the rcu flavor to callback trace events

The earlier trace events for registering RCU callbacks and for invoking
them did not include the RCU flavor (rcu_bh, rcu_preempt, or rcu_sched).
This commit adds the RCU flavor to those trace events.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-28 21:38:21 -07:00
Paul E. McKenney
385680a948 rcu: Add event-trace markers to TREE_RCU kthreads
Add event-trace markers to TREE_RCU kthreads to allow including these
kthread's CPU time in the utilization calculations.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-28 21:38:19 -07:00
Paul E. McKenney
72fe701b70 rcu: Add RCU type to callback-invocation tracing
Add a string to the rcu_batch_start() and rcu_batch_end() trace
messages that indicates the RCU type ("rcu_sched", "rcu_bh", or
"rcu_preempt").  The trace messages for the actual invocations
themselves are not marked, as it should be clear from the
rcu_batch_start() and rcu_batch_end() events before and after.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-28 21:38:15 -07:00
Paul E. McKenney
300df91ca9 rcu: Event-trace markers for computing RCU CPU utilization
This commit adds the trace_rcu_utilization() marker that is to be
used to allow postprocessing scripts compute RCU's CPU utilization,
give or take event-trace overhead.  Note that we do not include RCU's
dyntick-idle interface because event tracing requires RCU protection,
which is not available in dyntick-idle mode.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-28 21:38:13 -07:00
Paul E. McKenney
29c00b4a1d rcu: Add event-tracing for RCU callback invocation
There was recently some controversy about the overhead of invoking RCU
callbacks.  Add TRACE_EVENT()s to obtain fine-grained timings for the
start and stop of a batch of callbacks and also for each callback invoked.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-28 21:38:12 -07:00
Ming Lei
53b615ccca PM / Runtime: Introduce trace points for tracing rpm_* functions
This patch introduces 3 trace points to prepare for tracing
rpm_idle/rpm_suspend/rpm_resume functions, so we can use these
trace points to replace the current dev_dbg().

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-09-27 22:53:27 +02:00
Peter Zijlstra
557ab42542 sched, tracing: Show PREEMPT_ACTIVE state in trace_sched_switch
We had need to see the difference between scheduling a runnable task and
a runnable task being involuntarily preempted.

No app should rely on the old string output (the binary
trace event record format is not changed).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1316164603.10174.11.camel@twins
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-09-26 13:25:54 +02:00
Mark Brown
e56235e099 ASoC: Add another DAPM stat for neighbour checks
The number of times we look at a potentially connected neighbour is just
as important as the number of times we actually recurse into looking at
that neighbour so also collect that statistic.

Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
2011-09-22 17:24:40 +01:00
Mark Brown
de02d0786d ASoC: Trace and collect statistics for DAPM graph walking
One of the longest standing areas for improvement in ASoC has been the
DAPM algorithm - it repeats the same checks many times whenever it is run
and makes no effort to limit the areas of the graph it checks meaning we
do an awful lot of walks over the full graph. This has never mattered too
much as the size of the graph has generally been small in relation to the
size of the devices supported and the speed of CPUs but it is annoying.

In preparation for work on improving this insert a trace point after the
graph walk has been done. This gives us specific timing information for
the walk, and in order to give quantifiable (non-benchmark) numbers also
count every time we check a link or check the power for a widget and report
those numbers. Substantial changes in the algorithm may require tweaks to
the stats but they should be useful for simpler things.

Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
2011-09-21 14:53:44 +01:00
Dimitris Papastamos
5936008901 regmap: Add the regcache_sync trace event
Signed-off-by: Dimitris Papastamos <dp@opensource.wolfsonmicro.com>
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
2011-09-19 19:06:34 +01:00
Aditya Kali
d8990240d8 ext4: add some tracepoints in ext4/extents.c
This patch adds some tracepoints in ext4/extents.c and updates a tracepoint in
ext4/inode.c.

Tested: Built and ran the kernel and verified that these tracepoints work.
Also ran xfstests.

Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:18:51 -04:00
Wu Fengguang
c8ad620638 writeback: show raw dirtied_when in trace writeback_single_inode
Save inode->dirtied_when in the raw trace output for reliable scripting,
and to also show in formatted output the relative age in seconds for
easy human reading.

CC: Jan Kara <jack@suse.cz>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-08-31 08:48:15 +08:00
Linus Torvalds
5ccc38740a Merge branch 'for-linus' of git://git.kernel.dk/linux-block
* 'for-linus' of git://git.kernel.dk/linux-block: (23 commits)
  Revert "cfq: Remove special treatment for metadata rqs."
  block: fix flush machinery for stacking drivers with differring flush flags
  block: improve rq_affinity placement
  blktrace: add FLUSH/FUA support
  Move some REQ flags to the common bio/request area
  allow blk_flush_policy to return REQ_FSEQ_DATA independent of *FLUSH
  xen/blkback: Make description more obvious.
  cfq-iosched: Add documentation about idling
  block: Make rq_affinity = 1 work as expected
  block: swim3: fix unterminated of_device_id table
  block/genhd.c: remove useless cast in diskstats_show()
  drivers/cdrom/cdrom.c: relax check on dvd manufacturer value
  drivers/block/drbd/drbd_nl.c: use bitmap_parse instead of __bitmap_parse
  bsg-lib: add module.h include
  cfq-iosched: Reduce linked group count upon group destruction
  blk-throttle: correctly determine sync bio
  loop: fix deadlock when sysfs and LOOP_CLR_FD race against each other
  loop: add BLK_DEV_LOOP_MIN_COUNT=%i to allow distros 0 pre-allocated loop devices
  loop: add management interface for on-demand device allocation
  loop: replace linked list of allocated devices with an idr index
  ...
2011-08-19 10:47:07 -07:00