Commit Graph

14963 Commits

Author SHA1 Message Date
Steven Rostedt 2b6080f28c tracing: Encapsulate global_trace and remove dependencies on global vars
The global_trace variable in kernel/trace/trace.c has been kept 'static' and
local to that file so that it would not be used too much outside of that
file. This has paid off, even though there were lots of changes to make
the trace_array structure more generic (not depending on global_trace).

Removal of a lot of direct usages of global_trace is needed to be able to
create more trace_arrays such that we can add multiple buffers.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:34:42 -04:00
Steven Rostedt ae3b5093ad tracing: Use RING_BUFFER_ALL_CPUS for TRACE_PIPE_ALL_CPU
Both RING_BUFFER_ALL_CPUS and TRACE_PIPE_ALL_CPU are defined as
-1 and used to say that all the ring buffers are to be modified
or read (instead of just a single cpu, which would be >= 0).

There's no reason to keep TRACE_PIPE_ALL_CPU as it is also started
to be used for more than what it was created for, and now that
the ring buffer code added a generic RING_BUFFER_ALL_CPUS define,
we can clean up the trace code to use that instead and remove
the TRACE_PIPE_ALL_CPU macro.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:34:41 -04:00
Steven Rostedt ae63b31e4d tracing: Separate out trace events from global variables
The trace events for ftrace are all defined via global variables.
The arrays of events and event systems are linked to a global list.
This prevents multiple users of the event system (what to enable and
what not to).

By adding descriptors to represent the event/file relation, as well
as to which trace_array descriptor they are associated with, allows
for more than one set of events to be defined. Once the trace events
files have a link between the trace event and the trace_array they
are associated with, we can create multiple trace_arrays that can
record separate events in separate buffers.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-15 00:34:40 -04:00
Steven Rostedt (Red Hat) 613f04a0f5 tracing: Prevent buffer overwrite disabled for latency tracers
The latency tracers require the buffers to be in overwrite mode,
otherwise they get screwed up. Force the buffers to stay in overwrite
mode when latency tracers are enabled.

Added a flag_changed() method to the tracer structure to allow
the tracers to see what flags are being changed, and also be able
to prevent the change from happing.

Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-14 23:40:21 -04:00
Steven Rostedt (Red Hat) 8090282265 tracing: Keep overwrite in sync between regular and snapshot buffers
Changing the overwrite mode for the ring buffer via the trace
option only sets the normal buffer. But the snapshot buffer could
swap with it, and then the snapshot would be in non overwrite mode
and the normal buffer would be in overwrite mode, even though the
option flag states otherwise.

Keep the two buffers overwrite modes in sync.

Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-14 23:40:15 -04:00
Steven Rostedt (Red Hat) 69d34da298 tracing: Protect tracer flags with trace_types_lock
Seems that the tracer flags have never been protected from
synchronous writes. Luckily, admins don't usually modify the
tracing flags via two different tasks. But if scripts were to
be used to modify them, then they could get corrupted.

Move the trace_types_lock that protects against tracers changing
to also protect the flags being set.

Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-14 13:50:56 -04:00
Steven Rostedt (Red Hat) 740466bc89 tracing: Fix free of probe entry by calling call_rcu_sched()
Because function tracing is very invasive, and can even trace
calls to rcu_read_lock(), RCU access in function tracing is done
with preempt_disable_notrace(). This requires a synchronize_sched()
for updates and not a synchronize_rcu().

Function probes (traceon, traceoff, etc) must be freed after
a synchronize_sched() after its entry has been removed from the
hash. But call_rcu() is used. Fix this by using call_rcu_sched().

Also fix the usage to use hlist_del_rcu() instead of hlist_del().

Cc: stable@vger.kernel.org
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-13 17:57:44 -04:00
Steven Rostedt (Red Hat) 2721e72dd1 tracing: Fix race in snapshot swapping
Although the swap is wrapped with a spin_lock, the assignment
of the temp buffer used to swap is not within that lock.
It needs to be moved into that lock, otherwise two swaps
happening on two different CPUs, can end up using the wrong
temp buffer to assign in the swap.

Luckily, all current callers of the swap function appear to have
their own locks. But in case something is added that allows two
different callers to call the swap, then there's a chance that
this race can trigger and corrupt the buffers.

New code is coming soon that will allow for this race to trigger.

I've Cc'd stable, so this bug will not show up if someone backports
one of the changes that can trigger this bug.

Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-12 11:56:33 -04:00
Steven Rostedt (Red Hat) c9960e4854 tracing: Do not return EINVAL in snapshot when not allocated
To use the tracing snapshot feature, writing a '1' into the snapshot
file causes the snapshot buffer to be allocated if it has not already
been allocated and dose a 'swap' with the main buffer, so that the
snapshot now contains what was in the main buffer, and the main buffer
now writes to what was the snapshot buffer.

To free the snapshot buffer, a '0' is written into the snapshot file.

To clear the snapshot buffer, any number but a '0' or '1' is written
into the snapshot file. But if the file is not allocated it returns
-EINVAL error code. This is rather pointless. It is better just to
do nothing and return success.

Acked-by: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-07 10:31:38 -05:00
Steven Rostedt (Red Hat) d8741e2e88 tracing: Add help of snapshot feature when snapshot is empty
When cat'ing the snapshot file, instead of showing an empty trace
header like the trace file does, show how to use the snapshot
feature.

Also, this is a good place to show if the snapshot has been allocated
or not. Users may want to "pre allocate" the snapshot to have a fast
"swap" of the current buffer. Otherwise, a swap would be slow and might
fail as it would need to allocate the snapshot buffer, and that might
fail under tight memory constraints.

Here's what it looked like before:

 # tracer: nop
 #
 # entries-in-buffer/entries-written: 0/0   #P:4
 #
 #                              _-----=> irqs-off
 #                             / _----=> need-resched
 #                            | / _---=> hardirq/softirq
 #                            || / _--=> preempt-depth
 #                            ||| /     delay
 #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
 #              | |       |   ||||       |         |

Here's what it looks like now:

 # tracer: nop
 #
 #
 # * Snapshot is freed *
 #
 # Snapshot commands:
 # echo 0 > snapshot : Clears and frees snapshot buffer
 # echo 1 > snapshot : Allocates snapshot buffer, if not already allocated.
 #                      Takes a snapshot of the main buffer.
 # echo 2 > snapshot : Clears snapshot buffer (but does not allocate)
 #                      (Doesn't have to be '2' works with any number that
 #                       is not a '0' or '1')

Acked-by: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-03-07 10:31:22 -05:00
Steven Rostedt db05021d49 ftrace: Update the kconfig for DYNAMIC_FTRACE
The prompt to enable DYNAMIC_FTRACE (the ability to nop and
enable function tracing at run time) had a confusing statement:

 "enable/disable ftrace tracepoints dynamically"

This was written before tracepoints were added to the kernel,
but now that tracepoints have been added, this is very confusing
and has confused people enough to give wrong information during
presentations.

Not only that, I looked at the help text, and it still references
that dreaded daemon that use to wake up once a second to update
the nop locations and brick NICs, that hasn't been around for over
five years.

Time to bring the text up to the current decade.

Cc: stable@vger.kernel.org
Reported-by: Ezequiel Garcia <elezegarcia@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-02-27 21:58:01 -05:00
Ingo Molnar ff1fb5f6b4 Merge branch 'tip/perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/urgent
Pull two fixes from Steven Rostedt.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-02-20 11:26:21 +01:00
Linus Torvalds ece8e0b2f9 Merge branch 'for-3.9-async' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull async changes from Tejun Heo:
 "These are followups for the earlier deadlock issue involving async
  ending up waiting for itself through block requesting module[1].  The
  following changes are made by these commits.

   - Instead of requesting default elevator on each request_queue init,
     block now requests it once early during boot.

   - Kmod triggers warning if invoked from an async worker.

   - Async synchronization implementation has been reimplemented.  It's
     a lot simpler now."

* 'for-3.9-async' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  async: initialise list heads to fix crash
  async: replace list of active domains with global list of pending items
  async: keep pending tasks on async_domain and remove async_pending
  async: use ULLONG_MAX for infinity cookie value
  async: bring sanity to the use of words domain and running
  async, kmod: warn on synchronous request_module() from async workers
  block: don't request module during elevator init
  init, block: try to load default elevator module early during boot
2013-02-19 22:10:26 -08:00
Linus Torvalds 67cb104b4c Merge branch 'for-3.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue changes from Tejun Heo:
 "A lot of reorganization is going on mostly to prepare for worker pools
  with custom attributes so that workqueue can replace custom pool
  implementations in places including writeback and btrfs and make CPU
  assignment in crypto more flexible.

  workqueue evolved from purely per-cpu design and implementation, so
  there are a lot of assumptions regarding being bound to CPUs and even
  unbound workqueues are implemented as an extension of the model -
  workqueues running on the special unbound CPU.  Bulk of changes this
  round are about promoting worker_pools as the top level abstraction
  replacing global_cwq (global cpu workqueue).  At this point, I'm
  fairly confident about getting custom worker pools working pretty soon
  and ready for the next merge window.

  Lai's patches are replacing the convoluted mb() dancing workqueue has
  been doing with much simpler mechanism which only depends on
  assignment atomicity of long.  For details, please read the commit
  message of 0b3dae68ac ("workqueue: simplify is-work-item-queued-here
  test").  While the change ends up adding one pointer to struct
  delayed_work, the inflation in percentage is less than five percent
  and it decouples delayed_work logic a lot more cleaner from usual work
  handling, removes the unusual memory barrier dancing, and allows for
  further simplification, so I think the trade-off is acceptable.

  There will be two more workqueue related pull requests and there are
  some shared commits among them.  I'll write further pull requests
  assuming this pull request is pulled first."

* 'for-3.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (37 commits)
  workqueue: un-GPL function delayed_work_timer_fn()
  workqueue: rename cpu_workqueue to pool_workqueue
  workqueue: reimplement is_chained_work() using current_wq_worker()
  workqueue: fix is_chained_work() regression
  workqueue: pick cwq instead of pool in __queue_work()
  workqueue: make get_work_pool_id() cheaper
  workqueue: move nr_running into worker_pool
  workqueue: cosmetic update in try_to_grab_pending()
  workqueue: simplify is-work-item-queued-here test
  workqueue: make work->data point to pool after try_to_grab_pending()
  workqueue: add delayed_work->wq to simplify reentrancy handling
  workqueue: make work_busy() test WORK_STRUCT_PENDING first
  workqueue: replace WORK_CPU_NONE/LAST with WORK_CPU_END
  workqueue: post global_cwq removal cleanups
  workqueue: rename nr_running variables
  workqueue: remove global_cwq
  workqueue: remove worker_pool->gcwq
  workqueue: replace for_each_worker_pool() with for_each_std_worker_pool()
  workqueue: make freezing/thawing per-pool
  workqueue: make hotplug processing per-pool
  ...
2013-02-19 22:01:33 -08:00
Linus Torvalds 1eaec8212e Merge branch 'for-3.9-cleanups' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue [delayed_]work_pending() cleanups from Tejun Heo:
 "This is part of on-going cleanups to remove / minimize usages of
  workqueue interfaces which are deprecated and/or misleading.

  This round drops a number of usages of [delayed_]work_pending(), which
  are dangerous as they lack any form of synchronization and thus often
  lead to buggy / unnecessary code.  There are a couple legitimate use
  cases in kernel.  Hopefully, they can be converted and
  [delayed_]work_pending() can be removed completely.  Even if not,
  removing most of misuses should make it more difficult to find
  examples of misuses and thus slow down growth of them.

  These changes are independent from other workqueue changes."

* 'for-3.9-cleanups' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  wimax/i2400m: fix i2400m->wake_tx_skb handling
  kprobes: fix wait_for_kprobe_optimizer()
  ipw2x00: simplify scan_event handling
  video/exynos: don't use [delayed_]work_pending()
  tty/max3100: don't use [delayed_]work_pending()
  x86/mce: don't use [delayed_]work_pending()
  rfkill: don't use [delayed_]work_pending()
  wl1251: don't use [delayed_]work_pending()
  thinkpad_acpi: don't use [delayed_]work_pending()
  mwifiex: don't use [delayed_]work_pending()
  sja1000: don't use [delayed_]work_pending()
2013-02-19 21:58:52 -08:00
Linus Torvalds 121027a7a6 Merge branch 'x86-build-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull two x86 kernel build changes from Ingo Molnar:
 "The first change modifies how 'make oldconfig' works on cross-bitness
  situations on x86.  It was felt the new behavior of preserving the
  bitness of the .config is more logical.  This is a leftover of the
  merge.

  The second change eliminates a Perl warning.  (There's another, more
  complete fix resulting of this warning fix, which second fix in flight
  to you via the kbuild tree, which will remove the timeconst.pl script
  altogether.)"

* 'x86-build-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timeconst.pl: Eliminate Perl warning
  x86: Default to ARCH=x86 to avoid overriding CONFIG_64BIT
2013-02-19 19:12:03 -08:00
Linus Torvalds 5800700f66 Merge branch 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/apic changes from Ingo Molnar:
 "Main changes:

   - Multiple MSI support added to the APIC, PCI and AHCI code - acked
     by all relevant maintainers, by Alexander Gordeev.

     The advantage is that multiple AHCI ports can have multiple MSI
     irqs assigned, and can thus spread to multiple CPUs.

     [ Drivers can make use of this new facility via the
       pci_enable_msi_block_auto() method ]

   - x86 IOAPIC code from interrupt remapping cleanups from Joerg
     Roedel:

     These patches move all interrupt remapping specific checks out of
     the x86 core code and replaces the respective call-sites with
     function pointers.  As a result the interrupt remapping code is
     better abstraced from x86 core interrupt handling code.

   - Various smaller improvements, fixes and cleanups."

* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
  x86/intel/irq_remapping: Clean up x2apic opt-out security warning mess
  x86, kvm: Fix intialization warnings in kvm.c
  x86, irq: Move irq_remapped out of x86 core code
  x86, io_apic: Introduce eoi_ioapic_pin call-back
  x86, msi: Introduce x86_msi.compose_msi_msg call-back
  x86, irq: Introduce setup_remapped_irq()
  x86, irq: Move irq_remapped() check into free_remapped_irq
  x86, io-apic: Remove !irq_remapped() check from __target_IO_APIC_irq()
  x86, io-apic: Move CONFIG_IRQ_REMAP code out of x86 core
  x86, irq: Add data structure to keep AMD specific irq remapping information
  x86, irq: Move irq_remapping_enabled declaration to iommu code
  x86, io_apic: Remove irq_remapping_enabled check in setup_timer_IRQ0_pin
  x86, io_apic: Move irq_remapping_enabled checks out of check_timer()
  x86, io_apic: Convert setup_ioapic_entry to function pointer
  x86, io_apic: Introduce set_affinity function pointer
  x86, msi: Use IRQ remapping specific setup_msi_irqs routine
  x86, hpet: Introduce x86_msi_ops.setup_hpet_msi
  x86, io_apic: Introduce x86_io_apic_ops.print_entries for debugging
  x86, io_apic: Introduce x86_io_apic_ops.disable()
  x86, apic: Mask IO-APIC and PIC unconditionally on LAPIC resume
  ...
2013-02-19 19:07:27 -08:00
Linus Torvalds 266d7ad7f4 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer changes from Ingo Molnar:
 "Main changes:

   - ntp: Add CONFIG_RTC_SYSTOHC: a generic RTC driver facility
     complementing the existing CONFIG_RTC_HCTOSYS, which uses NTP to
     keep the hardware clock updated.

   - posix-timers: Fix clock_adjtime to always return timex data on
     success.  This is changing the ABI, but no breakage was expected
     and found - caution is warranted nevertheless.

   - platform persistent clock improvements/cleanups.

   - clockevents: refactor timer broadcast handling to be more generic
     and less duplicated with matching architecture code (mostly ARM
     motivated.)

   - various fixes and cleanups"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timers/x86/hpet: Use HPET_COUNTER to specify the hpet counter in vread_hpet()
  posix-cpu-timers: Fix nanosleep task_struct leak
  clockevents: Fix generic broadcast for FEAT_C3STOP
  time, Fix setting of hardware clock in NTP code
  hrtimer: Prevent hrtimer_enqueue_reprogram race
  clockevents: Add generic timer broadcast function
  clockevents: Add generic timer broadcast receiver
  timekeeping: Switch HAS_PERSISTENT_CLOCK to ALWAYS_USE_PERSISTENT_CLOCK
  x86/time/rtc: Don't print extended CMOS year when reading RTC
  x86: Select HAS_PERSISTENT_CLOCK on x86
  timekeeping: Add CONFIG_HAS_PERSISTENT_CLOCK option
  rtc: Skip the suspend/resume handling if persistent clock exist
  timekeeping: Add persistent_clock_exist flag
  posix-timers: Fix clock_adjtime to always return timex data on success
  Round the calculated scale factor in set_cyc2ns_scale()
  NTP: Add a CONFIG_RTC_SYSTOHC configuration
  MAINTAINERS: Update John Stultz's email
  time: create __getnstimeofday for WARNless calls
2013-02-19 19:05:45 -08:00
Linus Torvalds bcbd818c06 Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull preparatory smp/hotplug patches from Ingo Molnar:
 "Some early preparatory changes for the WIP hotplug rework by Thomas
  Gleixner."

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  stop_machine: Use smpboot threads
  stop_machine: Store task reference in a separate per cpu variable
  smpboot: Allow selfparking per cpu threads
2013-02-19 19:04:55 -08:00
Linus Torvalds d652e1eb8e Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler changes from Ingo Molnar:
 "Main changes:

   - scheduler side full-dynticks (user-space execution is undisturbed
     and receives no timer IRQs) preparation changes that convert the
     cputime accounting code to be full-dynticks ready, from Frederic
     Weisbecker.

   - Initial sched.h split-up changes, by Clark Williams

   - select_idle_sibling() performance improvement by Mike Galbraith:

        " 1 tbench pair (worst case) in a 10 core + SMT package:

          pre   15.22 MB/sec 1 procs
          post 252.01 MB/sec 1 procs "

  - sched_rr_get_interval() ABI fix/change.  We think this detail is not
    used by apps (so it's not an ABI in practice), but lets keep it
    under observation.

  - misc RT scheduling cleanups, optimizations"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  sched/rt: Add <linux/sched/rt.h> header to <linux/init_task.h>
  cputime: Remove irqsave from seqlock readers
  sched, powerpc: Fix sched.h split-up build failure
  cputime: Restore CPU_ACCOUNTING config defaults for PPC64
  sched/rt: Move rt specific bits into new header file
  sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice
  sched: Move sched.h sysctl bits into separate header
  sched: Fix signedness bug in yield_to()
  sched: Fix select_idle_sibling() bouncing cow syndrome
  sched/rt: Further simplify pick_rt_task()
  sched/rt: Do not account zero delta_exec in update_curr_rt()
  cputime: Safely read cputime of full dynticks CPUs
  kvm: Prepare to add generic guest entry/exit callbacks
  cputime: Use accessors to read task cputime stats
  cputime: Allow dynamic switch between tick/virtual based cputime accounting
  cputime: Generic on-demand virtual cputime accounting
  cputime: Move default nsecs_to_cputime() to jiffies based cputime file
  cputime: Librarize per nsecs resolution cputime definitions
  cputime: Avoid multiplication overflow on utime scaling
  context_tracking: Export context state for generic vtime
  ...

Fix up conflict in kernel/context_tracking.c due to comment additions.
2013-02-19 18:19:48 -08:00
Linus Torvalds 8f55cea410 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf changes from Ingo Molnar:
 "There are lots of improvements, the biggest changes are:

  Main kernel side changes:

   - Improve uprobes performance by adding 'pre-filtering' support, by
     Oleg Nesterov.

   - Make some POWER7 events available in sysfs, equivalent to what was
     done on x86, from Sukadev Bhattiprolu.

   - tracing updates by Steve Rostedt - mostly misc fixes and smaller
     improvements.

   - Use perf/event tracing to report PCI Express advanced errors, by
     Tony Luck.

   - Enable northbridge performance counters on AMD family 15h, by Jacob
     Shin.

   - This tracing commit:

        tracing: Remove the extra 4 bytes of padding in events

     changes the ABI.  All involved parties (PowerTop in particular)
     seem to agree that it's safe to do now with the introduction of
     libtraceevent, but the devil is in the details ...

  Main tooling side changes:

   - Add 'event group view', from Namyung Kim:

     To use it, 'perf record' should group events when recording.  And
     then perf report parses the saved group relation from file header
     and prints them together if --group option is provided.  You can
     use the 'perf evlist' command to see event group information:

        $ perf record -e '{ref-cycles,cycles}' noploop 1
        [ perf record: Woken up 2 times to write data ]
        [ perf record: Captured and wrote 0.385 MB perf.data (~16807 samples) ]

        $ perf evlist --group
        {ref-cycles,cycles}

     With this example, default perf report will show you each event
     separately.

     You can use --group option to enable event group view:

        $ perf report --group
        ...
        # group: {ref-cycles,cycles}
        # ========
        # Samples: 7K of event 'anon group { ref-cycles, cycles }'
        # Event count (approx.): 6876107743
        #
        #         Overhead  Command      Shared Object                      Symbol
        # ................  .......  .................  ..........................
            99.84%  99.76%  noploop  noploop            [.] main
             0.07%   0.00%  noploop  ld-2.15.so         [.] strcmp
             0.03%   0.00%  noploop  [kernel.kallsyms]  [k] timerqueue_del
             0.03%   0.03%  noploop  [kernel.kallsyms]  [k] sched_clock_cpu
             0.02%   0.00%  noploop  [kernel.kallsyms]  [k] account_user_time
             0.01%   0.00%  noploop  [kernel.kallsyms]  [k] __alloc_pages_nodemask
             0.00%   0.00%  noploop  [kernel.kallsyms]  [k] native_write_msr_safe
             0.00%   0.11%  noploop  [kernel.kallsyms]  [k] _raw_spin_lock
             0.00%   0.06%  noploop  [kernel.kallsyms]  [k] find_get_page
             0.00%   0.02%  noploop  [kernel.kallsyms]  [k] rcu_check_callbacks
             0.00%   0.02%  noploop  [kernel.kallsyms]  [k] __current_kernel_time

     As you can see the Overhead column now contains both of ref-cycles
     and cycles and header line shows group information also - 'anon
     group { ref-cycles, cycles }'.  The output is sorted by period of
     group leader first.

   - Initial GTK+ annotate browser, from Namhyung Kim.

   - Add option for runtime switching perf data file in perf report,
     just press 's' and a menu with the valid files found in the current
     directory will be presented, from Feng Tang.

   - Add support to display whole group data for raw columns, from Jiri
     Olsa.

   - Add per processor socket count aggregation in perf stat, from
     Stephane Eranian.

   - Add interval printing in 'perf stat', from Stephane Eranian.

   - 'perf test' improvements

   - Add support for wildcards in tracepoint system name, from Jiri
     Olsa.

   - Add anonymous huge page recognition, from Joshua Zhu.

   - perf build-id cache now can show DSOs present in a perf.data file
     that are not in the cache, to integrate with build-id servers being
     put in place by organizations such as Fedora.

   - perf top now shares more of the evsel config/creation routines with
     'record', paving the way for further integration like 'top'
     snapshots, etc.

   - perf top now supports DWARF callchains.

   - Fix mmap limitations on 32-bit, fix from David Miller.

   - 'perf bench numa mem' NUMA performance measurement suite

   - ... and lots of fixes, performance improvements, cleanups and other
     improvements I failed to list - see the shortlog and git log for
     details."

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (270 commits)
  perf/x86/amd: Enable northbridge performance counters on AMD family 15h
  perf/hwbp: Fix cleanup in case of kzalloc failure
  perf tools: Fix build with bison 2.3 and older.
  perf tools: Limit unwind support to x86 archs
  perf annotate: Make it to be able to skip unannotatable symbols
  perf gtk/annotate: Fail early if it can't annotate
  perf gtk/annotate: Show source lines with gray color
  perf gtk/annotate: Support multiple event annotation
  perf ui/gtk: Implement basic GTK2 annotation browser
  perf annotate: Fix warning message on a missing vmlinux
  perf buildid-cache: Add --update option
  uprobes/perf: Avoid uprobe_apply() whenever possible
  uprobes/perf: Teach trace_uprobe/perf code to use UPROBE_HANDLER_REMOVE
  uprobes/perf: Teach trace_uprobe/perf code to pre-filter
  uprobes/perf: Teach trace_uprobe/perf code to track the active perf_event's
  uprobes: Introduce uprobe_apply()
  perf: Introduce hw_perf_event->tp_target and ->tp_list
  uprobes/perf: Always increment trace_uprobe->nhit
  uprobes/tracing: Kill uprobe_trace_consumer, embed uprobe_consumer into trace_uprobe
  uprobes/tracing: Introduce is_trace_uprobe_enabled()
  ...
2013-02-19 17:49:41 -08:00
Linus Torvalds b7133a9a10 Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq core changes from Ingo Molnar:
 "The biggest changes are the IRQ-work and printk changes from Frederic
  Weisbecker, which prepare the code for 'full dynticks' (the ability to
  stop or slow down the periodic tick arbitrarily, not just in idle time
  as today):

   - Don't stop tick with irq works pending.  This fix is generally
     useful and concerns archs that can't raise self IPIs.

   - Flush irq works before CPU offlining.

   - Introduce "lazy" irq works that can wait for the next tick to be
     executed, unless it's stopped.

   - Implement klogd wake up using irq work.  This removes the ad-hoc
     printk_tick()/printk_needs_cpu() hooks and make it working even in
     dynticks mode.

   - Cleanups and fixes."

* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Export enable/disable_percpu_irq()
  arch Kconfig: Remove references to IRQ_PER_CPU
  irq_work: Remove return value from the irq_work_queue() function
  genirq: Avoid deadlock in spurious handling
  printk: Wake up klogd using irq_work
  irq_work: Make self-IPIs optable
  irq_work: Warn if there's still work on cpu_down
  irq_work: Flush work on CPU_DYING
  irq_work: Don't stop the tick with pending works
  nohz: Add API to check tick state
  irq_work: Remove CONFIG_HAVE_IRQ_WORK
  irq_work: Fix racy check on work pending flag
  irq_work: Fix racy IRQ_WORK_BUSY flag setting
2013-02-19 17:47:58 -08:00
Linus Torvalds e84cf5d0fd Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU changes from Ingo Molnar:
 "SRCU changes:

   - These include debugging aids, updates that move towards the goal of
     permitting srcu_read_lock() and srcu_read_unlock() to be used from
     idle and offline CPUs, and a few small fixes.

  Changes to rcutorture and to RCU documentation:

   - Posted to LKML at https://lkml.org/lkml/2013/1/26/188

  Enhancements to uniprocessor handling in tiny RCU:

   - Posted to LKML at https://lkml.org/lkml/2013/1/27/2

  Tag RCU callbacks with grace-period number to simplify callback
  advancement:

   - Posted to LKML at https://lkml.org/lkml/2013/1/26/203

  Miscellaneous fixes:

   - Posted to LKML at https://lkml.org/lkml/2013/1/26/204"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
  srcu: use ACCESS_ONCE() to access sp->completed in srcu_read_lock()
  srcu: Update synchronize_srcu_expedited()'s comments
  srcu: Update synchronize_srcu()'s comments
  srcu: Remove checks preventing idle CPUs from calling srcu_read_lock()
  srcu: Remove checks preventing offline CPUs from calling srcu_read_lock()
  srcu: Simple cleanup for cleanup_srcu_struct()
  srcu: Add might_sleep() annotation to synchronize_srcu()
  srcu: Simplify __srcu_read_unlock() via this_cpu_dec()
  rcu: Allow rcutorture to be built at low optimization levels
  rcu: Make rcutorture's shuffler task shuffle recently added tasks
  rcu: Allow TREE_PREEMPT_RCU on UP systems
  rcu: Provide RCU CPU stall warnings for tiny RCU
  context_tracking: Add comments on interface and internals
  rcu: Remove obsolete Kconfig option from comment
  rcu: Remove unused code originally used for context tracking
  rcu: Consolidate debugging Kconfig options
  rcu: Correct 'optimized' to 'optimize' in header comment
  rcu: Trace callback acceleration
  rcu: Tag callback lists with corresponding grace-period number
  rcutorture: Don't compare ptr with 0
  ...
2013-02-19 17:45:20 -08:00
Konstantin Khlebnikov 1438ade567 workqueue: un-GPL function delayed_work_timer_fn()
commit d8e794dfd5 ("workqueue: set
delayed_work->timer function on initialization") exports function
delayed_work_timer_fn() only for GPL modules. This makes delayed-works
unusable for non-GPL modules, because initialization macro now requires
GPL symbol. For example schedule_delayed_work() available for non-GPL.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # 3.7
2013-02-19 10:09:13 -08:00
Thomas Gleixner cdc4e86b58 cputime: Remove irqsave from seqlock readers
The reader side code has no requirement to disable interrupts while
sampling data. The sequence counter is enough to ensure consistency.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-02-19 08:05:53 +01:00
Steven Rostedt (Red Hat) 8c189ea64e ftrace: Call ftrace cleanup module notifier after all other notifiers
Commit: c1bf08ac "ftrace: Be first to run code modification on modules"

changed ftrace module notifier's priority to INT_MAX in order to
process the ftrace nops before anything else could touch them
(namely kprobes). This was the correct thing to do.

Unfortunately, the ftrace module notifier also contains the ftrace
clean up code. As opposed to the set up code, this code should be
run *after* all the module notifiers have run in case a module is doing
correct clean-up and unregisters its ftrace hooks. Basically, ftrace
needs to do clean up on module removal, as it needs to know about code
being removed so that it doesn't try to modify that code. But after it
removes the module from its records, if a ftrace user tries to remove
a probe, that removal will fail due as the record of that code segment
no longer exists.

Nothing really bad happens if the probe removal is called after ftrace
did the clean up, but the ftrace removal function will return an error.
Correct code (such as kprobes) will produce a WARN_ON() if it fails
to remove the probe. As people get annoyed by frivolous warnings, it's
best to do the ftrace clean up after everything else.

By splitting the ftrace_module_notifier into two notifiers, one that
does the module load setup that is run at high priority, and the other
that is called for module clean up that is run at low priority, the
problem is solved.

Cc: stable@vger.kernel.org
Reported-by: Frank Ch. Eigler <fche@redhat.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-02-18 23:09:26 -05:00
Chris Metcalf 36a5df85e9 genirq: Export enable/disable_percpu_irq()
These functions are used by the tilegx onchip network driver, and it's
useful to be able to load that driver as a module.

Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Link: http://lkml.kernel.org/r/201302012043.r11KhNZF024371@farm-0021.internal.tilera.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-02-18 21:42:25 +01:00
Stanislaw Gruszka e6c42c295e posix-cpu-timers: Fix nanosleep task_struct leak
The trinity fuzzer triggered a task_struct reference leak via
clock_nanosleep with CPU_TIMERs. do_cpu_nanosleep() calls
posic_cpu_timer_create(), but misses a corresponding
posix_cpu_timer_del() which leads to the task_struct reference leak.

Reported-and-tested-by: Tommi Rantala <tt.rantala@gmail.com>
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20130215100810.GF4392@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-02-15 11:41:56 +01:00
Daniel Baluta 02e176af92 perf/hwbp: Fix cleanup in case of kzalloc failure
Obviously this is a typo and could result in memory leaks if kzalloc
fails on a given cpu.

Signed-off-by: Daniel Baluta <dbaluta@ixiacom.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1360186160-7566-1-git-send-email-dbaluta@ixiacom.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2013-02-14 17:06:39 -03:00
Thomas Gleixner 9f4646d283 Merge branch 'fortglx/3.9/time' of git://git.linaro.org/people/jstultz/linux into timers/core 2013-02-14 19:46:10 +01:00
Thomas Gleixner 14e568e78f stop_machine: Use smpboot threads
Use the smpboot thread infrastructure. Mark the stopper thread
selfparking and park it after it has finished the take_cpu_down()
work.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Arjan van de Veen <arjan@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Richard Weinberger <rw@linutronix.de>
Cc: Magnus Damm <magnus.damm@gmail.com>
Link: http://lkml.kernel.org/r/20130131120741.686315164@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-02-14 15:29:38 +01:00
Thomas Gleixner 860a0ffaa3 stop_machine: Store task reference in a separate per cpu variable
To allow the stopper thread being managed by the smpboot thread
infrastructure separate out the task storage from the stopper data
structure.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Arjan van de Veen <arjan@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Richard Weinberger <rw@linutronix.de>
Cc: Magnus Damm <magnus.damm@gmail.com>
Link: http://lkml.kernel.org/r/20130131120741.626690384@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-02-14 15:29:37 +01:00
Thomas Gleixner 7d7e499f73 smpboot: Allow selfparking per cpu threads
The stop machine threads are still killed when a cpu goes offline. The
reason is that the thread is used to bring the cpu down, so it can't
be parked along with the other per cpu threads.

Allow a per cpu thread to be excluded from automatic parking, so it
can park itself once it's done

Add a create callback function as well.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Arjan van de Veen <arjan@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Richard Weinberger <rw@linutronix.de>
Cc: Magnus Damm <magnus.damm@gmail.com>
Link: http://lkml.kernel.org/r/20130131120741.553993267@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-02-14 15:29:37 +01:00
Tejun Heo 112202d909 workqueue: rename cpu_workqueue to pool_workqueue
workqueue has moved away from global_cwqs to worker_pools and with the
scheduled custom worker pools, wforkqueues will be associated with
pools which don't have anything to do with CPUs.  The workqueue code
went through significant amount of changes recently and mass renaming
isn't likely to hurt much additionally.  Let's replace 'cpu' with
'pool' so that it reflects the current design.

* s/struct cpu_workqueue_struct/struct pool_workqueue/
* s/cpu_wq/pool_wq/
* s/cwq/pwq/

This patch is purely cosmetic.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-13 19:29:12 -08:00
Tejun Heo 8d03ecfe47 workqueue: reimplement is_chained_work() using current_wq_worker()
is_chained_work() was added before current_wq_worker() and implemented
its own ham-fisted way of finding out whether %current is a workqueue
worker - it iterates through all possible workers.

Drop the custom implementation and reimplement using
current_wq_worker().

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-13 19:29:10 -08:00
Tejun Heo 1dd638149f workqueue: fix is_chained_work() regression
c9e7cf273f ("workqueue: move busy_hash from global_cwq to
worker_pool") incorrectly converted is_chained_work() to use
get_gcwq() inside for_each_gcwq_cpu() while removing get_gcwq().

As cwq might not exist for all possible workqueue CPUs, @cwq can be
NULL and the following cwq deferences can lead to oops.

Fix it by using for_each_cwq_cpu() instead, which is the better one to
use anyway as we only need to check pools that the wq is associated
with.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-13 19:29:07 -08:00
Steven Rostedt f431b634f2 tracing/syscalls: Allow archs to ignore tracing compat syscalls
The tracing of ia32 compat system calls has been a bit of a pain as they
use different system call numbers than the 64bit equivalents.

I wrote a simple 'lls' program that lists files. I compiled it as a i686
ELF binary and ran it under a x86_64 box. This is the result:

echo 0 > /debug/tracing/tracing_on
echo 1 > /debug/tracing/events/syscalls/enable
echo 1 > /debug/tracing/tracing_on ; ./lls ; echo 0 > /debug/tracing/tracing_on

grep lls /debug/tracing/trace

[.. skipping calls before TS_COMPAT is set ...]

             lls-1127  [005] d...   936.409188: sys_recvfrom(fd: 0, ubuf: 4d560fc4, size: 0, flags: 8048034, addr: 8, addr_len: f7700420)
             lls-1127  [005] d...   936.409190: sys_recvfrom -> 0x8a77000
             lls-1127  [005] d...   936.409211: sys_lgetxattr(pathname: 0, name: 1000, value: 3, size: 22)
             lls-1127  [005] d...   936.409215: sys_lgetxattr -> 0xf76ff000
             lls-1127  [005] d...   936.409223: sys_dup2(oldfd: 4d55ae9b, newfd: 4)
             lls-1127  [005] d...   936.409228: sys_dup2 -> 0xfffffffffffffffe
             lls-1127  [005] d...   936.409236: sys_newfstat(fd: 4d55b085, statbuf: 80000)
             lls-1127  [005] d...   936.409242: sys_newfstat -> 0x3
             lls-1127  [005] d...   936.409243: sys_removexattr(pathname: 3, name: ffcd0060)
             lls-1127  [005] d...   936.409244: sys_removexattr -> 0x0
             lls-1127  [005] d...   936.409245: sys_lgetxattr(pathname: 0, name: 19614, value: 1, size: 2)
             lls-1127  [005] d...   936.409248: sys_lgetxattr -> 0xf76e5000
             lls-1127  [005] d...   936.409248: sys_newlstat(filename: 3, statbuf: 19614)
             lls-1127  [005] d...   936.409249: sys_newlstat -> 0x0
             lls-1127  [005] d...   936.409262: sys_newfstat(fd: f76fb588, statbuf: 80000)
             lls-1127  [005] d...   936.409279: sys_newfstat -> 0x3
             lls-1127  [005] d...   936.409279: sys_close(fd: 3)
             lls-1127  [005] d...   936.421550: sys_close -> 0x200
             lls-1127  [005] d...   936.421558: sys_removexattr(pathname: 3, name: ffcd00d0)
             lls-1127  [005] d...   936.421560: sys_removexattr -> 0x0
             lls-1127  [005] d...   936.421569: sys_lgetxattr(pathname: 4d564000, name: 1b1abc, value: 5, size: 802)
             lls-1127  [005] d...   936.421574: sys_lgetxattr -> 0x4d564000
             lls-1127  [005] d...   936.421575: sys_capget(header: 4d70f000, dataptr: 1000)
             lls-1127  [005] d...   936.421580: sys_capget -> 0x0
             lls-1127  [005] d...   936.421580: sys_lgetxattr(pathname: 4d710000, name: 3000, value: 3, size: 812)
             lls-1127  [005] d...   936.421589: sys_lgetxattr -> 0x4d710000
             lls-1127  [005] d...   936.426130: sys_lgetxattr(pathname: 4d713000, name: 2abc, value: 3, size: 32)
             lls-1127  [005] d...   936.426141: sys_lgetxattr -> 0x4d713000
             lls-1127  [005] d...   936.426145: sys_newlstat(filename: 3, statbuf: f76ff3f0)
             lls-1127  [005] d...   936.426146: sys_newlstat -> 0x0
             lls-1127  [005] d...   936.431748: sys_lgetxattr(pathname: 0, name: 1000, value: 3, size: 22)

Obviously I'm not calling newfstat with a fd of 4d55b085. The calls are
obviously incorrect, and confusing.

Other efforts have been made to fix this:

https://lkml.org/lkml/2012/3/26/367

But the real solution is to rewrite the syscall internals and come up
with a fixed solution. One that doesn't require all the kluge that the
current solution has.

Thus for now, instead of outputting incorrect data, simply ignore them.
With this patch the changes now have:

 #> grep lls /debug/tracing/trace
 #>

Compat system calls simply are not traced. If users need compat
syscalls, then they should just use the raw syscall tracepoints.

For an architecture to make their compat syscalls ignored, it must
define ARCH_TRACE_IGNORE_COMPAT_SYSCALLS (done in asm/ftrace.h) and also
define an arch_trace_is_compat_syscall() function that will return true
if the current task should ignore tracing the syscall.

I want to stress that this change does not affect actual syscalls in any
way, shape or form. It is only used within the tracing system and
doesn't interfere with the syscall logic at all. The changes are
consolidated nicely into trace_syscalls.c and asm/ftrace.h.

I had to make one small modification to asm/thread_info.h and that was
to remove the include of asm/ftrace.h. As asm/ftrace.h required the
current_thread_info() it was causing include hell. That include was
added back in 2008 when the function graph tracer was added:

 commit caf4b323 "tracing, x86: add low level support for ftrace return tracing"

It does not need to be included there.

Link: http://lkml.kernel.org/r/1360703939.21867.99.camel@gandalf.local.home

Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-02-12 17:46:28 -05:00
Eric W. Biederman 6e6668845f kernel/pid.c: reenable interrupts when alloc_pid() fails because init has exited
We're forgetting to reenable local interrupts on an error path.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Reported-by: Josh Boyer <jwboyer@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-12 14:34:00 -08:00
Thomas Gleixner 86c8ead593 Merge branch 'timers/for-arm' into timers/core 2013-02-12 20:22:56 +01:00
Mark Rutland 5d1d9a29bc clockevents: Fix generic broadcast for FEAT_C3STOP
Commit 12ad100046: "clockevents: Add generic timer broadcast function"
made tick_device_uses_broadcast set up the generic broadcast function
for dummy devices (where !tick_device_is_functional(dev)), but neglected
to set up the broadcast function for devices that stop in low power
states (with the CLOCK_EVT_FEAT_C3STOP flag).

When these devices enter low power states they will not have the generic
broadcast function assigned, and will bring down the system when an
attempt is made to broadcast to them.

This patch ensures that the broadcast function is also assigned for
devices which require broadcast in low power states.

Reported-by: Stephen Warren <swarren@nvidia.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Stephen Warren <swarren@nvidia.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: nico@linaro.org
Cc: Marc.Zyngier@arm.com
Cc: Will.Deacon@arm.com
Cc: santosh.shilimkar@ti.com
Cc: john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-02-12 20:22:28 +01:00
Tejun Heo ad72b3bea7 kprobes: fix wait_for_kprobe_optimizer()
wait_for_kprobe_optimizer() seems largely broken.  It uses
optimizer_comp which is never re-initialized, so
wait_for_kprobe_optimizer() will never wait for anything once
kprobe_optimizer() finishes all pending jobs for the first time.

Also, aside from completion, delayed_work_pending() is %false once
kprobe_optimizer() starts execution and wait_for_kprobe_optimizer()
won't wait for it.

Reimplement it so that it flushes optimizing_work until
[un]optimizing_lists are empty.  Note that this also makes
optimizing_work execute immediately if someone's waiting for it, which
is the nicer behavior.

Only compile tested.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
2013-02-09 11:32:42 -08:00
Prarit Bhargava 84e345e4e2 time, Fix setting of hardware clock in NTP code
At init time, if the system time is "warped" forward in warp_clock()
it will differ from the hardware clock by sys_tz.tz_minuteswest.  This time
difference is not taken into account when ntp updates the hardware clock,
and this causes the system time to jump forward by this offset every reboot.

The kernel must take this offset into account when writing the system time
to the hardware clock in the ntp code.  This patch adds
persistent_clock_is_local which indicates that an offset has been applied
in warp_clock() and accounts for the "warp" before writing the hardware
clock.

x86 does not have this problem as rtc writes are software limited to a
+/-15 minute window relative to the current rtc time.  Other arches, such
as powerpc, however do a full synchronization of the system time to the
rtc and will see this problem.

[v2]: generated against tip/timers/core

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-02-08 15:07:05 -08:00
Oleg Nesterov b2fe8ba674 uprobes/perf: Avoid uprobe_apply() whenever possible
uprobe_perf_open/close call the costly uprobe_apply() every time,
we can avoid it if:

	- "nr_systemwide != 0" is not changed.

	- There is another process/thread with the same ->mm.

	- copy_proccess() does inherit_event(). dup_mmap() preserves the
	  inserted breakpoints.

	- event->attr.enable_on_exec == T, we can rely on uprobe_mmap()
	  called by exec/mmap paths.

	- tp_target is exiting. Only _close() checks PF_EXITING, I don't
	  think TRACE_REG_PERF_OPEN can hit the dying task too often.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2013-02-08 18:28:08 +01:00
Oleg Nesterov f42d24a1d2 uprobes/perf: Teach trace_uprobe/perf code to use UPROBE_HANDLER_REMOVE
Change uprobe_trace_func() and uprobe_perf_func() to return "int". Change
uprobe_dispatcher() to return "trace_ret | perf_ret" although this is not
needed, currently TP_FLAG_TRACE/TP_FLAG_PROFILE are mutually exclusive.

The only functional change is that uprobe_perf_func() checks the filtering
too and returns UPROBE_HANDLER_REMOVE if nobody wants to trace current.

Testing:

	# perf probe -x /lib/libc.so.6 syscall

	# perf record -e probe_libc:syscall -i perl -e 'fork; syscall -1 for 1..10; wait'

	# perf report --show-total-period
		100.00%            10     perl  libc-2.8.so    [.] syscall

Before this patch:

	# cat /sys/kernel/debug/tracing/uprobe_profile
		/lib/libc.so.6 syscall				20

A child process doesn't have a counter, but still it hits this breakoint
"copied" by dup_mmap().

After the patch:

	# cat /sys/kernel/debug/tracing/uprobe_profile
		/lib/libc.so.6 syscall				11

The child process hits this int3 only once and does unapply_uprobe().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2013-02-08 18:28:07 +01:00
Oleg Nesterov 31ba334836 uprobes/perf: Teach trace_uprobe/perf code to pre-filter
Finally implement uprobe_perf_filter() which checks ->nr_systemwide or
->perf_events to figure out whether we need to insert the breakpoint.

uprobe_perf_open/close are changed to do uprobe_apply(true/false) when
the new perf event comes or goes away.

Note that currently this is very suboptimal:

	- uprobe_register() called by TRACE_REG_PERF_REGISTER becomes a
	  heavy nop, consumer->filter() always returns F at this stage.

	  As it was already discussed we need uprobe_register_only() to
	  avoid the costly register_for_each_vma() when possible.

	- uprobe_apply() is oftenly overkill. Unless "nr_systemwide != 0"
	  changes we need uprobe_apply_mm(), unapply_uprobe() is almost
	  what we need.

	- uprobe_apply() can be simply avoided sometimes, see the next
	  changes.

Testing:

	# perf probe -x /lib/libc.so.6 syscall

	# perl -e 'syscall -1 while 1' &
	[1] 530

	# perf record -e probe_libc:syscall perl -e 'syscall -1 for 1..10; sleep 1'

	# perf report --show-total-period
		100.00%            10     perl  libc-2.8.so    [.] syscall

Before this patch:

	# cat /sys/kernel/debug/tracing/uprobe_profile
		/lib/libc.so.6 syscall				79291

A huge ->nrhit == 79291 reflects the fact that the background process
530 constantly hits this breakpoint too, even if doesn't contribute to
the output.

After the patch:

	# cat /sys/kernel/debug/tracing/uprobe_profile
		/lib/libc.so.6 syscall				10

This shows that only the target process was punished by int3.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2013-02-08 18:28:07 +01:00
Oleg Nesterov 736288ba50 uprobes/perf: Teach trace_uprobe/perf code to track the active perf_event's
Introduce "struct trace_uprobe_filter" which records the "active"
perf_event's attached to ftrace_event_call. For the start we simply
use list_head, we can optimize this later if needed. For example, we
do not really need to record an event with ->parent != NULL, we can
rely on parent->child_list. And we can certainly do some optimizations
for the case when 2 events have the same ->tp_target or tp_target->mm.

Change trace_uprobe_register() to process TRACE_REG_PERF_OPEN/CLOSE
and add/del this perf_event to the list.

We can probably avoid any locking, but lets start with the "obvioulsy
correct" trace_uprobe_filter->rwlock which protects everything.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2013-02-08 18:28:06 +01:00
Oleg Nesterov bdf8647c44 uprobes: Introduce uprobe_apply()
Currently it is not possible to change the filtering constraints after
uprobe_register(), so a consumer can not, say, start to trace a task/mm
which was previously filtered out, or remove the no longer needed bp's.

Introduce uprobe_apply() which simply does register_for_each_vma() again
to consult uprobe_consumer->filter() and install/remove the breakpoints.
The only complication is that register_for_each_vma() can no longer
assume that uprobe->consumers should be consulter if is_register == T,
so we change it to accept "struct uprobe_consumer *new" instead.

Unlike uprobe_register(), uprobe_apply(true) doesn't do "unregister" if
register_for_each_vma() fails, it is up to caller to handle the error.

Note: we probably need to cleanup the current interface, it is strange
that uprobe_apply/unregister need inode/offset. We should either change
uprobe_register() to return "struct uprobe *", or add a private ->uprobe
member in uprobe_consumer. And in the long term uprobe_apply() should
take a single argument, uprobe or consumer, even "bool add" should go
away.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2013-02-08 18:28:04 +01:00
Oleg Nesterov f22c1bb6b4 perf: Introduce hw_perf_event->tp_target and ->tp_list
sys_perf_event_open()->perf_init_event(event) is called before
find_get_context(event), this means that event->ctx == NULL when
class->reg(TRACE_REG_PERF_REGISTER/OPEN) is called and thus it
can't know if this event is per-task or system-wide.

This patch adds hw_perf_event->tp_target for PERF_TYPE_TRACEPOINT,
this is analogous to PERF_TYPE_BREAKPOINT/bp_target we already have.
The patch also moves ->bp_target up so that it can overlap with the
new member, this can help the compiler to generate the better code.

trace_uprobe_register() will use it for prefiltering to avoid the
unnecessary breakpoints in mm's we do not want to trace.

->tp_target doesn't have its own reference, but we can rely on the
fact that either sys_perf_event_open() holds a reference, or it is
equal to event->ctx->task. So this pointer is always valid until
free_event().

Also add the "struct list_head tp_list" into this union. It is not
strictly necessary, but it can simplify the next changes and we can
add it for free.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2013-02-08 18:28:02 +01:00
Oleg Nesterov 1b47aefd9b uprobes/perf: Always increment trace_uprobe->nhit
Move tu->nhit++ from uprobe_trace_func() to uprobe_dispatcher().

->nhit counts how many time we hit the breakpoint inserted by this
uprobe, we do not want to loose this info if uprobe was enabled by
sys_perf_event_open().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08 18:24:34 +01:00
Oleg Nesterov a932b7381f uprobes/tracing: Kill uprobe_trace_consumer, embed uprobe_consumer into trace_uprobe
trace_uprobe->consumer and "struct uprobe_trace_consumer" add the
unnecessary indirection and complicate the code for no reason.

This patch simply embeds uprobe_consumer into "struct trace_uprobe",
all other changes only fix the compilation errors.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2013-02-08 18:24:33 +01:00
Oleg Nesterov b64b007797 uprobes/tracing: Introduce is_trace_uprobe_enabled()
probe_event_enable/disable() check tu->consumer != NULL to avoid the
wrong uprobe_register/unregister().

We are going to kill this pointer and "struct uprobe_trace_consumer",
so we add the new helper, is_trace_uprobe_enabled(), which can rely
on TP_FLAG_TRACE/TP_FLAG_PROFILE instead.

Note: the current logic doesn't look optimal, it is not clear why
TP_FLAG_TRACE/TP_FLAG_PROFILE are mutually exclusive, we will probably
change this later.

Also kill the unused TP_FLAG_UPROBE.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08 18:24:30 +01:00
Oleg Nesterov 7e4e28c539 uprobes/tracing: Ensure inode != NULL in create_trace_uprobe()
probe_event_enable/disable() check tu->inode != NULL at the start.
This is ugly, if igrab() can fail create_trace_uprobe() should not
succeed and "postpone" the failure.

And S_ISREG(inode->i_mode) check added by d24d7dbf is not safe.

Note: alloc_uprobe() should probably check igrab() != NULL as well.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08 18:24:14 +01:00
Oleg Nesterov 4161824f18 uprobes/tracing: Fully initialize uprobe_trace_consumer before uprobe_register()
probe_event_enable() does uprobe_register() and only after that sets
utc->tu and tu->consumer/flags. This can race with uprobe_dispatcher()
which can miss these assignments or see them out of order. Nothing
really bad can happen, but this doesn't look clean/safe.

And this does not allow to use uprobe_consumer->filter() we are going
to add, it is called by uprobe_register() and it needs utc->tu.

Change this code to initialize everything before uprobe_register(), and
reset tu->consumer/flags if it fails. We can't race with event_disable(),
the caller holds event_mutex, and if we could the code would be wrong
anyway.

In fact I think uprobe_trace_consumer should die, it buys nothing but
complicates the code. We can simply add uprobe_consumer into trace_uprobe.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08 18:10:19 +01:00
Oleg Nesterov 84d7ed799f uprobes/tracing: Fix dentry/mount leak in create_trace_uprobe()
create_trace_uprobe() does kern_path() to find ->d_inode, but forgets
to do path_put(). We can do this right after igrab().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08 18:10:17 +01:00
Josh Stone e8440c1458 uprobes: Add exports for module use
The original pull message for uprobes (commit 654443e2) noted:

  This tree includes uprobes support in 'perf probe' - but SystemTap
  (and other tools) can take advantage of user probe points as well.

In order to actually be usable in module-based tools like SystemTap, the
interface needs to be exported.  This patch first adds the obvious
exports for uprobe_register and uprobe_unregister.  Then it also adds
one for task_user_regset_view, which is necessary to get the correct
state of userspace registers.

Signed-off-by: Josh Stone <jistone@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2013-02-08 17:47:13 +01:00
Oleg Nesterov af4355e91f uprobes: Kill the bogus IS_ERR_VALUE(xol_vaddr) check
utask->xol_vaddr is either zero or valid, remove the bogus
IS_ERR_VALUE() check in xol_free_insn_slot().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08 17:47:13 +01:00
Oleg Nesterov 608e7427c0 uprobes: Do not allocate current->utask unnecessary
handle_swbp() does get_utask() before can_skip_sstep() for no reason,
we do not need ->utask if can_skip_sstep() succeeds.

Move get_utask() to pre_ssout() who actually starts to use it. Move
the initialization of utask->active_uprobe/state as well. This way
the whole initialization is consolidated in pre_ssout().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
2013-02-08 17:47:12 +01:00
Oleg Nesterov aba51024e7 uprobes: Fix utask->xol_vaddr leak in pre_ssout()
pre_ssout() should do xol_free_insn_slot() if arch_uprobe_pre_xol()
fails, otherwise nobody will free the allocated slot.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08 17:47:12 +01:00
Oleg Nesterov a6cb3f6d51 uprobes: Do not play with utask in xol_get_insn_slot()
pre_ssout()->xol_get_insn_slot() path is confusing and buggy. This patch
cleanups the code, the next one fixes the bug.

Change xol_get_insn_slot() to only allocate the slot and do nothing more,
move the initialization of utask->xol_vaddr/vaddr into pre_ssout().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08 17:47:12 +01:00
Oleg Nesterov 5a2df662aa uprobes: Turn add_utask() into get_utask()
Rename add_utask() into get_utask() and change it to allocate on
demand to simplify the caller. Like get_xol_area() it will have
more users.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08 17:47:12 +01:00
Oleg Nesterov 9b545df809 uprobes: Fold xol_alloc_area() into get_xol_area()
Currently only xol_get_insn_slot() does get_xol_area() + xol_alloc_area(),
but this will have more users and we do not want to copy-and-paste this
code. This patch simply moves xol_alloc_area() into get_xol_area() to
simplify the current and future code.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08 17:47:11 +01:00
Oleg Nesterov c8a8253800 uprobes: Move alloc_page() from xol_add_vma() to xol_alloc_area()
Move alloc_page() from xol_add_vma() to xol_alloc_area() to cleanup
the code. This separates the memory allocations and consolidates the
-EALREADY cleanups and the error handling.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08 17:47:11 +01:00
Oleg Nesterov 74e59dfc6b uprobes: Change handle_swbp() to expose bp_vaddr to handler_chain()
Change handle_swbp() to set regs->ip = bp_vaddr in advance, this is
what consumer->handler() needs but uprobe_get_swbp_addr() is not
exported.

This also simplifies the code and makes it more consistent across
the supported architectures. handle_swbp() becomes the only caller
of uprobe_get_swbp_addr().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
2013-02-08 17:47:11 +01:00
Oleg Nesterov da1816b1ca uprobes: Teach handler_chain() to filter out the probed task
Currrently the are 2 problems with pre-filtering:

1. It is not possible to add/remove a task (mm) after uprobe_register()

2. A forked child inherits all breakpoints and uprobe_consumer can not
   control this.

This patch does the first step to improve the filtering. handler_chain()
removes the breakpoints installed by this uprobe from current->mm if all
handlers return UPROBE_HANDLER_REMOVE.

Note that handler_chain() relies on ->register_rwsem to avoid the race
with uprobe_register/unregister which can add/del a consumer, or even
remove and then insert the new uprobe at the same address.

Perhaps we will add uprobe_apply_mm(uprobe, mm, is_register) and teach
copy_mm() to do filter(UPROBE_FILTER_FORK), but I think this change makes
sense anyway.

Note: instead of checking the retcode from uc->handler, we could add
uc->filter(UPROBE_FILTER_BPHIT). But I think this is not optimal to
call 2 hooks in a row. This buys nothing, and if handler/filter do
something nontrivial they will probably do the same work twice.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08 17:47:11 +01:00
Oleg Nesterov 8a7f2fa0de uprobes: Reintroduce uprobe_consumer->filter()
Finally add uprobe_consumer->filter() and change consumer_filter()
to actually call this method.

Note that ->filter() accepts mm_struct, not task_struct. Because:

	1. We do not have for_each_mm_user(mm, task).

	2. Even if we implement for_each_mm_user(), ->filter() can
	   use it itself.

	3. It is not clear who will actually need this interface to
	   do the "nontrivial" filtering.

Another argument is "enum uprobe_filter_ctx", consumer->filter() can
use it to figure out why/where it was called. For example, perhaps
we can add UPROBE_FILTER_PRE_REGISTER used by build_map_info() to
quickly "nack" the unwanted mm's. In this case consumer should know
that it is called under ->i_mmap_mutex.

See the previous discussion at http://marc.info/?t=135214229700002
Perhaps we should pass more arguments, vma/vaddr?

Note: this patch obviously can't help to filter out the child created
by fork(), this will be addressed later.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08 17:47:10 +01:00
Oleg Nesterov 806a98bdf2 uprobes: Rationalize the usage of filter_chain()
filter_chain() was added into install_breakpoint/remove_breakpoint to
simplify the initial changes but this is sub-optimal.

This patch shifts the callsite to the callers, register_for_each_vma()
and uprobe_mmap(). This way:

- It will be easier to add the new arguments. This is the main reason,
  we can do more optimizations later.

- register_for_each_vma(is_register => true) can be optimized, we only
  need to consult the new consumer. The previous consumers were already
  asked when they called uprobe_register().

This patch also moves the MMF_HAS_UPROBES check from remove_breakpoint(),
this allows to avoid the potentionally costly filter_chain(). Note that
register_for_each_vma(is_register => false) doesn't really need to take
->consumer_rwsem, but I don't think it makes sense to optimize this and
introduce filter_chain_lockless().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08 17:47:10 +01:00
Oleg Nesterov 66d06dffa5 uprobes: Kill uprobes_mutex[], separate alloc_uprobe() and __uprobe_register()
uprobe_register() and uprobe_unregister() are the only users of
mutex_lock(uprobes_hash(inode)), and the only reason why we can't
simply remove it is that we need to ensure that delete_uprobe() is
not possible after alloc_uprobe() and before consumer_add().

IOW, we need to ensure that when we take uprobe->register_rwsem
this uprobe is still valid and we didn't race with _unregister()
which called delete_uprobe() in between.

With this patch uprobe_register() simply checks uprobe_is_active()
and retries if it hits this very unlikely race. uprobes_mutex[] is
no longer needed and can be removed.

There is another reason for this change, prepare_uprobe() should be
folded into alloc_uprobe() and we do not want to hold the extra locks
around read_mapping_page/etc.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08 17:47:10 +01:00
Oleg Nesterov 06b7bcd8cb uprobes: Introduce uprobe_is_active()
The lifetime of uprobe->rb_node and uprobe->inode is not refcounted,
delete_uprobe() is called when we detect that uprobe has no consumers,
and it would be deadly wrong to do this twice.

Change delete_uprobe() to WARN() if it was already called. We use
RB_CLEAR_NODE() to mark uprobe "inactive", then RB_EMPTY_NODE() can
be used to detect this case.

RB_EMPTY_NODE() is not used directly, we add the trivial helper for
the next change.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08 17:47:09 +01:00
Oleg Nesterov 441f1eb7db uprobes: Kill uprobe_events, use RB_EMPTY_ROOT() instead
uprobe_events counts the number of uprobes in uprobes_tree but
it is used as a boolean. We can use RB_EMPTY_ROOT() instead.

Probably no_uprobe_events() added by this patch can have more
callers, say, mmf_recalc_uprobes().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08 17:47:08 +01:00
Oleg Nesterov d4d3ccc6d1 uprobes: Kill uprobe->copy_mutex
Now that ->register_rwsem is safe under ->mmap_sem we can kill
->copy_mutex and abuse down_write(&uprobe->consumer_rwsem).

This makes prepare_uprobe() even more ugly, but we should kill
it anyway.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08 17:47:08 +01:00
Oleg Nesterov bb929284be uprobes: Kill UPROBE_RUN_HANDLER flag
Simply remove UPROBE_RUN_HANDLER and the corresponding code.

It can only help if uprobe has a single consumer, and in fact
it is no longer needed after handler_chain() was changed to use
->register_rwsem, we simply can not race with uprobe_register().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08 17:47:06 +01:00
Oleg Nesterov 1ff6fee5e6 uprobes: Change filter_chain() to iterate ->consumers list
Now that it safe to use ->consumer_rwsem under ->mmap_sem we can
almost finish the implementation of filter_chain(). It still lacks
the actual uc->filter(...) call but othewrwise it is ready, just
it pretends that ->filter() always returns true.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08 17:47:05 +01:00
Oleg Nesterov e591c8d78e uprobes: Introduce uprobe->register_rwsem
Introduce uprobe->register_rwsem. It is taken for writing around
__uprobe_register/unregister.

Change handler_chain() to use this sem rather than consumer_rwsem.

The main reason for this change is that we have the nasty problem
with mmap_sem/consumer_rwsem dependency. filter_chain() needs to
protect uprobe->consumers like handler_chain(), but they can not
use the same lock. filter_chain() can be called under ->mmap_sem
(currently this is always true), but we want to allow ->handler()
to play with the probed task's memory, and this needs ->mmap_sem.

Alternatively we could use srcu, but synchronize_srcu() is very
slow and ->register_rwsem allows us to do more. In particular, we
can teach handler_chain() to do remove_breakpoint() if this bp is
"nacked" by all consumers, we know that we can't race with the
new consumer which does uprobe_register().

See also the next patches. uprobes_mutex[] is almost ready to die.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08 17:47:03 +01:00
Oleg Nesterov 9a98e03cc1 uprobes: _register() should always do register_for_each_vma(true)
To support the filtering uprobe_register() should do
register_for_each_vma(true) every time the new consumer comes,
we need to install the previously nacked breakpoints.

Note:
	- uprobes_mutex[] should die, what it actually protects is
	  alloc_uprobe().

	- UPROBE_RUN_HANDLER should die too, obviously it can't work
	  unless uprobe has a single consumer. The consumer should
	  serialize with _register/_unregister itself. Or this flag
	  should live in uprobe_consumer->state.

	- Perhaps we can do some optimizations later. For example, if
	  filter_chain() never returns false uprobe can record this
	  fact and avoid the unnecessary register_for_each_vma().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08 17:47:03 +01:00
Oleg Nesterov 04aab9b200 uprobes: _unregister() should always do register_for_each_vma(false)
uprobe_unregister() removes the breakpoints only if the last consumer
goes away. To support the filtering it should do this every time, we
want to remove the breakpoints which nobody else want to keep.

Note: given that filter_chain() is not actually implemented, this patch
itself doesn't change the behaviour yet, register_for_each_vma(false)
is a heavy "nop" unless there are no more consumers.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08 17:47:03 +01:00
Oleg Nesterov 63633cbf82 uprobes: Introduce filter_chain()
Add the new helper filter_chain(). Currently it is only placeholder,
the comment explains what is should do. We will change it later to
consult every consumer to decide whether we need to install the swbp.
Until then it works as if any consumer returns true, this matches the
current behavior.

Change install_breakpoint() to call filter_chain() instead of checking
uprobe->consumers != NULL. We obviously need this, and this equally
closes the race with _unregister().

Change remove_breakpoint() to call this helper too. Currently this is
pointless because remove_breakpoint() is only called when the last
consumer goes away, but we will change this.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08 17:47:02 +01:00
Oleg Nesterov fe20d71f25 uprobes: Kill uprobe_consumer->filter()
uprobe_consumer->filter() is pointless in its current form, kill it.

We will add it back, but with the different signature/semantics. Perhaps
we will even re-introduce the callsite in handler_chain(), but not to
just skip uc->handler().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08 17:47:02 +01:00
Oleg Nesterov f0744af7d0 uprobes: Kill the pointless inode/uc checks in register/unregister
register/unregister verifies that inode/uc != NULL. For what?
This really looks like "hide the potential problem", the caller
should pass the valid data.

register() also checks uc->next == NULL, probably to prevent the
double-register but the caller can do other stupid/wrong things.
If we do this check, then we should document that uc->next should
be cleared before register() and add BUG_ON().

Also add the small comment about the i_size_read() check.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08 17:47:01 +01:00
Oleg Nesterov bbc33d0593 uprobes: Move __set_bit(UPROBE_SKIP_SSTEP) into alloc_uprobe()
Cosmetic. __set_bit(UPROBE_SKIP_SSTEP) is the part of initialization,
it is not clear why it is set in insert_uprobe().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-02-08 17:46:59 +01:00
H. Peter Anvin 63a3f60341 timeconst.pl: Eliminate Perl warning
defined(@array) is deprecated in Perl and gives off a warning.
Restructure the code to remove that warning.

[ hpa: it would be interesting to revert to the timeconst.bc script.
  It appears that the failures reported by akpm during testing of
  that script was due to a known broken version of make, not a problem
  with bc.  The Makefile rules could probably be restructured to avoid
  the make bug, or it is probably old enough that it doesn't matter. ]

Reported-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <stable@vger.kernel.org>
2013-02-07 17:14:08 -08:00
Lai Jiangshan 7a6b55e710 srcu: use ACCESS_ONCE() to access sp->completed in srcu_read_lock()
The old SRCU implementation loads sp->completed within an
RCU-sched section, courtesy of preempt_disable().  This was required
due to the use of synchronize_sched() in the old implemenation's
synchronize_srcu().  However, the new implementation does not rely
on synchronize_sched(), so it in turn does not require the load of
sp->completed and the ->c[] counter to be in a single preempt-disabled
region of code.  This commit therefore moves the sp->completed access
outside of the preempt-disabled region and applies ACCESS_ONCE().

The resulting code is almost as the same as before, but it removes the
now-misleading rcu_dereference_index_check() call.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-02-07 15:19:36 -08:00
Lai Jiangshan 49271ca606 srcu: Update synchronize_srcu_expedited()'s comments
Because synchronize_srcu_expedited() no longer uses
synchronize_rcu_sched_expedited(), synchronize_srcu_expedited() no longer
indirectly acquires any CPU-hotplug-related locks.  This commit therefore
updates the comments accordingly.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-02-07 15:17:45 -08:00
Lai Jiangshan 34a64b6bb6 srcu: Update synchronize_srcu()'s comments
The core of SRCU is changed, but synchronize_srcu()'s comments describe
the old algorithm.  This commit therefore updates them to match the
new algorithm.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-02-07 15:17:01 -08:00
Lai Jiangshan ab4d2986e4 srcu: Simple cleanup for cleanup_srcu_struct()
Pack six lines of code into two lines.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-02-07 15:10:10 -08:00
Lai Jiangshan 6e6f1b307e srcu: Add might_sleep() annotation to synchronize_srcu()
Although synchronize_srcu() can sleep, it will not sleep if the fast
path succeeds, which means that illegal use of synchronize_rcu()
might go unnoticed.  This commit therefore adds might_sleep(), which
unconditionally catches illegal use of synchronize_rcu() from atomic
context.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-02-07 15:08:19 -08:00
Lai Jiangshan 5a41344a3d srcu: Simplify __srcu_read_unlock() via this_cpu_dec()
This commit replaces disabling of preemption and decrement of a per-CPU
variable with this_cpu_dec(), which avoids preemption disabling on x86
and shortens the code on all platforms.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-02-07 15:06:25 -08:00
Lai Jiangshan 8594fade39 workqueue: pick cwq instead of pool in __queue_work()
Currently, __queue_work() chooses the pool to queue a work item to and
then determines cwq from the target wq and the chosen pool.  This is a
bit backwards in that we can determine cwq first and simply use
cwq->pool.  This way, we can skip get_std_worker_pool() in queueing
path which will be a hurdle when implementing custom worker pools.

Update __queue_work() such that it chooses the target cwq and then use
cwq->pool instead of the other way around.  While at it, add missing
{} in an if statement.

This patch doesn't introduce any functional changes.

tj: The original patch had two get_cwq() calls - the first to
    determine the pool by doing get_cwq(cpu, wq)->pool and the second
    to determine the matching cwq from get_cwq(pool->cpu, wq).
    Updated the function such that it chooses cwq instead of pool and
    removed the second call.  Rewrote the description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-07 13:17:51 -08:00
Lai Jiangshan 54d5b7d079 workqueue: make get_work_pool_id() cheaper
get_work_pool_id() currently first obtains pool using get_work_pool()
and then return pool->id.  For an off-queue work item, this involves
obtaining pool ID from worker->data, performing idr_find() to find the
matching pool and then returning its pool->id which of course is the
same as the one which went into idr_find().

Just open code WORK_STRUCT_CWQ case and directly return pool ID from
work->data.

tj: The original patch dropped on-queue work item handling and renamed
    the function to offq_work_pool_id().  There isn't much benefit in
    doing so.  Handling it only requires a single if() and we need at
    least BUG_ON(), which is also a branch, even if we drop on-queue
    handling.  Open code WORK_STRUCT_CWQ case and keep the function in
    line with get_work_pool().  Rewrote the description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-07 13:14:20 -08:00
Tejun Heo e19e397a85 workqueue: move nr_running into worker_pool
As nr_running is likely to be accessed from other CPUs during
try_to_wake_up(), it was kept outside worker_pool; however, while less
frequent, other fields in worker_pool are accessed from other CPUs
for, e.g., non-reentrancy check.  Also, with recent pool related
changes, accessing nr_running matching the worker_pool isn't as simple
as it used to be.

Move nr_running inside worker_pool.  Keep it aligned to cacheline and
define CPU pools using DEFINE_PER_CPU_SHARED_ALIGNED().  This should
give at least the same cacheline behavior.

get_pool_nr_running() is replaced with direct pool->nr_running
accesses.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Joonsoo Kim <js1304@gmail.com>
2013-02-07 13:14:20 -08:00
Clark Williams 8bd75c77b7 sched/rt: Move rt specific bits into new header file
Move rt scheduler definitions out of include/linux/sched.h into
new file include/linux/sched/rt.h

Signed-off-by: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20130207094707.7b9f825f@riff.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-02-07 20:51:08 +01:00
Clark Williams ce0dbbbb30 sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice
Add a /proc/sys/kernel scheduler knob named
sched_rr_timeslice_ms that allows global changing of the
SCHED_RR timeslice value. User visable value is in milliseconds
but is stored as jiffies.  Setting to 0 (zero) resets to the
default (currently 100ms).

Signed-off-by: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20130207094704.13751796@riff.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-02-07 20:51:07 +01:00
Clark Williams cf4aebc292 sched: Move sched.h sysctl bits into separate header
Move the sysctl-related bits from include/linux/sched.h into
a new file: include/linux/sched/sysctl.h. Then update source
files requiring access to those bits by including the new
header file.

Signed-off-by: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20130207094659.06dced96@riff.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-02-07 20:50:54 +01:00
Tejun Heo 1606283622 workqueue: cosmetic update in try_to_grab_pending()
With the recent is-work-queued-here test simplification, the nested
if() in try_to_grab_pending() can be collapsed.  Collapse it.

This patch is purely cosmetic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-02-06 18:04:53 -08:00
Lai Jiangshan 0b3dae68ac workqueue: simplify is-work-item-queued-here test
Currently, determining whether a work item is queued on a locked pool
involves somewhat convoluted memory barrier dancing.  It goes like the
following.

* When a work item is queued on a pool, work->data is updated before
  work->entry is linked to the pending list with a wmb() inbetween.

* When trying to determine whether a work item is currently queued on
  a pool pointed to by work->data, it locks the pool and looks at
  work->entry.  If work->entry is linked, we then do rmb() and then
  check whether work->data points to the current pool.

This works because, work->data can only point to a pool if it
currently is or were on the pool and,

* If it currently is on the pool, the tests would obviously succeed.

* It it left the pool, its work->entry was cleared under pool->lock,
  so if we're seeing non-empty work->entry, it has to be from the work
  item being linked on another pool.  Because work->data is updated
  before work->entry is linked with wmb() inbetween, work->data update
  from another pool is guaranteed to be visible if we do rmb() after
  seeing non-empty work->entry.  So, we either see empty work->entry
  or we see updated work->data pointin to another pool.

While this works, it's convoluted, to put it mildly.  With recent
updates, it's now guaranteed that work->data points to cwq only while
the work item is queued and that updating work->data to point to cwq
or back to pool is done under pool->lock, so we can simply test
whether work->data points to cwq which is associated with the
currently locked pool instead of the convoluted memory barrier
dancing.

This patch replaces the memory barrier based "are you still here,
really?" test with much simpler "does work->data points to me?" test -
if work->data points to a cwq which is associated with the currently
locked pool, the work item is guaranteed to be queued on the pool as
work->data can start and stop pointing to such cwq only under
pool->lock and the start and stop coincide with queue and dequeue.

tj: Rewrote the comments and description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-06 18:04:53 -08:00
Lai Jiangshan 4468a00fd9 workqueue: make work->data point to pool after try_to_grab_pending()
We plan to use work->data pointing to cwq as the synchronization
invariant when determining whether a given work item is on a locked
pool or not, which requires work->data pointing to cwq only while the
work item is queued on the associated pool.

With delayed_work updated not to overload work->data for target
workqueue recording, the only case where we still have off-queue
work->data pointing to cwq is try_to_grab_pending() which doesn't
update work->data after stealing a queued work item.  There's no
reason for try_to_grab_pending() to not update work->data to point to
the pool instead of cwq, like the normal execution does.

This patch adds set_work_pool_and_keep_pending() which makes
work->data point to pool instead of cwq but keeps the pending bit
unlike set_work_pool_and_clear_pending() (surprise!).

After this patch, it's guaranteed that only queued work items point to
cwqs.

This patch doesn't introduce any visible behavior change.

tj: Renamed the new helper function to match
    set_work_pool_and_clear_pending() and rewrote the description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-06 18:04:53 -08:00
Lai Jiangshan 60c057bca2 workqueue: add delayed_work->wq to simplify reentrancy handling
To avoid executing the same work item from multiple CPUs concurrently,
a work_struct records the last pool it was on in its ->data so that,
on the next queueing, the pool can be queried to determine whether the
work item is still executing or not.

A delayed_work goes through timer before actually being queued on the
target workqueue and the timer needs to know the target workqueue and
CPU.  This is currently achieved by modifying delayed_work->work.data
such that it points to the cwq which points to the target workqueue
and the last CPU the work item was on.  __queue_delayed_work()
extracts the last CPU from delayed_work->work.data and then combines
it with the target workqueue to create new work.data.

The only thing this rather ugly hack achieves is encoding the target
workqueue into delayed_work->work.data without using a separate field,
which could be a trade off one can make; unfortunately, this entangles
work->data management between regular workqueue and delayed_work code
by setting cwq pointer before the work item is actually queued and
becomes a hindrance for further improvements of work->data handling.

This can be easily made sane by adding a target workqueue field to
delayed_work.  While delayed_work is used widely in the kernel and
this does make it a bit larger (<5%), I think this is the right
trade-off especially given the prospect of much saner handling of
work->data which currently involves quite tricky memory barrier
dancing, and don't expect to see any measureable effect.

Add delayed_work->wq and drop the delayed_work->work.data overloading.

tj: Rewrote the description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-06 18:04:53 -08:00
Lai Jiangshan 038366c5cf workqueue: make work_busy() test WORK_STRUCT_PENDING first
Currently, work_busy() first tests whether the work has a pool
associated with it and if not, considers it idle.  This works fine
even for delayed_work.work queued on timer, as __queue_delayed_work()
sets cwq on delayed_work.work - a queued delayed_work always has its
cwq and thus pool associated with it.

However, we're about to update delayed_work queueing and this won't
hold.  Update work_busy() such that it tests WORK_STRUCT_PENDING
before the associated pool.  This doesn't make any noticeable behavior
difference now.

With work_pending() test moved, the function read a lot better with
"if (!pool)" test flipped to positive.  Flip it.

While at it, lose the comment about now non-existent reentrant
workqueues.

tj: Reorganized the function and rewrote the description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-06 18:04:53 -08:00
Lai Jiangshan 6be195886a workqueue: replace WORK_CPU_NONE/LAST with WORK_CPU_END
Now that workqueue has moved away from gcwqs, workqueue no longer has
the need to have a CPU identifier indicating "no cpu associated" - we
now use WORK_OFFQ_POOL_NONE instead - and most uses of WORK_CPU_NONE
are gone.

The only left usage is as the end marker for for_each_*wq*()
iterators, where the name WORK_CPU_NONE is confusing w/o actual
WORK_CPU_NONE usages.  Similarly, WORK_CPU_LAST which equals
WORK_CPU_NONE no longer makes sense.

Replace both WORK_CPU_NONE and LAST with WORK_CPU_END.  This patch
doesn't introduce any functional difference.

tj: s/WORK_CPU_LAST/WORK_CPU_END/ and rewrote the description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-06 18:04:53 -08:00
Ingo Molnar 0351096eb0 Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU build fixlet from Paul E. McKenney.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-02-06 12:43:04 +01:00
Ingo Molnar b2c77a57e4 This implements the cputime accounting on full dynticks CPUs.
Typical cputime stats infrastructure relies on the timer tick and
 its periodic polling on the CPU to account the amount of time
 spent by the CPUs and the tasks per high level domains such as
 userspace, kernelspace, guest, ...
 
 Now we are preparing to implement full dynticks capability on
 Linux for Real Time and HPC users who want full CPU isolation.
 This feature requires a cputime accounting that doesn't depend
 on the timer tick.
 
 To implement it, this new cputime infrastructure plugs into
 kernel/user/guest boundaries to take snapshots of cputime and
 flush these to the stats when needed. This performs pretty
 much like CONFIG_VIRT_CPU_ACCOUNTING except that context location
 and cputime snaphots are synchronized between write and read
 side such that the latter can safely retrieve the pending tickless
 cputime of a task and add it to its latest cputime snapshot to
 return the correct result to the user.
 
 Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJRBsKnAAoJEIUkVEdQjox3lMgP/2R6DU2f8PyGIao3hne4M3Pu
 L3q+mAG53b24Dy014KeW7gd8yv45fE7wp/rs8CGLte9VzbLkRCDSFQPgBuXVagRj
 tV5nfAuqD0wHTnA+HhBE3l3C2RKAPGIu79rBpnIR/QIPPl8Z3Dby8YgmxEQKDf8G
 j7MEBu2LthSuqEi2ZXemnO5r0oEnQAzAp4TTi/M38k0Fmt59nOGyjLnI+xHYCBMa
 1pnz7j3jjR9NJExGu8iVvbo+jupuQngP8qmkLXHvYnj/TEJNwzO1hHVoSwOpjYpS
 9ycl+T8IKQLbAkBywLtq3Mzde43xt/t8wYyGZ0oAV+Z7MIpz/9YIfDJwqQeqoNbD
 dAdbNjKMbsxCgmrnyqSagfMQg/r3CPZ4vf40TMCaN4gNUJC4Ie+E4kPRKRh59+PB
 Ukthmqujn0f40LAa+HXTUuzafd3b0s/ewH+8FuQ6LAG9b5+WnoN8JTJ5u6+ydokO
 ZleeOowuRZZEg+abQ8Sm2GRm/BzN29gi/npb//I+ZDXWv/+3yccgsiPjCRzCAAaO
 g1RmYryFSRUwHQbGNNypVWVuOLWvrBQ4jqbGO7BBuBByZMSHryKxR6mb+inH3qLE
 xIDM9SdSJisc292OzoFKwVZki4MaXaadJXJduVvqYlZQvXXs7eAa4wo3euhtVITD
 NLQO5OZXE4oIQmDFb0FV
 =1Tzp
 -----END PGP SIGNATURE-----

Merge tag 'full-dynticks-cputime-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into sched/core

Pull full-dynticks (user-space execution is undisturbed and
receives no timer IRQs) preparation changes that convert the
cputime accounting code to be full-dynticks ready,
from Frederic Weisbecker:

 "This implements the cputime accounting on full dynticks CPUs.

  Typical cputime stats infrastructure relies on the timer tick and
  its periodic polling on the CPU to account the amount of time
  spent by the CPUs and the tasks per high level domains such as
  userspace, kernelspace, guest, ...

  Now we are preparing to implement full dynticks capability on
  Linux for Real Time and HPC users who want full CPU isolation.
  This feature requires a cputime accounting that doesn't depend
  on the timer tick.

  To implement it, this new cputime infrastructure plugs into
  kernel/user/guest boundaries to take snapshots of cputime and
  flush these to the stats when needed. This performs pretty
  much like CONFIG_VIRT_CPU_ACCOUNTING except that context location
  and cputime snaphots are synchronized between write and read
  side such that the latter can safely retrieve the pending tickless
  cputime of a task and add it to its latest cputime snapshot to
  return the correct result to the user."

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-02-05 13:10:33 +01:00