Commit Graph

35107 Commits

Author SHA1 Message Date
Thomas Gleixner aa3b66f401 tick/sched: Make jiffies update quick check more robust
The quick check in tick_do_update_jiffies64() whether jiffies need to be
updated is not really correct under all circumstances and on all
architectures, especially not on 32bit systems.

The quick check does:

    if (now < READ_ONCE(tick_next_period))
    	return;

and the counterpart in the update is:

    WRITE_ONCE(tick_next_period, next_update_time);

This has two problems:

  1) On weakly ordered architectures there is no guarantee that the stores
     before the WRITE_ONCE() are visible which means that other CPUs can
     operate on a stale jiffies value.

  2) On 32bit the store of tick_next_period which is an u64 is split into
     two 32bit stores. If the first 32bit store advances tick_next_period
     far out and the second 32bit store is delayed (virt, NMI ...) then
     jiffies will become stale until the second 32bit store happens.

Address this by seperating the handling for 32bit and 64bit.

On 64bit problem #1 is addressed by replacing READ_ONCE() / WRITE_ONCE()
with smp_load_acquire() / smp_store_release().

On 32bit problem #2 is addressed by protecting the quick check with the
jiffies sequence counter. The load and stores can be plain because the
sequence count mechanics provides the required barriers already.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/87czzpc02w.fsf@nanos.tec.linutronix.de
2020-12-11 23:19:10 +01:00
Andrii Nakryiko b7906b70a2 bpf: Fix enum names for bpf_this_cpu_ptr() and bpf_per_cpu_ptr() helpers
Remove bpf_ prefix, which causes these helpers to be reported in verifier
dump as bpf_bpf_this_cpu_ptr() and bpf_bpf_per_cpu_ptr(), respectively. Lets
fix it as long as it is still possible before UAPI freezes on these helpers.

Fixes: eaa6bcb71e ("bpf: Introduce bpf_per_cpu_ptr()")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-11 14:19:07 -08:00
Arnd Bergmann 6e7b64b9dd elfcore: fix building with clang
kernel/elfcore.c only contains weak symbols, which triggers a bug with
clang in combination with recordmcount:

  Cannot find symbol for section 2: .text.
  kernel/elfcore.o: failed

Move the empty stubs into linux/elfcore.h as inline functions.  As only
two architectures use these, just use the architecture specific Kconfig
symbols to key off the declaration.

Link: https://lkml.kernel.org/r/20201204165742.3815221-2-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Barret Rhoden <brho@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-11 14:02:14 -08:00
Rafael J. Wysocki 90ac908a41 cpufreq: schedutil: Simplify sugov_update_next_freq()
Rearrange a conditional to make it more straightforward.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2020-12-11 19:53:58 +01:00
John Garry 1d3aec8928 genirq/affinity: Add irq_update_affinity_desc()
Add a function to allow the affinity of an interrupt be switched to
managed, such that interrupts allocated for platform devices may be
managed.

This new interface has certain limitations, and attempts to use it in the
following circumstances will fail:
- For when the kernel is configured for generic IRQ reservation mode (in
  config GENERIC_IRQ_RESERVATION_MODE). The reason being that it could
  conflict with managed vs. non-managed interrupt accounting.
- The interrupt is already started, which should not be the case during
  init
- The interrupt is already configured as managed, which means double init

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/1606905417-183214-2-git-send-email-john.garry@huawei.com
2020-12-11 14:47:50 +00:00
Valentin Schneider b388fa5014 Revert "genirq: Add fasteoi IPI flow"
handle_percpu_devid_fasteoi_ipi() has no more users, and
handle_percpu_devid_irq() can do all that it was supposed to do. Get rid of
it.

This reverts commit c5e5ec033c.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20201109094121.29975-6-valentin.schneider@arm.com
2020-12-11 14:47:50 +00:00
Thomas Gleixner 76e87d96b3 ntp: Consolidate the RTC update implementation
The code for the legacy RTC and the RTC class based update are pretty much
the same. Consolidate the common parts into one function and just invoke
the actual setter functions.

For RTC class based devices the update code checks whether the offset is
valid for the device, which is usually not the case for the first
invocation. If it's not the same it stores the correct offset and lets the
caller try again. That's not much different from the previous approach
where the first invocation had a pretty low probability to actually hit the
allowed window.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Link: https://lore.kernel.org/r/20201206220542.355743355@linutronix.de
2020-12-11 10:40:53 +01:00
Thomas Gleixner 69eca258c8 ntp: Make the RTC sync offset less obscure
The current RTC set_offset_nsec value is not really intuitive to
understand. 

  tsched       twrite(t2.tv_sec - 1) 	 t2 (seconds increment)

The offset is calculated from twrite based on the assumption that t2 -
twrite == 1s. That means for the MC146818 RTC the offset needs to be
negative so that the write happens 500ms before t2.

It's easier to understand when the whole calculation is based on t2. That
avoids negative offsets and the meaning is obvious:

 t2 - twrite:     The time defined by the chip when seconds increment
      		  after the write.

 twrite - tsched: The time for the transport to the point where the chip
 	  	  is updated. 

==> set_offset_nsec =  t2 - tsched
    ttransport      =  twrite - tsched
    tRTCinc         =  t2 - twrite
==> set_offset_nsec =  ttransport + tRTCinc

tRTCinc is a chip property and can be obtained from the data sheet.

ttransport depends on how the RTC is connected. It is close to 0 for
directly accessible RTCs. For RTCs behind a slow bus, e.g. i2c, it's the
time required to send the update over the bus. This can be estimated or
even calibrated, but that's a different problem.

Adjust the implementation and update comments accordingly.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Link: https://lore.kernel.org/r/20201206220542.263204937@linutronix.de
2020-12-11 10:40:53 +01:00
Thomas Gleixner 33e62e8323 ntp, rtc: Move rtc_set_ntp_time() to ntp code
rtc_set_ntp_time() is not really RTC functionality as the code is just a
user of RTC. Move it into the NTP code which allows further cleanups.

Requested-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Link: https://lore.kernel.org/r/20201206220542.166871172@linutronix.de
2020-12-11 10:40:52 +01:00
Thomas Gleixner c9e6189fb0 ntp: Make the RTC synchronization more reliable
Miroslav reported that the periodic RTC synchronization in the NTP code
fails more often than not to hit the specified update window.

The reason is that the code uses delayed_work to schedule the update which
needs to be in thread context as the underlying RTC might be connected via
a slow bus, e.g. I2C. In the update function it verifies whether the
current time is correct vs. the requirements of the underlying RTC.

But delayed_work is using the timer wheel for scheduling which is
inaccurate by design. Depending on the distance to the expiry the wheel
gets less granular to allow batching and to avoid the cascading of the
original timer wheel. See 500462a9de ("timers: Switch to a non-cascading
wheel") and the code for further details.

The code already deals with this by splitting the 660 seconds period into a
long 659 seconds timer and then retrying with a smaller delta.

But looking at the actual granularities of the timer wheel (which depend on
the HZ configuration) the 659 seconds timer ends up in an outer wheel level
and is affected by a worst case granularity of:

HZ          Granularity
1000        32s
 250        16s
 100        40s

So the initial timer can be already off by max 12.5% which is not a big
issue as the period of the sync is defined as ~11 minutes.

The fine grained second attempt schedules to the desired update point with
a timer expiring less than a second from now. Depending on the actual delta
and the HZ setting even the second attempt can end up in outer wheel levels
which have a large enough granularity to make the correctness check fail.

As this is a fundamental property of the timer wheel there is no way to
make this more accurate short of iterating in one jiffies steps towards the
update point.

Switch it to an hrtimer instead which schedules the actual update work. The
hrtimer will expire precisely (max 1 jiffie delay when high resolution
timers are not available). The actual scheduling delay of the work is the
same as before.

The update is triggered from do_adjtimex() which is a bit racy but not much
more racy than it was before:

     if (ntp_synced())
     	queue_delayed_work(system_power_efficient_wq, &sync_work, 0);

which is racy when the work is currently executed and has not managed to
reschedule itself.

This becomes now:

     if (ntp_synced() && !hrtimer_is_queued(&sync_hrtimer))
     	queue_work(system_power_efficient_wq, &sync_work, 0);

which is racy when the hrtimer has expired and the work is currently
executed and has not yet managed to rearm the hrtimer.

Not a big problem as it just schedules work for nothing.

The new implementation has a safe guard in place to catch the case where
the hrtimer is queued on entry to the work function and avoids an extra
update attempt of the RTC that way.

Reported-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Miroslav Lichvar <mlichvar@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Link: https://lore.kernel.org/r/20201206220542.062910520@linutronix.de
2020-12-11 10:40:52 +01:00
Barry Song 5b78f2dc31 sched/fair: Trivial correction of the newidle_balance() comment
idle_balance() has been renamed to newidle_balance(). To differentiate
with nohz_idle_balance, it seems refining the comment will be helpful
for the readers of the code.

Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20201202220641.22752-1-song.bao.hua@hisilicon.com
2020-12-11 10:30:44 +01:00
Mel Gorman 13d5a5e9f9 sched/fair: Clear SMT siblings after determining the core is not idle
The clearing of SMT siblings from the SIS mask before checking for an idle
core is a small but unnecessary cost. Defer the clearing of the siblings
until the scan moves to the next potential target. The cost of this was
not measured as it is borderline noise but it should be self-evident.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20201130144020.GS3371@techsingularity.net
2020-12-11 10:30:38 +01:00
Mauro Carvalho Chehab 59a74b1544 sched: Fix kernel-doc markup
Kernel-doc requires that a kernel-doc markup to be immediately
below the function prototype, as otherwise it will rename it.
So, move sys_sched_yield() markup to the right place.

Also fix the cpu_util() markup: Kernel-doc markups
should use this format:
        identifier - description

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/50cd6f460aeb872ebe518a8e9cfffda2df8bdb0a.1606823973.git.mchehab+huawei@kernel.org
2020-12-11 10:30:31 +01:00
Linus Torvalds 4d31058b82 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) IPsec compat fixes, from Dmitry Safonov.

 2) Fix memory leak in xfrm_user_policy(). Fix from Yu Kuai.

 3) Fix polling in xsk sockets by using sk_poll_wait() instead of
    datagram_poll() which keys off of sk_wmem_alloc and such which xsk
    sockets do not update. From Xuan Zhuo.

 4) Missing init of rekey_data in cfgh80211, from Sara Sharon.

 5) Fix destroy of timer before init, from Davide Caratti.

 6) Missing CRYPTO_CRC32 selects in ethernet driver Kconfigs, from Arnd
    Bergmann.

 7) Missing error return in rtm_to_fib_config() switch case, from Zhang
    Changzhong.

 8) Fix some src/dest address handling in vrf and add a testcase. From
    Stephen Suryaputra.

 9) Fix multicast handling in Seville switches driven by mscc-ocelot
    driver. From Vladimir Oltean.

10) Fix proto value passed to skb delivery demux in udp, from Xin Long.

11) HW pkt counters not reported correctly in enetc driver, from Claudiu
    Manoil.

12) Fix deadlock in bridge, from Joseph Huang.

13) Missing of_node_pur() in dpaa2 driver, fromn Christophe JAILLET.

14) Fix pid fetching in bpftool when there are a lot of results, from
    Andrii Nakryiko.

15) Fix long timeouts in nft_dynset, from Pablo Neira Ayuso.

16) Various stymmac fixes, from Fugang Duan.

17) Fix null deref in tipc, from Cengiz Can.

18) When mss is biog, coose more resonable rcvq_space in tcp, fromn Eric
    Dumazet.

19) Revert a geneve change that likely isnt necessary, from Jakub
    Kicinski.

20) Avoid premature rx buffer reuse in various Intel driversm from Björn
    Töpel.

21) retain EcT bits during TIS reflection in tcp, from Wei Wang.

22) Fix Tso deferral wrt. cwnd limiting in tcp, from Neal Cardwell.

23) MPLS_OPT_LSE_LABEL attribute is 342 ot 8 bits, from Guillaume Nault

24) Fix propagation of 32-bit signed bounds in bpf verifier and add test
    cases, from Alexei Starovoitov.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits)
  selftests: fix poll error in udpgro.sh
  selftests/bpf: Fix "dubious pointer arithmetic" test
  selftests/bpf: Fix array access with signed variable test
  selftests/bpf: Add test for signed 32-bit bound check bug
  bpf: Fix propagation of 32-bit signed bounds from 64-bit bounds.
  MAINTAINERS: Add entry for Marvell Prestera Ethernet Switch driver
  net: sched: Fix dump of MPLS_OPT_LSE_LABEL attribute in cls_flower
  net/mlx4_en: Handle TX error CQE
  net/mlx4_en: Avoid scheduling restart task if it is already running
  tcp: fix cwnd-limited bug for TSO deferral where we send nothing
  net: flow_offload: Fix memory leak for indirect flow block
  tcp: Retain ECT bits for tos reflection
  ethtool: fix stack overflow in ethnl_parse_bitset()
  e1000e: fix S0ix flow to allow S0i3.2 subset entry
  ice: avoid premature Rx buffer reuse
  ixgbe: avoid premature Rx buffer reuse
  i40e: avoid premature Rx buffer reuse
  igb: avoid transmit queue timeout in xdp path
  igb: use xdp_do_flush
  igb: skb add metasize for xdp
  ...
2020-12-10 15:30:13 -08:00
David S. Miller d9838b1d39 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2020-12-10

The following pull-request contains BPF updates for your *net* tree.

We've added 21 non-merge commits during the last 12 day(s) which contain
a total of 21 files changed, 163 insertions(+), 88 deletions(-).

The main changes are:

1) Fix propagation of 32-bit signed bounds from 64-bit bounds, from Alexei.

2) Fix ring_buffer__poll() return value, from Andrii.

3) Fix race in lwt_bpf, from Cong.

4) Fix test_offload, from Toke.

5) Various xsk fixes.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Thanks a lot!

Also thanks to reporters, reviewers and testers of commits in this pull-request:

Cong Wang, Hulk Robot, Jakub Kicinski, Jean-Philippe Brucker, John
Fastabend, Magnus Karlsson, Maxim Mikityanskiy, Yonghong Song
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-10 14:29:30 -08:00
Alexei Starovoitov b02709587e bpf: Fix propagation of 32-bit signed bounds from 64-bit bounds.
The 64-bit signed bounds should not affect 32-bit signed bounds unless the
verifier knows that upper 32-bits are either all 1s or all 0s. For example the
register with smin_value==1 doesn't mean that s32_min_value is also equal to 1,
since smax_value could be larger than 32-bit subregister can hold.
The verifier refines the smax/s32_max return value from certain helpers in
do_refine_retval_range(). Teach the verifier to recognize that smin/s32_min
value is also bounded. When both smin and smax bounds fit into 32-bit
subregister the verifier can propagate those bounds.

Fixes: 3f50f132d8 ("bpf: Verifier, do explicit ALU32 bounds tracking")
Reported-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-12-10 13:02:53 -08:00
Dmitry Torokhov b2058cd93d Input: gtco - remove driver
The driver has its own HID descriptor parsing code, that had and still
has several issues discovered by syzbot and other tools. Ideally we
should move the driver over to the HID subsystem, so that it uses proven
parsing code.  However the devices in question are EOL, and GTCO is not
willing to extend resources for that, so let's simply remove the driver.

Note that our HID support has greatly improved over the last 10 years,
we may also consider reverting 6f8d9e26e7 ("hid-core.c: Adds all GTCO
CalComp Digitizers and InterWrite School Products to blacklist") and see
if GTCO devices actually work with normal HID drivers.

Link: https://lore.kernel.org/r/X8wbBtO5KidME17K@google.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2020-12-09 17:47:36 -08:00
Saravana Kannan 01bb86b380 driver core: Add fwnode_init()
There are multiple locations in the kernel where a struct fwnode_handle
is initialized. Add fwnode_init() so that we have one way of
initializing a fwnode_handle.

Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Saravana Kannan <saravanak@google.com>
Link: https://lore.kernel.org/r/20201121020232.908850-8-saravanak@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-09 19:10:20 +01:00
Catalin Marinas d889797530 Merge remote-tracking branch 'arm64/for-next/fixes' into for-next/core
* arm64/for-next/fixes: (26 commits)
  arm64: mte: fix prctl(PR_GET_TAGGED_ADDR_CTRL) if TCF0=NONE
  arm64: mte: Fix typo in macro definition
  arm64: entry: fix EL1 debug transitions
  arm64: entry: fix NMI {user, kernel}->kernel transitions
  arm64: entry: fix non-NMI kernel<->kernel transitions
  arm64: ptrace: prepare for EL1 irq/rcu tracking
  arm64: entry: fix non-NMI user<->kernel transitions
  arm64: entry: move el1 irq/nmi logic to C
  arm64: entry: prepare ret_to_user for function call
  arm64: entry: move enter_from_user_mode to entry-common.c
  arm64: entry: mark entry code as noinstr
  arm64: mark idle code as noinstr
  arm64: syscall: exit userspace before unmasking exceptions
  arm64: pgtable: Ensure dirty bit is preserved across pte_wrprotect()
  arm64: pgtable: Fix pte_accessible()
  ACPI/IORT: Fix doc warnings in iort.c
  arm64/fpsimd: add <asm/insn.h> to <asm/kprobes.h> to fix fpsimd build
  arm64: cpu_errata: Apply Erratum 845719 to KRYO2XX Silver
  arm64: proton-pack: Add KRYO2XX silver CPUs to spectre-v2 safe-list
  arm64: kpti: Add KRYO2XX gold/silver CPU cores to kpti safelist
  ...

# Conflicts:
#	arch/arm64/include/asm/exception.h
#	arch/arm64/kernel/sdei.c
2020-12-09 18:04:55 +00:00
Catalin Marinas d45056ad73 Merge remote-tracking branch 'arm64/for-next/scs' into for-next/core
* arm64/for-next/scs:
  arm64: sdei: Push IS_ENABLED() checks down to callee functions
  arm64: scs: use vmapped IRQ and SDEI shadow stacks
  scs: switch to vmapped shadow stacks
2020-12-09 18:04:48 +00:00
peterz@infradead.org 78af4dc949 perf: Break deadlock involving exec_update_mutex
Syzbot reported a lock inversion involving perf. The sore point being
perf holding exec_update_mutex() for a very long time, specifically
across a whole bunch of filesystem ops in pmu::event_init() (uprobes)
and anon_inode_getfile().

This then inverts against procfs code trying to take
exec_update_mutex.

Move the permission checks later, such that we need to hold the mutex
over less code.

Reported-by: syzbot+db9cdf3dd1f64252c6ef@syzkaller.appspotmail.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-12-09 17:08:57 +01:00
Waiman Long 617f3ef951 locking/rwsem: Remove reader optimistic spinning
Reader optimistic spinning is helpful when the reader critical section
is short and there aren't that many readers around. It also improves
the chance that a reader can get the lock as writer optimistic spinning
disproportionally favors writers much more than readers.

Since commit d3681e269f ("locking/rwsem: Wake up almost all readers
in wait queue"), all the waiting readers are woken up so that they can
all get the read lock and run in parallel. When the number of contending
readers is large, allowing reader optimistic spinning will likely cause
reader fragmentation where multiple smaller groups of readers can get
the read lock in a sequential manner separated by writers. That reduces
reader parallelism.

One possible way to address that drawback is to limit the number of
readers (preferably one) that can do optimistic spinning. These readers
act as representatives of all the waiting readers in the wait queue as
they will wake up all those waiting readers once they get the lock.

Alternatively, as reader optimistic lock stealing has already enhanced
fairness to readers, it may be easier to just remove reader optimistic
spinning and simplifying the optimistic spinning code as a result.

Performance measurements (locking throughput kops/s) using a locking
microbenchmark with 50/50 reader/writer distribution and turbo-boost
disabled was done on a 2-socket Cascade Lake system (48-core 96-thread)
to see the impacts of these changes:

  1) Vanilla     - 5.10-rc3 kernel
  2) Before      - 5.10-rc3 kernel with previous patches in this series
  2) limit-rspin - 5.10-rc3 kernel with limited reader spinning patch
  3) no-rspin    - 5.10-rc3 kernel with reader spinning disabled

  # of threads  CS Load   Vanilla  Before   limit-rspin   no-rspin
  ------------  -------   -------  ------   -----------   --------
       2            1      5,185    5,662      5,214       5,077
       4            1      5,107    4,983      5,188       4,760
       8            1      4,782    4,564      4,720       4,628
      16            1      4,680    4,053      4,567       3,402
      32            1      4,299    1,115      1,118       1,098
      64            1      3,218      983      1,001         957
      96            1      1,938      944        957         930

       2           20      2,008    2,128      2,264       1,665
       4           20      1,390    1,033      1,046       1,101
       8           20      1,472    1,155      1,098       1,213
      16           20      1,332    1,077      1,089       1,122
      32           20        967      914        917         980
      64           20        787      874        891         858
      96           20        730      836        847         844

       2          100        372      356        360         355
       4          100        492      425        434         392
       8          100        533      537        529         538
      16          100        548      572        568         598
      32          100        499      520        527         537
      64          100        466      517        526         512
      96          100        406      497        506         509

The column "CS Load" represents the number of pause instructions issued
in the locking critical section. A CS load of 1 is extremely short and
is not likey in real situations. A load of 20 (moderate) and 100 (long)
are more realistic.

It can be seen that the previous patches in this series have reduced
performance in general except in highly contended cases with moderate
or long critical sections that performance improves a bit. This change
is mostly caused by the "Prevent potential lock starvation" patch that
reduce reader optimistic spinning and hence reduce reader fragmentation.

The patch that further limit reader optimistic spinning doesn't seem to
have too much impact on overall performance as shown in the benchmark
data.

The patch that disables reader optimistic spinning shows reduced
performance at lightly loaded cases, but comparable or slightly better
performance on with heavier contention.

This patch just removes reader optimistic spinning for now. As readers
are not going to do optimistic spinning anymore, we don't need to
consider if the OSQ is empty or not when doing lock stealing.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Link: https://lkml.kernel.org/r/20201121041416.12285-6-longman@redhat.com
2020-12-09 17:08:48 +01:00
Waiman Long 1a728dff85 locking/rwsem: Enable reader optimistic lock stealing
If the optimistic spinning queue is empty and the rwsem does not have
the handoff or write-lock bits set, it is actually not necessary to
call rwsem_optimistic_spin() to spin on it. Instead, it can steal the
lock directly as its reader bias is in the count already.  If it is
the first reader in this state, it will try to wake up other readers
in the wait queue.

With this patch applied, the following were the lock event counts
after rebooting a 2-socket system and a "make -j96" kernel rebuild.

  rwsem_opt_rlock=4437
  rwsem_rlock=29
  rwsem_rlock_steal=19

So lock stealing represents about 0.4% of all the read locks acquired
in the slow path.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Link: https://lkml.kernel.org/r/20201121041416.12285-4-longman@redhat.com
2020-12-09 17:08:48 +01:00
Waiman Long 2f06f70292 locking/rwsem: Prevent potential lock starvation
The lock handoff bit is added in commit 4f23dbc1e6 ("locking/rwsem:
Implement lock handoff to prevent lock starvation") to avoid lock
starvation. However, allowing readers to do optimistic spinning does
introduce an unlikely scenario where lock starvation can happen.

The lock handoff bit may only be set when a waiter is being woken up.
In the case of reader unlock, wakeup happens only when the reader count
reaches 0. If there is a continuous stream of incoming readers acquiring
read lock via optimistic spinning, it is possible that the reader count
may never reach 0 and so the handoff bit will never be asserted.

One way to prevent this scenario from happening is to disallow optimistic
spinning if the rwsem is currently owned by readers. If the previous
or current owner is a writer, optimistic spinning will be allowed.

If the previous owner is a reader but the reader count has reached 0
before, a wakeup should have been issued. So the handoff mechanism
will be kicked in to prevent lock starvation. As a result, it should
be OK to do optimistic spinning in this case.

This patch may have some impact on reader performance as it reduces
reader optimistic spinning especially if the lock critical sections
are short the number of contending readers are small.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Link: https://lkml.kernel.org/r/20201121041416.12285-3-longman@redhat.com
2020-12-09 17:08:48 +01:00
Waiman Long c8fe8b0564 locking/rwsem: Pass the current atomic count to rwsem_down_read_slowpath()
The atomic count value right after reader count increment can be useful
to determine the rwsem state at trylock time. So the count value is
passed down to rwsem_down_read_slowpath() to be used when appropriate.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Link: https://lkml.kernel.org/r/20201121041416.12285-2-longman@redhat.com
2020-12-09 17:08:47 +01:00
Peter Zijlstra c995e638cc locking/rwsem: Fold __down_{read,write}*()
There's a lot needless duplication in __down_{read,write}*(), cure
that with a helper.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201207090243.GE3040@hirez.programming.kicks-ass.net
2020-12-09 17:08:47 +01:00
Peter Zijlstra 285c61aedf locking/rwsem: Introduce rwsem_write_trylock()
One copy of this logic is better than three.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201207090243.GE3040@hirez.programming.kicks-ass.net
2020-12-09 17:08:47 +01:00
Peter Zijlstra 3379116a0c locking/rwsem: Better collate rwsem_read_trylock()
All users of rwsem_read_trylock() do rwsem_set_reader_owned(sem) on
success, move it into rwsem_read_trylock() proper.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201207090243.GE3040@hirez.programming.kicks-ass.net
2020-12-09 17:08:47 +01:00
Peter Zijlstra 2b3c99ee63 Merge branch 'locking/rwsem' 2020-12-09 17:08:45 +01:00
Eric W. Biederman 31784cff7e rwsem: Implement down_read_interruptible
In preparation for converting exec_update_mutex to a rwsem so that
multiple readers can execute in parallel and not deadlock, add
down_read_interruptible.  This is needed for perf_event_open to be
converted (with no semantic changes) from working on a mutex to
wroking on a rwsem.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/87k0tybqfy.fsf@x220.int.ebiederm.org
2020-12-09 17:08:42 +01:00
Eric W. Biederman 0f9368b5bf rwsem: Implement down_read_killable_nested
In preparation for converting exec_update_mutex to a rwsem so that
multiple readers can execute in parallel and not deadlock, add
down_read_killable_nested.  This is needed so that kcmp_lock
can be converted from working on a mutexes to working on rw_semaphores.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/87o8jabqh3.fsf@x220.int.ebiederm.org
2020-12-09 17:08:41 +01:00
Andy Lutomirski e45cdc71d1 membarrier: Execute SYNC_CORE on the calling thread
membarrier()'s MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is documented as
syncing the core on all sibling threads but not necessarily the calling
thread.  This behavior is fundamentally buggy and cannot be used safely.

Suppose a user program has two threads.  Thread A is on CPU 0 and thread B
is on CPU 1.  Thread A modifies some text and calls
membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE).

Then thread B executes the modified code.  If, at any point after
membarrier() decides which CPUs to target, thread A could be preempted and
replaced by thread B on CPU 0.  This could even happen on exit from the
membarrier() syscall.  If this happens, thread B will end up running on CPU
0 without having synced.

In principle, this could be fixed by arranging for the scheduler to issue
sync_core_before_usermode() whenever switching between two threads in the
same mm if there is any possibility of a concurrent membarrier() call, but
this would have considerable overhead.  Instead, make membarrier() sync the
calling CPU as well.

As an optimization, this avoids an extra smp_mb() in the default
barrier-only mode and an extra rseq preempt on the caller.

Fixes: 70216e18e5 ("membarrier: Provide core serializing command, *_SYNC_CORE")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/r/250ded637696d490c69bef1877148db86066881c.1607058304.git.luto@kernel.org
2020-12-09 09:37:43 +01:00
Andy Lutomirski 758c9373d8 membarrier: Explicitly sync remote cores when SYNC_CORE is requested
membarrier() does not explicitly sync_core() remote CPUs; instead, it
relies on the assumption that an IPI will result in a core sync.  On x86,
this may be true in practice, but it's not architecturally reliable.  In
particular, the SDM and APM do not appear to guarantee that interrupt
delivery is serializing.  While IRET does serialize, IPI return can
schedule, thereby switching to another task in the same mm that was
sleeping in a syscall.  The new task could then SYSRET back to usermode
without ever executing IRET.

Make this more robust by explicitly calling sync_core_before_usermode()
on remote cores.  (This also helps people who search the kernel tree for
instances of sync_core() and sync_core_before_usermode() -- one might be
surprised that the core membarrier code doesn't currently show up in a
such a search.)

Fixes: 70216e18e5 ("membarrier: Provide core serializing command, *_SYNC_CORE")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/776b448d5f7bd6b12690707f5ed67bcda7f1d427.1607058304.git.luto@kernel.org
2020-12-09 09:37:43 +01:00
Andy Lutomirski 2ecedd7569 membarrier: Add an actual barrier before rseq_preempt()
It seems that most RSEQ membarrier users will expect any stores done before
the membarrier() syscall to be visible to the target task(s).  While this
is extremely likely to be true in practice, nothing actually guarantees it
by a strict reading of the x86 manuals.  Rather than providing this
guarantee by accident and potentially causing a problem down the road, just
add an explicit barrier.

Fixes: 70216e18e5 ("membarrier: Provide core serializing command, *_SYNC_CORE")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/d3e7197e034fa4852afcf370ca49c30496e58e40.1607058304.git.luto@kernel.org
2020-12-09 09:37:43 +01:00
Florent Revest b60da4955f bpf: Only provide bpf_sock_from_file with CONFIG_NET
This moves the bpf_sock_from_file definition into net/core/filter.c
which only gets compiled with CONFIG_NET and also moves the helper proto
usage next to other tracing helpers that are conditional on CONFIG_NET.

This avoids
  ld: kernel/trace/bpf_trace.o: in function `bpf_sock_from_file':
  bpf_trace.c:(.text+0xe23): undefined reference to `sock_from_file'
When compiling a kernel with BPF and without NET.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: KP Singh <kpsingh@kernel.org>
Link: https://lore.kernel.org/bpf/20201208173623.1136863-1-revest@chromium.org
2020-12-08 18:23:36 -08:00
Andrii Nakryiko 8bdd8e275e bpf: Return -ENOTSUPP when attaching to non-kernel BTF
Return -ENOTSUPP if tracing BPF program is attempted to be attached with
specified attach_btf_obj_fd pointing to non-kernel (neither vmlinux nor
module) BTF object. This scenario might be supported in the future and isn't
outright invalid, so -EINVAL isn't the most appropriate error code.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20201208064326.667389-1-andrii@kernel.org
2020-12-08 17:14:27 +01:00
Lukas Bulwahn 2f4b03195f bpf: Propagate __user annotations properly
__htab_map_lookup_and_delete_batch() stores a user pointer in the local
variable ubatch and uses that in copy_{from,to}_user(), but ubatch misses a
__user annotation.

So, sparse warns in the various assignments and uses of ubatch:

  kernel/bpf/hashtab.c:1415:24: warning: incorrect type in initializer
    (different address spaces)
  kernel/bpf/hashtab.c:1415:24:    expected void *ubatch
  kernel/bpf/hashtab.c:1415:24:    got void [noderef] __user *

  kernel/bpf/hashtab.c:1444:46: warning: incorrect type in argument 2
    (different address spaces)
  kernel/bpf/hashtab.c:1444:46:    expected void const [noderef] __user *from
  kernel/bpf/hashtab.c:1444:46:    got void *ubatch

  kernel/bpf/hashtab.c:1608:16: warning: incorrect type in assignment
    (different address spaces)
  kernel/bpf/hashtab.c:1608:16:    expected void *ubatch
  kernel/bpf/hashtab.c:1608:16:    got void [noderef] __user *

  kernel/bpf/hashtab.c:1609:26: warning: incorrect type in argument 1
    (different address spaces)
  kernel/bpf/hashtab.c:1609:26:    expected void [noderef] __user *to
  kernel/bpf/hashtab.c:1609:26:    got void *ubatch

Add the __user annotation to repair this chain of propagating __user
annotations in __htab_map_lookup_and_delete_batch().

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20201207123720.19111-1-lukas.bulwahn@gmail.com
2020-12-07 19:26:09 -08:00
Eric Dumazet e1868b9e36 bpf: Avoid overflows involving hash elem_size
Use of bpf_map_charge_init() was making sure hash tables would not use more
than 4GB of memory.

Since the implicit check disappeared, we have to be more careful
about overflows, to support big hash tables.

syzbot triggers a panic using :

bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_LRU_HASH, key_size=16384, value_size=8,
                     max_entries=262200, map_flags=0, inner_map_fd=-1, map_name="",
                     map_ifindex=0, btf_fd=-1, btf_key_type_id=0, btf_value_type_id=0,
                     btf_vmlinux_value_type_id=0}, 64) = ...

BUG: KASAN: vmalloc-out-of-bounds in bpf_percpu_lru_populate kernel/bpf/bpf_lru_list.c:594 [inline]
BUG: KASAN: vmalloc-out-of-bounds in bpf_lru_populate+0x4ef/0x5e0 kernel/bpf/bpf_lru_list.c:611
Write of size 2 at addr ffffc90017e4a020 by task syz-executor.5/19786

CPU: 0 PID: 19786 Comm: syz-executor.5 Not tainted 5.10.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x107/0x163 lib/dump_stack.c:118
 print_address_description.constprop.0.cold+0x5/0x4c8 mm/kasan/report.c:385
 __kasan_report mm/kasan/report.c:545 [inline]
 kasan_report.cold+0x1f/0x37 mm/kasan/report.c:562
 bpf_percpu_lru_populate kernel/bpf/bpf_lru_list.c:594 [inline]
 bpf_lru_populate+0x4ef/0x5e0 kernel/bpf/bpf_lru_list.c:611
 prealloc_init kernel/bpf/hashtab.c:319 [inline]
 htab_map_alloc+0xf6e/0x1230 kernel/bpf/hashtab.c:507
 find_and_alloc_map kernel/bpf/syscall.c:123 [inline]
 map_create kernel/bpf/syscall.c:829 [inline]
 __do_sys_bpf+0xa81/0x5170 kernel/bpf/syscall.c:4336
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45deb9
Code: 0d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 db b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fd93fbc0c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
RAX: ffffffffffffffda RBX: 0000000000001a40 RCX: 000000000045deb9
RDX: 0000000000000040 RSI: 0000000020000280 RDI: 0000000000000000
RBP: 000000000119bf60 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000119bf2c
R13: 00007ffc08a7be8f R14: 00007fd93fbc19c0 R15: 000000000119bf2c

Fixes: 755e5d5536 ("bpf: Eliminate rlimit-based memory accounting for hashtab maps")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Link: https://lore.kernel.org/bpf/20201207182821.3940306-1-eric.dumazet@gmail.com
2020-12-07 12:57:25 -08:00
Linus Torvalds cd796ed334 Fix userstacktrace option for instances
While writing an application that requires user stack trace option
 to work in instances, I found that the instance option has a bug
 that makes it a nop. The check for performing the user stack trace
 in an instance, checks the top level options (not the instance options)
 to determine if a user stack trace should be performed or not.
 
 This is not only incorrect, but also confusing for users. It confused
 me for a bit!
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCX85BzBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qv7XAQCzfq6DpHUv1gVaBkGytOszLW4IuEtd
 /09jDPbVSG8T2QEA+e6fpBP1aIixgLN6+vFVl3wAK3SaIQeIA/blkbBNFQ0=
 =o1Ew
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
 "Fix userstacktrace option for instances

  While writing an application that requires user stack trace option to
  work in instances, I found that the instance option has a bug that
  makes it a nop. The check for performing the user stack trace in an
  instance, checks the top level options (not the instance options) to
  determine if a user stack trace should be performed or not.

  This is not only incorrect, but also confusing for users. It confused
  me for a bit!"

* tag 'trace-v5.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix userstacktrace option for instances
2020-12-07 11:20:26 -08:00
Linus Torvalds 592d9a0835 A set of updates for the interrupt subsystem:
- Make multiqueue devices which use the managed interrupt affinity
     infrastructure work on PowerPC/Pseries. PowerPC does not use the
     generic infrastructure for setting up PCI/MSI interrupts and the
     multiqueue changes failed to update the legacy PCI/MSI infrastructure.
     Make this work by passing the affinity setup information down to the
     mapping and allocation functions.
 
   - Move Jason Cooper from MAINTAINERS to CREDITS as his mail is bouncing
     and he's not reachable. We hope all is well with him and say thanks
     for his work over the years.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/M1GwTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoW8xD/4uG/0ayYgSdRf4nXcyXu4JKoHV5oK5
 y7IWY9s04fqTFbVO2fRaD1hBYHavWfdV80obP8dJio1g6R1BqzZiEVUmCdWI0tHJ
 recAsGYxqPrNj9soHEZ7ZmuGX6VhuzQj57srU+lhzsqk+88uY/n1d/TlrHCH7miU
 0cfBSoolP2l2p6UYHvXfH2wk1hRHg8sySOfxGSp6KSrewoOwAOT2CCNX8gIcmy1n
 dUsJaHEFzU547p55zDs5DTHfM0yJdsqqUpdxvpiZWpZhsIzoQvd8taiH7/uaRGqd
 yJI4sMWudJUGGas2Vq0yjG6L0uAJ7M+kjqodJzn0hAKq6MhAIKaPMEbpPx9TuZYb
 zZg9ce5o4LwzTphNPmcEMCjpPKRGNiEbcl1XY4qhQWnBvuOb1mIBFV4+6srd+Lpg
 o7kEt+XyjKZARgw01yDf9tHSJYOcBQuHqGUdRZQAWSCThizpQsOZJaUWB8l4mLDy
 fScYx4cH12oPmCg3Fdd22oq7JN0ed9O3M7BLuzmI006uSWsB8fbfEcM+k+g+63Go
 xpHYKM6VOzLlspFPFvVo3nwzvc787he2I9tIOPtCU0Hl4BBjjfGyB9FcnhNShNoe
 dfVVzTaEWmHNGonTI61suwZJeyOTudzHqbgw5rAmqmJeV5R8gFSiGmzINlqoWZX6
 TWDVz4Y1ma8QwA==
 =/DKJ
 -----END PGP SIGNATURE-----

Merge tag 'irq-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fixes from Thomas Gleixner:
 "A set of updates for the interrupt subsystem:

   - Make multiqueue devices which use the managed interrupt affinity
     infrastructure work on PowerPC/Pseries. PowerPC does not use the
     generic infrastructure for setting up PCI/MSI interrupts and the
     multiqueue changes failed to update the legacy PCI/MSI
     infrastructure. Make this work by passing the affinity setup
     information down to the mapping and allocation functions.

   - Move Jason Cooper from MAINTAINERS to CREDITS as his mail is
     bouncing and he's not reachable. We hope all is well with him and
     say thanks for his work over the years"

* tag 'irq-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  powerpc/pseries: Pass MSI affinity to irq_create_mapping()
  genirq/irqdomain: Add an irq_create_mapping_affinity() function
  MAINTAINERS: Move Jason Cooper to CREDITS
2020-12-06 11:15:55 -08:00
Linus Torvalds 32f741b02f powerpc fixes for 5.10 #5
Three commits fixing possible missed TLB invalidations for multi-threaded
 processes when CPUs are hotplugged in and out.
 
 A fix for a host crash triggerable by host userspace (qemu) in KVM on Power9.
 
 A fix for a host crash in machine check handling when running HPT guests on a
 HPT host.
 
 One commit fixing potential missed TLB invalidations when using the hash MMU on
 Power9 or later.
 
 A regression fix for machines with CPUs on node 0 but no memory.
 
 Thanks to:
   Aneesh Kumar K.V, Cédric Le Goater, Greg Kurz, Milan Mohanty, Milton Miller,
   Nicholas Piggin, Paul Mackerras, Srikar Dronamraju.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl/LcwsTHG1wZUBlbGxl
 cm1hbi5pZC5hdQAKCRBR6+o8yOGlgOeiD/wKGX8eE7AJ5ZxoFLwpGEJhp9QgMDhe
 nP82CkKobwMM3UCbde9MC8PqYGC7/7PhRPM0GI03uh6EfeHUtle7AZlBAlZoGaeJ
 MwdQBQrZSqf1QJOyhUEa6CI0XTfCEOrsw+AkZQKdsv9JLcFBz7IyfP61gf7MHfyo
 QKlfYYilXHbJ7M9oiM9gKUdtrpPfMGH0YnIp0FR+JowJAWUfFY626H9j7chNwWK+
 7nrphtLHwsBVNtIoKWvPocuLKPsziOqXWnOP/do/RuCoKXMbGjtOJHhUgEYC5PM7
 eQug43YDaws4K1fxaHvQto/u92nL2GFY6FfKNeJ5FcQYgCIvi/T8jzEsJyqGbpVz
 YihZj1MbhhGr/neVtJW4SbdCTCU7R7X9QBy4He6XoWHR0fNoQDQvjNT/ziiuHiN0
 tU+Y9aoHwI/0Pb44ceiQ/T10nxYtk+6Cj5Cm9Ll7MvfjUsE/BpxlYdi+KMqRSGOb
 itOwFLQpgy28feMRKGZNKFURwTophASFaKO88yhjeSnlcGqxvicSIUpz8UD1jxwt
 o/tsger09ZXqBYVdVKLpqbKsifVbzUfJmmycvuDF37B+VjwHACP+VZltwdOqnX13
 BM9ndcDW2p6UnNLfs47FWJM+czmShrgwqI/W7qcCFleYL3r5XOS8hJHfgvJEcE04
 n7A9cNvK5q6nvg==
 =tIAZ
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.10-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "Some more powerpc fixes for 5.10:

   - Three commits fixing possible missed TLB invalidations for
     multi-threaded processes when CPUs are hotplugged in and out.

   - A fix for a host crash triggerable by host userspace (qemu) in KVM
     on Power9.

   - A fix for a host crash in machine check handling when running HPT
     guests on a HPT host.

   - One commit fixing potential missed TLB invalidations when using the
     hash MMU on Power9 or later.

   - A regression fix for machines with CPUs on node 0 but no memory.

  Thanks to Aneesh Kumar K.V, Cédric Le Goater, Greg Kurz, Milan
  Mohanty, Milton Miller, Nicholas Piggin, Paul Mackerras, and Srikar
  Dronamraju"

* tag 'powerpc-5.10-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/64s/powernv: Fix memory corruption when saving SLB entries on MCE
  KVM: PPC: Book3S HV: XIVE: Fix vCPU id sanity check
  powerpc/numa: Fix a regression on memoryless node 0
  powerpc/64s: Trim offlined CPUs from mm_cpumasks
  kernel/cpu: add arch override for clear_tasks_mm_cpumask() mm handling
  powerpc/64s/pseries: Fix hash tlbiel_all_isa300 for guest kernels
  powerpc/64s: Fix hash ISA v3.0 TLBIEL instruction generation
2020-12-05 11:16:21 -08:00
Steven Rostedt (VMware) bcee527895 tracing: Fix userstacktrace option for instances
When the instances were able to use their own options, the userstacktrace
option was left hardcoded for the top level. This made the instance
userstacktrace option bascially into a nop, and will confuse users that set
it, but nothing happens (I was confused when it happened to me!)

Cc: stable@vger.kernel.org
Fixes: 16270145ce ("tracing: Add trace options for core options to instances")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-12-04 16:36:16 -05:00
Florent Revest 4f19cab761 bpf: Add a bpf_sock_from_file helper
While eBPF programs can check whether a file is a socket by file->f_op
== &socket_file_ops, they cannot convert the void private_data pointer
to a struct socket BTF pointer. In order to do this a new helper
wrapping sock_from_file is added.

This is useful to tracing programs but also other program types
inheriting this set of helpers such as iterators or LSM programs.

Signed-off-by: Florent Revest <revest@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: KP Singh <kpsingh@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20201204113609.1850150-2-revest@google.com
2020-12-04 22:32:40 +01:00
Jakub Kicinski a1dd1d8697 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-12-03

The main changes are:

1) Support BTF in kernel modules, from Andrii.

2) Introduce preferred busy-polling, from Björn.

3) bpf_ima_inode_hash() and bpf_bprm_opts_set() helpers, from KP Singh.

4) Memcg-based memory accounting for bpf objects, from Roman.

5) Allow bpf_{s,g}etsockopt from cgroup bind{4,6} hooks, from Stanislav.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (118 commits)
  selftests/bpf: Fix invalid use of strncat in test_sockmap
  libbpf: Use memcpy instead of strncpy to please GCC
  selftests/bpf: Add fentry/fexit/fmod_ret selftest for kernel module
  selftests/bpf: Add tp_btf CO-RE reloc test for modules
  libbpf: Support attachment of BPF tracing programs to kernel modules
  libbpf: Factor out low-level BPF program loading helper
  bpf: Allow to specify kernel module BTFs when attaching BPF programs
  bpf: Remove hard-coded btf_vmlinux assumption from BPF verifier
  selftests/bpf: Add CO-RE relocs selftest relying on kernel module BTF
  selftests/bpf: Add support for marking sub-tests as skipped
  selftests/bpf: Add bpf_testmod kernel module for testing
  libbpf: Add kernel module BTF support for CO-RE relocations
  libbpf: Refactor CO-RE relocs to not assume a single BTF object
  libbpf: Add internal helper to load BTF data by FD
  bpf: Keep module's btf_data_size intact after load
  bpf: Fix bpf_put_raw_tracepoint()'s use of __module_address()
  selftests/bpf: Add Userspace tests for TCP_WINDOW_CLAMP
  bpf: Adds support for setting window clamp
  samples/bpf: Fix spelling mistake "recieving" -> "receiving"
  bpf: Fix cold build of test_progs-no_alu32
  ...
====================

Link: https://lore.kernel.org/r/20201204021936.85653-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04 07:48:12 -08:00
Andrii Nakryiko 290248a5b7 bpf: Allow to specify kernel module BTFs when attaching BPF programs
Add ability for user-space programs to specify non-vmlinux BTF when attaching
BTF-powered BPF programs: raw_tp, fentry/fexit/fmod_ret, LSM, etc. For this,
attach_prog_fd (now with the alias name attach_btf_obj_fd) should specify FD
of a module or vmlinux BTF object. For backwards compatibility reasons,
0 denotes vmlinux BTF. Only kernel BTF (vmlinux or module) can be specified.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201203204634.1325171-11-andrii@kernel.org
2020-12-03 17:38:21 -08:00
Andrii Nakryiko 22dc4a0f5e bpf: Remove hard-coded btf_vmlinux assumption from BPF verifier
Remove a permeating assumption thoughout BPF verifier of vmlinux BTF. Instead,
wherever BTF type IDs are involved, also track the instance of struct btf that
goes along with the type ID. This allows to gradually add support for kernel
module BTFs and using/tracking module types across BPF helper calls and
registers.

This patch also renames btf_id() function to btf_obj_id() to minimize naming
clash with using btf_id to denote BTF *type* ID, rather than BTF *object*'s ID.

Also, altough btf_vmlinux can't get destructed and thus doesn't need
refcounting, module BTFs need that, so apply BTF refcounting universally when
BPF program is using BTF-powered attachment (tp_btf, fentry/fexit, etc). This
makes for simpler clean up code.

Now that BTF type ID is not enough to uniquely identify a BTF type, extend BPF
trampoline key to include BTF object ID. To differentiate that from target
program BPF ID, set 31st bit of type ID. BTF type IDs (at least currently) are
not allowed to take full 32 bits, so there is no danger of confusing that bit
with a valid BTF type ID.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201203204634.1325171-10-andrii@kernel.org
2020-12-03 17:38:21 -08:00
Andrii Nakryiko 2fe8890848 bpf: Keep module's btf_data_size intact after load
Having real btf_data_size stored in struct module is benefitial to quickly
determine which kernel modules have associated BTF object and which don't.
There is no harm in keeping this info, as opposed to keeping invalid pointer.

Fixes: 607c543f93 ("bpf: Sanitize BTF data pointer after module is loaded")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201203204634.1325171-3-andrii@kernel.org
2020-12-03 17:38:20 -08:00
Andrii Nakryiko 12cc126df8 bpf: Fix bpf_put_raw_tracepoint()'s use of __module_address()
__module_address() needs to be called with preemption disabled or with
module_mutex taken. preempt_disable() is enough for read-only uses, which is
what this fix does. Also, module_put() does internal check for NULL, so drop
it as well.

Fixes: a38d1107f9 ("bpf: support raw tracepoints in modules")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20201203204634.1325171-2-andrii@kernel.org
2020-12-03 17:38:20 -08:00
Jakub Kicinski 55fd59b003 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:
	drivers/net/ethernet/ibm/ibmvnic.c

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 15:44:09 -08:00
Peter Zijlstra 8af26be062 perf/core: Fix arch_perf_get_page_size()
The (new) page-table walker in arch_perf_get_page_size() is broken in
various ways. Specifically while it is used in a lockless manner, it
doesn't depend on CONFIG_HAVE_FAST_GUP nor uses the proper _lockless
offset methods, nor is careful to only read each entry only once.

Also the hugetlb support is broken due to calling pte_page() without
first checking pte_special().

Rewrite the whole thing to be a proper lockless page-table walker and
employ the new pXX_leaf_size() pgtable functions to determine the
pagetable size without looking at the page-frames.

Fixes: 51b646b2d9 ("perf,mm: Handle non-page-table-aligned hugetlbfs")
Fixes: 8d97e71811 ("perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Kan Liang <kan.liang@linux.intel.com>
Link: https://lkml.kernel.org/r/20201126124207.GM3040@hirez.programming.kicks-ass.net
2020-12-03 10:14:51 +01:00
Roman Gushchin 3ac1f01b43 bpf: Eliminate rlimit-based memory accounting for bpf progs
Do not use rlimit-based memory accounting for bpf progs. It has been
replaced with memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-34-guro@fb.com
2020-12-02 18:32:47 -08:00
Roman Gushchin 80ee81e040 bpf: Eliminate rlimit-based memory accounting infra for bpf maps
Remove rlimit-based accounting infrastructure code, which is not used
anymore.

To provide a backward compatibility, use an approximation of the
bpf map memory footprint as a "memlock" value, available to a user
via map info. The approximation is based on the maximal number of
elements and key and value sizes.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-33-guro@fb.com
2020-12-02 18:32:47 -08:00
Roman Gushchin ab31be378a bpf: Eliminate rlimit-based memory accounting for bpf local storage maps
Do not use rlimit-based memory accounting for bpf local storage maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-32-guro@fb.com
2020-12-02 18:32:47 -08:00
Roman Gushchin 370868107b bpf: Eliminate rlimit-based memory accounting for stackmap maps
Do not use rlimit-based memory accounting for stackmap maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-30-guro@fb.com
2020-12-02 18:32:47 -08:00
Roman Gushchin abbdd0813f bpf: Eliminate rlimit-based memory accounting for bpf ringbuffer
Do not use rlimit-based memory accounting for bpf ringbuffer.
It has been replaced with the memcg-based memory accounting.

bpf_ringbuf_alloc() can't return anything except ERR_PTR(-ENOMEM)
and a valid pointer, so to simplify the code make it return NULL
in the first case. This allows to drop a couple of lines in
ringbuf_map_alloc() and also makes it look similar to other memory
allocating function like kmalloc().

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-28-guro@fb.com
2020-12-02 18:32:47 -08:00
Roman Gushchin db54330d3e bpf: Eliminate rlimit-based memory accounting for reuseport_array maps
Do not use rlimit-based memory accounting for reuseport_array maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-27-guro@fb.com
2020-12-02 18:32:47 -08:00
Roman Gushchin a37fb7ef24 bpf: Eliminate rlimit-based memory accounting for queue_stack_maps maps
Do not use rlimit-based memory accounting for queue_stack maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-26-guro@fb.com
2020-12-02 18:32:46 -08:00
Roman Gushchin cbddcb574d bpf: Eliminate rlimit-based memory accounting for lpm_trie maps
Do not use rlimit-based memory accounting for lpm_trie maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-25-guro@fb.com
2020-12-02 18:32:46 -08:00
Roman Gushchin 755e5d5536 bpf: Eliminate rlimit-based memory accounting for hashtab maps
Do not use rlimit-based memory accounting for hashtab maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-24-guro@fb.com
2020-12-02 18:32:46 -08:00
Roman Gushchin 844f157f6c bpf: Eliminate rlimit-based memory accounting for devmap maps
Do not use rlimit-based memory accounting for devmap maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-23-guro@fb.com
2020-12-02 18:32:46 -08:00
Roman Gushchin 087b0d39fe bpf: Eliminate rlimit-based memory accounting for cgroup storage maps
Do not use rlimit-based memory accounting for cgroup storage maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-22-guro@fb.com
2020-12-02 18:32:46 -08:00
Roman Gushchin 711cabaf14 bpf: Eliminate rlimit-based memory accounting for cpumap maps
Do not use rlimit-based memory accounting for cpumap maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-21-guro@fb.com
2020-12-02 18:32:46 -08:00
Roman Gushchin f043733f31 bpf: Eliminate rlimit-based memory accounting for bpf_struct_ops maps
Do not use rlimit-based memory accounting for bpf_struct_ops maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-20-guro@fb.com
2020-12-02 18:32:46 -08:00
Roman Gushchin 1bc5975613 bpf: Eliminate rlimit-based memory accounting for arraymap maps
Do not use rlimit-based memory accounting for arraymap maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-19-guro@fb.com
2020-12-02 18:32:46 -08:00
Roman Gushchin e9aae8beba bpf: Memcg-based memory accounting for bpf local storage maps
Account memory used by bpf local storage maps:
per-socket, per-inode and per-task storages.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-16-guro@fb.com
2020-12-02 18:32:45 -08:00
Roman Gushchin be4035c734 bpf: Memcg-based memory accounting for bpf ringbuffer
Enable the memcg-based memory accounting for the memory used by
the bpf ringbuffer.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-15-guro@fb.com
2020-12-02 18:32:45 -08:00
Roman Gushchin 353e7af4bf bpf: Memcg-based memory accounting for lpm_trie maps
Include lpm trie and lpm trie node objects into the memcg-based memory
accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-14-guro@fb.com
2020-12-02 18:32:45 -08:00
Roman Gushchin 881456811a bpf: Refine memcg-based memory accounting for hashtab maps
Include percpu objects and the size of map metadata into the
accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-13-guro@fb.com
2020-12-02 18:32:45 -08:00
Roman Gushchin 1440290adf bpf: Refine memcg-based memory accounting for devmap maps
Include map metadata and the node size (struct bpf_dtab_netdev)
into the accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-12-guro@fb.com
2020-12-02 18:32:45 -08:00
Roman Gushchin 3a61c7c58b bpf: Memcg-based memory accounting for cgroup storage maps
Account memory used by cgroup storage maps including metadata
structures.

Account the percpu memory for the percpu flavor of cgroup storage.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-11-guro@fb.com
2020-12-02 18:32:45 -08:00
Roman Gushchin e88cc05b61 bpf: Refine memcg-based memory accounting for cpumap maps
Include metadata and percpu data into the memcg-based memory
accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-10-guro@fb.com
2020-12-02 18:32:45 -08:00
Roman Gushchin 6d192c7938 bpf: Refine memcg-based memory accounting for arraymap maps
Include percpu arrays and auxiliary data into the memcg-based memory
accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-9-guro@fb.com
2020-12-02 18:32:45 -08:00
Roman Gushchin d5299b67dd bpf: Memcg-based memory accounting for bpf maps
This patch enables memcg-based memory accounting for memory allocated
by __bpf_map_area_alloc(), which is used by many types of bpf maps for
large initial memory allocations.

Please note, that __bpf_map_area_alloc() should not be used outside of
map creation paths without setting the active memory cgroup to the
map's memory cgroup.

Following patches in the series will refine the accounting for
some of the map types.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-8-guro@fb.com
2020-12-02 18:32:45 -08:00
Roman Gushchin 48edc1f78a bpf: Prepare for memcg-based memory accounting for bpf maps
Bpf maps can be updated from an interrupt context and in such
case there is no process which can be charged. It makes the memory
accounting of bpf maps non-trivial.

Fortunately, after commit 4127c6504f ("mm: kmem: enable kernel
memcg accounting from interrupt contexts") and commit b87d8cefe4
("mm, memcg: rework remote charging API to support nesting")
it's finally possible.

To make the ownership model simple and consistent, when the map
is created, the memory cgroup of the current process is recorded.
All subsequent allocations related to the bpf map are charged to
the same memory cgroup. It includes allocations made by any processes
(even if they do belong to a different cgroup) and from interrupts.

This commit introduces 3 new helpers, which will be used by following
commits to enable the accounting of bpf maps memory:
  - bpf_map_kmalloc_node()
  - bpf_map_kzalloc()
  - bpf_map_alloc_percpu()

They are wrapping popular memory allocation functions. They set
the active memory cgroup to the map's memory cgroup and add
__GFP_ACCOUNT to the passed gfp flags. Then they call into
the corresponding memory allocation function and restore
the original active memory cgroup.

These helpers are supposed to use everywhere except the map creation
path. During the map creation when the map structure is allocated by
itself, it cannot be passed to those helpers. In those cases default
memory allocation function will be used with the __GFP_ACCOUNT flag.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-7-guro@fb.com
2020-12-02 18:32:34 -08:00
Roman Gushchin ddf8503c7c bpf: Memcg-based memory accounting for bpf progs
Include memory used by bpf programs into the memcg-based accounting.
This includes the memory used by programs itself, auxiliary data,
statistics and bpf line info. A memory cgroup containing the
process which loads the program is getting charged.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-6-guro@fb.com
2020-12-02 18:28:06 -08:00
Roman Gushchin bcfe06bf26 mm: memcontrol: Use helpers to read page's memcg data
Patch series "mm: allow mapping accounted kernel pages to userspace", v6.

Currently a non-slab kernel page which has been charged to a memory cgroup
can't be mapped to userspace.  The underlying reason is simple: PageKmemcg
flag is defined as a page type (like buddy, offline, etc), so it takes a
bit from a page->mapped counter.  Pages with a type set can't be mapped to
userspace.

But in general the kmemcg flag has nothing to do with mapping to
userspace.  It only means that the page has been accounted by the page
allocator, so it has to be properly uncharged on release.

Some bpf maps are mapping the vmalloc-based memory to userspace, and their
memory can't be accounted because of this implementation detail.

This patchset removes this limitation by moving the PageKmemcg flag into
one of the free bits of the page->mem_cgroup pointer.  Also it formalizes
accesses to the page->mem_cgroup and page->obj_cgroups using new helpers,
adds several checks and removes a couple of obsolete functions.  As the
result the code became more robust with fewer open-coded bit tricks.

This patch (of 4):

Currently there are many open-coded reads of the page->mem_cgroup pointer,
as well as a couple of read helpers, which are barely used.

It creates an obstacle on a way to reuse some bits of the pointer for
storing additional bits of information.  In fact, we already do this for
slab pages, where the last bit indicates that a pointer has an attached
vector of objcg pointers instead of a regular memcg pointer.

This commits uses 2 existing helpers and introduces a new helper to
converts all read sides to calls of these helpers:
  struct mem_cgroup *page_memcg(struct page *page);
  struct mem_cgroup *page_memcg_rcu(struct page *page);
  struct mem_cgroup *page_memcg_check(struct page *page);

page_memcg_check() is intended to be used in cases when the page can be a
slab page and have a memcg pointer pointing at objcg vector.  It does
check the lowest bit, and if set, returns NULL.  page_memcg() contains a
VM_BUG_ON_PAGE() check for the page not being a slab page.

To make sure nobody uses a direct access, struct page's
mem_cgroup/obj_cgroups is converted to unsigned long memcg_data.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Link: https://lkml.kernel.org/r/20201027001657.3398190-1-guro@fb.com
Link: https://lkml.kernel.org/r/20201027001657.3398190-2-guro@fb.com
Link: https://lore.kernel.org/bpf/20201201215900.3569844-2-guro@fb.com
2020-12-02 18:28:05 -08:00
Frederic Weisbecker d14ce74f1f irq: Call tick_irq_enter() inside HARDIRQ_OFFSET
Now that account_hardirq_enter() is called after HARDIRQ_OFFSET has
been incremented, there is nothing left that prevents us from also
moving tick_irq_enter() after HARDIRQ_OFFSET is incremented.

The desired outcome is to remove the nasty hack that prevents softirqs
from being raised through ksoftirqd instead of the hardirq bottom half.
Also tick_irq_enter() then becomes appropriately covered by lockdep.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201202115732.27827-6-frederic@kernel.org
2020-12-02 20:20:05 +01:00
Frederic Weisbecker d3759e7184 irqtime: Move irqtime entry accounting after irq offset incrementation
IRQ time entry is currently accounted before HARDIRQ_OFFSET or
SOFTIRQ_OFFSET are incremented. This is convenient to decide to which
index the cputime to account is dispatched.

Unfortunately it prevents tick_irq_enter() from being called under
HARDIRQ_OFFSET because tick_irq_enter() has to be called before the IRQ
entry accounting due to the necessary clock catch up. As a result we
don't benefit from appropriate lockdep coverage on tick_irq_enter().

To prepare for fixing this, move the IRQ entry cputime accounting after
the preempt offset is incremented. This requires the cputime dispatch
code to handle the extra offset.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20201202115732.27827-5-frederic@kernel.org
2020-12-02 20:20:05 +01:00
Frederic Weisbecker 8a6a5920d3 sched/vtime: Consolidate IRQ time accounting
The 3 architectures implementing CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
all have their own version of irq time accounting that dispatch the
cputime to the appropriate index: hardirq, softirq, system, idle,
guest... from an all-in-one function.

Instead of having these ad-hoc versions, move the cputime destination
dispatch decision to the core code and leave only the actual per-index
cputime accounting to the architecture.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201202115732.27827-4-frederic@kernel.org
2020-12-02 20:20:05 +01:00
Frederic Weisbecker 2b91ec9f55 s390/vtime: Use the generic IRQ entry accounting
s390 has its own version of IRQ entry accounting because it doesn't
account the idle time the same way the other architectures do. Only
the actual idle sleep time is accounted as idle time, the rest of the
idle task execution is accounted as system time.

Make the generic IRQ entry accounting aware of architectures that have
their own way of accounting idle time and convert s390 to use it.

This prepares s390 to get involved in further consolidations of IRQ
time accounting.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201202115732.27827-3-frederic@kernel.org
2020-12-02 20:20:04 +01:00
Frederic Weisbecker 7197688b20 sched/cputime: Remove symbol exports from IRQ time accounting
account_irq_enter_time() and account_irq_exit_time() are not called
from modules. EXPORT_SYMBOL_GPL() can be safely removed from the IRQ
cputime accounting functions called from there.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201202115732.27827-2-frederic@kernel.org
2020-12-02 20:20:04 +01:00
Sven Schnelle c6156e1da6 entry: Add syscall_exit_to_user_mode_work()
This is the same as syscall_exit_to_user_mode() but without calling
exit_to_user_mode(). This can be used if there is an architectural reason
to avoid the combo function, e.g. restarting a syscall without returning to
userspace. Before returning to user space the caller has to invoke
exit_to_user_mode().

[ tglx: Amended comments ]

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201201142755.31931-6-svens@linux.ibm.com
2020-12-02 15:07:58 +01:00
Sven Schnelle 310de1a678 entry: Add exit_to_user_mode() wrapper
Called from architecture specific code when syscall_exit_to_user_mode() is
not suitable. It simply calls __exit_to_user_mode().

This way __exit_to_user_mode() can still be inlined because it is declared
static __always_inline.

[ tglx: Amended comments and moved it to a different place in the header ]

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201201142755.31931-5-svens@linux.ibm.com
2020-12-02 15:07:57 +01:00
Sven Schnelle 96e2fbccd0 entry_Add_enter_from_user_mode_wrapper
To be called from architecture specific code if the combo interfaces are
not suitable. It simply calls __enter_from_user_mode(). This way
__enter_from_user_mode will still be inlined because it is declared static
__always_inline.

[ tglx: Amend comments and move it to a different location in the header ]

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201201142755.31931-4-svens@linux.ibm.com
2020-12-02 15:07:57 +01:00
Sven Schnelle bb793562f0 entry: Rename exit_to_user_mode()
In order to make this function publicly available rename it so it can still
be inlined. An additional exit_to_user_mode() function will be added with
a later commit.

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201201142755.31931-3-svens@linux.ibm.com
2020-12-02 15:07:57 +01:00
Sven Schnelle 6666bb714f entry: Rename enter_from_user_mode()
In order to make this function publicly available rename it so it can still
be inlined. An additional enter_from_user_mode() function will be added with
a later commit.

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201201142755.31931-2-svens@linux.ibm.com
2020-12-02 15:07:57 +01:00
Gabriel Krisman Bertazi 11894468e3 entry: Support Syscall User Dispatch on common syscall entry
Syscall User Dispatch (SUD) must take precedence over seccomp and
ptrace, since the use case is emulation (it can be invoked with a
different ABI) such that seccomp filtering by syscall number doesn't
make sense in the first place.  In addition, either the syscall is
dispatched back to userspace, in which case there is no resource for to
trace, or the syscall will be executed, and seccomp/ptrace will execute
next.

Since SUD runs before tracepoints, it needs to be a SYSCALL_WORK_EXIT as
well, just to prevent a trace exit event when dispatch was triggered.
For that, the on_syscall_dispatch() examines context to skip the
tracepoint, audit and other work.

[ tglx: Add a comment on the exit side ]

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201127193238.821364-5-krisman@collabora.com
2020-12-02 15:07:56 +01:00
Gabriel Krisman Bertazi 1446e1df9e kernel: Implement selective syscall userspace redirection
Introduce a mechanism to quickly disable/enable syscall handling for a
specific process and redirect to userspace via SIGSYS.  This is useful
for processes with parts that require syscall redirection and parts that
don't, but who need to perform this boundary crossing really fast,
without paying the cost of a system call to reconfigure syscall handling
on each boundary transition.  This is particularly important for Windows
games running over Wine.

The proposed interface looks like this:

  prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <off>, <length>, [selector])

The range [<offset>,<offset>+<length>) is a part of the process memory
map that is allowed to by-pass the redirection code and dispatch
syscalls directly, such that in fast paths a process doesn't need to
disable the trap nor the kernel has to check the selector.  This is
essential to return from SIGSYS to a blocked area without triggering
another SIGSYS from rt_sigreturn.

selector is an optional pointer to a char-sized userspace memory region
that has a key switch for the mechanism. This key switch is set to
either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the
redirection without calling the kernel.

The feature is meant to be set per-thread and it is disabled on
fork/clone/execv.

Internally, this doesn't add overhead to the syscall hot path, and it
requires very little per-architecture support.  I avoided using seccomp,
even though it duplicates some functionality, due to previous feedback
that maybe it shouldn't mix with seccomp since it is not a security
mechanism.  And obviously, this should never be considered a security
mechanism, since any part of the program can by-pass it by using the
syscall dispatcher.

For the sysinfo benchmark, which measures the overhead added to
executing a native syscall that doesn't require interception, the
overhead using only the direct dispatcher region to issue syscalls is
pretty much irrelevant.  The overhead of using the selector goes around
40ns for a native (unredirected) syscall in my system, and it is (as
expected) dominated by the supervisor-mode user-address access.  In
fact, with SMAP off, the overhead is consistently less than 5ns on my
test box.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201127193238.821364-4-krisman@collabora.com
2020-12-02 15:07:56 +01:00
Linus Torvalds ef6900acc8 Tracing fixes for 5.10-rc6
- Use correct timestamp variable for ring buffer write stamp update
  - Fix up before stamp and write stamp when crossing ring buffer sub
    buffers
  - Keep a zero delta in ring buffer in slow path if cmpxchg fails
  - Fix trace_printk static buffer for archs that care
  - Fix ftrace record accounting for ftrace ops with trampolines
  - Fix DYNAMIC_FTRACE_WITH_DIRECT_CALLS dependency
  - Remove WARN_ON in hwlat tracer that triggers on something that is OK
  - Make "my_tramp" trampoline in ftrace direct sample code global
  - Fixes in the bootconfig tool for better alignment management
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCX8ZzghQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qg0JAQCII1bDQyF3APLlNFRqfHf3bTo7Zl5z
 WaUd1Cd7JkY+WAD/eF1dWjN0JRtfU+oRlk6UZ4oNmp8WMJvQ7oV26ub2egE=
 =lts8
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Use correct timestamp variable for ring buffer write stamp update

 - Fix up before stamp and write stamp when crossing ring buffer sub
   buffers

 - Keep a zero delta in ring buffer in slow path if cmpxchg fails

 - Fix trace_printk static buffer for archs that care

 - Fix ftrace record accounting for ftrace ops with trampolines

 - Fix DYNAMIC_FTRACE_WITH_DIRECT_CALLS dependency

 - Remove WARN_ON in hwlat tracer that triggers on something that is OK

 - Make "my_tramp" trampoline in ftrace direct sample code global

 - Fixes in the bootconfig tool for better alignment management

* tag 'trace-v5.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ring-buffer: Always check to put back before stamp when crossing pages
  ftrace: Fix DYNAMIC_FTRACE_WITH_DIRECT_CALLS dependency
  ftrace: Fix updating FTRACE_FL_TRAMP
  tracing: Fix alignment of static buffer
  tracing: Remove WARN_ON in start_thread()
  samples/ftrace: Mark my_tramp[12]? global
  ring-buffer: Set the right timestamp in the slow path of __rb_reserve_next()
  ring-buffer: Update write stamp with the correct ts
  docs: bootconfig: Update file format on initrd image
  tools/bootconfig: Align the bootconfig applied initrd image size to 4
  tools/bootconfig: Fix to check the write failure correctly
  tools/bootconfig: Fix errno reference after printf()
2020-12-01 15:30:18 -08:00
Sami Tolvanen a2abe7cbd8 scs: switch to vmapped shadow stacks
The kernel currently uses kmem_cache to allocate shadow call stacks,
which means an overflows may not be immediately detected and can
potentially result in another task's shadow stack to be overwritten.

This change switches SCS to use virtually mapped shadow stacks for
tasks, which increases shadow stack size to a full page and provides
more robust overflow detection, similarly to VMAP_STACK.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20201130233442.2562064-2-samitolvanen@google.com
Signed-off-by: Will Deacon <will@kernel.org>
2020-12-01 10:30:28 +00:00
Steven Rostedt (VMware) 68e10d5ff5 ring-buffer: Always check to put back before stamp when crossing pages
The current ring buffer logic checks to see if the updating of the event
buffer was interrupted, and if it is, it will try to fix up the before stamp
with the write stamp to make them equal again. This logic is flawed, because
if it is not interrupted, the two are guaranteed to be different, as the
current event just updated the before stamp before allocation. This
guarantees that the next event (this one or another interrupting one) will
think it interrupted the time updates of a previous event and inject an
absolute time stamp to compensate.

The correct logic is to always update the timestamps when traversing to a
new sub buffer.

Cc: stable@vger.kernel.org
Fixes: a389d86f7f ("ring-buffer: Have nested events still record running time stamp")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-30 23:21:51 -05:00
Naveen N. Rao 49a962c075 ftrace: Fix DYNAMIC_FTRACE_WITH_DIRECT_CALLS dependency
DYNAMIC_FTRACE_WITH_DIRECT_CALLS should depend on
DYNAMIC_FTRACE_WITH_REGS since we need ftrace_regs_caller().

Link: https://lkml.kernel.org/r/fc4b257ea8689a36f086d2389a9ed989496ca63a.1606412433.git.naveen.n.rao@linux.vnet.ibm.com

Cc: stable@vger.kernel.org
Fixes: 763e34e74b ("ftrace: Add register_ftrace_direct()")
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-30 21:43:08 -05:00
Naveen N. Rao 4c75b0ff4e ftrace: Fix updating FTRACE_FL_TRAMP
On powerpc, kprobe-direct.tc triggered FTRACE_WARN_ON() in
ftrace_get_addr_new() followed by the below message:
  Bad trampoline accounting at: 000000004222522f (wake_up_process+0xc/0x20) (f0000001)

The set of steps leading to this involved:
- modprobe ftrace-direct-too
- enable_probe
- modprobe ftrace-direct
- rmmod ftrace-direct <-- trigger

The problem turned out to be that we were not updating flags in the
ftrace record properly. From the above message about the trampoline
accounting being bad, it can be seen that the ftrace record still has
FTRACE_FL_TRAMP set though ftrace-direct module is going away. This
happens because we are checking if any ftrace_ops has the
FTRACE_FL_TRAMP flag set _before_ updating the filter hash.

The fix for this is to look for any _other_ ftrace_ops that also needs
FTRACE_FL_TRAMP.

Link: https://lkml.kernel.org/r/56c113aa9c3e10c19144a36d9684c7882bf09af5.1606412433.git.naveen.n.rao@linux.vnet.ibm.com

Cc: stable@vger.kernel.org
Fixes: a124692b69 ("ftrace: Enable trampoline when rec count returns back to one")
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-30 21:43:07 -05:00
Minchan Kim 8fa655a3a0 tracing: Fix alignment of static buffer
With 5.9 kernel on ARM64, I found ftrace_dump output was broken but
it had no problem with normal output "cat /sys/kernel/debug/tracing/trace".

With investigation, it seems coping the data into temporal buffer seems to
break the align binary printf expects if the static buffer is not aligned
with 4-byte. IIUC, get_arg in bstr_printf expects that args has already
right align to be decoded and seq_buf_bprintf says ``the arguments are saved
in a 32bit word array that is defined by the format string constraints``.
So if we don't keep the align under copy to temporal buffer, the output
will be broken by shifting some bytes.

This patch fixes it.

Link: https://lkml.kernel.org/r/20201125225654.1618966-1-minchan@kernel.org

Cc: <stable@vger.kernel.org>
Fixes: 8e99cf91b9 ("tracing: Do not allocate buffer in trace_find_next_entry() in atomic")
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-30 21:43:07 -05:00
Vasily Averin 310e3a4b5a tracing: Remove WARN_ON in start_thread()
This patch reverts commit 978defee11 ("tracing: Do a WARN_ON()
 if start_thread() in hwlat is called when thread exists")

.start hook can be legally called several times if according
tracer is stopped

screen window 1
[root@localhost ~]# echo 1 > /sys/kernel/tracing/events/kmem/kfree/enable
[root@localhost ~]# echo 1 > /sys/kernel/tracing/options/pause-on-trace
[root@localhost ~]# less -F /sys/kernel/tracing/trace

screen window 2
[root@localhost ~]# cat /sys/kernel/debug/tracing/tracing_on
0
[root@localhost ~]# echo hwlat >  /sys/kernel/debug/tracing/current_tracer
[root@localhost ~]# echo 1 > /sys/kernel/debug/tracing/tracing_on
[root@localhost ~]# cat /sys/kernel/debug/tracing/tracing_on
0
[root@localhost ~]# echo 2 > /sys/kernel/debug/tracing/tracing_on

triggers warning in dmesg:
WARNING: CPU: 3 PID: 1403 at kernel/trace/trace_hwlat.c:371 hwlat_tracer_start+0xc9/0xd0

Link: https://lkml.kernel.org/r/bd4d3e70-400d-9c82-7b73-a2d695e86b58@virtuozzo.com

Cc: Ingo Molnar <mingo@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 978defee11 ("tracing: Do a WARN_ON() if start_thread() in hwlat is called when thread exists")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-30 21:43:07 -05:00
Andrea Righi 8785f51a17 ring-buffer: Set the right timestamp in the slow path of __rb_reserve_next()
In the slow path of __rb_reserve_next() a nested event(s) can happen
between evaluating the timestamp delta of the current event and updating
write_stamp via local_cmpxchg(); in this case the delta is not valid
anymore and it should be set to 0 (same timestamp as the interrupting
event), since the event that we are currently processing is not the last
event in the buffer.

Link: https://lkml.kernel.org/r/X8IVJcp1gRE+FJCJ@xps-13-7390

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lwn.net/Articles/831207
Fixes: a389d86f7f ("ring-buffer: Have nested events still record running time stamp")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-30 15:22:05 -05:00
Steven Rostedt (VMware) 55ea4cf403 ring-buffer: Update write stamp with the correct ts
The write stamp, used to calculate deltas between events, was updated with
the stale "ts" value in the "info" structure, and not with the updated "ts"
variable. This caused the deltas between events to be inaccurate, and when
crossing into a new sub buffer, had time go backwards.

Link: https://lkml.kernel.org/r/20201124223917.795844-1-elavila@google.com

Cc: stable@vger.kernel.org
Fixes: a389d86f7f ("ring-buffer: Have nested events still record running time stamp")
Reported-by: "J. Avila" <elavila@google.com>
Tested-by: Daniel Mentz <danielmentz@google.com>
Tested-by: Will McVicker <willmcvicker@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-30 15:21:31 -05:00
Marc Zyngier 4615fbc378 genirq/irqdomain: Don't try to free an interrupt that has no mapping
When an interrupt allocation fails for N interrupts, it is pretty
common for the error handling code to free the same number of interrupts,
no matter how many interrupts have actually been allocated.

This may result in the domain freeing code to be unexpectedly called
for interrupts that have no mapping in that domain. Things end pretty
badly.

Instead, add some checks to irq_domain_free_irqs_hierarchy() to make sure
that thiss does not follow the hierarchy if no mapping exists for a given
interrupt.

Fixes: 6a6544e520 ("genirq/irqdomain: Remove auto-recursive hierarchy support")
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201129135551.396777-1-maz@kernel.org
2020-11-30 14:50:21 +01:00
Laurent Vivier bb4c6910c8 genirq/irqdomain: Add an irq_create_mapping_affinity() function
There is currently no way to convey the affinity of an interrupt
via irq_create_mapping(), which creates issues for devices that
expect that affinity to be managed by the kernel.

In order to sort this out, rename irq_create_mapping() to
irq_create_mapping_affinity() with an additional affinity parameter that
can be passed down to irq_domain_alloc_descs().

irq_create_mapping() is re-implemented as a wrapper around
irq_create_mapping_affinity().

No functional change.

Fixes: e75eafb9b0 ("genirq/msi: Switch to new irq spreading infrastructure")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kurz <groug@kaod.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20201126082852.1178497-2-lvivier@redhat.com
2020-11-30 12:21:31 +01:00
Linus Torvalds f91a3aa6bc Yet two more places which invoke tracing from RCU disabled regions in the
idle path. Similar to the entry path the low level idle functions have to
 be non-instrumentable.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/DpAUTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoXSLD/9klc0YimnEnROW6Q5Svb2IcyIutmXF
 bOIY1bYYoKILOBj3wyvDUhmdMuq5zh7H9yG11hO8MaVVWVQcLcOMLdHTYm9dcdmF
 xQk33+xqjuhRShB+nEmC9ayYtWogtH6W6uZ6WDtF9ZltMKU85n5ddGJ/Fvo+HoCb
 NbOdHGJdJ3/3ZCeHnxOnxM+5/GwjkBuccTV/tXmb3yXrfU9DBySyQ4/UchcpF43w
 LcEb0kiQbpZsBTByKJOQV8+RR654S0sILlvRwVXpmj94vrgGwhlVk1/9rz7tkOhF
 ksoo1mTVu75LMt22G/hXxE63787yRvFdHjapf0+kCOAuhl992NK+xlGDH8o9DXcu
 9y73D4bI0HnDFs20w6vs20iLvxECJiYHJqlgR5ZwFUToceaNgtiYr8kzuD7Zbae1
 KG2E7BuNSwHWMtf97fGn44GZknPEOaKdDn4Wv6/bvKHxLm77qe11RKF70Stcz2AI
 am13KmQzzsHGF5qNWwpElRUxSdxfJMR66RnOdTQULGrRedaZTFol/y2pnVzTSe3k
 SZnlpL5kE7y92UYDogPb5wWA7b+YkJN0OdSkRFy1FH26ZG8E4M7ZJ2tql5Sw7pGM
 lsTjXpAUphnK5rz7QcYE8KAZWj//fIAcElIrvdklVcBnS3IqjfksYW27B64133vx
 cT1B/lA1PHXj6Q==
 =raED
 -----END PGP SIGNATURE-----

Merge tag 'locking-urgent-2020-11-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fixes from Thomas Gleixner:
 "Two more places which invoke tracing from RCU disabled regions in the
  idle path.

  Similar to the entry path the low level idle functions have to be
  non-instrumentable"

* tag 'locking-urgent-2020-11-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  intel_idle: Fix intel_idle() vs tracing
  sched/idle: Fix arch_cpu_idle() vs tracing
2020-11-29 11:19:26 -08:00