Commit Graph

840602 Commits

Author SHA1 Message Date
Stefano Brivio 73f51d151e selftests: pmtu: Fix encapsulating device in pmtu_vti6_link_change_mtu
In the pmtu_vti6_link_change_mtu test, both local and remote addresses
for the vti6 tunnel are assigned to the same address given to the dummy
interface that we use as encapsulating device with a known MTU.

This works as long as the dummy interface is actually selected, via
rt6_lookup(), as encapsulating device. But if the remote address of the
tunnel is a local address too, the loopback interface could also be
selected, and there's nothing wrong with it.

This is what some older -stable kernels do (3.18.z, at least), and
nothing prevents us from subtly changing FIB implementation to revert
back to that behaviour in the future.

Define an IPv6 prefix instead, and use two separate addresses as local
and remote for vti6, so that the encapsulating device can't be a
loopback interface.

Reported-by: Xiumei Mu <xmu@redhat.com>
Fixes: 1fad59ea1c ("selftests: pmtu: Add pmtu_vti6_link_change_mtu test")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-28 17:15:06 -07:00
Camelia Groza cbe9e83594 enetc: Enable TC offloading with mqprio
Add support to configure multiple prioritized TX traffic
classes with mqprio.

Configure one BD ring per TC for the moment, one netdev
queue per TC.

Signed-off-by: Camelia Groza <camelia.groza@nxp.com>
Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-28 17:11:02 -07:00
David S. Miller 7f3343234c Merge branch 'stmmac-SPDX'
Neil Armstrong says:

====================
net: stmmac: dwmac-meson: update with SPDX Licence identifier

Update the SPDX Licence identifier for the Amlogic Meson6 and Meson8 dwmac
glue drivers.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-28 17:09:15 -07:00
Neil Armstrong 56aaa114f0 net: stmmac: dwmac-meson8b: update with SPDX Licence identifier
Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
Reviewed-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-28 17:09:15 -07:00
Neil Armstrong f87845cf0f net: stmmac: dwmac-meson: update with SPDX Licence identifier
Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-28 17:09:15 -07:00
Sasha Neftin 62a5b8429e igc: Cleanup the redundant code
The default flow control settings for the i225 device is both
'rx' and 'tx' pause frames. There is no depend on the NVM value.
This patch comes to fix this and clean up the driver code.

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-05-28 16:15:55 -07:00
Sasha Neftin 0373ad4d05 igc: Add flow control support
This change adds flow control settings. This is required to
enable the legacy flow control support.

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-05-28 16:13:57 -07:00
Konstantin Khlebnikov d17ba0f616 e1000e: start network tx queue only when link is up
Driver does not want to keep packets in Tx queue when link is lost.
But present code only reset NIC to flush them, but does not prevent
queuing new packets. Moreover reset sequence itself could generate
new packets via netconsole and NIC falls into endless reset loop.

This patch wakes Tx queue only when NIC is ready to send packets.

This is proper fix for problem addressed by commit 0f9e980bf5
("e1000e: fix cyclic resets at link up with active tx").

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Tested-by: Joseph Yasi <joe.yasi@gmail.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Tested-by: Oleksandr Natalenko <oleksandr@redhat.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-05-28 16:08:43 -07:00
Konstantin Khlebnikov caff422ea8 Revert "e1000e: fix cyclic resets at link up with active tx"
This reverts commit 0f9e980bf5.

That change cased false-positive warning about hardware hang:

e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
e1000e 0000:00:1f.6 eth0: Detected Hardware Unit Hang:
   TDH                  <0>
   TDT                  <1>
   next_to_use          <1>
   next_to_clean        <0>
buffer_info[next_to_clean]:
   time_stamp           <fffba7a7>
   next_to_watch        <0>
   jiffies              <fffbb140>
   next_to_watch.status <0>
MAC Status             <40080080>
PHY Status             <7949>
PHY 1000BASE-T Status  <0>
PHY Extended Status    <3000>
PCI Status             <10>
e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

Besides warning everything works fine.
Original issue will be fixed property in following patch.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reported-by: Joseph Yasi <joe.yasi@gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=203175
Tested-by: Joseph Yasi <joe.yasi@gmail.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Tested-by: Oleksandr Natalenko <oleksandr@redhat.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-05-28 16:01:01 -07:00
Sasha Neftin 16ecd8d9af igc: Remove the obsolete workaround
Enables a resend request after the completion timeout workaround is not
relevant for i225 device. This patch is clean code relevant this
workaround.
Minor cosmetic fixes, replace the 'spaces' with 'tabs'

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-05-28 15:57:59 -07:00
Sasha Neftin 796bfb1035 igc: Clean up unused pointers
Few function pointers from phy_operations structure were unused.
This patch cleans those.

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-05-28 15:56:11 -07:00
Sasha Neftin ae586f0b39 igc: Fix double definitions
Collision threshold and threshold's shift has been defined twice.
This patch comes to fix that.

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-05-28 15:54:39 -07:00
Gustavo A. R. Silva 42277cedba igb: mark expected switch fall-through
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

This patch fixes the following warning:

drivers/net/ethernet/intel/igb/e1000_82575.c: In function ‘igb_get_invariants_82575’:
drivers/net/ethernet/intel/igb/e1000_82575.c:636:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
   if (igb_sgmii_uses_mdio_82575(hw)) {
      ^
drivers/net/ethernet/intel/igb/e1000_82575.c:642:2: note: here
  case E1000_CTRL_EXT_LINK_MODE_PCIE_SERDES:
  ^~~~

Warning level 3 was used: -Wimplicit-fallthrough=3

Notice that, in this particular case, the code comment is modified
in accordance with what GCC is expecting to find.

This patch is part of the ongoing efforts to enable
-Wimplicit-fallthrough.

Signed-off-by: "Gustavo A. R. Silva" <gustavo@embeddedor.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-05-28 15:52:37 -07:00
Gustavo A. R. Silva b7b3ad7aaf igb: mark expected switch fall-through
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

This patch fixes the following warning:

drivers/net/ethernet/intel/igb/igb_main.c: In function ‘__igb_notify_dca’:
drivers/net/ethernet/intel/igb/igb_main.c:6694:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
   if (dca_add_requester(dev) == 0) {
      ^
drivers/net/ethernet/intel/igb/igb_main.c:6701:2: note: here
  case DCA_PROVIDER_REMOVE:
  ^~~~

Warning level 3 was used: -Wimplicit-fallthrough=3

Notice that, in this particular case, the code comment is modified
in accordance with what GCC is expecting to find.

This patch is part of the ongoing efforts to enable
-Wimplicit-fallthrough.

Signed-off-by: "Gustavo A. R. Silva" <gustavo@embeddedor.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-05-28 15:48:42 -07:00
Feng Tang 47e16692b2 igb/igc: warn when fatal read failure happens
Failed in read the HW register is very serious for igb/igc driver,
as its hw_addr will be set to NULL and cause the adapter be seen as
"REMOVED".

We saw the error only a few times in the MTBF test for suspend/resume,
but can hardly get any useful info to debug.

Adding WARN() so that we can get the necessary information about
where and how it happens, and use it for root causing and fixing
this "PCIe link lost issue"

This affects igb, igc.

Signed-off-by: Feng Tang <feng.tang@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Acked-by: Sasha Neftin <sasha.neftin@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-05-28 15:42:32 -07:00
Tomas Bortoli dfb4a6f219 tracing: Avoid memory leak in predicate_parse()
In case of errors, predicate_parse() goes to the out_free label
to free memory and to return an error code.

However, predicate_parse() does not free the predicates of the
temporary prog_stack array, thence leaking them.

Link: http://lkml.kernel.org/r/20190528154338.29976-1-tomasbortoli@gmail.com

Cc: stable@vger.kernel.org
Fixes: 80765597bc ("tracing: Rewrite filter logic to be simpler and faster")
Reported-by: syzbot+6b8e0fb820e570c59e19@syzkaller.appspotmail.com
Signed-off-by: Tomas Bortoli <tomasbortoli@gmail.com>
[ Added protection around freeing prog_stack[i].pred ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-28 16:27:58 -04:00
Madalin Bucur c7ae09253c fsl/fman: include IPSEC SPI in the Keygen extraction
The keygen extracted fields are used as input for the hash that
determines the incoming frames distribution. Adding IPSEC SPI so
different IPSEC flows can be distributed to different CPUs.

Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-28 11:21:03 -07:00
Madalin Bucur 7aae703f80 dpaa_eth: use only online CPU portals
Make sure only the portals for the online CPUs are used.
Without this change, there are issues when someone boots with
maxcpus=n, with n < actual number of cores available as frames
either received or corresponding to the transmit confirmation
path would be offered for dequeue to the offline CPU portals,
getting lost.

Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-28 11:19:55 -07:00
Maxime Chevallier 7af5b07751 net: mvpp2: cls: Check RSS table index validity when creating a context
Make sure we don't use an out-of-bound index for the per-port RSS
context array.

As of today, the global context creation in mvpp22_rss_context_create
will prevent us from reaching this case, but we should still make sure
we are using a sane value anyway.

Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-28 11:18:44 -07:00
Jisheng Zhang d484e06e25 net: mvneta: Fix err code path of probe
Fix below issues in err code path of probe:
1. we don't need to unregister_netdev() because the netdev isn't
registered.
2. when register_netdev() fails, we also need to destroy bm pool for
HWBM case.

Fixes: dc35a10f68 ("net: mvneta: bm: add support for hardware buffer management")
Signed-off-by: Jisheng Zhang <Jisheng.Zhang@synaptics.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-28 11:05:00 -07:00
Thierry Reding 54ed6fd2e0 net: stmmac: Do not output error on deferred probe
If the subdriver defers probe, do not show an error message. It's
perfectly fine for this error to occur since the driver will get another
chance to probe after some time and will usually succeed after all of
the resources that it requires have been registered.

Signed-off-by: Thierry Reding <treding@nvidia.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-28 11:00:09 -07:00
Filipe Manana 06989c799f Btrfs: fix race updating log root item during fsync
When syncing the log, the final phase of a fsync operation, we need to
either create a log root's item or update the existing item in the log
tree of log roots, and that depends on the current value of the log
root's log_transid - if it's 1 we need to create the log root item,
otherwise it must exist already and we update it. Since there is no
synchronization between updating the log_transid and checking it for
deciding whether the log root's item needs to be created or updated, we
end up with a tiny race window that results in attempts to update the
item to fail because the item was not yet created:

              CPU 1                                    CPU 2

  btrfs_sync_log()

    lock root->log_mutex

    set log root's log_transid to 1

    unlock root->log_mutex

                                               btrfs_sync_log()

                                                 lock root->log_mutex

                                                 sets log root's
                                                 log_transid to 2

                                                 unlock root->log_mutex

    update_log_root()

      sees log root's log_transid
      with a value of 2

        calls btrfs_update_root(),
        which fails with -EUCLEAN
        and causes transaction abort

Until recently the race lead to a BUG_ON at btrfs_update_root(), but after
the recent commit 7ac1e464c4 ("btrfs: Don't panic when we can't find a
root key") we just abort the current transaction.

A sample trace of the BUG_ON() on a SLE12 kernel:

  ------------[ cut here ]------------
  kernel BUG at ../fs/btrfs/root-tree.c:157!
  Oops: Exception in kernel mode, sig: 5 [#1]
  SMP NR_CPUS=2048 NUMA pSeries
  (...)
  Supported: Yes, External
  CPU: 78 PID: 76303 Comm: rtas_errd Tainted: G                 X 4.4.156-94.57-default #1
  task: c00000ffa906d010 ti: c00000ff42b08000 task.ti: c00000ff42b08000
  NIP: d000000036ae5cdc LR: d000000036ae5cd8 CTR: 0000000000000000
  REGS: c00000ff42b0b860 TRAP: 0700   Tainted: G                 X  (4.4.156-94.57-default)
  MSR: 8000000002029033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 22444484  XER: 20000000
  CFAR: d000000036aba66c SOFTE: 1
  GPR00: d000000036ae5cd8 c00000ff42b0bae0 d000000036bda220 0000000000000054
  GPR04: 0000000000000001 0000000000000000 c00007ffff8d37c8 0000000000000000
  GPR08: c000000000e19c00 0000000000000000 0000000000000000 3736343438312079
  GPR12: 3930373337303434 c000000007a3a800 00000000007fffff 0000000000000023
  GPR16: c00000ffa9d26028 c00000ffa9d261f8 0000000000000010 c00000ffa9d2ab28
  GPR20: c00000ff42b0bc48 0000000000000001 c00000ff9f0d9888 0000000000000001
  GPR24: c00000ffa9d26000 c00000ffa9d261e8 c00000ffa9d2a800 c00000ff9f0d9888
  GPR28: c00000ffa9d26028 c00000ffa9d2aa98 0000000000000001 c00000ffa98f5b20
  NIP [d000000036ae5cdc] btrfs_update_root+0x25c/0x4e0 [btrfs]
  LR [d000000036ae5cd8] btrfs_update_root+0x258/0x4e0 [btrfs]
  Call Trace:
  [c00000ff42b0bae0] [d000000036ae5cd8] btrfs_update_root+0x258/0x4e0 [btrfs] (unreliable)
  [c00000ff42b0bba0] [d000000036b53610] btrfs_sync_log+0x2d0/0xc60 [btrfs]
  [c00000ff42b0bce0] [d000000036b1785c] btrfs_sync_file+0x44c/0x4e0 [btrfs]
  [c00000ff42b0bd80] [c00000000032e300] vfs_fsync_range+0x70/0x120
  [c00000ff42b0bdd0] [c00000000032e44c] do_fsync+0x5c/0xb0
  [c00000ff42b0be10] [c00000000032e8dc] SyS_fdatasync+0x2c/0x40
  [c00000ff42b0be30] [c000000000009488] system_call+0x3c/0x100
  Instruction dump:
  7f43d378 4bffebb9 60000000 88d90008 3d220000 e8b90000 3b390009 e87a01f0
  e8898e08 e8f90000 4bfd48e5 60000000 <0fe00000> e95b0060 39200004 394a0ea0
  ---[ end trace 8f2dc8f919cabab8 ]---

So fix this by doing the check of log_transid and updating or creating the
log root's item while holding the root's log_mutex.

Fixes: 7237f18336 ("Btrfs: fix tree logs parallel sync")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-28 19:26:46 +02:00
Filipe Manana 5338e43abb Btrfs: fix wrong ctime and mtime of a directory after log replay
When replaying a log that contains a new file or directory name that needs
to be added to its parent directory, we end up updating the mtime and the
ctime of the parent directory to the current time after we have set their
values to the correct ones (set at fsync time), efectivelly losing them.

Sample reproducer:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ mkdir /mnt/dir
  $ touch /mnt/dir/file

  # fsync of the directory is optional, not needed
  $ xfs_io -c fsync /mnt/dir
  $ xfs_io -c fsync /mnt/dir/file

  $ stat -c %Y /mnt/dir
  1557856079

  <power failure>

  $ sleep 3
  $ mount /dev/sdb /mnt
  $ stat -c %Y /mnt/dir
  1557856082

    --> should have been 1557856079, the mtime is updated to the current
        time when replaying the log

Fix this by not updating the mtime and ctime to the current time at
btrfs_add_link() when we are replaying a log tree.

This could be triggered by my recent fsync fuzz tester for fstests, for
which an fstests patch exists titled "fstests: generic, fsync fuzz tester
with fsstress".

Fixes: e02119d5a7 ("Btrfs: Add a write ahead tree log to optimize synchronous operations")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-28 19:16:16 +02:00
Filipe Manana 60d9f50308 Btrfs: fix fsync not persisting changed attributes of a directory
While logging an inode we follow its ancestors and for each one we mark
it as logged in the current transaction, even if we have not logged it.
As a consequence if we change an attribute of an ancestor, such as the
UID or GID for example, and then explicitly fsync it, we end up not
logging the inode at all despite returning success to user space, which
results in the attribute being lost if a power failure happens after
the fsync.

Sample reproducer:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ mkdir /mnt/dir
  $ chown 6007:6007 /mnt/dir

  $ sync

  $ chown 9003:9003 /mnt/dir
  $ touch /mnt/dir/file
  $ xfs_io -c fsync /mnt/dir/file

  # fsync our directory after fsync'ing the new file, should persist the
  # new values for the uid and gid.
  $ xfs_io -c fsync /mnt/dir

  <power failure>

  $ mount /dev/sdb /mnt
  $ stat -c %u:%g /mnt/dir
  6007:6007

    --> should be 9003:9003, the uid and gid were not persisted, despite
        the explicit fsync on the directory prior to the power failure

Fix this by not updating the logged_trans field of ancestor inodes when
logging an inode, since we have not logged them. Let only future calls to
btrfs_log_inode() to mark inodes as logged.

This could be triggered by my recent fsync fuzz tester for fstests, for
which an fstests patch exists titled "fstests: generic, fsync fuzz tester
with fsstress".

Fixes: 12fcfd22fe ("Btrfs: tree logging unlink/rename fixes")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-28 18:56:50 +02:00
Qu Wenruo 57949d033a btrfs: qgroup: Check bg while resuming relocation to avoid NULL pointer dereference
[BUG]
When mounting a fs with reloc tree and has qgroup enabled, it can cause
NULL pointer dereference at mount time:

  BUG: kernel NULL pointer dereference, address: 00000000000000a8
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: 0000 [#1] PREEMPT SMP NOPTI
  RIP: 0010:btrfs_qgroup_add_swapped_blocks+0x186/0x300 [btrfs]
  Call Trace:
   replace_path.isra.23+0x685/0x900 [btrfs]
   merge_reloc_root+0x26e/0x5f0 [btrfs]
   merge_reloc_roots+0x10a/0x1a0 [btrfs]
   btrfs_recover_relocation+0x3cd/0x420 [btrfs]
   open_ctree+0x1bc8/0x1ed0 [btrfs]
   btrfs_mount_root+0x544/0x680 [btrfs]
   legacy_get_tree+0x34/0x60
   vfs_get_tree+0x2d/0xf0
   fc_mount+0x12/0x40
   vfs_kern_mount.part.12+0x61/0xa0
   vfs_kern_mount+0x13/0x20
   btrfs_mount+0x16f/0x860 [btrfs]
   legacy_get_tree+0x34/0x60
   vfs_get_tree+0x2d/0xf0
   do_mount+0x81f/0xac0
   ksys_mount+0xbf/0xe0
   __x64_sys_mount+0x25/0x30
   do_syscall_64+0x65/0x240
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

[CAUSE]
In btrfs_recover_relocation(), we don't have enough info to determine
which block group we're relocating, but only to merge existing reloc
trees.

Thus in btrfs_recover_relocation(), rc->block_group is NULL.
btrfs_qgroup_add_swapped_blocks() hasn't taken this into consideration,
and causes a NULL pointer dereference.

The bug is introduced by commit 3d0174f78e ("btrfs: qgroup: Only trace
data extents in leaves if we're relocating data block group"), and
later qgroup refactoring still keeps this optimization.

[FIX]
Thankfully in the context of btrfs_recover_relocation(), there is no
other progress can modify tree blocks, thus those swapped tree blocks
pair will never affect qgroup numbers, no matter whatever we set for
block->trace_leaf.

So we only need to check if @bg is NULL before accessing @bg->flags.

Reported-by: Juan Erbes <jerbes@gmail.com>
Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1134806
Fixes: 3d0174f78e ("btrfs: qgroup: Only trace data extents in leaves if we're relocating data block group")
CC: stable@vger.kernel.org # 4.20+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-28 18:54:10 +02:00
Qu Wenruo 30d40577e3 btrfs: reloc: Also queue orphan reloc tree for cleanup to avoid BUG_ON()
[BUG]
When a fs has orphan reloc tree along with unfinished balance:
  ...
        item 16 key (TREE_RELOC ROOT_ITEM FS_TREE) itemoff 12090 itemsize 439
                generation 12 root_dirid 256 bytenr 300400640 level 1 refs 0 <<<
                lastsnap 8 byte_limit 0 bytes_used 1359872 flags 0x0(none)
                uuid 7c48d938-33a3-4aae-ab19-6e5c9d406e46
        item 17 key (BALANCE TEMPORARY_ITEM 0) itemoff 11642 itemsize 448
                temporary item objectid BALANCE offset 0
                balance status flags 14

Then at mount time, we can hit the following kernel BUG_ON():
  BTRFS info (device dm-3): relocating block group 298844160 flags metadata|dup
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/relocation.c:1413!
  invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
  CPU: 1 PID: 897 Comm: btrfs-balance Tainted: G           O      5.2.0-rc1-custom #15
  RIP: 0010:create_reloc_root+0x1eb/0x200 [btrfs]
  Call Trace:
   btrfs_init_reloc_root+0x96/0xb0 [btrfs]
   record_root_in_trans+0xb2/0xe0 [btrfs]
   btrfs_record_root_in_trans+0x55/0x70 [btrfs]
   select_reloc_root+0x7e/0x230 [btrfs]
   do_relocation+0xc4/0x620 [btrfs]
   relocate_tree_blocks+0x592/0x6a0 [btrfs]
   relocate_block_group+0x47b/0x5d0 [btrfs]
   btrfs_relocate_block_group+0x183/0x2f0 [btrfs]
   btrfs_relocate_chunk+0x4e/0xe0 [btrfs]
   btrfs_balance+0x864/0xfa0 [btrfs]
   balance_kthread+0x3b/0x50 [btrfs]
   kthread+0x123/0x140
   ret_from_fork+0x27/0x50

[CAUSE]
In btrfs, reloc trees are used to record swapped tree blocks during
balance.
Reloc tree either get merged (replace old tree blocks of its parent
subvolume) in next transaction if its ref is 1 (fresh).
Or is already merged and will be cleaned up if its ref is 0 (orphan).

After commit d2311e6985 ("btrfs: relocation: Delay reloc tree deletion
after merge_reloc_roots"), reloc tree cleanup is delayed until one block
group is balanced.

Since fresh reloc roots are recorded during merge, as long as there
is no power loss, those orphan reloc roots converted from fresh ones are
handled without problem.

However when power loss happens, orphan reloc roots can be recorded
on-disk, thus at next mount time, we will have orphan reloc roots from
on-disk data directly, and ignored by clean_dirty_subvols() routine.

Then when background balance starts to balance another block group, and
needs to create new reloc root for the same root, btrfs_insert_item()
returns -EEXIST, and trigger that BUG_ON().

[FIX]
For orphan reloc roots, also queue them to rc->dirty_subvol_roots, so
all reloc roots no matter orphan or not, can be cleaned up properly and
avoid above BUG_ON().

And to cooperate with above change, clean_dirty_subvols() will check if
the queued root is a reloc root or a subvol root.
For a subvol root, do the old work, and for a orphan reloc root, clean it
up.

Fixes: d2311e6985 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
CC: stable@vger.kernel.org # 5.1
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-28 18:54:10 +02:00
Filipe Manana 3c850b4511 Btrfs: incremental send, fix emission of invalid clone operations
When doing an incremental send we can now issue clone operations with a
source range that ends at the source's file eof and with a destination
range that ends at an offset smaller then the destination's file eof.
If the eof of the source file is not aligned to the sector size of the
filesystem, the receiver will get a -EINVAL error when trying to do the
operation or, on older kernels, silently corrupt the destination file.
The corruption happens on kernels without commit ac765f83f1
("Btrfs: fix data corruption due to cloning of eof block"), while the
failure to clone happens on kernels with that commit.

Example reproducer:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt/sdb

  $ xfs_io -f -c "pwrite -S 0xb1 0 2M" /mnt/sdb/foo
  $ xfs_io -f -c "pwrite -S 0xc7 0 2M" /mnt/sdb/bar
  $ xfs_io -f -c "pwrite -S 0x4d 0 2M" /mnt/sdb/baz
  $ xfs_io -f -c "pwrite -S 0xe2 0 2M" /mnt/sdb/zoo

  $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base

  $ btrfs send -f /tmp/base.send /mnt/sdb/base

  $ xfs_io -c "reflink /mnt/sdb/bar 1560K 500K 100K" /mnt/sdb/bar
  $ xfs_io -c "reflink /mnt/sdb/bar 1560K 0 100K" /mnt/sdb/zoo
  $ xfs_io -c "truncate 550K" /mnt/sdb/bar

  $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr

  $ btrfs send -f /tmp/incr.send -p /mnt/sdb/base /mnt/sdb/incr

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt/sdc

  $ btrfs receive -f /tmp/base.send /mnt/sdc
  $ btrfs receive -vv -f /tmp/incr.send /mnt/sdc
  (...)
  truncate bar size=563200
  utimes bar
  clone zoo - source=bar source offset=512000 offset=0 length=51200
  ERROR: failed to clone extents to zoo
  Invalid argument

The failure happens because the clone source range ends at the eof of file
bar, 563200, which is not aligned to the filesystems sector size (4Kb in
this case), and the destination range ends at offset 0 + 51200, which is
less then the size of the file zoo (2Mb).

So fix this by detecting such case and instead of issuing a clone
operation for the whole range, do a clone operation for smaller range
that is sector size aligned followed by a write operation for the block
containing the eof. Here we will always be pessimistic and assume the
destination filesystem of the send stream has the largest possible sector
size (64Kb), since we have no way of determining it.

This fixes a recent regression introduced in kernel 5.2-rc1.

Fixes: 040ee6120c ("Btrfs: send, improve clone range")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-28 18:54:10 +02:00
Filipe Manana 6b1f72e5b8 Btrfs: incremental send, fix file corruption when no-holes feature is enabled
When using the no-holes feature, if we have a file with prealloc extents
with a start offset beyond the file's eof, doing an incremental send can
cause corruption of the file due to incorrect hole detection. Such case
requires that the prealloc extent(s) exist in both the parent and send
snapshots, and that a hole is punched into the file that covers all its
extents that do not cross the eof boundary.

Example reproducer:

  $ mkfs.btrfs -f -O no-holes /dev/sdb
  $ mount /dev/sdb /mnt/sdb

  $ xfs_io -f -c "pwrite -S 0xab 0 500K" /mnt/sdb/foobar
  $ xfs_io -c "falloc -k 1200K 800K" /mnt/sdb/foobar

  $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base

  $ btrfs send -f /tmp/base.snap /mnt/sdb/base

  $ xfs_io -c "fpunch 0 500K" /mnt/sdb/foobar

  $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr

  $ btrfs send -p /mnt/sdb/base -f /tmp/incr.snap /mnt/sdb/incr

  $ md5sum /mnt/sdb/incr/foobar
  816df6f64deba63b029ca19d880ee10a   /mnt/sdb/incr/foobar

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt/sdc

  $ btrfs receive -f /tmp/base.snap /mnt/sdc
  $ btrfs receive -f /tmp/incr.snap /mnt/sdc

  $ md5sum /mnt/sdc/incr/foobar
  cf2ef71f4a9e90c2f6013ba3b2257ed2   /mnt/sdc/incr/foobar

    --> Different checksum, because the prealloc extent beyond the
        file's eof confused the hole detection code and it assumed
        a hole starting at offset 0 and ending at the offset of the
        prealloc extent (1200Kb) instead of ending at the offset
        500Kb (the file's size).

Fix this by ensuring we never cross the file's size when issuing the
write operations for a hole.

Fixes: 16e7549f04 ("Btrfs: incompatible format change to remove hole extents")
CC: stable@vger.kernel.org # 3.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-28 18:54:10 +02:00
Dennis Zhou fee13fe965 btrfs: correct zstd workspace manager lock to use spin_lock_bh()
The btrfs zstd workspace manager uses a background timer to reclaim not
recently used workspaces. I used spin_lock() from this context which
should have been caught with lockdep, but was not. This deadlock was
reported in bugzilla. The fix is to switch the zstd wsm lock to use
spin_lock_bh() from the softirq context.

This happened quite relibably on ppc64, unlike on other architectures.

  [  313.402874] ================================
  [  313.402875] WARNING: inconsistent lock state
  [  313.402879] 5.1.0-rc7 #1 Not tainted
  [  313.402880] --------------------------------
  [  313.402882] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
  [  313.402885] swapper/5/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
  [  313.402888] 0000000080d1120c (&(&wsm.lock)->rlock){+.?.}, at: .zstd_reclaim_timer_fn+0x40/0x230
  [  313.402895] {SOFTIRQ-ON-W} state was registered at:
  [  313.402899]   .lock_acquire+0xd0/0x240
  [  313.402903]   ._raw_spin_lock+0x34/0x60
  [  313.402906]   .zstd_get_workspace+0xd0/0x360
  [  313.402908]   .end_compressed_bio_read+0x3b8/0x540
  [  313.402911]   .bio_endio+0x174/0x2c0
  [  313.402914]   .end_workqueue_fn+0x4c/0x70
  [  313.402917]   .normal_work_helper+0x138/0x7e0
  [  313.402920]   .process_one_work+0x324/0x790
  [  313.402922]   .worker_thread+0x68/0x570
  [  313.402925]   .kthread+0x19c/0x1b0
  [  313.402928]   .ret_from_kernel_thread+0x58/0x78
  [  313.402930] irq event stamp: 2629216
  [  313.402933] hardirqs last  enabled at (2629216): [<c0000000009da738>] ._raw_spin_unlock_irq+0x38/0x60
  [  313.402936] hardirqs last disabled at (2629215): [<c0000000009da4c4>] ._raw_spin_lock_irq+0x24/0x70
  [  313.402939] softirqs last  enabled at (2629212): [<c0000000000af9fc>] .irq_enter+0x8c/0xd0
  [  313.402942] softirqs last disabled at (2629213): [<c0000000000afb58>] .irq_exit+0x118/0x170
  [  313.402944]
		 other info that might help us debug this:
  [  313.402945]  Possible unsafe locking scenario:

  [  313.402947]        CPU0
  [  313.402948]        ----
  [  313.402949]   lock(&(&wsm.lock)->rlock);
  [  313.402951]   <Interrupt>
  [  313.402952]     lock(&(&wsm.lock)->rlock);
  [  313.402954]
		  *** DEADLOCK ***

  [  313.402957] 1 lock held by swapper/5/0:
  [  313.402958]  #0: 000000004b612042 ((&wsm.timer)){+.-.}, at: .call_timer_fn+0x0/0x3c0
  [  313.402963]
		 stack backtrace:
  [  313.402967] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 5.1.0-rc7 #1
  [  313.402968] Call Trace:
  [  313.402972] [c0000007fa262e70] [c0000000009b3294] .dump_stack+0xe0/0x15c (unreliable)
  [  313.402975] [c0000007fa262f10] [c000000000125548] .print_usage_bug+0x348/0x390
  [  313.402978] [c0000007fa262fd0] [c000000000125cb4] .mark_lock+0x724/0x930
  [  313.402981] [c0000007fa263080] [c000000000126c20] .__lock_acquire+0xc90/0x16a0
  [  313.402984] [c0000007fa2631b0] [c000000000128040] .lock_acquire+0xd0/0x240
  [  313.402987] [c0000007fa263280] [c0000000009da2b4] ._raw_spin_lock+0x34/0x60
  [  313.402990] [c0000007fa263300] [c00000000054b0b0] .zstd_reclaim_timer_fn+0x40/0x230
  [  313.402993] [c0000007fa2633d0] [c000000000158b38] .call_timer_fn+0xc8/0x3c0
  [  313.402996] [c0000007fa2634a0] [c000000000158f74] .expire_timers+0x144/0x260
  [  313.402999] [c0000007fa263550] [c000000000159178] .run_timer_softirq+0xe8/0x230
  [  313.403002] [c0000007fa263680] [c0000000009db288] .__do_softirq+0x188/0x5d4
  [  313.403004] [c0000007fa263790] [c0000000000afb58] .irq_exit+0x118/0x170
  [  313.403008] [c0000007fa263800] [c000000000028d88] .timer_interrupt+0x158/0x430
  [  313.403012] [c0000007fa2638b0] [c0000000000091d4] decrementer_common+0x134/0x140
  [  313.403017] --- interrupt: 901 at replay_interrupt_return+0x0/0x4
		     LR = .arch_local_irq_restore.part.0+0x68/0x80
  [  313.403020] [c0000007fa263bb0] [c00000000001a3ac] .arch_local_irq_restore.part.0+0x2c/0x80 (unreliable)
  [  313.403024] [c0000007fa263c30] [c0000000007bbbcc] .cpuidle_enter_state+0xec/0x670
  [  313.403027] [c0000007fa263d00] [c0000000000f5130] .call_cpuidle+0x40/0x90
  [  313.403031] [c0000007fa263d70] [c0000000000f554c] .do_idle+0x2dc/0x3a0
  [  313.403034] [c0000007fa263e30] [c0000000000f59ac] .cpu_startup_entry+0x2c/0x30
  [  313.403037] [c0000007fa263ea0] [c000000000045674] .start_secondary+0x644/0x650
  [  313.403041] [c0000007fa263f90] [c00000000000ad5c] start_secondary_prolog+0x10/0x14

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203517
Fixes: 3f93aef535 ("btrfs: add zstd compression level support")
CC: stable@vger.kernel.org # 5.1+
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-28 18:54:09 +02:00
Nikolay Borisov debd1c065d btrfs: Ensure replaced device doesn't have pending chunk allocation
Recent FITRIM work, namely bbbf7243d6 ("btrfs: combine device update
operations during transaction commit") combined the way certain
operations are recoded in a transaction. As a result an ASSERT was added
in dev_replace_finish to ensure the new code works correctly.
Unfortunately I got reports that it's possible to trigger the assert,
meaning that during a device replace it's possible to have an unfinished
chunk allocation on the source device.

This is supposed to be prevented by the fact that a transaction is
committed before finishing the replace oepration and alter acquiring the
chunk mutex. This is not sufficient since by the time the transaction is
committed and the chunk mutex acquired it's possible to allocate a chunk
depending on the workload being executed on the replaced device. This
bug has been present ever since device replace was introduced but there
was never code which checks for it.

The correct way to fix is to ensure that there is no pending device
modification operation when the chunk mutex is acquire and if there is
repeat transaction commit. Unfortunately it's not possible to just
exclude the source device from btrfs_fs_devices::dev_alloc_list since
this causes ENOSPC to be hit in transaction commit.

Fixing that in another way would need to add special cases to handle the
last writes and forbid new ones. The looped transaction fix is more
obvious, and can be easily backported. The runtime of dev-replace is
long so there's no noticeable delay caused by that.

Reported-by: David Sterba <dsterba@suse.com>
Fixes: 391cd9df81 ("Btrfs: fix unprotected alloc list insertion during the finishing procedure of replace")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-28 18:54:00 +02:00
Linus Torvalds 9fb67d643f Pin control fixes for v5.2:
- Interrupt clearing fix for the Intel pin controllers affecting
   touchpads on some laptops.
 - Compile Kconfig fix for the STMFX expander pin controller.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJc7POsAAoJEEEQszewGV1zh1sP+gLYet9hCIOdinJFznOIT6P9
 ncl6clClf+mhKpNX8DRhXGF1OWxx2Us2uXjtanUA/K7GmdazUDK3b80OX/xYFMnX
 RumR2VcABGk8vs6hh6qD92xm5PXc/IOelM1i4k29o9TpKXffaqME7BgTCsR9gjSL
 G4X3JH5kPxLjy41iKskO8ipyhtocUXhL43WrXh4tGilVKYFMtvJhwUpw2/UDj4Sw
 ex6ZjUwlj5sjYHpB52jb4NeHqPMOgPDu7St6YakoMtfu5aR0kaTOvqRwnQWjmY6m
 6/Sna3/x8oly9k77uw/ruZrmGoseYeGjrEpAEdYt2bcNM0TsudULc24vnArVG/n3
 ucv4RZ4Km2M1G/EOskSyTuxomjAcyIbIJ+ZdllGsAOcyraqd2kHvVIwReOdDyLta
 FQWDedA1KZRRnCC7GOIVtlDIoyMS3JnxXCAvVuFW2ybqoJy4HRgCot7ZaAtpeosm
 Bk7Yb4zd4wIyQvxHruaaT7MLfG1mRShkeV8c4NCot/ILO5m2uJ6L55x7ta+OU/vy
 sd29+ww8BN1UR6x06faQpvXv6KXKJmSWFGENGq1aW3WmkmOJnkit8gvAzhTbtW1S
 17ezYJLwzD0tUzj/laAzqV3gXc9XwvA1nVciwWPzm+yEaa2VlsATam8+THPQF3aU
 UJODDGHn9EY5yM1Q5Eyt
 =XtxG
 -----END PGP SIGNATURE-----

Merge tag 'pinctrl-v5.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

Pull pin control fixes from Linus Walleij:
 "The commits that stand out are the Intel fixes that arrived during the
  merge window and I got relayed by pull request from Andy.

  Apart from that a minor Kconfig noise.

   - Interrupt clearing fix for the Intel pin controllers affecting
     touchpads on some laptops.

   - Compile Kconfig fix for the STMFX expander pin controller"

* tag 'pinctrl-v5.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
  pinctrl: stmfx: Fix compile issue when CONFIG_OF_GPIO is not defined
  pinctrl: intel: Clear interrupt status in mask/unmask callback
  pinctrl: intel: Use GENMASK() consistently
2019-05-28 09:35:04 -07:00
Linus Torvalds ca6584a331 GPIO fixes for v5.2:
- Fix a build error in gpio-adp5588
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJc7PG1AAoJEEEQszewGV1zsqMP/3t7ajMMuDSNO8EaBeYD7XiK
 3hqbyKjh81r83Zc3xQh7S204LQnuasSrkAzsmxcqGEX9a/UVoH4yOk0PTdeS53OH
 tCfmzg0sGpYj7bLaJtOXDKQDQW1op4OqhHFg8nE6U583EDsafn+oKLO9T4IOTTyz
 fyecVD/8QWs9fyiSoOSMjXWYLHJMCMzdeiV+84r8CdWK7MWI4RmfRe0ThhPA6ARM
 DhvqFwO4rwc3JoxLh85Ag+m3X0JkWTboLeaadRP2SWzXcfmuguwhUnkReWh3LPNX
 qqHPkMCiNST3Hag27oxg5ygioVoCz8SwoptzOiyer00JVQNsCsoFNCxlJS3kBTov
 AHJFr452Bu7kOCu2R2XMnFzHR0NVY0WIYoimRaTd6dMlIarsK4zHSbX4XKy0jvc2
 JpVUCz6SCdzrwc1G/udaj1JEbDEdLWgUEdr3QcX2yMFXpjfXCixCH6hxxCNPxQdO
 Nkk2rFHJ/I5TCpZovKbzYYUg9gjVAKnQKVchjjRPu/QPvcr4KbNwABRrRtvoWHgz
 aA2jMCNMohm6uDq6dnkLIHs7V+i5SwX3lHEVJWJRL17yjeppXaLjOjn0dapXjRMd
 FVHezaHwxVSLDFEI9E5Eh6aF93j5PwyoF2VyeZyNjbg+AHmDX0fQBJg4lM6KeehL
 eqjSQQypf2lIAzdxfs8q
 =eby+
 -----END PGP SIGNATURE-----

Merge tag 'gpio-v5.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio

Pull GPIO fix from Linus Walleij:
 "Fix a build error in gpio-adp5588"

* tag 'gpio-v5.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
  gpio: fix gpio-adp5588 build errors
2019-05-28 09:33:05 -07:00
Alexei Starovoitov d0a3a4b218 Merge branch 'cgroup-auto-detach'
Roman Gushchin says:

====================
This patchset implements a cgroup bpf auto-detachment functionality:
bpf programs are detached as soon as possible after removal of the
cgroup, without waiting for the release of all associated resources.

Patches 2 and 3 are required to implement a corresponding kselftest
in patch 4.

v5:
  1) rebase

v4:
  1) release cgroup bpf data using a workqueue
  2) add test_cgroup_attach to .gitignore

v3:
  1) some minor changes and typo fixes

v2:
  1) removed a bogus check in patch 4
  2) moved buf[len] = 0 in patch 2
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-28 09:30:03 -07:00
Roman Gushchin d5506591d5 selftests/bpf: add auto-detach test
Add a kselftest to cover bpf auto-detachment functionality.
The test creates a cgroup, associates some resources with it,
attaches a couple of bpf programs and deletes the cgroup.

Then it checks that bpf programs are going away in 5 seconds.

Expected output:
  $ ./test_cgroup_attach
  #override:PASS
  #multi:PASS
  #autodetach:PASS
  test_cgroup_attach:PASS

On a kernel without auto-detaching:
  $ ./test_cgroup_attach
  #override:PASS
  #multi:PASS
  #autodetach:FAIL
  test_cgroup_attach:FAIL

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-28 09:30:02 -07:00
Roman Gushchin 596092ef8b selftests/bpf: enable all available cgroup v2 controllers
Enable all available cgroup v2 controllers when setting up
the environment for the bpf kselftests. It's required to properly test
the bpf prog auto-detach feature. Also it will generally increase
the code coverage.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-28 09:30:02 -07:00
Roman Gushchin ba0c0cc05d selftests/bpf: convert test_cgrp2_attach2 example into kselftest
Convert test_cgrp2_attach2 example into a proper test_cgroup_attach
kselftest. It's better because we do run kselftest on a constant
basis, so there are better chances to spot a potential regression.

Also make it slightly less verbose to conform kselftests output style.

Output example:
  $ ./test_cgroup_attach
  #override:PASS
  #multi:PASS
  test_cgroup_attach:PASS

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-28 09:30:02 -07:00
Roman Gushchin 4bfc0bb2c6 bpf: decouple the lifetime of cgroup_bpf from cgroup itself
Currently the lifetime of bpf programs attached to a cgroup is bound
to the lifetime of the cgroup itself. It means that if a user
forgets (or intentionally avoids) to detach a bpf program before
removing the cgroup, it will stay attached up to the release of the
cgroup. Since the cgroup can stay in the dying state (the state
between being rmdir()'ed and being released) for a very long time, it
leads to a waste of memory. Also, it blocks a possibility to implement
the memcg-based memory accounting for bpf objects, because a circular
reference dependency will occur. Charged memory pages are pinning the
corresponding memory cgroup, and if the memory cgroup is pinning
the attached bpf program, nothing will be ever released.

A dying cgroup can not contain any processes, so the only chance for
an attached bpf program to be executed is a live socket associated
with the cgroup. So in order to release all bpf data early, let's
count associated sockets using a new percpu refcounter. On cgroup
removal the counter is transitioned to the atomic mode, and as soon
as it reaches 0, all bpf programs are detached.

Because cgroup_bpf_release() can block, it can't be called from
the percpu ref counter callback directly, so instead an asynchronous
work is scheduled.

The reference counter is not socket specific, and can be used for any
other types of programs, which can be executed from a cgroup-bpf hook
outside of the process context, had such a need arise in the future.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: jolsa@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-28 09:30:02 -07:00
Randy Dunlap 9a626c4a63 ia64: fix build errors by exporting paddr_to_nid()
Fix build errors on ia64 when DISCONTIGMEM=y and NUMA=y by
exporting paddr_to_nid().

Fixes these build errors:

ERROR: "paddr_to_nid" [sound/core/snd-pcm.ko] undefined!
ERROR: "paddr_to_nid" [net/sunrpc/sunrpc.ko] undefined!
ERROR: "paddr_to_nid" [fs/cifs/cifs.ko] undefined!
ERROR: "paddr_to_nid" [drivers/video/fbdev/core/fb.ko] undefined!
ERROR: "paddr_to_nid" [drivers/usb/mon/usbmon.ko] undefined!
ERROR: "paddr_to_nid" [drivers/usb/core/usbcore.ko] undefined!
ERROR: "paddr_to_nid" [drivers/md/raid1.ko] undefined!
ERROR: "paddr_to_nid" [drivers/md/dm-mod.ko] undefined!
ERROR: "paddr_to_nid" [drivers/md/dm-crypt.ko] undefined!
ERROR: "paddr_to_nid" [drivers/md/dm-bufio.ko] undefined!
ERROR: "paddr_to_nid" [drivers/ide/ide-core.ko] undefined!
ERROR: "paddr_to_nid" [drivers/ide/ide-cd_mod.ko] undefined!
ERROR: "paddr_to_nid" [drivers/gpu/drm/drm.ko] undefined!
ERROR: "paddr_to_nid" [drivers/char/agp/agpgart.ko] undefined!
ERROR: "paddr_to_nid" [drivers/block/nbd.ko] undefined!
ERROR: "paddr_to_nid" [drivers/block/loop.ko] undefined!
ERROR: "paddr_to_nid" [drivers/block/brd.ko] undefined!
ERROR: "paddr_to_nid" [crypto/ccm.ko] undefined!

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: linux-ia64@vger.kernel.org
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-28 09:17:03 -07:00
Ard Biesheuvel 3fd00beb14 arm64/module: revert to unsigned interpretation of ABS16/32 relocations
Commit 1cf24a2cc3

  ("arm64/module: deal with ambiguity in PRELxx relocation ranges")

updated the overflow checking logic in the relocation handling code to
ensure that PREL16/32 relocations don't overflow signed quantities.

However, the same code path is used for absolute relocations, where the
interpretation is the opposite: the only current use case for absolute
relocations operating on non-native word size quantities is the CRC32
handling in the CONFIG_MODVERSIONS code, and these CRCs are unsigned
32-bit quantities, which are now being rejected by the module loader
if bit 31 happens to be set.

So let's use different ranges for quanties subject to absolute vs.
relative relocations:
- ABS16/32 relocations should be in the range [0, Uxx_MAX)
- PREL16/32 relocations should be in the range [Sxx_MIN, Sxx_MAX)
- otherwise, print an error since no other 16 or 32 bit wide data
  relocations are currently supported.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2019-05-28 15:15:53 +01:00
Catalin Marinas 0037727716 arm64: Fix the arm64_personality() syscall wrapper redirection
Following commit 4378a7d4be ("arm64: implement syscall wrappers"), the
syscall function names gained the '__arm64_' prefix. Ensure that we
have the correct #define for redirecting a default syscall through a
wrapper.

Fixes: 4378a7d4be ("arm64: implement syscall wrappers")
Cc: <stable@vger.kernel.org> # 4.19.x-
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2019-05-28 13:37:18 +01:00
Daniel T. Lee 37b54aed12 samples/bpf: fix a couple of style issues in bpf_load
This commit fixes a few style problems in samples/bpf/bpf_load.c:

 - Magic string use of 'DEBUGFS'
 - Useless zero initialization of a global variable
 - Minor style fix with whitespace

Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-05-28 11:10:37 +02:00
Stanislav Fomichev 486d3f22c0 selftests/bpf: fail test_tunnel.sh if subtests fail
Right now test_tunnel.sh always exits with success even if some
of the subtests fail. Since the output is very verbose, it's
hard to spot the issues with subtests. Let's fail the script
if any subtest fails.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-05-28 11:08:57 +02:00
Daniel Borkmann 463910a5dd Merge branch 'bpf-bpftool-dbg-output'
Quentin Monnet says:

====================
This series adds an option to bpftool to make it print additional
information via libbpf and the kernel verifier when attempting to load
programs.

A new API function is added to libbpf in order to pass the log_level
from bpftool with the bpf_object__* part of the API.

Options for a finer control over the log levels to use for libbpf and
the verifier could be added in the future, if desired.

v3:
- Fix and clarify commit logs.

v2:
- Do not add distinct options for libbpf and verifier logs, just keep the
  one that sets all log levels to their maximum. Rename the option.
- Do not offer a way to pick desired log levels. The choice is "use the
  option to print all logs" or "stick with the defaults".
- Do not export BPF_LOG_* flags to user header.
- Update all man pages (most bpftool operations use libbpf and may print
  libbpf logs). Verifier logs are only used when attempting to load
  programs for now, so bpftool-prog.rst and bpftool.rst remain the only
  pages updated in that regard.

Previous discussion available at:

https://lore.kernel.org/bpf/20190523105426.3938-1-quentin.monnet@netronome.com/
https://lore.kernel.org/bpf/20190429095227.9745-1-quentin.monnet@netronome.com/
====================

Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-05-28 11:05:57 +02:00
Quentin Monnet 55d778076b tools: bpftool: make -d option print debug output from verifier
The "-d" option is used to require all logs available for bpftool. So
far it meant telling libbpf to print even debug-level information. But
there is another source of info that can be made more verbose: when we
attemt to load programs with bpftool, we can pass a log_level parameter
to the verifier in order to control the amount of information that is
printed to the console.

Reuse the "-d" option to print all information the verifier can tell. At
this time, this means logs related to BPF_LOG_LEVEL1, BPF_LOG_LEVEL2 and
BPF_LOG_STATS. As mentioned in the discussion on the first version of
this set, these macros are internal to the kernel
(include/linux/bpf_verifier.h) and are not meant to be part of the
stable user API, therefore we simply use the related constants to print
whatever we can at this time, without trying to tell users what is
log_level1 or what is statistics.

Verifier logs are only used when loading programs for now (In the
future: for loading BTF objects with bpftool? Although libbpf does not
currently offer to print verifier info at debug level if no error
occurred when loading BTF objects), so bpftool.rst and bpftool-prog.rst
are the only man pages to get the update.

v3:
- Add details on log level and BTF loading at the end of commit log.

v2:
- Remove the possibility to select the log levels to use (v1 offered a
  combination of "log_level1", "log_level2" and "stats").
- The macros from kernel header bpf_verifier.h are not used (and
  therefore not moved to UAPI header).
- In v1 this was a distinct option, but is now merged in the only "-d"
  switch to activate libbpf and verifier debug-level logs all at the
  same time.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-05-28 11:03:26 +02:00
Quentin Monnet 60276f9849 libbpf: add bpf_object__load_xattr() API function to pass log_level
libbpf was recently made aware of the log_level attribute for programs,
used to specify the level of information expected to be dumped by the
verifier. Function bpf_prog_load_xattr() got support for this log_level
parameter.

But some applications using libbpf rely on another function to load
programs, bpf_object__load(), which does accept any parameter for log
level. Create an API function based on bpf_object__load(), but accepting
an "attr" object as a parameter. Then add a log_level field to that
object, so that applications calling the new bpf_object__load_xattr()
can pick the desired log level.

v3:
- Rewrite commit log.

v2:
- We are in a new cycle, bump libbpf extraversion number.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-05-28 11:03:26 +02:00
Quentin Monnet 775bc8ada8 tools: bpftool: add -d option to get debug output from libbpf
libbpf has three levels of priority for output messages: warn, info,
debug. By default, debug output is not printed to the console.

Add a new "--debug" (short name: "-d") option to bpftool to print libbpf
logs for all three levels.

Internally, we simply use the function provided by libbpf to replace the
default printing function by one that prints logs regardless of their
level.

v2:
- Remove the possibility to select the log-levels to use (v1 offered a
  combination of "warn", "info" and "debug").
- Rename option and offer a short name: -d|--debug.
- Add option description to all bpftool manual pages (instead of
  bpftool-prog.rst only), as all commands use libbpf.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-05-28 11:02:21 +02:00
Hariprasad Kelam d98363b510 libbpf: fix warning that PTR_ERR_OR_ZERO can be used
Fix below warning reported by coccicheck:

/tools/lib/bpf/libbpf.c:3461:1-3: WARNING: PTR_ERR_OR_ZERO can be used

Signed-off-by: Hariprasad Kelam <hariprasad.kelam@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-05-28 10:56:05 +02:00
Chang-Hsien Tsai 92bd6820f2 bpf: style fix in while(!feof()) loop
Use fgets() as the while loop condition.

Signed-off-by: Chang-Hsien Tsai <luke.tw@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-05-28 10:54:01 +02:00
Yonghong Song e1afb70252 bpf: check signal validity in nmi for bpf_send_signal() helper
Commit 8b401f9ed2 ("bpf: implement bpf_send_signal() helper")
introduced bpf_send_signal() helper. If the context is nmi,
the sending signal work needs to be deferred to irq_work.
If the signal is invalid, the error will appear in irq_work
and it won't be propagated to user.

This patch did an early check in the helper itself to notify
user invalid signal, as suggested by Daniel.

Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-05-28 10:51:33 +02:00
Andrii Nakryiko 0d97dacc46 bpftool: auto-complete BTF IDs for btf dump
Auto-complete BTF IDs for `btf dump id` sub-command. List of possible BTF
IDs is scavenged from loaded BPF programs that have associated BTFs, as
there is currently no API in libbpf to fetch list of all BTFs in the
system.

Suggested-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-05-28 10:50:27 +02:00