platform_kernel-5.15

Commit Graph

Author	SHA1	Message	Date
Greg Kroah-Hartman	bbc1b6a8a4	Merge 5.15.11 into android13-5.15 Changes in 5.15.11 reset: tegra-bpmp: Revert Handle errors in BPMP response KVM: VMX: clear vmx_x86_ops.sync_pir_to_irr if APICv is disabled KVM: selftests: Make sure kvm_create_max_vcpus test won't hit RLIMIT_NOFILE KVM: downgrade two BUG_ONs to WARN_ON_ONCE x86/kvm: remove unused ack_notifier callbacks KVM: X86: Fix tlb flush for tdp in kvm_invalidate_pcid() mac80211: fix rate control for retransmitted frames mac80211: fix regression in SSN handling of addba tx mac80211: mark TX-during-stop for TX in in_reconfig mac80211: send ADDBA requests using the tid/queue of the aggregation session mac80211: validate extended element ID is present firmware: arm_scpi: Fix string overflow in SCPI genpd driver bpf: Fix kernel address leakage in atomic fetch bpf, selftests: Add test case for atomic fetch on spilled pointer bpf: Fix signed bounds propagation after mov32 bpf: Make 32->64 bounds propagation slightly more robust bpf, selftests: Add test case trying to taint map value pointer bpf: Fix kernel address leakage in atomic cmpxchg's r0 aux reg bpf, selftests: Update test case for atomic cmpxchg on r0 with pointer vduse: fix memory corruption in vduse_dev_ioctl() vduse: check that offset is within bounds in get_config() virtio_ring: Fix querying of maximum DMA mapping size for virtio device vdpa: check that offsets are within bounds s390/entry: fix duplicate tracking of irq nesting level recordmcount.pl: look for jgnop instruction as well as bcrl on s390 arm64: dts: ten64: remove redundant interrupt declaration for gpio-keys ceph: fix up non-directory creation in SGID directories dm btree remove: fix use after free in rebalance_children() audit: improve robustness of the audit queue handling btrfs: convert latest_bdev type to btrfs_device and rename btrfs: use latest_dev in btrfs_show_devname btrfs: update latest_dev when we create a sprout device btrfs: remove stale comment about the btrfs_show_devname scsi: ufs: core: Retry START_STOP on UNIT_ATTENTION drm/i915/hdmi: convert intel_hdmi_to_dev to intel_hdmi_to_i915 drm/i915/hdmi: Turn DP++ TMDS output buffers back on in encoder->shutdown() pinctrl: amd: Fix wakeups when IRQ is shared with SCI arm64: dts: rockchip: remove mmc-hs400-enhanced-strobe from rk3399-khadas-edge arm64: dts: rockchip: fix rk3308-roc-cc vcc-sd supply arm64: dts: rockchip: fix rk3399-leez-p710 vcc3v3-lan supply arm64: dts: rockchip: fix audio-supply for Rock Pi 4 arm64: dts: rockchip: fix poweroff on helios64 dmaengine: idxd: add halt interrupt support dmaengine: idxd: fix calling wq quiesce inside spinlock mac80211: track only QoS data frames for admission control tee: amdtee: fix an IS_ERR() vs NULL bug ceph: fix duplicate increment of opened_inodes metric ceph: initialize pathlen variable in reconnect_caps_cb ARM: socfpga: dts: fix qspi node compatible arm64: dts: imx8mq: remove interconnect property from lcdif clk: Don't parent clks until the parent is fully registered soc: imx: Register SoC device only on i.MX boards iwlwifi: mvm: don't crash on invalid rate w/o STA virtio: always enter drivers/virtio/ virtio/vsock: fix the transport to work with VMADDR_CID_ANY vdpa: Consider device id larger than 31 Revert "drm/fb-helper: improve DRM fbdev emulation device names" selftests: net: Correct ping6 expected rc from 2 to 1 s390/kexec_file: fix error handling when applying relocations sch_cake: do not call cake_destroy() from cake_init() inet_diag: fix kernel-infoleak for UDP sockets netdevsim: don't overwrite read only ethtool parms selftests: icmp_redirect: pass xfail=0 to log_test() net: hns3: fix use-after-free bug in hclgevf_send_mbx_msg net: hns3: fix race condition in debugfs selftests: Add duplicate config only for MD5 VRF tests selftests: Fix raw socket bind tests with VRF selftests: Fix IPv6 address bind tests dmaengine: idxd: fix missed completion on abort path dmaengine: st_fdma: fix MODULE_ALIAS drm: simpledrm: fix wrong unit with pixel clock net/sched: sch_ets: don't remove idle classes from the round-robin list selftests/net: toeplitz: fix udp option net: dsa: mv88e6xxx: Unforce speed & duplex in mac_link_down() selftest/net/forwarding: declare NETIFS p9 p10 mptcp: never allow the PM to close a listener subflow drm/ast: potential dereference of null pointer drm/i915/display: Fix an unsigned subtraction which can never be negative. mac80211: agg-tx: don't schedule_and_wake_txq() under sta->lock cfg80211: Acquire wiphy mutex on regulatory work mac80211: fix lookup when adding AddBA extension element net: stmmac: fix tc flower deletion for VLAN priority Rx steering flow_offload: return EOPNOTSUPP for the unsupported mpls action type rds: memory leak in __rds_conn_create() ice: Use div64_u64 instead of div_u64 in adjfine ice: Don't put stale timestamps in the skb drm/amd/display: Set exit_optimized_pwr_state for DCN31 drm/amd/pm: fix a potential gpu_metrics_table memory leak mptcp: remove tcp ulp setsockopt support mptcp: clear 'kern' flag from fallback sockets mptcp: fix deadlock in __mptcp_push_pending() soc/tegra: fuse: Fix bitwise vs. logical OR warning igb: Fix removal of unicast MAC filters of VFs igbvf: fix double free in `igbvf_probe` igc: Fix typo in i225 LTR functions ixgbe: Document how to enable NBASE-T support ixgbe: set X550 MDIO speed before talking to PHY netdevsim: Zero-initialize memory for new map's value in function nsim_bpf_map_alloc net/packet: rx_owner_map depends on pg_vec net: stmmac: dwmac-rk: fix oob read in rk_gmac_setup sfc_ef100: potential dereference of null pointer dsa: mv88e6xxx: fix debug print for SPEED_UNFORCED net: Fix double 0x prefix print in SKB dump net/smc: Prevent smc_release() from long blocking net: systemport: Add global locking for descriptor lifecycle sit: do not call ipip6_dev_free() from sit_init_net() afs: Fix mmap arm64: kexec: Fix missing error code 'ret' warning in load_other_segments() bpf: Fix extable fixup offset. bpf, selftests: Fix racing issue in btf_skc_cls_ingress test powerpc/85xx: Fix oops when CONFIG_FSL_PMC=n USB: gadget: bRequestType is a bitfield, not a enum Revert "usb: early: convert to readl_poll_timeout_atomic()" KVM: x86: Drop guest CPUID check for host initiated writes to MSR_IA32_PERF_CAPABILITIES tty: n_hdlc: make n_hdlc_tty_wakeup() asynchronous USB: NO_LPM quirk Lenovo USB-C to Ethernet Adapher(RTL8153-04) usb: dwc2: fix STM ID/VBUS detection startup delay in dwc2_driver_probe PCI/MSI: Clear PCI_MSIX_FLAGS_MASKALL on error PCI/MSI: Mask MSI-X vectors only on success usb: xhci-mtk: fix list_del warning when enable list debug usb: xhci: Extend support for runtime power management for AMD's Yellow carp. usb: cdnsp: Fix incorrect status for control request usb: cdnsp: Fix incorrect calling of cdnsp_died function usb: cdnsp: Fix issue in cdnsp_log_ep trace event usb: cdnsp: Fix lack of spin_lock_irqsave/spin_lock_restore usb: typec: tcpm: fix tcpm unregister port but leave a pending timer usb: gadget: u_ether: fix race in setting MAC address in setup phase USB: serial: cp210x: fix CP2105 GPIO registration USB: serial: option: add Telit FN990 compositions selinux: fix sleeping function called from invalid context btrfs: fix memory leak in __add_inode_ref() btrfs: fix double free of anon_dev after failure to create subvolume btrfs: check WRITE_ERR when trying to read an extent buffer btrfs: fix missing blkdev_put() call in btrfs_scan_one_device() zonefs: add MODULE_ALIAS_FS iocost: Fix divide-by-zero on donation from low hweight cgroup serial: 8250_fintek: Fix garbled text for console timekeeping: Really make sure wall_to_monotonic isn't positive cifs: sanitize multiple delimiters in prepath locking/rtmutex: Fix incorrect condition in rtmutex_spin_on_owner() riscv: dts: unleashed: Add gpio card detect to mmc-spi-slot riscv: dts: unmatched: Add gpio card detect to mmc-spi-slot perf inject: Fix segfault due to close without open perf inject: Fix segfault due to perf_data__fd() without open libata: if T_LENGTH is zero, dma direction should be DMA_NONE powerpc/module_64: Fix livepatching for RO modules drm/amdgpu: correct register access for RLC_JUMP_TABLE_RESTORE drm/amdgpu: don't override default ECO_BITs setting drm/amd/pm: fix reading SMU FW version from amdgpu_firmware_info on YC Revert "can: m_can: remove support for custom bit timing" can: m_can: make custom bittiming fields const can: m_can: pci: use custom bit timings for Elkhart Lake ARM: dts: imx6ull-pinfunc: Fix CSI_DATA07__ESAI_TX0 pad name xsk: Do not sleep in poll() when need_wakeup set mptcp: add missing documented NL params bpf, x64: Factor out emission of REX byte in more cases bpf: Fix extable address check. USB: core: Make do_proc_control() and do_proc_bulk() killable media: mxl111sf: change mutex_init() location fuse: annotate lock in fuse_reverse_inval_entry() ovl: fix warning in ovl_create_real() scsi: scsi_debug: Don't call kcalloc() if size arg is zero scsi: scsi_debug: Fix type in min_t to avoid stack OOB scsi: scsi_debug: Sanity check block descriptor length in resp_mode_select() io-wq: remove spurious bit clear on task_work addition io-wq: check for wq exit after adding new worker task_work rcu: Mark accesses to rcu_state.n_force_qs io-wq: drop wqe lock before creating new worker bus: ti-sysc: Fix variable set but not used warning for reinit_modules selftests/damon: test debugfs file reads/writes with huge count Revert "xsk: Do not sleep in poll() when need_wakeup set" xen/blkfront: harden blkfront against event channel storms xen/netfront: harden netfront against event channel storms xen/console: harden hvc_xen against event channel storms xen/netback: fix rx queue stall detection xen/netback: don't queue unlimited number of packages Linux 5.15.11 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I20c400f64f45729c6f833c31ee18eb4b92f5ed89	2021-12-29 11:32:19 +01:00
Miklos Szeredi	d2ccdd4e4e	ovl: fix warning in ovl_create_real() commit 1f5573cfe7a7056e80a92c7a037a3e69f3a13d1c upstream. Syzbot triggered the following warning in ovl_workdir_create() -> ovl_create_real(): if (!err && WARN_ON(!newdentry->d_inode)) { The reason is that the cgroup2 filesystem returns from mkdir without instantiating the new dentry. Weird filesystems such as this will be rejected by overlayfs at a later stage during setup, but to prevent such a warning, call ovl_mkdir_real() directly from ovl_workdir_create() and reject this case early. Reported-and-tested-by: syzbot+75eab84fd0af9e8bf66b@syzkaller.appspotmail.com Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-12-22 09:32:50 +01:00
David Anderson	84a6506ce6	FROMLIST: overlayfs: override_creds=off option bypass creator_cred By default, all access to the upper, lower and work directories is the recorded mounter's MAC and DAC credentials. The incoming accesses are checked against the caller's credentials. If the principles of least privilege are applied, the mounter's credentials might not overlap the credentials of the caller's when accessing the overlayfs filesystem. For example, a file that a lower DAC privileged caller can execute, is MAC denied to the generally higher DAC privileged mounter, to prevent an attack vector. We add the option to turn off override_creds in the mount options; all subsequent operations after mount on the filesystem will be only the caller's credentials. The module boolean parameter and mount option override_creds is also added as a presence check for this "feature", existence of /sys/module/overlay/parameters/override_creds. It was not always this way. Circa 4.6 there was no recorded mounter's credentials, instead privileged access to upper or work directories were temporarily increased to perform the operations. The MAC (selinux) policies were caller's in all cases. override_creds=off partially returns us to this older access model minus the insecure temporary credential increases. This is to permit use in a system with non-overlapping security models for each executable including the agent that mounts the overlayfs filesystem. In Android this is the case since init, which performs the mount operations, has a minimal MAC set of privileges to reduce any attack surface, and services that use the content have a different set of MAC privileges (eg: read, for vendor labelled configuration, execute for vendor libraries and modules). The caveats are not a problem in the Android usage model, however they should be fixed for completeness and for general use in time. Bug: 204981027 Link: https://lore.kernel.org/lkml/20211117015806.2192263-4-dvander@google.com Signed-off-by: David Anderson <dvander@google.com> Signed-off-by: Mark Salyzyn <salyzyn@android.com> Change-Id: I21516b3af483790fb0a7bfdeabf4770c486dbcbc	2021-11-23 19:13:38 +00:00
David Anderson	e884438aa5	Revert "FROMLIST: Add flags option to get xattr method paired to..." Revert submission 1881578 Reason for revert: broken build in CI Reverted Changes: Id2c6fa6ee:FROMLIST: Add flags option to get xattr method pai... Ifa966dabd:FROMLIST: overlayfs: inode_owner_or_capable called... I46e6c74ff:FROMLIST: overlayfs: override_creds=off option byp... I0b8fe9f1f:FROMLIST: overlayfs: handle XATTR_NOSECURITY flag ... Change-Id: Ic4f9a8dd92dc492ed0a474c783497ec525f1c762 Signed-off-by: David Anderson <dvander@google.com>	2021-11-20 03:15:20 +00:00
David Anderson	df1cc768a5	Revert "FROMLIST: overlayfs: handle XATTR_NOSECURITY flag for ge..." Revert submission 1881578 Reason for revert: broken build in CI Reverted Changes: Id2c6fa6ee:FROMLIST: Add flags option to get xattr method pai... Ifa966dabd:FROMLIST: overlayfs: inode_owner_or_capable called... I46e6c74ff:FROMLIST: overlayfs: override_creds=off option byp... I0b8fe9f1f:FROMLIST: overlayfs: handle XATTR_NOSECURITY flag ... Change-Id: Ie59bb4f7ad86f64c320a3e6f328a4e6f4fa1d204 Signed-off-by: David Anderson <dvander@google.com>	2021-11-20 03:15:14 +00:00
David Anderson	650b7fa416	Revert "FROMLIST: overlayfs: override_creds=off option bypass cr..." Revert submission 1881578 Reason for revert: broken build in CI Reverted Changes: Id2c6fa6ee:FROMLIST: Add flags option to get xattr method pai... Ifa966dabd:FROMLIST: overlayfs: inode_owner_or_capable called... I46e6c74ff:FROMLIST: overlayfs: override_creds=off option byp... I0b8fe9f1f:FROMLIST: overlayfs: handle XATTR_NOSECURITY flag ... Change-Id: Ib78cfc2cfb717a0c4a1997399aaa47d38c6f366d Signed-off-by: David Anderson <dvander@google.com>	2021-11-20 03:15:07 +00:00
David Anderson	0792ff2e87	FROMLIST: overlayfs: override_creds=off option bypass creator_cred By default, all access to the upper, lower and work directories is the recorded mounter's MAC and DAC credentials. The incoming accesses are checked against the caller's credentials. If the principles of least privilege are applied, the mounter's credentials might not overlap the credentials of the caller's when accessing the overlayfs filesystem. For example, a file that a lower DAC privileged caller can execute, is MAC denied to the generally higher DAC privileged mounter, to prevent an attack vector. We add the option to turn off override_creds in the mount options; all subsequent operations after mount on the filesystem will be only the caller's credentials. The module boolean parameter and mount option override_creds is also added as a presence check for this "feature", existence of /sys/module/overlay/parameters/override_creds. It was not always this way. Circa 4.6 there was no recorded mounter's credentials, instead privileged access to upper or work directories were temporarily increased to perform the operations. The MAC (selinux) policies were caller's in all cases. override_creds=off partially returns us to this older access model minus the insecure temporary credential increases. This is to permit use in a system with non-overlapping security models for each executable including the agent that mounts the overlayfs filesystem. In Android this is the case since init, which performs the mount operations, has a minimal MAC set of privileges to reduce any attack surface, and services that use the content have a different set of MAC privileges (eg: read, for vendor labelled configuration, execute for vendor libraries and modules). The caveats are not a problem in the Android usage model, however they should be fixed for completeness and for general use in time. Bug: 204981027 Link: https://lore.kernel.org/lkml/20211117015806.2192263-4-dvander@google.com Change-Id: I46e6c74ff634eb064cf9d714017432171a898890 Signed-off-by: David Anderson <dvander@google.com> Signed-off-by: Mark Salyzyn <salyzyn@android.com>	2021-11-16 18:15:06 -08:00
David Anderson	78626d4b82	FROMLIST: overlayfs: handle XATTR_NOSECURITY flag for get xattr method __vfs_getxattr({dentry...XATTR_NOSECURITY}) -> handler->get({dentry...XATTR_NOSECURITY}) -> __vfs_getxattr({realdentry...XATTR_NOSECURITY}) -> lower_handler->get({realdentry...XATTR_NOSECURITY}) which would report back through the chain data and success as expected, the logging security layer at the top would have the data to determine the access permissions and report back to the logs and the caller that the target context was blocked. For selinux this would solve the cosmetic issue of the selinux log and allow audit2allow to correctly report the rule needed to address the access problem. Check impure, opaque, origin & meta xattr with no sepolicy audit (using __vfs_getxattr) since these operations are internal to overlayfs operations and do not disclose any data. This became an issue for credential override off since sys_admin would have been required by the caller; whereas would have been inherently present for the creator since it performed the mount. This is a change in operations since we do not check in the new ovl_do_getxattr function if the credential override is off or not. Reasoning is that the sepolicy check is unnecessary overhead, especially since the check can be expensive. Because for override credentials off, this affects _everyone_ that underneath performs private xattr calls without the appropriate sepolicy permissions and sys_admin capability. Providing blanket support for sys_admin would be bad for all possible callers. For the override credentials on, this will affect only the mounter, should it lack sepolicy permissions. Not considered a security problem since mounting by definition has sys_admin capabilities, but sepolicy contexts would still need to be crafted. It should be noted that there is precedence, __vfs_getxattr is used in other filesystems for their own internal trusted xattr management. Change-Id: I0b8fe9f1fe6c763fbd27a09c6de8209d1dc9d2f7 Signed-off-by: David Anderson <dvander@google.com> Signed-off-by: Mark Salyzyn <salyzyn@android.com> Link: https://lore.kernel.org/lkml/20211117015806.2192263-3-dvander@google.com Bug: 204981027	2021-11-16 18:14:59 -08:00
David Anderson	8c288004aa	FROMLIST: Add flags option to get xattr method paired to __vfs_getxattr Add a flag option to get xattr method that could have a bit flag of XATTR_NOSECURITY passed to it. XATTR_NOSECURITY is generally then set in the __vfs_getxattr path when called by security infrastructure. This handles the case of a union filesystem driver that is being requested by the security layer to report back the xattr data. For the use case where access is to be blocked by the security layer. The path then could be security(dentry) -> __vfs_getxattr(dentry...XATTR_NOSECURITY) -> handler->get(dentry...XATTR_NOSECURITY) -> __vfs_getxattr(lower_dentry...XATTR_NOSECURITY) -> lower_handler->get(lower_dentry...XATTR_NOSECURITY) which would report back through the chain data and success as expected, the logging security layer at the top would have the data to determine the access permissions and report back the target context that was blocked. Without the get handler flag, the path on a union filesystem would be the errant security(dentry) -> __vfs_getxattr(dentry) -> handler->get(dentry) -> vfs_getxattr(lower_dentry) -> nested -> security(lower_dentry, log off) -> lower_handler->get(lower_dentry) which would report back through the chain no data, and -EACCES. For selinux for both cases, this would translate to a correctly determined blocked access. In the first case with this change a correct avc log would be reported, in the second legacy case an incorrect avc log would be reported against an uninitialized u:object_r:unlabeled:s0 context making the logs cosmetically useless for audit2allow. This patch series is inert and is the wide-spread addition of the flags option for xattr functions, and a replacement of __vfs_getxattr with __vfs_getxattr(...XATTR_NOSECURITY). Bug: 204981027 Link: https://lore.kernel.org/lkml/20211117015806.2192263-2-dvander@google.com Change-Id: Id2c6fa6eeb2b5cca5a11e0cd02a3fbf2a5fcbef4 Signed-off-by: David Anderson <dvander@google.com> Signed-off-by: Mark Salyzyn <salyzyn@android.com>	2021-11-16 18:14:17 -08:00
Vyacheslav Yurkov	ca45275cd6	ovl: add ovl_allow_offline_changes() helper Allows to check whether any of extended features are enabled Signed-off-by: Vyacheslav Yurkov <Vyacheslav.Yurkov@bruker.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2021-08-17 11:47:44 +02:00
Vyacheslav Yurkov	e4522bc873	ovl: disable decoding null uuid with redirect_dir Currently decoding origin with lower null uuid is not allowed unless user opted-in to one of the new features that require following the lower inode of non-dir upper (index, xino, metacopy). Now we add redirect_dir too to that feature list. Signed-off-by: Vyacheslav Yurkov <Vyacheslav.Yurkov@bruker.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2021-08-17 11:47:44 +02:00
Miklos Szeredi	708fa01597	ovl: allow upperdir inside lowerdir Commit `146d62e5a5` ("ovl: detect overlapping layers") made sure we don't have overlapping layers, but it also broke the arguably valid use case of mount -olowerdir=/,upperdir=/subdir,.. where upperdir overlaps lowerdir on the same filesystem. This has been causing regressions. Revert the check, but only for the specific case where upperdir and/or workdir are subdirectories of lowerdir. Any other overlap (e.g. lowerdir is subdirectory of upperdir, etc) case is crazy, so leave the check in place for those. Overlaps are detected at lookup time too, so reverting the mount time check should be safe. Fixes: `146d62e5a5` ("ovl: detect overlapping layers") Cc: <stable@vger.kernel.org> # v5.2 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2021-04-12 12:00:37 +02:00
Giuseppe Scrivano	321b46b904	ovl: show "userxattr" in the mount data This was missed when adding the option. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> Reviewed-by: Vivek Goyal <vgoyal@redhat.com> Fixes: `2d2f2d7322` ("ovl: user xattr") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2021-04-12 12:00:37 +02:00
Chengguang Xu	568edee485	ovl: do not copy attr several times In ovl_xattr_set() we have already copied attr of real inode so no need to copy it again in ovl_posix_acl_xattr_set(). Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2021-04-12 12:00:37 +02:00
Chengguang Xu	d7b49b10d5	ovl: fix error for ovl_fill_super() There are some places should return -EINVAL instead of -ENOMEM in ovl_fill_super(). [Amir] Consistently set error before checking the error condition. Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2021-04-12 12:00:36 +02:00
Amir Goldstein	b0e0f69731	ovl: restrict lower null uuid for "xino=auto" Commit `a888db3101` ("ovl: fix regression with re-formatted lower squashfs") attempted to fix a regression with existing setups that use a practice that we are trying to discourage. The discourage part was described this way in the commit message: "To avoid the reported regression while still allowing the new features with single lower squashfs, do not allow decoding origin with lower null uuid unless user opted-in to one of the new features that require following the lower inode of non-dir upper (index, xino, metacopy)." The three mentioned features are disabled by default in Kconfig, so it was assumed that if they are enabled, the user opted-in for them. Apparently, distros started to configure CONFIG_OVERLAY_FS_XINO_AUTO=y some time ago, so users upgrading their kernels can still be affected by said regression even though they never opted-in for any new feature. To fix this, treat "xino=on" as "user opted-in", but not "xino=auto". Since we are changing the behavior of "xino=auto" to no longer follow to lower origin with null uuid, take this one step further and disable xino in that corner case. To be consistent, disable xino also in cases of lower fs without file handle support and upper fs without xattr support. Update documentation w.r.t the new "xino=auto" behavior and fix the out dated bits of documentation regarding "xino" and regarding offline modifications to lower layers. Link: https://lore.kernel.org/linux-unionfs/b36a429d7c563730c28d763d4d57a6fc30508a4f.1615216996.git.kevin@kevinlocke.name/ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2021-04-12 12:00:36 +02:00
Amir Goldstein	e21a6c57e3	ovl: check that upperdir path is not on a read-only mount So far we only checked that sb is not read-only. Suggested-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2021-04-12 12:00:36 +02:00
Linus Torvalds	7d6beb71da	idmapped-mounts-v5.12 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYCegywAKCRCRxhvAZXjc ouJ6AQDlf+7jCQlQdeKKoN9QDFfMzG1ooemat36EpRRTONaGuAD8D9A4sUsG4+5f 4IU5Lj9oY4DEmF8HenbWK2ZHsesL2Qg= =yPaw -----END PGP SIGNATURE----- Merge tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull idmapped mounts from Christian Brauner: "This introduces idmapped mounts which has been in the making for some time. Simply put, different mounts can expose the same file or directory with different ownership. This initial implementation comes with ports for fat, ext4 and with Christoph's port for xfs with more filesystems being actively worked on by independent people and maintainers. Idmapping mounts handle a wide range of long standing use-cases. Here are just a few: - Idmapped mounts make it possible to easily share files between multiple users or multiple machines especially in complex scenarios. For example, idmapped mounts will be used in the implementation of portable home directories in systemd-homed.service(8) where they allow users to move their home directory to an external storage device and use it on multiple computers where they are assigned different uids and gids. This effectively makes it possible to assign random uids and gids at login time. - It is possible to share files from the host with unprivileged containers without having to change ownership permanently through chown(2). - It is possible to idmap a container's rootfs and without having to mangle every file. For example, Chromebooks use it to share the user's Download folder with their unprivileged containers in their Linux subsystem. - It is possible to share files between containers with non-overlapping idmappings. - Filesystem that lack a proper concept of ownership such as fat can use idmapped mounts to implement discretionary access (DAC) permission checking. - They allow users to efficiently changing ownership on a per-mount basis without having to (recursively) chown(2) all files. In contrast to chown (2) changing ownership of large sets of files is instantenous with idmapped mounts. This is especially useful when ownership of a whole root filesystem of a virtual machine or container is changed. With idmapped mounts a single syscall mount_setattr syscall will be sufficient to change the ownership of all files. - Idmapped mounts always take the current ownership into account as idmappings specify what a given uid or gid is supposed to be mapped to. This contrasts with the chown(2) syscall which cannot by itself take the current ownership of the files it changes into account. It simply changes the ownership to the specified uid and gid. This is especially problematic when recursively chown(2)ing a large set of files which is commong with the aforementioned portable home directory and container and vm scenario. - Idmapped mounts allow to change ownership locally, restricting it to specific mounts, and temporarily as the ownership changes only apply as long as the mount exists. Several userspace projects have either already put up patches and pull-requests for this feature or will do so should you decide to pull this: - systemd: In a wide variety of scenarios but especially right away in their implementation of portable home directories. https://systemd.io/HOME_DIRECTORY/ - container runtimes: containerd, runC, LXD:To share data between host and unprivileged containers, unprivileged and privileged containers, etc. The pull request for idmapped mounts support in containerd, the default Kubernetes runtime is already up for quite a while now: https://github.com/containerd/containerd/pull/4734 - The virtio-fs developers and several users have expressed interest in using this feature with virtual machines once virtio-fs is ported. - ChromeOS: Sharing host-directories with unprivileged containers. I've tightly synced with all those projects and all of those listed here have also expressed their need/desire for this feature on the mailing list. For more info on how people use this there's a bunch of talks about this too. Here's just two recent ones: https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf https://fosdem.org/2021/schedule/event/containers_idmap/ This comes with an extensive xfstests suite covering both ext4 and xfs: https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts It covers truncation, creation, opening, xattrs, vfscaps, setid execution, setgid inheritance and more both with idmapped and non-idmapped mounts. It already helped to discover an unrelated xfs setgid inheritance bug which has since been fixed in mainline. It will be sent for inclusion with the xfstests project should you decide to merge this. In order to support per-mount idmappings vfsmounts are marked with user namespaces. The idmapping of the user namespace will be used to map the ids of vfs objects when they are accessed through that mount. By default all vfsmounts are marked with the initial user namespace. The initial user namespace is used to indicate that a mount is not idmapped. All operations behave as before and this is verified in the testsuite. Based on prior discussions we want to attach the whole user namespace and not just a dedicated idmapping struct. This allows us to reuse all the helpers that already exist for dealing with idmappings instead of introducing a whole new range of helpers. In addition, if we decide in the future that we are confident enough to enable unprivileged users to setup idmapped mounts the permission checking can take into account whether the caller is privileged in the user namespace the mount is currently marked with. The user namespace the mount will be marked with can be specified by passing a file descriptor refering to the user namespace as an argument to the new mount_setattr() syscall together with the new MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern of extensibility. The following conditions must be met in order to create an idmapped mount: - The caller must currently have the CAP_SYS_ADMIN capability in the user namespace the underlying filesystem has been mounted in. - The underlying filesystem must support idmapped mounts. - The mount must not already be idmapped. This also implies that the idmapping of a mount cannot be altered once it has been idmapped. - The mount must be a detached/anonymous mount, i.e. it must have been created by calling open_tree() with the OPEN_TREE_CLONE flag and it must not already have been visible in the filesystem. The last two points guarantee easier semantics for userspace and the kernel and make the implementation significantly simpler. By default vfsmounts are marked with the initial user namespace and no behavioral or performance changes are observed. The manpage with a detailed description can be found here: `1d7b902e28` In order to support idmapped mounts, filesystems need to be changed and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The patches to convert individual filesystem are not very large or complicated overall as can be seen from the included fat, ext4, and xfs ports. Patches for other filesystems are actively worked on and will be sent out separately. The xfstestsuite can be used to verify that port has been done correctly. The mount_setattr() syscall is motivated independent of the idmapped mounts patches and it's been around since July 2019. One of the most valuable features of the new mount api is the ability to perform mounts based on file descriptors only. Together with the lookup restrictions available in the openat2() RESOLVE_* flag namespace which we added in v5.6 this is the first time we are close to hardened and race-free (e.g. symlinks) mounting and path resolution. While userspace has started porting to the new mount api to mount proper filesystems and create new bind-mounts it is currently not possible to change mount options of an already existing bind mount in the new mount api since the mount_setattr() syscall is missing. With the addition of the mount_setattr() syscall we remove this last restriction and userspace can now fully port to the new mount api, covering every use-case the old mount api could. We also add the crucial ability to recursively change mount options for a whole mount tree, both removing and adding mount options at the same time. This syscall has been requested multiple times by various people and projects. There is a simple tool available at https://github.com/brauner/mount-idmapped that allows to create idmapped mounts so people can play with this patch series. I'll add support for the regular mount binary should you decide to pull this in the following weeks: Here's an example to a simple idmapped mount of another user's home directory: u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt u1001@f2-vm:/$ ls -al /home/ubuntu/ total 28 drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 . drwxr-xr-x 4 root root 4096 Oct 28 04:00 .. -rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history -rw-r--r-- 1 ubuntu ubuntu 220 Feb 25 2020 .bash_logout -rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25 2020 .bashrc -rw-r--r-- 1 ubuntu ubuntu 807 Feb 25 2020 .profile -rw-r--r-- 1 ubuntu ubuntu 0 Oct 16 16:11 .sudo_as_admin_successful -rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo u1001@f2-vm:/$ ls -al /mnt/ total 28 drwxr-xr-x 2 u1001 u1001 4096 Oct 28 22:07 . drwxr-xr-x 29 root root 4096 Oct 28 22:01 .. -rw------- 1 u1001 u1001 3154 Oct 28 22:12 .bash_history -rw-r--r-- 1 u1001 u1001 220 Feb 25 2020 .bash_logout -rw-r--r-- 1 u1001 u1001 3771 Feb 25 2020 .bashrc -rw-r--r-- 1 u1001 u1001 807 Feb 25 2020 .profile -rw-r--r-- 1 u1001 u1001 0 Oct 16 16:11 .sudo_as_admin_successful -rw------- 1 u1001 u1001 1144 Oct 28 00:43 .viminfo u1001@f2-vm:/$ touch /mnt/my-file u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file u1001@f2-vm:/$ ls -al /mnt/my-file -rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file u1001@f2-vm:/$ ls -al /home/ubuntu/my-file -rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file u1001@f2-vm:/$ getfacl /mnt/my-file getfacl: Removing leading '/' from absolute path names # file: mnt/my-file # owner: u1001 # group: u1001 user::rw- user:u1001:rwx group::rw- mask::rwx other::r-- u1001@f2-vm:/$ getfacl /home/ubuntu/my-file getfacl: Removing leading '/' from absolute path names # file: home/ubuntu/my-file # owner: ubuntu # group: ubuntu user::rw- user:ubuntu:rwx group::rw- mask::rwx other::r--" * tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits) xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl xfs: support idmapped mounts ext4: support idmapped mounts fat: handle idmapped mounts tests: add mount_setattr() selftests fs: introduce MOUNT_ATTR_IDMAP fs: add mount_setattr() fs: add attr_flags_to_mnt_flags helper fs: split out functions to hold writers namespace: only take read lock in do_reconfigure_mnt() mount: make {lock,unlock}_mount_hash() static namespace: take lock_mount_hash() directly when changing flags nfs: do not export idmapped mounts overlayfs: do not mount on top of idmapped mounts ecryptfs: do not mount on top of idmapped mounts ima: handle idmapped mounts apparmor: handle idmapped mounts fs: make helpers idmap mount aware exec: handle idmapped mounts would_dump: handle idmapped mounts ...	2021-02-23 13:39:45 -08:00
Sargun Dhillon	335d3fc579	ovl: implement volatile-specific fsync error behaviour Overlayfs's volatile option allows the user to bypass all forced sync calls to the upperdir filesystem. This comes at the cost of safety. We can never ensure that the user's data is intact, but we can make a best effort to expose whether or not the data is likely to be in a bad state. The best way to handle this in the time being is that if an overlayfs's upperdir experiences an error after a volatile mount occurs, that error will be returned on fsync, fdatasync, sync, and syncfs. This is contradictory to the traditional behaviour of VFS which fails the call once, and only raises an error if a subsequent fsync error has occurred, and been raised by the filesystem. One awkward aspect of the patch is that we have to manually set the superblock's errseq_t after the sync_fs callback as opposed to just returning an error from syncfs. This is because the call chain looks something like this: sys_syncfs -> sync_filesystem -> __sync_filesystem -> /* The return value is ignored here sb->s_op->sync_fs(sb) _sync_blockdev /* Where the VFS fetches the error to raise to userspace */ errseq_check_and_advance Because of this we call errseq_set every time the sync_fs callback occurs. Due to the nature of this seen / unseen dichotomy, if the upperdir is an inconsistent state at the initial mount time, overlayfs will refuse to mount, as overlayfs cannot get a snapshot of the upperdir's errseq that will increment on error until the user calls syncfs. Signed-off-by: Sargun Dhillon <sargun@sargun.me> Suggested-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Fixes: `c86243b090` ("ovl: provide a mount option "volatile"") Cc: stable@vger.kernel.org Reviewed-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2021-01-28 10:22:48 +01:00
Miklos Szeredi	9efb069de4	ovl: add warning on user_ns mismatch Currently there's no way to create an overlay filesystem outside of the current user namespace. Make sure that if this assumption changes it doesn't go unnoticed. Reported-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2021-01-28 10:22:48 +01:00
Christian Brauner	029a52ada6	overlayfs: do not mount on top of idmapped mounts Prevent overlayfs from being mounted on top of idmapped mounts. Stacking filesystems need to be prevented from being mounted on top of idmapped mounts until they have have been converted to handle this. Link: https://lore.kernel.org/r/20210121131959.646623-29-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2021-01-24 14:27:20 +01:00
Christian Brauner	549c729771	fs: make helpers idmap mount aware Extend some inode methods with an additional user namespace argument. A filesystem that is aware of idmapped mounts will receive the user namespace the mount has been marked with. This can be used for additional permission checking and also to enable filesystems to translate between uids and gids if they need to. We have implemented all relevant helpers in earlier patches. As requested we simply extend the exisiting inode method instead of introducing new ones. This is a little more code churn but it's mostly mechanical and doesnt't leave us with additional inode methods. Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2021-01-24 14:27:20 +01:00
Tycho Andersen	c7c7a1a18a	xattr: handle idmapped mounts When interacting with extended attributes the vfs verifies that the caller is privileged over the inode with which the extended attribute is associated. For posix access and posix default extended attributes a uid or gid can be stored on-disk. Let the functions handle posix extended attributes on idmapped mounts. If the inode is accessed through an idmapped mount we need to map it according to the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. This has no effect for e.g. security xattrs since they don't store uids or gids and don't perform permission checks on them like posix acls do. Link: https://lore.kernel.org/r/20210121131959.646623-10-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Tycho Andersen <tycho@tycho.pizza> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2021-01-24 14:27:17 +01:00
Christian Brauner	e65ce2a50c	acl: handle idmapped mounts The posix acl permission checking helpers determine whether a caller is privileged over an inode according to the acls associated with the inode. Add helpers that make it possible to handle acls on idmapped mounts. The vfs and the filesystems targeted by this first iteration make use of posix_acl_fix_xattr_from_user() and posix_acl_fix_xattr_to_user() to translate basic posix access and default permissions such as the ACL_USER and ACL_GROUP type according to the initial user namespace (or the superblock's user namespace) to and from the caller's current user namespace. Adapt these two helpers to handle idmapped mounts whereby we either map from or into the mount's user namespace depending on in which direction we're translating. Similarly, cap_convert_nscap() is used by the vfs to translate user namespace and non-user namespace aware filesystem capabilities from the superblock's user namespace to the caller's user namespace. Enable it to handle idmapped mounts by accounting for the mount's user namespace. In addition the fileystems targeted in the first iteration of this patch series make use of the posix_acl_chmod() and, posix_acl_update_mode() helpers. Both helpers perform permission checks on the target inode. Let them handle idmapped mounts. These two helpers are called when posix acls are set by the respective filesystems to handle this case we extend the ->set() method to take an additional user namespace argument to pass the mount's user namespace down. Link: https://lore.kernel.org/r/20210121131959.646623-9-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2021-01-24 14:27:17 +01:00
Christian Brauner	2f221d6f7b	attr: handle idmapped mounts When file attributes are changed most filesystems rely on the setattr_prepare(), setattr_copy(), and notify_change() helpers for initialization and permission checking. Let them handle idmapped mounts. If the inode is accessed through an idmapped mount map it into the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Helpers that perform checks on the ia_uid and ia_gid fields in struct iattr assume that ia_uid and ia_gid are intended values and have already been mapped correctly at the userspace-kernelspace boundary as we already do today. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2021-01-24 14:27:16 +01:00
Christian Brauner	21cb47be6f	inode: make init and permission helpers idmapped mount aware The inode_owner_or_capable() helper determines whether the caller is the owner of the inode or is capable with respect to that inode. Allow it to handle idmapped mounts. If the inode is accessed through an idmapped mount it according to the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Similarly, allow the inode_init_owner() helper to handle idmapped mounts. It initializes a new inode on idmapped mounts by mapping the fsuid and fsgid of the caller from the mount's user namespace. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-7-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2021-01-24 14:27:16 +01:00
Christian Brauner	0558c1bf5a	capability: handle idmapped mounts In order to determine whether a caller holds privilege over a given inode the capability framework exposes the two helpers privileged_wrt_inode_uidgid() and capable_wrt_inode_uidgid(). The former verifies that the inode has a mapping in the caller's user namespace and the latter additionally verifies that the caller has the requested capability in their current user namespace. If the inode is accessed through an idmapped mount map it into the mount's user namespace. Afterwards the checks are identical to non-idmapped inodes. If the initial user namespace is passed all operations are a nop so non-idmapped mounts will not see a change in behavior. Link: https://lore.kernel.org/r/20210121131959.646623-5-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2021-01-24 14:27:16 +01:00
Miklos Szeredi	459c7c565a	ovl: unprivieged mounts Enable unprivileged user namespace mounts of overlayfs. Overlayfs's permission model () ensures that the mounter itself cannot gain additional privileges by the act of creating an overlayfs mount. This feature request is coming from the "rootless" container crowd. () Documentation/filesystems/overlayfs.txt#Permission model Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-12-14 15:26:14 +01:00
Miklos Szeredi	2d2f2d7322	ovl: user xattr Optionally allow using "user.overlay." namespace instead of "trusted.overlay." This is necessary for overlayfs to be able to be mounted in an unprivileged namepsace. Make the option explicit, since it makes the filesystem format be incompatible. Disable redirect_dir and metacopy options, because these would allow privilege escalation through direct manipulation of the "user.overlay.redirect" or "user.overlay.metacopy" xattrs. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com>	2020-12-14 15:26:14 +01:00
Miklos Szeredi	cef4cbff06	ovl: expand warning in ovl_d_real() There was a syzbot report with this warning but insufficient information... Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-11-12 11:31:55 +01:00
Pavel Tikhomirov	5830fb6b54	ovl: introduce new "uuid=off" option for inodes index feature This replaces uuid with null in overlayfs file handles and thus relaxes uuid checks for overlay index feature. It is only possible in case there is only one filesystem for all the work/upper/lower directories and bare file handles from this backing filesystem are unique. In other case when we have multiple filesystems lets just fallback to "uuid=on" which is and equivalent of how it worked before with all uuid checks. This is needed when overlayfs is/was mounted in a container with index enabled (e.g.: to be able to resolve inotify watch file handles on it to paths in CRIU), and this container is copied and started alongside with the original one. This way the "copy" container can't have the same uuid on the superblock and mounting the overlayfs from it later would fail. That is an example of the problem on top of loop+ext4: dd if=/dev/zero of=loopbackfile.img bs=100M count=10 losetup -fP loopbackfile.img losetup -a #/dev/loop0: [64768]:35 (/loop-test/loopbackfile.img) mkfs.ext4 loopbackfile.img mkdir loop-mp mount -o loop /dev/loop0 loop-mp mkdir loop-mp/{lower,upper,work,merged} mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\ upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged umount loop-mp/merged umount loop-mp e2fsck -f /dev/loop0 tune2fs -U random /dev/loop0 mount -o loop /dev/loop0 loop-mp mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\ upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged #mount: /loop-test/loop-mp/merged: #mount(2) system call failed: Stale file handle. If you just change the uuid of the backing filesystem, overlay is not mounting any more. In Virtuozzo we copy container disks (ploops) when create the copy of container and we require fs uuid to be unique for a new container. Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-11-12 11:31:55 +01:00
Miklos Szeredi	610afc0bd4	ovl: pass ovl_fs down to functions accessing private xattrs This paves the way for optionally using the "user.overlay." xattr namespace. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-09-02 10:58:49 +02:00
Miklos Szeredi	26150ab5ea	ovl: drop flags argument from ovl_do_setxattr() All callers pass zero flags to ovl_do_setxattr(). So drop this argument. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-09-02 10:58:48 +02:00
Miklos Szeredi	7109704705	ovl: adhere to the vfs_ vs. ovl_do_ conventions for xattrs Call ovl_do_xattr() when accessing an overlay private xattr, vfs_xattr() otherwise. This has an effect on debug output, which is made more consistent by this patch. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-09-02 10:58:48 +02:00
Vivek Goyal	c86243b090	ovl: provide a mount option "volatile" Container folks are complaining that dnf/yum issues too many sync while installing packages and this slows down the image build. Build requirement is such that they don't care if a node goes down while build was still going on. In that case, they will simply throw away unfinished layer and start new build. So they don't care about syncing intermediate state to the disk and hence don't want to pay the price associated with sync. So they are asking for mount options where they can disable sync on overlay mount point. They primarily seem to have two use cases. - For building images, they will mount overlay with nosync and then sync upper layer after unmounting overlay and reuse upper as lower for next layer. - For running containers, they don't seem to care about syncing upper layer because if node goes down, they will simply throw away upper layer and create a fresh one. So this patch provides a mount option "volatile" which disables all forms of sync. Now it is caller's responsibility to throw away upper if system crashes or shuts down and start fresh. With "volatile", I am seeing roughly 20% speed up in my VM where I am just installing emacs in an image. Installation time drops from 31 seconds to 25 seconds when nosync option is used. This is for the case of building on top of an image where all packages are already cached. That way I take out the network operations latency out of the measurement. Giuseppe is also looking to cut down on number of iops done on the disk. He is complaining that often in cloud their VMs are throttled if they cross the limit. This option can help them where they reduce number of iops (by cutting down on frequent sync and writebacks). Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-09-02 10:58:48 +02:00
Amir Goldstein	235ce9ed96	ovl: check for incompatible features in work dir An incompatible feature is marked by a non-empty directory nested 2 levels deep under "work" dir, e.g.: workdir/work/incompat/volatile. This commit checks for marked incompat features, warns about them and fails to mount the overlay, for example: overlayfs: overlay with incompat feature 'volatile' cannot be mounted Very old kernels (i.e. v3.18) will fail to remove a non-empty "work" dir and fail the mount. Newer kernels will fail to remove a "work" dir with entries nested 3 levels and fall back to read-only mount. User mounting with old kernel will see a warning like these in dmesg: overlayfs: cleanup of 'incompat/...' failed (-39) overlayfs: cleanup of 'work/incompat' failed (-39) overlayfs: cleanup of 'ovl-work/work' failed (-39) overlayfs: failed to create directory /vdf/ovl-work/work (errno: 17); mounting read-only These warnings should give the hint to the user that: 1. mount failure is caused by backward incompatible features 2. mount failure can be resolved by manually removing the "work" directory There is nothing preventing users on old kernels from manually removing workdir entirely or mounting overlay with a new workdir, so this is in no way a full proof backward compatibility enforcement, but only a best effort. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-09-02 10:58:48 +02:00
Amir Goldstein	f0e1266ed2	ovl: fix mount option checks for nfs_export with no upperdir Without upperdir mount option, there is no index dir and the dependency checks nfs_export => index for mount options parsing are incorrect. Allow the combination nfs_export=on,index=off with no upperdir and move the check for dependency redirect_dir=nofollow for non-upper mount case to mount options parsing. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-07-16 00:11:15 +02:00
Amir Goldstein	470c156361	ovl: force read-only sb on failure to create index dir With index feature enabled, on failure to create index dir, overlay is being mounted read-only. However, we do not forbid user to remount overlay read-write. Fix that by setting ofs->workdir to NULL, which prevents remount read-write. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-07-16 00:11:15 +02:00
Amir Goldstein	a888db3101	ovl: fix regression with re-formatted lower squashfs Commit `9df085f3c9` ("ovl: relax requirement for non null uuid of lower fs") relaxed the requirement for non null uuid with single lower layer to allow enabling index and nfs_export features with single lower squashfs. Fabian reported a regression in a setup when overlay re-uses an existing upper layer and re-formats the lower squashfs image. Because squashfs has no uuid, the origin xattr in upper layer are decoded from the new lower layer where they may resolve to a wrong origin file and user may get an ESTALE or EIO error on lookup. To avoid the reported regression while still allowing the new features with single lower squashfs, do not allow decoding origin with lower null uuid unless user opted-in to one of the new features that require following the lower inode of non-dir upper (index, xino, metacopy). Reported-by: Fabian <godi.beat@gmx.net> Link: https://lore.kernel.org/linux-unionfs/32532923.JtPX5UtSzP@fgdesktop/ Fixes: `9df085f3c9` ("ovl: relax requirement for non null uuid of lower fs") Cc: stable@vger.kernel.org # v4.20+ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-07-16 00:10:31 +02:00
Amir Goldstein	20396365a1	ovl: fix oops in ovl_indexdir_cleanup() with nfs_export=on Mounting with nfs_export=on, xfstests overlay/031 triggers a kernel panic since v5.8-rc1 overlayfs updates. overlayfs: orphan index entry (index/00fb1..., ftype=4000, nlink=2) BUG: kernel NULL pointer dereference, address: 0000000000000030 RIP: 0010:ovl_cleanup_and_whiteout+0x28/0x220 [overlay] Bisect point at commit `c21c839b84` ("ovl: whiteout inode sharing") Minimal reproducer: -------------------------------------------------- rm -rf l u w m mkdir -p l u w m mkdir -p l/testdir touch l/testdir/testfile mount -t overlay -o lowerdir=l,upperdir=u,workdir=w,nfs_export=on overlay m echo 1 > m/testdir/testfile umount m rm -rf u/testdir mount -t overlay -o lowerdir=l,upperdir=u,workdir=w,nfs_export=on overlay m umount m -------------------------------------------------- When mount with nfs_export=on, and fail to verify an orphan index, we're cleaning this index from indexdir by calling ovl_cleanup_and_whiteout(). This dereferences ofs->workdir, that was earlier set to NULL. The design was that ovl->workdir will point at ovl->indexdir, but we are assigning ofs->indexdir to ofs->workdir only after ovl_indexdir_cleanup(). There is no reason not to do it sooner, because once we get success from ofs->indexdir = ovl_workdir_create(... there is no turning back. Reported-and-tested-by: Murphy Zhou <jencce.kernel@gmail.com> Fixes: `c21c839b84` ("ovl: whiteout inode sharing") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-07-16 00:09:59 +02:00
youngjun	24f14009b8	ovl: inode reference leak in ovl_is_inuse true case. When "ovl_is_inuse" true case, trap inode reference not put. plus adding the comment explaining sequence of ovl_is_inuse after ovl_setup_trap. Fixes: `0be0bfd2de` ("ovl: fix regression caused by overlapping layers detection") Cc: <stable@vger.kernel.org> # v4.19+ Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: youngjun <her0gyugyu@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-07-16 00:05:40 +02:00
youngjun	2068cf7dfb	ovl: remove unnecessary lock check Directory is always locked until "out_unlock" label. So lock check is not needed. Signed-off-by: youngjun <her0gyugyu@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-06-08 09:57:19 +02:00
Miklos Szeredi	df820f8de4	ovl: make private mounts longterm Overlayfs is using clone_private_mount() to create internal mounts for underlying layers. These are used for operations requiring a path, such as dentry_open(). Since these private mounts are not in any namespace they are treated as short term, "detached" mounts and mntput() involves taking the global mount_lock, which can result in serious cacheline pingpong. Make these private mounts longterm instead, which trade the penalty on mntput() for a slightly longer shutdown time due to an added RCU grace period when putting these mounts. Introduce a new helper kern_unmount_many() that can take care of multiple longterm mounts with a single RCU grace period. Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-06-04 10:48:19 +02:00
Miklos Szeredi	b8e42a651b	ovl: get rid of redundant members in struct ovl_fs ofs->upper_mnt is copied to ->layers[0].mnt and ->layers[0].trap could be used instead of a separate ->upperdir_trap. Split the lowerdir option early to get the number of layers, then allocate the ->layers array, and finally fill the upper and lower layers, as before. Get rid of path_put_init() in ovl_lower_dir(), since the only caller will take care of that. [Colin Ian King] Fix null pointer dereference on null stack pointer on error return found by Coverity. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-06-04 10:48:19 +02:00
Miklos Szeredi	08f4c7c86d	ovl: add accessor for ofs->upper_mnt Next patch will remove ofs->upper_mnt, so add an accessor function for this field. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-06-04 10:48:19 +02:00
Chengguang Xu	399c109d35	ovl: sync dirty data when remounting to ro mode sync_filesystem() does not sync dirty data for readonly filesystem during umount, so before changing to readonly filesystem we should sync dirty data for data integrity. Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-05-13 11:11:24 +02:00
Chengguang Xu	c21c839b84	ovl: whiteout inode sharing Share inode with different whiteout files for saving inode and speeding up delete operation. If EMLINK is encountered when linking a shared whiteout, create a new one. In case of any other error, disable sharing for this super block. Note: ofs->whiteout is protected by inode lock on workdir. Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-05-13 11:11:24 +02:00
Jeffle Xu	654255fa20	ovl: inherit SB_NOSEC flag from upperdir Since the stacking of regular file operations [1], the overlayfs edition of write_iter() is called when writing regular files. Since then, xattr lookup is needed on every write since file_remove_privs() is called from ovl_write_iter(), which would become the performance bottleneck when writing small chunks of data. In my test case, file_remove_privs() would consume ~15% CPU when running fstime of unixbench (the workload is repeadly writing 1 KB to the same file) [2]. Inherit the SB_NOSEC flag from upperdir. Since then xattr lookup would be done only once on the first write. Unixbench fstime gets a ~20% performance gain with this patch. [1] https://lore.kernel.org/lkml/20180606150905.GC9426@magnolia/T/ [2] https://www.spinics.net/lists/linux-unionfs/msg07153.html Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-05-13 11:11:24 +02:00
Konstantin Khlebnikov	32b1924b21	ovl: skip overlayfs superblocks at global sync Stacked filesystems like overlayfs has no own writeback, but they have to forward syncfs() requests to backend for keeping data integrity. During global sync() each overlayfs instance calls method ->sync_fs() for backend although it itself is in global list of superblocks too. As a result one syscall sync() could write one superblock several times and send multiple disk barriers. This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that. Reported-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-05-13 11:11:24 +02:00
Amir Goldstein	62a8a85be8	ovl: index dir act as work dir With index=on, let index dir act as the work dir for copy up and cleanups. This will help implementing whiteout inode sharing. We still create the "work" dir on mount regardless of index=on and it is used to test the features supported by upper fs. One reason is that before the feature tests, we do not know if index could be enabled or not. The reason we do not use "index" directory also as workdir with index=off is because the existence of the "index" directory acts as a simple persistent signal that index was enabled on this filesystem and tools may want to use that signal. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-05-13 11:11:24 +02:00

1 2 3 4 5 ...

278 Commits