linux

Commit Graph

Author	SHA1	Message	Date
Sean Christopherson	b6b80c78af	KVM: x86/mmu: Allocate PAE root array when using SVM's 32-bit NPT SVM's Nested Page Tables (NPT) reuses x86 paging for the host-controlled page walk. For 32-bit KVM, this means PAE paging is used even when TDP is enabled, i.e. the PAE root array needs to be allocated. Fixes: `ee6268ba3a` ("KVM: x86: Skip pae_root shadow allocation if tdp enabled") Cc: stable@vger.kernel.org Reported-by: Jiri Palecek <jpalecek@web.de> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-19 16:11:53 +02:00
Liran Alon	6ca00dfafd	KVM: x86: Modify struct kvm_nested_state to have explicit fields for data Improve the KVM_{GET,SET}_NESTED_STATE structs by detailing the format of VMX nested state data in a struct. In order to avoid changing the ioctl values of KVM_{GET,SET}_NESTED_STATE, there is a need to preserve sizeof(struct kvm_nested_state). This is done by defining the data struct as "data.vmx[0]". It was the most elegant way I found to preserve struct size while still keeping struct readable and easy to maintain. It does have a misfortunate side-effect that now it has to be accessed as "data.vmx[0]" rather than just "data.vmx". Because we are already modifying these structs, I also modified the following: * Define the "format" field values as macros. * Rename vmcs_pa to vmcs12_pa for better readability. Signed-off-by: Liran Alon <liran.alon@oracle.com> [Remove SVM stubs, add KVM_STATE_NESTED_VMX_VMCS12_SIZE. - Paolo] Reviewed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-19 16:11:52 +02:00
David S. Miller	cfecf0d001	Merge branch 'mlxsw-Implement-flower-ingress-device-matching-offload' Ido Schimmel says: ==================== mlxsw: Implement flower ingress device matching offload Jiri says: In case of using shared block, user might find it handy to be able to insert filters to match on particular ingress device. This patchset exposes the ingress ifindex through flow_dissector and flow_offload so mlxsw can use it to push down to HW. See the selftests for examples of usage. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-19 10:09:22 -04:00
Jiri Pirko	dcc5e1f9ca	selftests: tc: add ingress device matching support Extend tc_flower to test plain ingress device matching and also tc_shblock to test ingress device matching on shared block. Add new tc_flower_router.sh where ingress device matching on egress (after routing) is done. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-19 10:09:22 -04:00
Jiri Pirko	0c1f391d19	mlxsw: spectrum_flower: Implement support for ingress device matching Benefit from the previously extended flow_dissector infrastructure and offload matching on ingress port. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-19 10:09:22 -04:00
Jiri Pirko	d8e9461446	mlxsw: spectrum_acl: Fix SRC_SYS_PORT element size Fix the size of the SRC_SYS_PORT element to be 16. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-19 10:09:22 -04:00
Jiri Pirko	ff5405f690	mlxsw: spectrum_acl: Avoid size check for RX_ACL_SYSTEM_PORT element RX_ACL_SYSTEM_PORT is 8 bit but SRC_SYS_PORT is 16 bits. Internally, SRC_SYS_PORT is used to carry the value. Relax the checker in case of RX_ACL_SYSTEM_PORT and allow different size. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-19 10:09:22 -04:00
Jiri Pirko	511a5adcaa	mlxsw: spectrum_acl: Write RX_ACL_SYSTEM_PORT acl element correctly RX_ACL_SYSTEM_PORT is equal to SRC_SYS_PORT - 1. So before write to block we need to adjust the key value. Introduce new "EXT" helper to implement this. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-19 10:09:22 -04:00
Jiri Pirko	9558a83aee	net: flow_offload: implement support for meta key Implement support for previously added flow dissector meta key. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-19 10:09:22 -04:00
Jiri Pirko	8212ed777f	net: sched: cls_flower: use flow_dissector for ingress ifindex Use previously introduced infra to obtain and store ingress ifindex instead doing it locally. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-19 10:09:22 -04:00
Jiri Pirko	82828b88f0	flow_dissector: add support for ingress ifindex dissection Add new key meta that contains ingress ifindex value and add a function to dissect this from skb. The key and function is prepared to cover other potential skb metadata values dissection. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-19 10:09:21 -04:00
Amir Goldstein	c285a2f01d	fanotify: update connector fsid cache on add mark When implementing connector fsid cache, we only initialized the cache when the first mark added to object was added by FAN_REPORT_FID group. We forgot to update conn->fsid when the second mark is added by FAN_REPORT_FID group to an already attached connector without fsid cache. Reported-and-tested-by: syzbot+c277e8e2f46414645508@syzkaller.appspotmail.com Fixes: `77115225ac` ("fanotify: cache fsid in fsnotify_mark_connector") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>	2019-06-19 15:53:58 +02:00
yangerkun	c6d9c35d16	quota: fix a problem about transfer quota Run below script as root, dquot_add_space will return -EDQUOT since __dquot_transfer call dquot_add_space with flags=0, and dquot_add_space think it's a preallocation. Fix it by set flags as DQUOT_SPACE_WARN. mkfs.ext4 -O quota,project /dev/vdb mount -o prjquota /dev/vdb /mnt setquota -P 23 1 1 0 0 /dev/vdb dd if=/dev/zero of=/mnt/test-file bs=4K count=1 chattr -p 23 test-file Fixes: `7b9ca4c61b` ("quota: Reduce contention on dq_data_lock") Signed-off-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz>	2019-06-19 15:25:41 +02:00
Ville Syrjälä	475df5d0f3	drm/i915: Don't clobber M/N values during fastset check We're now calling intel_pipe_config_compare(..., true) uncoditionally which means we're always going clobber the calculated M/N values with the old values if the fuzzy M/N check passes. That causes problems because the fuzzy check allows for a huge difference in the values. I'm actually tempted to just make the M/N checks exact, but that might prevent fastboot from kicking in when people want it. So for now let's overwrite the computed values with the old values only if decide to skip the modeset. v2: Copy has_drrs along with M/N M2/N2 values Cc: stable@vger.kernel.org Cc: Blubberbub@protonmail.com Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Hans de Goede <hdegoede@redhat.com> Tested-by: Blubberbub@protonmail.com Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110782 Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110675 Fixes: `d19f958db2` ("drm/i915: Enable fastset for non-boot modesets.") Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190612172423.25231-1-ville.syrjala@linux.intel.com Reviewed-by: Imre Deak <imre.deak@intel.com> (cherry picked from commit `f0521558a2`) Signed-off-by: Jani Nikula <jani.nikula@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190619120929.4057-1-ville.syrjala@linux.intel.com	2019-06-19 15:57:09 +03:00
Amir Goldstein	6dde1e42f4	ovl: make i_ino consistent with st_ino in more cases Relax the condition that overlayfs supports nfs export, to require that i_ino is consistent with st_ino/d_ino. It is enough to require that st_ino and d_ino are consistent. This fixes the failure of xfstest generic/504, due to mismatch of st_ino to inode number in the output of /proc/locks. Fixes: `12574a9f4c` ("ovl: consistent i_ino for non-samefs with xino") Cc: <stable@vger.kernel.org> # v4.19 Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2019-06-19 09:04:19 +02:00
Jani Nikula	f5633efced	Merge tag 'gvt-fixes-2019-06-19' of https://github.com/intel/gvt-linux into drm-intel-fixes gvt-fixes-2019-06-19 - Fix reserved PVINFO register write (Weinan) Signed-off-by: Jani Nikula <jani.nikula@intel.com> From: Zhenyu Wang <zhenyuw@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190619062240.GM9684@zhen-hp.sh.intel.com	2019-06-19 09:59:59 +03:00
Colin Ian King	39f5886032	net/mlx5: add missing void argument to function mlx5_devlink_alloc Function mlx5_devlink_alloc is missing a void argument, add it to clean up the non-ANSI function declaration. Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-18 22:28:50 -04:00
David S. Miller	da21ad276a	Merge branch 'net-mvpp2-cls-Allow-steering-based-on-vlan-tag' Maxime Chevallier says: ==================== net: mvpp2: cls: Allow steering based on vlan tag The PPv2 classifier can perform flow steering based on keys extracted from the VLAN tag. This series adds support for using the vlan id and the vlan prio as keys, using the ethtool interface. Patch 1 is a preparatory patch that prevent false-positive matches, using a dedicated lookup id for the RSS C2 lookup. Patch 2 allows to separate the flows based on the header fields they contain. The main goal is to be able to separate tagged traffic from untagged traffic for flow steering, just as we already do for RSS. Patch 3 solves an issue we have when extracting fields that aren't full bytes, such as the vlan tag which is 12 bits wide, or the priority which is 3 bits wide. Finally, patch 4 adds support for steering based on both vlan id and priority, extracted from the outermost tag. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-18 22:26:05 -04:00
Maxime Chevallier	1274daede3	net: mvpp2: cls: Add steering based on vlan Id and priority. This commit allows using the vlan Id and priority as parts of the key for classification offload. These fields are extracted from the outermost tag, if multiple tags are present. Vlan Id and priority are considered as 2 different fields by the classifier, however the fields are both appended in the Header Extracted Key in the same layout as they are found in the tags. This means that when steering only based on the prio, a 16-bit slot is still taken in the HEK. The classifier doesn't allow extracting the DEI bit from the tag, so we explicitly prevent user from using this bit in the key. This commit adds the vlan priotity as a compatible HEK field for tagged traffic, meaning that we limit the possibility of extracting this field only to the flows that contain tagged traffic. Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-18 22:26:05 -04:00
Maxime Chevallier	12b8e2dd01	net: mvpp2: cls: right-justify the C2 TCAM keys The C2 TCAM used for classification uses a key (Header Extracted Key) built by concatenating several fields extracted from the packet header. After a lot of trial-and-error and some guess work, it seems the HEK is right justified, with the first fields being stored in the MSB, then concatenated up until the LSB. Until now, this doesn't cause any issue since all HEK fields we use are full bytes. However this is an issue for the upcoming VLAN id and pri extraction, which aren't full bytes. Rework the way we built that TCAM key, by changing the order in which we append the fields. Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-18 22:26:05 -04:00
Maxime Chevallier	834df6ea95	net: mvpp2: cls: Only select applicable flows of classification offload The way we currently handle classification offload and RSS is by having dedicated lookup sequences in the flow table, each being selected depending on several fields being present in the packet header. We need to make sure the classification operation we want to perform can be done in each flow we want to insert it into. As an example, classifying on VLAN tag can only be done on flows used for tagged traffic. This commit makes sure we don't insert rules in flows we aren't compatible with. Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-18 22:26:04 -04:00
Maxime Chevallier	c641af4f6f	net: mvpp2: cls: Use a dedicated lu_type for the RSS lookup When performing a TCAM lookup in the C2 engine, it's possible that multiple entries match the packet. To make sure the correct entry match when performing a lookup, the Flow Table can set a lookup type, which will be used in the TCAM lookup, thus preventing such false-positives. We need to make sure the RSS match doesn't interfere with other classification lookups, hence we use a dedicated lookup_type for it. Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-18 22:26:04 -04:00
David S. Miller	9368b8e24b	Merge branch 'macb-SiFive-FU540-C000' Yash Shah says: ==================== Add macb support for SiFive FU540-C000 On FU540, the management IP block is tightly coupled with the Cadence MACB IP block. It manages many of the boundary signals from the MACB IP This patchset controls the tx_clk input signal to the MACB IP. It switches between the local TX clock (125MHz) and PHY TX clocks. This is necessary to toggle between 1Gb and 100/10Mb speeds. Future patches may add support for monitoring or controlling other IP boundary signals. This patchset is mostly based on work done by Wesley Terpstra <wesley@sifive.com> This patchset is based on Linux v5.2-rc1 and tested on HiFive Unleashed board with additional board related patches needed for testing can be found at dev/yashs/ethernet_v3 branch of: https://github.com/yashshah7/riscv-linux.git Change History: V3: - Revert "MACB_SIFIVE_FU540" config changes in Kconfig and driver code. The driver does not depend on SiFive GPIO driver. V2: - Change compatible string from "cdns,fu540-macb" to "sifive,fu540-macb" - Add "MACB_SIFIVE_FU540" in Kconfig to support SiFive FU540 in macb driver. This is needed because on FU540, the macb driver depends on SiFive GPIO driver. - Avoid writing the result of a comparison to a register. - Fix the issue of probe fail on reloading the module reported by: Andreas Schwab <schwab@suse.de> ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-18 22:02:27 -04:00
Yash Shah	c218ad5590	macb: Add support for SiFive FU540-C000 The management IP block is tightly coupled with the Cadence MACB IP block on the FU540, and manages many of the boundary signals from the MACB IP. This patch only controls the tx_clk input signal to the MACB IP. Future patches may add support for monitoring or controlling other IP boundary signals. Signed-off-by: Yash Shah <yash.shah@sifive.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-18 22:02:27 -04:00
Yash Shah	d4993e19da	macb: bindings doc: add sifive fu540-c000 binding Add the compatibility string documentation for SiFive FU540-C0000 interface. On the FU540, this driver also needs to read and write registers in a management IP block that monitors or drives boundary signals for the GEMGXL IP block that are not directly mapped to GEMGXL registers. Therefore, add additional range to "reg" property for SiFive GEMGXL management IP registers. Signed-off-by: Yash Shah <yash.shah@sifive.com> Reviewed-by: Paul Walmsley <paul.walmsley@sifive.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-18 22:02:27 -04:00
David S. Miller	d75d5f9764	Merge branch 'hinic-add-rss-support-and-rss-parameters-configuration' Xue Chaojing says: ==================== hinic: add rss support and rss parameters configuration This series add rss support for HINIC driver and implement the ethtool interface related to rss parameter configuration. user can use ethtool configure rss parameters or show rss parameters. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-18 21:52:27 -04:00
Xue Chaojing	4fdc51bb4e	hinic: add support for rss parameters with ethtool This patch adds support rss parameters with ethtool, user can change hash key, hash indirection table, hash function by ethtool -X, and show rss parameters by ethtool -x. Signed-off-by: Xue Chaojing <xuechaojing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-18 21:52:27 -04:00
Xue Chaojing	eb8ce9ac16	hinic: move ethtool code into hinic_ethtool This patch moves ethtool code from hinic_main.c to hinic_ethtool.c Signed-off-by: Xue Chaojing <xuechaojing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-18 21:52:27 -04:00
Xue Chaojing	421e952628	hinic: add rss support This patch adds rss support for the HINIC driver. Signed-off-by: Xue Chaojing <xuechaojing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-18 21:52:27 -04:00
David S. Miller	d470e720ef	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net 1) Module autoload for masquerade and redirection does not work. 2) Leak in unqueued packets in nf_ct_frag6_queue(). Ignore duplicated fragments, pretend they are placed into the queue. Patches from Guillaume Nault. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-18 21:43:40 -04:00
Sunil Muthuswamy	cb359b6041	hvsock: fix epollout hang from race condition Currently, hvsock can enter into a state where epoll_wait on EPOLLOUT will not return even when the hvsock socket is writable, under some race condition. This can happen under the following sequence: - fd = socket(hvsocket) - fd_out = dup(fd) - fd_in = dup(fd) - start a writer thread that writes data to fd_out with a combination of epoll_wait(fd_out, EPOLLOUT) and - start a reader thread that reads data from fd_in with a combination of epoll_wait(fd_in, EPOLLIN) - On the host, there are two threads that are reading/writing data to the hvsocket stack: hvs_stream_has_space hvs_notify_poll_out vsock_poll sock_poll ep_poll Race condition: check for epollout from ep_poll(): assume no writable space in the socket hvs_stream_has_space() returns 0 check for epollin from ep_poll(): assume socket has some free space < HVS_PKT_LEN(HVS_SEND_BUF_SIZE) hvs_stream_has_space() will clear the channel pending send size host will not notify the guest because the pending send size has been cleared and so the hvsocket will never mark the socket writable Now, the EPOLLOUT will never return even if the socket write buffer is empty. The fix is to set the pending size to the default size and never change it. This way the host will always notify the guest whenever the writable space is bigger than the pending size. The host is already optimized to only notify the guest when the pending size threshold boundary is crossed and not everytime. This change also reduces the cpu usage somewhat since hv_stream_has_space() is in the hotpath of send: vsock_stream_sendmsg()->hv_stream_has_space() Earlier hv_stream_has_space was setting/clearing the pending size on every call. Signed-off-by: Sunil Muthuswamy <sunilmut@microsoft.com> Reviewed-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-18 21:41:12 -04:00
Fred Klassen	76e21533a4	net/udp_gso: Allow TX timestamp with UDP GSO Fixes an issue where TX Timestamps are not arriving on the error queue when UDP_SEGMENT CMSG type is combined with CMSG type SO_TIMESTAMPING. This can be illustrated with an updated updgso_bench_tx program which includes the '-T' option to test for this condition. It also introduces the '-P' option which will call poll() before reading the error queue. ./udpgso_bench_tx -4ucTPv -S 1472 -l2 -D 172.16.120.18 poll timeout udp tx: 0 MB/s 1 calls/s 1 msg/s The "poll timeout" message above indicates that TX timestamp never arrived. This patch preserves tx_flags for the first UDP GSO segment. Only the first segment is timestamped, even though in some cases there may be benefital in timestamping both the first and last segment. Factors in deciding on first segment timestamp only: - Timestamping both first and last segmented is not feasible. Hardware can only have one outstanding TS request at a time. - Timestamping last segment may under report network latency of the previous segments. Even though the doorbell is suppressed, the ring producer counter has been incremented. - Timestamping the first segment has the upside in that it reports timestamps from the application's view, e.g. RTT. - Timestamping the first segment has the downside that it may underreport tx host network latency. It appears that we have to pick one or the other. And possibly follow-up with a config flag to choose behavior. v2: Remove tests as noted by Willem de Bruijn <willemb@google.com> Moving tests from net to net-next v3: Update only relevant tx_flag bits as per Willem de Bruijn <willemb@google.com> v4: Update comments and commit message as per Willem de Bruijn <willemb@google.com> Fixes: `ee80d1ebe5` ("udp: add udp gso") Signed-off-by: Fred Klassen <fklassen@appneta.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-18 21:38:07 -04:00
David S. Miller	e11e1007a1	Merge branch 'net-netem-fix-issues-with-corrupting-GSO-frames' Jakub Kicinski says: ==================== net: netem: fix issues with corrupting GSO frames Corrupting GSO frames currently leads to crashes, due to skb use after free. These stem from the skb list handling - the segmented skbs come back on a list, and this list is not properly unlinked before enqueuing the segments. Turns out this condition is made very likely to occur because of another bug - in backlog accounting. Segments are counted twice, which means qdisc's limit gets reached leading to drops and making the use after free very likely to happen. The bugs are fixed in order in which they were added to the tree. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-18 21:30:39 -04:00
Jakub Kicinski	3e14c383de	net: netem: fix use after free and double free with packet corruption Brendan reports that the use of netem's packet corruption capability leads to strange crashes. This seems to be caused by commit `d66280b12b` ("net: netem: use a list in addition to rbtree") which uses skb->next pointer to construct a fast-path queue of in-order skbs. Packet corruption code has to invoke skb_gso_segment() in case of skbs in need of GSO. skb_gso_segment() returns a list of skbs. If next pointers of the skbs on that list do not get cleared fast path list may point to freed skbs or skbs which are also on the RB tree. Let's say skb gets segmented into 3 frames: A -> B -> C A gets hooked to the t_head t_tail list by tfifo_enqueue(), but it's next pointer didn't get cleared so we have: h t \|/ A -> B -> C Now if B and C get also get enqueued successfully all is fine, because tfifo_enqueue() will overwrite the list in order. IOW: Enqueue B: h t \| \| A -> B C Enqueue C: h t \| \| A -> B -> C But if B and C get reordered we may end up with: h t RB tree \|/ \| A -> B -> C B \ C Or if they get dropped just: h t \|/ A -> B -> C where A and B are already freed. To reproduce either limit has to be set low to cause freeing of segs or reorders have to happen (due to delay jitter). Note that we only have to mark the first segment as not on the list, "finish_segs" handling of other frags already does that. Another caveat is that qdisc_drop_all() still has to free all segments correctly in case of drop of first segment, therefore we re-link segs before calling it. v2: - re-link before drop, v1 was leaking non-first segs if limit was hit at the first seg - better commit message which lead to discovering the above :) Reported-by: Brendan Galloway <brendan.galloway@netronome.com> Fixes: `d66280b12b` ("net: netem: use a list in addition to rbtree") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-18 21:30:38 -04:00
Jakub Kicinski	177b800746	net: netem: fix backlog accounting for corrupted GSO frames When GSO frame has to be corrupted netem uses skb_gso_segment() to produce the list of frames, and re-enqueues the segments one by one. The backlog length has to be adjusted to account for new frames. The current calculation is incorrect, leading to wrong backlog lengths in the parent qdisc (both bytes and packets), and incorrect packet backlog count in netem itself. Parent backlog goes negative, netem's packet backlog counts all non-first segments twice (thus remaining non-zero even after qdisc is emptied). Move the variables used to count the adjustment into local scope to make 100% sure they aren't used at any stage in backports. Fixes: `6071bd1aa1` ("netem: Segment GSO packets on enqueue") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-18 21:30:38 -04:00
Colin Ian King	760f1dc295	net: stmmac: add sanity check to device_property_read_u32_array call Currently the call to device_property_read_u32_array is not error checked leading to potential garbage values in the delays array that are then used in msleep delays. Add a sanity check to the property fetching. Addresses-Coverity: ("Uninitialized scalar variable") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-18 21:03:38 -04:00
Colin Ian King	9476274093	net: lio_core: fix potential sign-extension overflow on large shift Left shifting the signed int value 1 by 31 bits has undefined behaviour and the shift amount oq_no can be as much as 63. Fix this by using BIT_ULL(oq_no) instead. Addresses-Coverity: ("Bad shift operation") Fixes: `f21fb3ed36` ("Add support of Cavium Liquidio ethernet adapters") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-18 21:00:40 -04:00
Geert Uytterhoeven	cf29a49879	net: hns3: Add missing newline at end of file "git diff" says: \ No newline at end of file after modifying the file. Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-18 20:56:42 -04:00
David S. Miller	55458d2f40	Merge branch 'net-fix-quite-a-few-dst_cache-crashes-reported-by-syzbot' Xin Long says: ==================== net: fix quite a few dst_cache crashes reported by syzbot There are two kinds of crashes reported many times by syzbot with no reproducer. Call Traces are like: BUG: KASAN: slab-out-of-bounds in rt_cache_valid+0x158/0x190 net/ipv4/route.c:1556 rt_cache_valid+0x158/0x190 net/ipv4/route.c:1556 __mkroute_output net/ipv4/route.c:2332 [inline] ip_route_output_key_hash_rcu+0x819/0x2d50 net/ipv4/route.c:2564 ip_route_output_key_hash+0x1ef/0x360 net/ipv4/route.c:2393 __ip_route_output_key include/net/route.h:125 [inline] ip_route_output_flow+0x28/0xc0 net/ipv4/route.c:2651 ip_route_output_key include/net/route.h:135 [inline] ... or: kasan: GPF could be caused by NULL-ptr deref or user memory access RIP: 0010:dst_dev_put+0x24/0x290 net/core/dst.c:168 <IRQ> rt_fibinfo_free_cpus net/ipv4/fib_semantics.c:200 [inline] free_fib_info_rcu+0x2e1/0x490 net/ipv4/fib_semantics.c:217 __rcu_reclaim kernel/rcu/rcu.h:240 [inline] rcu_do_batch kernel/rcu/tree.c:2437 [inline] invoke_rcu_callbacks kernel/rcu/tree.c:2716 [inline] rcu_process_callbacks+0x100a/0x1ac0 kernel/rcu/tree.c:2697 ... They were caused by the fib_nh_common percpu member 'nhc_pcpu_rth_output' overwritten by another percpu variable 'dev->tstats' access overflow in tipc udp media xmit path when counting packets on a non tunnel device. The fix is to make udp tunnel work with no tunnel device by allowing not to count packets on the tstats when the tunnel dev is NULL in Patches 1/3 and 2/3, then pass a NULL tunnel dev in tipc_udp_tunnel() in Patch 3/3. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-18 20:48:45 -04:00
Xin Long	c3bcde0266	tipc: pass tunnel dev as NULL to udp_tunnel(6)_xmit_skb udp_tunnel(6)_xmit_skb() called by tipc_udp_xmit() expects a tunnel device to count packets on dev->tstats, a perpcu variable. However, TIPC is using udp tunnel with no tunnel device, and pass the lower dev, like veth device that only initializes dev->lstats(a perpcu variable) when creating it. Later iptunnel_xmit_stats() called by ip(6)tunnel_xmit() thinks the dev as a tunnel device, and uses dev->tstats instead of dev->lstats. tstats' each pointer points to a bigger struct than lstats, so when tstats->tx_bytes is increased, other percpu variable's members could be overwritten. syzbot has reported quite a few crashes due to fib_nh_common percpu member 'nhc_pcpu_rth_output' overwritten, call traces are like: BUG: KASAN: slab-out-of-bounds in rt_cache_valid+0x158/0x190 net/ipv4/route.c:1556 rt_cache_valid+0x158/0x190 net/ipv4/route.c:1556 __mkroute_output net/ipv4/route.c:2332 [inline] ip_route_output_key_hash_rcu+0x819/0x2d50 net/ipv4/route.c:2564 ip_route_output_key_hash+0x1ef/0x360 net/ipv4/route.c:2393 __ip_route_output_key include/net/route.h:125 [inline] ip_route_output_flow+0x28/0xc0 net/ipv4/route.c:2651 ip_route_output_key include/net/route.h:135 [inline] ... or: kasan: GPF could be caused by NULL-ptr deref or user memory access RIP: 0010:dst_dev_put+0x24/0x290 net/core/dst.c:168 <IRQ> rt_fibinfo_free_cpus net/ipv4/fib_semantics.c:200 [inline] free_fib_info_rcu+0x2e1/0x490 net/ipv4/fib_semantics.c:217 __rcu_reclaim kernel/rcu/rcu.h:240 [inline] rcu_do_batch kernel/rcu/tree.c:2437 [inline] invoke_rcu_callbacks kernel/rcu/tree.c:2716 [inline] rcu_process_callbacks+0x100a/0x1ac0 kernel/rcu/tree.c:2697 ... The issue exists since tunnel stats update is moved to iptunnel_xmit by Commit `039f50629b` ("ip_tunnel: Move stats update to iptunnel_xmit()"), and here to fix it by passing a NULL tunnel dev to udp_tunnel(6)_xmit_skb so that the packets counting won't happen on dev->tstats. Reported-by: syzbot+9d4c12bfd45a58738d0a@syzkaller.appspotmail.com Reported-by: syzbot+a9e23ea2aa21044c2798@syzkaller.appspotmail.com Reported-by: syzbot+c4c4b2bb358bb936ad7e@syzkaller.appspotmail.com Reported-by: syzbot+0290d2290a607e035ba1@syzkaller.appspotmail.com Reported-by: syzbot+a43d8d4e7e8a7a9e149e@syzkaller.appspotmail.com Reported-by: syzbot+a47c5f4c6c00fc1ed16e@syzkaller.appspotmail.com Fixes: `039f50629b` ("ip_tunnel: Move stats update to iptunnel_xmit()") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-18 20:48:45 -04:00
Xin Long	6f6a862205	ip6_tunnel: allow not to count pkts on tstats by passing dev as NULL A similar fix to Patch "ip_tunnel: allow not to count pkts on tstats by setting skb's dev to NULL" is also needed by ip6_tunnel. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-18 20:48:45 -04:00
Xin Long	5684abf702	ip_tunnel: allow not to count pkts on tstats by setting skb's dev to NULL iptunnel_xmit() works as a common function, also used by a udp tunnel which doesn't have to have a tunnel device, like how TIPC works with udp media. In these cases, we should allow not to count pkts on dev's tstats, so that udp tunnel can work with no tunnel device safely. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-18 20:48:45 -04:00
Daniel Borkmann	94079b6425	Merge branch 'bpf-bounded-loops' Alexei Starovoitov says: ==================== v2->v3: fixed issues in backtracking pointed out by Andrii. The next step is to add a lot more tests for backtracking. v1->v2: addressed Andrii's feedback. this patch set introduces verifier support for bounded loops and adds several other improvements. Ideally they would be introduced one at a time, but to support bounded loop the verifier needs to 'step back' in the patch 1. That patch introduces tracking of spill/fill of constants through the stack. Though it's a useful feature it hurts cilium tests. Patch 3 introduces another feature by extending is_branch_taken logic to 'if rX op rY' conditions. This feature is also necessary to support bounded loops. Then patch 4 adds support for the loops while adding key heuristics with jmp_processed. Introduction of parentage chain of verifier states in patch 4 allows patch 9 to add backtracking of precise scalar registers which finally resolves degradation from patch 1. The end result is much faster verifier for existing programs and new support for loops. See patch 8 for many kinds of loops that are now validated. Patch 9 is the most tricky one and could be rewritten with a different algorithm in the future. ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-06-19 02:22:53 +02:00
Alexei Starovoitov	b5dc0163d8	bpf: precise scalar_value tracking Introduce precision tracking logic that helps cilium programs the most: old clang old clang new clang new clang with all patches with all patches bpf_lb-DLB_L3.o 1838 2283 1923 1863 bpf_lb-DLB_L4.o 3218 2657 3077 2468 bpf_lb-DUNKNOWN.o 1064 545 1062 544 bpf_lxc-DDROP_ALL.o 26935 23045 166729 22629 bpf_lxc-DUNKNOWN.o 34439 35240 174607 28805 bpf_netdev.o 9721 8753 8407 6801 bpf_overlay.o 6184 7901 5420 4754 bpf_lxc_jit.o 39389 50925 39389 50925 Consider code: 654: (85) call bpf_get_hash_recalc#34 655: (bf) r7 = r0 656: (15) if r8 == 0x0 goto pc+29 657: (bf) r2 = r10 658: (07) r2 += -48 659: (18) r1 = 0xffff8881e41e1b00 661: (85) call bpf_map_lookup_elem#1 662: (15) if r0 == 0x0 goto pc+23 663: (69) r1 = (u16 )(r0 +0) 664: (15) if r1 == 0x0 goto pc+21 665: (bf) r8 = r7 666: (57) r8 &= 65535 667: (bf) r2 = r8 668: (3f) r2 /= r1 669: (2f) r2 = r1 670: (bf) r1 = r8 671: (1f) r1 -= r2 672: (57) r1 &= 255 673: (25) if r1 > 0x1e goto pc+12 R0=map_value(id=0,off=0,ks=20,vs=64,imm=0) R1_w=inv(id=0,umax_value=30,var_off=(0x0; 0x1f)) 674: (67) r1 <<= 1 675: (0f) r0 += r1 At this point the verifier will notice that scalar R1 is used in map pointer adjustment. R1 has to be precise for later operations on R0 to be validated properly. The verifier will backtrack the above code in the following way: last_idx 675 first_idx 664 regs=2 stack=0 before 675: (0f) r0 += r1 // started backtracking R1 regs=2 is a bitmask regs=2 stack=0 before 674: (67) r1 <<= 1 regs=2 stack=0 before 673: (25) if r1 > 0x1e goto pc+12 regs=2 stack=0 before 672: (57) r1 &= 255 regs=2 stack=0 before 671: (1f) r1 -= r2 // now both R1 and R2 has to be precise -> regs=6 mask regs=6 stack=0 before 670: (bf) r1 = r8 // after this insn R8 and R2 has to be precise regs=104 stack=0 before 669: (2f) r2 = r1 // after this one R8, R2, and R1 regs=106 stack=0 before 668: (3f) r2 /= r1 regs=106 stack=0 before 667: (bf) r2 = r8 regs=102 stack=0 before 666: (57) r8 &= 65535 regs=102 stack=0 before 665: (bf) r8 = r7 regs=82 stack=0 before 664: (15) if r1 == 0x0 goto pc+21 // this is the end of verifier state. The following regs will be marked precised: R1_rw=invP(id=0,umax_value=65535,var_off=(0x0; 0xffff)) R7_rw=invP(id=0) parent didn't have regs=82 stack=0 marks // so backtracking continues into parent state last_idx 663 first_idx 655 regs=82 stack=0 before 663: (69) r1 = (u16 )(r0 +0) // R1 was assigned no need to track it further regs=80 stack=0 before 662: (15) if r0 == 0x0 goto pc+23 // keep tracking R7 regs=80 stack=0 before 661: (85) call bpf_map_lookup_elem#1 // keep tracking R7 regs=80 stack=0 before 659: (18) r1 = 0xffff8881e41e1b00 regs=80 stack=0 before 658: (07) r2 += -48 regs=80 stack=0 before 657: (bf) r2 = r10 regs=80 stack=0 before 656: (15) if r8 == 0x0 goto pc+29 regs=80 stack=0 before 655: (bf) r7 = r0 // here the assignment into R7 // mark R0 to be precise: R0_rw=invP(id=0) parent didn't have regs=1 stack=0 marks // regs=1 -> tracking R0 last_idx 654 first_idx 644 regs=1 stack=0 before 654: (85) call bpf_get_hash_recalc#34 // and in the parent frame it was a return value // nothing further to backtrack Two scalar registers not marked precise are equivalent from state pruning point of view. More details in the patch comments. It doesn't support bpf2bpf calls yet and enabled for root only. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-06-19 02:22:52 +02:00
Alexei Starovoitov	b061017f8b	selftests/bpf: add realistic loop tests Add a bunch of loop tests. Most of them are created by replacing '#pragma unroll' with '#pragma clang loop unroll(disable)' Several tests are artificially large: /* partial unroll. llvm will unroll loop ~150 times. * C loop count -> 600. * Asm loop count -> 4. * 16k insns in loop body. * Total of 5 such loops. Total program size ~82k insns. / "./pyperf600.o", / no unroll at all. * C loop count -> 600. * ASM loop count -> 600. * ~110 insns in loop body. * Total of 5 such loops. Total program size ~1500 insns. / "./pyperf600_nounroll.o", / partial unroll. 19k insn in a loop. * Total program size 20.8k insn. * ~350k processed_insns */ "./strobemeta.o", Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-06-19 02:22:52 +02:00
Alexei Starovoitov	0d3679e99a	selftests/bpf: add basic verifier tests for loops This set of tests is a rewrite of Edward's earlier tests: https://patchwork.ozlabs.org/patch/877221/ Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-06-19 02:22:52 +02:00
Alexei Starovoitov	aeee380ccf	selftests/bpf: fix tests Fix tests that assumed no loops. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-06-19 02:22:52 +02:00
Alexei Starovoitov	eea1c227b9	bpf: fix callees pruning callers The commit `7640ead939` partially resolved the issue of callees incorrectly pruning the callers. With introduction of bounded loops and jmps_processed heuristic single verifier state may contain multiple branches and calls. It's possible that new verifier state (for future pruning) will be allocated inside callee. Then callee will exit (still within the same verifier state). It will go back to the caller and there R6-R9 registers will be read and will trigger mark_reg_read. But the reg->live for all frames but the top frame is not set to LIVE_NONE. Hence mark_reg_read will fail to propagate liveness into parent and future walking will incorrectly conclude that the states are equivalent because LIVE_READ is not set. In other words the rule for parent/live should be: whenever register parentage chain is set the reg->live should be set to LIVE_NONE. is_state_visited logic already follows this rule for spilled registers. Fixes: `7640ead939` ("bpf: verifier: make sure callees don't prune with caller differences") Fixes: `f4d7e40a5b` ("bpf: introduce function calls (verification)") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-06-19 02:22:51 +02:00
Alexei Starovoitov	2589726d12	bpf: introduce bounded loops Allow the verifier to validate the loops by simulating their execution. Exisiting programs have used '#pragma unroll' to unroll the loops by the compiler. Instead let the verifier simulate all iterations of the loop. In order to do that introduce parentage chain of bpf_verifier_state and 'branches' counter for the number of branches left to explore. See more detailed algorithm description in bpf_verifier.h This algorithm borrows the key idea from Edward Cree approach: https://patchwork.ozlabs.org/patch/877222/ Additional state pruning heuristics make such brute force loop walk practical even for large loops. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-06-19 02:22:51 +02:00
Alexei Starovoitov	fb8d251ee2	bpf: extend is_branch_taken to registers This patch extends is_branch_taken() logic from JMP+K instructions to JMP+X instructions. Conditional branches are often done when src and dst registers contain known scalars. In such case the verifier can follow the branch that is going to be taken when program executes. That speeds up the verification and is essential feature to support bounded loops. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-06-19 02:22:51 +02:00

... 2 3 4 5 6 ...

842735 Commits All Branches Search

842735 Commits

All Branches