diff --git a/SOURCES/openvswitch-2.13.0.patch b/SOURCES/openvswitch-2.13.0.patch index 956ac75..c446119 100644 --- a/SOURCES/openvswitch-2.13.0.patch +++ b/SOURCES/openvswitch-2.13.0.patch @@ -629,10 +629,18 @@ index fe3935fca2..4c8772f63a 100644 ankur dwivedi ankurengg2003@gmail.com chen zhang 3zhangchen9211@gmail.com diff --git a/Documentation/automake.mk b/Documentation/automake.mk -index 22976a3cd6..f46ec988a3 100644 +index 22976a3cd6..0052c99061 100644 --- a/Documentation/automake.mk +++ b/Documentation/automake.mk -@@ -217,8 +217,13 @@ install-man-rst: docs-check +@@ -57,6 +57,7 @@ DOC_SOURCE = \ + Documentation/topics/ovsdb-replication.rst \ + Documentation/topics/porting.rst \ + Documentation/topics/tracing.rst \ ++ Documentation/topics/usdt-probes.rst \ + Documentation/topics/userspace-tso.rst \ + Documentation/topics/windows.rst \ + Documentation/howto/index.rst \ +@@ -217,8 +218,13 @@ install-man-rst: docs-check $(extract_stem_and_section); \ echo " $(MKDIR_P) '$(DESTDIR)'\"$$mandir\""; \ $(MKDIR_P) '$(DESTDIR)'"$$mandir"; \ @@ -885,6 +893,15 @@ index c6c6fd8bde..14b32ab3b8 100644 $ export DPDK_TARGET=x86_64-native-linuxapp-gcc $ export DPDK_BUILD=$DPDK_DIR/$DPDK_TARGET $ cd $DPDK_DIR +diff --git a/Documentation/topics/index.rst b/Documentation/topics/index.rst +index 08af3a24d0..019d60703f 100644 +--- a/Documentation/topics/index.rst ++++ b/Documentation/topics/index.rst +@@ -53,3 +53,4 @@ OVS + userspace-tso + idl-compound-indexes + ovs-extensions ++ usdt-probes diff --git a/Documentation/topics/testing.rst b/Documentation/topics/testing.rst index 161e9d442e..fb1cbdf25e 100644 --- a/Documentation/topics/testing.rst @@ -943,6 +960,393 @@ index 161e9d442e..fb1cbdf25e 100644 vsperf ------ +diff --git a/Documentation/topics/usdt-probes.rst b/Documentation/topics/usdt-probes.rst +new file mode 100644 +index 0000000000..cdded4f906 +--- /dev/null ++++ b/Documentation/topics/usdt-probes.rst +@@ -0,0 +1,381 @@ ++.. ++ Licensed under the Apache License, Version 2.0 (the "License"); you may ++ not use this file except in compliance with the License. You may obtain ++ a copy of the License at ++ ++ http://www.apache.org/licenses/LICENSE-2.0 ++ ++ Unless required by applicable law or agreed to in writing, software ++ distributed under the License is distributed on an "AS IS" BASIS, WITHOUT ++ WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the ++ License for the specific language governing permissions and limitations ++ under the License. ++ ++ Convention for heading levels in Open vSwitch documentation: ++ ++ ======= Heading 0 (reserved for the title in a document) ++ ------- Heading 1 ++ ~~~~~~~ Heading 2 ++ +++++++ Heading 3 ++ ''''''' Heading 4 ++ ++ Avoid deeper levels because they do not render well. ++ ++============================================= ++User Statically-Defined Tracing (USDT) probes ++============================================= ++ ++Sometimes it's desired to troubleshoot one of OVS's components in the field. ++One of the techniques used is to add dynamic tracepoints, for example using ++perf_. However, the desired dynamic tracepoint and/or the desired variable, ++might not be available due to compiler optimizations. ++ ++In this case, a well-thought-off, static tracepoint could be permanently added, ++so it's always available. For OVS we use the DTrace probe macro's, which have ++little to no overhead when disabled. Various tools exist to enable them. See ++some examples below. ++ ++ ++Compiling with USDT probes enabled ++---------------------------------- ++ ++Since USDT probes are compiled out by default, a compile-time option is ++available to include them. To add the probes to the generated code, use the ++following configure option :: ++ ++ $ ./configure --enable-usdt-probes ++ ++The following line should be seen in the configure output when the above option ++is used :: ++ ++ checking whether USDT probes are enabled... yes ++ ++ ++Listing available probes ++------------------------ ++ ++There are various ways to display USDT probes available in a userspace ++application. Here we show three examples. All assuming ovs-vswitchd is in the ++search path with USDT probes enabled: ++ ++You can use the **perf** tool as follows :: ++ ++ $ perf buildid-cache --add $(which ovs-vswitchd) ++ $ perf list | grep sdt_ ++ sdt_main:poll_block [SDT event] ++ sdt_main:run_start [SDT event] ++ ++You can use the bpftrace_ tool :: ++ ++ # bpftrace -l "usdt:$(which ovs-vswitchd):*" ++ usdt:/usr/sbin/ovs-vswitchd:main:poll_block ++ usdt:/usr/sbin/ovs-vswitchd:main:run_start ++ ++.. note:: ++ ++ If you execute this on a running process, ++ ``bpftrace -lp $(pidof ovs-vswitchd) "usdt:*"`` , it will list all USDT ++ events, i.e., also the ones available in the used shared libraries. ++ ++Finally, you can use the **tplist** tool which is part of the bcc_ framework :: ++ ++ $ /usr/share/bcc/tools/tplist -vv -l $(which ovs-vswitchd) ++ b'main':b'poll_block' [sema 0x0] ++ location #1 b'/usr/sbin/ovs-vswitchd' 0x407fdc ++ b'main':b'run_start' [sema 0x0] ++ location #1 b'/usr/sbin/ovs-vswitchd' 0x407ff6 ++ ++ ++Using probes ++------------ ++ ++We will use the OVS sandbox environment in combination with the probes shown ++above to try out some of the available trace tools. To start up the virtual ++environment use the ``make sandbox`` command. In addition we have to create ++a bridge to kick of the main bridge loop :: ++ ++ $ ovs-vsctl add-br test_bridge ++ $ ovs-vsctl show ++ 055acdca-2f0c-4f6e-b542-f4b6d2c44e08 ++ Bridge test_bridge ++ Port test_bridge ++ Interface test_bridge ++ type: internal ++ ++perf ++~~~~ ++ ++Perf is using Linux uprobe based event tracing to for capturing the events. ++To enable the main:\* probes as displayed above and take an actual trace, you ++need to execute the following sequence of perf commands :: ++ ++ # perf buildid-cache --add $(which ovs-vswitchd) ++ ++ # perf list | grep sdt_ ++ sdt_main:poll_block [SDT event] ++ sdt_main:run_start [SDT event] ++ ++ # perf probe --add=sdt_main:poll_block --add=sdt_main:run_start ++ Added new events: ++ sdt_main:poll_block (on %poll_block in /usr/sbin/ovs-vswitchd) ++ sdt_main:run_start (on %run_start in /usr/sbin/ovs-vswitchd) ++ ++ You can now use it in all perf tools, such as: ++ ++ perf record -e sdt_main:run_start -aR sleep 1 ++ ++ # perf record -e sdt_main:run_start -e sdt_main:poll_block \ ++ -p $(pidof ovs-vswitchd) sleep 30 ++ [ perf record: Woken up 1 times to write data ] ++ [ perf record: Captured and wrote 0.039 MB perf.data (132 samples) ] ++ ++ # perf script ++ ovs-vswitchd 8576 [011] 21031.340433: sdt_main:run_start: (407ff6) ++ ovs-vswitchd 8576 [011] 21031.340516: sdt_main:poll_block: (407fdc) ++ ovs-vswitchd 8576 [011] 21031.841726: sdt_main:run_start: (407ff6) ++ ovs-vswitchd 8576 [011] 21031.842088: sdt_main:poll_block: (407fdc) ++ ... ++ ++Note that the above examples works with the sandbox environment, so make sure ++you execute the above command while in the sandbox shell! ++ ++There are a lot more options available with perf, for example, the ++``--call-graph dwarf`` option, which would give you a call graph in the ++``perf script`` output. See the perf documentation for more information. ++ ++One other interesting feature is that the perf data can be converted for use ++by the trace visualizer `Trace Compass`_. This can be done using the ++``--all --to-ctf`` option to the ``perf data convert`` tool. ++ ++ ++bpftrace ++~~~~~~~~ ++ ++bpftrace is a high-level tracing language based on eBPF, which can be used to ++script USDT probes. Here we will show a simple one-liner to display the ++USDT probes being hit. However, the script section below reference some more ++advanced bpftrace scripts. ++ ++This is a simple bpftrace one-liner to show all ``main:*`` USDT probes :: ++ ++ # bpftrace -p $(pidof ovs-vswitchd) -e \ ++ 'usdt::main:* { printf("%s %u [%u] %u %s\n", ++ comm, pid, cpu, elapsed, probe); }' ++ Attaching 2 probes... ++ ovs-vswitchd 8576 [11] 203833199 usdt:main:run_start ++ ovs-vswitchd 8576 [11] 204086854 usdt:main:poll_block ++ ovs-vswitchd 8576 [11] 221611985 usdt:main:run_start ++ ovs-vswitchd 8576 [11] 221892019 usdt:main:poll_block ++ ++ ++bcc ++~~~ ++ ++The BPF Compiler Collection (BCC) is a set of tools and scripts that also use ++eBPF for tracing. The example below uses the ``trace`` tool to show the events ++while they are being generated :: ++ ++ # /usr/share/bcc/tools/trace -T -p $(pidof ovs-vswitchd) \ ++ 'u::main:run_start' 'u::main:poll_block' ++ TIME PID TID COMM FUNC ++ 15:49:06 8576 8576 ovs-vswitchd main:run_start ++ 15:49:06 8576 8576 ovs-vswitchd main:poll_block ++ 15:49:06 8576 8576 ovs-vswitchd main:run_start ++ 15:49:06 8576 8576 ovs-vswitchd main:poll_block ++ ^C ++ ++ ++Scripts ++------- ++To not have to re-invent the wheel when trying to debug complex OVS issues, a ++set of scripts are provided in the source repository. They are located in the ++``utilities/usdt-scripts/`` directory, and each script contains detailed ++information on how they should be used, and what information they provide. ++ ++ ++Available probes ++---------------- ++The next sections describes all the available probes, their use case, and if ++used in any script, which one. Any new probes being added to OVS should get ++their own section. See the below "Adding your own probes" section for the ++used naming convention. ++ ++Available probes in ``ovs_vswitchd``: ++ ++- dpif_netlink_operate\_\_:op_flow_del ++- dpif_netlink_operate\_\_:op_flow_execute ++- dpif_netlink_operate\_\_:op_flow_get ++- dpif_netlink_operate\_\_:op_flow_put ++- dpif_recv:recv_upcall ++- main:poll_block ++- main:run_start ++ ++ ++dpif_netlink_operate\_\_:op_flow_del ++~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ++ ++**Description**: ++ ++This probe gets triggered when the Netlink datapath is about to execute the ++DPIF_OP_FLOW_DEL operation as part of the dpif ``operate()`` callback. ++ ++**Arguments**: ++ ++- *arg0*: ``(struct dpif_netlink *) dpif`` ++- *arg1*: ``(struct dpif_flow_del *) del`` ++- *arg2*: ``(struct dpif_netlink_flow *) flow`` ++- *arg3*: ``(struct ofpbuf *) aux->request`` ++ ++**Script references**: ++ ++- *None* ++ ++ ++dpif_netlink_operate\_\_:op_flow_execute ++~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ++ ++**Description**: ++ ++This probe gets triggered when the Netlink datapath is about to execute the ++DPIF_OP_FLOW_EXECUTE operation as part of the dpif ``operate()`` callback. ++ ++**Arguments**: ++ ++- *arg0*: ``(struct dpif_netlink *) dpif`` ++- *arg1*: ``(struct dpif_execute *) op->execute`` ++- *arg2*: ``dp_packet_data(op->execute.packet)`` ++- *arg3*: ``dp_packet_size(op->execute.packet)`` ++- *arg4*: ``(struct ofpbuf *) aux->request`` ++ ++**Script references**: ++ ++- ``utilities/usdt-scripts/upcall_cost.py`` ++ ++ ++dpif_netlink_operate\_\_:op_flow_get ++~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ++ ++**Description**: ++ ++This probe gets triggered when the Netlink datapath is about to execute the ++DPIF_OP_FLOW_GET operation as part of the dpif ``operate()`` callback. ++ ++**Arguments**: ++ ++- *arg0*: ``(struct dpif_netlink *) dpif`` ++- *arg1*: ``(struct dpif_flow_get *) get`` ++- *arg2*: ``(struct dpif_netlink_flow *) flow`` ++- *arg3*: ``(struct ofpbuf *) aux->request`` ++ ++**Script references**: ++ ++- *None* ++ ++ ++dpif_netlink_operate\_\_:op_flow_put ++~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ++ ++**Description**: ++ ++This probe gets triggered when the Netlink datapath is about to execute the ++DPIF_OP_FLOW_PUT operation as part of the dpif ``operate()`` callback. ++ ++**Arguments**: ++ ++- *arg0*: ``(struct dpif_netlink *) dpif`` ++- *arg1*: ``(struct dpif_flow_put *) put`` ++- *arg2*: ``(struct dpif_netlink_flow *) flow`` ++- *arg3*: ``(struct ofpbuf *) aux->request`` ++ ++**Script references**: ++ ++- ``utilities/usdt-scripts/upcall_cost.py`` ++ ++ ++probe dpif_recv:recv_upcall ++~~~~~~~~~~~~~~~~~~~~~~~~~~~ ++ ++**Description**: ++ ++This probe gets triggered when the datapath independent layer gets notified ++that a packet needs to be processed by userspace. This allows the probe to ++intercept all packets sent by the kernel to ``ovs-vswitchd``. The ++``upcall_monitor.py`` script uses this probe to display and capture all packets ++sent to ``ovs-vswitchd``. ++ ++**Arguments**: ++ ++- *arg0*: ``(struct dpif *)->full_name`` ++- *arg1*: ``(struct dpif_upcall *)->type`` ++- *arg2*: ``dp_packet_data((struct dpif_upcall *)->packet)`` ++- *arg3*: ``dp_packet_size((struct dpif_upcall *)->packet)`` ++- *arg4*: ``(struct dpif_upcall *)->key`` ++- *arg5*: ``(struct dpif_upcall *)->key_len`` ++ ++**Script references**: ++ ++- ``utilities/usdt-scripts/upcall_cost.py`` ++- ``utilities/usdt-scripts/upcall_monitor.py`` ++ ++ ++probe main:run_start ++~~~~~~~~~~~~~~~~~~~~ ++ ++**Description**: ++The ovs-vswitchd's main process contains a loop that runs every time some work ++needs to be done. This probe gets triggered every time the loop starts from the ++beginning. See also the ``main:poll_block`` probe below. ++ ++**Arguments**: ++ ++*None* ++ ++**Script references**: ++ ++- ``utilities/usdt-scripts/bridge_loop.bt`` ++ ++ ++probe main:poll_block ++~~~~~~~~~~~~~~~~~~~~~ ++ ++**Description**: ++The ovs-vswitchd's main process contains a loop that runs every time some work ++needs to be done. This probe gets triggered every time the loop is done, and ++it's about to wait for being re-started by a poll_block() call returning. ++See also the ``main:run_start`` probe above. ++ ++**Arguments**: ++ ++*None* ++ ++**Script references**: ++ ++- ``utilities/usdt-scripts/bridge_loop.bt`` ++ ++ ++Adding your own probes ++---------------------- ++ ++Adding your own probes is as simple as adding the ``OVS_USDT_PROBE()`` macro ++to the code. It's similar to the ``DTRACE_PROBExx`` macro's with the difference ++that it does automatically determine the number of optional arguments. ++ ++The macro requires at least two arguments. The first one being the *provider*, ++and the second one being the *name*. To keep some consistency with the probe ++naming, please use the following convention. The *provider* should be the ++function name, and the *name* should be the name of the tracepoint. If you do ++function entry and exit like probes, please use ``entry`` and ``exit``. ++ ++If, for some reason, you do not like to use the function name as a *provider*, ++please prefix it with ``__``, so we know it's not a function name. ++ ++The remaining parameters, up to 10, can be variables, pointers, etc., that ++might be of interest to capture at this point in the code. Note that the ++provided variables can cause the compiler to be less effective in optimizing. ++ ++ ++ ++.. _perf : https://developers.redhat.com/blog/2020/05/29/debugging-vhost-user-tx-contention-in-open-vswitch# ++.. _bpftrace : https://github.com/iovisor/bpftrace ++.. _bcc : https://github.com/iovisor/bcc ++.. _Trace Compass : https://www.eclipse.org/tracecompass/ diff --git a/Documentation/topics/userspace-tso.rst b/Documentation/topics/userspace-tso.rst index 94eddc0b2f..f7b6b2639a 100644 --- a/Documentation/topics/userspace-tso.rst @@ -1256,7 +1660,7 @@ index f6b88ca2d0..9429702db9 100755 manpages=`cd $distdir && echo *` diff --git a/configure.ac b/configure.ac -index 92b52f6712..b2bb3796ac 100644 +index 92b52f6712..cb88e03e89 100644 --- a/configure.ac +++ b/configure.ac @@ -13,7 +13,7 @@ @@ -1268,7 +1672,15 @@ index 92b52f6712..b2bb3796ac 100644 AC_CONFIG_SRCDIR([datapath/datapath.c]) AC_CONFIG_MACRO_DIR([m4]) AC_CONFIG_AUX_DIR([build-aux]) -@@ -100,6 +100,7 @@ OVS_CHECK_IF_DL +@@ -88,6 +88,7 @@ OVS_CHECK_WIN32 + OVS_CHECK_VISUAL_STUDIO_DDK + OVS_CHECK_COVERAGE + OVS_CHECK_NDEBUG ++OVS_CHECK_USDT + OVS_CHECK_NETLINK + OVS_CHECK_OPENSSL + OVS_CHECK_LIBCAPNG +@@ -100,6 +101,7 @@ OVS_CHECK_IF_DL OVS_CHECK_STRTOK_R OVS_CHECK_LINUX_AF_XDP AC_CHECK_DECLS([sys_siglist], [], [], [[#include ]]) @@ -1276,7 +1688,7 @@ index 92b52f6712..b2bb3796ac 100644 AC_CHECK_MEMBERS([struct stat.st_mtim.tv_nsec, struct stat.st_mtimensec], [], [], [[#include ]]) AC_CHECK_MEMBERS([struct ifreq.ifr_flagshigh], [], [], [[#include ]]) -@@ -188,6 +189,7 @@ OVS_CHECK_LINUX +@@ -188,6 +190,7 @@ OVS_CHECK_LINUX OVS_CHECK_LINUX_NETLINK OVS_CHECK_LINUX_TC OVS_CHECK_LINUX_SCTP_CT @@ -80309,6 +80721,18 @@ index b0a5ce8bec..bc51a5767f 100644 }; enum { +diff --git a/include/openvswitch/automake.mk b/include/openvswitch/automake.mk +index 73c3461757..14a5e3b34b 100644 +--- a/include/openvswitch/automake.mk ++++ b/include/openvswitch/automake.mk +@@ -43,6 +43,7 @@ openvswitchinclude_HEADERS = \ + include/openvswitch/tun-metadata.h \ + include/openvswitch/type-props.h \ + include/openvswitch/types.h \ ++ include/openvswitch/usdt-probes.h \ + include/openvswitch/util.h \ + include/openvswitch/uuid.h \ + include/openvswitch/version.h \ diff --git a/include/openvswitch/compiler.h b/include/openvswitch/compiler.h index 5289a70f6e..cf009f8264 100644 --- a/include/openvswitch/compiler.h @@ -80334,6 +80758,62 @@ index 1f81d830e7..4b9893388b 100644 bool mf_is_pipeline_field(const struct mf_field *); bool mf_is_set(const struct mf_field *, const struct flow *); void mf_mask_field(const struct mf_field *, struct flow_wildcards *); +diff --git a/include/openvswitch/usdt-probes.h b/include/openvswitch/usdt-probes.h +new file mode 100644 +index 0000000000..a49563f071 +--- /dev/null ++++ b/include/openvswitch/usdt-probes.h +@@ -0,0 +1,50 @@ ++/* ++ * Copyright (c) 2021 Red Hat, Inc. ++ * ++ * Licensed under the Apache License, Version 2.0 (the "License"); ++ * you may not use this file except in compliance with the License. ++ * You may obtain a copy of the License at: ++ * ++ * http://www.apache.org/licenses/LICENSE-2.0 ++ * ++ * Unless required by applicable law or agreed to in writing, software ++ * distributed under the License is distributed on an "AS IS" BASIS, ++ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ++ * See the License for the specific language governing permissions and ++ * limitations under the License. ++ */ ++ ++#ifndef OPENVSWITCH_USDT_PROBES_H ++#define OPENVSWITCH_USDT_PROBES_H 1 ++ ++#ifdef HAVE_USDT_PROBES ++#include ++#endif ++ ++#ifdef __cplusplus ++extern "C" { ++#endif ++ ++#ifdef HAVE_USDT_PROBES ++ ++#define GET_DTRACE_FUNC(_0, _1, _2, _3, _4, _5, _6, _7, _8, _9, _10, \ ++ NAME, ...) NAME ++ ++#define OVS_USDT_PROBE(provider, name, ...) \ ++ GET_DTRACE_FUNC(_0, ##__VA_ARGS__, DTRACE_PROBE10, DTRACE_PROBE9, \ ++ DTRACE_PROBE8, DTRACE_PROBE7, DTRACE_PROBE6, \ ++ DTRACE_PROBE5, DTRACE_PROBE4, DTRACE_PROBE3, \ ++ DTRACE_PROBE2, DTRACE_PROBE1, DTRACE_PROBE) \ ++ (provider, name, ##__VA_ARGS__) ++ ++#else ++ ++#define OVS_USDT_PROBE(...) ++ ++#endif ++ ++#ifdef __cplusplus ++} ++#endif ++ ++#endif /* usdt-probes.h */ diff --git a/ipsec/ovs-monitor-ipsec.in b/ipsec/ovs-monitor-ipsec.in index 37e3703245..41ef886194 100755 --- a/ipsec/ovs-monitor-ipsec.in @@ -81887,10 +82367,18 @@ index d393aab5e3..a73102ea98 100644 } else { atomic_count_set(&pmd->pmd_overloaded, 0); diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c -index 5b5c96d727..2c6c23c645 100644 +index 5b5c96d727..805f1d31f3 100644 --- a/lib/dpif-netlink.c +++ b/lib/dpif-netlink.c -@@ -691,6 +691,7 @@ dpif_netlink_set_features(struct dpif *dpif_, uint32_t new_features) +@@ -56,6 +56,7 @@ + #include "openvswitch/poll-loop.h" + #include "openvswitch/shash.h" + #include "openvswitch/thread.h" ++#include "openvswitch/usdt-probes.h" + #include "openvswitch/vlog.h" + #include "packets.h" + #include "random.h" +@@ -691,6 +692,7 @@ dpif_netlink_set_features(struct dpif *dpif_, uint32_t new_features) dpif_netlink_dp_init(&request); request.cmd = OVS_DP_CMD_SET; @@ -81898,7 +82386,50 @@ index 5b5c96d727..2c6c23c645 100644 request.dp_ifindex = dpif->dp_ifindex; request.user_features = dpif->user_features | new_features; -@@ -2046,6 +2047,7 @@ parse_flow_put(struct dpif_netlink *dpif, struct dpif_flow_put *put) +@@ -1874,6 +1876,9 @@ dpif_netlink_operate__(struct dpif_netlink *dpif, + aux->txn.reply = &aux->reply; + } + dpif_netlink_flow_to_ofpbuf(&flow, &aux->request); ++ ++ OVS_USDT_PROBE(dpif_netlink_operate__, op_flow_put, ++ dpif, put, &flow, &aux->request); + break; + + case DPIF_OP_FLOW_DEL: +@@ -1884,6 +1889,9 @@ dpif_netlink_operate__(struct dpif_netlink *dpif, + aux->txn.reply = &aux->reply; + } + dpif_netlink_flow_to_ofpbuf(&flow, &aux->request); ++ ++ OVS_USDT_PROBE(dpif_netlink_operate__, op_flow_del, ++ dpif, del, &flow, &aux->request); + break; + + case DPIF_OP_EXECUTE: +@@ -1904,6 +1912,12 @@ dpif_netlink_operate__(struct dpif_netlink *dpif, + } else { + dpif_netlink_encode_execute(dpif->dp_ifindex, &op->execute, + &aux->request); ++ ++ OVS_USDT_PROBE(dpif_netlink_operate__, op_flow_execute, ++ dpif, &op->execute, ++ dp_packet_data(op->execute.packet), ++ dp_packet_size(op->execute.packet), ++ &aux->request); + } + break; + +@@ -1912,6 +1926,9 @@ dpif_netlink_operate__(struct dpif_netlink *dpif, + dpif_netlink_init_flow_get(dpif, get, &flow); + aux->txn.reply = get->buffer; + dpif_netlink_flow_to_ofpbuf(&flow, &aux->request); ++ ++ OVS_USDT_PROBE(dpif_netlink_operate__, op_flow_get, ++ dpif, get, &flow, &aux->request); + break; + + default: +@@ -2046,6 +2063,7 @@ parse_flow_put(struct dpif_netlink *dpif, struct dpif_flow_put *put) uint8_t csum_on = false; int err; @@ -81906,7 +82437,7 @@ index 5b5c96d727..2c6c23c645 100644 if (put->flags & DPIF_FP_PROBE) { return EOPNOTSUPP; } -@@ -2141,7 +2143,11 @@ parse_flow_put(struct dpif_netlink *dpif, struct dpif_flow_put *put) +@@ -2141,7 +2159,11 @@ parse_flow_put(struct dpif_netlink *dpif, struct dpif_flow_put *put) out: if (err && err != EEXIST && (put->flags & DPIF_FP_MODIFY)) { /* Modified rule can't be offloaded, try and delete from HW */ @@ -81919,7 +82450,7 @@ index 5b5c96d727..2c6c23c645 100644 if (!del_err) { /* Delete from hw success, so old flow was offloaded. -@@ -2902,8 +2908,6 @@ dpif_netlink_ct_set_limits(struct dpif *dpif OVS_UNUSED, +@@ -2902,8 +2924,6 @@ dpif_netlink_ct_set_limits(struct dpif *dpif OVS_UNUSED, const uint32_t *default_limits, const struct ovs_list *zone_limits) { @@ -81928,7 +82459,7 @@ index 5b5c96d727..2c6c23c645 100644 if (ovs_ct_limit_family < 0) { return EOPNOTSUPP; } -@@ -2920,8 +2924,10 @@ dpif_netlink_ct_set_limits(struct dpif *dpif OVS_UNUSED, +@@ -2920,8 +2940,10 @@ dpif_netlink_ct_set_limits(struct dpif *dpif OVS_UNUSED, size_t opt_offset; opt_offset = nl_msg_start_nested(request, OVS_CT_LIMIT_ATTR_ZONE_LIMIT); if (default_limits) { @@ -81941,7 +82472,7 @@ index 5b5c96d727..2c6c23c645 100644 nl_msg_put(request, &req_zone_limit, sizeof req_zone_limit); } -@@ -2929,8 +2935,10 @@ dpif_netlink_ct_set_limits(struct dpif *dpif OVS_UNUSED, +@@ -2929,8 +2951,10 @@ dpif_netlink_ct_set_limits(struct dpif *dpif OVS_UNUSED, struct ct_dpif_zone_limit *zone_limit; LIST_FOR_EACH (zone_limit, node, zone_limits) { @@ -81954,7 +82485,7 @@ index 5b5c96d727..2c6c23c645 100644 nl_msg_put(request, &req_zone_limit, sizeof req_zone_limit); } } -@@ -3014,8 +3022,9 @@ dpif_netlink_ct_get_limits(struct dpif *dpif OVS_UNUSED, +@@ -3014,8 +3038,9 @@ dpif_netlink_ct_get_limits(struct dpif *dpif OVS_UNUSED, size_t opt_offset = nl_msg_start_nested(request, OVS_CT_LIMIT_ATTR_ZONE_LIMIT); @@ -81966,7 +82497,7 @@ index 5b5c96d727..2c6c23c645 100644 nl_msg_put(request, &req_zone_limit, sizeof req_zone_limit); struct ct_dpif_zone_limit *zone_limit; -@@ -3065,8 +3074,9 @@ dpif_netlink_ct_del_limits(struct dpif *dpif OVS_UNUSED, +@@ -3065,8 +3090,9 @@ dpif_netlink_ct_del_limits(struct dpif *dpif OVS_UNUSED, struct ct_dpif_zone_limit *zone_limit; LIST_FOR_EACH (zone_limit, node, zone_limits) { @@ -81978,7 +82509,7 @@ index 5b5c96d727..2c6c23c645 100644 nl_msg_put(request, &req_zone_limit, sizeof req_zone_limit); } nl_msg_end_nested(request, opt_offset); -@@ -3140,6 +3150,20 @@ dpif_netlink_ct_get_timeout_policy_name(struct dpif *dpif OVS_UNUSED, +@@ -3140,6 +3166,20 @@ dpif_netlink_ct_get_timeout_policy_name(struct dpif *dpif OVS_UNUSED, return 0; } @@ -81999,7 +82530,7 @@ index 5b5c96d727..2c6c23c645 100644 #define CT_DPIF_NL_TP_TCP_MAPPINGS \ CT_DPIF_NL_TP_MAPPING(TCP, TCP, SYN_SENT, SYN_SENT) \ CT_DPIF_NL_TP_MAPPING(TCP, TCP, SYN_RECV, SYN_RECV) \ -@@ -3982,6 +4006,7 @@ const struct dpif_class dpif_netlink_class = { +@@ -3982,6 +4022,7 @@ const struct dpif_class dpif_netlink_class = { dpif_netlink_ct_timeout_policy_dump_next, dpif_netlink_ct_timeout_policy_dump_done, dpif_netlink_ct_get_timeout_policy_name, @@ -82007,7 +82538,7 @@ index 5b5c96d727..2c6c23c645 100644 NULL, /* ipf_set_enabled */ NULL, /* ipf_set_min_frag */ NULL, /* ipf_set_max_nfrags */ -@@ -4638,7 +4663,7 @@ report_loss(struct dpif_netlink *dpif, struct dpif_channel *ch, uint32_t ch_idx, +@@ -4638,7 +4679,7 @@ report_loss(struct dpif_netlink *dpif, struct dpif_channel *ch, uint32_t ch_idx, time_msec() - ch->last_poll); } @@ -82041,10 +82572,51 @@ index b77317bca1..9941c9ba8a 100644 /* Disables or enables conntrack fragment reassembly. The default diff --git a/lib/dpif.c b/lib/dpif.c -index 9d9c716c13..1847689114 100644 +index 9d9c716c13..8ff7ea2adf 100644 --- a/lib/dpif.c +++ b/lib/dpif.c -@@ -1235,6 +1235,7 @@ dpif_execute_helper_cb(void *aux_, struct dp_packet_batch *packets_, +@@ -24,22 +24,18 @@ + #include + + #include "coverage.h" +-#include "dpctl.h" + #include "dp-packet.h" ++#include "dpctl.h" + #include "dpif-netdev.h" +-#include "openvswitch/dynamic-string.h" + #include "flow.h" ++#include "netdev-provider.h" + #include "netdev.h" + #include "netlink.h" + #include "odp-execute.h" + #include "odp-util.h" +-#include "openvswitch/ofp-print.h" +-#include "openvswitch/ofpbuf.h" + #include "packets.h" +-#include "openvswitch/poll-loop.h" + #include "route-table.h" + #include "seq.h" +-#include "openvswitch/shash.h" + #include "sset.h" + #include "timeval.h" + #include "tnl-neigh-cache.h" +@@ -47,9 +43,14 @@ + #include "util.h" + #include "uuid.h" + #include "valgrind.h" ++#include "openvswitch/dynamic-string.h" + #include "openvswitch/ofp-errors.h" ++#include "openvswitch/ofp-print.h" ++#include "openvswitch/ofpbuf.h" ++#include "openvswitch/poll-loop.h" ++#include "openvswitch/shash.h" ++#include "openvswitch/usdt-probes.h" + #include "openvswitch/vlog.h" +-#include "lib/netdev-provider.h" + + VLOG_DEFINE_THIS_MODULE(dpif); + +@@ -1235,6 +1236,7 @@ dpif_execute_helper_cb(void *aux_, struct dp_packet_batch *packets_, execute.needs_help = false; execute.probe = false; execute.mtu = 0; @@ -82052,6 +82624,19 @@ index 9d9c716c13..1847689114 100644 aux->error = dpif_execute(aux->dpif, &execute); log_execute_message(aux->dpif, &this_module, &execute, true, aux->error); +@@ -1580,6 +1582,12 @@ dpif_recv(struct dpif *dpif, uint32_t handler_id, struct dpif_upcall *upcall, + if (dpif->dpif_class->recv) { + error = dpif->dpif_class->recv(dpif, handler_id, upcall, buf); + if (!error) { ++ OVS_USDT_PROBE(dpif_recv, recv_upcall, dpif->full_name, ++ upcall->type, ++ dp_packet_data(&upcall->packet), ++ dp_packet_size(&upcall->packet), ++ upcall->key, upcall->key_len); ++ + dpif_print_packet(dpif, upcall); + } else if (error != EAGAIN) { + log_operation(dpif, "recv", error); diff --git a/lib/dpif.h b/lib/dpif.h index 4df8f7c8b7..31d3139845 100644 --- a/lib/dpif.h @@ -85830,6 +86415,45 @@ index 7ad8758fe6..067dcad157 100644 void bitwise_copy(const void *src, unsigned int src_len, unsigned int src_ofs, void *dst, unsigned int dst_len, unsigned int dst_ofs, unsigned int n_bits); +diff --git a/m4/openvswitch.m4 b/m4/openvswitch.m4 +index add3aabcc2..391d6d8f7c 100644 +--- a/m4/openvswitch.m4 ++++ b/m4/openvswitch.m4 +@@ -60,6 +60,34 @@ AC_DEFUN([OVS_CHECK_NDEBUG], + [ndebug=false]) + AM_CONDITIONAL([NDEBUG], [test x$ndebug = xtrue])]) + ++dnl Checks for --enable-usdt-probes and defines HAVE_USDT if it is specified. ++AC_DEFUN([OVS_CHECK_USDT], [ ++ AC_ARG_ENABLE( ++ [usdt-probes], ++ [AC_HELP_STRING([--enable-usdt-probes], ++ [Enable User Statically Defined Tracing (USDT) probes])], ++ [case "${enableval}" in ++ (yes) usdt=true ;; ++ (no) usdt=false ;; ++ (*) AC_MSG_ERROR([bad value ${enableval} for --enable-usdt-probes]) ;; ++ esac], ++ [usdt=false]) ++ ++ AC_MSG_CHECKING([whether USDT probes are enabled]) ++ if test "$usdt" != true; then ++ AC_MSG_RESULT([no]) ++ else ++ AC_MSG_RESULT([yes]) ++ ++ AC_CHECK_HEADER([sys/sdt.h], [], ++ [AC_MSG_ERROR([unable to find sys/sdt.h needed for USDT support])]) ++ ++ AC_DEFINE([HAVE_USDT_PROBES], [1], ++ [Define to 1 if USDT probes are enabled.]) ++ fi ++ AM_CONDITIONAL([HAVE_USDT_PROBES], [test $usdt = true]) ++]) ++ + dnl Checks for MSVC x64 compiler. + AC_DEFUN([OVS_CHECK_WIN64], + [AC_CACHE_CHECK( diff --git a/ofproto/connmgr.c b/ofproto/connmgr.c index 51d656cba9..fd926cbb82 100644 --- a/ofproto/connmgr.c @@ -92679,6 +93303,42 @@ index bc49aa480e..39679dd444 100755 import re import sys +diff --git a/utilities/automake.mk b/utilities/automake.mk +index e2e22c39ae..f7aa1418d8 100644 +--- a/utilities/automake.mk ++++ b/utilities/automake.mk +@@ -56,12 +56,15 @@ EXTRA_DIST += \ + utilities/ovs-vlan-test.in \ + utilities/ovs-vsctl-bashcomp.bash \ + utilities/checkpatch.py \ +- utilities/docker/Makefile \ +- utilities/docker/ovs-override.conf \ +- utilities/docker/start-ovs \ +- utilities/docker/create_ovs_db.sh \ +- utilities/docker/debian/Dockerfile \ +- utilities/docker/debian/build-kernel-modules.sh ++ utilities/docker/Makefile \ ++ utilities/docker/ovs-override.conf \ ++ utilities/docker/start-ovs \ ++ utilities/docker/create_ovs_db.sh \ ++ utilities/docker/debian/Dockerfile \ ++ utilities/docker/debian/build-kernel-modules.sh \ ++ utilities/usdt-scripts/bridge_loop.bt \ ++ utilities/usdt-scripts/upcall_cost.py \ ++ utilities/usdt-scripts/upcall_monitor.py + MAN_ROOTS += \ + utilities/ovs-testcontroller.8.in \ + utilities/ovs-dpctl.8.in \ +@@ -128,6 +131,8 @@ FLAKE8_PYFILES += utilities/ovs-pcap.in \ + utilities/checkpatch.py utilities/ovs-dev.py \ + utilities/ovs-check-dead-ifs.in \ + utilities/ovs-tcpdump.in \ +- utilities/ovs-pipegen.py ++ utilities/ovs-pipegen.py \ ++ utilities/usdt-scripts/upcall_monitor.py \ ++ utilities/usdt-scripts/upcall_cost.py + + include utilities/bugtool/automake.mk diff --git a/utilities/bugtool/ovs-bugtool.in b/utilities/bugtool/ovs-bugtool.in index e55bfc2ed5..987d211069 100755 --- a/utilities/bugtool/ovs-bugtool.in @@ -93409,6 +94069,2464 @@ index bd3972636e..37cc72d401 100644 } ds_chomp(&ctx->output, ' '); ds_put_char(&ctx->output, '\n'); +diff --git a/utilities/usdt-scripts/bridge_loop.bt b/utilities/usdt-scripts/bridge_loop.bt +new file mode 100755 +index 0000000000..0116269475 +--- /dev/null ++++ b/utilities/usdt-scripts/bridge_loop.bt +@@ -0,0 +1,120 @@ ++#!/usr/bin/env bpftrace ++/* ++ * Copyright (c) 2021 Red Hat, Inc. ++ * ++ * Licensed under the Apache License, Version 2.0 (the "License"); ++ * you may not use this file except in compliance with the License. ++ * You may obtain a copy of the License at: ++ * ++ * http://www.apache.org/licenses/LICENSE-2.0 ++ * ++ * Unless required by applicable law or agreed to in writing, software ++ * distributed under the License is distributed on an "AS IS" BASIS, ++ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ++ * See the License for the specific language governing permissions and ++ * limitations under the License. ++ * ++ * ++ * Script information: ++ * ------------------- ++ * bridge_loop.bt uses the USDTs in the ovs-vswitchd's main bridge loop ++ * to report how long it spends running a single loop, as well as the time ++ * it spends waiting in the poll_loop(). Once done, it will also print a ++ * histogram for the time spent. ++ * ++ * The following is an example of how to use the script on the running ++ * ovs-vswitchd process: ++ * ++ * $ bridge_loop.bt -p `pgrep -n ovs-vswitchd` ++ * Attaching 4 probes... ++ * -------------------------------------------------------------- ++ * Tracing ovs-vswitchd's main() loop... Hit Ctrl-C to end. ++ * -------------------------------------------------------------- ++ * - [886467@ovs-vswitchd] bridge run loop time : 0:00:00.000230706 ++ * - [886467@ovs-vswitchd] poll_block() wait time: 0:00:00.501854292 ++ * - [886467@ovs-vswitchd] bridge run loop time : 0:00:00.000266445 ++ * - [886467@ovs-vswitchd] poll_block() wait time: 0:00:00.499750288 ++ * - [886467@ovs-vswitchd] bridge run loop time : 0:00:00.000254856 ++ * - [886467@ovs-vswitchd] poll_block() wait time: 0:00:00.499944280 ++ * - [886467@ovs-vswitchd] bridge run loop time : 0:00:00.000267390 ++ * - [886467@ovs-vswitchd] poll_block() wait time: 0:00:00.093566288 ++ * - [886467@ovs-vswitchd] bridge run loop time : 0:00:00.000316774 ++ * - [886467@ovs-vswitchd] poll_block() wait time: 0:00:00.406697754 ++ * - [886467@ovs-vswitchd] bridge run loop time : 0:00:00.000264505 ++ * ^C ++ * -------------------------------------------------------------- ++ * Showing run time histograms in micro seconds: ++ * -------------------------------------------------------------- ++ * ++ * @bridge_run_time: ++ * [0, 1000) 6 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| ++ * ++ * ++ * @poll_block_wait_time: ++ * [90000, 120000) 1 |@@@@@@@@@@@@@@@@@ | ++ * [120000, 150000) 0 | | ++ * [150000, 180000) 0 | | ++ * [180000, 210000) 0 | | ++ * [210000, 240000) 0 | | ++ * [240000, 270000) 0 | | ++ * [270000, 300000) 0 | | ++ * [300000, 330000) 0 | | ++ * [330000, 360000) 0 | | ++ * [360000, 390000) 0 | | ++ * [390000, 420000) 1 |@@@@@@@@@@@@@@@@@ | ++ * [420000, 450000) 0 | | ++ * [450000, 480000) 0 | | ++ * [480000, 510000) 3 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| ++ */ ++ ++ ++BEGIN ++{ ++ printf("--------------------------------------------------------------\n"); ++ printf("Tracing ovs-vswitchd's main() loop... Hit Ctrl-C to end.\n"); ++ printf("--------------------------------------------------------------\n"); ++} ++ ++ ++usdt::main:poll_block ++{ ++ @pb_start[tid] = nsecs; ++ if (@rs_start[tid] != 0) { ++ $delta = nsecs - @rs_start[tid]; ++ printf("- [%d@%s] bridge run loop time : %u:%2.2u:%2.2u.%9.9u\n", ++ tid, comm, ++ $delta / 3600 / 1000000000, ++ $delta / 60 / 1000000000 % 60, ++ $delta / 1000000000 % 60, ++ $delta % 1000000000); ++ @bridge_run_time = lhist($delta / 1000, 0, 1000000, 1000); ++ } ++} ++ ++ ++usdt::main:run_start ++{ ++ @rs_start[tid] = nsecs; ++ if (@pb_start[tid] != 0) { ++ $delta = nsecs - @pb_start[tid]; ++ printf("- [%d@%s] poll_block() wait time: %u:%2.2u:%2.2u.%9.9u\n", ++ tid, comm, ++ $delta / 3600 / 1000000000, ++ $delta / 60 / 1000000000 % 60, ++ $delta / 1000000000 % 60, ++ $delta % 1000000000); ++ @poll_block_wait_time = lhist($delta / 1000, 0, 30000000, 30000); ++ } ++} ++ ++ ++END ++{ ++ clear(@rs_start); ++ clear(@pb_start); ++ ++ printf("\n"); ++ printf("--------------------------------------------------------------\n"); ++ printf("Showing run time histograms in micro seconds:\n"); ++ printf("--------------------------------------------------------------"); ++} +diff --git a/utilities/usdt-scripts/upcall_cost.py b/utilities/usdt-scripts/upcall_cost.py +new file mode 100755 +index 0000000000..7656695857 +--- /dev/null ++++ b/utilities/usdt-scripts/upcall_cost.py +@@ -0,0 +1,1787 @@ ++#!/usr/bin/env python3 ++# ++# Copyright (c) 2021 Red Hat, Inc. ++# ++# Licensed under the Apache License, Version 2.0 (the "License"); ++# you may not use this file except in compliance with the License. ++# You may obtain a copy of the License at: ++# ++# http://www.apache.org/licenses/LICENSE-2.0 ++# ++# Unless required by applicable law or agreed to in writing, software ++# distributed under the License is distributed on an "AS IS" BASIS, ++# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ++# See the License for the specific language governing permissions and ++# limitations under the License. ++# ++# Script information: ++# ------------------- ++# upcall_cost.py uses various user space and kernel space probes to determine ++# the costs (in time) for handling the first packet in user space. It ++# calculates the following costs: ++# ++# - Time it takes from the kernel sending the upcall till it's received by the ++# ovs-vswitchd process. ++# - Time it takes from ovs-vswitchd sending the execute actions command till ++# the kernel receives it. ++# - The total time it takes from the kernel to sent the upcall until it ++# receives the packet execute command. ++# - The total time of the above, minus the time it takes for the actual lookup. ++# ++# In addition, it will also report the number of packets batched, as OVS will ++# first try to read UPCALL_MAX_BATCH(64) packets from kernel space and then ++# does the flow lookups and execution. So the smaller the batch size, the more ++# realistic are the cost estimates. ++# ++# The script does not need any options to attach to a running instance of ++# ovs-vswitchd. However, it's recommended always run the script with the ++# --write-events option. This way, if something does go wrong, the collected ++# data is saved. Use the --help option to see all the available options. ++# ++# Note: In addition to the bcc tools for your specific setup, you need the ++# following Python packages: ++# pip install alive-progress halo psutil scapy strenum text_histogram3 ++# ++ ++try: ++ from bcc import BPF, USDT, USDTException ++except ModuleNotFoundError: ++ print("WARNING: Can't find the BPF Compiler Collection (BCC) tools!") ++ print(" This is NOT problem if you analyzing previously collected" ++ " data.\n") ++from alive_progress import alive_bar ++from collections import namedtuple ++from halo import Halo ++from scapy.all import TCP, UDP ++from scapy.layers.l2 import Ether ++from strenum import StrEnum ++from text_histogram3 import histogram ++from time import process_time ++ ++import argparse ++import ast ++import psutil ++import re ++import struct ++import subprocess ++import sys ++import time ++ ++# ++# Global definitions ++# ++DP_TUNNEL_PORT = -1 ++ ++ ++# ++# Actual eBPF source code ++# ++ebpf_source = """ ++#include ++#include ++#include ++ ++#define MAX_PACKET ++#define MAX_KEY ++ ++enum { ++ EVENT_RECV_UPCALL = 0, ++ EVENT_DP_UPCALL, ++ EVENT_OP_FLOW_PUT, ++ EVENT_OP_FLOW_EXECUTE, ++ EVENT_OVS_PKT_EXEC, ++ _EVENT_MAX_EVENT ++}; ++ ++#define barrier_var(var) asm volatile("" : "=r"(var) : "0"(var)) ++ ++struct event_t { ++ u32 event; ++ u32 cpu; ++ u32 pid; ++ u32 upcall_type; ++ u64 ts; ++ u32 pkt_frag_size; ++ u32 pkt_size; ++ u64 key_size; ++ char comm[TASK_COMM_LEN]; ++ char dpif_name[32]; ++ char dev_name[16]; ++ unsigned char pkt[MAX_PACKET]; ++ unsigned char key[MAX_KEY]; ++}; ++ ++BPF_RINGBUF_OUTPUT(events, ); ++BPF_TABLE("percpu_array", uint32_t, uint64_t, dropcnt, _EVENT_MAX_EVENT); ++ ++static struct event_t *init_event(u32 type) ++{ ++ struct event_t *event = events.ringbuf_reserve(sizeof(struct event_t)); ++ ++ if (!event) { ++ uint64_t *value = dropcnt.lookup(&type); ++ if (value) ++ __sync_fetch_and_add(value, 1); ++ ++ return NULL; ++ } ++ ++ event->event = type; ++ event->ts = bpf_ktime_get_ns(); ++ event->cpu = bpf_get_smp_processor_id(); ++ event->pid = bpf_get_current_pid_tgid(); ++ bpf_get_current_comm(&event->comm, sizeof(event->comm)); ++ ++ return event; ++} ++ ++int trace__recv_upcall(struct pt_regs *ctx) { ++ uint32_t upcall_type; ++ uint64_t addr; ++ uint64_t size; ++ ++ bpf_usdt_readarg(2, ctx, &upcall_type); ++ if (upcall_type != 0) ++ return 0; ++ ++ struct event_t *event = init_event(EVENT_RECV_UPCALL); ++ if (!event) ++ return 1; ++ ++ bpf_usdt_readarg(1, ctx, &addr); ++ bpf_probe_read_str(&event->dpif_name, sizeof(event->dpif_name), ++ (void *)addr); ++ ++ event->upcall_type = upcall_type; ++ bpf_usdt_readarg(4, ctx, &event->pkt_size); ++ bpf_usdt_readarg(6, ctx, &event->key_size); ++ ++ if (event->pkt_size > MAX_PACKET) ++ size = MAX_PACKET; ++ else ++ size = event->pkt_size; ++ bpf_usdt_readarg(3, ctx, &addr); ++ bpf_probe_read(&event->pkt, size, (void *)addr); ++ ++ if (event->key_size > MAX_KEY) ++ size = MAX_KEY; ++ else ++ size = event->key_size; ++ bpf_usdt_readarg(5, ctx, &addr); ++ bpf_probe_read(&event->key, size, (void *)addr); ++ ++ events.ringbuf_submit(event, 0); ++ return 0; ++}; ++ ++ ++int trace__op_flow_put(struct pt_regs *ctx) { ++ uint64_t addr; ++ uint64_t size; ++ ++ struct event_t *event = init_event(EVENT_OP_FLOW_PUT); ++ if (!event) { ++ return 1; ++ } ++ ++ events.ringbuf_submit(event, 0); ++ return 0; ++}; ++ ++ ++int trace__op_flow_execute(struct pt_regs *ctx) { ++ uint64_t addr; ++ uint64_t size; ++ ++ struct event_t *event = init_event(EVENT_OP_FLOW_EXECUTE); ++ if (!event) { ++ return 1; ++ } ++ ++ bpf_usdt_readarg(4, ctx, &event->pkt_size); ++ ++ if (event->pkt_size > MAX_PACKET) ++ size = MAX_PACKET; ++ else ++ size = event->pkt_size; ++ bpf_usdt_readarg(3, ctx, &addr); ++ bpf_probe_read(&event->pkt, size, (void *)addr); ++ ++ events.ringbuf_submit(event, 0); ++ return 0; ++}; ++ ++ ++TRACEPOINT_PROBE(openvswitch, ovs_dp_upcall) { ++ uint64_t size; ++ struct sk_buff *skb = args->skbaddr; ++ ++ if (args->upcall_cmd != 1 || skb == NULL || skb->data == NULL) ++ return 0; ++ ++ struct event_t *event = init_event(EVENT_DP_UPCALL); ++ if (!event) { ++ return 1; ++ } ++ ++ event->upcall_type = args->upcall_cmd; ++ event->pkt_size = args->len; ++ ++ TP_DATA_LOC_READ_CONST(&event->dpif_name, dp_name, ++ sizeof(event->dpif_name)); ++ TP_DATA_LOC_READ_CONST(&event->dev_name, dev_name, ++ sizeof(event->dev_name)); ++ ++ if (skb->data_len != 0) { ++ event->pkt_frag_size = (skb->len - skb->data_len) & 0xfffffff; ++ size = event->pkt_frag_size; ++ } else { ++ event->pkt_frag_size = 0; ++ size = event->pkt_size; ++ } ++ ++ /* Prevent clang from using register mirroring (or any optimization) on ++ * the 'size' variable. */ ++ barrier_var(size); ++ ++ if (size > MAX_PACKET) ++ size = MAX_PACKET; ++ bpf_probe_read_kernel(event->pkt, size, skb->data); ++ ++ events.ringbuf_submit(event, 0); ++ return 0; ++} ++ ++int kprobe__ovs_packet_cmd_execute(struct pt_regs *ctx, struct sk_buff *skb) ++{ ++ uint64_t size; ++ ++ if (skb == NULL || skb->data == NULL) ++ return 0; ++ ++ struct event_t *event = init_event(EVENT_OVS_PKT_EXEC); ++ if (!event) { ++ return 1; ++ } ++ ++ events.ringbuf_submit(event, 0); ++ return 0; ++} ++""" ++ ++ ++# ++# Event types ++# ++class EventType(StrEnum): ++ RECV_UPCALL = 'dpif_recv__recv_upcall' ++ DP_UPCALL = 'openvswitch__dp_upcall' ++ OP_FLOW_PUT = 'dpif_netlink_operate__op_flow_put' ++ OP_FLOW_EXECUTE = 'dpif_netlink_operate__op_flow_execute' ++ OVS_PKT_EXEC = 'ktrace__ovs_packet_cmd_execute' ++ ++ def short_name(name, length=22): ++ if len(name) < length: ++ return name ++ ++ return '..' + name[-(length - 2):] ++ ++ def from_trace(trace_event): ++ if trace_event == 0: ++ return EventType.RECV_UPCALL ++ elif trace_event == 1: ++ return EventType.DP_UPCALL ++ elif trace_event == 2: ++ return EventType.OP_FLOW_PUT ++ elif trace_event == 3: ++ return EventType.OP_FLOW_EXECUTE ++ elif trace_event == 4: ++ return EventType.OVS_PKT_EXEC ++ ++ raise ValueError ++ ++ ++# ++# Simple event class ++# ++class Event(object): ++ def __init__(self, ts, pid, comm, cpu, event_type): ++ self.ts = ts ++ self.pid = pid ++ self.comm = comm ++ self.cpu = cpu ++ self.event_type = event_type ++ ++ def __str__(self): ++ return "[{:<22}] {:<16} {:8} [{:03}] {:18.9f}".format( ++ EventType.short_name(self.event_type), ++ self.comm, ++ self.pid, ++ self.cpu, ++ self.ts / 1000000000) ++ ++ def __repr__(self): ++ more = "" ++ if self.__class__.__name__ != "Event": ++ more = ", ..." ++ ++ return "{}({}, {}, {}, {}, {}{})".format(self.__class__.__name__, ++ self.ts, self.pid, ++ self.comm, self.cpu, ++ self.event_type, more) ++ ++ def handle_event(event): ++ event = Event(event.ts, event.pid, event.comm.decode("utf-8"), ++ event.cpu, EventType.from_trace(event.event)) ++ ++ if not options.quiet: ++ print(event) ++ ++ return event ++ ++ def get_event_header_str(): ++ return "{:<24} {:<16} {:>8} {:<3} {:<18} {}".format( ++ "EVENT", "COMM", "PID", "CPU", "TIME", ++ "EVENT DATA[dpif_name/dp_port/pkt_len/pkt_frag_len]") ++ ++ ++# ++# dp_upcall event class ++# ++class DpUpcall(Event): ++ def __init__(self, ts, pid, comm, cpu, dpif_name, port, pkt, pkt_len, ++ pkt_frag_len): ++ super(DpUpcall, self).__init__(ts, pid, comm, cpu, EventType.DP_UPCALL) ++ self.dpif_name = dpif_name ++ self.dp_port = get_dp_mapping(dpif_name, port) ++ if self.dp_port is None: ++ # ++ # As we only identify interfaces at startup, new interfaces could ++ # have been added, causing the lookup to fail. Just something to ++ # keep in mind when running this in a dynamic environment. ++ # ++ raise LookupError("Can't find datapath port mapping!") ++ self.pkt = pkt ++ self.pkt_len = pkt_len ++ self.pkt_frag_len = pkt_frag_len ++ ++ def __str__(self): ++ return "[{:<22}] {:<16} {:8} [{:03}] {:18.9f}: " \ ++ "{:<17} {:4} {:4} {:4}".format(self.event_type, ++ self.comm, ++ self.pid, ++ self.cpu, ++ self.ts / 1000000000, ++ self.dpif_name, ++ self.dp_port, ++ self.pkt_len, ++ self.pkt_frag_len) ++ ++ def handle_event(event): ++ if event.pkt_size < options.packet_size: ++ pkt_len = event.pkt_size ++ else: ++ pkt_len = options.packet_size ++ ++ pkt_data = bytes(event.pkt)[:pkt_len] ++ ++ if len(pkt_data) <= 0 or event.pkt_size == 0: ++ return ++ ++ try: ++ event = DpUpcall(event.ts, event.pid, event.comm.decode("utf-8"), ++ event.cpu, event.dpif_name.decode("utf-8"), ++ event.dev_name.decode("utf-8"), ++ pkt_data, ++ event.pkt_size, ++ event.pkt_frag_size) ++ except LookupError: ++ # ++ # If we can't do the port lookup, ignore this event. ++ # ++ return None ++ ++ if not options.quiet: ++ print(event) ++ ++ return event ++ ++ ++# ++# recv_upcall event class ++# ++class RecvUpcall(Event): ++ def __init__(self, ts, pid, comm, cpu, dpif_name, key, pkt, pkt_len): ++ super(RecvUpcall, self).__init__(ts, pid, comm, cpu, ++ EventType.RECV_UPCALL) ++ ++ if dpif_name.startswith("system@"): ++ dpif_name = dpif_name[len("system@"):] ++ self.dpif_name = dpif_name ++ ++ nla = RecvUpcall.decode_nlm(key, dump=False) ++ if "OVS_KEY_ATTR_IN_PORT" in nla: ++ self.dp_port = struct.unpack('=L', nla["OVS_KEY_ATTR_IN_PORT"])[0] ++ elif "OVS_KEY_ATTR_TUNNEL" in nla: ++ self.dp_port = DP_TUNNEL_PORT ++ else: ++ self.dp_port = RecvUpcall.get_system_dp_port(self.dpif_name) ++ ++ if self.dp_port is None: ++ raise LookupError("Can't find RecvUpcall dp port mapping!") ++ ++ self.pkt = pkt ++ self.pkt_len = pkt_len ++ ++ def __str__(self): ++ return "[{:<22}] {:<16} {:8} [{:03}] {:18.9f}: {:<17} {:4} {:4}". \ ++ format( ++ self.event_type, ++ self.comm, ++ self.pid, ++ self.cpu, ++ self.ts / 1000000000, ++ self.dpif_name, ++ self.dp_port, ++ self.pkt_len) ++ ++ def get_system_dp_port(dpif_name): ++ dp_map = get_dp_mapping(dpif_name, "ovs-system", return_map=True) ++ if dpif_name not in dp_map: ++ return None ++ try: ++ return dp_map[dpif_name]["ovs-system"] ++ except KeyError: ++ return None ++ ++ def decode_nlm(msg, indent=4, dump=True): ++ bytes_left = len(msg) ++ result = {} ++ ++ while bytes_left: ++ if bytes_left < 4: ++ if dump: ++ print("{}WARN: decode truncated; can't read header".format( ++ ' ' * indent)) ++ break ++ ++ nla_len, nla_type = struct.unpack("=HH", msg[:4]) ++ ++ if nla_len < 4: ++ if dump: ++ print("{}WARN: decode truncated; nla_len < 4".format( ++ ' ' * indent)) ++ break ++ ++ nla_data = msg[4:nla_len] ++ trunc = "" ++ ++ if nla_len > bytes_left: ++ trunc = "..." ++ nla_data = nla_data[:(bytes_left - 4)] ++ if RecvUpcall.get_ovs_key_attr_str(nla_type) == \ ++ "OVS_KEY_ATTR_TUNNEL": ++ # ++ # If we have truncated tunnel information, we still would ++ # like to know. This is due to the special tunnel handling ++ # needed for port matching. ++ # ++ result[RecvUpcall.get_ovs_key_attr_str(nla_type)] = bytes() ++ else: ++ result[RecvUpcall.get_ovs_key_attr_str(nla_type)] = nla_data ++ ++ if dump: ++ print("{}nla_len {}, nla_type {}[{}], data: {}{}".format( ++ ' ' * indent, nla_len, ++ RecvUpcall.get_ovs_key_attr_str(nla_type), ++ nla_type, ++ "".join("{:02x} ".format(b) for b in nla_data), trunc)) ++ ++ if trunc != "": ++ if dump: ++ print("{}WARN: decode truncated; nla_len > msg_len[{}] ". ++ format(' ' * indent, bytes_left)) ++ break ++ ++ # Update next offset, but make sure it's aligned correctly. ++ next_offset = (nla_len + 3) & ~(3) ++ msg = msg[next_offset:] ++ bytes_left -= next_offset ++ ++ return result ++ ++ def get_ovs_key_attr_str(attr): ++ ovs_key_attr = ["OVS_KEY_ATTR_UNSPEC", ++ "OVS_KEY_ATTR_ENCAP", ++ "OVS_KEY_ATTR_PRIORITY", ++ "OVS_KEY_ATTR_IN_PORT", ++ "OVS_KEY_ATTR_ETHERNET", ++ "OVS_KEY_ATTR_VLAN", ++ "OVS_KEY_ATTR_ETHERTYPE", ++ "OVS_KEY_ATTR_IPV4", ++ "OVS_KEY_ATTR_IPV6", ++ "OVS_KEY_ATTR_TCP", ++ "OVS_KEY_ATTR_UDP", ++ "OVS_KEY_ATTR_ICMP", ++ "OVS_KEY_ATTR_ICMPV6", ++ "OVS_KEY_ATTR_ARP", ++ "OVS_KEY_ATTR_ND", ++ "OVS_KEY_ATTR_SKB_MARK", ++ "OVS_KEY_ATTR_TUNNEL", ++ "OVS_KEY_ATTR_SCTP", ++ "OVS_KEY_ATTR_TCP_FLAGS", ++ "OVS_KEY_ATTR_DP_HASH", ++ "OVS_KEY_ATTR_RECIRC_ID", ++ "OVS_KEY_ATTR_MPLS", ++ "OVS_KEY_ATTR_CT_STATE", ++ "OVS_KEY_ATTR_CT_ZONE", ++ "OVS_KEY_ATTR_CT_MARK", ++ "OVS_KEY_ATTR_CT_LABELS", ++ "OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV4", ++ "OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6", ++ "OVS_KEY_ATTR_NSH"] ++ ++ if attr < 0 or attr > len(ovs_key_attr): ++ return "" ++ ++ return ovs_key_attr[attr] ++ ++ def handle_event(event): ++ # ++ # For us, only upcalls with a packet, flow_key, and upcall_type ++ # DPIF_UC_MISS are of interest. ++ # ++ if event.pkt_size <= 0 or event.key_size <= 0 or \ ++ event.upcall_type != 0: ++ return ++ ++ if event.key_size < options.flow_key_size: ++ key_len = event.key_size ++ else: ++ key_len = options.flow_key_size ++ ++ if event.pkt_size < options.packet_size: ++ pkt_len = event.pkt_size ++ else: ++ pkt_len = options.packet_size ++ ++ try: ++ event = RecvUpcall(event.ts, event.pid, event.comm.decode("utf-8"), ++ event.cpu, event.dpif_name.decode("utf-8"), ++ bytes(event.key)[:key_len], ++ bytes(event.pkt)[:pkt_len], ++ event.pkt_size) ++ except LookupError: ++ return None ++ ++ if not options.quiet: ++ print(event) ++ ++ return event ++ ++ ++# ++# op_flow_execute event class ++# ++class OpFlowExecute(Event): ++ def __init__(self, ts, pid, comm, cpu, pkt, pkt_len): ++ super(OpFlowExecute, self).__init__(ts, pid, comm, cpu, ++ EventType.OP_FLOW_EXECUTE) ++ self.pkt = pkt ++ self.pkt_len = pkt_len ++ ++ def __str__(self): ++ return "[{:<22}] {:<16} {:8} [{:03}] {:18.9f}: " \ ++ "{:<17} {:4} {:4}".format(EventType.short_name(self.event_type), ++ self.comm, ++ self.pid, ++ self.cpu, ++ self.ts / 1000000000, ++ "", ++ "", ++ self.pkt_len) ++ ++ def handle_event(event): ++ if event.pkt_size < options.packet_size: ++ pkt_len = event.pkt_size ++ else: ++ pkt_len = options.packet_size ++ ++ pkt_data = bytes(event.pkt)[:pkt_len] ++ ++ if len(pkt_data) <= 0 or event.pkt_size == 0: ++ return ++ ++ event = OpFlowExecute(event.ts, event.pid, event.comm.decode("utf-8"), ++ event.cpu, pkt_data, event.pkt_size) ++ ++ if not options.quiet: ++ print(event) ++ ++ return event ++ ++ ++# ++# get_dp_mapping() ++# ++def get_dp_mapping(dp, port, return_map=False, dp_map=None): ++ if options.unit_test: ++ return port ++ ++ if dp_map is not None: ++ get_dp_mapping.dp_port_map_cache = dp_map ++ ++ # ++ # Build a cache, so we do not have to execue the ovs command each time. ++ # ++ if not hasattr(get_dp_mapping, "dp_port_map_cache"): ++ try: ++ output = subprocess.check_output(['ovs-appctl', 'dpctl/show'], ++ encoding='utf8').split("\n") ++ except subprocess.CalledProcessError: ++ output = "" ++ pass ++ ++ current_dp = None ++ get_dp_mapping.dp_port_map_cache = {} ++ ++ for line in output: ++ match = re.match("^system@(.*):$", line) ++ if match is not None: ++ current_dp = match.group(1) ++ ++ match = re.match("^ port ([0-9]+): ([^ /]*)", line) ++ if match is not None and current_dp is not None: ++ try: ++ get_dp_mapping.dp_port_map_cache[ ++ current_dp][match.group(2)] = int(match.group(1)) ++ except KeyError: ++ get_dp_mapping.dp_port_map_cache[current_dp] = \ ++ {match.group(2): int(match.group(1))} ++ ++ if return_map: ++ return get_dp_mapping.dp_port_map_cache ++ ++ if dp not in get_dp_mapping.dp_port_map_cache or \ ++ port not in get_dp_mapping.dp_port_map_cache[dp]: ++ return None ++ ++ return get_dp_mapping.dp_port_map_cache[dp][port] ++ ++ ++# ++# event_to_dict() ++# ++def event_to_dict(event): ++ event_dict = {} ++ ++ for field, _ in event._fields_: ++ if isinstance(getattr(event, field), (int, bytes)): ++ event_dict[field] = getattr(event, field) ++ else: ++ if (field == "key" and event.key_size == 0) or \ ++ (field == "pkt" and event.pkt_size == 0): ++ data = bytes() ++ else: ++ data = bytes(getattr(event, field)) ++ ++ event_dict[field] = data ++ ++ return event_dict ++ ++ ++# ++# receive_event_bcc() ++# ++def receive_event_bcc(ctx, data, size): ++ global events_received ++ events_received += 1 ++ ++ event = b['events'].event(data) ++ ++ if export_file is not None: ++ export_file.write("event = {}\n".format(event_to_dict(event))) ++ ++ receive_event(event) ++ ++ ++# ++# receive_event() ++# ++def receive_event(event): ++ global event_count ++ ++ if event.event == 0: ++ trace_event = RecvUpcall.handle_event(event) ++ elif event.event == 1: ++ trace_event = DpUpcall.handle_event(event) ++ elif event.event == 2: ++ trace_event = Event.handle_event(event) ++ elif event.event == 3: ++ trace_event = OpFlowExecute.handle_event(event) ++ elif event.event == 4: ++ trace_event = Event.handle_event(event) ++ ++ try: ++ event_count['total'][EventType.from_trace(event.event)] += 1 ++ except KeyError: ++ event_count['total'][EventType.from_trace(event.event)] = 1 ++ event_count['valid'][EventType.from_trace(event.event)] = 0 ++ ++ if trace_event is not None: ++ event_count['valid'][EventType.from_trace(event.event)] += 1 ++ trace_data.append(trace_event) ++ ++ ++# ++# collect_event_sets() ++# ++def collect_event_sets(events, collect_stats=False, profile=False, ++ spinner=False): ++ t1_time = 0 ++ ++ def t1_start(): ++ nonlocal t1_time ++ t1_time = process_time() ++ ++ def t1_stop(description): ++ print("* PROFILING: {:<50}: {:.06f} seconds".format( ++ description, process_time() - t1_time)) ++ ++ warn_parcial_match = False ++ warn_frag = False ++ ++ if profile: ++ t1_start() ++ # ++ # First let's create a dict of per handler thread events. ++ # ++ threads = {} ++ threads_result = {} ++ for idx, event in enumerate(events): ++ if event.event_type == EventType.DP_UPCALL: ++ continue ++ if event.pid not in threads: ++ threads[event.pid] = [] ++ threads[event.pid].append([idx, event]) ++ ++ if profile: ++ t1_stop("Creating per thread dictionary") ++ t1_start() ++ # ++ # Now spit them in per upcall sets, but remember that ++ # RecvUpcall event can be batched. ++ # ++ batch_stats = [] ++ for thread, items in threads.items(): ++ thread_set = [] ++ batch = [] ++ ovs_pkt_exec_set = [] ++ batching = True ++ collecting = 0 ++ has_flow_put = False ++ has_flow_exec = False ++ ++ def next_batch(): ++ nonlocal batching, batch, collecting, has_flow_put, has_flow_exec ++ nonlocal ovs_pkt_exec_set, thread_set ++ ++ if len(batch) > 0: ++ # ++ # If we are done with the batch, see if we need to match up ++ # any batched OVS_PKT_EXEC events. ++ # ++ for event in batch: ++ if len(ovs_pkt_exec_set) <= 0: ++ break ++ if any(isinstance(item, ++ OpFlowExecute) for item in event[2]): ++ event[2].append(ovs_pkt_exec_set.pop(0)) ++ # ++ # Append the batch to the thread-specific set. ++ # ++ thread_set = thread_set + batch ++ if collect_stats: ++ batch_stats.append(len(batch)) ++ batching = True ++ batch = [] ++ ovs_pkt_exec_set = [] ++ has_flow_put = False ++ has_flow_exec = False ++ collecting = 0 ++ ++ def next_batch_set(): ++ nonlocal has_flow_put, has_flow_exec, collecting ++ has_flow_put = False ++ has_flow_exec = False ++ collecting += 1 ++ ++ for item in items: ++ idx, event = item ++ ++ if batching: ++ if event.event_type == EventType.RECV_UPCALL: ++ batch.append(item + [[]]) ++ elif len(batch) > 0: ++ batching = False ++ collecting = 0 ++ else: ++ continue ++ ++ if not batching: ++ if event.event_type == EventType.RECV_UPCALL: ++ next_batch() ++ batch.append(item + [[]]) ++ else: ++ if event.event_type == EventType.OP_FLOW_PUT: ++ if has_flow_put: ++ next_batch_set() ++ if collecting >= len(batch): ++ next_batch() ++ continue ++ ++ batch[collecting][2].append(item[1]) ++ has_flow_put = True ++ ++ elif event.event_type == EventType.OP_FLOW_EXECUTE: ++ if has_flow_exec: ++ next_batch_set() ++ if collecting >= len(batch): ++ next_batch() ++ continue ++ ++ if (event.pkt_len == batch[collecting][1].pkt_len ++ and event.pkt == batch[collecting][1].pkt): ++ batch[collecting][2].append(item[1]) ++ has_flow_put = True ++ has_flow_exec = True ++ else: ++ # ++ # If we end up here it could be that an upcall in a ++ # batch did not generate an EXECUTE and we are out ++ # of sync. Try to match it to the next batch entry. ++ # ++ next_idx = collecting + 1 ++ while True: ++ if next_idx >= len(batch): ++ next_batch() ++ break ++ if (event.pkt_len == batch[next_idx][1].pkt_len ++ and event.pkt == batch[next_idx][1].pkt): ++ ++ batch[next_idx][2] = batch[collecting][2] ++ batch[collecting][2] = [] ++ collecting = next_idx ++ batch[collecting][2].append(item[1]) ++ has_flow_put = True ++ has_flow_exec = True ++ break ++ ++ next_idx += 1 ++ ++ elif event.event_type == EventType.OVS_PKT_EXEC: ++ # ++ # The OVS_PKT_EXEC might also be batched, so we keep ++ # them in a separate list and assign them to the ++ # correct set when completing the set. ++ # ++ ovs_pkt_exec_set.append(item[1]) ++ continue ++ ++ if collecting >= len(batch): ++ next_batch() ++ next_batch() ++ threads_result[thread] = thread_set ++ ++ if profile: ++ t1_stop("Creating upcall sets") ++ t1_start() ++ ++ # ++ # Move thread results from list to dictionary ++ # ++ thread_stats = {} ++ for thread, sets in threads_result.items(): ++ if len(sets) > 0: ++ thread_stats[sets[0][1].comm] = len(sets) ++ ++ threads_result[thread] = {} ++ for upcall in sets: ++ threads_result[thread][upcall[0]] = [upcall[1]] + upcall[2] ++ ++ if profile: ++ t1_stop("Moving upcall list to dictionary") ++ t1_start() ++ ++ if options.debug & 0x4000000 != 0: ++ print() ++ for thread, sets in threads_result.items(): ++ for idx, idx_set in sets.items(): ++ print("DBG: {}".format(idx_set)) ++ ++ # ++ # Create two lists on with DP_UPCALLs and RECV_UPCALLs ++ # ++ dp_upcall_list = [] ++ recv_upcall_list = [] ++ for idx, event in enumerate(events): ++ if event.event_type == EventType.DP_UPCALL: ++ dp_upcall_list.append([idx, event]) ++ elif event.event_type == EventType.RECV_UPCALL: ++ recv_upcall_list.append([idx, event]) ++ ++ if profile: ++ t1_stop("Creating DP_UPCALL and RECV_UPCALL lists") ++ t1_start() ++ ++ if options.debug & 0x4000000 != 0: ++ print() ++ for dp_upcall in dp_upcall_list: ++ print("DBG: {}".format(dp_upcall)) ++ print() ++ for recv_upcall in recv_upcall_list: ++ print("DBG: {}".format(recv_upcall)) ++ ++ # ++ # Now find the matching DP_UPCALL and RECV_UPCALL events ++ # ++ event_sets = [] ++ if spinner: ++ print() ++ with alive_bar(len(dp_upcall_list), ++ title="- Matching DP_UPCALLs to RECV_UPCALLs", ++ spinner=None, disable=not spinner) as bar: ++ ++ for (idx, event) in dp_upcall_list: ++ remove_indexes = [] ++ this_set = None ++ # ++ # TODO: This part needs some optimization, as it's slow in the ++ # PVP test scenario. This is because a lot of DP_UPCALLS ++ # will not have a matching RECV_UPCALL leading to walking ++ # the entire recv_upcall_list list. ++ # ++ # Probably some dictionary, but in the PVP scenario packets ++ # come from a limited set of ports, and the length is all the ++ # same. So we do need the key to be recv.dport + ++ # len(recv.pkt) + recv.pkt, however, the recv.pkt compare ++ # needs to happen on min(len(event.pkt), len(recv.pkt)). ++ # ++ for idx_in_list, (idx_recv, recv) in enumerate(recv_upcall_list): ++ match = False ++ ++ if idx_recv < idx: ++ remove_indexes.append(idx_in_list) ++ continue ++ # ++ # If the RecvUpcall is a tunnel port, we can not map it to ++ # the correct tunnel. For now, we assume the first matching ++ # packet is the correct one. For more details see the OVS ++ # ukey_to_flow_netdev() function. ++ # ++ if (event.dp_port == recv.dp_port or ++ recv.dp_port == DP_TUNNEL_PORT) \ ++ and event.pkt_len == recv.pkt_len: ++ ++ compare_len = min(len(event.pkt), len(recv.pkt)) ++ ++ if len(event.pkt) != len(recv.pkt) \ ++ and event.pkt_frag_len == 0: ++ warn_parcial_match = True ++ elif event.pkt_frag_len != 0: ++ warn_frag = True ++ compare_len = min(compare_len, event.pkt_frag_len) ++ ++ if event.pkt[:compare_len] == recv.pkt[:compare_len]: ++ match = True ++ else: ++ # ++ # There are still some corner cases due to the fact ++ # the kernel dp_upcall tracepoint is hit before the ++ # packet is prepared/modified for upcall pass on. ++ # Example cases are packet checksum update, VLAN ++ # insertion, etc., etc. For now, we try to handle the ++ # checksum part, but we might need to add more ++ # exceptions in the future. ++ # ++ diff_bytes = sum(i != j for i, j in zip( ++ event.pkt[:compare_len], recv.pkt[:compare_len])) ++ ++ if diff_bytes <= 2 and compare_len > 56: ++ # This could be a TCP or UDP checksum ++ event_pkt = Ether(bytes(event.pkt)[:compare_len]) ++ recv_pkt = Ether(bytes(recv.pkt)[:compare_len]) ++ if (event_pkt.haslayer(TCP) and ++ recv_pkt.haslayer(TCP)) or ( ++ event_pkt.haslayer(UDP) and ++ recv_pkt.haslayer(UDP)): ++ ++ if event_pkt.haslayer(TCP): ++ event_chksum = event_pkt[TCP].chksum ++ recv_chksum = recv_pkt[TCP].chksum ++ else: ++ event_chksum = event_pkt[UDP].chksum ++ recv_chksum = recv_pkt[UDP].chksum ++ ++ if event_chksum & 0xff != recv_chksum & 0xff: ++ diff_bytes -= 1 ++ if event_chksum & 0xff00 != \ ++ recv_chksum & 0xff00: ++ diff_bytes -= 1 ++ ++ if diff_bytes == 0: ++ match = True ++ ++ if match: ++ this_set = {event.event_type: event} ++ for sevent in threads_result[recv.pid][idx_recv]: ++ this_set[sevent.event_type] = sevent ++ event_sets.append(this_set) ++ remove_indexes.append(idx_in_list) ++ ++ if options.debug & 0x4000000 != 0: ++ print("DBG: Matched DpUpcall({:6}) => " ++ "RecvUpcall({:6})".format(idx, idx_recv)) ++ ++ break ++ ++ elif options.debug & 0x8000000 != 0: ++ print("DBG: COMPARE DpUpcall({:6}) != " ++ "RecvUpcall({:6})".format(idx, idx_recv)) ++ event_pkt = Ether(bytes(event.pkt)[:compare_len]) ++ recv_pkt = Ether(bytes(recv.pkt)[:compare_len]) ++ print(re.sub('^', 'DBG:' + ' ' * 4, ++ event_pkt.show(dump=True), ++ flags=re.MULTILINE)) ++ print(re.sub('^', 'DBG:' + ' ' * 4, ++ recv_pkt.show(dump=True), ++ flags=re.MULTILINE)) ++ ++ elif options.debug & 0x8000000 != 0: ++ print("DBG: COMPATE DpUpcall({:6}) != " ++ "RecvUpcall({:6}) -> port {}, {} -> " ++ "len = {}, {}".format(idx, idx_recv, ++ event.dp_port, ++ recv.dp_port, ++ event.pkt_len, ++ recv.pkt_len)) ++ ++ bar() ++ for remove_idx in sorted(remove_indexes, reverse=True): ++ del recv_upcall_list[remove_idx] ++ ++ if profile: ++ t1_stop("Matching DP_UPCALLs to a set") ++ ++ if warn_parcial_match: ++ print("WARNING: Packets not fully captured for matching!\n " ++ "Increase the packet buffer with the '--packet-size' option.") ++ if warn_frag: ++ print("WARNING: SKB from kernel had fragments, we could only copy/" ++ "compare the first part!") ++ ++ if collect_stats: ++ return event_sets, batch_stats, thread_stats ++ ++ return event_sets ++ ++ ++# ++# unit_test() ++# ++def unit_test(): ++ pkt1 = b'\x01\x02\x03\x04\x05' ++ pkt2 = b'\x01\x02\x03\x04\x06' ++ pkt3 = b'\x01\x02\x03\x04\x07' ++ key = b'\x08\x00\x03\x00\x01\x00\x00\x00' # Port 1 ++ # ++ # Basic test with all events in line ++ # ++ t1_events = [DpUpcall(1, 100, "ping", 1, "system", 1, pkt1, len(pkt1), 0), ++ RecvUpcall(2, 1, "hndl", 1, "systen", key, pkt1, len(pkt1)), ++ Event(3, 1, "hndl", 1, EventType.OP_FLOW_PUT), ++ OpFlowExecute(4, 1, "hndl", 1, pkt1, len(pkt1)), ++ Event(5, 1, "hndl", 1, EventType.OVS_PKT_EXEC)] ++ t1_result = [{EventType.DP_UPCALL: t1_events[0], ++ EventType.RECV_UPCALL: t1_events[1], ++ EventType.OP_FLOW_PUT: t1_events[2], ++ EventType.OP_FLOW_EXECUTE: t1_events[3], ++ EventType.OVS_PKT_EXEC: t1_events[4]}] ++ # ++ # Basic test with missing flow put ++ # ++ t2_events = [DpUpcall(1, 100, "ping", 1, "system", 1, pkt1, len(pkt1), 0), ++ RecvUpcall(2, 1, "hndl", 1, "systen", key, pkt1, len(pkt1)), ++ OpFlowExecute(4, 1, "hndl", 1, pkt1, len(pkt1)), ++ Event(5, 1, "hndl", 1, EventType.OVS_PKT_EXEC)] ++ t2_result = [{EventType.DP_UPCALL: t2_events[0], ++ EventType.RECV_UPCALL: t2_events[1], ++ EventType.OP_FLOW_EXECUTE: t2_events[2], ++ EventType.OVS_PKT_EXEC: t2_events[3]}] ++ # ++ # Test with RecvUpcall's being batched ++ # ++ t3_events = [DpUpcall(1, 101, "ping", 1, "system", 1, pkt1, len(pkt1), 0), ++ DpUpcall(2, 102, "ping", 2, "system", 1, pkt2, len(pkt2), 0), ++ DpUpcall(3, 101, "ping", 3, "system", 1, pkt3, len(pkt3), 0), ++ RecvUpcall(4, 1, "hndl", 1, "systen", key, pkt1, len(pkt1)), ++ RecvUpcall(5, 1, "hndl", 1, "systen", key, pkt3, len(pkt3)), ++ RecvUpcall(6, 1, "hndl", 1, "systen", key, pkt2, len(pkt2)), ++ Event(7, 1, "hndl", 1, EventType.OP_FLOW_PUT), ++ OpFlowExecute(8, 1, "hndl", 1, pkt1, len(pkt1)), ++ Event(9, 1, "hndl", 1, EventType.OVS_PKT_EXEC), ++ Event(10, 1, "hndl", 1, EventType.OP_FLOW_PUT), ++ OpFlowExecute(11, 1, "hndl", 1, pkt3, len(pkt3)), ++ Event(12, 1, "hndl", 1, EventType.OVS_PKT_EXEC), ++ Event(13, 1, "hndl", 1, EventType.OP_FLOW_PUT), ++ OpFlowExecute(14, 1, "hndl", 1, pkt2, len(pkt2)), ++ Event(15, 1, "hndl", 1, EventType.OVS_PKT_EXEC)] ++ t3_result = [{EventType.DP_UPCALL: t3_events[0], ++ EventType.RECV_UPCALL: t3_events[3], ++ EventType.OP_FLOW_PUT: t3_events[6], ++ EventType.OP_FLOW_EXECUTE: t3_events[7], ++ EventType.OVS_PKT_EXEC: t3_events[8]}, ++ {EventType.DP_UPCALL: t3_events[1], ++ EventType.RECV_UPCALL: t3_events[5], ++ EventType.OP_FLOW_PUT: t3_events[12], ++ EventType.OP_FLOW_EXECUTE: t3_events[13], ++ EventType.OVS_PKT_EXEC: t3_events[14]}, ++ {EventType.DP_UPCALL: t3_events[2], ++ EventType.RECV_UPCALL: t3_events[4], ++ EventType.OP_FLOW_PUT: t3_events[9], ++ EventType.OP_FLOW_EXECUTE: t3_events[10], ++ EventType.OVS_PKT_EXEC: t3_events[11]}] ++ # ++ # Test with RecvUpcall's single + batch ++ # ++ t4_events = [DpUpcall(1, 100, "ping", 1, "system", 1, pkt1, len(pkt1), 0), ++ RecvUpcall(2, 1, "hndl", 1, "systen", key, pkt1, len(pkt1)), ++ Event(3, 1, "hndl", 1, EventType.OP_FLOW_PUT), ++ OpFlowExecute(4, 1, "hndl", 1, pkt1, len(pkt1)), ++ Event(5, 1, "hndl", 1, EventType.OVS_PKT_EXEC), ++ DpUpcall(6, 101, "ping", 1, "system", 1, pkt1, len(pkt1), 0), ++ DpUpcall(7, 102, "ping", 2, "system", 1, pkt2, len(pkt2), 0), ++ DpUpcall(8, 101, "ping", 3, "system", 1, pkt3, len(pkt3), 0), ++ RecvUpcall(9, 1, "hndl", 1, "systen", key, pkt1, len(pkt1)), ++ RecvUpcall(10, 1, "hndl", 1, "systen", key, pkt3, len(pkt3)), ++ RecvUpcall(11, 1, "hndl", 1, "systen", key, pkt2, len(pkt2)), ++ Event(12, 1, "hndl", 1, EventType.OP_FLOW_PUT), ++ OpFlowExecute(13, 1, "hndl", 1, pkt1, len(pkt1)), ++ Event(14, 1, "hndl", 1, EventType.OVS_PKT_EXEC), ++ Event(15, 1, "hndl", 1, EventType.OP_FLOW_PUT), ++ OpFlowExecute(16, 1, "hndl", 1, pkt3, len(pkt3)), ++ Event(17, 1, "hndl", 1, EventType.OVS_PKT_EXEC), ++ Event(18, 1, "hndl", 1, EventType.OP_FLOW_PUT), ++ OpFlowExecute(14, 1, "hndl", 1, pkt2, len(pkt2)), ++ Event(19, 1, "hndl", 1, EventType.OVS_PKT_EXEC)] ++ t4_result = [{EventType.DP_UPCALL: t4_events[0], ++ EventType.RECV_UPCALL: t4_events[1], ++ EventType.OP_FLOW_PUT: t4_events[2], ++ EventType.OP_FLOW_EXECUTE: t4_events[3], ++ EventType.OVS_PKT_EXEC: t4_events[4]}, ++ {EventType.DP_UPCALL: t4_events[5], ++ EventType.RECV_UPCALL: t4_events[8], ++ EventType.OP_FLOW_PUT: t4_events[11], ++ EventType.OP_FLOW_EXECUTE: t4_events[12], ++ EventType.OVS_PKT_EXEC: t4_events[13]}, ++ {EventType.DP_UPCALL: t4_events[6], ++ EventType.RECV_UPCALL: t4_events[10], ++ EventType.OP_FLOW_PUT: t4_events[17], ++ EventType.OP_FLOW_EXECUTE: t4_events[18], ++ EventType.OVS_PKT_EXEC: t4_events[19]}, ++ {EventType.DP_UPCALL: t4_events[7], ++ EventType.RECV_UPCALL: t4_events[9], ++ EventType.OP_FLOW_PUT: t4_events[14], ++ EventType.OP_FLOW_EXECUTE: t4_events[15], ++ EventType.OVS_PKT_EXEC: t4_events[16]}] ++ # ++ # Test with two threads interleaved ++ # ++ t5_events = [DpUpcall(1, 100, "ping", 1, "system", 1, pkt1, len(pkt1), 0), ++ DpUpcall(2, 100, "ping", 1, "system", 1, pkt2, len(pkt2), 0), ++ RecvUpcall(3, 1, "hndl", 1, "systen", key, pkt1, len(pkt1)), ++ RecvUpcall(4, 2, "hndl", 2, "systen", key, pkt2, len(pkt2)), ++ Event(5, 2, "hndl", 2, EventType.OP_FLOW_PUT), ++ Event(6, 1, "hndl", 1, EventType.OP_FLOW_PUT), ++ OpFlowExecute(7, 2, "hndl", 1, pkt2, len(pkt2)), ++ OpFlowExecute(8, 1, "hndl", 1, pkt1, len(pkt1)), ++ Event(9, 1, "hndl", 1, EventType.OVS_PKT_EXEC), ++ Event(10, 2, "hndl", 1, EventType.OVS_PKT_EXEC)] ++ t5_result = [{EventType.DP_UPCALL: t5_events[0], ++ EventType.RECV_UPCALL: t5_events[2], ++ EventType.OP_FLOW_PUT: t5_events[5], ++ EventType.OP_FLOW_EXECUTE: t5_events[7], ++ EventType.OVS_PKT_EXEC: t5_events[8]}, ++ {EventType.DP_UPCALL: t5_events[1], ++ EventType.RECV_UPCALL: t5_events[3], ++ EventType.OP_FLOW_PUT: t5_events[4], ++ EventType.OP_FLOW_EXECUTE: t5_events[6], ++ EventType.OVS_PKT_EXEC: t5_events[9]}] ++ # ++ # Test batch with missing events ++ # ++ t6_events = [DpUpcall(1, 101, "ping", 1, "system", 1, pkt1, len(pkt1), 0), ++ DpUpcall(2, 102, "ping", 2, "system", 1, pkt2, len(pkt2), 0), ++ RecvUpcall(3, 1, "hndl", 1, "systen", key, pkt1, len(pkt1)), ++ RecvUpcall(4, 1, "hndl", 1, "systen", key, pkt2, len(pkt2)), ++ Event(5, 1, "hndl", 1, EventType.OP_FLOW_PUT), ++ OpFlowExecute(6, 1, "hndl", 1, pkt2, len(pkt2)), ++ Event(7, 1, "hndl", 1, EventType.OVS_PKT_EXEC)] ++ t6_result = [{EventType.DP_UPCALL: t6_events[0], ++ EventType.RECV_UPCALL: t6_events[2]}, ++ {EventType.DP_UPCALL: t6_events[1], ++ EventType.RECV_UPCALL: t6_events[3], ++ EventType.OP_FLOW_PUT: t6_events[4], ++ EventType.OP_FLOW_EXECUTE: t6_events[5], ++ EventType.OVS_PKT_EXEC: t6_events[6]}] ++ # ++ # Test with RecvUpcall's and OVS_PKT_EXEC being batched ++ # ++ t7_events = [DpUpcall(1, 101, "ping", 1, "system", 1, pkt1, len(pkt1), 0), ++ DpUpcall(2, 102, "ping", 2, "system", 1, pkt2, len(pkt2), 0), ++ DpUpcall(3, 101, "ping", 3, "system", 1, pkt3, len(pkt3), 0), ++ RecvUpcall(4, 1, "hndl", 1, "systen", key, pkt1, len(pkt1)), ++ RecvUpcall(5, 1, "hndl", 1, "systen", key, pkt2, len(pkt2)), ++ RecvUpcall(6, 1, "hndl", 1, "systen", key, pkt3, len(pkt3)), ++ Event(7, 1, "hndl", 1, EventType.OP_FLOW_PUT), ++ OpFlowExecute(8, 1, "hndl", 1, pkt1, len(pkt1)), ++ Event(9, 1, "hndl", 1, EventType.OP_FLOW_PUT), ++ OpFlowExecute(10, 1, "hndl", 1, pkt2, len(pkt2)), ++ Event(11, 1, "hndl", 1, EventType.OP_FLOW_PUT), ++ OpFlowExecute(12, 1, "hndl", 1, pkt3, len(pkt3)), ++ Event(13, 1, "hndl", 1, EventType.OVS_PKT_EXEC), ++ Event(14, 1, "hndl", 1, EventType.OVS_PKT_EXEC), ++ Event(15, 1, "hndl", 1, EventType.OVS_PKT_EXEC)] ++ t7_result = [{EventType.DP_UPCALL: t7_events[0], ++ EventType.RECV_UPCALL: t7_events[3], ++ EventType.OP_FLOW_PUT: t7_events[6], ++ EventType.OP_FLOW_EXECUTE: t7_events[7], ++ EventType.OVS_PKT_EXEC: t7_events[12]}, ++ {EventType.DP_UPCALL: t7_events[1], ++ EventType.RECV_UPCALL: t7_events[4], ++ EventType.OP_FLOW_PUT: t7_events[8], ++ EventType.OP_FLOW_EXECUTE: t7_events[9], ++ EventType.OVS_PKT_EXEC: t7_events[13]}, ++ {EventType.DP_UPCALL: t7_events[2], ++ EventType.RECV_UPCALL: t7_events[5], ++ EventType.OP_FLOW_PUT: t7_events[10], ++ EventType.OP_FLOW_EXECUTE: t7_events[11], ++ EventType.OVS_PKT_EXEC: t7_events[14]}] ++ # ++ # Actual test sets ++ # ++ test_set = [["Simple single event", t1_events, t1_result], ++ ["Single event, missing flow_put", t2_events, t2_result], ++ ["Batched events", t3_events, t3_result], ++ ["Single + batched events", t4_events, t4_result], ++ ["Two sets, different threads", t5_events, t5_result], ++ ["Batch with missing exec", t6_events, t6_result], ++ ["Batched events including exec", t7_events, t7_result]] ++ ++ print("Running some simple unit tests:") ++ ++ for test in test_set: ++ print("- {:<32} ".format(test[0]), end="") ++ result = collect_event_sets(test[1][:]) ++ if result == test[2]: ++ print("PASS") ++ else: ++ print("FAIL") ++ print(" OUTPUT :") ++ for event_set in result: ++ hdr = " - " ++ for event_type, event in event_set.items(): ++ print("{} {:<16}: {}".format(hdr, event_type.name, event)) ++ hdr = " " ++ print(" EXPECTED:") ++ for event_set in test[2]: ++ hdr = " - " ++ for event_type, event in event_set.items(): ++ print("{} {:<16}: {}".format(hdr, event_type.name, event)) ++ hdr = " " ++ ++ ++# ++# show_key_value() ++# ++def show_key_value(data_set, description=None): ++ if description is not None: ++ print("\n=> {}:".format(description)) ++ ++ for k, v in data_set.items(): ++ print(" {:36}: {:>10}".format(str(k), str(v))) ++ ++ ++# ++# show_batch_histogram() ++# ++def show_batch_histogram(data_set, description=None): ++ nr_of_buckets = 64 ++ ++ if description is not None: ++ print("\n=> {}:".format(description)) ++ ++ if len(data_set) == 0: ++ print("# NumSamples = 0") ++ return ++ ++ min_val = nr_of_buckets ++ max_val = 0 ++ entries = 0 ++ high_buckets = 0 ++ buckets = [0] * nr_of_buckets ++ ++ for entry in data_set: ++ min_val = min(min_val, entry) ++ max_val = max(max_val, entry) ++ if entry == 0: ++ continue ++ elif entry > nr_of_buckets: ++ high_buckets += 1 ++ else: ++ buckets[entry - 1] += 1 ++ ++ entries += 1 ++ ++ if max(buckets + [high_buckets]) > 4: ++ scale = int(max(buckets + [high_buckets]) / 4) ++ else: ++ scale = 1 ++ ++ print("# NumSamples = {}; Min = {}; Max = {}".format(entries, min_val, ++ max_val)) ++ print("# each ∎ represents a count of {}".format(scale)) ++ ++ for idx in range(int(nr_of_buckets / 2)): ++ idx_2nd = idx + int(nr_of_buckets / 2) ++ print("{:5} [{:8}]: {:22} {:5} [{:8}]: {:22}".format( ++ idx + 1, buckets[idx], "∎" * int(buckets[idx] / scale), ++ idx_2nd + 1, buckets[idx_2nd], ++ "∎" * int(buckets[idx_2nd] / scale))) ++ ++ if high_buckets > 0: ++ print("{:>5} [{:8}]: {:22}".format(">" + str(nr_of_buckets), ++ high_buckets, ++ "∎" * int(high_buckets / scale))) ++ ++ ++# ++# show_histogram() ++# ++def show_histogram(data_set, description=None, options=None, ++ minimum=None, maximum=None, buckets=None, custbuckets=None): ++ if description is not None: ++ print("\n=> {}:".format(description)) ++ ++ if options is not None: ++ if buckets is None: ++ buckets = options.histogram_buckets ++ if options is not None and options.sets: ++ print(data_set) ++ ++ if len(data_set) == 0: ++ print("# NumSamples = 0") ++ elif len(data_set) == 1: ++ print("# NumSamples = 1; Min = {0:.4f}; Max = {0:.4f}". ++ format(data_set[0])) ++ elif len(set(data_set)) == 1 and maximum is None and minimum is None and \ ++ custbuckets is None: ++ histogram(data_set, buckets=buckets, minimum=list(set(data_set))[0], ++ maximum=list(set(data_set))[0] + 1) ++ else: ++ histogram(data_set, buckets=buckets, ++ minimum=minimum, maximum=maximum, custbuckets=custbuckets) ++ ++ ++# ++# buffer_size_type() ++# ++def buffer_size_type(astr, min=64, max=2048): ++ value = int(astr) ++ if min <= value <= max: ++ return value ++ else: ++ raise argparse.ArgumentTypeError( ++ 'value not in range {}-{}'.format(min, max)) ++ ++ ++# ++# next_power_of_two() ++# ++def next_power_of_two(val): ++ np = 1 ++ while np < val: ++ np *= 2 ++ return np ++ ++ ++# ++# main() ++# ++def main(): ++ # ++ # Don't like these globals, but ctx passing does not seem to work with the ++ # existing open_ring_buffer() API :( ++ # ++ global b ++ global options ++ global trace_data ++ global events_received ++ global event_count ++ global export_file ++ ++ # ++ # Argument parsing ++ # ++ parser = argparse.ArgumentParser() ++ ++ parser.add_argument("-b", "--histogram-buckets", ++ help="Number of buckets per histogram, default 20", ++ type=int, default=20, metavar="BUCKETS") ++ parser.add_argument("--buffer-page-count", ++ help="Number of BPF ring buffer pages, default 1024", ++ type=int, default=1024, metavar="NUMBER") ++ parser.add_argument("-D", "--debug", ++ help="Enable eBPF debugging", ++ type=lambda x: int(x, 0), const=0x3f, default=0, ++ nargs='?') ++ parser.add_argument("-f", "--flow-key-size", ++ help="Set maximum flow key size to capture, " ++ "default 64", type=buffer_size_type, default=64, ++ metavar="[64-2048]") ++ parser.add_argument("--handler-filter", ++ help="Post processing handler thread filter", ++ type=str, default=None, metavar="HANDLERS") ++ parser.add_argument("-P", "--packet-size", ++ help="Set maximum packet size to capture, " ++ "default 256", type=buffer_size_type, default=256, ++ metavar="[64-2048]") ++ parser.add_argument("-p", "--pid", metavar="VSWITCHD_PID", ++ help="ovs-vswitch's PID", ++ type=int, default=None) ++ parser.add_argument("-q", "--quiet", action="store_true", ++ help="Do not show individual events") ++ parser.add_argument("-r", "--read-events", ++ help="Read events from FILE instead of installing " ++ "tracepoints", type=str, default=None, metavar="FILE") ++ parser.add_argument("--sets", action="store_true", ++ help="Dump content of data sets") ++ parser.add_argument("-s", "--stop", ++ help="Stop after receiving EVENTS number of trace " ++ "events", ++ type=int, default=0, metavar="EVENTS") ++ parser.add_argument("--unit-test", action="store_true", ++ help=argparse.SUPPRESS) ++ parser.add_argument("-w", "--write-events", ++ help="Write events to FILE", ++ type=str, default=None, metavar="FILE") ++ ++ options = parser.parse_args() ++ ++ if options.unit_test: ++ unit_test() ++ sys.exit(0) ++ ++ # ++ # Find the PID of the ovs-vswitchd daemon if not specified. ++ # ++ if options.pid is None and options.read_events is None: ++ for proc in psutil.process_iter(): ++ if 'ovs-vswitchd' in proc.name(): ++ if options.pid is not None: ++ print("ERROR: Multiple ovs-vswitchd daemons running, " ++ "use the -p option!") ++ sys.exit(-1) ++ ++ options.pid = proc.pid ++ ++ # ++ # Error checking on input parameters. ++ # ++ if options.pid is None and options.read_events is None: ++ print("ERROR: Failed to find ovs-vswitchd's PID!") ++ sys.exit(-1) ++ ++ if options.read_events is not None and options.write_events is not None: ++ print("ERROR: Either supply the read or write events option, " ++ "not both!") ++ sys.exit(-1) ++ ++ if options.handler_filter is not None and options.read_events is None: ++ print("ERROR: The --handler-filter option is only valid with the " ++ "--read-events option!") ++ sys.exit(-1) ++ ++ options.buffer_page_count = next_power_of_two(options.buffer_page_count) ++ ++ # ++ # Open write handle if needed. ++ # ++ if options.write_events is not None: ++ try: ++ export_file = open(options.write_events, "w") ++ except (FileNotFoundError, IOError, PermissionError) as e: ++ print("ERROR: Can't create export file \"{}\": {}".format( ++ options.write_events, e.strerror)) ++ sys.exit(-1) ++ else: ++ export_file = None ++ ++ trace_data = [] ++ event_count = {'total': {}, 'valid': {}, 'miss': {}} ++ if options.read_events is None: ++ # ++ # Call get_dp_mapping() to prepare the cache ++ # ++ dp_port_map = get_dp_mapping("ovs-system", "eth0", return_map=True) ++ if export_file is not None: ++ export_file.write("dp_port_map = {}\n".format(dp_port_map)) ++ ++ # ++ # Attach the usdt probe ++ # ++ u = USDT(pid=int(options.pid)) ++ try: ++ u.enable_probe(probe="recv_upcall", fn_name="trace__recv_upcall") ++ u.enable_probe(probe="op_flow_put", fn_name="trace__op_flow_put") ++ u.enable_probe(probe="op_flow_execute", ++ fn_name="trace__op_flow_execute") ++ except USDTException as e: ++ print("ERROR: {}" ++ "ovs-vswitchd!".format( ++ (re.sub('^', ' ' * 7, str(e), ++ flags=re.MULTILINE)).strip(). ++ replace("--with-dtrace or --enable-dtrace", ++ "--enable-usdt-probes"))) ++ sys.exit(-1) ++ ++ # ++ # Uncomment to see how arguments are decoded. ++ # print(u.get_text()) ++ # ++ print("- Compiling eBPF programs...") ++ ++ # ++ # Attach probes to the running process ++ # ++ source = ebpf_source.replace("", ++ str(options.packet_size)) ++ source = source.replace("", str(options.flow_key_size)) ++ source = source.replace("", ++ str(options.buffer_page_count)) ++ ++ b = BPF(text=source, usdt_contexts=[u], debug=options.debug & 0xffffff) ++ ++ # ++ # Dump out all events ++ # ++ print("- Capturing events [Press ^C to stop]...") ++ events_received = 0 ++ ++ if not options.quiet: ++ print("\n" + Event.get_event_header_str()) ++ ++ b['events'].open_ring_buffer(receive_event_bcc) ++ while 1: ++ try: ++ b.ring_buffer_poll() ++ if options.stop != 0 and events_received >= options.stop: ++ break ++ time.sleep(0.5) ++ except KeyboardInterrupt: ++ break ++ ++ dropcnt = b.get_table("dropcnt") ++ export_misses = {} ++ for k in dropcnt.keys(): ++ event = EventType.from_trace(k.value) ++ count = dropcnt.sum(k).value ++ if count > 0: ++ if event not in event_count['total']: ++ event_count['total'][event] = 0 ++ event_count['valid'][event] = 0 ++ event_count['miss'][event] = count ++ export_misses[k.value] = count ++ ++ if options.write_events is not None: ++ if sum(event_count['miss'].values()) > 0: ++ export_file.write("event_miss = {}\n".format(export_misses)) ++ ++ export_file.close() ++ ++ print() ++ else: ++ # ++ # Here we are requested to read event from an event export ++ # ++ thread_filter = None ++ if options.handler_filter is not None: ++ thread_filter = options.handler_filter.split(',') ++ ++ try: ++ dp_port_mapping_valid = False ++ with open(options.read_events, 'r') as fd: ++ events_received = 0 ++ ++ if options.quiet: ++ spinner = Halo(spinner="dots", color="cyan", ++ text="Reading events from \"{}\"...".format( ++ options.read_events)) ++ spinner.start() ++ else: ++ print("- Reading events from \"{}\"...".format( ++ options.read_events)) ++ ++ if not options.quiet: ++ print("\n" + Event.get_event_header_str()) ++ ++ for entry in fd: ++ if options.stop != 0 and events_received >= options.stop: ++ break ++ ++ entry.rstrip() ++ if entry.startswith('dp_port_map = {'): ++ if not dp_port_mapping_valid: ++ dp_port_mapping_valid = True ++ get_dp_mapping("", "", ++ dp_map=ast.literal_eval(entry[14:])) ++ elif (entry.startswith('event = {') and ++ dp_port_mapping_valid): ++ event = ast.literal_eval(entry[8:]) ++ event = namedtuple("EventObject", ++ event.keys())(*event.values()) ++ ++ if thread_filter is not None \ ++ and EventType.from_trace(event.event) != \ ++ EventType.DP_UPCALL \ ++ and event.comm.decode("utf-8") not in thread_filter: ++ # Skip none filtered threads ++ continue ++ ++ if len(event.pkt) > 0: ++ options.packet_size = len(event.pkt) ++ if len(event.key) > 0: ++ options.flow_key_size = len(event.key) ++ receive_event(event) ++ events_received += 1 ++ elif entry.startswith('event_miss = {'): ++ misses = ast.literal_eval(entry[13:]) ++ for e, count in misses.items(): ++ event = EventType.from_trace(e) ++ if count > 0: ++ if event not in event_count['total']: ++ event_count['total'][event] = 0 ++ event_count['valid'][event] = 0 ++ event_count['miss'][event] = count ++ ++ if options.quiet: ++ spinner.stop() ++ print("- Reading events from \"{}\"...".format( ++ options.read_events)) ++ ++ except (FileNotFoundError, PermissionError): ++ print("ERROR: Can't open file \"{}\" for reading!".format( ++ options.read_events)) ++ sys.exit(-1) ++ ++ # ++ # Start analyzing the data ++ # ++ print("- Analyzing results ({} events)...".format(len(trace_data))) ++ ++ if events_received > 0: ++ if sum(event_count['miss'].values()) > 0: ++ print("\nWARNING: Not all events were captured!\n " ++ "Increase the BPF ring buffer size with the " ++ "--buffer-page-count option.") ++ ++ print("\n=> Events received per type (usable/total) [missed events]:") ++ for event, total in sorted(event_count['total'].items()): ++ miss = event_count['miss'][event] if event in event_count['miss'] \ ++ else 0 ++ print(" {:36}: {:10}/{:10} [{:10}]".format( ++ event, event_count['valid'][event], total, miss)) ++ ++ collection, batch_stats, thread_stats = collect_event_sets( ++ trace_data, collect_stats=True, spinner=True) ++ ++ if len(collection) <= 0: ++ print("No upcall data sets where found!!") ++ sys.exit(0) ++ ++ print("\n- Analyzing {} event sets...".format(len(collection))) ++ ++ if options.debug & 0x1000000 != 0: ++ for upcall in collection: ++ print("DBG: {}{}{}{}{}".format( ++ "U" if EventType.DP_UPCALL in upcall else "-", ++ "u" if EventType.RECV_UPCALL in upcall else "-", ++ "p" if EventType.OP_FLOW_PUT in upcall else "-", ++ "e" if EventType.OP_FLOW_EXECUTE in upcall else "-", ++ "E" if EventType.OVS_PKT_EXEC in upcall else "-")) ++ if options.debug & 0x2000000 != 0: ++ try: ++ print("DBG: - {}".format(upcall[EventType.DP_UPCALL])) ++ print("DBG: - {}".format(upcall[EventType.RECV_UPCALL])) ++ print("DBG: - {}".format(upcall[EventType.OP_FLOW_PUT])) ++ print("DBG: - {}".format( ++ upcall[EventType.OP_FLOW_EXECUTE])) ++ print("DBG: - {}".format(upcall[EventType.OVS_PKT_EXEC])) ++ except LookupError: ++ continue ++ ++ show_key_value(thread_stats, description="Upcalls handled per thread") ++ show_batch_histogram(batch_stats, ++ description="Histogram of upcalls per batch") ++ ++ kernel_to_vswitchd = [] ++ kernel_to_kernel_exec = [] ++ vswitchd_to_kernel = [] ++ time_minus_lookup = [] ++ for upcall in collection: ++ kernel_to_vswitchd.append((upcall[EventType.RECV_UPCALL].ts - ++ upcall[EventType.DP_UPCALL].ts) / ++ 1000) ++ ++ if EventType.OP_FLOW_PUT in upcall and \ ++ EventType.OVS_PKT_EXEC in upcall: ++ time_minus_lookup.append( ++ ((upcall[EventType.OVS_PKT_EXEC].ts - ++ upcall[EventType.DP_UPCALL].ts) - ++ (upcall[EventType.OP_FLOW_PUT].ts - ++ upcall[EventType.RECV_UPCALL].ts)) / 1000) ++ ++ if EventType.OP_FLOW_EXECUTE in upcall and \ ++ EventType.OVS_PKT_EXEC in upcall: ++ vswitchd_to_kernel.append((upcall[EventType.OVS_PKT_EXEC].ts ++ - upcall[EventType.OP_FLOW_EXECUTE].ts) ++ / 1000) ++ ++ if EventType.OVS_PKT_EXEC in upcall: ++ kernel_to_kernel_exec.append((upcall[EventType.OVS_PKT_EXEC].ts - ++ upcall[EventType.DP_UPCALL].ts) / ++ 1000) ++ ++ show_histogram(kernel_to_vswitchd, ++ description="Kernel upcall action to vswitchd receive " ++ "(microseconds)", ++ options=options) ++ show_histogram(vswitchd_to_kernel, ++ description="vswitchd execute to kernel receive " ++ "(microseconds)", ++ options=options) ++ show_histogram(time_minus_lookup, ++ description="Upcall overhead (total time minus lookup) " ++ "(microseconds)", ++ options=options) ++ show_histogram(kernel_to_kernel_exec, ++ description="Kernel upcall to kernel packet execute " ++ "(microseconds)", ++ options=options) ++ ++ ++# ++# Start main() as the default entry point... ++# ++if __name__ == '__main__': ++ main() +diff --git a/utilities/usdt-scripts/upcall_monitor.py b/utilities/usdt-scripts/upcall_monitor.py +new file mode 100755 +index 0000000000..d856547f03 +--- /dev/null ++++ b/utilities/usdt-scripts/upcall_monitor.py +@@ -0,0 +1,533 @@ ++#!/usr/bin/env python3 ++# ++# Copyright (c) 2021 Red Hat, Inc. ++# ++# Licensed under the Apache License, Version 2.0 (the "License"); ++# you may not use this file except in compliance with the License. ++# You may obtain a copy of the License at: ++# ++# http://www.apache.org/licenses/LICENSE-2.0 ++# ++# Unless required by applicable law or agreed to in writing, software ++# distributed under the License is distributed on an "AS IS" BASIS, ++# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ++# See the License for the specific language governing permissions and ++# limitations under the License. ++# ++# Script information: ++# ------------------- ++# upcall_monitor.py uses the dpif_recv:recv_upcall USDT to receive all upcall ++# packets sent by the kernel to ovs-vswitchd. By default, it will show all ++# upcall events, which looks something like this: ++# ++# TIME CPU COMM PID DPIF_NAME TYPE PKT_LEN... ++# 5952147.003848809 2 handler4 1381158 system@ovs-system 0 98 132 ++# 5952147.003879643 2 handler4 1381158 system@ovs-system 0 70 160 ++# 5952147.003914924 2 handler4 1381158 system@ovs-system 0 98 152 ++# ++# In addition, the packet and flow key data can be dumped. This can be done ++# using the --packet-decode and --flow-key decode options (see below). ++# ++# Note that by default only 64 bytes of the packet and flow key are retrieved. ++# If you would like to capture all or more of the packet and/or flow key data, ++# the ----packet-size and --flow-key-size options can be used. ++# ++# If required, the received packets can also be stored in a pcap file using the ++# --pcap option. ++# ++# The following are the available options: ++# ++# usage: upcall_monitor.py [-h] [-D [DEBUG]] [-d {none,hex,decode}] ++# [-f [64-2048]] [-k {none,hex,nlraw}] ++# [-p VSWITCHD_PID] [-s [64-2048]] [-w PCAP_FILE] ++# ++# optional arguments: ++# -h, --help show this help message and exit ++# -D [DEBUG], --debug [DEBUG] ++# Enable eBPF debugging ++# -d {none,hex,decode}, --packet-decode {none,hex,decode} ++# Display packet content in selected mode, ++# default none ++# -f [64-2048], --flow-key-size [64-2048] ++# Set maximum flow key size to capture, default 64 ++# -k {none,hex,nlraw}, --flow-key-decode {none,hex,nlraw} ++# Display flow-key content in selected mode, default ++# none ++# -p VSWITCHD_PID, --pid VSWITCHD_PID ++# ovs-vswitch's PID ++# -s [64-2048], --packet-size [64-2048] ++# Set maximum packet size to capture, default 64 ++# -w PCAP_FILE, --pcap PCAP_FILE ++# Write upcall packets to specified pcap file. ++# ++# The following is an example of how to use the script on the running ++# ovs-vswitchd process with a packet and flow key dump enabled: ++# ++# $ ./upcall_monitor.py --packet-decode decode --flow-key-decode nlraw \ ++# --packet-size 128 --flow-key-size 256 ++# TIME CPU COMM PID DPIF_NAME ... ++# 5953013.333214231 2 handler4 1381158 system@ovs-system ... ++# Flow key size 132 bytes, size captured 132 bytes. ++# nla_len 8, nla_type OVS_KEY_ATTR_RECIRC_ID[20], data: 00 00 00 00 ++# nla_len 8, nla_type OVS_KEY_ATTR_DP_HASH[19], data: 00 00 00 00 ++# nla_len 8, nla_type OVS_KEY_ATTR_PRIORITY[2], data: 00 00 00 00 ++# nla_len 8, nla_type OVS_KEY_ATTR_IN_PORT[3], data: 02 00 00 00 ++# nla_len 8, nla_type OVS_KEY_ATTR_SKB_MARK[15], data: 00 00 00 00 ++# nla_len 8, nla_type OVS_KEY_ATTR_CT_STATE[22], data: 00 00 00 00 ++# nla_len 6, nla_type OVS_KEY_ATTR_CT_ZONE[23], data: 00 00 ++# nla_len 8, nla_type OVS_KEY_ATTR_CT_MARK[24], data: 00 00 00 00 ++# nla_len 20, nla_type OVS_KEY_ATTR_CT_LABELS[25], data: 00 00 00 00 ... ++# nla_len 16, nla_type OVS_KEY_ATTR_ETHERNET[4], data: 04 f4 bc 28 57 ... ++# nla_len 6, nla_type OVS_KEY_ATTR_ETHERTYPE[6], data: 08 00 ++# nla_len 16, nla_type OVS_KEY_ATTR_IPV4[7], data: 01 01 01 64 01 01 ... ++# nla_len 6, nla_type OVS_KEY_ATTR_ICMP[11], data: 00 00 ++# 1: Receive dp_port 2, packet size 98 bytes, size captured 98 bytes. ++# ###[ Ethernet ]### ++# dst = 3c:fd:fe:9e:7f:68 ++# src = 04:f4:bc:28:57:01 ++# type = IPv4 ++# ###[ IP ]### ++# version = 4 ++# ihl = 5 ++# tos = 0x0 ++# len = 84 ++# id = 41404 ++# flags = DF ++# frag = 0 ++# ttl = 64 ++# proto = icmp ++# chksum = 0x940c ++# src = 1.1.1.100 ++# dst = 1.1.1.123 ++# \options \ ++# ###[ ICMP ]### ++# type = echo-reply ++# code = 0 ++# chksum = 0x2f55 ++# id = 0x90e6 ++# seq = 0x1 ++# ###[ Raw ]### ++# load = 'GBTa\x00\x00\x00\x00\xd8L\r\x00\x00\x00\x00\... ++# ++ ++from bcc import BPF, USDT, USDTException ++from os.path import exists ++from scapy.all import hexdump, wrpcap ++from scapy.layers.l2 import Ether ++ ++import argparse ++import psutil ++import re ++import struct ++import sys ++import time ++ ++# ++# Actual eBPF source code ++# ++ebpf_source = """ ++#include ++ ++#define MAX_PACKET ++#define MAX_KEY ++ ++struct event_t { ++ u32 cpu; ++ u32 pid; ++ u32 upcall_type; ++ u64 ts; ++ u32 pkt_size; ++ u64 key_size; ++ char comm[TASK_COMM_LEN]; ++ char dpif_name[32]; ++ unsigned char pkt[MAX_PACKET]; ++ unsigned char key[MAX_KEY]; ++}; ++BPF_RINGBUF_OUTPUT(events, ); ++BPF_TABLE("percpu_array", uint32_t, uint64_t, dropcnt, 1); ++ ++int do_trace(struct pt_regs *ctx) { ++ uint64_t addr; ++ uint64_t size; ++ ++ struct event_t *event = events.ringbuf_reserve(sizeof(struct event_t)); ++ if (!event) { ++ uint32_t type = 0; ++ uint64_t *value = dropcnt.lookup(&type); ++ if (value) ++ __sync_fetch_and_add(value, 1); ++ ++ return 1; ++ } ++ ++ event->ts = bpf_ktime_get_ns(); ++ event->cpu = bpf_get_smp_processor_id(); ++ event->pid = bpf_get_current_pid_tgid(); ++ bpf_get_current_comm(&event->comm, sizeof(event->comm)); ++ ++ bpf_usdt_readarg(1, ctx, &addr); ++ bpf_probe_read_str(&event->dpif_name, sizeof(event->dpif_name), ++ (void *)addr); ++ ++ bpf_usdt_readarg(2, ctx, &event->upcall_type); ++ bpf_usdt_readarg(4, ctx, &event->pkt_size); ++ bpf_usdt_readarg(6, ctx, &event->key_size); ++ ++ if (event->pkt_size > MAX_PACKET) ++ size = MAX_PACKET; ++ else ++ size = event->pkt_size; ++ bpf_usdt_readarg(3, ctx, &addr); ++ bpf_probe_read(&event->pkt, size, (void *)addr); ++ ++ if (event->key_size > MAX_KEY) ++ size = MAX_KEY; ++ else ++ size = event->key_size; ++ bpf_usdt_readarg(5, ctx, &addr); ++ bpf_probe_read(&event->key, size, (void *)addr); ++ ++ events.ringbuf_submit(event, 0); ++ return 0; ++}; ++""" ++ ++ ++# ++# print_event() ++# ++def print_event(ctx, data, size): ++ event = b['events'].event(data) ++ print("{:<18.9f} {:<4} {:<16} {:<10} {:<32} {:<4} {:<10} {:<10}". ++ format(event.ts / 1000000000, ++ event.cpu, ++ event.comm.decode("utf-8"), ++ event.pid, ++ event.dpif_name.decode("utf-8"), ++ event.upcall_type, ++ event.pkt_size, ++ event.key_size)) ++ ++ # ++ # Dump flow key information ++ # ++ if event.key_size < options.flow_key_size: ++ key_len = event.key_size ++ else: ++ key_len = options.flow_key_size ++ ++ if options.flow_key_decode != 'none': ++ print(" Flow key size {} bytes, size captured {} bytes.". ++ format(event.key_size, key_len)) ++ ++ if options.flow_key_decode == 'hex': ++ # ++ # Abuse scapy's hex dump to dump flow key ++ # ++ print(re.sub('^', ' ' * 4, hexdump(Ether(bytes(event.key)[:key_len]), ++ dump=True), ++ flags=re.MULTILINE)) ++ ++ if options.flow_key_decode == 'nlraw': ++ nla = decode_nlm(bytes(event.key)[:key_len]) ++ else: ++ nla = decode_nlm(bytes(event.key)[:key_len], dump=False) ++ ++ if "OVS_KEY_ATTR_IN_PORT" in nla: ++ port = struct.unpack("=I", nla["OVS_KEY_ATTR_IN_PORT"])[0] ++ else: ++ port = "Unknown" ++ ++ # ++ # Decode packet only if there is data ++ # ++ if event.pkt_size <= 0: ++ return ++ ++ pkt_id = get_pkt_id() ++ ++ if event.pkt_size < options.packet_size: ++ pkt_len = event.pkt_size ++ pkt_data = bytes(event.pkt)[:event.pkt_size] ++ else: ++ pkt_len = options.packet_size ++ pkt_data = bytes(event.pkt) ++ ++ if options.packet_decode != 'none' or options.pcap is not None: ++ print(" {}: Receive dp_port {}, packet size {} bytes, size " ++ "captured {} bytes.".format(pkt_id, port, event.pkt_size, ++ pkt_len)) ++ ++ if options.packet_decode == 'hex': ++ print(re.sub('^', ' ' * 4, hexdump(pkt_data, dump=True), ++ flags=re.MULTILINE)) ++ ++ packet = Ether(pkt_data) ++ packet.wirelen = event.pkt_size ++ ++ if options.packet_decode == 'decode': ++ print(re.sub('^', ' ' * 4, packet.show(dump=True), flags=re.MULTILINE)) ++ ++ if options.pcap is not None: ++ wrpcap(options.pcap, packet, append=True, snaplen=options.packet_size) ++ ++ ++# ++# decode_nlm() ++# ++def decode_nlm(msg, indent=4, dump=True): ++ bytes_left = len(msg) ++ result = {} ++ ++ while bytes_left: ++ if bytes_left < 4: ++ if dump: ++ print("{}WARN: decode truncated; can't read header".format( ++ ' ' * indent)) ++ break ++ ++ nla_len, nla_type = struct.unpack("=HH", msg[:4]) ++ ++ if nla_len < 4: ++ if dump: ++ print("{}WARN: decode truncated; nla_len < 4".format( ++ ' ' * indent)) ++ break ++ ++ nla_data = msg[4:nla_len] ++ trunc = "" ++ ++ if nla_len > bytes_left: ++ trunc = "..." ++ nla_data = nla_data[:(bytes_left - 4)] ++ else: ++ result[get_ovs_key_attr_str(nla_type)] = nla_data ++ ++ if dump: ++ print("{}nla_len {}, nla_type {}[{}], data: {}{}".format( ++ ' ' * indent, nla_len, get_ovs_key_attr_str(nla_type), ++ nla_type, ++ "".join("{:02x} ".format(b) for b in nla_data), trunc)) ++ ++ if trunc != "": ++ if dump: ++ print("{}WARN: decode truncated; nla_len > msg_len[{}] ". ++ format(' ' * indent, bytes_left)) ++ break ++ ++ # update next offset, but make sure it's aligned correctly ++ next_offset = (nla_len + 3) & ~(3) ++ msg = msg[next_offset:] ++ bytes_left -= next_offset ++ ++ return result ++ ++ ++# ++# get_ovs_key_attr_str() ++# ++def get_ovs_key_attr_str(attr): ++ ovs_key_attr = ["OVS_KEY_ATTR_UNSPEC", ++ "OVS_KEY_ATTR_ENCAP", ++ "OVS_KEY_ATTR_PRIORITY", ++ "OVS_KEY_ATTR_IN_PORT", ++ "OVS_KEY_ATTR_ETHERNET", ++ "OVS_KEY_ATTR_VLAN", ++ "OVS_KEY_ATTR_ETHERTYPE", ++ "OVS_KEY_ATTR_IPV4", ++ "OVS_KEY_ATTR_IPV6", ++ "OVS_KEY_ATTR_TCP", ++ "OVS_KEY_ATTR_UDP", ++ "OVS_KEY_ATTR_ICMP", ++ "OVS_KEY_ATTR_ICMPV6", ++ "OVS_KEY_ATTR_ARP", ++ "OVS_KEY_ATTR_ND", ++ "OVS_KEY_ATTR_SKB_MARK", ++ "OVS_KEY_ATTR_TUNNEL", ++ "OVS_KEY_ATTR_SCTP", ++ "OVS_KEY_ATTR_TCP_FLAGS", ++ "OVS_KEY_ATTR_DP_HASH", ++ "OVS_KEY_ATTR_RECIRC_ID", ++ "OVS_KEY_ATTR_MPLS", ++ "OVS_KEY_ATTR_CT_STATE", ++ "OVS_KEY_ATTR_CT_ZONE", ++ "OVS_KEY_ATTR_CT_MARK", ++ "OVS_KEY_ATTR_CT_LABELS", ++ "OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV4", ++ "OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6", ++ "OVS_KEY_ATTR_NSH"] ++ ++ if attr < 0 or attr > len(ovs_key_attr): ++ return "" ++ ++ return ovs_key_attr[attr] ++ ++ ++# ++# get_pkt_id() ++# ++def get_pkt_id(): ++ if not hasattr(get_pkt_id, "counter"): ++ get_pkt_id.counter = 0 ++ get_pkt_id.counter += 1 ++ return get_pkt_id.counter ++ ++ ++# ++# buffer_size_type() ++# ++def buffer_size_type(astr, min=64, max=2048): ++ value = int(astr) ++ if min <= value <= max: ++ return value ++ else: ++ raise argparse.ArgumentTypeError( ++ 'value not in range {}-{}'.format(min, max)) ++ ++ ++# ++# next_power_of_two() ++# ++def next_power_of_two(val): ++ np = 1 ++ while np < val: ++ np *= 2 ++ return np ++ ++ ++# ++# main() ++# ++def main(): ++ # ++ # Don't like these globals, but ctx passing does not seem to work with the ++ # existing open_ring_buffer() API :( ++ # ++ global b ++ global options ++ ++ # ++ # Argument parsing ++ # ++ parser = argparse.ArgumentParser() ++ ++ parser.add_argument("--buffer-page-count", ++ help="Number of BPF ring buffer pages, default 1024", ++ type=int, default=1024, metavar="NUMBER") ++ parser.add_argument("-D", "--debug", ++ help="Enable eBPF debugging", ++ type=int, const=0x3f, default=0, nargs='?') ++ parser.add_argument('-d', '--packet-decode', ++ help='Display packet content in selected mode, ' ++ 'default none', ++ choices=['none', 'hex', 'decode'], default='none') ++ parser.add_argument("-f", "--flow-key-size", ++ help="Set maximum flow key size to capture, " ++ "default 64", type=buffer_size_type, default=64, ++ metavar="[64-2048]") ++ parser.add_argument('-k', '--flow-key-decode', ++ help='Display flow-key content in selected mode, ' ++ 'default none', ++ choices=['none', 'hex', 'nlraw'], default='none') ++ parser.add_argument("-p", "--pid", metavar="VSWITCHD_PID", ++ help="ovs-vswitch's PID", ++ type=int, default=None) ++ parser.add_argument("-s", "--packet-size", ++ help="Set maximum packet size to capture, " ++ "default 64", type=buffer_size_type, default=64, ++ metavar="[64-2048]") ++ parser.add_argument("-w", "--pcap", metavar="PCAP_FILE", ++ help="Write upcall packets to specified pcap file.", ++ type=str, default=None) ++ ++ options = parser.parse_args() ++ ++ # ++ # Find the PID of the ovs-vswitchd daemon if not specified. ++ # ++ if options.pid is None: ++ for proc in psutil.process_iter(): ++ if 'ovs-vswitchd' in proc.name(): ++ if options.pid is not None: ++ print("ERROR: Multiple ovs-vswitchd daemons running, " ++ "use the -p option!") ++ sys.exit(-1) ++ ++ options.pid = proc.pid ++ ++ # ++ # Error checking on input parameters ++ # ++ if options.pid is None: ++ print("ERROR: Failed to find ovs-vswitchd's PID!") ++ sys.exit(-1) ++ ++ if options.pcap is not None: ++ if exists(options.pcap): ++ print("ERROR: Destination capture file \"{}\" already exists!". ++ format(options.pcap)) ++ sys.exit(-1) ++ ++ options.buffer_page_count = next_power_of_two(options.buffer_page_count) ++ ++ # ++ # Attach the usdt probe ++ # ++ u = USDT(pid=int(options.pid)) ++ try: ++ u.enable_probe(probe="recv_upcall", fn_name="do_trace") ++ except USDTException as e: ++ print("ERROR: {}" ++ "ovs-vswitchd!".format( ++ (re.sub('^', ' ' * 7, str(e), flags=re.MULTILINE)).strip(). ++ replace("--with-dtrace or --enable-dtrace", ++ "--enable-usdt-probes"))) ++ sys.exit(-1) ++ ++ # ++ # Uncomment to see how arguments are decoded. ++ # print(u.get_text()) ++ # ++ ++ # ++ # Attach probe to running process ++ # ++ source = ebpf_source.replace("", str(options.packet_size)) ++ source = source.replace("", str(options.flow_key_size)) ++ source = source.replace("", ++ str(options.buffer_page_count)) ++ ++ b = BPF(text=source, usdt_contexts=[u], debug=options.debug) ++ ++ # ++ # Print header ++ # ++ print("{:<18} {:<4} {:<16} {:<10} {:<32} {:<4} {:<10} {:<10}".format( ++ "TIME", "CPU", "COMM", "PID", "DPIF_NAME", "TYPE", "PKT_LEN", ++ "FLOW_KEY_LEN")) ++ ++ # ++ # Dump out all events ++ # ++ b['events'].open_ring_buffer(print_event) ++ while 1: ++ try: ++ b.ring_buffer_poll() ++ time.sleep(0.5) ++ except KeyboardInterrupt: ++ break ++ ++ dropcnt = b.get_table("dropcnt") ++ for k in dropcnt.keys(): ++ count = dropcnt.sum(k).value ++ if k.value == 0 and count > 0: ++ print("\nWARNING: Not all upcalls were captured, {} were dropped!" ++ "\n Increase the BPF ring buffer size with the " ++ "--buffer-page-count option.".format(count)) ++ ++ ++# ++# Start main() as the default entry point... ++# ++if __name__ == '__main__': ++ main() diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c index e591c26a6c..ce348b9d16 100644 --- a/vswitchd/bridge.c @@ -93426,6 +96544,34 @@ index e591c26a6c..ce348b9d16 100644 } } +diff --git a/vswitchd/ovs-vswitchd.c b/vswitchd/ovs-vswitchd.c +index 1e72b628b1..f325c752a9 100644 +--- a/vswitchd/ovs-vswitchd.c ++++ b/vswitchd/ovs-vswitchd.c +@@ -48,6 +48,7 @@ + #include "timeval.h" + #include "unixctl.h" + #include "util.h" ++#include "openvswitch/usdt-probes.h" + #include "openvswitch/vconn.h" + #include "openvswitch/vlog.h" + #include "lib/vswitch-idl.h" +@@ -115,6 +116,7 @@ main(int argc, char *argv[]) + exiting = false; + cleanup = false; + while (!exiting) { ++ OVS_USDT_PROBE(main, run_start); + memory_run(); + if (memory_should_report()) { + struct simap usage; +@@ -135,6 +137,7 @@ main(int argc, char *argv[]) + if (exiting) { + poll_immediate_wake(); + } ++ OVS_USDT_PROBE(main, poll_block); + poll_block(); + if (should_service_stop()) { + exiting = true; diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index 3ddaaefda8..fd13281a7f 100644 --- a/vswitchd/vswitch.xml diff --git a/SPECS/openvswitch2.13.spec b/SPECS/openvswitch2.13.spec index e4adeb1..4063cba 100644 --- a/SPECS/openvswitch2.13.spec +++ b/SPECS/openvswitch2.13.spec @@ -59,7 +59,7 @@ Summary: Open vSwitch Group: System Environment/Daemons daemon/database/utilities URL: http://www.openvswitch.org/ Version: 2.13.0 -Release: 146%{?commit0:.%{date}git%{shortcommit0}}%{?commit1:dpdk%{shortcommit1}}%{?dist} +Release: 152%{?commit0:.%{date}git%{shortcommit0}}%{?commit1:dpdk%{shortcommit1}}%{?dist} # Nearly all of openvswitch is ASL 2.0. The bugtool is LGPLv2+, and the # lib/sflow*.[ch] files are SISSL @@ -162,6 +162,7 @@ BuildRequires: python3-devel python3-setuptools BuildRequires: desktop-file-utils BuildRequires: groff-base graphviz BuildRequires: unbound-devel +BuildRequires: systemtap-sdt-devel # make check dependencies BuildRequires: procps-ng %if 0%{?rhel} > 7 || 0%{?fedora} @@ -373,7 +374,8 @@ ln -s ../configure --disable-static \ --enable-shared \ --enable-ssl \ - --with-pkidir=%{_sharedstatedir}/openvswitch/pki + --with-pkidir=%{_sharedstatedir}/openvswitch/pki \ + --enable-usdt-probes make %{?_smp_mflags} popd pushd build-static @@ -389,6 +391,7 @@ ln -s ../configure --with-dpdk=$(pwd)/../%{dpdkdir}/%{dpdktarget} \ %endif --with-pkidir=%{_sharedstatedir}/openvswitch/pki \ + --enable-usdt-probes \ ac_cv_search_mlx5dv_create_wq=-lmlx5 # mlx5dv_create_wq was added in rdma-core 16 since it's the first # officially released release that adds support for mlx offloading, @@ -712,6 +715,74 @@ exit 0 %endif %changelog +* Tue Jan 25 2022 Eelco Chaudron - 2.13.0-152 +- downstream: Enable usdt probes in build [RH git: 30f6bfcfca] + Fixes: BZ1840877 + + +* Tue Jan 25 2022 Eelco Chaudron - 2.13.0-151 +- utilities: Add netlink flow operation USDT probes and upcall_cost script. [RH git: bb307de169] + This patch adds a series of NetLink flow operation USDT probes. + These probes are in turn used in the upcall_cost Python script, + which in addition of some kernel tracepoints, give an insight into + the time spent on processing upcall. + + Signed-off-by: Eelco Chaudron + Acked-by: Paolo Valerio + Signed-off-by: Ilya Maximets + + +* Tue Jan 25 2022 Eelco Chaudron - 2.13.0-150 +- utilities: Add upcall USDT probe and associated script. [RH git: 2c68e0b4c2] + Added the dpif_recv:recv_upcall USDT probe, which is used by the + included upcall_monitor.py script. This script receives all upcall + packets sent by the kernel to ovs-vswitchd. By default, it will + show all upcall events, which looks something like this: + + TIME CPU COMM PID DPIF_NAME TYPE PKT_LEN FLOW_KEY_LEN + 5952147.003848809 2 handler4 1381158 system@ovs-system 0 98 132 + 5952147.003879643 2 handler4 1381158 system@ovs-system 0 70 160 + 5952147.003914924 2 handler4 1381158 system@ovs-system 0 98 152 + + It can also dump the packet and NetLink content, and if required, + the packets can also be written to a pcap file. + + Signed-off-by: Eelco Chaudron + Acked-by: Paolo Valerio + Signed-off-by: Ilya Maximets + + +* Tue Jan 25 2022 Eelco Chaudron - 2.13.0-149 +- Documentation: Add USDT documentation and bpftrace example. [RH git: 714f513833] + Add the USDT documentation and a bpftrace example using the + bridge run USDT probes. + + Signed-off-by: Eelco Chaudron + Acked-by: Paolo Valerio + Signed-off-by: Ilya Maximets + + +* Tue Jan 25 2022 Eelco Chaudron - 2.13.0-148 +- openvswitch: Define the OVS_STATIC_TRACE() macro. [RH git: c47c990c6b] + This patch defines the OVS_STATIC_TRACE() macro, and as an + example, adds two of them in the bridge run loop. + + Signed-off-by: Eelco Chaudron + Acked-by: Paolo Valerio + Signed-off-by: Ilya Maximets + + +* Tue Jan 25 2022 Eelco Chaudron - 2.13.0-147 +- configure: Add --enable-usdt-probes option to enable USDT probes. [RH git: 6bad1cd13d] + Allow inclusion of User Statically Defined Trace (USDT) probes + in the OVS binaries using the --enable-usdt-probes option to the + ./configure script. + + Signed-off-by: Eelco Chaudron + Acked-by: Paolo Valerio + Signed-off-by: Ilya Maximets + + * Fri Jan 21 2022 Open vSwitch CI - 2.13.0-146 - Merging upstream branch-2.13 [RH git: 0b2dac74a2] Commit list: