naccyde / rpms / iproute

Forked from rpms/iproute 5 months ago
Clone

Blame SOURCES/0145-doc-Add-my-article-about-tc-filters-and-actions.patch

049c96
From bcccddebc0476f32a4cc725f7f9f20a85c7db47b Mon Sep 17 00:00:00 2001
049c96
From: Phil Sutter <psutter@redhat.com>
049c96
Date: Wed, 30 Mar 2016 16:51:38 +0200
049c96
Subject: [PATCH] doc: Add my article about tc, filters and actions
049c96
049c96
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1275426
049c96
Upstream Status: iproute2.git commit 5f4d27d533917
049c96
049c96
commit 5f4d27d533917ccce4249c1d367aabf606167c47
049c96
Author: Phil Sutter <phil@nwl.cc>
049c96
Date:   Fri Mar 4 13:11:47 2016 +0100
049c96
049c96
    doc: Add my article about tc, filters and actions
049c96
049c96
    Signed-off-by: Phil Sutter <phil@nwl.cc>
049c96
---
049c96
 doc/Makefile       |   2 +-
049c96
 doc/tc-filters.tex | 529 +++++++++++++++++++++++++++++++++++++++++++++++++++++
049c96
 2 files changed, 530 insertions(+), 1 deletion(-)
049c96
 create mode 100644 doc/tc-filters.tex
049c96
049c96
diff --git a/doc/Makefile b/doc/Makefile
049c96
index b92957e..5e9da17 100644
049c96
--- a/doc/Makefile
049c96
+++ b/doc/Makefile
049c96
@@ -1,4 +1,4 @@
049c96
-PSFILES=ip-cref.ps ip-tunnels.ps api-ip6-flowlabels.ps ss.ps nstat.ps arpd.ps rtstat.ps
049c96
+PSFILES=ip-cref.ps ip-tunnels.ps api-ip6-flowlabels.ps ss.ps nstat.ps arpd.ps rtstat.ps tc-filters.ps
049c96
 # tc-cref.ps
049c96
 # api-rtnl.tex api-pmtudisc.tex api-news.tex
049c96
 # iki-netdev.ps iki-neighdst.ps
049c96
diff --git a/doc/tc-filters.tex b/doc/tc-filters.tex
049c96
new file mode 100644
049c96
index 0000000..59127d6
049c96
--- /dev/null
049c96
+++ b/doc/tc-filters.tex
049c96
@@ -0,0 +1,529 @@
049c96
+\documentclass[12pt,twoside]{article}
049c96
+
049c96
+\usepackage[hidelinks]{hyperref}	% \url
049c96
+\usepackage{booktabs}			% nicer tabulars
049c96
+\usepackage{fancyvrb}
049c96
+\usepackage{fullpage}
049c96
+\usepackage{float}
049c96
+
049c96
+\newcommand{\iface}{\textit}
049c96
+\newcommand{\cmd}{\texttt}
049c96
+\newcommand{\man}{\textit}
049c96
+\newcommand{\qdisc}{\texttt}
049c96
+\newcommand{\filter}{\texttt}
049c96
+
049c96
+\begin{document}
049c96
+\title{QoS in Linux with TC and Filters}
049c96
+\author{Phil Sutter (phil@nwl.cc)}
049c96
+\date{January 2016}
049c96
+\maketitle
049c96
+
049c96
+TC, the Traffic Control utility, has been there for a very long time - forever
049c96
+in my humble perception. It is still (and has ever been if I'm not mistaken) the
049c96
+only tool to configure QoS in Linux.
049c96
+
049c96
+Standard practice when transmitting packets over a medium which may block (due
049c96
+to congestion, e.g.) is to use a queue which temporarily holds these packets. In
049c96
+Linux, this queueing approach is where QoS happens: A Queueing Discipline
049c96
+(qdisc) holds multiple packet queues with different priorities for dequeueing to
049c96
+the network driver. The classification (i.e. deciding which queue a packet
049c96
+should go into) is typically done based on Type Of Service (IPv4) or Traffic
049c96
+Class (IPv6) header fields but depending on qdisc implementation, might be
049c96
+controlled by the user as well.
049c96
+
049c96
+Qdiscs come in two flavors, classful or classless. While classless qdiscs are
049c96
+not as flexible as classful ones, they also require much less customizing. Often
049c96
+it is enough to just attach them to an interface, without exact knowledge of
049c96
+what is done internally. Classful qdiscs are the exact opposite: flexible in
049c96
+application, they are often not even usable without insightful configuration.
049c96
+
049c96
+As the name implies, classful qdiscs provide configurable classes to sort
049c96
+traffic into. In it's basic form, this is not much different than, say, the
049c96
+classless \qdisc{pfifo\_fast} which holds three queues and classifies per
049c96
+packet upon priority field. Though typically classes go beyond that by
049c96
+supporting nesting and additional characteristics like e.g. maximum traffic
049c96
+rate or quantum.
049c96
+
049c96
+When it comes to controlling the classification process, filters come into play.
049c96
+They attach to the parent of a set of classes (i.e. either the qdisc itself or
049c96
+a parent class) and specify how a packet (or it's associated flow) has to look
049c96
+like in order to suit a given class. To overcome this simplification, it is
049c96
+possible to attach multiple filters to the same parent, which then consults each
049c96
+of them in row until the first one accepts the packet.
049c96
+
049c96
+Before getting into detail about what filters there are and how to use them, a
049c96
+simple setup of a qdisc with classes is necessary:
049c96
+\begin{figure}[H]
049c96
+\begin{Verbatim}
049c96
+  .-------------------------------------------------------.
049c96
+  |                                                       |
049c96
+  |  HTB                                                  |
049c96
+  |                                                       |
049c96
+  | .----------------------------------------------------.|
049c96
+  | |                                                    ||
049c96
+  | |  Class 1:1                                         ||
049c96
+  | |                                                    ||
049c96
+  | | .---------------..---------------..---------------.||
049c96
+  | | |               ||               ||               |||
049c96
+  | | |  Class 1:10   ||  Class 1:20   ||  Class 1:30   |||
049c96
+  | | |               ||               ||               |||
049c96
+  | | | .------------.|| .------------.|| .------------.|||
049c96
+  | | | |            ||| |            ||| |            ||||
049c96
+  | | | |  fq_codel  ||| |  fq_codel  ||| |  fq_codel  ||||
049c96
+  | | | |            ||| |            ||| |            ||||
049c96
+  | | | '------------'|| '------------'|| '------------'|||
049c96
+  | | '---------------''---------------''---------------'||
049c96
+  | '----------------------------------------------------'|
049c96
+  '-------------------------------------------------------'
049c96
+\end{Verbatim}
049c96
+\end{figure}
049c96
+\noindent
049c96
+The following commands establish the basic setup shown:
049c96
+\begin{Verbatim}
049c96
+(1) # tc qdisc replace dev eth0 root handle 1: htb default 30
049c96
+(2) # tc class add dev eth0 parent 1: classid 1:1 htb rate 95mbit
049c96
+(3) # alias tclass='tc class add dev eth0 parent 1:1'
049c96
+(4) # tclass classid 1:10 htb rate 1mbit ceil 20mbit prio 1
049c96
+(4) # tclass classid 1:20 htb rate 90mbit ceil 95mbit prio 2
049c96
+(4) # tclass classid 1:30 htb rate 1mbit ceil 95mbit prio 3
049c96
+(5) # tc qdisc add dev eth0 parent 1:10 fq_codel
049c96
+(5) # tc qdisc add dev eth0 parent 1:20 fq_codel
049c96
+(5) # tc qdisc add dev eth0 parent 1:30 fq_codel
049c96
+\end{Verbatim}
049c96
+A little explanation for the unfamiliar reader:
049c96
+\begin{enumerate}
049c96
+\item Replace the root qdisc of \iface{eth0} by an instance of \qdisc{HTB}.
049c96
+  Specifying the handle is necessary so it can be referenced in consecutive
049c96
+  calls to \cmd{tc}. The default class for unclassified traffic is set to
049c96
+  30.
049c96
+\item Create a single top-level class with handle 1:1 which limits the total
049c96
+   bandwidth allowed to 95mbit/s. It is assumed that \iface{eth0} is a 100mbit/s link,
049c96
+   staying a little below that helps to keep the main point of enqueueing in
049c96
+   the qdisc layer instead of the interface hardware queue or at another
049c96
+   bottleneck in the network.
049c96
+\item Define an alias for the common part of the remaining three calls in order
049c96
+   to improve readability. This means all remaining classes are attached to the
049c96
+   common parent class from (2).
049c96
+\item Create three child classes for different uses: Class 1:10 has highest
049c96
+   priority but is tightly limited in bandwidth - fine for interactive
049c96
+   connections.  Class 1:20 has mid priority and high guaranteed bandwidth, for
049c96
+   high priority bulk traffic. Finally, there's the default class 1:30 with
049c96
+   lowest priority, low guaranteed bandwidth and the ability to use the full
049c96
+   link in case it's unused otherwise. This should be fine for uninteresting
049c96
+   traffic not explicitly taken care of.
049c96
+\item Attach a leaf qdisc to each of the child classes created in (4). Since
049c96
+   \qdisc{HTB} by default attaches \qdisc{pfifo} as leaf qdisc, this step is optional. Still,
049c96
+   the fairness between different flows provided by the classless \qdisc{fq\_codel} is
049c96
+   worth the effort.
049c96
+\end{enumerate}
049c96
+More information about the qdiscs and fine-tuning parameters can be found in
049c96
+\man{tc-htb(8)} and \man{tc-fq\_codel(8)}.
049c96
+
049c96
+Without any additional setup done, now all traffic leaving \iface{eth0} is shaped to
049c96
+95mbit/s and directed through class 1:30. This can be verified by looking at the
049c96
+\texttt{Sent} field of the class statistics printed via \cmd{tc -s class show dev eth0}:
049c96
+Only the root class 1:1 and it's child 1:30 should show any traffic.
049c96
+
049c96
+
049c96
+\section*{Finally time to start filtering!}
049c96
+
049c96
+Let's begin with a simple one, i.e. reestablishing what \qdisc{pfifo\_fast} did
049c96
+automatically based on TOS/Priority field. Linux internally translates the
049c96
+header field into the priority field of struct skbuff, which
049c96
+\qdisc{pfifo\_fast} uses for
049c96
+classification. \man{tc-prio(8)} contains a table listing the priority (and
049c96
+ultimately, \qdisc{pfifo\_fast} queue index) each TOS value is being translated into.
049c96
+Here is a shorter version:
049c96
+\begin{center}
049c96
+\begin{tabular}{lll}
049c96
+TOS Values & Linux Priority (Number) & Queue Index \\
049c96
+\midrule
049c96
+0x0  - 0x6  & Best Effort (0)      & 1 \\
049c96
+0x8  - 0xe  & Bulk (2)             & 2 \\
049c96
+0x10 - 0x16 & Interactive (6)      & 0 \\
049c96
+0x18 - 0x1e & Interactive Bulk (4) & 1 \\
049c96
+\end{tabular}
049c96
+\end{center}
049c96
+Using the \filter{basic} filter, it is possible to match packets based on that skbuff
049c96
+field, which has the added benefit of being IP version agnostic. Since the
049c96
+\qdisc{HTB} setup above defaults to class ID 1:30, the Bulk priority can be
049c96
+ignored. The \filter{basic} filter allows to combine matches, therefore we get along
049c96
+with only two filters:
049c96
+\begin{Verbatim}
049c96
+# tc filter add dev eth0 parent 1: basic \
049c96
+        match 'meta(priority eq 6)' classid 1:10
049c96
+# tc filter add dev eth0 parent 1: basic \
049c96
+        match 'meta(priority eq 0)' \
049c96
+        or 'meta(priority eq 4)' classid 1:20
049c96
+\end{Verbatim}
049c96
+A detailed description of the \filter{basic} filter and the ematch syntax it uses can be
049c96
+found in \man{tc-basic(8)} and \man{tc-ematch(8)}.
049c96
+
049c96
+Obviously, this first example cries for optimization. A simple one would be to
049c96
+just change the default class from 1:30 to 1:20, so filters are only needed for
049c96
+Bulk and Interactive priorities:
049c96
+\begin{Verbatim}
049c96
+# tc filter add dev eth0 parent 1: basic \
049c96
+        match 'meta(priority eq 6)' classid 1:10
049c96
+# tc filter add dev eth0 parent 1: basic \
049c96
+        match 'meta(priority eq 2)' classid 1:20
049c96
+\end{Verbatim}
049c96
+Given that class IDs are random, choosing them wisely allows for a direct
049c96
+mapping. So first, recreate the qdisc and classes configuration:
049c96
+\begin{Verbatim}
049c96
+# tc qdisc replace dev eth0 root handle 1: htb default 10
049c96
+# tc class add dev eth0 parent 1: classid 1:1 htb rate 95mbit
049c96
+# alias tclass='tc class add dev eth0 parent 1:1'
049c96
+# tclass classid 1:16 htb rate 1mbit ceil 20mbit prio 1
049c96
+# tclass classid 1:10 htb rate 90mbit ceil 95mbit prio 2
049c96
+# tclass classid 1:12 htb rate 1mbit ceil 95mbit prio 3
049c96
+# tc qdisc add dev eth0 parent 1:16 fq_codel
049c96
+# tc qdisc add dev eth0 parent 1:10 fq_codel
049c96
+# tc qdisc add dev eth0 parent 1:12 fq_codel
049c96
+\end{Verbatim}
049c96
+This is basically identical to above, but with changed leaf class IDs and the
049c96
+second priority class being the default. Using the \filter{flow} filter with it's \texttt{map}
049c96
+functionality, a single filter command is enough:
049c96
+\begin{Verbatim}
049c96
+# tc filter add dev eth0 parent 1: handle 0x1337 flow \
049c96
+        map key priority baseclass 1:10
049c96
+\end{Verbatim}
049c96
+The \filter{flow} filter now uses the priority value to construct a destination class ID
049c96
+by adding it to the value of \texttt{baseclass}. While this works for priority values of
049c96
+0, 2 and 6, it will result in non-existent class ID 1:14 for Interactive Bulk
049c96
+traffic. In that case, the \qdisc{HTB} default applies so that traffic goes into class
049c96
+ID 1:10 just as intended. Please note that specifying a handle is a mandatory
049c96
+requirement by the \filter{flow} filter, although I didn't see where one would use that
049c96
+later. For more information about \filter{flow}, see \man{tc-flow(8)}.
049c96
+
049c96
+While \filter{flow} and \filter{basic} filters are relatively easy to apply and understand, they
049c96
+are as well quite limited to their intended purpose. A more flexible option is
049c96
+the \filter{u32} filter, which allows to match on arbitrary parts of the packet data -
049c96
+yet only on that, not any meta data associated to it by the kernel (with the
049c96
+exception of firewall mark value). So in order to continue this little
049c96
+exercise with \filter{u32}, we have to base classification directly upon the actual TOS
049c96
+value. An intuitive attempt might look like this:
049c96
+\begin{Verbatim}
049c96
+# alias tcfilter='tc filter add dev eth0 parent 1:'
049c96
+# tcfilter u32 match ip dsfield 0x10 0x1e classid 1:16
049c96
+# tcfilter u32 match ip dsfield 0x12 0x1e classid 1:16
049c96
+# tcfilter u32 match ip dsfield 0x14 0x1e classid 1:16
049c96
+# tcfilter u32 match ip dsfield 0x16 0x1e classid 1:16
049c96
+# tcfilter u32 match ip dsfield 0x8 0x1e classid 1:12
049c96
+# tcfilter u32 match ip dsfield 0xa 0x1e classid 1:12
049c96
+# tcfilter u32 match ip dsfield 0xc 0x1e classid 1:12
049c96
+# tcfilter u32 match ip dsfield 0xe 0x1e classid 1:12
049c96
+\end{Verbatim}
049c96
+The obvious drawback here is the amount of filters needed. And without the
049c96
+default class, eight more filters would be necessary. This also has performance
049c96
+implications: A packet with TOS value 0xe will be checked eight times in total
049c96
+in order to determine it's destination class. While there's not much to be done
049c96
+about the number of filters, at least the performance problem can be eliminated
049c96
+by using \filter{u32}'s hash table support:
049c96
+\begin{Verbatim}
049c96
+# tc filter add dev eth0 parent 1: prio 99 handle 1: u32 divisor 16
049c96
+\end{Verbatim}
049c96
+This creates a hash table with 16 buckets. The table size is arbitrary, but not
049c96
+random: Since the first bit of the TOS field is not interesting, it can be
049c96
+ignored and therefore the range of values to consider is just [0;15], i.e. a
049c96
+number of 16 different values. The next step is to populate the hash table:
049c96
+\begin{Verbatim}
049c96
+# alias tcfilter='tc filter add dev eth0 parent 1: prio 99'
049c96
+# tcfilter u32 match u8 0 0 ht 1:0: classid 1:16
049c96
+# tcfilter u32 match u8 0 0 ht 1:1: classid 1:16
049c96
+# tcfilter u32 match u8 0 0 ht 1:2: classid 1:16
049c96
+# tcfilter u32 match u8 0 0 ht 1:3: classid 1:16
049c96
+# tcfilter u32 match u8 0 0 ht 1:4: classid 1:12
049c96
+# tcfilter u32 match u8 0 0 ht 1:5: classid 1:12
049c96
+# tcfilter u32 match u8 0 0 ht 1:6: classid 1:12
049c96
+# tcfilter u32 match u8 0 0 ht 1:7: classid 1:12
049c96
+# tcfilter u32 match u8 0 0 ht 1:8: classid 1:16
049c96
+# tcfilter u32 match u8 0 0 ht 1:9: classid 1:16
049c96
+# tcfilter u32 match u8 0 0 ht 1:a: classid 1:16
049c96
+# tcfilter u32 match u8 0 0 ht 1:b: classid 1:16
049c96
+# tcfilter u32 match u8 0 0 ht 1:c: classid 1:10
049c96
+# tcfilter u32 match u8 0 0 ht 1:d: classid 1:10
049c96
+# tcfilter u32 match u8 0 0 ht 1:e: classid 1:10
049c96
+# tcfilter u32 match u8 0 0 ht 1:f: classid 1:10
049c96
+\end{Verbatim}
049c96
+The parameter \texttt{ht} denotes the hash table and bucket the filter should be added
049c96
+to. Since the first TOS bit is ignored, it's value has to be divided by two in
049c96
+order to get to the bucket it maps to. E.g. a TOS value of 0x10 will therefore
049c96
+map to bucket 0x8.  For the sake of completeness, all possible values are mapped
049c96
+and therefore a configurable default class is not required. Note that the used
049c96
+match expression is not necessary, but mandatory. Therefore anything that
049c96
+matches any packet will suffice. Finally, a filter which links to the defined
049c96
+hash table is needed:
049c96
+\begin{Verbatim}
049c96
+# tc filter add dev eth0 parent 1: prio 1 protocol ip u32 \
049c96
+        link 1: hashkey mask 0x001e0000 match u8 0 0
049c96
+\end{Verbatim}
049c96
+Here again, the actual match statement is not necessary, but syntactically
049c96
+required. All the magic lies within the \texttt{hashkey} parameter, which defines which
049c96
+part of the packet should be used directly as hash key. Here's a drawing of the
049c96
+first four bytes of the IPv4 header, with the area selected by \texttt{hashkey mask}
049c96
+highlighted:
049c96
+\begin{figure}[H]
049c96
+\begin{Verbatim}
049c96
+ 0                1                2                3
049c96
+ .-----------------------------------------------------------------.
049c96
+ |        |       | ########  |    |                               |
049c96
+ | Version|  IHL  | #DSCP###  | ECN|  Total Length                 |
049c96
+ |        |       | ########  |    |                               |
049c96
+ '-----------------------------------------------------------------'
049c96
+\end{Verbatim}
049c96
+\end{figure}
049c96
+\noindent
049c96
+This may look confusing at first, but keep in mind that bit- as well as
049c96
+byte-ordering here is LSB while the mask value is written in MSB we humans use.
049c96
+Therefore reading the mask is done like so, starting from left:
049c96
+\begin{enumerate}
049c96
+\item Skip the first byte (which contains Version and IHL fields).
049c96
+\item Skip the lowest bit of the second byte (0x1e is even).
049c96
+\item Mark the four following bits (0x1e is 11110 in binary).
049c96
+\item Skip the remaining three bits of the second byte as well as the remaining two
049c96
+   bytes.
049c96
+\end{enumerate}
049c96
+Before doing the lookup, the kernel right-shifts the masked value by the amount
049c96
+of zero-bits in \texttt{mask}, which implicitly also does the division by two which the
049c96
+hash table depends on. With this setup, every packet has to pass exactly two
049c96
+filters to be classified. Note that this filter is limited to IPv4 packets: Due
049c96
+to the related Traffic Class field being at a different offset in the packet, it
049c96
+would not work for IPv6. To use the same setup for IPv6 as well, a second
049c96
+entry-level filter is necessary:
049c96
+\begin{Verbatim}
049c96
+# tc filter add dev eth0 parent 1: prio 2 protocol ipv6 u32 \
049c96
+        link 1: hashkey mask 0x01e00000 match u8 0 0
049c96
+\end{Verbatim}
049c96
+For illustration purposes, here again is a drawing of the first four bytes of
049c96
+the IPv6 header, again with masked area highlighted:
049c96
+\begin{figure}[H]
049c96
+\begin{Verbatim}
049c96
+ 0                1                2                3
049c96
+ .-----------------------------------------------------------------.
049c96
+ |        | ########      |                                        |
049c96
+ | Version| #Traffic Class|   Flow Label                           |
049c96
+ |        | ########      |                                        |
049c96
+ '-----------------------------------------------------------------'
049c96
+\end{Verbatim}
049c96
+\end{figure}
049c96
+\noindent
049c96
+Reading the mask value is analogous to IPv4 with the added complexity that
049c96
+Traffic Class spans over two bytes. Yet, for comparison there's a simple trick:
049c96
+IPv6 has the interesting field shifted by four bits to the left, and the new
049c96
+mask's value is shifted by the same amount. For further information about
049c96
+\filter{u32} and what can be done with it, consult it's man page
049c96
+\man{tc-u32(8)}.
049c96
+
049c96
+Of course, the kernel provides many more filters than just \filter{basic},
049c96
+\filter{flow} and \filter{u32} which have been presented above. As of now, the
049c96
+remaining ones are:
049c96
+\begin{description}
049c96
+\item[bpf]
049c96
+        Filtering using Berkeley Packet Filter programs. The program's return
049c96
+        code determines the packet's destination class ID.
049c96
+
049c96
+\item[cgroup]
049c96
+        Filter packets based on control groups. This is only useful for packets
049c96
+        originating from the local host, as control groups only exist in that
049c96
+        scope.
049c96
+
049c96
+\item[flower]
049c96
+        An extended variant of the flow filter.
049c96
+
049c96
+\item[fw]
049c96
+        Matches on firewall mark values previously assigned to the packet by
049c96
+        netfilter (or a filter action, see below for details). This allows to
049c96
+        export the classification algorithm into netfilter, which is very
049c96
+        convenient if appropriate rules exist on the same system in there
049c96
+        already.
049c96
+
049c96
+\item[route]
049c96
+        Filter packets based on matching routing table entry. Basically
049c96
+        equivalent to the \texttt{fw} filter above, to make use of an already existing
049c96
+        extensive routing table setup.
049c96
+
049c96
+\item[rsvp, rsvp6]
049c96
+        Implementation of the Resource Reservation Protocol in Linux, to react
049c96
+        upon requests sent by an RSVP daemon.
049c96
+
049c96
+\item[tcindex]
049c96
+        Match packets based on tcindex value, which is usually set by the dsmark
049c96
+        qdisc. This is part of an approach to support Differentiated Services in
049c96
+        Linux, which is another topic on it's own.
049c96
+\end{description}
049c96
+
049c96
+
049c96
+\section*{Filter Actions}
049c96
+
049c96
+The tc filter framework provides the infrastructure to another extensible set of
049c96
+tools as well, namely tc actions. As the name suggests, they allow to do things
049c96
+with packets (or associated data). (The list of) Actions are part of a given
049c96
+filter. If it matches, each action it contains is executed in order before
049c96
+returning the classification result. Since the action has direct access to the
049c96
+latter, it is in theory possible for an action to react upon or even change the
049c96
+filtering result - as long as the packet matched, of course. Yet none of the
049c96
+currently in-tree actions make use of this.
049c96
+
049c96
+The Generic Actions framework originally evolved out of the filters' ability to
049c96
+police traffic to a given maximum bandwidth. One common use case for that is to
049c96
+limit ingress traffic, dropping packets which exceed the threshold. A classic
049c96
+setup example is like so:
049c96
+\begin{Verbatim}
049c96
+# tc qdisc add dev eth0 handle ffff: ingress
049c96
+# tc filter add dev eth0 parent ffff: u32 \
049c96
+        match u32 0 0
049c96
+        police rate 1mbit burst 100k
049c96
+\end{Verbatim}
049c96
+The ingress qdisc is not a real one, but merely a point of reference for filters
049c96
+to attach to which should get applied to incoming traffic. The \filter{u32} filter added
049c96
+above matches on any packet and therefore limits the total incoming bandwidth to
049c96
+1mbit/s, allowing bursts of up to 100kbytes. Using the new syntax, the filter
049c96
+command changes slightly:
049c96
+\begin{Verbatim}
049c96
+# tc filter add dev eth0 parent ffff: u32 \
049c96
+        match u32 0 0 \
049c96
+        action police rate 1mbit burst 100k
049c96
+\end{Verbatim}
049c96
+The important detail is that this syntax allows to define multiple actions.
049c96
+E.g. for testing purposes, it is possible to redirect exceeding traffic to the
049c96
+loopback interface instead of dropping it:
049c96
+\begin{Verbatim}
049c96
+# tc filter add dev eth0 parent ffff: u32 \
049c96
+        match u32 0 0 \
049c96
+        action police rate 1mbit burst 100k conform-exceed pipe \
049c96
+        action mirred egress redirect dev lo
049c96
+\end{Verbatim}
049c96
+The added parameter \texttt{conform-exceed pipe} tells the police action to allow for
049c96
+further actions to handle the exceeding packet.
049c96
+
049c96
+Apart from \texttt{police} and \texttt{mirred} actions, there are a few more. Here's a full
049c96
+list of the currently implemented ones:
049c96
+\begin{description}
049c96
+\item[bpf]
049c96
+        Apply a Berkeley Packet Filter program to the packet.
049c96
+
049c96
+\item[connmark]
049c96
+        Set the packet's firewall mark to that of it's connection. This works by
049c96
+        searching the conntrack table for a matching entry. If found, the mark
049c96
+        is restored.
049c96
+
049c96
+\item[csum]
049c96
+        Trigger recalculation of packet checksums. The supported protocols are:
049c96
+        IPv4, ICMP, IGMP, TCP, UDP and UDPLite.
049c96
+
049c96
+\item[ipt]
049c96
+        Pass the packet to an iptables target. This allows to use iptables
049c96
+        extensions directly instead of having to go the extra mile via setting
049c96
+        an arbitrary firewall mark and matching on that from within netfilter.
049c96
+
049c96
+\item[mirred]
049c96
+        Mirror or redirect packets. This is often combined with the ifb pseudo
049c96
+        device to share a common QoS setup between multiple interfaces or even
049c96
+        ingress traffic.
049c96
+
049c96
+\item[nat]
049c96
+        Perform stateless Native Address Translation. This is certainly not
049c96
+        complete and therefore inferior to NAT using iptables: Although the
049c96
+        kernel module decides between TCP, UDP and ICMP traffic, it does not
049c96
+        handle typical problematic protocols such as active FTP or SIP.
049c96
+
049c96
+\item[pedit]
049c96
+        Generic packet editing. This allows to alter arbitrary bytes of the
049c96
+        packet, either by specifying an offset into the packet or by naming a
049c96
+        packet header and field name to change. Currently, the latter is
049c96
+        implemented only for IPv4 yet.
049c96
+
049c96
+\item[police]
049c96
+        Apply a bandwidth rate limiting policy. Packets exceeding it are dropped
049c96
+        by default, but may optionally be handled differently.
049c96
+
049c96
+\item[simple]
049c96
+        This is rather an example than real action. All it does is print a
049c96
+        user-defined string together with a packet counter. Useful maybe for
049c96
+        debugging when filter statistics are not available or too complicated.
049c96
+
049c96
+\item[skbedit]
049c96
+        Edit associated packet data, supports changing queue mapping, priority
049c96
+        field and firewall mark value.
049c96
+
049c96
+\item[vlan]
049c96
+        Add/remove a VLAN header to/from the packet. This might serve as
049c96
+        alternative to using 802.1Q pseudo-interfaces in combination with
049c96
+        routing rules when e.g. packets for a given destination need to be
049c96
+        encapsulated.
049c96
+\end{description}
049c96
+
049c96
+
049c96
+\section*{Intermediate Functional Block}
049c96
+
049c96
+The Intermediate Functional Block (\texttt{ifb}) pseudo network interface acts as a QoS
049c96
+concentrator for multiple different sources of traffic. Packets from or to other
049c96
+interfaces have to be redirected to it using the \texttt{mirred} action in order to be
049c96
+handled, regularly routed traffic will be dropped. This way, a single stack of
049c96
+qdiscs, classes and filters can be shared between multiple interfaces.
049c96
+
049c96
+Here's a simple example to feed incoming traffic from multiple interfaces
049c96
+through a Stochastic Fairness Queue (\qdisc{sfq}):
049c96
+\begin{Verbatim}
049c96
+(1) # modprobe ifb
049c96
+(2) # ip link set ifb0 up
049c96
+(3) # tc qdisc add dev ifb0 root sfq
049c96
+\end{Verbatim}
049c96
+The first step is to load the \texttt{ifb} kernel module (1). By default, this will
049c96
+create two ifb devices: \iface{ifb0} and \iface{ifb1}. After setting
049c96
+\iface{ifb0} up in (2), the root
049c96
+qdisc is replaced by \qdisc{sfq} in (3). Finally, one can start redirecting ingress
049c96
+traffic to \iface{ifb0}, e.g. from \iface{eth0}:
049c96
+\begin{Verbatim}
049c96
+# tc qdisc add dev eth0 handle ffff: ingress
049c96
+# tc filter add dev eth0 parent ffff: u32 \
049c96
+        match u32 0 0 \
049c96
+        action mirred egress redirect dev ifb0
049c96
+\end{Verbatim}
049c96
+The same can be done for other interfaces, just replacing \iface{eth0} in the two
049c96
+commands above. One thing to keep in mind here is the asymmetrical routing this
049c96
+creates within the host doing the QoS: Incoming packets enter the system via
049c96
+\iface{ifb0}, while corresponding replies leave directly via \iface{eth0}. This can be observed
049c96
+using \cmd{tcpdump} on \iface{ifb0}, which shows the input part of the traffic only. What's
049c96
+more confusing is that \cmd{tcpdump} on \iface{eth0} shows both incoming and outgoing traffic,
049c96
+but the redirection is still effective - a simple prove is setting
049c96
+\iface{ifb0} down,
049c96
+which will interrupt the communication. Obviously \cmd{tcpdump} catches the packets to
049c96
+dump before they enter the ingress qdisc, which is why it sees them while the
049c96
+kernel itself doesn't.
049c96
+
049c96
+
049c96
+\section*{Conclusion}
049c96
+
049c96
+My personal impression is that although the \cmd{tc} utility is an absolute
049c96
+necessity for anyone aiming at doing QoS in Linux professionally, there are way
049c96
+too many loose ends and trip wires present in it's environment. Contributing to
049c96
+this is the fact, that much of the non-essential functionality is redundantly
049c96
+available in netfilter. Another problem which adds weight to the first one is a
049c96
+general lack of documentation. Of course, there are many HOWTOs and guides in
049c96
+the internet, but since it's often not clear how up to date these are, I prefer
049c96
+the usual resources such as man or info pages. Surely nothing one couldn't fix
049c96
+in hindsight, but quality certainly suffers if the original author of the code
049c96
+does not or can not contribute to that.
049c96
+
049c96
+All that being said, once the steep learning curve has been mastered, the
049c96
+conglomerate of (classful) qdiscs, filters and actions provides a highly
049c96
+sophisticated and flexible infrastructure to perform QoS, which plays nicely
049c96
+along with routing and firewalling setups.
049c96
+
049c96
+
049c96
+\section*{Further Reading}
049c96
+
049c96
+A good starting point for novice users and experienced ones diving into unknown
049c96
+areas is the extensive HOWTO at \url{http://lartc.org}. The iproute2 package ships
049c96
+some examples (usually in /usr/share/doc/, depending on distribution) as well as
049c96
+man pages for \cmd{tc} in general, qdiscs and filters. The latter have been added
049c96
+just recently though, so if your distribution does not ship iproute2 version
049c96
+4.3.0 yet, these are not in there. Apart from that, the internet is a spring of
049c96
+HOWTOs and scripts people wrote - though these should be taken with a grain of
049c96
+salt: The complexity of the matter often leads to copying others' solutions
049c96
+without much validation, which allows for less optimal or even obsolete
049c96
+implementations to survive much longer than desired.
049c96
+
049c96
+\end{document}
049c96
-- 
049c96
1.8.3.1
049c96