netfilter project logo Planet Netfilter

February 18, 2015

Eric Leblond: Slides of my talks at Lecce

I’ve been invited by SaLUG to Lecce to give some talks during their Geek Evening. I’ve done a talk on nftables and one of suricata.

Lecce by night

Lecce by night

The nftables talk was about the motivation behind the change from iptables. Here are the slides: Nftables

The talk on Suricata was explaining the different feature of Suricata and was showing how I’ve used it to make a study of SSH bruteforce. Here are the slides: Suricata, Netfilter and the PRC.

Thanks a lot to Giuseppe Longo, Luca Greco and all the SaLUG team, you have been wonderful hosts!

February 13, 2015

Rusty Russell: lguest Learns PCI

With the 1.0 virtio standard finalized by the committee (though minor non-material corrections and clarifications are still trickling in), Michael Tsirkin did the heavy lifting of writing the Linux drivers (based partly on an early prototype of mine).

But I wanted an independent implementation to test: both because OASIS insists on multiple implementations before standard ratification, but also because I wanted to make sure the code which is about to go into the merge window works well.

Thus, I began the task of making lguest understand PCI.  Fortunately, the osdev wiki has an excellent introduction on how to talk PCI on an x86 machine.  It didn’t take me too long to get a successful PCI bus scan from the guest, and start about implementing the virtio parts.

The final part (over which I procrastinated for a week) was to step through the spec and document all the requirements in the lguest comments.  I also added checks that the guest driver was behaving sufficiently, but now it’s finally done.

It also resulted in a few minor patches, and some clarification patches for the spec.  No red flags, however, so I’m reasonably confident that 3.20 will have compliant 1.0 virtio support!

November 09, 2014

Eric Leblond: Efficient search of string in a list of strings in Python


I’m currently working on a script that parses Suricata EVE log files and try to detect if some fields in the log are present in a list of bad patterns. So the script has two parts which are reading the log file and searching for the string in a list of strings. This list can be big with a target of around 20000 strings.

Note: This post may seem trivial for real Python developers but as I did not manage to find any documentation on this here is this blog post.

The tests

For my test I have used a 653Mo log file containing 896077 lines. Reading this JSON formatted file is taking 5.0s. As my list of strings was around 3000 elements so far below targeted size, a thumb rules was saying that I will be ok if script stayed below 6 seconds with the matching added. First test was a simple Python style inclusion test with the hostname being put in a list:
if event['http']['hostname'] in hostname_list:
For that test, the result was 9.5s so not awful but a bit over my expectation. Just to check I have run a test with a C-like implementation:
for elt in hostname_list:
   if elt == target:
       # we have a winner
Result was a nice surprise, … for Python, with a execution time of 20.20s.

I was beginning to fear some development to be able to reach the speed I needed and I gave a last try. As I was taking care of match, I can transform my list of strings in a Python set thus only getting unique elements. So I have run the test using:

if event['http']['hostname'] in set(hostname_list):

Result was an amazing execution time of 5.20s. Only 0.20s were used to check data against my set of strings.


Python set required elements to be hashable. It is needed because internal implementation is using dictionary. So looking for an element in a set is equivalent to look for an element in a hash table. And this is really faster than searching in a list where there is no real magic possible.

So if you only care about match and if your elements are hashable then use Python set to test for existence of a object in your set.

November 07, 2014

Jesper Dangaard Brouer: Announce github repo prototype-kernel

Kernel prototype development made easy

For easier and quicker kernel development cycles, I've created a git-repo called "prototype-kernel". To also help your productivity I've shared in on github (isn't it great that work at +Red Hat defaults to being Open Source):

Getting started developing your first kernel code is really easy: follow this guide.

The basic idea is to compile modules outside the kernel tree, but use the kernels kbuild infrastructure. Thus, this does require that you have a kernel source tree avail on your system. Most distributions offer a kernel-devel package, that only install what you need for this to work.

 On +Fedora Project do: 'yum install kernel-devel'
 On +Debian do: 'aptitude install linux-headers'

Directory layout:

There are two main directories: "minimalist" and "kernel".

If you just want to test a small code change go to the "minimalist" directory.  This contains the minimal Makefile needed to compile the kernel modules in this directory. Do you devel cycle of: Modify, make, insmod, rmmod.

The "kernel" directory is more interesting to me. It maintains the same directory layout as the mainline kernel. The basic idea is that code developed should be in the same locations that you plan to submit upstream. Here you need to edit the files Kbuild files when adding new modules/code. Notice these Kbuild files are located in each sub-directory.

Detecting kernel and compiling

In the "kernel" directory you can simply run: make

The system will detect / assume-to-find the kernel source by looking in /lib/modules/`uname -r`/build/. If the kernel you are compiling for are not in this location, then specify it to make on the command line like:

  $ make kbuilddir=~/git/kernel/net-next/

Compiling for a remote host

If you are compiling locally, but want to upload the kernel modules for testing on a remote host, then the make system supports this. Compile and upload modules with a single make command do:

  $ make push_remote kbuilddir=~/git/kernel/net-next/ HOST=

Localhost development

Development on the same host is sometimes easier.  To help this process there is a script to help setup symlinks to from the running kernels module directory into you git repo.  This means you just need to compile and no need to install the modules again and again (perhaps run 'depmod -a' if needed).  Command line for this setup, be in the "kernel" directory and run:

  $ sudo ./scripts/

Development cycle made simple

After this setup, standard modprobe and modinfo works.  Thus, oneliner work cycle for testing a given kernel module can be as simple as:

  $ make && (sudo rmmod time_bench_sample; sudo modprobe time_bench_sample) && sudo dmesg -c

Happy kernel-hacking :-)

October 09, 2014

Jesper Dangaard Brouer: Unlocked 10Gbps TX wirespeed smallest packet single core

Achievement unlocked, 10Gbit/s full TX wirespeed smallest packet size on a single CPU core (14.8Mpps). Recent changes to the Linux kernels lower TX layers have unlocked the full (TX side) potential of the NIC drivers and hardware.

The single core 14.8Mpps performance number is an artificial benchmark performed with pktgen, which besides spinning the same packet (skb), now also notifies the NIC hardware after populating it's TX ring buffer with a "burst" of packets. The new pktgen option is called "burst" (commit 38b2cf2982d ("net: pktgen: packet bursting via skb->xmit_more")).

Hint: for ixgbe driver you need to adjust cleanup interval as described in my earlier blogpost.

  • Via cmdline:  ethtool -C ethX rx-usecs 30

Fortunately, we have also found a way to take advantage of these new capabilities for normal/real use-cases, by adjusting GSO segmentation and modifying the qdisc (traffic control) layer, more details later.

As mentioned before, I personally only see pktgen as a tool for performance testing and tuning the NIC driver TX layer.   Thus, the pktgen test merely show, that we have finally unlocked the full potential of the NIC hardware and driver.  We started this journey with seeing 4Mpps single core.

The SKB xmit_more API

The new bulking API, for the driver layer, requires some extra explanation, because we wanted to avoid API changes, that would require updating every driver.

The new API for driver layer TX bulking is called "skb->xmit_more". To avoid changing the external driver API and to keep the new API light weight (read fast), the "skb" (struct sk_buff) have been extended with a "xmit_more" (bit) indication.

If the driver see "skb->xmit_more" set, then the stack is telling that it guarantees that another packet will be given immediately (when ->ndo_start_xmit() returns).
 This means that, unless the driver have a filled TX queue, it can simply add the packet to the TX queue and defer the expensive indication to the HW (of updating the TX ring queue tail pointer).

Challenge: bulk without added latency

It is difficult to taking advantage of bulking without introducing added latency.  Nevertheless, this have been our goal.  Don't speculative/artificially introducing a delay betting on another packets arriving shortly. Principle of only bulk packets when really needed, based on some solid indication from the stack.

Existing stack bulking: GSO

The network stack already have packet aggregation facilities like TSO (TCP Segmentation Offload), GSO (Generic Segmentation Offload) and GRO (Generic Receive Offload), which is primarily targeted for TCP flows. Let's take advantage of these packets already being aggregated.

+David Miller, restructured central parts of the xmit layer, allowing it to work with skb lists, via merge commit 53fda7f7f9e8 ("Merge branch 'xmit_list'").

This restructure also allowed GRO/GSO segmented lists to take advantage of the performance boost of being bulk transmitted in the driver layer (via xmit_more indication).

Note, that hardware supported TSO packets have in-principal always benefited from the tail pointer update, as the driver can (usually) put the entire TSO packet (upto 64k) as though it is a single packet in to the TX ring, before updating tail pointer.

After these optimization's some situations exists (for certain HW), where the software GSO is faster-than the hardware based TSO.  To enable the software only approach use cmdline:

  • ethtool -K ethX gro on gso on tso off

GRO/GSO is focused on TCP and limited to bulking/aggregation on a per flow basis.  What about UDP or many simultaneous shorter lived flows.

Qdisc layer bulk dequeue

Bulking from the qdisc layer is good general solution that will benefit all kinds of packets.  A queue in the qdisc layer, provide a very solid opportunity for bulking, packets have already been delayed and its easy to construct a skb list with packets avail in the queue.

In our solution for qdisc bulk dequeue (commit 5772e9a3463b) we have further extended the performance benefit of bulking.  We can amortize the cost of locking both at the qdisc layer and for the TXQ lock.
 Several packets are dequeued while holding the qdisc root_lock, amortizing locking cost over several packet.  The dequeued SKB list is processed under the TXQ lock in dev_hard_start_xmit(), thus also amortizing the cost of the TXQ lock.

 Details: based on earlier benchmarking for locking, the cost of the lock/unlock sequence is 16ns (in the optimal none contended case), with two locking sections this is a 32ns cost that will get amortized.

 Note: We benefit from this lock amortization saving, regardless weather the device driver take advantage of the xmit_more API. Which is good as most driver have not yet, implemented this xmit_more API.

Lower CPU usage when using the full bandwidth on a link, is the most direct benefit than most will experience, from these changes.

Sending small packets "all-the-way" from userspace RAW sockets (but still through the qdisc layer), I can also demonstrate that it's now possible to send 14.8Mpps (10G wirespeed small frames) it just required using minimum 11 CPUs (using trafgen --qdisc-path, CPU E5-2695).  This is obviously an artifical benchmark, but it demonstrates that we are scaling with the number of CPUs.

For testing details see merge commit c2bf5ec20488f ('qdisc_bulk_dequeue').

Avoiding Head-of-Line blocking

We have explicitly chosen to: only support qdisc bulking for BQL supported drivers.

As the bufferbloat fighters can testify, device drivers have a tendency buffer too many bytes in the hardware.  The solution to this problem is BQL (Byte Queue Limit) by Tom Herbert.  Sadly not many device drivers have actually implemented using BQL.

The reason for needing BQL is that the qdisc layer needs an indication it can use to determine how many packets (actually bytes) it should bulk dequeue, before sending it to the driver layer.

This is need because, we want to avoid overshooting the driver layers capacity, as it will result in a need to requeue the excessive packet back into the qdisc layer.  Currently the qdisc layer, does not support a requeue facility for a scheduler, instead it stores requeued packets in the root qdisc.  These requeued packets cause Head-of-Line blocking, as they must be transmitted before dequeueing new packets to
avoid packet reordering.

Implementing BQL support in your driver also helps moving any queue to the qdisc layer, which gives your driver the opportunity to benefit from these amazing performance improvements (and is also allow us to
fix bufferbloat by applying an appropriate qdisc like fq_codel).

Joint work

Getting to this level of performance have been the jointed work and feedback from many people.

Thanks to,
 +David Miller (Red Hat),
 +Eric Dumazet (Google),
 Hannes Frederic Sowa (Red Hat),
 Tom Herbert (Google),
 Daniel Borkmann (Red Hat),
 Florian Westphal (Red Hat),
 +Alexei Starovoitov (Plumgrid),
 +Alexander Duyck (Intel / Red Hat),
 John Fastabend (Intel),
 Jamal Hadi Salim (Mojatatu Networks),
 +Neil Horman  (Red Hat),
 +Dave Taht (,
 +Toke Høiland-Jørgensen (Karlstads University)
 and reviewers on

September 29, 2014

Eric Leblond: Slides of my nftables talk at Kernel Recipes

I’ve been lucky enough to do a talk during the third edition of Kernel Recipes. I’ve presented the evolution of nftables durig the previous year.

You can get the slides from here: 2014_kernel_recipes_nftables.

Thanks to Hupstream for uploading the video of the talk:

Not much material but this slides and a video of the work done during the previous year on nftables and its components:

September 24, 2014

Eric Leblond: Using DOM with nftables

DOM and SSH honeypot

DOM is a solution comparable to fail2ban but it uses Suricata SSH log instead of SSH server logs. The goal of DOM is to redirect the attacker based on its SSH client version. This allows to send attacker to a honeypot like pshitt directly after the first attempt. And this can be done for a whole network as Suricata does not need to be on the targeted box.

Using DOM with nftables

I’ve pushed a basic nftables support to DOM. Instead of adding element via ipset it uses a nftables set. It is simple to use it as you just need to add a -n flag to specify which table the set has been defined in:
./dom -f /usr/local/var/log/suricata/eve.json -n nat -s libssh -vvv -i -m OpenSSH
To activate the network address translation based on the set, you can use something like:
table ip nat {
        set libssh { 
                type ipv4_addr

        chain prerouting {
                 type nat hook prerouting priority -150;
                 ip saddr @libssh ip protocol tcp counter dnat

A complete basic ruleset

Here’s the ruleset running on the box implementing pshitt and DOM:
table inet filter {
        chain input {
                 type filter hook input priority 0;
                 ct state established,related accept
                 iif lo accept
                 ct state new iif != lo tcp dport {ssh, 2200} tcp flags == syn counter log prefix "SSH attempt" group 1 accept
                 iif br0 ip daddr accept
                 ip saddr tcp dport {9300, 3142} counter accept
                 ip saddr counter accept
                 counter log prefix "Input Default DROP" group 2 drop

table ip nat {
        set libssh { 
                type ipv4_addr

        chain prerouting {
                 type nat hook prerouting priority -150;
                 ip saddr @libssh ip protocol tcp counter dnat

        chain postrouting {
                 type nat hook postrouting priority -150;
                 ip saddr snat
There is a interesting rule in this ruleset. The first is:
ct state new iif != lo tcp dport {ssh, 2200} tcp flags == syn counter log prefix "SSH attempt" group 1 accept
It uses a negative construction to match on the interface iif != lo which means interface is not lo. Note that it also uses an unamed set to define the port list via tcp dport {ssh, 2200}. That way we have one single rule for normal and honeypot ssh. At least, this rule is logging and accepting and the logging is done via nfnetlink_log because of the group parameter. This allows to have ulogd to capture log message triggered by this rule.

September 12, 2014

Jesper Dangaard Brouer: Mini-tutorial for netperf-wrapper setup on RHEL6/CentOS6

The tool "netperf-wrapper" (by +Toke Høiland-Jørgensen <toke(at)>) is very useful for repeating network measurements, that involves running multiple concurrent instances of testing tools (primarily netperf, iperf and ping, but also tools like d-itg and http-getter).

The tools is best known in the bufferbloat community for it's test Realtime Response Under Load (RRUL), but the netperf-wrapper tool has other tests that I find useful.
Core software dependencies are recent versions of netperf, python, matplotlib and fping (optional are d-itg and http_runner).

Dependency issues on RHEL6

First dependencies are solved easily by installing "python-matplotlib":
 $ sudo yum install -y python-matplotlib python-pip

The software dependencies turned out to be a challenge on my RHEL6 box.

The "ping" program is too old to support option "-D" (prints timestamp before each-line).  Work-around is to install "fping", which I choose to do from "rpmforge":

Commands needed for install "fping":
 # rpm -Uvh
 # yum install -y fping

The "netperf" tool itself (on RHEL6) were not compiled with configure option "--enable-demo=yes" which is needed to get timestamp and continuous result output during a test-run.

Thus, I needed to recompile "netperf" manually:

Install netperf-wrapper

Installation is quite simple, once the dependencies have been meet:

  • git clone
  • cd netperf-wrapper
  • sudo python2 install

GUI mode

There is a nice GUI mode for investigating and comparing results, started by:
 $ netperf-wrapper --gui

This depend on matplotlib with Qt4 (and PyQt4), which unfortunately were not available for RHEL6. Fortunately there were a software package for this on Fedora, named "python-matplotlib-qt4".

For GUI mode netperf-wrapper needs: matplotlib with Qt4
 $ sudo yum install -y python-matplotlib-qt4 PyQt4

Thus, the workflow is to run the tests on my RHEL6 machines, and analyze the result files on my Fedora laptop.

Using the tool

The same tool "netperf-wrapper" is both used for running the test, and later analyzing the result.

Listing the tests available:
 $ netperf-wrapper --list-tests

For listing which plots are available for a given test e.g. "rrul":
 $ netperf-wrapper --list-plots rrul

Before running a test towards a target system, remember to start the "netserver" daemon process on the target host (just run command "netserver" nothing else).

Start a test-run towards e.g. IP with test rrul
 $ netperf-wrapper -H -t my_title rrul

It is recommend using the option "-t" to give your test a title, which makes is easier to distinguish when comparing two or more test files in e.g. the GUI tool.

The results of the test-run will be stored in a compressed json formatted text file, with the naming convention: rrul-2014-MM-DDTHHMMSS.milisec.json.gz

To view the result, without the GUI, run:
 $ netperf-wrapper -i rrul_prio-2014-09-10T125650.993908.json.gz -f plot
Or e.g. selecting a specific plot like "ping_cdf"
 $ netperf-wrapper -i rrul_prio-2014-09-10T125650.993908.json.gz -f plot -p ping_cdf

netperf-wrapper can also output numeric data suitable for plotting in org-mode or .csv (spreadsheet) format, but I didn't play with those options.

Updates: A release 0.7.0 of netperf-wrapper is pending

Extra: On bufferbloat
Interested in more about bufferbloat?

Too few people are linking to the best talk explaining bufferbloat and how it's solved by Van Jacobson (slides).  The video quality is unfortunately not very good.

I've used some of Van's point in my own talk about bufferbloat: Beyond the existences of Bufferbloat, have we found the cure? (slides)

September 11, 2014

Jesper Dangaard Brouer: Network setup for accurate nanosec measurements

As I described in my previous blogpost, I'm leveraging the PPS measurements to deduct the nanosec improvements I'm making to the code.

One problem with using this on the nanosec scale is the accuracy of your measurements, which depend on the accuracy of the hardware your are using.

Modern systems have power-saving and turbo-boosting features build into the CPUs.  And Hyper-Threading technology that allows one CPU core to appear as two CPUs, by sharing ALUs etc.

While establish an accurate baseline for some upstream measurements (subj: Get rid of ndo_xmit_flush / commit 0b725a2ca61). I was starting to see too much variation in my trafgen measurements.

I created a rather-large oneliner, that I have converted into a script here:
Which allowed me to get a picture of the accuracy of my measurements, and they are not accurate enough. (For more real stats like std-dev consider running these measurements through Rusty Russell's tool

My findings:

  1. Disable all C states and P states.
  2. Disabling Hyper-Threading and power-management in BIOS helped the accuracy
  3. 10Gbit/s ixgbe ring-buffer cleanup interval also influenced accuracy

Reading +Jeremy Eder's blog post. It seems the best method for disabling these C and P states, and
keeping all CPUs in C0/C1 state is doing:

 # tuned-adm profile latency-performance

I found the most stable ring-buffer cleanup interval for the ixgbe driver were 30 usecs. Configured by

 # ethtool -C eth5 rx-usecs 30

Besides these tunings: my blogpost on "Basic tuning for network overload testing" should still be followed.
Generally I've started using the "profile latency-performance", but unless I need to measure some specific code change, I'm still using the ixgbe's default "dynamic" ring-buffer cleanup interval.

Details about the "ethtool -C" tuning is avail in blogpost "Pktgen for network overload testing".

Jesper Dangaard Brouer: Packet Per Sec measurements for improving the Linux Kernel network stack

Why I'm using Packet Per Second (PPS) tests for measuring the improvement in performance (of the Linux Kernel network stack).  Many people (e.g. other kernel developer) does not understand why I'm using PPS measurements, this blogpost explains why.

The basic problem of using large MTU (usually 1500 bytes) size packets, is that the transmission delay itself, is enough to hide any improvement I'm making (e.g. a faster lookup function).

Transmission delay 1514 bytes (+20 bytes for Ethernet overhead) at 10Gbit/s is 1227 nanosec:

  • ((bytes+wireoverhead)*8) / 10 Gbits = time-unit
  • ((1500+14+20)*8)/((10000*10^6))*10^9 = 1227.20 ns

This means, if the network stack can generate (alloc/fill/free) a 1500 byte packet faster than every 1227ns, then it can utilize the bandwidth of the 10Gbit/s link fully.  And yes, we can already do so. Thus, with 1500 bytes frame any stack performance improvement, will only be measurable with by a lower CPU utilization.

Let face it; the kernel have been optimized heavily for the last 20 years.  Thus, the improvements we are able to come up with, is going to be on the nanosec scale.
For example I've found a faster way to clear the SKB, which saves 7 nanosec.  Being able to measure this performance improvement were essential while developing this faster clearing.

Lets assume, the stack cost (alloc/fill/syscall/free) is 1200ns (thus faster than 1227ns), then a 7ns improvement will only be 0.58%, which I can only measure as a lower CPU utilization (as bandwidth limit have been reached), which in practice cannot be measured accurately enough.

By lowering the packet size, the transmission delay the stack cost (alloc/fill/syscall/free) can "hide-behind" is reduced. With the smallest packet size of 64 bytes, this is significantly reduced, to:

  • ((64+20)*8)/((10000*10^6))*10^9 = 67.2ns

This basically exposes stacks cost, as its current cost is larger than 67.2ns.  This can be used for getting some measurements that allow us to actually measure the improvement of the code changes we are making, even-though this "only" translates into reduced CPU usage with big frames (which translates into more processer time for your application).

In packet per sec (pps) this correspond to approx 14.8Mpps:

  • 1sec/67.2ns =>  1/(67.2/10^9) = 14,880,952 pps
  • or directly from the packet size as:
  • 10Gbit / (bytes*8) = (10000*10^6)/((64+20)*8) = 14,880,952 pps

Measuring packet per sec (PPS) instead of bandwidth, have another advantage.  Instead of just comparing how many PPS improvement is seen, then instead translate the PPS into nanosec (between packets).
Comparing nanosec used before and after, will show us the nanosec saved by the given code change.

See, how I used it in this and  this commit to document the actually improvement of the changes I made.

Update: For deducting the nanosec saved by a given code change, to be valid, usually requires isolating your test to utilize a single CPU.

Lets use the 14.8Mpps as an example of howto translate PPS to nanosec:

  • 1sec / pps => (1/14880952*10^9) = 67.2ns

Extra: Just for the calculation exercise.

  • How many packets per sec does 1227.20 ns correspond to:
    • 1sec/1227.2ns =>  1/(1227.2/10^9) = 814,863 pps
  • Can also be calculated directly from the packet size as:
    • 10Gbit / (bytes*8) = (10000*10^6)/((1514+20)*8) = 814,863 pps

August 19, 2014

Rusty Russell: POLLOUT doesn’t mean write(2) won’t block: Part II

My previous discovery that poll() indicating an fd was writable didn’t mean write() wouldn’t block lead to some interesting discussion on Google+.

It became clear that there is much confusion over read and write; eg. Linus thought read() was like write() whereas I thought (prior to my last post) that write() was like read(). Both wrong…

Both Linux and v6 UNIX always returned from read() once data was available (v6 didn’t have sockets, but they had pipes). POSIX even suggests this:

The value returned may be less than nbyte if the number of bytes left in the file is less than nbyte, if the read() request was interrupted by a signal, or if the file is a pipe or FIFO or special file and has fewer than nbyte bytes immediately available for reading.

But write() is different. Presumably so simple UNIX filters didn’t have to check the return and loop (they’d just die with EPIPE anyway), write() tries hard to write all the data before returning. And that leads to a simple rule.  Quoting Linus:

Sure, you can try to play games by knowing socket buffer sizes and look at pending buffers with SIOCOUTQ etc, and say “ok, I can probably do a write of size X without blocking” even on a blocking file descriptor, but it’s hacky, fragile and wrong.

I’m travelling, so I built an Ubuntu-compatible kernel with a printk() into select() and poll() to see who else was making this mistake on my laptop:

cups-browsed: (1262): fd 5 poll() for write without nonblock
cups-browsed: (1262): fd 6 poll() for write without nonblock
Xorg: (1377): fd 1 select() for write without nonblock
Xorg: (1377): fd 3 select() for write without nonblock
Xorg: (1377): fd 11 select() for write without nonblock

This first one is actually OK; fd 5 is an eventfd (which should never block). But the rest seem to be sockets, and thus probably bugs.

What’s worse, are the Linux select() man page:

       A file descriptor is considered ready if it is possible to
       perform the corresponding I/O operation (e.g., read(2)) without
       ... those in writefds will be watched to see if a write will
       not block...

And poll():

		Writing now will not block.

Man page patches have been submitted…

August 02, 2014

Rusty Russell: ccan/io: revisited

There are numerous C async I/O libraries; tevent being the one I’m most familiar with.  Yet, tevent has a very wide API, and programs using it inevitably descend into “callback hell”.  So I wrote ccan/io.

The idea is that each I/O callback returns a “struct io_plan” which says what I/O to do next, and what callback to call.  Examples are “io_read(buf, len, next, next_arg)” to read a fixed number of bytes, and “io_read_partial(buf, lenp, next, next_arg)” to perform a single read.  You could also write your own, such as pettycoin’s “io_read_packet()” which read a length then allocated and read in the rest of the packet.

This should enable a convenient debug mode: you turn each io_read() etc. into synchronous operations and now you have a nice callchain showing what happened to a file descriptor.  In practice, however, debug was painful to use and a frequent source of bugs inside ccan/io, so I never used it for debugging.

And I became less happy when I used it in anger for pettycoin, but at some point you’ve got to stop procrastinating and start producing code, so I left it alone.

Now I’ve revisited it.   820 insertions(+), 1042 deletions(-) and the code is significantly less hairy, and the API a little simpler.  In particular, writing the normal “read-then-write” loops is still very nice, while doing full duplex I/O is possible, but more complex.  Let’s see if I’m still happy once I’ve merged it into pettycoin…

July 29, 2014

Rusty Russell: Pettycoin Alpha01 Tagged

As all software, it took longer than I expected, but today I tagged the first version of pettycoin.  Now, lots more polish and features, but at least there’s something more than the git repo for others to look at!

July 17, 2014

Rusty Russell: API Bug of the Week: getsockname().

A “non-blocking” IPv6 connect() call was in fact, blocking.  Tracking that down made me realize the IPv6 address was mostly random garbage, which was caused by this function:

bool get_fd_addr(int fd, struct protocol_net_address *addr)
   union {
      struct sockaddr sa;
      struct sockaddr_in in;
      struct sockaddr_in6 in6;
   } u;
   socklen_t len = sizeof(len);
   if (getsockname(fd, &, &len) != 0)
      return false;

The bug: “sizeof(len)” should be “sizeof(u)”.  But when presented with a too-short length, getsockname() truncates, and otherwise “succeeds”; you have to check the resulting len value to see what you should have passed.

Obviously an error return would be better here, but the writable len arg is pretty useless: I don’t know of any callers who check the length return and do anything useful with it.  Provide getsocklen() for those who do care, and have getsockname() take a size_t as its third arg.

Oh, and the blocking?  That was because I was calling “fcntl(fd, F_SETFD, …)” instead of “F_SETFL”!

July 02, 2014

Jesper Dangaard Brouer: The calculations: 10Gbit/s wirespeed

In this blogpost, I'll try to make you understand the engineering challenge behind processing 10Gbit/s wirespeed, at the smallest Ethernet packet size.

The peak packet rate is 14.88 Mpps (million packets per sec) uni-directional on 10Gbit/s with the smallest frame size.

Details: What is the smalles Ethernet frame
Ethernet frame overhead:

Thus, the minimim size Ethernet frame is: 84 bytes (20 + 64)

Max 1500 bytes MTU Ethernetframe size is: 1538 bytes (calc: (12+8) + (14) + 1500 + (4) = 1538 bytes)

Packet rate calculations

Peak packet rate calculated as:  (10*10^9) bits/sec / (84 bytes * 8) = 14,880,952 pps
1500 MTU packet rate calculated as: (10*10^9) bits/sec / (1538 bytes * 8) = 812,744 pps

Time budget
This is the important part to wrap-your-head around.

With 14.88 Mpps the time budget for processing a single packet is:

  • 67.2 ns (nanosecond) (calc as: 1/14880952*10^9 ns)

This corrospond to approx: 201 CPU cycles on a 3GHz CPU (assuming only one instruction per cycle, disregarding superscalar/pipelined CPUs). Only having 201 clock-cycles processing time per packet is very little.

Relate these numbers to something
This 67.2ns number is hard to use for anything, if we cannot relate this to some other time measurements.

A single cache-miss takes: 32 ns (measured on a E5-2650 CPU). Thus, with just two cache-misses (2x32=64ns), almost the total 67.2 ns budget is gone. The Linux skb (sk_buff) is 4 cache-lines (on 64-bit), and the kernel e.g. insists on writing zeros to these cache-lines, during allocation of an skb.

We might not "suffer" a full cache-miss, sometimes the memory is available in L2 or L3 cache.  Thus, it is useful to know these time measurements.  Measured on my E5-2630 CPU (with lmbench command "lat_mem_rd 1024 128"), L2 access costs 4.3ns, and L3 access costs 7.9ns.

The "LOCK" operation
Assembler instructions can be prefixed with a "LOCK" operation, which means that they perform an atomic operation. This is uses every time e.g. a spinlock is locked or unlocked, cmpxchg and atomic_inc (some operations are even implicitly LOCK prefixed, like xchg).

I've measured the cost of this atomic "LOCK" operation to be 8.25ns on my CPU (with this program). Even for the completely optimal situation of a spinlock only being touch by one CPU, we have two LOCK calls which costs 16.5ns.

System call overhead
A FreeBSD case study of sendto(), in Luigi Rizzo netmap paper, shows that the cost of only the system call is 96ns, which is above the 67.2 ns budget.  The total overhead of sendto() were 950 ns.  These 950ns corrospond to 1,052,631 pps (calc as 1/(950/10^9)).
On Linux I measured the system call getuid(2), to take 87.77 ns and 201 CPU-cycles (TSC measurement) (the CPU efficiency were 1.42 insns per cycle, measured with perf stat). Thus, the syscall itself eats up the entire budget.

  • Update: Most of the syscall overhead comes from kernel option CONFIG_AUDITSYSCALL, without it, the syscall overhead drops to 41.85 ns.

How to overcome this syscall problem?  We can amortize the cost, by sending several packets in a single syscall.  It is not very well known, but we actually already have a syscall to send several packets with a single syscall, called "sendmmsg(2)". Notice the extra "m" (and the corresponding receive version "recvmmsg(2)"). Not many examples exists on the Internet for using these syscalls. Thus, I've provided some example code here for sendmmsg and recvmmsg.

RAW socket speeds
Daniel Borkmann and I recently optimized AF_PACKET, to scale to several CPUs (trafgen, kernel qdisc bypass and trafgen use qdisc bypass). But let us look at the performance numbers for only a single CPU:

  • Qdisc path = 1,226,776 pps => 815 ns per packet (calc: 1/pps*10^9)
  • Qdisc bypass = 1,382,075 pps => 723 ns per packet (calc: 1/pps*10^9)

This is also interesting, because this show us the cost of the qdisc code path, which costs 92 ns.  In this 10Gbit/s context it is fairly large, e.g. corresponding to almost 3 cache-line misses (92/32=2.9).

June 26, 2014

Eric Leblond: pshitt: collect passwords used in SSH bruteforce


I’ve been playing lately on analysis SSH bruteforce caracterization. I was a bit frustrated of just getting partial information:

  • ulogd can give information about scanner settings
  • suricata can give me information about software version
  • sshd server logs shows username
But having username without having the password is really frustrating.

So I decided to try to get them. Looking for a SSH server honeypot, I did find kippo but it was going too far for me by providing a fake shell access. So I’ve decided to build my own based on paramiko.

pshitt, Passwords of SSH Intruders Transferred to Text, was born. It is a lightweight fake SSH server that collect authentication data sent by intruders. It basically collects username and password and writes the extracted data to a file in JSON format. For each authentication attempt, pshitt is dumping a JSON formatted entry:

{"username": "admin", "src_ip": "", "password": "passw0rd", "src_port": 36221, "timestamp": "2014-06-26T10:48:05.799316"}
The data can then be easily imported in Logstash (see pshitt README) or Splunk.

The setup

As I want to really connect to the box running ssh with a regular client, I needed a setup to automatically redirect the offenders and only them to pshitt server. A simple solution was to used DOM. DOM parses Suricata EVE JSON log file in which Suricata gives us the software version of IP connecting to the SSH server. If DOM sees a software version containing libssh, it adds the originating IP to an ipset set. So, the idea of our honeypot setup is simple:
  • Suricata outputs SSH software version to EVE
  • DOM adds IP using libssh to the ipset set
  • Netfilter NAT redirects all IP off the set to pshitt when they try to connect to our ssh server
Getting the setup in place is really easy. We first create the set:
ipset create libssh hash:ip
then we start DOM so it adds all offenders to the set named libssh:
cd DOM
./dom -f /usr/local/var/log/suricata/eve.json -s libssh
A more accurate setup for dom can be the following. If you know that your legitimate client are only based on OpenSSH then you can run dom to put in the list all IP that do not (-i) use an OpenSSH client (-m OpenSSh):
./dom -f /usr/local/var/log/suricata/eve.json -s libssh -vvv -i -m OpenSSH
If we want to list the elements of the set, we can use:
ipset list libssh
Now, we can start pshitt:
cd pshitt
And finally we redirect the connection coming from IP of the libssh set to the port 2200:
iptables -A PREROUTING -m set --match-set libssh src -t nat -i eth0 -p tcp -m tcp --dport 22 -j REDIRECT --to-ports 2200

Some results

Here’s an extract of the most used passwords when trying to get access to the root account: real root passwords And here’s the same thing for the admin account attempt: Root passwords Both data show around 24 hours of attempts on an anonymous box.


Thanks to paramiko, it was really fast to code pshitt. I’m now collecting data and I think that they will help to improve the categorization of SSH bruteforce tools.

June 21, 2014

Rusty Russell: Alternate Blog for my Pettycoin Work

I decided to use github for pettycoin, and tested out their blogging integration (summary: it’s not very integrated, but once set up, Jekyll is nice).  I’m keeping a blow-by-blow development blog over there.

June 16, 2014

Rusty Russell: Rusty Goes on Sabbatical, June to December

At I spoke about my pre-alpha implementation of Pettycoin, but progress since then has been slow.  That’s partially due to yak shaving (like rewriting ccan/io library), partially reimplementation of parts I didn’t like, and partially due to the birth of my son, but mainly because I have a day job which involves working on Power 8 KVM issues for IBM.  So Alex convinced me to take 6 months off from the day job, and work 4 days a week on pettycoin.

I’m going to be blogging my progress, so expect several updates a week.  The first few alpha releases will be useless for doing any actual transactions, but by the first beta the major pieces should be in place…

June 11, 2014

Eric Leblond: Let’s talk about SELKS

The slides of my lightning talk at SSTIC are available: Let’s talk about SELKS. The slides are in French and are intended to be humorous.

The presentation is about defensive security that needs to get sexier. And Suricata 2.0 with EVE logging combined with Elasticsearch and Kibana can really help to reach that target. If you want to try Suricata and Elasticsearch, you can download and test SELKS.


The talk also present a small tool named Deny On Monitoring which demonstrate how easy it is to extract information from Suricata EVE JSON logging.

June 07, 2014

Rusty Russell: Donation to Jupiter Broadcasting

Chris Fisher’s Jupiter Broadcasting pod/vodcasting started 8 years ago with the Linux Action Show: still their flagship show, and how I discovered them 3 years ago.  Shows like this give access to FOSS to those outside the LWN-reading crowd; community building can be a thankless task, and as a small shop Chris has had ups and downs along the way.  After listening to them for a few years, I feel a weird bond with this bunch of people I’ve never met.

I regularly listen to Techsnap for security news, Scibyte for science with my daughter, and Unfilter to get an insight into the NSA and what the US looks like from the inside.  I bugged Chris a while back to accept bitcoin donations, and when they did I subscribed to Unfilter for a year at 2 BTC.  To congratulate them on reaching the 100th Unfilter episode, I repeated that donation.

They’ve started doing new and ambitious things, like Linux HOWTO, so I know they’ll put the funds to good use!

June 04, 2014

Jesper Dangaard Brouer: Pktgen for network overload testing

Want to get maximum performance out of the kernel level packet generator (pktgen)?
Then read this blogpost:

  • Simple tuning will increase performance from 4Mpps to 5.5Mpps (per CPU)

You might see pktgen as a fast packet generator, which it is, but I (as a kernel developer) also see it as network stack testing tool of the TX code path.

Pktgen have a parameter "clone_skb", which specifies how many time to send the same packet, before freeing and allocting a new packet for transmission.  This affects performance significantly, as it can remove a lot of memory allocation and access overhead.

I have two distinctly different use-cases for stack testing:

  1. clone_skb=1      tests the stack alloc/free overhead (related to the SKB)
  2. clone_skb=100000 tests the NIC driver layer
Lets focus on case 2, driver layer.

Tuning NIC driver layer for max performance:
The default NIC setting are not tuned for pktgen's artificial overload type of benchmarking, as this could hurt the normal use-case.

Specifically increasing the TX ring buffer in the NIC:
 # ethtool -G ethX tx 1024

A larger TX ring can improve pktgen's performance, while it can hurt in the general case, 1) because the TX ring buffer might get larger than the CPUs L1/L2 cache, 2) because it allow more queueing in the NIC HW layer (which is bad for bufferbloat).

One should be careful to conclude, that packets/descriptors in the HW TX ring cause delay.  Drivers usually delay cleaning up the ring-buffers (for various performance reasons), thus packets stalling the TX ring, might just be waiting for cleanup.

This "slow" cleanup issues is specifically the case, for the driver ixgbe (Intel 82599 chip).  This driver (ixgbe) combine TX+RX ring cleanups, and the cleanup interval is affected by the ethtool --coalesce setting of parameter "rx-usecs".

For ixgbe use e.g "30" resulting in approx 33K interrupts/sec (1/30*10^6):
 # ethtool -C ethX rx-usecs 30

Performance data:
Packet Per Sec (pps) performance tests using a single pktgen CPU thread, CPU E5-2630, 10Gbit/s driver ixgbe. (using net-next development kernel v3.15-rc1-2680-g6623b41)

Adjusting the "ethtool -C ethX rx-usecs" value affect how often we cleanup the ring.  Keeping the default TX ring size at 512, and adjusting "rx-usecs":
  • 3,935,002 pps - rx-usecs:  1 (irqs:  9346)
  • 5,132,350 pps - rx-usecs: 10 (irqs: 99157)
  • 5,375,111 pps - rx-usecs: 20 (irqs: 50154)
  • 5,454,050 pps - rx-usecs: 30 (irqs: 33872)
  • 5,496,320 pps - rx-usecs: 40 (irqs: 26197)
  • 5,502,510 pps - rx-usecs: 50 (irqs: 21527)
Performance when adjusting the TX ring buffer size. Keeping "rx-usecs==1" (default) while adjusting TX ring size (ethtool -G):
  • 3,935,002 pps - tx-size:  512
  • 5,354,401 pps - tx-size:  768
  • 5,356,847 pps - tx-size: 1024
  • 5,327,595 pps - tx-size: 1536
  • 5,356,779 pps - tx-size: 2048
  • 5,353,438 pps - tx-size: 4096
The performance of adjusting cleanup interval (rx-usecs), seems to win over simply increasing the TX ring buffer size. This also proves the theory of TX queue is filled with old packets/descriptors that needs cleaning.
(Edit: updated numbers to be clean upstream, previously included some patches)

Tools: Want easy to use script for pktgen look here
More details on pktgen advanced topics by Daniel Turull.

May 27, 2014

Rusty Russell: Effects of packet/data sizes on various networks

I was thinking about peer-to-peer networking (in the context of Pettycoin, of course) and I wondered if sending ~1420 bytes of data is really any slower than sending 1 byte on real networks.  Similarly, is it worth going to extremes to avoid crossing over into two TCP packets?

So I wrote a simple Linux TCP ping pong client and server: the client connects to the server then loops: reads until it gets a ‘1’ byte, then it responds with a single byte ack.  The server sends data ending in a 1 byte, then reads the response byte, printing out how long it took.  First 1 byte of data, then 101 bytes, all the way to 9901 bytes.  It does this 20 times, then closes the socket.

Here are the results on various networks (or download the source and result files for your own analysis):

On Our Gigabit Lan

Interestingly, we do win for tiny packets, but there’s no real penalty once we’re over a packet (until we get to three packets worth):

Over the Gigabit Lan

Over the Gigabit Lan


Over Gigabit LAN (closeup)

On Our Wireless Lan

Here we do see a significant decline as we enter the second packet, though extra bytes in the first packet aren’t completely free:

Wireless LAN (all results)

Wireless LAN (all results)

Wireless LAN (closeup)

Wireless LAN (closeup)

Via ADSL2 Over The Internet (Same Country)

Ignoring the occasional congestion from other uses of my home net connection, we see a big jump after the first packet, then another as we go from 3 to 4 packets:

ADSL over internet in same country

ADSL over internet in same country

ADSL over internet in same country (closeup)

ADSL over internet in same country (closeup)

Via ADSL2 Over The Internet (Australia <-> USA)

Here, packet size is completely lost in the noise; the carrier pidgins don’t even notice the extra weight:

Wifi + ADSL2 from Adelaide to US

Wifi + ADSL2 from Adelaide to US

Wifi + ADSL2 from Adelaide to US (closeup)

Wifi + ADSL2 from Adelaide to US (closeup)

Via 3G Cellular Network (HSPA)

I initially did this with Wifi tethering, but the results were weird enough that Joel wrote a little Java wrapper so I could run the test natively on the phone.  It didn’t change the resulting pattern much, but I don’t know if this regularity of delay is a 3G or an Android thing.  Here every packet costs, but you don’t win a prize for having a short packet:

3G network

3G network

3G network (closeup)

3G network (closeup)

Via 2G Network (EDGE)

This one actually gives you a penalty for short packets!  800 bytes to 2100 bytes is the sweet-spot:

2G (EDGE) network

2G (EDGE) network

2G (EDGE) network (closeup)

2G (EDGE) network (closeup)


So if you’re going to send one byte, what’s the penalty for sending more?  Eyeballing the minimum times from the graphs above:

Wired LAN Wireless ADSL 3G 2G
Penalty for filling packet 30%  15%  5%  0%  0%*
Penalty for second packet 30%  40%  15%  20%  0%
Penalty for fourth packet 60%  80%  25%  40%  25%

* Average for EDGE actually improves by about 35% if you fill packet

May 19, 2014

Eric Leblond: Playing with python-git


I’m currently working on Scirius, the web management interface for Suricata developed by Stamus Networks. Scirius is able to fetch IDS signatures from external place and the backend is storing this element in a git tree. As Scirius is a Django application, this means we need to interact with git in Python.

Usually the documentation of Python modules is good and enough to develop. This is sadly not the case for GitPython. There is documentation but the overall quality it not excellent, at least for a non genuine Python developer, and there is some big part missing.

Doing a commit

Doing a commit is really simple once you have understand what to do. You need to open the repository and work on his index which is the object you add file to commit to. In the following example, I want to add everything under the rules directory:

    repo = git.Repo(source_git_dir)
    index = repo.index
    message =  'source version at %s' % (self.updated_date)

Set value in the configuration of a repository

It is possible to edit the configuration of a git repository with GitPython. To do that you need to get the config and to use the set_value function. For example, the following code snippet create a repository and set and for that repository:

    repo = git.Repo.init(source_git_dir)
    config = repo.config_writer()
    config.set_value("user", "email", "")
    config.set_value("user", "name", "Scirius")

OSError 25: Inappropriate ioctl for device

I’ve encountered this fabulous exception when trying to do a commit in Scirius. The problem is only showing up when running the application in wsfcgi mode. It is documented in Issue 39 on GitHub but there is no workaround proposed.

The error comes from the fact the function used to guess the identity of the user running the application is called even if value are set in the config. And this function is failing when it is called outside of a real session. This function is in fact trying to get things from environment but these value are not set when the application is started by init. To fix this, it is possible to force the USERNAME environment variable.

Here’s how it is implemented in Scirius:

+    os.environ['USERNAME'] = 'scirius'
    message =  'source version at %s' % (self.updated_date)

You can see the diff on GitHub

May 08, 2014

Rusty Russell: BTC->BPAY gateway (for Australians)

I tested out, which lets you pay any BPAY bill (see explanation from reddit).  Since I’d never heard of the developer, I wasn’t going to send anything large through it, but it worked flawlessly.  At least the exposure is limited to the time between sending the BTC and seeing the BPAY receipt.  Exchange rate was fair, and it was a simple process.

Now I need to convince my wife we should buy some BTC for paying bills…

April 30, 2014

Jesper Dangaard Brouer: trafgen a fast packet generator

The netsniff-ng toolkit version 0.5.8 have been released.

One of the tools included in the netsniff-ng toolkit is: "trafgen" a multi-threaded low-level zero-copy network packet generator.  The recent release contains some significant performance improvements to that traffic generator.

Single CPU generator performance on a E5-2630 CPU, with Intel ixgbe/82599 chip, reach 1.4 Million Packet Per Sec (Mpps) when using the recent kernel (>= v3.14) feature of qdisc bypass for RAW sockets. And around 1.2 Mpps without bypassing the qdisc system in the kernel. (Default is to use the qdisc bypass if available, for testing purposes the qdisc path can be enabled via command line option "--qdisc-path")

In this release, I've also made "trafgen" scale to more CPUs:

The hard part of using trafgen is specifying and creating the packet description input file.  I really enjoy the flexibility when defining the packet contents, but without good examples as a starting point, it can be a daunting task.

For that reason, I've made some examples available at github here:

I've used the SYN attack example while developing the SYNPROXY module, see my other blogpost. I'm releasing this example now, because solutions for mitigating this attack is now available.

Jon Schipp also have a solution and have created a script "gencfg" for generating trafgen packet description input files, avail on github:

Notice: to get these performance numbers you need to tune your packet generator machine for network overload testing.

Jesper Dangaard Brouer: Mitigating DDoS SYN flood attacks with iptables/netfilter

Hey, I'm also blogging on the Red Hat Enterprise Linux Blog

I recently did very practical post on Mitigating TCP SYN Flood Attacks with iptables/netfilter, with the hope to provide the world with a practical solution to solve these annoying SYN-flood DDoS attacks, that we have been seeing for the last 20 years.

I've also been touring with a technical talk on the subject, and the most recent version of the slides are here.

There is also a YouTube video of my presentation at DevConf 2013.

April 29, 2014

Jesper Dangaard Brouer: Basic tuning for network overload testing

I'm doing a lot of network testing, where I'm constantly trying to push the limits of the hardware and network stack (in-order to improve performance and fix scalability issues in the Linux Kernel).

Some basic tuning of the NICs (Network Interface Cards) and IRQs are required, before we can start this "overload" testing mode.

1. First thing I do, is to kill "irqbalance", to avoid it mangling with my manual IRQ assignments.

 # killall irqbalance

2. Next I, align/bind the NICs IRQs to CPUs (one-to-one).

I have a script for aligning the IRQs, that I copied from the Intel ixgbe driver tarball:
 # set_irq_affinity $DEV

The easiest way to view, how the current IRQ assignment is to use this "grep ." trick:
  # grep . /proc/irq/*/eth4{,-*}/../smp_affinity_list

3. Then I, disable Ethernet Flow-Control

 # ethtool -A $DEV rx off tx off autoneg off

I'm disabling Ethernet Flow Control (PAUSE frames) because i want to create an overload situation. When my transmitter/generator machine is overloading the target machine, I don't want the target machine to send "backoff" PAUSE frames, especially if I'm testing the limits of the transmitters network stack.

4. Unload all netfilter and iptables module

I have a simple script for flushing iptables and unloading all the modules:

I usually also perform benchmarking and tuning of iptables/Netfilter modules, but for overload testing I'm unloading all module, as these do introduce measurable overhead.

Extra: A word of caution regarding CPU sleep or idle states:
I've experienced issues when doing low-latency measurements with Sandy-E Bridge CPUs C-states, because it too aggressively tried to go into a sleep state, even under a high network load. The latency cost of coming out of a sleep state can be significant. Jeremy Eder have described these issues in detail in his blog:

Simply use the tool "turbostat" to measure the different C-states.

And use the tool "tuned-adm" to adjust what profile you want to enable e.g.:
 # tuned-adm profile throughput-performance
 # tuned-adm profile latency-performance

April 27, 2014

Eric Leblond: Slides of my coccigrep lightning talk at HES2014

I’ve gave a lightning talk about coccigrep at Hackito Ergo Sum to show how it can be used to search in code during audit or hacking party. Here are the slides: coccigrep: a semantic grep for the C language.

The slides of my talk Suricata 2.0, Netfilter and the PRC will soon be available on Stamus Networks website.

April 17, 2014

Eric Leblond: Speeding up scapy packets sending

Sending packets with scapy

I’m currently doing some code based on scapy. This code reads data from a possibly huge file and send a packet for each line in the file using the contained information. So the code contains a simple loop and uses sendp because the frame must be sent at layer 2.

     def run(self):
         filedesc = open(self.filename, 'r')
         # loop on read line
         for line in filedesc:
             # Build and send packet
             sendp(pkt, iface = self.iface, verbose = verbose)
             # Inter packet treatment

Doing that the performance are a bit deceptive. For 18 packets, we’ve got:

    real    0m2.437s
    user    0m0.056s
    sys     0m0.012s

If we strace the code, the explanation is quite obvious:

socket(PF_PACKET, SOCK_RAW, 768)        = 4
setsockopt(4, SOL_SOCKET, SO_RCVBUF, [0], 4) = 0
select(5, [4], [], [], {0, 0})          = 0 (Timeout)
ioctl(4, SIOCGIFINDEX, {ifr_name="lo", ifr_index=1}) = 0
bind(4, {sa_family=AF_PACKET, proto=0x03, if1, pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0
setsockopt(4, SOL_SOCKET, SO_RCVBUF, [1073741824], 4) = 0
setsockopt(4, SOL_SOCKET, SO_SNDBUF, [1073741824], 4) = 0
getsockname(4, {sa_family=AF_PACKET, proto=0x03, if1, pkttype=PACKET_HOST, addr(6)={772, 000000000000}, [18]) = 0
ioctl(4, SIOCGIFNAME, {ifr_index=1, ifr_name="lo"}) = 0
sendto(4, "\377\377\377\377\377\377\0\0\0\0\0\0\10\0E\0\0S}0@\0*\6\265\373\307;\224\24\300\250"..., 97, 0, NULL, 0) = 97
select(0, NULL, NULL, NULL, {0, 0})     = 0 (Timeout)
close(4)                                = 0
socket(PF_PACKET, SOCK_RAW, 768)        = 4
setsockopt(4, SOL_SOCKET, SO_RCVBUF, [0], 4) = 0
select(5, [4], [], [], {0, 0})          = 0 (Timeout)
ioctl(4, SIOCGIFINDEX, {ifr_name="lo", ifr_index=1}) = 0
bind(4, {sa_family=AF_PACKET, proto=0x03, if1, pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0
setsockopt(4, SOL_SOCKET, SO_RCVBUF, [1073741824], 4) = 0
setsockopt(4, SOL_SOCKET, SO_SNDBUF, [1073741824], 4) = 0
getsockname(4, {sa_family=AF_PACKET, proto=0x03, if1, pkttype=PACKET_HOST, addr(6)={772, 000000000000}, [18]) = 0
ioctl(4, SIOCGIFNAME, {ifr_index=1, ifr_name="lo"}) = 0
sendto(4, "\377\377\377\377\377\377\0\0\0\0\0\0\10\0E\0\0004}1@\0*\6\266\31\307;\224\24\300\250"..., 66, 0, NULL, 0) = 66
select(0, NULL, NULL, NULL, {0, 0})     = 0 (Timeout)
close(4)                                = 0

For each packet, a new socket is opened and this takes age.

Speeding up the sending

To speed up the sending, one solution is to build a list of packets and to send that list via a sendp() call.

     def run(self):
         filedesc = open(self.filename, 'r')
         pkt_list = []
         # loop on read line
         for line in filedesc:
             # Build and send packet
         sendp(pkt_list, iface = self.iface, verbose = verbose)

This is not possible in our case due to the inter packet treatment we have to do. So the best way is to reuse the socket. This can be done easily when you’ve read the documentation^W code:

@@ -27,6 +27,7 @@ class replay:
     def run(self):
         # open filename
         filedesc = open(self.filename, 'r')
+        s = conf.L2socket(iface=self.iface)
         # loop on read line
         for line in filedesc:
             # Build and send packet
-            sendp(pkt, iface = self.iface, verbose = verbose)
+            s.send(pkt)

The idea is to create a socket via the function used in sendp() and to use the send() function of the object to send packets.

With that modification, the performance are far better:

    real    0m0.108s
    user    0m0.064s
    sys     0m0.004s

I’m not a scapy expert so ping me if there is a better way to do this.

April 16, 2014

Jesper Dangaard Brouer: Full scalability for Netfilter conntracks

My scalability fixes for Netfilter connection tracking have reached Linus'es tree and will appear in kernel release v3.15.

Netfilter’s conntrack have had a bad reputation for being slow. While this was true in the "early-days", it have been offering excellent scalability for established conntracks for a long time now.  Matching against existing conntrack entries is very fast and completely scalable. (The conntrack system actually does lockless RCU (Read-Copy Update) lookups for existing connections).

The conntrack system have had a scalability problem when it comes to creating (or deleting) connections, for a long time now (single central spinlock).  This scalability issue is now fixed.

This work relates to my recent efforts of using conntrack for DDoS protection, as e.g. SYN-floods would hit this "new" connection scalability problem with Netfilter conntracks.

Finally version 3 of the patchset were accepted March 7th 2014 (note Eric Dumazet worked on the first attempts back in May 9th 2013). The most important commit is 93bb0ceb75 "netfilter: conntrack: remove central spinlock nf_conntrack_lock")

Jesper Dangaard Brouer: Announcing: The IPTV-Analyzer

I'm happy to announce the first official release of the IPTV-Analyzer project, as an Open Source project.

The IPTV-Analyzer is a continuous/real-time tool for analyzing the contents of MPEG2 Transport Stream (TS) packets, which is commonly used for IPTV multicast signals. The main purpose is continuous quality measurement, with a focus on detecting MPEG2 TS/CC packet drops.

The core component is an iptables (Linux) kernel module, named "mpeg2ts". This kernel module performs the real-time Deep Packet Inspection of the MPEG2-TS packets. Its highly performance optimized, written for parallel processing across CPU cores (via RCU locking) and hash tables are used for handling large number of streams. Statistics are exported via the proc filesystem (scalability is achieved via use of the seq_file proc API). It scales to hundreds of IPTV channels, even on small ATOM based CPUs.

Please send bugreports, patches, improvement, comments or insults to:

March 24, 2014

Rusty Russell: Legal Questions About and Australia

As my previous post documented, I’ve experimented with  Following the arrest of two Miami men for trading on localbitcoins, I decided to seek legal advice on the sitation in Australia.

Online research led me to Nick Karagiannis of Kelly and Co, who was already familiar with Bitcoin: I guess it’s a rare opportunity for excitement in financial regulatory circles!  This set me back several thousand dollars (in fiat, unfortunately), but the result was reassuring.

They’ve released an excellent summary of the situation, derived from their research.  I hope that helps other bitcoin users in Australia, and I’ll post more in future should the legal situation change.

March 19, 2014

Rusty Russell: Bitcoin Trading In Australia

I bought 10 BTC to play with back in 2011, and have been slowly spending them to support bitcoin adoption.  One thing which I couldn’t get reliable information on was how to buy and sell bitcoin within Australia, so over the last few months I decided to sell a few via different methods and report the results here (this also helps my budget, since I’m headed off on paternity leave imminently!).

All options listed here use two-factor authentication, otherwise I wouldn’t trust them with more than cents.  And obviously you shouldn’t leave your bitcoins in an exchange for any longer than necessary, since most exchanges over time have gone bankrupt.

Option 1: MtGox AUD

Yes, I transferred some BTC into MtGox and sold them.  This gave the best price, but after over two months of waiting the bank transfer to get my money hadn’t been completed.  So I gave up, bought back into bitcoins (fewer, since the price had jumped) and thus discovered that MtGox was issuing invalid BTC transactions so I couldn’t even get those out.  Then they halted transactions altogether blaming TX malleability.  Then they went bankrupt.  Then they leaked my personal data just for good measure.  The only way their failure could be more complete is if my MtGox Yubikey catches on fire and burns my home to the ground.

Volume: Great (5M AUD/month)
Price Premium: $25 – $50 / BTC
Charge: 0.65%
Hassle: Infinite
Summary: 0/10

Option 2:

According to, localbitcoins is the largest volume method for AUD exchange.  It’s not an exchange, so much as a matching and escrow service, though there are a number of professional traders active on the site.  The bulk of AUD trades are online, though I sold face to face (and I’ll be blogging about the range of people I met doing that). is a great place for online BTC buyers, since they have been around for quite a while and have an excellent reputation with no previous security issues, and they hold bitcoins in escrow as soon as you hit “buy”.  It’s a bit more work than an exchange, since you have to choose the counter-party yourself.

For online sellers, transfers from stolen bank accounts is a real issue.  Electronic Funds Transfer (aka “Pay Anyone”) is reversible, so when the real bank account owner realizes their money is missing, the bank tends to freeze the receiving (ie. BTC seller’s) bank account to make sure they can’t remove the disputed funds.  This process can take weeks or months, and banks’ anti-fraud departments generally treat bitcoin sellers who get defrauded with hostility (ANZ is reported to be the exception here).  A less common scam is fraudsters impersonating the Australian Tax Office and telling the victim to EFT to the localbitcoins seller.

Mitigations for sellers include any combination of:

  1. Only accepting old-fashioned cash deposits via a branch (though I’m aware of one US report where a fraudster convinced the teller to reverse the deposit, I haven’t heard of that in Australia)
  2. Insisting on “” in the transfer message (to avoid the ATO fraud problem)
  3. Only dealing with buyers with significant reputation (100+ trades with over 150 BTC is the Gold Standard)
  4. Insisting on real ID checking (eg. Skype chat of buyer with drivers’ license)
  5. Only dealing with buyers whose accounts are older than two weeks (most fraudsters are in and out before then, though their reputation can be very good until they get caught)
  6. Only allowing internal transfers between the same bank (eg. Commonwealth), relying on the bank’s use of two factor authentication to reduce fraud.

Many buyers on are newcomers, so anticipate honest mistakes for the most part.  The golden rule always applies: if someone is offering an unrealistic price, it’s because they’re trying to cheat you.

Volume: Good (1M AUD/month)
Price Premium: $5 – $20 / BTC
Charge: 1% (selling), 0% (buying)
Hassle: Medium
Summary: 7/10

Option 3:

You’ll need to get your bank account checked to use this fairly low-volume exchange, but it’s reasonably painless.  Their issues are their lack of exposure (I found out about them through and lack of volume (about a quarter of the volume), but they also trade litecoin if you’re into that.  You can leave standing orders, or just manually place one which is going to be matched instantly.

They seem like a small operation, based in Sydney, but my interactions with them have been friendly and fast.

Volume: Low (300k AUD/month)
Price Premium: $0 / BTC
Charge: 1%
Hassle: Low
Summary: 7/10

Option 4:

I heard about this site from a well-circulated blog post on Commonwealth Bank closing their bank account last year.  I didn’t originally consider them since they don’t promote themselves as an exchange, but you can use their filler to sell them bitcoins at a spot rate.  It’s limited to $4000 per day according to their FAQ.

They have an online ID check, using the usual sources which didn’t quite work for me due to out-of-date electoral information, but they cleared that manually within a day.  They deposit 1c into your bank account to verify it, but that hasn’t worked for me, so I’ve no way to withdraw my money and they haven’t responded to my query 5 days ago leaving me feeling nervous.  A search of reddit points to common delays, and founder’s links to the hacked-and-failed Bitcoinica give me a distinct “magical gathering” feel. [Edit: they apparently tried and failed four times to transfer the 1c verification to my ING account; with 1-2 business day support response, this took quite a while.  They never explained why this was failing.  Using my wife’s CBA account worked however, and I got my funds the next day.  Upgraded their score from 4/10 to 5/10.]

Volume: Unknown (self-reports indicate ~250k/month?)
Price Premium: $0 / BTC
Charge: 1.1% (selling) 2% (buying)
Hassle: Medium
Summary: 5/10

If you trade, I’d love to hear corrections, comments etc. or email me on

March 07, 2014

Eric Leblond: Suricata and Ulogd meet Logstash and Splunk

Some progress on the JSON side

Suricata 2.0-rc2 is out and it brings some progress on the JSON side. The logging of SSH protocol has been added: Screenshot from 2014-03-07 18:50:21 and the format of timestamp has been updated to be ISO 8601 compliant and it is now named timestamp instead of time.

Ulogd, the Netfilter logging daemon has seen similar change as it is now also using a ISO 8601 compliant timestamp for the . This feature is available in git and will be part of ulogd 2.0.4.

Thanks to this format change, the integration with logstash or splunk is easier and more accurate. This permit to fix one problem regarding the timestamp of an event inside of the event and logging manager. At least in logstash, the used date was the one of the parsing which was not really accurate. It could even be a problem when logstash was parsing a file with old entries because the difference in timestamp could be huge.

It is now possible to update logstash configuration to have a correct parsing of the timestamp. After doing this the internal @timestamp and the timestamp of the event are synchronized as show on the following screenshot:


Logstash configuration

Screenshot from 2014-02-02 13:22:34

To configure logstash, you simply needs to tell him that the timestamp field in JSON message is a date. To do so, you need to add a filter:

      date {
        match => [ "timestamp", "ISO8601" ]
A complete logstash.conf would then looks like:
input {
   file {
      path => [ "/usr/local/var/log/suricata/eve.json", "/var/log/ulogd.json" ]
      codec =>   json
      type => "json-log"

filter {
   if [type] == "json-log" {
      date {
        match => [ "timestamp", "ISO8601" ]

output {
  stdout { codec => rubydebug }
  elasticsearch { embedded => true }

Splunk configuration

Screenshot from 2014-03-07 23:30:40

In splunk, auto detection of the file format is failing and it seems you need to define a type to parse JSON in $SPLUNK_DIR/etc/system/local/props.conf:

KV_MODE = json

Then you can simply declare the log file in $SPLUNK_DIR/etc/system/local/inputs.conf:

sourcetype = suricata

sourcetype = suricata

you can now build search events and build dashboard based on Suricata or Netfilter packet logging: Screenshot from 2014-03-05 23:17:12

February 24, 2014

Eric Leblond: Nftables and the Netfilter logging framework

Nftables logging

If nftables is bringing a lot of changes on user side, this is also true in the logging area. There is now only one single keyword for logging: log and this target is using the Netfilter logging framework. A corollary of that is that why you may not see any log messages even if a rule with log is matching because the Netfilter logging framework has to be configured.

Netfilter logging framework

The Netfilter logging framework is a generic way of logging used in Netfilter components. This framework is implemented in two different kernel modules:

  • xt_LOG: printk based logging, outputting everything to syslog (same module as the one used for iptables LOG target). It can only log packets for IPv4 and IPv6
  • nfnetlink_log: netlink based logging requiring to setup ulogd2 to get the events (same module as the one used for iptables NFLOG target). It can log packet for any family.

To use one of the two modules, you need to load them with modprobe. It is possible to have both modules loaded and in this case, you can then setup logging on a per-protocol basis. The active configuration is available for reading in /proc:

# cat /proc/net/netfilter/nf_log 
 0 NONE (nfnetlink_log)
 1 NONE (nfnetlink_log)
 2 nfnetlink_log (nfnetlink_log,ipt_LOG)
 3 NONE (nfnetlink_log)
 4 NONE (nfnetlink_log)
 5 NONE (nfnetlink_log)
 6 NONE (nfnetlink_log)
 7 nfnetlink_log (nfnetlink_log)
 8 NONE (nfnetlink_log)
 9 NONE (nfnetlink_log)
10 nfnetlink_log (nfnetlink_log,ip6t_LOG)
11 NONE (nfnetlink_log)
12 NONE (nfnetlink_log)
The syntax is the following FAMILY ACTIVE_MODULE (AVAILABLE_MODULES). Here nfnetlink_log was loaded first and xt_LOG was loaded afterward (xt_LOG is aliased to ipt_LOG and ip6t_LOG).

Protocol family numbers can look a bit strange. It is in fact mapped on the socket family name that is used in underlying code. The list is the following:

#define AF_UNSPEC	0
#define AF_UNIX		1	/* Unix domain sockets 		*/
#define AF_INET		2	/* Internet IP Protocol 	*/
#define AF_AX25		3	/* Amateur Radio AX.25 		*/
#define AF_IPX		4	/* Novell IPX 			*/
#define AF_APPLETALK	5	/* Appletalk DDP 		*/
#define	AF_NETROM	6	/* Amateur radio NetROM 	*/
#define AF_BRIDGE	7	/* Multiprotocol bridge 	*/
#define AF_AAL5		8	/* Reserved for Werner's ATM 	*/
#define AF_X25		9	/* Reserved for X.25 project 	*/
#define AF_INET6	10	/* IP version 6			*/
#define AF_MAX		12	/* For now.. */

To update the configuration, you need to write in the file corresponding to the family in /proc/sys/net/netfilter/nf_log/ directory. For example, if you want to use ipt_LOG for IPv4 (2 in the list), you can do:

echo "ipt_LOG" >/proc/sys/net/netfilter/nf_log/2 
This will active ipt_LOG for IPv4 logging:
# cat /proc/net/netfilter/nf_log 
 0 NONE (nfnetlink_log)
 1 NONE (nfnetlink_log)
 2 ipt_LOG (nfnetlink_log,ipt_LOG)
 3 NONE (nfnetlink_log)
 4 NONE (nfnetlink_log)
 5 NONE (nfnetlink_log)
 6 NONE (nfnetlink_log)
 7 nfnetlink_log (nfnetlink_log)
 8 NONE (nfnetlink_log)
 9 NONE (nfnetlink_log)
10 nfnetlink_log (nfnetlink_log,ip6t_LOG)
11 NONE (nfnetlink_log)
12 NONE (nfnetlink_log)

Netfilter framework is used internally by Netfilter for some logging. For example, the connection tracking is using it to send messages when invalid packets are seen. These messages are useful because they contain the reason of the reject. For example, one of the message is “nf_ct_tcp: ACK is under the lower bound (possible overly delayed ACK)”. This logging messages are only sent if the logging of invalid packet is asked. This is done by doing:

echo "255"> /proc/sys/net/netfilter/nf_conntrack_log_invalid
More information on the magical 255 value are available in kernel documentation of nf_conntrack sysctl. If nfnetlink_log module is used for the protocol, then the used group is 0. So if you want to activate these messages, it could be a good idea to use non 0 nfnetlink group in the log rules. This way you will be able to differentiate the log sources in a software like ulogd.

Logging with Nftables

As mentioned before, logging is made via a log keyword. A typical log and accept rule will look like:

nft add rule filter input tcp dport 22 ct state new log prefix \"SSH for ever\" group 2 accept
This rule is accepting packet to port 22 in the state NEW and it is logging them with prefix SSH for ever on group 2. Here the group is only used when the active logging kernel module is nfnetlink_log. The option has no effect if xt_LOG is used. In fact, when used with xt_LOG, the only available option is prefix (at least for nftables 0.099).

The available options when using nfnetlink_log module are the following (at least for nftables 0.099):

  • prefix: A prefix string to include in the log message, up to 64 characters long, useful for distinguishing messages in the logs.
  • group: The netlink group (0 – 2^16-1) to which packets are (only applicable for nfnetlink_log). The default value is 0.
  • snaplen: The number of bytes to be copied to userspace (only applicable for nfnetlink_log). nfnetlink_log instances may specify their own range, this option overrides it.
  • queue-threshold: Number of packets to queue inside the kernel before sending them to userspace (only applicable for nfnetlink_log). Higher values result in less overhead per packet, but increase delay until the packets reach userspace. The default value is 1.
Note: the description are extracted from iptables man pages.

If you want to do some easy testing with nftables, simply load xt_LOG module before nfnetlink_log. It will bind to IPv4 and IPv6 protocol and provide you logging. For more fancy stuff involving nfnetlink_log, you can have a look at Using ulogd and JSON output.

Happy logging to all!

February 23, 2014

Eric Leblond: Logging connection tracking event with ulogd


I’ve recently met @aurelsec and we’ve discussed about the interest of logging connection tracking entries. This is indeed a undervalued information source in a network.

Quoting Wikipedia: “Connection tracking allows the kernel to keep track of all logical network connections or sessions, and thereby relate all of the packets which may make up that connection. NAT relies on this information to translate all related packets in the same way, and iptables can use this information to act as a stateful firewall.”

Connection tracking being linked with Network Address Translation has a direct impact: it stores both side of each connection. If we use conntrack tool from conntrack-tools to list connections:

# conntrack  -L
tcp      6 431999 ESTABLISHED src= dst= sport=53400 dport=443 src= dst= sport=443 dport=53500 [ASSURED] mark=0 use=1
We have the two sides of a connection:
  • Orig: here to This is the packet information as seen by the firewall when it reaches him. There is no translation at all.
  • Reply: here to This is how will look like a answer coming from the server. The destination has been changed to the public IP of the firewall (here And there is also a change of the destination port to the one used by the firewall when doing the initial mapping. In fact, as multiple client could use the same port at the same time, the firewall may have to rewrite the initial source port.

So the connection tracking stores all NAT transformations. This information is important because this is the only way to know which IP in a private network is responsible of something in the outside world. For example, let’s suppose that has been attacked by our internal client (here If the admin of this server sees the attack, it will only see the IP address and port source 53500. If an authority asks you for the IP address responsible in your internal network you have no instrument but the conntrack to know that this was in fact

That’s why logging connection tracking event is one of the only effective way to store the information necessary to get back to the internal IP address in case of external query. Let’s now do this with ulogd 2.

Ulogd setup

Ulogd installation

Ulogd 2 is able to get information from the connection tracking and to log them in files or database. If your distribution is not providing ulogd and if you don’t know how to install it, you can check this post Using ulogd and JSON output. To be sure that you will be able to log connection tracking event, you need to have NFCT plugin to yes at the end of configure output.

Ulogd configuration:
  Input plugins:
    NFLOG plugin:			yes
    NFCT plugin:			yes

Kernel setup

All functionalities are standard since kernel 2.6.14. You only need to load the following module:

modprobe nf_conntrack_netlink
It is the one in charge of kernel and userspace information exchange regarding connection tracking. It provides features to dump the conntrack table or modify entries in the conntrack. For example the conntrack tool mentioned before is using that communication method to get the listing of connection tracking entries. But the feature that interest us in ulogd is the event mode. For each event in the life of a connection, a message is sent to the userspace. Ulogd is able to listen to these messages and this gives it the ability to store all information on the life of the connection in connection tracking.

Depending on the protocol you have on your network, you may need to run on of the following:

modprobe nf_conntrack_ipv4
modprobe nf_conntrack_ipv6

Ulogd setup

Our first objective will simply be to log all NAT decisions to a syslog-like file on disk. In term of connection tracking, this means we will log all connection in the NEW state. This way we will get information about any packet going through the firewall with the associated NAT transformation.

If you install from sources, copy ulogd.conf at the root of ulogd sources to your config directory (usually /usr/local/etc/. And start your favorite editor on it.

Ulogd is doing logging based on stack definition. A stack is one chain of plugins starting from a input plugin, finishing with an output one, and with filter in the middle. In our case, we want to get packet from Netfilter conntrack and the corresponding plugin is NFCT. The first example of stack containing NFCT in the ulogd.conf file is the one we are interested in, so we uncomment it:

We are not sure that the setup of input and output plugin will be correct. For now, let’s just check the output:
As you may have seen, emu1 is also used by packet logging. So it may be a good idea that we have our own output file for connection tracking event. To do that, we update the stack:
and create a new config below emu1:
We have changed file name and keep the sync option which permit to avoid the a delay in write due to buffering effect during write which can be very annoying when debugging a setup.

Now, we can test:

ulogd -v
In /var/log/ulogd_nfct.log, we see things like
Feb 22 10:50:36 ice-age2 [DESTROY] ORIG: SRC= DST= PROTO=TCP SPT=6000 DPT=22 PKTS=0 BYTES=0 , REPLY: SRC= DST= PROTO=TCP SPT=22 DPT=6000 PKTS=0 BYTES=0
So we only have destruction messages. This is not exactly what we wanted to have. We are interested in NEW message that will allow us to have a correct timing of the event. Reading ulogd.conf file, it seems there is no information about chossing the event types. But let’s ask to the NFCT input plugin its capabilities. To do that we use option -i of ulogd:
# ulogd -v -i /usr/local/lib/ulogd/ 
Name: NFCT
Config options:
        Var: pollinterval (Integer, Default: 0)
        Var: hash_enable (Integer, Default: 1)
        Var: hash_buckets (Integer, Default: 8192)
        Var: hash_max_entries (Integer, Default: 32768)
        Var: event_mask (Integer, Default: 5)
        Var: netlink_socket_buffer_size (Integer, Default: 0)
        Var: netlink_socket_buffer_maxsize (Integer, Default: 0)
        Var: netlink_resync_timeout (Integer, Default: 60)
        Var: reliable (Integer, Default: 0)
        Var: accept_src_filter (String, Default: )
        Var: accept_dst_filter (String, Default: )
        Var: accept_proto_filter (String, Default: )
The listing start with the configuration keys. One of them is event_mask. This is a the one controlling which events are sent from kernel to userspace. The value is a mask combining some of the following values:
So default value of 5 is to listen to NEW and DESTROY events. Clever reader will then ask: why did we only see DESTROY messages in that case. This is because ulogd NFCT plugin is running by default in hash_enable mode. In this mode, one single message is output for each connection (at end) and a hash is maintained in the kernel to store the info (here initial timestamp of the connection). Our setup don’t need this feature because we only want to get the NAT transformation so we switch the hash feature off and limit the events to NEW:

We can now restart ulogd and check the log file:

Feb 22 11:59:34 ice-age2 [NEW] ORIG: SRC=2a01:e35:1394:5bd0:da50:b6ff:fe3c:4250 DST=2001:41d0:1:9598::1 PROTO=TCP SPT=51162 DPT=22 PKTS=0 BYTES=0 , REPLY: SRC=2001:41d0:1:9598::1 DST=2a01:e35:1394:5bd0:da50:b6ff:fe3c:4250 PROTO=TCP SPT=22 DPT=51162 PKTS=0 BYTES=0
Feb 22 11:59:43 ice-age2 [NEW] ORIG: SRC= DST= PROTO=TCP SPT=60846 DPT=443 PKTS=0 BYTES=0 , REPLY: SRC= DST= PROTO=TCP SPT=443 DPT=60946 PKTS=0 BYTES=0
This is exactly what we wanted, we have a trace of all NAT transformation.

Maintain an history of connection tracking


We want to log all information describing a connection so we have a trace of what is going on the firewall. This means we need at least:

  • IP information for orig and reply way
  • Timestamp of start and end of connection
  • Bandwidth used by the connection

Kernel setup

By default, recent kernel have a limited handling of connection tracking. Some useful fields are not stored for performance reason. This is the case of the accounting (number of packets and bytes) and the case of the timestamp of the connection creation. The advantage of getting accounting information is trivial as you get information on bandwidth usage. Regarding timestamp, the interest is on implementation side. It allows ulogd to get all information needed for describing a connection in one single message (the DESTROY one). And ulogd does not need anymore to maintain a hash table to get the info and propagate it at exit.

To activate both features, you have to do:

 echo "1"> /proc/sys/net/netfilter/nf_conntrack_acct
 echo "1"> /proc/sys/net/netfilter/nf_conntrack_timestamp

Ulogd setup

For following setup, you will need ulogd build from git or a ulogd at a version superior or equal to 2.0.4.

Let’s first use JSON output to get the information in a readable format. We need to define a stack:


On ct2 side, we don’t want to use the hash and we only want to get DESTROY message, so our configuration looks like:


Regarding, jsonnfct1 we could have reused the default JSON configuration but for ease of testing we will dedicate a file to the NFCT logging:


After a ulogd restart, we’ve got this type of entries:

{"reply.ip.daddr.str": "2a01:e35:1394:5ad0:da50:e6ff:fe3c:1250", "oob.protocol": 0, "dvc": "Netfilter", "timestamp": "Sat Feb 22 12:27:04 2014", "orig.ip.protocol": 6, "reply.raw.pktcount": 20, "flow.end.sec": 1393068424, "": 51384, "orig.l4.dport": 22, "orig.raw.pktlen": 5600, "": 1384991512, "orig.raw.pktcount": 23, "reply.raw.pktlen": 4328, "reply.ip.protocol": 6, "": 22, "reply.l4.dport": 51384, "ct.mark": 0, "ct.event": 4, "flow.start.sec": 1393068302, "flow.start.usec": 637516, "flow.end.usec": 403240, "reply.ip.saddr.str": "2001:41d0:1:9598::1", "": 10, "src_ip": "2a01:e35:1394:5ad0:da50:e6ff:fe3c:1250", "dest_ip": "2001:41d0:1:9598::1"}
The fields we wanted are here:
  • flow.start.* keys store the timestamp of flow start
  • flow.end.* keys store the end of the connection
  • *.raw.pkt* keys store the accounting information

You can then add this file to the file parsed by logstash. For that if you can use information from Using ulogd and JSON output and modify the input section:

input {
   file { 
      path => [ "/var/log/ulogd.json", "/var/log/ulogd_nfct.json"]
      codec =>   json 
One interesting information in a connection tracking entry is the duration. But the field is not available in ulogd JSON output and it is not possible to do mathematical operations in Kibana. A solution to get the information is to add a filter in logstash.conf to compute the duration:
filter {
  if [type] == "json-log" {
    ruby {
      code => "if event['']; event['flow.duration.sec']=(event['flow.end.sec'].to_i - event['flow.start.sec'].to_i); end"

Screenshot from 2014-02-23 18:00:23 A thing to notice to understand the obtained duration is that a connection is dying following contextual timeout. For example, in the case of a TCP connection, even after a FIN packet there’s a timeout applied. So a short connection will at least be of the duration of the timeout.

An other logging method is PostgreSQL. The stack to use is almost the same as JSON one but use, as you may have guess, the PGSQL plugin:

The configuration of the PostgreSQL plugin is easy based on the setup available in the configuration:
I’m not the one who will explain how to connect to a PostgreSQL database and create a ulogd2 database. See Pollux post for that: ulogd2: the new userspace logging daemon for netfilter/iptables (part 2)

Other setup are possible. For example, you can maintain a copy of the connection tracking table in the database and also keep the history. To do that you need to use the INSERT_OR_REPLACE_CT procedure and a connection tracking INPUT plugin not using the hash table but getting NEW and DESTROY events:



Connection will be inserted in the table when getting the NEW event and the connection entry in the database will be updated when the DESTROY message will be received.

February 05, 2014

Eric Leblond: Suricata and Nftables

Iptables and suricata as IPS

Building a Suricata ruleset with iptables has always been a complicated task when trying to combined the rules that are necessary for the IPS with the firewall rules. Suricata has always used Netfilter advanced features allowing some more or less tricky methods to be used.

For the one not familiar with IPS using Netfilter, here’s a few starting points:

  1. IPS receives the packet coming from kernel via rules using the NFQUEUE target
  2. The IPS must received all packets of a given flow to be able to handle detection cleanly
  3. The NFQUEUE target is a terminal target: when the IPS verdicts a packet, it is or accepted (and leave current chain)

So the ruleset needs to send all packets to the IPS. A basic ruleset for an IPS could thus looks like:

iptables -A FORWARD -j NFQUEUE
With such a ruleset, all packets going through the box are sent to the IPS.

If now you want to combine this with your ruleset, then usually your first try is to add rules to the filter chain:

iptables -A FORWARD -j NFQUEUE
iptables -A FORWARD -m conntrack --ctstate ESTABLISHED -j ACCEPT
# your firewall rules here
But this will not work because of point 2: All packets sent via NFQUEUE to the IPS are or blocked or if accepted, leave the FORWARD chain directly and are going for evaluation to the next chain (mangle POSTROUTING in our case). With such a ruleset, the result is that there is no firewall but an IPS in place.

As mentioned before there is some existing solutions (see Building a Suricata ruleset for extensive information). The simplest one is to dedicated one another chain such as mangle to IPS:

iptables -A FORWARD -t mangle -j NFQUEUE
iptables -A FORWARD -m conntrack --ctstate ESTABLISHED -j ACCEPT
# your firewall rules here
No conflict here but you have to be sure nothing in your system will use the the mangle table or you will have the same problem as the one seen previously in the filter chain. So there was no universal and simple solution to implement an IPS and a firewall ruleset with iptables.

IPS the easy way with Nftables

In Nftables, chains are defined by the user using nft command line. The user can specify:

  • The hook: the place in packet life where the chain will be set. See this diagram for more info.
    • prerouting: chain will be placed before packet are routed
    • input: chain will receive packets going to the box
    • forward: chain will receive packets routed by the box
    • postrouting: chain will receive packets after routing and before sending packets
    • output: chain will receive packet sent by the host
  • The chain type: define the objective of the chain
    • filter: chain will filter packet
    • nat: chain will only contains NAT rules
    • route: chain is containing rule that may change the route (previously now as mangle)
  • The priority: define the evaluation order of the different chains of a given hook. It is an integer that can be freely specified. But it also permits to place chain before or after some internal operation such as connection tracking.

In our case, we want to act on forwarded packets. And we want to have a chain for filtering followed by a chain for IPS. So the setup is simple of chain is simple

nft -i
nft> add table filter
nft> add chain filter firewall { type filter hook forward priority 0;}
nft> add chain filter IPS { type filter hook forward priority 10;}
With this setup, a packet will reach the firewall chain first where it will be filtered. If the packet is blocked, it will be destroy inside of the kernel. It the packet is accepted it will then jump to the next chain following order of increasing priority. In our case, the packet reaches the IPS chain.

Now, that we’ve got our chains we can add filtering rules, for example:
nft add rule filter firewall ct state established accept
nft add rule filter firewall tcp dport ssh counter accept
nft add rule filter firewall tcp dport 443 accept
nft add rule filter firewall counter log drop
And for our Suricata IPS, that’s just trivial:
nft add rule filter IPS queue

A bit more details

The queue target in nftables

The complete support for the queue target will be available in Linux 3.14. The syntax looks as follow:

nft add rule filter output queue num 3 total 2 options fanout
This rule sends matching packets to 2 load-balanced queues (total 2) starting at 3 (num 3). fanout is one of the two queue options:
  • fanout: When used together with total load balancing, this will use the CPU ID as an index to map packets to the queues. The idea is that you can improve perfor mance if there’s a queue per CPU. This requires total with a value superior to 1 to be specified.
  • bypass: By default, if no userspace program is listening on an Netfilter queue,then all packets that are to be queued are dropped. When this option is used, the queue rule behaves like ACCEPT instead, and the packet will move on to the next table.

For a complete description of queueing mechanism in Netfilter see Using NFQUEUE and libnetfilter_queue.

If you want to test this before Linux 3.14 release, you can get nft sources from nftables git and use next-3.14 branch.

Chain priority

For reference, here are the priority values of some important internal operations and of iptables static chains:

  • NF_IP_PRI_CONNTRACK_DEFRAG (-400): priority of defragmentation
  • NF_IP_PRI_RAW (-300): traditional priority of the raw table placed before connection tracking operation
  • NF_IP_PRI_SELINUX_FIRST (-225): SELinux operations
  • NF_IP_PRI_CONNTRACK (-200): Connection tracking operations
  • NF_IP_PRI_MANGLE (-150): mangle operation
  • NF_IP_PRI_NAT_DST (-100): destination NAT
  • NF_IP_PRI_FILTER (0): filtering operation, the filter table
  • NF_IP_PRI_SECURITY (50): Place of security table where secmark can be set for example
  • NF_IP_PRI_NAT_SRC (100): source NAT
  • NF_IP_PRI_SELINUX_LAST (225): SELInux at packet exit
  • NF_IP_PRI_CONNTRACK_HELPER (300): connection tracking at exit
For example, one can create in nftables an equivalent of the raw PREROUTING chain of iptables by doing:
# nft -i
nft> add chain filter pre_raw { type filter hook prerouting priority -300;}

Rusty Russell: Pettycoin and working with limited visibility.

At I gave a last-minute talk previewing my work on pettycoin (video, slides), an experiment to shard a bitcoin-like network.  The idea is to trade off some security and robustness in return for scale, but use it only for small amounts where fraud is less worthwhile.  On the bitcoin network today this is already seen with zero-confirmation transactions, and this is the niche pettycoin seeks to fill.

There are numerous problems to be overcome (one being the time taken by my day job, of course).  But segmenting the network and the blockchain is an interesting challenge: bitcoin’s blockchain is already designed so that you can have partial knowledge (mainly so you can prune used outputs).  But there’s a clear divide between full nodes, and second-class partial nodes.  I want a system where no one need know everything, and I’m getting closer to that goal.

Consider the simplest useful transaction in the bitcoin network, with one input (ie. a previous transaction’s output) and one output.  To verify this is a fairly simple process:

  1. Is the transaction well-formed?
  2. Find the transaction whose output this is spending.
  3. Does the signature match the address of that output?
  4. Has that output already been spent?

With bitcoin, you’re expected to know every transaction with unspent outputs, so if you can’t find the transaction at step 2, the verification fails. Even better, you can verify that previous transaction, too, all the way back to the creating of the coins involved.  Your only worry is that the blockchain you have is the same as everyone else’s, so they’ll accept your transaction later.

If you don’t expect to know everything, it’s more difficult.  You can use a merkle proof to show that a transaction was present in a block; it takes just log(N) hashes for an N-transaction block.  So you could prove that all those previous transactions are in the blockchain (though that might be thousands of transactions) by providing me with each transaction and proof.

But this can’t prove that there are not duplicate transactions in the blockchain itself.  Only knowing the entire contents would do that.  So we’re relying on the rest of the network, each with a partial view, to check that didn’t happen.

This leads to the two requirements for any aspect of the pettycoin system which a node can’t verify by itself:

  1. The information to verify must be seen by some honest nodes.
  2. Each node must have an efficient way of reporting if it sees a problem.

The former is a bit tricky.  Consensus is formed by the blockchain, but you don’t want to see all of it.  You might expect to see some fraction of it, but if you don’t, how would you alert the network in a way that can’t be faked?   Imagine a miner holds back 5 transactions in the block, the miner might wait for your complaint message on one, then release that transaction making you look like the dishonest one.  By making you cry wolf, they can ensure you are ignored.

The solution used in pettycoin is that miners have to prove that they know the transactions in the 10 previous blocks.  They do this by hashing the transactions from the previous block into a merkle tree like normal, only they prefix each transaction with their payout address (this is called prev_merkle in the code).  The only way to generate this hash is to know the contents of each transaction, and you can’t make a valid block without it.  Unfortunately, the only way to demonstrate that this hash is wrong (thus the block is invalid) is to also know the contents of each transaction in the block.  Thus transactions are batched into groups of 4096; you only need send 4096 transactions to prove that one of the hashes in a block is wrong.  Miners will insist on knowing the transactions for those blocks, knowing that if they fake it they’ll likely be caught.

Reporting most other problems in a block is fairly:

  1. You can prove a duplicate spend in the block chain by showing both transactions and the merkle proofs that they are in each block.  The second block is invalid.
  2. You can prove a malformed transaction by showing the transactions and the merkle proof it is in the block.  That block is invalid.
  3. You can prove an overspend by showing the transactions used as inputs.  That block is invalid.

But if a transaction in a block relies on an output of a transaction which never existed, you can’t prove it.  Even if you know every transaction which ever happened, you can’t prove that to me (without sending me the whole blockchain).  The initial design lived with such warts in the blockchain, instead insisting that you would have to show all the predecessors when you paid me (via a payment protocol).  That predecessor tree quickly becomes unwieldy, however.

The new approach is that for each input of a transaction in the blockchain, the miner has to include the block and transaction number where it appeared.  Now anyone who knows that previous transaction can check it, and if there’s a problem it’s easy for any node to prove by showing the transaction which is in that previous block (with merkle proof that it is).

This means that the blockchain can be trusted, if half the mining power can be trusted.  This is a weaker guarantee that bitcoin, but sufficiently strong for pettycoin.  If you send me a new transaction along with transactions it uses as inputs  and merkle proofs that they are in the blockchain, I only need ensure that the new transaction isn’t a double-spend.  That’s the same as the bitcoin network, with zero-confirmation transactions (though pettycoin has a special double-spend report message to expedite it a little).

Next post, I’ll respond to the best criticism of pettycoin yet, the problem of gateways (by Jason Gi)…

February 02, 2014

Eric Leblond: Using ulogd and JSON output

Ulogd and JSON output

In February 2014, I’ve commited a new output plugin to ulogd, the userspace logging daemon for Netfilter. This is a JSON output plugin which output logs into a file in JSON format. The interest of the JSON format is that it is easily parsed by software just as logstash. And once data are understood by logstash, you can get some nice and useful dashboard in Kibana:

Screenshot from 2014-02-02 13:22:34

This post explains how to configure ulogd and iptables to do packet logging and differentiate accepted and blocked packets. If you want to see how cool is the result, just check my post: Investigation on an attack tool used in China.


At the time of this writing, the JSON output plugin for ulogd is only available in the git tree. Ulogd 2.0.4 will contain the feature.

If you need to get the source, you can do:

git clone git://

Then the build is standard:

sudo make install

Please note that at the end of the configure, you must see:

Ulogd configuration:
  Input plugins:
    NFLOG plugin:			yes
    NFACCT plugin:			yes
  Output plugins:
    PCAP plugin:			yes
    JSON plugin:			yes
If the JSON plugin is not build, you need to install libjansson devel files on your system and rerun configure.


Ulogd configuration

All the edits are made in the ulogd.conf file. With default configure option the file is in /usr/local/etc/.

First, you need to activate the JSON plugin:



Then we define two stacks for logging. It will be used to differentiate accepted packets from dropped packets:


pre> The first stack will be used to log accepted packet, so we the numeric_label to 1 in set in [log2]. In [log3], we use a numeric_label of 0.
group=1 # Group has to be different from the one use in log1

group=2 # Group has to be different from the one use in log1/log2
numeric_label=0 # you can label the log info based on the packet verdict

The last thing to edit is the configuration of the JSON instance:

device="My awesome FW"
Here we say we want log and write on disk configuration (via sync) and we named our device My awesome FW. Last value boolean_label is the most tricky. It this configuration variable is set to 1, the numeric_label will be used to decide if a packet has been accepted or blocked. It this variable is set non null, then the packet is seen as allowed. If not, then it is seen as blocked.

Sample Iptables rules

In this example, packets to port 22 are logged and accepted and thus are logged in nflog-group 1. Packet in the default drop rule are sent to group 2 because they are dropped.

iptables -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
iptables -A INPUT ! -i lo -p tcp -m tcp --dport 22 --tcp-flags FIN,SYN,RST,ACK SYN -m state --state NEW -j NFLOG --nflog-prefix  "SSH Attempt" --nflog-group 1
iptables -A INPUT -i lo -j ACCEPT
iptables -A INPUT -p tcp -m tcp --dport 22 -m state --state NEW -j ACCEPT
iptables -A INPUT -j NFLOG --nflog-prefix  "Input IPv4 Default DROP" --nflog-group 2

There is no difference in IPv6, we just use nflog-group 1 and 2 with the same purpose:

ip6tables -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
ip6tables -A INPUT ! -i lo -p tcp -m tcp --dport 22 --tcp-flags FIN,SYN,RST,ACK SYN -m state --state NEW -j NFLOG --nflog-prefix  "SSH Attempt" --nflog-group 1
ip6tables -A INPUT ! -i lo -p ipv6-icmp -m icmp6 --icmpv6-type 128 -m state --state NEW -j NFLOG --nflog-prefix  "Input ICMPv6" --nflog-group 1
ip6tables -A INPUT -p ipv6-icmp -j ACCEPT
ip6tables -A INPUT -p tcp -m tcp --dport 22 -m state --state NEW -j ACCEPT
ip6tables -A INPUT -i lo -j ACCEPT
ip6tables -A INPUT -j NFLOG --nflog-prefix  "Input IPv6 Default DROP" --nflog-group 2

Logstash configuration

Logstash configuration is simple. You must simply declare the ulogd.json file as input and optionaly you can activate geoip on the src_ip key:

input {
   file { 
      path => [ "/var/log/ulogd.json"]
      codec =>   json 

filter {
  if [src_ip]  {
    geoip {
      source => "src_ip"
      target => "geoip"
      add_field => [ "[geoip][coordinates]", "%{[geoip][longitude]}" ]
      add_field => [ "[geoip][coordinates]", "%{[geoip][latitude]}"  ]
    mutate {
      convert => [ "[geoip][coordinates]", "float" ]

output { 
  stdout { codec => rubydebug }
  elasticsearch { embedded => true }


To start ulogd in daemon mode, simply run:

ulogd -d

You can download logstash from their website and start it with the following command line:

java -jar logstash-1.3.3-flatjar.jar agent -f etc/logstash.conf --log log/logstash-indexer.out -- web

Once done, just point your browser to localhost:9292 and enjoy nice and interesting graphs.

Screenshot from 2014-02-02 13:57:19

Eric Leblond: Investigation on an attack tool used in China

Log analysis experiment

I’ve been playing lately with logstash using data from the ulogd JSON output plugin and the Suricata full JSON output as well as standard system logs.

Screenshot from 2014-02-02 13:22:34

Ulogd is getting Netfilter firewall logs from Linux kernel and is writing them in JSON format. Suricata is doing the same with alert and other traces. Logstash is getting both log as well as sytem log. This allows to create some dashboard with information coming from multiple sources. If you want to know how to configure ulogd for JSON output check this post. For suricata, you can have a look at this one.

Ulogd output is really new and I was experimenting with it in Kibana. When adding some custom graphs, I’ve observed some strange things and decided to investigate.

Displaying TCP window

TCP window size at the start of the connection is not defined in the RFC. So every OSes have choozen their own default value. It was thus looking interesting to display TCP window to be able to find some strange behavior. With the new ulogd JSON plugin, the window size information is available in the tcp.window key. So, after doing a query on tcp.syn:1 to only get TCP syn packet, I was able to graph the TCP window size of SYN packets.

Screenshot from 2014-02-02 13:22:58

Most of the TCP window sizes are well-known and correspond to standard operating systems:

  • 65535 is or MacOSX or some MS Windows OS.
  • 14600 is used by some Linux.

The first uncommon value is 16384. Graph are clickable on Kibana, so I was at one click of some interesting information.

First information when looking at dashboard after selection TCP syn packet with a window size of 16384 was the fact, it was only ssh scanning:

Screenshot from 2014-02-02 13:58:15

Second information is the fact that, according to geoip, all IPs are chinese:

Screenshot from 2014-02-02 13:57:19

A SSH scanning software

When looking at the details of the attempt made on my IP, there was something interesting: Screenshot from 2014-02-02 14:04:32

For all hosts, all requests are done with the same source port (6000). This is not possible to do that with a standard ssh client where the source port is by default choosen by the operating system. So or we have a custom standard software that perform a bind operation to port 6000 at socket creation. This is possible and one advantage would be to be easily authorized through a firewall if the country had one. Or we could have a software developped with low level (RAW) sockets for performance reason. This would allow a faster scanning of the internet by skipping OS TCP connection handling. There is a lot of posts regarding the usage of port 6000 as source for some scanning but I did not find any really interesting information in them.

On suricata side, most of the source IPs are referenced in ET compromised rules: Screenshot from 2014-02-02 13:25:03

Analysing my SSH logs, I did not see any trace of ssh bruteforce coming from source port 6000. But when selecting an IP, I’ve got trace of brute force from at least one of the IP: Screenshot from 2014-02-02 14:31:02

These attackers seems to really love the root account. In fact, I did not manage to find any trace of attempts for user different than root for IP address that are using the port 6000.

Getting back to my ulogd dashboard, I’ve displayed more info about the used scanning sequence: Screenshot from 2014-02-02 14:34:05 The host scans the box using a scanner using raw socket, then it attacks with a few minutes later with SSH bruteforce tool. The bruteforce tool has a TCP window size at start of 65535. It indicates that a separated software is used for scanning. So we should have an queueing mechanism between the scanner and the bruteforce tool. This may explains the duration between the scan and the bruteforce. Regarding TCP window size value, 65535 seems to indicate a Windows server (which is coherent with TTL value).

Looking at the scanner traffic

Capturing a sample traffic did not give to much information. This is a scanner sending a SYN and cleanly sending a reset when it got the SYN, ACK:

14:27:54.982273 IP (tos 0x0, ttl 103, id 256, offset 0, flags [none], proto TCP (6), length 40) > Flags [S], cksum 0xa525 (correct), seq 9764864, win 16384, length 0
14:27:54.982314 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 44) > Flags [S.], cksum 0xeee2 (correct), seq 2707606274, ack 9764865, win 29200, options [mss 1460], length 0
14:27:55.340992 IP (tos 0x0, ttl 111, id 14032, offset 0, flags [none], proto TCP (6), length 40) > Flags [R], cksum 0xe48c (correct), seq 9764865, win 0, length 0

But it seems the RST packet after the SYN, ACK is not well crafted: Screenshot from 2014-02-02 16:07:26

More info on SSH bruteforce tool

Knowing the the behavior was scanning from 6000 and starting a normal scanning, I’ve focused the Suricata dashboard on one IP to see if I had some more information: Screenshot from 2014-02-02 15:21:58

One single IP in the list of the scanning host is triggering multiple alerts. The event table confirmed this: Screenshot from 2014-02-02 15:16:41

Studying the geographical repartition of the Libssh alert, it appears there is used in other countries than China: Screenshot from 2014-02-02 15:24:59 So, libssh is not a discriminatory element of the attacks.


A custom attack tool has been been deployed on some Chinese IPs. This is a combination of a SSH scanner based on RAW socket and a SSH bruteforce tool. It tries to gain access to the root account of system via the ssh service. On an organisational level, it is possible there is a Chinese initiative trying to get the low-hanging fruit (system with ssh root account protected by password) or maybe it is just a some organization using some compromised Chinese IPs to try to get control other more boxes.

January 20, 2014

Eric Leblond: Why you will love nftables

Linux 3.13 is out

Linux 3.13 is out bringing among other thing the first official release of nftables. nftables is the project that aims to replace the existing {ip,ip6,arp,eb}tables framework aka iptables. nftables version in Linux 3.13 is not yet complete. Some important features are missing and will be introduced in the following Linux versions. It is already usable in most cases but a complete support (read nftables at a better level than iptables) should be available in Linux 3.15.

nftables comes with a new command line tool named nft. nft is the successor of iptables and derivatives (ip6tables, arptables). And it has a completely different syntax. Yes, if you are used to iptables, that’s a shock. But there is a compatibility layer that allow you to use iptables even if filtering is done with nftables in kernel.

There is only really few documentation available for now. You can find my nftables quick howto and there is some other initiatives that should be made public soon.

Some command line examples

Multiple targets on one line

Suppose you want to log and drop a packet with iptables, you had to write two rules. One for drop and one for logging:

iptables -A FORWARD -p tcp --dport 22 -j LOG
iptables -A FORWARD -p tcp --dport 22 -j DROP

With nft, you can combined both targets:

nft add rule filter forward tcp dport 22 log drop
Easy set creation

Suppose you want to allow packets for different ports and allow different icmpv6 types. With iptables, you need to use something like:

ip6tables -A INPUT -p tcp -m multiport --dports 23,80,443 -j ACCEPT
ip6tables -A INPUT -p icmpv6 --icmpv6-type neighbor-solicitation -j ACCEPT
ip6tables -A INPUT -p icmpv6 --icmpv6-type echo-request -j ACCEPT
ip6tables -A INPUT -p icmpv6 --icmpv6-type router-advertisement -j ACCEPT
ip6tables -A INPUT -p icmpv6 --icmpv6-type neighbor-advertisement -j ACCEPT

With nft, sets can be use on any element in a rule:

nft add rule ip6 filter input tcp dport {telnet, http, https} accept
nft add rule ip6 filter input icmpv6 type { nd-neighbor-solicit, echo-request, nd-router-advert, nd-neighbor-advert } accept

It is easier to write and it is more efficient on filtering side as there is only one rule added for each protocol.

You can also use named set to be able to make them evolve other time:

# nft -i # use interactive mode
nft> add set global ipv4_ad { type ipv4_address;}
nft> add element global ipv4_ad {, }
nft> add rule ip global filter ip saddr @ipv4_ad drop
And later when a new bad boy is detected:
# nft -i
nft> add element global ipv4_ad { }

One advanced feature of nftables is mapping. It is possible to use to different type of data and to link them. For example, we can associate iface and a dedicated rule set (stored in a chain and created before). In the example, the chains are named low_sec and high_sec:

# nft -i
nft> add map filter jump_map { type ifindex : verdict; }
nft> add element filter jump_map { eth0 : jump low_sec; }
nft> add element filter jump_map { eth1 : jump high_sec; }
nft> add rule filter input iif vmap @jump_map

Now, let’s say you have a new dynamic interface ppp1, it is easy to setup filtering for it. Simply add it in the jump_map mapping:

nft> add element filter jump_map { ppp1 : jump low_sec; }

On administration and kernel side

More speed at update

Adding a rule in iptables was getting dramatically slower with the number of rules and that’s explained why script using iptables call are taking a long time to complete. This is not anymore with nftables which is using atomic and fast operation to update rule sets.

Less kernel update

With iptables, each match or target was requiring a kernel module. So, you had to recompile kernel in case you forgot something or want to use something new. this is not anymore the case with nftables. In nftables, most work is done in userspace and kernel only knows some basic instruction (filtering is implemented in a pseudo-state machine). For example, icmpv6 support has been achieved via a simple patch of the nft tool. This type of modification in iptables would have required kernel and iptables upgrade.

January 10, 2014

Eric Leblond: A bit of logstash cooking


I’m running a dedicated server to host some internet services. The server runs Debian. I’ve installed logstash on it to do a bit of monitoring of my system logs and suricata. I’ve build a set of dashboards. The screenshot below shows a part of the one being dedicated to suricata: Suricata dashboard


My data sources were the following:
  • System logs
  • Apache logs
  • Suricata full JSON logs (should be available in suricata 2.0)
System logs

The setup was mostly really easy. I’ve just added a grok pattern to detect successful and unsuccessful connections on the ssh server.

input {
  file {
    type => "linux-syslog"
    path => [ "/var/log/daemon.log", "/var/log/auth.log", "/var/log/" ]
filter {
  if [type] == "linux-syslog" {
      grok {
        match => { "message" => "Accepted %{WORD:auth_method} for %{USER:username} from %{IP:src_ip} port %{INT:src_port} ssh2" }
      grok {
        match => { "message" => "Invalid user %{USER:username} from %{IP:src_ip}" }
Apache logs
Extract of Apache Dashboard

For apache, it was even easier for access.log:

  file {
    path => [ "/var/log/apache2/*access.log" ]
    type => "apache-access"

  file {
    type => "apache-error"
    path => "/var/log/apache2/error.log"
filter {
  if [type] == "apache-access" {
      grok {
        match => { "message" => "%{COMBINEDAPACHELOG}" }

  if [type] == "apache-error" {
      grok {
        match => { "message" => "%{APACHEERRORLOG}" }
        patterns_dir => ["/var/lib/logstash/etc/grok"]

For error log, I’ve created a grok pattern to get client IP. So I’ve created a file in grok dir with:

APACHEERRORLOG \[%{HTTPERRORDATE:timestamp}\] \[%{WORD:severity}\] \[client %{IPORHOST:clientip}\] %{GREEDYDATA:message_remainder}
Netfilter logs
Extract of firewall Dashboard

For Netfilter logs, I’ve decided to play it the old way and to parse kernel log instead of using ulogd:

input {
  file {
    type => "kern-log"
    path => "/var/log/kern.log"

filter {
 if [type] == "kern-log" {
        grok {
                match => { "message" => "%{IPTABLES}"}
                patterns_dir => ["/var/lib/logstash/etc/grok"]
with IPTABLES being defined in a file placed in the grok directory and containing:
 NETFILTERMAC %{COMMONMAC:dst_mac}:%{COMMONMAC:src_mac}:%{ETHTYPE:ethtype}
 ETHTYPE (?:(?:[A-Fa-f0-9]{2}):(?:[A-Fa-f0-9]{2}))
 IPTABLES1 (?:IN=%{WORD:in_device} OUT=(%{WORD:out_device})? MAC=%{NETFILTERMAC} SRC=%{IP:src_ip} DST=%{IP:dst_ip}.*(TTL=%{INT:ttl})?.*PROTO=%{WORD:proto}?.*SPT=%{INT:src_port}?.*DPT=%{INT:dst_port}?.*)
 IPTABLES2 (?:IN=%{WORD:in_device} OUT=(%{WORD:out_device})? MAC=%{NETFILTERMAC} SRC=%{IP:src_ip} DST=%{IP:dst_ip}.*(TTL=%{INT:ttl})?.*PROTO=%{INT:proto}?.*)

Exim logs
Extract of SMTP dashboard

This part was complicated because exim logs are multiline. So I found a page explaining how to match at least, the logs for delivered mail. It is using multiline in filter. Then I added a series of matches to get more information. Each match do only get a part of a message so I’ve used break_on_match not to exit when one of the match succeed.

input {
  file {
    type => "exim-log"
    path => "/var/log/exim4/mainlog"
filter {
  if [type] == "exim-log" {
      multiline {
        pattern => "%{DATE} %{TIME} %{HOSTNAME:msgid} (=>|Completed)"
        what => "previous"
      grok {
        break_on_match => false
        match => [
          "message", "= %{NOTSPACE:from} H=%{NOTSPACE:server} \[%{IP:src_ip}\]"
      grok {
        break_on_match => false
        match => [
          "message", "=> %{USERNAME:username} %{NOTSPACE:dest}> R=%{WORD:transport}"

      grok {
        break_on_match => false
        match => [
          "message", "=> %{NOTSPACE:dest} R=%{WORD:transport}"
      grok {
        break_on_match => false
        match => [
          "message", "%{DATE} %{TIME} H=%{NOTSPACE:server}%{GREEDYDATA} \[%{IP:src_ip}\] F=%{NOTSPACE:mail_to}> temporarily rejected RCPT %{NOTSPACE:dest}>: greylisted"
Pie with file types

Suricata full JSON output is JSON so the configuration in logstash is trivial:

input {
   file {
      path => ["/var/log/suricata/eve.json" ]
      codec =>   json
You can download a sample Suricata Dashboard to use in in your logstash installation.

The full configuration

Below is the full configuration. There is only one thing which I did not mention. For most source IP, I use geoip to have an idea of the localisation of the IP.

input {
  file {
    type => "linux-syslog"
    path => [ "/var/log/daemon.log", "/var/log/auth.log", "/var/log/" ]

  file {
    path => [ "/var/log/apache2/*access.log" ]
    type => "apache-access"

  file {
    type => "apache-error"
    path => "/var/log/apache2/error.log"

  file {
    type => "exim-log"
    path => "/var/log/exim4/mainlog"

  file {
    type => "kern-log"
    path => "/var/log/kern.log"

   file {
      path => ["/var/log/suricata/eve.json" ]
      codec =>   json


filter {
  if [type] == "apache-access" {
      grok {
        match => { "message" => "%{COMBINEDAPACHELOG}" }
  if [type] == "linux-syslog" {
      grok {
        match => { "message" => "Accepted %{WORD:auth_method} for %{USER:username} from %{IP:src_ip} port %{INT:src_port} ssh2" }

  if [type] == "apache-error" {
      grok {
        match => { "message" => "%{APACHEERRORLOG}" }
        patterns_dir => ["/var/lib/logstash/etc/grok"]

  if [type] == "exim-log" {
      multiline {
        pattern => "%{DATE} %{TIME} %{HOSTNAME:msgid} (=>|Completed)"
        what => "previous"
      grok {
        break_on_match => false
        match => [
          "message", "= %{NOTSPACE:from} H=%{NOTSPACE:server} \[%{IP:src_ip}\]"
      grok {
        break_on_match => false
        match => [
          "message", "=> %{USERNAME:username} %{NOTSPACE:dest}> R=%{WORD:transport}"

      grok {
        break_on_match => false
        match => [
          "message", "=> %{NOTSPACE:dest} R=%{WORD:transport}"
      grok {
        break_on_match => false
        match => [
          "message", "%{DATE} %{TIME} H=%{NOTSPACE:server}%{GREEDYDATA} \[%{IP:src_ip}\] F=%{NOTSPACE:mail_to}> temporarily rejected RCPT %{NOTSPACE:dest}>: greylisted"

 if [type] == "kern-log" {
        grok {
                match => { "message" => "%{IPTABLES}"}
                patterns_dir => ["/var/lib/logstash/etc/grok"]

  if [src_ip]  {
    geoip {
      source => "src_ip"
      target => "geoip"
      add_field => [ "[geoip][coordinates]", "%{[geoip][longitude]}" ]
      add_field => [ "[geoip][coordinates]", "%{[geoip][latitude]}"  ]
    mutate {
      convert => [ "[geoip][coordinates]", "float" ]

  if [clientip]  {
    geoip {
      source => "clientip"
      target => "geoip"
      add_field => [ "[geoip][coordinates]", "%{[geoip][longitude]}" ]
      add_field => [ "[geoip][coordinates]", "%{[geoip][latitude]}"  ]
    mutate {
      convert => [ "[geoip][coordinates]", "float" ]

  if [srcip]  {
    geoip {
      source => "srcip"
      target => "geoip"
      add_field => [ "[geoip][coordinates]", "%{[geoip][longitude]}" ]
      add_field => [ "[geoip][coordinates]", "%{[geoip][latitude]}"  ]
    mutate {
      convert => [ "[geoip][coordinates]", "float" ]

output {
  stdout { codec => rubydebug }
  elasticsearch { embedded => true }

January 04, 2014

Patrick McHardy: test


November 27, 2013

Eric Leblond: What’s new in ulogd 2.0.3

New features in ulogd 2.0.3 release

Database framework update

ulogd 2.0.3 implements two new optional modes for database connections:

  • backlog system to avoid event loss in case of database downtime
  • running mode where acquisition is made in one thread and queries to databases are made in separate threads to reduce latency in the treatment of kernel messages
These two modes are described below.

Postgresql update

Postgresql output plugin was only offering a small subset of Postgresql connection-related options. It is now possible to use the connstring to use all possible parameters of libpq param keywords. If set, this variable has precedence on other variables.

One interest of connstring is to be able to use a SSL-encrypted connection to the database by using the sslmode keyword:

connstring="host=localhost port=4321 dbname=nulog user=nupik password=changeme sslmode=verify-full sslcert=/etc/ssl/pgsql-cert.pem sslkey=/etc/ssl/pgsql-key.pem sslrootcert==/etc/ssl/pgsql-rootcert.pem"

Event loss prevention

ulogd 2.0.3 implements a backlog system for all database output plugins using the abstraction framework for database connection. At the writing of this article, this is MySQL, PostgreSQL and DBI. Memory will be dedicated to store the queries that can not be run because of an unavailability of the database. Once the database is back, the queries are played in order.

To activate this mode, you need to set the backlog_memcap value in the database definition.


Set backlog_memcap to the size of memory that will be allocated to store events in memory if data is temporary down. The variable backlog_oneshot_requests is storing the number of queries to process at once before reading a kernel message.

Multithreaded database output

If the ring buffer mode is active, a thread will be created for each stack involving the configured database. It will connect to the database and execute the queries. The idea is to try to avoid buffer overrun by minimizing the time requested to treat kernel message. Doing synchronous SQL request, as it was made before was causing a delay which could cause some messages to be lost in case of burst from kernel side. With this new mode, the time to process kernel message is equal to the time of the formatting of the query.

To activate this mode, you need to set ring_buffer_size to a value superior to 1. The value stores the number of SQL requests to keep in the ring buffer.


The ring_buffer_size has precedence on the backlog_memcap value. And backlog will be disabled if the ring buffer is active as ring buffer also provide packet loss prevention. ring_buffer_size is the maximum number of queries to keep in memory.

November 18, 2013

Eric Leblond: Using linux perf tools for Suricata performance analysis


Perf is a great tool to analyse performances on Linux boxes. For example, perf top will give you this type of output on a box running Suricata on a high speed network:

Events: 32K cycles                                                                                                                                                                                                                            
 28.41%  suricata            [.] SCACSearch
 19.86%        [.] tolower
 17.83%  suricata            [.] SigMatchSignaturesBuildMatchArray
  6.11%  suricata            [.] SigMatchSignaturesBuildMatchArrayAddSignature
  2.06%  suricata            [.] tolower@plt
  1.70%  [.] pthread_mutex_trylock
  1.17%  suricata            [.] StreamTcpGetFlowState
  1.10%        [.] __memcpy_ssse3_back
  0.90%  [.] pthread_mutex_lock

The functions are sorted by CPU consumption. Using arrow key it is possible to jump into the annotated code to see where most CPU cycles are used.

This is really useful but in the case of a function like pthread_mutex_trylock, the interesting part is to be able to find where this function is called.

Getting function call graph in perf

This stack overflow question lead me to the solution.

I’ve started to build suricata with the -fno-omit-frame-pointer option:

./configure --enable-pfring --enable-luajit CFLAGS="-fno-omit-frame-pointer"
make install

Once suricata was restarted (with pid being 9366), I was then able to record the data:

sudo perf record -a --call-graph -p 9366

Extracting the call graph was then possible by running:

sudo perf report --call-graph --stdio
The result is a huge detailed report. For example, here’s the part on pthread_mutex_lock:
     0.94%  Suricata-Main     [.] pthread_mutex_lock
            --- pthread_mutex_lock
               |--48.69%-- FlowHandlePacket
               |          |
               |          |--53.04%-- DecodeUDP
               |          |          |
               |          |          |--95.84%-- DecodeIPV4
               |          |          |          |
               |          |          |          |--99.97%-- DecodeVLAN
               |          |          |          |          DecodeEthernet
               |          |          |          |          DecodePfring
               |          |          |          |          TmThreadsSlotVarRun
               |          |          |          |          TmThreadsSlotProcessPkt
               |          |          |          |          ReceivePfringLoop
               |          |          |          |          TmThreadsSlotPktAcqLoop
               |          |          |          |          start_thread
               |          |          |           --0.03%-- [...]
               |          |          |
               |          |           --4.16%-- DecodeIPV6
               |          |                     |
               |          |                     |--97.59%-- DecodeTunnel
               |          |                     |          |
               |          |                     |          |--99.18%-- DecodeTeredo
               |          |                     |          |          DecodeUDP
               |          |                     |          |          DecodeIPV4
               |          |                     |          |          DecodeVLAN
               |          |                     |          |          DecodeEthernet
               |          |                     |          |          DecodePfring
               |          |                     |          |          TmThreadsSlotVarRun
               |          |                     |          |          TmThreadsSlotProcessPkt
               |          |                     |          |          ReceivePfringLoop
               |          |                     |          |          TmThreadsSlotPktAcqLoop
               |          |                     |          |          start_thread
               |          |                     |          |
               |          |                     |           --0.82%-- DecodeIPV4
               |          |                     |                     DecodeVLAN
               |          |                     |                     DecodeEthernet
               |          |                     |                     DecodePfring
               |          |                     |                     TmThreadsSlotVarRun
               |          |                     |                     TmThreadsSlotProcessPkt
               |          |                     |                     ReceivePfringLoop
               |          |                     |                     TmThreadsSlotPktAcqLoop
               |          |                     |                     start_thread
               |          |                     |
               |          |                      --2.41%-- DecodeIPV6
               |          |                                DecodeTunnel
               |          |                                DecodeTeredo
               |          |                                DecodeUDP
               |          |                                DecodeIPV4
               |          |                                DecodeVLAN
               |          |                                DecodeEthernet
               |          |                                DecodePfring
               |          |                                TmThreadsSlotVarRun
               |          |                                TmThreadsSlotProcessPkt
               |          |                                ReceivePfringLoop
               |          |                                TmThreadsSlotPktAcqLoop
               |          |                                start_thread

October 28, 2013

Eric Leblond: Logstash and Suricata for the old guys


logstash an opensource tool for managing events and logs. It is using elasticsearch for the storage and has a really nice interface named Kibana. One of the easiest to use entry format is JSON.

Suricata is an IDS/IPS which has some interesting logging features. Version 2.0 will feature a JSON export for all logging subsystem. It will then be possible to output in JSON format:

  • HTTP log
  • DNS log
  • TLS log
  • File log
  • IDS Alerts
For now, only File log is available in JSON format. This extract meta data from files transferred over HTTP.

Peter Manev has described how to connect Logstash Kibana and Suricata JSON output. Installation is really simple, just download logstash from logstash website, write your configuration file and start the thing.

Kibana interface is really impressive: Kibana Screenshot

But at the time, I started to look at the document, a few things were missing:

  • Geoip is not supported
  • All fields containing space appear as multiple entries

Geoip support

This one was easy. You simply have to edit the logstash.conf file to add a section about geoip:

input {
  file { 
    path => "/home/eric/builds/suricata/var/log/suricata/files-json.log" 
    codec =>   json 
    # This format tells logstash to expect 'logstash' json events from the file.
    #format => json_event 

output { 
  stdout { codec => rubydebug }
  elasticsearch { embedded => true }

#geoip part
filter {
  if [srcip] {
    geoip {
      source => "srcip"
      target => "geoip"
      add_field => [ "[geoip][coordinates]", "%{[geoip][longitude]}" ]
      add_field => [ "[geoip][coordinates]", "%{[geoip][latitude]}"  ]
    mutate {
      convert => [ "[geoip][coordinates]", "float" ]

It adds a filter that check for presence of srcip and add geoip information to the entry. The tricky thing is the add_field part that create an array that has to be used when adding a map to kibana dashboard. See following screenshot for explanation: Creating new map in Kibana

You may have the following error:

You must specify 'database => ...' in your geoip filter"

In this case, you need to specify the path to the geoip database by adding the database keyword to geoip configuration:

#geoip part
filter {
  if [srcip] {
    geoip {
      source => "srcip"
      target => "geoip"
      database => "/path/to/GeoLiteCity.dat"
      add_field => [ "[geoip][coordinates]", "%{[geoip][longitude]}" ]
      add_field => [ "[geoip][coordinates]", "%{[geoip][latitude]}"  ]
    mutate {
      convert => [ "[geoip][coordinates]", "float" ]

Once the file is written, you can start logstash

java -jar /home/eric/builds/logstash/logstash-1.2.2-flatjar.jar agent -f /home/eric/builds/logstash/logstash.conf --log /home/eric/builds/logstash/log/logstash-indexer.out -- web

See Logstash Kibana and Suricata JSON output for detailed information on setup.

Logstash indexing and mapping

Before logstash 1.3.1, fixing the space issue was really complex. Since that version, all indexed fields are provided with a .raw field that can be used to avoid the problem with spaces in name. So now, you can simply use in Kibana something like geoip.country_name.raw in the definition of graph instead of geoip.country_name. Doing that United States does not appear anymore as United and States.

Fixing the space issue for lostash previous to 1.3.1 was far more complicated for an old guy like me used to configuration files. If finding the origin of the behavior is easy fixing it was more painful. A simple googling shows me that by default elasticsearch storage split string at spaces when indexing. To fix this, you have to specify that the field should not be analyzed during indexing: "index":"not_analyzed"

That was looking easy at first but logstash is not using a configuration file for indexing and mapping. In fact, you need to interact with elasticsearch via HTTP requests. Second problem is that the index are dynamically generated, so there is a template system that you can use to have indexes created the way you want.

Creating an template is easy. You simply do something like:

curl -XPUT http://localhost:9200/_template/logstash_per_index -d '
    "template" : "logstash*",

This will create a template that will be applied to all newly created indexes with name matching “logstash*”. The difficult part is to know what to to put in MAGIC HERE and to check if “logstash*” will match created index. To check this, you can retrieve all current mappings:

curl -XGET 'http://localhost:9200/_all/_mapping'

You then get a list of mappings and you can check the name. But best part is that you can get a base text to update the mapping definition part. With Suricata file log and geoip activated, the following configuration is working well:

curl -XPUT http://localhost:9200/_template/logstash_per_index -d '
    "template" : "logstash*",
    "mappings" : {
      "logs" : {
         "properties": {
                  "city_name":{"type":"string", "index":"not_analyzed"},
                  "country_name":{"type":"string", "index":"not_analyzed"},
                  "real_region_name":{"type":"string", "index":"not_analyzed"},
                  "region_name":{"type":"string", "index":"not_analyzed"},
            "http_user_agent":{"type":"string", "index":"not_analyzed", "omit_norms":true, "index_options":"docs"},
            "magic":{"type":"string", "index":"not_analyzed", "omit_norms":true, "index_options":"docs"},

I’ve added some “index”:”not_analyzed” and improved the type for some of the fields. For example, srcip has been defined as an IP address. This allow to do range searching in Kibana like

["" TO ""]

Next point is to update the index format. To to so, you can get the name of current index, delete it and recreate it. To get the name you can use le mapping listing:

curl -XGET 'http://localhost:9200/_all/_mapping'

The return is something like:


So now, we can destroy this index named “logstash-2013.10.27″ and have it recreated with the correct settings:

curl -XDELETE 'http://localhost:9200/logstash-2013.10.27'
curl -XPUT 'http://localhost:9200/logstash-2013.10.27'
We need data to be reindexed so:
curl -XGET 'http://localhost:9200/logstash-2013.10.27/_refresh'

It may also be a good idea to wait for new data as it seems to trigger update in what elasticsearch is sending.

October 17, 2013

Rusty Russell: 2014: Rusty’s Must See List

Delightedly finished reading through the program.  Some nasty clashes have me still arguing with myself, but here are my personal compulsory attendance talks.  Your preferences will no-doubt differ, so I’ve tried to explain my reasons:


  • Tridgell: Open Hardware Differential GPS – I spoke to Tridge about this, and the abstract completely undersells it.
  • Corbet: Kernel – Jon’s kernel talks are great for non-kernel people, but for me it’s about seeing the forest through the trees.
  • McKenney: Parallel Verification – Paul’s spoken with me about this, but I want to hear the practical side to see how I can apply it.
  • Heo: Kernel per-cpu  – I persuaded Tejun to submit this; his per-cpu work was elegant (mine, on which this was built, was merely functional).
  • Airlie: Virtual GPU – Sorry Bdale (with whom it clashes): I have wanted a virtio GPU for so long, I need to see it.
  • Packard: Zero-copy Compositing  – Keith is always good, and graphics performance is fascinating.
  • Suehle: Raspberry Pi – Generally O’Reilly books are well researched, so I expect great content here.
  • Isaacs: CTDB Bugs – I was around when he was finding some of these, and there are some fascinating surprises here.

September 05, 2013

Harald Welte: Problems with OpenVPN on high-latency satellite links

So far I never had a need to look in detail how the OpenVPN protocol actually looks on the wire. It seems like not many people had that much of a close look, as the wireshark plugin is fairly recent (from 2012 I think) while OpenVPN is around for ten more years than that. If I was an OpenVPN developer, the wireshark plugin would be the first thing I'd write to help debugging and development. At least that's what I've been doing from OpenPCD to SIMtrace and through the various GSM and other protocols I encounter...

The reason for my current investigation is some quite strange and yet-unexplained problems when running OpenVPN on high-latency satellite links. I'm not talking about high-bandwidth VSAT or systems with dedicated / guaranteed bandwidth. The links I'm seeing often have RTT (as seen by ICMP echo) of 2 seconds, sometimes even 5. This is of course not only the satellite link, but includes queuing on the ground, possibly the space segment and of course the terminal, including (possibly) access arbitration.

What struck me _very_ odd is that OpenVPN is sending tons of UDP messages with ridiculously small size during the TLS handshake when bringing up the tunnel. Further investigation shows that they actually internally configure a MTU of '0' for the link, which seems to be capped at 100 bytes control payload, plus HMAC and OpenVPN header resulting in 124 to 138 bytes UDP payload.

Now you have to consider that the server certificate (possibly including even a CA certificate) can be quite large, plus all the gazillions of TLS handshaking options in ServerHello, the first message from server to client. This means that OpenVPN transmits that ServerHello in something like 40 to 60 fragments of 100 bytes each! And each of the fragments will have to be acknowledged by the remote end, leading 80 to 120 UDP/IP packets _only_ for the delivery of the TLS ServerHello.

Then you start reviewing the hundreds of OpenVPN configuration options, many of them related to MTU, MSS, fragmentation, etc. There is none for that insanely small default of 100 bytes for control packets during hand-shake. I even read through the related source code, only to find that indeed this behavior seems hard-coded. Some time later I had written a patch to add this option, thanks to Free Software. It seems to work on client and server and brings the ClientHello down to much smaller 4-6 messages.

The fun continues when you see that the timeout for re-transmitting fragments that have not been ACKed yet is 2 seconds. At my satellite RTT times this of course leads to lots of unneeded re-transmissions, simply because the ACK hasn't made its way back to the sender of the original message yet. Luckily there's a configuration option for that.

After the patch and changing that option, the protocol trace looks much more sane. However, I still have problems establishing a tunnel in a number of cases. For some odd reason, the last fragment of the ServerHello is not acknowledged by the client, no matter whether patched or unpatched OpenVPN is being used. I get acknowledgements always only up to fragment N-1 after having transmitted N. That last fragment is then re-transmitted by the server with exponential back-off, and finally some 60 seconds later the server gives up as the TLS handshake didn't finish within that time. Extending the TLS handshake timeout to 120 seconds also doesn't help.

I'm not quite sure why something like 39 out of 39 fragments all get delivered reliably and acknowledged, but always the last fragment (40) doesn't make it to the remote side. That's certainly not random packet loss, but a very deterministic one. Let's see if I can still manage to find out what that might be...

July 27, 2013

Rusty Russell: Git prompt for bash

I don’t know who wrote this originally, but this is from my .bashrc.  Tridge’s is simpler, but has colour!

Before this, I avoided git branches in favour of multiple copies of repositories because I use my prompt to provide location.  This provided the missing piece…

# Git me harder!
__git_ps1 ()
    local g="$(git rev-parse --git-dir 2>/dev/null)"
    if [ -n "$g" ]; then
        local r
        local b
        if [ -d "$g/../.dotest" ]
            local b="$(git symbolic-ref HEAD 2>/dev/null)"
        elif [ -d "$g/.dotest-merge" ]
            b="$(cat $g/.dotest-merge/head-name)"
        elif [ -f "$g/MERGE_HEAD" ]
            b="$(git symbolic-ref HEAD 2>/dev/null)"
            if [ -f $g/BISECT_LOG ]
            if ! b="$(git symbolic-ref HEAD 2>/dev/null)"
                b="$(cut -c1-7 $g/HEAD)..."
        if [ -n "$1" ]; then
            printf "$1" "${b##refs/heads/}$r"
            printf " (%s)" "${b##refs/heads/}$r"


July 23, 2013

Rusty Russell: On Linux-Kernel Mailing List Behavior

As raised recently by Sarah Sharp, the Linux Kernel mailing list (lkml) has a reputation as an intimidating place.  The context (covered so well by LWN) was that Greg Kroah-Hartman, the stable maintainer, is seen as a soft touch who accepts patches Linus wouldn’t.

There’s been much uninformed discussion from those outside lkml, so let’s start with a common basis:

  1. Sarah Sharp is an established and respected kernel maintainer.  She’s made it.
  2. Linus (and other developers) are human, and sometimes write in anger.

Now my opinions, as someone who cares about this issue and has been working on the kernel for about 16 years.

The kernel mailing list is much friendlier than it used to be: some of its reputation is now undeserved.  Linus is unreserved in criticising code or actions, but rarely crosses into ad-hominem.  His absolutist statements reduce RTT by telling you what is required; geeks love to argue, but it’s pointless because it’s his git tree.

That said, imitating Linus on lkml causes problems; without his authority, loudly claiming absolutes is simply ranting.  This escalates until it’s remarkably hard to avoid crossing into personal attacks; most of us inevitably double-down when we’re criticized, and train-wreck ensues.

I plan to follow Sarah’s example and respond when someone’s abusive.  Making it clear what’s expected should make things more pleasant eventually.  It’s been about ten years since I decided to reduce my flames to a single post every year; I’m now going to aim for zero (aka. “What Would Sarah Sharp Do?”)

July 09, 2013

Rusty Russell: 6 Technical Things I Learned About Bitcoin

I’ve been collecting these as I research the bitcoin protocol, so I thought it was worth posting about.  None of these are groundbreaking, but these are what surprised me as I deepened my understanding.

10 Minute Blocks.  Currently 9 minutes.  But usually 7 minutes.

Everyone talks about a block every 10 minutes, but that’s the long-term mean.  Spikes in exchange rates are followed fairly closely by spikes in network hashrate, and ASIC miners are ramping up to meet demand.  As difficulty adjustment happens every 2016 blocks (ideally 2 weeks), there’s a lag. Over the life of bitcoin, and over the last year the average is almost exactly 600 seconds, but over the last 3 months it’s been 520 seconds.  The last month is 542 seconds, so hashrate acceleration is slowing.

But a subtler effect is shown when we look at the median, rather than the mean: it’s just under 7 minutes.  This is because the time to hit the target hash is not a normal distribution at all.  There’s probably a fancy name for this spike with an exponential tail, but I’ve graphed here a recent set of 2016 blocks (fortnight 115) showing the distribution of block times in minute-wide buckets.

Now, these stats were using timestamps in the blocks, rather than the actual observed times, but I’m assuming on average that they’re correct.

Actually, 10.005 Minute Blocks

The bitcoin client calculates how long an interval took by subtracting the timestamp from beginning of the interval to the end of the interval of 2016 blocks.  There are 2015 spaces between 2016 blocks, but the code divides by 2016.  But I’m sure no one else cares about that 0.3 second mistake, since block times are never that precise anyway.

Politics In The Genesis Block.  Or Not.

It’s common to point to the text in the very first block “The Times 03/Jan/2009 Chancellor on brink of second bailout for banks” as a political statement by Satoshi.  While I’m sure the headline amused the author, we need look no further than the initial Bitcoin Paper, section 3:

A timestamp server works by taking a hash of a block of items to be timestamped and widely publishing the hash, such as in a newspaper or Usenet post [2-5]. The timestamp proves that the data must have existed at the time, obviously, in order to get into the hash.

In other words, it simply proves that there was no pre-mining going on.  It would be interesting to get an accurate timestamp of the initial release of bitcoin and examine London Times headlines around that date to see if it was cherry-picked, or happy coincidence.

Crazy Address Encoding

Bitcoin addresses are a 25-byte number.  It’s usually encoding using 58 characters (numbers and letters, omitting zero, capital I and O, lower-case l to avoid confusion). Dividing by 58 is a bit of a pain, but doing crypto means we have big number libraries lying around which we can use.

But it’s not the straight encoding one might expect, which would result in 37 character addresses.  You might expect that leading zeroes can be omitted for compactness, but in fact, leading whole zero bytes are encoded separately. This gives variable-length addresses of between 27 and 34 characters and a second loop to encode and decode them.

Anonymity Off By Default

Anonymity is hard, but I was surprised to see’s page about my donation to Unfilter correctly geolocated to my home town!  Perhaps it’s a fluke, but I was taken aback by how clear it was.

CVEs in Bitcoin

Like any software, there have been flaws in the bitcoin reference client: obviously there has been a great deal of scrutiny and concern.  Unlike most projects, there is a superb wiki page which details each vulnerability, with consequences and deployment status across the network:

Corrections welcome!

July 02, 2013

Rusty Russell: VIRTIO Growing Up: OASIS Standard Technical Committee

Over the last few years, interest in virtio has begun compounding.  FreeBSD have their bhyve implementation, there’s an MMIO bus and SCSI endpoint implementation, and I’ve been fielding more queries about various alternate implementations.  While it’s taken longer than I’d hoped, the effort hasn’t waned as I feared.

So I have carved out some time this year to turn this draft into a real, consensual standard with the trappings expected by those outside the normal Linux/KVM sphere (such as an IP policy). I know I said I’d never get involved in a standard process again after the FHS, but OASIS seems like the right umbrella to cleanly and efficiently run this effort.

There are limitations and workarounds in the current draft and implementations.  None are fatal, but they make a case for a flag day change for 1.0 (with backwards compatibility possible for implementations which want that).  More compelling, to me, is the chance for other vendors to get involved now and have their voices heard: after the standard is finalized, they’ll just have to follow along.

I look forward to polishing what we have, and making sure we can implement even more awesome things in future.

June 05, 2013

Harald Welte: Attending HITCON and COSCUP in Taipei

It is my pleasure to attend the HITCON 2013 and COSCUP 2013 conferences in July/August this year. They are both in Taipei. HITCON is a hacker/security event, while COSCUP is a pure Free/Open Source Software conference.

At both events I will be speaking at the growing list of GSM related tools that are available these days, like OpenBSC, OsmcoomBB, SIMtrace, OsmoSGSN, OsmoBTS, OsmoSDR, etc. As they are both FOSS projects and useful in a security context, this fits well within the scope of both events.

Given that I'm going to be back to Taiwan, I'm looking very much forward to meeting old friends and former colleagues from my Openmoko days in Taipei. God, do I miss those days. While terribly stressful, they still are the most exciting days of my career so far.

And yes, I'm also going to use the opportunity for a continuation of my motorbike riding in this beautiful country.

June 03, 2013

Harald Welte: Rest In Peace, Atul Chitnis

Today, very sad news has reached me: Atul Chitnis has passed away. Most people outside of India will most likely not recognize the name: He has been instrumental in pioneering the BBS community in India, and the founder and leader of the Linux Bangalore and later conferences, held annually in Bangalore.

I myself first met Atul about ten years ago, and had the honor of being invited to speak at many of the conferences he was involved in. Besides that professional connection, we became friends. The warmth and affection with which I was accepted by him and his family during my many trips to Bangalore is without comparison. I was treated and accepted like a family member, despite just being this random free software hacker from Germany who is always way too busy to return the amount of kindness.

Despite the 17 year age difference, there was a connection between the two of us. Not just the mutual respect for each others' work, but something else. It might have been partially due to his German roots. It might have been the similarities in our journey through technology. We both started out in the BBS community with analog modems, we both started to write DOS software in the past, before turning to Linux. We both became heavily involved in mobile technology around the same time: He during his work at Geodesic, I working for Openmoko. Only in recent years his indulgence in Apple products was slightly irritating ;)

Only five weeks ago I had visited Atul. Given the state of his health, it was clear that this might very well be the last time that we meet each other. I'm sad that this now actually turned out to become the thruth. It would have been great to meet again at the end of the year (the typical schedule).

My heartfelt condolences to his family. Particularly to his wonderful wife Shubha, his daughther Anjali, his mother and brother. [who I'm only not calling by their name in this post as they deserve some privacy and their Identities is not listed on Atuls wikipedia page].

Atul was 51 years old. Way too young to die. Yet, he has managed to created a legacy that will extend long beyond his life. He profoundly influenced generations of technology enthusiasts in India and beyond.

April 01, 2013

Rusty Russell: Thanks for the Bitcoin donation!

Last week I used 2 BTC to support Jupiter Broadcasting’s Unfilter show (and their other shows, but only Unfilter takes BTC so far).  Just now I noticed that someone made a 0.5BTC donation to my blog (I’ve had a BTC donation address in the sidebar of my blog for a few years now).  Thanks!

As I promised to pass donations onwards, I googled for bitcoin donations, and chose the following places to give 0.05 BTC each:

  1. Juice Rap News for making high-baud political commentary (Unfilter in rap form)
  2. Freedom Box for actually doing something about Internet freedom.
  3. (as recommended by for the same.
  4. for keeping a healthy Open alternative.
  5. Bitcoin Foundation to support and strengthen the infrastructure that made this possible.
  6. The Free Software Foundation even though I don’t always agree with them.
  7. Wikileaks for recognizing something society needs, even if they stumble at delivery.
  8. The Internet Archive for something that only gets more useful over time.

There are two left to go, so I’ll keep an eye out for more opportunities to donate in the next few weeks…


March 29, 2013

Harald Welte: Hardware outage affectiong,,

As usual, murphy's law dictates that problems will occur at the worst possible moment. One of my servers in the data center died on March 20, and it was the machine which hosts the majority of the free software projects that I've created or am involved in. From to OpenPCD and OpenEZX to and virtually all sites and services.

Recovery was slow as there is no hot spare and none of my other machines in the data center have backplanes for the old SCA-80 hard disks that are in use by that particular machine. So we had to send the disks to Berlin, wait until I'm back there, and then manually rsync everything over to a different box in the data center.

To my big surprise, not many complaints reached me (and yes, my personal and/or business e-mail was not affected in any way)

Recovery is complete now, and I'm looking forward to things getting back to normal soon.

Harald Welte: OsmoDevCon 2013 preparation update

OsmoDevCon 2013 is getting closer every day, and I'm very much looking forward to meet the fellow developers of the various Osmcoom sub-projects. Organization-wise, the catering has now been sorted out, and Holger has managed to get a test license for two ARFCN from the regulatory body without any trouble.

This means that we're more or less all set. The key needs to be picked up from IN-Berlin, and we need to bring some extra extension cords, ethernet switch, power cords and other gear, but that's really only very minor tasks.

There's not as much formal schedule as we used to have last year, which is good as I hope it means we can focus on getting actual work done, as opposed to spending most of the time updating one another about our respective work and progress.

March 20, 2013

Rusty Russell: GCC and C vs C++ Speed, Measured.

With the imminent release of gcc 4.8, GCC has finally switched to C++ as the implementation language.  As usual, LWN has excellent coverage.  Those with long memories will remember Linux trying to use g++ back in 1992 and retreating in horror at the larger, slower code.  The main benefit was stricter typechecking, particularly for enums (a great idea: I had -Wstrict-enum patches for gcc about 12 years ago, which was a superset of the -Wenum-compare we have now, but never got it merged).

With this in mind, and Ian Taylor’s bold assertion that “The C subset of C++ is as efficient as C”, I wanted to test what had changed with some actual measurements.  So I grabbed gcc 4.7.2 (the last release which could do this), and built it with C and C++ compilers:

  1. ../gcc-4.7.2/configure –prefix=/usr/local/gcc-c –disable-bootstrap –enable-languages=c,c++ –disable-multiarch –disable-multilib
  2. ../gcc-4.7.2/configure –prefix=/usr/local/gcc-cxx –disable-bootstrap –enable-languages=c,c++ –disable-multiarch –disable-multilib –enable-build-with-cxx

The C++-compiled binaries are slightly larger, though that’s mostly debug info:

  1. -rwxr-xr-x 3 rusty rusty 1886551 Mar 18 17:13 /usr/local/gcc-c/bin/gcc
    text       data        bss        dec        hex    filename
    552530       3752       6888     563170      897e2    /usr/local/gcc-c/bin/gcc
  2. -rwxr-xr-x 3 rusty rusty 1956593 Mar 18 17:13 /usr/local/gcc-cxx/bin/gcc
    text       data        bss        dec        hex    filename
    552731       3760       7176     563667      899d3    /usr/local/gcc-cxx/bin/gcc

Then I used them both to compile a clean Linux kernel 10 times:

  1. for i in `seq 10`; do time make -s CC=/usr/local/gcc-c/bin/gcc 2>/dev/null; make -s clean; done
  2. for i in `seq 10`; do time make -s CC=/usr/local/gcc-cxx/bin/gcc 2>/dev/null; make -s clean; done

Using stats –trim-outliers, which throws away best and worse, and we have the times for the remaining 8:

  1. real    14m24.359000-35.107000(25.1521+/-0.62)s
    user    12m50.468000-52.576000(50.912+/-0.23)s
    sys    1m24.921000-27.465000(25.795+/-0.31)s
  2. real    14m27.148000-29.635000(27.8895+/-0.78)s
    user    12m50.428000-52.852000(51.956+/-0.7)s
    sys    1m26.597000-29.274000(27.863+/-0.66)s

So the C++-compiled binaries are measurably slower, though not noticably: it’s about 865 seconds vs 868 seconds, or about .3%.  Even if a kernel compile spends half its time linking, statting, etc, that’s under 1% slowdown.

And it’s perfectly explicable by the larger executable size.  If we strip all the gcc binaries, and do another 10 runs of each (… flash forward to the next day.. oops, powerfail, make that 2 days later):

  1. real    14m24.659000-33.435000(26.1196+/-0.65)s
    user    12m50.032000-57.701000(50.9755+/-0.36)s
    sys    1m26.057000-28.406000(26.863+/-0.36)s
  2. real    14m26.811000-29.284000(27.1308+/-0.17)s
    user    12m51.428000-52.696000(52.156+/-0.39)s
    sys    1m26.157000-27.973000(26.869+/-0.41)s

Now the difference is 0.1%, pretty much in the noise.

Summary: so whether you like C++ or not, the performance argument is moot.

February 08, 2013

Harald Welte: Update on what I've been doing

For the better part of a year, this blog has failed to provide you with a lot of updates what I've been doing. This is somewhat relate to a shift from doing freelance work on mainline / FOSS projects like the Linux kernel.

In April 2011, Holger and I started a new company here in Berlin (sysmocom - systems for mobile communications GmbH). This company, among other things, attempts to provide products and services surrounding the various mobile communications related FOSS projects, particularly OpenBSC, OsmoSGSN, OpenGGSN, but also OsmocomBB, and now also OsmoBTS + OsmoPCU, two integral components of our own BTS product called sysmoBTS.

Aside from the usual software development, this entails a variety of other tasks, technical and non-technical. First of all, I did more electrical engineering than I did in the years since Openmoko. And even there, I was only leading the hardware architecture, and didn't actually have to capture schematics or route PCBs myself. So now there are some general-purpose and some customer-specific circuits that had to be done. I really enjoy that work, sometimes even more than software development. Particularly the early/initial design phase can be quite exciting. Selecting components, figuring out how to interconnect them, whether you can fit all of them together in the given amount of GPIOs and other resource of your main CPU, etc. But then even the hand-soldering the first couple of boards is fun, too.

Of all the things I so far had least exposure to is casing and mechanical issues. Luckily we have a contractor working on that for us, but still there are all kinds of issues that can go wrong, where unpopulated PCB footprints can suddenly make contact with a case, or all kinds of issues related to manufacturing tolerances. Another topic is packaging. After all, you want the products to end up in the hands of the customer in a neat, proper and form-fitting package.

On the other hand, there is a lot of administrative work. Sourcing components can sometimes be a PITA, particularly if even distributors like Digikey conspire against you and don't even carry those low quantities of a component that we need for our 100-board low quantity runs. EMC and other measurements for CE approval are a fun topic, too. I've never been involved personally in those, and it has been an interesting venture. Luckily, at least for sysmoBTS, things are looking quite promising now. Customs paperwork, Import/Export related buerocracy (both in Germany as well as other countries) always have new surprises, despite me having experience in dealing with customs for more than 10 years now.

Also significant amount of time is spent on evaluating suppliers and their products, e.g. items like SIM/USIM cards, cavity duplexers, antennas, cables, adapters, power amplifiers and other RF related accessories for our products.

The thing that really caught me off-guard are the German laws on inventory accounting. Basically there is no threshold for low-quantity goods, so as a company on capital (GmbH/AG) you have to account for each and every fscking SMD resistor or capacitor. And then you don't only have to count all those parts, but also put a value at them. Depending on the type of item, you have to use either the purchasing price, or the current market price if you were to buy it again, or the price you expect to sell the item for. Furthermore, the trade law requirements on inventory accounting are different than the tax laws, not often with contradictory aims ;)

In the end it seems the best possible strategy is to put a lot of the low-value inventory into the garbage bin before the end of the financial year, as the value of the product (e.g. 130 SMD resistors in 0402 worth fractions of cents) is so much lower than the cost of counting it. Now that's of course an environmental sin, especially if you consider lots and lots of small and medium-sized companies ending up at that conclusion :(

So all in all, this should give you somewhat of an explanation why there might have been less activity on this blog about exciting technical things. On the one hand, they might relate to customer related projects which are of confidential nature. On the other hand, they might simply be boring things like dealing with transport damage of cavity duplexers from china, or with FedEx billing customs/import fees to the wrong address...

Overall I still have the feeling that I was writing a decent amount of code in 2012 - although there can never be enough :) Most of it was probably either related to OsmoBTS, OpenBSC/OsmoNITB or the various Erlang SS7/TCAP/MAP related projects. The list of more community-oriented projects with long TODO lists is growing, though. I'd like to work on SIMtrace MITM / card emulation support, the CC32RS512 based smartcard OS, libosmosim (there's a first branch in libosmocore.git). Let's hope I can find a bit more time for that kind of stuff this year. You should never give up hope, they say ;)

February 04, 2013

Harald Welte: Back from FOSDEM 2013

As (almost) every year, I attended the annual incarnation of FOSDEM. It is undoubtedly (one of?) the most remarkable events about Free Software in existence. No registration, no fees, 24 tracks in parallel, an estimated 5000 number of attendees. I also like that it brings together people from so many different communities, not _just_ the Linux or Gnome or KDE or Telephony or Legal people, but a good mixture of everything.

I have to congratulate the organizers, who manage to pull this off, year after year again. And as opposed to many other events, they do so quietly and without much recognition, I feel. I'd also like to thank the many volunteers working tirelessly before, at and after the event. Last, but not least, I'd like to thank the local university (ULB Solbosch) hosting the event.

What made me truly sad though, is the amount of littering that surprisingly many of the attendees did. This was particularly visible in the Cafeteria. Imagine an event run by volunteers, who put in a lot of time and effort. Imagine an event where food and drinks are sold by volunteers at such low prices that there can barely be any profit at all. And then imagine people eating there and leaving all their rubbish around, as if they were in some kind of restaurant where they are being served and where somebody is cleaning up after them. It really makes me feel very bitter to see this. Don't people realize that those very volunteers who are creating the event will then have to put in _their_ spare time just because those who just enjoyed their coffee or lunch didn't have the extra 30 seconds of bringing their trash to the trashcan? I feel ashamed for members of our community who behave this way. Please think next time before acting and show your respect to the people behind FOSDEM.

Copyright (C) 2001-2010 by the respective authors.