netfilter project logo Planet Netfilter

August 19, 2014

Rusty Russell: POLLOUT doesn’t mean write(2) won’t block: Part II

My previous discovery that poll() indicating an fd was writable didn’t mean write() wouldn’t block lead to some interesting discussion on Google+.

It became clear that there is much confusion over read and write; eg. Linus thought read() was like write() whereas I thought (prior to my last post) that write() was like read(). Both wrong…

Both Linux and v6 UNIX always returned from read() once data was available (v6 didn’t have sockets, but they had pipes). POSIX even suggests this:

The value returned may be less than nbyte if the number of bytes left in the file is less than nbyte, if the read() request was interrupted by a signal, or if the file is a pipe or FIFO or special file and has fewer than nbyte bytes immediately available for reading.

But write() is different. Presumably so simple UNIX filters didn’t have to check the return and loop (they’d just die with EPIPE anyway), write() tries hard to write all the data before returning. And that leads to a simple rule.  Quoting Linus:

Sure, you can try to play games by knowing socket buffer sizes and look at pending buffers with SIOCOUTQ etc, and say “ok, I can probably do a write of size X without blocking” even on a blocking file descriptor, but it’s hacky, fragile and wrong.

I’m travelling, so I built an Ubuntu-compatible kernel with a printk() into select() and poll() to see who else was making this mistake on my laptop:

cups-browsed: (1262): fd 5 poll() for write without nonblock
cups-browsed: (1262): fd 6 poll() for write without nonblock
Xorg: (1377): fd 1 select() for write without nonblock
Xorg: (1377): fd 3 select() for write without nonblock
Xorg: (1377): fd 11 select() for write without nonblock

This first one is actually OK; fd 5 is an eventfd (which should never block). But the rest seem to be sockets, and thus probably bugs.

What’s worse, are the Linux select() man page:

       A file descriptor is considered ready if it is possible to
       perform the corresponding I/O operation (e.g., read(2)) without
       blocking.
       ... those in writefds will be watched to see if a write will
       not block...

And poll():

	POLLOUT
		Writing now will not block.

Man page patches have been submitted…

August 02, 2014

Rusty Russell: ccan/io: revisited

There are numerous C async I/O libraries; tevent being the one I’m most familiar with.  Yet, tevent has a very wide API, and programs using it inevitably descend into “callback hell”.  So I wrote ccan/io.

The idea is that each I/O callback returns a “struct io_plan” which says what I/O to do next, and what callback to call.  Examples are “io_read(buf, len, next, next_arg)” to read a fixed number of bytes, and “io_read_partial(buf, lenp, next, next_arg)” to perform a single read.  You could also write your own, such as pettycoin’s “io_read_packet()” which read a length then allocated and read in the rest of the packet.

This should enable a convenient debug mode: you turn each io_read() etc. into synchronous operations and now you have a nice callchain showing what happened to a file descriptor.  In practice, however, debug was painful to use and a frequent source of bugs inside ccan/io, so I never used it for debugging.

And I became less happy when I used it in anger for pettycoin, but at some point you’ve got to stop procrastinating and start producing code, so I left it alone.

Now I’ve revisited it.   820 insertions(+), 1042 deletions(-) and the code is significantly less hairy, and the API a little simpler.  In particular, writing the normal “read-then-write” loops is still very nice, while doing full duplex I/O is possible, but more complex.  Let’s see if I’m still happy once I’ve merged it into pettycoin…

July 29, 2014

Rusty Russell: Pettycoin Alpha01 Tagged

As all software, it took longer than I expected, but today I tagged the first version of pettycoin.  Now, lots more polish and features, but at least there’s something more than the git repo for others to look at!

July 17, 2014

Rusty Russell: API Bug of the Week: getsockname().

A “non-blocking” IPv6 connect() call was in fact, blocking.  Tracking that down made me realize the IPv6 address was mostly random garbage, which was caused by this function:

bool get_fd_addr(int fd, struct protocol_net_address *addr)
{
   union {
      struct sockaddr sa;
      struct sockaddr_in in;
      struct sockaddr_in6 in6;
   } u;
   socklen_t len = sizeof(len);
   if (getsockname(fd, &u.sa, &len) != 0)
      return false;
   ...
}

The bug: “sizeof(len)” should be “sizeof(u)”.  But when presented with a too-short length, getsockname() truncates, and otherwise “succeeds”; you have to check the resulting len value to see what you should have passed.

Obviously an error return would be better here, but the writable len arg is pretty useless: I don’t know of any callers who check the length return and do anything useful with it.  Provide getsocklen() for those who do care, and have getsockname() take a size_t as its third arg.

Oh, and the blocking?  That was because I was calling “fcntl(fd, F_SETFD, …)” instead of “F_SETFL”!

July 02, 2014

Jesper Dangaard Brouer: The calculations: 10Gbit/s wirespeed

In this blogpost, I'll try to make you understand the engineering challenge behind processing 10Gbit/s wirespeed, at the smallest Ethernet packet size.

The peak packet rate is 14.88 Mpps (million packets per sec) uni-directional on 10Gbit/s with the smallest frame size.

Details: What is the smalles Ethernet frame
Ethernet frame overhead:


Thus, the minimim size Ethernet frame is: 84 bytes (20 + 64)

Max 1500 bytes MTU Ethernetframe size is: 1538 bytes (calc: (12+8) + (14) + 1500 + (4) = 1538 bytes)

Packet rate calculations

Peak packet rate calculated as:  (10*10^9) bits/sec / (84 bytes * 8) = 14,880,952 pps
1500 MTU packet rate calculated as: (10*10^9) bits/sec / (1538 bytes * 8) = 812,744 pps

Time budget
This is the important part to wrap-your-head around.

With 14.88 Mpps the time budget for processing a single packet is:

  • 67.2 ns (nanosecond) (calc as: 1/14880952*10^9 ns)

This corrospond to approx: 201 CPU cycles on a 3GHz CPU (assuming only one instruction per cycle, disregarding superscalar/pipelined CPUs). Only having 201 clock-cycles processing time per packet is very little.

Relate these numbers to something
This 67.2ns number is hard to use for anything, if we cannot relate this to some other time measurements.

cache-misses
A single cache-miss takes: 32 ns (measured on a E5-2650 CPU). Thus, with just two cache-misses (2x32=64ns), almost the total 67.2 ns budget is gone. The Linux skb (sk_buff) is 4 cache-lines (on 64-bit), and the kernel e.g. insists on writing zeros to these cache-lines, during allocation of an skb.

cache-references
We might not "suffer" a full cache-miss, sometimes the memory is available in L2 or L3 cache.  Thus, it is useful to know these time measurements.  Measured on my E5-2630 CPU (with lmbench command "lat_mem_rd 1024 128"), L2 access costs 4.3ns, and L3 access costs 7.9ns.

The "LOCK" operation
Assembler instructions can be prefixed with a "LOCK" operation, which means that they perform an atomic operation. This is uses every time e.g. a spinlock is locked or unlocked, cmpxchg and atomic_inc (some operations are even implicitly LOCK prefixed, like xchg).

I've measured the cost of this atomic "LOCK" operation to be 8.25ns on my CPU (with this program). Even for the completely optimal situation of a spinlock only being touch by one CPU, we have two LOCK calls which costs 16.5ns.

System call overhead
A FreeBSD case study of sendto(), in Luigi Rizzo netmap paper, shows that the cost of only the system call is 96ns, which is above the 67.2 ns budget.  The total overhead of sendto() were 950 ns.  These 950ns corrospond to 1,052,631 pps (calc as 1/(950/10^9)).
On Linux I measured the system call getuid(2), to take 87.77 ns and 201 CPU-cycles (TSC measurement) (the CPU efficiency were 1.42 insns per cycle, measured with perf stat). Thus, the syscall itself eats up the entire budget.

  • Update: Most of the syscall overhead comes from kernel option CONFIG_AUDITSYSCALL, without it, the syscall overhead drops to 41.85 ns.


How to overcome this syscall problem?  We can amortize the cost, by sending several packets in a single syscall.  It is not very well known, but we actually already have a syscall to send several packets with a single syscall, called "sendmmsg(2)". Notice the extra "m" (and the corresponding receive version "recvmmsg(2)"). Not many examples exists on the Internet for using these syscalls. Thus, I've provided some example code here for sendmmsg and recvmmsg.

RAW socket speeds
Daniel Borkmann and I recently optimized AF_PACKET, to scale to several CPUs (trafgen, kernel qdisc bypass and trafgen use qdisc bypass). But let us look at the performance numbers for only a single CPU:

  • Qdisc path = 1,226,776 pps => 815 ns per packet (calc: 1/pps*10^9)
  • Qdisc bypass = 1,382,075 pps => 723 ns per packet (calc: 1/pps*10^9)

This is also interesting, because this show us the cost of the qdisc code path, which costs 92 ns.  In this 10Gbit/s context it is fairly large, e.g. corresponding to almost 3 cache-line misses (92/32=2.9).

June 26, 2014

Eric Leblond: pshitt: collect passwords used in SSH bruteforce

Introduction

I’ve been playing lately on analysis SSH bruteforce caracterization. I was a bit frustrated of just getting partial information:

  • ulogd can give information about scanner settings
  • suricata can give me information about software version
  • sshd server logs shows username
But having username without having the password is really frustrating.

So I decided to try to get them. Looking for a SSH server honeypot, I did find kippo but it was going too far for me by providing a fake shell access. So I’ve decided to build my own based on paramiko.

pshitt, Passwords of SSH Intruders Transferred to Text, was born. It is a lightweight fake SSH server that collect authentication data sent by intruders. It basically collects username and password and writes the extracted data to a file in JSON format. For each authentication attempt, pshitt is dumping a JSON formatted entry:

{"username": "admin", "src_ip": "116.10.191.236", "password": "passw0rd", "src_port": 36221, "timestamp": "2014-06-26T10:48:05.799316"}
The data can then be easily imported in Logstash (see pshitt README) or Splunk.

The setup

As I want to really connect to the box running ssh with a regular client, I needed a setup to automatically redirect the offenders and only them to pshitt server. A simple solution was to used DOM. DOM parses Suricata EVE JSON log file in which Suricata gives us the software version of IP connecting to the SSH server. If DOM sees a software version containing libssh, it adds the originating IP to an ipset set. So, the idea of our honeypot setup is simple:
  • Suricata outputs SSH software version to EVE
  • DOM adds IP using libssh to the ipset set
  • Netfilter NAT redirects all IP off the set to pshitt when they try to connect to our ssh server
Getting the setup in place is really easy. We first create the set:
ipset create libssh hash:ip
then we start DOM so it adds all offenders to the set named libssh:
cd DOM
./dom -f /usr/local/var/log/suricata/eve.json -s libssh
A more accurate setup for dom can be the following. If you know that your legitimate client are only based on OpenSSH then you can run dom to put in the list all IP that do not (-i) use an OpenSSH client (-m OpenSSh):
./dom -f /usr/local/var/log/suricata/eve.json -s libssh -vvv -i -m OpenSSH
If we want to list the elements of the set, we can use:
ipset list libssh
Now, we can start pshitt:
cd pshitt
./pshitt
And finally we redirect the connection coming from IP of the libssh set to the port 2200:
iptables -A PREROUTING -m set --match-set libssh src -t nat -i eth0 -p tcp -m tcp --dport 22 -j REDIRECT --to-ports 2200

Some results

Here’s an extract of the most used passwords when trying to get access to the root account: real root passwords And here’s the same thing for the admin account attempt: Root passwords Both data show around 24 hours of attempts on an anonymous box.

Conclusion

Thanks to paramiko, it was really fast to code pshitt. I’m now collecting data and I think that they will help to improve the categorization of SSH bruteforce tools.

June 21, 2014

Rusty Russell: Alternate Blog for my Pettycoin Work

I decided to use github for pettycoin, and tested out their blogging integration (summary: it’s not very integrated, but once set up, Jekyll is nice).  I’m keeping a blow-by-blow development blog over there.

June 16, 2014

Rusty Russell: Rusty Goes on Sabbatical, June to December

At linux.conf.au I spoke about my pre-alpha implementation of Pettycoin, but progress since then has been slow.  That’s partially due to yak shaving (like rewriting ccan/io library), partially reimplementation of parts I didn’t like, and partially due to the birth of my son, but mainly because I have a day job which involves working on Power 8 KVM issues for IBM.  So Alex convinced me to take 6 months off from the day job, and work 4 days a week on pettycoin.

I’m going to be blogging my progress, so expect several updates a week.  The first few alpha releases will be useless for doing any actual transactions, but by the first beta the major pieces should be in place…

June 11, 2014

Eric Leblond: Let’s talk about SELKS

The slides of my lightning talk at SSTIC are available: Let’s talk about SELKS. The slides are in French and are intended to be humorous.

The presentation is about defensive security that needs to get sexier. And Suricata 2.0 with EVE logging combined with Elasticsearch and Kibana can really help to reach that target. If you want to try Suricata and Elasticsearch, you can download and test SELKS.

selks

The talk also present a small tool named Deny On Monitoring which demonstrate how easy it is to extract information from Suricata EVE JSON logging.

June 07, 2014

Rusty Russell: Donation to Jupiter Broadcasting

Chris Fisher’s Jupiter Broadcasting pod/vodcasting started 8 years ago with the Linux Action Show: still their flagship show, and how I discovered them 3 years ago.  Shows like this give access to FOSS to those outside the LWN-reading crowd; community building can be a thankless task, and as a small shop Chris has had ups and downs along the way.  After listening to them for a few years, I feel a weird bond with this bunch of people I’ve never met.

I regularly listen to Techsnap for security news, Scibyte for science with my daughter, and Unfilter to get an insight into the NSA and what the US looks like from the inside.  I bugged Chris a while back to accept bitcoin donations, and when they did I subscribed to Unfilter for a year at 2 BTC.  To congratulate them on reaching the 100th Unfilter episode, I repeated that donation.

They’ve started doing new and ambitious things, like Linux HOWTO, so I know they’ll put the funds to good use!

June 04, 2014

Jesper Dangaard Brouer: Pktgen for network overload testing

Want to get maximum performance out of the kernel level packet generator (pktgen)?
Then read this blogpost:

  • Simple tuning will increase performance from 4Mpps to 5.5Mpps (per CPU)


You might see pktgen as a fast packet generator, which it is, but I (as a kernel developer) also see it as network stack testing tool of the TX code path.

Pktgen have a parameter "clone_skb", which specifies how many time to send the same packet, before freeing and allocting a new packet for transmission.  This affects performance significantly, as it can remove a lot of memory allocation and access overhead.

I have two distinctly different use-cases for stack testing:

  1. clone_skb=1      tests the stack alloc/free overhead (related to the SKB)
  2. clone_skb=100000 tests the NIC driver layer
Lets focus on case 2, driver layer.


Tuning NIC driver layer for max performance:
The default NIC setting are not tuned for pktgen's artificial overload type of benchmarking, as this could hurt the normal use-case.

Specifically increasing the TX ring buffer in the NIC:
 # ethtool -G ethX tx 1024

A larger TX ring can improve pktgen's performance, while it can hurt in the general case, 1) because the TX ring buffer might get larger than the CPUs L1/L2 cache, 2) because it allow more queueing in the NIC HW layer (which is bad for bufferbloat).

One should be careful to conclude, that packets/descriptors in the HW TX ring cause delay.  Drivers usually delay cleaning up the ring-buffers (for various performance reasons), thus packets stalling the TX ring, might just be waiting for cleanup.

This "slow" cleanup issues is specifically the case, for the driver ixgbe (Intel 82599 chip).  This driver (ixgbe) combine TX+RX ring cleanups, and the cleanup interval is affected by the ethtool --coalesce setting of parameter "rx-usecs".

For ixgbe use e.g "30" resulting in approx 33K interrupts/sec (1/30*10^6):
 # ethtool -C ethX rx-usecs 30

Performance data:
Packet Per Sec (pps) performance tests using a single pktgen CPU thread, CPU E5-2630, 10Gbit/s driver ixgbe. (using net-next development kernel v3.15-rc1-2680-g6623b41)

Adjusting the "ethtool -C ethX rx-usecs" value affect how often we cleanup the ring.  Keeping the default TX ring size at 512, and adjusting "rx-usecs":
  • 3,935,002 pps - rx-usecs:  1 (irqs:  9346)
  • 5,132,350 pps - rx-usecs: 10 (irqs: 99157)
  • 5,375,111 pps - rx-usecs: 20 (irqs: 50154)
  • 5,454,050 pps - rx-usecs: 30 (irqs: 33872)
  • 5,496,320 pps - rx-usecs: 40 (irqs: 26197)
  • 5,502,510 pps - rx-usecs: 50 (irqs: 21527)
Performance when adjusting the TX ring buffer size. Keeping "rx-usecs==1" (default) while adjusting TX ring size (ethtool -G):
  • 3,935,002 pps - tx-size:  512
  • 5,354,401 pps - tx-size:  768
  • 5,356,847 pps - tx-size: 1024
  • 5,327,595 pps - tx-size: 1536
  • 5,356,779 pps - tx-size: 2048
  • 5,353,438 pps - tx-size: 4096
The performance of adjusting cleanup interval (rx-usecs), seems to win over simply increasing the TX ring buffer size. This also proves the theory of TX queue is filled with old packets/descriptors that needs cleaning.
(Edit: updated numbers to be clean upstream, previously included some patches)

Tools: Want easy to use script for pktgen look here
More details on pktgen advanced topics by Daniel Turull.

May 27, 2014

Rusty Russell: Effects of packet/data sizes on various networks

I was thinking about peer-to-peer networking (in the context of Pettycoin, of course) and I wondered if sending ~1420 bytes of data is really any slower than sending 1 byte on real networks.  Similarly, is it worth going to extremes to avoid crossing over into two TCP packets?

So I wrote a simple Linux TCP ping pong client and server: the client connects to the server then loops: reads until it gets a ’1′ byte, then it responds with a single byte ack.  The server sends data ending in a 1 byte, then reads the response byte, printing out how long it took.  First 1 byte of data, then 101 bytes, all the way to 9901 bytes.  It does this 20 times, then closes the socket.

Here are the results on various networks (or download the source and result files for your own analysis):

On Our Gigabit Lan

Interestingly, we do win for tiny packets, but there’s no real penalty once we’re over a packet (until we get to three packets worth):

Over the Gigabit Lan

Over the Gigabit Lan

gigabit-lan-closeup

Over Gigabit LAN (closeup)

On Our Wireless Lan

Here we do see a significant decline as we enter the second packet, though extra bytes in the first packet aren’t completely free:

Wireless LAN (all results)

Wireless LAN (all results)

Wireless LAN (closeup)

Wireless LAN (closeup)

Via ADSL2 Over The Internet (Same Country)

Ignoring the occasional congestion from other uses of my home net connection, we see a big jump after the first packet, then another as we go from 3 to 4 packets:

ADSL over internet in same country

ADSL over internet in same country

ADSL over internet in same country (closeup)

ADSL over internet in same country (closeup)

Via ADSL2 Over The Internet (Australia <-> USA)

Here, packet size is completely lost in the noise; the carrier pidgins don’t even notice the extra weight:

Wifi + ADSL2 from Adelaide to US

Wifi + ADSL2 from Adelaide to US

Wifi + ADSL2 from Adelaide to US (closeup)

Wifi + ADSL2 from Adelaide to US (closeup)

Via 3G Cellular Network (HSPA)

I initially did this with Wifi tethering, but the results were weird enough that Joel wrote a little Java wrapper so I could run the test natively on the phone.  It didn’t change the resulting pattern much, but I don’t know if this regularity of delay is a 3G or an Android thing.  Here every packet costs, but you don’t win a prize for having a short packet:

3G network

3G network

3G network (closeup)

3G network (closeup)

Via 2G Network (EDGE)

This one actually gives you a penalty for short packets!  800 bytes to 2100 bytes is the sweet-spot:

2G (EDGE) network

2G (EDGE) network

2G (EDGE) network (closeup)

2G (EDGE) network (closeup)

Summary

So if you’re going to send one byte, what’s the penalty for sending more?  Eyeballing the minimum times from the graphs above:

Wired LAN Wireless ADSL 3G 2G
Penalty for filling packet 30%  15%  5%  0%  0%*
Penalty for second packet 30%  40%  15%  20%  0%
Penalty for fourth packet 60%  80%  25%  40%  25%

* Average for EDGE actually improves by about 35% if you fill packet

May 19, 2014

Eric Leblond: Playing with python-git

Introduction

I’m currently working on Scirius, the web management interface for Suricata developed by Stamus Networks. Scirius is able to fetch IDS signatures from external place and the backend is storing this element in a git tree. As Scirius is a Django application, this means we need to interact with git in Python.

Usually the documentation of Python modules is good and enough to develop. This is sadly not the case for GitPython. There is documentation but the overall quality it not excellent, at least for a non genuine Python developer, and there is some big part missing.

Doing a commit

Doing a commit is really simple once you have understand what to do. You need to open the repository and work on his index which is the object you add file to commit to. In the following example, I want to add everything under the rules directory:

    repo = git.Repo(source_git_dir)
    index = repo.index
    index.add(["rules"])
    message =  'source version at %s' % (self.updated_date)
    index.commit(message)

Set value in the configuration of a repository

It is possible to edit the configuration of a git repository with GitPython. To do that you need to get the config and to use the set_value function. For example, the following code snippet create a repository and set user.email and user.name for that repository:

    repo = git.Repo.init(source_git_dir)
    config = repo.config_writer()
    config.set_value("user", "email", "scirius@stamus-networks.com")
    config.set_value("user", "name", "Scirius")

OSError 25: Inappropriate ioctl for device

I’ve encountered this fabulous exception when trying to do a commit in Scirius. The problem is only showing up when running the application in wsfcgi mode. It is documented in Issue 39 on GitHub but there is no workaround proposed.

The error comes from the fact the function used to guess the identity of the user running the application is called even if value are set in the config. And this function is failing when it is called outside of a real session. This function is in fact trying to get things from environment but these value are not set when the application is started by init. To fix this, it is possible to force the USERNAME environment variable.

Here’s how it is implemented in Scirius:

+    os.environ['USERNAME'] = 'scirius'
    index.add(["rules"])
    message =  'source version at %s' % (self.updated_date)
    index.commit(message)

You can see the diff on GitHub

May 08, 2014

Rusty Russell: BTC->BPAY gateway (for Australians)

I tested out livingroomofsatoshi.com, which lets you pay any BPAY bill (see explanation from reddit).  Since I’d never heard of the developer, I wasn’t going to send anything large through it, but it worked flawlessly.  At least the exposure is limited to the time between sending the BTC and seeing the BPAY receipt.  Exchange rate was fair, and it was a simple process.

Now I need to convince my wife we should buy some BTC for paying bills…

April 30, 2014

Jesper Dangaard Brouer: trafgen a fast packet generator

The netsniff-ng toolkit version 0.5.8 have been released.

One of the tools included in the netsniff-ng toolkit is: "trafgen" a multi-threaded low-level zero-copy network packet generator.  The recent release contains some significant performance improvements to that traffic generator.

Single CPU generator performance on a E5-2630 CPU, with Intel ixgbe/82599 chip, reach 1.4 Million Packet Per Sec (Mpps) when using the recent kernel (>= v3.14) feature of qdisc bypass for RAW sockets. And around 1.2 Mpps without bypassing the qdisc system in the kernel. (Default is to use the qdisc bypass if available, for testing purposes the qdisc path can be enabled via command line option "--qdisc-path")

In this release, I've also made "trafgen" scale to more CPUs:


The hard part of using trafgen is specifying and creating the packet description input file.  I really enjoy the flexibility when defining the packet contents, but without good examples as a starting point, it can be a daunting task.

For that reason, I've made some examples available at github here:


I've used the SYN attack example while developing the SYNPROXY module, see my other blogpost. I'm releasing this example now, because solutions for mitigating this attack is now available.

Jon Schipp also have a solution and have created a script "gencfg" for generating trafgen packet description input files, avail on github: https://github.com/jonschipp/gencfg


Notice: to get these performance numbers you need to tune your packet generator machine for network overload testing.

Jesper Dangaard Brouer: Mitigating DDoS SYN flood attacks with iptables/netfilter

Hey, I'm also blogging on the Red Hat Enterprise Linux Blog

I recently did very practical post on Mitigating TCP SYN Flood Attacks with iptables/netfilter, with the hope to provide the world with a practical solution to solve these annoying SYN-flood DDoS attacks, that we have been seeing for the last 20 years.

I've also been touring with a technical talk on the subject, and the most recent version of the slides are here.

There is also a YouTube video of my presentation at DevConf 2013.

April 29, 2014

Jesper Dangaard Brouer: Basic tuning for network overload testing

I'm doing a lot of network testing, where I'm constantly trying to push the limits of the hardware and network stack (in-order to improve performance and fix scalability issues in the Linux Kernel).

Some basic tuning of the NICs (Network Interface Cards) and IRQs are required, before we can start this "overload" testing mode.

1. First thing I do, is to kill "irqbalance", to avoid it mangling with my manual IRQ assignments.

 # killall irqbalance

2. Next I, align/bind the NICs IRQs to CPUs (one-to-one).

I have a script for aligning the IRQs, that I copied from the Intel ixgbe driver tarball:
 # set_irq_affinity $DEV

The easiest way to view, how the current IRQ assignment is to use this "grep ." trick:
  # grep . /proc/irq/*/eth4{,-*}/../smp_affinity_list

3. Then I, disable Ethernet Flow-Control

 # ethtool -A $DEV rx off tx off autoneg off

I'm disabling Ethernet Flow Control (PAUSE frames) because i want to create an overload situation. When my transmitter/generator machine is overloading the target machine, I don't want the target machine to send "backoff" PAUSE frames, especially if I'm testing the limits of the transmitters network stack.

4. Unload all netfilter and iptables module

I have a simple script for flushing iptables and unloading all the modules:
 # netfilter_unload_modules.sh

I usually also perform benchmarking and tuning of iptables/Netfilter modules, but for overload testing I'm unloading all module, as these do introduce measurable overhead.


Extra: A word of caution regarding CPU sleep or idle states:
I've experienced issues when doing low-latency measurements with Sandy-E Bridge CPUs C-states, because it too aggressively tried to go into a sleep state, even under a high network load. The latency cost of coming out of a sleep state can be significant. Jeremy Eder have described these issues in detail in his blog:
  http://www.breakage.org/2013/08/oh-did-you-expect-the-cpu/

Simply use the tool "turbostat" to measure the different C-states.

And use the tool "tuned-adm" to adjust what profile you want to enable e.g.:
 # tuned-adm profile throughput-performance
 # tuned-adm profile latency-performance

April 27, 2014

Eric Leblond: Slides of my coccigrep lightning talk at HES2014

I’ve gave a lightning talk about coccigrep at Hackito Ergo Sum to show how it can be used to search in code during audit or hacking party. Here are the slides: coccigrep: a semantic grep for the C language.

The slides of my talk Suricata 2.0, Netfilter and the PRC will soon be available on Stamus Networks website.

April 17, 2014

Eric Leblond: Speeding up scapy packets sending

Sending packets with scapy

I’m currently doing some code based on scapy. This code reads data from a possibly huge file and send a packet for each line in the file using the contained information. So the code contains a simple loop and uses sendp because the frame must be sent at layer 2.

     def run(self):
         filedesc = open(self.filename, 'r')
         # loop on read line
         for line in filedesc:
             # Build and send packet
             sendp(pkt, iface = self.iface, verbose = verbose)
             # Inter packet treatment

Doing that the performance are a bit deceptive. For 18 packets, we’ve got:

    real    0m2.437s
    user    0m0.056s
    sys     0m0.012s

If we strace the code, the explanation is quite obvious:

socket(PF_PACKET, SOCK_RAW, 768)        = 4
setsockopt(4, SOL_SOCKET, SO_RCVBUF, [0], 4) = 0
select(5, [4], [], [], {0, 0})          = 0 (Timeout)
ioctl(4, SIOCGIFINDEX, {ifr_name="lo", ifr_index=1}) = 0
bind(4, {sa_family=AF_PACKET, proto=0x03, if1, pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0
setsockopt(4, SOL_SOCKET, SO_RCVBUF, [1073741824], 4) = 0
setsockopt(4, SOL_SOCKET, SO_SNDBUF, [1073741824], 4) = 0
getsockname(4, {sa_family=AF_PACKET, proto=0x03, if1, pkttype=PACKET_HOST, addr(6)={772, 000000000000}, [18]) = 0
ioctl(4, SIOCGIFNAME, {ifr_index=1, ifr_name="lo"}) = 0
sendto(4, "\377\377\377\377\377\377\0\0\0\0\0\0\10\0E\0\0S}0@\0*\6\265\373\307;\224\24\300\250"..., 97, 0, NULL, 0) = 97
select(0, NULL, NULL, NULL, {0, 0})     = 0 (Timeout)
close(4)                                = 0
socket(PF_PACKET, SOCK_RAW, 768)        = 4
setsockopt(4, SOL_SOCKET, SO_RCVBUF, [0], 4) = 0
select(5, [4], [], [], {0, 0})          = 0 (Timeout)
ioctl(4, SIOCGIFINDEX, {ifr_name="lo", ifr_index=1}) = 0
bind(4, {sa_family=AF_PACKET, proto=0x03, if1, pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0
setsockopt(4, SOL_SOCKET, SO_RCVBUF, [1073741824], 4) = 0
setsockopt(4, SOL_SOCKET, SO_SNDBUF, [1073741824], 4) = 0
getsockname(4, {sa_family=AF_PACKET, proto=0x03, if1, pkttype=PACKET_HOST, addr(6)={772, 000000000000}, [18]) = 0
ioctl(4, SIOCGIFNAME, {ifr_index=1, ifr_name="lo"}) = 0
sendto(4, "\377\377\377\377\377\377\0\0\0\0\0\0\10\0E\0\0004}1@\0*\6\266\31\307;\224\24\300\250"..., 66, 0, NULL, 0) = 66
select(0, NULL, NULL, NULL, {0, 0})     = 0 (Timeout)
close(4)                                = 0

For each packet, a new socket is opened and this takes age.

Speeding up the sending

To speed up the sending, one solution is to build a list of packets and to send that list via a sendp() call.

     def run(self):
         filedesc = open(self.filename, 'r')
         pkt_list = []
         # loop on read line
         for line in filedesc:
             # Build and send packet
             pkt_list.append(pkt)
         sendp(pkt_list, iface = self.iface, verbose = verbose)

This is not possible in our case due to the inter packet treatment we have to do. So the best way is to reuse the socket. This can be done easily when you’ve read the documentation^W code:

@@ -27,6 +27,7 @@ class replay:
     def run(self):
         # open filename
         filedesc = open(self.filename, 'r')
+        s = conf.L2socket(iface=self.iface)
         # loop on read line
         for line in filedesc:
             # Build and send packet
-            sendp(pkt, iface = self.iface, verbose = verbose)
+            s.send(pkt)

The idea is to create a socket via the function used in sendp() and to use the send() function of the object to send packets.

With that modification, the performance are far better:

    real    0m0.108s
    user    0m0.064s
    sys     0m0.004s

I’m not a scapy expert so ping me if there is a better way to do this.

April 16, 2014

Jesper Dangaard Brouer: Full scalability for Netfilter conntracks

My scalability fixes for Netfilter connection tracking have reached Linus'es tree and will appear in kernel release v3.15.

Netfilter’s conntrack have had a bad reputation for being slow. While this was true in the "early-days", it have been offering excellent scalability for established conntracks for a long time now.  Matching against existing conntrack entries is very fast and completely scalable. (The conntrack system actually does lockless RCU (Read-Copy Update) lookups for existing connections).

The conntrack system have had a scalability problem when it comes to creating (or deleting) connections, for a long time now (single central spinlock).  This scalability issue is now fixed.

This work relates to my recent efforts of using conntrack for DDoS protection, as e.g. SYN-floods would hit this "new" connection scalability problem with Netfilter conntracks.

Finally version 3 of the patchset were accepted March 7th 2014 (note Eric Dumazet worked on the first attempts back in May 9th 2013). The most important commit is 93bb0ceb75 "netfilter: conntrack: remove central spinlock nf_conntrack_lock")

Jesper Dangaard Brouer: Announcing: The IPTV-Analyzer

I'm happy to announce the first official release of the IPTV-Analyzer project, as an Open Source project.



The IPTV-Analyzer is a continuous/real-time tool for analyzing the contents of MPEG2 Transport Stream (TS) packets, which is commonly used for IPTV multicast signals. The main purpose is continuous quality measurement, with a focus on detecting MPEG2 TS/CC packet drops.

The core component is an iptables (Linux) kernel module, named "mpeg2ts". This kernel module performs the real-time Deep Packet Inspection of the MPEG2-TS packets. Its highly performance optimized, written for parallel processing across CPU cores (via RCU locking) and hash tables are used for handling large number of streams. Statistics are exported via the proc filesystem (scalability is achieved via use of the seq_file proc API). It scales to hundreds of IPTV channels, even on small ATOM based CPUs.

Please send bugreports, patches, improvement, comments or insults to: netoptimizer@brouer.com

March 24, 2014

Rusty Russell: Legal Questions About Localbitcoins.com and Australia

As my previous post documented, I’ve experimented with localbitcoins.com.  Following the arrest of two Miami men for trading on localbitcoins, I decided to seek legal advice on the sitation in Australia.

Online research led me to Nick Karagiannis of Kelly and Co, who was already familiar with Bitcoin: I guess it’s a rare opportunity for excitement in financial regulatory circles!  This set me back several thousand dollars (in fiat, unfortunately), but the result was reassuring.

They’ve released an excellent summary of the situation, derived from their research.  I hope that helps other bitcoin users in Australia, and I’ll post more in future should the legal situation change.

March 19, 2014

Rusty Russell: Bitcoin Trading In Australia

I bought 10 BTC to play with back in 2011, and have been slowly spending them to support bitcoin adoption.  One thing which I couldn’t get reliable information on was how to buy and sell bitcoin within Australia, so over the last few months I decided to sell a few via different methods and report the results here (this also helps my budget, since I’m headed off on paternity leave imminently!).

All options listed here use two-factor authentication, otherwise I wouldn’t trust them with more than cents.  And obviously you shouldn’t leave your bitcoins in an exchange for any longer than necessary, since most exchanges over time have gone bankrupt.

Option 1: MtGox AUD

Yes, I transferred some BTC into MtGox and sold them.  This gave the best price, but after over two months of waiting the bank transfer to get my money hadn’t been completed.  So I gave up, bought back into bitcoins (fewer, since the price had jumped) and thus discovered that MtGox was issuing invalid BTC transactions so I couldn’t even get those out.  Then they halted transactions altogether blaming TX malleability.  Then they went bankrupt.  Then they leaked my personal data just for good measure.  The only way their failure could be more complete is if my MtGox Yubikey catches on fire and burns my home to the ground.

Volume: Great (5M AUD/month)
Price Premium: $25 – $50 / BTC
Charge: 0.65%
Hassle: Infinite
Summary: 0/10

Option 2: localbitcoins.com

According to bitcoincharts.com, localbitcoins is the largest volume method for AUD exchange.  It’s not an exchange, so much as a matching and escrow service, though there are a number of professional traders active on the site.  The bulk of AUD trades are online, though I sold face to face (and I’ll be blogging about the range of people I met doing that).

localbitcoins.com is a great place for online BTC buyers, since they have been around for quite a while and have an excellent reputation with no previous security issues, and they hold bitcoins in escrow as soon as you hit “buy”.  It’s a bit more work than an exchange, since you have to choose the counter-party yourself.

For online sellers, transfers from stolen bank accounts is a real issue.  Electronic Funds Transfer (aka “Pay Anyone”) is reversible, so when the real bank account owner realizes their money is missing, the bank tends to freeze the receiving (ie. BTC seller’s) bank account to make sure they can’t remove the disputed funds.  This process can take weeks or months, and banks’ anti-fraud departments generally treat bitcoin sellers who get defrauded with hostility (ANZ is reported to be the exception here).  A less common scam is fraudsters impersonating the Australian Tax Office and telling the victim to EFT to the localbitcoins seller.

Mitigations for sellers include any combination of:

  1. Only accepting old-fashioned cash deposits via a branch (though I’m aware of one US report where a fraudster convinced the teller to reverse the deposit, I haven’t heard of that in Australia)
  2. Insisting on “localbitcoins.com” in the transfer message (to avoid the ATO fraud problem)
  3. Only dealing with buyers with significant reputation (100+ trades with over 150 BTC is the Gold Standard)
  4. Insisting on real ID checking (eg. Skype chat of buyer with drivers’ license)
  5. Only dealing with buyers whose accounts are older than two weeks (most fraudsters are in and out before then, though their reputation can be very good until they get caught)
  6. Only allowing internal transfers between the same bank (eg. Commonwealth), relying on the bank’s use of two factor authentication to reduce fraud.

Many buyers on localbitcoins.com are newcomers, so anticipate honest mistakes for the most part.  The golden rule always applies: if someone is offering an unrealistic price, it’s because they’re trying to cheat you.

Volume: Good (1M AUD/month)
Price Premium: $5 – $20 / BTC
Charge: 1% (selling), 0% (buying)
Hassle: Medium
Summary: 7/10

Option 3: btcmarkets.net

You’ll need to get your bank account checked to use this fairly low-volume exchange, but it’s reasonably painless.  Their issues are their lack of exposure (I found out about them through bitcoincharts.com) and lack of volume (about a quarter of the localbitcoins.com volume), but they also trade litecoin if you’re into that.  You can leave standing orders, or just manually place one which is going to be matched instantly.

They seem like a small operation, based in Sydney, but my interactions with them have been friendly and fast.

Volume: Low (300k AUD/month)
Price Premium: $0 / BTC
Charge: 1%
Hassle: Low
Summary: 7/10

Option 4: coinjar.io

I heard about this site from a well-circulated blog post on Commonwealth Bank closing their bank account last year.  I didn’t originally consider them since they don’t promote themselves as an exchange, but you can use their filler to sell them bitcoins at a spot rate.  It’s limited to $4000 per day according to their FAQ.

They have an online ID check, using the usual sources which didn’t quite work for me due to out-of-date electoral information, but they cleared that manually within a day.  They deposit 1c into your bank account to verify it, but that hasn’t worked for me, so I’ve no way to withdraw my money and they haven’t responded to my query 5 days ago leaving me feeling nervous.  A search of reddit points to common delays, and founder’s links to the hacked-and-failed Bitcoinica give me a distinct “magical gathering” feel. [Edit: they apparently tried and failed four times to transfer the 1c verification to my ING account; with 1-2 business day support response, this took quite a while.  They never explained why this was failing.  Using my wife's CBA account worked however, and I got my funds the next day.  Upgraded their score from 4/10 to 5/10.]

Volume: Unknown (self-reports indicate ~250k/month?)
Price Premium: $0 / BTC
Charge: 1.1% (selling) 2% (buying)
Hassle: Medium
Summary: 5/10

If you trade, I’d love to hear corrections, comments etc. or email me on rusty@rustcorp.com.au.

March 07, 2014

Eric Leblond: Suricata and Ulogd meet Logstash and Splunk

Some progress on the JSON side

Suricata 2.0-rc2 is out and it brings some progress on the JSON side. The logging of SSH protocol has been added: Screenshot from 2014-03-07 18:50:21 and the format of timestamp has been updated to be ISO 8601 compliant and it is now named timestamp instead of time.

Ulogd, the Netfilter logging daemon has seen similar change as it is now also using a ISO 8601 compliant timestamp for the . This feature is available in git and will be part of ulogd 2.0.4.

Thanks to this format change, the integration with logstash or splunk is easier and more accurate. This permit to fix one problem regarding the timestamp of an event inside of the event and logging manager. At least in logstash, the used date was the one of the parsing which was not really accurate. It could even be a problem when logstash was parsing a file with old entries because the difference in timestamp could be huge.

It is now possible to update logstash configuration to have a correct parsing of the timestamp. After doing this the internal @timestamp and the timestamp of the event are synchronized as show on the following screenshot:

timestamp

Logstash configuration

Screenshot from 2014-02-02 13:22:34

To configure logstash, you simply needs to tell him that the timestamp field in JSON message is a date. To do so, you need to add a filter:

      date {
        match => [ "timestamp", "ISO8601" ]
      }
A complete logstash.conf would then looks like:
input {
   file {
      path => [ "/usr/local/var/log/suricata/eve.json", "/var/log/ulogd.json" ]
      codec =>   json
      type => "json-log"
   }
}

filter {
   if [type] == "json-log" {
      date {
        match => [ "timestamp", "ISO8601" ]
      }
   }
}

output {
  stdout { codec => rubydebug }
  elasticsearch { embedded => true }
}

Splunk configuration

Screenshot from 2014-03-07 23:30:40

In splunk, auto detection of the file format is failing and it seems you need to define a type to parse JSON in $SPLUNK_DIR/etc/system/local/props.conf:

[suricata]
KV_MODE = json
NO_BINARY_CHECK = 1
TRUNCATE = 0

Then you can simply declare the log file in $SPLUNK_DIR/etc/system/local/inputs.conf:

[monitor:///usr/local/var/log/suricata/eve.json]
sourcetype = suricata

[monitor:///var/log/ulogd.json]
sourcetype = suricata

you can now build search events and build dashboard based on Suricata or Netfilter packet logging: Screenshot from 2014-03-05 23:17:12

February 24, 2014

Eric Leblond: Nftables and the Netfilter logging framework

Nftables logging

If nftables is bringing a lot of changes on user side, this is also true in the logging area. There is now only one single keyword for logging: log and this target is using the Netfilter logging framework. A corollary of that is that why you may not see any log messages even if a rule with log is matching because the Netfilter logging framework has to be configured.

Netfilter logging framework

The Netfilter logging framework is a generic way of logging used in Netfilter components. This framework is implemented in two different kernel modules:

  • xt_LOG: printk based logging, outputting everything to syslog (same module as the one used for iptables LOG target). It can only log packets for IPv4 and IPv6
  • nfnetlink_log: netlink based logging requiring to setup ulogd2 to get the events (same module as the one used for iptables NFLOG target). It can log packet for any family.

To use one of the two modules, you need to load them with modprobe. It is possible to have both modules loaded and in this case, you can then setup logging on a per-protocol basis. The active configuration is available for reading in /proc:

# cat /proc/net/netfilter/nf_log 
 0 NONE (nfnetlink_log)
 1 NONE (nfnetlink_log)
 2 nfnetlink_log (nfnetlink_log,ipt_LOG)
 3 NONE (nfnetlink_log)
 4 NONE (nfnetlink_log)
 5 NONE (nfnetlink_log)
 6 NONE (nfnetlink_log)
 7 nfnetlink_log (nfnetlink_log)
 8 NONE (nfnetlink_log)
 9 NONE (nfnetlink_log)
10 nfnetlink_log (nfnetlink_log,ip6t_LOG)
11 NONE (nfnetlink_log)
12 NONE (nfnetlink_log)
The syntax is the following FAMILY ACTIVE_MODULE (AVAILABLE_MODULES). Here nfnetlink_log was loaded first and xt_LOG was loaded afterward (xt_LOG is aliased to ipt_LOG and ip6t_LOG).

Protocol family numbers can look a bit strange. It is in fact mapped on the socket family name that is used in underlying code. The list is the following:

#define AF_UNSPEC	0
#define AF_UNIX		1	/* Unix domain sockets 		*/
#define AF_INET		2	/* Internet IP Protocol 	*/
#define AF_AX25		3	/* Amateur Radio AX.25 		*/
#define AF_IPX		4	/* Novell IPX 			*/
#define AF_APPLETALK	5	/* Appletalk DDP 		*/
#define	AF_NETROM	6	/* Amateur radio NetROM 	*/
#define AF_BRIDGE	7	/* Multiprotocol bridge 	*/
#define AF_AAL5		8	/* Reserved for Werner's ATM 	*/
#define AF_X25		9	/* Reserved for X.25 project 	*/
#define AF_INET6	10	/* IP version 6			*/
#define AF_MAX		12	/* For now.. */

To update the configuration, you need to write in the file corresponding to the family in /proc/sys/net/netfilter/nf_log/ directory. For example, if you want to use ipt_LOG for IPv4 (2 in the list), you can do:

echo "ipt_LOG" >/proc/sys/net/netfilter/nf_log/2 
This will active ipt_LOG for IPv4 logging:
# cat /proc/net/netfilter/nf_log 
 0 NONE (nfnetlink_log)
 1 NONE (nfnetlink_log)
 2 ipt_LOG (nfnetlink_log,ipt_LOG)
 3 NONE (nfnetlink_log)
 4 NONE (nfnetlink_log)
 5 NONE (nfnetlink_log)
 6 NONE (nfnetlink_log)
 7 nfnetlink_log (nfnetlink_log)
 8 NONE (nfnetlink_log)
 9 NONE (nfnetlink_log)
10 nfnetlink_log (nfnetlink_log,ip6t_LOG)
11 NONE (nfnetlink_log)
12 NONE (nfnetlink_log)

Netfilter framework is used internally by Netfilter for some logging. For example, the connection tracking is using it to send messages when invalid packets are seen. These messages are useful because they contain the reason of the reject. For example, one of the message is “nf_ct_tcp: ACK is under the lower bound (possible overly delayed ACK)”. This logging messages are only sent if the logging of invalid packet is asked. This is done by doing:

echo "255"> /proc/sys/net/netfilter/nf_conntrack_log_invalid
More information on the magical 255 value are available in kernel documentation of nf_conntrack sysctl. If nfnetlink_log module is used for the protocol, then the used group is 0. So if you want to activate these messages, it could be a good idea to use non 0 nfnetlink group in the log rules. This way you will be able to differentiate the log sources in a software like ulogd.

Logging with Nftables

As mentioned before, logging is made via a log keyword. A typical log and accept rule will look like:

nft add rule filter input tcp dport 22 ct state new log prefix \"SSH for ever\" group 2 accept
This rule is accepting packet to port 22 in the state NEW and it is logging them with prefix SSH for ever on group 2. Here the group is only used when the active logging kernel module is nfnetlink_log. The option has no effect if xt_LOG is used. In fact, when used with xt_LOG, the only available option is prefix (at least for nftables 0.099).

The available options when using nfnetlink_log module are the following (at least for nftables 0.099):

  • prefix: A prefix string to include in the log message, up to 64 characters long, useful for distinguishing messages in the logs.
  • group: The netlink group (0 – 2^16-1) to which packets are (only applicable for nfnetlink_log). The default value is 0.
  • snaplen: The number of bytes to be copied to userspace (only applicable for nfnetlink_log). nfnetlink_log instances may specify their own range, this option overrides it.
  • queue-threshold: Number of packets to queue inside the kernel before sending them to userspace (only applicable for nfnetlink_log). Higher values result in less overhead per packet, but increase delay until the packets reach userspace. The default value is 1.
Note: the description are extracted from iptables man pages.

If you want to do some easy testing with nftables, simply load xt_LOG module before nfnetlink_log. It will bind to IPv4 and IPv6 protocol and provide you logging. For more fancy stuff involving nfnetlink_log, you can have a look at Using ulogd and JSON output.

Happy logging to all!

February 23, 2014

Eric Leblond: Logging connection tracking event with ulogd

Motivation

I’ve recently met @aurelsec and we’ve discussed about the interest of logging connection tracking entries. This is indeed a undervalued information source in a network.

Quoting Wikipedia: “Connection tracking allows the kernel to keep track of all logical network connections or sessions, and thereby relate all of the packets which may make up that connection. NAT relies on this information to translate all related packets in the same way, and iptables can use this information to act as a stateful firewall.”

Connection tracking being linked with Network Address Translation has a direct impact: it stores both side of each connection. If we use conntrack tool from conntrack-tools to list connections:

# conntrack  -L
tcp      6 431999 ESTABLISHED src=192.168.1.129 dst=19.1.16.7 sport=53400 dport=443 src=19.1.16.7 dst=1.2.3.4 sport=443 dport=53500 [ASSURED] mark=0 use=1
...
We have the two sides of a connection:
  • Orig: here 192.168.1.129:53400 to 19.1.16.7:443. This is the packet information as seen by the firewall when it reaches him. There is no translation at all.
  • Reply: here 19.1.16.7:443 to 1.2.3.4:53500. This is how will look like a answer coming from the server. The destination has been changed to the public IP of the firewall (here 1.2.3.4). And there is also a change of the destination port to the one used by the firewall when doing the initial mapping. In fact, as multiple client could use the same port at the same time, the firewall may have to rewrite the initial source port.

So the connection tracking stores all NAT transformations. This information is important because this is the only way to know which IP in a private network is responsible of something in the outside world. For example, let’s suppose that 19.1.16.7 has been attacked by our internal client (here 192.168.1.129). If the admin of this server sees the attack, it will only see the 1.2.3.4 IP address and port source 53500. If an authority asks you for the IP address responsible in your internal network you have no instrument but the conntrack to know that this was in fact 192.168.1.129.

That’s why logging connection tracking event is one of the only effective way to store the information necessary to get back to the internal IP address in case of external query. Let’s now do this with ulogd 2.

Ulogd setup

Ulogd installation

Ulogd 2 is able to get information from the connection tracking and to log them in files or database. If your distribution is not providing ulogd and if you don’t know how to install it, you can check this post Using ulogd and JSON output. To be sure that you will be able to log connection tracking event, you need to have NFCT plugin to yes at the end of configure output.

Ulogd configuration:
  Input plugins:
    NFLOG plugin:			yes
    NFCT plugin:			yes

Kernel setup

All functionalities are standard since kernel 2.6.14. You only need to load the following module:

modprobe nf_conntrack_netlink
It is the one in charge of kernel and userspace information exchange regarding connection tracking. It provides features to dump the conntrack table or modify entries in the conntrack. For example the conntrack tool mentioned before is using that communication method to get the listing of connection tracking entries. But the feature that interest us in ulogd is the event mode. For each event in the life of a connection, a message is sent to the userspace. Ulogd is able to listen to these messages and this gives it the ability to store all information on the life of the connection in connection tracking.

Depending on the protocol you have on your network, you may need to run on of the following:

modprobe nf_conntrack_ipv4
modprobe nf_conntrack_ipv6

Ulogd setup

Our first objective will simply be to log all NAT decisions to a syslog-like file on disk. In term of connection tracking, this means we will log all connection in the NEW state. This way we will get information about any packet going through the firewall with the associated NAT transformation.

If you install from sources, copy ulogd.conf at the root of ulogd sources to your config directory (usually /usr/local/etc/. And start your favorite editor on it.

Ulogd is doing logging based on stack definition. A stack is one chain of plugins starting from a input plugin, finishing with an output one, and with filter in the middle. In our case, we want to get packet from Netfilter conntrack and the corresponding plugin is NFCT. The first example of stack containing NFCT in the ulogd.conf file is the one we are interested in, so we uncomment it:

stack=ct1:NFCT,ip2str1:IP2STR,print1:PRINTFLOW,emu1:LOGEMU
We are not sure that the setup of input and output plugin will be correct. For now, let’s just check the output:
[emu1]
file="/var/log/ulogd_syslogemu.log"
sync=1
As you may have seen, emu1 is also used by packet logging. So it may be a good idea that we have our own output file for connection tracking event. To do that, we update the stack:
stack=ct1:NFCT,ip2str1:IP2STR,print1:PRINTFLOW,emunfct1:LOGEMU
and create a new config below emu1:
[emunfct1]
file="/var/log/ulogd_nfct.log"
sync=1
We have changed file name and keep the sync option which permit to avoid the a delay in write due to buffering effect during write which can be very annoying when debugging a setup.

Now, we can test:

ulogd -v
In /var/log/ulogd_nfct.log, we see things like
Feb 22 10:50:36 ice-age2 [DESTROY] ORIG: SRC=61.174.51.209 DST=192.168.1.129 PROTO=TCP SPT=6000 DPT=22 PKTS=0 BYTES=0 , REPLY: SRC=192.168.1.129 DST=61.174.51.209 PROTO=TCP SPT=22 DPT=6000 PKTS=0 BYTES=0
So we only have destruction messages. This is not exactly what we wanted to have. We are interested in NEW message that will allow us to have a correct timing of the event. Reading ulogd.conf file, it seems there is no information about chossing the event types. But let’s ask to the NFCT input plugin its capabilities. To do that we use option -i of ulogd:
# ulogd -v -i /usr/local/lib/ulogd/ulogd_inpflow_NFCT.so 
Name: NFCT
Config options:
        Var: pollinterval (Integer, Default: 0)
        Var: hash_enable (Integer, Default: 1)
        Var: hash_buckets (Integer, Default: 8192)
        Var: hash_max_entries (Integer, Default: 32768)
        Var: event_mask (Integer, Default: 5)
        Var: netlink_socket_buffer_size (Integer, Default: 0)
        Var: netlink_socket_buffer_maxsize (Integer, Default: 0)
        Var: netlink_resync_timeout (Integer, Default: 60)
        Var: reliable (Integer, Default: 0)
        Var: accept_src_filter (String, Default: )
        Var: accept_dst_filter (String, Default: )
        Var: accept_proto_filter (String, Default: )
...
The listing start with the configuration keys. One of them is event_mask. This is a the one controlling which events are sent from kernel to userspace. The value is a mask combining some of the following values:
  • NF_NETLINK_CONNTRACK_NEW: 0×00000001
  • NF_NETLINK_CONNTRACK_UPDATE: 0×00000002
  • NF_NETLINK_CONNTRACK_DESTROY: 0×00000004
So default value of 5 is to listen to NEW and DESTROY events. Clever reader will then ask: why did we only see DESTROY messages in that case. This is because ulogd NFCT plugin is running by default in hash_enable mode. In this mode, one single message is output for each connection (at end) and a hash is maintained in the kernel to store the info (here initial timestamp of the connection). Our setup don’t need this feature because we only want to get the NAT transformation so we switch the hash feature off and limit the events to NEW:
[ct1]
event_mask=0x00000001
hash_enable=0

We can now restart ulogd and check the log file:

Feb 22 11:59:34 ice-age2 [NEW] ORIG: SRC=2a01:e35:1394:5bd0:da50:b6ff:fe3c:4250 DST=2001:41d0:1:9598::1 PROTO=TCP SPT=51162 DPT=22 PKTS=0 BYTES=0 , REPLY: SRC=2001:41d0:1:9598::1 DST=2a01:e35:1394:5bd0:da50:b6ff:fe3c:4250 PROTO=TCP SPT=22 DPT=51162 PKTS=0 BYTES=0
Feb 22 11:59:43 ice-age2 [NEW] ORIG: SRC=192.168.1.129 DST=68.232.35.139 PROTO=TCP SPT=60846 DPT=443 PKTS=0 BYTES=0 , REPLY: SRC=68.232.35.139 DST=1.2.3.4 PROTO=TCP SPT=443 DPT=60946 PKTS=0 BYTES=0
This is exactly what we wanted, we have a trace of all NAT transformation.

Maintain an history of connection tracking

Objective

We want to log all information describing a connection so we have a trace of what is going on the firewall. This means we need at least:

  • IP information for orig and reply way
  • Timestamp of start and end of connection
  • Bandwidth used by the connection

Kernel setup

By default, recent kernel have a limited handling of connection tracking. Some useful fields are not stored for performance reason. This is the case of the accounting (number of packets and bytes) and the case of the timestamp of the connection creation. The advantage of getting accounting information is trivial as you get information on bandwidth usage. Regarding timestamp, the interest is on implementation side. It allows ulogd to get all information needed for describing a connection in one single message (the DESTROY one). And ulogd does not need anymore to maintain a hash table to get the info and propagate it at exit.

To activate both features, you have to do:

 echo "1"> /proc/sys/net/netfilter/nf_conntrack_acct
 echo "1"> /proc/sys/net/netfilter/nf_conntrack_timestamp

Ulogd setup

For following setup, you will need ulogd build from git or a ulogd at a version superior or equal to 2.0.4.

Let’s first use JSON output to get the information in a readable format. We need to define a stack:

stack=ct2:NFCT,ip2str1:IP2STR,jsonnfct1:JSON

On ct2 side, we don’t want to use the hash and we only want to get DESTROY message, so our configuration looks like:

[ct2]
hash_enable=0
event_mask=0x00000004

Regarding, jsonnfct1 we could have reused the default JSON configuration but for ease of testing we will dedicate a file to the NFCT logging:

[jsonnfct1]
sync=1
file="/var/log/ulogd_nfct.json"

After a ulogd restart, we’ve got this type of entries:

{"reply.ip.daddr.str": "2a01:e35:1394:5ad0:da50:e6ff:fe3c:1250", "oob.protocol": 0, "dvc": "Netfilter", "timestamp": "Sat Feb 22 12:27:04 2014", "orig.ip.protocol": 6, "reply.raw.pktcount": 20, "flow.end.sec": 1393068424, "orig.l4.sport": 51384, "orig.l4.dport": 22, "orig.raw.pktlen": 5600, "ct.id": 1384991512, "orig.raw.pktcount": 23, "reply.raw.pktlen": 4328, "reply.ip.protocol": 6, "reply.l4.sport": 22, "reply.l4.dport": 51384, "ct.mark": 0, "ct.event": 4, "flow.start.sec": 1393068302, "flow.start.usec": 637516, "flow.end.usec": 403240, "reply.ip.saddr.str": "2001:41d0:1:9598::1", "oob.family": 10, "src_ip": "2a01:e35:1394:5ad0:da50:e6ff:fe3c:1250", "dest_ip": "2001:41d0:1:9598::1"}
The fields we wanted are here:
  • flow.start.* keys store the timestamp of flow start
  • flow.end.* keys store the end of the connection
  • *.raw.pkt* keys store the accounting information

You can then add this file to the file parsed by logstash. For that if you can use information from Using ulogd and JSON output and modify the input section:

input {
   file { 
      path => [ "/var/log/ulogd.json", "/var/log/ulogd_nfct.json"]
      codec =>   json 
   }
}
One interesting information in a connection tracking entry is the duration. But the field is not available in ulogd JSON output and it is not possible to do mathematical operations in Kibana. A solution to get the information is to add a filter in logstash.conf to compute the duration:
filter {
  if [type] == "json-log" {
    ruby {
      code => "if event['ct.id']; event['flow.duration.sec']=(event['flow.end.sec'].to_i - event['flow.start.sec'].to_i); end"
    }
  }
}

Screenshot from 2014-02-23 18:00:23 A thing to notice to understand the obtained duration is that a connection is dying following contextual timeout. For example, in the case of a TCP connection, even after a FIN packet there’s a timeout applied. So a short connection will at least be of the duration of the timeout.

An other logging method is PostgreSQL. The stack to use is almost the same as JSON one but use, as you may have guess, the PGSQL plugin:

stack=ct2:NFCT,ip2str1:IP2STR,pgsql2:PGSQL
The configuration of the PostgreSQL plugin is easy based on the setup available in the configuration:
[pgsql2]
db="nulog"
host="localhost"
user="nupik"
table="ulog2_ct"
#schema="public"
pass="changeme"
procedure="INSERT_CT"
I’m not the one who will explain how to connect to a PostgreSQL database and create a ulogd2 database. See Pollux post for that: ulogd2: the new userspace logging daemon for netfilter/iptables (part 2)

Other setup are possible. For example, you can maintain a copy of the connection tracking table in the database and also keep the history. To do that you need to use the INSERT_OR_REPLACE_CT procedure and a connection tracking INPUT plugin not using the hash table but getting NEW and DESTROY events:

stack=ct2:NFCT,ip2str1:IP2STR,pgsql2:PGSQL

[ct2]
hash_enable=0

[pgsql2]
db="nulog"
host="localhost"
user="nupik"
table="ulog2_ct"
#schema="public"
pass="changeme"
procedure="INSERT_OR_REPLACE_CT"
Connection will be inserted in the table when getting the NEW event and the connection entry in the database will be updated when the DESTROY message will be received.

February 05, 2014

Eric Leblond: Suricata and Nftables

Iptables and suricata as IPS

Building a Suricata ruleset with iptables has always been a complicated task when trying to combined the rules that are necessary for the IPS with the firewall rules. Suricata has always used Netfilter advanced features allowing some more or less tricky methods to be used.

For the one not familiar with IPS using Netfilter, here’s a few starting points:

  1. IPS receives the packet coming from kernel via rules using the NFQUEUE target
  2. The IPS must received all packets of a given flow to be able to handle detection cleanly
  3. The NFQUEUE target is a terminal target: when the IPS verdicts a packet, it is or accepted (and leave current chain)

So the ruleset needs to send all packets to the IPS. A basic ruleset for an IPS could thus looks like:

iptables -A FORWARD -j NFQUEUE
With such a ruleset, all packets going through the box are sent to the IPS.

If now you want to combine this with your ruleset, then usually your first try is to add rules to the filter chain:

iptables -A FORWARD -j NFQUEUE
iptables -A FORWARD -m conntrack --ctstate ESTABLISHED -j ACCEPT
# your firewall rules here
But this will not work because of point 2: All packets sent via NFQUEUE to the IPS are or blocked or if accepted, leave the FORWARD chain directly and are going for evaluation to the next chain (mangle POSTROUTING in our case). With such a ruleset, the result is that there is no firewall but an IPS in place.

As mentioned before there is some existing solutions (see Building a Suricata ruleset for extensive information). The simplest one is to dedicated one another chain such as mangle to IPS:

iptables -A FORWARD -t mangle -j NFQUEUE
iptables -A FORWARD -m conntrack --ctstate ESTABLISHED -j ACCEPT
# your firewall rules here
No conflict here but you have to be sure nothing in your system will use the the mangle table or you will have the same problem as the one seen previously in the filter chain. So there was no universal and simple solution to implement an IPS and a firewall ruleset with iptables.

IPS the easy way with Nftables

In Nftables, chains are defined by the user using nft command line. The user can specify:

  • The hook: the place in packet life where the chain will be set. See this diagram for more info.
    • prerouting: chain will be placed before packet are routed
    • input: chain will receive packets going to the box
    • forward: chain will receive packets routed by the box
    • postrouting: chain will receive packets after routing and before sending packets
    • output: chain will receive packet sent by the host
  • The chain type: define the objective of the chain
    • filter: chain will filter packet
    • nat: chain will only contains NAT rules
    • route: chain is containing rule that may change the route (previously now as mangle)
  • The priority: define the evaluation order of the different chains of a given hook. It is an integer that can be freely specified. But it also permits to place chain before or after some internal operation such as connection tracking.

In our case, we want to act on forwarded packets. And we want to have a chain for filtering followed by a chain for IPS. So the setup is simple of chain is simple

nft -i
nft> add table filter
nft> add chain filter firewall { type filter hook forward priority 0;}
nft> add chain filter IPS { type filter hook forward priority 10;}
With this setup, a packet will reach the firewall chain first where it will be filtered. If the packet is blocked, it will be destroy inside of the kernel. It the packet is accepted it will then jump to the next chain following order of increasing priority. In our case, the packet reaches the IPS chain.

Now, that we’ve got our chains we can add filtering rules, for example:
nft add rule filter firewall ct state established accept
nft add rule filter firewall tcp dport ssh counter accept
nft add rule filter firewall tcp dport 443 accept
nft add rule filter firewall counter log drop
And for our Suricata IPS, that’s just trivial:
nft add rule filter IPS queue

A bit more details

The queue target in nftables

The complete support for the queue target will be available in Linux 3.14. The syntax looks as follow:

nft add rule filter output queue num 3 total 2 options fanout
This rule sends matching packets to 2 load-balanced queues (total 2) starting at 3 (num 3). fanout is one of the two queue options:
  • fanout: When used together with total load balancing, this will use the CPU ID as an index to map packets to the queues. The idea is that you can improve perfor mance if there’s a queue per CPU. This requires total with a value superior to 1 to be specified.
  • bypass: By default, if no userspace program is listening on an Netfilter queue,then all packets that are to be queued are dropped. When this option is used, the queue rule behaves like ACCEPT instead, and the packet will move on to the next table.

For a complete description of queueing mechanism in Netfilter see Using NFQUEUE and libnetfilter_queue.

If you want to test this before Linux 3.14 release, you can get nft sources from nftables git and use next-3.14 branch.

Chain priority

For reference, here are the priority values of some important internal operations and of iptables static chains:

  • NF_IP_PRI_CONNTRACK_DEFRAG (-400): priority of defragmentation
  • NF_IP_PRI_RAW (-300): traditional priority of the raw table placed before connection tracking operation
  • NF_IP_PRI_SELINUX_FIRST (-225): SELinux operations
  • NF_IP_PRI_CONNTRACK (-200): Connection tracking operations
  • NF_IP_PRI_MANGLE (-150): mangle operation
  • NF_IP_PRI_NAT_DST (-100): destination NAT
  • NF_IP_PRI_FILTER (0): filtering operation, the filter table
  • NF_IP_PRI_SECURITY (50): Place of security table where secmark can be set for example
  • NF_IP_PRI_NAT_SRC (100): source NAT
  • NF_IP_PRI_SELINUX_LAST (225): SELInux at packet exit
  • NF_IP_PRI_CONNTRACK_HELPER (300): connection tracking at exit
For example, one can create in nftables an equivalent of the raw PREROUTING chain of iptables by doing:
# nft -i
nft> add chain filter pre_raw { type filter hook prerouting priority -300;}

Rusty Russell: Pettycoin and working with limited visibility.

At linux.conf.au I gave a last-minute talk previewing my work on pettycoin (video, slides), an experiment to shard a bitcoin-like network.  The idea is to trade off some security and robustness in return for scale, but use it only for small amounts where fraud is less worthwhile.  On the bitcoin network today this is already seen with zero-confirmation transactions, and this is the niche pettycoin seeks to fill.

There are numerous problems to be overcome (one being the time taken by my day job, of course).  But segmenting the network and the blockchain is an interesting challenge: bitcoin’s blockchain is already designed so that you can have partial knowledge (mainly so you can prune used outputs).  But there’s a clear divide between full nodes, and second-class partial nodes.  I want a system where no one need know everything, and I’m getting closer to that goal.

Consider the simplest useful transaction in the bitcoin network, with one input (ie. a previous transaction’s output) and one output.  To verify this is a fairly simple process:

  1. Is the transaction well-formed?
  2. Find the transaction whose output this is spending.
  3. Does the signature match the address of that output?
  4. Has that output already been spent?

With bitcoin, you’re expected to know every transaction with unspent outputs, so if you can’t find the transaction at step 2, the verification fails. Even better, you can verify that previous transaction, too, all the way back to the creating of the coins involved.  Your only worry is that the blockchain you have is the same as everyone else’s, so they’ll accept your transaction later.

If you don’t expect to know everything, it’s more difficult.  You can use a merkle proof to show that a transaction was present in a block; it takes just log(N) hashes for an N-transaction block.  So you could prove that all those previous transactions are in the blockchain (though that might be thousands of transactions) by providing me with each transaction and proof.

But this can’t prove that there are not duplicate transactions in the blockchain itself.  Only knowing the entire contents would do that.  So we’re relying on the rest of the network, each with a partial view, to check that didn’t happen.

This leads to the two requirements for any aspect of the pettycoin system which a node can’t verify by itself:

  1. The information to verify must be seen by some honest nodes.
  2. Each node must have an efficient way of reporting if it sees a problem.

The former is a bit tricky.  Consensus is formed by the blockchain, but you don’t want to see all of it.  You might expect to see some fraction of it, but if you don’t, how would you alert the network in a way that can’t be faked?   Imagine a miner holds back 5 transactions in the block, the miner might wait for your complaint message on one, then release that transaction making you look like the dishonest one.  By making you cry wolf, they can ensure you are ignored.

The solution used in pettycoin is that miners have to prove that they know the transactions in the 10 previous blocks.  They do this by hashing the transactions from the previous block into a merkle tree like normal, only they prefix each transaction with their payout address (this is called prev_merkle in the code).  The only way to generate this hash is to know the contents of each transaction, and you can’t make a valid block without it.  Unfortunately, the only way to demonstrate that this hash is wrong (thus the block is invalid) is to also know the contents of each transaction in the block.  Thus transactions are batched into groups of 4096; you only need send 4096 transactions to prove that one of the hashes in a block is wrong.  Miners will insist on knowing the transactions for those blocks, knowing that if they fake it they’ll likely be caught.

Reporting most other problems in a block is fairly:

  1. You can prove a duplicate spend in the block chain by showing both transactions and the merkle proofs that they are in each block.  The second block is invalid.
  2. You can prove a malformed transaction by showing the transactions and the merkle proof it is in the block.  That block is invalid.
  3. You can prove an overspend by showing the transactions used as inputs.  That block is invalid.

But if a transaction in a block relies on an output of a transaction which never existed, you can’t prove it.  Even if you know every transaction which ever happened, you can’t prove that to me (without sending me the whole blockchain).  The initial design lived with such warts in the blockchain, instead insisting that you would have to show all the predecessors when you paid me (via a payment protocol).  That predecessor tree quickly becomes unwieldy, however.

The new approach is that for each input of a transaction in the blockchain, the miner has to include the block and transaction number where it appeared.  Now anyone who knows that previous transaction can check it, and if there’s a problem it’s easy for any node to prove by showing the transaction which is in that previous block (with merkle proof that it is).

This means that the blockchain can be trusted, if half the mining power can be trusted.  This is a weaker guarantee that bitcoin, but sufficiently strong for pettycoin.  If you send me a new transaction along with transactions it uses as inputs  and merkle proofs that they are in the blockchain, I only need ensure that the new transaction isn’t a double-spend.  That’s the same as the bitcoin network, with zero-confirmation transactions (though pettycoin has a special double-spend report message to expedite it a little).

Next post, I’ll respond to the best criticism of pettycoin yet, the problem of gateways (by Jason Gi)…

February 02, 2014

Eric Leblond: Using ulogd and JSON output

Ulogd and JSON output

In February 2014, I’ve commited a new output plugin to ulogd, the userspace logging daemon for Netfilter. This is a JSON output plugin which output logs into a file in JSON format. The interest of the JSON format is that it is easily parsed by software just as logstash. And once data are understood by logstash, you can get some nice and useful dashboard in Kibana:

Screenshot from 2014-02-02 13:22:34

This post explains how to configure ulogd and iptables to do packet logging and differentiate accepted and blocked packets. If you want to see how cool is the result, just check my post: Investigation on an attack tool used in China.

Installation

At the time of this writing, the JSON output plugin for ulogd is only available in the git tree. Ulogd 2.0.4 will contain the feature.

If you need to get the source, you can do:

git clone git://git.netfilter.org/ulogd2

Then the build is standard:

./autogen.sh
./configure
make
sudo make install

Please note that at the end of the configure, you must see:

Ulogd configuration:
  Input plugins:
    NFLOG plugin:			yes
...
    NFACCT plugin:			yes
  Output plugins:
    PCAP plugin:			yes
...
    JSON plugin:			yes
If the JSON plugin is not build, you need to install libjansson devel files on your system and rerun configure.

Configuration

Ulogd configuration

All the edits are made in the ulogd.conf file. With default configure option the file is in /usr/local/etc/.

First, you need to activate the JSON plugin:

plugin="/home/eric/builds/ulogd/lib/ulogd/ulogd_output_JSON.so"

p>

Then we define two stacks for logging. It will be used to differentiate accepted packets from dropped packets:

stack=log2:NFLOG,base1:BASE,ifi1:IFINDEX,ip2str1:IP2STR,mac2str1:HWHDR,json1:JSON
stack=log3:NFLOG,base1:BASE,ifi1:IFINDEX,ip2str1:IP2STR,mac2str1:HWHDR,json1:JSON

pre> The first stack will be used to log accepted packet, so we the numeric_label to 1 in set in [log2]. In [log3], we use a numeric_label of 0.
[log2]
group=1 # Group has to be different from the one use in log1
numeric_label=1

[log3]
group=2 # Group has to be different from the one use in log1/log2
numeric_label=0 # you can label the log info based on the packet verdict

The last thing to edit is the configuration of the JSON instance:

[json1]
sync=1
device="My awesome FW"
boolean_label=1
Here we say we want log and write on disk configuration (via sync) and we named our device My awesome FW. Last value boolean_label is the most tricky. It this configuration variable is set to 1, the numeric_label will be used to decide if a packet has been accepted or blocked. It this variable is set non null, then the packet is seen as allowed. If not, then it is seen as blocked.

Sample Iptables rules

In this example, packets to port 22 are logged and accepted and thus are logged in nflog-group 1. Packet in the default drop rule are sent to group 2 because they are dropped.

iptables -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
iptables -A INPUT ! -i lo -p tcp -m tcp --dport 22 --tcp-flags FIN,SYN,RST,ACK SYN -m state --state NEW -j NFLOG --nflog-prefix  "SSH Attempt" --nflog-group 1
iptables -A INPUT -i lo -j ACCEPT
iptables -A INPUT -p tcp -m tcp --dport 22 -m state --state NEW -j ACCEPT
iptables -A INPUT -j NFLOG --nflog-prefix  "Input IPv4 Default DROP" --nflog-group 2

There is no difference in IPv6, we just use nflog-group 1 and 2 with the same purpose:

ip6tables -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
ip6tables -A INPUT ! -i lo -p tcp -m tcp --dport 22 --tcp-flags FIN,SYN,RST,ACK SYN -m state --state NEW -j NFLOG --nflog-prefix  "SSH Attempt" --nflog-group 1
ip6tables -A INPUT ! -i lo -p ipv6-icmp -m icmp6 --icmpv6-type 128 -m state --state NEW -j NFLOG --nflog-prefix  "Input ICMPv6" --nflog-group 1
ip6tables -A INPUT -p ipv6-icmp -j ACCEPT
ip6tables -A INPUT -p tcp -m tcp --dport 22 -m state --state NEW -j ACCEPT
ip6tables -A INPUT -i lo -j ACCEPT
ip6tables -A INPUT -j NFLOG --nflog-prefix  "Input IPv6 Default DROP" --nflog-group 2

Logstash configuration

Logstash configuration is simple. You must simply declare the ulogd.json file as input and optionaly you can activate geoip on the src_ip key:

input {
   file { 
      path => [ "/var/log/ulogd.json"]
      codec =>   json 
   }
}

filter {
  if [src_ip]  {
    geoip {
      source => "src_ip"
      target => "geoip"
      add_field => [ "[geoip][coordinates]", "%{[geoip][longitude]}" ]
      add_field => [ "[geoip][coordinates]", "%{[geoip][latitude]}"  ]
    }
    mutate {
      convert => [ "[geoip][coordinates]", "float" ]
    }
  }
}

output { 
  stdout { codec => rubydebug }
  elasticsearch { embedded => true }
}

Usage

To start ulogd in daemon mode, simply run:

ulogd -d

You can download logstash from their website and start it with the following command line:

java -jar logstash-1.3.3-flatjar.jar agent -f etc/logstash.conf --log log/logstash-indexer.out -- web

Once done, just point your browser to localhost:9292 and enjoy nice and interesting graphs.

Screenshot from 2014-02-02 13:57:19

Eric Leblond: Investigation on an attack tool used in China

Log analysis experiment

I’ve been playing lately with logstash using data from the ulogd JSON output plugin and the Suricata full JSON output as well as standard system logs.

Screenshot from 2014-02-02 13:22:34

Ulogd is getting Netfilter firewall logs from Linux kernel and is writing them in JSON format. Suricata is doing the same with alert and other traces. Logstash is getting both log as well as sytem log. This allows to create some dashboard with information coming from multiple sources. If you want to know how to configure ulogd for JSON output check this post. For suricata, you can have a look at this one.

Ulogd output is really new and I was experimenting with it in Kibana. When adding some custom graphs, I’ve observed some strange things and decided to investigate.

Displaying TCP window

TCP window size at the start of the connection is not defined in the RFC. So every OSes have choozen their own default value. It was thus looking interesting to display TCP window to be able to find some strange behavior. With the new ulogd JSON plugin, the window size information is available in the tcp.window key. So, after doing a query on tcp.syn:1 to only get TCP syn packet, I was able to graph the TCP window size of SYN packets.

Screenshot from 2014-02-02 13:22:58

Most of the TCP window sizes are well-known and correspond to standard operating systems:

  • 65535 is or MacOSX or some MS Windows OS.
  • 14600 is used by some Linux.

The first uncommon value is 16384. Graph are clickable on Kibana, so I was at one click of some interesting information.

First information when looking at dashboard after selection TCP syn packet with a window size of 16384 was the fact, it was only ssh scanning:

Screenshot from 2014-02-02 13:58:15

Second information is the fact that, according to geoip, all IPs are chinese:

Screenshot from 2014-02-02 13:57:19

A SSH scanning software

When looking at the details of the attempt made on my IP, there was something interesting: Screenshot from 2014-02-02 14:04:32

For all hosts, all requests are done with the same source port (6000). This is not possible to do that with a standard ssh client where the source port is by default choosen by the operating system. So or we have a custom standard software that perform a bind operation to port 6000 at socket creation. This is possible and one advantage would be to be easily authorized through a firewall if the country had one. Or we could have a software developped with low level (RAW) sockets for performance reason. This would allow a faster scanning of the internet by skipping OS TCP connection handling. There is a lot of posts regarding the usage of port 6000 as source for some scanning but I did not find any really interesting information in them.

On suricata side, most of the source IPs are referenced in ET compromised rules: Screenshot from 2014-02-02 13:25:03

Analysing my SSH logs, I did not see any trace of ssh bruteforce coming from source port 6000. But when selecting an IP, I’ve got trace of brute force from at least one of the IP: Screenshot from 2014-02-02 14:31:02

These attackers seems to really love the root account. In fact, I did not manage to find any trace of attempts for user different than root for IP address that are using the port 6000.

Getting back to my ulogd dashboard, I’ve displayed more info about the used scanning sequence: Screenshot from 2014-02-02 14:34:05 The host scans the box using a scanner using raw socket, then it attacks with a few minutes later with SSH bruteforce tool. The bruteforce tool has a TCP window size at start of 65535. It indicates that a separated software is used for scanning. So we should have an queueing mechanism between the scanner and the bruteforce tool. This may explains the duration between the scan and the bruteforce. Regarding TCP window size value, 65535 seems to indicate a Windows server (which is coherent with TTL value).

Looking at the scanner traffic

Capturing a sample traffic did not give to much information. This is a scanner sending a SYN and cleanly sending a reset when it got the SYN, ACK:

14:27:54.982273 IP (tos 0x0, ttl 103, id 256, offset 0, flags [none], proto TCP (6), length 40)
    218.2.22.118.6000 > 192.168.1.19.22: Flags [S], cksum 0xa525 (correct), seq 9764864, win 16384, length 0
14:27:54.982314 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 44)
    192.168.1.19.22 > 218.2.22.118.6000: Flags [S.], cksum 0xeee2 (correct), seq 2707606274, ack 9764865, win 29200, options [mss 1460], length 0
14:27:55.340992 IP (tos 0x0, ttl 111, id 14032, offset 0, flags [none], proto TCP (6), length 40)
    218.2.22.118.6000 > 192.168.1.19.22: Flags [R], cksum 0xe48c (correct), seq 9764865, win 0, length 0

But it seems the RST packet after the SYN, ACK is not well crafted: Screenshot from 2014-02-02 16:07:26

More info on SSH bruteforce tool

Knowing the the behavior was scanning from 6000 and starting a normal scanning, I’ve focused the Suricata dashboard on one IP to see if I had some more information: Screenshot from 2014-02-02 15:21:58

One single IP in the list of the scanning host is triggering multiple alerts. The event table confirmed this: Screenshot from 2014-02-02 15:16:41

Studying the geographical repartition of the Libssh alert, it appears there is used in other countries than China: Screenshot from 2014-02-02 15:24:59 So, libssh is not a discriminatory element of the attacks.

Conclusion

A custom attack tool has been been deployed on some Chinese IPs. This is a combination of a SSH scanner based on RAW socket and a SSH bruteforce tool. It tries to gain access to the root account of system via the ssh service. On an organisational level, it is possible there is a Chinese initiative trying to get the low-hanging fruit (system with ssh root account protected by password) or maybe it is just a some organization using some compromised Chinese IPs to try to get control other more boxes.

January 20, 2014

Eric Leblond: Why you will love nftables

Linux 3.13 is out

Linux 3.13 is out bringing among other thing the first official release of nftables. nftables is the project that aims to replace the existing {ip,ip6,arp,eb}tables framework aka iptables. nftables version in Linux 3.13 is not yet complete. Some important features are missing and will be introduced in the following Linux versions. It is already usable in most cases but a complete support (read nftables at a better level than iptables) should be available in Linux 3.15.

nftables comes with a new command line tool named nft. nft is the successor of iptables and derivatives (ip6tables, arptables). And it has a completely different syntax. Yes, if you are used to iptables, that’s a shock. But there is a compatibility layer that allow you to use iptables even if filtering is done with nftables in kernel.

There is only really few documentation available for now. You can find my nftables quick howto and there is some other initiatives that should be made public soon.

Some command line examples

Multiple targets on one line

Suppose you want to log and drop a packet with iptables, you had to write two rules. One for drop and one for logging:

iptables -A FORWARD -p tcp --dport 22 -j LOG
iptables -A FORWARD -p tcp --dport 22 -j DROP

With nft, you can combined both targets:

nft add rule filter forward tcp dport 22 log drop
Easy set creation

Suppose you want to allow packets for different ports and allow different icmpv6 types. With iptables, you need to use something like:

ip6tables -A INPUT -p tcp -m multiport --dports 23,80,443 -j ACCEPT
ip6tables -A INPUT -p icmpv6 --icmpv6-type neighbor-solicitation -j ACCEPT
ip6tables -A INPUT -p icmpv6 --icmpv6-type echo-request -j ACCEPT
ip6tables -A INPUT -p icmpv6 --icmpv6-type router-advertisement -j ACCEPT
ip6tables -A INPUT -p icmpv6 --icmpv6-type neighbor-advertisement -j ACCEPT

With nft, sets can be use on any element in a rule:

nft add rule ip6 filter input tcp dport {telnet, http, https} accept
nft add rule ip6 filter input icmpv6 type { nd-neighbor-solicit, echo-request, nd-router-advert, nd-neighbor-advert } accept

It is easier to write and it is more efficient on filtering side as there is only one rule added for each protocol.

You can also use named set to be able to make them evolve other time:

# nft -i # use interactive mode
nft> add set global ipv4_ad { type ipv4_address;}
nft> add element global ipv4_ad { 192.168.1.4, 192.168.1.5 }
nft> add rule ip global filter ip saddr @ipv4_ad drop
And later when a new bad boy is detected:
# nft -i
nft> add element global ipv4_ad { 192.168.3.4 }
Mapping

One advanced feature of nftables is mapping. It is possible to use to different type of data and to link them. For example, we can associate iface and a dedicated rule set (stored in a chain and created before). In the example, the chains are named low_sec and high_sec:

# nft -i
nft> add map filter jump_map { type ifindex : verdict; }
nft> add element filter jump_map { eth0 : jump low_sec; }
nft> add element filter jump_map { eth1 : jump high_sec; }
nft> add rule filter input iif vmap @jump_map

Now, let’s say you have a new dynamic interface ppp1, it is easy to setup filtering for it. Simply add it in the jump_map mapping:

nft> add element filter jump_map { ppp1 : jump low_sec; }

On administration and kernel side

More speed at update

Adding a rule in iptables was getting dramatically slower with the number of rules and that’s explained why script using iptables call are taking a long time to complete. This is not anymore with nftables which is using atomic and fast operation to update rule sets.

Less kernel update

With iptables, each match or target was requiring a kernel module. So, you had to recompile kernel in case you forgot something or want to use something new. this is not anymore the case with nftables. In nftables, most work is done in userspace and kernel only knows some basic instruction (filtering is implemented in a pseudo-state machine). For example, icmpv6 support has been achieved via a simple patch of the nft tool. This type of modification in iptables would have required kernel and iptables upgrade.

January 10, 2014

Eric Leblond: A bit of logstash cooking

Introduction

I’m running a dedicated server to host some internet services. The server runs Debian. I’ve installed logstash on it to do a bit of monitoring of my system logs and suricata. I’ve build a set of dashboards. The screenshot below shows a part of the one being dedicated to suricata: Suricata dashboard

Setup

My data sources were the following:
  • System logs
  • Apache logs
  • Suricata full JSON logs (should be available in suricata 2.0)
System logs

The setup was mostly really easy. I’ve just added a grok pattern to detect successful and unsuccessful connections on the ssh server.

input {
  file {
    type => "linux-syslog"
    path => [ "/var/log/daemon.log", "/var/log/auth.log", "/var/log/mail.info" ]
  }
filter {
  if [type] == "linux-syslog" {
      grok {
        match => { "message" => "Accepted %{WORD:auth_method} for %{USER:username} from %{IP:src_ip} port %{INT:src_port} ssh2" }
      }
      grok {
        match => { "message" => "Invalid user %{USER:username} from %{IP:src_ip}" }
      }
  }
}
Apache logs
Extract of Apache Dashboard

For apache, it was even easier for access.log:

  file {
    path => [ "/var/log/apache2/*access.log" ]
    type => "apache-access"
  }

  file {
    type => "apache-error"
    path => "/var/log/apache2/error.log"
  }
}
filter {
  if [type] == "apache-access" {
      grok {
        match => { "message" => "%{COMBINEDAPACHELOG}" }
      }
  }

  if [type] == "apache-error" {
      grok {
        match => { "message" => "%{APACHEERRORLOG}" }
        patterns_dir => ["/var/lib/logstash/etc/grok"]
      }
  }
}

For error log, I’ve created a grok pattern to get client IP. So I’ve created a file in grok dir with:

HTTPERRORDATE %{DAY} %{MONTH} %{MONTHDAY} %{TIME} %{YEAR}
APACHEERRORLOG \[%{HTTPERRORDATE:timestamp}\] \[%{WORD:severity}\] \[client %{IPORHOST:clientip}\] %{GREEDYDATA:message_remainder}
Netfilter logs
Extract of firewall Dashboard

For Netfilter logs, I’ve decided to play it the old way and to parse kernel log instead of using ulogd:

input {
  file {
    type => "kern-log"
    path => "/var/log/kern.log"
  }
}

filter {
 if [type] == "kern-log" {
        grok {
                match => { "message" => "%{IPTABLES}"}
                patterns_dir => ["/var/lib/logstash/etc/grok"]
        }
 }
}
with IPTABLES being defined in a file placed in the grok directory and containing:
 NETFILTERMAC %{COMMONMAC:dst_mac}:%{COMMONMAC:src_mac}:%{ETHTYPE:ethtype}
 ETHTYPE (?:(?:[A-Fa-f0-9]{2}):(?:[A-Fa-f0-9]{2}))
 IPTABLES1 (?:IN=%{WORD:in_device} OUT=(%{WORD:out_device})? MAC=%{NETFILTERMAC} SRC=%{IP:src_ip} DST=%{IP:dst_ip}.*(TTL=%{INT:ttl})?.*PROTO=%{WORD:proto}?.*SPT=%{INT:src_port}?.*DPT=%{INT:dst_port}?.*)
 IPTABLES2 (?:IN=%{WORD:in_device} OUT=(%{WORD:out_device})? MAC=%{NETFILTERMAC} SRC=%{IP:src_ip} DST=%{IP:dst_ip}.*(TTL=%{INT:ttl})?.*PROTO=%{INT:proto}?.*)
 IPTABLES (?:%{IPTABLES1}|%{IPTABLES2})

Exim logs
Extract of SMTP dashboard

This part was complicated because exim logs are multiline. So I found a page explaining how to match at least, the logs for delivered mail. It is using multiline in filter. Then I added a series of matches to get more information. Each match do only get a part of a message so I’ve used break_on_match not to exit when one of the match succeed.

input {
  file {
    type => "exim-log"
    path => "/var/log/exim4/mainlog"
  }
}
filter {
  if [type] == "exim-log" {
      multiline {
        pattern => "%{DATE} %{TIME} %{HOSTNAME:msgid} (=>|Completed)"
        what => "previous"
      }
      grok {
        break_on_match => false
        match => [
          "message", "= %{NOTSPACE:from} H=%{NOTSPACE:server} \[%{IP:src_ip}\]"
        ]
      }
      grok {
        break_on_match => false
        match => [
          "message", "=> %{USERNAME:username} %{NOTSPACE:dest}> R=%{WORD:transport}"
        ]
      }

      grok {
        break_on_match => false
        match => [
          "message", "=> %{NOTSPACE:dest} R=%{WORD:transport}"
        ]
     }
      grok {
        break_on_match => false
        match => [
          "message", "%{DATE} %{TIME} H=%{NOTSPACE:server}%{GREEDYDATA} \[%{IP:src_ip}\] F=%{NOTSPACE:mail_to}> temporarily rejected RCPT %{NOTSPACE:dest}>: greylisted"
        ]
      }
   }
}
Suricata
Pie with file types

Suricata full JSON output is JSON so the configuration in logstash is trivial:

input {
   file {
      path => ["/var/log/suricata/eve.json" ]
      codec =>   json
   }
}
You can download a sample Suricata Dashboard to use in in your logstash installation.

The full configuration

Below is the full configuration. There is only one thing which I did not mention. For most source IP, I use geoip to have an idea of the localisation of the IP.

input {
  file {
    type => "linux-syslog"
    path => [ "/var/log/daemon.log", "/var/log/auth.log", "/var/log/mail.info" ]
  }

  file {
    path => [ "/var/log/apache2/*access.log" ]
    type => "apache-access"
  }

  file {
    type => "apache-error"
    path => "/var/log/apache2/error.log"
  }

  file {
    type => "exim-log"
    path => "/var/log/exim4/mainlog"
  }

  file {
    type => "kern-log"
    path => "/var/log/kern.log"
  }

   file {
      path => ["/var/log/suricata/eve.json" ]
      codec =>   json
   }

}

filter {
  if [type] == "apache-access" {
      grok {
        match => { "message" => "%{COMBINEDAPACHELOG}" }
      }
  }
  if [type] == "linux-syslog" {
      grok {
        match => { "message" => "Accepted %{WORD:auth_method} for %{USER:username} from %{IP:src_ip} port %{INT:src_port} ssh2" }
      }
  }

  if [type] == "apache-error" {
      grok {
        match => { "message" => "%{APACHEERRORLOG}" }
        patterns_dir => ["/var/lib/logstash/etc/grok"]
      }
  }

  if [type] == "exim-log" {
      multiline {
        pattern => "%{DATE} %{TIME} %{HOSTNAME:msgid} (=>|Completed)"
        what => "previous"
      }
      grok {
        break_on_match => false
        match => [
          "message", "= %{NOTSPACE:from} H=%{NOTSPACE:server} \[%{IP:src_ip}\]"
        ]
      }
      grok {
        break_on_match => false
        match => [
          "message", "=> %{USERNAME:username} %{NOTSPACE:dest}> R=%{WORD:transport}"
        ]
      }

      grok {
        break_on_match => false
        match => [
          "message", "=> %{NOTSPACE:dest} R=%{WORD:transport}"
        ]
     }
      grok {
        break_on_match => false
        match => [
          "message", "%{DATE} %{TIME} H=%{NOTSPACE:server}%{GREEDYDATA} \[%{IP:src_ip}\] F=%{NOTSPACE:mail_to}> temporarily rejected RCPT %{NOTSPACE:dest}>: greylisted"
        ]
      }
   }

 if [type] == "kern-log" {
        grok {
                match => { "message" => "%{IPTABLES}"}
                patterns_dir => ["/var/lib/logstash/etc/grok"]
        }
 }

  if [src_ip]  {
    geoip {
      source => "src_ip"
      target => "geoip"
      add_field => [ "[geoip][coordinates]", "%{[geoip][longitude]}" ]
      add_field => [ "[geoip][coordinates]", "%{[geoip][latitude]}"  ]
    }
    mutate {
      convert => [ "[geoip][coordinates]", "float" ]
    }
  }

  if [clientip]  {
    geoip {
      source => "clientip"
      target => "geoip"
      add_field => [ "[geoip][coordinates]", "%{[geoip][longitude]}" ]
      add_field => [ "[geoip][coordinates]", "%{[geoip][latitude]}"  ]
    }
    mutate {
      convert => [ "[geoip][coordinates]", "float" ]
    }
  }

  if [srcip]  {
    geoip {
      source => "srcip"
      target => "geoip"
      add_field => [ "[geoip][coordinates]", "%{[geoip][longitude]}" ]
      add_field => [ "[geoip][coordinates]", "%{[geoip][latitude]}"  ]
    }
    mutate {
      convert => [ "[geoip][coordinates]", "float" ]
    }
  }
}

output {
  stdout { codec => rubydebug }
  elasticsearch { embedded => true }
}

January 04, 2014

Patrick McHardy: test

test

November 27, 2013

Eric Leblond: What’s new in ulogd 2.0.3

New features in ulogd 2.0.3 release

Database framework update

ulogd 2.0.3 implements two new optional modes for database connections:

  • backlog system to avoid event loss in case of database downtime
  • running mode where acquisition is made in one thread and queries to databases are made in separate threads to reduce latency in the treatment of kernel messages
These two modes are described below.

Postgresql update

Postgresql output plugin was only offering a small subset of Postgresql connection-related options. It is now possible to use the connstring to use all possible parameters of libpq param keywords. If set, this variable has precedence on other variables.

One interest of connstring is to be able to use a SSL-encrypted connection to the database by using the sslmode keyword:

connstring="host=localhost port=4321 dbname=nulog user=nupik password=changeme sslmode=verify-full sslcert=/etc/ssl/pgsql-cert.pem sslkey=/etc/ssl/pgsql-key.pem sslrootcert==/etc/ssl/pgsql-rootcert.pem"

Event loss prevention

ulogd 2.0.3 implements a backlog system for all database output plugins using the abstraction framework for database connection. At the writing of this article, this is MySQL, PostgreSQL and DBI. Memory will be dedicated to store the queries that can not be run because of an unavailability of the database. Once the database is back, the queries are played in order.

To activate this mode, you need to set the backlog_memcap value in the database definition.

[mysql1]
db="nulog"
...
procedure="INSERT_PACKET_FULL"
backlog_memcap=1000000
backlog_oneshot_requests=10

Set backlog_memcap to the size of memory that will be allocated to store events in memory if data is temporary down. The variable backlog_oneshot_requests is storing the number of queries to process at once before reading a kernel message.

Multithreaded database output

If the ring buffer mode is active, a thread will be created for each stack involving the configured database. It will connect to the database and execute the queries. The idea is to try to avoid buffer overrun by minimizing the time requested to treat kernel message. Doing synchronous SQL request, as it was made before was causing a delay which could cause some messages to be lost in case of burst from kernel side. With this new mode, the time to process kernel message is equal to the time of the formatting of the query.

To activate this mode, you need to set ring_buffer_size to a value superior to 1. The value stores the number of SQL requests to keep in the ring buffer.

[pgsql1]
db="nulog"
...
procedure="INSERT_PACKET_FULL"
ring_buffer_size=1000

The ring_buffer_size has precedence on the backlog_memcap value. And backlog will be disabled if the ring buffer is active as ring buffer also provide packet loss prevention. ring_buffer_size is the maximum number of queries to keep in memory.

November 18, 2013

Eric Leblond: Using linux perf tools for Suricata performance analysis

Introduction

Perf is a great tool to analyse performances on Linux boxes. For example, perf top will give you this type of output on a box running Suricata on a high speed network:

Events: 32K cycles                                                                                                                                                                                                                            
 28.41%  suricata            [.] SCACSearch
 19.86%  libc-2.15.so        [.] tolower
 17.83%  suricata            [.] SigMatchSignaturesBuildMatchArray
  6.11%  suricata            [.] SigMatchSignaturesBuildMatchArrayAddSignature
  2.06%  suricata            [.] tolower@plt
  1.70%  libpthread-2.15.so  [.] pthread_mutex_trylock
  1.17%  suricata            [.] StreamTcpGetFlowState
  1.10%  libc-2.15.so        [.] __memcpy_ssse3_back
  0.90%  libpthread-2.15.so  [.] pthread_mutex_lock

The functions are sorted by CPU consumption. Using arrow key it is possible to jump into the annotated code to see where most CPU cycles are used.

This is really useful but in the case of a function like pthread_mutex_trylock, the interesting part is to be able to find where this function is called.

Getting function call graph in perf

This stack overflow question lead me to the solution.

I’ve started to build suricata with the -fno-omit-frame-pointer option:

./configure --enable-pfring --enable-luajit CFLAGS="-fno-omit-frame-pointer"
make
make install

Once suricata was restarted (with pid being 9366), I was then able to record the data:

sudo perf record -a --call-graph -p 9366

Extracting the call graph was then possible by running:

sudo perf report --call-graph --stdio
The result is a huge detailed report. For example, here’s the part on pthread_mutex_lock:
     0.94%  Suricata-Main  libpthread-2.15.so     [.] pthread_mutex_lock
            |
            --- pthread_mutex_lock
               |
               |--48.69%-- FlowHandlePacket
               |          |
               |          |--53.04%-- DecodeUDP
               |          |          |
               |          |          |--95.84%-- DecodeIPV4
               |          |          |          |
               |          |          |          |--99.97%-- DecodeVLAN
               |          |          |          |          DecodeEthernet
               |          |          |          |          DecodePfring
               |          |          |          |          TmThreadsSlotVarRun
               |          |          |          |          TmThreadsSlotProcessPkt
               |          |          |          |          ReceivePfringLoop
               |          |          |          |          TmThreadsSlotPktAcqLoop
               |          |          |          |          start_thread
               |          |          |           --0.03%-- [...]
               |          |          |
               |          |           --4.16%-- DecodeIPV6
               |          |                     |
               |          |                     |--97.59%-- DecodeTunnel
               |          |                     |          |
               |          |                     |          |--99.18%-- DecodeTeredo
               |          |                     |          |          DecodeUDP
               |          |                     |          |          DecodeIPV4
               |          |                     |          |          DecodeVLAN
               |          |                     |          |          DecodeEthernet
               |          |                     |          |          DecodePfring
               |          |                     |          |          TmThreadsSlotVarRun
               |          |                     |          |          TmThreadsSlotProcessPkt
               |          |                     |          |          ReceivePfringLoop
               |          |                     |          |          TmThreadsSlotPktAcqLoop
               |          |                     |          |          start_thread
               |          |                     |          |
               |          |                     |           --0.82%-- DecodeIPV4
               |          |                     |                     DecodeVLAN
               |          |                     |                     DecodeEthernet
               |          |                     |                     DecodePfring
               |          |                     |                     TmThreadsSlotVarRun
               |          |                     |                     TmThreadsSlotProcessPkt
               |          |                     |                     ReceivePfringLoop
               |          |                     |                     TmThreadsSlotPktAcqLoop
               |          |                     |                     start_thread
               |          |                     |
               |          |                      --2.41%-- DecodeIPV6
               |          |                                DecodeTunnel
               |          |                                DecodeTeredo
               |          |                                DecodeUDP
               |          |                                DecodeIPV4
               |          |                                DecodeVLAN
               |          |                                DecodeEthernet
               |          |                                DecodePfring
               |          |                                TmThreadsSlotVarRun
               |          |                                TmThreadsSlotProcessPkt
               |          |                                ReceivePfringLoop
               |          |                                TmThreadsSlotPktAcqLoop
               |          |                                start_thread

October 28, 2013

Eric Leblond: Logstash and Suricata for the old guys

Introduction

logstash an opensource tool for managing events and logs. It is using elasticsearch for the storage and has a really nice interface named Kibana. One of the easiest to use entry format is JSON.

Suricata is an IDS/IPS which has some interesting logging features. Version 2.0 will feature a JSON export for all logging subsystem. It will then be possible to output in JSON format:

  • HTTP log
  • DNS log
  • TLS log
  • File log
  • IDS Alerts
For now, only File log is available in JSON format. This extract meta data from files transferred over HTTP.

Peter Manev has described how to connect Logstash Kibana and Suricata JSON output. Installation is really simple, just download logstash from logstash website, write your configuration file and start the thing.

Kibana interface is really impressive: Kibana Screenshot

But at the time, I started to look at the document, a few things were missing:

  • Geoip is not supported
  • All fields containing space appear as multiple entries

Geoip support

This one was easy. You simply have to edit the logstash.conf file to add a section about geoip:

input {
  file { 
    path => "/home/eric/builds/suricata/var/log/suricata/files-json.log" 
    codec =>   json 
    # This format tells logstash to expect 'logstash' json events from the file.
    #format => json_event 
  }
}

output { 
  stdout { codec => rubydebug }
  elasticsearch { embedded => true }
}

#geoip part
filter {
  if [srcip] {
    geoip {
      source => "srcip"
      target => "geoip"
      add_field => [ "[geoip][coordinates]", "%{[geoip][longitude]}" ]
      add_field => [ "[geoip][coordinates]", "%{[geoip][latitude]}"  ]
    }
    mutate {
      convert => [ "[geoip][coordinates]", "float" ]
    }
  }
}

It adds a filter that check for presence of srcip and add geoip information to the entry. The tricky thing is the add_field part that create an array that has to be used when adding a map to kibana dashboard. See following screenshot for explanation: Creating new map in Kibana

You may have the following error:

You must specify 'database => ...' in your geoip filter"

In this case, you need to specify the path to the geoip database by adding the database keyword to geoip configuration:

#geoip part
filter {
  if [srcip] {
    geoip {
      source => "srcip"
      target => "geoip"
      database => "/path/to/GeoLiteCity.dat"
      add_field => [ "[geoip][coordinates]", "%{[geoip][longitude]}" ]
      add_field => [ "[geoip][coordinates]", "%{[geoip][latitude]}"  ]
    }
    mutate {
      convert => [ "[geoip][coordinates]", "float" ]
    }
  }
}

Once the file is written, you can start logstash

java -jar /home/eric/builds/logstash/logstash-1.2.2-flatjar.jar agent -f /home/eric/builds/logstash/logstash.conf --log /home/eric/builds/logstash/log/logstash-indexer.out -- web

See Logstash Kibana and Suricata JSON output for detailed information on setup.

Logstash indexing and mapping

Before logstash 1.3.1, fixing the space issue was really complex. Since that version, all indexed fields are provided with a .raw field that can be used to avoid the problem with spaces in name. So now, you can simply use in Kibana something like geoip.country_name.raw in the definition of graph instead of geoip.country_name. Doing that United States does not appear anymore as United and States.

Fixing the space issue for lostash previous to 1.3.1 was far more complicated for an old guy like me used to configuration files. If finding the origin of the behavior is easy fixing it was more painful. A simple googling shows me that by default elasticsearch storage split string at spaces when indexing. To fix this, you have to specify that the field should not be analyzed during indexing: "index":"not_analyzed"

That was looking easy at first but logstash is not using a configuration file for indexing and mapping. In fact, you need to interact with elasticsearch via HTTP requests. Second problem is that the index are dynamically generated, so there is a template system that you can use to have indexes created the way you want.

Creating an template is easy. You simply do something like:

curl -XPUT http://localhost:9200/_template/logstash_per_index -d '
{
    "template" : "logstash*",
    MAGIC HERE
}'

This will create a template that will be applied to all newly created indexes with name matching “logstash*”. The difficult part is to know what to to put in MAGIC HERE and to check if “logstash*” will match created index. To check this, you can retrieve all current mappings:

curl -XGET 'http://localhost:9200/_all/_mapping'

You then get a list of mappings and you can check the name. But best part is that you can get a base text to update the mapping definition part. With Suricata file log and geoip activated, the following configuration is working well:

curl -XPUT http://localhost:9200/_template/logstash_per_index -d '
{
    "template" : "logstash*",
    "mappings" : {
      "logs" : {
         "properties": {
            "@timestamp":{"type":"date",
            "format":"dateOptionalTime"},
            "@version":{"type":"string"},
            "dp":{"type":"long"},
            "dstip":{"type":"ip"},
            "filename":{"type":"string"},
            "geoip":{
               "properties":{
                  "area_code":{"type":"long"},
                  "city_name":{"type":"string", "index":"not_analyzed"},
                  "continent_code":{"type":"string"},
                  "coordinates":{"type":"string"},
                  "country_code2":{"type":"string"},
                  "country_code3":{"type":"string"},
                  "country_name":{"type":"string", "index":"not_analyzed"},
                  "dma_code":{"type":"long"},
                  "ip":{"type":"string"},
                  "latitude":{"type":"double"},
                  "longitude":{"type":"double"},
                  "postal_code":{"type":"string"},
                  "real_region_name":{"type":"string", "index":"not_analyzed"},
                  "region_name":{"type":"string", "index":"not_analyzed"},
                  "timezone":{"type":"string"}
               }
            },
            "host":{"type":"string"},
            "http_host":{"type":"string"},
            "http_referer":{"type":"string"},
            "http_uri":{"type":"string"},
            "http_user_agent":{"type":"string", "index":"not_analyzed", "omit_norms":true, "index_options":"docs"},
            "ipver":{"type":"long"},
            "magic":{"type":"string", "index":"not_analyzed", "omit_norms":true, "index_options":"docs"},
            "md5":{"type":"string"},
            "path":{"type":"string"},
            "protocol":{"type":"long"},
            "size":{"type":"long"},
            "sp":{"type":"long"},
            "srcip":{"type":"ip"},
            "state":{"type":"string"},
            "stored":{"type":"boolean"},
            "tags":{"type":"string"},
            "timestamp":{"type":"string"}
      }
    }
  }
}'

I’ve added some “index”:”not_analyzed” and improved the type for some of the fields. For example, srcip has been defined as an IP address. This allow to do range searching in Kibana like

["192.168.42.24" TO "192.168.42.45"]

Next point is to update the index format. To to so, you can get the name of current index, delete it and recreate it. To get the name you can use le mapping listing:

curl -XGET 'http://localhost:9200/_all/_mapping'

The return is something like:

{"logstash-2013.10.27":{"logs":{"properties":

So now, we can destroy this index named “logstash-2013.10.27″ and have it recreated with the correct settings:

curl -XDELETE 'http://localhost:9200/logstash-2013.10.27'
curl -XPUT 'http://localhost:9200/logstash-2013.10.27'
We need data to be reindexed so:
curl -XGET 'http://localhost:9200/logstash-2013.10.27/_refresh'

It may also be a good idea to wait for new data as it seems to trigger update in what elasticsearch is sending.

October 17, 2013

Rusty Russell: linux.conf.au 2014: Rusty’s Must See List

Delightedly finished reading through the linux.conf.au program.  Some nasty clashes have me still arguing with myself, but here are my personal compulsory attendance talks.  Your preferences will no-doubt differ, so I’ve tried to explain my reasons:

 

  • Tridgell: Open Hardware Differential GPS – I spoke to Tridge about this, and the abstract completely undersells it.
  • Corbet: Kernel – Jon’s kernel talks are great for non-kernel people, but for me it’s about seeing the forest through the trees.
  • McKenney: Parallel Verification – Paul’s spoken with me about this, but I want to hear the practical side to see how I can apply it.
  • Heo: Kernel per-cpu  – I persuaded Tejun to submit this; his per-cpu work was elegant (mine, on which this was built, was merely functional).
  • Airlie: Virtual GPU – Sorry Bdale (with whom it clashes): I have wanted a virtio GPU for so long, I need to see it.
  • Packard: Zero-copy Compositing  – Keith is always good, and graphics performance is fascinating.
  • Suehle: Raspberry Pi – Generally O’Reilly books are well researched, so I expect great content here.
  • Isaacs: CTDB Bugs – I was around when he was finding some of these, and there are some fascinating surprises here.

September 05, 2013

Harald Welte: Problems with OpenVPN on high-latency satellite links

So far I never had a need to look in detail how the OpenVPN protocol actually looks on the wire. It seems like not many people had that much of a close look, as the wireshark plugin is fairly recent (from 2012 I think) while OpenVPN is around for ten more years than that. If I was an OpenVPN developer, the wireshark plugin would be the first thing I'd write to help debugging and development. At least that's what I've been doing from OpenPCD to SIMtrace and through the various GSM and other protocols I encounter...

The reason for my current investigation is some quite strange and yet-unexplained problems when running OpenVPN on high-latency satellite links. I'm not talking about high-bandwidth VSAT or systems with dedicated / guaranteed bandwidth. The links I'm seeing often have RTT (as seen by ICMP echo) of 2 seconds, sometimes even 5. This is of course not only the satellite link, but includes queuing on the ground, possibly the space segment and of course the terminal, including (possibly) access arbitration.

What struck me _very_ odd is that OpenVPN is sending tons of UDP messages with ridiculously small size during the TLS handshake when bringing up the tunnel. Further investigation shows that they actually internally configure a MTU of '0' for the link, which seems to be capped at 100 bytes control payload, plus HMAC and OpenVPN header resulting in 124 to 138 bytes UDP payload.

Now you have to consider that the server certificate (possibly including even a CA certificate) can be quite large, plus all the gazillions of TLS handshaking options in ServerHello, the first message from server to client. This means that OpenVPN transmits that ServerHello in something like 40 to 60 fragments of 100 bytes each! And each of the fragments will have to be acknowledged by the remote end, leading 80 to 120 UDP/IP packets _only_ for the delivery of the TLS ServerHello.

Then you start reviewing the hundreds of OpenVPN configuration options, many of them related to MTU, MSS, fragmentation, etc. There is none for that insanely small default of 100 bytes for control packets during hand-shake. I even read through the related source code, only to find that indeed this behavior seems hard-coded. Some time later I had written a patch to add this option, thanks to Free Software. It seems to work on client and server and brings the ClientHello down to much smaller 4-6 messages.

The fun continues when you see that the timeout for re-transmitting fragments that have not been ACKed yet is 2 seconds. At my satellite RTT times this of course leads to lots of unneeded re-transmissions, simply because the ACK hasn't made its way back to the sender of the original message yet. Luckily there's a configuration option for that.

After the patch and changing that option, the protocol trace looks much more sane. However, I still have problems establishing a tunnel in a number of cases. For some odd reason, the last fragment of the ServerHello is not acknowledged by the client, no matter whether patched or unpatched OpenVPN is being used. I get acknowledgements always only up to fragment N-1 after having transmitted N. That last fragment is then re-transmitted by the server with exponential back-off, and finally some 60 seconds later the server gives up as the TLS handshake didn't finish within that time. Extending the TLS handshake timeout to 120 seconds also doesn't help.

I'm not quite sure why something like 39 out of 39 fragments all get delivered reliably and acknowledged, but always the last fragment (40) doesn't make it to the remote side. That's certainly not random packet loss, but a very deterministic one. Let's see if I can still manage to find out what that might be...

July 27, 2013

Rusty Russell: Git prompt for bash

I don’t know who wrote this originally, but this is from my .bashrc.  Tridge’s is simpler, but has colour!

Before this, I avoided git branches in favour of multiple copies of repositories because I use my prompt to provide location.  This provided the missing piece…

# Git me harder!
__git_ps1 ()
{
    local g="$(git rev-parse --git-dir 2>/dev/null)"
    if [ -n "$g" ]; then
        local r
        local b
        if [ -d "$g/../.dotest" ]
        then
            local b="$(git symbolic-ref HEAD 2>/dev/null)"
            r="|REBASING"
        elif [ -d "$g/.dotest-merge" ]
        then
            r="|REBASING"
            b="$(cat $g/.dotest-merge/head-name)"
        elif [ -f "$g/MERGE_HEAD" ]
        then
            r="|MERGING"
            b="$(git symbolic-ref HEAD 2>/dev/null)"
        else
            if [ -f $g/BISECT_LOG ]
            then
                r="|BISECTING"
            fi
            if ! b="$(git symbolic-ref HEAD 2>/dev/null)"
            then
                b="$(cut -c1-7 $g/HEAD)..."
            fi
        fi
        if [ -n "$1" ]; then
            printf "$1" "${b##refs/heads/}$r"
        else
            printf " (%s)" "${b##refs/heads/}$r"
        fi
    fi
}

PS1="${PS1//\\w/\\w\$(__git_ps1)}"

July 23, 2013

Rusty Russell: On Linux-Kernel Mailing List Behavior

As raised recently by Sarah Sharp, the Linux Kernel mailing list (lkml) has a reputation as an intimidating place.  The context (covered so well by LWN) was that Greg Kroah-Hartman, the stable maintainer, is seen as a soft touch who accepts patches Linus wouldn’t.

There’s been much uninformed discussion from those outside lkml, so let’s start with a common basis:

  1. Sarah Sharp is an established and respected kernel maintainer.  She’s made it.
  2. Linus (and other developers) are human, and sometimes write in anger.

Now my opinions, as someone who cares about this issue and has been working on the kernel for about 16 years.

The kernel mailing list is much friendlier than it used to be: some of its reputation is now undeserved.  Linus is unreserved in criticising code or actions, but rarely crosses into ad-hominem.  His absolutist statements reduce RTT by telling you what is required; geeks love to argue, but it’s pointless because it’s his git tree.

That said, imitating Linus on lkml causes problems; without his authority, loudly claiming absolutes is simply ranting.  This escalates until it’s remarkably hard to avoid crossing into personal attacks; most of us inevitably double-down when we’re criticized, and train-wreck ensues.

I plan to follow Sarah’s example and respond when someone’s abusive.  Making it clear what’s expected should make things more pleasant eventually.  It’s been about ten years since I decided to reduce my flames to a single post every year; I’m now going to aim for zero (aka. “What Would Sarah Sharp Do?”)

July 09, 2013

Rusty Russell: 6 Technical Things I Learned About Bitcoin

I’ve been collecting these as I research the bitcoin protocol, so I thought it was worth posting about.  None of these are groundbreaking, but these are what surprised me as I deepened my understanding.

10 Minute Blocks.  Currently 9 minutes.  But usually 7 minutes.

Everyone talks about a block every 10 minutes, but that’s the long-term mean.  Spikes in exchange rates are followed fairly closely by spikes in network hashrate, and ASIC miners are ramping up to meet demand.  As difficulty adjustment happens every 2016 blocks (ideally 2 weeks), there’s a lag. Over the life of bitcoin, and over the last year the average is almost exactly 600 seconds, but over the last 3 months it’s been 520 seconds.  The last month is 542 seconds, so hashrate acceleration is slowing.

But a subtler effect is shown when we look at the median, rather than the mean: it’s just under 7 minutes.  This is because the time to hit the target hash is not a normal distribution at all.  There’s probably a fancy name for this spike with an exponential tail, but I’ve graphed here a recent set of 2016 blocks (fortnight 115) showing the distribution of block times in minute-wide buckets.

Now, these stats were using timestamps in the blocks, rather than the actual observed times, but I’m assuming on average that they’re correct.

Actually, 10.005 Minute Blocks

The bitcoin client calculates how long an interval took by subtracting the timestamp from beginning of the interval to the end of the interval of 2016 blocks.  There are 2015 spaces between 2016 blocks, but the code divides by 2016.  But I’m sure no one else cares about that 0.3 second mistake, since block times are never that precise anyway.

Politics In The Genesis Block.  Or Not.

It’s common to point to the text in the very first block “The Times 03/Jan/2009 Chancellor on brink of second bailout for banks” as a political statement by Satoshi.  While I’m sure the headline amused the author, we need look no further than the initial Bitcoin Paper, section 3:

A timestamp server works by taking a hash of a block of items to be timestamped and widely publishing the hash, such as in a newspaper or Usenet post [2-5]. The timestamp proves that the data must have existed at the time, obviously, in order to get into the hash.

In other words, it simply proves that there was no pre-mining going on.  It would be interesting to get an accurate timestamp of the initial release of bitcoin and examine London Times headlines around that date to see if it was cherry-picked, or happy coincidence.

Crazy Address Encoding

Bitcoin addresses are a 25-byte number.  It’s usually encoding using 58 characters (numbers and letters, omitting zero, capital I and O, lower-case l to avoid confusion). Dividing by 58 is a bit of a pain, but doing crypto means we have big number libraries lying around which we can use.

But it’s not the straight encoding one might expect, which would result in 37 character addresses.  You might expect that leading zeroes can be omitted for compactness, but in fact, leading whole zero bytes are encoded separately. This gives variable-length addresses of between 27 and 34 characters and a second loop to encode and decode them. https://en.bitcoin.it/wiki/Base58Check_encoding

Anonymity Off By Default

Anonymity is hard, but I was surprised to see blockchain.info’s page about my donation to Unfilter correctly geolocated to my home town!  Perhaps it’s a fluke, but I was taken aback by how clear it was.

CVEs in Bitcoin

Like any software, there have been flaws in the bitcoin reference client: obviously there has been a great deal of scrutiny and concern.  Unlike most projects, there is a superb wiki page which details each vulnerability, with consequences and deployment status across the network: https://en.bitcoin.it/wiki/Common_Vulnerabilities_and_Exposures.

Corrections welcome!
Rusty.

July 02, 2013

Rusty Russell: VIRTIO Growing Up: OASIS Standard Technical Committee

http://www.oasis-open.org/committees/virtio

Over the last few years, interest in virtio has begun compounding.  FreeBSD have their bhyve implementation, there’s an MMIO bus and SCSI endpoint implementation, and I’ve been fielding more queries about various alternate implementations.  While it’s taken longer than I’d hoped, the effort hasn’t waned as I feared.

So I have carved out some time this year to turn this draft into a real, consensual standard with the trappings expected by those outside the normal Linux/KVM sphere (such as an IP policy). I know I said I’d never get involved in a standard process again after the FHS, but OASIS seems like the right umbrella to cleanly and efficiently run this effort.

There are limitations and workarounds in the current draft and implementations.  None are fatal, but they make a case for a flag day change for 1.0 (with backwards compatibility possible for implementations which want that).  More compelling, to me, is the chance for other vendors to get involved now and have their voices heard: after the standard is finalized, they’ll just have to follow along.

I look forward to polishing what we have, and making sure we can implement even more awesome things in future.

June 05, 2013

Harald Welte: Attending HITCON and COSCUP in Taipei

It is my pleasure to attend the HITCON 2013 and COSCUP 2013 conferences in July/August this year. They are both in Taipei. HITCON is a hacker/security event, while COSCUP is a pure Free/Open Source Software conference.

At both events I will be speaking at the growing list of GSM related tools that are available these days, like OpenBSC, OsmcoomBB, SIMtrace, OsmoSGSN, OsmoBTS, OsmoSDR, etc. As they are both FOSS projects and useful in a security context, this fits well within the scope of both events.

Given that I'm going to be back to Taiwan, I'm looking very much forward to meeting old friends and former colleagues from my Openmoko days in Taipei. God, do I miss those days. While terribly stressful, they still are the most exciting days of my career so far.

And yes, I'm also going to use the opportunity for a continuation of my motorbike riding in this beautiful country.

June 03, 2013

Harald Welte: Rest In Peace, Atul Chitnis

Today, very sad news has reached me: Atul Chitnis has passed away. Most people outside of India will most likely not recognize the name: He has been instrumental in pioneering the BBS community in India, and the founder and leader of the Linux Bangalore and later FOSS.in conferences, held annually in Bangalore.

I myself first met Atul about ten years ago, and had the honor of being invited to speak at many of the conferences he was involved in. Besides that professional connection, we became friends. The warmth and affection with which I was accepted by him and his family during my many trips to Bangalore is without comparison. I was treated and accepted like a family member, despite just being this random free software hacker from Germany who is always way too busy to return the amount of kindness.

Despite the 17 year age difference, there was a connection between the two of us. Not just the mutual respect for each others' work, but something else. It might have been partially due to his German roots. It might have been the similarities in our journey through technology. We both started out in the BBS community with analog modems, we both started to write DOS software in the past, before turning to Linux. We both became heavily involved in mobile technology around the same time: He during his work at Geodesic, I working for Openmoko. Only in recent years his indulgence in Apple products was slightly irritating ;)

Only five weeks ago I had visited Atul. Given the state of his health, it was clear that this might very well be the last time that we meet each other. I'm sad that this now actually turned out to become the thruth. It would have been great to meet again at the end of the year (the typical FOSS.in schedule).

My heartfelt condolences to his family. Particularly to his wonderful wife Shubha, his daughther Anjali, his mother and brother. [who I'm only not calling by their name in this post as they deserve some privacy and their Identities is not listed on Atuls wikipedia page].

Atul was 51 years old. Way too young to die. Yet, he has managed to created a legacy that will extend long beyond his life. He profoundly influenced generations of technology enthusiasts in India and beyond.

April 01, 2013

Rusty Russell: Thanks for the Bitcoin donation!

Last week I used 2 BTC to support Jupiter Broadcasting’s Unfilter show (and their other shows, but only Unfilter takes BTC so far).  Just now I noticed that someone made a 0.5BTC donation to my blog (I’ve had a BTC donation address in the sidebar of my blog for a few years now).  Thanks!

As I promised to pass donations onwards, I googled for bitcoin donations, and chose the following places to give 0.05 BTC each:

  1. Juice Rap News for making high-baud political commentary (Unfilter in rap form)
  2. Freedom Box for actually doing something about Internet freedom.
  3. Torservers.net (as recommended by torproject.org) for the same.
  4. f-droid.org for keeping a healthy Open alternative.
  5. Bitcoin Foundation to support and strengthen the infrastructure that made this possible.
  6. The Free Software Foundation even though I don’t always agree with them.
  7. Wikileaks for recognizing something society needs, even if they stumble at delivery.
  8. The Internet Archive for something that only gets more useful over time.

There are two left to go, so I’ll keep an eye out for more opportunities to donate in the next few weeks…

-0.05

March 29, 2013

Harald Welte: Hardware outage affectiong osmocom.org, deDECTed.org, gpl-violations.org

As usual, murphy's law dictates that problems will occur at the worst possible moment. One of my servers in the data center died on March 20, and it was the machine which hosts the majority of the free software projects that I've created or am involved in. From people.netfilter.org to OpenPCD and OpenEZX to gpl-violations.org and virtually all osmocom.org sites and services.

Recovery was slow as there is no hot spare and none of my other machines in the data center have backplanes for the old SCA-80 hard disks that are in use by that particular machine. So we had to send the disks to Berlin, wait until I'm back there, and then manually rsync everything over to a different box in the data center.

To my big surprise, not many complaints reached me (and yes, my personal and/or business e-mail was not affected in any way)

Recovery is complete now, and I'm looking forward to things getting back to normal soon.

Harald Welte: OsmoDevCon 2013 preparation update

OsmoDevCon 2013 is getting closer every day, and I'm very much looking forward to meet the fellow developers of the various Osmcoom sub-projects. Organization-wise, the catering has now been sorted out, and Holger has managed to get a test license for two ARFCN from the regulatory body without any trouble.

This means that we're more or less all set. The key needs to be picked up from IN-Berlin, and we need to bring some extra extension cords, ethernet switch, power cords and other gear, but that's really only very minor tasks.

There's not as much formal schedule as we used to have last year, which is good as I hope it means we can focus on getting actual work done, as opposed to spending most of the time updating one another about our respective work and progress.

March 20, 2013

Rusty Russell: GCC and C vs C++ Speed, Measured.

With the imminent release of gcc 4.8, GCC has finally switched to C++ as the implementation language.  As usual, LWN has excellent coverage.  Those with long memories will remember Linux trying to use g++ back in 1992 and retreating in horror at the larger, slower code.  The main benefit was stricter typechecking, particularly for enums (a great idea: I had -Wstrict-enum patches for gcc about 12 years ago, which was a superset of the -Wenum-compare we have now, but never got it merged).

With this in mind, and Ian Taylor’s bold assertion that “The C subset of C++ is as efficient as C”, I wanted to test what had changed with some actual measurements.  So I grabbed gcc 4.7.2 (the last release which could do this), and built it with C and C++ compilers:

  1. ../gcc-4.7.2/configure –prefix=/usr/local/gcc-c –disable-bootstrap –enable-languages=c,c++ –disable-multiarch –disable-multilib
  2. ../gcc-4.7.2/configure –prefix=/usr/local/gcc-cxx –disable-bootstrap –enable-languages=c,c++ –disable-multiarch –disable-multilib –enable-build-with-cxx

The C++-compiled binaries are slightly larger, though that’s mostly debug info:

  1. -rwxr-xr-x 3 rusty rusty 1886551 Mar 18 17:13 /usr/local/gcc-c/bin/gcc
    text       data        bss        dec        hex    filename
    552530       3752       6888     563170      897e2    /usr/local/gcc-c/bin/gcc
  2. -rwxr-xr-x 3 rusty rusty 1956593 Mar 18 17:13 /usr/local/gcc-cxx/bin/gcc
    text       data        bss        dec        hex    filename
    552731       3760       7176     563667      899d3    /usr/local/gcc-cxx/bin/gcc

Then I used them both to compile a clean Linux kernel 10 times:

  1. for i in `seq 10`; do time make -s CC=/usr/local/gcc-c/bin/gcc 2>/dev/null; make -s clean; done
  2. for i in `seq 10`; do time make -s CC=/usr/local/gcc-cxx/bin/gcc 2>/dev/null; make -s clean; done

Using stats –trim-outliers, which throws away best and worse, and we have the times for the remaining 8:

  1. real    14m24.359000-35.107000(25.1521+/-0.62)s
    user    12m50.468000-52.576000(50.912+/-0.23)s
    sys    1m24.921000-27.465000(25.795+/-0.31)s
  2. real    14m27.148000-29.635000(27.8895+/-0.78)s
    user    12m50.428000-52.852000(51.956+/-0.7)s
    sys    1m26.597000-29.274000(27.863+/-0.66)s

So the C++-compiled binaries are measurably slower, though not noticably: it’s about 865 seconds vs 868 seconds, or about .3%.  Even if a kernel compile spends half its time linking, statting, etc, that’s under 1% slowdown.

And it’s perfectly explicable by the larger executable size.  If we strip all the gcc binaries, and do another 10 runs of each (… flash forward to the next day.. oops, powerfail, make that 2 days later):

  1. real    14m24.659000-33.435000(26.1196+/-0.65)s
    user    12m50.032000-57.701000(50.9755+/-0.36)s
    sys    1m26.057000-28.406000(26.863+/-0.36)s
  2. real    14m26.811000-29.284000(27.1308+/-0.17)s
    user    12m51.428000-52.696000(52.156+/-0.39)s
    sys    1m26.157000-27.973000(26.869+/-0.41)s

Now the difference is 0.1%, pretty much in the noise.

Summary: so whether you like C++ or not, the performance argument is moot.

February 08, 2013

Harald Welte: Update on what I've been doing

For the better part of a year, this blog has failed to provide you with a lot of updates what I've been doing. This is somewhat relate to a shift from doing freelance work on mainline / FOSS projects like the Linux kernel.

In April 2011, Holger and I started a new company here in Berlin (sysmocom - systems for mobile communications GmbH). This company, among other things, attempts to provide products and services surrounding the various mobile communications related FOSS projects, particularly OpenBSC, OsmoSGSN, OpenGGSN, but also OsmocomBB, and now also OsmoBTS + OsmoPCU, two integral components of our own BTS product called sysmoBTS.

Aside from the usual software development, this entails a variety of other tasks, technical and non-technical. First of all, I did more electrical engineering than I did in the years since Openmoko. And even there, I was only leading the hardware architecture, and didn't actually have to capture schematics or route PCBs myself. So now there are some general-purpose and some customer-specific circuits that had to be done. I really enjoy that work, sometimes even more than software development. Particularly the early/initial design phase can be quite exciting. Selecting components, figuring out how to interconnect them, whether you can fit all of them together in the given amount of GPIOs and other resource of your main CPU, etc. But then even the hand-soldering the first couple of boards is fun, too.

Of all the things I so far had least exposure to is casing and mechanical issues. Luckily we have a contractor working on that for us, but still there are all kinds of issues that can go wrong, where unpopulated PCB footprints can suddenly make contact with a case, or all kinds of issues related to manufacturing tolerances. Another topic is packaging. After all, you want the products to end up in the hands of the customer in a neat, proper and form-fitting package.

On the other hand, there is a lot of administrative work. Sourcing components can sometimes be a PITA, particularly if even distributors like Digikey conspire against you and don't even carry those low quantities of a component that we need for our 100-board low quantity runs. EMC and other measurements for CE approval are a fun topic, too. I've never been involved personally in those, and it has been an interesting venture. Luckily, at least for sysmoBTS, things are looking quite promising now. Customs paperwork, Import/Export related buerocracy (both in Germany as well as other countries) always have new surprises, despite me having experience in dealing with customs for more than 10 years now.

Also significant amount of time is spent on evaluating suppliers and their products, e.g. items like SIM/USIM cards, cavity duplexers, antennas, cables, adapters, power amplifiers and other RF related accessories for our products.

The thing that really caught me off-guard are the German laws on inventory accounting. Basically there is no threshold for low-quantity goods, so as a company on capital (GmbH/AG) you have to account for each and every fscking SMD resistor or capacitor. And then you don't only have to count all those parts, but also put a value at them. Depending on the type of item, you have to use either the purchasing price, or the current market price if you were to buy it again, or the price you expect to sell the item for. Furthermore, the trade law requirements on inventory accounting are different than the tax laws, not often with contradictory aims ;)

In the end it seems the best possible strategy is to put a lot of the low-value inventory into the garbage bin before the end of the financial year, as the value of the product (e.g. 130 SMD resistors in 0402 worth fractions of cents) is so much lower than the cost of counting it. Now that's of course an environmental sin, especially if you consider lots and lots of small and medium-sized companies ending up at that conclusion :(

So all in all, this should give you somewhat of an explanation why there might have been less activity on this blog about exciting technical things. On the one hand, they might relate to customer related projects which are of confidential nature. On the other hand, they might simply be boring things like dealing with transport damage of cavity duplexers from china, or with FedEx billing customs/import fees to the wrong address...

Overall I still have the feeling that I was writing a decent amount of code in 2012 - although there can never be enough :) Most of it was probably either related to OsmoBTS, OpenBSC/OsmoNITB or the various Erlang SS7/TCAP/MAP related projects. The list of more community-oriented projects with long TODO lists is growing, though. I'd like to work on SIMtrace MITM / card emulation support, the CC32RS512 based smartcard OS, libosmosim (there's a first branch in libosmocore.git). Let's hope I can find a bit more time for that kind of stuff this year. You should never give up hope, they say ;)

February 04, 2013

Harald Welte: Back from FOSDEM 2013

As (almost) every year, I attended the annual incarnation of FOSDEM. It is undoubtedly (one of?) the most remarkable events about Free Software in existence. No registration, no fees, 24 tracks in parallel, an estimated 5000 number of attendees. I also like that it brings together people from so many different communities, not _just_ the Linux or Gnome or KDE or Telephony or Legal people, but a good mixture of everything.

I have to congratulate the organizers, who manage to pull this off, year after year again. And as opposed to many other events, they do so quietly and without much recognition, I feel. I'd also like to thank the many volunteers working tirelessly before, at and after the event. Last, but not least, I'd like to thank the local university (ULB Solbosch) hosting the event.

What made me truly sad though, is the amount of littering that surprisingly many of the attendees did. This was particularly visible in the Cafeteria. Imagine an event run by volunteers, who put in a lot of time and effort. Imagine an event where food and drinks are sold by volunteers at such low prices that there can barely be any profit at all. And then imagine people eating there and leaving all their rubbish around, as if they were in some kind of restaurant where they are being served and where somebody is cleaning up after them. It really makes me feel very bitter to see this. Don't people realize that those very volunteers who are creating the event will then have to put in _their_ spare time just because those who just enjoyed their coffee or lunch didn't have the extra 30 seconds of bringing their trash to the trashcan? I feel ashamed for members of our community who behave this way. Please think next time before acting and show your respect to the people behind FOSDEM.

Harald Welte: Talk Idea: How to write code to make later enforcement easy

During FOSDEM 2013, I spoke with some fellow Free Software developers about how my knowledge on copyright and specifically legal aspects of software copyright has influenced the way how I write code, and particularly how I design architecture of programs.

This made me realize that this would probably make a quite interesting talk at Free Software conferences: How to architect and write code in order to make later [GPL] enforcement easy.

Of course there are all the general and mostly well-known rules like keeping track of who owns which part of the copyright, having proper copyright claims and license headers, etc.

But I'm more thinking in the sense of: How do I write code in a way to make sure people extending it in some way with their own code will be forced to create a derivative work. If that is the case, they will have absolutely no choice but to also license that under GPL.

This is particularly important in the case of GPL licensed libraries. The common understanding in the community is that writing an executable program against a GPL licensed library will constitute a derivative work and thus the main program must be licensed under the GPL, if it is ever distributed.

However, in reality there is of course no precedent, and in some particular cases, the legal framework, depending on the jurisdiction, might come to different conclusions if it ever ended up in court. The claim of a 'derivative work' would be particularly weak if the main program is only using a set of standard function calls whose function declarations are the same in many versions of the GPL licensed library you link against. So let's assume there was a GPL licensed standard C library for stuff like open(), close(), printf() and the like. I think it would be very difficult to argue in court that a program written against those functions and linked against such a library would constitute a derivative work of the library. As in fact, there are many other implementations providing the exact same interface, under different licenses, and the API was not even drafted by the author of the GPL licensed implementation.

So I think there are some things that an author of an (intentionally) GPL licensed library can do while writing the code, which will later help him to establish that an executable program is a derived work.

The same is true to some extent for executable programs, too. I very intentionally did not introduce a plug-in interface for BTS drivers in OpenBSC, even though while technically it would have been possible. I _want_ somebody who adds code for a different BTS to touch the main code of the program instead of just writing an external plugin. The mere fact that he has to edit the main program in order to add a new BTS driver indicates that he is creating a derivative work.

So I'll probably try to submit a talk on this topic to some upcoming conference[s]. If you think this is an interesting topic and want me to talk about it at a FOSS related event, please feel free to send me an e-mail.

January 16, 2013

Rusty Russell: Looking forward to linux.conf.au 2013

This year’s organizers took specific pains to attract deep content, and the schedule reflects that: there are very few slots where I’m not torn between two topics.  This will be great fun!

After a little introspection, I did not submit a talk this year.  My work in 2012 was with Linaro helping with KVM on ARM: that topic is better addressed by Christoffer Dall, so I convinced him to submit (unfortunately, he withdrew as January became an untenable time for him to travel).  My other coding work was incremental, not revolutionary: module signatures, CCAN nor ntdb shook the ground this year.  There just wasn’t anything I was excited about: a reliable litmus test.

See you at LCA!

Harald Welte: Why I hate phone calls so much

The fact that I have more than 20 missed phone calls on my land line telephone after only half a day has passed triggers me to write this blog post.

It is simply impossible to get any productive work done if there are synchronous interruptions. If I'm doing any even remotely complex task such as analyzing code, designing electronics or whatever else, then the interruption of the flow of thoughts, and the context switch to whatever the phone call might be about is costing me an insurmountable amount of my productive efficiency. I doubt that I am the only one having that feeling / experience.

So why on earth does everybody think they are entitled to interrupt my work at any given point in time they desire? Why do they think whatever issue they have rectifies an immediate interruption in what I am doing? To me, an unscheduled phone call almost always feels like an insult. It is a severe intrusion into my work-flow, and has a very high cost to me in terms of loss of productivity.

Sure, there are exceptional absolute emergencies (like, a medical emergency of a family member). But just about anything else can be put in an e-mail, which I can respond to at a time of my choosing, i.e. at a time I am not deeply buried into some other task that requires expensive context switching and the associated loss of productivity. And yes, a response might be the same day, some days later, or even a week or more later. There are literally hundreds of mails of dozens of people that need to be responded to. I can never even remotely answer all of them in a timely manner, even if I'm working 12-14 hours a day up to 7 days a week.

Right now I'm doing the only reasonable thing that is left: Switch off all phones. And to anyone out there intending to contact me: Please think twice before calling me on the phone. Almost anything can be put in an e-mail. And if you really want to have a phone call, please request a scheduled phone call in an e-mail containing a very detailed agenda and explanation of the topic.

January 02, 2013

Harald Welte: Strain of bad luck

From roughly September to December 2012 I seem to have had a quite unusual strain of bad luck and set-backs. I don't want to go into the details here, as most of the issues are of quite private nature.

This has kept me quite distracted from a lot of my other activity. Projects like the various Osmocom sub-projects, gpl-violations.org are in desperate need of attention, and I have severely neglected my responsibilities in the Chaos Computer Club Berlin e.V. :(

I don't even want to talk about actual paid work, where customers also had to put up with repeated schedule slips and lack of availability.

I let down friends and colleagues at a number of occasions, as I was unable to keep up with anything that remotely resembles my typical work schedule.

Last but not least, I regrettably have also not felt much of an urge to write many blog posts here.

My sincere hope and expectation is that things are going to improve quickly in 2013. At least most of issues from the last half year have been resolved. Now I need to work through a considerable back-log of work and find more time for my volunteer projects in the FOSS and hacker worlds. However, this will need some time and I would like to ask for some patience. I do intend to be up to speed with things just like before.

In this spirit, I am looking forward to a productive and exciting 2013. Happy hacking und Viel Spass am Gerät

December 24, 2012

Rusty Russell: Fixed-length semi-lockless queues revisited

There were some great comments on my previous post, both in comments here and on the Google Plus post which points to it.  I’d like to address the point here, now I’ve had a few moments to do follow-up work.

One anonymous commenter, as well as Stephen Hemminger via email, point to the existing lockless queue code in liburcu.  I had actually waded through this before (I say waded, because it’s wrapped in a few layers which I find annoying; there’s a reason I write little CCAN modules).  It’s clever and genuinely lockless; my hat off to , but it only works in conjunction with RCU.  In particular, it’s an unlimited-length queue which uses a dummy element to avoid ever being empty, and the fact that it can safely traverse the ‘->next’ entry even as an element is being dequeued, because the rules say you can’t alter that field or free the object until later.

Stephen also pointed me to Kip Macy’s buf_ring from FreeBSD; it uses two producer counters, prod_head and prod_tail.  The consumer looks at prod_tail as usual, the producers compare and swap increment prod_head, then place their element, then wait for prod_tail to catch up with prod_head before incrementing prod_tail.  Reimplementing this in my code showed it to be slower than the lower-bit-to-lock case for my benchmarks, though not much (the main difference is in the single-producer-using-muliple-producer-safe-routines, which are the first three benchmarks).  I ignored the buf_ring consumer, which uses a similar two-counter scheme for consumers, which is only useful for debugging, and used the same consumer code as before.

Arjen van de Ven makes several excellent points.  Firstly, that transaction-style features may allow efficient lock-ellision in upcoming Intel CPUs (and, of course, PowerPC has announced transaction support for Power8), so we’ll have to revisit in a few years when that reaches me.

His more immediate point is thatuncontended locks are really cheap on recent CPUs; cheaper than cache-hot compare-and-swap operations.  All the benchmarks I did involve everyone banging on the queue all the time, so I’m only measuring the contended cases.  So I hacked my benchmarks to allow for “0 consumers” by having the producer discard all the queue contents every time it filled.  Similarly, filling the queue with junk when it’s empty for a “0 producers” benchmark.

Here we can see that the dumb, single lock comes into its own, being twice as fast as my optimal-when-contended version.  If we just consider the common case of a single writer and a single reader, the lockless implementation takes 24ns in the contended case, and 14ns in the uncontended cases, whereas the naive locked implementation takes 107ns in the contended case and 7ns in the uncontended case.  In other words, you’d have to be uncontended over 90% of the time to win.  That can’t happen in a naive implementation which wakes the consumer as soon as the first item has been inserted into the queue (and if you implement a batch version of queue_insert, the atomic exchange gets amortized, so it gets harder to beat).

For the moment, I’m sticking with the previous winner; there’s still much to do to turn it into a usable API.

November 06, 2012

Jeremy Kerr: I've moved!

All of the stuff from this site is now at jk.ozlabs.org.

If you're looking for a new RSS feed, you can find that at jk.ozlabs.org/feeds/rss/.

See you on the other side!

September 24, 2012

Stephen Hemminger: VXLAN for Linux


VXLAN for Linux

Just published a Linux kernel implementation of VXLAN for possible inclusion in 3.7 kernel (patches).
For those unfamiliar with VXLAN, here are some common questions.

Q: What is VXLAN?

It is a standard protocol to transfer layer 2 Ethernet packets over UDP.

Q: What is the VXLAN protocol?

The standard is under development, the current draft RFC is at version 2.

Q: Why do we need yet another tunnel protocol? Why not just use GRE?

Existing tunnel protocols depend on properties of the backbone which may not be available. Generic Routing Encapsulation works by tunneling over IP and maybe blocked at routers by firewalls that only accept TCP and UDP.

Q: Does Openvswitch already do VXLAN?

The development version of Openvswitch does have VXLAN support, but OVS is fundamentally different than normal Linux networking. Many people may not want to take the jump into OVS. There are many cases where existing Linux networking technologies are easier to configure and use.

Q: What could VXLAN in Linux be used for?

It could be used to terminate VXLAN in Linux router, or link Linux bridges across hypervisors, or talk to legacy expensive virtualization products.

Q: Why is VXLAN cool?

Read the blogosphere, here are some good starting points


Q: That's too technical, what can I show my manager.

There is a short introductory video on the fundamentals of VXLAN




June 22, 2012

Jesper Dangaard Brouer: Upload local git repo, to remote site


When working on something, you should always start a git repository.
Its so much easier, to track and undo your changes.

(Simply run: "git init" in the directory, and add all the files "git add".)

But what happens, when you decide you want to publish your work (or just want a backup copy on a remote server).

Here is how you, upload your local git repo to a remote git server (and start using the remote as the master).

Log into your git server, and create a "bare" git repo, via commands:
ssh user@git.example.dk
mkdir -p ~/git-repos/next-big-thing.git
cd !$
git init --bare
echo "My next-big-thing git repo" > description
On your machine, with your existing/local git repo, we "upload" via push force the git repository:

git push --force user@git.example.dk:git-repos/next-big-thing.git master

Your local git repo, are not using the remote git repo as the master, lets change that
(view your current remote via cmd: "git config remote.origin.url")
git config remote.origin.url user@git.example.dk:git-repos/next-big-thing.git
Congratulation, you are done.

You have now moved your local git repo to a remote git host, and are using is as a master.

February 28, 2011

Stephen Hemminger: net-snmp ip-forward table performance problems

Some performance problems are hard and complex, but others seem to be
due to just plain stupidity. This is the saga of SNMP daemon and a
full BGP route table. Way back in 2006, Vyatta discovered that if an
SNMP walk was done on a server with a full BGP route table, it would
peg the CPU and never complete. A full BGP route table is 500K entries
or more so it does a good job of exposing scalability nightmares. The
initial fix was to disable the caching of the route table in SNMP
which made it return no entries. Hardly a good fix, but returning
nothing is better than crashing.

I began investigating with the simple tools of packet capture with wireshark
and syscall capturing with strace. The first discovery was that each
request caused the TCP wrappers library to open and read
/etc/hosts.allow and /etc/hosts.deny. Bogus on two counts:

  1. Debian is shipping the 2 files with no real entries only comments.
    Each packet caused file to be read but there was really no data.
    It would have been better to have the file not exist and have the open fail.
  2. But for our distribution, there was no point in enabling
    TCP wrappers anyway.

The fix was simple to disable tcp-wrappers.

The net-snmp daemon retrieves the ipv4 and ipv6 routing table the old
school way through /proc. This isn't a total disaster but since the
route entries in /proc start with an interface name and net-snmp wants
an ifindex it looks up each entry. That is 300K extra ioctl
calls. Short term hack was to just cache last ifname -> ifindex
translation
; later I replaced it with a netlink route dump
which gives ifindex (surprisingly netlink route dump is already used
in another MIB).

Next observation was that it is stupid to use snmpwalk to walk
the whole system and instead use snmpbulk. This helps but still
the walk would not complete.

The real discovery was when looking at the net-snmp container
code. Internally, net-snmp uses an objectish abstraction to store
data, and the main ones are a flat table and a linked list. The table
is stored in sorted order for fast lookup and sequential access. New
entries are placed at the end of the table and a dirty bit is set for
next lookup. The problem is that each insert also does a lookup for
duplicates which causes a sort
. This makes inserts do quicksort for
each entry -- there is the scalability problem.

To make it more interesting net-snmp creates the route table
twice. First it reads table from /proc and puts entries in one table,
then walks that table to create the cache table used for lookup.

Loading the cache with non-scalable insertion takes several minutes on
a really fast machine, and the cache timeout is 30 seconds. This
ends up causing the CPU load because each request finds a dirty cache
and does a full reload.

Now for the good news, fixing the insert wasn't the hard. The first
step was realizing that the temporary table doesn't have to a table
container, instead it can be changed to a FIFO (linked list). The FIFO
container is O(1) on insert. The actual cache container requires a
different approach. The table container has an unused flag to allow
duplicates in the table. Turning the ALLOW_DUPLICATES flag makes
inserts much faster because the table is not sorted until the first
request. These get the table load down to less than a 5 seconds
on fast machine.

Lastly a couple of other improvements help as well. When the
binary_table is expanded, the code would calloc a new area, copy the
old data and then free the original. This is much worse than just
using realloc which can usually in place expansion when table is
getting large. The sort function can be optimized to avoid calling the
comparison function, and using a faster insertion sort for small sub
sections. These get the load down to less than a second.

Extra credit to the first developer who implements a new net-snmp
container using something better for big tables like AVL or B-tree.

January 10, 2011

Jesper Dangaard Brouer: Bufferbloat: Wireless is worse than expected

Jim Gettys also pointed out, that bufferbloat also exists on Wifi wireless connections[1].

I didn't take his word for it, but tested it my self. And the result was far worse than I expected! In optimal conditions sitting next to the Wifi AP, I can easily introduce a 600 ms delay (0.6 sec), and moving further away I quickly see latency approaching 1 sec. See Gettys tests[1] for more info.

This bufferbloat issue is going to get worse, as we get more bandwidth on our broadband connections. This means that we have not see problem at its full extend yet.
Thus, lets fix it before it gets out of hand!

An interesting property of Wifi bufferbloat is, that the queue/bloat happens on you own machine (when uploading).
Linux is to blame, big time!
An some point we/the Linux kernel developers, increased the default transmit queue length (txqueuelen), from 100 packets to 1000 packets (happend when the netcards went from 100Mbit/s to 1000Mbit/s).

The major problem here is, that wireless devices also inherited this default txqueuelen setting of 1000 packets.

This is fortunately easily fixed/adjusted via e.g the command:
ifconfig wlan0 txqueuelen 10
But unfortunately, this does not remove all the TX bufferbloat in the system. The Wifi driver and hardware also have significant TX buffering, as Getty also describes[2].

Setting the txqueuelen to 1, I still experienced an average delay of 96ms and max 116ms, when linking 24 Mbit/s.

24 Mbit/s * 116 ms = 348000 bytes
= 232 packets (with 1500 bytes packets)
Thus, a hardware bufferbloat on minimum 232 packets. Gettys report his wifi hardware has 255 packets in hardware.

Fixing the hardware Wifi bufferbloat, is harder at we need to fix each individual driver and most of them not exported the functionality to tools like ethtool).
We need to at least mitigate the wifi bufferbloat, to some sane level.
Next we can talk about choosing a queueing stategy AQM for wifi networks.

Links:
[1] http://gettys.wordpress.com/2010/12/02/home-router-puzzle-piece-two-fun-with-wireless/
[2] http://gettys.wordpress.com/2011/01/03/aggregate-bufferbloat-802-11-and-3g-networks/
Copyright (C) 2001-2010 by the respective authors.