netfilter project logo Planet Netfilter

November 15, 2015

Harald Welte: GSM test network at 32C3, after all

Contrary to my blog post yesterday, it looks like we will have a private GSM network at the CCC congress again, after all.

It appears that Vodafone Germany (who was awarded the former DECT guard band in the 2015 spectrum auctions) is not yet using it in December, and they agreed that we can use it at the 32C3.

With this approval from Vodafone Germany we can now go to the regulator (BNetzA) and obtain the usual test license. Given that we used to get the license in the past, and that Vodafone has agreed, this should be a mere formality.

For the German language readers who appreciate the language of the administration, it will be a Frequenzzuteilung für Versuchszwecke im nichtöffentlichen mobilen Landfunk.

So thanks to Vodafone Germany, who enabled us at least this time to run a network again. By end of 2016 you can be sure they will have put their new spectrum to use, so I'm not that optimistic that this would be possible again.

Harald Welte: No GSM test network at 32C3

I currently don't assume that there will be a GSM network at the 32C3.

Ever since OpenBSC was created in 2008, the annual CCC congress was a great opportunity to test OpenBSC and related software with thousands of willing participants. In order to do so, we obtained a test licence from the German regulatory authority. This was never any problem, as there was a chunk of spectrum in the 1800 MHz GSM band that was not allocated to any commercial operator, the so-called DECT guard band. It's called that way as it was kept free in order to ensure there is no interference between 1800 MHz GSM and the neighboring DECT cordless telephones.

Over the decades, it was determined on a EU level that this guard band might not be necessary, or at least not if certain considerations are taken for BTSs deployed in that band.

When the German regulatory authority re-auctioned the GSM spectrum earlier this year, they decided to also auction the frequencies of the former DECT guard band. The DECT guard band was awarded to Vodafone.

This is a pity, as this means that people involved with cellular research or development of cellular technology now have it significantly harder to actually test their systems.

In some other EU member states it is easier, like in the Netherlands or the UK, where the DECT guard band was not treated like any other chunk of the GSM bands, but put under special rules. Not so in Germany.

To make a long story short: Without the explicit permission of any of the commercial mobile operators, it is not possible to run a test/experimental network like we used to ran at the annual CCC congress.

Given that

  • the event is held in the city center (where frequencies are typically used and re-used quite densely), and
  • an operator has nothing to gain from permitting us to test our open source GSM/GPRS implementations,

I think there is little chance that this will become a reality.

If anyone has really good contacts to the radio network planning team of a German mobile operator and wants to prove me wrong: Feel free to contact me by e-mail.

Thanks to everyone involved with the GSM team at the CCC events, particularly Holger Freyther, Daniel Willmann, Stefan Schmidt, Jan Luebbe, Peter Stuge, Sylvain Munaut, Kevin Redon, Andreas Eversberg, Ulli (and everyone else whom I may have forgot, my apologies). It's been a pleasure!

Thanks also to our friends at the POC (Phone Operation Center) who have provided interfacing to the DECT, ISDN, analog and VoIP network at the events. Thanks to roh for helping with our special patch requests. Thanks also to those entities and people who borrowed equipment (like BTSs) in the pre-sysmocom years.

So long, and thanks for all the fish!

Harald Welte: Progress on the Linux kernel GTP code

It is always sad if you start to develop some project and then never get around finishing it, as there are too many things to take care in parallel. But then, days only have 24 hours...

Back in 2012 I started to write some generic Linux kernel GTP tunneling code. GTP is the GPRS Tunneling Protocol, a protocol between core network elements in GPRS networks, later extended to be used in UMTS and even LTE networks.

GTP is split in a control plane for management and the user plane carrying the actual user IP traffic of a mobile subscriber. So if you're reading this blog via a cellular interent connection, your data is carried in GTP-U within the cellular core network.

To me as a former Linux kernel networking developer, the user plane of GTP (GTP-U) had always belonged into kernel space. It is a tunneling protocol not too different from many other tunneling protocols that already exist (GRE, IPIP, L2TP, PPP, ...) and for the user plane, all it does is basically add a header in one direction and remove the header in the other direction. User data, particularly in networks with many subscribers and/or high bandwidth use.

Also, unlike many other telecom / cellular protocols, GTP is an IP-only protocol with no E1, Frame Relay or ATM legacy. It also has nothing to do with SS7, nor does it use ASN.1 syntax and/or some exotic encoding rules. In summary, it is nothing like any other GSM/3GPP protocol, and looks much more of what you're used from the IETF/Internet world.

Unfortunately I didn't get very far with my code back in 2012, but luckily Pablo Neira (one of my colleagues from netfilter/iptables days) picked it up and brought it along. However, for some time it has been stalled until recently it was thankfully picked up by Andreas Schultz and now receives some attention and discussion, with the clear intention to finish + submit it for mainline inclusion.

The code is now kept in a git repository at

Thanks to Pablo and Andreas for picking this up, let's hope this is the last coding sprint before it goes mainline and gets actually used in production.

Harald Welte: Osmocom Berlin meetings

Back in 2012, I started the idea of having a regular, bi-weekly meeting of people interested in mobile communications technology, not only strictly related to the Osmocom projects and software. This was initially called the Osmocom User Group Berlin. The meetings were held twice per month in the rooms of the Chaos Computer Club Berlin.

There are plenty of people that were or still are involved with Osmocom one way or another in Berlin. Think of zecke, alphaone, 2b-as, kevin, nion, max, prom, dexter, myself - just to name a few.

Over the years, I got "too busy" and was no longer able to attend regularly. Some people kept it alive (thanks to dexter!), but eventually they were discontinued in 2013.

Starting in October 2015, I started a revival of the meetings, two have been held already, the third is coming up next week on November 11.

I'm happy that I had the idea of re-starting the meeting. It's good to meet old friends and new people alike. Both times there actually were some new faces around, most of which even had a classic professional telecom background.

In order to emphasize the focus is strictly not on Osmocom alone ( particularly not about its users only), I decided to rename the event to the Osmocom Meeting Berlin.

If you're in Berlin and are interested in mobile communications technology on the protocol and radio side of things, feel free to join us next Wednesday.

Harald Welte: Germany's excessive additional requirements for VAT-free intra-EU shipments


At my company sysmocom we are operating a small web-shop providing small tools and accessories for people interested in mobile research. This includes programmable SIM cards, SIM card protocol tracers, adapter cables, duplexers for cellular systems, GPS disciplined clock units, and other things we consider useful to people in and around the various Osmocom projects.

We of course ship domestic, inside the EU and world-wide. And that's where the trouble starts, at least since 2014.

What are VAT-free intra-EU shipments?

As many readers of this blog (at least the European ones) know, inside the EU there is a system by which intra-EU sales between businesses in EU member countries are performed without charging VAT.

This is the result of different countries having different amount of VAT, and the fact that a business can always deduct the VAT it spends on its purchases from the VAT it has to charge on its sales. In order to avoid having to file VAT return statements in each of the countries of your suppliers, the suppliers simply ship their goods without charging VAT in the first place.

In order to have checks and balances, both the supplier and the recipient have to file declarations to their tax authorities, indicating the sales volume and the EU VAT ID of the respective business partners.

So far so good. This concept was reasonably simple to implement and it makes the life easier for all involved businesses, so everyone participates in this scheme.

Of course there always have been some obstacles, particularly here in Germany. For example, you are legally required to confirm the EU-VAT-ID of the buyer before issuing a VAT-free invoice. This confirmation request can be done online

However, the Germany tax authorities invented something unbelievable: A Web-API for confirmation of EU-VAT-IDs that has opening hours. Despite this having rightfully been at the center of ridicule by the German internet community for many years, it still remains in place. So there are certain times of the day where you cannot verify EU-VAT-IDs, and thus cannot sell products VAT-free ;)

But even with that one has gotten used to live.


Now in recent years (since January 1st, 2014) , the German authorities came up with the concept of the Gelangensbescheinigung. To the German reader, this newly invented word already sounds ugly enough. Literal translation is difficult, as it sounds really clumsy. Think of something like a reaching-its-destination-certificate

So now it is no longer sufficient to simply verify the EU-VAT-ID of the buyer, issue the invoice and ship the goods, but you also have to produce such a Gelangensbescheinigung for each and every VAT-free intra-EU shipment. This document needs to include

  • the name and address of the recipient
  • the quantity and designation of the goods sold
  • the place and month when the goods were received
  • the date of when the document was signed
  • the signature of the recipient (not required in case of an e-mail where the e-mail headers show that the messages was transmitted from a server under control of the recipient)

How can you produce such a statement? Well, in the ideal / legal / formal case, you provide a form to your buyer, which he then signs and certifies that he has received the goods in the destination country.

First of all, I find if offensive that I have to ask my customers to make such declarations in the first place. And then even if I accept this and go ahead with it, it is my legal responsibility to ensure that he actually fills this in.

What if the customer doesn't want to fill it in or forgets about it?

Then I as the seller am liable to pay 19% VAT on the purchase he made, despite me never having charged those 19%.

So not only do I have to generate such forms and send them with my goods, but I also need a business process of checking for their return, reminding the customers that their form has not yet been returned, and in the end they can simply not return it and I loose money. Great.

Track+Trace / Courier Services

Now there are some alternate ways in which a Gelangensbescheinigung can be generated. For example by a track+trace protocol of the delivery company. However, the requirements to this track+trace protocol are so high, that at least when I checked in late 2013, the track and trace protocol of UPS did not fulfill the requirements. For example, a track+trace protocol usually doesn't show the quantity and designation of goods. Why would it? UPS just moves a package from A to B, and there is no customs involved that would require to know what's in the package.

Postal Packages

Now let's say you'd like to send your goods by postal service. For low-priced non-urgent goods, that's actually what you generally want to do, as everything else is simply way too expensive compared to the value of the goods.

However, this is only permitted, if the postal service you use produces you with a receipt of having accepted your package, containing the following mandatory information:

  • name and address of the entity issuing the receipt
  • name and address of the sender
  • name and address of the recipient
  • quantity and type of goods
  • date of having receive the goods

Now I don't know how this works in other countries, but in Germany you will not be able to get such a receipt form the postal office.

In fact I inquired several times with the legal department of Deutsche Post, up to the point of sending a registered letter (by Deutsche Post) to Deutsche Post. They have never responded to any of those letters!

So we have the German tax authorities claiming yes, of course you can still do intra-EU shipments to other countries by postal services, you just need to provide a receipt, but then at the same time they ask for a receipt indicating details that no postal receipt would ever show.

Particularly a postal receipt would never confirm what kind of goods you are sending. How would the postal service know? You hand them a package, and they transfer it. It is - rightfully - none of their business what its content may be. So how can you ask them to confirm that certain goods were received for transport ?!?


So in summary:

Since January 1st, 2014, we now have German tax regulations in force that make VAT free intra-EU shipments extremely difficult to impossible

  • The type of receipt they require from postal services is not provided by Deutsche Post, thereby making it impossible to use Deutsche Post for VAT free intra-EU shipments
  • The type of track+trace protocol issued by UPS does not fulfill the requirements, making it impossible to use them for VAT-free intra-EU shipments
  • The only other option is to get an actual receipt from the customer. If that customer doesn't want to provide this, the German seller is liable to pay the 19% German VAT, despite never having charged that to his customer


To me, the conclusion of all of this can only be one:

German tax authorities do not want German sellers to sell VAT-free goods to businesses in other EU countries. They are actively trying to undermine the VAT principles of the EU. And nobody seem to complain about it or even realize there is a problem.

What a brave new world we live in.

Harald Welte: small tools: rtl8168-eeprom

Some time ago I wrote a small Linux command line utility that can be used to (re)program the Ethernet (MAC) address stored in the EEPROM attached to an RTL8168 Ethernet chip.

This is for example useful if you are a system integrator that has its own IEEE OUI range and you would like to put your own MAC address in devices that contain the said Realtek etherent chips (already pre-programmed with some other MAC address).

The source code can be obtaned from:

Harald Welte: small tools: gpsdate

In 2013 I wrote a small Linux program that can be usded to set the system clock based on the clock received from a GPS receiver (via gpsd), particularly when a system is first booted. It is similar in purpose to ntpdate, but of course obtains time not from ntp but from the GPS receiver.

This is particularly useful for RTC-less systems without network connectivity, which come up with a completely wrong system clock that needs to be properly set as soon as th GPS receiver finally has acquired a signal.

I asked the ntp hackers if they were interested in merging it into the official code base, and their response was (summarized) that with a then-future release of ntpd this would no longer be needed. So the gpsdate program remains an external utility.

So in case anyone else might find the tool interesting: The source code can be obtained from

Harald Welte: Deutsche Bank / unstable interfaces

Deutsche Bank is a large, international bank. They offer services world-wide and are undoubtedly proud of their massive corporate IT department.

Yet, at the same time, they fail to get the most fundamental principles of user/customer-visible interfaces wrong: Don't change them. If you need to change them, manage the change carefully.

In many software projects, keeping the API or other interface stable is paramount. Think of the Linux kernel, where breaking a userspace-visible interface is not permitted. The reasons are simple: If you break that interface, _everyone_ using that interface will need to change their implementation, and will have to synchronize that with the change on the other side of the interface.

The internet online banking system of Deutsche Bank in Germany permits the upload of transactions by their customers in a CSV file format.

And guess what? They change the file format from one day to the other.

  • without informing their users in advance, giving them time to adopt their implementations of that interface
  • without documenting the exact nature of the change
  • adding new fields to the CSV in the middle of the line, rather than at the end of the line, to make sure things break even more

Now if you're running a business and depend on automatizing your payments using the interface provided by Deutsche Bank, this means that you fail to pay your suppliers in time, you hastily drop/delay other (paid!) work that you have to do in order to try to figure out what exactly Deutsche Bank decided to change completely unannounced, from one day to the other.

If at all, I would have expected this from a hobbyist kind of project. But seriously, from one of the worlds' leading banks? An interface that is probably used by thousands and thousands of users? WTF?!?

Harald Welte: The VMware GPL case

My absence from blogging meant that I didn't really publicly comment on the continued GPL violations by VMware, and the 2015 legal case that well-known kernel developer Christoph Hellwig has brought forward against VMware.

The most recent update by the Software Freedom Conservancy on the VMware GPL case can be found at

In case anyone ever doubted: I of course join the ranks of the long list of Linux developers and other stakeholders that consider VMware's behavior completely unacceptable, if not outrageous.

For many years they have been linking modified Linux kernel device drivers and entire kernel subsystems into their proprietary vmkernel software (part of ESXi). As an excuse, they have added a thin shim layer under GPLv2 which they call vmklinux. And to make all of this work, they had to add lots of vmklinux specific API to the proprietary vmkernel. All the code runs as one program, in one address space, in the same thread of execution. So basically, it is at the level of the closest possible form of integration between two pieces of code: Function calls within the same thread/process.

In order to make all this work, they had to modify their vmkernel, implement vmklinux and also heavily modify the code they took from Linux in the first place. So the drivers are not usable with mainline linux anymore, and vmklinux is not usable without vmkernel either.

If all the above is not a clear indication that multiple pieces of code form one work/program (and subsequently must be licensed under GNU GPLv2), what should ever be considered that?

To me, it is probably one of the strongest cases one can find about the question of derivative works and the GPL(v2). Of course, all my ramblings have no significance in a court, and the judge may rule based on reports of questionable technical experts. But I'm convinced if the court was well-informed and understood the actual situation here, it would have to rule in favor of Christoph Hellwig and the GPL.

What I really don't get is why VMware puts up the strongest possible defense one can imagine. Not only did they not back down in lengthy out-of-court negotiations with the Software Freedom Conservancy, but also do they defend themselves strongly against the claims in court.

In my many years of doing GPL enforcement, I've rarely seen such a dedication and strong opposition. This shows the true nature of VMware as a malicious, unfair entity that gives a damn sh*t about other peoples' copyright, the Free Software community and its code of conduct as a whole, and the Linux kernel developers in particular.

So let's hope they waste a lot of money in their legal defense, get a sufficient amount of negative PR out of this to the point of tainting their image, and finally obtain a ruling upholding the GPL.

All the best to Christoph and the Conservancy in fighting this fight. For those readers that want to help their cause, I believe they are looking for more supporter donations.

Harald Welte: What I've been busy with

Those who don't know me personally and/or stay in touch more closely might be wondering what on earth happened to Harald in the last >= 1 year?

The answer would be long, but I can summarize it to I disappeared into sysmocom. You know, the company that Holger and I founded four years ago, in order to commercially support OpenBSC and related projects, and to build products around it.

In recent years, the team has been growing to the point where in 2015 we had suddenly 9 employees and a handful of freelancers working for us.

But then, that's still a small company, and based on the projects we're involved, that team has to cover a variety of topics (next to the actual GSM/GPRS related work), including

  • mechanical engineering (enclosure design)
  • all types of electrical engineering
    • AC/electrical wiring/fusing on DIN rails
    • AC/DC and isolated DC/DC power supplies (based on modules)
    • digital design
    • analog design
    • RF design
  • prototype manufacturing and testing
  • software development
    • bare-iron bootloader/os/application on Cortex-M0
    • NuttX on Cortex-M3
    • OpenAT applications on Sierra Wireless
    • custom flavors of Linux on several different ARM architectures (TI DaVinci, TI Sitara)
    • drivers for various peripherals including Ethernet Switches, PoE PSE controller
    • lots of system-level software for management, maintenance, control

I've been involved in literally all of those topics, with most of my time spent on the electronics side than on the software side. And if software, the more on the bootloader/RTOS side, than on applications.

So what did we actually build? It's unfortunately still not possible to disclose fully at this point, but it was all related to marine communications technology. GSM being one part of it, but only one of many in the overall picture.

Given the quite challenging breadth/width of the tasks at hand and problem to solve, I'm actually surprised how much we could achieve with such a small team in a limited amount of time. But then, there's virtually no time left, which meant no work, no blogging, no progress on the various Osmocom Erlang projects for core network protocols, and last but not least no Taiwan holidays this year.

ately I see light at the end of the tunnel, and there is again a bit ore time to get back to old habits, and thus I

  • resurrected this blog from the dead
  • resurrected various project homepages that have disappeared
  • started some more work on actual telecom stuff (osmo-iuh, for example)
  • restarted the Osmocom Berlin Meeting

Harald Welte: Weblog + homepage online again

On October 31st, 2014, I had reeboote my main server for a kernel upgrade, and could not mount the LUKS crypto volume ever again. While the techincal cause for this remains a mystery until today (it has spawned some conspiracy theories), I finally took some time to recover some bits and pieces from elsewhere. I didn't want this situation to drag on for more than a year...

Rather than bringing online the old content using sub-optimal and clumsy tools to generate static content (web sites generated by docbook-xml, blog by blosxom), I decided to give it a fresh start and try nikola, a more modern and actively maintained tool to generate static web pages and blogs.

The blog is now available at (a redirect from the old /weblog is in place, for those who keep broken links for more than 12 months). The RSS feed URLs are different from before, but there are again per-category feeds so people (and planets) can subscribe to the respective category they're interested in.

And yes, I do plan to blog again more regularly, to make this place not just an archive of a decade of blogging, but a place that is alive and thrives with new content.

My personal web site is available at while my (similarly re-vamped) freelancing business web site is also available again at

I still need to decide what to do about the old site. It still has its old manual web 1.0 structure from the late 1990ies.

I've also re-surrected and as well as (old content). Next in line is, which I also intend to convert to nikola for maintenance reasons.

October 20, 2015

Rusty Russell: ccan/mem’s memeqzero iteration

On Thursday I was writing some code, and I wanted to test if an array was all zero.  First I checked if ccan/mem had anything, in case I missed it, then jumped on IRC to ask the author (and overall CCAN co-maintainer) David Gibson about it.

We bikeshedded around names: memallzero? memiszero? memeqz? memeqzero() won by analogy with the already-extant memeq and memeqstr. Then I asked:

rusty: dwg: now, how much time do I waste optimizing?
dwg: rusty, in the first commit, none

Exactly five minutes later I had it implemented and tested.

The Naive Approach: Times: 1/7/310/37064 Bytes: 50

bool memeqzero(const void *data, size_t length)
    const unsigned char *p = data;

    while (length) {
        if (*p)
            return false;
    return true;

As a summary, I’ve give the nanoseconds for searching through 1,8,512 and 65536 bytes only.

Another 20 minutes, and I had written that benchmark, and an optimized version.

128-byte Static Buffer: Times: 6/8/48/5872 Bytes: 108

Here’s my first attempt at optimization; using a static array of 128 bytes of zeroes and assuming memcmp is well-optimized for fixed-length comparisons.  Worse for small sizes, much better for big.

 const unsigned char *p = data;
 static unsigned long zeroes[16];

 while (length > sizeof(zeroes)) {
     if (memcmp(zeroes, p, sizeof(zeroes)))
         return false;
     p += sizeof(zeroes);
     length -= sizeof(zeroes);
 return memcmp(zeroes, p, length) == 0;

Using a 64-bit Constant: Times: 12/12/84/6418 Bytes: 169

dwg: but blowing a cacheline (more or less) on zeroes for comparison, which isn’t necessarily a win

Using a single zero uint64_t for comparison is pretty messy:

bool memeqzero(const void *data, size_t length)
    const unsigned char *p = data;
    const unsigned long zero = 0;
    size_t pre;
    pre = (size_t)p % sizeof(unsigned long);
    if (pre) {
        size_t n = sizeof(unsigned long) - pre;
        if (n > length)
            n = length;
        if (memcmp(p, &zero, n) != 0)
            return false;
        p += n;
        length -= n;
    while (length > sizeof(zero)) {
        if (*(unsigned long *)p != zero)
            return false;
        p += sizeof(zero);
        length -= sizeof(zero);
    return memcmp(&zero, p, length) == 0;

And, worse in every way!

Using a 64-bit Constant With Open-coded Ends: Times: 4/9/68/6444 Bytes: 165

dwg: rusty, what colour is the bikeshed if you have an explicit char * loop for the pre and post?

That’s slightly better, but memcmp still wins over large distances, perhaps due to prefetching or other tricks.

Epiphany #1: We Already Have Zeroes: Times 3/5/92/5801 Bytes: 422

Then I realized that we don’t need a static buffer: we know everything we’ve already tested is zero!  So I open coded the first 16 byte compare, then memcmp()ed against the previous bytes, doubling each time.  Then a final memcmp for the tail.  Clever huh?

But it no faster than the static buffer case on the high end, and much bigger.

dwg: rusty, that is brilliant. but being brilliant isn’t enough to make things work, necessarily :p

Epiphany #2: memcmp can overlap: Times 3/5/37/2823 Bytes: 307

My doubling logic above was because my brain wasn’t completely in phase: unlike memcpy, memcmp arguments can happily overlap!  It’s still worth doing an open-coded loop to start (gcc unrolls it here with -O3), but after 16 it’s worth memcmping with the previous 16 bytes.  This is as fast as naive with as little as 2 bytes, and the fastest solution by far with larger numbers:

 const unsigned char *p = data;
 size_t len;

 /* Check first 16 bytes manually */
 for (len = 0; len < 16; len++) {
     if (!length)
         return true;
     if (*p)
         return false;

 /* Now we know that's zero, memcmp with self. */
 return memcmp(data, p, length) == 0;

You can find the final code in CCAN (or on Github) including the benchmark code.

Finally, after about 4 hours of random yak shaving, it turns out lightning doesn’t even want to use memeqzero() any more!  Hopefully someone else will benefit.

October 01, 2015

Eric Leblond: My “Kernel packet capture technologies” talk at KR2015

I’ve just finished my talk on Linux kernel packet capture technologies at Kernel Recipes 2015. I would like to thanks the organizer for their great work. I also thank Frank Tizzoni for the drawing


In that talk, I’ve tried to do an overview of the history of packet capture technologies in the Linux kernel. All that seen from userspace and from a Suricata developer perspective.

You can download the slides here: 2015_kernel_recipes_capture.

August 15, 2015

Rusty Russell: Broadband Speeds, New Data

Thanks to edmundedgar on reddit I have some more accurate data to update my previous bandwidth growth estimation post: OFCOM UK, who released their November 2014 report on average broadband speeds.  Whereas Akamai numbers could be lowered by the increase in mobile connections, this directly measures actual broadband speeds.

Extracting the figures gives:

  1. Average download speed in November 2008 was 3.6Mbit
  2. Average download speed in November 2014 was 22.8Mbit
  3. Average upload speed in November 2014 was 2.9Mbit
  4. Average upload speed in November 2008 to April 2009 was 0.43Mbit/s

So in 6 years, downloads went up by 6.333 times, and uploads went up by 6.75 times.  That’s an annual increase of 36% for downloads and 37% for uploads; that’s good, as it implies we can use download speed factor increases as a proxy for upload speed increases (as upload speed is just as important for a peer-to-peer network).

This compares with my previous post’s Akamai’s UK numbers of 3.526Mbit in Q4 2008 and 10.874Mbit in Q4 2014: only a factor of 3.08 (26% per annum).  Given how close Akamai’s numbers were to OFCOM’s in November 2008 (a year after the iPhone UK release, but probably too early for mobile to have significant effect), it’s reasonable to assume that mobile plays a large part of this difference.

If we assume Akamai’s numbers reflected real broadband rates prior to November 2008, we can also use it to extend the OFCOM data back a year: this is important since there was almost no bandwidth growth according to Akamai from Q4 2007 to Q7 2008: ignoring that period gives a rosier picture than my last post, and smells of cherrypicking data.

So, let’s say the UK went from 3.265Mbit in Q4 2007 (Akamai numbers) to 22.8Mbit in Q4 2014 (OFCOM numbers).  That’s a factor of 6.98, or 32% increase per annum for the UK. If we assume that the US Akamai data is under-representing Q4 2014 speeds by the same factor (6.333 / 3.08 = 2.056) as the UK data, that implies the US went from 3.644Mbit in Q4 2007 to 11.061 * 2.056 = 22.74Mbit in Q4 2014, giving a factor of 6.24, or 30% increase per annum for the US.

As stated previously, China is now where the US and UK were 7 years ago, suggesting they’re a reasonable model for future growth for that region.  Thus I revise my bandwidth estimates; instead of 17% per annum this suggests 30% per annum as a reasonable growth rate.

August 04, 2015

Rusty Russell: The Bitcoin Blocksize: A Summary

There’s a significant debate going on at the moment in the Bitcoin world; there’s a great deal of information and misinformation, and it’s hard to find a cogent summary in one place.  This post is my attempt, though I already know that it will cause me even more trouble than that time I foolishly entitled a post “If you didn’t run code written by assholes, your machine wouldn’t boot”.

The Technical Background: 1MB Block Limit

The bitcoin protocol is powered by miners, who gather transactions into blocks, producing a block every 10 minutes (but it varies a lot).  They get a 25 bitcoin subsidy for this, plus whatever fees are paid by those transactions.  This subsidy halves every 4 years: in about 12 months it will drop to 12.5.

Full nodes on the network check transactions and blocks, and relay them to others.  There are also lightweight nodes which simply listen for transactions which affect them, and trust that blocks from miners are generally OK.

A normal transaction is 250 bytes, and there’s a hard-coded 1 megabyte limit on the block size.  This limit was introduced years ago as a quick way of avoiding a miner flooding the young network, though the original code could only produce 200kb blocks, and the default reference code still defaults to a 750kb limit.

In the last few months there have been increasing runs of full blocks, causing backlogs for a few hours.  More recently, someone deliberately flooded the network with normal-fee transactions for several days; any transactions paying less fees than those had to wait for hours to be processed.

There are 5 people who have commit access to the bitcoin reference implementation (aka. “bitcoin-core”), and they vary significantly in their concerns on the issue.

The Bitcoin Users’ Perspective

From the bitcoin users perspective, blocks should be infinite, and fees zero or minimal.  This is the basic position of respected (but non-bitcoin-core) developer Mike Hearn, and has support from bitcoin-core ex-lead Gavin Andresen.  They work on the wallet and end-user side of bitcoin, and they see the issue as the most urgent.  In an excellent post arguing why growth is so important, Mike raises the following points, which I’ve paraphrased:

  1. Currencies have network effects. A currency that has few users is simply not competitive with currencies that have many.
  2. A decentralised currency that the vast majority can’t use doesn’t change the amount of centralisation in the world. Most people will still end up using banks, with all the normal problems.
  3. Growth is a part of the social contract. It always has been.
  4. Businesses will only continue to invest in bitcoin and build infrastructure if they are assured that the market will grow significantly.
  5. Bitcoin needs users, lots of them, for its political survival. There are many people out there who would like to see digital cash disappear, or be regulated out of existence.

At this point, it’s worth mentioning another bitcoin-core developer: Jeff Garzik.  He believes that the bitcoin userbase has been promised that transactions will continue to be almost free.  When a request to change the default mining limit from 750kb to 1M was closed by the bitcoin lead developer Wladimir van der Laan as unimportant, Jeff saw this as a symbolic moment:

What Happens If We Don’t Increase Soon?

Mike Hearn has a fairly apocalyptic view of what would happen if blocks fill.  That was certainly looking likely when the post was written, but due to episodes where the blocks were full for days, wallet designers are (finally) starting to estimate fees for timely processing (miners process larger fee transactions first).  Some wallets and services didn’t even have a way to change the setting, leaving users stranded during high-volume events.

It now seems that the bursts of full blocks will arrive with increasing frequency; proposals are fairly mature now to allow users to post-increase fees if required, which (if all goes well) could make for a fairly smooth transition from the current “fees are tiny and optional” mode of operation to a “there will be a small fee”.

But even if this rosy scenario is true, this begsavoids the bigger question of how high fees can become before bitcoin becomes useless.  1c?  5c?  20c? $1?

So What Are The Problems With Increasing The Blocksize?

In a word, the problem is miners.  As mining has transitioned from a geek pastime, semi-hobbyist, then to large operations with cheap access to power, it has become more concentrated.

The only difference between bitcoin and previous cryptocurrencies is that instead of a centralized “broker” to ensure honesty, bitcoin uses an open competition of miners. Given bitcoin’s endurance, it’s fair to count this a vital property of bitcoin.  Mining centralization is the long-term concern of another bitcoin-core developer (and my coworker at Blockstream), Gregory Maxwell.

Control over half the block-producing power and you control who can use bitcoin and cheat anyone not using a full node themselves.  Control over 2/3, and you can force a rule change on the rest of the network by stalling it until enough people give in.  Central control is also a single point to shut the network down; that lets others apply legal or extra-legal pressure to restrict the network.

What Drives Centralization?

Bitcoin mining is more efficient at scale. That was to be expected[7]. However, the concentration has come much faster than expected because of the invention of mining pools.  These pools tell miners what to mine, in return for a small (or in some cases, zero) share of profits.  It saves setup costs, they’re easy to use, and miners get more regular payouts.  This has caused bitcoin to reel from one centralization crisis to another over the last few years; the decline in full nodes has been precipitous by some measures[5] and continues to decline[6].

Consider the plight of a miner whose network is further away from most other miners.  They find out about new blocks later, and their blocks get built on later.  Both these effects cause them to create blocks which the network ignores, called orphans.  Some orphans are the inevitable consequence of miners racing for the same prize, but the orphan problem is not symmetrical.  Being well connected to the other miners helps, but there’s a second effect: if you discover the previous block, you’ve a head-start on the next one.  This means a pool which has 20% of the hashing power doesn’t have to worry about delays at all 20% of the time.

If the orphan rate is very low (say, 0.1%), the effect can be ignored.  But as it climbs, the pressure to join a pool (the largest pool) becomes economically irresistible, until only one pool remains.

Larger Blocks Are Driving Up Orphan Rates

Large blocks take longer to propagate, increasing the rate of orphans.  This has been happening as blocks increase.  Blocks with no transactions at all are smallest, and so propagate fastest: they still get a 25 bitcoin subsidy, though they don’t help bitcoin users much.

Many people assumed that miners wouldn’t overly centralize, lest they cause a clear decentralization failure and drive the bitcoin price into the ground.  That assumption has proven weak in the face of climbing orphan rates.

And miners have been behaving very badly.  Mining pools orchestrate attacks on each other with surprising regularity; DDOS and block withholding attacks are both well documented[1][2].  A large mining pool used their power to double spend and steal thousands of bitcoin from a gambling service[3].  When it was noticed, they blamed a rogue employee.  No money was returned, nor any legal action taken.  It was hoped that miners would leave for another pool as they approached majority share, but that didn’t happen.

If large blocks can be used as a weapon by larger miners against small ones[8], it’s expected that they will be.

More recently (and quite by accident) it was discovered that over half the mining power aren’t verifying transactions in blocks they build upon[4].  They did this in order to reduce orphans, and one large pool is still doing so.  This is a problem because lightweight bitcoin clients work by assuming anything in the longest chain of blocks is good; this was how the original bitcoin paper anticipated that most users would interact with the system.

The Third Side Of The Debate: Long Term Network Funding

Before I summarize, it’s worth mentioning the debate beyond the current debate: long term network support.  The minting of new coins decreases with time; the plan of record (as suggested in the original paper) is that total transaction fees will rise to replace the current mining subsidy.  The schedule of this is unknown and generally this transition has not happened: free transactions still work.

The block subsidy as I write this is about $7000.  If nothing else changes, miners would want $3500 in fees in 12 months when the block subsidy halves, or about $2 per transaction.  That won’t happen; miners will simply lose half their income.  (Perhaps eventually they form a cartel to enforce a minimum fee, causing another centralization crisis? I don’t know.)

It’s natural for users to try to defer the transition as long as possible, and the practice in bitcoin-core has been to aggressively reduce the default fees as the bitcoin price rises.  Core developers Gregory Maxwell and Pieter Wuille feel that signal was a mistake; that fees will have to rise eventually and users should not be lulled into thinking otherwise.

Mike Hearn in particular has been holding out the promise that it may not be necessary.  On this he is not widely supported: that some users would offer to pay more so other users can continue to pay less.

It’s worth noting that some bitcoin businesses rely on the current very low fees and don’t want to change; I suspect this adds bitterness and vitriol to many online debates.


The bitcoin-core developers who deal with users most feel that bitcoin needs to expand quickly or die, that letting fees emerge now will kill expansion, and that the infrastructure will improve over time if it has to.

Other bitcoin-core developers feel that bitcoin’s infrastructure is dangerously creaking, that fees need to emerge anyway, and that if there is a real emergency a blocksize change could be rolled out within a few weeks.

At least until this is resolved, don’t count on future bitcoin fees being insignificant, nor promise others that bitcoin has “free transactions”.


“Bitcoin Mining Pools Targeted in Wave of DDOS Attacks” Coinbase 2015

[2] “Block Withholding Attacks – Recent Research” N T Courtois 2014

[3] “GHash.IO and double-spending against BetCoin Dice” mmtech et. al 2013

[4] “Questions about the July 4th BIP66 fork”

[5] “350,000 full nodes to 6,000 in two years…” P Todd 2015

[6] “Reachable nodes during the last 365 days.”

[7] “Re: Scalability and transaction rate” Satoshi 2010

[8] “[Bitcoin-development] Mining centralization pressure from non-uniform propagation speed” Pieter Wuille 2015

July 08, 2015

Rusty Russell: The Megatransaction: Why Does It Take 25 Seconds?

Last night f2pool mined a 1MB block containing a single 1MB transaction.  This scooped up some of the spam which has been going to various weakly-passworded “brainwallets”, gaining them 0.5569 bitcoins (on top of the normal 25 BTC subsidy).  You can see the megatransaction on

It was widely reported to take about 25 seconds for bitcoin core to process this block: this is far worse than my “2 seconds per MB” result in my last post, which was considered a pretty bad case.  Let’s look at why.

How Signatures Are Verified

The algorithm to check a transaction input (of this form) looks like this:

  1. Strip the other inputs from the transaction.
  2. Replace the input script we’re checking with the script of the output it’s trying to spend.
  3. Hash the resulting transaction with SHA256, then hash the result with SHA256 again.
  4. Check the signature correctly signed that hash result.

Now, for a transaction with 5570 inputs, we have to do this 5570 times.  And the bitcoin core code does this by making a copy of the transaction each time, and using the marshalling code to hash that; it’s not a huge surprise that we end up spending 20 seconds on it.

How Fast Could Bitcoin Core Be If Optimized?

Once we strip the inputs, the result is only about 6k long; hashing 6k 5570 times takes about 265 milliseconds (on my modern i3 laptop).  We have to do some work to change the transaction each time, but we should end up under half a second without any major backflips.

Problem solved?  Not quite….

This Block Isn’t The Worst Case (For An Optimized Implementation)

As I said above, the amount we have to hash is about 6k; if a transaction has larger outputs, that number changes.  We can fit in fewer inputs though.  A simple simulation shows the worst case for 1MB transaction has 3300 inputs, and 406000 byte output(s): simply doing the hashing for input signatures takes about 10.9 seconds.  That’s only about two or three times faster than the bitcoind naive implementation.

This problem is far worse if blocks were 8MB: an 8MB transaction with 22,500 inputs and 3.95MB of outputs takes over 11 minutes to hash.  If you can mine one of those, you can keep competitors off your heels forever, and own the bitcoin network… Well, probably not.  But there’d be a lot of emergency patching, forking and screaming…

Short Term Steps

An optimized implementation in bitcoind is a good idea anyway, and there are three obvious paths:

  1. Optimize the signature hash path to avoid the copy, and hash in place as much as possible.
  2. Use the Intel and ARM optimized SHA256 routines, which increase SHA256 speed by about 80%.
  3. Parallelize the input checking for large numbers of inputs.

Longer Term Steps

A soft fork could introduce an OP_CHECKSIG2, which hashes the transaction in a different order.  In particular, it should hash the input script replacement at the end, so the “midstate” of the hash can be trivially reused.  This doesn’t entirely eliminate the problem, since the sighash flags can require other permutations of the transaction; these would have to be carefully explored (or only allowed with OP_CHECKSIG).

This soft fork could also place limits on how big an OP_CHECKSIG-using transaction could be.

Such a change will take a while: there are other things which would be nice to change for OP_CHECKSIG2, such as new sighash flags for the Lightning Network, and removing the silly DER encoding of signatures.

July 06, 2015

Rusty Russell: Bitcoin Core CPU Usage With Larger Blocks

Since I was creating large blocks (41662 transactions), I added a little code to time how long they take once received (on my laptop, which is only an i3).

The obvious place to look is CheckBlock: a simple 1MB block takes a consistent 10 milliseconds to validate, and an 8MB block took 79 to 80 milliseconds, which is nice and linear.  (A 17MB block took 171 milliseconds).

Weirdly, that’s not the slow part: promoting the block to the best block (ActivateBestChain) takes 1.9-2.0 seconds for a 1MB block, and 15.3-15.7 seconds for an 8MB block.  At least it’s scaling linearly, but it’s just slow.

So, 16 Seconds Per 8MB Block?

I did some digging.  Just invalidating and revalidating the 8MB block only took 1 second, so something about receiving a fresh block makes it worse. I spent a day or so wrestling with benchmarking[1]…

Indeed, ConnectTip does the actual script evaluation: CheckBlock() only does a cursory examination of each transaction.  I’m guessing bitcoin core is not smart enough to parallelize a chain of transactions like mine, hence the 2 seconds per MB.  On normal transaction patterns even my laptop should be about 4 times faster than that (but I haven’t actually tested it yet!).

So, 4 Seconds Per 8MB Block?

But things are going to get better: I hacked in the currently-disabled libsecp256k1, and the time for the 8MB ConnectTip dropped from 18.6 seconds to 6.5 seconds.

So, 1.6 Seconds Per 8MB Block?

I re-enabled optimization after my benchmarking, and the result was 4.4 seconds; that’s libsecp256k1, and an 8MB block.

Let’s Say 1.1 Seconds for an 8MB Block

This is with some assumptions about parallelism; and remember this is on my laptop which has a fairly low-end CPU.  While you may not be able to run a competitive mining operation on a Raspberry Pi, you can pretty much ignore normal verification times in the blocksize debate.


[1] I turned on -debug=bench, which produced impenetrable and seemingly useless results in the log.

So I added a print with a sleep, so I could run perf.  Then I disabled optimization, so I’d get understandable backtraces with perf.  Then I rebuilt perf because Ubuntu’s perf doesn’t demangle C++ symbols, which is part of the kernel source package. (Are we having fun yet?).  I even hacked up a small program to help run perf on just that part of bitcoind.   Finally, after perf failed me (it doesn’t show 100% CPU, no idea why; I’d expect to see main in there somewhere…) I added stderr prints and ran strace on the thing to get timings.

July 03, 2015

Rusty Russell: Wrapper for running perf on part of a program.

Linux’s perf competes with early git for title of least-friendly Linux tool.  Because it’s tied to kernel versions, and the interfaces changes fairly randomly, you can never figure out how to use the version you need to use (hint: always use -g).

But when it works, it’s very useful.  Recently I wanted to figure out where bitcoind was spending its time processing a block; because I’m a cool kid, I didn’t use gprof, I used perf.  The problem is that I only want information on that part of bitcoind.  To start with, I put a sleep(30) and a big printf in the source, but that got old fast.

Thus, I wrote “perfme.c“.  Compile it (requires some trivial CCAN headers) and link perfme-start and perfme-stop to the binary.  By default it runs/stops perf record on its parent, but an optional pid arg can be used for other things (eg. if your program is calling it via system(), the shell will be the parent).

June 25, 2015

Rusty Russell: Hashing Speed: SHA256 vs Murmur3

So I did some IBLT research (as posted to bitcoin-dev ) and I lazily used SHA256 to create both the temporary 48-bit txids, and from them to create a 16-bit index offset.  Each node has to produce these for every bitcoin transaction ID it knows about (ie. its entire mempool), which is normally less than 10,000 transactions, but we’d better plan for 1M given the coming blopockalypse.

For txid48, we hash an 8 byte seed with the 32-byte txid; I ignored the 8 byte seed for the moment, and measured various implementations of SHA256 hashing 32 bytes on on my Intel Core i3-5010U CPU @ 2.10GHz laptop (though note we’d be hashing 8 extra bytes for IBLT): (implementation in CCAN)

  1. Bitcoin’s SHA256: 527.7+/-0.9 nsec
  2. Optimizing the block ending on bitcoin’s SHA256: 500.4+/-0.66 nsec
  3. Intel’s asm rorx: 314.1+/-0.3 nsec
  4. Intel’s asm SSE4 337.5+/-0.5 nsec
  5. Intel’s asm RORx-x8ms 458.6+/-2.2 nsec
  6. Intel’s asm AVX 336.1+/-0.3 nsec

So, if you have 1M transactions in your mempool, expect it to take about 0.62 seconds of hashing to calculate the IBLT.  This is too slow (though it’s fairly trivially parallelizable).  However, we just need a universal hash, not a cryptographic one, so I benchmarked murmur3_x64_128:

  1. Murmur3-128: 23 nsec

That’s more like 0.046 seconds of hashing, which seems like enough of a win to add a new hash to the mix.

June 19, 2015

Rusty Russell: Mining on a Home DSL connection: latency for 1MB and 8MB blocks

I like data.  So when Patrick Strateman handed me a hacky patch for a new testnet with a 100MB block limit, I went to get some.  I added 7 digital ocean nodes, another hacky patch to prevent sendrawtransaction from broadcasting, and a quick utility to create massive chains of transactions/

My home DSL connection is 11Mbit down, and 1Mbit up; that’s the fastest I can get here.  I was CPU mining on my laptop for this test, while running tcpdump to capture network traffic for analysis.  I didn’t measure the time taken to process the blocks on the receiving nodes, just the first propagation step.

1 Megabyte Block

Naively, it should take about 10 seconds to send a 1MB block up my DSL line from first packet to last.  Here’s what actually happens, in seconds for each node:

  1. 66.8
  2. 70.4
  3. 71.8
  4. 71.9
  5. 73.8
  6. 75.1
  7. 75.9
  8. 76.4

The packet dump shows they’re all pretty much sprayed out simultaneously (bitcoind may do the writes in order, but the network stack interleaves them pretty well).  That’s why it’s 67 seconds at best before the first node receives my block (a bit longer, since that’s when the packet left my laptop).

8 Megabyte Block

I increased my block size, and one node dropped out, so this isn’t quite the same, but the times to send to each node are about 8 times worse, as expected:

  1. 501.7
  2. 524.1
  3. 536.9
  4. 537.6
  5. 538.6
  6. 544.4
  7. 546.7


Using the rough formula of 1-exp(-t/600), I would expect orphan rates of 10.5% generating 1MB blocks, and 56.6% with 8MB blocks; that’s a huge cut in expected profits.


  • Get a faster DSL connection.  Though even an uplink 10 times faster would mean 1.1% orphan rate with 1MB blocks, or 8% with 8MB blocks.
  • Only connect to a single well-connected peer (-maxconnections=1), and hope they propagate your block.
  • Refuse to mine any transactions, and just collect the block reward.  Doesn’t help the bitcoin network at all though.
  • Join a large pool.  This is what happens in practice, but raises a significant centralization problem.


  • We need bitcoind to be smarter about ratelimiting in these situations, and stream serially.  Done correctly (which is hard), it could also help bufferbloat which makes running a full node at home so painful when it propagates blocks.
  • Some kind of block compression, along the lines of Gavin’s IBLT idea. I’ve done some preliminary work on this, and it’s promising, but far from trivial.


June 03, 2015

Rusty Russell: What Transactions Get Crowded Out If Blocks Fill?

What happens if bitcoin blocks fill?  Miners choose transactions with the highest fees, so low fee transactions get left behind.  Let’s look at what makes up blocks today, to try to figure out which transactions will get “crowded out” at various thresholds.

Some assumptions need to be made here: we can’t automatically tell the difference between me taking a $1000 output and paying you 1c, and me paying you $999.99 and sending myself the 1c change.  So my first attempt was very conservative: only look at transactions with two or more outputs which were under the given thresholds (I used a nice round $200 / BTC price throughout, for simplicity).

(Note: I used bitcoin-iterate to pull out transaction data, and rebuild blocks without certain transactions; you can reproduce the csv files in the blocksize-stats directory if you want).

Paying More Than 1 Person Under $1 (< 500000 Satoshi)

Here’s the result (against the current blocksize):

Sending 2 Or More Sub-$1 Outputs

Let’s zoom in to the interesting part, first, since there’s very little difference before 220,000 (February 2013).  You can see that only about 18% of transactions are sending less than $1 and getting less than $1 in change:

Since March 2013…

Paying Anyone Under 1c, 10c, $1

The above graph doesn’t capture the case where I have $100 and send you 1c.   If we eliminate any transaction which has any output less than various thresholds, we’ll catch that. The downside is that we capture the “sending myself tiny change” case, but I’d expect that to be rarer:

Blocksizes Without Small Output Transactions

This eliminates far more transactions.  We can see only 2.5% of the block size is taken by transactions with 1c outputs (the dark red line following the block “current blocks” line), but the green line shows about 20% of the block used for 10c transactions.  And about 45% of the block is transactions moving $1 or less.

Interpretation: Hard Landing Unlikely, But Microtransactions Lose

If the block size doesn’t increase (or doesn’t increase in time): we’ll see transactions get slower, and fees become the significant factor in whether your transaction gets processed quickly.  People will change behaviour: I’m not going to spend 20c to send you 50c!

Because block finding is highly variable and many miners are capping blocks at 750k, we see backlogs at times already; these bursts will happen with increasing frequency from now on.  This will put pressure on Satoshdice and similar services, who will be highly incentivized to use StrawPay or roll their own channel mechanism for off-blockchain microtransactions.

I’d like to know what timescale this happens on, but the graph shows that we grow (and occasionally shrink) in bursts.  A logarithmic graph prepared by Peter R of suggests that we hit 1M mid-2016 or so; expect fee pressure to bend that graph downwards soon.

The bad news is that even if fees hit (say) 25c and that prevents all the sub-$1 transactions, we only double our capacity, giving us perhaps another 18 months. (At that point miners are earning $1000 from transaction fees as well as $5000 (@ $200/BTC) from block reward, which is nice for them I guess.)

My Best Guess: Larger Blocks Desirable Within 2 Years, Needed by 3

Personally I think 5c is a reasonable transaction fee, but I’d prefer not to see it until we have decentralized off-chain alternatives.  I’d be pretty uncomfortable with a 25c fee unless the Lightning Network was so ubiquitous that I only needed to pay it twice a year.  Higher than that would have me reaching for my credit card to charge my Lightning Network account :)

Disclaimer: I Work For BlockStream, on Lightning Networks

Lightning Networks are a marathon, not a sprint.  The development timeframes in my head are even vaguer than the guesses above.  I hope it’s part of the eventual answer, but it’s not the bandaid we’re looking for.  I wish it were different, but we’re going to need other things in the mean time.

I hope this provided useful facts, whatever your opinions.

Rusty Russell: Current Blocksize, by graphs.

I used bitcoin-iterate and gnumeric to render the current bitcoin blocksizes, and here are the results.

My First Graph: A Moment of Panic

This is block sizes up to yesterday; I’ve asked gnumeric to derive an exponential trend line from the data (in black; the red one is linear)

Woah! We hit 1M blocks in a month! PAAAANIC!

That trend line hits 1000000 at block 363845.5, which we’d expect in about 32 days time!  This is what is freaking out so many denizens of the Bitcoin Subreddit. I also just saw a similar inaccurate [correction: misleading] graph reshared by Mike Hearn on G+ :(

But Wait A Minute

That trend line says we’re on 800k blocks today, and we’re clearly not.  Let’s add a 6 hour moving average:

Oh, we’re only halfway there….

In fact, if we cluster into 36 blocks (ie. 6 hours worth), we can see how misleading the terrible exponential fit is:

What! We’re already over 1M blocks?? Maths, you lied to me!

Clearer Graphs: 1 week Moving Average

Actual Weekly Running Average Blocksize

So, not time to panic just yet, though we’re clearly growing, and in unpredictable bursts.

June 01, 2015

Rusty Russell: Block size: rate of internet speed growth since 2008?

I’ve been trying not to follow the Great Blocksize Debate raging on reddit.  However, the lack of any concrete numbers has kind of irked me, so let me add one for now.

If we assume bandwidth is the main problem with running nodes, let’s look at average connection growth rates since 2008.  Google lead me to NetMetrics (who seem to charge), and Akamai’s State Of The Internet (who don’t).  So I used the latter, of course:

Akamai’s Average Connection Speed Chart Q4/07 to Q4/14

I tried to pick a range of countries, and here are the results:

Country % Growth Over 7 years Per Annum
Australia 348 19.5%
Brazil 349 19.5%
China 481 25.2%
Philippines 258 14.5%
UK 333 18.8%
US 304 17.2%


Countries which had best bandwidth grew about 17% a year, so I think that’s the best model for future growth patterns (China is now where the US was 7 years ago, for example).

If bandwidth is the main centralization concern, you’ll want block growth below 15%. That implies we could jump the cap to 3MB next year, and 15% thereafter. Or if you’re less conservative, 3.5MB next year, and 17% there after.

April 30, 2015

Eric Leblond: Elasticsearch, systemd and Debian Jessie

Now that Debian Jessie is out, it was the time to do an upgrade of my Elasticsearch servers. I’ve got two of them running in LXC containers on my main hardware system Upgrading to Jessie was straightforward via apt-get dist-upgrade.

But the Elasticsearch server processes were not here after reboot. I’m using the Elasticsearch 1.5 packages provided by Elastic on their website.

Running /etc/init.d/elasticsearch start or service elasticsearch start were not giving any output. Systemd which is now starting the service was not kind enough to provide any debugging information.

After some research, I’ve discovered that the /usr/lib/systemd/system/elasticsearch.service was starting elasticsearch with:
ExecStart=/usr/share/elasticsearch/bin/elasticsearch            \
                            -Des.default.config=$CONF_FILE      \
                            -Des.default.path.home=$ES_HOME     \
                            -Des.default.path.logs=$LOG_DIR     \
                  $DATA_DIR    \
                  $WORK_DIR    \
And the variables like CONF_FILE defined in /etc/default/elasticsearch were commented and it seems it was preventing Elasticsearch to start. So updating the file to have:
# Elasticsearch log directory

# Elasticsearch data directory

# Elasticsearch work directory

# Elasticsearch configuration directory

# Elasticsearch configuration file (elasticsearch.yml)
was enough to have a working Elasticsearch.

As pointed out by Peter Manev, to enable the service at reboot, you need to run as root:

systemctl enable elasticsearch.service

I hope this will help some of you and let me know if you find better solutions.

Rusty Russell: Some bitcoin mempool data: first look

Previously I discussed the use of IBLTs (on the pettycoin blog).  Kalle and I got some interesting, but slightly different results; before I revisited them I wanted some real data to play with.

Finally, a few weeks ago I ran 4 nodes for a week, logging incoming transactions and the contents of the mempools when we saw a block.  This gives us some data to chew on when tuning any fast block sync mechanism; here’s my first impressions looking a the data (which is available on github).

These graphs are my first look; in blue is the number of txs in the block, and in purple stacked on top is the number of txs which were left in the mempool after we took those away.

The good news is that all four sites are very similar; there’s small variance across these nodes (three are in Digital Ocean data centres and one is behind two NATs and a wireless network at my local coworking space).

The bad news is that there are spikes of very large mempools around block 352,800; a series of 731kb blocks which I’m guessing is some kind of soft limit for some mining software [EDIT: 750k is the default soft block limit; reported in 1024-byte quantities as does, this is 732k.  Thanks sipa!].  Our ability to handle this case will depend very much on heuristics for guessing which transactions are likely candidates to be in the block at all (I’m hoping it’s as simple as first-seen transactions are most likely, but I haven’t tested yet).

Transactions in Mempool and in Blocks: Australia (poor connection)

Transactions in Mempool and in Blocks: Singapore

Transactions in Mempool and in Blocks: San Francisco

Transactions in Mempool and in Blocks: San Francisco (using Relay Network)

April 08, 2015

Rusty Russell: Lightning Networks Part IV: Summary

This is the fourth part of my series of posts explaining the bitcoin Lightning Networks 0.5 draft paper.  See Part I, Part II and Part III.

The key revelation of the paper is that we can have a network of arbitrarily complicated transactions, such that they aren’t on the blockchain (and thus are fast, cheap and extremely scalable), but at every point are ready to be dropped onto the blockchain for resolution if there’s a problem.  This is genuinely revolutionary.

It also vindicates Satoshi’s insistence on the generality of the Bitcoin scripting system.  And though it’s long been suggested that bitcoin would become a clearing system on which genuine microtransactions would be layered, it was unclear that we were so close to having such a system in bitcoin already.

Note that the scheme requires some solution to malleability to allow chains of transactions to be built (this is a common theme, so likely to be mitigated in a future soft fork), but Gregory Maxwell points out that it also wants selective malleability, so transactions can be replaced without invalidating the HTLCs which are spending their outputs.  Thus it proposes new signature flags, which will require active debate, analysis and another soft fork.

There is much more to discover in the paper itself: recommendations for lightning network routing, the node charging model, a risk summary, the specifics of the softfork changes, and more.

I’ll leave you with a brief list of requirements to make Lightning Networks a reality:

  1. A soft-fork is required, to protect against malleability and to allow new signature modes.
  2. A new peer-to-peer protocol needs to be designed for the lightning network, including routing.
  3. Blame and rating systems are needed for lightning network nodes.  You don’t have to trust them, but it sucks if they go down as your money is probably stuck until the timeout.
  4. More refinements (eg. relative OP_CHECKLOCKTIMEVERIFY) to simplify and tighten timeout times.
  5. Wallets need to learn to use this, with UI handling of things like timeouts and fallbacks to the bitcoin network (sorry, your transaction failed, you’ll get your money back in N days).
  6. You need to be online every 40 days to check that an old HTLC hasn’t leaked, which will require some alternate solution for occasional users (shut down channel, have some third party, etc).
  7. A server implementation needs to be written.

That’s a lot of work!  But it’s all simply engineering from here, just as bitcoin was once the paper was released.  I look forward to seeing it happen (and I’m confident it will).

April 06, 2015

Rusty Russell: Lightning Networks Part III: Channeling Contracts

This is the third part of my series of posts explaining the bitcoin Lightning Networks 0.5 draft paper.

In Part I I described how a Poon-Dryja channel uses a single in-blockchain transaction to create off-blockchain transactions which can be safely updated by either party (as long as both agree), with fallback to publishing the latest versions to the blockchain if something goes wrong.

In Part II I described how Hashed Timelocked Contracts allow you to safely make one payment conditional upon another, so payments can be routed across untrusted parties using a series of transactions with decrementing timeout values.

Now we’ll join the two together: encapsulate Hashed Timelocked Contracts inside a channel, so they don’t have to be placed in the blockchain (unless something goes wrong).

Revision: Why Poon-Dryja Channels Work

Here’s half of a channel setup between me and you where I’m paying you 1c: (there’s always a mirror setup between you and me, so it’s symmetrical)

Half a channel: we will invalidate transaction 1 (in favour of a new transaction 2) to send funds.

The system works because after we agree on a new transaction (eg. to pay you another 1c), you revoke this by handing me your private keys to unlock that 1c output.  Now if you ever released Transaction 1, I can spend both the outputs.  If we want to add a new output to Transaction 1, we need to be able to make it similarly stealable.

Adding a 1c HTLC Output To Transaction 1 In The Channel

I’m going to send you 1c now via a HTLC (which means you’ll only get it if the riddle is answered; if it times out, I get the 1c back).  So we replace transaction 1 with transaction 2, which has three outputs: $9.98 to me, 1c to you, and 1c to the HTLC: (once we agree on the new transactions, we invalidate transaction 1 as detailed in Part I)

Our Channel With an Output for an HTLC

Note that you supply another separate signature (sig3) for this output, so you can reveal that private key later without giving away any other output.

We modify our previous HTLC design so you revealing the sig3 would allow me to steal this output. We do this the same way we did for that 1c going to you: send the output via a timelocked mutually signed transaction.  But there are two transaction paths in an HTLC: the got-the-riddle path and the timeout path, so we need to insert those timelocked mutually signed transactions in both of them.  First let’s append a 1 day delay to the timeout path:

Timeout path of HTLC, with locktime so it can be stolen once you give me your sig3.

Similarly, we need to append a timelocked transaction on the “got the riddle solution” path, which now needs my signature as well (otherwise you could create a replacement transaction and bypass the timelocked transaction):

Full HTLC: If you reveal Transaction 2 after we agree it’s been revoked, and I have your sig3 private key, I can spend that output before you can, down either the settlement or timeout paths.

Remember The Other Side?

Poon-Dryja channels are symmetrical, so the full version has a matching HTLC on the other side (except with my temporary keys, so you can catch me out if I use a revoked transaction).  Here’s the full diagram, just to be complete:

A complete lightning network channel with an HTLC, containing a glorious 13 transactions.

Closing The HTLC

When an HTLC is completed, we just update transaction 2, and don’t include the HTLC output.  The funds either get added to your output (R value revealed before timeout) or my output (timeout).

Note that we can have an arbitrary number of independent HTLCs in progress at once, and open and/or close as many in each transaction update as both parties agree to.

Keys, Keys Everywhere!

Each output for a revocable transaction needs to use a separate address, so we can hand the private key to the other party.  We use two disposable keys for each HTLC[1], and every new HTLC will change one of the other outputs (either mine, if I’m paying you, or yours if you’re paying me), so that needs a new key too.  That’s 3 keys, doubled for the symmetry, to give 6 keys per HTLC.

Adam Back pointed out that we can actually implement this scheme without the private key handover, and instead sign a transaction for the other side which gives them the money immediately.  This would permit more key reuse, but means we’d have to store these transactions somewhere on the off chance we needed them.

Storing just the keys is smaller, but more importantly, Section 6.2 of the paper describes using BIP 32 key hierarchies so the disposable keys are derived: after a while, you only need to store one key for all the keys the other side has given you.  This is vastly more efficient than storing a transaction for every HTLC, and indicates the scale (thousands of HTLCs per second) that the authors are thinking.

Next: Conclusion

My next post will be a TL;DR summary, and some more references to the implementation details and possibilities provided by the paper.


[1] The new sighash types are fairly loose, and thus allow you to attach a transaction to a different parent if it uses the same output addresses.  I think we could re-use the same keys in both paths if we ensure that the order of keys required is reversed for one, but we’d still need 4 keys, so it seems a bit too tricky.

April 01, 2015

Rusty Russell: Lightning Networks Part II: Hashed Timelock Contracts (HTLCs)

In Part I, we demonstrated Poon-Dryja channels; a generalized channel structure which used revocable transactions to ensure that old transactions wouldn’t be reused.

A channel from me<->you would allow me to efficiently send you 1c, but that doesn’t scale since it takes at least one on-blockchain transaction to set up each channel. The solution to this is to route funds via intermediaries;  in this example we’ll use the fictitious “MtBox”.

If I already have a channel with MtBox’s Payment Node, and so do you, that lets me reliably send 1c to MtBox without (usually) needing the blockchain, and it lets MtBox send you 1c with similar efficiency.

But it doesn’t give me a way to force them to send it to you; I have to trust them.  We can do better.

Bonding Unrelated Transactions using Riddles

For simplicity, let’s ignore channels for the moment.  Here’s the “trust MtBox” solution:

I send you 1c via MtBox; simplest possible version, using two independent transactions. I trust MtBox to generate its transaction after I send it mine.

What if we could bond these transactions together somehow, so that when you spend the output from the MtBox transaction, that automatically allows MtBox to spend the output from my transaction?

Here’s one way. You send me a riddle question to which nobody else knows the answer: eg. “What’s brown and sticky?”.  I then promise MtBox the 1c if they answer that riddle correctly, and tell MtBox that you know.

MtBox doesn’t know the answer, so it turns around and promises to pay you 1c if you answer “What’s brown and sticky?”. When you answer “A stick”, MtBox can pay you 1c knowing that it can collect the 1c off me.

The bitcoin blockchain is really good at riddles; in particular “what value hashes to this one?” is easy to express in the scripting language. So you pick a random secret value R, then hash it to get H, then send me H.  My transaction’s 1c output requires MtBox’s signature, and a value which hashes to H (ie. R).  MtBox adds the same requirement to its transaction output, so if you spend it, it can get its money back from me:

Two Independent Transactions, Connected by A Hash Riddle.

Handling Failure Using Timeouts

This example is too simplistic; when MtBox’s PHP script stops processing transactions, I won’t be able to get my 1c back if I’ve already published my transaction.  So we use a familiar trick from Part I, a timeout transaction which after (say) 2 days, returns the funds to me.  This output needs both my and MtBox’s signatures, and MtBox supplies me with the refund transaction containing the timeout:

Hash Riddle Transaction, With Timeout

MtBox similarly needs a timeout in case you disappear.  And it needs to make sure it gets the answer to the riddle from you within that 2 days, otherwise I might use my timeout transaction and it can’t get its money back.  To give plenty of margin, it uses a 1 day timeout:

MtBox Needs Your Riddle Answer Before It Can Answer Mine

Chaining Together

It’s fairly clear to see that longer paths are possible, using the same “timelocked” transactions.  The paper uses 1 day per hop, so if you were 5 hops away (say, me <-> MtBox <-> Carol <-> David <-> Evie <-> you) I would use a 5 day timeout to MtBox, MtBox a 4 day to Carol, etc.  A routing protocol is required, but if some routing doesn’t work two nodes can always cancel by mutual agreement (by creating timeout transaction with no locktime).

The paper refers to each set of transactions as contracts, with the following terms:

  • If you can produce to MtBox an unknown 20-byte random input data R from a known H, within two days, then MtBox will settle the contract by paying you 1c.
  • If two days have elapsed, then the above clause is null and void and the clearing process is invalidated.
  • Either party may (and should) pay out according to the terms of this contract in any method of the participants choosing and close out this contract early so long as both participants in this contract agree.

The hashing and timelock properties of the transactions are what allow them to be chained across a network, hence the term Hashed Timelock Contracts.

Next: Using Channels With Hashed Timelock Contracts.

The hashed riddle construct is cute, but as detailed above every transaction would need to be published on the blockchain, which makes it pretty pointless.  So the next step is to embed them into a Poon-Dryja channel, so that (in the normal, cooperative case) they don’t need to reach the blockchain at all.

March 30, 2015

Rusty Russell: Lightning Networks Part I: Revocable Transactions

I finally took a second swing at understanding the Lightning Network paper.  The promise of this work is exceptional: instant reliable transactions across the bitcoin network. But the implementation is complex and the draft paper reads like a grab bag of ideas; but it truly rewards close reading!  It doesn’t involve novel crypto, nor fancy bitcoin scripting tricks.

There are several techniques which are used in the paper, so I plan to concentrate on one per post and wrap up at the end.

Revision: Payment Channels

I open a payment channel to you for up to $10

A Payment Channel is a method for sending microtransactions to a single recipient, such as me paying you 1c a minute for internet access.  I create an opening transaction which has a $10 output, which can only be redeemed by a transaction input signed by you and me (or me alone, after a timeout, just in case you vanish).  That opening transaction goes into the blockchain, and we’re sure it’s bedded down.

I pay you 1c in the payment channel. Claim it any time!

Then I send you a signed transaction which spends that opening transaction output, and has two outputs: one for $9.99 to me, and one for 1c to you.  If you want, you could sign that transaction too, and publish it immediately to get your 1c.

Update: now I pay you 2c via the payment channel.

Then a minute later, I send you a signed transaction which spends that same opening transaction output, and has a $9.98 output for me, and a 2c output for you. Each minute, I send you another transaction, increasing the amount you get every time.

This works because:

  1.  Each transaction I send spends the same output; so only one of them can ever be included in the blockchain.
  2. I can’t publish them, since they need your signature and I don’t have it.
  3. At the end, you will presumably publish the last one, which is best for you.  You could publish an earlier one, and cheat yourself of money, but that’s not my problem.

Undoing A Promise: Revoking Transactions?

In the simple channel case above, we don’t have to revoke or cancel old transactions, as the only person who can spend them is the person who would be cheated.  This makes the payment channel one way: if the amount I was paying you ever went down, you could simply broadcast one of the older, more profitable transactions.

So if we wanted to revoke an old transaction, how would we do it?

There’s no native way in bitcoin to have a transaction which expires.  You can have a transaction which is valid after 5 days (using locktime), but you can’t have one which is valid until 5 days has passed.

So the only way to invalidate a transaction is to spend one of its inputs, and get that input-stealing transaction into the blockchain before the transaction you’re trying to invalidate.  That’s no good if we’re trying to update a transaction continuously (a-la payment channels) without most of them reaching the blockchain.

The Transaction Revocation Trick

But there’s a trick, as described in the paper.  We build our transaction as before (I sign, and you hold), which spends our opening transaction output, and has two outputs.  The first is a 9.99c output for me.  The second is a bit weird–it’s 1c, but needs two signatures to spend: mine and a temporary one of yours.  Indeed, I create and sign such a transaction which spends this output, and send it to you, but that transaction has a locktime of 1 day:

The first payment in a lightning-style channel.

Now, if you sign and publish that transaction, I can spend my $9.99 straight away, and you can publish that timelocked transaction tomorrow and get your 1c.

But what if we want to update the transaction?  We create a new transaction, with 9.98c output to me and 2c output to a transaction signed by both me and another temporary address of yours.  I create and sign a transaction which spends that 2c output, has a locktime of 1 day and has an output going to you, and send it to you.

We can revoke the old transaction: you simply give me the temporary private key you used for that transaction.  Weird, I know (and that’s why you had to generate a temporary address for it).  Now, if you were ever to sign and publish that old transaction, I can spend my $9.99 straight away, and create a transaction using your key and my key to spend your 1c.  Your transaction (1a below) which could spend that 1c output is timelocked, so I’ll definitely get my 1c transaction into the blockchain first (and the paper uses a timelock of 40 days, not 1).

Updating the payment in a lightning-style channel: you sent me your private key for sig2, so I could spend both outputs of Transaction 1 if you were to publish it.

So the effect is that the old transaction is revoked: if you were to ever sign and release it, I could steal all the money.  Neat trick, right?

A Minor Variation To Avoid Timeout Fallback

In the original payment channel, the opening transaction had a fallback clause: after some time, it is all spendable by me.  If you stop responding, I have to wait for this to kick in to get my money back.  Instead, the paper uses a pair of these “revocable” transaction structures.  The second is a mirror image of the first, in effect.

A full symmetric, bi-directional payment channel.

So the first output is $9.99 which needs your signature and a temporary signature of mine.  The second is  1c for meyou.  You sign the transaction, and I hold it.  You create and sign a transaction which has that $9.99 as input, a 1 day locktime, and send it to me.

Since both your and my “revocable” transactions spend the same output, only one can reach the blockchain.  They’re basically equivalent: if you send yours you must wait 1 day for your money.  If I send mine, I have to wait 1 day for my money.  But it means either of us can finalize the payment at any time, so the opening transaction doesn’t need a timeout clause.


Now we have a generalized transaction channel, which can spend the opening transaction in any way we both agree on, without trust or requiring on-blockchain updates (unless things break down).

The next post will discuss Hashed Timelock Contracts (HTLCs) which can be used to create chains of payments…

Notes For Pedants:

In the payment channel open I assume OP_CHECKLOCKTIMEVERIFY, which isn’t yet in bitcoin.  It’s simpler.

I ignore transaction fees as an unnecessary distraction.

We need malleability fixes, so you can’t mutate a transaction and break the ones which follow.  But I also need the ability to sign Transaction 1a without a complete Transaction 1 (since you can’t expose the signed version to me).  The paper proposes new SIGHASH types to allow this.

[EDIT 2015-03-30 22:11:59+10:30: We also need to sign the other symmetric transactions before signing the opening transaction.  If we released a completed opening transaction before having the other transactions, we might be stuck with no way to get our funds back (as we don’t have a “return all to me” timeout on the opening transaction)]

February 18, 2015

Eric Leblond: Slides of my talks at Lecce

I’ve been invited by SaLUG to Lecce to give some talks during their Geek Evening. I’ve done a talk on nftables and one of suricata.

Lecce by night

Lecce by night

The nftables talk was about the motivation behind the change from iptables. Here are the slides: Nftables

The talk on Suricata was explaining the different feature of Suricata and was showing how I’ve used it to make a study of SSH bruteforce. Here are the slides: Suricata, Netfilter and the PRC.

Thanks a lot to Giuseppe Longo, Luca Greco and all the SaLUG team, you have been wonderful hosts!

February 13, 2015

Rusty Russell: lguest Learns PCI

With the 1.0 virtio standard finalized by the committee (though minor non-material corrections and clarifications are still trickling in), Michael Tsirkin did the heavy lifting of writing the Linux drivers (based partly on an early prototype of mine).

But I wanted an independent implementation to test: both because OASIS insists on multiple implementations before standard ratification, but also because I wanted to make sure the code which is about to go into the merge window works well.

Thus, I began the task of making lguest understand PCI.  Fortunately, the osdev wiki has an excellent introduction on how to talk PCI on an x86 machine.  It didn’t take me too long to get a successful PCI bus scan from the guest, and start about implementing the virtio parts.

The final part (over which I procrastinated for a week) was to step through the spec and document all the requirements in the lguest comments.  I also added checks that the guest driver was behaving sufficiently, but now it’s finally done.

It also resulted in a few minor patches, and some clarification patches for the spec.  No red flags, however, so I’m reasonably confident that 3.20 will have compliant 1.0 virtio support!

November 09, 2014

Eric Leblond: Efficient search of string in a list of strings in Python


I’m currently working on a script that parses Suricata EVE log files and try to detect if some fields in the log are present in a list of bad patterns. So the script has two parts which are reading the log file and searching for the string in a list of strings. This list can be big with a target of around 20000 strings.

Note: This post may seem trivial for real Python developers but as I did not manage to find any documentation on this here is this blog post.

The tests

For my test I have used a 653Mo log file containing 896077 lines. Reading this JSON formatted file is taking 5.0s. As my list of strings was around 3000 elements so far below targeted size, a thumb rules was saying that I will be ok if script stayed below 6 seconds with the matching added. First test was a simple Python style inclusion test with the hostname being put in a list:
if event['http']['hostname'] in hostname_list:
For that test, the result was 9.5s so not awful but a bit over my expectation. Just to check I have run a test with a C-like implementation:
for elt in hostname_list:
   if elt == target:
       # we have a winner
Result was a nice surprise, … for Python, with a execution time of 20.20s.

I was beginning to fear some development to be able to reach the speed I needed and I gave a last try. As I was taking care of match, I can transform my list of strings in a Python set thus only getting unique elements. So I have run the test using:

hostname_set = set(hostname_list)
for event in event_list:
    if event['http']['hostname'] in hostname_set:

Result was an amazing execution time of 5.20s. Only 0.20s were used to check data against my set of strings.


Python set required elements to be hashable. It is needed because internal implementation is using dictionary. So looking for an element in a set is equivalent to look for an element in a hash table. And this is really faster than searching in a list where there is no real magic possible.

So if you only care about match and if your elements are hashable then use Python set to test for existence of a object in your set.

November 07, 2014

Jesper Dangaard Brouer: Announce github repo prototype-kernel

Kernel prototype development made easy

For easier and quicker kernel development cycles, I've created a git-repo called "prototype-kernel". To also help your productivity I've shared in on github (isn't it great that work at +Red Hat defaults to being Open Source):

Getting started developing your first kernel code is really easy: follow this guide.

The basic idea is to compile modules outside the kernel tree, but use the kernels kbuild infrastructure. Thus, this does require that you have a kernel source tree avail on your system. Most distributions offer a kernel-devel package, that only install what you need for this to work.

 On +Fedora Project do: 'yum install kernel-devel'
 On +Debian do: 'aptitude install linux-headers'

Directory layout:

There are two main directories: "minimalist" and "kernel".

If you just want to test a small code change go to the "minimalist" directory.  This contains the minimal Makefile needed to compile the kernel modules in this directory. Do you devel cycle of: Modify, make, insmod, rmmod.

The "kernel" directory is more interesting to me. It maintains the same directory layout as the mainline kernel. The basic idea is that code developed should be in the same locations that you plan to submit upstream. Here you need to edit the files Kbuild files when adding new modules/code. Notice these Kbuild files are located in each sub-directory.

Detecting kernel and compiling

In the "kernel" directory you can simply run: make

The system will detect / assume-to-find the kernel source by looking in /lib/modules/`uname -r`/build/. If the kernel you are compiling for are not in this location, then specify it to make on the command line like:

  $ make kbuilddir=~/git/kernel/net-next/

Compiling for a remote host

If you are compiling locally, but want to upload the kernel modules for testing on a remote host, then the make system supports this. Compile and upload modules with a single make command do:

  $ make push_remote kbuilddir=~/git/kernel/net-next/ HOST=

Localhost development

Development on the same host is sometimes easier.  To help this process there is a script to help setup symlinks to from the running kernels module directory into you git repo.  This means you just need to compile and no need to install the modules again and again (perhaps run 'depmod -a' if needed).  Command line for this setup, be in the "kernel" directory and run:

  $ sudo ./scripts/

Development cycle made simple

After this setup, standard modprobe and modinfo works.  Thus, oneliner work cycle for testing a given kernel module can be as simple as:

  $ make && (sudo rmmod time_bench_sample; sudo modprobe time_bench_sample) && sudo dmesg -c

Happy kernel-hacking :-)

October 09, 2014

Jesper Dangaard Brouer: Unlocked 10Gbps TX wirespeed smallest packet single core

Achievement unlocked, 10Gbit/s full TX wirespeed smallest packet size on a single CPU core (14.8Mpps). Recent changes to the Linux kernels lower TX layers have unlocked the full (TX side) potential of the NIC drivers and hardware.

The single core 14.8Mpps performance number is an artificial benchmark performed with pktgen, which besides spinning the same packet (skb), now also notifies the NIC hardware after populating it's TX ring buffer with a "burst" of packets. The new pktgen option is called "burst" (commit 38b2cf2982d ("net: pktgen: packet bursting via skb->xmit_more")).

Hint: for ixgbe driver you need to adjust cleanup interval as described in my earlier blogpost.

  • Via cmdline:  ethtool -C ethX rx-usecs 30

Fortunately, we have also found a way to take advantage of these new capabilities for normal/real use-cases, by adjusting GSO segmentation and modifying the qdisc (traffic control) layer, more details later.

As mentioned before, I personally only see pktgen as a tool for performance testing and tuning the NIC driver TX layer.   Thus, the pktgen test merely show, that we have finally unlocked the full potential of the NIC hardware and driver.  We started this journey with seeing 4Mpps single core.

The SKB xmit_more API

The new bulking API, for the driver layer, requires some extra explanation, because we wanted to avoid API changes, that would require updating every driver.

The new API for driver layer TX bulking is called "skb->xmit_more". To avoid changing the external driver API and to keep the new API light weight (read fast), the "skb" (struct sk_buff) have been extended with a "xmit_more" (bit) indication.

If the driver see "skb->xmit_more" set, then the stack is telling that it guarantees that another packet will be given immediately (when ->ndo_start_xmit() returns).
 This means that, unless the driver have a filled TX queue, it can simply add the packet to the TX queue and defer the expensive indication to the HW (of updating the TX ring queue tail pointer).

Challenge: bulk without added latency

It is difficult to taking advantage of bulking without introducing added latency.  Nevertheless, this have been our goal.  Don't speculative/artificially introducing a delay betting on another packets arriving shortly. Principle of only bulk packets when really needed, based on some solid indication from the stack.

Existing stack bulking: GSO

The network stack already have packet aggregation facilities like TSO (TCP Segmentation Offload), GSO (Generic Segmentation Offload) and GRO (Generic Receive Offload), which is primarily targeted for TCP flows. Let's take advantage of these packets already being aggregated.

+David Miller, restructured central parts of the xmit layer, allowing it to work with skb lists, via merge commit 53fda7f7f9e8 ("Merge branch 'xmit_list'").

This restructure also allowed GRO/GSO segmented lists to take advantage of the performance boost of being bulk transmitted in the driver layer (via xmit_more indication).

Note, that hardware supported TSO packets have in-principal always benefited from the tail pointer update, as the driver can (usually) put the entire TSO packet (upto 64k) as though it is a single packet in to the TX ring, before updating tail pointer.

After these optimization's some situations exists (for certain HW), where the software GSO is faster-than the hardware based TSO.  To enable the software only approach use cmdline:

  • ethtool -K ethX gro on gso on tso off

GRO/GSO is focused on TCP and limited to bulking/aggregation on a per flow basis.  What about UDP or many simultaneous shorter lived flows.

Qdisc layer bulk dequeue

Bulking from the qdisc layer is good general solution that will benefit all kinds of packets.  A queue in the qdisc layer, provide a very solid opportunity for bulking, packets have already been delayed and its easy to construct a skb list with packets avail in the queue.

In our solution for qdisc bulk dequeue (commit 5772e9a3463b) we have further extended the performance benefit of bulking.  We can amortize the cost of locking both at the qdisc layer and for the TXQ lock.
 Several packets are dequeued while holding the qdisc root_lock, amortizing locking cost over several packet.  The dequeued SKB list is processed under the TXQ lock in dev_hard_start_xmit(), thus also amortizing the cost of the TXQ lock.

 Details: based on earlier benchmarking for locking, the cost of the lock/unlock sequence is 16ns (in the optimal none contended case), with two locking sections this is a 32ns cost that will get amortized.

 Note: We benefit from this lock amortization saving, regardless weather the device driver take advantage of the xmit_more API. Which is good as most driver have not yet, implemented this xmit_more API.

Lower CPU usage when using the full bandwidth on a link, is the most direct benefit than most will experience, from these changes.

Sending small packets "all-the-way" from userspace RAW sockets (but still through the qdisc layer), I can also demonstrate that it's now possible to send 14.8Mpps (10G wirespeed small frames) it just required using minimum 11 CPUs (using trafgen --qdisc-path, CPU E5-2695).  This is obviously an artifical benchmark, but it demonstrates that we are scaling with the number of CPUs.

For testing details see merge commit c2bf5ec20488f ('qdisc_bulk_dequeue').

Avoiding Head-of-Line blocking

We have explicitly chosen to: only support qdisc bulking for BQL supported drivers.

As the bufferbloat fighters can testify, device drivers have a tendency buffer too many bytes in the hardware.  The solution to this problem is BQL (Byte Queue Limit) by Tom Herbert.  Sadly not many device drivers have actually implemented using BQL.

The reason for needing BQL is that the qdisc layer needs an indication it can use to determine how many packets (actually bytes) it should bulk dequeue, before sending it to the driver layer.

This is need because, we want to avoid overshooting the driver layers capacity, as it will result in a need to requeue the excessive packet back into the qdisc layer.  Currently the qdisc layer, does not support a requeue facility for a scheduler, instead it stores requeued packets in the root qdisc.  These requeued packets cause Head-of-Line blocking, as they must be transmitted before dequeueing new packets to
avoid packet reordering.

Implementing BQL support in your driver also helps moving any queue to the qdisc layer, which gives your driver the opportunity to benefit from these amazing performance improvements (and is also allow us to
fix bufferbloat by applying an appropriate qdisc like fq_codel).

Joint work

Getting to this level of performance have been the jointed work and feedback from many people.

Thanks to,
 +David Miller (Red Hat),
 +Eric Dumazet (Google),
 Hannes Frederic Sowa (Red Hat),
 Tom Herbert (Google),
 Daniel Borkmann (Red Hat),
 Florian Westphal (Red Hat),
 +Alexei Starovoitov (Plumgrid),
 +Alexander Duyck (Intel / Red Hat),
 John Fastabend (Intel),
 Jamal Hadi Salim (Mojatatu Networks),
 +Neil Horman  (Red Hat),
 +Dave Taht (,
 +Toke Høiland-Jørgensen (Karlstads University)
 and reviewers on

September 29, 2014

Eric Leblond: Slides of my nftables talk at Kernel Recipes

I’ve been lucky enough to do a talk during the third edition of Kernel Recipes. I’ve presented the evolution of nftables durig the previous year.

You can get the slides from here: 2014_kernel_recipes_nftables.

Thanks to Hupstream for uploading the video of the talk:

Not much material but this slides and a video of the work done during the previous year on nftables and its components:

September 24, 2014

Eric Leblond: Using DOM with nftables

DOM and SSH honeypot

DOM is a solution comparable to fail2ban but it uses Suricata SSH log instead of SSH server logs. The goal of DOM is to redirect the attacker based on its SSH client version. This allows to send attacker to a honeypot like pshitt directly after the first attempt. And this can be done for a whole network as Suricata does not need to be on the targeted box.

Using DOM with nftables

I’ve pushed a basic nftables support to DOM. Instead of adding element via ipset it uses a nftables set. It is simple to use it as you just need to add a -n flag to specify which table the set has been defined in:
./dom -f /usr/local/var/log/suricata/eve.json -n nat -s libssh -vvv -i -m OpenSSH
To activate the network address translation based on the set, you can use something like:
table ip nat {
        set libssh { 
                type ipv4_addr

        chain prerouting {
                 type nat hook prerouting priority -150;
                 ip saddr @libssh ip protocol tcp counter dnat

A complete basic ruleset

Here’s the ruleset running on the box implementing pshitt and DOM:
table inet filter {
        chain input {
                 type filter hook input priority 0;
                 ct state established,related accept
                 iif lo accept
                 ct state new iif != lo tcp dport {ssh, 2200} tcp flags == syn counter log prefix "SSH attempt" group 1 accept
                 iif br0 ip daddr accept
                 ip saddr tcp dport {9300, 3142} counter accept
                 ip saddr counter accept
                 counter log prefix "Input Default DROP" group 2 drop

table ip nat {
        set libssh { 
                type ipv4_addr

        chain prerouting {
                 type nat hook prerouting priority -150;
                 ip saddr @libssh ip protocol tcp counter dnat

        chain postrouting {
                 type nat hook postrouting priority -150;
                 ip saddr snat
There is a interesting rule in this ruleset. The first is:
ct state new iif != lo tcp dport {ssh, 2200} tcp flags == syn counter log prefix "SSH attempt" group 1 accept
It uses a negative construction to match on the interface iif != lo which means interface is not lo. Note that it also uses an unamed set to define the port list via tcp dport {ssh, 2200}. That way we have one single rule for normal and honeypot ssh. At least, this rule is logging and accepting and the logging is done via nfnetlink_log because of the group parameter. This allows to have ulogd to capture log message triggered by this rule.

September 12, 2014

Jesper Dangaard Brouer: Mini-tutorial for netperf-wrapper setup on RHEL6/CentOS6

The tool "netperf-wrapper" (by +Toke Høiland-Jørgensen <toke(at)>) is very useful for repeating network measurements, that involves running multiple concurrent instances of testing tools (primarily netperf, iperf and ping, but also tools like d-itg and http-getter).

The tools is best known in the bufferbloat community for it's test Realtime Response Under Load (RRUL), but the netperf-wrapper tool has other tests that I find useful.
Core software dependencies are recent versions of netperf, python, matplotlib and fping (optional are d-itg and http_runner).

Dependency issues on RHEL6

First dependencies are solved easily by installing "python-matplotlib":
 $ sudo yum install -y python-matplotlib python-pip

The software dependencies turned out to be a challenge on my RHEL6 box.

The "ping" program is too old to support option "-D" (prints timestamp before each-line).  Work-around is to install "fping", which I choose to do from "rpmforge":

Commands needed for install "fping":
 # rpm -Uvh
 # yum install -y fping

The "netperf" tool itself (on RHEL6) were not compiled with configure option "--enable-demo=yes" which is needed to get timestamp and continuous result output during a test-run.

Thus, I needed to recompile "netperf" manually:

Install netperf-wrapper

Installation is quite simple, once the dependencies have been meet:

  • git clone
  • cd netperf-wrapper
  • sudo python2 install

GUI mode

There is a nice GUI mode for investigating and comparing results, started by:
 $ netperf-wrapper --gui

This depend on matplotlib with Qt4 (and PyQt4), which unfortunately were not available for RHEL6. Fortunately there were a software package for this on Fedora, named "python-matplotlib-qt4".

For GUI mode netperf-wrapper needs: matplotlib with Qt4
 $ sudo yum install -y python-matplotlib-qt4 PyQt4

Thus, the workflow is to run the tests on my RHEL6 machines, and analyze the result files on my Fedora laptop.

Using the tool

The same tool "netperf-wrapper" is both used for running the test, and later analyzing the result.

Listing the tests available:
 $ netperf-wrapper --list-tests

For listing which plots are available for a given test e.g. "rrul":
 $ netperf-wrapper --list-plots rrul

Before running a test towards a target system, remember to start the "netserver" daemon process on the target host (just run command "netserver" nothing else).

Start a test-run towards e.g. IP with test rrul
 $ netperf-wrapper -H -t my_title rrul

It is recommend using the option "-t" to give your test a title, which makes is easier to distinguish when comparing two or more test files in e.g. the GUI tool.

The results of the test-run will be stored in a compressed json formatted text file, with the naming convention: rrul-2014-MM-DDTHHMMSS.milisec.json.gz

To view the result, without the GUI, run:
 $ netperf-wrapper -i rrul_prio-2014-09-10T125650.993908.json.gz -f plot
Or e.g. selecting a specific plot like "ping_cdf"
 $ netperf-wrapper -i rrul_prio-2014-09-10T125650.993908.json.gz -f plot -p ping_cdf

netperf-wrapper can also output numeric data suitable for plotting in org-mode or .csv (spreadsheet) format, but I didn't play with those options.

Updates: A release 0.7.0 of netperf-wrapper is pending

Extra: On bufferbloat
Interested in more about bufferbloat?

Too few people are linking to the best talk explaining bufferbloat and how it's solved by Van Jacobson (slides).  The video quality is unfortunately not very good.

I've used some of Van's point in my own talk about bufferbloat: Beyond the existences of Bufferbloat, have we found the cure? (slides)

September 11, 2014

Jesper Dangaard Brouer: Network setup for accurate nanosec measurements

As I described in my previous blogpost, I'm leveraging the PPS measurements to deduct the nanosec improvements I'm making to the code.

One problem with using this on the nanosec scale is the accuracy of your measurements, which depend on the accuracy of the hardware your are using.

Modern systems have power-saving and turbo-boosting features build into the CPUs.  And Hyper-Threading technology that allows one CPU core to appear as two CPUs, by sharing ALUs etc.

While establish an accurate baseline for some upstream measurements (subj: Get rid of ndo_xmit_flush / commit 0b725a2ca61). I was starting to see too much variation in my trafgen measurements.

I created a rather-large oneliner, that I have converted into a script here:
Which allowed me to get a picture of the accuracy of my measurements, and they are not accurate enough. (For more real stats like std-dev consider running these measurements through Rusty Russell's tool

My findings:

  1. Disable all C states and P states.
  2. Disabling Hyper-Threading and power-management in BIOS helped the accuracy
  3. 10Gbit/s ixgbe ring-buffer cleanup interval also influenced accuracy

Reading +Jeremy Eder's blog post. It seems the best method for disabling these C and P states, and
keeping all CPUs in C0/C1 state is doing:

 # tuned-adm profile latency-performance

I found the most stable ring-buffer cleanup interval for the ixgbe driver were 30 usecs. Configured by

 # ethtool -C eth5 rx-usecs 30

Besides these tunings: my blogpost on "Basic tuning for network overload testing" should still be followed.
Generally I've started using the "profile latency-performance", but unless I need to measure some specific code change, I'm still using the ixgbe's default "dynamic" ring-buffer cleanup interval.

Details about the "ethtool -C" tuning is avail in blogpost "Pktgen for network overload testing".

Jesper Dangaard Brouer: Packet Per Sec measurements for improving the Linux Kernel network stack

Why I'm using Packet Per Second (PPS) tests for measuring the improvement in performance (of the Linux Kernel network stack).  Many people (e.g. other kernel developer) does not understand why I'm using PPS measurements, this blogpost explains why.

The basic problem of using large MTU (usually 1500 bytes) size packets, is that the transmission delay itself, is enough to hide any improvement I'm making (e.g. a faster lookup function).

Transmission delay 1514 bytes (+20 bytes for Ethernet overhead) at 10Gbit/s is 1227 nanosec:

  • ((bytes+wireoverhead)*8) / 10 Gbits = time-unit
  • ((1500+14+20)*8)/((10000*10^6))*10^9 = 1227.20 ns

This means, if the network stack can generate (alloc/fill/free) a 1500 byte packet faster than every 1227ns, then it can utilize the bandwidth of the 10Gbit/s link fully.  And yes, we can already do so. Thus, with 1500 bytes frame any stack performance improvement, will only be measurable with by a lower CPU utilization.

Let face it; the kernel have been optimized heavily for the last 20 years.  Thus, the improvements we are able to come up with, is going to be on the nanosec scale.
For example I've found a faster way to clear the SKB, which saves 7 nanosec.  Being able to measure this performance improvement were essential while developing this faster clearing.

Lets assume, the stack cost (alloc/fill/syscall/free) is 1200ns (thus faster than 1227ns), then a 7ns improvement will only be 0.58%, which I can only measure as a lower CPU utilization (as bandwidth limit have been reached), which in practice cannot be measured accurately enough.

By lowering the packet size, the transmission delay the stack cost (alloc/fill/syscall/free) can "hide-behind" is reduced. With the smallest packet size of 64 bytes, this is significantly reduced, to:

  • ((64+20)*8)/((10000*10^6))*10^9 = 67.2ns

This basically exposes stacks cost, as its current cost is larger than 67.2ns.  This can be used for getting some measurements that allow us to actually measure the improvement of the code changes we are making, even-though this "only" translates into reduced CPU usage with big frames (which translates into more processer time for your application).

In packet per sec (pps) this correspond to approx 14.8Mpps:

  • 1sec/67.2ns =>  1/(67.2/10^9) = 14,880,952 pps
  • or directly from the packet size as:
  • 10Gbit / (bytes*8) = (10000*10^6)/((64+20)*8) = 14,880,952 pps

Measuring packet per sec (PPS) instead of bandwidth, have another advantage.  Instead of just comparing how many PPS improvement is seen, then instead translate the PPS into nanosec (between packets).
Comparing nanosec used before and after, will show us the nanosec saved by the given code change.

See, how I used it in this and  this commit to document the actually improvement of the changes I made.

Update: For deducting the nanosec saved by a given code change, to be valid, usually requires isolating your test to utilize a single CPU.

Lets use the 14.8Mpps as an example of howto translate PPS to nanosec:

  • 1sec / pps => (1/14880952*10^9) = 67.2ns

Extra: Just for the calculation exercise.

  • How many packets per sec does 1227.20 ns correspond to:
    • 1sec/1227.2ns =>  1/(1227.2/10^9) = 814,863 pps
  • Can also be calculated directly from the packet size as:
    • 10Gbit / (bytes*8) = (10000*10^6)/((1514+20)*8) = 814,863 pps

August 19, 2014

Rusty Russell: POLLOUT doesn’t mean write(2) won’t block: Part II

My previous discovery that poll() indicating an fd was writable didn’t mean write() wouldn’t block lead to some interesting discussion on Google+.

It became clear that there is much confusion over read and write; eg. Linus thought read() was like write() whereas I thought (prior to my last post) that write() was like read(). Both wrong…

Both Linux and v6 UNIX always returned from read() once data was available (v6 didn’t have sockets, but they had pipes). POSIX even suggests this:

The value returned may be less than nbyte if the number of bytes left in the file is less than nbyte, if the read() request was interrupted by a signal, or if the file is a pipe or FIFO or special file and has fewer than nbyte bytes immediately available for reading.

But write() is different. Presumably so simple UNIX filters didn’t have to check the return and loop (they’d just die with EPIPE anyway), write() tries hard to write all the data before returning. And that leads to a simple rule.  Quoting Linus:

Sure, you can try to play games by knowing socket buffer sizes and look at pending buffers with SIOCOUTQ etc, and say “ok, I can probably do a write of size X without blocking” even on a blocking file descriptor, but it’s hacky, fragile and wrong.

I’m travelling, so I built an Ubuntu-compatible kernel with a printk() into select() and poll() to see who else was making this mistake on my laptop:

cups-browsed: (1262): fd 5 poll() for write without nonblock
cups-browsed: (1262): fd 6 poll() for write without nonblock
Xorg: (1377): fd 1 select() for write without nonblock
Xorg: (1377): fd 3 select() for write without nonblock
Xorg: (1377): fd 11 select() for write without nonblock

This first one is actually OK; fd 5 is an eventfd (which should never block). But the rest seem to be sockets, and thus probably bugs.

What’s worse, are the Linux select() man page:

       A file descriptor is considered ready if it is possible to
       perform the corresponding I/O operation (e.g., read(2)) without
       ... those in writefds will be watched to see if a write will
       not block...

And poll():

		Writing now will not block.

Man page patches have been submitted…

August 02, 2014

Rusty Russell: ccan/io: revisited

There are numerous C async I/O libraries; tevent being the one I’m most familiar with.  Yet, tevent has a very wide API, and programs using it inevitably descend into “callback hell”.  So I wrote ccan/io.

The idea is that each I/O callback returns a “struct io_plan” which says what I/O to do next, and what callback to call.  Examples are “io_read(buf, len, next, next_arg)” to read a fixed number of bytes, and “io_read_partial(buf, lenp, next, next_arg)” to perform a single read.  You could also write your own, such as pettycoin’s “io_read_packet()” which read a length then allocated and read in the rest of the packet.

This should enable a convenient debug mode: you turn each io_read() etc. into synchronous operations and now you have a nice callchain showing what happened to a file descriptor.  In practice, however, debug was painful to use and a frequent source of bugs inside ccan/io, so I never used it for debugging.

And I became less happy when I used it in anger for pettycoin, but at some point you’ve got to stop procrastinating and start producing code, so I left it alone.

Now I’ve revisited it.   820 insertions(+), 1042 deletions(-) and the code is significantly less hairy, and the API a little simpler.  In particular, writing the normal “read-then-write” loops is still very nice, while doing full duplex I/O is possible, but more complex.  Let’s see if I’m still happy once I’ve merged it into pettycoin…

July 29, 2014

Rusty Russell: Pettycoin Alpha01 Tagged

As all software, it took longer than I expected, but today I tagged the first version of pettycoin.  Now, lots more polish and features, but at least there’s something more than the git repo for others to look at!

July 17, 2014

Rusty Russell: API Bug of the Week: getsockname().

A “non-blocking” IPv6 connect() call was in fact, blocking.  Tracking that down made me realize the IPv6 address was mostly random garbage, which was caused by this function:

bool get_fd_addr(int fd, struct protocol_net_address *addr)
   union {
      struct sockaddr sa;
      struct sockaddr_in in;
      struct sockaddr_in6 in6;
   } u;
   socklen_t len = sizeof(len);
   if (getsockname(fd, &, &len) != 0)
      return false;

The bug: “sizeof(len)” should be “sizeof(u)”.  But when presented with a too-short length, getsockname() truncates, and otherwise “succeeds”; you have to check the resulting len value to see what you should have passed.

Obviously an error return would be better here, but the writable len arg is pretty useless: I don’t know of any callers who check the length return and do anything useful with it.  Provide getsocklen() for those who do care, and have getsockname() take a size_t as its third arg.

Oh, and the blocking?  That was because I was calling “fcntl(fd, F_SETFD, …)” instead of “F_SETFL”!

July 02, 2014

Jesper Dangaard Brouer: The calculations: 10Gbit/s wirespeed

In this blogpost, I'll try to make you understand the engineering challenge behind processing 10Gbit/s wirespeed, at the smallest Ethernet packet size.

The peak packet rate is 14.88 Mpps (million packets per sec) uni-directional on 10Gbit/s with the smallest frame size.

Details: What is the smalles Ethernet frame
Ethernet frame overhead:

Thus, the minimim size Ethernet frame is: 84 bytes (20 + 64)

Max 1500 bytes MTU Ethernetframe size is: 1538 bytes (calc: (12+8) + (14) + 1500 + (4) = 1538 bytes)

Packet rate calculations

Peak packet rate calculated as:  (10*10^9) bits/sec / (84 bytes * 8) = 14,880,952 pps
1500 MTU packet rate calculated as: (10*10^9) bits/sec / (1538 bytes * 8) = 812,744 pps

Time budget
This is the important part to wrap-your-head around.

With 14.88 Mpps the time budget for processing a single packet is:

  • 67.2 ns (nanosecond) (calc as: 1/14880952*10^9 ns)

This corrospond to approx: 201 CPU cycles on a 3GHz CPU (assuming only one instruction per cycle, disregarding superscalar/pipelined CPUs). Only having 201 clock-cycles processing time per packet is very little.

Relate these numbers to something
This 67.2ns number is hard to use for anything, if we cannot relate this to some other time measurements.

A single cache-miss takes: 32 ns (measured on a E5-2650 CPU). Thus, with just two cache-misses (2x32=64ns), almost the total 67.2 ns budget is gone. The Linux skb (sk_buff) is 4 cache-lines (on 64-bit), and the kernel e.g. insists on writing zeros to these cache-lines, during allocation of an skb.

We might not "suffer" a full cache-miss, sometimes the memory is available in L2 or L3 cache.  Thus, it is useful to know these time measurements.  Measured on my E5-2630 CPU (with lmbench command "lat_mem_rd 1024 128"), L2 access costs 4.3ns, and L3 access costs 7.9ns.

The "LOCK" operation
Assembler instructions can be prefixed with a "LOCK" operation, which means that they perform an atomic operation. This is uses every time e.g. a spinlock is locked or unlocked, cmpxchg and atomic_inc (some operations are even implicitly LOCK prefixed, like xchg).

I've measured the cost of this atomic "LOCK" operation to be 8.25ns on my CPU (with this program). Even for the completely optimal situation of a spinlock only being touch by one CPU, we have two LOCK calls which costs 16.5ns.

System call overhead
A FreeBSD case study of sendto(), in Luigi Rizzo netmap paper, shows that the cost of only the system call is 96ns, which is above the 67.2 ns budget.  The total overhead of sendto() were 950 ns.  These 950ns corrospond to 1,052,631 pps (calc as 1/(950/10^9)).
On Linux I measured the system call getuid(2), to take 87.77 ns and 201 CPU-cycles (TSC measurement) (the CPU efficiency were 1.42 insns per cycle, measured with perf stat). Thus, the syscall itself eats up the entire budget.

  • Update: Most of the syscall overhead comes from kernel option CONFIG_AUDITSYSCALL, without it, the syscall overhead drops to 41.85 ns.

How to overcome this syscall problem?  We can amortize the cost, by sending several packets in a single syscall.  It is not very well known, but we actually already have a syscall to send several packets with a single syscall, called "sendmmsg(2)". Notice the extra "m" (and the corresponding receive version "recvmmsg(2)"). Not many examples exists on the Internet for using these syscalls. Thus, I've provided some example code here for sendmmsg and recvmmsg.

RAW socket speeds
Daniel Borkmann and I recently optimized AF_PACKET, to scale to several CPUs (trafgen, kernel qdisc bypass and trafgen use qdisc bypass). But let us look at the performance numbers for only a single CPU:

  • Qdisc path = 1,226,776 pps => 815 ns per packet (calc: 1/pps*10^9)
  • Qdisc bypass = 1,382,075 pps => 723 ns per packet (calc: 1/pps*10^9)

This is also interesting, because this show us the cost of the qdisc code path, which costs 92 ns.  In this 10Gbit/s context it is fairly large, e.g. corresponding to almost 3 cache-line misses (92/32=2.9).

June 26, 2014

Eric Leblond: pshitt: collect passwords used in SSH bruteforce


I’ve been playing lately on analysis SSH bruteforce caracterization. I was a bit frustrated of just getting partial information:

  • ulogd can give information about scanner settings
  • suricata can give me information about software version
  • sshd server logs shows username
But having username without having the password is really frustrating.

So I decided to try to get them. Looking for a SSH server honeypot, I did find kippo but it was going too far for me by providing a fake shell access. So I’ve decided to build my own based on paramiko.

pshitt, Passwords of SSH Intruders Transferred to Text, was born. It is a lightweight fake SSH server that collect authentication data sent by intruders. It basically collects username and password and writes the extracted data to a file in JSON format. For each authentication attempt, pshitt is dumping a JSON formatted entry:

{"username": "admin", "src_ip": "", "password": "passw0rd", "src_port": 36221, "timestamp": "2014-06-26T10:48:05.799316"}
The data can then be easily imported in Logstash (see pshitt README) or Splunk.

The setup

As I want to really connect to the box running ssh with a regular client, I needed a setup to automatically redirect the offenders and only them to pshitt server. A simple solution was to used DOM. DOM parses Suricata EVE JSON log file in which Suricata gives us the software version of IP connecting to the SSH server. If DOM sees a software version containing libssh, it adds the originating IP to an ipset set. So, the idea of our honeypot setup is simple:
  • Suricata outputs SSH software version to EVE
  • DOM adds IP using libssh to the ipset set
  • Netfilter NAT redirects all IP off the set to pshitt when they try to connect to our ssh server
Getting the setup in place is really easy. We first create the set:
ipset create libssh hash:ip
then we start DOM so it adds all offenders to the set named libssh:
cd DOM
./dom -f /usr/local/var/log/suricata/eve.json -s libssh
A more accurate setup for dom can be the following. If you know that your legitimate client are only based on OpenSSH then you can run dom to put in the list all IP that do not (-i) use an OpenSSH client (-m OpenSSh):
./dom -f /usr/local/var/log/suricata/eve.json -s libssh -vvv -i -m OpenSSH
If we want to list the elements of the set, we can use:
ipset list libssh
Now, we can start pshitt:
cd pshitt
And finally we redirect the connection coming from IP of the libssh set to the port 2200:
iptables -A PREROUTING -m set --match-set libssh src -t nat -i eth0 -p tcp -m tcp --dport 22 -j REDIRECT --to-ports 2200

Some results

Here’s an extract of the most used passwords when trying to get access to the root account: real root passwords And here’s the same thing for the admin account attempt: Root passwords Both data show around 24 hours of attempts on an anonymous box.


Thanks to paramiko, it was really fast to code pshitt. I’m now collecting data and I think that they will help to improve the categorization of SSH bruteforce tools.

June 21, 2014

Rusty Russell: Alternate Blog for my Pettycoin Work

I decided to use github for pettycoin, and tested out their blogging integration (summary: it’s not very integrated, but once set up, Jekyll is nice).  I’m keeping a blow-by-blow development blog over there.

June 16, 2014

Rusty Russell: Rusty Goes on Sabbatical, June to December

At I spoke about my pre-alpha implementation of Pettycoin, but progress since then has been slow.  That’s partially due to yak shaving (like rewriting ccan/io library), partially reimplementation of parts I didn’t like, and partially due to the birth of my son, but mainly because I have a day job which involves working on Power 8 KVM issues for IBM.  So Alex convinced me to take 6 months off from the day job, and work 4 days a week on pettycoin.

I’m going to be blogging my progress, so expect several updates a week.  The first few alpha releases will be useless for doing any actual transactions, but by the first beta the major pieces should be in place…

June 11, 2014

Eric Leblond: Let’s talk about SELKS

The slides of my lightning talk at SSTIC are available: Let’s talk about SELKS. The slides are in French and are intended to be humorous.

The presentation is about defensive security that needs to get sexier. And Suricata 2.0 with EVE logging combined with Elasticsearch and Kibana can really help to reach that target. If you want to try Suricata and Elasticsearch, you can download and test SELKS.


The talk also present a small tool named Deny On Monitoring which demonstrate how easy it is to extract information from Suricata EVE JSON logging.

June 07, 2014

Rusty Russell: Donation to Jupiter Broadcasting

Chris Fisher’s Jupiter Broadcasting pod/vodcasting started 8 years ago with the Linux Action Show: still their flagship show, and how I discovered them 3 years ago.  Shows like this give access to FOSS to those outside the LWN-reading crowd; community building can be a thankless task, and as a small shop Chris has had ups and downs along the way.  After listening to them for a few years, I feel a weird bond with this bunch of people I’ve never met.

I regularly listen to Techsnap for security news, Scibyte for science with my daughter, and Unfilter to get an insight into the NSA and what the US looks like from the inside.  I bugged Chris a while back to accept bitcoin donations, and when they did I subscribed to Unfilter for a year at 2 BTC.  To congratulate them on reaching the 100th Unfilter episode, I repeated that donation.

They’ve started doing new and ambitious things, like Linux HOWTO, so I know they’ll put the funds to good use!

June 04, 2014

Jesper Dangaard Brouer: Pktgen for network overload testing

Want to get maximum performance out of the kernel level packet generator (pktgen)?
Then read this blogpost:

  • Simple tuning will increase performance from 4Mpps to 5.5Mpps (per CPU)

You might see pktgen as a fast packet generator, which it is, but I (as a kernel developer) also see it as network stack testing tool of the TX code path.

Pktgen have a parameter "clone_skb", which specifies how many time to send the same packet, before freeing and allocting a new packet for transmission.  This affects performance significantly, as it can remove a lot of memory allocation and access overhead.

I have two distinctly different use-cases for stack testing:

  1. clone_skb=1      tests the stack alloc/free overhead (related to the SKB)
  2. clone_skb=100000 tests the NIC driver layer
Lets focus on case 2, driver layer.

Tuning NIC driver layer for max performance:
The default NIC setting are not tuned for pktgen's artificial overload type of benchmarking, as this could hurt the normal use-case.

Specifically increasing the TX ring buffer in the NIC:
 # ethtool -G ethX tx 1024

A larger TX ring can improve pktgen's performance, while it can hurt in the general case, 1) because the TX ring buffer might get larger than the CPUs L1/L2 cache, 2) because it allow more queueing in the NIC HW layer (which is bad for bufferbloat).

One should be careful to conclude, that packets/descriptors in the HW TX ring cause delay.  Drivers usually delay cleaning up the ring-buffers (for various performance reasons), thus packets stalling the TX ring, might just be waiting for cleanup.

This "slow" cleanup issues is specifically the case, for the driver ixgbe (Intel 82599 chip).  This driver (ixgbe) combine TX+RX ring cleanups, and the cleanup interval is affected by the ethtool --coalesce setting of parameter "rx-usecs".

For ixgbe use e.g "30" resulting in approx 33K interrupts/sec (1/30*10^6):
 # ethtool -C ethX rx-usecs 30

Performance data:
Packet Per Sec (pps) performance tests using a single pktgen CPU thread, CPU E5-2630, 10Gbit/s driver ixgbe. (using net-next development kernel v3.15-rc1-2680-g6623b41)

Adjusting the "ethtool -C ethX rx-usecs" value affect how often we cleanup the ring.  Keeping the default TX ring size at 512, and adjusting "rx-usecs":
  • 3,935,002 pps - rx-usecs:  1 (irqs:  9346)
  • 5,132,350 pps - rx-usecs: 10 (irqs: 99157)
  • 5,375,111 pps - rx-usecs: 20 (irqs: 50154)
  • 5,454,050 pps - rx-usecs: 30 (irqs: 33872)
  • 5,496,320 pps - rx-usecs: 40 (irqs: 26197)
  • 5,502,510 pps - rx-usecs: 50 (irqs: 21527)
Performance when adjusting the TX ring buffer size. Keeping "rx-usecs==1" (default) while adjusting TX ring size (ethtool -G):
  • 3,935,002 pps - tx-size:  512
  • 5,354,401 pps - tx-size:  768
  • 5,356,847 pps - tx-size: 1024
  • 5,327,595 pps - tx-size: 1536
  • 5,356,779 pps - tx-size: 2048
  • 5,353,438 pps - tx-size: 4096
The performance of adjusting cleanup interval (rx-usecs), seems to win over simply increasing the TX ring buffer size. This also proves the theory of TX queue is filled with old packets/descriptors that needs cleaning.
(Edit: updated numbers to be clean upstream, previously included some patches)

Tools: Want easy to use script for pktgen look here
More details on pktgen advanced topics by Daniel Turull.

May 27, 2014

Rusty Russell: Effects of packet/data sizes on various networks

I was thinking about peer-to-peer networking (in the context of Pettycoin, of course) and I wondered if sending ~1420 bytes of data is really any slower than sending 1 byte on real networks.  Similarly, is it worth going to extremes to avoid crossing over into two TCP packets?

So I wrote a simple Linux TCP ping pong client and server: the client connects to the server then loops: reads until it gets a ‘1’ byte, then it responds with a single byte ack.  The server sends data ending in a 1 byte, then reads the response byte, printing out how long it took.  First 1 byte of data, then 101 bytes, all the way to 9901 bytes.  It does this 20 times, then closes the socket.

Here are the results on various networks (or download the source and result files for your own analysis):

On Our Gigabit Lan

Interestingly, we do win for tiny packets, but there’s no real penalty once we’re over a packet (until we get to three packets worth):

Over the Gigabit Lan

Over the Gigabit Lan


Over Gigabit LAN (closeup)

On Our Wireless Lan

Here we do see a significant decline as we enter the second packet, though extra bytes in the first packet aren’t completely free:

Wireless LAN (all results)

Wireless LAN (all results)

Wireless LAN (closeup)

Wireless LAN (closeup)

Via ADSL2 Over The Internet (Same Country)

Ignoring the occasional congestion from other uses of my home net connection, we see a big jump after the first packet, then another as we go from 3 to 4 packets:

ADSL over internet in same country

ADSL over internet in same country

ADSL over internet in same country (closeup)

ADSL over internet in same country (closeup)

Via ADSL2 Over The Internet (Australia <-> USA)

Here, packet size is completely lost in the noise; the carrier pidgins don’t even notice the extra weight:

Wifi + ADSL2 from Adelaide to US

Wifi + ADSL2 from Adelaide to US

Wifi + ADSL2 from Adelaide to US (closeup)

Wifi + ADSL2 from Adelaide to US (closeup)

Via 3G Cellular Network (HSPA)

I initially did this with Wifi tethering, but the results were weird enough that Joel wrote a little Java wrapper so I could run the test natively on the phone.  It didn’t change the resulting pattern much, but I don’t know if this regularity of delay is a 3G or an Android thing.  Here every packet costs, but you don’t win a prize for having a short packet:

3G network

3G network

3G network (closeup)

3G network (closeup)

Via 2G Network (EDGE)

This one actually gives you a penalty for short packets!  800 bytes to 2100 bytes is the sweet-spot:

2G (EDGE) network

2G (EDGE) network

2G (EDGE) network (closeup)

2G (EDGE) network (closeup)


So if you’re going to send one byte, what’s the penalty for sending more?  Eyeballing the minimum times from the graphs above:

Wired LAN Wireless ADSL 3G 2G
Penalty for filling packet 30%  15%  5%  0%  0%*
Penalty for second packet 30%  40%  15%  20%  0%
Penalty for fourth packet 60%  80%  25%  40%  25%

* Average for EDGE actually improves by about 35% if you fill packet

May 19, 2014

Eric Leblond: Playing with python-git


I’m currently working on Scirius, the web management interface for Suricata developed by Stamus Networks. Scirius is able to fetch IDS signatures from external place and the backend is storing this element in a git tree. As Scirius is a Django application, this means we need to interact with git in Python.

Usually the documentation of Python modules is good and enough to develop. This is sadly not the case for GitPython. There is documentation but the overall quality it not excellent, at least for a non genuine Python developer, and there is some big part missing.

Doing a commit

Doing a commit is really simple once you have understand what to do. You need to open the repository and work on his index which is the object you add file to commit to. In the following example, I want to add everything under the rules directory:

    repo = git.Repo(source_git_dir)
    index = repo.index
    message =  'source version at %s' % (self.updated_date)

Set value in the configuration of a repository

It is possible to edit the configuration of a git repository with GitPython. To do that you need to get the config and to use the set_value function. For example, the following code snippet create a repository and set and for that repository:

    repo = git.Repo.init(source_git_dir)
    config = repo.config_writer()
    config.set_value("user", "email", "")
    config.set_value("user", "name", "Scirius")

OSError 25: Inappropriate ioctl for device

I’ve encountered this fabulous exception when trying to do a commit in Scirius. The problem is only showing up when running the application in wsfcgi mode. It is documented in Issue 39 on GitHub but there is no workaround proposed.

The error comes from the fact the function used to guess the identity of the user running the application is called even if value are set in the config. And this function is failing when it is called outside of a real session. This function is in fact trying to get things from environment but these value are not set when the application is started by init. To fix this, it is possible to force the USERNAME environment variable.

Here’s how it is implemented in Scirius:

+    os.environ['USERNAME'] = 'scirius'
    message =  'source version at %s' % (self.updated_date)

You can see the diff on GitHub

May 08, 2014

Rusty Russell: BTC->BPAY gateway (for Australians)

I tested out, which lets you pay any BPAY bill (see explanation from reddit).  Since I’d never heard of the developer, I wasn’t going to send anything large through it, but it worked flawlessly.  At least the exposure is limited to the time between sending the BTC and seeing the BPAY receipt.  Exchange rate was fair, and it was a simple process.

Now I need to convince my wife we should buy some BTC for paying bills…

April 30, 2014

Jesper Dangaard Brouer: trafgen a fast packet generator

The netsniff-ng toolkit version 0.5.8 have been released.

One of the tools included in the netsniff-ng toolkit is: "trafgen" a multi-threaded low-level zero-copy network packet generator.  The recent release contains some significant performance improvements to that traffic generator.

Single CPU generator performance on a E5-2630 CPU, with Intel ixgbe/82599 chip, reach 1.4 Million Packet Per Sec (Mpps) when using the recent kernel (>= v3.14) feature of qdisc bypass for RAW sockets. And around 1.2 Mpps without bypassing the qdisc system in the kernel. (Default is to use the qdisc bypass if available, for testing purposes the qdisc path can be enabled via command line option "--qdisc-path")

In this release, I've also made "trafgen" scale to more CPUs:

The hard part of using trafgen is specifying and creating the packet description input file.  I really enjoy the flexibility when defining the packet contents, but without good examples as a starting point, it can be a daunting task.

For that reason, I've made some examples available at github here:

I've used the SYN attack example while developing the SYNPROXY module, see my other blogpost. I'm releasing this example now, because solutions for mitigating this attack is now available.

Jon Schipp also have a solution and have created a script "gencfg" for generating trafgen packet description input files, avail on github:

Notice: to get these performance numbers you need to tune your packet generator machine for network overload testing.

Jesper Dangaard Brouer: Mitigating DDoS SYN flood attacks with iptables/netfilter

Hey, I'm also blogging on the Red Hat Enterprise Linux Blog

I recently did very practical post on Mitigating TCP SYN Flood Attacks with iptables/netfilter, with the hope to provide the world with a practical solution to solve these annoying SYN-flood DDoS attacks, that we have been seeing for the last 20 years.

I've also been touring with a technical talk on the subject, and the most recent version of the slides are here.

There is also a YouTube video of my presentation at DevConf 2013.

April 29, 2014

Jesper Dangaard Brouer: Basic tuning for network overload testing

I'm doing a lot of network testing, where I'm constantly trying to push the limits of the hardware and network stack (in-order to improve performance and fix scalability issues in the Linux Kernel).

Some basic tuning of the NICs (Network Interface Cards) and IRQs are required, before we can start this "overload" testing mode.

1. First thing I do, is to kill "irqbalance", to avoid it mangling with my manual IRQ assignments.

 # killall irqbalance

2. Next I, align/bind the NICs IRQs to CPUs (one-to-one).

I have a script for aligning the IRQs, that I copied from the Intel ixgbe driver tarball:
 # set_irq_affinity $DEV

The easiest way to view, how the current IRQ assignment is to use this "grep ." trick:
  # grep . /proc/irq/*/eth4{,-*}/../smp_affinity_list

3. Then I, disable Ethernet Flow-Control

 # ethtool -A $DEV rx off tx off autoneg off

I'm disabling Ethernet Flow Control (PAUSE frames) because i want to create an overload situation. When my transmitter/generator machine is overloading the target machine, I don't want the target machine to send "backoff" PAUSE frames, especially if I'm testing the limits of the transmitters network stack.

4. Unload all netfilter and iptables module

I have a simple script for flushing iptables and unloading all the modules:

I usually also perform benchmarking and tuning of iptables/Netfilter modules, but for overload testing I'm unloading all module, as these do introduce measurable overhead.

Extra: A word of caution regarding CPU sleep or idle states:
I've experienced issues when doing low-latency measurements with Sandy-E Bridge CPUs C-states, because it too aggressively tried to go into a sleep state, even under a high network load. The latency cost of coming out of a sleep state can be significant. Jeremy Eder have described these issues in detail in his blog:

Simply use the tool "turbostat" to measure the different C-states.

And use the tool "tuned-adm" to adjust what profile you want to enable e.g.:
 # tuned-adm profile throughput-performance
 # tuned-adm profile latency-performance

April 27, 2014

Eric Leblond: Slides of my coccigrep lightning talk at HES2014

I’ve gave a lightning talk about coccigrep at Hackito Ergo Sum to show how it can be used to search in code during audit or hacking party. Here are the slides: coccigrep: a semantic grep for the C language.

The slides of my talk Suricata 2.0, Netfilter and the PRC will soon be available on Stamus Networks website.

April 17, 2014

Eric Leblond: Speeding up scapy packets sending

Sending packets with scapy

I’m currently doing some code based on scapy. This code reads data from a possibly huge file and send a packet for each line in the file using the contained information. So the code contains a simple loop and uses sendp because the frame must be sent at layer 2.

     def run(self):
         filedesc = open(self.filename, 'r')
         # loop on read line
         for line in filedesc:
             # Build and send packet
             sendp(pkt, iface = self.iface, verbose = verbose)
             # Inter packet treatment

Doing that the performance are a bit deceptive. For 18 packets, we’ve got:

    real    0m2.437s
    user    0m0.056s
    sys     0m0.012s

If we strace the code, the explanation is quite obvious:

socket(PF_PACKET, SOCK_RAW, 768)        = 4
setsockopt(4, SOL_SOCKET, SO_RCVBUF, [0], 4) = 0
select(5, [4], [], [], {0, 0})          = 0 (Timeout)
ioctl(4, SIOCGIFINDEX, {ifr_name="lo", ifr_index=1}) = 0
bind(4, {sa_family=AF_PACKET, proto=0x03, if1, pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0
setsockopt(4, SOL_SOCKET, SO_RCVBUF, [1073741824], 4) = 0
setsockopt(4, SOL_SOCKET, SO_SNDBUF, [1073741824], 4) = 0
getsockname(4, {sa_family=AF_PACKET, proto=0x03, if1, pkttype=PACKET_HOST, addr(6)={772, 000000000000}, [18]) = 0
ioctl(4, SIOCGIFNAME, {ifr_index=1, ifr_name="lo"}) = 0
sendto(4, "\377\377\377\377\377\377\0\0\0\0\0\0\10\0E\0\0S}0@\0*\6\265\373\307;\224\24\300\250"..., 97, 0, NULL, 0) = 97
select(0, NULL, NULL, NULL, {0, 0})     = 0 (Timeout)
close(4)                                = 0
socket(PF_PACKET, SOCK_RAW, 768)        = 4
setsockopt(4, SOL_SOCKET, SO_RCVBUF, [0], 4) = 0
select(5, [4], [], [], {0, 0})          = 0 (Timeout)
ioctl(4, SIOCGIFINDEX, {ifr_name="lo", ifr_index=1}) = 0
bind(4, {sa_family=AF_PACKET, proto=0x03, if1, pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0
setsockopt(4, SOL_SOCKET, SO_RCVBUF, [1073741824], 4) = 0
setsockopt(4, SOL_SOCKET, SO_SNDBUF, [1073741824], 4) = 0
getsockname(4, {sa_family=AF_PACKET, proto=0x03, if1, pkttype=PACKET_HOST, addr(6)={772, 000000000000}, [18]) = 0
ioctl(4, SIOCGIFNAME, {ifr_index=1, ifr_name="lo"}) = 0
sendto(4, "\377\377\377\377\377\377\0\0\0\0\0\0\10\0E\0\0004}1@\0*\6\266\31\307;\224\24\300\250"..., 66, 0, NULL, 0) = 66
select(0, NULL, NULL, NULL, {0, 0})     = 0 (Timeout)
close(4)                                = 0

For each packet, a new socket is opened and this takes age.

Speeding up the sending

To speed up the sending, one solution is to build a list of packets and to send that list via a sendp() call.

     def run(self):
         filedesc = open(self.filename, 'r')
         pkt_list = []
         # loop on read line
         for line in filedesc:
             # Build and send packet
         sendp(pkt_list, iface = self.iface, verbose = verbose)

This is not possible in our case due to the inter packet treatment we have to do. So the best way is to reuse the socket. This can be done easily when you’ve read the documentation^W code:

@@ -27,6 +27,7 @@ class replay:
     def run(self):
         # open filename
         filedesc = open(self.filename, 'r')
+        s = conf.L2socket(iface=self.iface)
         # loop on read line
         for line in filedesc:
             # Build and send packet
-            sendp(pkt, iface = self.iface, verbose = verbose)
+            s.send(pkt)

The idea is to create a socket via the function used in sendp() and to use the send() function of the object to send packets.

With that modification, the performance are far better:

    real    0m0.108s
    user    0m0.064s
    sys     0m0.004s

I’m not a scapy expert so ping me if there is a better way to do this.

April 16, 2014

Jesper Dangaard Brouer: Full scalability for Netfilter conntracks

My scalability fixes for Netfilter connection tracking have reached Linus'es tree and will appear in kernel release v3.15.

Netfilter’s conntrack have had a bad reputation for being slow. While this was true in the "early-days", it have been offering excellent scalability for established conntracks for a long time now.  Matching against existing conntrack entries is very fast and completely scalable. (The conntrack system actually does lockless RCU (Read-Copy Update) lookups for existing connections).

The conntrack system have had a scalability problem when it comes to creating (or deleting) connections, for a long time now (single central spinlock).  This scalability issue is now fixed.

This work relates to my recent efforts of using conntrack for DDoS protection, as e.g. SYN-floods would hit this "new" connection scalability problem with Netfilter conntracks.

Finally version 3 of the patchset were accepted March 7th 2014 (note Eric Dumazet worked on the first attempts back in May 9th 2013). The most important commit is 93bb0ceb75 "netfilter: conntrack: remove central spinlock nf_conntrack_lock")

Jesper Dangaard Brouer: Announcing: The IPTV-Analyzer

I'm happy to announce the first official release of the IPTV-Analyzer project, as an Open Source project.

The IPTV-Analyzer is a continuous/real-time tool for analyzing the contents of MPEG2 Transport Stream (TS) packets, which is commonly used for IPTV multicast signals. The main purpose is continuous quality measurement, with a focus on detecting MPEG2 TS/CC packet drops.

The core component is an iptables (Linux) kernel module, named "mpeg2ts". This kernel module performs the real-time Deep Packet Inspection of the MPEG2-TS packets. Its highly performance optimized, written for parallel processing across CPU cores (via RCU locking) and hash tables are used for handling large number of streams. Statistics are exported via the proc filesystem (scalability is achieved via use of the seq_file proc API). It scales to hundreds of IPTV channels, even on small ATOM based CPUs.

Please send bugreports, patches, improvement, comments or insults to:

Copyright (C) 2001-2010 by the respective authors.