netfilter project logo Planet Netfilter

October 01, 2015

Eric Leblond: My “Kernel packet capture technologies” talk at KR2015

I’ve just finished my talk on Linux kernel packet capture technologies at Kernel Recipes 2015. I would like to thanks the organizer for their great work. I also thank Frank Tizzoni for the drawing


In that talk, I’ve tried to do an overview of the history of packet capture technologies in the Linux kernel. All that seen from userspace and from a Suricata developer perspective.

You can download the slides here: 2015_kernel_recipes_capture.

August 15, 2015

Rusty Russell: Broadband Speeds, New Data

Thanks to edmundedgar on reddit I have some more accurate data to update my previous bandwidth growth estimation post: OFCOM UK, who released their November 2014 report on average broadband speeds.  Whereas Akamai numbers could be lowered by the increase in mobile connections, this directly measures actual broadband speeds.

Extracting the figures gives:

  1. Average download speed in November 2008 was 3.6Mbit
  2. Average download speed in November 2014 was 22.8Mbit
  3. Average upload speed in November 2014 was 2.9Mbit
  4. Average upload speed in November 2008 to April 2009 was 0.43Mbit/s

So in 6 years, downloads went up by 6.333 times, and uploads went up by 6.75 times.  That’s an annual increase of 36% for downloads and 37% for uploads; that’s good, as it implies we can use download speed factor increases as a proxy for upload speed increases (as upload speed is just as important for a peer-to-peer network).

This compares with my previous post’s Akamai’s UK numbers of 3.526Mbit in Q4 2008 and 10.874Mbit in Q4 2014: only a factor of 3.08 (26% per annum).  Given how close Akamai’s numbers were to OFCOM’s in November 2008 (a year after the iPhone UK release, but probably too early for mobile to have significant effect), it’s reasonable to assume that mobile plays a large part of this difference.

If we assume Akamai’s numbers reflected real broadband rates prior to November 2008, we can also use it to extend the OFCOM data back a year: this is important since there was almost no bandwidth growth according to Akamai from Q4 2007 to Q7 2008: ignoring that period gives a rosier picture than my last post, and smells of cherrypicking data.

So, let’s say the UK went from 3.265Mbit in Q4 2007 (Akamai numbers) to 22.8Mbit in Q4 2014 (OFCOM numbers).  That’s a factor of 6.98, or 32% increase per annum for the UK. If we assume that the US Akamai data is under-representing Q4 2014 speeds by the same factor (6.333 / 3.08 = 2.056) as the UK data, that implies the US went from 3.644Mbit in Q4 2007 to 11.061 * 2.056 = 22.74Mbit in Q4 2014, giving a factor of 6.24, or 30% increase per annum for the US.

As stated previously, China is now where the US and UK were 7 years ago, suggesting they’re a reasonable model for future growth for that region.  Thus I revise my bandwidth estimates; instead of 17% per annum this suggests 30% per annum as a reasonable growth rate.

August 04, 2015

Rusty Russell: The Bitcoin Blocksize: A Summary

There’s a significant debate going on at the moment in the Bitcoin world; there’s a great deal of information and misinformation, and it’s hard to find a cogent summary in one place.  This post is my attempt, though I already know that it will cause me even more trouble than that time I foolishly entitled a post “If you didn’t run code written by assholes, your machine wouldn’t boot”.

The Technical Background: 1MB Block Limit

The bitcoin protocol is powered by miners, who gather transactions into blocks, producing a block every 10 minutes (but it varies a lot).  They get a 25 bitcoin subsidy for this, plus whatever fees are paid by those transactions.  This subsidy halves every 4 years: in about 12 months it will drop to 12.5.

Full nodes on the network check transactions and blocks, and relay them to others.  There are also lightweight nodes which simply listen for transactions which affect them, and trust that blocks from miners are generally OK.

A normal transaction is 250 bytes, and there’s a hard-coded 1 megabyte limit on the block size.  This limit was introduced years ago as a quick way of avoiding a miner flooding the young network, though the original code could only produce 200kb blocks, and the default reference code still defaults to a 750kb limit.

In the last few months there have been increasing runs of full blocks, causing backlogs for a few hours.  More recently, someone deliberately flooded the network with normal-fee transactions for several days; any transactions paying less fees than those had to wait for hours to be processed.

There are 5 people who have commit access to the bitcoin reference implementation (aka. “bitcoin-core”), and they vary significantly in their concerns on the issue.

The Bitcoin Users’ Perspective

From the bitcoin users perspective, blocks should be infinite, and fees zero or minimal.  This is the basic position of respected (but non-bitcoin-core) developer Mike Hearn, and has support from bitcoin-core ex-lead Gavin Andresen.  They work on the wallet and end-user side of bitcoin, and they see the issue as the most urgent.  In an excellent post arguing why growth is so important, Mike raises the following points, which I’ve paraphrased:

  1. Currencies have network effects. A currency that has few users is simply not competitive with currencies that have many.
  2. A decentralised currency that the vast majority can’t use doesn’t change the amount of centralisation in the world. Most people will still end up using banks, with all the normal problems.
  3. Growth is a part of the social contract. It always has been.
  4. Businesses will only continue to invest in bitcoin and build infrastructure if they are assured that the market will grow significantly.
  5. Bitcoin needs users, lots of them, for its political survival. There are many people out there who would like to see digital cash disappear, or be regulated out of existence.

At this point, it’s worth mentioning another bitcoin-core developer: Jeff Garzik.  He believes that the bitcoin userbase has been promised that transactions will continue to be almost free.  When a request to change the default mining limit from 750kb to 1M was closed by the bitcoin lead developer Wladimir van der Laan as unimportant, Jeff saw this as a symbolic moment:

What Happens If We Don’t Increase Soon?

Mike Hearn has a fairly apocalyptic view of what would happen if blocks fill.  That was certainly looking likely when the post was written, but due to episodes where the blocks were full for days, wallet designers are (finally) starting to estimate fees for timely processing (miners process larger fee transactions first).  Some wallets and services didn’t even have a way to change the setting, leaving users stranded during high-volume events.

It now seems that the bursts of full blocks will arrive with increasing frequency; proposals are fairly mature now to allow users to post-increase fees if required, which (if all goes well) could make for a fairly smooth transition from the current “fees are tiny and optional” mode of operation to a “there will be a small fee”.

But even if this rosy scenario is true, this begsavoids the bigger question of how high fees can become before bitcoin becomes useless.  1c?  5c?  20c? $1?

So What Are The Problems With Increasing The Blocksize?

In a word, the problem is miners.  As mining has transitioned from a geek pastime, semi-hobbyist, then to large operations with cheap access to power, it has become more concentrated.

The only difference between bitcoin and previous cryptocurrencies is that instead of a centralized “broker” to ensure honesty, bitcoin uses an open competition of miners. Given bitcoin’s endurance, it’s fair to count this a vital property of bitcoin.  Mining centralization is the long-term concern of another bitcoin-core developer (and my coworker at Blockstream), Gregory Maxwell.

Control over half the block-producing power and you control who can use bitcoin and cheat anyone not using a full node themselves.  Control over 2/3, and you can force a rule change on the rest of the network by stalling it until enough people give in.  Central control is also a single point to shut the network down; that lets others apply legal or extra-legal pressure to restrict the network.

What Drives Centralization?

Bitcoin mining is more efficient at scale. That was to be expected[7]. However, the concentration has come much faster than expected because of the invention of mining pools.  These pools tell miners what to mine, in return for a small (or in some cases, zero) share of profits.  It saves setup costs, they’re easy to use, and miners get more regular payouts.  This has caused bitcoin to reel from one centralization crisis to another over the last few years; the decline in full nodes has been precipitous by some measures[5] and continues to decline[6].

Consider the plight of a miner whose network is further away from most other miners.  They find out about new blocks later, and their blocks get built on later.  Both these effects cause them to create blocks which the network ignores, called orphans.  Some orphans are the inevitable consequence of miners racing for the same prize, but the orphan problem is not symmetrical.  Being well connected to the other miners helps, but there’s a second effect: if you discover the previous block, you’ve a head-start on the next one.  This means a pool which has 20% of the hashing power doesn’t have to worry about delays at all 20% of the time.

If the orphan rate is very low (say, 0.1%), the effect can be ignored.  But as it climbs, the pressure to join a pool (the largest pool) becomes economically irresistible, until only one pool remains.

Larger Blocks Are Driving Up Orphan Rates

Large blocks take longer to propagate, increasing the rate of orphans.  This has been happening as blocks increase.  Blocks with no transactions at all are smallest, and so propagate fastest: they still get a 25 bitcoin subsidy, though they don’t help bitcoin users much.

Many people assumed that miners wouldn’t overly centralize, lest they cause a clear decentralization failure and drive the bitcoin price into the ground.  That assumption has proven weak in the face of climbing orphan rates.

And miners have been behaving very badly.  Mining pools orchestrate attacks on each other with surprising regularity; DDOS and block withholding attacks are both well documented[1][2].  A large mining pool used their power to double spend and steal thousands of bitcoin from a gambling service[3].  When it was noticed, they blamed a rogue employee.  No money was returned, nor any legal action taken.  It was hoped that miners would leave for another pool as they approached majority share, but that didn’t happen.

If large blocks can be used as a weapon by larger miners against small ones[8], it’s expected that they will be.

More recently (and quite by accident) it was discovered that over half the mining power aren’t verifying transactions in blocks they build upon[4].  They did this in order to reduce orphans, and one large pool is still doing so.  This is a problem because lightweight bitcoin clients work by assuming anything in the longest chain of blocks is good; this was how the original bitcoin paper anticipated that most users would interact with the system.

The Third Side Of The Debate: Long Term Network Funding

Before I summarize, it’s worth mentioning the debate beyond the current debate: long term network support.  The minting of new coins decreases with time; the plan of record (as suggested in the original paper) is that total transaction fees will rise to replace the current mining subsidy.  The schedule of this is unknown and generally this transition has not happened: free transactions still work.

The block subsidy as I write this is about $7000.  If nothing else changes, miners would want $3500 in fees in 12 months when the block subsidy halves, or about $2 per transaction.  That won’t happen; miners will simply lose half their income.  (Perhaps eventually they form a cartel to enforce a minimum fee, causing another centralization crisis? I don’t know.)

It’s natural for users to try to defer the transition as long as possible, and the practice in bitcoin-core has been to aggressively reduce the default fees as the bitcoin price rises.  Core developers Gregory Maxwell and Pieter Wuille feel that signal was a mistake; that fees will have to rise eventually and users should not be lulled into thinking otherwise.

Mike Hearn in particular has been holding out the promise that it may not be necessary.  On this he is not widely supported: that some users would offer to pay more so other users can continue to pay less.

It’s worth noting that some bitcoin businesses rely on the current very low fees and don’t want to change; I suspect this adds bitterness and vitriol to many online debates.


The bitcoin-core developers who deal with users most feel that bitcoin needs to expand quickly or die, that letting fees emerge now will kill expansion, and that the infrastructure will improve over time if it has to.

Other bitcoin-core developers feel that bitcoin’s infrastructure is dangerously creaking, that fees need to emerge anyway, and that if there is a real emergency a blocksize change could be rolled out within a few weeks.

At least until this is resolved, don’t count on future bitcoin fees being insignificant, nor promise others that bitcoin has “free transactions”.


“Bitcoin Mining Pools Targeted in Wave of DDOS Attacks” Coinbase 2015

[2] “Block Withholding Attacks – Recent Research” N T Courtois 2014

[3] “GHash.IO and double-spending against BetCoin Dice” mmtech et. al 2013

[4] “Questions about the July 4th BIP66 fork”

[5] “350,000 full nodes to 6,000 in two years…” P Todd 2015

[6] “Reachable nodes during the last 365 days.”

[7] “Re: Scalability and transaction rate” Satoshi 2010

[8] “[Bitcoin-development] Mining centralization pressure from non-uniform propagation speed” Pieter Wuille 2015

July 08, 2015

Rusty Russell: The Megatransaction: Why Does It Take 25 Seconds?

Last night f2pool mined a 1MB block containing a single 1MB transaction.  This scooped up some of the spam which has been going to various weakly-passworded “brainwallets”, gaining them 0.5569 bitcoins (on top of the normal 25 BTC subsidy).  You can see the megatransaction on

It was widely reported to take about 25 seconds for bitcoin core to process this block: this is far worse than my “2 seconds per MB” result in my last post, which was considered a pretty bad case.  Let’s look at why.

How Signatures Are Verified

The algorithm to check a transaction input (of this form) looks like this:

  1. Strip the other inputs from the transaction.
  2. Replace the input script we’re checking with the script of the output it’s trying to spend.
  3. Hash the resulting transaction with SHA256, then hash the result with SHA256 again.
  4. Check the signature correctly signed that hash result.

Now, for a transaction with 5570 inputs, we have to do this 5570 times.  And the bitcoin core code does this by making a copy of the transaction each time, and using the marshalling code to hash that; it’s not a huge surprise that we end up spending 20 seconds on it.

How Fast Could Bitcoin Core Be If Optimized?

Once we strip the inputs, the result is only about 6k long; hashing 6k 5570 times takes about 265 milliseconds (on my modern i3 laptop).  We have to do some work to change the transaction each time, but we should end up under half a second without any major backflips.

Problem solved?  Not quite….

This Block Isn’t The Worst Case (For An Optimized Implementation)

As I said above, the amount we have to hash is about 6k; if a transaction has larger outputs, that number changes.  We can fit in fewer inputs though.  A simple simulation shows the worst case for 1MB transaction has 3300 inputs, and 406000 byte output(s): simply doing the hashing for input signatures takes about 10.9 seconds.  That’s only about two or three times faster than the bitcoind naive implementation.

This problem is far worse if blocks were 8MB: an 8MB transaction with 22,500 inputs and 3.95MB of outputs takes over 11 minutes to hash.  If you can mine one of those, you can keep competitors off your heels forever, and own the bitcoin network… Well, probably not.  But there’d be a lot of emergency patching, forking and screaming…

Short Term Steps

An optimized implementation in bitcoind is a good idea anyway, and there are three obvious paths:

  1. Optimize the signature hash path to avoid the copy, and hash in place as much as possible.
  2. Use the Intel and ARM optimized SHA256 routines, which increase SHA256 speed by about 80%.
  3. Parallelize the input checking for large numbers of inputs.

Longer Term Steps

A soft fork could introduce an OP_CHECKSIG2, which hashes the transaction in a different order.  In particular, it should hash the input script replacement at the end, so the “midstate” of the hash can be trivially reused.  This doesn’t entirely eliminate the problem, since the sighash flags can require other permutations of the transaction; these would have to be carefully explored (or only allowed with OP_CHECKSIG).

This soft fork could also place limits on how big an OP_CHECKSIG-using transaction could be.

Such a change will take a while: there are other things which would be nice to change for OP_CHECKSIG2, such as new sighash flags for the Lightning Network, and removing the silly DER encoding of signatures.

July 06, 2015

Rusty Russell: Bitcoin Core CPU Usage With Larger Blocks

Since I was creating large blocks (41662 transactions), I added a little code to time how long they take once received (on my laptop, which is only an i3).

The obvious place to look is CheckBlock: a simple 1MB block takes a consistent 10 milliseconds to validate, and an 8MB block took 79 to 80 milliseconds, which is nice and linear.  (A 17MB block took 171 milliseconds).

Weirdly, that’s not the slow part: promoting the block to the best block (ActivateBestChain) takes 1.9-2.0 seconds for a 1MB block, and 15.3-15.7 seconds for an 8MB block.  At least it’s scaling linearly, but it’s just slow.

So, 16 Seconds Per 8MB Block?

I did some digging.  Just invalidating and revalidating the 8MB block only took 1 second, so something about receiving a fresh block makes it worse. I spent a day or so wrestling with benchmarking[1]…

Indeed, ConnectTip does the actual script evaluation: CheckBlock() only does a cursory examination of each transaction.  I’m guessing bitcoin core is not smart enough to parallelize a chain of transactions like mine, hence the 2 seconds per MB.  On normal transaction patterns even my laptop should be about 4 times faster than that (but I haven’t actually tested it yet!).

So, 4 Seconds Per 8MB Block?

But things are going to get better: I hacked in the currently-disabled libsecp256k1, and the time for the 8MB ConnectTip dropped from 18.6 seconds to 6.5 seconds.

So, 1.6 Seconds Per 8MB Block?

I re-enabled optimization after my benchmarking, and the result was 4.4 seconds; that’s libsecp256k1, and an 8MB block.

Let’s Say 1.1 Seconds for an 8MB Block

This is with some assumptions about parallelism; and remember this is on my laptop which has a fairly low-end CPU.  While you may not be able to run a competitive mining operation on a Raspberry Pi, you can pretty much ignore normal verification times in the blocksize debate.


[1] I turned on -debug=bench, which produced impenetrable and seemingly useless results in the log.

So I added a print with a sleep, so I could run perf.  Then I disabled optimization, so I’d get understandable backtraces with perf.  Then I rebuilt perf because Ubuntu’s perf doesn’t demangle C++ symbols, which is part of the kernel source package. (Are we having fun yet?).  I even hacked up a small program to help run perf on just that part of bitcoind.   Finally, after perf failed me (it doesn’t show 100% CPU, no idea why; I’d expect to see main in there somewhere…) I added stderr prints and ran strace on the thing to get timings.

July 03, 2015

Rusty Russell: Wrapper for running perf on part of a program.

Linux’s perf competes with early git for title of least-friendly Linux tool.  Because it’s tied to kernel versions, and the interfaces changes fairly randomly, you can never figure out how to use the version you need to use (hint: always use -g).

But when it works, it’s very useful.  Recently I wanted to figure out where bitcoind was spending its time processing a block; because I’m a cool kid, I didn’t use gprof, I used perf.  The problem is that I only want information on that part of bitcoind.  To start with, I put a sleep(30) and a big printf in the source, but that got old fast.

Thus, I wrote “perfme.c“.  Compile it (requires some trivial CCAN headers) and link perfme-start and perfme-stop to the binary.  By default it runs/stops perf record on its parent, but an optional pid arg can be used for other things (eg. if your program is calling it via system(), the shell will be the parent).

June 25, 2015

Rusty Russell: Hashing Speed: SHA256 vs Murmur3

So I did some IBLT research (as posted to bitcoin-dev ) and I lazily used SHA256 to create both the temporary 48-bit txids, and from them to create a 16-bit index offset.  Each node has to produce these for every bitcoin transaction ID it knows about (ie. its entire mempool), which is normally less than 10,000 transactions, but we’d better plan for 1M given the coming blopockalypse.

For txid48, we hash an 8 byte seed with the 32-byte txid; I ignored the 8 byte seed for the moment, and measured various implementations of SHA256 hashing 32 bytes on on my Intel Core i3-5010U CPU @ 2.10GHz laptop (though note we’d be hashing 8 extra bytes for IBLT): (implementation in CCAN)

  1. Bitcoin’s SHA256: 527.7+/-0.9 nsec
  2. Optimizing the block ending on bitcoin’s SHA256: 500.4+/-0.66 nsec
  3. Intel’s asm rorx: 314.1+/-0.3 nsec
  4. Intel’s asm SSE4 337.5+/-0.5 nsec
  5. Intel’s asm RORx-x8ms 458.6+/-2.2 nsec
  6. Intel’s asm AVX 336.1+/-0.3 nsec

So, if you have 1M transactions in your mempool, expect it to take about 0.62 seconds of hashing to calculate the IBLT.  This is too slow (though it’s fairly trivially parallelizable).  However, we just need a universal hash, not a cryptographic one, so I benchmarked murmur3_x64_128:

  1. Murmur3-128: 23 nsec

That’s more like 0.046 seconds of hashing, which seems like enough of a win to add a new hash to the mix.

June 19, 2015

Rusty Russell: Mining on a Home DSL connection: latency for 1MB and 8MB blocks

I like data.  So when Patrick Strateman handed me a hacky patch for a new testnet with a 100MB block limit, I went to get some.  I added 7 digital ocean nodes, another hacky patch to prevent sendrawtransaction from broadcasting, and a quick utility to create massive chains of transactions/

My home DSL connection is 11Mbit down, and 1Mbit up; that’s the fastest I can get here.  I was CPU mining on my laptop for this test, while running tcpdump to capture network traffic for analysis.  I didn’t measure the time taken to process the blocks on the receiving nodes, just the first propagation step.

1 Megabyte Block

Naively, it should take about 10 seconds to send a 1MB block up my DSL line from first packet to last.  Here’s what actually happens, in seconds for each node:

  1. 66.8
  2. 70.4
  3. 71.8
  4. 71.9
  5. 73.8
  6. 75.1
  7. 75.9
  8. 76.4

The packet dump shows they’re all pretty much sprayed out simultaneously (bitcoind may do the writes in order, but the network stack interleaves them pretty well).  That’s why it’s 67 seconds at best before the first node receives my block (a bit longer, since that’s when the packet left my laptop).

8 Megabyte Block

I increased my block size, and one node dropped out, so this isn’t quite the same, but the times to send to each node are about 8 times worse, as expected:

  1. 501.7
  2. 524.1
  3. 536.9
  4. 537.6
  5. 538.6
  6. 544.4
  7. 546.7


Using the rough formula of 1-exp(-t/600), I would expect orphan rates of 10.5% generating 1MB blocks, and 56.6% with 8MB blocks; that’s a huge cut in expected profits.


  • Get a faster DSL connection.  Though even an uplink 10 times faster would mean 1.1% orphan rate with 1MB blocks, or 8% with 8MB blocks.
  • Only connect to a single well-connected peer (-maxconnections=1), and hope they propagate your block.
  • Refuse to mine any transactions, and just collect the block reward.  Doesn’t help the bitcoin network at all though.
  • Join a large pool.  This is what happens in practice, but raises a significant centralization problem.


  • We need bitcoind to be smarter about ratelimiting in these situations, and stream serially.  Done correctly (which is hard), it could also help bufferbloat which makes running a full node at home so painful when it propagates blocks.
  • Some kind of block compression, along the lines of Gavin’s IBLT idea. I’ve done some preliminary work on this, and it’s promising, but far from trivial.


June 03, 2015

Rusty Russell: What Transactions Get Crowded Out If Blocks Fill?

What happens if bitcoin blocks fill?  Miners choose transactions with the highest fees, so low fee transactions get left behind.  Let’s look at what makes up blocks today, to try to figure out which transactions will get “crowded out” at various thresholds.

Some assumptions need to be made here: we can’t automatically tell the difference between me taking a $1000 output and paying you 1c, and me paying you $999.99 and sending myself the 1c change.  So my first attempt was very conservative: only look at transactions with two or more outputs which were under the given thresholds (I used a nice round $200 / BTC price throughout, for simplicity).

(Note: I used bitcoin-iterate to pull out transaction data, and rebuild blocks without certain transactions; you can reproduce the csv files in the blocksize-stats directory if you want).

Paying More Than 1 Person Under $1 (< 500000 Satoshi)

Here’s the result (against the current blocksize):

Sending 2 Or More Sub-$1 Outputs

Let’s zoom in to the interesting part, first, since there’s very little difference before 220,000 (February 2013).  You can see that only about 18% of transactions are sending less than $1 and getting less than $1 in change:

Since March 2013…

Paying Anyone Under 1c, 10c, $1

The above graph doesn’t capture the case where I have $100 and send you 1c.   If we eliminate any transaction which has any output less than various thresholds, we’ll catch that. The downside is that we capture the “sending myself tiny change” case, but I’d expect that to be rarer:

Blocksizes Without Small Output Transactions

This eliminates far more transactions.  We can see only 2.5% of the block size is taken by transactions with 1c outputs (the dark red line following the block “current blocks” line), but the green line shows about 20% of the block used for 10c transactions.  And about 45% of the block is transactions moving $1 or less.

Interpretation: Hard Landing Unlikely, But Microtransactions Lose

If the block size doesn’t increase (or doesn’t increase in time): we’ll see transactions get slower, and fees become the significant factor in whether your transaction gets processed quickly.  People will change behaviour: I’m not going to spend 20c to send you 50c!

Because block finding is highly variable and many miners are capping blocks at 750k, we see backlogs at times already; these bursts will happen with increasing frequency from now on.  This will put pressure on Satoshdice and similar services, who will be highly incentivized to use StrawPay or roll their own channel mechanism for off-blockchain microtransactions.

I’d like to know what timescale this happens on, but the graph shows that we grow (and occasionally shrink) in bursts.  A logarithmic graph prepared by Peter R of suggests that we hit 1M mid-2016 or so; expect fee pressure to bend that graph downwards soon.

The bad news is that even if fees hit (say) 25c and that prevents all the sub-$1 transactions, we only double our capacity, giving us perhaps another 18 months. (At that point miners are earning $1000 from transaction fees as well as $5000 (@ $200/BTC) from block reward, which is nice for them I guess.)

My Best Guess: Larger Blocks Desirable Within 2 Years, Needed by 3

Personally I think 5c is a reasonable transaction fee, but I’d prefer not to see it until we have decentralized off-chain alternatives.  I’d be pretty uncomfortable with a 25c fee unless the Lightning Network was so ubiquitous that I only needed to pay it twice a year.  Higher than that would have me reaching for my credit card to charge my Lightning Network account :)

Disclaimer: I Work For BlockStream, on Lightning Networks

Lightning Networks are a marathon, not a sprint.  The development timeframes in my head are even vaguer than the guesses above.  I hope it’s part of the eventual answer, but it’s not the bandaid we’re looking for.  I wish it were different, but we’re going to need other things in the mean time.

I hope this provided useful facts, whatever your opinions.

Rusty Russell: Current Blocksize, by graphs.

I used bitcoin-iterate and gnumeric to render the current bitcoin blocksizes, and here are the results.

My First Graph: A Moment of Panic

This is block sizes up to yesterday; I’ve asked gnumeric to derive an exponential trend line from the data (in black; the red one is linear)

Woah! We hit 1M blocks in a month! PAAAANIC!

That trend line hits 1000000 at block 363845.5, which we’d expect in about 32 days time!  This is what is freaking out so many denizens of the Bitcoin Subreddit. I also just saw a similar inaccurate [correction: misleading] graph reshared by Mike Hearn on G+ :(

But Wait A Minute

That trend line says we’re on 800k blocks today, and we’re clearly not.  Let’s add a 6 hour moving average:

Oh, we’re only halfway there….

In fact, if we cluster into 36 blocks (ie. 6 hours worth), we can see how misleading the terrible exponential fit is:

What! We’re already over 1M blocks?? Maths, you lied to me!

Clearer Graphs: 1 week Moving Average

Actual Weekly Running Average Blocksize

So, not time to panic just yet, though we’re clearly growing, and in unpredictable bursts.

June 01, 2015

Rusty Russell: Block size: rate of internet speed growth since 2008?

I’ve been trying not to follow the Great Blocksize Debate raging on reddit.  However, the lack of any concrete numbers has kind of irked me, so let me add one for now.

If we assume bandwidth is the main problem with running nodes, let’s look at average connection growth rates since 2008.  Google lead me to NetMetrics (who seem to charge), and Akamai’s State Of The Internet (who don’t).  So I used the latter, of course:

Akamai’s Average Connection Speed Chart Q4/07 to Q4/14

I tried to pick a range of countries, and here are the results:

Country % Growth Over 7 years Per Annum
Australia 348 19.5%
Brazil 349 19.5%
China 481 25.2%
Philippines 258 14.5%
UK 333 18.8%
US 304 17.2%


Countries which had best bandwidth grew about 17% a year, so I think that’s the best model for future growth patterns (China is now where the US was 7 years ago, for example).

If bandwidth is the main centralization concern, you’ll want block growth below 15%. That implies we could jump the cap to 3MB next year, and 15% thereafter. Or if you’re less conservative, 3.5MB next year, and 17% there after.

April 30, 2015

Eric Leblond: Elasticsearch, systemd and Debian Jessie

Now that Debian Jessie is out, it was the time to do an upgrade of my Elasticsearch servers. I’ve got two of them running in LXC containers on my main hardware system Upgrading to Jessie was straightforward via apt-get dist-upgrade.

But the Elasticsearch server processes were not here after reboot. I’m using the Elasticsearch 1.5 packages provided by Elastic on their website.

Running /etc/init.d/elasticsearch start or service elasticsearch start were not giving any output. Systemd which is now starting the service was not kind enough to provide any debugging information.

After some research, I’ve discovered that the /usr/lib/systemd/system/elasticsearch.service was starting elasticsearch with:
ExecStart=/usr/share/elasticsearch/bin/elasticsearch            \
                            -Des.default.config=$CONF_FILE      \
                            -Des.default.path.home=$ES_HOME     \
                            -Des.default.path.logs=$LOG_DIR     \
                  $DATA_DIR    \
                  $WORK_DIR    \
And the variables like CONF_FILE defined in /etc/default/elasticsearch were commented and it seems it was preventing Elasticsearch to start. So updating the file to have:
# Elasticsearch log directory

# Elasticsearch data directory

# Elasticsearch work directory

# Elasticsearch configuration directory

# Elasticsearch configuration file (elasticsearch.yml)
was enough to have a working Elasticsearch.

As pointed out by Peter Manev, to enable the service at reboot, you need to run as root:

systemctl enable elasticsearch.service

I hope this will help some of you and let me know if you find better solutions.

Rusty Russell: Some bitcoin mempool data: first look

Previously I discussed the use of IBLTs (on the pettycoin blog).  Kalle and I got some interesting, but slightly different results; before I revisited them I wanted some real data to play with.

Finally, a few weeks ago I ran 4 nodes for a week, logging incoming transactions and the contents of the mempools when we saw a block.  This gives us some data to chew on when tuning any fast block sync mechanism; here’s my first impressions looking a the data (which is available on github).

These graphs are my first look; in blue is the number of txs in the block, and in purple stacked on top is the number of txs which were left in the mempool after we took those away.

The good news is that all four sites are very similar; there’s small variance across these nodes (three are in Digital Ocean data centres and one is behind two NATs and a wireless network at my local coworking space).

The bad news is that there are spikes of very large mempools around block 352,800; a series of 731kb blocks which I’m guessing is some kind of soft limit for some mining software [EDIT: 750k is the default soft block limit; reported in 1024-byte quantities as does, this is 732k.  Thanks sipa!].  Our ability to handle this case will depend very much on heuristics for guessing which transactions are likely candidates to be in the block at all (I’m hoping it’s as simple as first-seen transactions are most likely, but I haven’t tested yet).

Transactions in Mempool and in Blocks: Australia (poor connection)

Transactions in Mempool and in Blocks: Singapore

Transactions in Mempool and in Blocks: San Francisco

Transactions in Mempool and in Blocks: San Francisco (using Relay Network)

April 08, 2015

Rusty Russell: Lightning Networks Part IV: Summary

This is the fourth part of my series of posts explaining the bitcoin Lightning Networks 0.5 draft paper.  See Part I, Part II and Part III.

The key revelation of the paper is that we can have a network of arbitrarily complicated transactions, such that they aren’t on the blockchain (and thus are fast, cheap and extremely scalable), but at every point are ready to be dropped onto the blockchain for resolution if there’s a problem.  This is genuinely revolutionary.

It also vindicates Satoshi’s insistence on the generality of the Bitcoin scripting system.  And though it’s long been suggested that bitcoin would become a clearing system on which genuine microtransactions would be layered, it was unclear that we were so close to having such a system in bitcoin already.

Note that the scheme requires some solution to malleability to allow chains of transactions to be built (this is a common theme, so likely to be mitigated in a future soft fork), but Gregory Maxwell points out that it also wants selective malleability, so transactions can be replaced without invalidating the HTLCs which are spending their outputs.  Thus it proposes new signature flags, which will require active debate, analysis and another soft fork.

There is much more to discover in the paper itself: recommendations for lightning network routing, the node charging model, a risk summary, the specifics of the softfork changes, and more.

I’ll leave you with a brief list of requirements to make Lightning Networks a reality:

  1. A soft-fork is required, to protect against malleability and to allow new signature modes.
  2. A new peer-to-peer protocol needs to be designed for the lightning network, including routing.
  3. Blame and rating systems are needed for lightning network nodes.  You don’t have to trust them, but it sucks if they go down as your money is probably stuck until the timeout.
  4. More refinements (eg. relative OP_CHECKLOCKTIMEVERIFY) to simplify and tighten timeout times.
  5. Wallets need to learn to use this, with UI handling of things like timeouts and fallbacks to the bitcoin network (sorry, your transaction failed, you’ll get your money back in N days).
  6. You need to be online every 40 days to check that an old HTLC hasn’t leaked, which will require some alternate solution for occasional users (shut down channel, have some third party, etc).
  7. A server implementation needs to be written.

That’s a lot of work!  But it’s all simply engineering from here, just as bitcoin was once the paper was released.  I look forward to seeing it happen (and I’m confident it will).

April 06, 2015

Rusty Russell: Lightning Networks Part III: Channeling Contracts

This is the third part of my series of posts explaining the bitcoin Lightning Networks 0.5 draft paper.

In Part I I described how a Poon-Dryja channel uses a single in-blockchain transaction to create off-blockchain transactions which can be safely updated by either party (as long as both agree), with fallback to publishing the latest versions to the blockchain if something goes wrong.

In Part II I described how Hashed Timelocked Contracts allow you to safely make one payment conditional upon another, so payments can be routed across untrusted parties using a series of transactions with decrementing timeout values.

Now we’ll join the two together: encapsulate Hashed Timelocked Contracts inside a channel, so they don’t have to be placed in the blockchain (unless something goes wrong).

Revision: Why Poon-Dryja Channels Work

Here’s half of a channel setup between me and you where I’m paying you 1c: (there’s always a mirror setup between you and me, so it’s symmetrical)

Half a channel: we will invalidate transaction 1 (in favour of a new transaction 2) to send funds.

The system works because after we agree on a new transaction (eg. to pay you another 1c), you revoke this by handing me your private keys to unlock that 1c output.  Now if you ever released Transaction 1, I can spend both the outputs.  If we want to add a new output to Transaction 1, we need to be able to make it similarly stealable.

Adding a 1c HTLC Output To Transaction 1 In The Channel

I’m going to send you 1c now via a HTLC (which means you’ll only get it if the riddle is answered; if it times out, I get the 1c back).  So we replace transaction 1 with transaction 2, which has three outputs: $9.98 to me, 1c to you, and 1c to the HTLC: (once we agree on the new transactions, we invalidate transaction 1 as detailed in Part I)

Our Channel With an Output for an HTLC

Note that you supply another separate signature (sig3) for this output, so you can reveal that private key later without giving away any other output.

We modify our previous HTLC design so you revealing the sig3 would allow me to steal this output. We do this the same way we did for that 1c going to you: send the output via a timelocked mutually signed transaction.  But there are two transaction paths in an HTLC: the got-the-riddle path and the timeout path, so we need to insert those timelocked mutually signed transactions in both of them.  First let’s append a 1 day delay to the timeout path:

Timeout path of HTLC, with locktime so it can be stolen once you give me your sig3.

Similarly, we need to append a timelocked transaction on the “got the riddle solution” path, which now needs my signature as well (otherwise you could create a replacement transaction and bypass the timelocked transaction):

Full HTLC: If you reveal Transaction 2 after we agree it’s been revoked, and I have your sig3 private key, I can spend that output before you can, down either the settlement or timeout paths.

Remember The Other Side?

Poon-Dryja channels are symmetrical, so the full version has a matching HTLC on the other side (except with my temporary keys, so you can catch me out if I use a revoked transaction).  Here’s the full diagram, just to be complete:

A complete lightning network channel with an HTLC, containing a glorious 13 transactions.

Closing The HTLC

When an HTLC is completed, we just update transaction 2, and don’t include the HTLC output.  The funds either get added to your output (R value revealed before timeout) or my output (timeout).

Note that we can have an arbitrary number of independent HTLCs in progress at once, and open and/or close as many in each transaction update as both parties agree to.

Keys, Keys Everywhere!

Each output for a revocable transaction needs to use a separate address, so we can hand the private key to the other party.  We use two disposable keys for each HTLC[1], and every new HTLC will change one of the other outputs (either mine, if I’m paying you, or yours if you’re paying me), so that needs a new key too.  That’s 3 keys, doubled for the symmetry, to give 6 keys per HTLC.

Adam Back pointed out that we can actually implement this scheme without the private key handover, and instead sign a transaction for the other side which gives them the money immediately.  This would permit more key reuse, but means we’d have to store these transactions somewhere on the off chance we needed them.

Storing just the keys is smaller, but more importantly, Section 6.2 of the paper describes using BIP 32 key hierarchies so the disposable keys are derived: after a while, you only need to store one key for all the keys the other side has given you.  This is vastly more efficient than storing a transaction for every HTLC, and indicates the scale (thousands of HTLCs per second) that the authors are thinking.

Next: Conclusion

My next post will be a TL;DR summary, and some more references to the implementation details and possibilities provided by the paper.


[1] The new sighash types are fairly loose, and thus allow you to attach a transaction to a different parent if it uses the same output addresses.  I think we could re-use the same keys in both paths if we ensure that the order of keys required is reversed for one, but we’d still need 4 keys, so it seems a bit too tricky.

April 01, 2015

Rusty Russell: Lightning Networks Part II: Hashed Timelock Contracts (HTLCs)

In Part I, we demonstrated Poon-Dryja channels; a generalized channel structure which used revocable transactions to ensure that old transactions wouldn’t be reused.

A channel from me<->you would allow me to efficiently send you 1c, but that doesn’t scale since it takes at least one on-blockchain transaction to set up each channel. The solution to this is to route funds via intermediaries;  in this example we’ll use the fictitious “MtBox”.

If I already have a channel with MtBox’s Payment Node, and so do you, that lets me reliably send 1c to MtBox without (usually) needing the blockchain, and it lets MtBox send you 1c with similar efficiency.

But it doesn’t give me a way to force them to send it to you; I have to trust them.  We can do better.

Bonding Unrelated Transactions using Riddles

For simplicity, let’s ignore channels for the moment.  Here’s the “trust MtBox” solution:

I send you 1c via MtBox; simplest possible version, using two independent transactions. I trust MtBox to generate its transaction after I send it mine.

What if we could bond these transactions together somehow, so that when you spend the output from the MtBox transaction, that automatically allows MtBox to spend the output from my transaction?

Here’s one way. You send me a riddle question to which nobody else knows the answer: eg. “What’s brown and sticky?”.  I then promise MtBox the 1c if they answer that riddle correctly, and tell MtBox that you know.

MtBox doesn’t know the answer, so it turns around and promises to pay you 1c if you answer “What’s brown and sticky?”. When you answer “A stick”, MtBox can pay you 1c knowing that it can collect the 1c off me.

The bitcoin blockchain is really good at riddles; in particular “what value hashes to this one?” is easy to express in the scripting language. So you pick a random secret value R, then hash it to get H, then send me H.  My transaction’s 1c output requires MtBox’s signature, and a value which hashes to H (ie. R).  MtBox adds the same requirement to its transaction output, so if you spend it, it can get its money back from me:

Two Independent Transactions, Connected by A Hash Riddle.

Handling Failure Using Timeouts

This example is too simplistic; when MtBox’s PHP script stops processing transactions, I won’t be able to get my 1c back if I’ve already published my transaction.  So we use a familiar trick from Part I, a timeout transaction which after (say) 2 days, returns the funds to me.  This output needs both my and MtBox’s signatures, and MtBox supplies me with the refund transaction containing the timeout:

Hash Riddle Transaction, With Timeout

MtBox similarly needs a timeout in case you disappear.  And it needs to make sure it gets the answer to the riddle from you within that 2 days, otherwise I might use my timeout transaction and it can’t get its money back.  To give plenty of margin, it uses a 1 day timeout:

MtBox Needs Your Riddle Answer Before It Can Answer Mine

Chaining Together

It’s fairly clear to see that longer paths are possible, using the same “timelocked” transactions.  The paper uses 1 day per hop, so if you were 5 hops away (say, me <-> MtBox <-> Carol <-> David <-> Evie <-> you) I would use a 5 day timeout to MtBox, MtBox a 4 day to Carol, etc.  A routing protocol is required, but if some routing doesn’t work two nodes can always cancel by mutual agreement (by creating timeout transaction with no locktime).

The paper refers to each set of transactions as contracts, with the following terms:

  • If you can produce to MtBox an unknown 20-byte random input data R from a known H, within two days, then MtBox will settle the contract by paying you 1c.
  • If two days have elapsed, then the above clause is null and void and the clearing process is invalidated.
  • Either party may (and should) pay out according to the terms of this contract in any method of the participants choosing and close out this contract early so long as both participants in this contract agree.

The hashing and timelock properties of the transactions are what allow them to be chained across a network, hence the term Hashed Timelock Contracts.

Next: Using Channels With Hashed Timelock Contracts.

The hashed riddle construct is cute, but as detailed above every transaction would need to be published on the blockchain, which makes it pretty pointless.  So the next step is to embed them into a Poon-Dryja channel, so that (in the normal, cooperative case) they don’t need to reach the blockchain at all.

March 30, 2015

Rusty Russell: Lightning Networks Part I: Revocable Transactions

I finally took a second swing at understanding the Lightning Network paper.  The promise of this work is exceptional: instant reliable transactions across the bitcoin network. But the implementation is complex and the draft paper reads like a grab bag of ideas; but it truly rewards close reading!  It doesn’t involve novel crypto, nor fancy bitcoin scripting tricks.

There are several techniques which are used in the paper, so I plan to concentrate on one per post and wrap up at the end.

Revision: Payment Channels

I open a payment channel to you for up to $10

A Payment Channel is a method for sending microtransactions to a single recipient, such as me paying you 1c a minute for internet access.  I create an opening transaction which has a $10 output, which can only be redeemed by a transaction input signed by you and me (or me alone, after a timeout, just in case you vanish).  That opening transaction goes into the blockchain, and we’re sure it’s bedded down.

I pay you 1c in the payment channel. Claim it any time!

Then I send you a signed transaction which spends that opening transaction output, and has two outputs: one for $9.99 to me, and one for 1c to you.  If you want, you could sign that transaction too, and publish it immediately to get your 1c.

Update: now I pay you 2c via the payment channel.

Then a minute later, I send you a signed transaction which spends that same opening transaction output, and has a $9.98 output for me, and a 2c output for you. Each minute, I send you another transaction, increasing the amount you get every time.

This works because:

  1.  Each transaction I send spends the same output; so only one of them can ever be included in the blockchain.
  2. I can’t publish them, since they need your signature and I don’t have it.
  3. At the end, you will presumably publish the last one, which is best for you.  You could publish an earlier one, and cheat yourself of money, but that’s not my problem.

Undoing A Promise: Revoking Transactions?

In the simple channel case above, we don’t have to revoke or cancel old transactions, as the only person who can spend them is the person who would be cheated.  This makes the payment channel one way: if the amount I was paying you ever went down, you could simply broadcast one of the older, more profitable transactions.

So if we wanted to revoke an old transaction, how would we do it?

There’s no native way in bitcoin to have a transaction which expires.  You can have a transaction which is valid after 5 days (using locktime), but you can’t have one which is valid until 5 days has passed.

So the only way to invalidate a transaction is to spend one of its inputs, and get that input-stealing transaction into the blockchain before the transaction you’re trying to invalidate.  That’s no good if we’re trying to update a transaction continuously (a-la payment channels) without most of them reaching the blockchain.

The Transaction Revocation Trick

But there’s a trick, as described in the paper.  We build our transaction as before (I sign, and you hold), which spends our opening transaction output, and has two outputs.  The first is a 9.99c output for me.  The second is a bit weird–it’s 1c, but needs two signatures to spend: mine and a temporary one of yours.  Indeed, I create and sign such a transaction which spends this output, and send it to you, but that transaction has a locktime of 1 day:

The first payment in a lightning-style channel.

Now, if you sign and publish that transaction, I can spend my $9.99 straight away, and you can publish that timelocked transaction tomorrow and get your 1c.

But what if we want to update the transaction?  We create a new transaction, with 9.98c output to me and 2c output to a transaction signed by both me and another temporary address of yours.  I create and sign a transaction which spends that 2c output, has a locktime of 1 day and has an output going to you, and send it to you.

We can revoke the old transaction: you simply give me the temporary private key you used for that transaction.  Weird, I know (and that’s why you had to generate a temporary address for it).  Now, if you were ever to sign and publish that old transaction, I can spend my $9.99 straight away, and create a transaction using your key and my key to spend your 1c.  Your transaction (1a below) which could spend that 1c output is timelocked, so I’ll definitely get my 1c transaction into the blockchain first (and the paper uses a timelock of 40 days, not 1).

Updating the payment in a lightning-style channel: you sent me your private key for sig2, so I could spend both outputs of Transaction 1 if you were to publish it.

So the effect is that the old transaction is revoked: if you were to ever sign and release it, I could steal all the money.  Neat trick, right?

A Minor Variation To Avoid Timeout Fallback

In the original payment channel, the opening transaction had a fallback clause: after some time, it is all spendable by me.  If you stop responding, I have to wait for this to kick in to get my money back.  Instead, the paper uses a pair of these “revocable” transaction structures.  The second is a mirror image of the first, in effect.

A full symmetric, bi-directional payment channel.

So the first output is $9.99 which needs your signature and a temporary signature of mine.  The second is  1c for meyou.  You sign the transaction, and I hold it.  You create and sign a transaction which has that $9.99 as input, a 1 day locktime, and send it to me.

Since both your and my “revocable” transactions spend the same output, only one can reach the blockchain.  They’re basically equivalent: if you send yours you must wait 1 day for your money.  If I send mine, I have to wait 1 day for my money.  But it means either of us can finalize the payment at any time, so the opening transaction doesn’t need a timeout clause.


Now we have a generalized transaction channel, which can spend the opening transaction in any way we both agree on, without trust or requiring on-blockchain updates (unless things break down).

The next post will discuss Hashed Timelock Contracts (HTLCs) which can be used to create chains of payments…

Notes For Pedants:

In the payment channel open I assume OP_CHECKLOCKTIMEVERIFY, which isn’t yet in bitcoin.  It’s simpler.

I ignore transaction fees as an unnecessary distraction.

We need malleability fixes, so you can’t mutate a transaction and break the ones which follow.  But I also need the ability to sign Transaction 1a without a complete Transaction 1 (since you can’t expose the signed version to me).  The paper proposes new SIGHASH types to allow this.

[EDIT 2015-03-30 22:11:59+10:30: We also need to sign the other symmetric transactions before signing the opening transaction.  If we released a completed opening transaction before having the other transactions, we might be stuck with no way to get our funds back (as we don’t have a “return all to me” timeout on the opening transaction)]

February 18, 2015

Eric Leblond: Slides of my talks at Lecce

I’ve been invited by SaLUG to Lecce to give some talks during their Geek Evening. I’ve done a talk on nftables and one of suricata.

Lecce by night

Lecce by night

The nftables talk was about the motivation behind the change from iptables. Here are the slides: Nftables

The talk on Suricata was explaining the different feature of Suricata and was showing how I’ve used it to make a study of SSH bruteforce. Here are the slides: Suricata, Netfilter and the PRC.

Thanks a lot to Giuseppe Longo, Luca Greco and all the SaLUG team, you have been wonderful hosts!

February 13, 2015

Rusty Russell: lguest Learns PCI

With the 1.0 virtio standard finalized by the committee (though minor non-material corrections and clarifications are still trickling in), Michael Tsirkin did the heavy lifting of writing the Linux drivers (based partly on an early prototype of mine).

But I wanted an independent implementation to test: both because OASIS insists on multiple implementations before standard ratification, but also because I wanted to make sure the code which is about to go into the merge window works well.

Thus, I began the task of making lguest understand PCI.  Fortunately, the osdev wiki has an excellent introduction on how to talk PCI on an x86 machine.  It didn’t take me too long to get a successful PCI bus scan from the guest, and start about implementing the virtio parts.

The final part (over which I procrastinated for a week) was to step through the spec and document all the requirements in the lguest comments.  I also added checks that the guest driver was behaving sufficiently, but now it’s finally done.

It also resulted in a few minor patches, and some clarification patches for the spec.  No red flags, however, so I’m reasonably confident that 3.20 will have compliant 1.0 virtio support!

November 09, 2014

Eric Leblond: Efficient search of string in a list of strings in Python


I’m currently working on a script that parses Suricata EVE log files and try to detect if some fields in the log are present in a list of bad patterns. So the script has two parts which are reading the log file and searching for the string in a list of strings. This list can be big with a target of around 20000 strings.

Note: This post may seem trivial for real Python developers but as I did not manage to find any documentation on this here is this blog post.

The tests

For my test I have used a 653Mo log file containing 896077 lines. Reading this JSON formatted file is taking 5.0s. As my list of strings was around 3000 elements so far below targeted size, a thumb rules was saying that I will be ok if script stayed below 6 seconds with the matching added. First test was a simple Python style inclusion test with the hostname being put in a list:
if event['http']['hostname'] in hostname_list:
For that test, the result was 9.5s so not awful but a bit over my expectation. Just to check I have run a test with a C-like implementation:
for elt in hostname_list:
   if elt == target:
       # we have a winner
Result was a nice surprise, … for Python, with a execution time of 20.20s.

I was beginning to fear some development to be able to reach the speed I needed and I gave a last try. As I was taking care of match, I can transform my list of strings in a Python set thus only getting unique elements. So I have run the test using:

hostname_set = set(hostname_list)
for event in event_list:
    if event['http']['hostname'] in hostname_set:

Result was an amazing execution time of 5.20s. Only 0.20s were used to check data against my set of strings.


Python set required elements to be hashable. It is needed because internal implementation is using dictionary. So looking for an element in a set is equivalent to look for an element in a hash table. And this is really faster than searching in a list where there is no real magic possible.

So if you only care about match and if your elements are hashable then use Python set to test for existence of a object in your set.

November 07, 2014

Jesper Dangaard Brouer: Announce github repo prototype-kernel

Kernel prototype development made easy

For easier and quicker kernel development cycles, I've created a git-repo called "prototype-kernel". To also help your productivity I've shared in on github (isn't it great that work at +Red Hat defaults to being Open Source):

Getting started developing your first kernel code is really easy: follow this guide.

The basic idea is to compile modules outside the kernel tree, but use the kernels kbuild infrastructure. Thus, this does require that you have a kernel source tree avail on your system. Most distributions offer a kernel-devel package, that only install what you need for this to work.

 On +Fedora Project do: 'yum install kernel-devel'
 On +Debian do: 'aptitude install linux-headers'

Directory layout:

There are two main directories: "minimalist" and "kernel".

If you just want to test a small code change go to the "minimalist" directory.  This contains the minimal Makefile needed to compile the kernel modules in this directory. Do you devel cycle of: Modify, make, insmod, rmmod.

The "kernel" directory is more interesting to me. It maintains the same directory layout as the mainline kernel. The basic idea is that code developed should be in the same locations that you plan to submit upstream. Here you need to edit the files Kbuild files when adding new modules/code. Notice these Kbuild files are located in each sub-directory.

Detecting kernel and compiling

In the "kernel" directory you can simply run: make

The system will detect / assume-to-find the kernel source by looking in /lib/modules/`uname -r`/build/. If the kernel you are compiling for are not in this location, then specify it to make on the command line like:

  $ make kbuilddir=~/git/kernel/net-next/

Compiling for a remote host

If you are compiling locally, but want to upload the kernel modules for testing on a remote host, then the make system supports this. Compile and upload modules with a single make command do:

  $ make push_remote kbuilddir=~/git/kernel/net-next/ HOST=

Localhost development

Development on the same host is sometimes easier.  To help this process there is a script to help setup symlinks to from the running kernels module directory into you git repo.  This means you just need to compile and no need to install the modules again and again (perhaps run 'depmod -a' if needed).  Command line for this setup, be in the "kernel" directory and run:

  $ sudo ./scripts/

Development cycle made simple

After this setup, standard modprobe and modinfo works.  Thus, oneliner work cycle for testing a given kernel module can be as simple as:

  $ make && (sudo rmmod time_bench_sample; sudo modprobe time_bench_sample) && sudo dmesg -c

Happy kernel-hacking :-)

October 09, 2014

Jesper Dangaard Brouer: Unlocked 10Gbps TX wirespeed smallest packet single core

Achievement unlocked, 10Gbit/s full TX wirespeed smallest packet size on a single CPU core (14.8Mpps). Recent changes to the Linux kernels lower TX layers have unlocked the full (TX side) potential of the NIC drivers and hardware.

The single core 14.8Mpps performance number is an artificial benchmark performed with pktgen, which besides spinning the same packet (skb), now also notifies the NIC hardware after populating it's TX ring buffer with a "burst" of packets. The new pktgen option is called "burst" (commit 38b2cf2982d ("net: pktgen: packet bursting via skb->xmit_more")).

Hint: for ixgbe driver you need to adjust cleanup interval as described in my earlier blogpost.

  • Via cmdline:  ethtool -C ethX rx-usecs 30

Fortunately, we have also found a way to take advantage of these new capabilities for normal/real use-cases, by adjusting GSO segmentation and modifying the qdisc (traffic control) layer, more details later.

As mentioned before, I personally only see pktgen as a tool for performance testing and tuning the NIC driver TX layer.   Thus, the pktgen test merely show, that we have finally unlocked the full potential of the NIC hardware and driver.  We started this journey with seeing 4Mpps single core.

The SKB xmit_more API

The new bulking API, for the driver layer, requires some extra explanation, because we wanted to avoid API changes, that would require updating every driver.

The new API for driver layer TX bulking is called "skb->xmit_more". To avoid changing the external driver API and to keep the new API light weight (read fast), the "skb" (struct sk_buff) have been extended with a "xmit_more" (bit) indication.

If the driver see "skb->xmit_more" set, then the stack is telling that it guarantees that another packet will be given immediately (when ->ndo_start_xmit() returns).
 This means that, unless the driver have a filled TX queue, it can simply add the packet to the TX queue and defer the expensive indication to the HW (of updating the TX ring queue tail pointer).

Challenge: bulk without added latency

It is difficult to taking advantage of bulking without introducing added latency.  Nevertheless, this have been our goal.  Don't speculative/artificially introducing a delay betting on another packets arriving shortly. Principle of only bulk packets when really needed, based on some solid indication from the stack.

Existing stack bulking: GSO

The network stack already have packet aggregation facilities like TSO (TCP Segmentation Offload), GSO (Generic Segmentation Offload) and GRO (Generic Receive Offload), which is primarily targeted for TCP flows. Let's take advantage of these packets already being aggregated.

+David Miller, restructured central parts of the xmit layer, allowing it to work with skb lists, via merge commit 53fda7f7f9e8 ("Merge branch 'xmit_list'").

This restructure also allowed GRO/GSO segmented lists to take advantage of the performance boost of being bulk transmitted in the driver layer (via xmit_more indication).

Note, that hardware supported TSO packets have in-principal always benefited from the tail pointer update, as the driver can (usually) put the entire TSO packet (upto 64k) as though it is a single packet in to the TX ring, before updating tail pointer.

After these optimization's some situations exists (for certain HW), where the software GSO is faster-than the hardware based TSO.  To enable the software only approach use cmdline:

  • ethtool -K ethX gro on gso on tso off

GRO/GSO is focused on TCP and limited to bulking/aggregation on a per flow basis.  What about UDP or many simultaneous shorter lived flows.

Qdisc layer bulk dequeue

Bulking from the qdisc layer is good general solution that will benefit all kinds of packets.  A queue in the qdisc layer, provide a very solid opportunity for bulking, packets have already been delayed and its easy to construct a skb list with packets avail in the queue.

In our solution for qdisc bulk dequeue (commit 5772e9a3463b) we have further extended the performance benefit of bulking.  We can amortize the cost of locking both at the qdisc layer and for the TXQ lock.
 Several packets are dequeued while holding the qdisc root_lock, amortizing locking cost over several packet.  The dequeued SKB list is processed under the TXQ lock in dev_hard_start_xmit(), thus also amortizing the cost of the TXQ lock.

 Details: based on earlier benchmarking for locking, the cost of the lock/unlock sequence is 16ns (in the optimal none contended case), with two locking sections this is a 32ns cost that will get amortized.

 Note: We benefit from this lock amortization saving, regardless weather the device driver take advantage of the xmit_more API. Which is good as most driver have not yet, implemented this xmit_more API.

Lower CPU usage when using the full bandwidth on a link, is the most direct benefit than most will experience, from these changes.

Sending small packets "all-the-way" from userspace RAW sockets (but still through the qdisc layer), I can also demonstrate that it's now possible to send 14.8Mpps (10G wirespeed small frames) it just required using minimum 11 CPUs (using trafgen --qdisc-path, CPU E5-2695).  This is obviously an artifical benchmark, but it demonstrates that we are scaling with the number of CPUs.

For testing details see merge commit c2bf5ec20488f ('qdisc_bulk_dequeue').

Avoiding Head-of-Line blocking

We have explicitly chosen to: only support qdisc bulking for BQL supported drivers.

As the bufferbloat fighters can testify, device drivers have a tendency buffer too many bytes in the hardware.  The solution to this problem is BQL (Byte Queue Limit) by Tom Herbert.  Sadly not many device drivers have actually implemented using BQL.

The reason for needing BQL is that the qdisc layer needs an indication it can use to determine how many packets (actually bytes) it should bulk dequeue, before sending it to the driver layer.

This is need because, we want to avoid overshooting the driver layers capacity, as it will result in a need to requeue the excessive packet back into the qdisc layer.  Currently the qdisc layer, does not support a requeue facility for a scheduler, instead it stores requeued packets in the root qdisc.  These requeued packets cause Head-of-Line blocking, as they must be transmitted before dequeueing new packets to
avoid packet reordering.

Implementing BQL support in your driver also helps moving any queue to the qdisc layer, which gives your driver the opportunity to benefit from these amazing performance improvements (and is also allow us to
fix bufferbloat by applying an appropriate qdisc like fq_codel).

Joint work

Getting to this level of performance have been the jointed work and feedback from many people.

Thanks to,
 +David Miller (Red Hat),
 +Eric Dumazet (Google),
 Hannes Frederic Sowa (Red Hat),
 Tom Herbert (Google),
 Daniel Borkmann (Red Hat),
 Florian Westphal (Red Hat),
 +Alexei Starovoitov (Plumgrid),
 +Alexander Duyck (Intel / Red Hat),
 John Fastabend (Intel),
 Jamal Hadi Salim (Mojatatu Networks),
 +Neil Horman  (Red Hat),
 +Dave Taht (,
 +Toke Høiland-Jørgensen (Karlstads University)
 and reviewers on

September 29, 2014

Eric Leblond: Slides of my nftables talk at Kernel Recipes

I’ve been lucky enough to do a talk during the third edition of Kernel Recipes. I’ve presented the evolution of nftables durig the previous year.

You can get the slides from here: 2014_kernel_recipes_nftables.

Thanks to Hupstream for uploading the video of the talk:

Not much material but this slides and a video of the work done during the previous year on nftables and its components:

September 24, 2014

Eric Leblond: Using DOM with nftables

DOM and SSH honeypot

DOM is a solution comparable to fail2ban but it uses Suricata SSH log instead of SSH server logs. The goal of DOM is to redirect the attacker based on its SSH client version. This allows to send attacker to a honeypot like pshitt directly after the first attempt. And this can be done for a whole network as Suricata does not need to be on the targeted box.

Using DOM with nftables

I’ve pushed a basic nftables support to DOM. Instead of adding element via ipset it uses a nftables set. It is simple to use it as you just need to add a -n flag to specify which table the set has been defined in:
./dom -f /usr/local/var/log/suricata/eve.json -n nat -s libssh -vvv -i -m OpenSSH
To activate the network address translation based on the set, you can use something like:
table ip nat {
        set libssh { 
                type ipv4_addr

        chain prerouting {
                 type nat hook prerouting priority -150;
                 ip saddr @libssh ip protocol tcp counter dnat

A complete basic ruleset

Here’s the ruleset running on the box implementing pshitt and DOM:
table inet filter {
        chain input {
                 type filter hook input priority 0;
                 ct state established,related accept
                 iif lo accept
                 ct state new iif != lo tcp dport {ssh, 2200} tcp flags == syn counter log prefix "SSH attempt" group 1 accept
                 iif br0 ip daddr accept
                 ip saddr tcp dport {9300, 3142} counter accept
                 ip saddr counter accept
                 counter log prefix "Input Default DROP" group 2 drop

table ip nat {
        set libssh { 
                type ipv4_addr

        chain prerouting {
                 type nat hook prerouting priority -150;
                 ip saddr @libssh ip protocol tcp counter dnat

        chain postrouting {
                 type nat hook postrouting priority -150;
                 ip saddr snat
There is a interesting rule in this ruleset. The first is:
ct state new iif != lo tcp dport {ssh, 2200} tcp flags == syn counter log prefix "SSH attempt" group 1 accept
It uses a negative construction to match on the interface iif != lo which means interface is not lo. Note that it also uses an unamed set to define the port list via tcp dport {ssh, 2200}. That way we have one single rule for normal and honeypot ssh. At least, this rule is logging and accepting and the logging is done via nfnetlink_log because of the group parameter. This allows to have ulogd to capture log message triggered by this rule.

September 12, 2014

Jesper Dangaard Brouer: Mini-tutorial for netperf-wrapper setup on RHEL6/CentOS6

The tool "netperf-wrapper" (by +Toke Høiland-Jørgensen <toke(at)>) is very useful for repeating network measurements, that involves running multiple concurrent instances of testing tools (primarily netperf, iperf and ping, but also tools like d-itg and http-getter).

The tools is best known in the bufferbloat community for it's test Realtime Response Under Load (RRUL), but the netperf-wrapper tool has other tests that I find useful.
Core software dependencies are recent versions of netperf, python, matplotlib and fping (optional are d-itg and http_runner).

Dependency issues on RHEL6

First dependencies are solved easily by installing "python-matplotlib":
 $ sudo yum install -y python-matplotlib python-pip

The software dependencies turned out to be a challenge on my RHEL6 box.

The "ping" program is too old to support option "-D" (prints timestamp before each-line).  Work-around is to install "fping", which I choose to do from "rpmforge":

Commands needed for install "fping":
 # rpm -Uvh
 # yum install -y fping

The "netperf" tool itself (on RHEL6) were not compiled with configure option "--enable-demo=yes" which is needed to get timestamp and continuous result output during a test-run.

Thus, I needed to recompile "netperf" manually:

Install netperf-wrapper

Installation is quite simple, once the dependencies have been meet:

  • git clone
  • cd netperf-wrapper
  • sudo python2 install

GUI mode

There is a nice GUI mode for investigating and comparing results, started by:
 $ netperf-wrapper --gui

This depend on matplotlib with Qt4 (and PyQt4), which unfortunately were not available for RHEL6. Fortunately there were a software package for this on Fedora, named "python-matplotlib-qt4".

For GUI mode netperf-wrapper needs: matplotlib with Qt4
 $ sudo yum install -y python-matplotlib-qt4 PyQt4

Thus, the workflow is to run the tests on my RHEL6 machines, and analyze the result files on my Fedora laptop.

Using the tool

The same tool "netperf-wrapper" is both used for running the test, and later analyzing the result.

Listing the tests available:
 $ netperf-wrapper --list-tests

For listing which plots are available for a given test e.g. "rrul":
 $ netperf-wrapper --list-plots rrul

Before running a test towards a target system, remember to start the "netserver" daemon process on the target host (just run command "netserver" nothing else).

Start a test-run towards e.g. IP with test rrul
 $ netperf-wrapper -H -t my_title rrul

It is recommend using the option "-t" to give your test a title, which makes is easier to distinguish when comparing two or more test files in e.g. the GUI tool.

The results of the test-run will be stored in a compressed json formatted text file, with the naming convention: rrul-2014-MM-DDTHHMMSS.milisec.json.gz

To view the result, without the GUI, run:
 $ netperf-wrapper -i rrul_prio-2014-09-10T125650.993908.json.gz -f plot
Or e.g. selecting a specific plot like "ping_cdf"
 $ netperf-wrapper -i rrul_prio-2014-09-10T125650.993908.json.gz -f plot -p ping_cdf

netperf-wrapper can also output numeric data suitable for plotting in org-mode or .csv (spreadsheet) format, but I didn't play with those options.

Updates: A release 0.7.0 of netperf-wrapper is pending

Extra: On bufferbloat
Interested in more about bufferbloat?

Too few people are linking to the best talk explaining bufferbloat and how it's solved by Van Jacobson (slides).  The video quality is unfortunately not very good.

I've used some of Van's point in my own talk about bufferbloat: Beyond the existences of Bufferbloat, have we found the cure? (slides)

September 11, 2014

Jesper Dangaard Brouer: Network setup for accurate nanosec measurements

As I described in my previous blogpost, I'm leveraging the PPS measurements to deduct the nanosec improvements I'm making to the code.

One problem with using this on the nanosec scale is the accuracy of your measurements, which depend on the accuracy of the hardware your are using.

Modern systems have power-saving and turbo-boosting features build into the CPUs.  And Hyper-Threading technology that allows one CPU core to appear as two CPUs, by sharing ALUs etc.

While establish an accurate baseline for some upstream measurements (subj: Get rid of ndo_xmit_flush / commit 0b725a2ca61). I was starting to see too much variation in my trafgen measurements.

I created a rather-large oneliner, that I have converted into a script here:
Which allowed me to get a picture of the accuracy of my measurements, and they are not accurate enough. (For more real stats like std-dev consider running these measurements through Rusty Russell's tool

My findings:

  1. Disable all C states and P states.
  2. Disabling Hyper-Threading and power-management in BIOS helped the accuracy
  3. 10Gbit/s ixgbe ring-buffer cleanup interval also influenced accuracy

Reading +Jeremy Eder's blog post. It seems the best method for disabling these C and P states, and
keeping all CPUs in C0/C1 state is doing:

 # tuned-adm profile latency-performance

I found the most stable ring-buffer cleanup interval for the ixgbe driver were 30 usecs. Configured by

 # ethtool -C eth5 rx-usecs 30

Besides these tunings: my blogpost on "Basic tuning for network overload testing" should still be followed.
Generally I've started using the "profile latency-performance", but unless I need to measure some specific code change, I'm still using the ixgbe's default "dynamic" ring-buffer cleanup interval.

Details about the "ethtool -C" tuning is avail in blogpost "Pktgen for network overload testing".

Jesper Dangaard Brouer: Packet Per Sec measurements for improving the Linux Kernel network stack

Why I'm using Packet Per Second (PPS) tests for measuring the improvement in performance (of the Linux Kernel network stack).  Many people (e.g. other kernel developer) does not understand why I'm using PPS measurements, this blogpost explains why.

The basic problem of using large MTU (usually 1500 bytes) size packets, is that the transmission delay itself, is enough to hide any improvement I'm making (e.g. a faster lookup function).

Transmission delay 1514 bytes (+20 bytes for Ethernet overhead) at 10Gbit/s is 1227 nanosec:

  • ((bytes+wireoverhead)*8) / 10 Gbits = time-unit
  • ((1500+14+20)*8)/((10000*10^6))*10^9 = 1227.20 ns

This means, if the network stack can generate (alloc/fill/free) a 1500 byte packet faster than every 1227ns, then it can utilize the bandwidth of the 10Gbit/s link fully.  And yes, we can already do so. Thus, with 1500 bytes frame any stack performance improvement, will only be measurable with by a lower CPU utilization.

Let face it; the kernel have been optimized heavily for the last 20 years.  Thus, the improvements we are able to come up with, is going to be on the nanosec scale.
For example I've found a faster way to clear the SKB, which saves 7 nanosec.  Being able to measure this performance improvement were essential while developing this faster clearing.

Lets assume, the stack cost (alloc/fill/syscall/free) is 1200ns (thus faster than 1227ns), then a 7ns improvement will only be 0.58%, which I can only measure as a lower CPU utilization (as bandwidth limit have been reached), which in practice cannot be measured accurately enough.

By lowering the packet size, the transmission delay the stack cost (alloc/fill/syscall/free) can "hide-behind" is reduced. With the smallest packet size of 64 bytes, this is significantly reduced, to:

  • ((64+20)*8)/((10000*10^6))*10^9 = 67.2ns

This basically exposes stacks cost, as its current cost is larger than 67.2ns.  This can be used for getting some measurements that allow us to actually measure the improvement of the code changes we are making, even-though this "only" translates into reduced CPU usage with big frames (which translates into more processer time for your application).

In packet per sec (pps) this correspond to approx 14.8Mpps:

  • 1sec/67.2ns =>  1/(67.2/10^9) = 14,880,952 pps
  • or directly from the packet size as:
  • 10Gbit / (bytes*8) = (10000*10^6)/((64+20)*8) = 14,880,952 pps

Measuring packet per sec (PPS) instead of bandwidth, have another advantage.  Instead of just comparing how many PPS improvement is seen, then instead translate the PPS into nanosec (between packets).
Comparing nanosec used before and after, will show us the nanosec saved by the given code change.

See, how I used it in this and  this commit to document the actually improvement of the changes I made.

Update: For deducting the nanosec saved by a given code change, to be valid, usually requires isolating your test to utilize a single CPU.

Lets use the 14.8Mpps as an example of howto translate PPS to nanosec:

  • 1sec / pps => (1/14880952*10^9) = 67.2ns

Extra: Just for the calculation exercise.

  • How many packets per sec does 1227.20 ns correspond to:
    • 1sec/1227.2ns =>  1/(1227.2/10^9) = 814,863 pps
  • Can also be calculated directly from the packet size as:
    • 10Gbit / (bytes*8) = (10000*10^6)/((1514+20)*8) = 814,863 pps

August 19, 2014

Rusty Russell: POLLOUT doesn’t mean write(2) won’t block: Part II

My previous discovery that poll() indicating an fd was writable didn’t mean write() wouldn’t block lead to some interesting discussion on Google+.

It became clear that there is much confusion over read and write; eg. Linus thought read() was like write() whereas I thought (prior to my last post) that write() was like read(). Both wrong…

Both Linux and v6 UNIX always returned from read() once data was available (v6 didn’t have sockets, but they had pipes). POSIX even suggests this:

The value returned may be less than nbyte if the number of bytes left in the file is less than nbyte, if the read() request was interrupted by a signal, or if the file is a pipe or FIFO or special file and has fewer than nbyte bytes immediately available for reading.

But write() is different. Presumably so simple UNIX filters didn’t have to check the return and loop (they’d just die with EPIPE anyway), write() tries hard to write all the data before returning. And that leads to a simple rule.  Quoting Linus:

Sure, you can try to play games by knowing socket buffer sizes and look at pending buffers with SIOCOUTQ etc, and say “ok, I can probably do a write of size X without blocking” even on a blocking file descriptor, but it’s hacky, fragile and wrong.

I’m travelling, so I built an Ubuntu-compatible kernel with a printk() into select() and poll() to see who else was making this mistake on my laptop:

cups-browsed: (1262): fd 5 poll() for write without nonblock
cups-browsed: (1262): fd 6 poll() for write without nonblock
Xorg: (1377): fd 1 select() for write without nonblock
Xorg: (1377): fd 3 select() for write without nonblock
Xorg: (1377): fd 11 select() for write without nonblock

This first one is actually OK; fd 5 is an eventfd (which should never block). But the rest seem to be sockets, and thus probably bugs.

What’s worse, are the Linux select() man page:

       A file descriptor is considered ready if it is possible to
       perform the corresponding I/O operation (e.g., read(2)) without
       ... those in writefds will be watched to see if a write will
       not block...

And poll():

		Writing now will not block.

Man page patches have been submitted…

August 02, 2014

Rusty Russell: ccan/io: revisited

There are numerous C async I/O libraries; tevent being the one I’m most familiar with.  Yet, tevent has a very wide API, and programs using it inevitably descend into “callback hell”.  So I wrote ccan/io.

The idea is that each I/O callback returns a “struct io_plan” which says what I/O to do next, and what callback to call.  Examples are “io_read(buf, len, next, next_arg)” to read a fixed number of bytes, and “io_read_partial(buf, lenp, next, next_arg)” to perform a single read.  You could also write your own, such as pettycoin’s “io_read_packet()” which read a length then allocated and read in the rest of the packet.

This should enable a convenient debug mode: you turn each io_read() etc. into synchronous operations and now you have a nice callchain showing what happened to a file descriptor.  In practice, however, debug was painful to use and a frequent source of bugs inside ccan/io, so I never used it for debugging.

And I became less happy when I used it in anger for pettycoin, but at some point you’ve got to stop procrastinating and start producing code, so I left it alone.

Now I’ve revisited it.   820 insertions(+), 1042 deletions(-) and the code is significantly less hairy, and the API a little simpler.  In particular, writing the normal “read-then-write” loops is still very nice, while doing full duplex I/O is possible, but more complex.  Let’s see if I’m still happy once I’ve merged it into pettycoin…

July 29, 2014

Rusty Russell: Pettycoin Alpha01 Tagged

As all software, it took longer than I expected, but today I tagged the first version of pettycoin.  Now, lots more polish and features, but at least there’s something more than the git repo for others to look at!

July 17, 2014

Rusty Russell: API Bug of the Week: getsockname().

A “non-blocking” IPv6 connect() call was in fact, blocking.  Tracking that down made me realize the IPv6 address was mostly random garbage, which was caused by this function:

bool get_fd_addr(int fd, struct protocol_net_address *addr)
   union {
      struct sockaddr sa;
      struct sockaddr_in in;
      struct sockaddr_in6 in6;
   } u;
   socklen_t len = sizeof(len);
   if (getsockname(fd, &, &len) != 0)
      return false;

The bug: “sizeof(len)” should be “sizeof(u)”.  But when presented with a too-short length, getsockname() truncates, and otherwise “succeeds”; you have to check the resulting len value to see what you should have passed.

Obviously an error return would be better here, but the writable len arg is pretty useless: I don’t know of any callers who check the length return and do anything useful with it.  Provide getsocklen() for those who do care, and have getsockname() take a size_t as its third arg.

Oh, and the blocking?  That was because I was calling “fcntl(fd, F_SETFD, …)” instead of “F_SETFL”!

July 02, 2014

Jesper Dangaard Brouer: The calculations: 10Gbit/s wirespeed

In this blogpost, I'll try to make you understand the engineering challenge behind processing 10Gbit/s wirespeed, at the smallest Ethernet packet size.

The peak packet rate is 14.88 Mpps (million packets per sec) uni-directional on 10Gbit/s with the smallest frame size.

Details: What is the smalles Ethernet frame
Ethernet frame overhead:

Thus, the minimim size Ethernet frame is: 84 bytes (20 + 64)

Max 1500 bytes MTU Ethernetframe size is: 1538 bytes (calc: (12+8) + (14) + 1500 + (4) = 1538 bytes)

Packet rate calculations

Peak packet rate calculated as:  (10*10^9) bits/sec / (84 bytes * 8) = 14,880,952 pps
1500 MTU packet rate calculated as: (10*10^9) bits/sec / (1538 bytes * 8) = 812,744 pps

Time budget
This is the important part to wrap-your-head around.

With 14.88 Mpps the time budget for processing a single packet is:

  • 67.2 ns (nanosecond) (calc as: 1/14880952*10^9 ns)

This corrospond to approx: 201 CPU cycles on a 3GHz CPU (assuming only one instruction per cycle, disregarding superscalar/pipelined CPUs). Only having 201 clock-cycles processing time per packet is very little.

Relate these numbers to something
This 67.2ns number is hard to use for anything, if we cannot relate this to some other time measurements.

A single cache-miss takes: 32 ns (measured on a E5-2650 CPU). Thus, with just two cache-misses (2x32=64ns), almost the total 67.2 ns budget is gone. The Linux skb (sk_buff) is 4 cache-lines (on 64-bit), and the kernel e.g. insists on writing zeros to these cache-lines, during allocation of an skb.

We might not "suffer" a full cache-miss, sometimes the memory is available in L2 or L3 cache.  Thus, it is useful to know these time measurements.  Measured on my E5-2630 CPU (with lmbench command "lat_mem_rd 1024 128"), L2 access costs 4.3ns, and L3 access costs 7.9ns.

The "LOCK" operation
Assembler instructions can be prefixed with a "LOCK" operation, which means that they perform an atomic operation. This is uses every time e.g. a spinlock is locked or unlocked, cmpxchg and atomic_inc (some operations are even implicitly LOCK prefixed, like xchg).

I've measured the cost of this atomic "LOCK" operation to be 8.25ns on my CPU (with this program). Even for the completely optimal situation of a spinlock only being touch by one CPU, we have two LOCK calls which costs 16.5ns.

System call overhead
A FreeBSD case study of sendto(), in Luigi Rizzo netmap paper, shows that the cost of only the system call is 96ns, which is above the 67.2 ns budget.  The total overhead of sendto() were 950 ns.  These 950ns corrospond to 1,052,631 pps (calc as 1/(950/10^9)).
On Linux I measured the system call getuid(2), to take 87.77 ns and 201 CPU-cycles (TSC measurement) (the CPU efficiency were 1.42 insns per cycle, measured with perf stat). Thus, the syscall itself eats up the entire budget.

  • Update: Most of the syscall overhead comes from kernel option CONFIG_AUDITSYSCALL, without it, the syscall overhead drops to 41.85 ns.

How to overcome this syscall problem?  We can amortize the cost, by sending several packets in a single syscall.  It is not very well known, but we actually already have a syscall to send several packets with a single syscall, called "sendmmsg(2)". Notice the extra "m" (and the corresponding receive version "recvmmsg(2)"). Not many examples exists on the Internet for using these syscalls. Thus, I've provided some example code here for sendmmsg and recvmmsg.

RAW socket speeds
Daniel Borkmann and I recently optimized AF_PACKET, to scale to several CPUs (trafgen, kernel qdisc bypass and trafgen use qdisc bypass). But let us look at the performance numbers for only a single CPU:

  • Qdisc path = 1,226,776 pps => 815 ns per packet (calc: 1/pps*10^9)
  • Qdisc bypass = 1,382,075 pps => 723 ns per packet (calc: 1/pps*10^9)

This is also interesting, because this show us the cost of the qdisc code path, which costs 92 ns.  In this 10Gbit/s context it is fairly large, e.g. corresponding to almost 3 cache-line misses (92/32=2.9).

June 26, 2014

Eric Leblond: pshitt: collect passwords used in SSH bruteforce


I’ve been playing lately on analysis SSH bruteforce caracterization. I was a bit frustrated of just getting partial information:

  • ulogd can give information about scanner settings
  • suricata can give me information about software version
  • sshd server logs shows username
But having username without having the password is really frustrating.

So I decided to try to get them. Looking for a SSH server honeypot, I did find kippo but it was going too far for me by providing a fake shell access. So I’ve decided to build my own based on paramiko.

pshitt, Passwords of SSH Intruders Transferred to Text, was born. It is a lightweight fake SSH server that collect authentication data sent by intruders. It basically collects username and password and writes the extracted data to a file in JSON format. For each authentication attempt, pshitt is dumping a JSON formatted entry:

{"username": "admin", "src_ip": "", "password": "passw0rd", "src_port": 36221, "timestamp": "2014-06-26T10:48:05.799316"}
The data can then be easily imported in Logstash (see pshitt README) or Splunk.

The setup

As I want to really connect to the box running ssh with a regular client, I needed a setup to automatically redirect the offenders and only them to pshitt server. A simple solution was to used DOM. DOM parses Suricata EVE JSON log file in which Suricata gives us the software version of IP connecting to the SSH server. If DOM sees a software version containing libssh, it adds the originating IP to an ipset set. So, the idea of our honeypot setup is simple:
  • Suricata outputs SSH software version to EVE
  • DOM adds IP using libssh to the ipset set
  • Netfilter NAT redirects all IP off the set to pshitt when they try to connect to our ssh server
Getting the setup in place is really easy. We first create the set:
ipset create libssh hash:ip
then we start DOM so it adds all offenders to the set named libssh:
cd DOM
./dom -f /usr/local/var/log/suricata/eve.json -s libssh
A more accurate setup for dom can be the following. If you know that your legitimate client are only based on OpenSSH then you can run dom to put in the list all IP that do not (-i) use an OpenSSH client (-m OpenSSh):
./dom -f /usr/local/var/log/suricata/eve.json -s libssh -vvv -i -m OpenSSH
If we want to list the elements of the set, we can use:
ipset list libssh
Now, we can start pshitt:
cd pshitt
And finally we redirect the connection coming from IP of the libssh set to the port 2200:
iptables -A PREROUTING -m set --match-set libssh src -t nat -i eth0 -p tcp -m tcp --dport 22 -j REDIRECT --to-ports 2200

Some results

Here’s an extract of the most used passwords when trying to get access to the root account: real root passwords And here’s the same thing for the admin account attempt: Root passwords Both data show around 24 hours of attempts on an anonymous box.


Thanks to paramiko, it was really fast to code pshitt. I’m now collecting data and I think that they will help to improve the categorization of SSH bruteforce tools.

June 21, 2014

Rusty Russell: Alternate Blog for my Pettycoin Work

I decided to use github for pettycoin, and tested out their blogging integration (summary: it’s not very integrated, but once set up, Jekyll is nice).  I’m keeping a blow-by-blow development blog over there.

June 16, 2014

Rusty Russell: Rusty Goes on Sabbatical, June to December

At I spoke about my pre-alpha implementation of Pettycoin, but progress since then has been slow.  That’s partially due to yak shaving (like rewriting ccan/io library), partially reimplementation of parts I didn’t like, and partially due to the birth of my son, but mainly because I have a day job which involves working on Power 8 KVM issues for IBM.  So Alex convinced me to take 6 months off from the day job, and work 4 days a week on pettycoin.

I’m going to be blogging my progress, so expect several updates a week.  The first few alpha releases will be useless for doing any actual transactions, but by the first beta the major pieces should be in place…

June 11, 2014

Eric Leblond: Let’s talk about SELKS

The slides of my lightning talk at SSTIC are available: Let’s talk about SELKS. The slides are in French and are intended to be humorous.

The presentation is about defensive security that needs to get sexier. And Suricata 2.0 with EVE logging combined with Elasticsearch and Kibana can really help to reach that target. If you want to try Suricata and Elasticsearch, you can download and test SELKS.


The talk also present a small tool named Deny On Monitoring which demonstrate how easy it is to extract information from Suricata EVE JSON logging.

June 07, 2014

Rusty Russell: Donation to Jupiter Broadcasting

Chris Fisher’s Jupiter Broadcasting pod/vodcasting started 8 years ago with the Linux Action Show: still their flagship show, and how I discovered them 3 years ago.  Shows like this give access to FOSS to those outside the LWN-reading crowd; community building can be a thankless task, and as a small shop Chris has had ups and downs along the way.  After listening to them for a few years, I feel a weird bond with this bunch of people I’ve never met.

I regularly listen to Techsnap for security news, Scibyte for science with my daughter, and Unfilter to get an insight into the NSA and what the US looks like from the inside.  I bugged Chris a while back to accept bitcoin donations, and when they did I subscribed to Unfilter for a year at 2 BTC.  To congratulate them on reaching the 100th Unfilter episode, I repeated that donation.

They’ve started doing new and ambitious things, like Linux HOWTO, so I know they’ll put the funds to good use!

June 04, 2014

Jesper Dangaard Brouer: Pktgen for network overload testing

Want to get maximum performance out of the kernel level packet generator (pktgen)?
Then read this blogpost:

  • Simple tuning will increase performance from 4Mpps to 5.5Mpps (per CPU)

You might see pktgen as a fast packet generator, which it is, but I (as a kernel developer) also see it as network stack testing tool of the TX code path.

Pktgen have a parameter "clone_skb", which specifies how many time to send the same packet, before freeing and allocting a new packet for transmission.  This affects performance significantly, as it can remove a lot of memory allocation and access overhead.

I have two distinctly different use-cases for stack testing:

  1. clone_skb=1      tests the stack alloc/free overhead (related to the SKB)
  2. clone_skb=100000 tests the NIC driver layer
Lets focus on case 2, driver layer.

Tuning NIC driver layer for max performance:
The default NIC setting are not tuned for pktgen's artificial overload type of benchmarking, as this could hurt the normal use-case.

Specifically increasing the TX ring buffer in the NIC:
 # ethtool -G ethX tx 1024

A larger TX ring can improve pktgen's performance, while it can hurt in the general case, 1) because the TX ring buffer might get larger than the CPUs L1/L2 cache, 2) because it allow more queueing in the NIC HW layer (which is bad for bufferbloat).

One should be careful to conclude, that packets/descriptors in the HW TX ring cause delay.  Drivers usually delay cleaning up the ring-buffers (for various performance reasons), thus packets stalling the TX ring, might just be waiting for cleanup.

This "slow" cleanup issues is specifically the case, for the driver ixgbe (Intel 82599 chip).  This driver (ixgbe) combine TX+RX ring cleanups, and the cleanup interval is affected by the ethtool --coalesce setting of parameter "rx-usecs".

For ixgbe use e.g "30" resulting in approx 33K interrupts/sec (1/30*10^6):
 # ethtool -C ethX rx-usecs 30

Performance data:
Packet Per Sec (pps) performance tests using a single pktgen CPU thread, CPU E5-2630, 10Gbit/s driver ixgbe. (using net-next development kernel v3.15-rc1-2680-g6623b41)

Adjusting the "ethtool -C ethX rx-usecs" value affect how often we cleanup the ring.  Keeping the default TX ring size at 512, and adjusting "rx-usecs":
  • 3,935,002 pps - rx-usecs:  1 (irqs:  9346)
  • 5,132,350 pps - rx-usecs: 10 (irqs: 99157)
  • 5,375,111 pps - rx-usecs: 20 (irqs: 50154)
  • 5,454,050 pps - rx-usecs: 30 (irqs: 33872)
  • 5,496,320 pps - rx-usecs: 40 (irqs: 26197)
  • 5,502,510 pps - rx-usecs: 50 (irqs: 21527)
Performance when adjusting the TX ring buffer size. Keeping "rx-usecs==1" (default) while adjusting TX ring size (ethtool -G):
  • 3,935,002 pps - tx-size:  512
  • 5,354,401 pps - tx-size:  768
  • 5,356,847 pps - tx-size: 1024
  • 5,327,595 pps - tx-size: 1536
  • 5,356,779 pps - tx-size: 2048
  • 5,353,438 pps - tx-size: 4096
The performance of adjusting cleanup interval (rx-usecs), seems to win over simply increasing the TX ring buffer size. This also proves the theory of TX queue is filled with old packets/descriptors that needs cleaning.
(Edit: updated numbers to be clean upstream, previously included some patches)

Tools: Want easy to use script for pktgen look here
More details on pktgen advanced topics by Daniel Turull.

May 27, 2014

Rusty Russell: Effects of packet/data sizes on various networks

I was thinking about peer-to-peer networking (in the context of Pettycoin, of course) and I wondered if sending ~1420 bytes of data is really any slower than sending 1 byte on real networks.  Similarly, is it worth going to extremes to avoid crossing over into two TCP packets?

So I wrote a simple Linux TCP ping pong client and server: the client connects to the server then loops: reads until it gets a ‘1’ byte, then it responds with a single byte ack.  The server sends data ending in a 1 byte, then reads the response byte, printing out how long it took.  First 1 byte of data, then 101 bytes, all the way to 9901 bytes.  It does this 20 times, then closes the socket.

Here are the results on various networks (or download the source and result files for your own analysis):

On Our Gigabit Lan

Interestingly, we do win for tiny packets, but there’s no real penalty once we’re over a packet (until we get to three packets worth):

Over the Gigabit Lan

Over the Gigabit Lan


Over Gigabit LAN (closeup)

On Our Wireless Lan

Here we do see a significant decline as we enter the second packet, though extra bytes in the first packet aren’t completely free:

Wireless LAN (all results)

Wireless LAN (all results)

Wireless LAN (closeup)

Wireless LAN (closeup)

Via ADSL2 Over The Internet (Same Country)

Ignoring the occasional congestion from other uses of my home net connection, we see a big jump after the first packet, then another as we go from 3 to 4 packets:

ADSL over internet in same country

ADSL over internet in same country

ADSL over internet in same country (closeup)

ADSL over internet in same country (closeup)

Via ADSL2 Over The Internet (Australia <-> USA)

Here, packet size is completely lost in the noise; the carrier pidgins don’t even notice the extra weight:

Wifi + ADSL2 from Adelaide to US

Wifi + ADSL2 from Adelaide to US

Wifi + ADSL2 from Adelaide to US (closeup)

Wifi + ADSL2 from Adelaide to US (closeup)

Via 3G Cellular Network (HSPA)

I initially did this with Wifi tethering, but the results were weird enough that Joel wrote a little Java wrapper so I could run the test natively on the phone.  It didn’t change the resulting pattern much, but I don’t know if this regularity of delay is a 3G or an Android thing.  Here every packet costs, but you don’t win a prize for having a short packet:

3G network

3G network

3G network (closeup)

3G network (closeup)

Via 2G Network (EDGE)

This one actually gives you a penalty for short packets!  800 bytes to 2100 bytes is the sweet-spot:

2G (EDGE) network

2G (EDGE) network

2G (EDGE) network (closeup)

2G (EDGE) network (closeup)


So if you’re going to send one byte, what’s the penalty for sending more?  Eyeballing the minimum times from the graphs above:

Wired LAN Wireless ADSL 3G 2G
Penalty for filling packet 30%  15%  5%  0%  0%*
Penalty for second packet 30%  40%  15%  20%  0%
Penalty for fourth packet 60%  80%  25%  40%  25%

* Average for EDGE actually improves by about 35% if you fill packet

May 19, 2014

Eric Leblond: Playing with python-git


I’m currently working on Scirius, the web management interface for Suricata developed by Stamus Networks. Scirius is able to fetch IDS signatures from external place and the backend is storing this element in a git tree. As Scirius is a Django application, this means we need to interact with git in Python.

Usually the documentation of Python modules is good and enough to develop. This is sadly not the case for GitPython. There is documentation but the overall quality it not excellent, at least for a non genuine Python developer, and there is some big part missing.

Doing a commit

Doing a commit is really simple once you have understand what to do. You need to open the repository and work on his index which is the object you add file to commit to. In the following example, I want to add everything under the rules directory:

    repo = git.Repo(source_git_dir)
    index = repo.index
    message =  'source version at %s' % (self.updated_date)

Set value in the configuration of a repository

It is possible to edit the configuration of a git repository with GitPython. To do that you need to get the config and to use the set_value function. For example, the following code snippet create a repository and set and for that repository:

    repo = git.Repo.init(source_git_dir)
    config = repo.config_writer()
    config.set_value("user", "email", "")
    config.set_value("user", "name", "Scirius")

OSError 25: Inappropriate ioctl for device

I’ve encountered this fabulous exception when trying to do a commit in Scirius. The problem is only showing up when running the application in wsfcgi mode. It is documented in Issue 39 on GitHub but there is no workaround proposed.

The error comes from the fact the function used to guess the identity of the user running the application is called even if value are set in the config. And this function is failing when it is called outside of a real session. This function is in fact trying to get things from environment but these value are not set when the application is started by init. To fix this, it is possible to force the USERNAME environment variable.

Here’s how it is implemented in Scirius:

+    os.environ['USERNAME'] = 'scirius'
    message =  'source version at %s' % (self.updated_date)

You can see the diff on GitHub

May 08, 2014

Rusty Russell: BTC->BPAY gateway (for Australians)

I tested out, which lets you pay any BPAY bill (see explanation from reddit).  Since I’d never heard of the developer, I wasn’t going to send anything large through it, but it worked flawlessly.  At least the exposure is limited to the time between sending the BTC and seeing the BPAY receipt.  Exchange rate was fair, and it was a simple process.

Now I need to convince my wife we should buy some BTC for paying bills…

April 30, 2014

Jesper Dangaard Brouer: trafgen a fast packet generator

The netsniff-ng toolkit version 0.5.8 have been released.

One of the tools included in the netsniff-ng toolkit is: "trafgen" a multi-threaded low-level zero-copy network packet generator.  The recent release contains some significant performance improvements to that traffic generator.

Single CPU generator performance on a E5-2630 CPU, with Intel ixgbe/82599 chip, reach 1.4 Million Packet Per Sec (Mpps) when using the recent kernel (>= v3.14) feature of qdisc bypass for RAW sockets. And around 1.2 Mpps without bypassing the qdisc system in the kernel. (Default is to use the qdisc bypass if available, for testing purposes the qdisc path can be enabled via command line option "--qdisc-path")

In this release, I've also made "trafgen" scale to more CPUs:

The hard part of using trafgen is specifying and creating the packet description input file.  I really enjoy the flexibility when defining the packet contents, but without good examples as a starting point, it can be a daunting task.

For that reason, I've made some examples available at github here:

I've used the SYN attack example while developing the SYNPROXY module, see my other blogpost. I'm releasing this example now, because solutions for mitigating this attack is now available.

Jon Schipp also have a solution and have created a script "gencfg" for generating trafgen packet description input files, avail on github:

Notice: to get these performance numbers you need to tune your packet generator machine for network overload testing.

Jesper Dangaard Brouer: Mitigating DDoS SYN flood attacks with iptables/netfilter

Hey, I'm also blogging on the Red Hat Enterprise Linux Blog

I recently did very practical post on Mitigating TCP SYN Flood Attacks with iptables/netfilter, with the hope to provide the world with a practical solution to solve these annoying SYN-flood DDoS attacks, that we have been seeing for the last 20 years.

I've also been touring with a technical talk on the subject, and the most recent version of the slides are here.

There is also a YouTube video of my presentation at DevConf 2013.

April 29, 2014

Jesper Dangaard Brouer: Basic tuning for network overload testing

I'm doing a lot of network testing, where I'm constantly trying to push the limits of the hardware and network stack (in-order to improve performance and fix scalability issues in the Linux Kernel).

Some basic tuning of the NICs (Network Interface Cards) and IRQs are required, before we can start this "overload" testing mode.

1. First thing I do, is to kill "irqbalance", to avoid it mangling with my manual IRQ assignments.

 # killall irqbalance

2. Next I, align/bind the NICs IRQs to CPUs (one-to-one).

I have a script for aligning the IRQs, that I copied from the Intel ixgbe driver tarball:
 # set_irq_affinity $DEV

The easiest way to view, how the current IRQ assignment is to use this "grep ." trick:
  # grep . /proc/irq/*/eth4{,-*}/../smp_affinity_list

3. Then I, disable Ethernet Flow-Control

 # ethtool -A $DEV rx off tx off autoneg off

I'm disabling Ethernet Flow Control (PAUSE frames) because i want to create an overload situation. When my transmitter/generator machine is overloading the target machine, I don't want the target machine to send "backoff" PAUSE frames, especially if I'm testing the limits of the transmitters network stack.

4. Unload all netfilter and iptables module

I have a simple script for flushing iptables and unloading all the modules:

I usually also perform benchmarking and tuning of iptables/Netfilter modules, but for overload testing I'm unloading all module, as these do introduce measurable overhead.

Extra: A word of caution regarding CPU sleep or idle states:
I've experienced issues when doing low-latency measurements with Sandy-E Bridge CPUs C-states, because it too aggressively tried to go into a sleep state, even under a high network load. The latency cost of coming out of a sleep state can be significant. Jeremy Eder have described these issues in detail in his blog:

Simply use the tool "turbostat" to measure the different C-states.

And use the tool "tuned-adm" to adjust what profile you want to enable e.g.:
 # tuned-adm profile throughput-performance
 # tuned-adm profile latency-performance

April 27, 2014

Eric Leblond: Slides of my coccigrep lightning talk at HES2014

I’ve gave a lightning talk about coccigrep at Hackito Ergo Sum to show how it can be used to search in code during audit or hacking party. Here are the slides: coccigrep: a semantic grep for the C language.

The slides of my talk Suricata 2.0, Netfilter and the PRC will soon be available on Stamus Networks website.

April 17, 2014

Eric Leblond: Speeding up scapy packets sending

Sending packets with scapy

I’m currently doing some code based on scapy. This code reads data from a possibly huge file and send a packet for each line in the file using the contained information. So the code contains a simple loop and uses sendp because the frame must be sent at layer 2.

     def run(self):
         filedesc = open(self.filename, 'r')
         # loop on read line
         for line in filedesc:
             # Build and send packet
             sendp(pkt, iface = self.iface, verbose = verbose)
             # Inter packet treatment

Doing that the performance are a bit deceptive. For 18 packets, we’ve got:

    real    0m2.437s
    user    0m0.056s
    sys     0m0.012s

If we strace the code, the explanation is quite obvious:

socket(PF_PACKET, SOCK_RAW, 768)        = 4
setsockopt(4, SOL_SOCKET, SO_RCVBUF, [0], 4) = 0
select(5, [4], [], [], {0, 0})          = 0 (Timeout)
ioctl(4, SIOCGIFINDEX, {ifr_name="lo", ifr_index=1}) = 0
bind(4, {sa_family=AF_PACKET, proto=0x03, if1, pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0
setsockopt(4, SOL_SOCKET, SO_RCVBUF, [1073741824], 4) = 0
setsockopt(4, SOL_SOCKET, SO_SNDBUF, [1073741824], 4) = 0
getsockname(4, {sa_family=AF_PACKET, proto=0x03, if1, pkttype=PACKET_HOST, addr(6)={772, 000000000000}, [18]) = 0
ioctl(4, SIOCGIFNAME, {ifr_index=1, ifr_name="lo"}) = 0
sendto(4, "\377\377\377\377\377\377\0\0\0\0\0\0\10\0E\0\0S}0@\0*\6\265\373\307;\224\24\300\250"..., 97, 0, NULL, 0) = 97
select(0, NULL, NULL, NULL, {0, 0})     = 0 (Timeout)
close(4)                                = 0
socket(PF_PACKET, SOCK_RAW, 768)        = 4
setsockopt(4, SOL_SOCKET, SO_RCVBUF, [0], 4) = 0
select(5, [4], [], [], {0, 0})          = 0 (Timeout)
ioctl(4, SIOCGIFINDEX, {ifr_name="lo", ifr_index=1}) = 0
bind(4, {sa_family=AF_PACKET, proto=0x03, if1, pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0
setsockopt(4, SOL_SOCKET, SO_RCVBUF, [1073741824], 4) = 0
setsockopt(4, SOL_SOCKET, SO_SNDBUF, [1073741824], 4) = 0
getsockname(4, {sa_family=AF_PACKET, proto=0x03, if1, pkttype=PACKET_HOST, addr(6)={772, 000000000000}, [18]) = 0
ioctl(4, SIOCGIFNAME, {ifr_index=1, ifr_name="lo"}) = 0
sendto(4, "\377\377\377\377\377\377\0\0\0\0\0\0\10\0E\0\0004}1@\0*\6\266\31\307;\224\24\300\250"..., 66, 0, NULL, 0) = 66
select(0, NULL, NULL, NULL, {0, 0})     = 0 (Timeout)
close(4)                                = 0

For each packet, a new socket is opened and this takes age.

Speeding up the sending

To speed up the sending, one solution is to build a list of packets and to send that list via a sendp() call.

     def run(self):
         filedesc = open(self.filename, 'r')
         pkt_list = []
         # loop on read line
         for line in filedesc:
             # Build and send packet
         sendp(pkt_list, iface = self.iface, verbose = verbose)

This is not possible in our case due to the inter packet treatment we have to do. So the best way is to reuse the socket. This can be done easily when you’ve read the documentation^W code:

@@ -27,6 +27,7 @@ class replay:
     def run(self):
         # open filename
         filedesc = open(self.filename, 'r')
+        s = conf.L2socket(iface=self.iface)
         # loop on read line
         for line in filedesc:
             # Build and send packet
-            sendp(pkt, iface = self.iface, verbose = verbose)
+            s.send(pkt)

The idea is to create a socket via the function used in sendp() and to use the send() function of the object to send packets.

With that modification, the performance are far better:

    real    0m0.108s
    user    0m0.064s
    sys     0m0.004s

I’m not a scapy expert so ping me if there is a better way to do this.

April 16, 2014

Jesper Dangaard Brouer: Full scalability for Netfilter conntracks

My scalability fixes for Netfilter connection tracking have reached Linus'es tree and will appear in kernel release v3.15.

Netfilter’s conntrack have had a bad reputation for being slow. While this was true in the "early-days", it have been offering excellent scalability for established conntracks for a long time now.  Matching against existing conntrack entries is very fast and completely scalable. (The conntrack system actually does lockless RCU (Read-Copy Update) lookups for existing connections).

The conntrack system have had a scalability problem when it comes to creating (or deleting) connections, for a long time now (single central spinlock).  This scalability issue is now fixed.

This work relates to my recent efforts of using conntrack for DDoS protection, as e.g. SYN-floods would hit this "new" connection scalability problem with Netfilter conntracks.

Finally version 3 of the patchset were accepted March 7th 2014 (note Eric Dumazet worked on the first attempts back in May 9th 2013). The most important commit is 93bb0ceb75 "netfilter: conntrack: remove central spinlock nf_conntrack_lock")

Jesper Dangaard Brouer: Announcing: The IPTV-Analyzer

I'm happy to announce the first official release of the IPTV-Analyzer project, as an Open Source project.

The IPTV-Analyzer is a continuous/real-time tool for analyzing the contents of MPEG2 Transport Stream (TS) packets, which is commonly used for IPTV multicast signals. The main purpose is continuous quality measurement, with a focus on detecting MPEG2 TS/CC packet drops.

The core component is an iptables (Linux) kernel module, named "mpeg2ts". This kernel module performs the real-time Deep Packet Inspection of the MPEG2-TS packets. Its highly performance optimized, written for parallel processing across CPU cores (via RCU locking) and hash tables are used for handling large number of streams. Statistics are exported via the proc filesystem (scalability is achieved via use of the seq_file proc API). It scales to hundreds of IPTV channels, even on small ATOM based CPUs.

Please send bugreports, patches, improvement, comments or insults to:

March 24, 2014

Rusty Russell: Legal Questions About and Australia

As my previous post documented, I’ve experimented with  Following the arrest of two Miami men for trading on localbitcoins, I decided to seek legal advice on the sitation in Australia.

Online research led me to Nick Karagiannis of Kelly and Co, who was already familiar with Bitcoin: I guess it’s a rare opportunity for excitement in financial regulatory circles!  This set me back several thousand dollars (in fiat, unfortunately), but the result was reassuring.

They’ve released an excellent summary of the situation, derived from their research.  I hope that helps other bitcoin users in Australia, and I’ll post more in future should the legal situation change.

March 19, 2014

Rusty Russell: Bitcoin Trading In Australia

I bought 10 BTC to play with back in 2011, and have been slowly spending them to support bitcoin adoption.  One thing which I couldn’t get reliable information on was how to buy and sell bitcoin within Australia, so over the last few months I decided to sell a few via different methods and report the results here (this also helps my budget, since I’m headed off on paternity leave imminently!).

All options listed here use two-factor authentication, otherwise I wouldn’t trust them with more than cents.  And obviously you shouldn’t leave your bitcoins in an exchange for any longer than necessary, since most exchanges over time have gone bankrupt.

Option 1: MtGox AUD

Yes, I transferred some BTC into MtGox and sold them.  This gave the best price, but after over two months of waiting the bank transfer to get my money hadn’t been completed.  So I gave up, bought back into bitcoins (fewer, since the price had jumped) and thus discovered that MtGox was issuing invalid BTC transactions so I couldn’t even get those out.  Then they halted transactions altogether blaming TX malleability.  Then they went bankrupt.  Then they leaked my personal data just for good measure.  The only way their failure could be more complete is if my MtGox Yubikey catches on fire and burns my home to the ground.

Volume: Great (5M AUD/month)
Price Premium: $25 – $50 / BTC
Charge: 0.65%
Hassle: Infinite
Summary: 0/10

Option 2:

According to, localbitcoins is the largest volume method for AUD exchange.  It’s not an exchange, so much as a matching and escrow service, though there are a number of professional traders active on the site.  The bulk of AUD trades are online, though I sold face to face (and I’ll be blogging about the range of people I met doing that). is a great place for online BTC buyers, since they have been around for quite a while and have an excellent reputation with no previous security issues, and they hold bitcoins in escrow as soon as you hit “buy”.  It’s a bit more work than an exchange, since you have to choose the counter-party yourself.

For online sellers, transfers from stolen bank accounts is a real issue.  Electronic Funds Transfer (aka “Pay Anyone”) is reversible, so when the real bank account owner realizes their money is missing, the bank tends to freeze the receiving (ie. BTC seller’s) bank account to make sure they can’t remove the disputed funds.  This process can take weeks or months, and banks’ anti-fraud departments generally treat bitcoin sellers who get defrauded with hostility (ANZ is reported to be the exception here).  A less common scam is fraudsters impersonating the Australian Tax Office and telling the victim to EFT to the localbitcoins seller.

Mitigations for sellers include any combination of:

  1. Only accepting old-fashioned cash deposits via a branch (though I’m aware of one US report where a fraudster convinced the teller to reverse the deposit, I haven’t heard of that in Australia)
  2. Insisting on “” in the transfer message (to avoid the ATO fraud problem)
  3. Only dealing with buyers with significant reputation (100+ trades with over 150 BTC is the Gold Standard)
  4. Insisting on real ID checking (eg. Skype chat of buyer with drivers’ license)
  5. Only dealing with buyers whose accounts are older than two weeks (most fraudsters are in and out before then, though their reputation can be very good until they get caught)
  6. Only allowing internal transfers between the same bank (eg. Commonwealth), relying on the bank’s use of two factor authentication to reduce fraud.

Many buyers on are newcomers, so anticipate honest mistakes for the most part.  The golden rule always applies: if someone is offering an unrealistic price, it’s because they’re trying to cheat you.

Volume: Good (1M AUD/month)
Price Premium: $5 – $20 / BTC
Charge: 1% (selling), 0% (buying)
Hassle: Medium
Summary: 7/10

Option 3:

You’ll need to get your bank account checked to use this fairly low-volume exchange, but it’s reasonably painless.  Their issues are their lack of exposure (I found out about them through and lack of volume (about a quarter of the volume), but they also trade litecoin if you’re into that.  You can leave standing orders, or just manually place one which is going to be matched instantly.

They seem like a small operation, based in Sydney, but my interactions with them have been friendly and fast.

Volume: Low (300k AUD/month)
Price Premium: $0 / BTC
Charge: 1%
Hassle: Low
Summary: 7/10

Option 4:

I heard about this site from a well-circulated blog post on Commonwealth Bank closing their bank account last year.  I didn’t originally consider them since they don’t promote themselves as an exchange, but you can use their filler to sell them bitcoins at a spot rate.  It’s limited to $4000 per day according to their FAQ.

They have an online ID check, using the usual sources which didn’t quite work for me due to out-of-date electoral information, but they cleared that manually within a day.  They deposit 1c into your bank account to verify it, but that hasn’t worked for me, so I’ve no way to withdraw my money and they haven’t responded to my query 5 days ago leaving me feeling nervous.  A search of reddit points to common delays, and founder’s links to the hacked-and-failed Bitcoinica give me a distinct “magical gathering” feel. [Edit: they apparently tried and failed four times to transfer the 1c verification to my ING account; with 1-2 business day support response, this took quite a while.  They never explained why this was failing.  Using my wife’s CBA account worked however, and I got my funds the next day.  Upgraded their score from 4/10 to 5/10.]

Volume: Unknown (self-reports indicate ~250k/month?)
Price Premium: $0 / BTC
Charge: 1.1% (selling) 2% (buying)
Hassle: Medium
Summary: 5/10

If you trade, I’d love to hear corrections, comments etc. or email me on

March 07, 2014

Eric Leblond: Suricata and Ulogd meet Logstash and Splunk

Some progress on the JSON side

Suricata 2.0-rc2 is out and it brings some progress on the JSON side. The logging of SSH protocol has been added: Screenshot from 2014-03-07 18:50:21 and the format of timestamp has been updated to be ISO 8601 compliant and it is now named timestamp instead of time.

Ulogd, the Netfilter logging daemon has seen similar change as it is now also using a ISO 8601 compliant timestamp for the . This feature is available in git and will be part of ulogd 2.0.4.

Thanks to this format change, the integration with logstash or splunk is easier and more accurate. This permit to fix one problem regarding the timestamp of an event inside of the event and logging manager. At least in logstash, the used date was the one of the parsing which was not really accurate. It could even be a problem when logstash was parsing a file with old entries because the difference in timestamp could be huge.

It is now possible to update logstash configuration to have a correct parsing of the timestamp. After doing this the internal @timestamp and the timestamp of the event are synchronized as show on the following screenshot:


Logstash configuration

Screenshot from 2014-02-02 13:22:34

To configure logstash, you simply needs to tell him that the timestamp field in JSON message is a date. To do so, you need to add a filter:

      date {
        match => [ "timestamp", "ISO8601" ]
A complete logstash.conf would then looks like:
input {
   file {
      path => [ "/usr/local/var/log/suricata/eve.json", "/var/log/ulogd.json" ]
      codec =>   json
      type => "json-log"

filter {
   if [type] == "json-log" {
      date {
        match => [ "timestamp", "ISO8601" ]

output {
  stdout { codec => rubydebug }
  elasticsearch { embedded => true }

Splunk configuration

Screenshot from 2014-03-07 23:30:40

In splunk, auto detection of the file format is failing and it seems you need to define a type to parse JSON in $SPLUNK_DIR/etc/system/local/props.conf:

KV_MODE = json

Then you can simply declare the log file in $SPLUNK_DIR/etc/system/local/inputs.conf:

sourcetype = suricata

sourcetype = suricata

you can now build search events and build dashboard based on Suricata or Netfilter packet logging: Screenshot from 2014-03-05 23:17:12

February 24, 2014

Eric Leblond: Nftables and the Netfilter logging framework

Nftables logging

If nftables is bringing a lot of changes on user side, this is also true in the logging area. There is now only one single keyword for logging: log and this target is using the Netfilter logging framework. A corollary of that is that why you may not see any log messages even if a rule with log is matching because the Netfilter logging framework has to be configured.

Netfilter logging framework

The Netfilter logging framework is a generic way of logging used in Netfilter components. This framework is implemented in two different kernel modules:

  • xt_LOG: printk based logging, outputting everything to syslog (same module as the one used for iptables LOG target). It can only log packets for IPv4 and IPv6
  • nfnetlink_log: netlink based logging requiring to setup ulogd2 to get the events (same module as the one used for iptables NFLOG target). It can log packet for any family.

To use one of the two modules, you need to load them with modprobe. It is possible to have both modules loaded and in this case, you can then setup logging on a per-protocol basis. The active configuration is available for reading in /proc:

# cat /proc/net/netfilter/nf_log 
 0 NONE (nfnetlink_log)
 1 NONE (nfnetlink_log)
 2 nfnetlink_log (nfnetlink_log,ipt_LOG)
 3 NONE (nfnetlink_log)
 4 NONE (nfnetlink_log)
 5 NONE (nfnetlink_log)
 6 NONE (nfnetlink_log)
 7 nfnetlink_log (nfnetlink_log)
 8 NONE (nfnetlink_log)
 9 NONE (nfnetlink_log)
10 nfnetlink_log (nfnetlink_log,ip6t_LOG)
11 NONE (nfnetlink_log)
12 NONE (nfnetlink_log)
The syntax is the following FAMILY ACTIVE_MODULE (AVAILABLE_MODULES). Here nfnetlink_log was loaded first and xt_LOG was loaded afterward (xt_LOG is aliased to ipt_LOG and ip6t_LOG).

Protocol family numbers can look a bit strange. It is in fact mapped on the socket family name that is used in underlying code. The list is the following:

#define AF_UNSPEC	0
#define AF_UNIX		1	/* Unix domain sockets 		*/
#define AF_INET		2	/* Internet IP Protocol 	*/
#define AF_AX25		3	/* Amateur Radio AX.25 		*/
#define AF_IPX		4	/* Novell IPX 			*/
#define AF_APPLETALK	5	/* Appletalk DDP 		*/
#define	AF_NETROM	6	/* Amateur radio NetROM 	*/
#define AF_BRIDGE	7	/* Multiprotocol bridge 	*/
#define AF_AAL5		8	/* Reserved for Werner's ATM 	*/
#define AF_X25		9	/* Reserved for X.25 project 	*/
#define AF_INET6	10	/* IP version 6			*/
#define AF_MAX		12	/* For now.. */

To update the configuration, you need to write in the file corresponding to the family in /proc/sys/net/netfilter/nf_log/ directory. For example, if you want to use ipt_LOG for IPv4 (2 in the list), you can do:

echo "ipt_LOG" >/proc/sys/net/netfilter/nf_log/2 
This will active ipt_LOG for IPv4 logging:
# cat /proc/net/netfilter/nf_log 
 0 NONE (nfnetlink_log)
 1 NONE (nfnetlink_log)
 2 ipt_LOG (nfnetlink_log,ipt_LOG)
 3 NONE (nfnetlink_log)
 4 NONE (nfnetlink_log)
 5 NONE (nfnetlink_log)
 6 NONE (nfnetlink_log)
 7 nfnetlink_log (nfnetlink_log)
 8 NONE (nfnetlink_log)
 9 NONE (nfnetlink_log)
10 nfnetlink_log (nfnetlink_log,ip6t_LOG)
11 NONE (nfnetlink_log)
12 NONE (nfnetlink_log)

Netfilter framework is used internally by Netfilter for some logging. For example, the connection tracking is using it to send messages when invalid packets are seen. These messages are useful because they contain the reason of the reject. For example, one of the message is “nf_ct_tcp: ACK is under the lower bound (possible overly delayed ACK)”. This logging messages are only sent if the logging of invalid packet is asked. This is done by doing:

echo "255"> /proc/sys/net/netfilter/nf_conntrack_log_invalid
More information on the magical 255 value are available in kernel documentation of nf_conntrack sysctl. If nfnetlink_log module is used for the protocol, then the used group is 0. So if you want to activate these messages, it could be a good idea to use non 0 nfnetlink group in the log rules. This way you will be able to differentiate the log sources in a software like ulogd.

Logging with Nftables

As mentioned before, logging is made via a log keyword. A typical log and accept rule will look like:

nft add rule filter input tcp dport 22 ct state new log prefix \"SSH for ever\" group 2 accept
This rule is accepting packet to port 22 in the state NEW and it is logging them with prefix SSH for ever on group 2. Here the group is only used when the active logging kernel module is nfnetlink_log. The option has no effect if xt_LOG is used. In fact, when used with xt_LOG, the only available option is prefix (at least for nftables 0.099).

The available options when using nfnetlink_log module are the following (at least for nftables 0.099):

  • prefix: A prefix string to include in the log message, up to 64 characters long, useful for distinguishing messages in the logs.
  • group: The netlink group (0 – 2^16-1) to which packets are (only applicable for nfnetlink_log). The default value is 0.
  • snaplen: The number of bytes to be copied to userspace (only applicable for nfnetlink_log). nfnetlink_log instances may specify their own range, this option overrides it.
  • queue-threshold: Number of packets to queue inside the kernel before sending them to userspace (only applicable for nfnetlink_log). Higher values result in less overhead per packet, but increase delay until the packets reach userspace. The default value is 1.
Note: the description are extracted from iptables man pages.

If you want to do some easy testing with nftables, simply load xt_LOG module before nfnetlink_log. It will bind to IPv4 and IPv6 protocol and provide you logging. For more fancy stuff involving nfnetlink_log, you can have a look at Using ulogd and JSON output.

Happy logging to all!

February 23, 2014

Eric Leblond: Logging connection tracking event with ulogd


I’ve recently met @aurelsec and we’ve discussed about the interest of logging connection tracking entries. This is indeed a undervalued information source in a network.

Quoting Wikipedia: “Connection tracking allows the kernel to keep track of all logical network connections or sessions, and thereby relate all of the packets which may make up that connection. NAT relies on this information to translate all related packets in the same way, and iptables can use this information to act as a stateful firewall.”

Connection tracking being linked with Network Address Translation has a direct impact: it stores both side of each connection. If we use conntrack tool from conntrack-tools to list connections:

# conntrack  -L
tcp      6 431999 ESTABLISHED src= dst= sport=53400 dport=443 src= dst= sport=443 dport=53500 [ASSURED] mark=0 use=1
We have the two sides of a connection:
  • Orig: here to This is the packet information as seen by the firewall when it reaches him. There is no translation at all.
  • Reply: here to This is how will look like a answer coming from the server. The destination has been changed to the public IP of the firewall (here And there is also a change of the destination port to the one used by the firewall when doing the initial mapping. In fact, as multiple client could use the same port at the same time, the firewall may have to rewrite the initial source port.

So the connection tracking stores all NAT transformations. This information is important because this is the only way to know which IP in a private network is responsible of something in the outside world. For example, let’s suppose that has been attacked by our internal client (here If the admin of this server sees the attack, it will only see the IP address and port source 53500. If an authority asks you for the IP address responsible in your internal network you have no instrument but the conntrack to know that this was in fact

That’s why logging connection tracking event is one of the only effective way to store the information necessary to get back to the internal IP address in case of external query. Let’s now do this with ulogd 2.

Ulogd setup

Ulogd installation

Ulogd 2 is able to get information from the connection tracking and to log them in files or database. If your distribution is not providing ulogd and if you don’t know how to install it, you can check this post Using ulogd and JSON output. To be sure that you will be able to log connection tracking event, you need to have NFCT plugin to yes at the end of configure output.

Ulogd configuration:
  Input plugins:
    NFLOG plugin:			yes
    NFCT plugin:			yes

Kernel setup

All functionalities are standard since kernel 2.6.14. You only need to load the following module:

modprobe nf_conntrack_netlink
It is the one in charge of kernel and userspace information exchange regarding connection tracking. It provides features to dump the conntrack table or modify entries in the conntrack. For example the conntrack tool mentioned before is using that communication method to get the listing of connection tracking entries. But the feature that interest us in ulogd is the event mode. For each event in the life of a connection, a message is sent to the userspace. Ulogd is able to listen to these messages and this gives it the ability to store all information on the life of the connection in connection tracking.

Depending on the protocol you have on your network, you may need to run on of the following:

modprobe nf_conntrack_ipv4
modprobe nf_conntrack_ipv6

Ulogd setup

Our first objective will simply be to log all NAT decisions to a syslog-like file on disk. In term of connection tracking, this means we will log all connection in the NEW state. This way we will get information about any packet going through the firewall with the associated NAT transformation.

If you install from sources, copy ulogd.conf at the root of ulogd sources to your config directory (usually /usr/local/etc/. And start your favorite editor on it.

Ulogd is doing logging based on stack definition. A stack is one chain of plugins starting from a input plugin, finishing with an output one, and with filter in the middle. In our case, we want to get packet from Netfilter conntrack and the corresponding plugin is NFCT. The first example of stack containing NFCT in the ulogd.conf file is the one we are interested in, so we uncomment it:

We are not sure that the setup of input and output plugin will be correct. For now, let’s just check the output:
As you may have seen, emu1 is also used by packet logging. So it may be a good idea that we have our own output file for connection tracking event. To do that, we update the stack:
and create a new config below emu1:
We have changed file name and keep the sync option which permit to avoid the a delay in write due to buffering effect during write which can be very annoying when debugging a setup.

Now, we can test:

ulogd -v
In /var/log/ulogd_nfct.log, we see things like
Feb 22 10:50:36 ice-age2 [DESTROY] ORIG: SRC= DST= PROTO=TCP SPT=6000 DPT=22 PKTS=0 BYTES=0 , REPLY: SRC= DST= PROTO=TCP SPT=22 DPT=6000 PKTS=0 BYTES=0
So we only have destruction messages. This is not exactly what we wanted to have. We are interested in NEW message that will allow us to have a correct timing of the event. Reading ulogd.conf file, it seems there is no information about chossing the event types. But let’s ask to the NFCT input plugin its capabilities. To do that we use option -i of ulogd:
# ulogd -v -i /usr/local/lib/ulogd/ 
Name: NFCT
Config options:
        Var: pollinterval (Integer, Default: 0)
        Var: hash_enable (Integer, Default: 1)
        Var: hash_buckets (Integer, Default: 8192)
        Var: hash_max_entries (Integer, Default: 32768)
        Var: event_mask (Integer, Default: 5)
        Var: netlink_socket_buffer_size (Integer, Default: 0)
        Var: netlink_socket_buffer_maxsize (Integer, Default: 0)
        Var: netlink_resync_timeout (Integer, Default: 60)
        Var: reliable (Integer, Default: 0)
        Var: accept_src_filter (String, Default: )
        Var: accept_dst_filter (String, Default: )
        Var: accept_proto_filter (String, Default: )
The listing start with the configuration keys. One of them is event_mask. This is a the one controlling which events are sent from kernel to userspace. The value is a mask combining some of the following values:
So default value of 5 is to listen to NEW and DESTROY events. Clever reader will then ask: why did we only see DESTROY messages in that case. This is because ulogd NFCT plugin is running by default in hash_enable mode. In this mode, one single message is output for each connection (at end) and a hash is maintained in the kernel to store the info (here initial timestamp of the connection). Our setup don’t need this feature because we only want to get the NAT transformation so we switch the hash feature off and limit the events to NEW:

We can now restart ulogd and check the log file:

Feb 22 11:59:34 ice-age2 [NEW] ORIG: SRC=2a01:e35:1394:5bd0:da50:b6ff:fe3c:4250 DST=2001:41d0:1:9598::1 PROTO=TCP SPT=51162 DPT=22 PKTS=0 BYTES=0 , REPLY: SRC=2001:41d0:1:9598::1 DST=2a01:e35:1394:5bd0:da50:b6ff:fe3c:4250 PROTO=TCP SPT=22 DPT=51162 PKTS=0 BYTES=0
Feb 22 11:59:43 ice-age2 [NEW] ORIG: SRC= DST= PROTO=TCP SPT=60846 DPT=443 PKTS=0 BYTES=0 , REPLY: SRC= DST= PROTO=TCP SPT=443 DPT=60946 PKTS=0 BYTES=0
This is exactly what we wanted, we have a trace of all NAT transformation.

Maintain an history of connection tracking


We want to log all information describing a connection so we have a trace of what is going on the firewall. This means we need at least:

  • IP information for orig and reply way
  • Timestamp of start and end of connection
  • Bandwidth used by the connection

Kernel setup

By default, recent kernel have a limited handling of connection tracking. Some useful fields are not stored for performance reason. This is the case of the accounting (number of packets and bytes) and the case of the timestamp of the connection creation. The advantage of getting accounting information is trivial as you get information on bandwidth usage. Regarding timestamp, the interest is on implementation side. It allows ulogd to get all information needed for describing a connection in one single message (the DESTROY one). And ulogd does not need anymore to maintain a hash table to get the info and propagate it at exit.

To activate both features, you have to do:

 echo "1"> /proc/sys/net/netfilter/nf_conntrack_acct
 echo "1"> /proc/sys/net/netfilter/nf_conntrack_timestamp

Ulogd setup

For following setup, you will need ulogd build from git or a ulogd at a version superior or equal to 2.0.4.

Let’s first use JSON output to get the information in a readable format. We need to define a stack:


On ct2 side, we don’t want to use the hash and we only want to get DESTROY message, so our configuration looks like:


Regarding, jsonnfct1 we could have reused the default JSON configuration but for ease of testing we will dedicate a file to the NFCT logging:


After a ulogd restart, we’ve got this type of entries:

{"reply.ip.daddr.str": "2a01:e35:1394:5ad0:da50:e6ff:fe3c:1250", "oob.protocol": 0, "dvc": "Netfilter", "timestamp": "Sat Feb 22 12:27:04 2014", "orig.ip.protocol": 6, "reply.raw.pktcount": 20, "flow.end.sec": 1393068424, "": 51384, "orig.l4.dport": 22, "orig.raw.pktlen": 5600, "": 1384991512, "orig.raw.pktcount": 23, "reply.raw.pktlen": 4328, "reply.ip.protocol": 6, "": 22, "reply.l4.dport": 51384, "ct.mark": 0, "ct.event": 4, "flow.start.sec": 1393068302, "flow.start.usec": 637516, "flow.end.usec": 403240, "reply.ip.saddr.str": "2001:41d0:1:9598::1", "": 10, "src_ip": "2a01:e35:1394:5ad0:da50:e6ff:fe3c:1250", "dest_ip": "2001:41d0:1:9598::1"}
The fields we wanted are here:
  • flow.start.* keys store the timestamp of flow start
  • flow.end.* keys store the end of the connection
  • *.raw.pkt* keys store the accounting information

You can then add this file to the file parsed by logstash. For that if you can use information from Using ulogd and JSON output and modify the input section:

input {
   file { 
      path => [ "/var/log/ulogd.json", "/var/log/ulogd_nfct.json"]
      codec =>   json 
One interesting information in a connection tracking entry is the duration. But the field is not available in ulogd JSON output and it is not possible to do mathematical operations in Kibana. A solution to get the information is to add a filter in logstash.conf to compute the duration:
filter {
  if [type] == "json-log" {
    ruby {
      code => "if event['']; event['flow.duration.sec']=(event['flow.end.sec'].to_i - event['flow.start.sec'].to_i); end"

Screenshot from 2014-02-23 18:00:23 A thing to notice to understand the obtained duration is that a connection is dying following contextual timeout. For example, in the case of a TCP connection, even after a FIN packet there’s a timeout applied. So a short connection will at least be of the duration of the timeout.

An other logging method is PostgreSQL. The stack to use is almost the same as JSON one but use, as you may have guess, the PGSQL plugin:

The configuration of the PostgreSQL plugin is easy based on the setup available in the configuration:
I’m not the one who will explain how to connect to a PostgreSQL database and create a ulogd2 database. See Pollux post for that: ulogd2: the new userspace logging daemon for netfilter/iptables (part 2)

Other setup are possible. For example, you can maintain a copy of the connection tracking table in the database and also keep the history. To do that you need to use the INSERT_OR_REPLACE_CT procedure and a connection tracking INPUT plugin not using the hash table but getting NEW and DESTROY events:



Connection will be inserted in the table when getting the NEW event and the connection entry in the database will be updated when the DESTROY message will be received.

February 05, 2014

Eric Leblond: Suricata and Nftables

Iptables and suricata as IPS

Building a Suricata ruleset with iptables has always been a complicated task when trying to combined the rules that are necessary for the IPS with the firewall rules. Suricata has always used Netfilter advanced features allowing some more or less tricky methods to be used.

For the one not familiar with IPS using Netfilter, here’s a few starting points:

  1. IPS receives the packet coming from kernel via rules using the NFQUEUE target
  2. The IPS must received all packets of a given flow to be able to handle detection cleanly
  3. The NFQUEUE target is a terminal target: when the IPS verdicts a packet, it is or accepted (and leave current chain)

So the ruleset needs to send all packets to the IPS. A basic ruleset for an IPS could thus looks like:

iptables -A FORWARD -j NFQUEUE
With such a ruleset, all packets going through the box are sent to the IPS.

If now you want to combine this with your ruleset, then usually your first try is to add rules to the filter chain:

iptables -A FORWARD -j NFQUEUE
iptables -A FORWARD -m conntrack --ctstate ESTABLISHED -j ACCEPT
# your firewall rules here
But this will not work because of point 2: All packets sent via NFQUEUE to the IPS are or blocked or if accepted, leave the FORWARD chain directly and are going for evaluation to the next chain (mangle POSTROUTING in our case). With such a ruleset, the result is that there is no firewall but an IPS in place.

As mentioned before there is some existing solutions (see Building a Suricata ruleset for extensive information). The simplest one is to dedicated one another chain such as mangle to IPS:

iptables -A FORWARD -t mangle -j NFQUEUE
iptables -A FORWARD -m conntrack --ctstate ESTABLISHED -j ACCEPT
# your firewall rules here
No conflict here but you have to be sure nothing in your system will use the the mangle table or you will have the same problem as the one seen previously in the filter chain. So there was no universal and simple solution to implement an IPS and a firewall ruleset with iptables.

IPS the easy way with Nftables

In Nftables, chains are defined by the user using nft command line. The user can specify:

  • The hook: the place in packet life where the chain will be set. See this diagram for more info.
    • prerouting: chain will be placed before packet are routed
    • input: chain will receive packets going to the box
    • forward: chain will receive packets routed by the box
    • postrouting: chain will receive packets after routing and before sending packets
    • output: chain will receive packet sent by the host
  • The chain type: define the objective of the chain
    • filter: chain will filter packet
    • nat: chain will only contains NAT rules
    • route: chain is containing rule that may change the route (previously now as mangle)
  • The priority: define the evaluation order of the different chains of a given hook. It is an integer that can be freely specified. But it also permits to place chain before or after some internal operation such as connection tracking.

In our case, we want to act on forwarded packets. And we want to have a chain for filtering followed by a chain for IPS. So the setup is simple of chain is simple

nft -i
nft> add table filter
nft> add chain filter firewall { type filter hook forward priority 0;}
nft> add chain filter IPS { type filter hook forward priority 10;}
With this setup, a packet will reach the firewall chain first where it will be filtered. If the packet is blocked, it will be destroy inside of the kernel. It the packet is accepted it will then jump to the next chain following order of increasing priority. In our case, the packet reaches the IPS chain.

Now, that we’ve got our chains we can add filtering rules, for example:
nft add rule filter firewall ct state established accept
nft add rule filter firewall tcp dport ssh counter accept
nft add rule filter firewall tcp dport 443 accept
nft add rule filter firewall counter log drop
And for our Suricata IPS, that’s just trivial:
nft add rule filter IPS queue

A bit more details

The queue target in nftables

The complete support for the queue target will be available in Linux 3.14. The syntax looks as follow:

nft add rule filter output queue num 3 total 2 options fanout
This rule sends matching packets to 2 load-balanced queues (total 2) starting at 3 (num 3). fanout is one of the two queue options:
  • fanout: When used together with total load balancing, this will use the CPU ID as an index to map packets to the queues. The idea is that you can improve perfor mance if there’s a queue per CPU. This requires total with a value superior to 1 to be specified.
  • bypass: By default, if no userspace program is listening on an Netfilter queue,then all packets that are to be queued are dropped. When this option is used, the queue rule behaves like ACCEPT instead, and the packet will move on to the next table.

For a complete description of queueing mechanism in Netfilter see Using NFQUEUE and libnetfilter_queue.

If you want to test this before Linux 3.14 release, you can get nft sources from nftables git and use next-3.14 branch.

Chain priority

For reference, here are the priority values of some important internal operations and of iptables static chains:

  • NF_IP_PRI_CONNTRACK_DEFRAG (-400): priority of defragmentation
  • NF_IP_PRI_RAW (-300): traditional priority of the raw table placed before connection tracking operation
  • NF_IP_PRI_SELINUX_FIRST (-225): SELinux operations
  • NF_IP_PRI_CONNTRACK (-200): Connection tracking operations
  • NF_IP_PRI_MANGLE (-150): mangle operation
  • NF_IP_PRI_NAT_DST (-100): destination NAT
  • NF_IP_PRI_FILTER (0): filtering operation, the filter table
  • NF_IP_PRI_SECURITY (50): Place of security table where secmark can be set for example
  • NF_IP_PRI_NAT_SRC (100): source NAT
  • NF_IP_PRI_SELINUX_LAST (225): SELInux at packet exit
  • NF_IP_PRI_CONNTRACK_HELPER (300): connection tracking at exit
For example, one can create in nftables an equivalent of the raw PREROUTING chain of iptables by doing:
# nft -i
nft> add chain filter pre_raw { type filter hook prerouting priority -300;}

Rusty Russell: Pettycoin and working with limited visibility.

At I gave a last-minute talk previewing my work on pettycoin (video, slides), an experiment to shard a bitcoin-like network.  The idea is to trade off some security and robustness in return for scale, but use it only for small amounts where fraud is less worthwhile.  On the bitcoin network today this is already seen with zero-confirmation transactions, and this is the niche pettycoin seeks to fill.

There are numerous problems to be overcome (one being the time taken by my day job, of course).  But segmenting the network and the blockchain is an interesting challenge: bitcoin’s blockchain is already designed so that you can have partial knowledge (mainly so you can prune used outputs).  But there’s a clear divide between full nodes, and second-class partial nodes.  I want a system where no one need know everything, and I’m getting closer to that goal.

Consider the simplest useful transaction in the bitcoin network, with one input (ie. a previous transaction’s output) and one output.  To verify this is a fairly simple process:

  1. Is the transaction well-formed?
  2. Find the transaction whose output this is spending.
  3. Does the signature match the address of that output?
  4. Has that output already been spent?

With bitcoin, you’re expected to know every transaction with unspent outputs, so if you can’t find the transaction at step 2, the verification fails. Even better, you can verify that previous transaction, too, all the way back to the creating of the coins involved.  Your only worry is that the blockchain you have is the same as everyone else’s, so they’ll accept your transaction later.

If you don’t expect to know everything, it’s more difficult.  You can use a merkle proof to show that a transaction was present in a block; it takes just log(N) hashes for an N-transaction block.  So you could prove that all those previous transactions are in the blockchain (though that might be thousands of transactions) by providing me with each transaction and proof.

But this can’t prove that there are not duplicate transactions in the blockchain itself.  Only knowing the entire contents would do that.  So we’re relying on the rest of the network, each with a partial view, to check that didn’t happen.

This leads to the two requirements for any aspect of the pettycoin system which a node can’t verify by itself:

  1. The information to verify must be seen by some honest nodes.
  2. Each node must have an efficient way of reporting if it sees a problem.

The former is a bit tricky.  Consensus is formed by the blockchain, but you don’t want to see all of it.  You might expect to see some fraction of it, but if you don’t, how would you alert the network in a way that can’t be faked?   Imagine a miner holds back 5 transactions in the block, the miner might wait for your complaint message on one, then release that transaction making you look like the dishonest one.  By making you cry wolf, they can ensure you are ignored.

The solution used in pettycoin is that miners have to prove that they know the transactions in the 10 previous blocks.  They do this by hashing the transactions from the previous block into a merkle tree like normal, only they prefix each transaction with their payout address (this is called prev_merkle in the code).  The only way to generate this hash is to know the contents of each transaction, and you can’t make a valid block without it.  Unfortunately, the only way to demonstrate that this hash is wrong (thus the block is invalid) is to also know the contents of each transaction in the block.  Thus transactions are batched into groups of 4096; you only need send 4096 transactions to prove that one of the hashes in a block is wrong.  Miners will insist on knowing the transactions for those blocks, knowing that if they fake it they’ll likely be caught.

Reporting most other problems in a block is fairly:

  1. You can prove a duplicate spend in the block chain by showing both transactions and the merkle proofs that they are in each block.  The second block is invalid.
  2. You can prove a malformed transaction by showing the transactions and the merkle proof it is in the block.  That block is invalid.
  3. You can prove an overspend by showing the transactions used as inputs.  That block is invalid.

But if a transaction in a block relies on an output of a transaction which never existed, you can’t prove it.  Even if you know every transaction which ever happened, you can’t prove that to me (without sending me the whole blockchain).  The initial design lived with such warts in the blockchain, instead insisting that you would have to show all the predecessors when you paid me (via a payment protocol).  That predecessor tree quickly becomes unwieldy, however.

The new approach is that for each input of a transaction in the blockchain, the miner has to include the block and transaction number where it appeared.  Now anyone who knows that previous transaction can check it, and if there’s a problem it’s easy for any node to prove by showing the transaction which is in that previous block (with merkle proof that it is).

This means that the blockchain can be trusted, if half the mining power can be trusted.  This is a weaker guarantee that bitcoin, but sufficiently strong for pettycoin.  If you send me a new transaction along with transactions it uses as inputs  and merkle proofs that they are in the blockchain, I only need ensure that the new transaction isn’t a double-spend.  That’s the same as the bitcoin network, with zero-confirmation transactions (though pettycoin has a special double-spend report message to expedite it a little).

Next post, I’ll respond to the best criticism of pettycoin yet, the problem of gateways (by Jason Gi)…

February 02, 2014

Eric Leblond: Using ulogd and JSON output

Ulogd and JSON output

In February 2014, I’ve commited a new output plugin to ulogd, the userspace logging daemon for Netfilter. This is a JSON output plugin which output logs into a file in JSON format. The interest of the JSON format is that it is easily parsed by software just as logstash. And once data are understood by logstash, you can get some nice and useful dashboard in Kibana:

Screenshot from 2014-02-02 13:22:34

This post explains how to configure ulogd and iptables to do packet logging and differentiate accepted and blocked packets. If you want to see how cool is the result, just check my post: Investigation on an attack tool used in China.


At the time of this writing, the JSON output plugin for ulogd is only available in the git tree. Ulogd 2.0.4 will contain the feature.

If you need to get the source, you can do:

git clone git://

Then the build is standard:

sudo make install

Please note that at the end of the configure, you must see:

Ulogd configuration:
  Input plugins:
    NFLOG plugin:			yes
    NFACCT plugin:			yes
  Output plugins:
    PCAP plugin:			yes
    JSON plugin:			yes
If the JSON plugin is not build, you need to install libjansson devel files on your system and rerun configure.


Ulogd configuration

All the edits are made in the ulogd.conf file. With default configure option the file is in /usr/local/etc/.

First, you need to activate the JSON plugin:



Then we define two stacks for logging. It will be used to differentiate accepted packets from dropped packets:


pre> The first stack will be used to log accepted packet, so we the numeric_label to 1 in set in [log2]. In [log3], we use a numeric_label of 0.
group=1 # Group has to be different from the one use in log1

group=2 # Group has to be different from the one use in log1/log2
numeric_label=0 # you can label the log info based on the packet verdict

The last thing to edit is the configuration of the JSON instance:

device="My awesome FW"
Here we say we want log and write on disk configuration (via sync) and we named our device My awesome FW. Last value boolean_label is the most tricky. It this configuration variable is set to 1, the numeric_label will be used to decide if a packet has been accepted or blocked. It this variable is set non null, then the packet is seen as allowed. If not, then it is seen as blocked.

Sample Iptables rules

In this example, packets to port 22 are logged and accepted and thus are logged in nflog-group 1. Packet in the default drop rule are sent to group 2 because they are dropped.

iptables -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
iptables -A INPUT ! -i lo -p tcp -m tcp --dport 22 --tcp-flags FIN,SYN,RST,ACK SYN -m state --state NEW -j NFLOG --nflog-prefix  "SSH Attempt" --nflog-group 1
iptables -A INPUT -i lo -j ACCEPT
iptables -A INPUT -p tcp -m tcp --dport 22 -m state --state NEW -j ACCEPT
iptables -A INPUT -j NFLOG --nflog-prefix  "Input IPv4 Default DROP" --nflog-group 2

There is no difference in IPv6, we just use nflog-group 1 and 2 with the same purpose:

ip6tables -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
ip6tables -A INPUT ! -i lo -p tcp -m tcp --dport 22 --tcp-flags FIN,SYN,RST,ACK SYN -m state --state NEW -j NFLOG --nflog-prefix  "SSH Attempt" --nflog-group 1
ip6tables -A INPUT ! -i lo -p ipv6-icmp -m icmp6 --icmpv6-type 128 -m state --state NEW -j NFLOG --nflog-prefix  "Input ICMPv6" --nflog-group 1
ip6tables -A INPUT -p ipv6-icmp -j ACCEPT
ip6tables -A INPUT -p tcp -m tcp --dport 22 -m state --state NEW -j ACCEPT
ip6tables -A INPUT -i lo -j ACCEPT
ip6tables -A INPUT -j NFLOG --nflog-prefix  "Input IPv6 Default DROP" --nflog-group 2

Logstash configuration

Logstash configuration is simple. You must simply declare the ulogd.json file as input and optionaly you can activate geoip on the src_ip key:

input {
   file { 
      path => [ "/var/log/ulogd.json"]
      codec =>   json 

filter {
  if [src_ip]  {
    geoip {
      source => "src_ip"
      target => "geoip"
      add_field => [ "[geoip][coordinates]", "%{[geoip][longitude]}" ]
      add_field => [ "[geoip][coordinates]", "%{[geoip][latitude]}"  ]
    mutate {
      convert => [ "[geoip][coordinates]", "float" ]

output { 
  stdout { codec => rubydebug }
  elasticsearch { embedded => true }


To start ulogd in daemon mode, simply run:

ulogd -d

You can download logstash from their website and start it with the following command line:

java -jar logstash-1.3.3-flatjar.jar agent -f etc/logstash.conf --log log/logstash-indexer.out -- web

Once done, just point your browser to localhost:9292 and enjoy nice and interesting graphs.

Screenshot from 2014-02-02 13:57:19

Eric Leblond: Investigation on an attack tool used in China

Log analysis experiment

I’ve been playing lately with logstash using data from the ulogd JSON output plugin and the Suricata full JSON output as well as standard system logs.

Screenshot from 2014-02-02 13:22:34

Ulogd is getting Netfilter firewall logs from Linux kernel and is writing them in JSON format. Suricata is doing the same with alert and other traces. Logstash is getting both log as well as sytem log. This allows to create some dashboard with information coming from multiple sources. If you want to know how to configure ulogd for JSON output check this post. For suricata, you can have a look at this one.

Ulogd output is really new and I was experimenting with it in Kibana. When adding some custom graphs, I’ve observed some strange things and decided to investigate.

Displaying TCP window

TCP window size at the start of the connection is not defined in the RFC. So every OSes have choozen their own default value. It was thus looking interesting to display TCP window to be able to find some strange behavior. With the new ulogd JSON plugin, the window size information is available in the tcp.window key. So, after doing a query on tcp.syn:1 to only get TCP syn packet, I was able to graph the TCP window size of SYN packets.

Screenshot from 2014-02-02 13:22:58

Most of the TCP window sizes are well-known and correspond to standard operating systems:

  • 65535 is or MacOSX or some MS Windows OS.
  • 14600 is used by some Linux.

The first uncommon value is 16384. Graph are clickable on Kibana, so I was at one click of some interesting information.

First information when looking at dashboard after selection TCP syn packet with a window size of 16384 was the fact, it was only ssh scanning:

Screenshot from 2014-02-02 13:58:15

Second information is the fact that, according to geoip, all IPs are chinese:

Screenshot from 2014-02-02 13:57:19

A SSH scanning software

When looking at the details of the attempt made on my IP, there was something interesting: Screenshot from 2014-02-02 14:04:32

For all hosts, all requests are done with the same source port (6000). This is not possible to do that with a standard ssh client where the source port is by default choosen by the operating system. So or we have a custom standard software that perform a bind operation to port 6000 at socket creation. This is possible and one advantage would be to be easily authorized through a firewall if the country had one. Or we could have a software developped with low level (RAW) sockets for performance reason. This would allow a faster scanning of the internet by skipping OS TCP connection handling. There is a lot of posts regarding the usage of port 6000 as source for some scanning but I did not find any really interesting information in them.

On suricata side, most of the source IPs are referenced in ET compromised rules: Screenshot from 2014-02-02 13:25:03

Analysing my SSH logs, I did not see any trace of ssh bruteforce coming from source port 6000. But when selecting an IP, I’ve got trace of brute force from at least one of the IP: Screenshot from 2014-02-02 14:31:02

These attackers seems to really love the root account. In fact, I did not manage to find any trace of attempts for user different than root for IP address that are using the port 6000.

Getting back to my ulogd dashboard, I’ve displayed more info about the used scanning sequence: Screenshot from 2014-02-02 14:34:05 The host scans the box using a scanner using raw socket, then it attacks with a few minutes later with SSH bruteforce tool. The bruteforce tool has a TCP window size at start of 65535. It indicates that a separated software is used for scanning. So we should have an queueing mechanism between the scanner and the bruteforce tool. This may explains the duration between the scan and the bruteforce. Regarding TCP window size value, 65535 seems to indicate a Windows server (which is coherent with TTL value).

Looking at the scanner traffic

Capturing a sample traffic did not give to much information. This is a scanner sending a SYN and cleanly sending a reset when it got the SYN, ACK:

14:27:54.982273 IP (tos 0x0, ttl 103, id 256, offset 0, flags [none], proto TCP (6), length 40) > Flags [S], cksum 0xa525 (correct), seq 9764864, win 16384, length 0
14:27:54.982314 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 44) > Flags [S.], cksum 0xeee2 (correct), seq 2707606274, ack 9764865, win 29200, options [mss 1460], length 0
14:27:55.340992 IP (tos 0x0, ttl 111, id 14032, offset 0, flags [none], proto TCP (6), length 40) > Flags [R], cksum 0xe48c (correct), seq 9764865, win 0, length 0

But it seems the RST packet after the SYN, ACK is not well crafted: Screenshot from 2014-02-02 16:07:26

More info on SSH bruteforce tool

Knowing the the behavior was scanning from 6000 and starting a normal scanning, I’ve focused the Suricata dashboard on one IP to see if I had some more information: Screenshot from 2014-02-02 15:21:58

One single IP in the list of the scanning host is triggering multiple alerts. The event table confirmed this: Screenshot from 2014-02-02 15:16:41

Studying the geographical repartition of the Libssh alert, it appears there is used in other countries than China: Screenshot from 2014-02-02 15:24:59 So, libssh is not a discriminatory element of the attacks.


A custom attack tool has been been deployed on some Chinese IPs. This is a combination of a SSH scanner based on RAW socket and a SSH bruteforce tool. It tries to gain access to the root account of system via the ssh service. On an organisational level, it is possible there is a Chinese initiative trying to get the low-hanging fruit (system with ssh root account protected by password) or maybe it is just a some organization using some compromised Chinese IPs to try to get control other more boxes.

January 20, 2014

Eric Leblond: Why you will love nftables

Linux 3.13 is out

Linux 3.13 is out bringing among other thing the first official release of nftables. nftables is the project that aims to replace the existing {ip,ip6,arp,eb}tables framework aka iptables. nftables version in Linux 3.13 is not yet complete. Some important features are missing and will be introduced in the following Linux versions. It is already usable in most cases but a complete support (read nftables at a better level than iptables) should be available in Linux 3.15.

nftables comes with a new command line tool named nft. nft is the successor of iptables and derivatives (ip6tables, arptables). And it has a completely different syntax. Yes, if you are used to iptables, that’s a shock. But there is a compatibility layer that allow you to use iptables even if filtering is done with nftables in kernel.

There is only really few documentation available for now. You can find my nftables quick howto and there is some other initiatives that should be made public soon.

Some command line examples

Multiple targets on one line

Suppose you want to log and drop a packet with iptables, you had to write two rules. One for drop and one for logging:

iptables -A FORWARD -p tcp --dport 22 -j LOG
iptables -A FORWARD -p tcp --dport 22 -j DROP

With nft, you can combined both targets:

nft add rule filter forward tcp dport 22 log drop
Easy set creation

Suppose you want to allow packets for different ports and allow different icmpv6 types. With iptables, you need to use something like:

ip6tables -A INPUT -p tcp -m multiport --dports 23,80,443 -j ACCEPT
ip6tables -A INPUT -p icmpv6 --icmpv6-type neighbor-solicitation -j ACCEPT
ip6tables -A INPUT -p icmpv6 --icmpv6-type echo-request -j ACCEPT
ip6tables -A INPUT -p icmpv6 --icmpv6-type router-advertisement -j ACCEPT
ip6tables -A INPUT -p icmpv6 --icmpv6-type neighbor-advertisement -j ACCEPT

With nft, sets can be use on any element in a rule:

nft add rule ip6 filter input tcp dport {telnet, http, https} accept
nft add rule ip6 filter input icmpv6 type { nd-neighbor-solicit, echo-request, nd-router-advert, nd-neighbor-advert } accept

It is easier to write and it is more efficient on filtering side as there is only one rule added for each protocol.

You can also use named set to be able to make them evolve other time:

# nft -i # use interactive mode
nft> add set global ipv4_ad { type ipv4_address;}
nft> add element global ipv4_ad {, }
nft> add rule ip global filter ip saddr @ipv4_ad drop
And later when a new bad boy is detected:
# nft -i
nft> add element global ipv4_ad { }

One advanced feature of nftables is mapping. It is possible to use to different type of data and to link them. For example, we can associate iface and a dedicated rule set (stored in a chain and created before). In the example, the chains are named low_sec and high_sec:

# nft -i
nft> add map filter jump_map { type ifindex : verdict; }
nft> add element filter jump_map { eth0 : jump low_sec; }
nft> add element filter jump_map { eth1 : jump high_sec; }
nft> add rule filter input iif vmap @jump_map

Now, let’s say you have a new dynamic interface ppp1, it is easy to setup filtering for it. Simply add it in the jump_map mapping:

nft> add element filter jump_map { ppp1 : jump low_sec; }

On administration and kernel side

More speed at update

Adding a rule in iptables was getting dramatically slower with the number of rules and that’s explained why script using iptables call are taking a long time to complete. This is not anymore with nftables which is using atomic and fast operation to update rule sets.

Less kernel update

With iptables, each match or target was requiring a kernel module. So, you had to recompile kernel in case you forgot something or want to use something new. this is not anymore the case with nftables. In nftables, most work is done in userspace and kernel only knows some basic instruction (filtering is implemented in a pseudo-state machine). For example, icmpv6 support has been achieved via a simple patch of the nft tool. This type of modification in iptables would have required kernel and iptables upgrade.

January 10, 2014

Eric Leblond: A bit of logstash cooking


I’m running a dedicated server to host some internet services. The server runs Debian. I’ve installed logstash on it to do a bit of monitoring of my system logs and suricata. I’ve build a set of dashboards. The screenshot below shows a part of the one being dedicated to suricata: Suricata dashboard


My data sources were the following:
  • System logs
  • Apache logs
  • Suricata full JSON logs (should be available in suricata 2.0)
System logs

The setup was mostly really easy. I’ve just added a grok pattern to detect successful and unsuccessful connections on the ssh server.

input {
  file {
    type => "linux-syslog"
    path => [ "/var/log/daemon.log", "/var/log/auth.log", "/var/log/" ]
filter {
  if [type] == "linux-syslog" {
      grok {
        match => { "message" => "Accepted %{WORD:auth_method} for %{USER:username} from %{IP:src_ip} port %{INT:src_port} ssh2" }
      grok {
        match => { "message" => "Invalid user %{USER:username} from %{IP:src_ip}" }
Apache logs
Extract of Apache Dashboard

For apache, it was even easier for access.log:

  file {
    path => [ "/var/log/apache2/*access.log" ]
    type => "apache-access"

  file {
    type => "apache-error"
    path => "/var/log/apache2/error.log"
filter {
  if [type] == "apache-access" {
      grok {
        match => { "message" => "%{COMBINEDAPACHELOG}" }

  if [type] == "apache-error" {
      grok {
        match => { "message" => "%{APACHEERRORLOG}" }
        patterns_dir => ["/var/lib/logstash/etc/grok"]

For error log, I’ve created a grok pattern to get client IP. So I’ve created a file in grok dir with:

APACHEERRORLOG \[%{HTTPERRORDATE:timestamp}\] \[%{WORD:severity}\] \[client %{IPORHOST:clientip}\] %{GREEDYDATA:message_remainder}
Netfilter logs
Extract of firewall Dashboard

For Netfilter logs, I’ve decided to play it the old way and to parse kernel log instead of using ulogd:

input {
  file {
    type => "kern-log"
    path => "/var/log/kern.log"

filter {
 if [type] == "kern-log" {
        grok {
                match => { "message" => "%{IPTABLES}"}
                patterns_dir => ["/var/lib/logstash/etc/grok"]
with IPTABLES being defined in a file placed in the grok directory and containing:
 NETFILTERMAC %{COMMONMAC:dst_mac}:%{COMMONMAC:src_mac}:%{ETHTYPE:ethtype}
 ETHTYPE (?:(?:[A-Fa-f0-9]{2}):(?:[A-Fa-f0-9]{2}))
 IPTABLES1 (?:IN=%{WORD:in_device} OUT=(%{WORD:out_device})? MAC=%{NETFILTERMAC} SRC=%{IP:src_ip} DST=%{IP:dst_ip}.*(TTL=%{INT:ttl})?.*PROTO=%{WORD:proto}?.*SPT=%{INT:src_port}?.*DPT=%{INT:dst_port}?.*)
 IPTABLES2 (?:IN=%{WORD:in_device} OUT=(%{WORD:out_device})? MAC=%{NETFILTERMAC} SRC=%{IP:src_ip} DST=%{IP:dst_ip}.*(TTL=%{INT:ttl})?.*PROTO=%{INT:proto}?.*)

Exim logs
Extract of SMTP dashboard

This part was complicated because exim logs are multiline. So I found a page explaining how to match at least, the logs for delivered mail. It is using multiline in filter. Then I added a series of matches to get more information. Each match do only get a part of a message so I’ve used break_on_match not to exit when one of the match succeed.

input {
  file {
    type => "exim-log"
    path => "/var/log/exim4/mainlog"
filter {
  if [type] == "exim-log" {
      multiline {
        pattern => "%{DATE} %{TIME} %{HOSTNAME:msgid} (=>|Completed)"
        what => "previous"
      grok {
        break_on_match => false
        match => [
          "message", "= %{NOTSPACE:from} H=%{NOTSPACE:server} \[%{IP:src_ip}\]"
      grok {
        break_on_match => false
        match => [
          "message", "=> %{USERNAME:username} %{NOTSPACE:dest}> R=%{WORD:transport}"

      grok {
        break_on_match => false
        match => [
          "message", "=> %{NOTSPACE:dest} R=%{WORD:transport}"
      grok {
        break_on_match => false
        match => [
          "message", "%{DATE} %{TIME} H=%{NOTSPACE:server}%{GREEDYDATA} \[%{IP:src_ip}\] F=%{NOTSPACE:mail_to}> temporarily rejected RCPT %{NOTSPACE:dest}>: greylisted"
Pie with file types

Suricata full JSON output is JSON so the configuration in logstash is trivial:

input {
   file {
      path => ["/var/log/suricata/eve.json" ]
      codec =>   json
You can download a sample Suricata Dashboard to use in in your logstash installation.

The full configuration

Below is the full configuration. There is only one thing which I did not mention. For most source IP, I use geoip to have an idea of the localisation of the IP.

input {
  file {
    type => "linux-syslog"
    path => [ "/var/log/daemon.log", "/var/log/auth.log", "/var/log/" ]

  file {
    path => [ "/var/log/apache2/*access.log" ]
    type => "apache-access"

  file {
    type => "apache-error"
    path => "/var/log/apache2/error.log"

  file {
    type => "exim-log"
    path => "/var/log/exim4/mainlog"

  file {
    type => "kern-log"
    path => "/var/log/kern.log"

   file {
      path => ["/var/log/suricata/eve.json" ]
      codec =>   json


filter {
  if [type] == "apache-access" {
      grok {
        match => { "message" => "%{COMBINEDAPACHELOG}" }
  if [type] == "linux-syslog" {
      grok {
        match => { "message" => "Accepted %{WORD:auth_method} for %{USER:username} from %{IP:src_ip} port %{INT:src_port} ssh2" }

  if [type] == "apache-error" {
      grok {
        match => { "message" => "%{APACHEERRORLOG}" }
        patterns_dir => ["/var/lib/logstash/etc/grok"]

  if [type] == "exim-log" {
      multiline {
        pattern => "%{DATE} %{TIME} %{HOSTNAME:msgid} (=>|Completed)"
        what => "previous"
      grok {
        break_on_match => false
        match => [
          "message", "= %{NOTSPACE:from} H=%{NOTSPACE:server} \[%{IP:src_ip}\]"
      grok {
        break_on_match => false
        match => [
          "message", "=> %{USERNAME:username} %{NOTSPACE:dest}> R=%{WORD:transport}"

      grok {
        break_on_match => false
        match => [
          "message", "=> %{NOTSPACE:dest} R=%{WORD:transport}"
      grok {
        break_on_match => false
        match => [
          "message", "%{DATE} %{TIME} H=%{NOTSPACE:server}%{GREEDYDATA} \[%{IP:src_ip}\] F=%{NOTSPACE:mail_to}> temporarily rejected RCPT %{NOTSPACE:dest}>: greylisted"

 if [type] == "kern-log" {
        grok {
                match => { "message" => "%{IPTABLES}"}
                patterns_dir => ["/var/lib/logstash/etc/grok"]

  if [src_ip]  {
    geoip {
      source => "src_ip"
      target => "geoip"
      add_field => [ "[geoip][coordinates]", "%{[geoip][longitude]}" ]
      add_field => [ "[geoip][coordinates]", "%{[geoip][latitude]}"  ]
    mutate {
      convert => [ "[geoip][coordinates]", "float" ]

  if [clientip]  {
    geoip {
      source => "clientip"
      target => "geoip"
      add_field => [ "[geoip][coordinates]", "%{[geoip][longitude]}" ]
      add_field => [ "[geoip][coordinates]", "%{[geoip][latitude]}"  ]
    mutate {
      convert => [ "[geoip][coordinates]", "float" ]

  if [srcip]  {
    geoip {
      source => "srcip"
      target => "geoip"
      add_field => [ "[geoip][coordinates]", "%{[geoip][longitude]}" ]
      add_field => [ "[geoip][coordinates]", "%{[geoip][latitude]}"  ]
    mutate {
      convert => [ "[geoip][coordinates]", "float" ]

output {
  stdout { codec => rubydebug }
  elasticsearch { embedded => true }

January 04, 2014

Patrick McHardy: test


Copyright (C) 2001-2010 by the respective authors.