The new definition of scale - the Internet of Things

, Tobias Oberstein

In terms of scaling networks and messaging middleware, the segment of data created by humans is by and large a solved problem. The new frontier is the Internet of Things.


I admit - I'm a numbers junkie.

I've always been addicted to numbers - in math (for reasons other than sheer size) and in technology. And with the latter in particular with performance numbers of networking servers and message brokering (and databases, but that's not in this post). Like in: "Wooah, so many X per Y? Awesome!"

Now, tuning systems and code, fuzzing with benchmarks and performance numbers just for the sake of challenging the maximum possible with certain hardware and software is interesting (for geeks), but probably questionable. Isn't that kinda benchmark porn? What's the practical point anyway. How much performance do you need?

So I wanted to come up with some perspectives from an applications point of view to put messaging performance numbers into context. Hence I went out on the Web and collected the following list:

Global aggregate Event rate by Application

ApplicationEvents per Second
Internet-of-Things incoming (est. 2020)300,000,000 ?? (see below)
Financial Data Feed incoming (peak 2013)6,800,000
Financial Data Feed incoming (avg. 2013)1,800,000
EMail (global, incl. 80% spam, 2012)1,600,000
Text Msg. Apps (global, est. 2014)580,000
Twitter Tweetline Updates (global, est. 2013) 1390,000
Nasdaq Totalview450,000
EMail (global, est. legit, 2012)320,000
SMS (global, 2012)200,000
Twitter sent Tweets (peak, 2013)144,000
NYSE Euronext (peak, 2012)130,000
Twitter sent Tweets (peak, 2011)120,000
ESPN (peak, 2013)100,000
Eurex (peak, 2012)61,000
Google Searches (2012)60,000
Facebook Likes (2012)32,000
London Stock Exch. (peak, 2012)25,000
Twitter Queries (avg., 2011)18,000
Twitter sent Tweets (avg., 2013)5,700
Twitter sent Tweets (avg., 2011)2,900
Visa Transactions (peak, 2007)2,200 Pageviews (2012)1,170
Visa Transactions (avg., 2007)900

An "event" in above is just a collective term for any request, response, message sent, message received, and so on. Further, I have collected above numbers from various sources on the net (see links at the end of this post). The precise values aren't important (you might disagree or have better numbers) - what I'd claim is they are not completely off in terms of order of magnitude.

Looking at the above numbers, a couple of things are noticable. Essentially, there are two groups of scale if you take out Email spam, and the first line (Internet of Things) which we'll come to in a minute: financial data, and human-generated data.

I don't know a lot about the finance business, but I would guess that a considerable chunk (most?) of that data originates in machines, algorithmic trading programs that talk to market places, place orders, hunt arbitrage and the like.

Within the other group, all data ultimately originates with humans doing something. We'll look at this in the next part of this post in more detail.

Another thing might spring to the eye: talking about machines, where are all the sensor and Internet of Things data points in that list? This is emerging technology, so coming up with reliable, publicly available historic data is a problem. One thing that seems to be clear however is that the Internet of Things is likely going to be huge. Here is what IDC has to say:

IDC expects the installed base of the Internet of Things will be approximately 212 billion "things" globally by the end of 2020. This will include 30.1 billion installed "connected (autonomous) things" in 2020.

If each of 30 billion connected things generates one data point every 100 seconds on average, well, that is huge. And this is my totally unscientific, ballpark number in the first row in the above table.

Human originated data

All data that originates in explicit human action or trigger such as the Web surfing, text messaging or social interaction has inherently limited volume and scale. There are only so many people on planet earth, and humans can only do so many things a day, like clicking something or entering some text message.

What is interesting is to actually do the math and relate that to current network technology available at your online retailer today.

According to InternetWorldStats, there were 2.4 billion Internet users in June 2012. If we - for a moment - assume each user generates 500 messages, tracked events or whatever other user related actions per day, this sums up to a total of

2.4 x 10^9 x 500 / 86.400 = **13.8 million events per second (meps)**

Note that I am not talking about media data streams like voice and video here. Those are produced or consumed by humans as well, but the metadata (who, when watched what or talked to whom) for those media streams is something different. That metadata events would fit into above estimates also.

Let's put the above 13.8 meps into perspective in terms of current network hardware.

10 Gigabit Ethernet technology (10GbE) has a theoretical limit of 14.88 million packets per second (mpps) at the minimum packet size of 64 bytes. If we assumed each event could fit into that 64 bytes, then

A single 10GbE Ethernet network adapter is enough to process the full data stream
of user events for the total world wide Internet user base.

The fun fact: you can buy such a network card for something like 350 Euro (12/2013) off the shelf. So that hardware isn't something out of reach that only large enterprises can afford (which is the case for 40GbE and 100GbE as of today).

Now you can of course question the estimates for the fraction of world population that is actually online, the 500 events per user per day, and you can question the assumption that tracked data per event would fit into 64 bytes. E.g. a Tweet can have like 140+ bytes, and so on. When you do so, you probably end up with a dozen or a hundred 10GbE NICs to digest the data stream. But that's basically then the ceiling. And, by the way, not even Google has access to the full global data stream like above.

Personally I find such estimates - though unscientific and using simple math - quite surprising. Modern network technology has reached incredible performance levels, and any human generated (meta-)data is basically not a fundamental challenge anymore in terms of scale for such hardware.

If you can come up with efficient software that is really capable of fully exploiting modern hardware, a single server with a dozen Xeon cores and a single 10GbE network card will go a long, long way.

Machine originated data

The above estimates show that scaling the network for any human originated data (other than media streams) is not a fundamental challenge anymore. What is a challenge is the prospect of digesting (and processing) the data originating from machines and sensors, the Internet of Things.

Applications in this area might be characterized in having

  • huge number of clients
  • long-lived connections
  • small to limited message size
  • high message rates
  • soft real-time and low-latency
  • distributed, federated application architectures

The combination of these characteristics and requirements creates a perfect storm.

The Internet of Things will likely dwarf any human generated data and even the massive data feeds with automated trading systems in finance as we know them today. In fact:

One could expect the Internet of Things to rapidly rise over the coming decade to a point where it will simply define scale with networking, message brokering, application infrastructure and databases.

The question is more: how to write (message broker) software that can fully exploit that hardware?

The hardware is here today, but the software is still a challenge. Current state-of-the-art networking frameworks like for example Twisted, NodeJS or Netty, and servers like Nginx or have solved the C10k problem we had with legacy stuff like Apache and thread-per-connection designs. But the next barrier lurks around the corner.

Those current servers - if done right - will scale to millions of long-lived (mostly idle) connections, but you will still have a hard time to drive them at the 1 million or up messages per second rate on millions of concurrently active connections and maintain low latency and jitter.

To exploit the hardware at those levels might require (yet another) radical rethinking in software design and/or architecture. But that's another post.



1.Regarding the Tweetline updates rate. This is difficult to estimate. The Twitter Web app does AJAX polling in 90s intervals of a user's Tweetline. I could not find numbers for average or peak concurrently active users (users which have the Twitter Web page open in their browser, or the mobile app active), which we would need. Hence the estimate: while Twitter has 500 mio. user accounts world-wide, for the US it has 140 mio. user accounts of which there are roughly 50 mio. monthly active. So for the US, monthly active is roughly 35% of all users. So with 500 mio. users world wide, that gives 175 mio. world wide monthly active users (35% of all) and 35 mio. concurrently active users (20% of monthly active - a total gut estimate). At a poll rate of 90s, that means 390k/sec poll requests. At 15k reqs/sec capacity for the Netty based Twitter frontend servers doing HTTP/SPDY, that is something like 30 frontend servers. A rack full.

Recent posts

Atom Feed

Search this Site

Community Support Forums