The new definition of scale - the Internet of Things
In terms of scaling networks and messaging middleware, the segment of data created by humans is by and large a solved problem. The new frontier is the Internet of Things.
I admit - I'm a numbers junkie.
I've always been addicted to numbers - in math (for reasons other than sheer size) and in technology. And with the latter in particular with performance numbers of networking servers and message brokering (and databases, but that's not in this post). Like in: "Wooah, so many X per Y? Awesome!"
Now, tuning systems and code, fuzzing with benchmarks and performance numbers just for the sake of challenging the maximum possible with certain hardware and software is interesting (for geeks), but probably questionable. Isn't that kinda benchmark porn? What's the practical point anyway. How much performance do you need?
So I wanted to come up with some perspectives from an applications point of view to put messaging performance numbers into context. Hence I went out on the Web and collected the following list:
Global aggregate Event rate by Application
|Application||Events per Second|
|Internet-of-Things incoming (est. 2020)||300,000,000 ?? (see below)|
|Financial Data Feed incoming (peak 2013)||6,800,000|
|Financial Data Feed incoming (avg. 2013)||1,800,000|
|EMail (global, incl. 80% spam, 2012)||1,600,000|
|Text Msg. Apps (global, est. 2014)||580,000|
|Twitter Tweetline Updates (global, est. 2013) 1||390,000|
|EMail (global, est. legit, 2012)||320,000|
|SMS (global, 2012)||200,000|
|Twitter sent Tweets (peak, 2013)||144,000|
|NYSE Euronext (peak, 2012)||130,000|
|Twitter sent Tweets (peak, 2011)||120,000|
|ESPN (peak, 2013)||100,000|
|Eurex (peak, 2012)||61,000|
|Google Searches (2012)||60,000|
|Facebook Likes (2012)||32,000|
|London Stock Exch. (peak, 2012)||25,000|
|Twitter Queries (avg., 2011)||18,000|
|Twitter sent Tweets (avg., 2013)||5,700|
|Twitter sent Tweets (avg., 2011)||2,900|
|Visa Transactions (peak, 2007)||2,200|
|Reddit.com Pageviews (2012)||1,170|
|Visa Transactions (avg., 2007)||900|
An "event" in above is just a collective term for any request, response, message sent, message received, and so on. Further, I have collected above numbers from various sources on the net (see links at the end of this post). The precise values aren't important (you might disagree or have better numbers) - what I'd claim is they are not completely off in terms of order of magnitude.
Looking at the above numbers, a couple of things are noticable. Essentially, there are two groups of scale if you take out Email spam, and the first line (Internet of Things) which we'll come to in a minute: financial data, and human-generated data.
I don't know a lot about the finance business, but I would guess that a considerable chunk (most?) of that data originates in machines, algorithmic trading programs that talk to market places, place orders, hunt arbitrage and the like.
Within the other group, all data ultimately originates with humans doing something. We'll look at this in the next part of this post in more detail.
Another thing might spring to the eye: talking about machines, where are all the sensor and Internet of Things data points in that list? This is emerging technology, so coming up with reliable, publicly available historic data is a problem. One thing that seems to be clear however is that the Internet of Things is likely going to be huge. Here is what IDC has to say:
IDC expects the installed base of the Internet of Things will be approximately 212 billion "things" globally by the end of 2020. This will include 30.1 billion installed "connected (autonomous) things" in 2020.
If each of 30 billion connected things generates one data point every 100 seconds on average, well, that is huge. And this is my totally unscientific, ballpark number in the first row in the above table.
Human originated data¶
All data that originates in explicit human action or trigger such as the Web surfing, text messaging or social interaction has inherently limited volume and scale. There are only so many people on planet earth, and humans can only do so many things a day, like clicking something or entering some text message.
What is interesting is to actually do the math and relate that to current network technology available at your online retailer today.
According to InternetWorldStats, there were 2.4 billion Internet users in June 2012. If we - for a moment - assume each user generates 500 messages, tracked events or whatever other user related actions per day, this sums up to a total of
Note that I am not talking about media data streams like voice and video here. Those are produced or consumed by humans as well, but the metadata (who, when watched what or talked to whom) for those media streams is something different. That metadata events would fit into above estimates also.
Let's put the above 13.8 meps into perspective in terms of current network hardware.
10 Gigabit Ethernet technology (10GbE) has a theoretical limit of 14.88 million packets per second (mpps) at the minimum packet size of 64 bytes. If we assumed each event could fit into that 64 bytes, then
of user events for the total world wide Internet user base.
The fun fact: you can buy such a network card for something like 350 Euro (12/2013) off the shelf. So that hardware isn't something out of reach that only large enterprises can afford (which is the case for 40GbE and 100GbE as of today).
Now you can of course question the estimates for the fraction of world population that is actually online, the 500 events per user per day, and you can question the assumption that tracked data per event would fit into 64 bytes. E.g. a Tweet can have like 140+ bytes, and so on. When you do so, you probably end up with a dozen or a hundred 10GbE NICs to digest the data stream. But that's basically then the ceiling. And, by the way, not even Google has access to the full global data stream like above.
Personally I find such estimates - though unscientific and using simple math - quite surprising. Modern network technology has reached incredible performance levels, and any human generated (meta-)data is basically not a fundamental challenge anymore in terms of scale for such hardware.
If you can come up with efficient software that is really capable of fully exploiting modern hardware, a single server with a dozen Xeon cores and a single 10GbE network card will go a long, long way.
Machine originated data¶
The above estimates show that scaling the network for any human originated data (other than media streams) is not a fundamental challenge anymore. What is a challenge is the prospect of digesting (and processing) the data originating from machines and sensors, the Internet of Things.
Applications in this area might be characterized in having
- huge number of clients
- long-lived connections
- small to limited message size
- high message rates
- soft real-time and low-latency
- distributed, federated application architectures
The combination of these characteristics and requirements creates a perfect storm.
The Internet of Things will likely dwarf any human generated data and even the massive data feeds with automated trading systems in finance as we know them today. In fact:
The question is more: how to write (message broker) software that can fully exploit that hardware?
The hardware is here today, but the software is still a challenge. Current state-of-the-art networking frameworks like for example Twisted, NodeJS or Netty, and servers like Nginx or Crossbar.io have solved the C10k problem we had with legacy stuff like Apache and thread-per-connection designs. But the next barrier lurks around the corner.
Those current servers - if done right - will scale to millions of long-lived (mostly idle) connections, but you will still have a hard time to drive them at the 1 million or up messages per second rate on millions of concurrently active connections and maintain low latency and jitter.
To exploit the hardware at those levels might require (yet another) radical rethinking in software design and/or architecture. But that's another post.
- New Tweets per second record, and how!
- DataSift Architecture: Realtime Datamining at 120,000 Tweets Per Second
- The Engineering Behind Twitters New Search Experience
- Analyst: Twitter Passed 500M Users In June 2012
- Twitter Has A Surprisingly Small Number Of US Users
- Pingdom - Internet 2012 in numbers
- Reddit - Top Posts of the Year and Best of 2012 Awards
- Statistic Brain - Google Annual Search Statistics
- Chat apps have overtaken SMS by message volume, but how big a disaster is that for carriers?
- Spam volume up big time: 64% increase in one month
- Spam Volumes: Past & Present, Global & Local
- AutomatedTrade - Market Data Maelstrom
- NuPont Market Data - 25 million messages per second in 2013?
- NYSE Euronext Tops Europe Peak Trading Charts
- Distributed Complex Event Processing with Query Rewriting
- How Big The Internet Of Things Could Become
- Here's Why 'The Internet Of Things' Will Be Huge, And Drive Tremendous Value For People And Businesses
- Internet of things: $8.9 trillion market in 2020, 212 billion connected things
- The Internet of Things Is Poised to Change Everything, Says IDC
- ESPN's Architecture at Scale - Operating at 100,000 Duh Nuh Nuhs Per Second
1.Regarding the Tweetline updates rate. This is difficult to estimate. The Twitter Web app does AJAX polling in 90s intervals of a user's Tweetline. I could not find numbers for average or peak concurrently active users (users which have the Twitter Web page open in their browser, or the mobile app active), which we would need. Hence the estimate: while Twitter has 500 mio. user accounts world-wide, for the US it has 140 mio. user accounts of which there are roughly 50 mio. monthly active. So for the US, monthly active is roughly 35% of all users. So with 500 mio. users world wide, that gives 175 mio. world wide monthly active users (35% of all) and 35 mio. concurrently active users (20% of monthly active - a total gut estimate). At a poll rate of 90s, that means 390k/sec poll requests. At 15k reqs/sec capacity for the Netty based Twitter frontend servers doing HTTP/SPDY, that is something like 30 frontend servers. A rack full.