Google, Facebook, Netflix, Tumblr, YouTube, Justin Bieber and Lady Gaga. What do all of these have in common? Well the first five are websites and the last two are people so the commonality might be difficult to ascertain. What these dissimilar things have in common are Internet popularity. Pop stars like Lady Gaga and Justin Bieber cause massive fluctuations in web traffic whenever they release new material. And the other websites, well, these are the behemoths of the web and are responsible for the vast majority of web traffic. It is rumored that Netflix is responsible for 25% of all Internet traffic. These pop stars and websites provide page views and video streams that number in the billions. Yes that is billion with a “b.” The amount of web traffic that these sites and pop stars generate boggles the imagination. The most amazing thing is that there are web infrastructures out there that support this kind of traffic.

When I think of a company like Netflix I am intrigued at just how they make a site like that work. I am not sure what type of systems you are building, but I bet they are not nearly the scale of the systems that the engineers at Netflix work on every day. So when opportunity knocks and presents the ability to learn how a site like Netflix is engineered, I tend to listen. At this year's QCon (Nov 14-15, 2011 the opportunity to hear Netflix engineers discuss their systems presented itself. The nice thing about QCon is that they attract developers and engineers from some of the largest websites on the planet. I fondly recall watching Joe Stump talk the complexities of writing code against a social website like Digg. I had a great time hearing the scale of a minor sub-portion of the Facebook site. And this year I was fascinated hearing two different talks on the Netflix website.

I will probably never work on a site of the magnitude of Netflix, but I can definitely learn a lot from how large websites like Netflix are engineered. I can also learn about tools and techniques that I can employ in my applications today. So it was with some excitement that I sat down to enjoy two distinct sessions on the Netflix infrastructure.

The first session was called “Keeping Movies Running Amid Thunderstorms!” This session was all about the overall architecture of Netflix's Amazon EC2 Infrastructure. This session gave a great overview of how Netflix went from a classic load balanced system to a highly layered distributed system. We learned a lot about problems that can occur in environments like this. One concept that we learned about was called “The Thundering Herd” ( An example of a thundering herd can occur when a server starts returning errors and the client machines that make those requests queue up and retry. The herd happens when more traffic keeps pouring in and the old queued up requests retry. This type of problem can cause cascading server failures as machines timeout. I had never heard of this before and I loved the solutions that they use to combat issues like this. Netflix is hosted on Amazon's EC2 cloud which provides the ability to stand up new instances of servers in rapid succession. What Netflix does to prevent the problem of thundering herds is to monitor their server utilization. Should any server's CPU utilization rise to over 50% they auto-provision more servers. Servers are provisioned rapidly in order to accommodate the additional traffic quickly. Then as traffic subsides servers are taken down slower than they were provisioned. This unprovisioning helps ensure that they don't cause further problems by taking resources away too quickly. Now I don't think the systems that I use will have this problem but I did learn some important concepts. I learned that you can make intelligent decisions on your servers by having good metrics and monitoring them. I learned that you can use these metrics to take automated action to solve problems. Basically I learned concepts that I might not have considered, or concepts that I might have had an inkling about but needed reinforcement of the concepts in my head.

Next I attended a session on the Netflix API. The number of API calls that the Netflix API supports is staggering. In the last 18 months, the Netflix API has gone from a billion API calls a month to a billion API calls a day. The amount of growth in that API boggles my mind. The Netflx API was originally created for every day developers like you and me. The idea was to foster the creation of innovative solutions using the data aggregated and provided by Netflix. There are dozens of useful sites out there that take advantage of this API. But something happened along the way. Devices! The number of devices capable of streaming Netflix movies is overwhelming. Every gaming console is capable of streaming movies, your mobile phone can probably do it, the iPad and dozens of consoles can do it. Even some Blu-ray players can stream Netflix. With this explosion in the consumer space, the number of hits the Netflix API supports has exploded as well. This explosion has presented some distinct challenges. One of the challenges comes in the original design of the API itself. The original design of the Netflix API was and still is significantly chatty. The major downside of this design is the expense. Serving up a billion API calls a day is expensive in server and networking costs. The Netflix team has determined that the majority of the traffic comes from devices and it's through the relationships with the device manufacturers that Netflix has created their next generation API. The Netflix team created a packaging architecture where devices make a single call that makes a bunch of API calls on the server and returns a package of data. The most interesting thing about this is the packages are actually bundles of code written in Groovy that run using the JVM on the server. I was intrigued by the idea of creating bundles of code to do work. The Roslyn project at Microsoft might provide some benefits to .NET developers when considering scenarios like code bundles.

One of the things that still excites me about being a software developer is the ability to analyze how other people solve difficult problems. While I may never engineer systems as large as Facebook or a Netflix, I can take away knowledge that they have gained and use it in my own projects where it is relevant. I love learning new concepts and I highly recommend you seek out sessions on subjects that might not be 100% relevant to your environment. You will learn something useful if you look closely enough.