Bg Swirly
Bg Swirly2

Blog

A CTO’s Thoughts about Amazon’s Cloud Crash
FreeWheel CTO, Diane Yu, Offers Her Perspective on Amazon's Cloud Woes
Posted 04.28.11 Comments: 0

As CTO of FreeWheel, I was often challenged by my decision to have our global operations team run and maintain our own colocation data centers. Why not put everything on a cloud? This question has come up during all stages of FreeWheel’s development, and my answer varies as we grow.

During the early days of FreeWheel, we didn’t pick a cloud solution like Amazon EC2, because they were not mature enough and weren’t proven in the market. We were very small when we first launched; if the Amazon cloud had been as big as it is today, I would have thought very hard before deciding to rent a colo. The cloud actually is a perfect choice for a small start-up. You can quickly launch your own website or service without much capital commitment, and speed and minimum capital commitment are critical factors when launching a business.

After FreeWheel's Monetization Rights Management® product launched, we quickly grew and signed up many brand name customers. As part of those contracts, we would sign Service Level Agreements (SLAs), and when you are talking about a few 9s in your up-time requirement, a cloud solution no longer works. To put everything in a cloud would mean you are competing for resources with small businesses or personal blogs that never worry about speed-to-response nor up-time - as long as it doesn’t go down all the time, it is good enough for them. However, it is not acceptable for us. When you have a customer portfolio like ours, I would fire myself if I were to put everything on a cloud, because that means I am not taking the quality of service for our customers to heart. A cloud service is not good enough for mission-critical business.  It can, however, be a perfect alternative for non-mission critical pieces of your business, and a perfect solution to handle spiky traffic overflow. 

Just recently, I was talking to our Chief Architect, asking him to investigate what it would take to expand our testing environment on the Amazon cloud. Even today, when everybody is talking about the Amazon cloud crash, it still doesn’t change my opinion. The fact that you choose the cloud means it is no big deal if your data is lost permanently or if the service goes down for a few days. When you make the choice of going with the cloud, you know whole-heartedly that this is going to happen one day, and you have to be ok with it. Test and development environments fit these characteristics.

I find people often blame technology for problems when in fact it is the decision of where to apply the technology that should be blamed. There were people that came to me and asked, “Why do you choose Ruby on Rails (ROR)? We have had so many problems with it, it is crappy.” I would ask, “What do you use ROR for? We chose ROR for speed of development in UI, not for high-performance applications.” ROR should not be chosen for high-traffic website development, nor high-performance servers. If I were to choose ROR for our ad server development to support billions of daily transactions, then it is my crappy choice to blame. ROR is not at fault. Same thing if you were to put all of your mission-critical data or service on a cloud, and one day you find that you just lost all of your data or the service is completely down. It is the decision of doing so to be blamed.

Am I arguing Amazon is not at fault? No. The fact that they claim perfect data back-ups and guarantee no data will be lost in their state-of-the-art cloud and couldn’t hold their promise is indeed Amazon’s fault. I do find it hard to believe that they don’t have “near-real-time” off-site data back-ups. In order to prevent a complete data loss should anything happen in one data center, a redundant data center should be set up to copy that data. This is how we set up our technology at FreeWheel. If our main site goes down, we can use the data copied to another data center to recover. 

What I suspect happened with Amazon’s data loss is that the corruption of the data in one site was replicated to a remote data center before they caught it, hence the remote back-up was not good either. If this is true (Amazon has yet to offer an official explanation), then their monitoring of the data integrity system is to blame.

UPDATE:  Amazon has since responded to explain their cloud crash.  You can read their comments here.

Diane Yu, Co-founder, Chief Technology Officer


Back
Here's what other people had to say

There are no comments yet... be the first!

What about you?

Name

Email

Comment

Remember my personal information
Notify me of follow-up comments?

Our POV
FreeWheel Thought Paper: Making Money from Mobile Content

Wish you had a playbook for how to take your prized, rights-managed content and successfully syndicate it to mobile devices? Download our guide for how to make critical decisions to best execute against your mobile syndication objectives.