Insane hiring, young talent and marketing dreams
This is a cross post from our Slog Blog -
http://pagalguy.com/slog/2009/05/05/young-talent-insane-hiring-and-a-marketing-dream/
Backing up woes
You’ve got your website up and chugging along very well (shared hosting or dedicated server or a vps) and life is good – the traffic is increasing, you are hiring and one fine day the site crashes. You panic, check with your hosting provider and you are told that it was a server crash. The hard drives didn’t survive and your account will be restored from a backup.
The backup could be a day old, a week old or a month old. What also is of note is whether the backup was stored on the server itself, on a different server or on an offnetwork backup setup – or a combination of any of the above. You may wish to check with your provider about the backup setup because if your data really matters you, then you should be careful and proactive enough to find the best solutions that work for you.
Why o why?
Some of the above mentioned backup systems may not be helpful or be enough to save any of your data in case the server was compromised. If you backup all your stuff back onto the same server and a hacker finds a way into your machine – there is a chance he will destroy both your primary data and backups. You’re left with nothing after this, except if you had the foresight to do other forms of backups as well.
Lets say you backed up your stuff within another server in the data center, you are again not scot free. Say the DC goes down, or gets raided by the FBI (check HA post) or just a simple case wherein the server which was compromised is connected to the backup server using SSH keyless login (kinda required for easier setup of regular rsync backups) – here again you could have the possibility of losing all your data. Kinda scary isn’t it?
All this makes a case for Off network backups i.e DC down, FBI raid, earthquake, flash flood, or err.. an errant truck crashing into a pole and taking the DC’s electricity offline.
But wait, if you still allow your primary machine to login keylessly to your external backup system, a hacker can take out your data and backups as well. If you are doing an external backup (within DC) or outside the DC – then work with solutions where you can’t login to the backup systems without knowing the login/pass and the login/pass should never be stored on the primary machine. Take a look at solutions like Evaut or R1soft (we use this) to do backups of all your servers/accounts to an external provider.
We use R1soft because of a couple of features/advantages it allow us – first it does sector level incremental backups and therefore it doesn’t use too much outbound bandwidth as it transfers only the changed files (well, just like rsync), secondly it provides a control panel which lets us restores individual files, directories from any of our backups – we tend to maintain 30 snapshots of our servers at all times and in some cases over 240 snapshots. Finally the killer feature is bare-metal restore – say your box crashed – all you need to do is get a new box up, and specify the R1soft setup to restore stuff. It will replicate everything as per the last snapshot, including the OS. Kinda life saving if you ever need it. If you folks use any other backup setup, I’d love to hear
While you might get all things right – I’ve seen cases where these backups were not verified and all the hardwork has gone down the drain because the integrity of the backups were not verified on a regular basis. Also you may want to try and restore your backups on a spare server sometime to ensure you have gotten it right. There are various backup options available today, opensource and commercial – but the above are some of the problems we take seriously with our data and work accordingly. It is never possible to have a 100% secure setup (someone just needs to find one loophole or exploit, while you have to continuously patch 100s of them) – but do spend the time and take the effort to build a backup system that appropriately reflects your value for the data. You can’t always spend a bomb to create a backup system, when you may be fine with losing 1 day worth of data
As you think critically of HA, your backup solutions too needs to be thought about in a critical manner that reflects the importance you accord it.
Why write Code ? Because i can test it :)
ok, I might have slightly exaggerated my motivation towards why I write code but my fascination towards test first development came from learning lessons the hard way. In later part of the post i shall explain test first development in more detail but before we get there we will come back to the question – why do we write code? While acknowledging the fact that we all might have myriad number of reasons ranging from “programming jobs are plenty in number” to “writing code is just so liberating” i will repeat what Frederick P. Brooks said :-
The programmer, like the poet, works only slightly removed from pure thought-stuff. He builds his castles in the air, from air, creating by exertion of the imagination. Few media of creation are so flexible, so easy to polish and rework, so readily capable of realizing grand conceptual structures.
If the above paragraph seemed like shakespearean language it simply means “When solving a problem, whatever solution you have in your mind,it is so easy(relative to other media) to put it into code. Since the implementation of your solution is trivial majority of the time will be spent on providing a solution to a problem than grappling with implementation details. The sheer smoothness of the medium to solve problems leads to the joy of coding(i added this
)”.
But code has a tendency to grow beyond initial expectations, quickly surpassing our capacity to fully understand it. we have all experienced how end phase of any project looks like. Development slows down to a crawl. we are all busy taming many headed bugs. we wrestle with somebody else’s code just to give up or fully re-write in despair. we sit in frustration as we see countless hours of work go up in smoke while people from other departments can’t understand why programmers are taking so much time to finish off the last 1% of the project.
This doesn’t even remotely sound like the utopian land as described in above paragraphs. So if we still have to spend obscene amount of time grappling with implementation details then may be coding isn’t as utopian as some people suggest. ok, very sad truth but we have to move on. so give up coding, update your resume and try your hand at some other job. wait, i am just kidding. There are ways to bring back fun into coding. one of such ways is test first development. It is a simple process which might take some time to get used to but once you are there the process will be fun all the while solving most of the problems mentioned in the above paragraph. No i am not kidding this time – the process is very enjoyable and as a side affect it will solve most of the problems mentioned in the previous paragraph.
Test first development is a well documented process. It consists of very rapid and short cycles. In each cycle you write some tests, code enough to pass those tests and then improve the design. This is famously known as red-green-refactor cycle. Let me explain each step in a more detailed manner.
1. Red :- Suppose you are about to write down a class. First thing you need to know is get a very good understanding of what the class should do and what it should not do. Then you write a test to validate one of the behaviors. Don’t complicate anything. Take a simple behavior and write a small example as a test to validate that behavior. At this stage you have not yet written any code in the class itself. Now run the test and the test bar should turn red.
2. Green :- Now go to the class and write small code just enough to make the test pass. Nothing more. Don’t worry about reusing or repeating any code. We will get there soon. Just concentrate on writing enough code to pass the test. Now run again and see that the test bar turns green. Yahoo. Congratulations you have just written a well tested code.
3. Refactor :- Now that you have a test go back to the code and see if there are any duplications or anything that makes the code look ugly. Now that you have a test available go ahead and change the code to your hearts content making re-usable components and beautifying the code. Don’t worry about breaking anything. you have already written a test and it will tell you if you break anything. Now repeat the whole red-green-refactor cycle for the next behavior.
Each step will hardly take couple of minutes and usually when you are productive you can cover upto 20-40 such cycles per hour. let’s see how this process solve couple of problems we discussed at the start of the article.
First you are working in baby steps constantly checking whatever you have written(”The bar should turn red now…..now it should turn green…now it should still be green…now it should turn red again..”). If you made any mistake then you will catch it right away and since you know the mistake is within the couple of lines of code you have just written for the latest test it is very easy to identify and correct the mistake. No more frustration while reading through your code trying to figure out where you have made a mistake and what were you thinking when you were writing that code. One major hassle gone. Now once you commit code along with the tests, one of your team mates wrote some code in his own class which will break some of the functionality that you have written. Don’t worry the test will fail and test bar turns red. Time to stop whatever you are doing and huddle in to solve the mistake then and there itself until the test turns green. This itself will solve most of your frustrations which usually lead you to “wrestle with somebody else’s code just to give up or fully re-write in despair”. We all know that finding mistakes, not fixing them immediately, is the most expensive part of programming.
This process solves most of the common problems which are the major culprits in making you a bad programmer. Next time i shall come up with one more process which we are setting up now which should improve your code quality further. Hola and have a nice time trying out test first development. If you want more material to read on test first development, just google “Test Driven Development” and you will get enough material to quench your thirst. Any help needed please put it in comments and i will be glad to help.
High Availability
Every time we grow & traffic sets new records – we never fail to be happy at the need for more servers to be added to our rack. The happiness however very soon makes me cringe because I know it is going to cost more to add all these servers to the rack and what is even more tough is to keep these set of servers chugging along at a good speed, secure and easy to operate.
I’ll focus on the high availability (HA) challenges of such a setup and why the costs get very steep very soon in case you require HA. The challenge for a startup is to work on an architecture that allows you to start at the right scale and then extend it as painlessly as possible.
For an early stage startup with minimal traffic, you can get by with a single server handling files, databases, emails and the webserver as well. After you’ve grown for a while, the database has the biggest chance of becoming your bottleneck, unless you are serving tons of files really fast. Now you would need to put in a separate server which just runs your MySQL installation. Right after that you would realize you need more machines on the frontend to serve all the awesome goodies.
This process is vicious and very soon you will have a couple of machines upfront acting as the front end and a couple of database servers. To keep MySQL playing nice, you should have sharded your data across multiple machines and/or put up those slave MySQL servers to which you can force your application reads. Now while you need to pat yourself on the back for being able to generate this much traffic – the next set of challenges are just starting.
This is the time you start worrying about single points of failure. HA simply means your system stays up and alive even if certain parts of the infrastructure come down on you. Hopefully you put up a hardware load balancer or a failover system w/software load balancers (haproxy comes to mind) – because if you didn’t and used a single load balancer on a server, all you need for your entire site to go down is that one server to go down. All your front end/backend servers come to naught when you don’t load balance/failover your loadbalancer
Here you were operating on a really tight budget and major chinks in the architecture start showing up. What if the master MySQL server goes down? Now usually these are fairly expensive beefy machines – but the fact is that hardware goes down and it usually does when you are the least prepared (hail, murphy!). Now are the choices you need to make – do you put up a similar beefy MySQL server and wait for the 1% time your server would go down and then failover the setup ? Or do you err. just chalk that 1% downtime to .. err.. you know .. keeping it cheap? Or you could HA MySQL by using a master + master configuration. However if you need to take frequent backups of your data, you may need a master + master + slave configuration so that you could shut down the slave for a brief period of time and do all those important backups and send them to your backup systems. While adding your set of machines to the setup, do ensure you have all the servers on a private vlan connected with gigabit cards/networks. Fast interlinkages, plus you don’t have to pay for the internal bandwidth transfer.
Lets also talk about backups – you do have backups dontcha!? Backing up your files/systems etc on the front end servers shouldn’t be too tough. You could rsync them over to another location and keep off network backups or use bare metal restore solutions like R1soft and have the ability to restore entire servers very easily. The costs of additional backup servers & the bandwidth to do snapshots at short time intervals become drivers of your decision as you continue to keep everything safe and secure. Explore multiple backup systems and do practice restoring them – that would be instrumental in getting you up back online fast. Keep a copy of your data backed up off network – this is critical – you don’t know when your DC can be raided by the FBI and you might lose all your servers. Allright, if its not the FBI, it could be an earthquake, bankruptcy, power surges, building fires – whatever. If your company means more to you than the montly server costs, backup – relentlessly.
As you continue to grow and think HA, make sure you have no single point of failure that jeopardizes all your hard work. One site I knew had a fairly comprehensive HA setup, but they kept their DNS on one server. Sadly, when that went down – nothing remained accessible. A gentle reminder that when going HA, work hard to isolate all such possibilities and remove such dependencies as soon as you can
