Thursday, September 18, 2014

Migrating an RDS instance into VPC

I really like Amazon's Relational Database Service - I've used it for a few years now, and especially like that I don't have to worry about managing backups and maintenance for patches on the server.  I've done several restores to a point in time, mostly to debug something that happened at that moment but could not be replicated later (because the db state had changed) - and that has proven to be a great feature.

I have used RDS long enough now though that Amazon's push to use VPC has got me wanting to move everything into VPC.  They don't have an easy way to migrate your RDS server into VPC, though.  Well, they do, but it involves taking your server offline (or switching it to read-only) while you dump the database and upload it into your new server.  This, to me, is unacceptable - a large database could experience hours of downtime using this method.  I'm used to making these transitions with zero downtime, and I intended to keep that up with my own migration to VPC.  Here are the steps I've taken.

It is important that you are using a CNAME for your DB connections.  While you may be able to go through and update each connection script, using a CNAME will enable you to make a switch very quickly.  To this end, I will admit that I have a pretty good control on the write activities - only one process generally is writing to my database.  So I am not at a risk of having one server write to the old database while another writes to the new as the CNAME update is propagating.  Still, I set the TTL to a very low value (60 seconds) while the change is being made, to ensure that the old DB server stops getting writes as soon as possible.

So, on to the process.  What we'll do is:

  1. Update our master server's binlog retention time
  2. Create a replication user on our master server
  3. Create a slave server from our master
  4. Stop the slave on the server
  5. Snapshot the slave's database
  6. Restore the snapshot into VPC
  7. Set the snapshot to be a slave to the original master
  8. Update the CNAME to point to our new database
  9. Stop the new server's slave once the CNAME has propagated
  10. Shut down the original master
I'm not going into the steps needed to prepare your VPC for a database server, which involve setting up a subnet group in the RDS console. If you haven't already, you might just go through the process of launching a fresh RDS instance in your VPC to tackle what needs to be done there. AWS does a fairly good job of helping you through this process, and what they don't tell you can easily be found in their documentation.

So, here we go.

Update our master server's binlog retention time

If you are not familiar with how MySQL master/slave replication works, it is fairly simple; as queries that update the data on the master server are executed, they are written to a log file which is sent to the slave databases who then execute the same queries on their database. You can see how far out of sync your slave server is from its master by executing the SHOW SLAVE STATUS command. You can see the current location on your master by executing the SHOW MASTER STATUS command. What we are going to do is snapshot our main database and then slave the new server to our original master until it catches up to it, then switch to use the new server. However, the default binlog retention time is very short (the binlog - binary log file - is the file that the master writes to for the slave. You don't want to keep this file around forever because it would grow to be huge, but you need it long enough so that your new slave can start reading it when we bring it up).  I have set mine to 24 hours to make sure I can attach to my log file with plenty of time to spare.

This is not a configuration you can set in your MySQL parameters, but you can call a special command embedded in the RDS MySQL:

call mysql.rds_set_configuration('binlog retention hours', 24);

Create a replication user on our master server

RDS will create a replication user for the slave that it creates, but we need to create one that we know the password to so we can connect with that user.  You can create a user in phpMyAdmin if you have that installed, or you can do the following from the command line:

CREATE USER 'replication'@'%' IDENTIFIED BY '[password]';
GRANT REPLICATION SLAVE ON *.* TO 'rep'@'%' IDENTIFIED BY '[password]' WITH MAX_QUERIES_PER_HOUR 0 MAX_CONNECTIONS_PER_HOUR 0 MAX_UPDATES_PER_HOUR 0 MAX_USER_CONNECTIONS 0;

Be sure you keep the replication username and password handy, you'll need it in a little bit.

Create a slave server from our master

Now we're ready to create a read-only slave of our database server.  The reason we are going to do this is because we want to make sure we have a valid snapshot in a point in time and we want to know what that point is, which we'll be able to identify from our slave.

Name it something relative to your main server, but don't think too hard about it, we won't keep it around for long. This step may take a while depending on how big your database is, but we have to wait now until the slave server has been created and is available.

Stop the slave on the server

Now that we have a slave, we're going to stop it from propagating from the master. Normally we would use a "STOP SLAVE" command, but our RDS credentials won't allow us to do that. Fortunately, they have given us another RDS-specific command to stop the slave:

CALL mysql.rds_stop_replication;

This will stop our slave.  We want to check and see what the status of our slave is, and note some very important values that we'll use when we bring up our new server:

show slave status\G

The values we are looking for are Master_Log_File and Read_Master_Log_Pos.  Make a note of these values with your replication user info.

Snapshot the slave's database

Now we're ready to take our slave's database and prep it to become our new database. Right-click on the slave in the console and select Take DB Snapshot.

Name the snapshot something about moving your database into VPC, and then wait for it to be completed.

Restore the snapshot into VPC

Once you have a snapshot, it's time to bring up your new database server in the VPC. At the time of this writing, there appears to be some sort of a bug in the management console that is preventing a restore of a snapshot into a VPC, so I had to do this from the command line.  Perhaps you will be more fortunate. Here is the command line call I used:

rds-restore-db-instance-from-db-snapshot [new database name] -sn [subnet group] -s [snapshot name] -pub true

It is important to make this database publicly accessible, because otherwise it will not be able to access our master. To this end, be sure your master is publicly accessible as well (at least to the IP address of the new server - because otherwise we won't be able to attach our slave to it).   You might want to include additional options from the rds-restore-db-instance, such as the instance class or option group, but you can also change those via the GUI before switching to the new database.

Once again, wait a bit until this server completes its launch.  We're almost there!

Set the snapshot to be a slave to the original master

Now connect to your new database server that is in the VPC. We need to turn this server into a slave of our existing master server where our previous slave left off.  Once again, we don't have the rights to use the CHANGE MASTER command, but we do have a handy RDS command:

CALL mysql.rds_set_external_master ([host], 3306, [user], [password], [logfile], [log-pos], 0);

where "host" is the very original (still live) database, "user" and "password" are the credentials to your replication user, and "logfile" and "log-pos" are the values we got from our slave server.

Now we need to start our slave, again using an RDS command:

CALL mysql.rds_start_replication;


If we call SHOW SLAVE STATUS at this point, we should see information indicating that our slave is running ("Slave IO Running" and "Slave SQL Running" should both be "Yes").  If this is not the case, the most likely problem is that your two databases cannot see each other; make sure you have a network path between them.

Update the CNAME to point to our new database

Once our slave server has caught up to our master (see the "Seconds behind master" attribute in the SHOW SLAVE STATUS output) we can migrate our CNAME to point to our new database. If all your applications use the CNAME, they will all switch within the TTL.  If you have the potential for two systems to connect at different times, you may need to be more cautious.  Depending on how likely this is and how much trouble it could cause, you might need to treat this as a master-master setup by changing the auto_increment_increment and auto_increment_offset values for the two servers - but that is beyond the scope of this tutorial.

One step that you might want to do before actually changing the CNAME is to launch a copy of your site (if your site is on AWS then just make an Image of your site and launch it), and go into that server to set the database connection directly to your new server. This will give you a chance to verify that your new database server is in a security group that your web server(s) can access. If you are unable to connect from your web server, you will probably need to check your security groups across the board to figure out why.

At this time, you will want to wait until all your servers have propagated to your new database. If your TTL was 5 minutes, wait at least 5 minutes.  After a while, you should be able to check your slave status and see that the master log position is not moving even if there has been activity on your site. When you are convinced that your old database server is no longer being used, you are ready to go on to the next step.

Stop the new server's slave once the CNAME has propagated

Your new server no longer needs to be connected to the slave - so we can stop it:

CALL mysql.rds_stop_replication;

and reset our slave info:

CALL mysql.rds_reset_external_master;

Now your old server is done and your new server is active!

Shut down the original master

At this point your original master and slave are ready to be shut down. I always take a snapshot as AWS advises just in case something goes wrong with the new server that was missed, or if for some reason I accidentally shut down the wrong server ... 

So there you have it, you have successfully moved your RDS server from the classic network to your VPC!