Permissions boundaries are hard, especially with databases. You need them to be hidden away in private subnets, but want highly available access to them without hassle.

Traditionally, you would use a bastion host (AKA Jump Server) in a public subnet to get access to your resources in private subnets, which works for its purpose. But managing these servers was cumbersome and annoying, as they lived with public DNS, and often used SSH for security enforcement which meant admins needed to manage ssh keys in a way that stayed up to date with their IAM policies.

Enter AWS Session Manager, AKA SSM. This tool has been widely blogged about, as it gives access to servers through IAM Policies instead of SSH keys. From a quick search, I found these great resources:

Even with these awesome resources, it wasn't immediately clear how to get started, especially with modern infrastructure management practices like terraform. And of the tools, like ssh-over-ssm, there is a significant prerequisite knowledge needed to make use of them.

Just about everyone on the planet with RDS instances wants to access them from a local port, so the goal of this codelab is to explore how to get secure access from scratch. It will explore some older ways of getting access, which will hopefully help explain why the industry has moved the way we have to the current best practices approach of combining EC2 Instance Connect with SSM.

At each step, I will create fully working example with basic security setups. The goal of this tutorial is to create near-production ready examples.

Now for a haiku:

Start with the basics
Security can be hard
It takes time to learn

Making a VPC

To start with, let's whip up a quick RDS Server in a private subnet with no internet access. In a new terraform module, add the following code:

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 2.18.0"

  name = "codelab-vpc"
  cidr = "10.0.0.0/16"
  azs  = ["eu-west-1a", "eu-west-1b"]

  # For the bastion host
  public_subnets = ["10.0.101.0/24", "10.0.102.0/24"]

  # For the RDS Instance
  database_subnets = ["10.0.1.0/24", "10.0.2.0/24"]

  # Allow private DNS
  enable_dns_hostnames = true
  enable_dns_support   = true
}

This describes a VPC we want to put all of our resources from this codelab into. It creates two public subnets that we can put the bastion host into, and two private subnets that we can put a database in. It also enables private DNS so that our bastion server will be able to reach out to the RDS endpoint.

To actually create these resources, let's run terraform init followed by terraform plan -out plan. If the plan looks good to you, you can apply it with a terraform apply plan.

Making an RDS Instance in that VPC

Now, let's create an RDS instance. In the same terraform file, add:

module "db" {
  source  = "terraform-aws-modules/rds/aws"
  version = "2.5.0"

  # Put the DB in a private subnet of the VPC created above
  vpc_security_group_ids = [module.db_security_group.this_security_group_id]
  create_db_subnet_group = false
  db_subnet_group_name   = module.vpc.database_subnet_group

  # Make it postgres just as an example
  identifier     = "codelab-db"
  name           = "codelab_db"
  engine         = "postgres"
  engine_version = "10.6"
  username       = "codelab_user"
  password       = "codelab_password"
  port           = 5432

  # Disable stuff we don't care about
  create_db_option_group    = false
  create_db_parameter_group = false

  # Other random required variables that we don't care about in this codelab
  allocated_storage  = 5 # GB
  instance_class     = "db.t2.small"
  maintenance_window = "Tue:00:00-Tue:03:00"
  backup_window      = "03:00-06:00"
}

module "db_security_group" {
  source  = "terraform-aws-modules/security-group/aws"
  version = "3.1.0"

  name   = "codelab-db-sg"
  vpc_id = module.vpc.vpc_id

  # Allow all incoming SSL traffic from the VPC
  ingress_cidr_blocks = [module.vpc.vpc_cidr_block]
  ingress_rules       = ["postgresql-tcp"]

  # Allow all outgoing HTTP and HTTPS traffic for updates
  egress_cidr_blocks = ["0.0.0.0/0"]
  egress_rules       = ["http-80-tcp", "https-443-tcp"]
}

You can use any other config you'd like for the RDS instance, my setup here was just meant to be as simple as possible with a small postgres instance. It is important to note that the security group is open to incoming TCP on port 5432 so that we can query it through the bastion hosts.

Run another terraform init, then terraform plan -out plan. If the plan looks good, apply it with terraform apply plan. It can take up to 40m to provisision a new RDS instance (in my case it took 9m), so please be patient. We only have to do this once.

Wrap Up

Woohoo! You now have a database we can use to test bastion configurations with :) It is not accessible to the public internet, and has a strict security policy.

Here's a haiku about RDS Instances in VPCs:

Made a VPC
Also made an RDS
One in the other

Intro

The RDS instance isn't super interesting yet, because it doesn't have any tables, data, or access set up. Because our security policy and DNS are so strict, we can't directly query it in any way yet.

In this step, we will set up a standard bastion server that we can SSH to that will let us query the database.

Terraform Changes

In the same terraform file as before, add the following:

module "ssh_key_pair" {
  source  = "terraform-aws-modules/key-pair/aws"
  version = "0.2.0"

  # Feel free to change if you want to use a different public key
  public_key = file("~/.ssh/id_rsa.pub")
  key_name   = "bastion_public_key"
}

module "bastion_security_group" {
  source  = "terraform-aws-modules/security-group/aws"
  version = "3.1.0"

  name   = "codelab-bastion-sg"
  vpc_id = module.vpc.vpc_id

  # Allow all incoming SSH traffic
  ingress_cidr_blocks = ["0.0.0.0/0"]
  ingress_rules       = ["ssh-tcp"]

  # Allow all outgoing HTTP and HTTPS traffic, as well as communication to db
  egress_cidr_blocks = ["0.0.0.0/0"]
  egress_rules       = ["http-80-tcp", "https-443-tcp", "postgresql-tcp"]
}

module "bastion" {
  source  = "terraform-aws-modules/ec2-instance/aws"
  version = "2.12.0"

  # Ubuntu 18.04 LTS AMI
  ami                         = "ami-035966e8adab4aaad"
  name                        = "codelab-bastion"
  associate_public_ip_address = true
  instance_type               = "t2.small"
  key_name                    = module.ssh_key_pair.this_key_pair_key_name
  vpc_security_group_ids      = [module.bastion_security_group.this_security_group_id]
  subnet_ids                  = module.vpc.public_subnets
}

###########
# Outputs #
###########

output "bastion_ip" {
  value = module.bastion.public_ip[0]
}

output "rds_endpoint" {
  value = module.db.this_db_instance_endpoint
}

This creates an Ubuntu 18.04 EC2 server in a public subnet of the VPC we created earlier. It's security group is open to incoming SSH and to outgoing communications with the RDS instance. It is set up so that your local ~/.ssh/id_rsa ssh key can be used to authenticate to the instance.

After the outputs of running terraform init and terraform plan -out plan look good, run a terraform apply to create the resources. Once completed, we're ready to ssh onto the instance!

Establishing SSH Tunnel

To begin, run the command:

ssh -L 5432:`terraform output rds_endpoint` -Nf ubuntu@`terraform output bastion_ip`

Breaking this down, it says:

Once complete, our tunnel has been established. Now let's create some test data!

Creating Test Data

We're going to add some test data to the RDS instance here just to make it easy to validate that we can still access the data later once we use more complicated setups. But it should be noted that this is entirely optional.

Install psql and postgres any way you know how, such as sudo apt-get -y install postgresql postgresql-contrib on ubuntu.

Then, let's create a table:

psql -d codelab_db -p 5432 \
  -h localhost \
  -U codelab_user \
  -c "CREATE TABLE codelab_table (Name varchar(255))"

and add some data:

psql -d codelab_db -p 5432 \
  -h localhost \
  -U codelab_user \
  -c "INSERT INTO codelab_table (Name) VALUES ('codelab_data')"

If you're prompted for a password, use the password we set in terraform earlier, codelab_password.

Finally, let's close the tunnel with kill $(lsof -t -i :5432), which says to kill the process that controls local port 5432.

Haiku

Do you hike? Cuz I want to haiku:

We did it. We're in!
Let the bastions grind you down?
Nah, I conquer them.

Cool AWS Services

Pre September 2018, what we have now would be considered standard, even cutting edge to those adopting cloud services. But in that September, AWS Systems Manager Session Manager (SSM) was announced, which helps to provide secure, access-controlled, and audited EC2 management.

Source: https://aws.amazon.com/about-aws/whats-new/2018/09/introducing-aws-systems-manager-session-manager/

A related service, EC2 Instance Connect, was introduced in June of 2019: https://aws.amazon.com/about-aws/whats-new/2019/06/introducing-amazon-ec2-instance-connect/. It enabled IAM based ssh controls with CloudTrail auditing, as well as browser based SSH from the AWS Console for those who like web GUIs.

Why they are so cool

These two products are superior to our current bastion setup for a few reasons:

For the next step, we'll add support for SSH'ing to our instance over EC2 Instance Connect.

Haiku

Did you ever hear of the time when the villagers of a mountainous domain overthrew their government? It was called the high coup. And it went a little something like this:

AWS
Services out the wazzu
Gotta learn 'em all

Why EC2 Instance Connect is Awesome

One of the cool things about EC2 Instance Connect is that you don't use long term SSH keys that need to live on the bastion instance. Instead, you use the aws ec2-instance-connect cli tools to send temporary public keys to the instance, that you then have 60 seconds to authenticate to using the private key. After the 60 seconds are up, the public key is forgotten.

This is more powerful than it may at first seem. What this really means is that:

This requires a few quick updates to our terraform.

Updating Terraform Code

First, you can entirely remove the ssh_key_pair module, as we no longer need to manage SSH keys :)

Then, add an IAM Instance Profile that will allow your bastion to make use of the EC2InstanceConnect policy:

module instance_profile_role {
  source  = "terraform-aws-modules/iam/aws//modules/iam-assumable-role"
  version = "~> 2.7.0"

  role_name               = "codelab-role"
  create_role             = true
  create_instance_profile = true
  role_requires_mfa       = false

  trusted_role_services   = ["ec2.amazonaws.com"]
  custom_role_policy_arns = ["arn:aws:iam::aws:policy/EC2InstanceConnect"]
}

Next, update your bastion module to remove the key_pair, make use of the new instance profile, and install ec2-instance-connect, by making it look like:

module "bastion" {
  source  = "terraform-aws-modules/ec2-instance/aws"
  version = "2.12.0"

  # Ubuntu 18.04 LTS AMI
  ami                         = "ami-035966e8adab4aaad"
  name                        = "codelab-bastion"
  associate_public_ip_address = true
  instance_type               = "t2.small"
  vpc_security_group_ids      = [module.bastion_security_group.this_security_group_id]
  subnet_ids                  = module.vpc.public_subnets
  iam_instance_profile        = module.instance_profile_role.this_iam_instance_profile_name

  # Install dependencies
  user_data = <<USER_DATA
#!/bin/bash
sudo apt-get update
sudo apt-get -y install ec2-instance-connect
  USER_DATA
}

The very last update is to add two new outputs:

output "instance_id" {
  value = module.bastion.id[0]
}

output "az" {
  value = module.bastion.availability_zone[0]
}

As per usual, run a terraform init then terraform plan -out plan. If the output looks good (note that the bastion instance should be recreated), then run a terraform apply.

Hooray! You have now enabled EC2 Instance connect on the instance. To connect, we need to generate a temporary ssh key, send the public key to the instance, and then create an SSH tunnel using the private key.

Establishing SSH Tunnel

This is as easy as:

echo -e 'y\n' | ssh-keygen -t rsa -f /tmp/temp -N '' >/dev/null 2>&1
aws ec2-instance-connect send-ssh-public-key \
  --instance-id `terraform output instance_id` \
  --availability-zone `terraform output az` \
  --instance-os-user ubuntu \
  --ssh-public-key file:///tmp/temp.pub
ssh -L 5432:`terraform output rds_endpoint` -Nf -i /tmp/temp ubuntu@`terraform output bastion_ip`

Verifying our Test Data is still there

Checking that is worked is as easy as executing a psql command to verify the data we added earlier is still there:

psql -d codelab_db -p 5432 \
  -h localhost \
  -U codelab_user \
  -c "SELECT * FROM codelab_table"

To cleanup, let's close the tunnel with kill $(lsof -t -i :5432).

Haiku

"Hey there," Kew said. "Hi Kew," I replied:

Short lived keys are good
To dust - the keys shall return
But the tunnel lives

There's still room for improvement

We already have a pretty darn secure and auditable system, but at Transcend we strive to minimize public DNS endpoints wherever possible to reduce our attack surface.

In the current setup, our bastion still has a public IP address, which is a requirement for EC2 Instance Connect (without using the SSM trick this codelab is leading up to). If we moved our bastion to a private subnet, where it could have no public DNS, we would lose EC2 Instance Connect access.

However, we could still access it with AWS Systems manager: so let's talk about it.

About SSM

Systems Manager comes with a bunch of premade "Documents" that you can run on sets of EC2s, similar-ish to running Ansible Playbooks on servers, if you're familiar. If you aren't familiar with Ansible or other automation tools, no worries - the idea is pretty simple: automate running the same commands on a bunch of machines at once.

In this code lab, we just need to worry about the AWS-StartSSHSession Document, which lets you create SSH sessions to EC2s, with the only requirements being that the EC2:

Enabling SSM Access

To start, update the VPC module to have some private subnets, and to have NAT gateways that will allow those private subnets to access the internet:

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 2.18.0"

  name = "codelab-vpc"
  cidr = "10.0.0.0/16"
  azs  = ["eu-west-1a", "eu-west-1b"]

  # For the NAT gateways
  public_subnets = ["10.0.101.0/24", "10.0.102.0/24"]

  # For the bastion host
  private_subnets = ["10.0.201.0/24", "10.0.202.0/24"]

  # For the RDS Instance
  database_subnets = ["10.0.1.0/24", "10.0.2.0/24"]

  # Ensure the private gateways can talk to the internet for SSM
  enable_nat_gateway = true

  # Allow private DNS
  enable_dns_hostnames = true
  enable_dns_support   = true
}

Next, In the database security group, update the ingress_cidr_blocks param to be module.vpc.private_subnets_cidr_blocks (notice that it should not be in a list anymore). This is really cool, as it ensures that only private instances can access your database, a huge security win!

On your bastion security group, you can entirely remove the ingress_cidr_blocks and ingress_rules lines, because we no longer need the SSH ports open. SSM works by sending out requests to the SSM endpoint, so we can entirely eliminate the ingress.

Think about how cool that is, our instance will have no public DNS, and will have no security group ingress: it's like a dream! If you're like me and dream about highly restricted network access, anyways.

On the bastion module, there are two changes:

Lastly, we need to update the IAM permissions to allow for the bastion host to talk to the SSM service endpoints.

In your instance_profile_role module, add the "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore", arn to the custom_role_policy_arns list.

And we're all done with our complete terraform setup!

Run terraform plan -out plan, verify the changes look good, and run terraform apply plan to finish off.

Haiku

Private subnets rock
But you need a NAT gateway
to get to the web

We're finally to the final step: tunneling through a bastion instance in a private subnet to reach an RDS instance in its own even more private subnet.

SSH'ing with ProxyCommand

It has been pretty common for a long time to have bastions or Jump Servers, where someone would want to SSH onto a system and then SSH from that instance to a different instance that couldn't be SSH'd to without first going to the bastion instance.

To help with this, ssh has an option called ProxyCommand. There's a great blog at https://www.cyberciti.biz/faq/linux-unix-ssh-proxycommand-passing-through-one-host-gateway-server/ if you're unfamiliar and want to see example of its usage in depth, but the main idea is just that it allows you to make two ssh jumps in a single command.

A command like ssh -o ProxyCommand="ssh -W %h:%p user@bastion.com" other_user@private.server.com is roughly equivalent to:

ssh user@bastion.com
ssh other_user@private.server.com # run from the bastion.com server

How does that help us

We want to use our bastion instance as a jump server in a ProxyCommand to get to our RDS instance, but our bastion does not have any public DNS that we can put as the host in our ssh command.

That's where the SSM Document we talked about earlier, AWS-StartSSHSession, comes in. It uses some SSM TCP magic to create SSH sessions to any server with SSM enabled.

The final command

Quick prereq: Install the SSM plugin for your cli: https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-working-with-install-plugin.html

Putting this all together, we can run the commands:

echo -e 'y\n' | ssh-keygen -t rsa -f /tmp/temp -N '' >/dev/null 2>&1
aws ec2-instance-connect send-ssh-public-key \
  --instance-id `terraform output instance_id` \
  --availability-zone `terraform output az` \
  --instance-os-user ubuntu \
  --ssh-public-key file:///tmp/temp.pub
ssh -i /tmp/temp \
  -Nf -M \
  -L 5432:`terraform output rds_endpoint` \
  -o "UserKnownHostsFile=/dev/null" \
  -o "StrictHostKeyChecking=no" \
  -o ProxyCommand="aws ssm start-session --target %h --document AWS-StartSSHSession --parameters portNumber=%p --region=eu-west-1" \
  ubuntu@`terraform output instance_id`

To test that this worked, let's run a query:

psql -d codelab_db -p 5432 \
  -h localhost \
  -U codelab_user \
  -c "SELECT * FROM codelab_table"

Yay. :)

Cleanup

Lets kill the tunnel: kill $(lsof -t -i :5432) and then remove all AWS resources: terraform destroy -auto-approve

Gosh I love terraform :)

Haiku

Got to kill the port
Got to destroy resources
It's time to clean up

Here is the final code after completing the tutorial:

##################
# Create the VPC #
##################

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 2.18.0"

  name = "codelab-vpc"
  cidr = "10.0.0.0/16"
  azs  = ["eu-west-1a", "eu-west-1b"]

  # For the bastion host
  private_subnets = ["10.0.101.0/24", "10.0.102.0/24"]

  # For the NAT gateways
  public_subnets = ["10.0.201.0/24", "10.0.202.0/24"]

  # For the RDS Instance
  database_subnets = ["10.0.1.0/24", "10.0.2.0/24"]

  # Ensure the private gateways can talk to the internet for SSM
  enable_nat_gateway = true

  # Allow private DNS
  enable_dns_hostnames = true
  enable_dns_support   = true
}

#######################
# Create the database #
#######################

module "db" {
  source  = "terraform-aws-modules/rds/aws"
  version = "2.5.0"

  # Put the DB in a private subnet of the VPC created above
  vpc_security_group_ids = [module.db_security_group.this_security_group_id]
  create_db_subnet_group = false
  db_subnet_group_name   = module.vpc.database_subnet_group

  # Make it postgres just as an example
  identifier     = "codelab-db"
  name           = "codelab_db"
  engine         = "postgres"
  engine_version = "10.6"
  username       = "codelab_user"
  password       = "codelab_password"
  port           = 5432

  # Disable stuff we don't care about
  create_db_option_group    = false
  create_db_parameter_group = false

  # Other random required variables that we don't care about in this codelab
  allocated_storage  = 5 # GB
  instance_class     = "db.t2.small"
  maintenance_window = "Tue:00:00-Tue:03:00"
  backup_window      = "03:00-06:00"
}

module "db_security_group" {
  source  = "terraform-aws-modules/security-group/aws"
  version = "3.1.0"

  name   = "codelab-db-sg"
  vpc_id = module.vpc.vpc_id

  # Allow all incoming SSL traffic from the VPC
  ingress_cidr_blocks = module.vpc.private_subnets_cidr_blocks
  ingress_rules       = ["postgresql-tcp"]

  # Allow all outgoing HTTP and HTTPS traffic for updates
  egress_cidr_blocks = ["0.0.0.0/0"]
  egress_rules       = ["http-80-tcp", "https-443-tcp"]
}

###############################
# Create the bastion instance #
###############################

module "bastion_security_group" {
  source  = "terraform-aws-modules/security-group/aws"
  version = "3.1.0"

  name   = "codelab-bastion-sg"
  vpc_id = module.vpc.vpc_id

  # Allow all outgoing HTTP and HTTPS traffic, as well as communication to db
  egress_cidr_blocks = ["0.0.0.0/0"]
  egress_rules       = ["http-80-tcp", "https-443-tcp", "postgresql-tcp"]
}

module "bastion" {
  source  = "terraform-aws-modules/ec2-instance/aws"
  version = "2.12.0"

  # Ubuntu 18.04 LTS AMI
  ami                    = "ami-035966e8adab4aaad"
  name                   = "codelab-bastion"
  instance_type          = "t2.small"
  vpc_security_group_ids = [module.bastion_security_group.this_security_group_id]
  subnet_ids             = module.vpc.private_subnets
  iam_instance_profile   = module.instance_profile_role.this_iam_instance_profile_name

  # Install dependencies
  user_data = <<USER_DATA
#!/bin/bash
sudo apt-get update
sudo apt-get -y install ec2-instance-connect
  USER_DATA
}

module instance_profile_role {
  source  = "terraform-aws-modules/iam/aws//modules/iam-assumable-role"
  version = "~> 2.7.0"

  role_name               = "codelab-role"
  create_role             = true
  create_instance_profile = true
  role_requires_mfa       = false

  trusted_role_services = ["ec2.amazonaws.com"]
  custom_role_policy_arns = [
    "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore",
    "arn:aws:iam::aws:policy/EC2InstanceConnect",
  ]
}

###########
# Outputs #
###########

output "instance_id" {
  value = module.bastion.id[0]
}

output "az" {
  value = module.bastion.availability_zone[0]
}

output "rds_endpoint" {
  value = module.db.this_db_instance_endpoint
}

145 lines for a VPC with solid security, a private database, and a private bastion host isn't too shabby.