Standing up Nautobot on AWS #1201
mathiaswegner
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Nautobot on AWS
This is a brief summary of setting up Nautobot on AWS. I used Nautobot 1.1.3, but it should not vary too much when using different versions. It assumes some familiarity with Nautobot and AWS. Pretty much all of the AWS cli commands are the bare minimum to stand up some version of this and only show creating one resource if multiple identical resources are needed, you should add tags and consider your needs around sizing, redundancy, etc. The end result is a basic Nautobot deployment. It is not doing any scaling, just one http server and one celery worker.
Starting infrastructure and configurations
Nautobot config changes
In order to work with redis SSL, you'll need to add the following to your nautobot_config.py:
where "YOUR SETTING" is one of 'required', 'optional', or 'none'. This is needed because Celery requires some value for ssl_cert_reqs when using SSL.
Starting with dependencies
VPC Security Groups
First, I went to VPC and created all of the security groups that I intended to use for the installation. For the internal services (EFS, RDS, Elasticache, ECS), I configured the security groups to allow inbound access to the service from the private subnets only, eg the postgres security group only allows inbound traffic to the postgres port from the CIDRs of the two private subnets. For the external services (ELB), I configured the security group to allow inbound http/https access to the service from my end user networks.
The full list of security groups that I created is:
Creating a Security Group
RDS
Second, I went to RDS and created a database subnet group, nautobot-postgres-subnetgroup that included both private subnets. I then created a generic Postgres RDS database named nautobot-postgres-db, assigned it to the nautobot-postgres-subnetgroup subnet group and the nautobot-rds-secgrp security group. The database does not need public access. I created a superuser at this point. Obviously sizing, redundancy, backup, and encryption settings will vary based on your use case, so your database setup will vary. I have not run into any issues with all encryption enabled and starting with limited resources intending to scale up as needed.
Creating RDS subnet group and postgres instance
Elasticache
The elasticache instance also needs a subnet group, so I created an elasticache subnet group named nautobot-redis-subnetgroup that included both private subnets. I created an Elasticache redis instance, nautobot-redis, encrypted both at rest and in transit, assigned an auth string at creation, and assigned it to the nautobot-redis-subnetgroup subnet group and the nautobot-redis-secgrp security group. I used default settings for number of shards and redundancy, but used the smallest node size that I could reasonably use, intending to scale up as needed, because Elasticache gets expensive fast. Note that I used a redis replication group in order to use transit encryption, there are additional options if you don't care about transit encryption.
Creating Elasticache subnet group and redis instance
EFS
I created an EFS file system, nautobot-efs-configfiles, and corresponding access point, nautobot-efs-accesspoint, and assigned them to the nautobot-efs-secgrp security group.
Creating the EFS file system and access point
Systems Manager Parameter Store
For secrets, I used Systems Manager Parameter Store and created SecureStrings. Note that whichever KMS key you use needs to be accessible by the nautobot-ecs-task-runner-role role, whether you use the account default key or specify a key. There are 3 secrets that are used at every startup (django secret, redis auth string, and the postgres nautobot user password) and 2 that are only needed at initialization (the admin user password and the admin user api key).
Creating a parameter store secret
ECR
I chose to use ECR for container image storage, but any container image repo will work as long as it is accessible by ECS and the nautobot-ecs-task-runner-role role. I created a single repo for the nautobot container, nautobot, and uploaded my container image.
For my nautobot container, I started with a standard nautobot as a base and installed my prefered plugins locally. The result was uploaded to ECR.
Creating the ECR repo and pushing your existing image
Identity and Access Management
Next, I went to IAM and created a role for executing the ECS task, nautobot-ecs-task-runner-role. I assigned a custom policy to the role that gave it permission to mount the EFS access point, pull images from ECR, run tasks in ECS, and send logs to CloudWatch.
This policy is not as locked down as it could be, some of the resources didn't exist when I created the policy and it needs to be revisited to specify ARNs, but this should be a good starting point and covers all of the permissions that my nautobot task needed.
IAM task runner policy json
Creating an IAM role and policy
ELB
The load balancer is the only part of this deployment that should be reachable from outside the VPC. Whatever your connectivity to AWS looks like (direct connect, vpn, public internet), the load balancer needs to deployed to be reachable from your network. First, the load balancer needs a target group to forward traffic to, so I created nautobot-elb-targetgroup. Next, I created the load balancer itself, nautobot-elb. Finally, I created a listener that forwards https connections to the ELB to port 8080 on our nautobot server.
Many of the load balancer configuration settings will depend on your use case, consult the AWS documentation on load balancers for details. In my environment, I used an internal load balancer since my vpc is connect to a transit gateway that connects back to our premises via a VPN tunnel. To make it easier on users, I created a cname within my domain that forwarded to the AWS FQDN for the load balancer and installed an SSL cert on the load balancer.
https://awscli.amazonaws.com/v2/documentation/api/latest/reference/elbv2/create-load-balancer.html
Listener default action json
Creating a target group and load balancer
Connect to the bastion host
There are a few tasks that are easier to do from the bastion host. From the bastion host, I connected to postgres and created a nautobot user and a nautobot database and granted the nautobot user access to the database. I then mounted the EFS volume from the bastion host and copied over my nautobot.env and nautobot_config.py files. I also created directories for the media, static, and jobs directories and updated the configuration to point at them.
Now that all of the dependencies are in place... ECS!
ECS Cluster
Since I don't want to maintain EC2 hosts, I decided to create a Fargate cluster and named it nautobot-fargate-cluster.
Creating the cluster
ECS Task(s)
Since we are going to use environment variables to create the superuser on initial startup, we can either create two tasks or we can start with the superuser environment variables and then edit the task to remove them after initialization is done. I'll describe the startup task here, the other task is nearly identical but does not include any of the SUPERUSER environment variables.
I created nautobot-ecs-task as a Fargate 1.4.0 task and assigned nautobot-ecs-task-runner-role to be the execution role and task role. I added nautobot-efs-configfiles as an EFS volume with the nautobot-efs-accesspoint access point, encrypted in transit and using IAM authorization.
I created the first container, nautobot, with the image I uploaded to the ECR repo, and mapped port 8080. I'm not using https on the container since it will only communicate with the load balancer. I specified a working directory of /opt/nautobot, added the EFS volume and mounted it at /opt/nautobot/config. Finally, I added environment variables for runtime configuration. Most of them are plain text values, but the secrets stored in the Parameter Store are ValueFrom and the ARN of the parameter. I loaded some variables in the task definition to make them easier to change or secure, the rest are defined in the nautobot.env file.
Again, the SUPERUSER variables are only needed for the first startup. I kept them defined in my task and changed the value of NAUTOBOT_CREATE_SUPERUSER to false in a newer version of the task.
The second container is the celery worker, nautobot-celery. It does not need mapped ports, but it does need the same container image, EFS volume, and working directory as well as the database, redis, and config environment variables. In addition, we need to specify the entrypoint and command. I used an entrypoint of "/usr/local/bin/nautobot-server" and a command of "celery,worker,--loglevel,INFO,--pidfile,/opt/nautobot/nautobot-celery.pid,-n,nautobot-celery".
For both containers, I used default CloudWatch logging. How much memory and cpu you want to commit, how much you want to assign to each container, and what sort of scale up/scale out you want to add depends on usage. My initial usage is low enough that I have not tinkered with scaling.
A warning about ALLOWED_HOSTS - this can be a giant pain with Django and ELB. The ELB health checks http_host header will be set to the IP address that ELB is using. There are a variety of ways to work around this, from adding django plugins designed to work around this issue to http server configs to overwrite the http_host header if it comes from the CIDR of the private subnets. Because I am using /27 CIDRs, I went with the fast but ugly approach of adding all of the IP addresses within the private subnets to the ALLOWED_HOSTS.
Container definition json for nautobot server
Volume definition json
Create the task
Service
The service is using the latest version of the task defined above and starting just one instance of the task. It's launching as Fargate 1.4.0 on linux on the cluster that we already created. It's deployed into the two private subnets of our VPC and I've assigned the ecs security group. This security group allows local access to 8080, so that the load balancer can connect directly to the containers but nothing else can. The load balancer is also defined here, with an IP target group.
Load balancer list json
Network structure json
Note that the task-definition argument below is task-family:task-version, update as needed
Create the service
Test it out!
Once the service is defined, it should start to spin up a task. It will take several seconds to pull the image from ECR, assign a node, and start the task. Once the task is in a RUNNING state, the load balancer will need to register the IP of the nautobot server. At that point, you should be able to connect to the FQDN of the load balancer!
If you used the NAUTOBOT_CREATE_SUPERUSER=True environment variable to create a superuser on the first run, you should go back and create a new version of the task that has that environment variable set to False and update the service to use the new version of the task. Updating the task should launch a new task instance and drain the old instance.
Troubleshooting
Issues I ran into:
Beta Was this translation helpful? Give feedback.
All reactions