search.usa.gov is a free service provided by the United States’ General Services Administration to more than 2,600 other federal, state, or local government agencies. This service allows agencies to easily and intuitively configure a search engine results page (SERP) experience that covers their domains, their documents, and their social media footprint, so that citizens can easily search government web sites to find the information they need. Customers include the Internal Revenue Service, the Department of Defense, and the White House among others.
Rapid River has supported operations and application development for search.usa.gov since 2015. When the search.usa.gov product management approached us late last year about finding ways to reduce operational overhead, we were eager to see what we could do.
In this post we’ll explain how we re-architected and migrated the search.usa.gov infrastructure in Amazon Web Services (AWS) in order to:
In the prior search.usa.gov datacenters - one in Chicago and one in Virgina - we had a pools of high-powered Dell “pizza box” servers running a mishmosh of services in a composition that had been tuned to the observed traffic patterns of search.usa.gov:
The layout of services across the servers didn’t seem to have much rhyme or reason. The fact that these were physical, pizza-box servers that were expensive to add and difficult to push through the security update process had led to the barnacle-like accumulations of services on servers over time.
We made it a primary goal of our new architecture to separate each of our services by role and to build flexible pools for each role that could be scaled up or down as demand increased or decreased for each service. This sounds great on the drawing board, but who has time or budget to build robust, role-specific deployment recipes for multiple applications and services?
The answer to the previous question is: certainly not us. However, the search.usa.gov infrastructure is fortunate because it is comprised of applications that have very well understood deployment practices:
The first five applications - usasearch, search_consumer, i14y, asis, and jobs_api, could be deployed quite easily using AWS OpsWorks’ well-known deployment recipes. We simply pointed OpsWorks to the GitHub repos for each app, and it took care of the rest with robust Capistrano-style deployments of the Rails and NodeJS apps.
That left us with with just Tematres and Elasticsearch, so we reached into the bag of tricks and wrote Chef recipes that would fit into the OpsWorks deployment cycle for these two applications. (An enormous hat tip goes to Nathan Smith for his work on all of these recipes!)
We then enabled Auto Healing on our application layers to ensure that servers would be replaced automatically if they failed. With robust recipes to build servers in place, we knew that this replacement would be seamless if it occurred. (But, of course, we tested it to make sure.)
To replace our very expensive CDN and web application firewall (WAF) provider, we implemented our own Apache proxy server layer using a modified version of the OWASP WAF rules for the modsecurity Apache module. (In fact, our expensive hosted WAF was itself based on modified version of the OWASP rules.) This took a bit of iterative tuning that we’ll discuss later.
We also migrated our database services (MySQL and Redis) with the hosted AWS equivalents (RDS MySQL and Elasticache Redis), in configurations that were designed to automatically withstand loss of an AWS availability zone (AZ). This was an inexpensive way to take the hassle of database availability, backups, and upgrades out of our hands.
With all of these pieces in place we were able to build out the following architecture in AWS:
The key thing to note about this architecture is that it has four new characteristics that our old environment did not:
Also, by focusing our expenses on the CPU capacity of the application server pool and the provisioned IOPS needed for Elasticsearch, we were able to achieve a cost reduction of 40% for monthly server costs in this new configuration. (Furthermore, our program manager can increase savings in the future by pre-purchasing server time through Reserved Instance pricing.)
One of the original drivers of this project was to get away from the very expensive cost of our CDN/WAF provider.
As you can see in our network diagram, we accomplished this by creating our own proxy servers that run a modified version of the OWASP WAF software. How we did this is probably worthy of its own blog post, but the basic recipe was this:
The CDN component was even more straightforward. With our proxy servers in place, we verified that we were setting correct expiration headers on our assets and then enabled mod_disk_cache on our proxy servers. Once we verified that assets were being served from our proxy servers without calls to our origin servers, we enabled a Rails asset host configuration on our production application to send all asset requests to a CloudFront Distribution whose origin server was our proxy server pool. This took all asset traffic off our expensive CDN provider without directing it to our origin servers.
The final savings calculation looked like this:
That’s a 97.5% savings we were delighted to return to the operating budget of the project for more important things like developing new features!
In other blog posts we discuss in more detail the complexities of supporting SSL certificates for our government customers’s hostnames and how we managed to comply with the DNSSEC requirement for government agencies. Those posts go into much more technical detail and are worth reading if you’re interested in how we solved the security challenges of providing SaaS services for thousands of government agencies.
The great thing about pools of servers built by hardened recipes is that individual server failures are a non-event: a new server is spun up automatically to replace one that failed. However, the Authority to Operate (ATO) process in particular and the government security auditing process in general become a bit tricky for an architecture like ours because these processes require (among many other things):
Since individual servers are identified by IP address, these tests get a little bit more complicated when individual servers can be rebuilt without warning due to server failure, load spikes, or data center outages - because the new server will often come online with a different DHCP’d IP address than the one it replaced. Similarly, if a new server is spun up in response to increased load, it will have an IP address that is unknown to the security testing infrastructure and will cause unnecesssary red flags.
We have worked closely with GSA security personnel to understand their security auditing requirements and to help them understand the new terrain in which we are operating. We hope this two-way conversation will lead to updated security practices that can take the dynamic nature of cloud hosting environments into account without sacrificing completeness or quality.
By applying some very commonly-understood modern operations practices - role-based deployment, server redundancy and pooling - to our application, we were able to achieve substantial cost savings while making the search.usa.gov service more resilient to failure. While some government security practices are still evolving to incorporate dynamic server environments, the success of our migration bodes well for the future of cost-effective and reliable cloud computing in government applications.