In an article series by Anuj Gupta and Neeraj Kumar they provide a roadmap for beginning your Cloud Native journey on AWS, detailing a set of Well Architected recommendations that address the foundations of ensuring your environment is designed for scale and security.
They explore an example e-commerce company who experiences hyper-growth, such as a Black Friday sales surge, and adopts a Cloud Native strategy to adapt, and they provide the requisite architecture patterns so that others can emulate this approach, namely:
Modern application development principles, namely microservices; purpose-built databases; automated software release pipelines; a serverless operational model; and automated, continuous security.
In the first article they explain the challenges a traditional architecture experiences when trying to cope with a hyper-growth scenario, such as:
- Tight coupling and dependency between system modules – The current application is designed to scale as a whole instead of scaling individual system modules.
- Ineffective automatic scaling – The application relies on Amazon EC2 Auto Scaling to add new instances when load increases. However, it takes at least 5 minutes for instances to become ready to serve user traffic.
- Limitation on system throughput – The database is running on r5.24xlarge Amazon RDS Multi-AZ, which has almost reached its vertical scale limit.
- Longer release cycles – Multiple team members contribute to the application code and commit code to the main branch of source version control, requiring coordination across multiple teams to ensure the right version of a dependency is included in the production release and is certified by testing teams.
- Increase in troubleshooting and issue resolution time – The current logging solution and tools provide insights into individual system components but lack a coherent view of the data flow across systems.
In the second they explain system improvements made to maximize the company’s system throughput and get consistent performance from their application stack:
- Introduce connection pooling – Implemented connection pooling via Amazon RDS Proxy to improve overall performance and reduce unnecessary resource utilization.
- Introduce read replicas – To address read scaling in the database cluster, they added multiple read replicas to handle read-only queries.
- Introduce caching – Deployed Amazon ElastiCache for Redis in a lazy-loading pattern as mentioned in the Database Caching Strategies Using Redis whitepaper. Caching not only improved scale and performance but also helped to optimize the database cost.
- Introduce purge and data archiving – Created tables partitioned by date, which allows us to analyze partitions with records that can be archived or purged based on the retention policy. Table size was reduced by 60% by applying these techniques.
- Query optimization – As different bottlenecks in the database throughput were identified, they used Amazon RDS performance to find bottlenecks in SQL queries.
- Introduce database sharding – They had to find a way to horizontally scale the database. “Sharding” allowed them to deal with having multiple writer nodes, as described in this Sharding with Amazon Relational Database Service blog post.
These architecture patterns not only maximize system throughput but also prepare your applications for incremental improvements to build highly scalable and resilient architecture.
In the third article they talk about architecture patterns to improve system resiliency, why observability matters, and how to build a holistic observability solution. This provides a comprehensive engineering guide for designing high availability systems, addressing elements such as:
- Disaster recovery – How to design a recovery system using cross-Region Amazon RDS read replicas, cross-Region Amazon S3 replication, and AWS CloudFormation templates, AWS Backup to simplify backup and cross-Region copying of Amazon EC2, Amazon Elastic Block Store (Amazon EBS), and Amazon RDS to mitigate business continuity risks. They formed pipeline to create “golden AMIs,” Amazon Machine Images that contain operating systems/packages to stand up consistent servers, using Amazon Image Builder, and additionally, automated application code deployment using AWS CodePipeline.
- Improve load-balancing capability across servers – The monolith application handles all transactions, including long transactions that can take up to 3 minutes and short transactions that complete within 30 milliseconds. To manage the resulting unbalanced traffic, they used the Least Outstanding Request algorithm in Application Load Balancer that distributes the load more uniformly.
- Decoupled integrations using event-driven design patterns – The application experienced a cascading failure when the payment processing and order fulfillment API time out impacted business operations. So they updated the application logic to use the design pattern shown in this FIFO topics example use case, so that the application sends requests to Amazon Simple Notification Service (Amazon SNS) topics subscribed by multiple Amazon Simple Queue Service (Amazon SQS) queues to fanout. The AWS Lambda functions poll messages from these SQS queues and interact with external systems, throttling the number of calls to external API operations.
- Predictive scaling for EC2 – Due to its monolithic architecture, the application didn’t scale quickly with sudden increases in traffic because of its high bootstrap time. To predict future demands, they optimized for application availability and allowed automatic scaling to use historical CPU utilization data, an approach highlighted in the New – Predictive Scaling for EC2, Powered by Machine Learning blog. This allows them to adjust capacity needs by forecasting usage patterns along with configurable warm-up time for application bootstrap.
- Standardize observability – When analyzing downtime events, they found that the time spent on correlating different events was leading to high Mean-Time-To-Resolve. Additionally, the performance baseline was missing and their analysis also uncovered a gap in security monitoring.
- To overcome these challenges they built a centralized logging solution using the guidance from the Visualizing AWS CloudTrail Events using Kibana blog, created dashboards in Kibana to visualize the log correlation insights and metrics, ran on-demand queries when we needed to troubleshoot performance issues to identify the root cause using Amazon Athena and used Amazon Elasticsearch Service (Amazon ES) to maintain real-time log insights.
- They used this solution as a baseline and updated the configuration as shown in Amazon ES Service Best Practices to optimize for performance and scale. We took manual hourly snapshots of the ES cluster to Amazon S3 and used Amazon S3 Cross-Region Replication to restore the cluster in case of a DR event.
- Manage service limits with Service Quota and adopt Multi-account strategy – They observed that most workloads were running in a single account, leading to service limits, and so implemented two strategies to address this situation: Adopted Service Quota to proactively identify service limits, which supports Amazon CloudWatch integration for some services, enabling alerts when they are close to reaching quotas. Production and non-production environments should be separated into a multi-account framework for many reasons, including service limits, as discussed in Best Practices for Organizational Units with AWS Organizations.
In the fourth article they examine the security and identity policies, noting that the site had experienced a growth in DDoS attacks, leading to downtime and loss of revenue. To address this they implemented multi-account strategy and AWS Identity and Access Management (IAM) best practices.
They used AWS Control Tower to deploy guardrails as service control policies (SCPs). These guardrails were then separated into production and non-production environments, creating the hierarchy shown here:
- Created a new Payer (or Management) Account with Sandbox OU and Transitional OU under Root OU. They then moved existing AWS accounts under the Transitional OU and Sandbox OU, provisioning new accounts with Account Factory and gradually migrated services from existing AWS accounts into the newly formed Log Archive Account, Security Account, Network Account, and Shared Services Account and applied appropriate guardrails.
- Registered Sandbox OU with Control Tower. Additionally, we migrated the centralized logging solution from Part 3 of this blog series to the Security Account. We moved non-production applications into the Dev and Test Accounts, respectively, to isolate workloads. We then moved existing accounts that had production services from the Transitional OU to Workload PROD OU.
- To strengthen new/existing employees’ credentials, they used AWS Trusted Advisor for IAM Access Key Rotation. This identifies IAM users whose access keys have not been rotated for more than 90 days and created an automated way to rotate them. They then generated an IAM credential report to identify IAM users that don’t need console access or that don’t need access keys and gradually assigned these users role-based access versus IAM access keys.
- During a Well-Architected Security Pillar review, they identified some applications that used hardcoded passwords that hadn’t been updated for more than 90 days. They re-factored these applications to get passwords from AWS Secrets Manager and followed best practices for performance.
- They set up a system to automatically change passwords for RDS databases and wrote an AWS Lambda function to update passwords for third-party integration. Some applications on Amazon EC2 were using IAM access keys to access AWS services, so they re-factored them to get permissions from the EC2 instance role attached to the EC2 instances, which reduced operational burden of rotating access keys.
- Using IAM Access Analyzer, they analyzed AWS CloudTrail logs and generated policies for IAM roles. This helped them to determine the least privilege permissions required for the roles as mentioned in the IAM Access Analyzer makes it easier to implement least privilege permissions by generating IAM policies based on access activity blog.
- To streamline access for internal users, they migrated users to AWS Single Sign-On (AWS SSO) federated access. We enabled all features in AWS Organizations to use AWS SSO and created permission sets to define access boundaries for different functions, assigning permission sets to different user groups and assigned users to user groups based on their job function, enabling them to reduce the number of IAM policies and use tag-based control when defining AWS SSO permissions policies.
- They followed the guidance in the Attribute-based Access Control with AWS SSO blog post to map user attributes and use tags to define permissions boundaries for user groups. This allowed them to provide access to users based on specific teams, projects, and departments. They enforced multi-factor authentication (MFA) for all AWS SSO users by configuring MFA settings to allow sign in only when an MFA device has been registered.
These improvements ensure that only the right people have access to the required resources for the right time, reducing the risk of compromised security credentials by using AWS Security Token Service (AWS STS) to generate temporary credentials when needed. System passwords are better protected from unwanted access and automatically rotated for improved security. AWS SSO enables them to enforce permissions at scale when people’s job functions change within or across teams.
In the fifth and final article they explain how they detect security misconfigurations, indicators of compromise, and other anomalous activity, and how they developed and iterated on their incident response processes.
With the pace of new infrastructure and software deployments, they had to ensure they maintained strong security, and so they established a dedicated security team and identified tools to simplify the management of their cloud security posture, allowing them to easily identify and prioritize security risks.
- They used Amazon GuardDuty to keep up with the newest threat actor tactics, techniques, and procedures (TTPs) and indicators of compromise (IOCs), saving them time and reduces complexity, because they don’t have to continuously engineer detections for new TTPs and IOCs for static events and machine-learning-based detections. This allows their security analysts to focus on building runbooks and quickly responding to security findings.
- Discovered sensitive data with Amazon Macie for Amazon S3. To host their external website, they use a few public Amazon Simple Storage Service (Amazon S3) buckets with static assets. They don’t want developers to accidentally put sensitive data in these buckets, and wanted to understand which S3 buckets contain sensitive information, such as financial or personally identifiable information (PII), so they used Amazon Macie to continuously scan our S3 buckets for sensitive data.
- Scanned for vulnerabilities with Amazon Inspector. They must scan their Amazon Elastic Compute Cloud (Amazon EC2) instances for known software vulnerabilities, such as Log4J, so they used Amazon Inspector to run continuous vulnerability scans on their EC2 instances and Amazon Elastic Container Registry (Amazon ECR) container images.
- Aggregate security findings with AWS Security Hub – With AWS Security Hub, their analysts can seamlessly prioritize findings from GuardDuty, Macie, Amazon Inspector, and many other AWS services, and also use Security Hub’s built-in security checks and insights to identify AWS resources and accounts that have a high number of findings and act on them, which they set up through:
- Assigned a security tooling account as the delegated administrator for these services. The delegated administrator configures the services and aggregates findings from other member accounts.
- Used the AWS Security Reference Architecture and the associated scripts to assist with set up, which helped ensure they set up and configured the security services according to best practices.
- Used Security Hub’s new multi-Region aggregation to aggregate all findings into their primary Region.
- Integrating Jira with the security tooling account with the steps outlined in How to set up a two-way integration between AWS Security Hub and Jira Service Management to track ownership and remediation status. Their security analysts use Security Hub-generated Jira tickets to view, prioritize, and respond to all security findings and misconfigurations across our AWS environment.
- They built incident response plans and processes to quickly address potential security incidents and minimize the impact and exposure, following the AWS Security Incident Response Guide and NIST framework, to adopt the following best practices:
- Develop incident response playbooks and runbooks for repeatable responses for security events that include: Playbooks for more strategic scenarios and responses based on some of the sample playbooks found here, and Runbooks that provide step-by-step guidance for their security analysts to follow in case an event occurs, using Amazon SageMaker notebooks and AWS Systems Manager Incident Manager runbooks to develop repeatable responses for pre-identified incidents, such as suspected command and control activity on an EC2 instance.
- Identify areas where they could accelerate responses to security threats by automating the response, using the AWS Security Hub Automated Response and Remediation solution as a starting point, so they didn’t need to build their own automated response and remediation workflow. The code is also easy to read, repeat, and centrally deploy through AWS CloudFormation StackSets. They used some of the built-in remediations like disabling active keys that have not been rotated for more than 90 days, making all Amazon Elastic Block Store (Amazon EBS) snapshots private, and many more. With automatic remediation, their analysts can respond quicker and in a more holistic and repeatable way.
- Implemented quarterly incident response simulations. These simulations test how well prepared their people, processes, and technologies are for an incident. They included some cloud-specific simulations like an S3 bucket exposure and an externally shared Amazon Relational Database Service (Amazon RDS) snapshot to ensure their security staff are prepared for an incident in the cloud, using the results of the simulations to iterate on their incident response processes.