This post presents a problematic pattern that I encountered in the infrastructure of one of my clients. I am calling this pattern "log storm", simply because I haven't encountered a more appropriate name. In practice this architectural anti-pattern leads to the creation of unnecessary amount of log entries in a security system which in turn incurs unnecessary usage costs. It can probably be encountered in various platforms and setups, but in this post I am focusing on AWS-based set-up.
Important aspects, or components, of the set-up that I encountered are: multi-account cloud set-up and CloudTrail logging enabled. These are very often aspects of infrastructure encountered even in small companies and startups.
A lot of businesses, even in the initial stage, find it very useful to segregate various classes of their assets. Putting assets into different accounts is a way to do that. The segregation might be introduced between certain departments (e.g., marketing resources vs. product resources), functions (e.g., application environment vs. shared service or production vs. staging environments). The goal here is to increase the overall security of the organization by limiting the impact of a potential data breach.
CloudTrail is one of the most important security services available in AWS. It basically allows to log each and every interaction with your management plane. It provides forensic material for future investigations and it allows your organization to enforce accountability of their employees and contractors.
These two components are helpful in bringing down some risks to accepted levels and they are quite often seen as regulatory and/or contractual requirements, especially in high-impact businesses, like finance or health.
A good Information Security Management System (ISMS) is unfortunately a complex structure. It has many moving elements on its own and naturally it needs to accommodate for all the changes in the business processes as well. After all, it plays a supporting/enabling role for the business core of an organization, but it's very sensitive to the specifics and details of the latter.
For a person governing the system, a natural response to this complexity is an attempt at simplification, which, among other things, means centralization. It's convenient to centralize identity management, asset management, access management, risk management, etc.. It's only right to work on centralizing our log management as well.
Centralized log management offers a wide range of benefits ranging from easier retention management, faster investigation and incident response times, capability of cross-analyzing forensic data, enforcing log protections, and so on.
In a multi-account environment based on AWS it makes sense to create a separate security account that will hold the logs delivered from other environments. For example, development and production applications can log events to the local CloudWatch log-groups with reduced retention time and these in turn can be transported to the central log-archive account for further processing.
It's tempting to use lambda functions for the log transformation and routing. After all, a lot of people are familiar with serverless code solutions and almost certainly they'll feel comfortable with one of several programming languages available. And on top of it, CloudWatch provides subscription functionality for lambda.
What could be easier? We subscribe the log-group to a lambda function, the function transforms the log entry and writes to centralized log archive destination.
The problem with this solution arises when we make use of CloudTrail's logging of lambda execution.
In this solution, we are creating a feedback loop. The monitored application/service creates a log entry, the entry is transformed by lambda and execution of lambda creates another log entry. Ad infinitum. Soon we may discover that 90% of our CloudTrail logs consist of the lambda functions being executed in separate infinite loops.
The problem is even more severe if we decide to use several lambda functions, e.g. for additional processing and feeding a SIEM of some type.
All of these entries are CloudTrail entries and CloudTrail can be relatively expensive, if we cross the free tier treshold. On the other hand, we don't want to turn off logging of lambda functions execution entirely becaouse it's possible for them to contain potentially damaging code and we want to maintain the log trail and accountability.
The problem with the anti-pattern presented in the first part on this topic was the feedback loop that was feeding the Lambda events back to the local CloudTrail audit trail. Registering these events in the audit trail in many cases is necessary to maintain the accountability of the code executed within our environment.
The general idea for the log routing for centralization is sound, we just need to look at one factor that is determining on how much logging 'overhead' is this going to cost us in a specific implementation.
The factor in question (let's call it factor S) is equal to how many second-order log entries we need for processing one original entry. The value of this variable varies depending on a specific design. If we use one lambda function to put the log entry into S3 bucket in another account, the S factor is going to be 1.
But, for the sake of example, let's analyze another implementation in which we:
1. execute lambda function routing the entry to the second account
2. execute second lambda function routing the entry to the specific log-group (e.g. based on the application that produced it)
3. execute third lambda function transforming and feeding the log entry to SIEM
In this implementation the value of S is 3, meaning that for each original log entry, we will have 3 second-order entries that come from the lambda function execution, one of those will feed back to audit trail in the original account, the two remaining - in the second account.
The log storm begins to rise when these second-order log entries are in turn subjected to the log processing and routing function, generating another 9 third-order entries. And the process does not stop there, it goes on ad infinitum, accumulating unnecessary entries over weeks and months and driving the ITSec infrastructure costs up.
The way out of this is of course reducing the S factor.
But it has to be done in a smart way. Remember that you need to account for all of the log entries that are going to be fed back into the CloudTrail audit trail. So, if you decide to create a queue for the purpose of collecting X amount of logs before executing the routing function, you might be thinking that the S is now:
S = 1/X,
because routing lambda is executed only once per X entries (for example S = 1/200). But remember to account for the function that will put the entry into the queue. If this function is executed for each entry (e.g., via CloudWatch Logs log subscription target), the actual value is:
S = 1/X + X,
which is slightly worse than that to begin with.
So far the best solution for reducing the 2nd order log entries that I found is theone based on CloudWatch Log Destinations and Kinesis Data Stream.
The overall architecture of the Kinesis-based solution is the following:
The EC2 workload (1) is the source of original log events. The events are being fed to the Kinesis Data Stream instance (3) in the destination account with the help of CloudWatch Log Destination function.
Finally, the Kinesis Data Stream collects the entries and in defined intervals calls appropriate lambda functions for routing and transformations.
The advantages of this implementation are the following.
First of all, Cloud Watch Log Destination instances do not create additional second-order entries. They don't need to, because their functionality is simple and known. It's impossible to e.g. hide a backdoor in such instance. They can also be easily subscribed to from the log-group level.
Furthermore, Kinesis Data Stream in combination with the Delivery Stream allows you to define the buffer size or timeout interval for events processing in such a way that execution of additional defined transformation function is executed for a number (tens, hundreds, thousands) of entries that are currently contained in the buffer.
This way, we can significantly reduce the S factor, and eliminate the potential for the log storms.
The generalized Kinesis-based solution can be found in the AWS Solution library.
In this post we discussed a solution to the Cloud Security Anti-Pattern, Log Storm, which was introduced in a previous post.
The problem with the anti-pattern is that it creates a feedback loop that leads to excessive logging, generating unnecessary entries over weeks and months, and driving up ITSec infrastructure costs.
The solution to this anti-pattern involves reducing the amount of the second-order log entries. Sample solution based on AWS Kinesis has been presented