No-Index Log Management at S3 Scale

Database indexes are invaluable for information systems with low throughput, low latency and high consistency requirements. Both compute, and disk space are required for creating indexes along with any required operational overheads. Often, resource, time, and cost to maintain indexing far outweighs the performance objectives of the log management tool itself.

LOGIQ’s log analytics has a unique no-index approach to log management allowing infinite scale, while ensuring search and query performance. For achieving this, we have to solve the problem of infinite scale for both our data and metadata stores.

LOGIQ maintains its metadata in Postgres. However, that cannot scale infinitely without incurring significant cost. Our Hybrid metadata layer manages the migration of metadata tables between postgres and S3. Metadata that is old, is seamlessly tiered to S3 and is fetched on-demand when needed. The Key/Value nature of S3 allows us to fetch granular metadata on-demand without additional indexes being maintained.

A similar approach is applied to data. Incoming data is broken into chunks and stored in a partitioned manner so object lookups for e.g. a namespace or an application does not need additional indexes. The object key implicitly encodes the index information. This makes lookups and retrievals efficient when data is needed from the S3 layer that is not found in the local disk cache.

LOGIQ’s architecture offers unique advantages by using S3 as its primary storage location. Yes! S3 is not a secondary storage tier in our architecture.

  • S3 storage for data and metadata

    Storing both data and metadata vs using local storage significantly reduces the total cost of the solution. Most scaled out self-service log analytics solutions require costly management of volumes at scale! LOGIQ abstracts it as an S3 API.

  • No-Index log management

    Eliminates costly compute and storage that would otherwise be used constantly for indexing, rebuilds etc.

  • Eliminate Data egress cost

    When running in public cloud environments, deploying LOGIQ with the S3 bucket in the same region eliminates costly egress and data transfer costs that can run into tens of thousands of dollars when sending data to an external cloud provider.

LOGIQ is the first real-time platform to bring together benefits of object store (scalability, one hop lookup, better retrieval,  ease of use, identity management, lifecycle policies, data archival etc) and distributed compute via Kubernetes, along with highly configurable dash-boarding, query, alerting and search. As a result, we provide much reduced cost, easy integration with other analytics tools, and operational agility.

Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn

Firelens demystified

AWS Firelens is a log routing agent for Amazon Elastic Container Service (ECS) containers. Applications on ECS run as docker containers. Containers can be run on a serverless infrastructure that is managed by ECS using the Fargate launch type. For more control over the infrastructure, containers can be hosted on a cluster of Amazon Elastic Compute Cloud (Amazon EC2) instances. In both of these scenarios, AWS manages networking, storage, security, IAM, and other necessary services required to run the containers.

FireLens for Amazon ECS enables the administrator to use task definition attributes to route logs to external log aggregators. It unifies the data collection across the ECS cluster. Its pluggable architecture allows adding data sources, parsers, filter/buffering, and output plugins.

The biggest advantage with Firelens is that you can connect almost any service endpoint as long as data sinks can process general-purpose JSON over HTTP, FluentFoward or TCP protocols. The Firelens magic is all about transforming the log output of ECS containers and generating the necessary routing configuration for sending logs to the logging service.

For using Firelens, define the log collector and sink. The sink can be any log aggregation provider like LOGIQ Log Insights. Let us now see how to put this together. We need the following:

  • A log router container with FireLens configuration marked as essential.
  • Application containers specifying the ”awsfirelens” log driver.
  • A task IAM role ARN for permissions needed to route logs

Below you will find a few “logConfiguration” examples that can be for your task definition. Note how the “logDriver” is set to “awsfirelens”. The “options” contain additional attributes for the log sink where the log data will be sent.

"logConfiguration": {
        "logDriver": "awsfirelens",
       "options": {
                 "Name": "forward"
                 "Port": "24224",
                 "Host": "logiq.example.com"
}
}

The ”awsfirelens” log driver allows you to specify Fluentd or Fluent Bit output plugin configuration. Your application container logs are routed to a sidecar or independent Firelens container inside your cluster that further routes your container log to its destination as defined in your task “logConfiguration”. Additionally, you can use the options field of the FireLensConfiguration object in the task definition to serve any advanced use case.

"firelensConfiguration" : {
      "type" : "fluentbit",
      "essential":true,
      "options" : {
         "config-file-value" : "arn:aws:s3:::mybucket/myFile.conf",
         "config-file-type" : "s3"
      }
   }

The diagram above shows how Firelens works. Container logs are sent to the Firelens container using the docker Log Driver. When the ECS Agent launches a Task that uses Firelens, it constructs a Fluent configuration file:

  • A specification of log source for how to gather the logs from the container

  • An ECS Metadata record transformer

  • Optional User-provided configuration. If you specify your own configuration file, firelens will use the ”include” directive to import it in the generated configuration file.

  • Log destinations or sinks derived from the Task Definition

The following snippet shows a configuration for including ECS metadata like container and cluster details.

{
   "containerDefinitions" : [
      {
         "image" : "906394416424.dkr.ecr.us-west-2.amazonaws.com/aws-for-fluent-bit:latest",
         "firelensConfiguration" : {
            "options" : {
               "enable-ecs-log-metadata" : "true"
            },
            "type" : "fluentbit"
         },
         "name" : "log_router",
         "essential" : true
      }
   ]
}

To demonstrate how Firelens works end to end, the below is a task definition example containing an HTTP web server and a Firelens sidecar container to route logs to the LOGIQ server. Also, replace the task execution role if it is named other than the default “executionRoleArn” and populate the account id shown in XXXXXXXXXXXX in the following example:

{
   "family" : "firelens-logiq",
   "executionRoleArn" : "arn:aws:iam::XXXXXXXXXXXX:role/ecs_task_execution_role",
   "taskRoleArn" : "arn:aws:iam::XXXXXXXXXXXX:role/ecs_task_iam_role",
   "containerDefinitions" : [
      {
         "memoryReservation" : 50,
         "essential" : true,
         "firelensConfiguration" : {
            "type" : "fluentbit",
            "options" : {
               "enable-ecs-log-metadata" : "true"
            }
         },
         "image" : "amazon/aws-for-fluent-bit:latest",
         "name" : "log_router_logiq",
         "logConfiguration" : {
            "logDriver" : "awsfirelens",
            "options" : {
                "Host" : "logiq.example.com", 
"Name" : "forward",
"Port" : "24224" } } }, { "essential" : true, "memoryReservation" : 100, "logConfiguration" : { "options" : { "Host" : "logiq.example.com", "Name" : "forward", "Port" : "24224" }, "logDriver" : "awsfirelens" }, "name" : "app", "image" : "httpd" } ], }

Understanding log formats and protocols – Part 1

Introduction

In this multi-part series, we are going to explore log formats and protocols for sending them: two of the most fundamental pieces of any logging system. In Part 1, we will go over log formats and how metadata is encoded. Very little has changed as far as these core fundamental pieces are concerned, even as technology and computing has evolved.

We will start the journey with an application. This could be from the system, written by a user, etc. Applications write out human-readable messages to convey information. These messages help, both the user and developer understand indirectly, the state of the application. Two additional critical pieces of information are needed  – the when and the how good or bad the information is for a reader who is looking at it. In essence, we have uncovered 3 mandatory parts of any log message for any logging system: the information, the when and the how good/bad it is. These are typically referred to as the message, the timestamp, and the severity.

Applications run within an operating system or nowadays what we call the cloud operating system. For an operating system dealing with many applications, it is natural to separate log messages from different applications and group them together. So here we see the fourth mandatory piece of any logging system – the application name itself, typically referred to as the application name

As is universally known, applications are written by humans and humans have a tendency to make mistakes and like it or not we see applications, for no real fault of their own, sometimes, die horrible deaths, sometimes are forced to restart, etc.. Due to this, the operating system now has to deal with multiple incarnations of an application and in order for the intended reader of the log message to understand this discontinuity, it is critical to include the 5th vital piece of any logging system, the process identifier or the incarnation identifier or the instance identifier. This is typically referred to as the process Id or the Instance Id.

So far we have a good set of mandatory fields that are essential for an operating system to organize the logs created by an application so a human reading it can make sense of it.multi

  • Message
  • Timestamp
  • Severity
  • Application Name
  • Process Identifier / Instance Identifier

RFC3164 - BSD Syslog protocol

Armed with this information, I think we are ready to segway into the first real standardized log format that came from the BSD folks – RFC 3164 – The BSD Syslog protocol. Let us look at a raw RFC3164 message

<14>Mar 29 06:15:28 customer-tooling postfix/anvil[17483]: statistics: max cache size 1 at Mar 29 06:12:05

Looking at the fields, it’s a quick guess about the mandatory fields we just talked about. Let’s see if you spotted the same as well

  • Message – “statistics: max cache size 1 at Mar 29 06:12:05”
  • Timestamp – “Mar 29 06:15:28”
  • Severity – ???
  • Application Name – “postfix/anvil”
  • Process Identifier/ Instance Identifier – “17483”

It’s not quite obvious what the severity is. So, let us look at that. RFC3164 encodes the severity in the beginning up to the first 5 characters of the message. Ref: https://tools.ietf.org/html/rfc3164#section-4.1.1

So let us decode what “<14>” means and what the severity of the message is. The BSD Syslog protocol calls this field the priority field.  The priority field encodes two pieces of information – the facility and the severity. So we have now encountered a new log message construct – the facility.

As you recall, we discussed the application name in the message. Operating systems group application processes into subsystems also referred to as the facility For e.g. all kernel messages/processes get assigned the kernel facility. Messages coming from the mail subsystem get grouped under a separate facility: mail system.

Facility and Severity are represented by integer codes in the BSD Syslog specification. Ref:  https://tools.ietf.org/html/rfc3164#section-4.1.1

Getting back to how to decode the severity of the message, the way the priority field is calculated is to take the facility integer value, multiply it by 8 and add the severity integer value. So

Priority_Code = Facility_Code * 8 + Severity_Code

So, to find the severity, we just divide the Priority_Code by 8 and use the remainder e.g. with a priority value above 14, the Facility is the quotient –  14/8. = 1 and severity code is the remainder 14-8*1 = 6

Now let’s decode what these two numbers mean. Let us look at the BSD protocol encodings for Severity and Facility below. The two tables below are reproduced from the section 4.1.1 referenced above in the BSD Syslog protocol document.

Looking at the two tables we can now conclude that the message was sent with the severity of Informational messages so nothing serious to report! The facility was user-level messages.


     Numerical Code         Severity
       
           0       Emergency: system is unusable
           1       Alert: action must be taken immediately
           2       Critical: critical conditions
           3       Error: error conditions
           4       Warning: warning conditions
           5       Notice: normal but significant condition
           6       Informational: informational messages
           7       Debug: debug-level messages
 
     Numerical Code         Facility
0 kernel messages 1 user-level messages 2 mail system 3 system daemons 4 security/authorization messages (note 1) 5 messages generated internally by syslogd 6 line printer subsystem 7 network news subsystem 8 UUCP subsystem 9 clock daemon (note 2) 10 security/authorization messages (note 1) 11 FTP daemon 12 NTP subsystem 13 log audit (note 1) 14 log alert (note 1) 15 clock daemon (note 2) 16 local use 0 (local0) 17 local use 1 (local1) 18 local use 2 (local2) 19 local use 3 (local3) 20 local use 4 (local4) 21 local use 5 (local5) 22 local use 6 (local6) 23 local use 7 (local7)

Summary

In summary, as we conclude part 1 of the “Understanding log formats and protocols” series, we have seen what the mandatory fields of a log message are and also looked at the BSD Syslog protocol encoding for a log message, also commonly referred to as an RFC3164 log message.

kubernetes entity hierarchy

Need For Search

We are all accustomed to search. We search to find answers. In the log analytics world, those answers have improved customer retention, aided better decisions. But the search is a tedious process. 

Well, what if we don’t have to search, what if the information has arranged in such a way that we can discover it. Remember the old search engines like excite where they divide the web into directories, everything arranged in a hierarchy. What if we could similarly discover our logs.

In our infrastructure, there could be thousands of containers running. If we have to root cause an issue specific to a container, the usual way to do this is to go to a logs search page, apply filters to drill down to the specific container we have to look into. The filters depend on how we have segregated the logs from our infrastructure. We need to apply multiple filters like the namespace, StatefulSet or pod identifiers etc..

Now let’s think for a moment. This need not be a haystack. In the case of Kubernetes based deployments, a hierarchy exists. Each Kubernetes cluster has namespaces. Namespace has multiple Kubernetes entities like StatefulSets or Deployments. Each of these has one or more pods.

kubernetes entity hierarchy

Ultimately what we need is the logs from these individual containers. Instead of search what if we could just discover it the way you could discover files from directories.

Log Visualization – Musings – Part 1

Searching through logs becomes ineffective when unknown unknown abound and data volume grows. Log visualization is key to help navigate large data volume. In most modern screens, one can easily display 50-100 lines of text comfortably for viewing at a time. Anything more gets hard to read. This is what we call as the “50-100” rule. 

My simple-minded laptop generates 4,000 Syslog lines in 15 hours. One would need to make 40-80 clicks to scroll through 4,000 lines of logs if I was looking to find something anomalous! Logging scale issues increase even more if it were in a cloud or corporate environment, due to the sheer number of machines running applications that are continuously operating and generating log data.

So, how do we make it easy for a user to go beyond the 50-100 lines. We don’t necessarily mean they can read all of the lines beyond the 50-100, but can there be visual representations that make it easy to navigate large amounts of text for specific purposes?

Here’s an example of viewing more lines on the display outside of 50-100 lines rule. The Sublime text editor has a zoomed-out code area or mini code map section at the right-hand side. A user would use the mini code map section to explore large amounts of code using this minified side view where a user can jump to parts of the source code with ease. Notice that the visual representation here is not for the user to read all of the code but acts as an assist is faster code navigation.

Sublime Text Editor with Minimap Example

While Sublime’s mini code map display has a beautiful code navigate feature, it does not serve log text visualization well for several reasons:

 

  • Log text doesn’t have a fixed format and fixed color labeling.
  • Log text workspace is too big to be handled by the editor 10-1,000 thousand’s text lines.
  • Unlike the metrics plot, minimap does not aid the user to visualize for logging anomalies
 
Eyeballing through log lines is analogous to examining metrics data points manually and not using a visualization tool such as a simple X-Y plot. Using the plot tool appropriately, without being an expert at the data, one can easily pick out unusual data activities such as unexpected bursts or discontinued segments. What if a user could see logs the same way?  In general, human beings do better when visual cues are present. 

So here’s an idea: we are going to plot a log as a dot in a graph just like you would plot a dot for a metric like a CPU utilization metric. An operator then uses it to isolate log abnormalities visually. What would this look like ? How would such a system work? That’s for a different article, I suppose.
Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn