Getting Kubernetes logs into BigQuery cost-effectively

How we navigated those JSON logs safely to BigQuery harbour

Photo by Mildly Useful on Unsplash

At Songkick, we’re always looking for new ways to make the most of available technology in order to work smarter, faster, and ensure we can sustain growth as efficiently and effectively as possible.

This spirit has inspired us to move away from a conventional datacenter to Google Cloud Computing Engine instances a couple of years ago, and now has sparked a project to take on a gradual move towards Kubernetes clusters. This initiative requires us to establish first a firm foundation of some key pieces of technology before we start shifting and lifting applications into pods. Things like monitoring, security, and a couple more, chief among which is logging.

In our current cloud setup, we import all logs from our applications into BigQuery because it makes searching for patterns and debugging things extremely easy. We wanted to keep that functionality on applications running in Kubernetes.

Kubernetes in Google Cloud

Google offers a managed Kubernetes product called Google Kubernetes Engine (GKE) that allows us to focus more on the application side of things, and take the lower-level cluster management out of our hands. All logs generated from a GKE cluster are in JSON format and get ingested by default by the Cloud Logging product (formerly Stackdriver Logging).
Cloud logging has many interesting features, like setting filters to avoid ingestion charges (you have to pay for log message past a threshold) and the ability to route messages to destinations called “sinks”, one of which offers a direct-to-Big-Query option. This would have been optimal (and would have made this an awkwardly brief post) but this feature uses a streaming model, which given the volume of log messages our applications generate, would have been expensive in the long run.

So, the next logical step was to use a Google Cloud Storage (GCS) sink, which lets us save all the selected log messages into a bucket of our choosing, where they could be easily imported into BigQuery using the NEWLINE_DELIMITED_JSON format option.

Setting up our apps to log to k8s

The very first step was happily quite straightforward, thanks to the brilliant work of previous generations of songkickers on the songkick-logging gem. This little gem handles the logging functionality of all our ruby apps, adjusting its behaviour depending on which environment the code is running. It was architected with a plug-n-play model in mind, which allows it to channel messages to syslog, a test file, standard output, or to Sentry (an external error discovery service) very easily.

In order to get our apps Kubernetes-ready we needed to add a new component that would send log messages to stdout as JSON strings:

Next, we just needed to tell songkick-logging to use the new StdoutJSONLayout class when our applications find themselves running inside a Kubernetes pod. Happy days! 💃🕺🏽

But let’s talk about that JSON, though

Here’s how a single Kubernetes log entry looks:

As can be seen, the message resulting from a logger invocation from one of our apps, ends up in the jsonPayload field of the Kubernetes log entry and includes the fields we declared in our new logging class: hostname, level, message, and timestamp. The name of the pod, the environment, and other bits of information are part of the labels.

Brief diversion first, though: let’s talk about those labels (cue suspenseful track).

Kubernetes uses labels as a way to tag and easily perform query-based operations on all the variety of objects that live in the cluster. These labels are part of the metadata definition of each resource. In the official documentation we find:

Labels are key/value pairs. Valid label keys have two segments: an optional prefix and name, separated by a slash (/).

This optional prefix differentiates between keys that are private to your cluster setup and keys that are provided by Kubernetes or some other third party.

This feature became a multiple-slash-shaped stone in our cool suede trainers, since while the JSON produced by Kubernetes is valid, BigQuery had a very distinct opinion about what constitutes a valid column name:

Invalid field name “k8s-pod/environment”. Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.

This means we can’t use NEWLINE_DELIMITED_JSON to import JSON directly, indeed a bummer, but not a show-stopper at all. BigQuery gives you ways to handle all sorts of format issues, and the option we went for was this: Import all JSON messages as CSV into a temporary table with a single STRING column then proceed to use the JSON_EXTRACT SQL function to extract the fields we actually cared about and append them to the real logs table.

Listing our ingredients

Now that we have a (hopefully) working plan, let’s outline the things we need to get this little little boat merrily down the stream:

  • A GCS bucket
  • A sink and a filter definition in Cloud Logging to send selected messages to that bucket and discard the unwanted ones to avoid ingestion charges
  • A dataset and a table to host the logs in BigQuery
  • A service account with the right permissions to read from the GCS bucket and perform data operations in BigQuery
  • A shell script that handles the import operation
  • Add a cron job that invokes the script at a time of our choosing

Setting up the bucket, sink, and table (without mopping)

Whenever possible, at Songkick we aim to treat infrastructure as code. For this purpose, we rely on Terraform, which makes it very easy to work with Google Cloud resources.

We start by setting up a bucket with the low cost storage class called NEARLINE and adding an expiration rule that deletes all files after 30 days.

Next we need to configure Cloud logging to achieve two objectives: first, to save in the bucket we just created all entries that both have a type called k8s_container and also belong to any container named app. Second, to exclude from ingestion any other log entries with the type k8s_container.

This sets up the “source” side of things, so next step is prepare the “target”. For that purpose, we need a dataset and a table in BigQuery that will host all the logs fetched from the GCS bucket.

The above creates a dataset for our logs table, this next bit creates it and defines the schema we want to work with (timestamp, level, message, application, etc)

Since log messages can reach high volumes, we decided to partition the table on periods of one day using the timestamp column. We don’t care for log messages older than 30 days, so they can be purged after that.

It’s time to kube!

Now that the source and the target have been created, it’s time to define the Kubernetes resources that will do the actual deed! Let’s work our way from the inside out (like that eager little alien in the 1979 classic family film).

  • First, we need access to the the gcloud and bq commands.
  • Next we need a shell script that will do a bit of google cloud setup, execute all the bq queries for the desired date, and perform some cleanup when done.
  • Next, we need to add that script to the Kubernetes cluster.
  • Finally set up a cron that invokes the script regularly.

So, to start with, here’s the heart of what the shell script does (brace for a slashes galore)

We resorted replacing the characters BigQuery didn’t like in the keys with underscores, which get cleaned up and turned into something sensible at a later stage. Since the jsonPayload contents are escaped JSON strings, we also need to remove double quotes and slashes. It’s a bit of an eyeful, but it gets the work done.

After this, our next mission is to have this script available inside the cluster. For this, we can make use of the ConfigMap resource:

Now, our shell script can be mounted anywhere within the current cluster namespace via the import-logs-script ConfigMap.

Finally, we need to define a cron job that provides gcloud, bq, and runs our nifty shell script at specific times. We can accomplish this using the CronJob resource.

This defines the following key items:

  • The schedule: Our job will run every hour, on minute 1 . This is so, because GKE will dump all logs into our GCS bucket sink hourly.
  • The container image: We want our job to run in a Google Cloud SDK container, since it will have both the gcloud and the bq commands ready for us to use.
  • Mounts the ConfigMap that contains our shell script into the /root/script/ path (and makes it executable).
  • How many successful and failed job executions to keep.

And that is it! We left out the service-accounts setup because there are tons of posts about that already and we wanted to focus a bit here. We now have a table in BigQuery that gets updated one minute past each our with all application log messages from our clusters. Here’s a sample of how the imported data looks like:

Thanks for reading this far, we hope this short coming-of-age story of log message transportation and transformation managed to add a few insights (or at least a little smile).

Keep an eye out for this space, as more stories of our Kubernetes migration are sure to appear every now and then!

Getting Kubernetes logs into BigQuery cost-effectively was originally published in Songkick Tech on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Songkick

Leave a Reply

Your email address will not be published.