How to manage Elasticsearch data across multiple indices with Filebeat, ILM, and data streams

Indices are an important part of Elasticsearch. Each index keeps your data sets separated and organized, giving you the flexibility to treat each set differently, as well as make it simple to manage data through its lifecycle. And Elastic makes it easy to take full advantage of indices by offering ingest methods and management tools to simplify the process.

In this post, we’ll use Filebeat to ingest data from multiple sources into multiple indices, and then we’ll use index lifecycle management (ILM) and data streams to further control that data.

What is Filebeat?

Filebeat is a lightweight shipper for logs that comes with many built-in modules for collecting, parsing, and managing data from multiple data sources to Elasticsearch seamlessly. These modules — which come with prebuilt Kibana dashboards — are available for a wide variety of systems, services, and applications.

A single Filebeat instance is versatile and can collect data from multiple data sources. It helps users save hours (if not days) of parsing, index template configuration, and data management. By default, data from all different sources is shipped to the same Elasticsearch index, letting users filter data sources with the event.module field. While this behavior makes getting started with Filebeat easier, there are multiple use cases that would benefit from storing similar data in multiple indices, like:

  1. Applying different data retention policies to different data sources. For example, VPC and firewall data may need to be kept for a year, while IIS data only needs to be stored in the cluster for a month.
  2. Making only a subset of data available to an application or dashboard. For example, security analysts may care about firewall and network data, but probably not about Redis or Kafka data.
  3. Easily assigning data access to users who need to access a subset of data. For example, members of the support team may only need to access IIS but not Tomcat data sets. To achieve this, the Elastic admin could look at document-level security, but this approach is more complex and has more room for human error.

Separating data with index lifecycle management and data streams

All of the use cases above suggest that one solution is to separate the data based on the data sources or Filebeat data module. Separating is easy to do, but avoiding sharding issues is a little bit more complex. Sharding issues can be either oversharding (too many small shards) or undersharding (your shards are too big). Each of those comes with its own issues, as oversharding means every operation has to hit more shards, and undersharding can lead to availability issues.

This blog will help you separate data from Filebeat into smaller indices that are dedicated for each module, but with data streams/ILM policies set up to help you avoid sharding issues. We are also going to leverage the new Elasticsearch index template and Elasticsearch component template to help ensure that all of the new indices will inherit the proper template mappings from the default Filebeat mapping. We are also going to use data streams to ensure that our data rolls over properly.

That seems like a lot, but once it’s configured, management becomes a breeze. A little work now and a lot less worry later.

Practical example: Managing multiple cloud data sources

In this example, we are going to have three types of data including Google Cloud, Microsoft Azure, and Amazon Web Services (AWS). The data sets will fall under two different data retention policies:

  • Google Cloud will be kept for 30 days
  • Azure and AWS will be kept for 365 days

If you don’t have a cluster to follow along with, you can spin up a free trial of Elastic Cloud before starting step 1.

Step 1: Create an index lifecycle policy

First, we’re going to create index lifecycle policy for each subset of data. This will create different retention policies so each index can have its own rules:

#create a policy that deletes data after days from rollover 
PUT _ilm/policy/filebeat-30days 
{ 
    "policy" : { 
      "phases" : { 
        "hot" : { 
          "min_age" : "0ms", 
          "actions" : { 
            "rollover" : { 
              "max_size" : "50gb", 
              "max_age" : "1d" 
            }, 
            "set_priority" : { 
              "priority" : 100 
            } 
          } 
        }, 
        "delete" : { 
          "min_age" : "30d", 
          "actions" : { 
            "delete" : { 
              "delete_searchable_snapshot" : true 
            } 
          } 
        } 
      } 
    } 
} 
#create a policy that deletes data after days from rollover 
PUT _ilm/policy/filebeat-365days 
{ 
    "policy" : { 
      "phases" : { 
        "hot" : { 
          "min_age" : "0ms", 
          "actions" : { 
            "rollover" : { 
              "max_size" : "50gb", 
              "max_age" : "1d" 
            }, 
            "set_priority" : { 
              "priority" : 100 
            } 
          } 
        }, 
        "delete" : { 
          "min_age" : "365d", 
          "actions" : { 
            "delete" : { 
              "delete_searchable_snapshot" : true 
            } 
          } 
        } 
      } 
    } 
}

Step 2: Create a component template

Now we need to create a component template, which is a building block for constructing index templates that specify index mappings, settings, and aliases. We’ll save time by creating it from the legacy template filebeat-%{agent.version}.

  1. Create a component template based on filebeat-7.10.1 by going to: Kibana > Stack Management > Index Management > Component templates > Create a component template.
  2. Copy the settings from filebeat-7.10.1 (GET _template/filebeat-7.10.1).
# need to remove  "rollover_alias": "filebeat-7.10.1" from the output   
curl -XGET "https://<Elasticsearch>:<port>/_template/filebeat-7.10.1" -u elastic | jq .[].settings
  1. Copy the mapping from the filebeat-7.10.1 legacy template.
# need to remove  "rollover_alias": "filebeat-7.10.1" from the output  
curl -XGET "https://<Elasticsearch>:<port>/_template/filebeat-7.10.1" -u elastic | jq .[].mappings

Step 3: Create index templates

Now we can use that component template to create a new index template for each data source. Let’s start with the Google Cloud template.

  1. Go to Kibana > Stack Management > Index Management > Index Templates > Create template
  2. Make sure Data stream is enabled.

  1. Add the component template you created to in step 2 to this new template
  2. Override the default setting from the component template to use the custom index life cycle policy
{
  "index": {
    "lifecycle": {
      "name": "filebeat-30days"
    }
  }
}
  1. Leave Mappings blank as the mappings are inherited from the component template.
  2. Leave Aliases blank.
  3. Save the template. Repeat these steps for all of the custom data sets with the correct ILM policies, either filebeat-30days or filebeat-365days.

Step 4: Configure output to multiple indices

Filebeat allows you to break the data based on event.module value to smaller indices. There are two typical logs flow setups, one with Logstash and one without:

Filebeat > Logstash > Elasticsearch

Data makes a stop at Logstash before being ingested into Elasticsearch. In this setup, we will require minimal data parsing to be done in Logstash given you have done the filebeat setup command.

input { 
  beats { 
    port => 5044 
  } 
} 
filter { 
  mutate { 
    add_tag => [ "coming_from_logstash"] 
  } 
} 
output { 
    elasticsearch { 
    index => "filebeat-%{[agent][version]}-%{[event][module]}" 
    user => "elastic" 
    password => "password" 
    pipeline => "filebeat-%{[agent][version]}-%{[event][module]}-%{[fileset][name]}-pipeline" ## leverage the built in ingest pipeline from filebeat 
    hosts => [ "https://<elasticsearch_URL>:<port>" ] 
    action => "create" 
    } 
}

Filebeat > Elasticsearch

In this setup, Filebeat sends data straight to Elasticsearch without using Logstash.

filebeat.inputs: 
- type: log 
  enabled: false 
filebeat.config.modules: 
  path: ${path.config}/modules.d/*.yml 
  reload.enabled: true 
setup.template.settings: 
  index.number_of_shards: 1 
# Set up Kibana 
setup.kibana.host: "https://Kibana_URL:443" 
setup.kibana.protocol: https 
# Set up Elasticsearch 
output.elasticsearch: 
  hosts : ["https://Elasticsearch_URL:443"] 
  username: "elastic" 
  password: "${elastic_user_pd}" 
  indices: 
    - index: "filebeat-%{[agent.version]}-%{[event.module]}" 
      when.has_fields: ['event.module'] 
processors: 
  - add_host_metadata: 
      when.not.contains.tags: forwarded 
  - add_cloud_metadata: ~ 
  - add_docker_metadata: ~ 
  - add_kubernetes_metadata: ~ 
logging.level: debug 
logging.to_files: true 
logging.files: 
  path: /var/log/filebeat 
  name: filebeat 
  keepfiles: 7 
  permissions: 0644

Step 5: Start ingesting

If everything works correctly, you should see something similar to this:

GET _cat/indices/filebeat*?v 
health status index                                                           uuid                   pri rep docs.count docs.deleted store.size pri.store.size 
green  open   .ds-filebeat-7.10.1-azure-2020.12.26-000001                     4xvOnCt2Sx6iVSy3WnUaKQ   3   1         24            0      1.7mb        876.1kb 
green  open   .ds-filebeat-7.10.1-aws-2020.12.26-000001                     4xvOnCt2Sx6iVSy3WnUaKQ   3   1         24            0      1.7mb        876.1kb 
green  open   .ds-filebeat-7.10.1-googlecloud-2020.12.26-000001               PYxA7WHwQf6dVzEJ07X2yQ   3   1     443538            0    294.2mb        146.8mb 
green  open   .ds-filebeat-7.10.1-system-2020.12.26-000001                    sZdFR68wSWCr_XOQasEkWA   3   1         72            0      1.1mb        607.2kb

And your ILM policy should show that it is being applied to x number of indices.

Start streaming today

It’s that simple, and now you have fewer management tasks to worry about! If you’ve just been reading along and haven’t tried it yet, I’d encourage you to try it yourself and start using Filebeat to start streaming in and analyzing logs from multiple data sources.

Want to see the health of your entire ecosystem beyond just log files? Check out Elastic Observability and bring your logs, metrics, and APM traces together into a single stack. Learn more about ILM in this webinar, and please reach out on the Discuss forums if you have any questions.

Source: Elastic

Leave a Reply

Your email address will not be published.


*