Index Search while using alias

Hi Team,

We had below situation, where we created a alias (emp) on top of five index (employee_01..employee_05 which contains similar kind of data).

  1. While searching particualr key,value as mentioned below. We expect it to search the two index employee_01 and employee_02 and return result. But while checking same query with profile option it is checking all the index i:e: all five index which are part of the alias emp.
GET /emp/_search
{
  "query" : {
    "match" : { "EMPLOYEE_ID" : "256" }
  }
}

GET /emp/_search
{
"profile": true,
"query" : {
    "term" : { "EMPLOYEE_ID" : "256" }
}
}
  1. Is there anyway we can restrict the search engine to search only particular indexes where data is present.
    Is there any different approach available to achieve the above scenario in Elasticsearch. Please suggest.

Thanks,
Debasis

Elasticsearch identifies the indices that hold relevant data by searching them. How would Elasticsearch be able to determine which indices hold the relevant data without searching them?

Specify the indices you want to query instead of using the alias.

1 Like

Thanks @Christian_Dahlqvist for quick response. Like RDBMS we are creating partitioned table using range of values so while search it will go to that particular partition instead of scanning entire table. Is that any kind of similar feature available in Elasticsearch.

Thanks,
Debasis

Elasticsearch performance characteristics differ quite a lot from RDBMS. Are you having a performance problem you are trying to solve? I would not expect querying a few indices that do not hold any relevant data to add any significant overhead so am wondering if you are trying to prematurely solve a problem that may not exist.

1 Like

Hi @Christian_Dahlqvist ,

We had a situation where customer wants to load around 15 Billion of records to Index and perform some search operation according to their test case. So before uploading and handed over to customer we had to captious regarding index performance during search. I had some below points and need suggestion how to overcome.

  1. Managing such a huge index will be painful.
  2. It will definitely impact the performance if customer query search will be 10 or 20 percent of data on 15 Billion data.
  3. We can not implement ILM policy on basis of size since we are ingesting logs to index on date wise.
    We can not implement ILM policy on basis of timestamp as the ingest logs itself contains a timestamp field.

Could you please advise how we can proceed.

Thanks,
Debasis

Based on the type of data you are indexing and the mappings you are using, how much space do you estimate the primary shards of the indexed data will take up on disk?

If you do not know I would recommend indexing enough data into a single index with a single primary shard until you get to a size of at least 1GB. That should allow you to extrapolate the expected size on disk.

You mentioned you want to partition the data. How many different partition keys are there? Is data reasonably evenly distributed across these? Do you always search based on partition key?

What is the average size of the documents you are indexing?

I am not sure I understand this point. Could you please clarify? Is your data immutable?

Thanks @Christian_Dahlqvist

Am from the same team as Debasis, so chiming in for him.

The primary shards is likely to be 22TB. This will be data for 180 days.

We are new to Elasticsearch and may be unnecessarily apprehensive about maintaining such a huge single index. But a reason why we would like to split it in different indices is so that we can drop the indices that hold data older than 180 days.

Its a single partiton key and that is date in epoch format. Data is very well distributed across different dates. These are not really logs but for the sake of this discussion, we can think of these as activity records which will keep coming in throughout the day.

On the ILM policy: Our initial thought was that we deploy an ILM policy. Policy will provide an alias and internally it will create multiple indices. Since it also allows defining the retention policy, indices will be deleted after 180 days. The application will query only the alias.
The challenge as we understand is, ILM can create a new index whenever the date changes on the clock (which is ok for something like logs) but it cannot change to a new index based on an epoch date value which is a field in the document (data) itself. May be our understanding is wrong?

The data is immutable.

Thanks

If it is the timestamp you want to partition on it is reasonably straightforward. The example provided above indicated that you wanted to partition on customer or something similar, which is quite different.

The approach I would recommend is to use time-based indices, which is what you eluded to. As your data is immutable you can do this two different ways. The first is through the use of data streams. This is great if data arrives with little delay compared to its timestamp as all data is written to only the latest index, which is then periodically switched for a new one behind the scenes through rollover.

If data does not arrive in near real time or you need to add data in retrospect while maintaining a suitable partitioning you can instead create traditional time-based indices. These share a common prefix and have a date in the index name. The date indicates which timestamps are held within the specific index. A common approach is to have daily indices where each index holds all data with a timestamp within a 24-hour period. With this approach the process indexing the data need to determine the index name based on the timestamp of each document to make sure it is sent to the correct one. This is something that Logstah and Filebeat supports.

ILM can be used to delete indices once they are beyond a certain age in both scenarios.

These approaches will result in a large number of indices to be queried and you can do so using an index pattern or a single alias asigned to all indices.

In older versions of Elasticsearch this used to be a bit expensive so old versions of e.g. Kibana used to determine exactly which indices to query based on the range within the query before setting the minimal set of indices. In the last few major versions this is no longer required and it is recommended to just query all indices that match the pattern or alias. Running the query against an index that does not hold any data based on the timestamps and range clause is very quick and add very little overhead. I have seen this work very well with a lot larger data volumes than you are mentioning, so I think you are trying to solve a problem that most likely does not exist.

Thanks a ton. We have tried Data Streams and we'll look up time based indices as well. In case you have any a blog link handy, that would be a great help!

hi @Christian_Dahlqvist Could you please share any reference blog link so that we can achieve the same since I am new to Elasticsearch.

Thanks,
Debasis

For the old-fashioned time-based indices I do not have any blog post as the concept is as old as Elasticsearch itself. You can look at what the Logstash Elasticsearch output plugin does. The idea is basically that you define the index name based on a timestamp field in the data. If the timestamp field e.g. is 2024-03-11T12:09:04Z this might go into a daily index named logstash-2024-03-11. This determination is done within the indexing process and ensures that data always end up in the correct index. If you have data coming in with incorrect or delayed timestamps this can cause problems though. It is also hard to know when an index will no longer be written (depends on how late new data may arrive) to which can make index lifecycle optimisation more difficult.

Data streams and rollover gets around this by always writing to the latest index. As long as data arrives in near real time and you do not do any backfilling indices will hold largely non-overlapping time ranges. This is generally more efficient. and preferable if your input data stream is indexing in near real-time.

Hi @Christian_Dahlqvist thanks. While we have a theoretical understanding, could you kindly share any relevant blogs or references that can guide us in how to implement the same. I apologize for the inconvenience, as I am still new to the Elastic world.

Thanks,
Debasis

How are you indexing data into Elasticsearch? Logstash? Filebeat? One of the language clients?

We are ingesting data to Elastic Index through Filebeat.

Thanks,

Then look at how you define index pattern in the Elasticsearch output. If you set index: "filebeat-%{+yyyy.MM.dd}" it will generate a time based index (e.g. filebeat-2024.03.15) based on the value of the @timestamp field, so make sure you use a timestamp processor (or set this using a set processor) to populate this field based on the timestamp you want to use from the log entry.

Hi @Christian_Dahlqvist

I had configured as below but still I did not see the Index getting created.

  1. Create below template

PUT _template/csv_index_template
{
"index_patterns": ["csv-*"],
"settings": {
"number_of_shards": 1,
"number_of_replicas": 1,
"index.lifecycle.name": "csv_rollover_policy",
"index.lifecycle.rollover_alias": "csv_data"
},
"mappings": {
"properties": {
"sequence":{
"type":"keyword"
},
"component":{
"type":"keyword"
},
"tenant":{
"type":"keyword"
},
"service_id":{
"type":"keyword"
},
"session_id":{
"type":"keyword"
},
"timestamp":{
"type": "date",
"format": "epoch_millis"
},
"edr_version":{
"type":"keyword"
}
}
}
}
}

  1. filebeat.yml

filebeat.inputs:

  • type: log
    enabled: true
    paths:
    • /path/to/your/csv/file.csv
      fields_under_root: true
      fields:
      log_type: csv_data

output.elasticsearch:
hosts: ["https://xx.xx.xx:9200"]
index: "csv-%{+yyyy.MM.dd}"
template.enabled: true
template.name: "csv_index_template"
template.pattern: "csv-*"

Thanks,
Debasis

Make sure that the indentation in your Filebeat config file is correct. It is hard to tell with the formatting you have on your post.

With this type of index pattern you are not using rollover, so remove this. Also make sure your ILM policy does not use or assume rollover being used.

Hi @Christian_Dahlqvist,

I had create below pipeline for ingesting.

> PUT _ingest/pipeline/parse_csv_data
> {
>   "processors": [
>     {
>       "csv": {
>         "description": "Parse EDR data From CSV Files",
>         "field": "message",
>         "target_fields": ["sequence","component","tenant","service_id","session_id","timestamp","edr_version"],
>         "separator": ",",
>         "ignore_missing":true,
>         "trim":true
>       },
> 	  "date": {
>         "field": "timestamp",  
>         "target_field": "timestamp",
>         "formats": ["UNIX_MS"]  
>       }
>       }
>       ]
> }

Below changes in filebeat.yml

output.elasticsearch:
  hosts: ["https://xx.xx.xx.xx:9200"]
  index: "filebeat-%{+yyyy.MM.dd}"

But still we are not able to create index automatically. Could you please help me if anything I am missing.

To troubleshoot the issue, I had changed below parameter in filebeat.yml to 1GB so that errors will written to one log file but it is automatically keeps on rotating after 1.1 KB.

Is there anyway we can disable log rotation in filebeat.

rotateeverybytes: 1073741824

Thanks,
Debasis

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.