Do not use lot of types per index?

agonzalez · October 4, 2016, 1:46am

In this presentation (https://www.elastic.co/elasticon/2015/sf/scaling-elasticsearch-for-production-at-verizon) they said to not use a lot of types per index.

I have now 41 types in filebeat-xx index, is this bad?

I am using the document_type to set the name of each log file where logs are coming from, should i put this on a field or other place?

warkolm · October 4, 2016, 2:57am

Are they different sorts of logs?

agonzalez · October 4, 2016, 3:16am

yes they are all text logs but they have different times format, some are multilines other not, and they are spread over different files

so what i am doing is log1-* is doc_type log1, log2-* is doc_type log2, etc... then i can search by log type.

dadoonet · October 4, 2016, 4:25am

In the future, types are probably going to be removed. See https://github.com/elastic/elasticsearch/issues/15613
It will take a long time but I believe it will happen.

Having one type per index seems to be a better option.

You can either put within a log_type1 object every content coming from type1 and so on or use one index per type.

agonzalez · October 4, 2016, 4:34am

what is the first option "put within a log_type1 object every content coming from type1"? is not having an index per filebeat log file too much?

At the end the logs file cotent is different but the documents are not, they should not have different mapping... basically we are not parsing them on different fields, we are only insterested in the full message line. We just use date to process filter to update date timestamp.

In this case what would be the best option? Would be better to just add a field property indicating the log file name so we can query for all logs line of type1, type2, etc and leave document type as "log" for all?

dadoonet · October 4, 2016, 7:09am

Yeah sorry. Was on a mobile and hard to explain. I meant:

{
  "type1": {
    // Whatever structure related to type1 data
  },
  "type2": {
    // Whatever structure related to type2 data
  },
  "type3": {
    // Whatever structure related to type3 data
  }
}

It depends on your retention factor, the total number of shards...

In that case, I'd simply add a type field inside the document:

{
  "type": "type1",
  "message": "content"
}

It's the best solution IMO in your case.

agonzalez · October 4, 2016, 10:20am

how do i add a type field inside document from filebeat? I know i can add fields.type, something like:

  paths:
    - /var/log/httpd/error_log
  fields:
   type: httpd_error_log

instead of using:

  paths:
    - /var/log/httpd/error_log
  document_type: httpd_error_log

Is this more efficent to just query logs of httpd_error_log and retrieve all messages of that type?

These are two sample docs i have now:

tasker log type:

{
  "_index": "filebeat-2016.10.04",
  "_type": "tasker",
  "_id": "AVePp7jp1UM0xgwSbV9J",
  "_score": null,
  "_source": {
    "@timestamp": "2016-10-04T10:25:16.388Z",
    "beat": {
      "hostname": "rc02",
      "name": "rc02"
    },
    "fields": {
      "asset_tag": "822-101-xxxx"
    },
    "input_type": "log",
    "message": "Oct 04 12:25:15 : [LOG0] Deleting task #42508 which has expired",
    "offset": 825908,
    "source": "/usr/bp/logs.dir/tasker-9.log",
    "type": "tasker"
  },
  "fields": {
    "@timestamp": [
      1475576716388
    ]
  },
  "sort": [
    1475576716388
  ]
}

bpserver-backup type log:

{
  "_index": "filebeat-2016.10.04",
  "_type": "bpserver-backup",
  "_id": "AVePp2Em1UM0xgwSbV2N",
  "_score": null,
  "_source": {
    "@timestamp": "2016-10-04T10:24:54.050Z",
    "beat": {
      "hostname": "rc02",
      "name": "rc02"
    },
    "fields": {
      "asset_tag": "822-101-xxx"
    },
    "input_type": "log",
    "message": "Oct 04 12:24:48 : [LOG0]   Load average....: 0.6\n ",
    "offset": 10858,
    "source": "/usr/bp/logs.dir/bpserver-backup-9.log",
    "type": "bpserver-backup"
  },
  "fields": {
    "@timestamp": [
      1475576694050
    ]
  },
  "sort": [
    1475576694050
  ]
}

thanks

dadoonet · October 4, 2016, 11:29am

I believe you can't do that in beats in general.
Before 5.0, you have to process that with Logstash. So forward from filebeat to logstash which mutates your fields and then ships to elasticsearch.
From 5.0, it's easier as you can define an ingest pipeline directly in elasticsearch. The pipeline will process the original data and mutate it inside elasticsearch before it gets indexed.

agonzalez · October 4, 2016, 11:36am

I am doing it currently with beats adding a field variable to identify the source id in filebeat 5.0 whithout logstash or ingest node pipeline, I am using in the beats config this:

"fields": {
  "asset_tag": "822-101-xxx"
}

so if i use this:

paths:
    - /var/log/httpd/error_log
  fields:
   type: httpd_error_log
   asset_tag: ${ASSET_TAG}

I will get this inside my documents:

"fields": {
  "asset_tag": "822-101-xxx",
  "type": "httpd_error_log"
}

and then I can search by fields.type: httpd_error_log

Just wanted to know if this is better and more efficient than having the document_type/type thing per log as all my logs document are the same format.

dadoonet · October 4, 2016, 3:34pm

It looks very good to me. As soon as other logs share the same structure, that's the ideal way IMO.

Topic		Replies	Views
Best practices for indexing log data Logstash	6	20335	October 25, 2017
What is better - create several document types or several indices? Elasticsearch	4	376	July 6, 2017
ElasticSearch _type performance Elasticsearch	7	392	July 6, 2017
Filebeats - Multiple and different file types Beats	4	27121	July 5, 2017
Different documents types in same index Elasticsearch	6	7740	June 1, 2018

Do not use lot of types per index?

Related topics