Do not use lot of types per index?


(Alberto Gonzalez) #1

In this presentation (https://www.elastic.co/elasticon/2015/sf/scaling-elasticsearch-for-production-at-verizon) they said to not use a lot of types per index.

I have now 41 types in filebeat-xx index, is this bad?

I am using the document_type to set the name of each log file where logs are coming from, should i put this on a field or other place?


(Mark Walkom) #2

Are they different sorts of logs?


(Alberto Gonzalez) #3

yes they are all text logs but they have different times format, some are multilines other not, and they are spread over different files

so what i am doing is log1-* is doc_type log1, log2-* is doc_type log2, etc... then i can search by log type.


(David Pilato) #4

In the future, types are probably going to be removed. See https://github.com/elastic/elasticsearch/issues/15613
It will take a long time but I believe it will happen.

Having one type per index seems to be a better option.

You can either put within a log_type1 object every content coming from type1 and so on or use one index per type.


(Alberto Gonzalez) #5

what is the first option "put within a log_type1 object every content coming from type1"? is not having an index per filebeat log file too much?

At the end the logs file cotent is different but the documents are not, they should not have different mapping... basically we are not parsing them on different fields, we are only insterested in the full message line. We just use date to process filter to update date timestamp.

In this case what would be the best option? Would be better to just add a field property indicating the log file name so we can query for all logs line of type1, type2, etc and leave document type as "log" for all?


(David Pilato) #6

Yeah sorry. Was on a mobile and hard to explain. I meant:

{
  "type1": {
    // Whatever structure related to type1 data
  },
  "type2": {
    // Whatever structure related to type2 data
  },
  "type3": {
    // Whatever structure related to type3 data
  }
}

It depends on your retention factor, the total number of shards...

In that case, I'd simply add a type field inside the document:

{
  "type": "type1",
  "message": "content"
}

It's the best solution IMO in your case.


(Alberto Gonzalez) #7

how do i add a type field inside document from filebeat? I know i can add fields.type, something like:

  paths:
    - /var/log/httpd/error_log
  fields:
   type: httpd_error_log

instead of using:

  paths:
    - /var/log/httpd/error_log
  document_type: httpd_error_log

Is this more efficent to just query logs of httpd_error_log and retrieve all messages of that type?

These are two sample docs i have now:

tasker log type:

{
  "_index": "filebeat-2016.10.04",
  "_type": "tasker",
  "_id": "AVePp7jp1UM0xgwSbV9J",
  "_score": null,
  "_source": {
    "@timestamp": "2016-10-04T10:25:16.388Z",
    "beat": {
      "hostname": "rc02",
      "name": "rc02"
    },
    "fields": {
      "asset_tag": "822-101-xxxx"
    },
    "input_type": "log",
    "message": "Oct 04 12:25:15 : [LOG0] Deleting task #42508 which has expired",
    "offset": 825908,
    "source": "/usr/bp/logs.dir/tasker-9.log",
    "type": "tasker"
  },
  "fields": {
    "@timestamp": [
      1475576716388
    ]
  },
  "sort": [
    1475576716388
  ]
}

bpserver-backup type log:

{
  "_index": "filebeat-2016.10.04",
  "_type": "bpserver-backup",
  "_id": "AVePp2Em1UM0xgwSbV2N",
  "_score": null,
  "_source": {
    "@timestamp": "2016-10-04T10:24:54.050Z",
    "beat": {
      "hostname": "rc02",
      "name": "rc02"
    },
    "fields": {
      "asset_tag": "822-101-xxx"
    },
    "input_type": "log",
    "message": "Oct 04 12:24:48 : [LOG0]   Load average....: 0.6\n ",
    "offset": 10858,
    "source": "/usr/bp/logs.dir/bpserver-backup-9.log",
    "type": "bpserver-backup"
  },
  "fields": {
    "@timestamp": [
      1475576694050
    ]
  },
  "sort": [
    1475576694050
  ]
}

thanks


(David Pilato) #8

I believe you can't do that in beats in general.
Before 5.0, you have to process that with Logstash. So forward from filebeat to logstash which mutates your fields and then ships to elasticsearch.
From 5.0, it's easier as you can define an ingest pipeline directly in elasticsearch. The pipeline will process the original data and mutate it inside elasticsearch before it gets indexed.


(Alberto Gonzalez) #9

I am doing it currently with beats adding a field variable to identify the source id in filebeat 5.0 whithout logstash or ingest node pipeline, I am using in the beats config this:

"fields": {
  "asset_tag": "822-101-xxx"
}

so if i use this:

paths:
    - /var/log/httpd/error_log
  fields:
   type: httpd_error_log
   asset_tag: ${ASSET_TAG}

I will get this inside my documents:

"fields": {
  "asset_tag": "822-101-xxx",
  "type": "httpd_error_log"
}

and then I can search by fields.type: httpd_error_log

Just wanted to know if this is better and more efficient than having the document_type/type thing per log as all my logs document are the same format.


(David Pilato) #10

It looks very good to me. As soon as other logs share the same structure, that's the ideal way IMO.


(system) #11