Disable _source field


(Phaniraj Nagalamadugu) #1

Hello,

I had harvested 100 csv files. using filebeat -> logstash (Using csv filter) -> Elasticsearch. The disk size of 100 CSV flat files is 609 MB and translated to 1 GB in elasticsearch (pri.store.size).

I am aware of the advantages of _source field. However, I am trying to see the space that we could save by disabling _source field in elasticsearch 6.2.4 version. Using below command to disable the _source field

Index Name : perflogs-2018.19

PUT perflogs-2018.19
{
  "mappings": {
    "_doc": {
      "_source": {
        "enabled": false
      }
    }
  }
}

I am getting below error. Please let me know if I am doing anything wrong.

{
  "error": {
    "root_cause": [
      {
        "type": "resource_already_exists_exception",
        "reason": "index [perflogs-2018.19/CH-f47yTRqCAjfxZiEKI3A] already exists",
        "index_uuid": "CH-f47yTRqCAjfxZiEKI3A",
        "index": "perflogs-2018.19"
      }
    ],
    "type": "resource_already_exists_exception",
    "reason": "index [perflogs-2018.19/CH-f47yTRqCAjfxZiEKI3A] already exists",
    "index_uuid": "CH-f47yTRqCAjfxZiEKI3A",
    "index": "perflogs-2018.19"
  },
  "status": 400
}

(Mark Walkom) #2

You cannot do it if the index already exists, so you will need to delete the index, add the mapping and then reprocess.


(Phaniraj Nagalamadugu) #3

Hi Warkolm,

Thanks for the quick response.

I followed the below steps and getting below error.

  1. Delete perflogs-2018.19 index
  2. Add Mapping to disable _source field
PUT perflogs-2018.19
{
  "mappings": {
    "_doc": {
      "_source": {
        "enabled": false
      }
    }
  }
}

  1. Harvest logs to perflogs-2018.19 index. I see below error in elasticsearch logs
[2018-05-16T06:44:41,919][DEBUG][o.e.a.b.TransportShardBulkAction] [perflogs-2018.19][0] failed to execute bulk item (index) BulkShardRequest [[perflogs-2018.19][0]] containing [12] requests
java.lang.IllegalArgumentException: Rejecting mapping update to [perflogs-2018.19] as the final mapping would have more than 1 type: [_doc, log]
        at org.elasticsearch.index.mapper.MapperService.internalMerge(MapperService.java:501) ~[elasticsearch-6.2.4.jar:6.2.4]
        at org.elasticsearch.index.mapper.MapperService.internalMerge(MapperService.java:353) ~[elasticsearch-6.2.4.jar:6.2.4]
        at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:285) ~[elasticsearch-6.2.4.jar:6.2.4]
        at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.applyRequest(MetaDataMappingService.java:313) ~[elasticsearch-6.2.4.jar:6.2.4]
        at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.execute(MetaDataMappingService.java:230) ~[elasticsearch-6.2.4.jar:6.2.4]
        at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:643) ~[elasticsearch-6.2.4.jar:6.2.4]
        at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:273) ~[elasticsearch-6.2.4.jar:6.2.4]
        at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:198) ~[elasticsearch-6.2.4.jar:6.2.4]
        at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:133) ~[elasticsearch-6.2.4.jar:6.2.4]
        at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) ~[elasticsearch-6.2.4.jar:6.2.4]
        at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) ~[elasticsearch-6.2.4.jar:6.2.4]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573) ~[elasticsearch-6.2.4.jar:6.2.4]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) ~[elasticsearch-6.2.4.jar:6.2.4]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) ~[elasticsearch-6.2.4.jar:6.2.4]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_171]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_171]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]

(Mark Walkom) #4

That'll depend on what other data is going into Elasticsearch, but are you using Logstash?


(David Pilato) #5

Without reindexing and removing _source you can also ask for a better compression:

See index.codec in https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules.html

Then run a _forcemerge call to rewrite the segments.

But also instead of removing the _source field which I do not recommend as you will miss a lot of feature then, have a look at your mapping and see what else you can optimize instead.


(Phaniraj Nagalamadugu) #6

Hi Warkolm/dadoonet,

Yes, we use logstash in the data pipeline. How do we fix "Multiple mapping" issue?
The data pipeline is filebeat -> logstash -> elasticsearch.

With default compression codec, 608 MB flat file size translated to 1 GB of index store size.
With best_compression codec, 608 MB flat file size translated to 740 MB of index store size. Even best_compression codec seems to be little high on disk utilization. Hence, would like to know the disk usage of index store size when _source field is disabled.

Thanks
Phaniraj


(Phaniraj Nagalamadugu) #7

[quote="nsphaniraj, post:6, topic:132036"]
Hi Warkolm/dadoonet,

_forcemerge did help to reduce the store size from 740 MB to 619 MB

But, would like to see the store size by disabling _source field.

Thanks
Phaniraj


(David Pilato) #8

We can probably help but to solve what? I mean that do you think it is worth it?

What are you going to do with your data? And again what is your current mapping?


(Christian Dahlqvist) #9

Have a look at this documentation for guidance on how to optimise mappings.


(Phaniraj Nagalamadugu) #10

Hi All,

Thank you all for the help.

I was able to disable the _source field after fixing the mapping name. Looks like logstash uses "log" as mapping name and I used the same name to disable _source field. 608 MB of log translated to 301 MB with best_compression codec.

PUT perflogs-2018.19
{
  "mappings": {
    "log": {
      "_source": {
        "enabled": false
      }
    }
  }
}

Thanks
Phaniraj


(system) #11

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.