Which Mappings settings can help improve indexing speed?

Hello,

I'm trying to improve the indexing performance on my cluster and everything listed in the tune for indexing speed document was already checked.

Currently the indexing is done using Logstash with the elasticsearch output and a couple of filebeats sending directly to elasticsearch, the indexing is done on 4 hot nodes, with 10 vCPU, 64 GB of RAM (30 GB of Java Heap) and SSD backed disks.

Normally I do not have any indexing problem and the data is almost near real time, but sometimes a couple of pipelines get behind and using the hot_threads endpoint I can see that one or more nodes show that the [write] action is one of the hot_threads, when this happens, the load of the node is also high.

I was still not able to track which index or pipeline is causing this as the hot_threads response does not have this information.

But since I'm in the process to optimize the index mappings, I was thinking if there is something that I can change at the mapping level to improve indexing speed.

Almost all my logs are security logs from network devices, applications, SaaS logs and things like that, I use the discover in Kibana and the SIEM/Detections interface to search and a couple of python scripts to trigger alerts and actions.

I do not need any kind of score in the searchs, looking at the mapping parameters, I found two things that I can change that I think would help my indexing speed.

One is set index_options to docs as I just need to know if a string is in a text field, the position and frequency doesn't matter.

The other is set norms to false as score also doesn't matter.

Anyone has some experience if changing those settings can help indexing speed? Is there any other settings that I could change to improve indexing speed?

Try using the new match_only_text field for those fields as a configuration shortcut for what you already did. See Text type family | Elasticsearch Guide [7.14] | Elastic - this will reduce your index size, which might result in faster queries due to smaller indices, but will probably not have a major effect in indexing like you already noticed.

In general, reducing the mapping either in terms of indexed fields or reducing the complexity of your analysis chain can have a huge impact. Maybe you need only keyword fields for some.

Are you using ingest processors? Try replacing grok ones with dissect if possible. Ingest processor stats also include runtime for each processors.

In general you're already on the right track it seems. Don't forget the refresh interval, increasing it may help significantly. Using the smallest numerically type might help as well (or going with scaled floats).

I don't have any idea to come up with better per-field stats on indexing time however.

In general, seeing write threads in the hot threads output is what you expect on indexing.

Parallel indexing and bulk sizes is something you have played around with already I suppose? Keep in mind that increasing the number of cores might also increase your indexing speed as thread pools like the write thread pool checks the number of available CPUs for sizing.

Thanks Alexander,

I was already planning to use match_only_text, but I'm on 7.12.1 and can't upgrade right now, we need to plan the upgrade and follow a internal change request flow.

There is no grok in the ingest pipelines, mostly a couple of geoip filters and the refresh interval is 30s or higher in almost all index I have.

I'm trying to troubleshoot the ingest and see if I can find what is the issue with the slow ingest, for this I'm using the following request to get the writting tasks:

GET _tasks?nodes=nodeName&actions=*write*&detailed

I'm now trying to understand what the response looks like, for example:

 "WO0iLuRvRSWoyBbofVrrJA:332110693" : {
  "node" : "WO0iLuRvRSWoyBbofVrrJA",
  "id" : 332110693,
  "type" : "transport",
  "action" : "indices:data/write/bulk[s]",
  "status" : {
    "phase" : "waiting_on_primary"
  },
  "description" : "requests[144], index[indexName-2021.09.13][1]",
  "start_time_in_millis" : 1631546856828,
  "running_time_in_nanos" : 91101711,
  "cancellable" : false,
  "parent_task_id" : "oYwJy5O2Q_KvotXxDigvAw:1676634788",
  "headers" : { }
}

What does the phase waiting_on_primary means? I also see the phase as rerouted or primary sometimes and the type as direct or transport.

Is this information in any part of the documentation? I didn't find it in the task management section.

Sometimes I see tasks with a high running time, larger than 500 ms.

This is the phase of a transport action, and that depends on the node your request is sent to and where the data is. Sometimes it's local, sometimes the coordinating node waits for the work to be finished on the primary node first.

If you enable trace logging for the org.elasticsearch.action.support.replication package, you will see some more explanations what those states mean. I.e. rerouted means that the coordinating node thought a shard would be on another node but was not.

I suppose your requests are balanced across all nodes already... bulk sizes are another lever to test and play around with.