Slow indexing speed in 3 nodes cluster

Hello,

We had a problem with slow indexing speed with ES cluster of 3 nodes (5.6.14) with the following roles:

  1. 70G RAM, SATA HDD, "mi" role
  2. 48G RAM, SSD HDD, "mdi" role
  3. 8G RAM, SATA HDD, "mi" role

All ES nodes have heap size in a half of available RAM and indices.memory.index_buffer_size: 60% and are tuned for maximum performance according to the recommendations in the ES documentation.

Also we have hourly dynamic template for indices for logging pretty heavy nginx logs with the following setting and store only last hour index:

"settings": {
      "index": {
        "number_of_shards": "2",
        "number_of_replicas": "0",
        "refresh_interval": "5s" 
      }
    }

Nginx is installed and collecting logs on the same node1 and node2. Bulk requests (5-10MB) are sending with rsyslog-omelasticsearch output module installed on the node1 and node2 through the node2. At the moments of the high traffic nginx requests indexing can't go over 16-17k documents per second so sending rsyslog queue is growing and in stuck. We noticed that LA on the data node2 is quite big (20-22 with 11 cores) at this high traffic moments and is caused by ES java process. Also we tried to redistribute bulk requests through node1 and note2 but it did not help.

Can you please help to investigate bottlenecks and speed up indexing?

Indexing involves doing a certain amount of work (analysing each document etc) and that work happens on data nodes. You only have one data node, so it's doing all the work. Redistributing the bulk requests across the two nodes doesn't really help because they just get rerouted back to the data node for indexing.

Can you add more data nodes to spread the load?

Yes, we tried to add data role to the node1(SATA) but in this case iowait increases greatly (~30%) on the node1 and nginx stops to work well with upstreams at high traffic peaks. Do I understand correctly that only a separate data node with SSD will help here? As I noticed, ES cluster indexing speed is defined by slower node?

Yes, it'd probably help to use SSDs on all your data nodes. Involving a spinning disk will probably slow things down.

One other thing to look out for is that by default Elasticsearch sets number_of_replicas: 1 on indices. With one data node you will have no replicas, but if you add a data node then Elasticsearch will add replicas of every shard on that second node, and this means both nodes will be trying to index every document since they both have a copy of every shard, so it won't solve your problem.

In the short term, this means you should check that number_of_replicas: 0 on every index, and in your index templates. Running without replicas isn't a great idea, however, and it might be a good idea to expand your cluster some more to support replicas.

In the dynamic template we already have number_of_replicas: 0 so adding new data node should help us. But in the end it will be helpful to have at least number_of_replicas: 1. How to speed up indexing in that case? More data nodes? What about disabling some index properties, for example disabling not_analyzed fields? In our application we are using near realtime search with big aggregations, first search on timestamp and request regexp, then twice aggregate it on two fields.

UPD: Most of the fields are keywords, so it seems that disabling not_analyzed fields will not help.

Scaling out sounds like it'll reliably improve performance here so if you're looking for a quick fix I'd try that. However your ingest rate of 16k HTTP logs per sec is quite a bit lower than what we see in internal benchmarks so perhaps there are savings to be had with configuration changes such as a more streamlined mapping.

I don't have any concrete suggestions, I'm afraid, apart from doing some experiments to validate your ideas.

Very interesting and strange. We have really lower results. Also we have some short time intervals where 20k http logs were ingested , but these are rare cases. Here is the template for some customized nginx logs we are using:

PUT _template/nginx_log
{
   "order": 10,
    "template": "httplog*",
    "settings": {
      "index": {
        "number_of_shards": "2",
        "number_of_replicas": "0",
        "refresh_interval": "5s" 
      }
    },
    "mappings": {
      "nginx": {
        "properties": {
          "stationid": {
            "type": "keyword" 
          },
          "request": {
            "type": "keyword" 
          },
          "cookie": {
            "type": "keyword" 
          },
          "response": {
            "type": "short" 
          },
          "bytes": {
            "type": "long" 
          },
          "clientip": {
            "type": "keyword" 
          },
          "verb": {
            "type": "keyword" 
          },
          "timestamp": {
            "format": "strict_date_optional_time||epoch_millis||dd/MMM/YYYY:HH:mm:ss Z",
            "type": "date" 
          }
        }
      },
      "_default_": {
        "dynamic_templates": [
          {
            "string_fields": {
              "mapping": {
                "norms": false,
                "type": "text",
                "fields": {
                  "raw": {
                    "ignore_above": 256,
                    "index": true,
                    "type": "keyword",
                    "doc_values": true
                  }
                }
              },
              "match_mapping_type": "string",
              "match": "*" 
            }
          },
          {
            "other_fields": {
              "mapping": {
                "doc_values": true
              },
              "match_mapping_type": "*",
              "match": "*" 
            }
          }
        ],
        "_all": {
          "norms": true,
          "enabled": true
        },
        "properties": {
          "timestamp": {
            "format": "strict_date_optional_time||epoch_millis||dd/MMM/YYYY:HH:mm:ss Z",
            "type": "date",
            "doc_values": true
          }
        }
      }
    },
    "aliases": {}
}

Can be rsyslog with mmnormalize that we use to parse logs and melasticsearch to send to the ES cluster be bottleneck? What do you mean by "more streamlined mapping"? Is it reasonable to lower heap size on data nodes from a half of available RAM to 8-16GB to save the load?

Next step: enabled data role on node1, so two data nodes are configured now with number_of_replicas: 0 then run esrally from nearly located 10G server with SATA disks to benchmark ES cluster with tracks=http_logs and pipeline=benchmark-only and other track parameters as defaults when traffic and load are quite low. Here are the results:

https://pastebin.com/QwWNeSsn

Strange, but indexing speed is still much slower than in default elasticsearch-benchmarks too.

The internal benchmarks run on fairly beefy machines with rather speedy storage and network in a very carefully controlled environment so you might not be able to reproduce their results. Also they're running the latest master rather than 5.6.14. You managed to achieve 50k+ docs per second which sounds sufficient for your needs?

Your mapping seems to have _all enabled, with norms: true, and is perhaps also dynamically adding text fields (and therefore analysing them) on the fly. It'd be interesting to see the mapping from one of your indices to see what's really being used for indexing.

Yes, I see, will try to plan upgrade to ES 6 too. Adding data node role with number_of_replicas: 0 seemed to help. Here is the example of current mapping one of the latest indices:

GET httplog-2019.03.01.08/_mapping

{
  "httplog-2019.03.01.08": {
    "mappings": {
      "_default_": {
        "_all": {
          "enabled": true
        },
        "dynamic_templates": [
          {
            "string_fields": {
              "match": "*",
              "match_mapping_type": "string",
              "mapping": {
                "fields": {
                  "raw": {
                    "ignore_above": 256,
                    "index": true,
                    "type": "keyword",
                    "doc_values": true
                  }
                },
                "norms": false,
                "type": "text"
              }
            }
          },
          {
            "other_fields": {
              "match": "*",
              "mapping": {
                "doc_values": true
              }
            }
          }
        ],
        "properties": {
          "timestamp": {
            "type": "date",
            "format": "strict_date_optional_time||epoch_millis||dd/MMM/YYYY:HH:mm:ss Z"
          }
        }
      },
      "nginx": {
        "_all": {
          "enabled": true
        },
        "dynamic_templates": [
          {
            "string_fields": {
              "match": "*",
              "match_mapping_type": "string",
              "mapping": {
                "fields": {
                  "raw": {
                    "ignore_above": 256,
                    "index": true,
                    "type": "keyword",
                    "doc_values": true
                  }
                },
                "norms": false,
                "type": "text"
              }
            }
          },
          {
            "other_fields": {
              "match": "*",
              "mapping": {
                "doc_values": true
              }
            }
          }
        ],
        "properties": {
          "agent": {
            "type": "text",
            "norms": false,
            "fields": {
              "raw": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "auth": {
            "type": "text",
            "norms": false,
            "fields": {
              "raw": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "blob": {
            "type": "text",
            "norms": false,
            "fields": {
              "raw": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "bytes": {
            "type": "long"
          },
          "clientip": {
            "type": "keyword"
          },
          "cookie": {
            "type": "keyword"
          },
          "httpversion": {
            "type": "text",
            "norms": false,
            "fields": {
              "raw": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "ident": {
            "type": "text",
            "norms": false,
            "fields": {
              "raw": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "metadata": {
            "properties": {
              "filename": {
                "type": "text",
                "norms": false,
                "fields": {
                  "raw": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "fileoffset": {
                "type": "text",
                "norms": false,
                "fields": {
                  "raw": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              }
            }
          },
          "referrer": {
            "type": "text",
            "norms": false,
            "fields": {
              "raw": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "request": {
            "type": "keyword"
          },
          "response": {
            "type": "short"
          },
          "stationid": {
            "type": "keyword"
          },
          "timestamp": {
            "type": "date",
            "format": "strict_date_optional_time||epoch_millis||dd/MMM/YYYY:HH:mm:ss Z"
          },
          "verb": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

Thanks, this indicates that the following fields have been added dynamically as both text and keyword fields:

  • agent
  • auth
  • blob
  • httpversion
  • ident
  • metadata.filename
  • metadata.fileoffset
  • referrer

Do you need full-text search on all of these fields? If not, it might make sense to map them manually, or to set "dynamic": false, so that Elasticsearch does not index any fields it doesn't recognise.

Do you also need the _all field? If not, it might make sense to disable that.

Oh, thanks for the suggestion! Sure we don't need full-text search on all of these fields. Will add "dynamic": false and disable _all field both for the nginx` type, and will post test results again.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.