ElasticSearch - Refresh issue ? Too many Requests ? Can't find documents randomly

Hi,

It's been some week i encounter some issues on ElasticSearch "Search" API.
Sometimes, the search doesn't return a result which has been indexed before (sometimes while before, like 30min or 1hour).

I made some research on this issue, and the only feature i found that can be related to this was the ElasticSearch "refresh" which made available indexed documents for search.
We keep all default parameters on the ElasticSearch servers, so the refresh interval should be 1sec.

It seems that refresh feature targets an index only if a "search" has been called in the last 30 seconds.
In our solution we just do a search every time before indexing any document to check any duplicate in the index, i guess the index is moving a lot and refresh should refreshing a lot too.

Our servers are 8go RAM and 4vCPU and we're indexing like 50k documents a day for now.
We have 2 nodes in the cluster with this config.
It seems that we're in the Elastic standard for our server for this index load.

It's hard to debug because it happen randomly and if we call the search by ourself, it's just working ..
So we're out of idea now.

I just wanna know if someone has encountered this behavior or know how i can investigate this.

N.B : We're indexing documents one by one, we can't bulk it in our process. Sometimes, we have a "rush" of documents and maybe the servers are struggling (1k documents in a few seconds or minutes) but it seems that 50k documents, when i'm comparing with other topic here, is a tiny tiny tiny load so i can't think about it.

Thanks for your help.

Welcome to our community! :smiley:

Can you share more info - what version are you on? What is the output from the _cluster/stats?pretty&human API? Can you replicate this every time?

Hi,

Thanks for your answer.
Unfortunately, i can't repro this issue in an easily way. That's why i called it "randomly".
We're on Elastic 7.8.0.
It's a PRODUCTION case so i try hard to reproduce it in my development environments but cannot.
The only difference is the load of indexing and search between Production and other environments..
Sometimes i have some 429 errors (too many requests) but in this case i don't have this one (or my logger is lying to me..)
I'm using NEST library in .Net Core.

Cluster stats :

{
  "_nodes": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "cluster_name": "ELASPRDEDIQ0001",
  "cluster_uuid": "WQZk8Cu5TYeTaEnHMTxk5A",
  "timestamp": 1616749744345,
  "status": "green",
  "indices": {
    "count": 5,
    "shards": {
      "total": 10,
      "primaries": 5,
      "replication": 1.0,
      "index": {
        "shards": {
          "min": 2,
          "max": 2,
          "avg": 2.0
        },
        "primaries": {
          "min": 1,
          "max": 1,
          "avg": 1.0
        },
        "replication": {
          "min": 1.0,
          "max": 1.0,
          "avg": 1.0
        }
      }
    },
    "docs": {
      "count": 4786753,
      "deleted": 2338810
    },
    "store": {
      "size": "53.8gb",
      "size_in_bytes": 57872657511
    },
    "fielddata": {
      "memory_size": "0b",
      "memory_size_in_bytes": 0,
      "evictions": 0
    },
    "query_cache": {
      "memory_size": "21.3mb",
      "memory_size_in_bytes": 22366448,
      "total_count": 27970605,
      "hit_count": 765247,
      "miss_count": 27205358,
      "cache_size": 78914,
      "cache_count": 756568,
      "evictions": 677654
    },
    "completion": {
      "size": "0b",
      "size_in_bytes": 0
    },
    "segments": {
      "count": 140,
      "memory": "1.5mb",
      "memory_in_bytes": 1634104,
      "terms_memory": "844.7kb",
      "terms_memory_in_bytes": 865040,
      "stored_fields_memory": "469.9kb",
      "stored_fields_memory_in_bytes": 481232,
      "term_vectors_memory": "0b",
      "term_vectors_memory_in_bytes": 0,
      "norms_memory": "45.5kb",
      "norms_memory_in_bytes": 46592,
      "points_memory": "0b",
      "points_memory_in_bytes": 0,
      "doc_values_memory": "235.5kb",
      "doc_values_memory_in_bytes": 241240,
      "index_writer_memory": "248mb",
      "index_writer_memory_in_bytes": 260146882,
      "version_map_memory": "53.8mb",
      "version_map_memory_in_bytes": 56422050,
      "fixed_bit_set": "0b",
      "fixed_bit_set_memory_in_bytes": 0,
      "max_unsafe_auto_id_timestamp": -1,
      "file_sizes": {
        
      }
    },
    "mappings": {
      "field_types": [
        {
          "name": "boolean",
          "count": 11,
          "index_count": 3
        },
        {
          "name": "date",
          "count": 14,
          "index_count": 5
        },
        {
          "name": "float",
          "count": 6,
          "index_count": 1
        },
        {
          "name": "keyword",
          "count": 85,
          "index_count": 5
        },
        {
          "name": "object",
          "count": 15,
          "index_count": 3
        },
        {
          "name": "text",
          "count": 34,
          "index_count": 5
        }
      ]
    },
    "analysis": {
      "char_filter_types": [
        
      ],
      "tokenizer_types": [
        
      ],
      "filter_types": [
        {
          "name": "edge_ngram",
          "count": 3,
          "index_count": 3
        },
        {
          "name": "pattern_capture",
          "count": 3,
          "index_count": 3
        }
      ],
      "analyzer_types": [
        {
          "name": "custom",
          "count": 9,
          "index_count": 3
        }
      ],
      "built_in_char_filters": [
        
      ],
      "built_in_tokenizers": [
        {
          "name": "standard",
          "count": 3,
          "index_count": 3
        },
        {
          "name": "uax_url_email",
          "count": 6,
          "index_count": 3
        }
      ],
      "built_in_filters": [
        {
          "name": "lowercase",
          "count": 9,
          "index_count": 3
        },
        {
          "name": "stop",
          "count": 3,
          "index_count": 3
        },
        {
          "name": "trim",
          "count": 3,
          "index_count": 3
        },
        {
          "name": "unique",
          "count": 3,
          "index_count": 3
        }
      ],
      "built_in_analyzers": [
        {
          "name": "standard",
          "count": 4,
          "index_count": 2
        }
      ]
    }
  },
  "nodes": {
    "count": {
      "total": 2,
      "coordinating_only": 0,
      "data": 2,
      "ingest": 2,
      "master": 2,
      "ml": 2,
      "remote_cluster_client": 2,
      "transform": 2,
      "voting_only": 0
    },
    "versions": [
      "7.8.0"
    ],
    "os": {
      "available_processors": 8,
      "allocated_processors": 8,
      "names": [
        {
          "name": "Windows Server 2019",
          "count": 2
        }
      ],
      "pretty_names": [
        {
          "pretty_name": "Windows Server 2019",
          "count": 2
        }
      ],
      "mem": {
        "total": "31.9gb",
        "total_in_bytes": 34358525952,
        "free": "8.4gb",
        "free_in_bytes": 9037443072,
        "used": "23.5gb",
        "used_in_bytes": 25321082880,
        "free_percent": 26,
        "used_percent": 74
      }
    },
    "process": {
      "cpu": {
        "percent": 2
      },
      "open_file_descriptors": {
        "min": -1,
        "max": -1,
        "avg": 0
      }
    },
    "jvm": {
      "max_uptime": "20.9d",
      "max_uptime_in_millis": 1806973476,
      "versions": [
        {
          "version": "14.0.1",
          "vm_name": "OpenJDK 64-Bit Server VM",
          "vm_version": "14.0.1+7",
          "vm_vendor": "AdoptOpenJDK",
          "bundled_jdk": false,
          "using_bundled_jdk": null,
          "count": 2
        }
      ],
      "mem": {
        "heap_used": "7.9gb",
        "heap_used_in_bytes": 8581577504,
        "heap_max": "16gb",
        "heap_max_in_bytes": 17179869184
      },
      "threads": 139
    },
    "fs": {
      "total": "119.9gb",
      "total_in_bytes": 128811261952,
      "free": "49.9gb",
      "free_in_bytes": 53653364736,
      "available": "49.9gb",
      "available_in_bytes": 53653364736
    },
    "plugins": [
      
    ],
    "network_types": {
      "transport_types": {
        "netty4": 2
      },
      "http_types": {
        "netty4": 2
      }
    },
    "discovery_types": {
      "zen": 2
    },
    "packaging_types": [
      {
        "flavor": "unknown",
        "type": "unknown",
        "count": 2
      }
    ],
    "ingest": {
      "number_of_pipelines": 2,
      "processor_stats": {
        "gsub": {
          "count": 0,
          "failed": 0,
          "current": 0,
          "time": "0s",
          "time_in_millis": 0
        },
        "script": {
          "count": 0,
          "failed": 0,
          "current": 0,
          "time": "0s",
          "time_in_millis": 0
        }
      }
    }
  }
}

Indexing individual documents can add a lot of overhead, especially with respect to disk I/O and it also looks like your documents are quite large. How large is an average document? Are you by any chance using nested documents? If so, how many nested documents does an average document have?

I really like for that the BulkProcessor pattern.
It's available in Java but I think you can implement a similar feature in other languages if not available in the client.

Basically instead of sending documents to elasticsearch, you send them to a bulk processor.

The bulk processor is then executing a bulk request when:

  • the processor contains more than X documents
  • the processor contains some documents and has not been executed since Y seconds
  • the processor contains more than Z bytes of data
BulkProcessor bulkProcessor = BulkProcessor.builder(client,  
        new BulkProcessor.Listener() {
            @Override public void beforeBulk(long executionId, BulkRequest request) {  } 
            @Override public void afterBulk(long executionId, BulkRequest request, BulkResponse response) {  } 
            @Override public void afterBulk(long executionId, BulkRequest request, Throwable failure) {  } 
        })
        .setBulkActions(10000) /* X */
        .setFlushInterval(TimeValue.timeValueSeconds(5)) /* Y */
        .setBulkSize(new ByteSizeValue(5, ByteSizeUnit.MB)) /* Z */
        .build();

That way, if you have suddenly a spike of data, the bulk API will be used behind the scene anyway and it's much much more efficient than single index operations.

We're not using nested documents actually.
Documents are coming from an external system one by one, we cannot batch it unfortunately.
Average document is "only" 6-7kb (3Millions documents / 20 Go on each node).

Seems like a good idea.
But i really need to have my document indexed as quick as possible because it will be used in the process just after.
With this solution, It should be possible that my process will not find the indexed document (I'm getting it back by the key) because it still "delayed" by this bulk processor.

Do you have a tip for that ?

Yes.

No. Sadly.

But may be you can do something like start indexing documents one by one but if you are getting more than x documents per second, switch to a bulk mode and then switch back to single indexing API...

I can imagine something like that, it seems pretty smart having this kind of policy.
Is there any way to "wait" for the indexing is complete by the bulk processor before continue the business process ? (I'm indexing document in parallel)
I can wait for like 3-5 seconds if i can limit calls made to Elastic server (through a policy or not) thanks to this batch pattern.
Other question, does the "too many requests" error is mixing Search call and Index call in the same queue ?
How can I avoid this error in my code ? Should I use a Semaphore to limit thread calling Elastic on the same indice ?

Thanks for your help again

There's a listener which has a afterBulk() method which you can probably use to be aware that the bulk request has been processed by Elasticsearch.

@Override public void afterBulk(long executionId, BulkRequest request, BulkResponse response) {  } 

There are dedicated search and bulk threadpools. So normally no.

Do you have the logs for this error?
I know that some tools like Beats for example has implemented a backoff policy for this: Configure the Elasticsearch output | Filebeat Reference [7.12] | Elastic

The number of seconds to wait before trying to reconnect to Elasticsearch after a network error. After waiting backoff.init seconds, Filebeat tries to reconnect. If the attempt fails, the backoff timer is increased exponentially up to backoff.max . After a successful connection, the backoff timer is reset. The default is 1s .

So after an error, you can add a delay in your code before sending the same request again... You can probably implement such a thing in your code if it's not available OOTB in Nest.

Hi,

I actually don't have a low level error.
I'm using NEST, and the only error i have are of this kind :

outerType : System.Threading.Tasks.TaskCanceledException
outerMessage : The operation was canceled.
The I/O operation has been aborted because of either a thread exit or an application request.

It happen on a random search when i have many searches and many indexing documents running concurrently. My meaning of "many" is not a lot for an ElasticSearch server (maybe 1000 indexing and 1000 searches in a few seconds, minutes..).
I really don't know how i can handle it.
It seems to be a throttle of requests ..
Sorry i don't know what kind of additional log i can give you.

NB : I'm on a .Net Core app, which can index 100 documents in parallel for each nodes, i have 2 applicative nodes on Production.

I'd recommend few things:

But may be that's a problem with the .Net server itself which can not hold the load? I mean that this message does not seem to come from elasticsearch...

I'm not very comfortable with Elasticsearch logs actually.
I just tried to give a look at it but didn't see anything unusual i'd say.
Our .Net servers were pretty "ok" when it happened, as the Elastic ones.
When this issue happen, we just restarts ElasticSearch services and all seems to work pretty fine after that.

What kind of information do i have to look at in the logs ?

Thanks again for your help

Share them and we will see if something looks interesting...