Hi,
I would like to share my use of case and issues and see if I can get some good advices and recommendations
####Use case:
My product is a search engine based on ES. It index everyday constantly in the same index. Sadly we cannot avoid this because we don’t know when the documents expire or change until we get the new ones. So we made index, update and delete operations. This creates a lot of deleted documents in the index.
Our search performance is degraded by all the index operations. If we stop indexing, the searches boosts.
####Current infrastructure and state:
Elasticsearch: 1.5.2
Cluster with 4 Nodes
Each Node:
Total memory: 60 GB
Heap size: 24.9 GB
1 Index
1 shard
3 replicas
6~ million documents
25~ GB store size
Disk space used: 15%
Documents deleted: 37-40%
Refresh interval: 120
HTTP Connection Rate: 7 /second
####Questions and doubts:
-
We currently use a custom _id. We know it is not the best according to this post.
We chose that _id to make our infrastructure and code complexity more more simpler. (We have to know the _id to make update and delete operations later). If we change this to one of the recommended types, the better index time would improve the search performance? How much? -
If I have 5 nodes and I set up a client to index with a connection string with 3 nodes and set up another client to search with a connection string of the other 2 nodes, are we going to see any improvements in the search performance? Or it is the same and the load is balanced across all the nodes?
-
Is it possible and useful to have 5 nodes, 3 to write and 2 to read? Only the read nodes will be faster? Or this is nonsense?
-
Is there any way to know which node answer a request?
Thank you in advance!