A cluster of 100 nodes was installed. We want it to serve as a search engine like Google.
Servers are located in a data center.
Raid-0 structure was preferred for the high-speed requirement.
Query traffic will be more intense than indexing traffic.
The query will be made with queries containing boolean, wildcard, fuzzy, transposition, function score.
The data in the index is updated periodically. (like once every 1-2 weeks)
There will be 1 replica for each index
3 Coordinators + ingest
As recommended in best practice, a separate cluster will be set up for stack monitoring and the data will be sent to that cluster with metricbeat.
Ingest pipeline (including stack monitoring) is not actively used. We preferred coordinator + ingest in case we need it in the future.
Does the coordinator + ingest node role structure work as a full performance coordinator node when the ingest node is not actively used?
Would you recommend putting a load-balancer in front of 3 coordinator nodes during indexing or querying?
There are index sizes up to 60 TB. When calculated according to best practice:
``Aim for shard sizes between 10GB and 50GB```
index => 60 TB => shard count => must be between 6000 - 1200
Is it ok to use 1500 shards for an index in a system with 100 nodes?
Note: _id-based indexing is done. The index is constantly being updated, it could not be written to multi indexes to avoid duplicate records.
We are considering using a (Turkish) dictionary stemmer for natural language processing, but we have performance concerns.
Do you have any suggestions?