Hi, we currently one unique index in production which has around 50M
documents. They are, mainly, from three different languages and also have a
created_at field associated with it.
Our common search use case is to search documents of any language but we
allow users to search only one specific language. Also, in terms of time
intervals, the common use case is to search the entire dataset, but we also
have a use case where we only need to search documents of the last ~1 year.
We currently have some queries paying the price of searching in a 50M
index, when they would only need to search in < 10M documents. Filters can
help with this, but they only help skipping the scoring of a document, the
query processing still needs to go through the whole gigantic posting list,
that is potentially on disk. (a posting list for "the" in Spanish is really
tiny compared to the one in English, for example).
My questions are regarding performance. Our main concern, now, is query
latency. Our system uses many queries with many OR clauses which easily
makes query latency a pain, depending on the terms it can be up to dozens
of seconds.
- First, if I search in multiple indices, will the search on them be done
in "parallel"? For example, I have "alias1" over two indexes: "index1" and
"index2". When searching in "alias1", the search in "index1" will occur in
parallel to the search in "index2", or they will be executed in sequence? - Second, what are the implications of "cutting" the dataset into 3
indices, one for each language? What will be the performance difference
between searching 3 indices and searching 1 index with all documents? - Third, separating indices by language and then searching all indices
together would mess up the scoring, right? IDFs for words can greatly vary between
languages (would need to change search type?) - Finally, it would be a good idea to "cut" the index in time intervals?
(like an index for each year worth of documents, for each language).
This all assumes the same number of machines/shards/replicas. We currently
have 16 shards in 8 (m1.large) EC2 instances.
Thanks!
Felipe Hummel
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.