Performance on indices for each language

Hi, we currently one unique index in production which has around 50M
documents. They are, mainly, from three different languages and also have a
created_at field associated with it.

Our common search use case is to search documents of any language but we
allow users to search only one specific language. Also, in terms of time
intervals, the common use case is to search the entire dataset, but we also
have a use case where we only need to search documents of the last ~1 year.

We currently have some queries paying the price of searching in a 50M
index, when they would only need to search in < 10M documents. Filters can
help with this, but they only help skipping the scoring of a document, the
query processing still needs to go through the whole gigantic posting list,
that is potentially on disk. (a posting list for "the" in Spanish is really
tiny compared to the one in English, for example).

My questions are regarding performance. Our main concern, now, is query
latency. Our system uses many queries with many OR clauses which easily
makes query latency a pain, depending on the terms it can be up to dozens
of seconds.

  • First, if I search in multiple indices, will the search on them be done
    in "parallel"? For example, I have "alias1" over two indexes: "index1" and
    "index2". When searching in "alias1", the search in "index1" will occur in
    parallel to the search in "index2", or they will be executed in sequence?
  • Second, what are the implications of "cutting" the dataset into 3
    indices, one for each language? What will be the performance difference
    between searching 3 indices and searching 1 index with all documents
    ?
  • Third, separating indices by language and then searching all indices
    together would mess up the scoring, right? IDFs for words can greatly vary between
    languages (would need to change search type?)
  • Finally, it would be a good idea to "cut" the index in time intervals?
    (like an index for each year worth of documents, for each language).

This all assumes the same number of machines/shards/replicas. We currently
have 16 shards in 8 (m1.large) EC2 instances.

Thanks!

Felipe Hummel

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Do you use filtered queries?
http://www.elasticsearch.org/guide/reference/query-dsl/filtered-query/

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Not actually, we normally use the request-level filters
(Elasticsearch Platform — Find real-time answers at scale | Elastic).
We only use filtered when we need some facet to be calculated considering
the filters.

I assume there's no performance difference.

On Sunday, June 2, 2013 5:18:30 AM UTC-4, Andrew Gaydenko wrote:

Do you use filtered queries?
Elasticsearch Platform — Find real-time answers at scale | Elastic

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Anyone has an opinion?

On Sunday, June 2, 2013 12:27:17 PM UTC-4, Felipe Hummel wrote:

Not actually, we normally use the request-level filters (
Elasticsearch Platform — Find real-time answers at scale | Elastic).
We only use filtered when we need some facet to be calculated considering
the filters.

I assume there's no performance difference.

On Sunday, June 2, 2013 5:18:30 AM UTC-4, Andrew Gaydenko wrote:

Do you use filtered queries?
Elasticsearch Platform — Find real-time answers at scale | Elastic

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

One element that people often don't talk about when talking about searching
documents in multiple languages is the UI/UX. What does that need to look
like and how much flexibility do you have there? I understand some queries
need to search all languages, but that doesn't necessarily mean the UI
needs to show a single result set. If the presentation layer allows
separation by language, I would go with index-per-language model, which is
cleaner and simpler.

The second part of your email is about searching all content vs. a subset
of content depending on the user time range selection, or something along
those lines. For this you could consider having multiple indices for
different time frames and the search client that knows which index holds
documents in which time range and issues queries only to the relevant
indices. Alternatively, this might be doable with routing on a field that
contains a date or a part of it.

Otis

Solr & Elasticsearch Support - http://sematext.com/
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Sunday, June 2, 2013 5:06:16 AM UTC-4, Felipe Hummel wrote:

Hi, we currently one unique index in production which has around 50M
documents. They are, mainly, from three different languages and also have a
created_at field associated with it.

Our common search use case is to search documents of any language but we
allow users to search only one specific language. Also, in terms of time
intervals, the common use case is to search the entire dataset, but we also
have a use case where we only need to search documents of the last ~1 year.

We currently have some queries paying the price of searching in a 50M
index, when they would only need to search in < 10M documents. Filters can
help with this, but they only help skipping the scoring of a document, the
query processing still needs to go through the whole gigantic posting list,
that is potentially on disk. (a posting list for "the" in Spanish is really
tiny compared to the one in English, for example).

My questions are regarding performance. Our main concern, now, is query
latency. Our system uses many queries with many OR clauses which easily
makes query latency a pain, depending on the terms it can be up to dozens
of seconds.

  • First, if I search in multiple indices, will the search on them be done
    in "parallel"? For example, I have "alias1" over two indexes: "index1"
    and "index2". When searching in "alias1", the search in "index1" will occur
    in parallel to the search in "index2", or they will be executed in sequence?
  • Second, what are the implications of "cutting" the dataset into 3
    indices, one for each language? What will be the performance difference
    between searching 3 indices and searching 1 index with all documents
    ?
  • Third, separating indices by language and then searching all indices
    together would mess up the scoring, right? IDFs for words can greatly vary between
    languages (would need to change search type?)
  • Finally, it would be a good idea to "cut" the index in time intervals?
    (like an index for each year worth of documents, for each language).

This all assumes the same number of machines/shards/replicas. We currently
have 16 shards in 8 (m1.large) EC2 instances.

Thanks!

Felipe Hummel

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.