Hi there. I've been digging deeper into Elasticsearch lately, and I'm wondering about the ramifications of multiple types in a single index.
For some background, I got into Elasticsearch because a number of Ruby on Rails projects I've worked on use it. It seems to be idiomatic for the various Ruby gems to split up each document type into its own index.
However, that limits the ability to do parent-child queries and so on.
What's the story behind indices and types? Should they stay split up unless one needs to reason about the relationships between documents? Should they all go in a single index just for fun? Does it matter tremendously one way or another?
My instinct tells me to keep them separated. I would suspect (without any data to back it up) that ES has an easier time of just about everything (searching, indexing, storage, partitioning) when document types are broken out into different indices.
I think your instincts are correct. As someone who has blown up a production site because I did not split two types into two indexes, my default thinking is 1:1 unless there is some good reason to do otherwise. This is especially true if it makes sense for fields to have the same name across types.
Calculations like IDF apply to a field across the entire index, regardless of type or other filtered fields that you may be using to logically partition your data.
Example: I had a reasonable sized (~1M) index with one type type1 with a field called my_text. Queries, aggregations, and so on were super fast and lots of end users hit pages that relied on these queries.
Then I added about a billion documents of type type2, also containing a field called my_text with the same mapping. I thought that was clever because I could then do some reasoning about my_text across the types.
What happened instead was that my original queries involving my_text on type1 got slow enough to back up the application server layer. Then I ran out of heap.
From a relevancy point of view, you can inadvertently change my_text search results on type1 by adding documents of type2, as the IDF across the index for that field changes. Ditto for suggestions.
Finally, for time-based indexes like you'd get with Logstash, it's easier to maintain different storage/replica/retention policies when you have the types split into separate physical indexes.
If an index is analogous to a DB, then a type is analogous to a table. Keeping things separate is good data hygiene.
However if you have lots of different types it may make sense to normalise them and group similar things.
i thought that at some point (when naming of fields was refactored for multifield and other scenarios) lucene field names was made to include type name so no collisions would occur across types. I guess I was wrong. Could anyone comment on it?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.