I am trying to understand and effectively use the index type
available in elasticsearch.
However, I am still not clear how _type meta field is different from any
regular field of an index in terms of storage/implementation. I do
understand avoiding_type_gotchas
For example, if I have 1 million records (say posts) and each post
has a creation_date. How will things play out if one of my index types
is creation_date itself (leading to ~ 1 million types)? I don't think it
affects the way Lucene stores documents, does it?
In what way my elasticsearch query performance be affected if I use
creation_date as index type against a namesake type say 'post'?
While elasticsearch is scalable in many dimensions there is one where it is limited. This is the metadata about your indices which includes the various indices, doc types and fields they contain.
These "mappings" exist in memory and are updated and shared around all nodes with every change. For this reason it does not make sense to endlessly grow the list of indices, types (and therefore fields) that exist in this cluster state. A type-per-document-creation-date registers a million on the one-to-ten scale of bad design decisions
Thanks Mark. That was helpful. A colleague of mine proposed that idea and despite my intuition on bad design, I didn't have any documentation on index_type to prove it.
Hi Mark, Thanks for sharing some insights on the internal design of elasticsearch. Let me clarify that we'll never come close to 1 million types. The retention policy is to keep 10-year data, which accounts for 365 x 10 types at most. To our understanding, types are designed for us to keep the documents in corresponding partitions. During query time, elasticsearch will only go through the documents in the type (date, in our specific case) range specified in the query. Since we have no other fields containing unchanging or fairly distributed (i.e. 90% on one, 10 on the others) values, date, as a primary filter on our report app, seems to be our only option to partition the data. Alternatively, we can have all our data in one default type. Which one do you think is more likely to be the bottleneck? Having 3650 types in the metadata sitting in memory or not leveraging the type feature to partition the data.
No, types in the same index share the same physical Lucene files. Behind the scenes elasticsearch applies a filter to the Lucene docs to only return the ones with your chosen type. This filter is no different to one you could implement with a custom field e.g. by defining a filtered alias [1]. The one difference with the many-types approach is that you would pollute your cluster state with thousands of near-identical mappings.
ok, if i understand the time-based-index feature correctly, I'll make several api calls (each with a specific time-based index specified) to cover the date range specified by the user. Is that correct? Since we can't really predict what date range the user would pick, we won't be able to come up with alias to group indices together. And do you imply that index is the only way we can get the data partitioning effect?
You can specify multiple indices in a single call, so multiple requests is generally not needed. Kibana can efficiently determine which indices that may hold data for the specific time window selected through a call to the field stats API, and is then able to target only the indices required when it creates the request. Before the field stats API was available it instead used the naming convention of the indices to limit the indices that it needed to query. You also have the option to query indices using an index pattern, e.g. a common prefix, although this naturally will hit all indices. Indices with no data in the interval should however return quite quickly.
Another benefit with time based indices is that you can adapt the number of shards each index has over time and that way adapt to increasing or decreasing daily volumes. It also makes it very easy and efficient to manage retention period as entire indices can be deleted.
Good to learn about time based indices. I am still somewhat unsure if that's best fit for our needs.
Our dashboard needs to display date_histogram by default for a year. That means if we create an index per each day, we are querying 365 indices every time the dashboard is accessed (happens to be the landing page of our app). I am curious to understand how aggregation (date_histogram) will perform over those many indices. Are date_histograms designed to perform better with a single or handful of indices?
When using time based indices, each index does not necessarily have to correspond to exactly a day. The time period covered by a time based index and the number of shards used often depends of the data volumes being indexed. As you have a very long retention period, it may make more sense to use monthly indices than daily. This does however depend on the amount of data you are indexing each month. Having large numbers of very small indices is inefficient, both with respect to querying and resource utilisation, so you want to make sure that your average shard size typically is between a few GB and a few tens of GB in size.
Kibana is used with date histograms against time based indices all the time, so they work just fine with time based indices.
We're considering the alternative of putting everything in one index. To better understand the trade-off (other than what you've mentioned above), we would like to confirm that all the documents in that one index will always be scanned through regardless of what kind of search queries is being performed. In other words, there's no other out-of-the-box mechanism available to make the search effort limited to a subset of the documents.
One of the problems with using a single index is that you can not change the number of shards once the index has been created. This means that you ideally need to know your data volumes up front in order to not end up with too many small shards or too few very large shards. Each query/aggregation is executed across all shards in parallel, but the processing of each shard is single-threaded. The query performance therefore depend on the size as well as number of shards.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.