I'm trying to write some pretty straightforward documentation about elasticsearch for my team, and I'm very new to using it myself. I'm using this query as an example. I'm trying to filter in a few different ways (term, wildcard, range). I want to be in filter context, not query context because scoring doesn't matter. The first date range clause is to take advantage of caching, because this would (theoretically) run every two minutes for monitoring purposes. Any feedback on anything weird I am doing would be very appreciated, I don't want to lead anyone astray. Thanks!
The only potentially hairy bit is that wildcard. The leading ? will force the query to do essentially a table-scan over all available characters. It's not as bad as a leading wildcard, which expands out to essentially every possible document in the index, but it will still be relatively expensive.
Is the leading/trailing ? really needed? How is sensor_type analyzed?
If the leading ? is needed, there's a trick you can do to help speed up the query (if it proves too slow). Add a multifield to the analyzer that uses a reverse token filter, then an ngram token filter. This will index _AIR_TEMP_ as ["_", "_P", "_PM", "_PME", "_PMET", ...]. Then when you search, include a query against both the forward and reverse field, which gives you essentially prefix and suffix search.
It's faster because it is indexing the prefix fragments directly into the datastructure, rather than doing the same thing at query-time. And because the reversed prefix search is indexed, it doesn't have to do a full table scan to find matching characters.
Feel free to ignore that tip if performance is fine. It may be something to file away for later when your data volume grows and you need to squeeze a bit more performance out of things.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.