I think I'm missing some vital basic information. And it doesn't help that I can't search for e.g. "_uid" and "@timestamp", only "uid" and "timestamp" in both Google and here in Discuss... And searching for "id" and "timestamp" gives overwhelmingly false positives
We have a process that would like to "subscribe" to an Elasticsearch query. So periodically poll Elasticsearch and only get documents that were introduced ( == indexed? ) since last poll.
Is that even possible? I don't mean just query everything in [ now() - 5min, now() ], because if timestamps are determined by the sender, there could easily be timestamps out of sync. And multiple documents can arrive in the same millisecond. I'd like to get every new document exactly once in arrival order.
Will order by _uid do what I want? And can I combine that with where _id > $lastQueriedID somehow? (Sorry, I'm not yet fluent in Query DSL). Does that guarantee sorting by arrival order, or is that just sort by ASCII(concat(_type, _id)) which is somehow unrelated to arrival order?
And what happens when we move from a single sandbox stand-alone machine to a cluster? Are there queries of this sort that will be deterministic on a single machine but that'll give misleading replies when moving to a cluster in the future?
in order to do what you described you somehow have to create a field to establish this order in your application. The problem with the '_id' or '_uid' field is that they are not really usable in range queries which you probably what to have for the _id > $lastQueriedID part.
If you control the indexing of the documents you can introduce your own kind of "autoincrement" id and store this along with your document in its own numeric field. The you can use this to scope your query to only return the "new" ones like you already suggested (provided you have a way of storing your lastQueriedID on your applications side).
Other ways you could do this is e.g. add an aditional timestamp field in your document ingestion process (e.g. logstash or an ingest node) and use that as a value that monotonically increases so you can use it in the same way as an auto-increment id for ordering.
This all assumes that you have a single point (or process) where document enter your system, so you can maintain an increasing counter. This should still work if you are moving from one node to a cluster as long as you control this on your application side.
I was afraid that was the direction this was heading. The need for a special field for autoincrement to detect arrival order ourselves is not ideal, especially because then we can't e.g. set it up to have many independent receivers of syslog or netflow in a cluster and just use a standard logstash setup.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.