How to query new documents only? Order by _uid?

pmorch · November 21, 2016, 1:33pm

Hi, total ELK newbie here.

I think I'm missing some vital basic information. And it doesn't help that I can't search for e.g. "_uid" and "@timestamp", only "uid" and "timestamp" in both Google and here in Discuss... And searching for "id" and "timestamp" gives overwhelmingly false positives

We have a process that would like to "subscribe" to an Elasticsearch query. So periodically poll Elasticsearch and only get documents that were introduced ( == indexed? ) since last poll.

Is that even possible? I don't mean just query everything in [ now() - 5min, now() ], because if timestamps are determined by the sender, there could easily be timestamps out of sync. And multiple documents can arrive in the same millisecond. I'd like to get every new document exactly once in arrival order.

Will order by _uid do what I want? And can I combine that with where _id > $lastQueriedID somehow? (Sorry, I'm not yet fluent in Query DSL). Does that guarantee sorting by arrival order, or is that just sort by ASCII(concat(_type, _id)) which is somehow unrelated to arrival order?

And what happens when we move from a single sandbox stand-alone machine to a cluster? Are there queries of this sort that will be deterministic on a single machine but that'll give misleading replies when moving to a cluster in the future?

cbuescher · November 21, 2016, 2:07pm

Hi,

in order to do what you described you somehow have to create a field to establish this order in your application. The problem with the '_id' or '_uid' field is that they are not really usable in range queries which you probably what to have for the _id > $lastQueriedID part.

If you control the indexing of the documents you can introduce your own kind of "autoincrement" id and store this along with your document in its own numeric field. The you can use this to scope your query to only return the "new" ones like you already suggested (provided you have a way of storing your lastQueriedID on your applications side).

Other ways you could do this is e.g. add an aditional timestamp field in your document ingestion process (e.g. logstash or an ingest node) and use that as a value that monotonically increases so you can use it in the same way as an auto-increment id for ordering.

This all assumes that you have a single point (or process) where document enter your system, so you can maintain an increasing counter. This should still work if you are moving from one node to a cluster as long as you control this on your application side.

pmorch · November 21, 2016, 7:54pm

Thanks, @cbuescher for your answer.

I was afraid that was the direction this was heading. The need for a special field for autoincrement to detect arrival order ourselves is not ideal, especially because then we can't e.g. set it up to have many independent receivers of syslog or netflow in a cluster and just use a standard logstash setup.

I'll keep this answer in mind as we move forward.

Peter

system · December 19, 2016, 7:55pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Sorting data by "_id" behavior Elasticsearch	4	522	April 23, 2019
Elastic 6.x - Get next document ID in index based on date/time? Elasticsearch	3	790	May 27, 2018
Querying the latest versions of your document Elasticsearch	3	470	October 5, 2018
Only query new documents from Elasticsearch index Elasticsearch	3	409	October 21, 2020
Query get documents by giving an array of document ids same order of array Kibana eql-elastic-query-language	1	404	March 9, 2023

How to query new documents only? Order by _uid?

Related topics