Hi trying to build my first transform - a latest transform. Maybe I misunderstand, but I expect to create and start the transform and it runs continuously - so new documents in the source index are processed and reflected in the destination.
I have created the transform, and when I start it, I see the result index and documents in it for the source index as it existed when the transform started.
New documents in the source index, however are not being processed. When I search the destination index I see no new documents being created and no existing documents updated.
You configured timestamp as you field for synchronization. It is important that this field contains a real, valid timestamp. Ensure the timezone of your client and Elasticsearch server do not mismatch.
You set delay to 60s. This means a data point can arrive up to 60s late. This compensates all kinds of ingest delays, be it transfer time, queuing, data accumulation and last but not least the index refresh interval of the lucene index you push the data to. 60s is conservative, if you know you push data faster or if timestamp is set by an ingest processor or by logstash, you can lower the value. On the flip side of delay, transform will not query for data from the last 60s. That means if you push a new record now, you have to wait 60s until transform will read it.
Why is transform doing this?
The way it works is called checkpointing and the reason for checkpointing is updating the index in an efficient way. Transforms uses query heuristics to keep the amount of data to be rewritten as small as possible. Transform does not brute-force re-create everything. You can find a description how it works in the docs.
It track sthe last log message from 1000s of windows hosts.
It keep the last message / document based on the host.name field
It is continuous, runs every 10m, based on @timestamp with a 60s delay and keeps the data for 7 days. It works great.
I'm using the Dev Console and creating my own timestamps in UTC time. I don't have an @timestamp field in my document. Is this from filebeat (which I have in production)
How would I change this to compensate for innaccurate timestamps? Would you make the timestamps earlier in the day to update the index or later in the day? Or would you change the transform settings?
Okay, understood you're going to have to create reasonable data for the transform to work on aligned with how your transform runs... Which sounds like your trying to do ...
(This may be harder than real data... )
So I would do something like this
Get rid of the query part of your transform (in fact use mine and just set the frequency to 1m to test)
Create a mapping and some docs.
Put in 3 or 4 docs with a recent timestamp like below (that is UTC)
Start the transform in continuous mode..
Then wait 3 mins or so then post those same docs with an updated timestamp and message...
The transform should pick them up .. and you should only have the latest docs.
ALSO you need to create the mapping for the transform destination index.
It shows how to use a pipeline that adds an ingest timestamp to every document you push. That's the most accurate way you can use and it allows you to lower the value for delay. Note that by default lucene uses a refresh interval of 1s, so you should not set delay to a lower value, but e.g. use 2s. Of course you can tweak this further, but I guess that should already meet recency requirements for most users.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.