At different random times throughout the day I am going to do a "crawl" of data which I am going to feed into elasticsearch. This bit is working just fine.
However the index should reflect only what was found in my most recent crawl and I currently have nothing to remove the content in the elasticsearch index which was left over from the previous crawl but wasn't found in the new crawl.
From what I can see I have a few options:
A) Delete items based on how old they are. Won't work because index times are random.
B) Delete entire index and feed with fresh data. Doesn't else em very efficient and will leave me time with an empty or partial index.
C) Do an insert/modify query, if not found insert, if found already in the index update the timesstamp, then do a second pass to delete any items with an older time stamp.
D) Something better.
I would really appreciate any feedback on a logical and efficient way to removing old content in a situation like this.
During upsert , see if the content hash has changed. If there no no
change you can stop orceeding and if there is a change in content , update
both the content field and the new hash content field.
On Sun, Apr 5, 2015 at 6:14 PM, Employ mail@employ.com wrote:
Hi,
At different random times throughout the day I am going to do a "crawl" of
data which I am going to feed into elasticsearch. This bit is working just
fine.
However the index should reflect only what was found in my most recent
crawl and I currently have nothing to remove the content in the
elasticsearch index which was left over from the previous crawl but wasn't
found in the new crawl.
From what I can see I have a few options:
A) Delete items based on how old they are. Won't work because index times
are random.
B) Delete entire index and feed with fresh data. Doesn't else em very
efficient and will leave me time with an empty or partial index.
C) Do an insert/modify query, if not found insert, if found already in the
index update the timesstamp, then do a second pass to delete any items with
an older time stamp.
D) Something better.
I would really appreciate any feedback on a logical and efficient way to
removing old content in a situation like this.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.