Delete old content from index

James_3 · April 5, 2015, 12:44pm

Hi,

At different random times throughout the day I am going to do a "crawl" of data which I am going to feed into elasticsearch. This bit is working just fine.

However the index should reflect only what was found in my most recent crawl and I currently have nothing to remove the content in the elasticsearch index which was left over from the previous crawl but wasn't found in the new crawl.

From what I can see I have a few options:

A) Delete items based on how old they are. Won't work because index times are random.

B) Delete entire index and feed with fresh data. Doesn't else em very efficient and will leave me time with an empty or partial index.

C) Do an insert/modify query, if not found insert, if found already in the index update the timesstamp, then do a second pass to delete any items with an older time stamp.

D) Something better.

I would really appreciate any feedback on a logical and efficient way to removing old content in a situation like this.

Thank you and happy Easter.

James

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/AF36C2A5-8B38-4176-90B8-2E4210A0244F%40employ.com.
For more options, visit https://groups.google.com/d/optout.

vineeth_mohan_2 · April 6, 2015, 3:29am

Hi ,

I understand that you are looking to index incremental data.
In that case the following approach is the best i can think of -

Make a unique key per document. This key can be the URL or shah hash
sum of some other field that makes sense to be a unique key.
Use the unique key as doc ID.
Set a field which would be the hash of the content field. This means
that this hash field will change whenever the content has changed.
Now whenever there is a new insert do a upsert
http://www.elastic.co/guide/en/elasticsearch/reference/1.4/docs-update.html#upserts
on this document.
During upsert , see if the content hash has changed. If there no no
change you can stop orceeding and if there is a change in content , update
both the content field and the new hash content field.

Thanks
Vineeth Mohan,
Elasticsearch consultant,
qbox.io ( Elasticsearch service provider http://qbox.io/)

On Sun, Apr 5, 2015 at 6:14 PM, Employ mail@employ.com wrote:

Hi,

At different random times throughout the day I am going to do a "crawl" of
data which I am going to feed into elasticsearch. This bit is working just
fine.

However the index should reflect only what was found in my most recent
crawl and I currently have nothing to remove the content in the
elasticsearch index which was left over from the previous crawl but wasn't
found in the new crawl.

From what I can see I have a few options:

A) Delete items based on how old they are. Won't work because index times
are random.

B) Delete entire index and feed with fresh data. Doesn't else em very
efficient and will leave me time with an empty or partial index.

C) Do an insert/modify query, if not found insert, if found already in the
index update the timesstamp, then do a second pass to delete any items with
an older time stamp.

D) Something better.

I would really appreciate any feedback on a logical and efficient way to
removing old content in a situation like this.

Thank you and happy Easter.

James

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/AF36C2A5-8B38-4176-90B8-2E4210A0244F%40employ.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGdPd5mrNgC8kMMnv8uNL%3DUh9Uk%3DN4TDqGLjwexAPtBucsnWEw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Delete support Elasticsearch	5	323	July 6, 2017
Updating the indexes in every half an hour Elasticsearch	7	1556	July 6, 2017
Deleting old records from ES Elasticsearch	2	2546	July 6, 2017
Deleting old versions Elasticsearch	7	7191	July 6, 2017
How to do incremental indexing in ElasticSearch? Elasticsearch	5	9752	July 6, 2017

Delete old content from index

Related topics