Delete old content from index

Hi,

At different random times throughout the day I am going to do a "crawl" of data which I am going to feed into elasticsearch. This bit is working just fine.

However the index should reflect only what was found in my most recent crawl and I currently have nothing to remove the content in the elasticsearch index which was left over from the previous crawl but wasn't found in the new crawl.

From what I can see I have a few options:

A) Delete items based on how old they are. Won't work because index times are random.

B) Delete entire index and feed with fresh data. Doesn't else em very efficient and will leave me time with an empty or partial index.

C) Do an insert/modify query, if not found insert, if found already in the index update the timesstamp, then do a second pass to delete any items with an older time stamp.

D) Something better.

I would really appreciate any feedback on a logical and efficient way to removing old content in a situation like this.

Thank you and happy Easter.

James

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/AF36C2A5-8B38-4176-90B8-2E4210A0244F%40employ.com.
For more options, visit https://groups.google.com/d/optout.

Hi ,

I understand that you are looking to index incremental data.
In that case the following approach is the best i can think of -

  1. Make a unique key per document. This key can be the URL or shah hash
    sum of some other field that makes sense to be a unique key.
  2. Use the unique key as doc ID.
  3. Set a field which would be the hash of the content field. This means
    that this hash field will change whenever the content has changed.
  4. Now whenever there is a new insert do a upsert
    http://www.elastic.co/guide/en/elasticsearch/reference/1.4/docs-update.html#upserts
    on this document.
  5. During upsert , see if the content hash has changed. If there no no
    change you can stop orceeding and if there is a change in content , update
    both the content field and the new hash content field.

Thanks
Vineeth Mohan,
Elasticsearch consultant,
qbox.io ( Elasticsearch service provider http://qbox.io/)

On Sun, Apr 5, 2015 at 6:14 PM, Employ mail@employ.com wrote:

Hi,

At different random times throughout the day I am going to do a "crawl" of
data which I am going to feed into elasticsearch. This bit is working just
fine.

However the index should reflect only what was found in my most recent
crawl and I currently have nothing to remove the content in the
elasticsearch index which was left over from the previous crawl but wasn't
found in the new crawl.

From what I can see I have a few options:

A) Delete items based on how old they are. Won't work because index times
are random.

B) Delete entire index and feed with fresh data. Doesn't else em very
efficient and will leave me time with an empty or partial index.

C) Do an insert/modify query, if not found insert, if found already in the
index update the timesstamp, then do a second pass to delete any items with
an older time stamp.

D) Something better.

I would really appreciate any feedback on a logical and efficient way to
removing old content in a situation like this.

Thank you and happy Easter.

James

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/AF36C2A5-8B38-4176-90B8-2E4210A0244F%40employ.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGdPd5mrNgC8kMMnv8uNL%3DUh9Uk%3DN4TDqGLjwexAPtBucsnWEw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.