Deleting Documents


(Eugene Strokin) #1

Hello,
I have indexed about 1M documents. My backup ZIP file of the index was
about 1Gb.
After a while I needed to delete about 20% of the documents, the process
went fine, and I don't see the deleted documents in search results.
But the back up file still has almost the same size (actually even little
big bigger)
Which makes me think that the deleted documents are not actually removed
but marked somehow as deleted.
Am I right? And if so, is it possible to clean up somehow the index from
the deleted documents without complete reindex. Because I expect some
extensive rotation of the indexed documents and I need to know if I need to
reindex documents from time to time to keep the index in reasonable size.

Thank you,
Eugene


(Clinton Gormley) #2

Hiya

I have indexed about 1M documents. My backup ZIP file of the index was
about 1Gb.
After a while I needed to delete about 20% of the documents, the
process went fine, and I don't see the deleted documents in search
results.
But the back up file still has almost the same size (actually even
little big bigger)
Which makes me think that the deleted documents are not actually
removed but marked somehow as deleted.

Correct

Am I right? And if so, is it possible to clean up somehow the index
from the deleted documents without complete reindex. Because I expect
some extensive rotation of the indexed documents and I need to know if
I need to reindex documents from time to time to keep the index in
reasonable size.

This will happen automatically. As you index, ES creates new 'segments'.
These segments gradually get merged into new bigger segments (and the
deleted docs in the old segments are removed). So you don't need to
worry about it.

If you want to force it, you can use the 'optimize' api, but generally I
wouldn't worry about it.

Also, depending on your data, you may want to consider other ways of
organising your index. For instance, if you have time based data (think
log messages), then you can have an index per day or month or whatever,
and then just delete the old indices which are no longer required. This
is more efficient than deleting individual docs.

clint

Thank you,
Eugene


(system) #3