I have a bit of dirty data in my index.
While for all regular documents, the 'id' field should be a domainname
(example.com, blog.example.com, ...), there was a bit of bad code which
inserted autogenerated values ('rM8CDN-aTC6lxaPIc858Rg',
'rYmT2kCvR9qoNaudg8_2Wg')
Now I'd like to write a little cleanup script that deletes these bad
documents.
My problem is that I have about 100 million documents, so just iterating
over all of them would take ages.
Something that I'd love to be able to do: Filter for id fields that don't
have a dot in them. Would that need a wildcard query with a trailing and
leading *?
Alternatively I could probably also filter for id fields that are 22
characters long.
My problem is that I haven't figured out how to do either of those two
things.
unfortunately I don't see a way to do this efficiently and / or without
major effort. The only way I can see to be reasonable is to write some
custom lucene code that prunes your index but that would require to make
your index read - only and shutdown ES which seems not reasonable either.
I'd guess the easiest way is to reindex into another index do the work on
the client side and don't index those docs in the first place.
simon
On Sunday, November 11, 2012 3:15:32 PM UTC+1, Marc Seeger wrote:
I have a bit of dirty data in my index.
While for all regular documents, the 'id' field should be a domainname ( example.com, blog.example.com, ...), there was a bit of bad code which
inserted autogenerated values ('rM8CDN-aTC6lxaPIc858Rg',
'rYmT2kCvR9qoNaudg8_2Wg')
Now I'd like to write a little cleanup script that deletes these bad
documents.
My problem is that I have about 100 million documents, so just iterating
over all of them would take ages.
Something that I'd love to be able to do: Filter for id fields that don't
have a dot in them. Would that need a wildcard query with a trailing and
leading *?
Alternatively I could probably also filter for id fields that are 22
characters long.
My problem is that I haven't figured out how to do either of those two
things.
In that case I'll probably just iterate over all of the IDs using
the "scan" search type with the scroll parameter.
Cheers,
Marc
On Monday, November 12, 2012 9:38:47 AM UTC+1, simonw wrote:
Hey mark,
unfortunately I don't see a way to do this efficiently and / or without
major effort. The only way I can see to be reasonable is to write some
custom lucene code that prunes your index but that would require to make
your index read - only and shutdown ES which seems not reasonable either.
I'd guess the easiest way is to reindex into another index do the work on
the client side and don't index those docs in the first place.
simon
On Sunday, November 11, 2012 3:15:32 PM UTC+1, Marc Seeger wrote:
I have a bit of dirty data in my index.
While for all regular documents, the 'id' field should be a domainname ( example.com, blog.example.com, ...), there was a bit of bad code which
inserted autogenerated values ('rM8CDN-aTC6lxaPIc858Rg',
'rYmT2kCvR9qoNaudg8_2Wg')
Now I'd like to write a little cleanup script that deletes these bad
documents.
My problem is that I have about 100 million documents, so just iterating
over all of them would take ages.
Something that I'd love to be able to do: Filter for id fields that don't
have a dot in them. Would that need a wildcard query with a trailing and
leading *?
Alternatively I could probably also filter for id fields that are 22
characters long.
My problem is that I haven't figured out how to do either of those two
things.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.