Query for fields that don't contain a certain character

I have a bit of dirty data in my index.
While for all regular documents, the 'id' field should be a domainname
(example.com, blog.example.com, ...), there was a bit of bad code which
inserted autogenerated values ('rM8CDN-aTC6lxaPIc858Rg',
'rYmT2kCvR9qoNaudg8_2Wg')

Now I'd like to write a little cleanup script that deletes these bad
documents.
My problem is that I have about 100 million documents, so just iterating
over all of them would take ages.

Something that I'd love to be able to do: Filter for id fields that don't
have a dot in them. Would that need a wildcard query with a trailing and
leading *?
Alternatively I could probably also filter for id fields that are 22
characters long.

My problem is that I haven't figured out how to do either of those two
things.

Any recommendations on how to clean this up?

--

Hey mark,

unfortunately I don't see a way to do this efficiently and / or without
major effort. The only way I can see to be reasonable is to write some
custom lucene code that prunes your index but that would require to make
your index read - only and shutdown ES which seems not reasonable either.
I'd guess the easiest way is to reindex into another index do the work on
the client side and don't index those docs in the first place.

simon

On Sunday, November 11, 2012 3:15:32 PM UTC+1, Marc Seeger wrote:

I have a bit of dirty data in my index.
While for all regular documents, the 'id' field should be a domainname (
example.com, blog.example.com, ...), there was a bit of bad code which
inserted autogenerated values ('rM8CDN-aTC6lxaPIc858Rg',
'rYmT2kCvR9qoNaudg8_2Wg')

Now I'd like to write a little cleanup script that deletes these bad
documents.
My problem is that I have about 100 million documents, so just iterating
over all of them would take ages.

Something that I'd love to be able to do: Filter for id fields that don't
have a dot in them. Would that need a wildcard query with a trailing and
leading *?
Alternatively I could probably also filter for id fields that are 22
characters long.

My problem is that I haven't figured out how to do either of those two
things.

Any recommendations on how to clean this up?

--

In that case I'll probably just iterate over all of the IDs using
the "scan" search type with the scroll parameter.

Cheers,
Marc

On Monday, November 12, 2012 9:38:47 AM UTC+1, simonw wrote:

Hey mark,

unfortunately I don't see a way to do this efficiently and / or without
major effort. The only way I can see to be reasonable is to write some
custom lucene code that prunes your index but that would require to make
your index read - only and shutdown ES which seems not reasonable either.
I'd guess the easiest way is to reindex into another index do the work on
the client side and don't index those docs in the first place.

simon

On Sunday, November 11, 2012 3:15:32 PM UTC+1, Marc Seeger wrote:

I have a bit of dirty data in my index.
While for all regular documents, the 'id' field should be a domainname (
example.com, blog.example.com, ...), there was a bit of bad code which
inserted autogenerated values ('rM8CDN-aTC6lxaPIc858Rg',
'rYmT2kCvR9qoNaudg8_2Wg')

Now I'd like to write a little cleanup script that deletes these bad
documents.
My problem is that I have about 100 million documents, so just iterating
over all of them would take ages.

Something that I'd love to be able to do: Filter for id fields that don't
have a dot in them. Would that need a wildcard query with a trailing and
leading *?
Alternatively I could probably also filter for id fields that are 22
characters long.

My problem is that I haven't figured out how to do either of those two
things.

Any recommendations on how to clean this up?

--