Query for fields that don't contain a certain character

Marc_Seeger_2 · November 11, 2012, 2:15pm

I have a bit of dirty data in my index.
While for all regular documents, the 'id' field should be a domainname
(example.com, blog.example.com, ...), there was a bit of bad code which
inserted autogenerated values ('rM8CDN-aTC6lxaPIc858Rg',
'rYmT2kCvR9qoNaudg8_2Wg')

Now I'd like to write a little cleanup script that deletes these bad
documents.
My problem is that I have about 100 million documents, so just iterating
over all of them would take ages.

Something that I'd love to be able to do: Filter for id fields that don't
have a dot in them. Would that need a wildcard query with a trailing and
leading *?
Alternatively I could probably also filter for id fields that are 22
characters long.

My problem is that I haven't figured out how to do either of those two
things.

Any recommendations on how to clean this up?

--

simonw_2 · November 12, 2012, 8:38am

Hey mark,

unfortunately I don't see a way to do this efficiently and / or without
major effort. The only way I can see to be reasonable is to write some
custom lucene code that prunes your index but that would require to make
your index read - only and shutdown ES which seems not reasonable either.
I'd guess the easiest way is to reindex into another index do the work on
the client side and don't index those docs in the first place.

simon

On Sunday, November 11, 2012 3:15:32 PM UTC+1, Marc Seeger wrote:

I have a bit of dirty data in my index.
While for all regular documents, the 'id' field should be a domainname (
example.com, blog.example.com, ...), there was a bit of bad code which
inserted autogenerated values ('rM8CDN-aTC6lxaPIc858Rg',
'rYmT2kCvR9qoNaudg8_2Wg')

Now I'd like to write a little cleanup script that deletes these bad
documents.
My problem is that I have about 100 million documents, so just iterating
over all of them would take ages.

Something that I'd love to be able to do: Filter for id fields that don't
have a dot in them. Would that need a wildcard query with a trailing and
leading *?
Alternatively I could probably also filter for id fields that are 22
characters long.

My problem is that I haven't figured out how to do either of those two
things.

Any recommendations on how to clean this up?

--

Marc_Seeger_2 · November 12, 2012, 11:33am

In that case I'll probably just iterate over all of the IDs using
the "scan" search type with the scroll parameter.

Cheers,
Marc

On Monday, November 12, 2012 9:38:47 AM UTC+1, simonw wrote:

Hey mark,

unfortunately I don't see a way to do this efficiently and / or without
major effort. The only way I can see to be reasonable is to write some
custom lucene code that prunes your index but that would require to make
your index read - only and shutdown ES which seems not reasonable either.
I'd guess the easiest way is to reindex into another index do the work on
the client side and don't index those docs in the first place.

simon

On Sunday, November 11, 2012 3:15:32 PM UTC+1, Marc Seeger wrote:

I have a bit of dirty data in my index.
While for all regular documents, the 'id' field should be a domainname (
example.com, blog.example.com, ...), there was a bit of bad code which
inserted autogenerated values ('rM8CDN-aTC6lxaPIc858Rg',
'rYmT2kCvR9qoNaudg8_2Wg')

Now I'd like to write a little cleanup script that deletes these bad
documents.
My problem is that I have about 100 million documents, so just iterating
over all of them would take ages.

Something that I'd love to be able to do: Filter for id fields that don't
have a dot in them. Would that need a wildcard query with a trailing and
leading *?
Alternatively I could probably also filter for id fields that are 22
characters long.

My problem is that I haven't figured out how to do either of those two
things.

Any recommendations on how to clean this up?

--

Topic		Replies	Views
Query with special characters Elasticsearch	5	4535	July 6, 2017
Help deleting by field name Elasticsearch	4	707	July 6, 2017
Query string with wildcards not working as (I) expect Elasticsearch	9	2966	July 6, 2017
Document corruption in index, id field is garbled text Elasticsearch	3	1176	July 6, 2017
My search results doesn't contain some of the documents that are present in the index Elasticsearch	7	366	July 6, 2017

Query for fields that don't contain a certain character

Related topics