Q: Deleting documents by _id but without index info?

Hi,

This may be a newbie question but I'm struggling to solve a problem for my
company in which a MySql database is used to feed and maintain a cluster of
Elasticsearch servers used for front-end searches.

My problem is: Given a set of document IDs from MySql, I need to remove
those documents from all of our Elasticsearch indices. We typically have
one index per year and though I know the document ID, which has the same ID
in Elasticsearch, I don't know the index it is stored in.

I started out trying the Perl module Search::Elasticsearch::Bulk which
offered the delete_ids() and add_action() methods, which both seemed to
allow me to delete a large number of documents, but requiring an index
name. I thought I could do the same trick as with the _search endpoint and
use a wildcard to indicate "all indices", e.g. index => 'published-*', but
that failed spectacularly:

InvalidIndexNameException[[pulse-] Invalid index name [published-], must
not contain the following characters [, /, *, ?, ", <, >, |, , ,]]; at
/usr/local/lib/site_perl/Cpan/share/perl/5.14.2/Search/Elasticsearch/Role/Bulk.pm
line 188.

So, I couldn't use a wildcard and I still didn't know the exact index for a
given document ID. I was back to the drawing board again.

My next attempt was to dynamically decide the index for each document, by
doing a search before building up the bulk delete. For this I thought I
could use the _mget endpoint, which seemed similar to _search and thus
would allow me to query all the document IDs to learn their indices. But
that didn't work either. Here I've copied the commands I tried running in
Sense:

GET /published-*/_mget
{
"docs" : [
{ "_id" : "019001201409294aa579ddb20348cbbf402116c91f6d15_811110" },
{ "_id" : "0190012014092947e9fe351dc78763dabc45231301e9f9_811110" }
]
}

GET /_mget
{
"docs" : [
{ "_id" : "019001201409294aa579ddb20348cbbf402116c91f6d15_811110" },
{ "_id" : "0190012014092947e9fe351dc78763dabc45231301e9f9_811110" }
]
}

The first only returned errors with each document while the second didn't
return anything, just issued an "index is missing" error.

However, when calling the _search endpoint I can either use '/published-*'
or no index info at all and still get a sensible result back, e.g.:

GET /_search
{
"query": {
"filtered": {
"query": {
"term" : { "_id" :
"0190012014092947e9fe351dc78763dabc45231301e9f9_811110" }
}
}
}
}

This has left me perplexed: Why can I query one document ID from the
_search endpoint and get back the index information but not from _mget?

This situation seems to force me to loop over each document ID, of possibly
hundreds of thousand per night, calling the _search endpoint for each ID to
get the index information and then build up the bulk delete.

Can life really be this difficult?

Are there other mechanisms I can look at that will allow me, for a given
list of document IDs, to delete the associated documents from unspecified
Elasticsearch indices?

I'm sorry if this was a trivial question but I've spent several days
pouring over the Search::Elasticsearch documentation and googling
Elasticsearch examples without finding other ways to get the job done.

Best wishes,
Bernt Rostad
Retriever Norge

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/719fd493-9eaa-4f22-8552-3f1c91245786%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Found the answer after another day of trial & errors. For those interested,
the clue is the "ids" parameter in a regular _search query:

GET /published-*/_search
{
"query" : {
"ids" : {
"values" : ["019001201409294aa579ddb20348cbbf402116c91f6d15_811110",
"0190012014092947e9fe351dc78763dabc45231301e9f9_811110"]
}
}
}

This allows me to fetch index data for for each of the document IDs :slight_smile:
bernt

On Wednesday, October 15, 2014 3:09:45 PM UTC+2, Bernt Rostad wrote:

Hi,

This may be a newbie question but I'm struggling to solve a problem for my
company in which a MySql database is used to feed and maintain a cluster of
Elasticsearch servers used for front-end searches.

My problem is: Given a set of document IDs from MySql, I need to remove
those documents from all of our Elasticsearch indices. We typically have
one index per year and though I know the document ID, which has the same ID
in Elasticsearch, I don't know the index it is stored in.

I started out trying the Perl module Search::Elasticsearch::Bulk which
offered the delete_ids() and add_action() methods, which both seemed to
allow me to delete a large number of documents, but requiring an index
name. I thought I could do the same trick as with the _search endpoint and
use a wildcard to indicate "all indices", e.g. index => 'published-*', but
that failed spectacularly:

InvalidIndexNameException[[pulse-] Invalid index name [published-], must
not contain the following characters [, /, *, ?, ", <, >, |, , ,]]; at
/usr/local/lib/site_perl/Cpan/share/perl/5.14.2/Search/Elasticsearch/Role/Bulk.pm
line 188.

So, I couldn't use a wildcard and I still didn't know the exact index for
a given document ID. I was back to the drawing board again.

My next attempt was to dynamically decide the index for each document, by
doing a search before building up the bulk delete. For this I thought I
could use the _mget endpoint, which seemed similar to _search and thus
would allow me to query all the document IDs to learn their indices. But
that didn't work either. Here I've copied the commands I tried running in
Sense:

GET /published-*/_mget
{
"docs" : [
{ "_id" : "019001201409294aa579ddb20348cbbf402116c91f6d15_811110" },
{ "_id" : "0190012014092947e9fe351dc78763dabc45231301e9f9_811110" }
]
}

GET /_mget
{
"docs" : [
{ "_id" : "019001201409294aa579ddb20348cbbf402116c91f6d15_811110" },
{ "_id" : "0190012014092947e9fe351dc78763dabc45231301e9f9_811110" }
]
}

The first only returned errors with each document while the second didn't
return anything, just issued an "index is missing" error.

However, when calling the _search endpoint I can either use '/published-*'
or no index info at all and still get a sensible result back, e.g.:

GET /_search
{
"query": {
"filtered": {
"query": {
"term" : { "_id" :
"0190012014092947e9fe351dc78763dabc45231301e9f9_811110" }
}
}
}
}

This has left me perplexed: Why can I query one document ID from the
_search endpoint and get back the index information but not from _mget?

This situation seems to force me to loop over each document ID, of
possibly hundreds of thousand per night, calling the _search endpoint for
each ID to get the index information and then build up the bulk delete.

Can life really be this difficult?

Are there other mechanisms I can look at that will allow me, for a given
list of document IDs, to delete the associated documents from unspecified
Elasticsearch indices?

I'm sorry if this was a trivial question but I've spent several days
pouring over the Search::Elasticsearch documentation and googling
Elasticsearch examples without finding other ways to get the job done.

Best wishes,
Bernt Rostad
Retriever Norge

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/bce09ee4-9742-46ac-9c98-7ce274b95818%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.