Hi,
This may be a newbie question but I'm struggling to solve a problem for my
company in which a MySql database is used to feed and maintain a cluster of
Elasticsearch servers used for front-end searches.
My problem is: Given a set of document IDs from MySql, I need to remove
those documents from all of our Elasticsearch indices. We typically have
one index per year and though I know the document ID, which has the same ID
in Elasticsearch, I don't know the index it is stored in.
I started out trying the Perl module Search::Elasticsearch::Bulk which
offered the delete_ids() and add_action() methods, which both seemed to
allow me to delete a large number of documents, but requiring an index
name. I thought I could do the same trick as with the _search endpoint and
use a wildcard to indicate "all indices", e.g. index => 'published-*', but
that failed spectacularly:
InvalidIndexNameException[[pulse-] Invalid index name [published-], must
not contain the following characters [, /, *, ?, ", <, >, |, , ,]]; at
/usr/local/lib/site_perl/Cpan/share/perl/5.14.2/Search/Elasticsearch/Role/Bulk.pm
line 188.
So, I couldn't use a wildcard and I still didn't know the exact index for a
given document ID. I was back to the drawing board again.
My next attempt was to dynamically decide the index for each document, by
doing a search before building up the bulk delete. For this I thought I
could use the _mget endpoint, which seemed similar to _search and thus
would allow me to query all the document IDs to learn their indices. But
that didn't work either. Here I've copied the commands I tried running in
Sense:
GET /published-*/_mget
{
"docs" : [
{ "_id" : "019001201409294aa579ddb20348cbbf402116c91f6d15_811110" },
{ "_id" : "0190012014092947e9fe351dc78763dabc45231301e9f9_811110" }
]
}
GET /_mget
{
"docs" : [
{ "_id" : "019001201409294aa579ddb20348cbbf402116c91f6d15_811110" },
{ "_id" : "0190012014092947e9fe351dc78763dabc45231301e9f9_811110" }
]
}
The first only returned errors with each document while the second didn't
return anything, just issued an "index is missing" error.
However, when calling the _search endpoint I can either use '/published-*'
or no index info at all and still get a sensible result back, e.g.:
GET /_search
{
"query": {
"filtered": {
"query": {
"term" : { "_id" :
"0190012014092947e9fe351dc78763dabc45231301e9f9_811110" }
}
}
}
}
This has left me perplexed: Why can I query one document ID from the
_search endpoint and get back the index information but not from _mget?
This situation seems to force me to loop over each document ID, of possibly
hundreds of thousand per night, calling the _search endpoint for each ID to
get the index information and then build up the bulk delete.
Can life really be this difficult?
Are there other mechanisms I can look at that will allow me, for a given
list of document IDs, to delete the associated documents from unspecified
Elasticsearch indices?
I'm sorry if this was a trivial question but I've spent several days
pouring over the Search::Elasticsearch documentation and googling
Elasticsearch examples without finding other ways to get the job done.
Best wishes,
Bernt Rostad
Retriever Norge
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/719fd493-9eaa-4f22-8552-3f1c91245786%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.