Merging multi search from 2 elasticsearch indexes

jagdeep · December 24, 2012, 10:51am

Hi,

I have been using ES for various scenarios and use cases, some of these
even include text mining. In all these scenarios there is one or the other
need for data join condition or merging response requirement.

In current requirement I am processing the data through multiple processing
pipelines. Example
Index 1 has Email id - Processing pipe line 1 is taking email id and
putting data in Index 2. Its structure is some what like this
{
"email":"abc@abc.com",
"isProcessedByPipeLine1":"false"
}

I know, its not advisable to update the index but these processes are
running continuously so i have to mark these records as processed. To do
this I am not updating a single filed i.e. "isProcessedByPipeLine1" but I
am overwriting this whole document with
{
"email":"abc@abc.com",
"isProcessedByPipeLine1":"true"
}

I am keeping different indexes and not just overwriting data in same index
because I have multiple data processing pipelines
running continuously and parallely.
I also want to export the data into csv or any other structured format.
While exporting I want to merge response from multiple indexes based on
email id.

Here is what I am planning to do

Rest interface which will accept index array(all the indexes from which
result need to be merged) and field upon which join condition to be
performed
Getting data from 1 index, extract join condition field and then perform
search on remaining indexes
Merge json response from all searches and deliver it

I don't think its a good\ optimized solution because its not just a simple
search, I am trying to dump whole data from all the indexes.

Any suggestions about overall approach or about data merge will be really
helpful.

Thanks and regards
Jagdeep

--

jprante · December 24, 2012, 12:44pm

With your strategy, your required search time will grow more than linearly.
If you don't have strict upper limits on the number of involved indexes and
fields and join ops, it will not scale.

Currently, there is a static doc ID based merge, called multi get, with a
strict upper limit. If your resources are organized by ID, you can perform
multi get like this:

curl -XDELETE 'localhost:9200/test'

curl -XPUT 'localhost:9200/test/data/1' -d '
{
"sentence" : "Hello World"
}
'

curl -XPUT 'localhost:9200/test/control/1' -d '
{
"done" : true
}
'

curl -XGET 'localhost:9200/test/_refresh'

echo
echo "mget"
curl -XPOST 'localhost:9200/test/_mget?pretty' -d '
{
"docs" : [
{ "_type" : "data", "_id" : "1"},
{ "_type" : "control", "_id" : "1"}
]
}
'

For more generic merging on arbitrary fields, I am working on graph-based
merging.

This graph-based approach should work by indexing graphs with resource IDs
(IRIs) known from the W3C Resource Description Framework.

In JSON-LD, you can define IRI-based IDs and graphs in JSON. In
Elasticsearch, graphs can be represented by document sets. It is up to you
how to organize index/type coordinates for your document sets.

A graph contains resources that are identifiable by IDs. Such IDs could
also serve as Elasticsearch doc IDs and could be indexed in Elasticsearch
beforehand.

By defining so called graph layouts or "frames", JSON-LD allows a
rule-based expansion of graphs in JSON documents. In Elasticsearch, such a
framing algorithm could be performed at index time.

For example, indexing a library / book / chapter hierarchy, you first index
each entity for itself, and then you index a hierarchy document only
containing the IDs as the graph layout. See also

If you layout your JSON-LD docs already in the Elasticsearch index, and not
later at search time, the search is still scalable. No additional merge
queries, no additional joins, which would otherwise challenge you very hard
once you need to cope with growing number of searches. Don't think adding
more ES nodes or adding more CPU/mem will help then, because it's the
algorithm which will go out of control.

Best regards,

Jörg

--

jagdeep · December 26, 2012, 7:09am

Hey Jörg, Thanks a lot for your response. Its as helpful as many of your ES
plugins

The solution of using multi-get, with different type mapping and same ID
for different docs, is interesting. However, it will work only when these
two types of documents have 1-1 mapping between them. I am specifically
looking for many-1 mapping here, so assigning same ID to different typed
docs is not helping me.

Generic merging on arbitrary fields, as u told, is what I am looking for.
In absense of pre-defined graph-layouts/ frames, what options do I have?
Please note that my current solution is needed for a data export utility,
not real-time search. So minor delay is acceptable.

I am thinking of doing a scroll on Index1, filter out required field from
returned batch, and do a terms query on Index2. Thinking of implementing
the logic as a generic plugin. Any idea on how to keep a scroll on Index2?
The search terms will keep changing, is there any way to make a scroll with
match_all query and get data corresponding to search_terms from it?

Also any hint on the fact if parent-child doc concept be helpful in such
scenario? Index1 is primary index, Index2 is secondary index on unique
values of some field of Index1. Index1:Index2 = many:1

Thanks and Regards
Jagdeep

On Monday, December 24, 2012 6:14:50 PM UTC+5:30, Jörg Prante wrote:

With your strategy, your required search time will grow more than
linearly. If you don't have strict upper limits on the number of involved
indexes and fields and join ops, it will not scale.

Currently, there is a static doc ID based merge, called multi get, with a
strict upper limit. If your resources are organized by ID, you can perform
multi get like this:

curl -XDELETE 'localhost:9200/test'

curl -XPUT 'localhost:9200/test/data/1' -d '
{
"sentence" : "Hello World"
}
'

curl -XPUT 'localhost:9200/test/control/1' -d '
{
"done" : true
}
'

curl -XGET 'localhost:9200/test/_refresh'

echo
echo "mget"
curl -XPOST 'localhost:9200/test/_mget?pretty' -d '
{
"docs" : [
{ "_type" : "data", "_id" : "1"},
{ "_type" : "control", "_id" : "1"}
]
}
'

For more generic merging on arbitrary fields, I am working on graph-based
merging.

This graph-based approach should work by indexing graphs with resource IDs
(IRIs) known from the W3C Resource Description Framework.

In JSON-LD, you can define IRI-based IDs and graphs in JSON. In
Elasticsearch, graphs can be represented by document sets. It is up to you
how to organize index/type coordinates for your document sets.

A graph contains resources that are identifiable by IDs. Such IDs could
also serve as Elasticsearch doc IDs and could be indexed in Elasticsearch
beforehand.

By defining so called graph layouts or "frames", JSON-LD allows a
rule-based expansion of graphs in JSON documents. In Elasticsearch, such a
framing algorithm could be performed at index time.

For example, indexing a library / book / chapter hierarchy, you first
index each entity for itself, and then you index a hierarchy document only
containing the IDs as the graph layout. See also

GitHub - ruby-rdf/json-ld: Ruby JSON-LD reader/writer for RDF.rb

If you layout your JSON-LD docs already in the Elasticsearch index, and
not later at search time, the search is still scalable. No additional merge
queries, no additional joins, which would otherwise challenge you very hard
once you need to cope with growing number of searches. Don't think adding
more ES nodes or adding more CPU/mem will help then, because it's the
algorithm which will go out of control.

Best regards,

Jörg

--

jprante · December 27, 2012, 8:30am

Yes, of course you can cascade queries. First one for key lookup, second
one is a multi get query, or just another query. This is application
dependent, no special ES solution. Merging docs is only on client-side.
Except multi get, there is no server-side merge in ES.

You can also scroll. But be aware that scroll approach is costly of you
scan through a result set with only few hits.

There is no parent/child doc merge support on the ES side, and there is no
parent/child "join".

Parent/child is rather limited to model data in a strict hierarchy. I found
that approaching parent/child should include the following characteristics:

you have a number of "main documents" and want to assign "sub documents"
to them later. It's a 1:N relationship, a child can have at maximum one
parent doc
addressing: you want to retrieve the parent doc even when searching in
children-only fields
you may need to run queries against parent and children independently
it is convenient to have very few fields for a children document that
have specific value for this child
the parent should carry all children fields which are not child specific
(for easy doc construction)
you mainly want to process children when the parent is involved
("top_children" for example), not as standalone documents

If you want a generic parent/child doc solution, such as in a tree, there
is an overhead for additional queries. But, because ES is keeping children
docs in the same shard as the parent, this overhead is kept as small as
possible. The standard ES operations on parent/child hide the complexity of
the queries from the user.

Jörg

--

Topic		Replies	Views
Is it possible to merge results of a search for two indexes based on a unique identifier in common? Like a join Elasticsearch	5	1540	July 16, 2020
Filtering Results from multiple indices Elasticsearch	3	337	July 6, 2017
How to join 2 index using 2 common values Elasticsearch	4	941	June 23, 2020
Pull data from multiple index Elasticsearch	1	12	October 15, 2024
Merge data fer two indices with foreign key Elasticsearch	4	1954	February 22, 2018

Merging multi search from 2 elasticsearch indexes

Related topics