Cross-index join


(plaflamme) #1

Hi,

When querying several indices, is there a way to get "unique" hits? By
"unique" I mean based on entity+id equality.

Say I have several CSV files that may contain data on common entities but
have varying columns and I've indexed them in separate indices in ES (using
the same "entity type" in all indices). The result is several indices where
some contain the same entity+id document. Can I ask ES to search through all
indices and return unique documents (entity+id)? I guess it's similar to an
SQL join where you query several tables and specify how you join them (in
this case, the join would always happen on "_id").

I know putting everything in one index would solve my problem, but for the
sake of management, it's simpler for me to manage several indices (one per
CSV so to speak).

I this a valid feature for ES? Is there a way to do this already?

Thanks,
Philippe


(Shay Banon) #2

There isn't an option to do that, since shards of different indices exists on different nodes. You can ask for more than you might want, and filter on the client side, or use a single index (as you suggested).
On Thursday, April 21, 2011 at 11:53 PM, Philippe Laflamme wrote:

Hi,

When querying several indices, is there a way to get "unique" hits? By "unique" I mean based on entity+id equality.

Say I have several CSV files that may contain data on common entities but have varying columns and I've indexed them in separate indices in ES (using the same "entity type" in all indices). The result is several indices where some contain the same entity+id document. Can I ask ES to search through all indices and return unique documents (entity+id)? I guess it's similar to an SQL join where you query several tables and specify how you join them (in this case, the join would always happen on "_id").

I know putting everything in one index would solve my problem, but for the sake of management, it's simpler for me to manage several indices (one per CSV so to speak).

I this a valid feature for ES? Is there a way to do this already?

Thanks,
Philippe


(plaflamme) #3

Ok, but is it something that could be implemented in ES directly? I don't
mind writing a plugin for my own purposes, but I'd need to know whether it's
doable.

Seems like "has_child" has a Collector implementation to do its thing. Would
this be the place to start?

Thanks,
Philippe

On Thu, Apr 21, 2011 at 18:52, Shay Banon shay.banon@elasticsearch.comwrote:

There isn't an option to do that, since shards of different indices
exists on different nodes. You can ask for more than you might want, and
filter on the client side, or use a single index (as you suggested).

On Thursday, April 21, 2011 at 11:53 PM, Philippe Laflamme wrote:

Hi,

When querying several indices, is there a way to get "unique" hits? By
"unique" I mean based on entity+id equality.

Say I have several CSV files that may contain data on common entities but
have varying columns and I've indexed them in separate indices in ES (using
the same "entity type" in all indices). The result is several indices where
some contain the same entity+id document. Can I ask ES to search through all
indices and return unique documents (entity+id)? I guess it's similar to an
SQL join where you query several tables and specify how you join them (in
this case, the join would always happen on "_id").

I know putting everything in one index would solve my problem, but for the
sake of management, it's simpler for me to manage several indices (one per
CSV so to speak).

I this a valid feature for ES? Is there a way to do this already?

Thanks,
Philippe


(Shay Banon) #4

No, you won't be able to do it since it will require talking to other shards over the wire, which will make this impractical.
On Friday, April 22, 2011 at 3:49 PM, Philippe Laflamme wrote:

Ok, but is it something that could be implemented in ES directly? I don't mind writing a plugin for my own purposes, but I'd need to know whether it's doable.

Seems like "has_child" has a Collector implementation to do its thing. Would this be the place to start?

Thanks,
Philippe

On Thu, Apr 21, 2011 at 18:52, Shay Banon shay.banon@elasticsearch.com wrote:

There isn't an option to do that, since shards of different indices exists on different nodes. You can ask for more than you might want, and filter on the client side, or use a single index (as you suggested).
On Thursday, April 21, 2011 at 11:53 PM, Philippe Laflamme wrote:

Hi,

When querying several indices, is there a way to get "unique" hits? By "unique" I mean based on entity+id equality.

Say I have several CSV files that may contain data on common entities but have varying columns and I've indexed them in separate indices in ES (using the same "entity type" in all indices). The result is several indices where some contain the same entity+id document. Can I ask ES to search through all indices and return unique documents (entity+id)? I guess it's similar to an SQL join where you query several tables and specify how you join them (in this case, the join would always happen on "_id").

I know putting everything in one index would solve my problem, but for the sake of management, it's simpler for me to manage several indices (one per CSV so to speak).

I this a valid feature for ES? Is there a way to do this already?

Thanks,
Philippe


(system) #5