A way to compress large Queries

revdev · November 2, 2012, 4:32pm

Hi,
I am at a point where some of my search queries are close to 1MB, for
example queries where the "terms" filter has around 50k ids. I assume this
is putting a lot of network overhead to transfer such big queries to ES. I
am wondering if there is a way I can send "macro" in a query which is later
on translated by ES.
Let me describe how it would work:

Query sent to ES:
filter: { terms : { ids: [macro_label_list_of_ids] }}
where "macro_label_list_of_ids" is just a label which uniquely identifies a
list of integer ids.
Upon receiving this query, ES does a lookup for
"macro_label_list_of_ids" on another index (created by us) to extract
values of ids and then replaces these ids where "macro_label_list_of_ids"
appears in the query.

So, can you tell me three things:

Is this possible in ES current?
If not, is there any plans to implement this?
Is there any other workaround to shorten my queries? ( I can't change
the way I am storing data because of business requirements)

Thanks!

--

Igor_Motov · November 2, 2012, 5:01pm

If you have one set of ids per query and they are all in the filter that is
applied to the entire query, you can create a filtered alias for each id
list. Besides saving on network traffic you will also save on parsing this
huge list of ids for every request.

On Friday, November 2, 2012 12:32:43 PM UTC-4, revdev wrote:

Hi,
I am at a point where some of my search queries are close to 1MB, for
example queries where the "terms" filter has around 50k ids. I assume this
is putting a lot of network overhead to transfer such big queries to ES. I
am wondering if there is a way I can send "macro" in a query which is later
on translated by ES.
Let me describe how it would work:

Query sent to ES:
filter: { terms : { ids: [macro_label_list_of_ids] }}
where "macro_label_list_of_ids" is just a label which uniquely identifies
a list of integer ids.

Upon receiving this query, ES does a lookup for
"macro_label_list_of_ids" on another index (created by us) to extract
values of ids and then replaces these ids where "macro_label_list_of_ids"
appears in the query.

So, can you tell me three things:

Is this possible in ES current?

If not, is there any plans to implement this?

Is there any other workaround to shorten my queries? ( I can't change
the way I am storing data because of business requirements)

Thanks!

--

revdev · November 2, 2012, 5:04pm

Glad to know that there is a solution to this problem.
I am not able to find proper documentation
here: Elasticsearch Platform — Find real-time answers at scale | Elastic
Can you point me to a page which describes how I can create aliases?
Thanks!

On Friday, November 2, 2012 10:01:20 AM UTC-7, Igor Motov wrote:

If you have one set of ids per query and they are all in the filter that
is applied to the entire query, you can create a filtered alias for each id
list. Besides saving on network traffic you will also save on parsing this
huge list of ids for every request.

On Friday, November 2, 2012 12:32:43 PM UTC-4, revdev wrote:

Hi,
I am at a point where some of my search queries are close to 1MB, for
example queries where the "terms" filter has around 50k ids. I assume this
is putting a lot of network overhead to transfer such big queries to ES. I
am wondering if there is a way I can send "macro" in a query which is later
on translated by ES.
Let me describe how it would work:

Query sent to ES:
filter: { terms : { ids: [macro_label_list_of_ids] }}
where "macro_label_list_of_ids" is just a label which uniquely identifies
a list of integer ids.

Upon receiving this query, ES does a lookup for
"macro_label_list_of_ids" on another index (created by us) to extract
values of ids and then replaces these ids where "macro_label_list_of_ids"
appears in the query.

So, can you tell me three things:

Is this possible in ES current?

If not, is there any plans to implement this?

Is there any other workaround to shorten my queries? ( I can't change
the way I am storing data because of business requirements)

Thanks!

--

Igor_Motov · November 2, 2012, 5:07pm

It's in the middle of this
page Elasticsearch Platform — Find real-time answers at scale | Elastic
Search for "Filtered Aliases"

On Friday, November 2, 2012 1:04:02 PM UTC-4, revdev wrote:

Glad to know that there is a solution to this problem.
I am not able to find proper documentation here:
Elasticsearch Platform — Find real-time answers at scale | Elastic
Can you point me to a page which describes how I can create aliases?
Thanks!

On Friday, November 2, 2012 10:01:20 AM UTC-7, Igor Motov wrote:

If you have one set of ids per query and they are all in the filter that
is applied to the entire query, you can create a filtered alias for each id
list. Besides saving on network traffic you will also save on parsing this
huge list of ids for every request.

On Friday, November 2, 2012 12:32:43 PM UTC-4, revdev wrote:

Hi,
I am at a point where some of my search queries are close to 1MB, for
example queries where the "terms" filter has around 50k ids. I assume this
is putting a lot of network overhead to transfer such big queries to ES. I
am wondering if there is a way I can send "macro" in a query which is later
on translated by ES.
Let me describe how it would work:

Query sent to ES:
filter: { terms : { ids: [macro_label_list_of_ids] }}
where "macro_label_list_of_ids" is just a label which uniquely
identifies a list of integer ids.

Upon receiving this query, ES does a lookup for
"macro_label_list_of_ids" on another index (created by us) to extract
values of ids and then replaces these ids where "macro_label_list_of_ids"
appears in the query.

So, can you tell me three things:

Is this possible in ES current?

If not, is there any plans to implement this?

Is there any other workaround to shorten my queries? ( I can't change
the way I am storing data because of business requirements)

Thanks!

--

Vinay_2 · November 2, 2012, 5:07pm

Thanks! You guys rock!

On Fri, Nov 2, 2012 at 10:07 AM, Igor Motov imotov@gmail.com wrote:

It's in the middle of this page
Elasticsearch Platform — Find real-time answers at scale | Elastic for "Filtered Aliases"

On Friday, November 2, 2012 1:04:02 PM UTC-4, revdev wrote:

Glad to know that there is a solution to this problem.
I am not able to find proper documentation here: http://www.**
Elasticsearch Platform — Find real-time answers at scale | Elastic http://www.elasticsearch.org/guide/reference/query-dsl/terms-filter.html
Can you point me to a page which describes how I can create aliases?
Thanks!

On Friday, November 2, 2012 10:01:20 AM UTC-7, Igor Motov wrote:

If you have one set of ids per query and they are all in the filter that
is applied to the entire query, you can create a filtered alias for each id
list. Besides saving on network traffic you will also save on parsing this
huge list of ids for every request.

On Friday, November 2, 2012 12:32:43 PM UTC-4, revdev wrote:

Hi,
I am at a point where some of my search queries are close to 1MB, for
example queries where the "terms" filter has around 50k ids. I assume this
is putting a lot of network overhead to transfer such big queries to ES. I
am wondering if there is a way I can send "macro" in a query which is later
on translated by ES.
Let me describe how it would work:

Query sent to ES:
filter: { terms : { ids: [macro_label_list_of_ids] }}
where "macro_label_list_of_ids" is just a label which uniquely
identifies a list of integer ids.

Upon receiving this query, ES does a lookup for
"macro_label_list_of_ids" on another index (created by us) to extract
values of ids and then replaces these ids where "macro_label_list_of_ids"
appears in the query.

So, can you tell me three things:

Is this possible in ES current?

If not, is there any plans to implement this?

Is there any other workaround to shorten my queries? ( I can't
change the way I am storing data because of business requirements)

Thanks!

--

--

jprante · November 2, 2012, 7:08pm

Hi,

not sure if I understand it completely - but 1MB sized queries just because
of ID lists are scaring me.

Your challenge of ID substitution by value extraction looks a lot like
denormalization of data structures (an ID is replaced by the resource the
ID stands for).

since automatic denormalization needs some mechanics that are missing
currently in ES, you will have to work around some issues. Filter alias or
index views is one alternative, another one is maintaining ID resolution at
client side (so the client does not send IDs but other related values to
ES).
Yes, I work on a denormalization implementation, a mapper plugin
elasticsearch-mapper-iri. My plan is to expand documents automatically if
an IRI (International Resource Identifier) is detected in a source document
by looking up the resource of the IRI. Even an ES GetRequest in the
background will be possible, beside fetching non-ES JSON over HTTP.
Yes, think of a better data model at index time, not at search time.
Your business requirements seem valid in a relational data model where IDs
play a different role than in an inverted index. An ES data model is
document-centric, so you would have to rethink your data organization
completely. The challenge is constructing documents that do not rely on
"pointers" between them but on hierarchical JSON structures.

Best regards,

Jörg

On Friday, November 2, 2012 5:32:43 PM UTC+1, revdev wrote:

Hi,
I am at a point where some of my search queries are close to 1MB, for
example queries where the "terms" filter has around 50k ids. I assume this
is putting a lot of network overhead to transfer such big queries to ES. I
am wondering if there is a way I can send "macro" in a query which is later
on translated by ES.
Let me describe how it would work:

Query sent to ES:
filter: { terms : { ids: [macro_label_list_of_ids] }}
where "macro_label_list_of_ids" is just a label which uniquely identifies
a list of integer ids.

Upon receiving this query, ES does a lookup for
"macro_label_list_of_ids" on another index (created by us) to extract
values of ids and then replaces these ids where "macro_label_list_of_ids"
appears in the query.

So, can you tell me three things:

Is this possible in ES current?

If not, is there any plans to implement this?

Is there any other workaround to shorten my queries? ( I can't change
the way I am storing data because of business requirements)

Thanks!

--

revdev · November 3, 2012, 12:50am

Thanks for the comments Jörg. The reason for such big queries is because I
am getting multiple facets with each facet having its own facet filter. The
whole query explodes because I am trying to query this for around 50k ids.
I am looking at filter aliases and hopefully that will solve my problem.
But the fact is that I will have to continuously update aliases as and when
the batch of ids change. Your plugin sounds exactly what I need. Any
pointers to it and when you are planning to release it?

Thanks!
Vinay

On Friday, November 2, 2012 12:08:37 PM UTC-7, Jörg Prante wrote:

Hi,

not sure if I understand it completely - but 1MB sized queries just
because of ID lists are scaring me.

Your challenge of ID substitution by value extraction looks a lot like
denormalization of data structures (an ID is replaced by the resource the
ID stands for).

since automatic denormalization needs some mechanics that are missing
currently in ES, you will have to work around some issues. Filter alias or
index views is one alternative, another one is maintaining ID resolution at
client side (so the client does not send IDs but other related values to
ES).

Yes, I work on a denormalization implementation, a mapper plugin
elasticsearch-mapper-iri. My plan is to expand documents automatically if
an IRI (International Resource Identifier) is detected in a source document
by looking up the resource of the IRI. Even an ES GetRequest in the
background will be possible, beside fetching non-ES JSON over HTTP.

Yes, think of a better data model at index time, not at search time.
Your business requirements seem valid in a relational data model where IDs
play a different role than in an inverted index. An ES data model is
document-centric, so you would have to rethink your data organization
completely. The challenge is constructing documents that do not rely on
"pointers" between them but on hierarchical JSON structures.

Best regards,

Jörg

On Friday, November 2, 2012 5:32:43 PM UTC+1, revdev wrote:

Hi,
I am at a point where some of my search queries are close to 1MB, for
example queries where the "terms" filter has around 50k ids. I assume this
is putting a lot of network overhead to transfer such big queries to ES. I
am wondering if there is a way I can send "macro" in a query which is later
on translated by ES.
Let me describe how it would work:

Query sent to ES:
filter: { terms : { ids: [macro_label_list_of_ids] }}
where "macro_label_list_of_ids" is just a label which uniquely identifies
a list of integer ids.

Upon receiving this query, ES does a lookup for
"macro_label_list_of_ids" on another index (created by us) to extract
values of ids and then replaces these ids where "macro_label_list_of_ids"
appears in the query.

So, can you tell me three things:

Is this possible in ES current?

If not, is there any plans to implement this?

Is there any other workaround to shorten my queries? ( I can't change
the way I am storing data because of business requirements)

Thanks!

--

jprante · November 4, 2012, 12:20pm

I hack Elasticsearch plugins in my spare time just for fun. Sorry, no ETA
for the plugin. It will be announced here on the list, of course.

Here is my status of activity in that area:

Netty websocket transport plugin and client (done)
ES client/server codebase modularization (pending)
HTTP Netty client (thanks to asynchttpclient done)
ES HTTP Java RESTful client (in progress)
IRI/Semantic Web content extraction plugin for ES client (open)
IRI/Semantic Web Mapper plugin for ES server (open)

The modularization task is just for better code dependency design and
smaller code units, it is pending because I wait for 0.21.

Because my day job is focused on indexing library catalogs with the help of
the semantic web, where denormalizing identifiers is a common task, I hope
to make use of ES for semantic web data indexing and search.

Some interesting things are going in in the Stanbol community, they connect
CMS to semantic web stores, and like to explore ES for
indexing https://twitter.com/fhopf/status/250161439342489600

Jörg

On Saturday, November 3, 2012 1:50:02 AM UTC+1, revdev wrote:

Thanks for the comments Jörg. The reason for such big queries is because I
am getting multiple facets with each facet having its own facet filter. The
whole query explodes because I am trying to query this for around 50k ids.
I am looking at filter aliases and hopefully that will solve my problem.
But the fact is that I will have to continuously update aliases as and when
the batch of ids change. Your plugin sounds exactly what I need. Any
pointers to it and when you are planning to release it?

Thanks!
Vinay

On Friday, November 2, 2012 12:08:37 PM UTC-7, Jörg Prante wrote:

Hi,

not sure if I understand it completely - but 1MB sized queries just
because of ID lists are scaring me.

Your challenge of ID substitution by value extraction looks a lot like
denormalization of data structures (an ID is replaced by the resource the
ID stands for).

since automatic denormalization needs some mechanics that are missing
currently in ES, you will have to work around some issues. Filter alias or
index views is one alternative, another one is maintaining ID resolution at
client side (so the client does not send IDs but other related values to
ES).

Yes, I work on a denormalization implementation, a mapper plugin
elasticsearch-mapper-iri. My plan is to expand documents automatically if
an IRI (International Resource Identifier) is detected in a source document
by looking up the resource of the IRI. Even an ES GetRequest in the
background will be possible, beside fetching non-ES JSON over HTTP.

Yes, think of a better data model at index time, not at search time.
Your business requirements seem valid in a relational data model where IDs
play a different role than in an inverted index. An ES data model is
document-centric, so you would have to rethink your data organization
completely. The challenge is constructing documents that do not rely on
"pointers" between them but on hierarchical JSON structures.

Best regards,

Jörg

On Friday, November 2, 2012 5:32:43 PM UTC+1, revdev wrote:

Hi,
I am at a point where some of my search queries are close to 1MB, for
example queries where the "terms" filter has around 50k ids. I assume this
is putting a lot of network overhead to transfer such big queries to ES. I
am wondering if there is a way I can send "macro" in a query which is later
on translated by ES.
Let me describe how it would work:

Query sent to ES:
filter: { terms : { ids: [macro_label_list_of_ids] }}
where "macro_label_list_of_ids" is just a label which uniquely
identifies a list of integer ids.

Upon receiving this query, ES does a lookup for
"macro_label_list_of_ids" on another index (created by us) to extract
values of ids and then replaces these ids where "macro_label_list_of_ids"
appears in the query.

So, can you tell me three things:

Is this possible in ES current?

If not, is there any plans to implement this?

Is there any other workaround to shorten my queries? ( I can't change
the way I am storing data because of business requirements)

Thanks!

--

Vinay_2 · November 5, 2012, 3:24am

Great! Would definitely check out when the plugin is released. Best of luck
on the project and thanks for sharing!

On Sun, Nov 4, 2012 at 4:20 AM, Jörg Prante joergprante@gmail.com wrote:

I hack Elasticsearch plugins in my spare time just for fun. Sorry, no ETA
for the plugin. It will be announced here on the list, of course.

Here is my status of activity in that area:

Netty websocket transport plugin and client (done)
ES client/server codebase modularization (pending)
HTTP Netty client (thanks to asynchttpclient done)
ES HTTP Java RESTful client (in progress)
IRI/Semantic Web content extraction plugin for ES client (open)
IRI/Semantic Web Mapper plugin for ES server (open)

The modularization task is just for better code dependency design and
smaller code units, it is pending because I wait for 0.21.

Because my day job is focused on indexing library catalogs with the help
of the semantic web, where denormalizing identifiers is a common task, I
hope to make use of ES for semantic web data indexing and search.

Some interesting things are going in in the Stanbol community, they
connect CMS to semantic web stores, and like to explore ES for indexing
https://twitter.com/fhopf/status/250161439342489600

Jörg

On Saturday, November 3, 2012 1:50:02 AM UTC+1, revdev wrote:

Thanks for the comments Jörg. The reason for such big queries is because
I am getting multiple facets with each facet having its own facet filter.
The whole query explodes because I am trying to query this for around 50k
ids. I am looking at filter aliases and hopefully that will solve my
problem. But the fact is that I will have to continuously update aliases as
and when the batch of ids change. Your plugin sounds exactly what I need.
Any pointers to it and when you are planning to release it?

Thanks!
Vinay

On Friday, November 2, 2012 12:08:37 PM UTC-7, Jörg Prante wrote:

Hi,

not sure if I understand it completely - but 1MB sized queries just
because of ID lists are scaring me.

Your challenge of ID substitution by value extraction looks a lot like
denormalization of data structures (an ID is replaced by the resource the
ID stands for).

since automatic denormalization needs some mechanics that are missing
currently in ES, you will have to work around some issues. Filter alias or
index views is one alternative, another one is maintaining ID resolution at
client side (so the client does not send IDs but other related values to
ES).

Yes, I work on a denormalization implementation, a mapper plugin
elasticsearch-mapper-iri. My plan is to expand documents automatically if
an IRI (International Resource Identifier) is detected in a source document
by looking up the resource of the IRI. Even an ES GetRequest in the
background will be possible, beside fetching non-ES JSON over HTTP.

Yes, think of a better data model at index time, not at search time.
Your business requirements seem valid in a relational data model where IDs
play a different role than in an inverted index. An ES data model is
document-centric, so you would have to rethink your data organization
completely. The challenge is constructing documents that do not rely on
"pointers" between them but on hierarchical JSON structures.

Best regards,

Jörg

On Friday, November 2, 2012 5:32:43 PM UTC+1, revdev wrote:

Hi,
I am at a point where some of my search queries are close to 1MB, for
example queries where the "terms" filter has around 50k ids. I assume this
is putting a lot of network overhead to transfer such big queries to ES. I
am wondering if there is a way I can send "macro" in a query which is later
on translated by ES.
Let me describe how it would work:

Query sent to ES:
filter: { terms : { ids: [macro_label_list_of_ids] }}
where "macro_label_list_of_ids" is just a label which uniquely
identifies a list of integer ids.

Upon receiving this query, ES does a lookup for
"macro_label_list_of_ids" on another index (created by us) to extract
values of ids and then replaces these ids where "macro_label_list_of_ids"
appears in the query.

So, can you tell me three things:

Is this possible in ES current?

If not, is there any plans to implement this?

Is there any other workaround to shorten my queries? ( I can't
change the way I am storing data because of business requirements)

Thanks!

--

--

phill · November 15, 2012, 1:45am

On 11/2/2012 5:50 PM, revdev wrote:

Thanks for the comments Jörg. The reason for such big queries is
because I am getting multiple facets with each facet having its own
facet filter.

Each facet with it's own filter? Why isn't there one "filtered" query
the results of which feed all facets? Does each facet really look at a
different set of ids?

Why isn't the original "macro_label_list_of_ids" in your original
"filter: { terms : { ids: [macro_label_list_of_ids] }}" representable in
the index as the ID and not all values in the list?

-Paul

The whole query explodes because I am trying to query this for around
50k ids. I am looking at filter aliases and hopefully that will solve
my problem. But the fact is that I will have to continuously update
aliases as and when the batch of ids change. Your plugin sounds
exactly what I need. Any pointers to it and when you are planning to
release it?

--

Vinay_2 · November 15, 2012, 7:14pm

Hi P. Hill,
Yeah each facet has its own filter because they need to act on a subset of
data, not the entire data filtered by global query. Secondly, I cant
represent all "ids" as a single "group id" that groups them together
because these groups are ever changing. If I add the group id to a doc, I
might have to update all docs when an "id" moves out of a group.
But, I found the solution with "Nested Type". I am now storing data as an
"nested" array vs as an object. That works perfectly but requires more
storage on disk and RAM. Hopefully, it will scale smoothly :).

On Wed, Nov 14, 2012 at 5:48 PM, P. Hill parehill1@gmail.com wrote:

On 11/2/2012 5:50 PM, revdev wrote:

Thanks for the comments Jörg. The reason for such big queries is because
I am getting multiple facets with each facet having its own facet filter.

Each facet with it's own filter? Why isn't there one "filtered" query the
results of which feed all facets? Does each facet really look at a
different set of ids?

Why isn't the original "macro_label_list_of_ids" in your original "filter:
{ terms : { ids: [macro_label_list_of_ids] }}" representable in the index
as the ID and not all values in the list?

-Paul

The whole query explodes because I am trying to query this for around 50k

ids. I am looking at filter aliases and hopefully that will solve my
problem. But the fact is that I will have to continuously update aliases as
and when the batch of ids change. Your plugin sounds exactly what I need.
Any pointers to it and when you are planning to release it?

--

--

Topic		Replies	Views
Filtered aliases and huge ids/terms filter Elasticsearch	4	654	July 5, 2017
Is it ok to make an idsQuery with lots (10k+) of ids on ES 5.x? Elasticsearch	2	685	March 6, 2017
Efficient query with a big input filter set? Elasticsearch	4	1239	July 6, 2017
Include & Exclude of 500k ids in a search query Elasticsearch	5	1273	March 12, 2022
Improving query performance with many filtered aliases Elasticsearch	5	1938	March 1, 2017

A way to compress large Queries

Related topics