A way to compress large Queries

Hi,
I am at a point where some of my search queries are close to 1MB, for
example queries where the "terms" filter has around 50k ids. I assume this
is putting a lot of network overhead to transfer such big queries to ES. I
am wondering if there is a way I can send "macro" in a query which is later
on translated by ES.
Let me describe how it would work:

  1. Query sent to ES:
    filter: { terms : { ids: [macro_label_list_of_ids] }}
    where "macro_label_list_of_ids" is just a label which uniquely identifies a
    list of integer ids.

  2. Upon receiving this query, ES does a lookup for
    "macro_label_list_of_ids" on another index (created by us) to extract
    values of ids and then replaces these ids where "macro_label_list_of_ids"
    appears in the query.

So, can you tell me three things:

  1. Is this possible in ES current?
  2. If not, is there any plans to implement this?
  3. Is there any other workaround to shorten my queries? ( I can't change
    the way I am storing data because of business requirements)

Thanks!

--

If you have one set of ids per query and they are all in the filter that is
applied to the entire query, you can create a filtered alias for each id
list. Besides saving on network traffic you will also save on parsing this
huge list of ids for every request.

On Friday, November 2, 2012 12:32:43 PM UTC-4, revdev wrote:

Hi,
I am at a point where some of my search queries are close to 1MB, for
example queries where the "terms" filter has around 50k ids. I assume this
is putting a lot of network overhead to transfer such big queries to ES. I
am wondering if there is a way I can send "macro" in a query which is later
on translated by ES.
Let me describe how it would work:

  1. Query sent to ES:
    filter: { terms : { ids: [macro_label_list_of_ids] }}
    where "macro_label_list_of_ids" is just a label which uniquely identifies
    a list of integer ids.

  2. Upon receiving this query, ES does a lookup for
    "macro_label_list_of_ids" on another index (created by us) to extract
    values of ids and then replaces these ids where "macro_label_list_of_ids"
    appears in the query.

So, can you tell me three things:

  1. Is this possible in ES current?
  2. If not, is there any plans to implement this?
  3. Is there any other workaround to shorten my queries? ( I can't change
    the way I am storing data because of business requirements)

Thanks!

--

Glad to know that there is a solution to this problem.
I am not able to find proper documentation
here: Elasticsearch Platform — Find real-time answers at scale | Elastic
Can you point me to a page which describes how I can create aliases?
Thanks!

On Friday, November 2, 2012 10:01:20 AM UTC-7, Igor Motov wrote:

If you have one set of ids per query and they are all in the filter that
is applied to the entire query, you can create a filtered alias for each id
list. Besides saving on network traffic you will also save on parsing this
huge list of ids for every request.

On Friday, November 2, 2012 12:32:43 PM UTC-4, revdev wrote:

Hi,
I am at a point where some of my search queries are close to 1MB, for
example queries where the "terms" filter has around 50k ids. I assume this
is putting a lot of network overhead to transfer such big queries to ES. I
am wondering if there is a way I can send "macro" in a query which is later
on translated by ES.
Let me describe how it would work:

  1. Query sent to ES:
    filter: { terms : { ids: [macro_label_list_of_ids] }}
    where "macro_label_list_of_ids" is just a label which uniquely identifies
    a list of integer ids.

  2. Upon receiving this query, ES does a lookup for
    "macro_label_list_of_ids" on another index (created by us) to extract
    values of ids and then replaces these ids where "macro_label_list_of_ids"
    appears in the query.

So, can you tell me three things:

  1. Is this possible in ES current?
  2. If not, is there any plans to implement this?
  3. Is there any other workaround to shorten my queries? ( I can't change
    the way I am storing data because of business requirements)

Thanks!

--

It's in the middle of this
page Elasticsearch Platform — Find real-time answers at scale | Elastic
Search for "Filtered Aliases"

On Friday, November 2, 2012 1:04:02 PM UTC-4, revdev wrote:

Glad to know that there is a solution to this problem.
I am not able to find proper documentation here:
Elasticsearch Platform — Find real-time answers at scale | Elastic
Can you point me to a page which describes how I can create aliases?
Thanks!

On Friday, November 2, 2012 10:01:20 AM UTC-7, Igor Motov wrote:

If you have one set of ids per query and they are all in the filter that
is applied to the entire query, you can create a filtered alias for each id
list. Besides saving on network traffic you will also save on parsing this
huge list of ids for every request.

On Friday, November 2, 2012 12:32:43 PM UTC-4, revdev wrote:

Hi,
I am at a point where some of my search queries are close to 1MB, for
example queries where the "terms" filter has around 50k ids. I assume this
is putting a lot of network overhead to transfer such big queries to ES. I
am wondering if there is a way I can send "macro" in a query which is later
on translated by ES.
Let me describe how it would work:

  1. Query sent to ES:
    filter: { terms : { ids: [macro_label_list_of_ids] }}
    where "macro_label_list_of_ids" is just a label which uniquely
    identifies a list of integer ids.

  2. Upon receiving this query, ES does a lookup for
    "macro_label_list_of_ids" on another index (created by us) to extract
    values of ids and then replaces these ids where "macro_label_list_of_ids"
    appears in the query.

So, can you tell me three things:

  1. Is this possible in ES current?
  2. If not, is there any plans to implement this?
  3. Is there any other workaround to shorten my queries? ( I can't change
    the way I am storing data because of business requirements)

Thanks!

--

Thanks! You guys rock! :slight_smile:

On Fri, Nov 2, 2012 at 10:07 AM, Igor Motov imotov@gmail.com wrote:

It's in the middle of this page
Elasticsearch Platform — Find real-time answers at scale | Elastic for "Filtered Aliases"

On Friday, November 2, 2012 1:04:02 PM UTC-4, revdev wrote:

Glad to know that there is a solution to this problem.
I am not able to find proper documentation here: http://www.**
Elasticsearch Platform — Find real-time answers at scale | Elastichttp://www.elasticsearch.org/guide/reference/query-dsl/terms-filter.html
Can you point me to a page which describes how I can create aliases?
Thanks!

On Friday, November 2, 2012 10:01:20 AM UTC-7, Igor Motov wrote:

If you have one set of ids per query and they are all in the filter that
is applied to the entire query, you can create a filtered alias for each id
list. Besides saving on network traffic you will also save on parsing this
huge list of ids for every request.

On Friday, November 2, 2012 12:32:43 PM UTC-4, revdev wrote:

Hi,
I am at a point where some of my search queries are close to 1MB, for
example queries where the "terms" filter has around 50k ids. I assume this
is putting a lot of network overhead to transfer such big queries to ES. I
am wondering if there is a way I can send "macro" in a query which is later
on translated by ES.
Let me describe how it would work:

  1. Query sent to ES:
    filter: { terms : { ids: [macro_label_list_of_ids] }}
    where "macro_label_list_of_ids" is just a label which uniquely
    identifies a list of integer ids.

  2. Upon receiving this query, ES does a lookup for
    "macro_label_list_of_ids" on another index (created by us) to extract
    values of ids and then replaces these ids where "macro_label_list_of_ids"
    appears in the query.

So, can you tell me three things:

  1. Is this possible in ES current?
  2. If not, is there any plans to implement this?
  3. Is there any other workaround to shorten my queries? ( I can't
    change the way I am storing data because of business requirements)

Thanks!

--

--

Hi,

not sure if I understand it completely - but 1MB sized queries just because
of ID lists are scaring me.

Your challenge of ID substitution by value extraction looks a lot like
denormalization of data structures (an ID is replaced by the resource the
ID stands for).

  1. since automatic denormalization needs some mechanics that are missing
    currently in ES, you will have to work around some issues. Filter alias or
    index views is one alternative, another one is maintaining ID resolution at
    client side (so the client does not send IDs but other related values to
    ES).

  2. Yes, I work on a denormalization implementation, a mapper plugin
    elasticsearch-mapper-iri. My plan is to expand documents automatically if
    an IRI (International Resource Identifier) is detected in a source document
    by looking up the resource of the IRI. Even an ES GetRequest in the
    background will be possible, beside fetching non-ES JSON over HTTP.

  3. Yes, think of a better data model at index time, not at search time.
    Your business requirements seem valid in a relational data model where IDs
    play a different role than in an inverted index. An ES data model is
    document-centric, so you would have to rethink your data organization
    completely. The challenge is constructing documents that do not rely on
    "pointers" between them but on hierarchical JSON structures.

Best regards,

Jörg

On Friday, November 2, 2012 5:32:43 PM UTC+1, revdev wrote:

Hi,
I am at a point where some of my search queries are close to 1MB, for
example queries where the "terms" filter has around 50k ids. I assume this
is putting a lot of network overhead to transfer such big queries to ES. I
am wondering if there is a way I can send "macro" in a query which is later
on translated by ES.
Let me describe how it would work:

  1. Query sent to ES:
    filter: { terms : { ids: [macro_label_list_of_ids] }}
    where "macro_label_list_of_ids" is just a label which uniquely identifies
    a list of integer ids.

  2. Upon receiving this query, ES does a lookup for
    "macro_label_list_of_ids" on another index (created by us) to extract
    values of ids and then replaces these ids where "macro_label_list_of_ids"
    appears in the query.

So, can you tell me three things:

  1. Is this possible in ES current?
  2. If not, is there any plans to implement this?
  3. Is there any other workaround to shorten my queries? ( I can't change
    the way I am storing data because of business requirements)

Thanks!

--

Thanks for the comments Jörg. The reason for such big queries is because I
am getting multiple facets with each facet having its own facet filter. The
whole query explodes because I am trying to query this for around 50k ids.
I am looking at filter aliases and hopefully that will solve my problem.
But the fact is that I will have to continuously update aliases as and when
the batch of ids change. Your plugin sounds exactly what I need. Any
pointers to it and when you are planning to release it?

Thanks!
Vinay

On Friday, November 2, 2012 12:08:37 PM UTC-7, Jörg Prante wrote:

Hi,

not sure if I understand it completely - but 1MB sized queries just
because of ID lists are scaring me.

Your challenge of ID substitution by value extraction looks a lot like
denormalization of data structures (an ID is replaced by the resource the
ID stands for).

  1. since automatic denormalization needs some mechanics that are missing
    currently in ES, you will have to work around some issues. Filter alias or
    index views is one alternative, another one is maintaining ID resolution at
    client side (so the client does not send IDs but other related values to
    ES).

  2. Yes, I work on a denormalization implementation, a mapper plugin
    elasticsearch-mapper-iri. My plan is to expand documents automatically if
    an IRI (International Resource Identifier) is detected in a source document
    by looking up the resource of the IRI. Even an ES GetRequest in the
    background will be possible, beside fetching non-ES JSON over HTTP.

  3. Yes, think of a better data model at index time, not at search time.
    Your business requirements seem valid in a relational data model where IDs
    play a different role than in an inverted index. An ES data model is
    document-centric, so you would have to rethink your data organization
    completely. The challenge is constructing documents that do not rely on
    "pointers" between them but on hierarchical JSON structures.

Best regards,

Jörg

On Friday, November 2, 2012 5:32:43 PM UTC+1, revdev wrote:

Hi,
I am at a point where some of my search queries are close to 1MB, for
example queries where the "terms" filter has around 50k ids. I assume this
is putting a lot of network overhead to transfer such big queries to ES. I
am wondering if there is a way I can send "macro" in a query which is later
on translated by ES.
Let me describe how it would work:

  1. Query sent to ES:
    filter: { terms : { ids: [macro_label_list_of_ids] }}
    where "macro_label_list_of_ids" is just a label which uniquely identifies
    a list of integer ids.

  2. Upon receiving this query, ES does a lookup for
    "macro_label_list_of_ids" on another index (created by us) to extract
    values of ids and then replaces these ids where "macro_label_list_of_ids"
    appears in the query.

So, can you tell me three things:

  1. Is this possible in ES current?
  2. If not, is there any plans to implement this?
  3. Is there any other workaround to shorten my queries? ( I can't change
    the way I am storing data because of business requirements)

Thanks!

--

I hack Elasticsearch plugins in my spare time just for fun. Sorry, no ETA
for the plugin. It will be announced here on the list, of course.

Here is my status of activity in that area:

Netty websocket transport plugin and client (done)
ES client/server codebase modularization (pending)
HTTP Netty client (thanks to asynchttpclient done)
ES HTTP Java RESTful client (in progress)
IRI/Semantic Web content extraction plugin for ES client (open)
IRI/Semantic Web Mapper plugin for ES server (open)

The modularization task is just for better code dependency design and
smaller code units, it is pending because I wait for 0.21.

Because my day job is focused on indexing library catalogs with the help of
the semantic web, where denormalizing identifiers is a common task, I hope
to make use of ES for semantic web data indexing and search.

Some interesting things are going in in the Stanbol community, they connect
CMS to semantic web stores, and like to explore ES for
indexing https://twitter.com/fhopf/status/250161439342489600

Jörg

On Saturday, November 3, 2012 1:50:02 AM UTC+1, revdev wrote:

Thanks for the comments Jörg. The reason for such big queries is because I
am getting multiple facets with each facet having its own facet filter. The
whole query explodes because I am trying to query this for around 50k ids.
I am looking at filter aliases and hopefully that will solve my problem.
But the fact is that I will have to continuously update aliases as and when
the batch of ids change. Your plugin sounds exactly what I need. Any
pointers to it and when you are planning to release it?

Thanks!
Vinay

On Friday, November 2, 2012 12:08:37 PM UTC-7, Jörg Prante wrote:

Hi,

not sure if I understand it completely - but 1MB sized queries just
because of ID lists are scaring me.

Your challenge of ID substitution by value extraction looks a lot like
denormalization of data structures (an ID is replaced by the resource the
ID stands for).

  1. since automatic denormalization needs some mechanics that are missing
    currently in ES, you will have to work around some issues. Filter alias or
    index views is one alternative, another one is maintaining ID resolution at
    client side (so the client does not send IDs but other related values to
    ES).

  2. Yes, I work on a denormalization implementation, a mapper plugin
    elasticsearch-mapper-iri. My plan is to expand documents automatically if
    an IRI (International Resource Identifier) is detected in a source document
    by looking up the resource of the IRI. Even an ES GetRequest in the
    background will be possible, beside fetching non-ES JSON over HTTP.

  3. Yes, think of a better data model at index time, not at search time.
    Your business requirements seem valid in a relational data model where IDs
    play a different role than in an inverted index. An ES data model is
    document-centric, so you would have to rethink your data organization
    completely. The challenge is constructing documents that do not rely on
    "pointers" between them but on hierarchical JSON structures.

Best regards,

Jörg

On Friday, November 2, 2012 5:32:43 PM UTC+1, revdev wrote:

Hi,
I am at a point where some of my search queries are close to 1MB, for
example queries where the "terms" filter has around 50k ids. I assume this
is putting a lot of network overhead to transfer such big queries to ES. I
am wondering if there is a way I can send "macro" in a query which is later
on translated by ES.
Let me describe how it would work:

  1. Query sent to ES:
    filter: { terms : { ids: [macro_label_list_of_ids] }}
    where "macro_label_list_of_ids" is just a label which uniquely
    identifies a list of integer ids.

  2. Upon receiving this query, ES does a lookup for
    "macro_label_list_of_ids" on another index (created by us) to extract
    values of ids and then replaces these ids where "macro_label_list_of_ids"
    appears in the query.

So, can you tell me three things:

  1. Is this possible in ES current?
  2. If not, is there any plans to implement this?
  3. Is there any other workaround to shorten my queries? ( I can't change
    the way I am storing data because of business requirements)

Thanks!

--

Great! Would definitely check out when the plugin is released. Best of luck
on the project and thanks for sharing!

On Sun, Nov 4, 2012 at 4:20 AM, Jörg Prante joergprante@gmail.com wrote:

I hack Elasticsearch plugins in my spare time just for fun. Sorry, no ETA
for the plugin. It will be announced here on the list, of course.

Here is my status of activity in that area:

Netty websocket transport plugin and client (done)
ES client/server codebase modularization (pending)
HTTP Netty client (thanks to asynchttpclient done)
ES HTTP Java RESTful client (in progress)
IRI/Semantic Web content extraction plugin for ES client (open)
IRI/Semantic Web Mapper plugin for ES server (open)

The modularization task is just for better code dependency design and
smaller code units, it is pending because I wait for 0.21.

Because my day job is focused on indexing library catalogs with the help
of the semantic web, where denormalizing identifiers is a common task, I
hope to make use of ES for semantic web data indexing and search.

Some interesting things are going in in the Stanbol community, they
connect CMS to semantic web stores, and like to explore ES for indexing
https://twitter.com/fhopf/status/250161439342489600

Jörg

On Saturday, November 3, 2012 1:50:02 AM UTC+1, revdev wrote:

Thanks for the comments Jörg. The reason for such big queries is because
I am getting multiple facets with each facet having its own facet filter.
The whole query explodes because I am trying to query this for around 50k
ids. I am looking at filter aliases and hopefully that will solve my
problem. But the fact is that I will have to continuously update aliases as
and when the batch of ids change. Your plugin sounds exactly what I need.
Any pointers to it and when you are planning to release it?

Thanks!
Vinay

On Friday, November 2, 2012 12:08:37 PM UTC-7, Jörg Prante wrote:

Hi,

not sure if I understand it completely - but 1MB sized queries just
because of ID lists are scaring me.

Your challenge of ID substitution by value extraction looks a lot like
denormalization of data structures (an ID is replaced by the resource the
ID stands for).

  1. since automatic denormalization needs some mechanics that are missing
    currently in ES, you will have to work around some issues. Filter alias or
    index views is one alternative, another one is maintaining ID resolution at
    client side (so the client does not send IDs but other related values to
    ES).

  2. Yes, I work on a denormalization implementation, a mapper plugin
    elasticsearch-mapper-iri. My plan is to expand documents automatically if
    an IRI (International Resource Identifier) is detected in a source document
    by looking up the resource of the IRI. Even an ES GetRequest in the
    background will be possible, beside fetching non-ES JSON over HTTP.

  3. Yes, think of a better data model at index time, not at search time.
    Your business requirements seem valid in a relational data model where IDs
    play a different role than in an inverted index. An ES data model is
    document-centric, so you would have to rethink your data organization
    completely. The challenge is constructing documents that do not rely on
    "pointers" between them but on hierarchical JSON structures.

Best regards,

Jörg

On Friday, November 2, 2012 5:32:43 PM UTC+1, revdev wrote:

Hi,
I am at a point where some of my search queries are close to 1MB, for
example queries where the "terms" filter has around 50k ids. I assume this
is putting a lot of network overhead to transfer such big queries to ES. I
am wondering if there is a way I can send "macro" in a query which is later
on translated by ES.
Let me describe how it would work:

  1. Query sent to ES:
    filter: { terms : { ids: [macro_label_list_of_ids] }}
    where "macro_label_list_of_ids" is just a label which uniquely
    identifies a list of integer ids.

  2. Upon receiving this query, ES does a lookup for
    "macro_label_list_of_ids" on another index (created by us) to extract
    values of ids and then replaces these ids where "macro_label_list_of_ids"
    appears in the query.

So, can you tell me three things:

  1. Is this possible in ES current?
  2. If not, is there any plans to implement this?
  3. Is there any other workaround to shorten my queries? ( I can't
    change the way I am storing data because of business requirements)

Thanks!

--

--

On 11/2/2012 5:50 PM, revdev wrote:

Thanks for the comments Jörg. The reason for such big queries is
because I am getting multiple facets with each facet having its own
facet filter.

Each facet with it's own filter? Why isn't there one "filtered" query
the results of which feed all facets? Does each facet really look at a
different set of ids?

Why isn't the original "macro_label_list_of_ids" in your original
"filter: { terms : { ids: [macro_label_list_of_ids] }}" representable in
the index as the ID and not all values in the list?

-Paul

The whole query explodes because I am trying to query this for around
50k ids. I am looking at filter aliases and hopefully that will solve
my problem. But the fact is that I will have to continuously update
aliases as and when the batch of ids change. Your plugin sounds
exactly what I need. Any pointers to it and when you are planning to
release it?

--

Hi P. Hill,
Yeah each facet has its own filter because they need to act on a subset of
data, not the entire data filtered by global query. Secondly, I cant
represent all "ids" as a single "group id" that groups them together
because these groups are ever changing. If I add the group id to a doc, I
might have to update all docs when an "id" moves out of a group.
But, I found the solution with "Nested Type". I am now storing data as an
"nested" array vs as an object. That works perfectly but requires more
storage on disk and RAM. Hopefully, it will scale smoothly :).

On Wed, Nov 14, 2012 at 5:48 PM, P. Hill parehill1@gmail.com wrote:

On 11/2/2012 5:50 PM, revdev wrote:

Thanks for the comments Jörg. The reason for such big queries is because
I am getting multiple facets with each facet having its own facet filter.

Each facet with it's own filter? Why isn't there one "filtered" query the
results of which feed all facets? Does each facet really look at a
different set of ids?

Why isn't the original "macro_label_list_of_ids" in your original "filter:
{ terms : { ids: [macro_label_list_of_ids] }}" representable in the index
as the ID and not all values in the list?

-Paul

The whole query explodes because I am trying to query this for around 50k

ids. I am looking at filter aliases and hopefully that will solve my
problem. But the fact is that I will have to continuously update aliases as
and when the batch of ids change. Your plugin sounds exactly what I need.
Any pointers to it and when you are planning to release it?

--

--