(new at ES): can this be done with ES?

Hi guys,

I'm looking for a storage/search solution for the following scenario:

I have *50M *documents of this form:

{title: "ABC", tags: [{id: 1, weight: 10, type: 5}, {id: 4, weight: 50,
type: 6},.....]},

Each document can have *thousands *of tags.

I want to:

  1. search for documents that match all of a given array of tags [1,4,...]
  2. return top 30 tags based on the sum of their weights from all the
    matched documents, grouped by type.

So, if tag *4 appears in 50 documents, in each having a weight of 10,
it's total weight would be 5010 = 100
.

I'm currently doing this with Mongo DB's aggregation framework, but have
serious concerns of it running out of memory since everything is done
in-memory.

Can this be achieved efficiently with ES?

Thank you for your input!

Matei

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I think this should be doable. You will need to use Nestedhttp://www.elasticsearch.org/guide/reference/mapping/nested-type.htmlor
Parent/Childhttp://www.elasticsearch.org/guide/reference/mapping/parent-field.htmlmapping, since inner-objects will collapse your 'tagged' fields into a
single, large array. IDs can be filtered using a simple Term filterhttp://www.elasticsearch.org/guide/reference/query-dsl/term-filter.html.
That will give you the list of matching documents.

To get the sum of weights for each type, you'd probably want to use a
filtered, Statisticalhttp://www.elasticsearch.org/guide/reference/api/search/facets/statistical-facet.htmlfacet.

-Zach

On Saturday, March 16, 2013 7:38:40 PM UTC-4, Matei wrote:

Hi guys,

I'm looking for a storage/search solution for the following scenario:

I have *50M *documents of this form:

{title: "ABC", tags: [{id: 1, weight: 10, type: 5}, {id: 4, weight: 50,
type: 6},.....]},

Each document can have *thousands *of tags.

I want to:

  1. search for documents that match all of a given array of tags [1,4,...]
  2. return top 30 tags based on the sum of their weights from all the
    matched documents, grouped by type.

So, if tag *4 appears in 50 documents, in each having a weight of 10,
it's total weight would be 5010 = 100
.

I'm currently doing this with Mongo DB's aggregation framework, but have
serious concerns of it running out of memory since everything is done
in-memory.

Can this be achieved efficiently with ES?

Thank you for your input!

Matei

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Zach, thank you for taking the time yo reply. As I read in the docs,
Statistical facets are also performed in-memory, just like in MongoDB so
the out-of-memory concerns would still be there, right?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Yep, that is correct - faceting (and sorting) are in-memory operations.

However, ES will only load into memory what is required to complete the
facet operation. If you filter the facet by ID value, you should be
operating on a subset of your 50M doc dataset. The memory requirements
will be lighter than if you were to facet over the entire dataset without a
filter. There is also the consideration that, as you add nodes to your
cluster, the memory burden of each node decreases since each node is
responsible for faceting a smaller portion of the data.

On Monday, March 18, 2013 7:29:09 AM UTC-4, Matei wrote:

Hi Zach, thank you for taking the time yo reply. As I read in the docs,
Statistical facets are also performed in-memory, just like in MongoDB so
the out-of-memory concerns would still be there, right?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

However, ES will only load into memory what is required to complete
the facet operation. If you filter the facet by ID value, you should
be operating on a subset of your 50M doc dataset. The memory
requirements will be lighter than if you were to facet over the entire
dataset without a filter. There is also the consideration that, as
you add nodes to your cluster, the memory burden of each node
decreases since each node is responsible for faceting a smaller
portion of the data.

Unfortunately, this is not correct.

For efficiency's sake, Elasticsearch assumes that, if you need field
values for docs 1..10 in this query, the chances are good that you'll
need docs 11.1000 in a later query, so it loads the field values for ALL
docs in your index into memory.

On top of that, in versions of ES before 0.90, multi-value fields use a
lot of memory: num_of_docs x max_num_of_vals_per_field

If you have docs with thousands of tags, that is going to consume an
enormous amount of memory. Chances are good you will get OOM failures.

The situation will be better when the next version of ES is released,
but you may want to reconsider this structure.

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Whoops, ignore what I said then. I assumed that a filtered facet would
operate on smaller datasets (similar to how queries operate on filtered
results).

Sorry about that! =/

On Monday, March 18, 2013 8:57:18 AM UTC-4, Clinton Gormley wrote:

However, ES will only load into memory what is required to complete
the facet operation. If you filter the facet by ID value, you should
be operating on a subset of your 50M doc dataset. The memory
requirements will be lighter than if you were to facet over the entire
dataset without a filter. There is also the consideration that, as
you add nodes to your cluster, the memory burden of each node
decreases since each node is responsible for faceting a smaller
portion of the data.

Unfortunately, this is not correct.

For efficiency's sake, Elasticsearch assumes that, if you need field
values for docs 1..10 in this query, the chances are good that you'll
need docs 11.1000 in a later query, so it loads the field values for ALL
docs in your index into memory.

On top of that, in versions of ES before 0.90, multi-value fields use a
lot of memory: num_of_docs x max_num_of_vals_per_field

If you have docs with thousands of tags, that is going to consume an
enormous amount of memory. Chances are good you will get OOM failures.

The situation will be better when the next version of ES is released,
but you may want to reconsider this structure.

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks Clint, Zach for your replies. It seems this is an increasingly
difficult problem to solve. Any suggestions on how I should restructure my
data for similar results, without risking OOM?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Matei

On Tue, 2013-03-19 at 04:35 -0700, Matei wrote:

Thanks Clint, Zach for your replies. It seems this is an increasingly
difficult problem to solve. Any suggestions on how I should
restructure my data for similar results, without risking OOM?

Difficult to say. It kinda depends what you really want to achieve. I
know you explained your algorithm earlier, but it sounds like a
poor-man's ranking engine. I'm wondering if the native functionality
available in ES might do a better job of it.

Perhaps step back and restate the problem: what user experience do you
want to provide.

From that, the answer may be clearer

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Clint,

I have about 50M events, each being described by tags. There are 4 tag
types (places, speakers, topics, industries).

  1. Tags are hierarchical, for example, "Eric Schmidt" is filed under
    Google who is filed under Tech companies. So, whenever Eric is at an event,
    all three tags are associated with the event.
  2. Different tags can have different popularity, meaning "Eric Schmidt"
    would have a popularity of 100, but "Eileen Naughton" would have a
    popularity of "10".
  3. The popularity does not apply hierarchically. That means that, if "Eric
    Schmidt" would leave Google for Foursquare, his popularity would still be
    100 and Foursquare would still have popularity 50.

Now, imagine a left-hand menu with 4 sections:

Places

Paris
London
New York
[more]

Speakers

Google
Facebook
Marc Zuckerberg
[more]

and so on.

Whenever the user clicks on a tag, I want the menu to reflect the results
(faceted search). The twist is that when deciding to show "Google" vs "Eric
Schmidt" vs"Foursquare" in the first three tags in each section, I want to
make sure the most popular tag is shown higher, based on the [number of
matching events] * [tag popularity]. That means that if there are 3
matching events for "Foursquare" and only one for "Eric Schmidt" it should
show Foursquare first, with a score of 3*50 = 150 vs Schmidt's 1 * 100.

Also, ideally, if I select "Google" then, for the "speakers" section, the
system should not return people outside Google, even if the matching events
also have "Zuckerberg" listed, with a huge popularity of 200. So, the
returned tags should reside "beneath" the current selection in each
section, and their sorting should be based on the above scoring logic.

I hope I managed to explain what I'm trying to achieve.

Any help would be greatly appreciated!

Thanks
Matei

On Tue, Mar 19, 2013 at 2:49 PM, Clinton Gormley clint@traveljury.comwrote:

Hi Matei

On Tue, 2013-03-19 at 04:35 -0700, Matei wrote:

Thanks Clint, Zach for your replies. It seems this is an increasingly
difficult problem to solve. Any suggestions on how I should
restructure my data for similar results, without risking OOM?

Difficult to say. It kinda depends what you really want to achieve. I
know you explained your algorithm earlier, but it sounds like a
poor-man's ranking engine. I'm wondering if the native functionality
available in ES might do a better job of it.

Perhaps step back and restate the problem: what user experience do you
want to provide.

From that, the answer may be clearer

clint

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/hWKV2M-pLs0/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Matei

I have about 50M events, each being described by tags. There are 4 tag
types (places, speakers, topics, industries).

  1. Tags are hierarchical, for example, "Eric Schmidt" is filed under
    Google who is filed under Tech companies. So, whenever Eric is at an
    event, all three tags are associated with the event.
  2. Different tags can have different popularity, meaning "Eric
    Schmidt" would have a popularity of 100, but "Eileen Naughton" would
    have a popularity of "10".
  3. The popularity does not apply hierarchically. That means that, if
    "Eric Schmidt" would leave Google for Foursquare, his popularity would
    still be 100 and Foursquare would still have popularity 50.

Now, imagine a left-hand menu with 4 sections:

Places

Paris
London
New York
[more]

Speakers

Google
Facebook
Marc Zuckerberg
[more]

and so on.

Whenever the user clicks on a tag, I want the menu to reflect the
results (faceted search). The twist is that when deciding to show
"Google" vs "Eric Schmidt" vs"Foursquare" in the first three tags in
each section, I want to make sure the most popular tag is shown
higher, based on the [number of matching events] * [tag popularity].
That means that if there are 3 matching events for "Foursquare" and
only one for "Eric Schmidt" it should show Foursquare first, with a
score of 3*50 = 150 vs Schmidt's 1 * 100.

Also, ideally, if I select "Google" then, for the "speakers" section,
the system should not return people outside Google, even if the
matching events also have "Zuckerberg" listed, with a huge popularity
of 200. So, the returned tags should reside "beneath" the current
selection in each section, and their sorting should be based on the
above scoring logic.

I hope I managed to explain what I'm trying to achieve.

OK - that makes a lot of sense now. Much easier to understand with
"real" data.

I think this will be quite easy to do using version 0.90.0.RC1 (which
will be released as stable in the near future), which has support for
sorting on values in nested documents (min,max,sum,avg).

So, index your tags as nested documents:

curl -XPUT 'http://127.0.0.1:9200/events/?pretty=1'  -d '
{
   "mappings" : {
      "event" : {
         "properties" : {
            "name" : {
               "type" : "string"
            },
            "tags" : {
               "type" : "nested",
               "properties" : {
                  "value" : {
                     "index" : "not_analyzed",
                     "type" : "string"
                  },
                  "weight" : {
                     "type" : "integer"
                  },
                  "type" : {
                     "index" : "not_analyzed",
                     "type" : "string"
                  }
               }
            }
         }
      }
   }
}
'

Then, some data, eg:

curl -XPOST 'http://127.0.0.1:9200/events/event?pretty=1'  -d '
{
   "title" : "Paris in the springtime",
   "tags" : [
      {
         "value" : "Paris",
         "weight" : 10,
         "type" : "place"
      },
      {
         "value" : "Eric Schmidt",
         "weight" : 100,
         "type" : "speaker"
      },
      {
         "value" : "Google",
         "weight" : 50,
         "type" : "company"
      }
   ]
}
'

curl -XPOST 'http://127.0.0.1:9200/events/event?pretty=1'  -d '
{
   "title" : "Barcelona is the bollocks",
   "tags" : [
      {
         "value" : "Barcelona",
         "weight" : 30,
         "type" : "place"
      },
      {
         "value" : "Mark Zuckerberg",
         "weight" : 30,
         "type" : "speaker"
      },
      {
         "value" : "Facebook",
         "weight" : 40,
         "type" : "company"
      }
   ]
}
'

Now we can do our search, and sort on the sum of the weights in the
nested docs:

Look for the 'sort' value:

You can filter on (eg) just speakers from Google:

Or you can include all events, but score just on companies named
'Google':

Because we're sorting on a field value (ie 'tags.weight'), its values
need to be loaded into memory. But it has only a single value per field,
and you shouldn't run into the memory problems that you might have had
with other designs

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Clint,

If I understand correctly, the examples you sent are for retrieving and
sorting *events *based on weight sum of selected tags which is great, but
not exactly the problem :slight_smile: What I am looking for is a way to retrieve the
facets
for the menu, for each tag type, so that I can efficiently allow
the users to drill down by showing only the most relevant tags, based on
the weight-sum logic.

Sorry for not being clear enough, but I will surely use the examples once I
get to the sorting of events (for now, the event searching is done, but I
haven't thought about sorting them yet)

On Thursday, 21 March 2013 19:43:24 UTC+2, Clinton Gormley wrote:

Hi Matei

I have about 50M events, each being described by tags. There are 4 tag
types (places, speakers, topics, industries).

  1. Tags are hierarchical, for example, "Eric Schmidt" is filed under
    Google who is filed under Tech companies. So, whenever Eric is at an
    event, all three tags are associated with the event.
  2. Different tags can have different popularity, meaning "Eric
    Schmidt" would have a popularity of 100, but "Eileen Naughton" would
    have a popularity of "10".
  3. The popularity does not apply hierarchically. That means that, if
    "Eric Schmidt" would leave Google for Foursquare, his popularity would
    still be 100 and Foursquare would still have popularity 50.

Now, imagine a left-hand menu with 4 sections:

Places

Paris
London
New York
[more]

Speakers

Google
Facebook
Marc Zuckerberg
[more]

and so on.

Whenever the user clicks on a tag, I want the menu to reflect the
results (faceted search). The twist is that when deciding to show
"Google" vs "Eric Schmidt" vs"Foursquare" in the first three tags in
each section, I want to make sure the most popular tag is shown
higher, based on the [number of matching events] * [tag popularity].
That means that if there are 3 matching events for "Foursquare" and
only one for "Eric Schmidt" it should show Foursquare first, with a
score of 3*50 = 150 vs Schmidt's 1 * 100.

Also, ideally, if I select "Google" then, for the "speakers" section,
the system should not return people outside Google, even if the
matching events also have "Zuckerberg" listed, with a huge popularity
of 200. So, the returned tags should reside "beneath" the current
selection in each section, and their sorting should be based on the
above scoring logic.

I hope I managed to explain what I'm trying to achieve.

OK - that makes a lot of sense now. Much easier to understand with
"real" data.

I think this will be quite easy to do using version 0.90.0.RC1 (which
will be released as stable in the near future), which has support for
sorting on values in nested documents (min,max,sum,avg).

So, index your tags as nested documents:

curl -XPUT 'http://127.0.0.1:9200/events/?pretty=1'  -d ' 
{ 
   "mappings" : { 
      "event" : { 
         "properties" : { 
            "name" : { 
               "type" : "string" 
            }, 
            "tags" : { 
               "type" : "nested", 
               "properties" : { 
                  "value" : { 
                     "index" : "not_analyzed", 
                     "type" : "string" 
                  }, 
                  "weight" : { 
                     "type" : "integer" 
                  }, 
                  "type" : { 
                     "index" : "not_analyzed", 
                     "type" : "string" 
                  } 
               } 
            } 
         } 
      } 
   } 
} 
' 

Then, some data, eg:

curl -XPOST 'http://127.0.0.1:9200/events/event?pretty=1'  -d ' 
{ 
   "title" : "Paris in the springtime", 
   "tags" : [ 
      { 
         "value" : "Paris", 
         "weight" : 10, 
         "type" : "place" 
      }, 
      { 
         "value" : "Eric Schmidt", 
         "weight" : 100, 
         "type" : "speaker" 
      }, 
      { 
         "value" : "Google", 
         "weight" : 50, 
         "type" : "company" 
      } 
   ] 
} 
' 

curl -XPOST 'http://127.0.0.1:9200/events/event?pretty=1'  -d ' 
{ 
   "title" : "Barcelona is the bollocks", 
   "tags" : [ 
      { 
         "value" : "Barcelona", 
         "weight" : 30, 
         "type" : "place" 
      }, 
      { 
         "value" : "Mark Zuckerberg", 
         "weight" : 30, 
         "type" : "speaker" 
      }, 
      { 
         "value" : "Facebook", 
         "weight" : 40, 
         "type" : "company" 
      } 
   ] 
} 
' 

Now we can do our search, and sort on the sum of the weights in the
nested docs:

Look for the 'sort' value:

curl.json · GitHub

You can filter on (eg) just speakers from Google:

curl.json · GitHub

Or you can include all events, but score just on companies named
'Google':

curl.json · GitHub

Because we're sorting on a field value (ie 'tags.weight'), its values
need to be loaded into memory. But it has only a single value per field,
and you shouldn't run into the memory problems that you might have had
with other designs

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Thu, 2013-03-21 at 10:55 -0700, Matei wrote:

Hi Clint,

If I understand correctly, the examples you sent are for retrieving
and sorting events based on weight sum of selected tags which is
great, but not exactly the problem :slight_smile: What I am looking for is a way
to retrieve the facets for the menu, for each tag type, so that I can
efficiently allow the users to drill down by showing only the most
relevant tags, based on the weight-sum logic.

Sorry for not being clear enough, but I will surely use the examples
once I get to the sorting of events (for now, the event searching is
done, but I haven't thought about sorting them yet)

Fortunately, this seems easy enough to do with a terms_stats facet.
Look at the 'total' value.

here are the facets for all docs:

and facets just for docs with company:Google

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

That looks great, I'm going to give everything you pointed out a test.
Hopefully, even though they are statistical facets I won't run out of
memory :slight_smile:

Thanks a lot! I owe you one. Let me know if you ever need front-end or PHP
assistance :slight_smile:

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.