Social search

Hey folks,

I was reading the new features list in 0.90 and saw social search. The
terms lookup mechanism seems to have some promise, but I have a few
questions/issues:

  • It doesn't seem to work for the _id field (I.e. {"_id": {"terms":{ ... }
    } })
  • The design means that you need to store the entire set of followers in a
    single doc array. Would that mean reindexing the entire list (which for
    us can be 300K+ longs) whenever the list changes?
  • if I wanted to denormalize the data instead and use a has_child filter to
    check the relationship, do you have any hints on how to create the minimal
    possible child doc so 100M+ of these don't kill the index size? I would be
    fine with losing the ability to do any other type of query (well except for
    having a stable id for these docs). Here is what I have so far:

{"mapping": {"follower": {
"_parent": {"type": "user"},
"_source": {"enabled": false},
"_all": {"enabled": false},
"properties": {
"followerId": { "type": "long", "precision_step": 0 },
},
} }

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Mike

I was reading the new features list in 0.90 and saw social search. The

terms lookup mechanism seems to have some promise, but I have a few
questions/issues:

  • It doesn't seem to work for the _id field (I.e. {"_id": {"terms":{ ... }
    } })

you want:

{ terms: { _id: { index... etc }}}

  • The design means that you need to store the entire set of followers in a
    single doc array. Would that mean reindexing the entire list (which for
    us can be 300K+ longs) whenever the list changes?

Yes, although you could break them down into smaller chunks and use a bool
filter to combine them

  • if I wanted to denormalize the data instead and use a has_child filter
    to check the relationship, do you have any hints on how to create the
    minimal possible child doc so 100M+ of these don't kill the index size? I
    would be fine with losing the ability to do any other type of query (well
    except for having a stable id for these docs). Here is what I have so far:

{"mapping": {"follower": {
"_parent": {"type": "user"},
"_source": {"enabled": false},
"_all": {"enabled": false},
"properties": {
"followerId": { "type": "long", "precision_step": 0 },
},
} }

I wouldn't disable the _source field - you'll regret it later on, eg when
you want to rebuild your index, or debug why a particular query isn't
working as expected. And I wouldn't worry about the precision_step either.

Also, in master, there is a big memory improvement on parent/child queries.
Now only parent IDs are loaded into memory. Previously it used to load
child IDs too

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey Clinton,

Thanks for the quick reply.

On Saturday, May 18, 2013 6:02:19 PM UTC-4, Clinton Gormley wrote:

Hi Mike

I was reading the new features list in 0.90 and saw social search. The

terms lookup mechanism seems to have some promise, but I have a few
questions/issues:

  • It doesn't seem to work for the _id field (I.e. {"_id": {"terms":{ ...
    } } })

you want:

{ terms: { _id: { index... etc }}}

Sorry I that wasn't a valid test case. Here's one that doesn't work:

$ curl -XPUT http://localhost:9200/index1/t1/123 -d '{ "name": "123" }'
{"ok":true,"_index":"index1","_type":"t1","_id":"123","_version":1}
$ curl -XPUT http://localhost:9200/index1/t1/456 -d '{ "name": "456" }'
{"ok":true,"_index":"index1","_type":"t1","_id":"456","_version":1}
$ curl -XPUT http://localhost:9200/index1/t2/1 -d '{ "ids": ["123", "456"]
}'
{"ok":true,"_index":"index1","_type":"t2","_id":"1","_version":1}
$ curl http://localhost:9200/index1/t1/_search -d '{ "query": { "filtered":
{ "filter": { "terms": { "_id": { "index": "index1", "type": "t2", "id":
"1", "path": "ids" } } } } } }'
{"took":48,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}
$ curl http://localhost:9200/index1/t1/_search -d '{ "query": { "filtered":
{ "filter": { "terms": { "_id": ["123", "456"] } } } } }'
{"took":14,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":2,"max_score":1.0,"hits":[{"_index":"index1","_type":"t1","_id":"456","_score":1.0,
"_source" : { "name": "456"
}},{"_index":"index1","_type":"t1","_id":"123","_score":1.0, "_source" : {
"name": "123" }}]}}

  • The design means that you need to store the entire set of followers in
    a single doc array. Would that mean reindexing the entire list (which for
    us can be 300K+ longs) whenever the list changes?

Yes, although you could break them down into smaller chunks and use a bool
filter to combine them

Hmm good point.

  • if I wanted to denormalize the data instead and use a has_child filter
    to check the relationship, do you have any hints on how to create the
    minimal possible child doc so 100M+ of these don't kill the index size? I
    would be fine with losing the ability to do any other type of query (well
    except for having a stable id for these docs). Here is what I have so far:

{"mapping": {"follower": {
"_parent": {"type": "user"},
"_source": {"enabled": false},
"_all": {"enabled": false},
"properties": {
"followerId": { "type": "long", "precision_step": 0 },
},
} }

I wouldn't disable the _source field - you'll regret it later on, eg when
you want to rebuild your index, or debug why a particular query isn't
working as expected. And I wouldn't worry about the precision_step either.

ES isn't the main datastore here, so reindexing from the database isn't an
issue. I ran into an issue when doing this with the above mapping - the
index got too big for the FS cache and query & indexing performance went
through the floor. This was with 3 nodes with 15G ram and an EBS RAID0.
Before adding the children the index was ~ 8G in size; afterwards it was
80G which is ~ 680 bytes for a doc that's 2 ints.

Also, in master, there is a big memory improvement on parent/child
queries. Now only parent IDs are loaded into memory. Previously it used to
load child IDs too

I saw that. I'm quite looking forward 0.90.1 - mostly because of the bulk
update support. :slight_smile:

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On 20 May 2013 05:59, Mike Kaplinskiy mike.kaplinskiy@gmail.com wrote:

$ curl -XPUT http://localhost:9200/index1/t1/123 -d '{ "name": "123" }'
{"ok":true,"_index":"index1","_type":"t1","_id":"123","_version":1}
$ curl -XPUT http://localhost:9200/index1/t1/456 -d '{ "name": "456" }'
{"ok":true,"_index":"index1","_type":"t1","_id":"456","_version":1}
$ curl -XPUT http://localhost:9200/index1/t2/1 -d '{ "ids": ["123",
"456"] }'
{"ok":true,"_index":"index1","_type":"t2","_id":"1","_version":1}
$ curl http://localhost:9200/index1/t1/_search -d '{ "query": {
"filtered": { "filter": { "terms": { "_id": { "index": "index1", "type":
"t2", "id": "1", "path": "ids" } } } } } }'

{"took":48,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}
$ curl http://localhost:9200/index1/t1/_search -d '{ "query": {
"filtered": { "filter": { "terms": { "_id": ["123", "456"] } } } } }'
{"took":14,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":2,"max_score":1.0,"hits":[{"_index":"index1","_type":"t1","_id":"456","_score":1.0,
"_source" : { "name": "456"
}},{"_index":"index1","_type":"t1","_id":"123","_score":1.0, "_source" : {
"name": "123" }}]}}

You're right, this doesn't work. I've opened this issue:

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.