Re: Digest for elasticsearch@googlegroups.com - 25 Messages in 13 Topics

Otis

Performance Monitoring - http://sematext.com/spm
On Nov 17, 2012 11:01 AM, elasticsearch@googlegroups.com wrote:

Today's Topic Summary

Group: http://groups.google.com/group/elasticsearch/topics

  • Storing table-like data in Elastic Search<#13b0f1a1165b73e2_group_thread_0>[3 Updates]
  • Issue with my template creation <#13b0f1a1165b73e2_group_thread_1>[1 Update]
  • Lost shards and cluster state stays red<#13b0f1a1165b73e2_group_thread_2>[1 Update]
  • how to one result when search nested?<#13b0f1a1165b73e2_group_thread_3>[2 Updates]
  • Control shard placement <#13b0f1a1165b73e2_group_thread_4> [2
    Updates]
  • Multiple synonyms contribute to the score<#13b0f1a1165b73e2_group_thread_5>[1 Update]
  • org.elasticsearch.transport.TransportSerializationException: Failed
    to deserialize exception response from stream when one node is still
    starting <#13b0f1a1165b73e2_group_thread_6> [1 Update]
  • [ANN] elasticsearch-equilibrium plugin version 0.19.4<#13b0f1a1165b73e2_group_thread_7>[2 Updates]
  • [ANN] geocluster-facet 0.0.1 <#13b0f1a1165b73e2_group_thread_8> [2
    Updates]
  • carrot2 error on elasticsearch version 0.19.11<#13b0f1a1165b73e2_group_thread_9>[2 Updates]
  • how to update cluster setting? <#13b0f1a1165b73e2_group_thread_10>[6 Updates]
  • [Autocomplete] Cleo or ElasticSearch with NGram<#13b0f1a1165b73e2_group_thread_11>[1 Update]
  • Too many open files but nofile set to 256000<#13b0f1a1165b73e2_group_thread_12>[1 Update]

Storing table-like data in Elastic Searchhttp://groups.google.com/group/elasticsearch/t/e0ef5d7dfd923618

Clinton Gormley clint@traveljury.com Nov 17 01:23PM +0100

However I'm now strugging how to give priority to the matching from
the same row. I.e. currently text search for "Irland Setter" gives
second document much higher score (0.21 and 0.13 respectively).

First, you're experimenting with very few documents (I assume) which
means that your terms are unevenly distributed across your shards. For
testing purposes, I would either add "search_type=dfs_query_then_fetch"
to your search query string, or I would create a test index with only 1
shard.

I need the first document to have a higher score because it has both
"Irand" and "Setter" in the same row.

Use the match_phrase query with a high "slop" value, eg:

{ "query": {
"match_phrase": {
"row": {
"query": "irland setter",
"slop": 100
}
}
}}

This will incorporate token distance into the relevance calculation.

Also, when you're indexing arrays of analyzed strings, it may be worth
setting the position_offset_gap in the mapping.

If you index ["quick brown", "fox"], by default it would be indexed as:

  • position 1 : quick
  • position 2 : brown
  • position 3 : fox

If you set the positon_offset_gap, ie map the "row" field as:

{ type: "string", position_offset_gap: 100 }

it would be indexed as:

  • position 1 : quick
  • position 2 : brown
  • position 103 : fox

This of course depends on what you are trying to achieve with your
data.

clint

Zaar Hai haizaar@gmail.com Nov 17 06:51AM -0800

On Saturday, November 17, 2012 2:24:02 PM UTC+2, Clinton Gormley wrote:

testing purposes, I would either add
"search_type=dfs_query_then_fetch"
to your search query string, or I would create a test index with
only 1
shard.

Yes, I'm currently experimenting just with two documents to make sure
I'm
on the right track. I've recreated them on a single shard following
your
advice

}
}
}}

This does not help. The "wrong" (second) document still gets much
higher
score.
I think its because after analysis, the first document looks like:
"boxer", "good", "dog", "germany", "irish", "setter", "great", "dog",
"irland"
And the second:
"setter", "important", "irland", "green"

So in the second document the "setter" is actually closer to "irland"
then
in the first one.

  • position 2 : brown
  • position 103 : fox

This of course depends on what you are trying to achieve with your
data.

This looks like an interesting approach. However I need gaps between
rows
and not between row members.
Strangely enough, changing mapping for "row" as you've suggested,
caused no
results at all.
Also running an analyzer shows that position_offset_gap is disregarded
completely.

Here is my query:
{
"query": {
"match_phrase": {
"row": {
"query": "setter ireland", "slop":100
}
}
}
}

And here is my mapping:
{
"table" : {
"properties" : {
"title" : {"type" : "string"},
"col_names" : {"type" : "string"},
"rows" : {
"properties" : {
"row" : { "type" : "string", "position_offset_gap" :
100 }
}
}
}
}
}

Thank you very much for your help and time!
Zaar

Clinton Gormley clint@traveljury.com Nov 17 04:11PM +0100

"dog", "irland"
And the second:
"setter", "important", "irland", "green"

Ah right, yes. And probably the fact that that row is shorter makes it
appear to be more relevant. You could try setting omit_norms to true,
to ignore field length normalization.

http://www.elasticsearch.org/guide/reference/mapping/core-types.html

This looks like an interesting approach. However I need gaps between
rows and not between row members.

True, sorry!

you may want to try an approach where you make "rows" type "nested". So
that would store each "row" as a separate sub-document, which you could
query individually.

Then you can also add {include_in_root: true} to the "rows" mapping, so
that all the data would also be indexed in the root document.

I've put together a demo here:

https://gist.github.com/4096675

clint

Issue with my template creationhttp://groups.google.com/group/elasticsearch/t/d4b4438aec0749f0

Radu Gheorghe radu.gheorghe@sematext.com Nov 17 04:37PM +0200

Hello Praveen,

If you use curl from the command-line, you'll probably have to escape
the quotes, like:

"date_formats" : ["yyyy-MM-dd'"'T'"'HH:mm:ss"]

So instead of: single-quote-T-single-quote -> which will translate
into the string T, since you have single quotes at the beginning and
the end of your data. So the single quotes there will only end the
quoted string started first, add a T, then begin a new quoted string

You can put:
single-quote-double-quote-single-quote-T-single-quote-double-quote-single-quote
-> which will translate to 'T', which is what you want. That's because
the first single quote ends the first part of your JSON, then you
start a double-quoted string which contains 'T', then you use
single-quotes again to continue your JSON.

That's a lot of quotes :slight_smile: Hope it helps, though.

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

On Fri, Nov 16, 2012 at 11:36 PM, Praveen Kariyanahalli

Lost shards and cluster state stays redhttp://groups.google.com/group/elasticsearch/t/a9b93dd7c0f0f649

Radu Gheorghe radu.gheorghe@sematext.com Nov 17 04:08PM +0200

Hello,

I'm not sure if I understood you question correctly, but what you can
do is:

  • delete the indices which have missing shards. Something like:

curl -XDELETE localhost:9200/corrupted_index/

  • reindex data belonging to those "incomplete" indices

Then your cluster state should be back to yellow/green again. Until
then, if you have indices that have missing shards but also allocated
shards, ES will still run your searches on the data you have. If
that's important to you, then you might prefer to do things like this:

  • reindex data belonging to incomplete indices into new indices with
    different names
  • delete indices with missing shards
  • add aliases[0] to the new indices with the old index names, so that
    searches will run as before

[0]
http://www.elasticsearch.org/guide/reference/api/admin-indices-aliases.html

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

how to one result when search nested?http://groups.google.com/group/elasticsearch/t/dc396ad9bb71e198

softtech Wonder wondersofttech@gmail.com Nov 17 01:43AM -0800

this is data
{
'firstname' :'Nicolas'
'lastname' :'Ippolito'
'books' :array
{
0 :array
{
'name' :php,
'rating':3
}
1 :array
{
'name' :'nodejs',
'rating':5,
},
2 :array
{
'name' :'guitar',
'rating':3,
}
}
}
I want result data if books.rating is "Max" and only one nested
example result I want
{
'firstname' :'Nicolas'
'lastname' :'Ippolito'
'books' :array
{
0 :array
{
'name' :'nodejs',
'rating':5,
},
}
}

Radu Gheorghe radu.gheorghe@sematext.com Nov 17 03:46PM +0200

Hi,

You mean, when a search hits a document, you want ES only to return
parts of that document?

If so, I'm not sure how you can do this other than at client side. Or,
by changing your data structure. For example, you might want to use
parent-child and search for what you want in the children, then use
the has_parent[0] query to search for what you want in the parent. In
this case Elasticsearch would return only the matching children. And
you can fetch their parents at client-side using the Multi Get API[1]

[0]
http://www.elasticsearch.org/guide/reference/query-dsl/has-parent-query.html
[1] http://www.elasticsearch.org/guide/reference/api/multi-get.html

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

On Sat, Nov 17, 2012 at 11:43 AM, softtech Wonder

Control shard placementhttp://groups.google.com/group/elasticsearch/t/c5e87efa6889a07f

elasticuser merik2004-elastic@yahoo.fr Nov 17 01:12AM -0800

My goal is to save space on the HDD. In my case, I have 5 To on my
cluster
but with replica shards just 2.5 To. So, I would like to keep 5 To for
my
primary shards and store the replica shards on a SAN.
I am aware that if a node with primary shards down all corresponding
replicas will be promoted to primaries but just during the time to
repair
the node.
Indeed, I would like to use nodes with HDD for requests and nodes with
SAN
just in case the nodes with primary shard has a problem.

On Saturday, November 17, 2012 12:11:15 AM UTC+1, Igor Motov wrote:

Igor Motov imotov@gmail.com Nov 17 05:23AM -0800

I see. That's an interesting idea. Unfortunately, I cannot think of a
mechanism that would allow you to do something like this. You can
configure
a set of nodes to use HDD, and you can configure another set of nodes
to
use SAN. You can use Allocation Awareness<
http://www.elasticsearch.org/guide/reference/modules/cluster.html>to
make sure that if one shard is allocated on HDD node, its replica would
be allocated on SAN and vis versa. You can start HDD nodes first to
make
sure they get all primaries shards. But that's it. There is really no
way
to reassign primaries back to HDD nodes if they will get moved to SAN
nodes, or to limit searches only to HDD nodes when primaries are no
longer
there.

On Saturday, November 17, 2012 4:12:31 AM UTC-5, elasticuser wrote:

Multiple synonyms contribute to the scorehttp://groups.google.com/group/elasticsearch/t/ef36599c6b76655c

Clinton Gormley clint@traveljury.com Nov 17 01:14PM +0100

Hi Kevin

document only has one mention of 'sutent' and none of its synonyms.
The net result is that words with more synonyms artificially get a
boost in the results.

There are various ways to approach this problem. Either you:

  • expand your synonym list at index time (ie you store all
    variations of the synonym in your index), but then you search on
    just one variation (by using a different analyzer at search or
    index time),
  • contract your synonym list at index and search time: eg foo, bar
    or baz all get indexed as just 'foo'. A search for 'bar'
    becomes a search for 'foo'

I have put together a gist demonstrating how this all works:
https://gist.github.com/4095280

The question remains: which should I prefer? expand: true or false?

I'm open to disagreement, but my vote would be for expand: false. ie
index just the first word in the synonym list, not all the words.

My reason for that is:

  1. fewer terms to index
  2. replacing synonyms with all variations or just one variation
    implies the same loss of original information (ie which synonym
    appeared in the original text).
  3. Synonyms can be of different lengths (eg "wi fi" vs "wifi"), which
    means that (with expand: true), the phrase "wifi router" would be
    indexed as:

Pos: 1 2 3
wifi router
wi fi router

which can mess up eg phrase queries which depends on token positions,
and can also mess up snippet highlighting.

hth

clint

org.elasticsearch.transport.TransportSerializationException: Failed to
deserialize exception response from stream when one node is still startinghttp://groups.google.com/group/elasticsearch/t/a5d2557800cb33a

Chris Male gento0nz@gmail.com Nov 17 01:35AM -0800

Hi,

Are you able to share your logs around the Exception? Just so we can
see
what's going on leading up to it.

On Friday, November 16, 2012 11:45:03 PM UTC+13, Barbara Ferreira
wrote:

[ANN] elasticsearch-equilibrium plugin version 0.19.4http://groups.google.com/group/elasticsearch/t/484bd10e8f3fab74

Otis Gospodnetic otis.gospodnetic@gmail.com Nov 16 06:39PM -0800

Hi Lee,

Thanks, this sounds nice.
Does this take into account shards and their replicas to ensure that
no
more than 1 copy of a shard is placed on any 1 server?

Otis

Performance Monitoring - http://sematext.com/spm/index.html
Search Analytics - http://sematext.com/search-analytics/index.html

On Friday, November 16, 2012 12:15:17 PM UTC-5, Lee Hinman wrote:

Lee Hinman matthew.hinman@gmail.com Nov 16 10:10PM -0800

On Friday, November 16, 2012 9:39:04 PM UTC-5, Otis Gospodnetic wrote:

Does this take into account shards and their replicas to ensure that
no
more than 1 copy of a shard is placed on any 1 server?

Otis

Hi Otis,

Yes, this plugin takes into account all the same Deciders that the
original
shard allocator does, just with the additional check for available
disk
space before giving the "thumbs up" for shard allocation or relocation.

  • Lee

[ANN] geocluster-facet 0.0.1http://groups.google.com/group/elasticsearch/t/6c3f95687578e6db

Eric Jain eric.jain@gmail.com Nov 16 06:20PM -0800

Here's a (somewhat simplistic) facet that clusters geo_points:

https://github.com/zenobase/geocluster-facet

You can see this plugin in action here:

https://zenobase.com/#/buckets/u07qih0a27/

Hoping to get some feedback, suggestions for improvements etc!

Medcl Zen medcl2000@gmail.com Nov 17 10:37AM +0800

nice plugin,thanks for sharing~

carrot2 error on elasticsearch version 0.19.11http://groups.google.com/group/elasticsearch/t/c0d6b6e932f92d52

Jalal Mohammed jalalm@algotree.com Nov 16 01:40PM +0530

Thanks Chris,
The error was with elasticsearch version 0.19.11, the carrot plugin
used to
work well with 0.19.8. I will raise this issue with the Carrot plugin
project.

Medcl Zen medcl2000@gmail.com Nov 17 10:34AM +0800

hi,already fixed in 1.1.1,
have fun,:slight_smile:

how to update cluster setting?http://groups.google.com/group/elasticsearch/t/4645c6f083f09ef9

Igor Motov imotov@gmail.com Nov 16 04:56PM -0800

Did you run it like this?

curl -XPUT localhost:9200/_cluster/settings -d '{
"persistent": {
"indices.store.throttle.type": "merge",
"indices.store.throttle.max_bytes_per_sec": "50mb"
}
}'

Which version of elasticsearch are you using?

On Friday, November 16, 2012 7:48:14 PM UTC-5, Jae wrote:

Jae metacret@gmail.com Nov 16 04:59PM -0800

Every setting I tried is throwing the same error message. I think that
I
should see the full list of cluster wide settings with 'curl -XGET
http://localhost:7104/_cluster/settings' but I am seeing empty
persistent
and transient settings like

{"persistent":{},"transient":{}}

What did I do wrong?

On Friday, November 16, 2012 4:48:14 PM UTC-8, Jae wrote:

Igor Motov imotov@gmail.com Nov 16 05:06PM -0800

When you run it with -XGET you only get back the setting that you set
there
using -XPUT.

On Friday, November 16, 2012 7:59:58 PM UTC-5, Jae wrote:

Jae metacret@gmail.com Nov 16 05:40PM -0800

So, how can I add updatable setting using -XPUT?

On Friday, November 16, 2012 5:06:30 PM UTC-8, Igor Motov wrote:

Igor Motov imotov@gmail.com Nov 16 05:42PM -0800

Like this:

curl -XPUT localhost:7104/_cluster/settings -d '{
"persistent": {
"indices.store.throttle.type": "merge",
"indices.store.throttle.max_bytes_per_sec": "50mb"
}
}'

On Friday, November 16, 2012 8:06:30 PM UTC-5, Igor Motov wrote:

Jae metacret@gmail.com Nov 16 05:48PM -0800

what the heck... when I specify an option as a file name such as

curl -XPUT localhost:7104/_cluster/settings -d @filename

and filename contains the following settings, it didn't work! what's
the
difference? :frowning:

Anyway, thank you so much for your patience

On Friday, November 16, 2012 5:42:20 PM UTC-8, Igor Motov wrote:

[Autocomplete] Cleo or ElasticSearch with NGramhttp://groups.google.com/group/elasticsearch/t/657c9bd4c63477c2

kidkid zkidkid@gmail.com Nov 16 12:11AM -0800

Hi All,

Currently, I am running searching with ES.
We use 3 server with 24 cores and 30GB Ram for each server.

I want to build a index with NGram for auto complete but my friend
tells me
to use CLeo.

I try to google about Cleo but I don't find any useful article about
Cleo
vs ES or (Lucence)

The problem is we have do autocomplete with Lucene and find it's not
good
enough.

Could someone help me ?

Thank in advance.

Too many open files but nofile set to 256000http://groups.google.com/group/elasticsearch/t/ae3dd2647a813a81

Derry O' Sullivan derryos@gmail.com Nov 15 11:54PM -0800

For anyone with a problem like this, it may be worth confirming the
numbers
within elasticsearch as well using the nodes info api:

/_nodes?process

gives the max_file_descriptors:

{

  • refresh_interval: 1000,
  • id: 13919,
  • max_file_descriptors: 25000

}

/_nodes/process/stats

gives:

open_file_descriptors: 516,

in the output:

On Thursday, 15 November 2012 11:12:43 UTC, mohsin husen wrote:

You received this message because you are subscribed to the Google Group
elasticsearch.
You can post via email elasticsearch@googlegroups.com.
To unsubscribe from this group, sendelasticsearch+unsubscribe@googlegroups.coman empty message.
For more options, visithttp://groups.google.com/group/elasticsearch/topicsthis group.

--

--