Pattern tokenization


(K.B.) #1

Hello,

I need to tokenize on a special pattern, here is the code I use with
SOLR:

I tried to implement it in ES using the pattern analyzer, but it didnt
work out, I tried:
index :
analysis :
analyzer :
linieAnalzyer :
type : pattern
lowercase: false
pattern: ", "

and realized I need a special tokenizer, so my hope was that during
analysis the break appears at the apttern level and this would work:

index :
analysis :
analyzer :
linieAnalzyer :
type : pattern
lowercase: false
pattern: ", "
flags: ""
tokenizer: keyword

but he ignores my keyword - has anyone an idea how I can tell him to
treat the string as string, then break it up onto ", " and see the
resulting bits as keywords that mustnt be treated any further?

Best,

K.


(Shay Banon) #2

Thats what the pattern analyzer should do. Its already an fully fledged analyzer, so specifying a tokenizer does not mean much (it only applies to custom analyzers). Can you fist a sample with your config, and using the analyzer API show the output that you get, and where it differs from what you expect?
On Saturday, March 5, 2011 at 7:42 PM, K.B. wrote:

Hello,

I need to tokenize on a special pattern, here is the code I use with
SOLR:

I tried to implement it in ES using the pattern analyzer, but it didnt
work out, I tried:
index :
analysis :
analyzer :
linieAnalzyer :
type : pattern
lowercase: false
pattern: ", "

and realized I need a special tokenizer, so my hope was that during
analysis the break appears at the apttern level and this would work:

index :
analysis :
analyzer :
linieAnalzyer :
type : pattern
lowercase: false
pattern: ", "
flags: ""
tokenizer: keyword

but he ignores my keyword - has anyone an idea how I can tell him to
treat the string as string, then break it up onto ", " and see the
resulting bits as keywords that mustnt be treated any further?

Best,

K.


(K.B.) #3

Hi Shay,

thanks for answering. I'll try my best to explain, but don't know how
exactly to show the output as I'm quite new to ES (coming from SOLR).

My current config now is (in elasticsearch.yml):
index :
analysis :
analyzer :
linieAnalzyer :
type : pattern
lowercase: false
pattern: ", "
flags: ""

My config for the type is as follows (from java):

String mapping = "{"produkt": {" +
""properties" : {" +
""linie_mv" : " +
"{ "type" : "string", " +
""index" : "analyzed", " +
" "index_analyzer": "linieAnalzyer", " +
" "search_analyzer": "linieAnalzyer" }" +
"}}}";

client.admin().indices().create(new
CreateIndexRequest(index).mapping(type, mapping)).actionGet();

and works like expected;

I then index docs containing in the linie_mv field values like:

"_0_Foo, _1_Foo_Bar, _2_Foo_Bar_Fii /Faa, _3_Foo_Bar_Fii /Faa_La"

its treated at the ", " but then tokenized at all whitespaces " " and
not as resulting single terms (after breaking it using the ", " he
mustn't dig in deeper as he then breaks my tree;

Reason is as this is used later as a facet based upon a prefix-query,
say e.g.:

prefix("0")
facets -> "_0_Foo" ; "_0_Foo2" etc.;

prefix"_1_Foo"
-> facets -> "_1_Foo_Bar" etc.;

So I can dynamically get the leaves of a tree and the corresponding
doc's;

Currently I get only part right, most I also get by a prefix("*") is
also chunked words from inner like "/Faa_La" here;

I hope this is now a bit more clear whats going wrong for my case.

Best,

K.

PS: Shay, thanks for your work - as a long-term compass user I really
appreaciate it very much!

On 6 Mrz., 05:31, Shay Banon shay.ba...@elasticsearch.com wrote:

Thats what the pattern analyzer should do. Its already an fully fledged analyzer, so specifying a tokenizer does not mean much (it only applies to custom analyzers). Can you fist a sample with your config, and using the analyzer API show the output that you get, and where it differs from what you expect?

On Saturday, March 5, 2011 at 7:42 PM, K.B. wrote:

Hello,

I need to tokenize on a special pattern, here is the code I use with
SOLR:

I tried to implement it in ES using the pattern analyzer, but it didnt
work out, I tried:
index :
analysis :
analyzer :
linieAnalzyer :
type : pattern
lowercase: false
pattern: ", "

and realized I need a special tokenizer, so my hope was that during
analysis the break appears at the apttern level and this would work:

index :
analysis :
analyzer :
linieAnalzyer :
type : pattern
lowercase: false
pattern: ", "
flags: ""
tokenizer: keyword

but he ignores my keyword - has anyone an idea how I can tell him to
treat the string as string, then break it up onto ", " and see the
resulting bits as keywords that mustnt be treated any further?

Best,

K.


(K.B.) #4

Hi Shay,

I'm a bit forther now. Instead of trying to continue arguing with the
pattern analyzer I instead went to "keyword" and did a
((String) o[37]).split(", ")
in java so it now indexes as it expected:

linie_mv: [
" _0_foo"
" _1_foo_bar"
" _2_foo_bar_xxxx"
" _3_foo_bar_yyyy"
]

When I now query using java-api:

SearchRequestBuilder builder = client.prepareSearch(index);
XContentQueryBuilder qb = QueryBuilders.queryString("_1_foo*")
.defaultOperator(QueryStringQueryBuilder.Operator.OR)
.field("linie_mv")
.field("id")
.allowLeadingWildcard(true).useDisMax(true);

    builder.setQuery(qb);

builder.addFacet(FacetBuilders.termsFacet("linie_mv").field("linie_mv"));

I get the expected resultset. The facet here, however, doesn't return
the limited facet one expects (" _1_foo_bar") but instead returns
all the facets that any matching item has. I also tried to use a
FacetFilter like
.facetFilter(new TermsFilterBuilder("linie_mv", "_1_foo*")
but it has no effect as still all facets are returned....

I managed to bypass this at the moment by introducing special fields
for each line e.g.: linie_mv_0 ... linie_mv_5 but this won't work when
doing m:m tree mappings - any idea what I can do or how I can limit
the facets to the one containing special characters? (I expected
facetFilter to just do this?)

Best,

Korbinian

On 6 Mrz., 11:36, "K.B." korbinian.ba...@googlemail.com wrote:

Hi Shay,

thanks for answering. I'll try my best to explain, but don't know how
exactly to show the output as I'm quite new to ES (coming from SOLR).

My current config now is (in elasticsearch.yml):
index :
analysis :
analyzer :
linieAnalzyer :
type : pattern
lowercase: false
pattern: ", "
flags: ""

My config for the type is as follows (from java):

String mapping = "{"produkt": {" +
""properties" : {" +
""linie_mv" : " +
"{ "type" : "string", " +
""index" : "analyzed", " +
" "index_analyzer": "linieAnalzyer", " +
" "search_analyzer": "linieAnalzyer" }" +
"}}}";

client.admin().indices().create(new
CreateIndexRequest(index).mapping(type, mapping)).actionGet();

and works like expected;

I then index docs containing in the linie_mv field values like:

"_0_Foo, _1_Foo_Bar, _2_Foo_Bar_Fii /Faa, _3_Foo_Bar_Fii /Faa_La"

its treated at the ", " but then tokenized at all whitespaces " " and
not as resulting single terms (after breaking it using the ", " he
mustn't dig in deeper as he then breaks my tree;

Reason is as this is used later as a facet based upon a prefix-query,
say e.g.:

prefix("0")
facets -> "_0_Foo" ; "_0_Foo2" etc.;

prefix"_1_Foo"
-> facets -> "_1_Foo_Bar" etc.;

So I can dynamically get the leaves of a tree and the corresponding
doc's;

Currently I get only part right, most I also get by a prefix("*") is
also chunked words from inner like "/Faa_La" here;

I hope this is now a bit more clear whats going wrong for my case.

Best,

K.

PS: Shay, thanks for your work - as a long-term compass user I really
appreaciate it very much!

On 6 Mrz., 05:31, Shay Banon shay.ba...@elasticsearch.com wrote:

Thats what the pattern analyzer should do. Its already an fully fledged analyzer, so specifying a tokenizer does not mean much (it only applies to custom analyzers). Can you fist a sample with your config, and using the analyzer API show the output that you get, and where it differs from what you expect?

On Saturday, March 5, 2011 at 7:42 PM, K.B. wrote:

Hello,

I need to tokenize on a special pattern, here is the code I use with
SOLR:

I tried to implement it in ES using the pattern analyzer, but it didnt
work out, I tried:
index :
analysis :
analyzer :
linieAnalzyer :
type : pattern
lowercase: false
pattern: ", "

and realized I need a special tokenizer, so my hope was that during
analysis the break appears at the apttern level and this would work:

index :
analysis :
analyzer :
linieAnalzyer :
type : pattern
lowercase: false
pattern: ", "
flags: ""
tokenizer: keyword

but he ignores my keyword - has anyone an idea how I can tell him to
treat the string as string, then break it up onto ", " and see the
resulting bits as keywords that mustnt be treated any further?

Best,

K.


(Shay Banon) #5

Facet (non global ones) are restricted to the query you execute, if that query matches more than you expect, then you need to optimize it. I am not sure if it should or not, since its hard to follow whats going on. In order for me to easily help, just gist a curl recreation and I can have a look (http://www.elasticsearch.org/help).
On Monday, March 7, 2011 at 8:49 PM, K.B. wrote:

Hi Shay,

I'm a bit forther now. Instead of trying to continue arguing with the
pattern analyzer I instead went to "keyword" and did a
((String) o[37]).split(", ")
in java so it now indexes as it expected:

linie_mv: [
" _0_foo"
" _1_foo_bar"
" _2_foo_bar_xxxx"
" _3_foo_bar_yyyy"
]

When I now query using java-api:

SearchRequestBuilder builder = client.prepareSearch(index);
XContentQueryBuilder qb = QueryBuilders.queryString("_1_foo*")
.defaultOperator(QueryStringQueryBuilder.Operator.OR)
.field("linie_mv")
.field("id")
.allowLeadingWildcard(true).useDisMax(true);

builder.setQuery(qb);

builder.addFacet(FacetBuilders.termsFacet("linie_mv").field("linie_mv"));

I get the expected resultset. The facet here, however, doesn't return
the limited facet one expects (" _1_foo_bar") but instead returns
all the facets that any matching item has. I also tried to use a
FacetFilter like
.facetFilter(new TermsFilterBuilder("linie_mv", "_1_foo*")
but it has no effect as still all facets are returned....

I managed to bypass this at the moment by introducing special fields
for each line e.g.: linie_mv_0 ... linie_mv_5 but this won't work when
doing m:m tree mappings - any idea what I can do or how I can limit
the facets to the one containing special characters? (I expected
facetFilter to just do this?)

Best,

Korbinian

On 6 Mrz., 11:36, "K.B." korbinian.ba...@googlemail.com wrote:

Hi Shay,

thanks for answering. I'll try my best to explain, but don't know how
exactly to show the output as I'm quite new to ES (coming from SOLR).

My current config now is (in elasticsearch.yml):
index :
analysis :
analyzer :
linieAnalzyer :
type : pattern
lowercase: false
pattern: ", "
flags: ""

My config for the type is as follows (from java):

String mapping = "{"produkt": {" +
""properties" : {" +
""linie_mv" : " +
"{ "type" : "string", " +
""index" : "analyzed", " +
" "index_analyzer": "linieAnalzyer", " +
" "search_analyzer": "linieAnalzyer" }" +
"}}}";

client.admin().indices().create(new
CreateIndexRequest(index).mapping(type, mapping)).actionGet();

and works like expected;

I then index docs containing in the linie_mv field values like:

"_0_Foo, _1_Foo_Bar, _2_Foo_Bar_Fii /Faa, _3_Foo_Bar_Fii /Faa_La"

its treated at the ", " but then tokenized at all whitespaces " " and
not as resulting single terms (after breaking it using the ", " he
mustn't dig in deeper as he then breaks my tree;

Reason is as this is used later as a facet based upon a prefix-query,
say e.g.:

prefix("0")
facets -> "_0_Foo" ; "_0_Foo2" etc.;

prefix"_1_Foo"
-> facets -> "_1_Foo_Bar" etc.;

So I can dynamically get the leaves of a tree and the corresponding
doc's;

Currently I get only part right, most I also get by a prefix("*") is
also chunked words from inner like "/Faa_La" here;

I hope this is now a bit more clear whats going wrong for my case.

Best,

K.

PS: Shay, thanks for your work - as a long-term compass user I really
appreaciate it very much!

On 6 Mrz., 05:31, Shay Banon shay.ba...@elasticsearch.com wrote:

Thats what the pattern analyzer should do. Its already an fully fledged analyzer, so specifying a tokenizer does not mean much (it only applies to custom analyzers). Can you fist a sample with your config, and using the analyzer API show the output that you get, and where it differs from what you expect?

On Saturday, March 5, 2011 at 7:42 PM, K.B. wrote:

Hello,

I need to tokenize on a special pattern, here is the code I use with
SOLR:

I tried to implement it in ES using the pattern analyzer, but it didnt
work out, I tried:
index :
analysis :
analyzer :
linieAnalzyer :
type : pattern
lowercase: false
pattern: ", "

and realized I need a special tokenizer, so my hope was that during
analysis the break appears at the apttern level and this would work:

index :
analysis :
analyzer :
linieAnalzyer :
type : pattern
lowercase: false
pattern: ", "
flags: ""
tokenizer: keyword

but he ignores my keyword - has anyone an idea how I can tell him to
treat the string as string, then break it up onto ", " and see the
resulting bits as keywords that mustnt be treated any further?

Best,

K.


(K.B.) #6

Hello Shay,

I try to re-explain. Please note that my problem relates to the
logic behind, not the query itself and as I'm using the java-api
there typically mustnt be any simple error within the query itself.

Now, re-explained:

Think about a dealer like amazon: you have books that can be mapped to
different tree-based categories, say

-> Category
-> Themes
and are sometimes limited to the area where they may be delivered to
due to legal reasons;

Now imagine you have 3 books:

A: "Historic->Old" (category) thats in the themes: "Religion->Jews"
and mustnt be sent to people living in pakistan
B: "Historic->Old" (category) thats in the themes: "Religion->Jews"
and it may be delivered everywhere;
C: "Historic->New" (category) thats in the themes: "Religion-

Christians" and it mustnt be sent to people living in pakistan

Now, the Categorys as well as the Themes have to be displayed using a
tree like structure as we have so many more categories and themes; If
you now are a looking at the restrictions these shouldn't be served to
our customers from e.g: pakistan, so a pre-created tree won't fit here
as we only know at runtime after a limiting search what type and kind
of trees we can show them.

As we don't know how many of those linkings are going to exist at any
time and we don't want to alter or browse for schemes (which is hard
if you have no clue how these will be called) we come to the point
what makes a good search solution vs. a simple, stupid ones like we
have in databases (and no, token and stemming isn't the thing that
RDBMs havent....): faceting

If we can now facet over the result of a search (here: only items
appropriate for our customer) we can dynamically retrieve all
existing tree-nodes that make sense for the user to be presented to.

So to overcome we need to do following:

  1. put together all possible mappings at 1(!) place and that we call
    "line" (which means it will be a multi-valued field for lucene)
  2. these mappings have to express the whole mapping route as we can't
    query too much around without sacrifying too much performance, meaning
    mappings will be called "Religion", "Religion_Jews" for our Jews
    category
  3. we need to be able to tell the depth of each node/ leaf so we can
    go for it and get all nodes from a certain level, resulting in our new
    mapping like "_0_Religion", "_1_Religion_Jews" etc.

This then translates to our following JSON-Books for ES:
https://gist.github.com/933bc88c63d30e9f2754 (the part under
JavaScript - my mistake, sorry)

These are then indexed and now we want to query for them from Java,
and here we want all books for all countries but see the full tree
then retrieved by a facet for the level 0 (think about a that the sub-
trees have all 1 common root under 0), so we want to get facets with
values:

"_0_Religion"
"_0_Historic"

so we do in java a query: "Java query level 0" https://gist.github.com/933bc88c63d30e9f2754

but instead of the expecting with facetFilter { new
TermsFilterBuilder("linie_mv", "0*") } we get all multi-value-field
entered values as all books match the query, so the filter for the
facet didn't work. However, we have another way: regex that might
help, so we do:
-> "Java" https://gist.github.com/933bc88c63d30e9f2754

but then it happens: if we enter regex("0") we get back an empty
facet, meaning => TermsFacet: []
if we enter regex("
0*") we get a not nice error from ES: ERROR ->

I hope this makes it more clear what I have to achive and what didn't
work out as expected. Any help on solving this is really appreciated
as if I can't fix this somehow very soon I have to ditch ES,

Best,

K.

On 8 Mrz., 07:43, Shay Banon shay.ba...@elasticsearch.com wrote:

Facet (non global ones) are restricted to the query you execute, if that query matches more than you expect, then you need to optimize it. I am not sure if it should or not, since its hard to follow whats going on. In order for me to easily help, just gist a curl recreation and I can have a look (http://www.elasticsearch.org/help).

On Monday, March 7, 2011 at 8:49 PM, K.B. wrote:

Hi Shay,

I'm a bit forther now. Instead of trying to continue arguing with the
pattern analyzer I instead went to "keyword" and did a
((String) o[37]).split(", ")
in java so it now indexes as it expected:

linie_mv: [
" _0_foo"
" _1_foo_bar"
" _2_foo_bar_xxxx"
" _3_foo_bar_yyyy"
]

When I now query using java-api:

SearchRequestBuilder builder = client.prepareSearch(index);
XContentQueryBuilder qb = QueryBuilders.queryString("_1_foo*")
.defaultOperator(QueryStringQueryBuilder.Operator.OR)
.field("linie_mv")
.field("id")
.allowLeadingWildcard(true).useDisMax(true);

builder.setQuery(qb);

builder.addFacet(FacetBuilders.termsFacet("linie_mv").field("linie_mv"));

I get the expected resultset. The facet here, however, doesn't return
the limited facet one expects (" _1_foo_bar") but instead returns
all the facets that any matching item has. I also tried to use a
FacetFilter like
.facetFilter(new TermsFilterBuilder("linie_mv", "_1_foo*")
but it has no effect as still all facets are returned....

I managed to bypass this at the moment by introducing special fields
for each line e.g.: linie_mv_0 ... linie_mv_5 but this won't work when
doing m:m tree mappings - any idea what I can do or how I can limit
the facets to the one containing special characters? (I expected
facetFilter to just do this?)

Best,

Korbinian

On 6 Mrz., 11:36, "K.B." korbinian.ba...@googlemail.com wrote:

Hi Shay,

thanks for answering. I'll try my best to explain, but don't know how
exactly to show the output as I'm quite new to ES (coming from SOLR).

My current config now is (in elasticsearch.yml):
index :
analysis :
analyzer :
linieAnalzyer :
type : pattern
lowercase: false
pattern: ", "
flags: ""

My config for the type is as follows (from java):

String mapping = "{"produkt": {" +
""properties" : {" +
""linie_mv" : " +
"{ "type" : "string", " +
""index" : "analyzed", " +
" "index_analyzer": "linieAnalzyer", " +
" "search_analyzer": "linieAnalzyer" }" +
"}}}";

client.admin().indices().create(new
CreateIndexRequest(index).mapping(type, mapping)).actionGet();

and works like expected;

I then index docs containing in the linie_mv field values like:

"_0_Foo, _1_Foo_Bar, _2_Foo_Bar_Fii /Faa, _3_Foo_Bar_Fii /Faa_La"

its treated at the ", " but then tokenized at all whitespaces " " and
not as resulting single terms (after breaking it using the ", " he
mustn't dig in deeper as he then breaks my tree;

Reason is as this is used later as a facet based upon a prefix-query,
say e.g.:

prefix("0")
facets -> "_0_Foo" ; "_0_Foo2" etc.;

prefix"_1_Foo"
-> facets -> "_1_Foo_Bar" etc.;

So I can dynamically get the leaves of a tree and the corresponding
doc's;

Currently I get only part right, most I also get by a prefix("*") is
also chunked words from inner like "/Faa_La" here;

I hope this is now a bit more clear whats going wrong for my case.

Best,

K.

PS: Shay, thanks for your work - as a long-term compass user I really
appreaciate it very much!

On 6 Mrz., 05:31, Shay Banon shay.ba...@elasticsearch.com wrote:

Thats what the pattern analyzer should do. Its already an fully fledged analyzer, so specifying a tokenizer does not mean much (it only applies to custom analyzers). Can you fist a sample with your config, and using the analyzer API show the output that you get, and where it differs from what you expect?

On Saturday, March 5, 2011 at 7:42 PM, K.B. wrote:

Hello,

I need to tokenize on a special pattern, here is the code I use with
SOLR:

I tried to implement it in ES using the pattern analyzer, but it didnt
work out, I tried:
index :
analysis :
analyzer :
linieAnalzyer :
type : pattern
lowercase: false
pattern: ", "

and realized I need a special tokenizer, so my hope was that during
analysis the break appears at the apttern level and this would work:

index :
analysis :
analyzer :
linieAnalzyer :
type : pattern
lowercase: false
pattern: ", "
flags: ""
tokenizer: keyword

but he ignores my keyword - has anyone an idea how I can tell him to
treat the string as string, then break it up onto ", " and see the
resulting bits as keywords that mustnt be treated any further?

Best,

K.


(K.B.) #7

PS: just to note: in this system one has to query for each level from
0 to n-1 where n is the node level he wants to reach; as n is limited
to a reasonable low value this is no problem as our problem will still
be solved in complexity O(n) and not the feared O(n^kn) (explained:
simple complexity and not an exponential increase)

On 8 Mrz., 10:41, "K.B." korbinian.ba...@googlemail.com wrote:

Hello Shay,

I try to re-explain. Please note that my problem relates to the
logic behind, not the query itself and as I'm using the java-api
there typically mustnt be any simple error within the query itself.

Now, re-explained:

Think about a dealer like amazon: you have books that can be mapped to
different tree-based categories, say

-> Category
-> Themes
and are sometimes limited to the area where they may be delivered to
due to legal reasons;

Now imagine you have 3 books:

A: "Historic->Old" (category) thats in the themes: "Religion->Jews"
and mustnt be sent to people living in pakistan
B: "Historic->Old" (category) thats in the themes: "Religion->Jews"
and it may be delivered everywhere;
C: "Historic->New" (category) thats in the themes: "Religion-

Christians" and it mustnt be sent to people living in pakistan

Now, the Categorys as well as the Themes have to be displayed using a
tree like structure as we have so many more categories and themes; If
you now are a looking at the restrictions these shouldn't be served to
our customers from e.g: pakistan, so a pre-created tree won't fit here
as we only know at runtime after a limiting search what type and kind
of trees we can show them.

As we don't know how many of those linkings are going to exist at any
time and we don't want to alter or browse for schemes (which is hard
if you have no clue how these will be called) we come to the point
what makes a good search solution vs. a simple, stupid ones like we
have in databases (and no, token and stemming isn't the thing that
RDBMs havent....): faceting

If we can now facet over the result of a search (here: only items
appropriate for our customer) we can dynamically retrieve all
existing tree-nodes that make sense for the user to be presented to.

So to overcome we need to do following:

  1. put together all possible mappings at 1(!) place and that we call
    "line" (which means it will be a multi-valued field for lucene)
  2. these mappings have to express the whole mapping route as we can't
    query too much around without sacrifying too much performance, meaning
    mappings will be called "Religion", "Religion_Jews" for our Jews
    category
  3. we need to be able to tell the depth of each node/ leaf so we can
    go for it and get all nodes from a certain level, resulting in our new
    mapping like "_0_Religion", "_1_Religion_Jews" etc.

This then translates to our following JSON-Books for ES:https://gist.github.com/933bc88c63d30e9f2754(the part under
JavaScript - my mistake, sorry)

These are then indexed and now we want to query for them from Java,
and here we want all books for all countries but see the full tree
then retrieved by a facet for the level 0 (think about a that the sub-
trees have all 1 common root under 0), so we want to get facets with
values:

"_0_Religion"
"_0_Historic"

so we do in java a query: "Java query level 0" https://gist.github.com/933bc88c63d30e9f2754

but instead of the expecting with facetFilter { new
TermsFilterBuilder("linie_mv", "0*") } we get all multi-value-field
entered values as all books match the query, so the filter for the
facet didn't work. However, we have another way: regex that might
help, so we do:
-> "Java"https://gist.github.com/933bc88c63d30e9f2754

but then it happens: if we enter regex("0") we get back an empty
facet, meaning => TermsFacet: []
if we enter regex("
0*") we get a not nice error from ES: ERROR ->https://gist.github.com/933bc88c63d30e9f2754

I hope this makes it more clear what I have to achive and what didn't
work out as expected. Any help on solving this is really appreciated
as if I can't fix this somehow very soon I have to ditch ES,

Best,

K.

On 8 Mrz., 07:43, Shay Banon shay.ba...@elasticsearch.com wrote:

Facet (non global ones) are restricted to the query you execute, if that query matches more than you expect, then you need to optimize it. I am not sure if it should or not, since its hard to follow whats going on. In order for me to easily help, just gist a curl recreation and I can have a look (http://www.elasticsearch.org/help).

On Monday, March 7, 2011 at 8:49 PM, K.B. wrote:

Hi Shay,

I'm a bit forther now. Instead of trying to continue arguing with the
pattern analyzer I instead went to "keyword" and did a
((String) o[37]).split(", ")
in java so it now indexes as it expected:

linie_mv: [
" _0_foo"
" _1_foo_bar"
" _2_foo_bar_xxxx"
" _3_foo_bar_yyyy"
]

When I now query using java-api:

SearchRequestBuilder builder = client.prepareSearch(index);
XContentQueryBuilder qb = QueryBuilders.queryString("_1_foo*")
.defaultOperator(QueryStringQueryBuilder.Operator.OR)
.field("linie_mv")
.field("id")
.allowLeadingWildcard(true).useDisMax(true);

builder.setQuery(qb);

builder.addFacet(FacetBuilders.termsFacet("linie_mv").field("linie_mv"));

I get the expected resultset. The facet here, however, doesn't return
the limited facet one expects (" _1_foo_bar") but instead returns
all the facets that any matching item has. I also tried to use a
FacetFilter like
.facetFilter(new TermsFilterBuilder("linie_mv", "_1_foo*")
but it has no effect as still all facets are returned....

I managed to bypass this at the moment by introducing special fields
for each line e.g.: linie_mv_0 ... linie_mv_5 but this won't work when
doing m:m tree mappings - any idea what I can do or how I can limit
the facets to the one containing special characters? (I expected
facetFilter to just do this?)

Best,

Korbinian

On 6 Mrz., 11:36, "K.B." korbinian.ba...@googlemail.com wrote:

Hi Shay,

thanks for answering. I'll try my best to explain, but don't know how
exactly to show the output as I'm quite new to ES (coming from SOLR).

My current config now is (in elasticsearch.yml):
index :
analysis :
analyzer :
linieAnalzyer :
type : pattern
lowercase: false
pattern: ", "
flags: ""

My config for the type is as follows (from java):

String mapping = "{"produkt": {" +
""properties" : {" +
""linie_mv" : " +
"{ "type" : "string", " +
""index" : "analyzed", " +
" "index_analyzer": "linieAnalzyer", " +
" "search_analyzer": "linieAnalzyer" }" +
"}}}";

client.admin().indices().create(new
CreateIndexRequest(index).mapping(type, mapping)).actionGet();

and works like expected;

I then index docs containing in the linie_mv field values like:

"_0_Foo, _1_Foo_Bar, _2_Foo_Bar_Fii /Faa, _3_Foo_Bar_Fii /Faa_La"

its treated at the ", " but then tokenized at all whitespaces " " and
not as resulting single terms (after breaking it using the ", " he
mustn't dig in deeper as he then breaks my tree;

Reason is as this is used later as a facet based upon a prefix-query,
say e.g.:

prefix("0")
facets -> "_0_Foo" ; "_0_Foo2" etc.;

prefix"_1_Foo"
-> facets -> "_1_Foo_Bar" etc.;

So I can dynamically get the leaves of a tree and the corresponding
doc's;

Currently I get only part right, most I also get by a prefix("*") is
also chunked words from inner like "/Faa_La" here;

I hope this is now a bit more clear whats going wrong for my case.

Best,

K.

PS: Shay, thanks for your work - as a long-term compass user I really
appreaciate it very much!

On 6 Mrz., 05:31, Shay Banon shay.ba...@elasticsearch.com wrote:

Thats what the pattern analyzer should do. Its already an fully fledged analyzer, so specifying a tokenizer does not mean much (it only applies to custom analyzers). Can you fist a sample with your config, and using the analyzer API show the output that you get, and where it differs from what you expect?

On Saturday, March 5, 2011 at 7:42 PM, K.B. wrote:

Hello,

I need to tokenize on a special pattern, here is the code I use with
SOLR:

I tried to implement it in ES using the pattern analyzer, but it didnt
work out, I tried:
index :
analysis :
analyzer :
linieAnalzyer :
type : pattern
lowercase: false
pattern: ", "

and realized I need a special tokenizer, so my hope was that during
analysis the break appears at the apttern level and this would work:

index :
analysis :
analyzer :
linieAnalzyer :
type : pattern
lowercase: false
pattern: ", "
flags: ""
tokenizer: keyword

but he ignores my keyword - has anyone an idea how I can tell him to
treat the string as string, then break it up onto ", " and see the
resulting bits as keywords that mustnt be treated any further?

Best,

K.


(K.B.) #8

Ok, problem solved thanks to Lukáš Vlček!

Problem is/ was that the regex-origin was not made clear in doc,
instead of:
.regex("1")
one has to use
.regex("^1.
")
as this is not lucene or JavaScript expression but a regular JAVA
regular expression!

PS: Wish for future to all coming features and additions to ES: please
add more verbose doc with more telling examples and the origin of
expressions!

On 8 Mrz., 10:46, "K.B." korbinian.ba...@googlemail.com wrote:

PS: just to note: in this system one has to query for each level from
0 to n-1 where n is the node level he wants to reach; as n is limited
to a reasonable low value this is no problem as our problem will still
be solved in complexity O(n) and not the feared O(n^kn) (explained:
simple complexity and not an exponential increase)

On 8 Mrz., 10:41, "K.B." korbinian.ba...@googlemail.com wrote:

Hello Shay,

I try to re-explain. Please note that my problem relates to the
logic behind, not the query itself and as I'm using the java-api
there typically mustnt be any simple error within the query itself.

Now, re-explained:

Think about a dealer like amazon: you have books that can be mapped to
different tree-based categories, say

-> Category
-> Themes
and are sometimes limited to the area where they may be delivered to
due to legal reasons;

Now imagine you have 3 books:

A: "Historic->Old" (category) thats in the themes: "Religion->Jews"
and mustnt be sent to people living in pakistan
B: "Historic->Old" (category) thats in the themes: "Religion->Jews"
and it may be delivered everywhere;
C: "Historic->New" (category) thats in the themes: "Religion-

Christians" and it mustnt be sent to people living in pakistan

Now, the Categorys as well as the Themes have to be displayed using a
tree like structure as we have so many more categories and themes; If
you now are a looking at the restrictions these shouldn't be served to
our customers from e.g: pakistan, so a pre-created tree won't fit here
as we only know at runtime after a limiting search what type and kind
of trees we can show them.

As we don't know how many of those linkings are going to exist at any
time and we don't want to alter or browse for schemes (which is hard
if you have no clue how these will be called) we come to the point
what makes a good search solution vs. a simple, stupid ones like we
have in databases (and no, token and stemming isn't the thing that
RDBMs havent....): faceting

If we can now facet over the result of a search (here: only items
appropriate for our customer) we can dynamically retrieve all
existing tree-nodes that make sense for the user to be presented to.

So to overcome we need to do following:

  1. put together all possible mappings at 1(!) place and that we call
    "line" (which means it will be a multi-valued field for lucene)
  2. these mappings have to express the whole mapping route as we can't
    query too much around without sacrifying too much performance, meaning
    mappings will be called "Religion", "Religion_Jews" for our Jews
    category
  3. we need to be able to tell the depth of each node/ leaf so we can
    go for it and get all nodes from a certain level, resulting in our new
    mapping like "_0_Religion", "_1_Religion_Jews" etc.

This then translates to our following JSON-Books for ES:https://gist.github.com/933bc88c63d30e9f2754(thepart under
JavaScript - my mistake, sorry)

These are then indexed and now we want to query for them from Java,
and here we want all books for all countries but see the full tree
then retrieved by a facet for the level 0 (think about a that the sub-
trees have all 1 common root under 0), so we want to get facets with
values:

"_0_Religion"
"_0_Historic"

so we do in java a query: "Java query level 0" https://gist.github.com/933bc88c63d30e9f2754

but instead of the expecting with facetFilter { new
TermsFilterBuilder("linie_mv", "0*") } we get all multi-value-field
entered values as all books match the query, so the filter for the
facet didn't work. However, we have another way: regex that might
help, so we do:
-> "Java"https://gist.github.com/933bc88c63d30e9f2754

but then it happens: if we enter regex("0") we get back an empty
facet, meaning => TermsFacet: []
if we enter regex("
0*") we get a not nice error from ES: ERROR ->https://gist.github.com/933bc88c63d30e9f2754

I hope this makes it more clear what I have to achive and what didn't
work out as expected. Any help on solving this is really appreciated
as if I can't fix this somehow very soon I have to ditch ES,

Best,

K.

On 8 Mrz., 07:43, Shay Banon shay.ba...@elasticsearch.com wrote:

Facet (non global ones) are restricted to the query you execute, if that query matches more than you expect, then you need to optimize it. I am not sure if it should or not, since its hard to follow whats going on. In order for me to easily help, just gist a curl recreation and I can have a look (http://www.elasticsearch.org/help).

On Monday, March 7, 2011 at 8:49 PM, K.B. wrote:

Hi Shay,

I'm a bit forther now. Instead of trying to continue arguing with the
pattern analyzer I instead went to "keyword" and did a
((String) o[37]).split(", ")
in java so it now indexes as it expected:

linie_mv: [
" _0_foo"
" _1_foo_bar"
" _2_foo_bar_xxxx"
" _3_foo_bar_yyyy"
]

When I now query using java-api:

SearchRequestBuilder builder = client.prepareSearch(index);
XContentQueryBuilder qb = QueryBuilders.queryString("_1_foo*")
.defaultOperator(QueryStringQueryBuilder.Operator.OR)
.field("linie_mv")
.field("id")
.allowLeadingWildcard(true).useDisMax(true);

builder.setQuery(qb);

builder.addFacet(FacetBuilders.termsFacet("linie_mv").field("linie_mv"));

I get the expected resultset. The facet here, however, doesn't return
the limited facet one expects (" _1_foo_bar") but instead returns
all the facets that any matching item has. I also tried to use a
FacetFilter like
.facetFilter(new TermsFilterBuilder("linie_mv", "_1_foo*")
but it has no effect as still all facets are returned....

I managed to bypass this at the moment by introducing special fields
for each line e.g.: linie_mv_0 ... linie_mv_5 but this won't work when
doing m:m tree mappings - any idea what I can do or how I can limit
the facets to the one containing special characters? (I expected
facetFilter to just do this?)

Best,

Korbinian

On 6 Mrz., 11:36, "K.B." korbinian.ba...@googlemail.com wrote:

Hi Shay,

thanks for answering. I'll try my best to explain, but don't know how
exactly to show the output as I'm quite new to ES (coming from SOLR).

My current config now is (in elasticsearch.yml):
index :
analysis :
analyzer :
linieAnalzyer :
type : pattern
lowercase: false
pattern: ", "
flags: ""

My config for the type is as follows (from java):

String mapping = "{"produkt": {" +
""properties" : {" +
""linie_mv" : " +
"{ "type" : "string", " +
""index" : "analyzed", " +
" "index_analyzer": "linieAnalzyer", " +
" "search_analyzer": "linieAnalzyer" }" +
"}}}";

client.admin().indices().create(new
CreateIndexRequest(index).mapping(type, mapping)).actionGet();

and works like expected;

I then index docs containing in the linie_mv field values like:

"_0_Foo, _1_Foo_Bar, _2_Foo_Bar_Fii /Faa, _3_Foo_Bar_Fii /Faa_La"

its treated at the ", " but then tokenized at all whitespaces " " and
not as resulting single terms (after breaking it using the ", " he
mustn't dig in deeper as he then breaks my tree;

Reason is as this is used later as a facet based upon a prefix-query,
say e.g.:

prefix("0")
facets -> "_0_Foo" ; "_0_Foo2" etc.;

prefix"_1_Foo"
-> facets -> "_1_Foo_Bar" etc.;

So I can dynamically get the leaves of a tree and the corresponding
doc's;

Currently I get only part right, most I also get by a prefix("*") is
also chunked words from inner like "/Faa_La" here;

I hope this is now a bit more clear whats going wrong for my case.

Best,

K.

PS: Shay, thanks for your work - as a long-term compass user I really
appreaciate it very much!

On 6 Mrz., 05:31, Shay Banon shay.ba...@elasticsearch.com wrote:

Thats what the pattern analyzer should do. Its already an fully fledged analyzer, so specifying a tokenizer does not mean much (it only applies to custom analyzers). Can you fist a sample with your config, and using the analyzer API show the output that you get, and where it differs from what you expect?

On Saturday, March 5, 2011 at 7:42 PM, K.B. wrote:

Hello,

I need to tokenize on a special pattern, here is the code I use with
SOLR:

I tried to implement it in ES using the pattern analyzer, but it didnt
work out, I tried:
index :
analysis :
analyzer :
linieAnalzyer :
type : pattern
lowercase: false
pattern: ", "

and realized I need a special tokenizer, so my hope

...

Erfahren Sie mehr »


(Lukáš Vlček) #9

Hi,

I added link to Java Pattern API to ES docs. Also feel free to report other
missing references or you can fork the docs and send pull request :slight_smile:

Regards,
Lukas

On Tue, Mar 8, 2011 at 8:40 PM, K.B. korbinian.bachl@googlemail.com wrote:

Ok, problem solved thanks to Lukáš Vlček!

Problem is/ was that the regex-origin was not made clear in doc,
instead of:
.regex("1")
one has to use
.regex("^1.
")
as this is not lucene or JavaScript expression but a regular JAVA
regular expression!

PS: Wish for future to all coming features and additions to ES: please
add more verbose doc with more telling examples and the origin of
expressions!

On 8 Mrz., 10:46, "K.B." korbinian.ba...@googlemail.com wrote:

PS: just to note: in this system one has to query for each level from
0 to n-1 where n is the node level he wants to reach; as n is limited
to a reasonable low value this is no problem as our problem will still
be solved in complexity O(n) and not the feared O(n^kn) (explained:
simple complexity and not an exponential increase)

On 8 Mrz., 10:41, "K.B." korbinian.ba...@googlemail.com wrote:

Hello Shay,

I try to re-explain. Please note that my problem relates to the
logic behind, not the query itself and as I'm using the java-api
there typically mustnt be any simple error within the query itself.

Now, re-explained:

Think about a dealer like amazon: you have books that can be mapped to
different tree-based categories, say

-> Category
-> Themes
and are sometimes limited to the area where they may be delivered to
due to legal reasons;

Now imagine you have 3 books:

A: "Historic->Old" (category) thats in the themes: "Religion->Jews"
and mustnt be sent to people living in pakistan
B: "Historic->Old" (category) thats in the themes: "Religion->Jews"
and it may be delivered everywhere;
C: "Historic->New" (category) thats in the themes: "Religion-

Christians" and it mustnt be sent to people living in pakistan

Now, the Categorys as well as the Themes have to be displayed using a
tree like structure as we have so many more categories and themes; If
you now are a looking at the restrictions these shouldn't be served to
our customers from e.g: pakistan, so a pre-created tree won't fit here
as we only know at runtime after a limiting search what type and kind
of trees we can show them.

As we don't know how many of those linkings are going to exist at any
time and we don't want to alter or browse for schemes (which is hard
if you have no clue how these will be called) we come to the point
what makes a good search solution vs. a simple, stupid ones like we
have in databases (and no, token and stemming isn't the thing that
RDBMs havent....): faceting

If we can now facet over the result of a search (here: only items
appropriate for our customer) we can dynamically retrieve all
existing tree-nodes that make sense for the user to be presented to.

So to overcome we need to do following:

  1. put together all possible mappings at 1(!) place and that we call
    "line" (which means it will be a multi-valued field for lucene)
  2. these mappings have to express the whole mapping route as we can't
    query too much around without sacrifying too much performance, meaning
    mappings will be called "Religion", "Religion_Jews" for our Jews
    category
  3. we need to be able to tell the depth of each node/ leaf so we can
    go for it and get all nodes from a certain level, resulting in our new
    mapping like "_0_Religion", "_1_Religion_Jews" etc.

This then translates to our following JSON-Books for ES:
https://gist.github.com/933bc88c63d30e9f2754(thepart under

JavaScript - my mistake, sorry)

These are then indexed and now we want to query for them from Java,
and here we want all books for all countries but see the full tree
then retrieved by a facet for the level 0 (think about a that the sub-
trees have all 1 common root under 0), so we want to get facets with
values:

"_0_Religion"
"_0_Historic"

so we do in java a query: "Java query level 0"
https://gist.github.com/933bc88c63d30e9f2754

but instead of the expecting with facetFilter { new
TermsFilterBuilder("linie_mv", "0*") } we get all multi-value-field
entered values as all books match the query, so the filter for the
facet didn't work. However, we have another way: regex that might
help, so we do:
-> "Java"https://gist.github.com/933bc88c63d30e9f2754

but then it happens: if we enter regex("0") we get back an empty
facet, meaning => TermsFacet: []
if we enter regex("
0*") we get a not nice error from ES: ERROR ->
https://gist.github.com/933bc88c63d30e9f2754

I hope this makes it more clear what I have to achive and what didn't
work out as expected. Any help on solving this is really appreciated
as if I can't fix this somehow very soon I have to ditch ES,

Best,

K.

On 8 Mrz., 07:43, Shay Banon shay.ba...@elasticsearch.com wrote:

Facet (non global ones) are restricted to the query you execute, if
that query matches more than you expect, then you need to optimize it. I am
not sure if it should or not, since its hard to follow whats going on. In
order for me to easily help, just gist a curl recreation and I can have a
look (http://www.elasticsearch.org/help).

On Monday, March 7, 2011 at 8:49 PM, K.B. wrote:

Hi Shay,

I'm a bit forther now. Instead of trying to continue arguing with
the

pattern analyzer I instead went to "keyword" and did a
((String) o[37]).split(", ")
in java so it now indexes as it expected:

linie_mv: [
" _0_foo"
" _1_foo_bar"
" _2_foo_bar_xxxx"
" _3_foo_bar_yyyy"
]

When I now query using java-api:

SearchRequestBuilder builder = client.prepareSearch(index);
XContentQueryBuilder qb = QueryBuilders.queryString("_1_foo*")
.defaultOperator(QueryStringQueryBuilder.Operator.OR)
.field("linie_mv")
.field("id")
.allowLeadingWildcard(true).useDisMax(true);

builder.setQuery(qb);

builder.addFacet(FacetBuilders.termsFacet("linie_mv").field("linie_mv"));

I get the expected resultset. The facet here, however, doesn't
return

the limited facet one expects (" _1_foo_bar") but instead returns
all the facets that any matching item has. I also tried to use a
FacetFilter like
.facetFilter(new TermsFilterBuilder("linie_mv", "_1_foo*")
but it has no effect as still all facets are returned....

I managed to bypass this at the moment by introducing special
fields

for each line e.g.: linie_mv_0 ... linie_mv_5 but this won't work
when

doing m:m tree mappings - any idea what I can do or how I can limit
the facets to the one containing special characters? (I expected
facetFilter to just do this?)

Best,

Korbinian

On 6 Mrz., 11:36, "K.B." korbinian.ba...@googlemail.com wrote:

Hi Shay,

thanks for answering. I'll try my best to explain, but don't know
how

exactly to show the output as I'm quite new to ES (coming from
SOLR).

My current config now is (in elasticsearch.yml):
index :
analysis :
analyzer :
linieAnalzyer :
type : pattern
lowercase: false
pattern: ", "
flags: ""

My config for the type is as follows (from java):

String mapping = "{"produkt": {" +
""properties" : {" +
""linie_mv" : " +
"{ "type" : "string", " +
""index" : "analyzed", " +
" "index_analyzer": "linieAnalzyer", " +
" "search_analyzer": "linieAnalzyer" }" +
"}}}";

client.admin().indices().create(new
CreateIndexRequest(index).mapping(type, mapping)).actionGet();

and works like expected;

I then index docs containing in the linie_mv field values like:

"_0_Foo, _1_Foo_Bar, _2_Foo_Bar_Fii /Faa, _3_Foo_Bar_Fii /Faa_La"

its treated at the ", " but then tokenized at all whitespaces " "
and

not as resulting single terms (after breaking it using the ", "
he

mustn't dig in deeper as he then breaks my tree;

Reason is as this is used later as a facet based upon a
prefix-query,

say e.g.:

prefix("0")
facets -> "_0_Foo" ; "_0_Foo2" etc.;

prefix"_1_Foo"
-> facets -> "_1_Foo_Bar" etc.;

So I can dynamically get the leaves of a tree and the
corresponding

doc's;

Currently I get only part right, most I also get by a prefix("*")
is

also chunked words from inner like "/Faa_La" here;

I hope this is now a bit more clear whats going wrong for my
case.

Best,

K.

PS: Shay, thanks for your work - as a long-term compass user I
really

appreaciate it very much!

On 6 Mrz., 05:31, Shay Banon shay.ba...@elasticsearch.com
wrote:

Thats what the pattern analyzer should do. Its already an fully
fledged analyzer, so specifying a tokenizer does not mean much (it only
applies to custom analyzers). Can you fist a sample with your config, and
using the analyzer API show the output that you get, and where it differs
from what you expect?

On Saturday, March 5, 2011 at 7:42 PM, K.B. wrote:

Hello,

I need to tokenize on a special pattern, here is the code I
use with

SOLR:

I tried to implement it in ES using the pattern analyzer, but
it didnt

work out, I tried:
index :
analysis :
analyzer :
linieAnalzyer :
type : pattern
lowercase: false
pattern: ", "

and realized I need a special tokenizer, so my hope

...

Erfahren Sie mehr »


(Shay Banon) #10

Right, its a Java regex, I will update the docs. What is a Lucene regex, don't think there is one. All this thread would have been much simpler if you could have just provided what was asked for, a simple curl recreation using the analyze API to see what is the result of the analyzer you created.
On Tuesday, March 8, 2011 at 9:40 PM, K.B. wrote:

Ok, problem solved thanks to Lukáš Vlček!

Problem is/ was that the regex-origin was not made clear in doc,
instead of:
.regex("1")
one has to use
.regex("^1.
")
as this is not lucene or JavaScript expression but a regular JAVA
regular expression!

PS: Wish for future to all coming features and additions to ES: please
add more verbose doc with more telling examples and the origin of
expressions!

On 8 Mrz., 10:46, "K.B." korbinian.ba...@googlemail.com wrote:

PS: just to note: in this system one has to query for each level from
0 to n-1 where n is the node level he wants to reach; as n is limited
to a reasonable low value this is no problem as our problem will still
be solved in complexity O(n) and not the feared O(n^kn) (explained:
simple complexity and not an exponential increase)

On 8 Mrz., 10:41, "K.B." korbinian.ba...@googlemail.com wrote:

Hello Shay,

I try to re-explain. Please note that my problem relates to the
logic behind, not the query itself and as I'm using the java-api
there typically mustnt be any simple error within the query itself.

Now, re-explained:

Think about a dealer like amazon: you have books that can be mapped to
different tree-based categories, say

-> Category
-> Themes
and are sometimes limited to the area where they may be delivered to
due to legal reasons;

Now imagine you have 3 books:

A: "Historic->Old" (category) thats in the themes: "Religion->Jews"
and mustnt be sent to people living in pakistan
B: "Historic->Old" (category) thats in the themes: "Religion->Jews"
and it may be delivered everywhere;
C: "Historic->New" (category) thats in the themes: "Religion-

Christians" and it mustnt be sent to people living in pakistan

Now, the Categorys as well as the Themes have to be displayed using a
tree like structure as we have so many more categories and themes; If
you now are a looking at the restrictions these shouldn't be served to
our customers from e.g: pakistan, so a pre-created tree won't fit here
as we only know at runtime after a limiting search what type and kind
of trees we can show them.

As we don't know how many of those linkings are going to exist at any
time and we don't want to alter or browse for schemes (which is hard
if you have no clue how these will be called) we come to the point
what makes a good search solution vs. a simple, stupid ones like we
have in databases (and no, token and stemming isn't the thing that
RDBMs havent....): faceting

If we can now facet over the result of a search (here: only items
appropriate for our customer) we can dynamically retrieve all
existing tree-nodes that make sense for the user to be presented to.

So to overcome we need to do following:

  1. put together all possible mappings at 1(!) place and that we call
    "line" (which means it will be a multi-valued field for lucene)
  2. these mappings have to express the whole mapping route as we can't
    query too much around without sacrifying too much performance, meaning
    mappings will be called "Religion", "Religion_Jews" for our Jews
    category
  3. we need to be able to tell the depth of each node/ leaf so we can
    go for it and get all nodes from a certain level, resulting in our new
    mapping like "_0_Religion", "_1_Religion_Jews" etc.

This then translates to our following JSON-Books for ES:https://gist.github.com/933bc88c63d30e9f2754(thepart under
JavaScript - my mistake, sorry)

These are then indexed and now we want to query for them from Java,
and here we want all books for all countries but see the full tree
then retrieved by a facet for the level 0 (think about a that the sub-
trees have all 1 common root under 0), so we want to get facets with
values:

"_0_Religion"
"_0_Historic"

so we do in java a query: "Java query level 0" https://gist.github.com/933bc88c63d30e9f2754

but instead of the expecting with facetFilter { new
TermsFilterBuilder("linie_mv", "0*") } we get all multi-value-field
entered values as all books match the query, so the filter for the
facet didn't work. However, we have another way: regex that might
help, so we do:
-> "Java"https://gist.github.com/933bc88c63d30e9f2754

but then it happens: if we enter regex("0") we get back an empty
facet, meaning => TermsFacet: []
if we enter regex("
0*") we get a not nice error from ES: ERROR ->https://gist.github.com/933bc88c63d30e9f2754

I hope this makes it more clear what I have to achive and what didn't
work out as expected. Any help on solving this is really appreciated
as if I can't fix this somehow very soon I have to ditch ES,

Best,

K.

On 8 Mrz., 07:43, Shay Banon shay.ba...@elasticsearch.com wrote:

Facet (non global ones) are restricted to the query you execute, if that query matches more than you expect, then you need to optimize it. I am not sure if it should or not, since its hard to follow whats going on. In order for me to easily help, just gist a curl recreation and I can have a look (http://www.elasticsearch.org/help).

On Monday, March 7, 2011 at 8:49 PM, K.B. wrote:

Hi Shay,

I'm a bit forther now. Instead of trying to continue arguing with the
pattern analyzer I instead went to "keyword" and did a
((String) o[37]).split(", ")
in java so it now indexes as it expected:

linie_mv: [
" _0_foo"
" _1_foo_bar"
" _2_foo_bar_xxxx"
" _3_foo_bar_yyyy"
]

When I now query using java-api:

SearchRequestBuilder builder = client.prepareSearch(index);
XContentQueryBuilder qb = QueryBuilders.queryString("_1_foo*")
.defaultOperator(QueryStringQueryBuilder.Operator.OR)
.field("linie_mv")
.field("id")
.allowLeadingWildcard(true).useDisMax(true);

builder.setQuery(qb);

builder.addFacet(FacetBuilders.termsFacet("linie_mv").field("linie_mv"));

I get the expected resultset. The facet here, however, doesn't return
the limited facet one expects (" _1_foo_bar") but instead returns
all the facets that any matching item has. I also tried to use a
FacetFilter like
.facetFilter(new TermsFilterBuilder("linie_mv", "_1_foo*")
but it has no effect as still all facets are returned....

I managed to bypass this at the moment by introducing special fields
for each line e.g.: linie_mv_0 ... linie_mv_5 but this won't work when
doing m:m tree mappings - any idea what I can do or how I can limit
the facets to the one containing special characters? (I expected
facetFilter to just do this?)

Best,

Korbinian

On 6 Mrz., 11:36, "K.B." korbinian.ba...@googlemail.com wrote:

Hi Shay,

thanks for answering. I'll try my best to explain, but don't know how
exactly to show the output as I'm quite new to ES (coming from SOLR).

My current config now is (in elasticsearch.yml):
index :
analysis :
analyzer :
linieAnalzyer :
type : pattern
lowercase: false
pattern: ", "
flags: ""

My config for the type is as follows (from java):

String mapping = "{"produkt": {" +
""properties" : {" +
""linie_mv" : " +
"{ "type" : "string", " +
""index" : "analyzed", " +
" "index_analyzer": "linieAnalzyer", " +
" "search_analyzer": "linieAnalzyer" }" +
"}}}";

client.admin().indices().create(new
CreateIndexRequest(index).mapping(type, mapping)).actionGet();

and works like expected;

I then index docs containing in the linie_mv field values like:

"_0_Foo, _1_Foo_Bar, _2_Foo_Bar_Fii /Faa, _3_Foo_Bar_Fii /Faa_La"

its treated at the ", " but then tokenized at all whitespaces " " and
not as resulting single terms (after breaking it using the ", " he
mustn't dig in deeper as he then breaks my tree;

Reason is as this is used later as a facet based upon a prefix-query,
say e.g.:

prefix("0")
facets -> "_0_Foo" ; "_0_Foo2" etc.;

prefix"_1_Foo"
-> facets -> "_1_Foo_Bar" etc.;

So I can dynamically get the leaves of a tree and the corresponding
doc's;

Currently I get only part right, most I also get by a prefix("*") is
also chunked words from inner like "/Faa_La" here;

I hope this is now a bit more clear whats going wrong for my case.

Best,

K.

PS: Shay, thanks for your work - as a long-term compass user I really
appreaciate it very much!

On 6 Mrz., 05:31, Shay Banon shay.ba...@elasticsearch.com wrote:

Thats what the pattern analyzer should do. Its already an fully fledged analyzer, so specifying a tokenizer does not mean much (it only applies to custom analyzers). Can you fist a sample with your config, and using the analyzer API show the output that you get, and where it differs from what you expect?

On Saturday, March 5, 2011 at 7:42 PM, K.B. wrote:

Hello,

I need to tokenize on a special pattern, here is the code I use with
SOLR:

I tried to implement it in ES using the pattern analyzer, but it didnt
work out, I tried:
index :
analysis :
analyzer :
linieAnalzyer :
type : pattern
lowercase: false
pattern: ", "

and realized I need a special tokenizer, so my hope

...

Erfahren Sie mehr »


(K.B.) #11

Heya Shay,

yes, next time I will try to be more specific, however this is
sometimes not quite easy when one is new to a piece of software.
Especially when it comes to curl its still hard for me as I just don't
use it (tm) but instead solve all over the java-API from within an
EJB.

Regarding docs: I already think about forking and doing some additions
from the knowledge I'm getting with my current project especially some
kind of "How to treat a danymic Tree - Guide" and "EJB + ES:
Implementation Suggestions"

Regarding ES from Java: would you suggest to better get 1 Client
within a Singleton and share this among other requests or should each
request get hold of his own client?

Best,

K.

On 9 Mrz., 10:18, Shay Banon shay.ba...@elasticsearch.com wrote:

Right, its a Java regex, I will update the docs. What is a Lucene regex, don't think there is one. All this thread would have been much simpler if you could have just provided what was asked for, a simple curl recreation using the analyze API to see what is the result of the analyzer you created.

On Tuesday, March 8, 2011 at 9:40 PM, K.B. wrote:

Ok, problem solved thanks to Lukáš Vlček!

Problem is/ was that the regex-origin was not made clear in doc,
instead of:
.regex("1")
one has to use
.regex("^1.
")
as this is not lucene or JavaScript expression but a regular JAVA
regular expression!

PS: Wish for future to all coming features and additions to ES: please
add more verbose doc with more telling examples and the origin of
expressions!

On 8 Mrz., 10:46, "K.B." korbinian.ba...@googlemail.com wrote:

PS: just to note: in this system one has to query for each level from
0 to n-1 where n is the node level he wants to reach; as n is limited
to a reasonable low value this is no problem as our problem will still
be solved in complexity O(n) and not the feared O(n^kn) (explained:
simple complexity and not an exponential increase)

On 8 Mrz., 10:41, "K.B." korbinian.ba...@googlemail.com wrote:

Hello Shay,

I try to re-explain. Please note that my problem relates to the
logic behind, not the query itself and as I'm using the java-api
there typically mustnt be any simple error within the query itself.

Now, re-explained:

Think about a dealer like amazon: you have books that can be mapped to
different tree-based categories, say

-> Category
-> Themes
and are sometimes limited to the area where they may be delivered to
due to legal reasons;

Now imagine you have 3 books:

A: "Historic->Old" (category) thats in the themes: "Religion->Jews"
and mustnt be sent to people living in pakistan
B: "Historic->Old" (category) thats in the themes: "Religion->Jews"
and it may be delivered everywhere;
C: "Historic->New" (category) thats in the themes: "Religion-

Christians" and it mustnt be sent to people living in pakistan

Now, the Categorys as well as the Themes have to be displayed using a
tree like structure as we have so many more categories and themes; If
you now are a looking at the restrictions these shouldn't be served to
our customers from e.g: pakistan, so a pre-created tree won't fit here
as we only know at runtime after a limiting search what type and kind
of trees we can show them.

As we don't know how many of those linkings are going to exist at any
time and we don't want to alter or browse for schemes (which is hard
if you have no clue how these will be called) we come to the point
what makes a good search solution vs. a simple, stupid ones like we
have in databases (and no, token and stemming isn't the thing that
RDBMs havent....): faceting

If we can now facet over the result of a search (here: only items
appropriate for our customer) we can dynamically retrieve all
existing tree-nodes that make sense for the user to be presented to.

So to overcome we need to do following:

  1. put together all possible mappings at 1(!) place and that we call
    "line" (which means it will be a multi-valued field for lucene)
  2. these mappings have to express the whole mapping route as we can't
    query too much around without sacrifying too much performance, meaning
    mappings will be called "Religion", "Religion_Jews" for our Jews
    category
  3. we need to be able to tell the depth of each node/ leaf so we can
    go for it and get all nodes from a certain level, resulting in our new
    mapping like "_0_Religion", "_1_Religion_Jews" etc.

This then translates to our following JSON-Books for ES:https://gist.github.com/933bc88c63d30e9f2754(thepartunder
JavaScript - my mistake, sorry)

These are then indexed and now we want to query for them from Java,
and here we want all books for all countries but see the full tree
then retrieved by a facet for the level 0 (think about a that the sub-
trees have all 1 common root under 0), so we want to get facets with
values:

"_0_Religion"
"_0_Historic"

so we do in java a query: "Java query level 0"https://gist.github.com/933bc88c63d30e9f2754

but instead of the expecting with facetFilter { new
TermsFilterBuilder("linie_mv", "0*") } we get all multi-value-field
entered values as all books match the query, so the filter for the
facet didn't work. However, we have another way: regex that might
help, so we do:
-> "Java"https://gist.github.com/933bc88c63d30e9f2754

but then it happens: if we enter regex("0") we get back an empty
facet, meaning => TermsFacet: []
if we enter regex("
0*") we get a not nice error from ES: ERROR ->https://gist.github.com/933bc88c63d30e9f2754

I hope this makes it more clear what I have to achive and what didn't
work out as expected. Any help on solving this is really appreciated
as if I can't fix this somehow very soon I have to ditch ES,

Best,

K.

On 8 Mrz., 07:43, Shay Banon shay.ba...@elasticsearch.com wrote:

Facet (non global ones) are restricted to the query you execute, if that query matches more than you expect, then you need to optimize it. I am not sure if it should or not, since its hard to follow whats going on. In order for me to easily help, just gist a curl recreation and I can have a look (http://www.elasticsearch.org/help).

On Monday, March 7, 2011 at 8:49 PM, K.B. wrote:

Hi Shay,

I'm a bit forther now. Instead of trying to continue arguing with the
pattern analyzer I instead went to "keyword" and did a
((String) o[37]).split(", ")
in java so it now indexes as it expected:

linie_mv: [
" _0_foo"
" _1_foo_bar"
" _2_foo_bar_xxxx"
" _3_foo_bar_yyyy"
]

When I now query using java-api:

SearchRequestBuilder builder = client.prepareSearch(index);
XContentQueryBuilder qb = QueryBuilders.queryString("_1_foo*")
.defaultOperator(QueryStringQueryBuilder.Operator.OR)
.field("linie_mv")
.field("id")
.allowLeadingWildcard(true).useDisMax(true);

builder.setQuery(qb);

builder.addFacet(FacetBuilders.termsFacet("linie_mv").field("linie_mv"));

I get the expected resultset. The facet here, however, doesn't return
the limited facet one expects (" _1_foo_bar") but instead returns
all the facets that any matching item has. I also tried to use a
FacetFilter like
.facetFilter(new TermsFilterBuilder("linie_mv", "_1_foo*")
but it has no effect as still all facets are returned....

I managed to bypass this at the moment by introducing special fields
for each line e.g.: linie_mv_0 ... linie_mv_5 but this won't work when
doing m:m tree mappings - any idea what I can do or how I can limit
the facets to the one containing special characters? (I expected
facetFilter to just do this?)

Best,

Korbinian

On 6 Mrz., 11:36, "K.B." korbinian.ba...@googlemail.com wrote:

Hi Shay,

thanks for answering. I'll try my best to explain, but don't know how
exactly to show the output as I'm quite new to ES (coming from SOLR).

My current config now is (in elasticsearch.yml):
index :
analysis :
analyzer :
linieAnalzyer :
type : pattern
lowercase: false
pattern: ", "
flags: ""

My config for the type is as follows (from java):

String mapping = "{"produkt": {" +
""properties" : {" +
""linie_mv" : " +
"{ "type" : "string", " +
""index" : "analyzed", " +
" "index_analyzer": "linieAnalzyer", " +
" "search_analyzer": "linieAnalzyer" }" +
"}}}";

client.admin().indices().create(new
CreateIndexRequest(index).mapping(type, mapping)).actionGet();

and works like expected;

I then index docs containing in the linie_mv field values like:

"_0_Foo, _1_Foo_Bar, _2_Foo_Bar_Fii /Faa, _3_Foo_Bar_Fii /Faa_La"

its treated at the ", " but then tokenized at all whitespaces " " and
not as resulting single terms (after breaking it using the ", " he
mustn't dig in deeper as he then breaks my tree;

Reason is as this is used later as a facet based upon a prefix-query,
say e.g.:

prefix("0")
facets -> "_0_Foo" ; "_0_Foo2" etc.;

prefix"_1_Foo"
-> facets -> "_1_Foo_Bar" etc.;

So I can dynamically get the leaves of a tree and the corresponding
doc's;

Currently I get only part right, most I also get by a prefix("*") is
also chunked words from inner like "/Faa_La" here;

I hope this is now a bit more clear whats going wrong for my case.

Best,

K.

PS: Shay, thanks for your work - as a long-term compass user I really
appreaciate it very much!

On 6 Mrz., 05:31, Shay Banon shay.ba...@elasticsearch.com wrote:

...

Erfahren Sie mehr »


(Shay Banon) #12

On Wednesday, March 9, 2011 at 12:03 PM, K.B. wrote:
Heya Shay,

yes, next time I will try to be more specific, however this is
sometimes not quite easy when one is new to a piece of software.
Especially when it comes to curl its still hard for me as I just don't
use it (tm) but instead solve all over the java-API from within an
EJB.

Regarding docs: I already think about forking and doing some additions
from the knowledge I'm getting with my current project especially some
kind of "How to treat a danymic Tree - Guide" and "EJB + ES:
Implementation Suggestions"
Would very much welcome it.

Regarding ES from Java: would you suggest to better get 1 Client
within a Singleton and share this among other requests or should each
request get hold of his own client?
Single Client, shared among all "requests".

Best,

K.

On 9 Mrz., 10:18, Shay Banon shay.ba...@elasticsearch.com wrote:

Right, its a Java regex, I will update the docs. What is a Lucene regex, don't think there is one. All this thread would have been much simpler if you could have just provided what was asked for, a simple curl recreation using the analyze API to see what is the result of the analyzer you created.

On Tuesday, March 8, 2011 at 9:40 PM, K.B. wrote:

Ok, problem solved thanks to Lukáš Vlček!

Problem is/ was that the regex-origin was not made clear in doc,
instead of:
.regex("1")
one has to use
.regex("^1.
")
as this is not lucene or JavaScript expression but a regular JAVA
regular expression!

PS: Wish for future to all coming features and additions to ES: please
add more verbose doc with more telling examples and the origin of
expressions!

On 8 Mrz., 10:46, "K.B." korbinian.ba...@googlemail.com wrote:

PS: just to note: in this system one has to query for each level from
0 to n-1 where n is the node level he wants to reach; as n is limited
to a reasonable low value this is no problem as our problem will still
be solved in complexity O(n) and not the feared O(n^kn) (explained:
simple complexity and not an exponential increase)

On 8 Mrz., 10:41, "K.B." korbinian.ba...@googlemail.com wrote:

Hello Shay,

I try to re-explain. Please note that my problem relates to the
logic behind, not the query itself and as I'm using the java-api
there typically mustnt be any simple error within the query itself.

Now, re-explained:

Think about a dealer like amazon: you have books that can be mapped to
different tree-based categories, say

-> Category
-> Themes
and are sometimes limited to the area where they may be delivered to
due to legal reasons;

Now imagine you have 3 books:

A: "Historic->Old" (category) thats in the themes: "Religion->Jews"
and mustnt be sent to people living in pakistan
B: "Historic->Old" (category) thats in the themes: "Religion->Jews"
and it may be delivered everywhere;
C: "Historic->New" (category) thats in the themes: "Religion-

Christians" and it mustnt be sent to people living in pakistan

Now, the Categorys as well as the Themes have to be displayed using a
tree like structure as we have so many more categories and themes; If
you now are a looking at the restrictions these shouldn't be served to
our customers from e.g: pakistan, so a pre-created tree won't fit here
as we only know at runtime after a limiting search what type and kind
of trees we can show them.

As we don't know how many of those linkings are going to exist at any
time and we don't want to alter or browse for schemes (which is hard
if you have no clue how these will be called) we come to the point
what makes a good search solution vs. a simple, stupid ones like we
have in databases (and no, token and stemming isn't the thing that
RDBMs havent....): faceting

If we can now facet over the result of a search (here: only items
appropriate for our customer) we can dynamically retrieve all
existing tree-nodes that make sense for the user to be presented to.

So to overcome we need to do following:

  1. put together all possible mappings at 1(!) place and that we call
    "line" (which means it will be a multi-valued field for lucene)
  2. these mappings have to express the whole mapping route as we can't
    query too much around without sacrifying too much performance, meaning
    mappings will be called "Religion", "Religion_Jews" for our Jews
    category
  3. we need to be able to tell the depth of each node/ leaf so we can
    go for it and get all nodes from a certain level, resulting in our new
    mapping like "_0_Religion", "_1_Religion_Jews" etc.

This then translates to our following JSON-Books for ES:https://gist.github.com/933bc88c63d30e9f2754(thepartunder
JavaScript - my mistake, sorry)

These are then indexed and now we want to query for them from Java,
and here we want all books for all countries but see the full tree
then retrieved by a facet for the level 0 (think about a that the sub-
trees have all 1 common root under 0), so we want to get facets with
values:

"_0_Religion"
"_0_Historic"

so we do in java a query: "Java query level 0"https://gist.github.com/933bc88c63d30e9f2754

but instead of the expecting with facetFilter { new
TermsFilterBuilder("linie_mv", "0*") } we get all multi-value-field
entered values as all books match the query, so the filter for the
facet didn't work. However, we have another way: regex that might
help, so we do:
-> "Java"https://gist.github.com/933bc88c63d30e9f2754

but then it happens: if we enter regex("0") we get back an empty
facet, meaning => TermsFacet: []
if we enter regex("
0*") we get a not nice error from ES: ERROR ->https://gist.github.com/933bc88c63d30e9f2754

I hope this makes it more clear what I have to achive and what didn't
work out as expected. Any help on solving this is really appreciated
as if I can't fix this somehow very soon I have to ditch ES,

Best,

K.

On 8 Mrz., 07:43, Shay Banon shay.ba...@elasticsearch.com wrote:

Facet (non global ones) are restricted to the query you execute, if that query matches more than you expect, then you need to optimize it. I am not sure if it should or not, since its hard to follow whats going on. In order for me to easily help, just gist a curl recreation and I can have a look (http://www.elasticsearch.org/help).

On Monday, March 7, 2011 at 8:49 PM, K.B. wrote:

Hi Shay,

I'm a bit forther now. Instead of trying to continue arguing with the
pattern analyzer I instead went to "keyword" and did a
((String) o[37]).split(", ")
in java so it now indexes as it expected:

linie_mv: [
" _0_foo"
" _1_foo_bar"
" _2_foo_bar_xxxx"
" _3_foo_bar_yyyy"
]

When I now query using java-api:

SearchRequestBuilder builder = client.prepareSearch(index);
XContentQueryBuilder qb = QueryBuilders.queryString("_1_foo*")
.defaultOperator(QueryStringQueryBuilder.Operator.OR)
.field("linie_mv")
.field("id")
.allowLeadingWildcard(true).useDisMax(true);

builder.setQuery(qb);

builder.addFacet(FacetBuilders.termsFacet("linie_mv").field("linie_mv"));

I get the expected resultset. The facet here, however, doesn't return
the limited facet one expects (" _1_foo_bar") but instead returns
all the facets that any matching item has. I also tried to use a
FacetFilter like
.facetFilter(new TermsFilterBuilder("linie_mv", "_1_foo*")
but it has no effect as still all facets are returned....

I managed to bypass this at the moment by introducing special fields
for each line e.g.: linie_mv_0 ... linie_mv_5 but this won't work when
doing m:m tree mappings - any idea what I can do or how I can limit
the facets to the one containing special characters? (I expected
facetFilter to just do this?)

Best,

Korbinian

On 6 Mrz., 11:36, "K.B." korbinian.ba...@googlemail.com wrote:

Hi Shay,

thanks for answering. I'll try my best to explain, but don't know how
exactly to show the output as I'm quite new to ES (coming from SOLR).

My current config now is (in elasticsearch.yml):
index :
analysis :
analyzer :
linieAnalzyer :
type : pattern
lowercase: false
pattern: ", "
flags: ""

My config for the type is as follows (from java):

String mapping = "{"produkt": {" +
""properties" : {" +
""linie_mv" : " +
"{ "type" : "string", " +
""index" : "analyzed", " +
" "index_analyzer": "linieAnalzyer", " +
" "search_analyzer": "linieAnalzyer" }" +
"}}}";

client.admin().indices().create(new
CreateIndexRequest(index).mapping(type, mapping)).actionGet();

and works like expected;

I then index docs containing in the linie_mv field values like:

"_0_Foo, _1_Foo_Bar, _2_Foo_Bar_Fii /Faa, _3_Foo_Bar_Fii /Faa_La"

its treated at the ", " but then tokenized at all whitespaces " " and
not as resulting single terms (after breaking it using the ", " he
mustn't dig in deeper as he then breaks my tree;

Reason is as this is used later as a facet based upon a prefix-query,
say e.g.:

prefix("0")
facets -> "_0_Foo" ; "_0_Foo2" etc.;

prefix"_1_Foo"
-> facets -> "_1_Foo_Bar" etc.;

So I can dynamically get the leaves of a tree and the corresponding
doc's;

Currently I get only part right, most I also get by a prefix("*") is
also chunked words from inner like "/Faa_La" here;

I hope this is now a bit more clear whats going wrong for my case.

Best,

K.

PS: Shay, thanks for your work - as a long-term compass user I really
appreaciate it very much!

On 6 Mrz., 05:31, Shay Banon shay.ba...@elasticsearch.com wrote:

...

Erfahren Sie mehr »


(system) #13