Issue with pattern analyzer


(Scott Decker) #1

Hey All,
I am trying to setup a pattern analyzer for indexes, but something
just isn't working.

Here is what I have tried in our elasticsearch.json

"pipeDelim":{
"type": "pattern",
"stopwords": "none",
"flags": "DOTALL",
"lowercase":true,
"pattern": "\|"
},

and

"pipeDelimTest":{
"type":"custom",
"tokenizer":"pattern",
"pattern":"\|",
"filter":["lowercase"]

        },

and I pass in the following to a doc that has one of these analyzers
Kobe Bryant|Lamar Odom

doing a search for
Kobe Bryant (term query)

nothing comes back from a search

if I type
bryant (term query)

then the document comes back

if I type
kobe bryant (term query)
nothing comes back.

so, it seems like it is just doing the lower casing, but not the pipe
delim to separate tokens and then lower casing.

Any ideas on how to setup this up so it delimits on the pipe
character, and then lower cases the tokens?

Thanks,
Scott


(Scott Decker) #2

To give more clarity on this, we had been running on 17.6, and just
upgraded to 18.7
we re-indexed everything, and our analyzers that were setup no longer
worked.

We set some default ones in the elasticsearch.json setup file for es
itself.
So, when we create indexes, anyone can use the main default ones we
have, like this pipeDelim

however, it doesn't seem like it is working anymore.
Here is our snippet from the json for the analyzers in
elasticsearch.json
"index":{
"analysis":{
"analyzer":{
"pipeDelim":{
"type": "pattern",
"stopwords": "none",
"flags": "DOTALL",
"lowercase":true,
"pattern": "\|"
},
"pipeDelimTest":{
"type":"custom",
"tokenizer":"pattern",
"pattern":"\|",
"filter":["lowercase"]

        }

}
}
}

neither of those seem to work now

On Feb 13, 8:49 am, Scott Decker sc...@publishthis.com wrote:

Hey All,
I am trying to setup a pattern analyzer for indexes, but something
just isn't working.

Here is what I have tried in our elasticsearch.json

"pipeDelim":{
"type": "pattern",
"stopwords": "none",
"flags": "DOTALL",
"lowercase":true,
"pattern": "\|"
},

and

"pipeDelimTest":{
"type":"custom",
"tokenizer":"pattern",
"pattern":"\|",
"filter":["lowercase"]

        },

and I pass in the following to a doc that has one of these analyzers
Kobe Bryant|Lamar Odom

doing a search for
Kobe Bryant (term query)

nothing comes back from a search

if I type
bryant (term query)

then the document comes back

if I type
kobe bryant (term query)
nothing comes back.

so, it seems like it is just doing the lower casing, but not the pipe
delim to separate tokens and then lower casing.

Any ideas on how to setup this up so it delimits on the pipe
character, and then lower cases the tokens?

Thanks,
Scott


(Scott Decker) #3

More info.

I try this:

curl -X GET "http://ec2loc:9200/index-to-test/_analyze?
analyzer=pipeDelim" -d "Kobe Bryant|Lamar Odom"
and this is the output
{"tokens":[{"token":"kobe bryant","start_offset":0,"end_offset":
11,"type":"word","position":1},{"token":"lamar odom","start_offset":
12,"end_offset":22,"type":"word","position":2}]}

So, obviously the analyzer works, but the documents being put in do
not seem to be getting analyzed correctly, nor can we search against
those tokens.

Here is an example index mapping from one of the indexes
entities: {

analyzer: pipeDelim
type: string

}

Anyone have any thoughts, or what else to try and test?

On Feb 13, 10:36 am, Scott Decker sc...@publishthis.com wrote:

To give more clarity on this, we had been running on 17.6, and just
upgraded to 18.7
we re-indexed everything, and our analyzers that were setup no longer
worked.

We set some default ones in the elasticsearch.json setup file for es
itself.
So, when we create indexes, anyone can use the main default ones we
have, like this pipeDelim

however, it doesn't seem like it is working anymore.
Here is our snippet from the json for the analyzers in
elasticsearch.json
"index":{
"analysis":{
"analyzer":{
"pipeDelim":{
"type": "pattern",
"stopwords": "none",
"flags": "DOTALL",
"lowercase":true,
"pattern": "\|"
},
"pipeDelimTest":{
"type":"custom",
"tokenizer":"pattern",
"pattern":"\|",
"filter":["lowercase"]

        }

}
}
}

neither of those seem to work now

On Feb 13, 8:49 am, Scott Decker sc...@publishthis.com wrote:

Hey All,
I am trying to setup a pattern analyzer for indexes, but something
just isn't working.

Here is what I have tried in our elasticsearch.json

"pipeDelim":{
"type": "pattern",
"stopwords": "none",
"flags": "DOTALL",
"lowercase":true,
"pattern": "\|"
},

and

"pipeDelimTest":{
"type":"custom",
"tokenizer":"pattern",
"pattern":"\|",
"filter":["lowercase"]

        },

and I pass in the following to a doc that has one of these analyzers
Kobe Bryant|Lamar Odom

doing a search for
Kobe Bryant (term query)

nothing comes back from a search

if I type
bryant (term query)

then the document comes back

if I type
kobe bryant (term query)
nothing comes back.

so, it seems like it is just doing the lower casing, but not the pipe
delim to separate tokens and then lower casing.

Any ideas on how to setup this up so it delimits on the pipe
character, and then lower cases the tokens?

Thanks,
Scott


(Clinton Gormley) #4

On Mon, 2012-02-13 at 11:58 -0800, Scott Decker wrote:

More info.

The problem, I think, is that you are providing snippets of info, rather
than gisting a complete example demonstrating the issue.

See http://www.elasticsearch.org/help

I'm guessing that the problem is not in the analysis, but how you are
searching.

clint

I try this:

curl -X GET "http://ec2loc:9200/index-to-test/_analyze?
analyzer=pipeDelim" -d "Kobe Bryant|Lamar Odom"
and this is the output
{"tokens":[{"token":"kobe bryant","start_offset":0,"end_offset":
11,"type":"word","position":1},{"token":"lamar odom","start_offset":
12,"end_offset":22,"type":"word","position":2}]}

So, obviously the analyzer works, but the documents being put in do
not seem to be getting analyzed correctly, nor can we search against
those tokens.

Here is an example index mapping from one of the indexes
entities: {

analyzer: pipeDelim
type: string

}

Anyone have any thoughts, or what else to try and test?

On Feb 13, 10:36 am, Scott Decker sc...@publishthis.com wrote:

To give more clarity on this, we had been running on 17.6, and just
upgraded to 18.7
we re-indexed everything, and our analyzers that were setup no longer
worked.

We set some default ones in the elasticsearch.json setup file for es
itself.
So, when we create indexes, anyone can use the main default ones we
have, like this pipeDelim

however, it doesn't seem like it is working anymore.
Here is our snippet from the json for the analyzers in
elasticsearch.json
"index":{
"analysis":{
"analyzer":{
"pipeDelim":{
"type": "pattern",
"stopwords": "none",
"flags": "DOTALL",
"lowercase":true,
"pattern": "\|"
},
"pipeDelimTest":{
"type":"custom",
"tokenizer":"pattern",
"pattern":"\|",
"filter":["lowercase"]

        }

}
}
}

neither of those seem to work now

On Feb 13, 8:49 am, Scott Decker sc...@publishthis.com wrote:

Hey All,
I am trying to setup a pattern analyzer for indexes, but something
just isn't working.

Here is what I have tried in our elasticsearch.json

"pipeDelim":{
"type": "pattern",
"stopwords": "none",
"flags": "DOTALL",
"lowercase":true,
"pattern": "\|"
},

and

"pipeDelimTest":{
"type":"custom",
"tokenizer":"pattern",
"pattern":"\|",
"filter":["lowercase"]

        },

and I pass in the following to a doc that has one of these analyzers
Kobe Bryant|Lamar Odom

doing a search for
Kobe Bryant (term query)

nothing comes back from a search

if I type
bryant (term query)

then the document comes back

if I type
kobe bryant (term query)
nothing comes back.

so, it seems like it is just doing the lower casing, but not the pipe
delim to separate tokens and then lower casing.

Any ideas on how to setup this up so it delimits on the pipe
character, and then lower cases the tokens?

Thanks,
Scott


(Scott Decker) #5

Here ya go

first entry is the elasticsearch.json that is what we use to start es
the second is the mapping I have for my simple test index
the third is the document i insert against the test index (yes, it
goes in as type "document")
the fourth is the query I run expecting that pipe delim works, no
results
the fifth is the query I run that should not work, but does

Let me know if you need more info
Thanks,
scott

On Feb 14, 1:57 am, Clinton Gormley cl...@traveljury.com wrote:

On Mon, 2012-02-13 at 11:58 -0800, Scott Decker wrote:

More info.

The problem, I think, is that you are providing snippets of info, rather
than gisting a complete example demonstrating the issue.

Seehttp://www.elasticsearch.org/help

I'm guessing that the problem is not in the analysis, but how you are
searching.

clint

I try this:

curl -X GET "http://ec2loc:9200/index-to-test/_analyze?
analyzer=pipeDelim" -d "Kobe Bryant|Lamar Odom"
and this is the output
{"tokens":[{"token":"kobe bryant","start_offset":0,"end_offset":
11,"type":"word","position":1},{"token":"lamar odom","start_offset":
12,"end_offset":22,"type":"word","position":2}]}

So, obviously the analyzer works, but the documents being put in do
not seem to be getting analyzed correctly, nor can we search against
those tokens.

Here is an example index mapping from one of the indexes
entities: {

analyzer: pipeDelim
type: string

}

Anyone have any thoughts, or what else to try and test?

On Feb 13, 10:36 am, Scott Decker sc...@publishthis.com wrote:

To give more clarity on this, we had been running on 17.6, and just
upgraded to 18.7
we re-indexed everything, and our analyzers that were setup no longer
worked.

We set some default ones in the elasticsearch.json setup file for es
itself.
So, when we create indexes, anyone can use the main default ones we
have, like this pipeDelim

however, it doesn't seem like it is working anymore.
Here is our snippet from the json for the analyzers in
elasticsearch.json
"index":{
"analysis":{
"analyzer":{
"pipeDelim":{
"type": "pattern",
"stopwords": "none",
"flags": "DOTALL",
"lowercase":true,
"pattern": "\|"
},
"pipeDelimTest":{
"type":"custom",
"tokenizer":"pattern",
"pattern":"\|",
"filter":["lowercase"]

        }

}
}
}

neither of those seem to work now

On Feb 13, 8:49 am, Scott Decker sc...@publishthis.com wrote:

Hey All,
I am trying to setup a pattern analyzer for indexes, but something
just isn't working.

Here is what I have tried in our elasticsearch.json

"pipeDelim":{
"type": "pattern",
"stopwords": "none",
"flags": "DOTALL",
"lowercase":true,
"pattern": "\|"
},

and

"pipeDelimTest":{
"type":"custom",
"tokenizer":"pattern",
"pattern":"\|",
"filter":["lowercase"]

        },

and I pass in the following to a doc that has one of these analyzers
Kobe Bryant|Lamar Odom

doing a search for
Kobe Bryant (term query)

nothing comes back from a search

if I type
bryant (term query)

then the document comes back

if I type
kobe bryant (term query)
nothing comes back.

so, it seems like it is just doing the lower casing, but not the pipe
delim to separate tokens and then lower casing.

Any ideas on how to setup this up so it delimits on the pipe
character, and then lower cases the tokens?

Thanks,
Scott


(ppearcy) #6

Complete, fully reproducible curl commands are the way to go for
others to easily reproduce. Taking a glance, this looks correct, but
an end to end reproduction will shed further light on things. Here is
an example to reproduce a synonym bug (that is actually fixed in the
soon to be released lucene 3.6):

Best Regards,
Paul

On Feb 14, 8:31 am, Scott Decker sc...@publishthis.com wrote:

Here ya gohttps://gist.github.com/1827576

first entry is the elasticsearch.json that is what we use to start es
the second is the mapping I have for my simple test index
the third is the document i insert against the test index (yes, it
goes in as type "document")
the fourth is the query I run expecting that pipe delim works, no
results
the fifth is the query I run that should not work, but does

Let me know if you need more info
Thanks,
scott

On Feb 14, 1:57 am, Clinton Gormley cl...@traveljury.com wrote:

On Mon, 2012-02-13 at 11:58 -0800, Scott Decker wrote:

More info.

The problem, I think, is that you are providing snippets of info, rather
than gisting a complete example demonstrating the issue.

Seehttp://www.elasticsearch.org/help

I'm guessing that the problem is not in the analysis, but how you are
searching.

clint

I try this:

curl -X GET "http://ec2loc:9200/index-to-test/_analyze?
analyzer=pipeDelim" -d "Kobe Bryant|Lamar Odom"
and this is the output
{"tokens":[{"token":"kobe bryant","start_offset":0,"end_offset":
11,"type":"word","position":1},{"token":"lamar odom","start_offset":
12,"end_offset":22,"type":"word","position":2}]}

So, obviously the analyzer works, but the documents being put in do
not seem to be getting analyzed correctly, nor can we search against
those tokens.

Here is an example index mapping from one of the indexes
entities: {

analyzer: pipeDelim
type: string

}

Anyone have any thoughts, or what else to try and test?

On Feb 13, 10:36 am, Scott Decker sc...@publishthis.com wrote:

To give more clarity on this, we had been running on 17.6, and just
upgraded to 18.7
we re-indexed everything, and our analyzers that were setup no longer
worked.

We set some default ones in the elasticsearch.json setup file for es
itself.
So, when we create indexes, anyone can use the main default ones we
have, like this pipeDelim

however, it doesn't seem like it is working anymore.
Here is our snippet from the json for the analyzers in
elasticsearch.json
"index":{
"analysis":{
"analyzer":{
"pipeDelim":{
"type": "pattern",
"stopwords": "none",
"flags": "DOTALL",
"lowercase":true,
"pattern": "\|"
},
"pipeDelimTest":{
"type":"custom",
"tokenizer":"pattern",
"pattern":"\|",
"filter":["lowercase"]

        }

}
}
}

neither of those seem to work now

On Feb 13, 8:49 am, Scott Decker sc...@publishthis.com wrote:

Hey All,
I am trying to setup a pattern analyzer for indexes, but something
just isn't working.

Here is what I have tried in our elasticsearch.json

"pipeDelim":{
"type": "pattern",
"stopwords": "none",
"flags": "DOTALL",
"lowercase":true,
"pattern": "\|"
},

and

"pipeDelimTest":{
"type":"custom",
"tokenizer":"pattern",
"pattern":"\|",
"filter":["lowercase"]

        },

and I pass in the following to a doc that has one of these analyzers
Kobe Bryant|Lamar Odom

doing a search for
Kobe Bryant (term query)

nothing comes back from a search

if I type
bryant (term query)

then the document comes back

if I type
kobe bryant (term query)
nothing comes back.

so, it seems like it is just doing the lower casing, but not the pipe
delim to separate tokens and then lower casing.

Any ideas on how to setup this up so it delimits on the pipe
character, and then lower cases the tokens?

Thanks,
Scott


(Clinton Gormley) #7

On Tue, 2012-02-14 at 13:15 -0800, ppearcy wrote:

Complete, fully reproducible curl commands are the way to go for
others to easily reproduce.

++ exactly!

You're missing a bunch of information, like what 'type' is your doc.
You have two different types with different analysis for the same field
(which can cause clashes depending on how you search).

Also, your pipe-delim analyzer lowercases, but then you're searching for
"Kobe Bryant" with a term query, which does no analysis (ie it looks for
the exact term "Kobe Bryant" while your term is "kobe bryant"

clint

Taking a glance, this looks correct, but
an end to end reproduction will shed further light on things. Here is
an example to reproduce a synonym bug (that is actually fixed in the
soon to be released lucene 3.6):
https://gist.github.com/1349777

Best Regards,
Paul

On Feb 14, 8:31 am, Scott Decker sc...@publishthis.com wrote:

Here ya gohttps://gist.github.com/1827576

first entry is the elasticsearch.json that is what we use to start es
the second is the mapping I have for my simple test index
the third is the document i insert against the test index (yes, it
goes in as type "document")
the fourth is the query I run expecting that pipe delim works, no
results
the fifth is the query I run that should not work, but does

Let me know if you need more info
Thanks,
scott

On Feb 14, 1:57 am, Clinton Gormley cl...@traveljury.com wrote:

On Mon, 2012-02-13 at 11:58 -0800, Scott Decker wrote:

More info.

The problem, I think, is that you are providing snippets of info, rather
than gisting a complete example demonstrating the issue.

Seehttp://www.elasticsearch.org/help

I'm guessing that the problem is not in the analysis, but how you are
searching.

clint

I try this:

curl -X GET "http://ec2loc:9200/index-to-test/_analyze?
analyzer=pipeDelim" -d "Kobe Bryant|Lamar Odom"
and this is the output
{"tokens":[{"token":"kobe bryant","start_offset":0,"end_offset":
11,"type":"word","position":1},{"token":"lamar odom","start_offset":
12,"end_offset":22,"type":"word","position":2}]}

So, obviously the analyzer works, but the documents being put in do
not seem to be getting analyzed correctly, nor can we search against
those tokens.

Here is an example index mapping from one of the indexes
entities: {

analyzer: pipeDelim
type: string

}

Anyone have any thoughts, or what else to try and test?

On Feb 13, 10:36 am, Scott Decker sc...@publishthis.com wrote:

To give more clarity on this, we had been running on 17.6, and just
upgraded to 18.7
we re-indexed everything, and our analyzers that were setup no longer
worked.

We set some default ones in the elasticsearch.json setup file for es
itself.
So, when we create indexes, anyone can use the main default ones we
have, like this pipeDelim

however, it doesn't seem like it is working anymore.
Here is our snippet from the json for the analyzers in
elasticsearch.json
"index":{
"analysis":{
"analyzer":{
"pipeDelim":{
"type": "pattern",
"stopwords": "none",
"flags": "DOTALL",
"lowercase":true,
"pattern": "\|"
},
"pipeDelimTest":{
"type":"custom",
"tokenizer":"pattern",
"pattern":"\|",
"filter":["lowercase"]

        }

}
}
}

neither of those seem to work now

On Feb 13, 8:49 am, Scott Decker sc...@publishthis.com wrote:

Hey All,
I am trying to setup a pattern analyzer for indexes, but something
just isn't working.

Here is what I have tried in our elasticsearch.json

"pipeDelim":{
"type": "pattern",
"stopwords": "none",
"flags": "DOTALL",
"lowercase":true,
"pattern": "\|"
},

and

"pipeDelimTest":{
"type":"custom",
"tokenizer":"pattern",
"pattern":"\|",
"filter":["lowercase"]

        },

and I pass in the following to a doc that has one of these analyzers
Kobe Bryant|Lamar Odom

doing a search for
Kobe Bryant (term query)

nothing comes back from a search

if I type
bryant (term query)

then the document comes back

if I type
kobe bryant (term query)
nothing comes back.

so, it seems like it is just doing the lower casing, but not the pipe
delim to separate tokens and then lower casing.

Any ideas on how to setup this up so it delimits on the pipe
character, and then lower cases the tokens?

Thanks,
Scott


(Scott Decker) #8

Working on the curl info. We use the java api, and not the straight
curl, so, a bit difficult to translate 1 to 1.

In response to the last comment though, the term "Kobe Bryant" should
be correct. it is in Solr/Lucene.
If I say search for "Kobe Bryant" the ES server should then run that
through the analyzer for the field, and then search for it, yes? That
should then turn it into "kobe bryant" because it will look for the
pipe character, find none, then lowercase the token.
Or are you saying this should be done via query_string instead?

Scott

On Feb 15, 2:19 am, Clinton Gormley cl...@traveljury.com wrote:

On Tue, 2012-02-14 at 13:15 -0800, ppearcy wrote:

Complete, fully reproducible curl commands are the way to go for
others to easily reproduce.

++ exactly!

You're missing a bunch of information, like what 'type' is your doc.
You have two different types with different analysis for the same field
(which can cause clashes depending on how you search).

Also, your pipe-delim analyzer lowercases, but then you're searching for
"Kobe Bryant" with a term query, which does no analysis (ie it looks for
the exact term "Kobe Bryant" while your term is "kobe bryant"

clint

Taking a glance, this looks correct, but
an end to end reproduction will shed further light on things. Here is
an example to reproduce a synonym bug (that is actually fixed in the
soon to be released lucene 3.6):
https://gist.github.com/1349777

Best Regards,
Paul

On Feb 14, 8:31 am, Scott Decker sc...@publishthis.com wrote:

Here ya gohttps://gist.github.com/1827576

first entry is the elasticsearch.json that is what we use to start es
the second is the mapping I have for my simple test index
the third is the document i insert against the test index (yes, it
goes in as type "document")
the fourth is the query I run expecting that pipe delim works, no
results
the fifth is the query I run that should not work, but does

Let me know if you need more info
Thanks,
scott

On Feb 14, 1:57 am, Clinton Gormley cl...@traveljury.com wrote:

On Mon, 2012-02-13 at 11:58 -0800, Scott Decker wrote:

More info.

The problem, I think, is that you are providing snippets of info, rather
than gisting a complete example demonstrating the issue.

Seehttp://www.elasticsearch.org/help

I'm guessing that the problem is not in the analysis, but how you are
searching.

clint

I try this:

curl -X GET "http://ec2loc:9200/index-to-test/_analyze?
analyzer=pipeDelim" -d "Kobe Bryant|Lamar Odom"
and this is the output
{"tokens":[{"token":"kobe bryant","start_offset":0,"end_offset":
11,"type":"word","position":1},{"token":"lamar odom","start_offset":
12,"end_offset":22,"type":"word","position":2}]}

So, obviously the analyzer works, but the documents being put in do
not seem to be getting analyzed correctly, nor can we search against
those tokens.

Here is an example index mapping from one of the indexes
entities: {

analyzer: pipeDelim
type: string

}

Anyone have any thoughts, or what else to try and test?

On Feb 13, 10:36 am, Scott Decker sc...@publishthis.com wrote:

To give more clarity on this, we had been running on 17.6, and just
upgraded to 18.7
we re-indexed everything, and our analyzers that were setup no longer
worked.

We set some default ones in the elasticsearch.json setup file for es
itself.
So, when we create indexes, anyone can use the main default ones we
have, like this pipeDelim

however, it doesn't seem like it is working anymore.
Here is our snippet from the json for the analyzers in
elasticsearch.json
"index":{
"analysis":{
"analyzer":{
"pipeDelim":{
"type": "pattern",
"stopwords": "none",
"flags": "DOTALL",
"lowercase":true,
"pattern": "\|"
},
"pipeDelimTest":{
"type":"custom",
"tokenizer":"pattern",
"pattern":"\|",
"filter":["lowercase"]

        }

}
}
}

neither of those seem to work now

On Feb 13, 8:49 am, Scott Decker sc...@publishthis.com wrote:

Hey All,
I am trying to setup a pattern analyzer for indexes, but something
just isn't working.

Here is what I have tried in our elasticsearch.json

"pipeDelim":{
"type": "pattern",
"stopwords": "none",
"flags": "DOTALL",
"lowercase":true,
"pattern": "\|"
},

and

"pipeDelimTest":{
"type":"custom",
"tokenizer":"pattern",
"pattern":"\|",
"filter":["lowercase"]

        },

and I pass in the following to a doc that has one of these analyzers
Kobe Bryant|Lamar Odom

doing a search for
Kobe Bryant (term query)

nothing comes back from a search

if I type
bryant (term query)

then the document comes back

if I type
kobe bryant (term query)
nothing comes back.

so, it seems like it is just doing the lower casing, but not the pipe
delim to separate tokens and then lower casing.

Any ideas on how to setup this up so it delimits on the pipe
character, and then lower cases the tokens?

Thanks,
Scott


(Clinton Gormley) #9

In response to the last comment though, the term "Kobe Bryant" should
be correct. it is in Solr/Lucene.
If I say search for "Kobe Bryant" the ES server should then run that
through the analyzer for the field, and then search for it, yes? That
should then turn it into "kobe bryant" because it will look for the
pipe character, find none, then lowercase the token.
Or are you saying this should be done via query_string instead?

The term query doesn't analyze. text/query_string/field queries do
analyze.

c

Scott

On Feb 15, 2:19 am, Clinton Gormley cl...@traveljury.com wrote:

On Tue, 2012-02-14 at 13:15 -0800, ppearcy wrote:

Complete, fully reproducible curl commands are the way to go for
others to easily reproduce.

++ exactly!

You're missing a bunch of information, like what 'type' is your doc.
You have two different types with different analysis for the same field
(which can cause clashes depending on how you search).

Also, your pipe-delim analyzer lowercases, but then you're searching for
"Kobe Bryant" with a term query, which does no analysis (ie it looks for
the exact term "Kobe Bryant" while your term is "kobe bryant"

clint

Taking a glance, this looks correct, but
an end to end reproduction will shed further light on things. Here is
an example to reproduce a synonym bug (that is actually fixed in the
soon to be released lucene 3.6):
https://gist.github.com/1349777

Best Regards,
Paul

On Feb 14, 8:31 am, Scott Decker sc...@publishthis.com wrote:

Here ya gohttps://gist.github.com/1827576

first entry is the elasticsearch.json that is what we use to start es
the second is the mapping I have for my simple test index
the third is the document i insert against the test index (yes, it
goes in as type "document")
the fourth is the query I run expecting that pipe delim works, no
results
the fifth is the query I run that should not work, but does

Let me know if you need more info
Thanks,
scott

On Feb 14, 1:57 am, Clinton Gormley cl...@traveljury.com wrote:

On Mon, 2012-02-13 at 11:58 -0800, Scott Decker wrote:

More info.

The problem, I think, is that you are providing snippets of info, rather
than gisting a complete example demonstrating the issue.

Seehttp://www.elasticsearch.org/help

I'm guessing that the problem is not in the analysis, but how you are
searching.

clint

I try this:

curl -X GET "http://ec2loc:9200/index-to-test/_analyze?
analyzer=pipeDelim" -d "Kobe Bryant|Lamar Odom"
and this is the output
{"tokens":[{"token":"kobe bryant","start_offset":0,"end_offset":
11,"type":"word","position":1},{"token":"lamar odom","start_offset":
12,"end_offset":22,"type":"word","position":2}]}

So, obviously the analyzer works, but the documents being put in do
not seem to be getting analyzed correctly, nor can we search against
those tokens.

Here is an example index mapping from one of the indexes
entities: {

analyzer: pipeDelim
type: string

}

Anyone have any thoughts, or what else to try and test?

On Feb 13, 10:36 am, Scott Decker sc...@publishthis.com wrote:

To give more clarity on this, we had been running on 17.6, and just
upgraded to 18.7
we re-indexed everything, and our analyzers that were setup no longer
worked.

We set some default ones in the elasticsearch.json setup file for es
itself.
So, when we create indexes, anyone can use the main default ones we
have, like this pipeDelim

however, it doesn't seem like it is working anymore.
Here is our snippet from the json for the analyzers in
elasticsearch.json
"index":{
"analysis":{
"analyzer":{
"pipeDelim":{
"type": "pattern",
"stopwords": "none",
"flags": "DOTALL",
"lowercase":true,
"pattern": "\|"
},
"pipeDelimTest":{
"type":"custom",
"tokenizer":"pattern",
"pattern":"\|",
"filter":["lowercase"]

        }

}
}
}

neither of those seem to work now

On Feb 13, 8:49 am, Scott Decker sc...@publishthis.com wrote:

Hey All,
I am trying to setup a pattern analyzer for indexes, but something
just isn't working.

Here is what I have tried in our elasticsearch.json

"pipeDelim":{
"type": "pattern",
"stopwords": "none",
"flags": "DOTALL",
"lowercase":true,
"pattern": "\|"
},

and

"pipeDelimTest":{
"type":"custom",
"tokenizer":"pattern",
"pattern":"\|",
"filter":["lowercase"]

        },

and I pass in the following to a doc that has one of these analyzers
Kobe Bryant|Lamar Odom

doing a search for
Kobe Bryant (term query)

nothing comes back from a search

if I type
bryant (term query)

then the document comes back

if I type
kobe bryant (term query)
nothing comes back.

so, it seems like it is just doing the lower casing, but not the pipe
delim to separate tokens and then lower casing.

Any ideas on how to setup this up so it delimits on the pipe
character, and then lower cases the tokens?

Thanks,
Scott


(Scott Decker) #10

Well jigger my timbers.
That was it.
using the query string works fine, but the term one did not.

Hmm.. I would recommend updating the docs on that, as that wasn't
quite clear.
Maybe something on this page here:
http://www.elasticsearch.org/guide/reference/query-dsl/

that calls out
query_string is handled by ES when you search
all other term|prefex and other queries must be tokenized by your
application before sending to ES

On Feb 15, 7:31 am, Clinton Gormley cl...@traveljury.com wrote:

In response to the last comment though, the term "Kobe Bryant" should
be correct. it is in Solr/Lucene.
If I say search for "Kobe Bryant" the ES server should then run that
through the analyzer for the field, and then search for it, yes? That
should then turn it into "kobe bryant" because it will look for the
pipe character, find none, then lowercase the token.
Or are you saying this should be done via query_string instead?

The term query doesn't analyze. text/query_string/field queries do
analyze.

c

Scott

On Feb 15, 2:19 am, Clinton Gormley cl...@traveljury.com wrote:

On Tue, 2012-02-14 at 13:15 -0800, ppearcy wrote:

Complete, fully reproducible curl commands are the way to go for
others to easily reproduce.

++ exactly!

You're missing a bunch of information, like what 'type' is your doc.
You have two different types with different analysis for the same field
(which can cause clashes depending on how you search).

Also, your pipe-delim analyzer lowercases, but then you're searching for
"Kobe Bryant" with a term query, which does no analysis (ie it looks for
the exact term "Kobe Bryant" while your term is "kobe bryant"

clint

Taking a glance, this looks correct, but
an end to end reproduction will shed further light on things. Here is
an example to reproduce a synonym bug (that is actually fixed in the
soon to be released lucene 3.6):
https://gist.github.com/1349777

Best Regards,
Paul

On Feb 14, 8:31 am, Scott Decker sc...@publishthis.com wrote:

Here ya gohttps://gist.github.com/1827576

first entry is the elasticsearch.json that is what we use to start es
the second is the mapping I have for my simple test index
the third is the document i insert against the test index (yes, it
goes in as type "document")
the fourth is the query I run expecting that pipe delim works, no
results
the fifth is the query I run that should not work, but does

Let me know if you need more info
Thanks,
scott

On Feb 14, 1:57 am, Clinton Gormley cl...@traveljury.com wrote:

On Mon, 2012-02-13 at 11:58 -0800, Scott Decker wrote:

More info.

The problem, I think, is that you are providing snippets of info, rather
than gisting a complete example demonstrating the issue.

Seehttp://www.elasticsearch.org/help

I'm guessing that the problem is not in the analysis, but how you are
searching.

clint

I try this:

curl -X GET "http://ec2loc:9200/index-to-test/_analyze?
analyzer=pipeDelim" -d "Kobe Bryant|Lamar Odom"
and this is the output
{"tokens":[{"token":"kobe bryant","start_offset":0,"end_offset":
11,"type":"word","position":1},{"token":"lamar odom","start_offset":
12,"end_offset":22,"type":"word","position":2}]}

So, obviously the analyzer works, but the documents being put in do
not seem to be getting analyzed correctly, nor can we search against
those tokens.

Here is an example index mapping from one of the indexes
entities: {

analyzer: pipeDelim
type: string

}

Anyone have any thoughts, or what else to try and test?

On Feb 13, 10:36 am, Scott Decker sc...@publishthis.com wrote:

To give more clarity on this, we had been running on 17.6, and just
upgraded to 18.7
we re-indexed everything, and our analyzers that were setup no longer
worked.

We set some default ones in the elasticsearch.json setup file for es
itself.
So, when we create indexes, anyone can use the main default ones we
have, like this pipeDelim

however, it doesn't seem like it is working anymore.
Here is our snippet from the json for the analyzers in
elasticsearch.json
"index":{
"analysis":{
"analyzer":{
"pipeDelim":{
"type": "pattern",
"stopwords": "none",
"flags": "DOTALL",
"lowercase":true,
"pattern": "\|"
},
"pipeDelimTest":{
"type":"custom",
"tokenizer":"pattern",
"pattern":"\|",
"filter":["lowercase"]

        }

}
}
}

neither of those seem to work now

On Feb 13, 8:49 am, Scott Decker sc...@publishthis.com wrote:

Hey All,
I am trying to setup a pattern analyzer for indexes, but something
just isn't working.

Here is what I have tried in our elasticsearch.json

"pipeDelim":{
"type": "pattern",
"stopwords": "none",
"flags": "DOTALL",
"lowercase":true,
"pattern": "\|"
},

and

"pipeDelimTest":{
"type":"custom",
"tokenizer":"pattern",
"pattern":"\|",
"filter":["lowercase"]

        },

and I pass in the following to a doc that has one of these analyzers
Kobe Bryant|Lamar Odom

doing a search for
Kobe Bryant (term query)

nothing comes back from a search

if I type
bryant (term query)

then the document comes back

if I type
kobe bryant (term query)
nothing comes back.

so, it seems like it is just doing the lower casing, but not the pipe
delim to separate tokens and then lower casing.

Any ideas on how to setup this up so it delimits on the pipe
character, and then lower cases the tokens?

Thanks,
Scott


(Ted Hromadka) #11

agreed... the docs do mention what is not analyzed, but could be
louder about tokenizing

http://www.elasticsearch.org/guide/reference/query-dsl/term-filter.html

http://www.elasticsearch.org/guide/reference/query-dsl/prefix-filter.html

On Feb 15, 9:34 am, Scott Decker sc...@publishthis.com wrote:

Well jigger my timbers.
That was it.
using the query string works fine, but the term one did not.

Hmm.. I would recommend updating the docs on that, as that wasn't
quite clear.
Maybe something on this page here:http://www.elasticsearch.org/guide/reference/query-dsl/

that calls out
query_string is handled by ES when you search
all other term|prefex and other queries must be tokenized by your
application before sending to ES

On Feb 15, 7:31 am, Clinton Gormley cl...@traveljury.com wrote:

In response to the last comment though, the term "Kobe Bryant" should
be correct. it is in Solr/Lucene.
If I say search for "Kobe Bryant" the ES server should then run that
through the analyzer for the field, and then search for it, yes? That
should then turn it into "kobe bryant" because it will look for the
pipe character, find none, then lowercase the token.
Or are you saying this should be done via query_string instead?

The term query doesn't analyze. text/query_string/field queries do
analyze.

c

Scott

On Feb 15, 2:19 am, Clinton Gormley cl...@traveljury.com wrote:

On Tue, 2012-02-14 at 13:15 -0800, ppearcy wrote:

Complete, fully reproducible curl commands are the way to go for
others to easily reproduce.

++ exactly!

You're missing a bunch of information, like what 'type' is your doc.
You have two different types with different analysis for the same field
(which can cause clashes depending on how you search).

Also, your pipe-delim analyzer lowercases, but then you're searching for
"Kobe Bryant" with a term query, which does no analysis (ie it looks for
the exact term "Kobe Bryant" while your term is "kobe bryant"

clint

Taking a glance, this looks correct, but
an end to end reproduction will shed further light on things. Here is
an example to reproduce a synonym bug (that is actually fixed in the
soon to be released lucene 3.6):
https://gist.github.com/1349777

Best Regards,
Paul

On Feb 14, 8:31 am, Scott Decker sc...@publishthis.com wrote:

Here ya gohttps://gist.github.com/1827576

first entry is the elasticsearch.json that is what we use to start es
the second is the mapping I have for my simple test index
the third is the document i insert against the test index (yes, it
goes in as type "document")
the fourth is the query I run expecting that pipe delim works, no
results
the fifth is the query I run that should not work, but does

Let me know if you need more info
Thanks,
scott

On Feb 14, 1:57 am, Clinton Gormley cl...@traveljury.com wrote:

On Mon, 2012-02-13 at 11:58 -0800, Scott Decker wrote:

More info.

The problem, I think, is that you are providing snippets of info, rather
than gisting a complete example demonstrating the issue.

Seehttp://www.elasticsearch.org/help

I'm guessing that the problem is not in the analysis, but how you are
searching.

clint

I try this:

curl -X GET "http://ec2loc:9200/index-to-test/_analyze?
analyzer=pipeDelim" -d "Kobe Bryant|Lamar Odom"
and this is the output
{"tokens":[{"token":"kobe bryant","start_offset":0,"end_offset":
11,"type":"word","position":1},{"token":"lamar odom","start_offset":
12,"end_offset":22,"type":"word","position":2}]}

So, obviously the analyzer works, but the documents being put in do
not seem to be getting analyzed correctly, nor can we search against
those tokens.

Here is an example index mapping from one of the indexes
entities: {

analyzer: pipeDelim
type: string

}

Anyone have any thoughts, or what else to try and test?

On Feb 13, 10:36 am, Scott Decker sc...@publishthis.com wrote:

To give more clarity on this, we had been running on 17.6, and just
upgraded to 18.7
we re-indexed everything, and our analyzers that were setup no longer
worked.

We set some default ones in the elasticsearch.json setup file for es
itself.
So, when we create indexes, anyone can use the main default ones we
have, like this pipeDelim

however, it doesn't seem like it is working anymore.
Here is our snippet from the json for the analyzers in
elasticsearch.json
"index":{
"analysis":{
"analyzer":{
"pipeDelim":{
"type": "pattern",
"stopwords": "none",
"flags": "DOTALL",
"lowercase":true,
"pattern": "\|"
},
"pipeDelimTest":{
"type":"custom",
"tokenizer":"pattern",
"pattern":"\|",
"filter":["lowercase"]

        }

}
}
}

neither of those seem to work now

On Feb 13, 8:49 am, Scott Decker sc...@publishthis.com wrote:

Hey All,
I am trying to setup a pattern analyzer for indexes, but something
just isn't working.

Here is what I have tried in our elasticsearch.json

"pipeDelim":{
"type": "pattern",
"stopwords": "none",
"flags": "DOTALL",
"lowercase":true,
"pattern": "\|"
},

and

"pipeDelimTest":{
"type":"custom",
"tokenizer":"pattern",
"pattern":"\|",
"filter":["lowercase"]

        },

and I pass in the following to a doc that has one of these analyzers
Kobe Bryant|Lamar Odom

doing a search for
Kobe Bryant (term query)

nothing comes back from a search

if I type
bryant (term query)

then the document comes back

if I type
kobe bryant (term query)
nothing comes back.

so, it seems like it is just doing the lower casing, but not the pipe
delim to separate tokens and then lower casing.

Any ideas on how to setup this up so it delimits on the pipe
character, and then lower cases the tokens?

Thanks,
Scott


(system) #12