I have a bit of an odd requirement in so far as analyzer is concerned.
Wondering if anyone has any tips/suggestions.
I have an item I am indexing (grade) that has a property (name) whose value
can be "0# (99.995%)".
I am doing a prefix search on _all.
I want users to be able to search using 99 or 99.9 or 99.995 or 99.995%.
I also want the user to be able to copy-paste "0# (99.995%)" and it should
work.
I am currently using the whitespace analyzer - which works for many of my
cases except the tricky one above.
99.995 doesnt work.
But "(99.995" does. Because obviously after whitespace tokenization, the
token begins with (.
I could filter out the "(" and ")" characters. But then "0# (99.995%)" wont
work.
Does anyone have some different suggestions?
I have a bit of an odd requirement in so far as analyzer is concerned.
Wondering if anyone has any tips/suggestions.
I have an item I am indexing (grade) that has a property (name) whose
value can be "0# (99.995%)".
I am doing a prefix search on _all.
I want users to be able to search using 99 or 99.9 or 99.995 or 99.995%.
I also want the user to be able to copy-paste "0# (99.995%)" and it should
work.
I am currently using the whitespace analyzer - which works for many of my
cases except the tricky one above.
99.995 doesnt work.
But "(99.995" does. Because obviously after whitespace tokenization, the
token begins with (.
I could filter out the "(" and ")" characters. But then "0# (99.995%)"
wont work.
Does anyone have some different suggestions?
I would start by suggesting that you create an indexing/querying analyzer
specifically for the field you know has this format.
Otherwise, I think your likeliest path to success, I think, is somewhere in
the character filters domain.
Character filters are applied to the string before the tokenizer:
One possibility here is a pattern replace char filter.
If you can write a matching pattern for all of the allowed values of this
field, and replace them with just the number,
apply that pattern to your indexing and searching, then you are only
dealing with searching for the numbers.
You may need a different character filter for the search analyzer, though,
since you are allowing for more formats than
are found in the source document field.
On Tuesday, July 15, 2014 10:40:30 AM UTC-4, mooky wrote:
I have a bit of an odd requirement in so far as analyzer is concerned.
Wondering if anyone has any tips/suggestions.
I have an item I am indexing (grade) that has a property (name) whose
value can be "0# (99.995%)".
I am doing a prefix search on _all.
I want users to be able to search using 99 or 99.9 or 99.995 or 99.995%.
I also want the user to be able to copy-paste "0# (99.995%)" and it should
work.
I am currently using the whitespace analyzer - which works for many of my
cases except the tricky one above.
99.995 doesnt work.
But "(99.995" does. Because obviously after whitespace tokenization, the
token begins with (.
I could filter out the "(" and ")" characters. But then "0# (99.995%)"
wont work.
Does anyone have some different suggestions?
I have a bit of an odd requirement in so far as analyzer is concerned.
Wondering if anyone has any tips/suggestions.
I have an item I am indexing (grade) that has a property (name) whose
value can be "0# (99.995%)".
I am doing a prefix search on _all.
I want users to be able to search using 99 or 99.9 or 99.995 or 99.995%.
I also want the user to be able to copy-paste "0# (99.995%)" and it
should work.
I am currently using the whitespace analyzer - which works for many of my
cases except the tricky one above.
99.995 doesnt work.
But "(99.995" does. Because obviously after whitespace tokenization, the
token begins with (.
I could filter out the "(" and ")" characters. But then "0# (99.995%)"
wont work.
Does anyone have some different suggestions?
I have a bit of an odd requirement in so far as analyzer is concerned.
Wondering if anyone has any tips/suggestions.
I have an item I am indexing (grade) that has a property (name) whose
value can be "0# (99.995%)".
I am doing a prefix search on _all.
I want users to be able to search using 99 or 99.9 or 99.995 or 99.995%.
I also want the user to be able to copy-paste "0# (99.995%)" and it
should work.
I am currently using the whitespace analyzer - which works for many of my
cases except the tricky one above.
99.995 doesnt work.
But "(99.995" does. Because obviously after whitespace tokenization, the
token begins with (.
I could filter out the "(" and ")" characters. But then "0# (99.995%)"
wont work.
Does anyone have some different suggestions?
It leads me to think that it would be very useful to use with a series of
specialist (special-case) analyzers in conjunction with the standard
analyzer.
Back to my original example - "0# (99.995%)" - what I really want is
something that will extract "99.995%".
The standard analyzer will extract "99.995" (and the rest of the text), the
whitespace analyzer will extract "(99.995%)".
Does a financial/numeric/accounting analyzer already exist? ie Something
that extracts "99.995%" or "$44.5665" or "-45bps" ?
-M
On Tuesday, 15 July 2014 18:58:46 UTC+1, mooky wrote:
Thanks. That looks interesting!
On Tuesday, 15 July 2014 16:15:23 UTC+1, vineeth mohan wrote:
I have a bit of an odd requirement in so far as analyzer is concerned.
Wondering if anyone has any tips/suggestions.
I have an item I am indexing (grade) that has a property (name) whose
value can be "0# (99.995%)".
I am doing a prefix search on _all.
I want users to be able to search using 99 or 99.9 or 99.995 or 99.995%.
I also want the user to be able to copy-paste "0# (99.995%)" and it
should work.
I am currently using the whitespace analyzer - which works for many of
my cases except the tricky one above.
99.995 doesnt work.
But "(99.995" does. Because obviously after whitespace tokenization, the
token begins with (.
I could filter out the "(" and ")" characters. But then "0# (99.995%)"
wont work.
Does anyone have some different suggestions?
A little late to the party but I would have used a custom index analyzer with lowercase, pattern, edgengram and a search analyzer of lowercase, pattern (maybe you have to flip lowercase and pattern)
With the pattern tokenizer you can specify a regex.
Elasticsearch is not any domain specific and hence wont take out these
financial terms.
You will need to write your own analyzer to facilitate this function.
It leads me to think that it would be very useful to use with a series of
specialist (special-case) analyzers in conjunction with the standard
analyzer.
Back to my original example - "0# (99.995%)" - what I really want is
something that will extract "99.995%".
The standard analyzer will extract "99.995" (and the rest of the text),
the whitespace analyzer will extract "(99.995%)".
Does a financial/numeric/accounting analyzer already exist? ie Something
that extracts "99.995%" or "$44.5665" or "-45bps" ?
-M
On Tuesday, 15 July 2014 18:58:46 UTC+1, mooky wrote:
Thanks. That looks interesting!
On Tuesday, 15 July 2014 16:15:23 UTC+1, vineeth mohan wrote:
Hello Mooky ,
You can apply multiple analyzers to a field -Yakaz · GitHub
elasticsearch-analysis-combo/
So you can add all your analyzer here and apply it.
I have a bit of an odd requirement in so far as analyzer is concerned.
Wondering if anyone has any tips/suggestions.
I have an item I am indexing (grade) that has a property (name) whose
value can be "0# (99.995%)".
I am doing a prefix search on _all.
I want users to be able to search using 99 or 99.9 or 99.995 or
99.995%.
I also want the user to be able to copy-paste "0# (99.995%)" and it
should work.
I am currently using the whitespace analyzer - which works for many of
my cases except the tricky one above.
99.995 doesnt work.
But "(99.995" does. Because obviously after whitespace tokenization,
the token begins with (.
I could filter out the "(" and ")" characters. But then "0# (99.995%)"
wont work.
Does anyone have some different suggestions?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.