Elasticsearch search fo words having '#' character

prince · July 10, 2013, 6:27am

For example, I am right now searching like this:

http://localhost:9200/posts/post/_search?q=content:%23sachin

But, I am getting all the results with 'sachin' and not '#sachin'. Also, I am writing a regular expression for getting the count of terms. The facet looks like this:

"facets": {
"content": {
"terms": {
"field": "content",
"size": 1000,
"all_terms": false,
"regex": "#sachin",
"regex_flags": [
"DOTALL",
"CASE_INSENSITIVE"
]
}
}
}

This is not returning any values. I think it has something to do with escaping the '#' inside the regular expression, but I am not sure how to do it. I have tried to escape it \ and \, but it did not work. Can anyone help me in this regard?

prince · July 10, 2013, 6:29am

I have posted the same question on Stack Overflow (http://stackoverflow.com/questions/17526736/elasticsearch-search-fo-words-having-character)

radu_gheorghe · July 10, 2013, 8:46am

Hello,

I think the standard
analyzerhttp://www.elasticsearch.org/guide/reference/index-modules/analysis/standard-analyzer/will
get rid of your #, and that's why it doesn't show up in searches.

If you want exact matches for that field, the easiest way is to use index
it as "not_analyzed". Here's a curl example that should work;

gist.github.com

https://gist.github.com/radu-gheorghe/5964537

sharp_search.sh

curl -XDELETE localhost:9200/test1
curl -XPUT localhost:9200/test1
curl -XPUT localhost:9200/test1/test/_mapping -d '{
  "test": {
    "properties": {
      "foo": {
        "type": "string",
        "index": "not_analyzed"
      }
    }

This file has been truncated. show original

On Wed, Jul 10, 2013 at 9:29 AM, prince prince@qburst.com wrote:

I have posted the same question on Stack Overflow
(
Elasticsearch search fo words having '#' character - Stack Overflow
)

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Elasticsearch-search-fo-words-having-character-tp4037822p4037823.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

spinscale · July 10, 2013, 9:39am

Hey,

you should not facet on analyzed fields, if you dont want to run out of
memory pretty quickly (because every term of the inverted index gets loaded
into memory for this field, which may be a lot, depending on the size of
the index).

--Alex

On Wed, Jul 10, 2013 at 10:46 AM, Radu Gheorghe
radu.gheorghe@sematext.comwrote:

Hello,

I think the standard analyzerhttp://www.elasticsearch.org/guide/reference/index-modules/analysis/standard-analyzer/will get rid of your #, and that's why it doesn't show up in searches.

If you want exact matches for that field, the easiest way is to use index
it as "not_analyzed". Here's a curl example that should work;
index and search for a value containing # · GitHub

On Wed, Jul 10, 2013 at 9:29 AM, prince prince@qburst.com wrote:

I have posted the same question on Stack Overflow
(
Elasticsearch search fo words having '#' character - Stack Overflow
)

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Elasticsearch-search-fo-words-having-character-tp4037822p4037823.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

prince · July 10, 2013, 11:02am

Hi Radu,

But what I need is to search is really "analyzed", because if content is something like '#sachin big news, retires from ODI', it should get that on searching, but if it is like 'sachin, you are awesome', it should not get that on searching.

"not_analyzed" will get only exact matches and is not useful for my usecase (Here, it will get only content with "#sachin" as the full text and not containing "#sachin")

radu_gheorghe · July 10, 2013, 11:52am

Hello,

In that case, you need to change your
analyzerhttp://www.elasticsearch.org/guide/reference/index-modules/analysis/.
Maybe Whitespace
Analyzerhttp://www.elasticsearch.org/guide/reference/index-modules/analysis/whitespace-analyzer/is
more appropriate here?

Also, please note Alex's warning regarding memory usage. I'm not sure how
your hardware and data look like, but might be worth paying some price at
index time for more performance and less memory usage during searches. For
example, you can parse your text for hashtags and store them in some "tags"
field (which can be an array), that you can store as not_analyzed and facet
separately.

Best regards,
Radu

On Wed, Jul 10, 2013 at 2:02 PM, prince prince@qburst.com wrote:

Hi Radu,

But what I need is to search is really "analyzed", because if content is
something like '#sachin big news, retires from ODI', it should get that on
searching, but if it is like 'sachin, you are awesome', it should not get
that on searching.

"not_analyzed" will get only exact matches and is not useful for my usecase
(Here, it will get only content with "#sachin" as the full text and not
containing "#sachin")

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Elasticsearch-search-fo-words-having-character-tp4037822p4037864.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

prince · July 10, 2013, 12:12pm

Hi Radu,

I have found this post http://webcache.googleusercontent.com/search?q=cache:http://www.fullscale.co/blog/2013/03/04/preserving_specific_characters_during_tokenizing_in_elasticsearch.html

My mapping, right now, is like this:

{
"content": {"type": 'string', "index": "analyzed'}
}

I am not sure how this mapping for getting strings starting with "#" and "@", should change, as I am new to elasticsearch.

prince · July 10, 2013, 12:45pm

Hi Radu,

I am using elasticsearch for searching a table of selected twitter posts, and we have a UI, that allows users to search for multiple terms using OR condition and we need to display the count of records for each term. We are using the facet for getting the count of individual terms. So, it is not limited to hashtags, but for anything the users searches.

So, is there a better way than using facets to get this count?

radu_gheorghe · July 10, 2013, 1:57pm

Hello,

Cool! Then I think you should try that and see how it works for you. I
thought the whitespace analyzer alone is enough, but maybe it isn't. I
didn't test it.

I think you should run a test, see how it fits your use-case and if you
need something more you can always come back here and ask more questions.

Best regards,
Radu

On Wed, Jul 10, 2013 at 3:12 PM, prince prince@qburst.com wrote:

Hi Radu,

I have found this post

http://webcache.googleusercontent.com/search?q=cache:http://www.fullscale.co/blog/2013/03/04/preserving_specific_characters_during_tokenizing_in_elasticsearch.html

My mapping, right now, is like this:

{
"content": {"type": 'string', "index": "analyzed'}
}

I am not sure how this mapping for getting strings starting with "#" and
"@", should change, as I am new to elasticsearch.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Elasticsearch-search-fo-words-having-character-tp4037822p4037870.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

radu_gheorghe · July 10, 2013, 1:59pm

On Wed, Jul 10, 2013 at 3:45 PM, prince prince@qburst.com wrote:

Hi Radu,

I am using elasticsearch for searching a table of selected twitter posts,
and we have a UI, that allows users to search for multiple terms using OR
condition and we need to display the count of records for each term. We are
using the facet for getting the count of individual terms. So, it is not
limited to hashtags, but for anything the users searches.

So, is there a better way than using facets to get this count?

Hmm... I don't see one. If you need counts for all the terms, then you have
to have memory for all the terms

I would monitor the cluster with something like
SPMhttp://sematext.com/spm/elasticsearch-performance-monitoring/.
You'll be able to see how your memory, field cache, etc goes up and down as
you use Elasticsearch. Then you can tell whether you have enough hardware
for the dataset and usage you're expecting.

Best regards,
Radu

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Having trouble when using search strings that have a hash symbol (#) in them Elasticsearch	3	2585	July 6, 2017
Elasticsearch query to capture '#' character Kibana	2	217	June 29, 2022
Searching special characters !, $, #, @ Elasticsearch	3	1012	February 7, 2023
RegEx Filter Not Matching on Hash tag (#) Elasticsearch	3	2637	July 6, 2017
Searching Special Charactes like &%*@()!{} etc in 0.13 Elasticsearch	2	364	July 6, 2017

Elasticsearch search fo words having '#' character

Related topics