I just started experimenting with ElasticSearch and everything is still
very overwhelming.
So I was hoping that maybe somebody can point me to right direction with
these questions.
My document contains *Url *and *Content *fields. I have two lists, one
containing ~1000 domain names and another with ~5000 words/phrases that I
would like to act as stop words. For example, if I do a search and document
and its Url or Content contains any of these excludes I don't want it to
return in search results.
What is the best way to accomplish numeric value search within text? For
example, I have text "Facebook Now Has 1.15 Billion Monthly Active Users".
I would like to search 1 500 000 000 and same thing with ranges also, like
1 500 000 000 - 2 000 000 000. Can it be done withing text field using some
kind of special number analyzer and tokenizer? Or I should extract all
numeric values first, store as an array in separate field and then use
number range filter.
Just found OpenNLP plugin for Elasticsearch - it can detect entites like
mone, location, date etc and store in separate fields for filtering .
Janno
laupäev, 3. august 2013 16:16.41 UTC+3 kirjutas Janno Järv:
Hi!
I just started experimenting with Elasticsearch and everything is still
very overwhelming.
So I was hoping that maybe somebody can point me to right direction with
these questions.
My document contains *Url *and *Content *fields. I have two lists, one
containing ~1000 domain names and another with ~5000 words/phrases that I
would like to act as stop words. For example, if I do a search and document
and its Url or Content contains any of these excludes I don't want it to
return in search results.
What is the best way to accomplish numeric value search within text?
For example, I have text "Facebook Now Has 1.15 Billion Monthly Active
Users". I would like to search 1 500 000 000 and same thing with ranges
also, like 1 500 000 000 - 2 000 000 000. Can it be done withing text field
using some kind of special number analyzer and tokenizer? Or I should
extract all numeric values first, store as an array in separate field and
then use number range filter.
Also, if you do not want to return certain documents in your search
results, it might make more sense, not to index them at all...
A short note about then OpenNLP plugin: I have merely written this as a
test balloon in order to find out if I could - there are several reasons,
why NLP is more likely a pre indexing step (at least the way how I
implemented it, there are deeper lucene integrations like the UIMA one,
where it might be useful to integrate it into elasticsearch).
a) The model takes up a lot of RAM, which will be duplicated for each node.
b) You have to shutdown your cluster, if you want to update your model
c) You have to shutdown your cluster, if you want to update your opennlp
libraries
Last, the OpenNLP plugin does not help you with your second requirement I
think (at least I did not intend to do that
Just found OpenNLP plugin for Elasticsearch - it can detect entites like
mone, location, date etc and store in separate fields for filtering .
Janno
laupäev, 3. august 2013 16:16.41 UTC+3 kirjutas Janno Järv:
Hi!
I just started experimenting with Elasticsearch and everything is still
very overwhelming.
So I was hoping that maybe somebody can point me to right direction with
these questions.
My document contains *Url *and *Content *fields. I have two lists,
one containing ~1000 domain names and another with ~5000 words/phrases that
I would like to act as stop words. For example, if I do a search and
document and its Url or Content contains any of these excludes I don't want
it to return in search results.
What is the best way to accomplish numeric value search within text?
For example, I have text "Facebook Now Has 1.15 Billion Monthly Active
Users". I would like to search 1 500 000 000 and same thing with ranges
also, like 1 500 000 000 - 2 000 000 000. Can it be done withing text field
using some kind of special number analyzer and tokenizer? Or I should
extract all numeric values first, store as an array in separate field and
then use number range filter.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.