How to index a word efficiently?


(thinking23487) #1

hi all

i have a list of 120 mio. domain names (at around 2gb plain text) and the challenge is to be able to search those domains
by wildcards
as i can see the problem boils down to being able to do a wildcard search on a word
to be honest im fairly new to nosql/document oriented db's and thus my first try
was using postgresql and like cause its easy (altough i knew it wouldnt kinda work in sense of performance)
a query can take at around 10 minutes or longer
from my search it seems that elasticsearch provides wildcard search out of the box and scaling would also come in handy - thats why ill give it a try

my first test with just inserting the domain names and query with wildcards resulted in nearly the same result as postgresql
the next idea i would have is to split every domain into small chunks and ignore wildcards
this means i insert an entry like 'domain:elasticsearch, chunks:el la as st ti ic cs se ea ar rc ch ela las ast .....' and chunks would be like a document i want to search and thus a search for 'elastic' would result in striping the wildcards and only search for 'elastic'

what would you suggest?
could elasticsearch be the right tool for this? and if yes, any ideas oo optimization/indexing would be great. another point im not sure of is, how many servers would provide good performance and as a reference how many entries could one server handle (in this case)? replication is another topic im facing later after i gained the basic experience
but to give some numbers, i tought about being able to get a count(*) of a query for 'elastic' in about 100 to 200 ms
retrieving lets say the first 100 entries should be in nearly the same time
i also know that providing facts on those few numbers i gave is hard and thus im only interested in experienced opinions or maybe what i should expect to not be fast

as you may see im very unexperienced in full text search and document oriented work
thx for your suggestions


(Rich Kroll) #2

Elasticsearch allows for configuring the number of shards and replicas,
which directly affects search performance. Considering the map/reduce
fashion that ES queries the index, I would think you could achieve good
performance against your dataset. ES breaks your documents down in the
index, so I don't think you will get any benefit from attempting to do it
yourself. Another option may be to write a custom analyizer, which creates
tokens on characters. Then you would not need the wildcards.
On Feb 21, 2011 6:47 PM, "JonathanD" prometheus__0@hotmail.com wrote:


(Barsk) #3
Hi,




As a newbie to Elastic Search myself I should maybe refrain from
answering. But I do recognize the confusion going from a SQL
environment to elasticsearch. There is a lot to grasp. And I have
just begun myself.




However, when it comes to indexing the domain names I would suggest
using the <b>standard </b>analyzer that breaks down string on the
dots, lowercases and removes stopwords. E.g. <a class="moz-txt-link-abbreviated" href="http://www.YAHOO.com">www.YAHOO.com</a> = www
yahoo com


The standard analyzer is using the default set of english stopwords
though (stopwords = words to NOT index). And you dont't want that in
this case as every word is significant in this context. But you <i>might
</i>want to skip the top level domains so why not consider those as
stopwords. Something like this (you have to fill in the complete
list of top level domains though). And maybe you can skip www too.
This example is in YAML format.
index :
    analysis :
        analyzer : 
              myDomainAnalyzer:
                    type : standard
                    stopwords: [www,com,net,biz,gov,org,se,no,da,fi]


Another approach is to use a pattern analyzer to skip the first host name part of the domain name, i.e www.YAHOO.com = yahoo.com
index :
    analysis :
        analyzer : 
              myDomainAnalyzer:
                    type : pattern
                    pattern: [\w+\.\w+$]
A third approach is to use a keyword analyzer to get the whole
domain name as is and then use wildcards to search in the string. If
you want to also lowercase you can configure a <i>custom </i>filter
setup:
index :
    analysis :
        analyzer :
            myDomainAnalyzer : 
                type: custom
                tokenizer: keyword
                filter: [ lowercase]

There is also the  choice to store the index in a <a href="http://www.elasticsearch.org/guide/reference/mapping/multi-field-type.html">multi_field</a>
setup where each field is analyzed according to different analyzer
setups. Then you can search with different approaches all-in-one and
maybe combine the results afterwards (that specific area is beyond
me ATM).




Well, I am not sure this anwsers your questions, but this is what I
can provide as a getting-started guide.




/Kristian




          




JonathanD skrev 2011-02-19 15:32:
<blockquote cite="mid:1298125957578-2533637.post@n3.nabble.com" type="cite"><pre wrap="">

hi all

i have a list of 120 mio. domain names (at around 2gb plain text) and the
challenge is to be able to search those domains
by wildcards
as i can see the problem boils down to being able to do a wildcard search on
a word
to be honest im fairly new to nosql/document oriented db's and thus my first
try
was using postgresql and like cause its easy (altough i knew it wouldnt
kinda work in sense of performance)
a query can take at around 10 minutes or longer
from my search it seems that elasticsearch provides wildcard search out of
the box and scaling would also come in handy - thats why ill give it a try

my first test with just inserting the domain names and query with wildcards
resulted in nearly the same result as postgresql
the next idea i would have is to split every domain into small chunks and
ignore wildcards
this means i insert an entry like 'domain:elasticsearch, chunks:el la as st
ti ic cs se ea ar rc ch ela las ast .....' and chunks would be like a
document i want to search and thus a search for 'elastic' would result in
striping the wildcards and only search for 'elastic'

what would you suggest?
could elasticsearch be the right tool for this? and if yes, any ideas oo
optimization/indexing would be great. another point im not sure of is, how
many servers would provide good performance and as a reference how many
entries could one server handle (in this case)? replication is another topic
im facing later after i gained the basic experience
but to give some numbers, i tought about being able to get a count(*) of a
query for 'elastic' in about 100 to 200 ms
retrieving lets say the first 100 entries should be in nearly the same time
i also know that providing facts on those few numbers i gave is hard and
thus im only interested in experienced opinions or maybe what i should
expect to not be fast

as you may see im very unexperienced in full text search and document
oriented work
thx for your suggestions

-- 
Med vänlig hälsning
Kristian Jörg

Devo IT AB
Tel: 054 - 22 14 58, 0709 - 15 83 42
E-post: kristian.jorg@devo.se
Webb: http://www.devo.se

(Karussell) #4

Yes, sharding is necessary (12 shards?).

Also keep mind you'll enough for every shard and your index needs to
be optimized after indexing.

After successfully making this fast you should try a NGramFilter for
the wildcard thing.

Regards,
Peter.

On 22 Feb., 02:05, Rich Kroll kroll.r...@gmail.com wrote:

Elasticsearch allows for configuring the number of shards and replicas,
which directly affects search performance. Considering the map/reduce
fashion that ES queries the index, I would think you could achieve good
performance against your dataset. ES breaks your documents down in the
index, so I don't think you will get any benefit from attempting to do it
yourself. Another option may be to write a custom analyizer, which creates
tokens on characters. Then you would not need the wildcards.
On Feb 21, 2011 6:47 PM, "JonathanD" prometheus...@hotmail.com wrote:


(Otis Gospodnetić) #5

Hi,

You likely want to do the following:

  • Make sure you "flip" hostname pieces (foo.bar.com becomes
    com.bar.foo) before tokenizing
  • Index these pieces as multiple shingled tokens (first token is
    "com", second one is "com.bar", third one is "com.bar.foo").

The above to let you get all "com" domains, all "com.bar" subdomains,
etc. -- without using wildcards! (but at the expense of slightly
larger index)

  • Potentially also "edge-ngram" the pieces if you want to be able to
    make prefix queries (with trailing wildcards, e.g. "com.ba*" if you
    need to fine all ba*.com domains.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Elastic Search
Lucene ecosystem search :: http://search-lucene.com/

On Feb 19, 9:32 am, JonathanD prometheus...@hotmail.com wrote:

hi all

i have a list of 120 mio. domain names (at around 2gb plain text) and the
challenge is to be able to search those domains
by wildcards
as i can see the problem boils down to being able to do a wildcard search on
a word
to be honest im fairly new to nosql/document oriented db's and thus my first
try
was using postgresql and like cause its easy (altough i knew it wouldnt
kinda work in sense of performance)
a query can take at around 10 minutes or longer
from my search it seems that elasticsearch provides wildcard search out of
the box and scaling would also come in handy - thats why ill give it a try

my first test with just inserting the domain names and query with wildcards
resulted in nearly the same result as postgresql
the next idea i would have is to split every domain into small chunks and
ignore wildcards
this means i insert an entry like 'domain:elasticsearch, chunks:el la as st
ti ic cs se ea ar rc ch ela las ast .....' and chunks would be like a
document i want to search and thus a search for 'elastic' would result in
striping the wildcards and only search for 'elastic'

what would you suggest?
could elasticsearch be the right tool for this? and if yes, any ideas oo
optimization/indexing would be great. another point im not sure of is, how
many servers would provide good performance and as a reference how many
entries could one server handle (in this case)? replication is another topic
im facing later after i gained the basic experience
but to give some numbers, i tought about being able to get a count(*) of a
query for 'elastic' in about 100 to 200 ms
retrieving lets say the first 100 entries should be in nearly the same time
i also know that providing facts on those few numbers i gave is hard and
thus im only interested in experienced opinions or maybe what i should
expect to not be fast

as you may see im very unexperienced in full text search and document
oriented work
thx for your suggestions

View this message in context:http://elasticsearch-users.115913.n3.nabble.com/how-to-index-a-word-e...
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(system) #6