Hi,
As a newbie to Elastic Search myself I should maybe refrain from
answering. But I do recognize the confusion going from a SQL
environment to elasticsearch. There is a lot to grasp. And I have
just begun myself.
However, when it comes to indexing the domain names I would suggest
using the <b>standard </b>analyzer that breaks down string on the
dots, lowercases and removes stopwords. E.g. <a class="moz-txt-link-abbreviated" href="http://www.YAHOO.com">www.YAHOO.com</a> = www
yahoo com
The standard analyzer is using the default set of english stopwords
though (stopwords = words to NOT index). And you dont't want that in
this case as every word is significant in this context. But you <i>might
</i>want to skip the top level domains so why not consider those as
stopwords. Something like this (you have to fill in the complete
list of top level domains though). And maybe you can skip www too.
This example is in YAML format.
index :
analysis :
analyzer :
myDomainAnalyzer:
type : standard
stopwords: [www,com,net,biz,gov,org,se,no,da,fi]
Another approach is to use a pattern analyzer to skip the first host name part of the domain name, i.e www.YAHOO.com = yahoo.com
index :
analysis :
analyzer :
myDomainAnalyzer:
type : pattern
pattern: [\w+\.\w+$]
A third approach is to use a keyword analyzer to get the whole
domain name as is and then use wildcards to search in the string. If
you want to also lowercase you can configure a <i>custom </i>filter
setup:
index :
analysis :
analyzer :
myDomainAnalyzer :
type: custom
tokenizer: keyword
filter: [ lowercase]
There is also the choice to store the index in a <a href="http://www.elasticsearch.org/guide/reference/mapping/multi-field-type.html">multi_field</a>
setup where each field is analyzed according to different analyzer
setups. Then you can search with different approaches all-in-one and
maybe combine the results afterwards (that specific area is beyond
me ATM).
Well, I am not sure this anwsers your questions, but this is what I
can provide as a getting-started guide.
/Kristian
JonathanD skrev 2011-02-19 15:32:
<blockquote cite="mid:1298125957578-2533637.post@n3.nabble.com" type="cite"><pre wrap="">
hi all
i have a list of 120 mio. domain names (at around 2gb plain text) and the
challenge is to be able to search those domains
by wildcards
as i can see the problem boils down to being able to do a wildcard search on
a word
to be honest im fairly new to nosql/document oriented db's and thus my first
try
was using postgresql and like cause its easy (altough i knew it wouldnt
kinda work in sense of performance)
a query can take at around 10 minutes or longer
from my search it seems that elasticsearch provides wildcard search out of
the box and scaling would also come in handy - thats why ill give it a try
my first test with just inserting the domain names and query with wildcards
resulted in nearly the same result as postgresql
the next idea i would have is to split every domain into small chunks and
ignore wildcards
this means i insert an entry like 'domain:elasticsearch, chunks:el la as st
ti ic cs se ea ar rc ch ela las ast .....' and chunks would be like a
document i want to search and thus a search for 'elastic' would result in
striping the wildcards and only search for 'elastic'
what would you suggest?
could elasticsearch be the right tool for this? and if yes, any ideas oo
optimization/indexing would be great. another point im not sure of is, how
many servers would provide good performance and as a reference how many
entries could one server handle (in this case)? replication is another topic
im facing later after i gained the basic experience
but to give some numbers, i tought about being able to get a count(*) of a
query for 'elastic' in about 100 to 200 ms
retrieving lets say the first 100 entries should be in nearly the same time
i also know that providing facts on those few numbers i gave is hard and
thus im only interested in experienced opinions or maybe what i should
expect to not be fast
as you may see im very unexperienced in full text search and document
oriented work
thx for your suggestions
--
Med vänlig hälsning
Kristian Jörg
Devo IT AB
Tel: 054 - 22 14 58, 0709 - 15 83 42
E-post: kristian.jorg@devo.se
Webb: http://www.devo.se