Hi all,
as I wanted to search for email addresses in my ElasticSearch I found
strange behaviour and I don't understand it....
Short Info: I use
"version" : {
"number" : "0.90.0.RC2",
"snapshot_build" : false
},
My OS is LInux (Gentoo/Debian) on both the same result.
I use the default ElasticSearch settings - nothing changed.
I can reproduce it by using the following script:
#!/bin/bash
the emails I want to check for in the ES
content1=(Michael Fabian Stefan Dominique Marcus Markus Noe Sebastian Diana
Daniela Alexandra Gabriela Jean Patrick Christoph Cedric Frederic Julien)
domains1=(ch biz com uk org de fr it eu)
domains2=(test check text )
index=0
#creating the combinations
for ((i=0;i<${#content1[]};i++)); do
for ((j=0;j<${#domains1[]};j++)); do
for ((k=0;k<${#domains2[*]};k++)); do
index=$((index+1))
testemail=${content1[$i]}"@"${domains2[$k]}"."${domains1[$j]};
echo ${testemail};
cmdstr="curl -XPOST 'http://localhost:9200/test/${index}' -d
'{"user" : ""${content1[$i]}"", "email" : ""${testemail}"" }'"
echo ${cmdstr}
eval ${cmdstr}
index=$((index+1))
testemail=${content1[$i]}"12345"${domains2[$k]}"."${domains1[$j]};
echo ${testemail};
cmdstr="curl -XPOST 'http://localhost:9200/test/${index}' -d
'{"user" : ""${content1[$i]}"", "email" : ""${testemail}"" }'"
echo ${cmdstr}
eval ${cmdstr}
index=$((index+1))
testemail=${content1[$i]}${domains2[$k]}${domains1[$j]};
echo ${testemail};
cmdstr="curl -XPOST 'http://localhost:9200/test/${index}' -d
'{"user" : ""${content1[$i]}"", "email" : ""${testemail}"" }'"
echo ${cmdstr}
eval ${cmdstr}
done
done
done
So running it imports 1458 datasets, meaning there are 1839=
486 of the format name@domain.rootdns
486 of the format name12345domain.rootdns
486 of the format naemdomainrootdns
So the numbers of entries in the test are 1459 derived from the command (as
there is the _score in the summary...)
$ curl -XGET 'http://localhost:9200/test/_search?&pretty=true' -d '{ "size"
: 100000, "query": { "match_all": {}}}' | grep -i "_score" | wc -l
If I check using wildcard for michaeltest with the command
$ curl -XGET 'http://localhost:9200/test/_search?&pretty=true' -d '{ "size"
: 100000, "query": { "wildcard": {"_all": "michaeltest"}}}' | grep -i
"_score" | wc -l
I receive 19 answers - one is summary score - so 18. BUT it must be 27
(Michaeltestdomains1variants=1193) what can be easily proofed using
the command:
$ curl -XGET 'http://localhost:9200/test/_search?&pretty=true' -d '{ "size"
: 100000, "query": { "match_all": {}}}' | grep -iP "michael.test." | wc -l
So why are there missing results?? Strange results using the command with
regular expression to find valid email - addresses (yes, I know the regular
expression can be improved) like this:
$ curl -XGET 'http://localhost:9200/test/_search?&pretty=true' -d '{ "size"
: 100000, "query": { "regexp": {"all":
"[A-Za-z0-9.%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,4}"}}}' | grep -i "_score" |
wc -l
which gives 163 results - but it must be 486 what is the result of the
following proof......
$ curl -XGET 'http://localhost:9200/test/_search?&pretty=true' -d '{ "size"
: 100000, "query": { "match_all": {}}}' | grep -i "score" |grep -P
"[A-Za-z0-9.%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,4}" | wc -l
Even more strange, if I do the same filter with a different programm like
$ curl -XGET 'http://localhost:9200/test/_search?&pretty=true' -d '{ "size"
: 100000, "query": { "regexp": {"all":
"[A-Za-z0-9.%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,4}"}}}' | grep -i "score" |
grep -P "[A-Za-z0-9.%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,4}" | wc -l
I get 54 answers because the former result included the results without the
'@' and '.' what is not what I wanted with my regular expression! I don't
want to print it here but just remove the |wc -l from the end......
So my question is, where am I wrong? And more interesting for me: How can
it be done correctly?
Thanks and kind regards,
Michael
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.