Query Results not correct?

Hi all,

as I wanted to search for email addresses in my ElasticSearch I found
strange behaviour and I don't understand it....

Short Info: I use

"version" : {
"number" : "0.90.0.RC2",
"snapshot_build" : false
},

My OS is LInux (Gentoo/Debian) on both the same result.

I use the default ElasticSearch settings - nothing changed.

I can reproduce it by using the following script:

#!/bin/bash

the emails I want to check for in the ES

content1=(Michael Fabian Stefan Dominique Marcus Markus Noe Sebastian Diana
Daniela Alexandra Gabriela Jean Patrick Christoph Cedric Frederic Julien)
domains1=(ch biz com uk org de fr it eu)
domains2=(test check text )

index=0
#creating the combinations
for ((i=0;i<${#content1[]};i++)); do
for ((j=0;j<${#domains1[
]};j++)); do
for ((k=0;k<${#domains2[*]};k++)); do
index=$((index+1))
testemail=${content1[$i]}"@"${domains2[$k]}"."${domains1[$j]};
echo ${testemail};
cmdstr="curl -XPOST 'http://localhost:9200/test/${index}' -d
'{"user" : ""${content1[$i]}"", "email" : ""${testemail}"" }'"
echo ${cmdstr}
eval ${cmdstr}

    index=$((index+1))
    testemail=${content1[$i]}"12345"${domains2[$k]}"."${domains1[$j]};
    echo ${testemail}; 
    cmdstr="curl -XPOST 'http://localhost:9200/test/${index}' -d 

'{"user" : ""${content1[$i]}"", "email" : ""${testemail}"" }'"
echo ${cmdstr}
eval ${cmdstr}

    index=$((index+1))
    testemail=${content1[$i]}${domains2[$k]}${domains1[$j]};
    echo ${testemail}; 
    cmdstr="curl -XPOST 'http://localhost:9200/test/${index}' -d 

'{"user" : ""${content1[$i]}"", "email" : ""${testemail}"" }'"
echo ${cmdstr}
eval ${cmdstr}

done

done
done

So running it imports 1458 datasets, meaning there are 1839=
486 of the format name@domain.rootdns
486 of the format name12345domain.rootdns
486 of the format naemdomainrootdns

So the numbers of entries in the test are 1459 derived from the command (as
there is the _score in the summary...)
$ curl -XGET 'http://localhost:9200/test/_search?&pretty=true' -d '{ "size"
: 100000, "query": { "match_all": {}}}' | grep -i "_score" | wc -l

If I check using wildcard for michaeltest with the command
$ curl -XGET 'http://localhost:9200/test/_search?&pretty=true' -d '{ "size"
: 100000, "query": { "wildcard": {"_all": "michaeltest"}}}' | grep -i
"_score" | wc -l

I receive 19 answers - one is summary score - so 18. BUT it must be 27
(Michaeltestdomains1variants=1193) what can be easily proofed using
the command:
$ curl -XGET 'http://localhost:9200/test/_search?&pretty=true' -d '{ "size"
: 100000, "query": { "match_all": {}}}' | grep -iP "michael.test." | wc -l

So why are there missing results?? Strange results using the command with
regular expression to find valid email - addresses (yes, I know the regular
expression can be improved) like this:
$ curl -XGET 'http://localhost:9200/test/_search?&pretty=true' -d '{ "size"
: 100000, "query": { "regexp": {"all":
"[A-Za-z0-9.
%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,4}"}}}' | grep -i "_score" |
wc -l

which gives 163 results - but it must be 486 what is the result of the
following proof......
$ curl -XGET 'http://localhost:9200/test/_search?&pretty=true' -d '{ "size"
: 100000, "query": { "match_all": {}}}' | grep -i "score" |grep -P
"[A-Za-z0-9.
%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,4}" | wc -l

Even more strange, if I do the same filter with a different programm like
$ curl -XGET 'http://localhost:9200/test/_search?&pretty=true' -d '{ "size"
: 100000, "query": { "regexp": {"all":
"[A-Za-z0-9.
%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,4}"}}}' | grep -i "score" |
grep -P "[A-Za-z0-9.
%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,4}" | wc -l
I get 54 answers because the former result included the results without the
'@' and '.' what is not what I wanted with my regular expression! I don't
want to print it here but just remove the |wc -l from the end......

So my question is, where am I wrong? And more interesting for me: How can
it be done correctly?

Thanks and kind regards,

Michael

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Am Freitag, 19. April 2013 22:24:20 UTC+2 schrieb Michael:

Hi all,

as I wanted to search for email addresses in my Elasticsearch I found
strange behaviour and I don't understand it....

Short Info: I use

"version" : {
"number" : "0.90.0.RC2",
"snapshot_build" : false
},

My OS is LInux (Gentoo/Debian) on both the same result.

I use the default Elasticsearch settings - nothing changed.

I can reproduce it by using the following script:

============================================================================================
#!/bin/bash

the emails I want to check for in the ES

content1=(Michael Fabian Stefan Dominique Marcus Markus Noe Sebastian
Diana Daniela Alexandra Gabriela Jean Patrick Christoph Cedric Frederic
Julien)
domains1=(ch biz com uk org de fr it eu)
domains2=(test check text )

index=0
#creating the combinations
for ((i=0;i<${#content1[]};i++)); do
for ((j=0;j<${#domains1[
]};j++)); do
for ((k=0;k<${#domains2[*]};k++)); do
index=$((index+1))
testemail=${content1[$i]}"@"${domains2[$k]}"."${domains1[$j]};
echo ${testemail};
cmdstr="curl -XPOST 'http://localhost:9200/test/${index}' -d
'{"user" : ""${content1[$i]}"", "email" : ""${testemail}"" }'"
echo ${cmdstr}
eval ${cmdstr}

    index=$((index+1))
    testemail=${content1[$i]}"12345"${domains2[$k]}"."${domains1[$j]};
    echo ${testemail}; 
    cmdstr="curl -XPOST 'http://localhost:9200/test/${index}' -d 

'{"user" : ""${content1[$i]}"", "email" : ""${testemail}"" }'"
echo ${cmdstr}
eval ${cmdstr}

    index=$((index+1))
    testemail=${content1[$i]}${domains2[$k]}${domains1[$j]};
    echo ${testemail}; 
    cmdstr="curl -XPOST 'http://localhost:9200/test/${index}' -d 

'{"user" : ""${content1[$i]}"", "email" : ""${testemail}"" }'"
echo ${cmdstr}
eval ${cmdstr}

done

done
done

============================================================================================

So running it imports 1458 datasets, meaning there are 1839=
486 of the format name@domain.rootdns
486 of the format name12345domain.rootdns
486 of the format naemdomainrootdns

So the numbers of entries in the test are 1459 derived from the command
(as there is the _score in the summary...)
$ curl -XGET 'http://localhost:9200/test/_search?&pretty=true' -d '{
"size" : 100000, "query": { "match_all": {}}}' | grep -i "_score" | wc -l

If I check using wildcard for michaeltest with the command
$ curl -XGET 'http://localhost:9200/test/_search?&pretty=true' -d '{
"size" : 100000, "query": { "wildcard": {"_all": "michaeltest"}}}' | grep
-i "_score" | wc -l

I receive 19 answers - one is summary score - so 18. BUT it must be 27
(Michaeltestdomains1variants=1193) what can be easily proofed using
the command:
$ curl -XGET 'http://localhost:9200/test/_search?&pretty=true' -d '{
"size" : 100000, "query": { "match_all": {}}}' | grep -iP "michael.test."
| wc -l

So why are there missing results?? Strange results using the command with
regular expression to find valid email - addresses (yes, I know the regular
expression can be improved) like this:
$ curl -XGET 'http://localhost:9200/test/_search?&pretty=true' -d '{
"size" : 100000, "query": { "regexp": {"all":
"[A-Za-z0-9.
%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,4}"}}}' | grep -i "_score" |
wc -l

which gives 163 results - but it must be 486 what is the result of the
following proof......
$ curl -XGET 'http://localhost:9200/test/_search?&pretty=true' -d '{
"size" : 100000, "query": { "match_all": {}}}' | grep -i "score" |grep -P
"[A-Za-z0-9.
%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,4}" | wc -l

Even more strange, if I do the same filter with a different programm like
$ curl -XGET 'http://localhost:9200/test/_search?&pretty=true' -d '{
"size" : 100000, "query": { "regexp": {"all":
"[A-Za-z0-9.
%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,4}"}}}' | grep -i "score" |
grep -P "[A-Za-z0-9.
%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,4}" | wc -l
I get 54 answers because the former result included the results without
the '@' and '.' what is not what I wanted with my regular expression! I
don't want to print it here but just remove the |wc -l from the end......

So my question is, where am I wrong? And more interesting for me: How can
it be done correctly?

Thanks and kind regards,

Michael

Correction: My apologizes as I have realized, that the regular expression
with numbers and '+' of course includes the results. So a minor change I
changed in the script
testemail=${content1[$i]}"12345"${domains2[$k]}"."${domains1[$j]};
to
testemail=${content1[$i]}":"${domains2[$k]}"."${domains1[$j]};
as the ':' is not in the regular expression list. Odd enough, all the
results are remaining - so this does not change anything! The observation
was correct.....

Any help is appreciated.

Thanks a lot.

Michael

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Your wildcard query "wildcard": {"_all": "michaeltest"} is expected to
match a single token. The _all field is using standard analyzer so emails
like Michael@test.ch are indexed as two tokens "michael" and "test.ch". If
you want them to be indexed as a single token, you should switch to
uaxurlemailhttp://www.elasticsearch.org/guide/reference/index-modules/analysis/uaxurlemail-tokenizer/tokenizer in the default analyzer. You can find an example
here: https://github.com/imotov/elasticsearch-test-scripts/blob/master/email_default_analyzer.sh

By the way, you don't have to retrieve all records each time to count the
number of records. The total number of results is reported in the response
header:

{
"took" : 4,
"timed_out" : false,
......
"hits" : {

  • "total" : 27,*
    "max_score" : 1.0,
    "hits" : [ {
    "_index" : "test",
    .....

I am also not sure if this is intentional or not, but you are creating 1458
different types in your test, instead of 1458 records of the same type.

On Monday, April 22, 2013 9:23:07 AM UTC+2, Michael wrote:

Am Freitag, 19. April 2013 22:24:20 UTC+2 schrieb Michael:

Hi all,

as I wanted to search for email addresses in my Elasticsearch I found
strange behaviour and I don't understand it....

Short Info: I use

"version" : {
"number" : "0.90.0.RC2",
"snapshot_build" : false
},

My OS is LInux (Gentoo/Debian) on both the same result.

I use the default Elasticsearch settings - nothing changed.

I can reproduce it by using the following script:

============================================================================================
#!/bin/bash

the emails I want to check for in the ES

content1=(Michael Fabian Stefan Dominique Marcus Markus Noe Sebastian
Diana Daniela Alexandra Gabriela Jean Patrick Christoph Cedric Frederic
Julien)
domains1=(ch biz com uk org de fr it eu)
domains2=(test check text )

index=0
#creating the combinations
for ((i=0;i<${#content1[]};i++)); do
for ((j=0;j<${#domains1[
]};j++)); do
for ((k=0;k<${#domains2[*]};k++)); do
index=$((index+1))
testemail=${content1[$i]}"@"${domains2[$k]}"."${domains1[$j]};
echo ${testemail};
cmdstr="curl -XPOST 'http://localhost:9200/test/${index}' -d
'{"user" : ""${content1[$i]}"", "email" : ""${testemail}"" }'"
echo ${cmdstr}
eval ${cmdstr}

    index=$((index+1))
    testemail=${content1[$i]}"12345"${domains2[$k]}"."${domains1[$j]};
    echo ${testemail}; 
    cmdstr="curl -XPOST 'http://localhost:9200/test/${index}' -d 

'{"user" : ""${content1[$i]}"", "email" : ""${testemail}"" }'"
echo ${cmdstr}
eval ${cmdstr}

    index=$((index+1))
    testemail=${content1[$i]}${domains2[$k]}${domains1[$j]};
    echo ${testemail}; 
    cmdstr="curl -XPOST 'http://localhost:9200/test/${index}' -d 

'{"user" : ""${content1[$i]}"", "email" : ""${testemail}"" }'"
echo ${cmdstr}
eval ${cmdstr}

done

done
done

============================================================================================

So running it imports 1458 datasets, meaning there are 1839=
486 of the format name@domain.rootdns
486 of the format name12345domain.rootdns
486 of the format naemdomainrootdns

So the numbers of entries in the test are 1459 derived from the command
(as there is the _score in the summary...)
$ curl -XGET 'http://localhost:9200/test/_search?&pretty=true' -d '{
"size" : 100000, "query": { "match_all": {}}}' | grep -i "_score" | wc -l

If I check using wildcard for michaeltest with the command
$ curl -XGET 'http://localhost:9200/test/_search?&pretty=true' -d '{
"size" : 100000, "query": { "wildcard": {"_all": "michaeltest"}}}' | grep
-i "_score" | wc -l

I receive 19 answers - one is summary score - so 18. BUT it must be 27
(Michaeltestdomains1variants=1193) what can be easily proofed using
the command:
$ curl -XGET 'http://localhost:9200/test/_search?&pretty=true' -d '{
"size" : 100000, "query": { "match_all": {}}}' | grep -iP "michael.test."
| wc -l

So why are there missing results?? Strange results using the command
with regular expression to find valid email - addresses (yes, I know the
regular expression can be improved) like this:
$ curl -XGET 'http://localhost:9200/test/_search?&pretty=true' -d '{
"size" : 100000, "query": { "regexp": {"all":
"[A-Za-z0-9.
%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,4}"}}}' | grep -i "_score" |
wc -l

which gives 163 results - but it must be 486 what is the result of the
following proof......
$ curl -XGET 'http://localhost:9200/test/_search?&pretty=true' -d '{
"size" : 100000, "query": { "match_all": {}}}' | grep -i "score" |grep -P
"[A-Za-z0-9.
%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,4}" | wc -l

Even more strange, if I do the same filter with a different programm like
$ curl -XGET 'http://localhost:9200/test/_search?&pretty=true' -d '{
"size" : 100000, "query": { "regexp": {"all":
"[A-Za-z0-9.
%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,4}"}}}' | grep -i "score" |
grep -P "[A-Za-z0-9.
%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,4}" | wc -l
I get 54 answers because the former result included the results without
the '@' and '.' what is not what I wanted with my regular expression! I
don't want to print it here but just remove the |wc -l from the end......

So my question is, where am I wrong? And more interesting for me: How can
it be done correctly?

Thanks and kind regards,

Michael

Correction: My apologizes as I have realized, that the regular expression
with numbers and '+' of course includes the results. So a minor change I
changed in the script
testemail=${content1[$i]}"12345"${domains2[$k]}"."${domains1[$j]};
to
testemail=${content1[$i]}":"${domains2[$k]}"."${domains1[$j]};
as the ':' is not in the regular expression list. Odd enough, all the
results are remaining - so this does not change anything! The observation
was correct.....

Any help is appreciated.

Thanks a lot.

Michael

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.