Counting items in a list [array] returns (what we think) are incorrect counts via groovy

Jeff_Steinmetz · January 9, 2015, 2:09am

Is there a better way to do this?

Please see this gist (or even better yet, run the script locally see the
issue).

gist.github.com

https://gist.github.com/jeffsteinmetz/2ea8329c667386c80fae

gistfile1.sh

# tested against elasticsearch 1.4.1
# groovy script does not accuratly count number of elements in list
# if you don't add a mapping to "not_analyze" it is even worse

INDEX_NAME='list-count-test'
NODE='localhost'

curl -XDELETE 'http://'$NODE':9200/'$INDEX_NAME

curl -XPUT 'http://'$NODE':9200/'$INDEX_NAME'/' -d '{

This file has been truncated. show original

You must have scripting enabled in your elasticsearch config for this to
work.

This was originally based on some comments I found here:

We would like to use a filtered query to only include documents that a
small count of items in the list [aka array], filtering where
values.size() < 10

"script": "doc['titles'].values.size() < 10"

Turns out the values.size() actually either counts tokenized (analyzed)
words, or if the mapping turns off analysis, it still counts incorrectly if
there are duplicates.
If analyze is not turned off, it counts tokenized words, not the number of
elements in the list.
If analyze is turned off for a given field, it improves, but duplicates are
missed.

For example, This comes back as size == 2
"titles": ["one", "duplicate", "duplicate"]
This comes back as size == 3, should be 4
"titles": ["http://bit.ly/abc", "http://bit.ly/abc", "http://bit.ly/def",
"http://bit.ly/ghi"]

Is this a bug, is there a better way, or is this just something that we
don't understand about groovy and values.size()?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f5e88338-8c4f-4cb8-b6c4-d7f47b365175%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

nik9000 · January 9, 2015, 3:03am

On Thu, Jan 8, 2015 at 9:09 PM, Jeff Steinmetz jeffrey.steinmetz@gmail.com
wrote:

Is there a better way to do this?

Please see this gist (or even better yet, run the script locally see the
issue).

Determine list [array] size in elasticsearch issue · GitHub

You must have scripting enabled in your elasticsearch config for this to
work.

This was originally based on some comments I found here:

elasticsearch - Search by size of object type field elastic search - Stack Overflow

We would like to use a filtered query to only include documents that a
small count of items in the list [aka array], filtering where
values.size() < 10

"script": "doc['titles'].values.size() < 10"

Turns out the values.size() actually either counts tokenized (analyzed)
words, or if the mapping turns off analysis, it still counts incorrectly if
there are duplicates.
If analyze is not turned off, it counts tokenized words, not the number of
elements in the list.
If analyze is turned off for a given field, it improves, but duplicates
are missed.

For example, This comes back as size == 2
"titles": ["one", "duplicate", "duplicate"]
This comes back as size == 3, should be 4
"titles": ["Yahoo | Mail, Weather, Search, Politics, News, Finance, Sports & Videos", "Yahoo | Mail, Weather, Search, Politics, News, Finance, Sports & Videos", "Warning! | There might be a problem with the requested link",
"http://bit.ly/ghi"]

Is this a bug, is there a better way, or is this just something that we
don't understand about groovy and values.size()?

I think that's just the way doc works. Try (but don't actually deploy)
_source['titles'].size() < 10. That should do what you expect. Don't
deploy that because its too slow. Try indexing the size and filtering on
it. You can use a transform to add the size of the array as an integer
field and just filter on it using a range filter. That'd probably be the
fastest option.

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd2d-KtOdV13trjnp3si_7%2B%2BAnOd%2BTTeTN75jkBuMsywyQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Jeff_Steinmetz · January 9, 2015, 5:03am

Thank you, that worked.

I was curious about the speed, is running a script using _source slower
that doc ?

Totally understand a dynamic script is slower regardless of _source vs
doc.

Makes sense that having a count transformed up front during index to create
a materialized value would certainly be much faster.

On Thursday, January 8, 2015 at 7:04:40 PM UTC-8, Nikolas Everett wrote:

On Thu, Jan 8, 2015 at 9:09 PM, Jeff Steinmetz <jeffrey....@gmail.com
<javascript:>> wrote:

Is there a better way to do this?

Please see this gist (or even better yet, run the script locally see the
issue).

Determine list [array] size in elasticsearch issue · GitHub

You must have scripting enabled in your elasticsearch config for this to
work.

This was originally based on some comments I found here:

elasticsearch - Search by size of object type field elastic search - Stack Overflow

We would like to use a filtered query to only include documents that a
small count of items in the list [aka array], filtering where
values.size() < 10

"script": "doc['titles'].values.size() < 10"

Turns out the values.size() actually either counts tokenized (analyzed)
words, or if the mapping turns off analysis, it still counts incorrectly if
there are duplicates.
If analyze is not turned off, it counts tokenized words, not the number
of elements in the list.
If analyze is turned off for a given field, it improves, but duplicates
are missed.

For example, This comes back as size == 2
"titles": ["one", "duplicate", "duplicate"]
This comes back as size == 3, should be 4
"titles": ["Yahoo | Mail, Weather, Search, Politics, News, Finance, Sports & Videos", "Yahoo | Mail, Weather, Search, Politics, News, Finance, Sports & Videos", "Warning! | There might be a problem with the requested link",
"http://bit.ly/ghi"]

Is this a bug, is there a better way, or is this just something that we
don't understand about groovy and values.size()?

I think that's just the way doc works. Try (but don't actually deploy)
_source['titles'].size() < 10. That should do what you expect. Don't
deploy that because its too slow. Try indexing the size and filtering on
it. You can use a transform to add the size of the array as an integer
field and just filter on it using a range filter. That'd probably be the
fastest option.

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/75736948-beac-43fc-84d4-25a94456d4ca%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

nik9000 · January 9, 2015, 5:15am

Source is going to be pretty sloe, yeah. If its a one off then its probably
fine but if you do it a lot probably best to index the count.
On Jan 9, 2015 12:04 AM, "Jeff Steinmetz" jeffrey.steinmetz@gmail.com
wrote:

Thank you, that worked.

I was curious about the speed, is running a script using _source slower
that doc ?

Totally understand a dynamic script is slower regardless of _source vs
doc.

Makes sense that having a count transformed up front during index to
create a materialized value would certainly be much faster.

On Thursday, January 8, 2015 at 7:04:40 PM UTC-8, Nikolas Everett wrote:

On Thu, Jan 8, 2015 at 9:09 PM, Jeff Steinmetz jeffrey....@gmail.com
wrote:

Is there a better way to do this?

Please see this gist (or even better yet, run the script locally see the
issue).

Determine list [array] size in elasticsearch issue · GitHub

You must have scripting enabled in your elasticsearch config for this to
work.

This was originally based on some comments I found here:
elasticsearch - Search by size of object type field elastic search - Stack Overflow
size-of-object-type-field-elastic-search

We would like to use a filtered query to only include documents that a
small count of items in the list [aka array], filtering where
values.size() < 10

"script": "doc['titles'].values.size() < 10"

Turns out the values.size() actually either counts tokenized (analyzed)
words, or if the mapping turns off analysis, it still counts incorrectly if
there are duplicates.
If analyze is not turned off, it counts tokenized words, not the number
of elements in the list.
If analyze is turned off for a given field, it improves, but duplicates
are missed.

For example, This comes back as size == 2
"titles": ["one", "duplicate", "duplicate"]
This comes back as size == 3, should be 4
"titles": ["Yahoo | Mail, Weather, Search, Politics, News, Finance, Sports & Videos", "Yahoo | Mail, Weather, Search, Politics, News, Finance, Sports & Videos", "Warning! | There might be a problem with the requested link",
"http://bit.ly/ghi"]

Is this a bug, is there a better way, or is this just something that we
don't understand about groovy and values.size()?

I think that's just the way doc works. Try (but don't actually deploy)
_source['titles'].size() < 10. That should do what you expect. Don't
deploy that because its too slow. Try indexing the size and filtering on
it. You can use a transform to add the size of the array as an integer
field and just filter on it using a range filter. That'd probably be the
fastest option.

Nik

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/75736948-beac-43fc-84d4-25a94456d4ca%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/75736948-beac-43fc-84d4-25a94456d4ca%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd35LG%3Dki2jMigsfgwrojXVBTCkJH784wu7GbEcXvu3tRg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Jeff_Steinmetz · January 9, 2015, 5:43am

Transform worked well. Nice.

Curious how to get it to save to source? Tried this below, no go. (I can
however do range queries agains title_count, so transform was indexed and
works well)

"transform" : {
  "script" : "ctx._source['\'title_count\''] =

ctx._source[''titles''].size()",
"lang": "groovy"
},
"properties": {
"titles": { "type": "string", "index": "not_analyzed" },
"title_count" : { "type": "integer", "store": "yes" }
}
}'

On Thursday, January 8, 2015 at 9:15:28 PM UTC-8, Nikolas Everett wrote:

Source is going to be pretty sloe, yeah. If its a one off then its
probably fine but if you do it a lot probably best to index the count.
On Jan 9, 2015 12:04 AM, "Jeff Steinmetz" <jeffrey....@gmail.com
<javascript:>> wrote:

Thank you, that worked.

I was curious about the speed, is running a script using _source slower
that doc ?

Totally understand a dynamic script is slower regardless of _source vs
doc.

Makes sense that having a count transformed up front during index to
create a materialized value would certainly be much faster.

On Thursday, January 8, 2015 at 7:04:40 PM UTC-8, Nikolas Everett wrote:

On Thu, Jan 8, 2015 at 9:09 PM, Jeff Steinmetz jeffrey....@gmail.com
wrote:

Is there a better way to do this?

Please see this gist (or even better yet, run the script locally see
the issue).

Determine list [array] size in elasticsearch issue · GitHub

You must have scripting enabled in your elasticsearch config for this
to work.

This was originally based on some comments I found here:
elasticsearch - Search by size of object type field elastic search - Stack Overflow
size-of-object-type-field-elastic-search

We would like to use a filtered query to only include documents that a
small count of items in the list [aka array], filtering where
values.size() < 10

"script": "doc['titles'].values.size() < 10"

Turns out the values.size() actually either counts tokenized (analyzed)
words, or if the mapping turns off analysis, it still counts incorrectly if
there are duplicates.
If analyze is not turned off, it counts tokenized words, not the number
of elements in the list.
If analyze is turned off for a given field, it improves, but duplicates
are missed.

For example, This comes back as size == 2
"titles": ["one", "duplicate", "duplicate"]
This comes back as size == 3, should be 4
"titles": ["Yahoo | Mail, Weather, Search, Politics, News, Finance, Sports & Videos", "Yahoo | Mail, Weather, Search, Politics, News, Finance, Sports & Videos", "Warning! | There might be a problem with the requested link",
"http://bit.ly/ghi"]

Is this a bug, is there a better way, or is this just something that we
don't understand about groovy and values.size()?

I think that's just the way doc works. Try (but don't actually
deploy) _source['titles'].size() < 10. That should do what you expect.
Don't deploy that because its too slow. Try indexing the size and
filtering on it. You can use a transform to add the size of the array as
an integer field and just filter on it using a range filter. That'd
probably be the fastest option.

Nik

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/75736948-beac-43fc-84d4-25a94456d4ca%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/75736948-beac-43fc-84d4-25a94456d4ca%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/00ff2bc1-94a9-4aa9-8c7e-ef5734affb4d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

nik9000 · January 9, 2015, 5:59am

Transform never saves to source. You have to transform on the application
side for that. It was designed for times when you wanted to index something
like this that would just take up extra space in the source document. I
imagine you could use a script field on the query if you need the result to
contain the count. Or just count it on the result side.

Nik
On Jan 9, 2015 12:43 AM, "Jeff Steinmetz" jeffrey.steinmetz@gmail.com
wrote:

Transform worked well. Nice.

Curious how to get it to save to source? Tried this below, no go. (I can
however do range queries agains title_count, so transform was indexed and
works well)
"transform" : {
  "script" : "ctx._source['\'title_count\''] =
ctx._source[''titles''].size()",
"lang": "groovy"
},
"properties": {
"titles": { "type": "string", "index": "not_analyzed" },
"title_count" : { "type": "integer", "store": "yes" }
}
}'

On Thursday, January 8, 2015 at 9:15:28 PM UTC-8, Nikolas Everett wrote:

Source is going to be pretty sloe, yeah. If its a one off then its
probably fine but if you do it a lot probably best to index the count.
On Jan 9, 2015 12:04 AM, "Jeff Steinmetz" jeffrey....@gmail.com wrote:

Thank you, that worked.

I was curious about the speed, is running a script using _source slower
that doc ?

Totally understand a dynamic script is slower regardless of _source vs
doc.

Makes sense that having a count transformed up front during index to
create a materialized value would certainly be much faster.

On Thursday, January 8, 2015 at 7:04:40 PM UTC-8, Nikolas Everett wrote:

On Thu, Jan 8, 2015 at 9:09 PM, Jeff Steinmetz jeffrey....@gmail.com
wrote:

Is there a better way to do this?

Please see this gist (or even better yet, run the script locally see
the issue).

Determine list [array] size in elasticsearch issue · GitHub

You must have scripting enabled in your elasticsearch config for this
to work.

This was originally based on some comments I found here:
elasticsearch - Search by size of object type field elastic search - Stack Overflow
of-object-type-field-elastic-search

We would like to use a filtered query to only include documents that a
small count of items in the list [aka array], filtering where
values.size() < 10

"script": "doc['titles'].values.size() < 10"

Turns out the values.size() actually either counts tokenized
(analyzed) words, or if the mapping turns off analysis, it still counts
incorrectly if there are duplicates.
If analyze is not turned off, it counts tokenized words, not the
number of elements in the list.
If analyze is turned off for a given field, it improves, but
duplicates are missed.

For example, This comes back as size == 2
"titles": ["one", "duplicate", "duplicate"]
This comes back as size == 3, should be 4
"titles": ["Yahoo | Mail, Weather, Search, Politics, News, Finance, Sports & Videos", "Yahoo | Mail, Weather, Search, Politics, News, Finance, Sports & Videos", "
Warning! | There might be a problem with the requested link", "http://bit.ly/ghi"]

Is this a bug, is there a better way, or is this just something that
we don't understand about groovy and values.size()?

I think that's just the way doc works. Try (but don't actually
deploy) _source['titles'].size() < 10. That should do what you expect.
Don't deploy that because its too slow. Try indexing the size and
filtering on it. You can use a transform to add the size of the array as
an integer field and just filter on it using a range filter. That'd
probably be the fastest option.

Nik

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/75736948-beac-43fc-84d4-25a94456d4ca%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/75736948-beac-43fc-84d4-25a94456d4ca%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/00ff2bc1-94a9-4aa9-8c7e-ef5734affb4d%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/00ff2bc1-94a9-4aa9-8c7e-ef5734affb4d%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd1Z3H3xn255yTsvSoR-dhVRa7eGJCBcugt6oSb-MU9HHw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Jeff_Steinmetz · January 9, 2015, 7:19am

Now that I am into the real wold scenario, it gets a bit tricker - I have
nested objects (keys).
I have to test the existence of the key in the Groovy script to avoid
parsing errors on insert.

How do you access a nested object in groovy? and test for the existence of
a nested object key?
such as this example:

curl -XPOST 'http://'$NODE':9200/'$INDEX_NAME'/post' -d '{
"titles": ["title 1", "title 2", "title 3", "title 4"],
"raw" : {
"links" : ["Yahoo | Mail, Weather, Search, Politics, News, Finance, Sports & Videos", "Yahoo | Mail, Weather, Search, Politics, News, Finance, Sports & Videos",
"Warning! | There might be a problem with the requested link", "http://bit.ly/ghi"]
}
}'

This doesn't seem to work (form what I can tell it never finds the key
raw.links even when it does exist)

  "script" : "if (ctx._source.containsKey('raw.links') )

{ctx._source.links_url_count = ctx._source['raw.links''].size() } else {
ctx._source.links_url_count = 0 }"

Simple keys work though like ctx._source.containsKey('title')

On Thursday, January 8, 2015 at 9:59:56 PM UTC-8, Nikolas Everett wrote:

Transform never saves to source. You have to transform on the application
side for that. It was designed for times when you wanted to index something
like this that would just take up extra space in the source document. I
imagine you could use a script field on the query if you need the result to
contain the count. Or just count it on the result side.

Nik
On Jan 9, 2015 12:43 AM, "Jeff Steinmetz" <jeffrey....@gmail.com
<javascript:>> wrote:
Transform worked well. Nice.

Curious how to get it to save to source? Tried this below, no go. (I
can however do range queries agains title_count, so transform was indexed
and works well)
"transform" : {
  "script" : "ctx._source['\'title_count\''] = 
ctx._source[''titles''].size()",
"lang": "groovy"
},
"properties": {
"titles": { "type": "string", "index": "not_analyzed" },
"title_count" : { "type": "integer", "store": "yes" }
}
}'

On Thursday, January 8, 2015 at 9:15:28 PM UTC-8, Nikolas Everett wrote:

Source is going to be pretty sloe, yeah. If its a one off then its
probably fine but if you do it a lot probably best to index the count.
On Jan 9, 2015 12:04 AM, "Jeff Steinmetz" jeffrey....@gmail.com wrote:

Thank you, that worked.

I was curious about the speed, is running a script using _source slower
that doc ?

Totally understand a dynamic script is slower regardless of _source vs
doc.

Makes sense that having a count transformed up front during index to
create a materialized value would certainly be much faster.

On Thursday, January 8, 2015 at 7:04:40 PM UTC-8, Nikolas Everett wrote:

On Thu, Jan 8, 2015 at 9:09 PM, Jeff Steinmetz jeffrey....@gmail.com
wrote:

Is there a better way to do this?

Please see this gist (or even better yet, run the script locally see
the issue).

Determine list [array] size in elasticsearch issue · GitHub

You must have scripting enabled in your elasticsearch config for this
to work.

This was originally based on some comments I found here:
elasticsearch - Search by size of object type field elastic search - Stack Overflow
of-object-type-field-elastic-search

We would like to use a filtered query to only include documents that
a small count of items in the list [aka array], filtering where
values.size() < 10

"script": "doc['titles'].values.size() < 10"

Turns out the values.size() actually either counts tokenized
(analyzed) words, or if the mapping turns off analysis, it still counts
incorrectly if there are duplicates.
If analyze is not turned off, it counts tokenized words, not the
number of elements in the list.
If analyze is turned off for a given field, it improves, but
duplicates are missed.

For example, This comes back as size == 2
"titles": ["one", "duplicate", "duplicate"]
This comes back as size == 3, should be 4
"titles": ["Yahoo | Mail, Weather, Search, Politics, News, Finance, Sports & Videos", "Yahoo | Mail, Weather, Search, Politics, News, Finance, Sports & Videos", "
Warning! | There might be a problem with the requested link", "http://bit.ly/ghi"]

Is this a bug, is there a better way, or is this just something that
we don't understand about groovy and values.size()?

I think that's just the way doc works. Try (but don't actually
deploy) _source['titles'].size() < 10. That should do what you expect.
Don't deploy that because its too slow. Try indexing the size and
filtering on it. You can use a transform to add the size of the array as
an integer field and just filter on it using a range filter. That'd
probably be the fastest option.

Nik

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/75736948-beac-43fc-84d4-25a94456d4ca%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/75736948-beac-43fc-84d4-25a94456d4ca%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/00ff2bc1-94a9-4aa9-8c7e-ef5734affb4d%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/00ff2bc1-94a9-4aa9-8c7e-ef5734affb4d%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3717aecd-78c1-4e48-9771-acc49f8c730a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · January 9, 2015, 10:37am

"titles" : ["one","duplicate","duplicate"] is a short form and becomes
"titles" : "one" and "titles":"duplicate" in the index.

With 'doc' scripts access the document in the indexed form, which is of
course not 1:1 with the source document. Maybe it works to use 'source'
to access the source field in the index to get the original form, but be
warned, this is slow, because the whole source field must be loaded and
decoded for each document.

Jörg

On Fri, Jan 9, 2015 at 3:09 AM, Jeff Steinmetz jeffrey.steinmetz@gmail.com
wrote:

Is there a better way to do this?

Please see this gist (or even better yet, run the script locally see the
issue).

Determine list [array] size in elasticsearch issue · GitHub

You must have scripting enabled in your elasticsearch config for this to
work.

This was originally based on some comments I found here:

elasticsearch - Search by size of object type field elastic search - Stack Overflow

We would like to use a filtered query to only include documents that a
small count of items in the list [aka array], filtering where
values.size() < 10

"script": "doc['titles'].values.size() < 10"

Turns out the values.size() actually either counts tokenized (analyzed)
words, or if the mapping turns off analysis, it still counts incorrectly if
there are duplicates.
If analyze is not turned off, it counts tokenized words, not the number of
elements in the list.
If analyze is turned off for a given field, it improves, but duplicates
are missed.

For example, This comes back as size == 2
"titles": ["one", "duplicate", "duplicate"]
This comes back as size == 3, should be 4
"titles": ["Yahoo | Mail, Weather, Search, Politics, News, Finance, Sports & Videos", "Yahoo | Mail, Weather, Search, Politics, News, Finance, Sports & Videos", "Warning! | There might be a problem with the requested link",
"http://bit.ly/ghi"]

Is this a bug, is there a better way, or is this just something that we
don't understand about groovy and values.size()?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f5e88338-8c4f-4cb8-b6c4-d7f47b365175%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f5e88338-8c4f-4cb8-b6c4-d7f47b365175%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGtWd7c0zBDBvYtQ2j9F0%2ByKbJEPGSiFK5ni65kGHsAng%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.