Scan query that returns document values only is heavily accessing the *.FDT file

Tzahi · November 23, 2014, 10:52pm

Hi all,

I have a tests index with 43 million documenst. there is a string document
value for each document. (about 5-10 character value for each document)

Mapping is:

{

"myindex" : {

"mappings" : {

  "num_type" : {

    "_type" : {

      "store" : true

    },

    "properties" : {

      "doc_value" : {

        "type" : "string",

        "doc_values_format" : "default"

      },

      "int1" : {

        "type" : "integer",

        "index" : "analyzed",

        "store" : true

      },

      "int2" : {

.

I need to retrieve the document values only for queries that may return
about 100,000 documents result set. I do not need ranking or anything else
that will slow this down.

My understanding is that if the query is only a filter – ranking is not
computed, and it is faster.

Here is a small python program to test it:

*import *elasticsearch

es = elasticsearch.Elasticsearch()

results = es.search("myindex", "num_type",
{
"fields":["doc_value"],
"size":1000,
"query": {"filtered": {
"query": {"match_all":{}}
,"filter": {
"term": {"r_int3": 929}}
}}
},scroll="10s",search_type="scan")

while True:
results = es.scroll(results["_scroll_id"], scroll="10s")
if len(results["hits"]["hits"]) <= 0:
break

The query runs pretty slow, and I see there is huge number of access to the
*.fdt (field data) file.

But I ask for a document value field – so why does ES access the *.fdt.

Thanks a lot in advance.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/89480f13-b00e-4e3f-a538-15fdbd18f073%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · November 24, 2014, 8:10am

Doc values are stored in the .fdt files.

Jörg

On Sun, Nov 23, 2014 at 11:52 PM, Tzahi jakubovitz tzahij@hotmail.com
wrote:

Hi all,

I have a tests index with 43 million documenst. there is a string document
value for each document. (about 5-10 character value for each document)

Mapping is:

{

"myindex" : {
"mappings" : {

  "num_type" : {

    "_type" : {

      "store" : true

    },

    "properties" : {

      "doc_value" : {

        "type" : "string",

        "doc_values_format" : "default"

      },

      "int1" : {

        "type" : "integer",

        "index" : "analyzed",

        "store" : true

      },

      "int2" : {
.

.

.

I need to retrieve the document values only for queries that may return
about 100,000 documents result set. I do not need ranking or anything else
that will slow this down.

My understanding is that if the query is only a filter – ranking is not
computed, and it is faster.

Here is a small python program to test it:

*import *elasticsearch

es = elasticsearch.Elasticsearch()

results = es.search("myindex", "num_type",
{
"fields":["doc_value"],
"size":1000,
"query": {"filtered": {
"query": {"match_all":{}}
,"filter": {
"term": {"r_int3": 929}}
}}
},scroll="10s",search_type="scan")

while True:
results = es.scroll(results["_scroll_id"], scroll="10s")
if len(results["hits"]["hits"]) <= 0:
break

The query runs pretty slow, and I see there is huge number of access to
the *.fdt (field data) file.

But I ask for a document value field – so why does ES access the *.fdt.

Thanks a lot in advance.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/89480f13-b00e-4e3f-a538-15fdbd18f073%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/89480f13-b00e-4e3f-a538-15fdbd18f073%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEsDnXCbmV0tGmNwuYvAwdW-t%2BYJhf6mYmbN4ZVf3fMrQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Tzahi · November 24, 2014, 9:04am

Thanks
Sorry - I did not stress this is document values and not field values.
Document values are stores in DVD file. which is small, compressed format.
I defined it to avoide having to access and parse the lucene document from
the huge FDT file (in my test- FDT file is 1000 times bigger than DVD file).
see
https://lucene.apache.org/core/4_3_1/core/org/apache/lucene/codecs/lucene42/Lucene42DocValuesFormat.html
.

I still try to avoide accessing the FDT file - it makes my query toooo slow.

Thanks again.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cd6ed6a9-f1c7-47c4-be3d-833553cb2bf6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · November 24, 2014, 9:13am

Oh, sorry. Yess, doc values are in .dvd files.

I assume that ES still puts hidden "type" and "uid" field in .fdt. But I'm
also surprised, there should be not much disk access for that.

Jörg

On Mon, Nov 24, 2014 at 10:04 AM, Tzahi jakubovitz tzahij@hotmail.com
wrote:

Thanks
Sorry - I did not stress this is document values and not field values.
Document values are stores in DVD file. which is small, compressed format.
I defined it to avoide having to access and parse the lucene document from
the huge FDT file (in my test- FDT file is 1000 times bigger than DVD file).
see
Lucene42DocValuesFormat (Lucene 4.3.1 API)
.

I still try to avoide accessing the FDT file - it makes my query toooo
slow.

Thanks again.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/cd6ed6a9-f1c7-47c4-be3d-833553cb2bf6%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/cd6ed6a9-f1c7-47c4-be3d-833553cb2bf6%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEnzt3BFr-6jmQ6voNxn9pkG5bsdYnK-iV8HauRTRkKyA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Is it possible to get query results from document values? Elasticsearch	3	397	July 6, 2017
Version 2.0: using doc values for result list Elasticsearch	6	1649	July 5, 2017
Scan/Scroll performance and cache Elasticsearch	11	3481	July 5, 2017
Slow results retrieval Elasticsearch	5	400	December 17, 2018
@timestamp field slow range query "DocValuesFieldExistsQuery" Elasticsearch	13	1339	July 20, 2020

Scan query that returns document values only is heavily accessing the *.FDT file

Related topics