Index randomly fetches wrong results and wrong numeric type

Hi,

I have a situation I have not seen before... In my test index I have just 25 records.
I am noticing the following inconsistent problems upon a simple search request to the alias (which is pointing to this single index):

GET /private_95787e92-ea54-412f-b2aa-84f136597e13_companys_active/_search
{
  "_source": "score"
}

When all works as expected, I get total hits 25 and the Integer field is returned correctly as an integer.

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 25,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "private_95787e92-ea54-412f-b2aa-84f136597e13_companys_new-000001",
        "_type" : "_doc",
        "_id" : "8834d971-2275-46ea-85b7-665a17532473",
        "_score" : 1.0,
        "_source" : { }
      },
      {
        "_index" : "private_95787e92-ea54-412f-b2aa-84f136597e13_companys_new-000001",
        "_type" : "_doc",
        "_id" : "99322c97-d41c-411a-ae4e-1ac7f91ed1da",
        "_score" : 1.0,
        "_source" : {
          "score" : 522
        }
      },
      {
        "_index" : "private_95787e92-ea54-412f-b2aa-84f136597e13_companys_new-000001",
        "_type" : "_doc",
        "_id" : "45690436-2a6c-4edf-a79b-87c934565427",
        "_score" : 1.0,
        "_source" : {
          "score" : 100
        }
      },
      {
        "_index" : "private_95787e92-ea54-412f-b2aa-84f136597e13_companys_new-000001",
        "_type" : "_doc",
        "_id" : "ce562ba5-e472-4713-ad35-88eac62de2f1",
        "_score" : 1.0,
        "_source" : { }
      },
      {
        "_index" : "private_95787e92-ea54-412f-b2aa-84f136597e13_companys_new-000001",
        "_type" : "_doc",
        "_id" : "a2405408-f02b-4780-a7b6-7d1559577ff8",
        "_score" : 1.0,
        "_source" : {
          "score" : 0
        }
      },
      {
        "_index" : "private_95787e92-ea54-412f-b2aa-84f136597e13_companys_new-000001",
        "_type" : "_doc",
        "_id" : "607afd54-099b-4330-a871-542676236cff",
        "_score" : 1.0,
        "_source" : { }
      },
      {
        "_index" : "private_95787e92-ea54-412f-b2aa-84f136597e13_companys_new-000001",
        "_type" : "_doc",
        "_id" : "af5c053a-f344-418f-8650-cc2c60f5525b",
        "_score" : 1.0,
        "_source" : {
          "score" : 522
        }
      },
      {
        "_index" : "private_95787e92-ea54-412f-b2aa-84f136597e13_companys_new-000001",
        "_type" : "_doc",
        "_id" : "576e513b-4b1c-4eac-bbfe-6a94fea2cea4",
        "_score" : 1.0,
        "_source" : {
          "score" : 522
        }
      },
      {
        "_index" : "private_95787e92-ea54-412f-b2aa-84f136597e13_companys_new-000001",
        "_type" : "_doc",
        "_id" : "36bb9722-b891-4452-b2eb-904ee78e72ea",
        "_score" : 1.0,
        "_source" : { }
      },
      {
        "_index" : "private_95787e92-ea54-412f-b2aa-84f136597e13_companys_new-000001",
        "_type" : "_doc",
        "_id" : "058ba128-2c6f-4969-be29-946deab0165b",
        "_score" : 1.0,
        "_source" : { }
      }
    ]
  }
}

However randomly I get total hits of just 7 and the Integer field is returned as double.

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 2,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 7,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "private_95787e92-ea54-412f-b2aa-84f136597e13_companys_new-000001",
        "_type" : "_doc",
        "_id" : "91e446fb-10ad-48bc-a1e9-7fb11867758c",
        "_score" : 1.0,
        "_source" : {
          "score" : 380.0
        }
      },
      {
        "_index" : "private_95787e92-ea54-412f-b2aa-84f136597e13_companys_new-000001",
        "_type" : "_doc",
        "_id" : "0cb40954-606f-42fd-90b8-de296ee7a2ba",
        "_score" : 1.0,
        "_source" : {
          "score" : 110.0
        }
      },
      {
        "_index" : "private_95787e92-ea54-412f-b2aa-84f136597e13_companys_new-000001",
        "_type" : "_doc",
        "_id" : "a348a3ce-21f5-41a5-9154-100bfa38e8c9",
        "_score" : 1.0,
        "_source" : {
          "score" : 100.0
        }
      },
      {
        "_index" : "private_95787e92-ea54-412f-b2aa-84f136597e13_companys_new-000001",
        "_type" : "_doc",
        "_id" : "bd20a90a-d530-4827-9f1d-109e79251dcc",
        "_score" : 1.0,
        "_source" : {
          "score" : 750.0
        }
      },
      {
        "_index" : "private_95787e92-ea54-412f-b2aa-84f136597e13_companys_new-000001",
        "_type" : "_doc",
        "_id" : "625d7d75-25de-4252-b215-7da324d69cad",
        "_score" : 1.0,
        "_source" : {
          "score" : 550.0
        }
      },
      {
        "_index" : "private_95787e92-ea54-412f-b2aa-84f136597e13_companys_new-000001",
        "_type" : "_doc",
        "_id" : "652be0af-3307-4403-9e81-810faf1bbc82",
        "_score" : 1.0,
        "_source" : {
          "score" : 770.0
        }
      },
      {
        "_index" : "private_95787e92-ea54-412f-b2aa-84f136597e13_companys_new-000001",
        "_type" : "_doc",
        "_id" : "addabf74-430b-423c-be53-02b4de6ec3c9",
        "_score" : 1.0,
        "_source" : {
          "score" : 750.0
        }
      }
    ]
  }
}

Partial mapping:

{
  "private_95787e92-ea54-412f-b2aa-84f136597e13_companys_new-000001" : {
    "mappings" : {
      // left out for clarity
      "properties" : {

        "score" : {
          "type" : "integer"
        },
      }
      // left out for clarity
    }
  }
}

Some more info: at times I notice that I get:
{"statusCode":502,"error":"Bad Gateway","message":"Client request timeout"}

And also I know that one of the nodes is in the low watermark.
(Note that the index mentioned here is different.)

GET _cluster/allocation/explain

{
  "index" : "private_123eafdc-b1d0-4a38-8766-e9689d037efd_companys_new-000001",
  "shard" : 2,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "NODE_LEFT",
    "at" : "2021-01-10T12:36:54.115Z",
    "details" : "node_left [TZb0mpk6Rk-o_ZGl1Ej1KQ]",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "PMqJCXm6T-uZT0DtzFcqHg",
      "node_name" : "elasticsearch-2",
      "transport_address" : "<xxx>:9300",
      "node_attributes" : {
        "ml.machine_memory" : "2147483648",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "disk_threshold",
          "decision" : "NO",
          "explanation" : "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=85%], using more disk space than the maximum allowed [85.0%], actual free: [13.357897780780078%]"
        }
      ]
    },
    {
      "node_id" : "ZzV7yKD-ToSOqoApPn3-9A",
      "node_name" : "elasticsearch-1",
      "transport_address" : "<xxx>:9300",
      "node_attributes" : {
        "ml.machine_memory" : "2147483648",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "store" : {
        "matching_size_in_bytes" : 5139037626
      },
      "deciders" : [
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "a copy of this shard is already allocated to this node [[private_123eafdc-b1d0-4a38-8766-e9689d037efd_companys_new-000001][2], node[ZzV7yKD-ToSOqoApPn3-9A], [P], s[STARTED], a[id=p3Fb2ewtSHaqwFpHjEQ44w]]"
        },
        {
          "decider" : "disk_threshold",
          "decision" : "NO",
          "explanation" : "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=85%], using more disk space than the maximum allowed [85.0%], actual free: [8.081981702764564%]"
        }
      ]
    }
  ]
}

Can you please help me understand what is causing this and how to solve it?

Some updates:
Comparing the results from when I received 25 results with score of integer type records (good) and the 7 results with score of float type records (bad), showed that none of the 7 records exist within the 25 records. In other words - those 7 records are coming from some other place. The only possibility I can think of was that those records belong to an index by the same name that I had deleted as the type of the score had changed from float to int. I don't know if this is true but this is the only assumption I have at this point.

Some of the steps I tried when troubleshooting:

  1. Tried to reindex, hoping that recreating the index would stop showing those 7 float records - did not help.
  2. Cleared a lot of space from other indices.
  3. Restarted 2/3 pods which seemed broken/nonresponsive.

Only after steps 2&3 the index stopped showing those 7 records. Hopefully this problem went away.

But, if I was indeed seeing results from the deleted index, I think it is an integrity issue in Elasticsearch. I assume that the deleted index could not have been deleted at 1 or 2 of the pods because of the lacking disk space, but I would never think that they should show up as results of the new index (if this is what I was seeing).

Thoughts?

Then the cluster would not have returned an ok response to the delete request.

Seems like there is something else happening here that would be worth digging into. What do you mean by this?

Then the cluster would not have returned an ok response to the delete request.

Unfortunately I don't remember what DELETE returned, thought I don't recall any errors.

Seems like there is something else happening here that would be worth digging into. What do you mean by this?

There are 3 pods. Observing the logs for them I had seen that one pod did not show any new logs which seemed to me as if it is nonresponsive. Another pod kept restarting, logging exceptions for 'No space left'. In between restarts I had to run some manual deletion of other large indexes to clear space. Only then that second pod managed to restart properly without complaining about diskspace. After this the index stopped showing me entries with 'float' score (which might have been entries from the deleted index).

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.