Accessing index shard directories via pylucene

Ben_Timby · November 17, 2021, 2:39am

Hi, I have old index files from Elasticsearch 1.7.6. The version of lucene utilized is 4.10.4, I was able to get lucene to open a directory that represents one of the three shards of my index. I can pull the documents from this directory. However, the index contains compound fields (meaning properties with properties). I am not sure if these are nested, flattened or how ES stores them, but the are absent from the documents when read using lucene.

I understand that these properties are likely stored in sub-documents, but that they should exist in the same shard. How would I find these sub-documents? When I print all documents in the index, they are all "parent" documents, and have the properties defined below such as "action", "addr", but are missing "location", "user" etc.

My pylucene code is like this:

def import_doc(doc):
    print(doc)


def import_index(path):
    index = FSDirectory.open(File(path))
    reader = DirectoryReader.open(index)

    for i in range(reader.maxDoc()):
        doc = reader.document(i)
        doc = {
            f.name(): f.stringValue()
            for f in doc.getFields()
        }
        import_doc(doc)

My index definition is like this:

{
  "audit-01-31-2021" : {
    "aliases" : {
      "audit" : { }
    },
    "mappings" : {
      "activity" : {
        "_all" : {
          "enabled" : true
        },
        "_routing" : {
          "required" : true,
          "path" : "site_id"
        },
        "properties" : {
          "action" : {
            "type" : "string",
            "store" : true
          },
          "addr" : {
            "type" : "string"
          },
          "connection_type" : {
            "type" : "string",
            "store" : true
          },
          "isdir" : {
            "type" : "boolean",
            "store" : true
          },
          "location" : {
            "properties" : {
              "country" : {
                "type" : "string"
              },
              "ip" : {
                "type" : "string"
              }
            }
          },
          "message" : {
            "properties" : {
              "id" : {
                "type" : "long"
              },
              "name" : {
                "type" : "string"
              }
            }
          },
          "result" : {
            "type" : "string",
            "store" : true
          },
          "rule" : {
            "properties" : {
              "id" : {
                "type" : "long"
              },
              "name" : {
                "type" : "string"
              }
            }
          },
          "seq_id" : {
            "type" : "integer",
            "store" : true
          },
          "signature" : {
            "properties" : {
              "id" : {
                "type" : "long"
              },
              "name" : {
                "type" : "string"
              }
            }
          },
          "site_id" : {
            "type" : "long",
            "store" : true
          },
          "size" : {
            "type" : "long",
            "store" : true
          },
          "target" : {
            "properties" : {
              "id" : {
                "type" : "long"
              },
              "name" : {
                "type" : "string"
              },
              "type" : {
                "type" : "string"
              }
            }
          },
          "timestamp" : {
            "type" : "date",
            "store" : true,
            "format" : "dateOptionalTime"
          },
          "type" : {
            "type" : "string",
            "store" : true
          },
          "uid" : {
            "type" : "string",
            "store" : true
          },
          "user" : {
            "properties" : {
              "id" : {
                "type" : "long"
              },
              "name" : {
                "type" : "string"
              },
              "username" : {
                "type" : "string",
                "analyzer" : "username_analyzer"
              }
            }
          }
        }
      }
    },
    "settings" : {
      "index" : {
        "creation_date" : "1612069201197",
        "analysis" : {
          "filter" : {
            "username_filter" : {
              "type" : "word_delimiter",
              "type_table" : [ "@ => ALPHA", ". => ALPHA", "# => ALPHA" ]
            }
          },
          "analyzer" : {
            "username_analyzer" : {
              "filter" : [ "lowercase", "username_filter" ],
              "type" : "custom",
              "tokenizer" : "whitespace"
            }
          }
        },
        "number_of_shards" : "3",
        "uuid" : "cSJzoUWgQFa-GLzb09hn3A",
        "version" : {
          "created" : "1070699"
        },
        "number_of_replicas" : "1"
      }
    },
    "warmers" : { }
  }
}

Christian_Dahlqvist · November 17, 2021, 5:18am

What are you looking to achieve? Why not use the Elasticsearch APIs? Why are you using such an anchient version?

Ben_Timby · November 17, 2021, 6:29am

Thanks for your reply.

What are you looking to achieve?

I need to export the data, my customer is abandoning elasticsearch as it is not the right tool for this job. They have a large amount of data in ES and I need to export it. The cluster is 30 nodes and several TB. I will NOT be using ES for this purpose (but will set up a new cluster for a different purpose) so I don't want to invest any resources in a cluster I am going to nuke.

Why not use the Elasticsearch APIs?

The ES cluster crashes when I try to export data using the API. I have tried elasticdump as well as curl and it is painfully slow and eventually starts timing out as nodes crash.

Why are you using such an anchient version?

There is no upgrade path, since source is not enabled, I can't index from remote even after upgrading to 5.x.

At the end of the day I just need the data out of ES. I can detach the indexes individually, they are each around 1GB and I was hoping to just read the data from the files directly. It is 99% working, I just don't know how this old version of ES maps sub-documents.

Christian_Dahlqvist · November 17, 2021, 7:29am

You should be able to get the mappings for the index using the Elasticsearch APIs, which will give you the structure of the indexed data. Apart from that I can not really help as you have source disabled.

Ben_Timby · November 17, 2021, 4:50pm

I believe I posted the mappings in my initial message (unless I am mistaken). In any case, I figured it out, the additional fields are encoded as binary data within the document, so, you just have to read and decode them. You can't use getField() and friends.

system · December 15, 2021, 4:50pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.