Mapping null values

Hey everyone!

I am quite new to Elasticsearch so feel free to criticize my work method. I am currently working on a project in which multiple pdf documents (a series of court decisions) need to be searched. The PDF documents are ingested into Elasticsearch via FSCrawler (thank you daadonet for this amazing program). For all the new documents that are created, I want to make sure that they have certain properties. In my case, each decision has e.g. a CourtInstance, DecissionDate, etc. At the moment of ingesting the data via fscrawler I would like to set these values by default to null, so that they can, later on, be adapted based on a query. To test things out I created a test query:

POST _reindex
{
  "source": {
	  "index": "courtdecissions"
}, 
  "dest": {
	  "index": "testindex"
}
}

Then I added the following mapping:

PUT /testindex/_mapping
{
  "properties":{
    "CourtCategory":{
      "type": "keyword",
      "null_value": "null"
    },
    "CourtInstance": {
      "type": "keyword",
      "null_value": "NULL"
    },
    "DecissionDate": {
      "type": "date",
      "null_value": "NULL"
    },
    "DecissionNumber":{
      "type": "keyword",
      "null_value": "NULL"
    },
    "CaseNumber":{
      "type": "keyword",
      "null_value": "NULL"
    },
    "References":{
      "type": "text"
    },
    "Summaries":{
      "type": "text"
    },
    "Comments": {
      "type": "text"
    }
  }
}

This seems to work fine since I can find the updated mapping via "GET /testindex/_mapping"

Afterward, I reindex using the POST _reindex mentioned above.

However, when I query for a null value, nothing returns:

GET /testindex/_search
{
  "query": {
    "term": {
        "CaseNumber": "NULL"
    }
  }
}

Any suggestions or ideas? Thank you in advance.

Ok. I solved it by adding the following pipeline to the index:

PUT _ingest/pipeline/courtdecissionspipeline
{
  "version": 1,
  "description": "provide default values for several fields", 
  "processors": [
    {"set": 
      {
        "field": "CourtCategory",
        "value": "undefined"
      }
    },
    {
      "set": {
        "field": "CourtInstance",
        "value": "undefined"
      }
    },
    {
      "set": {
        "field": "DecissionDate",
        "value": "0001-01-01"
      }
    },
    {
      "set": {
        "field": "DecissionNumber",
        "value": "undefined"
      }
    },
    {
      "set": {
        "field": "CaseNumber",
        "value": "undefined"
      }
    },
    {
      "set": {
        "field": "References",
        "value": "[]"
      }
    },
    {
      "set": {
        "field": "Summaries",
        "value": "[]"
      }
    },
    {
      "set": {
        "field": "Comments",
        "value": "[]"
      }
    }
  ]
}

After deleting the index I recreated the index:

PUT /testindex
{
  "mappings": 
  {
    "properties":{
    "CourtCategory":{
      "type": "keyword",
      "null_value": "undefined"
    },
    "CourtInstance": {
      "type": "keyword",
      "null_value": "NULL"
    },
    "DecissionDate": {
      "type": "date",
      "null_value": "0001-01-01"
    },
    "DecissionNumber":{
      "type": "keyword",
      "null_value": "NULL"
    },
    "CaseNumber":{
      "type": "keyword",
      "null_value": "NULL"
    },
    "References":{
      "type": "text"
    },
    "Summaries":{
      "type": "text"
    },
    "Comments": {
      "type": "text"
    }
  }
  },
  "settings": {
    "index": {
      "default_pipeline": "courtdecissionspipeline"
    }
  }
}

Afterward, I restarted the FSCrawler and now the file has the necessary default values. If there are any other suggestions from a more 'expert' view, please let me know!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.