Changing Elasticsearch + Fscrawler tokenizing methods to recognize Social Security Numbers/Passport Machine Readable Zones

Hello,

I have used fscrawler to index some files that have social security numbers in the format of XXX-XX-XXXX and also the machine readable zones of passports such as <<<<<<Karen<Ouyang<<<<<. They appear in the content when I do a general search of what is in my index.

(I used the default fscrawler 2.7 settings to tokenize the index.)

But, when I use Kibana and query_string search to find social security numbers and the passport MRZs, I don't get any results back. Here are the queries I used:

GET /test_10082020_v3/_search
{"query": {"query_string" : {"query" : "*<<<*", "default_field" : "content"}}, "_source":{"excludes":["content"]},"highlight": {"fields": {"*": {}}}, "size": 10000}
GET /test_10082020_v3/_search
{"query": {"query_string" : {"query" : "/[0-9]{3}-[0-9]{2}-[0-9]{4}/", "default_field" : "content"}}, "_source":{"excludes":["content"]},"highlight": {"fields": {"*": {}}}, "size": 10000}

I want to asK:

  1. Is there a better search query to query with to find the social security numbers and the passport machine readable zones?

  2. Do I need to change the way the source is tokenized in fscrawler settings, so that I can search for SSNs and passport MRZs easier? If so, what do I change it to? The tokenizer is currently fscrawler_path and type path_hierarchy.

Thank you.

Karen

Hi

Some thoughts about this.

First, the fact you are using FSCrawler to index documents is not really the problem so I'd remove FSCrawler from the picture for now.

Your main concern is that you would like to find which documents contains a SSN or a passport MRZ, right?
Or do you want to be able to find a document which contains Karen as a passport MRZ?

If this is one of the use case of your project, I'd try to get that information at index time instead of doing a very slow search...
It could be by generating fields like hasSSNand hasMRZ boolean values.

{
  "content": "WHATEVER HERE <<<<<<Karen<Ouyang<<<<< AND XXX-XX-XXXX",
  "hasSSN": true,
  "hasMRZ": true
}

Or generating fields like ssn and mrz...

{
  "content": "WHATEVER HERE <<<<<<Karen<Ouyang<<<<< AND XXX-XX-XXXX",
  "mrz": "Karen Ouyang",
  "ssn": "XXX-XX-XXXX"
}

To do that, I'd try to use an ingest pipeline that will generate those fields at index time.
FSCrawler supports this pipeline option.

Side note:

The tokenizer you mentioned is defined by FSCrawler only for some fields but not for the content field.

Hi David,

Thanks for letting me know about the pipeline option. I think it will become useful in one of our other use cases.

However, for the SSN/Passport use case, I would like to use the slow search method.

The tokenizer that was successful in picking up whole SSNs and the <<< signs in the passport MRZs was the whitespace tokenizer. This was the code we used in our _settings.json :

{
  "settings": {
    "number_of_shards": 4,
    "index.mapping.total_fields.limit": 2000,
     "analysis": {
      "analyzer": {
        "fulltext_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "type_as_payload"
          ]
        }
      }
    }
  },
"mappings": {
    "properties": {
      "content": {
        "type": "text",
        "term_vector": "with_positions_offsets_payloads",
        "store" : true,
        "analyzer" : "fulltext_analyzer"
       }
    }
  }
}

However, even though we were able to search for the << signs and SSNs, we had files that were not indexed, so we missed some files this way.

When we use the default fscrawler settings, we were able to index all the files, but were not able to search for the << signs and the SSNs using query_string.

So, my question is, is there a way to uses the whitespace tokenizer and also index all the files? I see that the difference between our modified default settings and the old default settings is that there are less defined mappings in the modified version. So, is our modified settings incomplete in that way which caused some files to not be indexed?

Thanks!

Best,

Karen

Yes.

Yes.

If you need to apply two different analysis strategies, you need to create a sub field in the mapping.
Something like:

{
  "settings": {
    "number_of_shards": 4,
    "index.mapping.total_fields.limit": 2000,
     "analysis": {
      "analyzer": {
        "fulltext_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "type_as_payload"
          ]
        }
      }
    }
  },
"mappings": {
    "properties": {
      "content": {
        "type": "text",
        "term_vector": "with_positions_offsets_payloads",
        "store" : true,
        "analyzer" : "fulltext_analyzer",
        "fields": {
           "foo": {
             "type": "text"
           }
        }
       }
    }
  }
}

That way, searching on text.foo will allow you to use the standard default analyzer.

Hi David,

That was really helpful! We were able to change the settings and crawl+ search through everything.

Thank you.

Karen

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.