Need help with searching text with special characters

Hi,

I have created index for my files in a server using FsCrawler and using elastcsearch_dsl&python program to match query. It is working fine for any text / numbers.
But I need to search text with special characters such as email address, date of birth etc. To search anything that seems like email, I am using regex pattern.

        client = Elasticsearch('127.0.0.1', port=9200)

        s = Search(using=client, index=["index_1","index_2"]).query("regexp",  content="[a-zA-Z0-9]+\@[a-zA-Z]+\.[a-zA-Z]+")
        s = s[0:9999]
        s = s.highlight('content')
        response = s.execute()

But the special characters were not reflecting on the results.

I recently got to know that it is ignoring because the standard tokenizer. But I don't know which tokenizer to use for regex patterns. I have created regex patterns for email, d.o.b, phone number , ssn etc.

I saw grok patterns too, but I have to create custom patterns for the above mentioned types also I am lost in how to incorporate them in elastic search dsl and python.

Could anybody please help me on what is the best way to achieve search with regex pattern matching?

Thanks,
Lisa.

What is the goal of this? What is the use case?

If you want to identify documents that contain email, credit cards, ... then I think you should do that at index time instead of trying to do that at search time.

Could you explain if you really want to do that at search time?

If so, could you share an example of a piece of text which should match the regex but does not? Make the example totally not connected to FSCrawler. FSCrawler is just here a way to collect text. So in your example, just use one single field like:

{
  "content": "foo bar"
}

So could you provide a full recreation script as described in About the Elasticsearch category. It will help to better understand what you are doing. Please, try to keep the example as simple as possible.

A full reproduction script will help readers to understand, reproduce and if needed fix your problem. It will also most likely help to get a faster answer.

Hi David,

Thanks for quick reply!
So the idea is to show all the emails in a file repository when you click a button. Not searching for individual email id.
For this I am using flask, elasticsearch_dsl and python.
To find all emails, I am using regex pattern [a-zA-Z0-9]+@[a-zA-Z]+.[a-zA-Z]+ (I know this is not right one, since @ and . has other meaning. When I try [a-zA-Z0-9]+\@[a-zA-Z]+\.[a-zA-Z]+ , my app is returning 0 results. but just to show you the sample output, I am using this pattern.)

Here is the code:
from flask import Flask, render_template, request
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
import os

# setting up the flask app and elastic search api connection
app = Flask(__name__)
client = Elasticsearch('127.0.0.1', port=9200)

# render the homepage when the URL is hit
@app.route('/')
def home():
    return render_template('home.html')

# defining the search response
@app.route('/search/email-results', methods=['GET', 'POST'])
def searchEmail():
    s = Search(using=client, index=["index_1","index_2"]).query("regexp",  content="[a-zA-Z0-9]+@[a-zA-Z]+.[a-zA-Z]+")
    response = s.execute()
    return render_template('Results.html', response=response)

if __name__ == "__main__":
    app.run(host="127.0.0.1", port=5000,debug=True)

When I run this app, you are gonna see a button a page to find emails in the index. when you click that, you are gonna see the file name, location etc.. something like below.

I did not add the found content in the grid yet but this is the whole idea behind it. anything that matches the pattern, show them in the grid.

suppose, auroraglez78@ gmail.com is in a file and when I run the app and clicked the button to find the email that match with the regex pattern([a-zA-Z0-9]+\@[a-zA-Z]+\.[a-zA-Z]+] ), is not returning any results.

Could you please explain how to identify the emails etc at index time?

Thanks,
Lisa

I was wondering if you could write an ingest script processor which tries to apply the regex or may be a grok processor or even better a dissect processor.

You can use one of those in an ingest pipeline.
FSCrawler supports that you call ingest pipelines before indexing the actual documents.

Hi David,
Thank you so much for your reply.
I created a custom analyzer( whitespace tokenizer) and added mapping to the index then re indexed according to your old post. (Conflicts with existing mapping)
then used the same search query with regex pattern.
Don't know this is the best process or not, but it is working as expected.

Thanks for the great support!
-Lisa

Hi David,

I have added more files(around 15000+) to the same resource and re-indexed using fsCrwaler. while reindexing(debug mode) I am gettingf below warning .

    02:43:25,514 ^[[36mDEBUG^[[m [f.p.e.c.f.FsParserAbstract] fetching content from [/var/www/html/file-scanner/ESFiles],[test01741.doc]
    02:43:25,514 ^[[36mDEBUG^[[m [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/var/www/html/file-scanner/ESFiles, /var/www/html/file-scanner/ESFiles/test01741.doc) = /test01741.doc
    02:43:25,515 ^[[36mDEBUG^[[m [f.p.e.c.f.FsParserAbstract] Indexing index_66.175.209.189/ff9b8a1fc3d994d22f8d920f0b897?pipeline=null
    02:43:25,516 ^[[36mDEBUG^[[m [f.p.e.c.f.FsParserAbstract] Looking for removed files in [/var/www/html/file-scanner/ESFiles]...
    02:43:28,043 ^[[33mWARN ^[[m [f.p.e.c.f.FsParserAbstract] Can't find stored field name to check existing filenames in path [/var/www/html/file-scanner/ESFiles]. Please set store: true on field [file.filename]
    02:43:28,043 ^[[33mWARN ^[[m [f.p.e.c.f.FsParserAbstract] Error while crawling /var/www/html/file-scanner/ESFiles: Mapping is incorrect: please set stored: true on field [file.filename].
    02:43:28,043 ^[[33mWARN ^[[m [f.p.e.c.f.FsParserAbstract] Full stacktrace
    java.lang.RuntimeException: Mapping is incorrect: please set stored: true on field [file.filename].
            at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.getFileDirectory(FsParserAbstract.java:374) ~[fscrawler-core-2.7-SNAPSHOT.jar:?]
            at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:309) ~[fscrawler-core-2.7-SNAPSHOT.jar:?]
            at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.run(FsParserAbstract.java:149) [fscrawler-core-2.7-SNAPSHOT.jar:?]
            at java.lang.Thread.run(Thread.java:748) [?:1.8.0_232]
    02:43:28,043 ^[[32mINFO ^[[m [f.p.e.c.f.FsParserAbstract] FS crawler is stopping after 1 run
    02:43:28,130 ^[[36mDEBUG^[[m [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [index_66.175.209.189]
    02:43:28,132 ^[[36mDEBUG^[[m [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
    02:43:28,132 ^[[36mDEBUG^[[m [f.p.e.c.f.c.v.ElasticsearchClientV7] Closing Elasticsearch client manager
    02:43:28,134 ^[[36mDEBUG^[[m [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped
    02:43:28,134 ^[[32mINFO ^[[m [f.p.e.c.f.FsCrawlerImpl] FS crawler [index_66.175.209.189] stopped
    02:43:28,135 ^[[36mDEBUG^[[m [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [index_66.175.209.189]
    02:43:28,136 ^[[36mDEBUG^[[m [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
    02:43:28,136 ^[[36mDEBUG^[[m [f.p.e.c.f.c.v.ElasticsearchClientV7] Closing Elasticsearch client manager
    02:43:28,136 ^[[36mDEBUG^[[m [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped
    02:43:28,136 ^[[32mINFO ^[[m [f.p.e.c.f.FsCrawlerImpl] FS crawler [index_66.175.209.189] stopped

I am using fscrawler 2.7, and elastic search 7.5.1

The reason I am asking this question under this thread is, I have created custom analyzer and it worked fine for around 30 - 50 files(I am not sure this error appeared in the first try). But when I added more files to the resource location and re indexed, the above warning error is displaying.
I thought this may be because of dynamic mapping, so added dynamic : true. still no change!
I added new field filename but there is no change.
Could you please tell me is my mapping correct or not?

This is the index I created before re-indexing.

   PUT /index_66.175.209.189/
   {
      "settings": {
        "index": {
          "number_of_shards": 1,
          "number_of_replicas": 1
         },
         "analysis": {
            "analyzer": {
               "my_analyzer": {
                  "type": "custom",
                  "filter": [
                     "lowercase"
                  ],
                  "tokenizer": "whitespace"
               }
            }
         }
      },
      "mappings": {
        "dynamic": true,
         "properties": {
             "file_name": {
               "type": "text",
               "store": true
           },
           "content": {
             "type": "text",
             "analyzer": "my_analyzer"
           }
         }
       }
   }

I deleted the index first, created the index with settings as above then re indexed using fsCrawler.
Please help me!

-Lisa

You need to reuse the provided mapping and modify it (see https://fscrawler.readthedocs.io/en/latest/admin/fs/elasticsearch.html#creating-your-own-mapping-analyzers). If you don't do that, you will need to fix some of the fields which are used by FSCrawler behind the scene.

Here you need to add specifically in the mapping a field named file.filename like:

    "file": {
      "properties": {
        "filename": {
          "type": "keyword",
          "store": true
        }
    }

Thank you so much! it is working and I no longer see the warning.
But I see some odd behavior. For example:

GET /index_66.175.209.189/_search
{
   "query": {
      "match": {
         "content": {
            "query": "13#abc",
            "analyzer": "my_analyzer"
         }}
      },
    "highlight" : {
        "require_field_match": false,
        "fields": {
                "content" : { "pre_tags" : ["<em>"], "post_tags" : ["</em>"] }
        }
    }  
}

In the response, it should show all the files with hit as <em> 13#abc</em> But only first file is showing the correct hit. Other files are not showing the exact hit value.

Hit response for the file test00001.doc

 "meta" : { },
          "file" : {
            "extension" : "doc",
            "content_type" : "text/plain; charset=ISO-8859-1",
            "created" : "2020-03-18T23:26:02.269+0000",
            "last_modified" : "2020-03-18T23:26:02.269+0000",
            "last_accessed" : "2020-03-24T02:43:19.709+0000",
            "indexing_date" : "2020-03-24T16:43:10.348+0000",
            "filesize" : 144,
            "filename" : "test00001.doc",
            "url" : "file:///var/www/html/file-scanner/ESFiles/test00001.doc"
          },
          "path" : {
            "root" : "ca5776bfff1151c16ccddbc7a154d40",
            "virtual" : "/test00001.doc",
            "real" : "/var/www/html/file-scanner/ESFiles/test00001.doc"
          }
        },
        "highlight" : {
          "content" : [
            "phil-22    pete-34    john-34\n swati@gmail.com \n 123-35-5252 \n 12-32-3525 \n <em>13#abc</em> abc!"
          ]
        }
      },

other 15000+ file's responses are as below.

        "meta" : { },
          "file" : {
            "extension" : "doc",
            "content_type" : "text/plain; charset=ISO-8859-1",
            "created" : "2020-03-18T23:34:28.593+0000",
            "last_modified" : "2020-03-18T23:34:28.593+0000",
            "last_accessed" : "2020-03-24T02:43:11.799+0000",
            "indexing_date" : "2020-03-24T16:43:02.220+0000",
            "filesize" : 189,
            "filename" : "test04552.doc",
            "url" : "file:///var/www/html/file-scanner/ESFiles/test04552.doc"
          },
          "path" : {
            "root" : "ca5776bfff1151c16ccddbc7a154d40",
            "virtual" : "/test04552.doc",
            "real" : "/var/www/html/file-scanner/ESFiles/test04552.doc"
          }
        },
        "highlight" : {
          "content" : [
            "john-34\n swati@gmail.com \n bcd01942@gmail.co.in\nbfajk903141@outlook.gov\n123-35-5252 \n 12-32-3525 \n <em>13</em>"
          ]
        }

The same is happening when I try with the pattern query.

s = Search(using=client, index=["index_66.175.209.189"]).query("regexp",  content="[0-9]{2}\#[a-zA-Z]+")
    response = s.execute()

clearly the special characters are getting recognized but not for all files though.

Could you please tell me is it because of any mapping issue ?

-Lisa

Below is how I created the index.

PUT /index_66.175.209.189/
{
   "settings": {
     "index": {
       "number_of_shards": 1,
       "number_of_replicas": 1
      },
      "analysis": {
         "analyzer": {
            "my_analyzer": {
               "type": "custom",
               "filter": [
                  "lowercase"
               ],
               "tokenizer": "whitespace"
            }
         }
      }
   },
   "mappings": {
     "dynamic": true,
      "properties": {
          "file": {
            "properties": {
              "filename": {
                "type": "keyword",
                "store": true
              }
            }
          },
        "content": {
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }
    }
}

-Lisa

Could anybody please tell me why is it happening?

-Lisa

Could you open a new question as this one has been solved?

Ideally could you provide a full recreation script as described in About the Elasticsearch category. It will help to better understand what you are doing. Please, try to keep the example as simple as possible and not connected to FSCrawler as it's not related.

A full reproduction script will help readers to understand, reproduce and if needed fix your problem. It will also most likely help to get a faster answer.