Need help with searching text with special characters

Lisahtwy · March 11, 2020, 3:37pm

Hi,

I have created index for my files in a server using FsCrawler and using elastcsearch_dsl&python program to match query. It is working fine for any text / numbers.
But I need to search text with special characters such as email address, date of birth etc. To search anything that seems like email, I am using regex pattern.

        client = Elasticsearch('127.0.0.1', port=9200)

        s = Search(using=client, index=["index_1","index_2"]).query("regexp",  content="[a-zA-Z0-9]+\@[a-zA-Z]+\.[a-zA-Z]+")
        s = s[0:9999]
        s = s.highlight('content')
        response = s.execute()

But the special characters were not reflecting on the results.

I recently got to know that it is ignoring because the standard tokenizer. But I don't know which tokenizer to use for regex patterns. I have created regex patterns for email, d.o.b, phone number , ssn etc.

I saw grok patterns too, but I have to create custom patterns for the above mentioned types also I am lost in how to incorporate them in elastic search dsl and python.

Could anybody please help me on what is the best way to achieve search with regex pattern matching?

Thanks,
Lisa.

dadoonet · March 11, 2020, 3:55pm

What is the goal of this? What is the use case?

If you want to identify documents that contain email, credit cards, ... then I think you should do that at index time instead of trying to do that at search time.

Could you explain if you really want to do that at search time?

If so, could you share an example of a piece of text which should match the regex but does not? Make the example totally not connected to FSCrawler. FSCrawler is just here a way to collect text. So in your example, just use one single field like:

{
  "content": "foo bar"
}

So could you provide a full recreation script as described in About the Elasticsearch category. It will help to better understand what you are doing. Please, try to keep the example as simple as possible.

A full reproduction script will help readers to understand, reproduce and if needed fix your problem. It will also most likely help to get a faster answer.

Lisahtwy · March 11, 2020, 4:43pm

Hi David,

Thanks for quick reply!
So the idea is to show all the emails in a file repository when you click a button. Not searching for individual email id.
For this I am using flask, elasticsearch_dsl and python.
To find all emails, I am using regex pattern [a-zA-Z0-9]+@[a-zA-Z]+.[a-zA-Z]+ (I know this is not right one, since @ and . has other meaning. When I try [a-zA-Z0-9]+\@[a-zA-Z]+\.[a-zA-Z]+ , my app is returning 0 results. but just to show you the sample output, I am using this pattern.)

Here is the code:
from flask import Flask, render_template, request
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
import os

# setting up the flask app and elastic search api connection
app = Flask(__name__)
client = Elasticsearch('127.0.0.1', port=9200)

# render the homepage when the URL is hit
@app.route('/')
def home():
    return render_template('home.html')

# defining the search response
@app.route('/search/email-results', methods=['GET', 'POST'])
def searchEmail():
    s = Search(using=client, index=["index_1","index_2"]).query("regexp",  content="[a-zA-Z0-9]+@[a-zA-Z]+.[a-zA-Z]+")
    response = s.execute()
    return render_template('Results.html', response=response)

if __name__ == "__main__":
    app.run(host="127.0.0.1", port=5000,debug=True)

When I run this app, you are gonna see a button a page to find emails in the index. when you click that, you are gonna see the file name, location etc.. something like below.

I did not add the found content in the grid yet but this is the whole idea behind it. anything that matches the pattern, show them in the grid.

suppose, auroraglez78@ gmail.com is in a file and when I run the app and clicked the button to find the email that match with the regex pattern([a-zA-Z0-9]+\@[a-zA-Z]+\.[a-zA-Z]+] ), is not returning any results.

Could you please explain how to identify the emails etc at index time?

Thanks,
Lisa

dadoonet · March 16, 2020, 5:16pm

I was wondering if you could write an ingest script processor which tries to apply the regex or may be a grok processor or even better a dissect processor.

You can use one of those in an ingest pipeline.
FSCrawler supports that you call ingest pipelines before indexing the actual documents.

Lisahtwy · March 17, 2020, 4:16pm

Hi David,
Thank you so much for your reply.
I created a custom analyzer( whitespace tokenizer) and added mapping to the index then re indexed according to your old post. (Conflicts with existing mapping)
then used the same search query with regex pattern.
Don't know this is the best process or not, but it is working as expected.

Thanks for the great support!
-Lisa

Lisahtwy · March 24, 2020, 3:51am

Hi David,

I have added more files(around 15000+) to the same resource and re-indexed using fsCrwaler. while reindexing(debug mode) I am gettingf below warning .

    02:43:25,514 ^[[36mDEBUG^[[m [f.p.e.c.f.FsParserAbstract] fetching content from [/var/www/html/file-scanner/ESFiles],[test01741.doc]
    02:43:25,514 ^[[36mDEBUG^[[m [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/var/www/html/file-scanner/ESFiles, /var/www/html/file-scanner/ESFiles/test01741.doc) = /test01741.doc
    02:43:25,515 ^[[36mDEBUG^[[m [f.p.e.c.f.FsParserAbstract] Indexing index_66.175.209.189/ff9b8a1fc3d994d22f8d920f0b897?pipeline=null
    02:43:25,516 ^[[36mDEBUG^[[m [f.p.e.c.f.FsParserAbstract] Looking for removed files in [/var/www/html/file-scanner/ESFiles]...
    02:43:28,043 ^[[33mWARN ^[[m [f.p.e.c.f.FsParserAbstract] Can't find stored field name to check existing filenames in path [/var/www/html/file-scanner/ESFiles]. Please set store: true on field [file.filename]
    02:43:28,043 ^[[33mWARN ^[[m [f.p.e.c.f.FsParserAbstract] Error while crawling /var/www/html/file-scanner/ESFiles: Mapping is incorrect: please set stored: true on field [file.filename].
    02:43:28,043 ^[[33mWARN ^[[m [f.p.e.c.f.FsParserAbstract] Full stacktrace
    java.lang.RuntimeException: Mapping is incorrect: please set stored: true on field [file.filename].
            at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.getFileDirectory(FsParserAbstract.java:374) ~[fscrawler-core-2.7-SNAPSHOT.jar:?]
            at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:309) ~[fscrawler-core-2.7-SNAPSHOT.jar:?]
            at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.run(FsParserAbstract.java:149) [fscrawler-core-2.7-SNAPSHOT.jar:?]
            at java.lang.Thread.run(Thread.java:748) [?:1.8.0_232]
    02:43:28,043 ^[[32mINFO ^[[m [f.p.e.c.f.FsParserAbstract] FS crawler is stopping after 1 run
    02:43:28,130 ^[[36mDEBUG^[[m [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [index_66.175.209.189]
    02:43:28,132 ^[[36mDEBUG^[[m [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
    02:43:28,132 ^[[36mDEBUG^[[m [f.p.e.c.f.c.v.ElasticsearchClientV7] Closing Elasticsearch client manager
    02:43:28,134 ^[[36mDEBUG^[[m [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped
    02:43:28,134 ^[[32mINFO ^[[m [f.p.e.c.f.FsCrawlerImpl] FS crawler [index_66.175.209.189] stopped
    02:43:28,135 ^[[36mDEBUG^[[m [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [index_66.175.209.189]
    02:43:28,136 ^[[36mDEBUG^[[m [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
    02:43:28,136 ^[[36mDEBUG^[[m [f.p.e.c.f.c.v.ElasticsearchClientV7] Closing Elasticsearch client manager
    02:43:28,136 ^[[36mDEBUG^[[m [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped
    02:43:28,136 ^[[32mINFO ^[[m [f.p.e.c.f.FsCrawlerImpl] FS crawler [index_66.175.209.189] stopped

I am using fscrawler 2.7, and elastic search 7.5.1

The reason I am asking this question under this thread is, I have created custom analyzer and it worked fine for around 30 - 50 files(I am not sure this error appeared in the first try). But when I added more files to the resource location and re indexed, the above warning error is displaying.
I thought this may be because of dynamic mapping, so added dynamic : true. still no change!
I added new field filename but there is no change.
Could you please tell me is my mapping correct or not?

This is the index I created before re-indexing.

   PUT /index_66.175.209.189/
   {
      "settings": {
        "index": {
          "number_of_shards": 1,
          "number_of_replicas": 1
         },
         "analysis": {
            "analyzer": {
               "my_analyzer": {
                  "type": "custom",
                  "filter": [
                     "lowercase"
                  ],
                  "tokenizer": "whitespace"
               }
            }
         }
      },
      "mappings": {
        "dynamic": true,
         "properties": {
             "file_name": {
               "type": "text",
               "store": true
           },
           "content": {
             "type": "text",
             "analyzer": "my_analyzer"
           }
         }
       }
   }

I deleted the index first, created the index with settings as above then re indexed using fsCrawler.
Please help me!

-Lisa

dadoonet · March 24, 2020, 9:31am

You need to reuse the provided mapping and modify it (see https://fscrawler.readthedocs.io/en/latest/admin/fs/elasticsearch.html#creating-your-own-mapping-analyzers). If you don't do that, you will need to fix some of the fields which are used by FSCrawler behind the scene.

Here you need to add specifically in the mapping a field named file.filename like:

    "file": {
      "properties": {
        "filename": {
          "type": "keyword",
          "store": true
        }
    }

Lisahtwy · March 24, 2020, 5:57pm

Thank you so much! it is working and I no longer see the warning.
But I see some odd behavior. For example:

GET /index_66.175.209.189/_search
{
   "query": {
      "match": {
         "content": {
            "query": "13#abc",
            "analyzer": "my_analyzer"
         }}
      },
    "highlight" : {
        "require_field_match": false,
        "fields": {
                "content" : { "pre_tags" : ["<em>"], "post_tags" : ["</em>"] }
        }
    }  
}

In the response, it should show all the files with hit as <em> 13#abc</em> But only first file is showing the correct hit. Other files are not showing the exact hit value.

Hit response for the file test00001.doc

 "meta" : { },
          "file" : {
            "extension" : "doc",
            "content_type" : "text/plain; charset=ISO-8859-1",
            "created" : "2020-03-18T23:26:02.269+0000",
            "last_modified" : "2020-03-18T23:26:02.269+0000",
            "last_accessed" : "2020-03-24T02:43:19.709+0000",
            "indexing_date" : "2020-03-24T16:43:10.348+0000",
            "filesize" : 144,
            "filename" : "test00001.doc",
            "url" : "file:///var/www/html/file-scanner/ESFiles/test00001.doc"
          },
          "path" : {
            "root" : "ca5776bfff1151c16ccddbc7a154d40",
            "virtual" : "/test00001.doc",
            "real" : "/var/www/html/file-scanner/ESFiles/test00001.doc"
          }
        },
        "highlight" : {
          "content" : [
            "phil-22    pete-34    john-34\n swati@gmail.com \n 123-35-5252 \n 12-32-3525 \n <em>13#abc</em> abc!"
          ]
        }
      },

other 15000+ file's responses are as below.

        "meta" : { },
          "file" : {
            "extension" : "doc",
            "content_type" : "text/plain; charset=ISO-8859-1",
            "created" : "2020-03-18T23:34:28.593+0000",
            "last_modified" : "2020-03-18T23:34:28.593+0000",
            "last_accessed" : "2020-03-24T02:43:11.799+0000",
            "indexing_date" : "2020-03-24T16:43:02.220+0000",
            "filesize" : 189,
            "filename" : "test04552.doc",
            "url" : "file:///var/www/html/file-scanner/ESFiles/test04552.doc"
          },
          "path" : {
            "root" : "ca5776bfff1151c16ccddbc7a154d40",
            "virtual" : "/test04552.doc",
            "real" : "/var/www/html/file-scanner/ESFiles/test04552.doc"
          }
        },
        "highlight" : {
          "content" : [
            "john-34\n swati@gmail.com \n bcd01942@gmail.co.in\nbfajk903141@outlook.gov\n123-35-5252 \n 12-32-3525 \n <em>13</em>"
          ]
        }

The same is happening when I try with the pattern query.

s = Search(using=client, index=["index_66.175.209.189"]).query("regexp",  content="[0-9]{2}\#[a-zA-Z]+")
    response = s.execute()

clearly the special characters are getting recognized but not for all files though.

Could you please tell me is it because of any mapping issue ?

-Lisa

Lisahtwy · March 24, 2020, 6:03pm

Below is how I created the index.

PUT /index_66.175.209.189/
{
   "settings": {
     "index": {
       "number_of_shards": 1,
       "number_of_replicas": 1
      },
      "analysis": {
         "analyzer": {
            "my_analyzer": {
               "type": "custom",
               "filter": [
                  "lowercase"
               ],
               "tokenizer": "whitespace"
            }
         }
      }
   },
   "mappings": {
     "dynamic": true,
      "properties": {
          "file": {
            "properties": {
              "filename": {
                "type": "keyword",
                "store": true
              }
            }
          },
        "content": {
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }
    }
}

-Lisa

Lisahtwy · March 26, 2020, 1:13pm

Could anybody please tell me why is it happening?

-Lisa

dadoonet · March 26, 2020, 3:02pm

Could you open a new question as this one has been solved?

Ideally could you provide a full recreation script as described in About the Elasticsearch category. It will help to better understand what you are doing. Please, try to keep the example as simple as possible and not connected to FSCrawler as it's not related.

A full reproduction script will help readers to understand, reproduce and if needed fix your problem. It will also most likely help to get a faster answer.

system · April 23, 2020, 3:12pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Special characters search in elastic search Elasticsearch	6	529	July 6, 2017
Special Characters not indexed and hence not searchable Elasticsearch	9	2899	July 6, 2017
EL setup for fulltext search Elasticsearch	11	589	July 6, 2017
Issue with pattern analyzer Elasticsearch	11	468	July 6, 2017
I can't find anything after hypens or underscores Elasticsearch	10	5274	July 6, 2017

Need help with searching text with special characters

Related topics