Use mapping character filter from file

Hello!

I'm looking to use the mapping character filter over a large number of terms. The basic idea is that I'm trying to standardize street addresses based on the USPS list seen here:
https://pe.usps.com/text/pub28/28apc_002.htm.

I'm fairly sure the best way to do this is with the mapping character filter. What I would like to do, and because this is such a large list of values being mapped, is put these in a separate file that the analyzer/token filter reads rather then directly map these values in the token filter itself as seen here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-mapping-charfilter.html.

What are some resources/examples I could use/mimic in order to create this type of file, and how would I use the mapping character filter to read from this file?

Thanks for any/all help!

BUMP

I might have something to help you, but I am a few weeks from releasing it.
I will try to remember to update you once it is done.

Thanks @Ivan, there is no real rush as this is a side project I'm working on. I remember seeing documentation about this in 2.x but I have not been able to find it. Thanks again.

Judging by the code, path should be supported. Try using mappings_path


1 Like

Thanks @Ivan, I'm planning on getting this to work this week. I'll post may results when done.

Hey @Ivan I have a follow-up question I hope you may be able to answer. After reading your previous response, I created the key,value pairs for my mapping file which is located on my D drive. I also changed the char_filter type mapping to "mappings_path" and pointed to this file. When trying to apply the settings the settings apply, the index closes. When trying to open the index I get the following error message:

{
"error": {
"root_cause": [
{
"type": "exception",
"reason": "Failed to verify index [wf_loan_sample/wrse5BUyRs-wtVek3vkV_g]"
}
],
"type": "exception",
"reason": "Failed to verify index [wf_loan_sample/wrse5BUyRs-wtVek3vkV_g]",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "IOException while reading mappings_path: /etc/elasticsearch/D:\usps_mappings.txt",
"caused_by": {
"type": "file_not_found_exception",
"reason": "/etc/elasticsearch/D:\usps_mappings.txt (No such file or directory)"
}
}
},
"status": 500
}

My thinking is that this means the file has to be kept server side in /etc/elasticsearch. My hope is that I could keep the file on my ETL server instead. Is this thinking correct, or do you think its possible to keep the mappings_path file on my ETL server? I'd much prefer to only maintain the file in my ETL environment, rather then having more to worry about server side. Thanks again.

Yes, the file needs to exist on every node that uses it when the index is
(re)opened. How were you expecting to connect Elasticsearch to your ETL
environment?

1 Like

@Ivan I connect to my Elasticsearch environments with a windows or ubuntu machine, using Python and the package 'elasticsearch'. I keep my mappings and analyzers in separate JSON files, usually in a shared location, but if not on my windows/ubuntu machine within the project directory. These are then read during ETL and 'PUT' on the index via the elasticsearch package, before the bulk indexing begins. Using this method all the settings and mappings are kept on my machine rather then the cluster itself.

In relation to my previous question I could in fact have a separate JSON file with a new char_filter 'mapping' field and just fill out all the mappings, kinda like the example below. But it would be a tremendously large char_filter, and likely cause some headaches if anyone else attempts to take a look at the settings on that particular index. This is why I want to use the 'mappings_path', instead of 'mapping'. As you just described, it seems the only way I'll be able to utilize this feature is if I keep the 'mappings_path' file on each node in the cluster. I will do this if necessary, but I'm hesitant as all the other code has been kept on my ETL machines (or a shared location that they read from), and I would rather not start putting index related code directly on the cluster.

kinda like the example below:

            "cfPunctuation" : {
                "type" : "mapping",
                "description": "Company mapping list to remove punctuation from input string",
                "mappings" : [
                    "\\u0027=>",
                    "\\u0022=>",
                    "\\u002E=>\\u0020"
                    ]
                }

Your alternative to using a path is to simply provide the complete list of
terms, which is what you are looking to avoid in the first place! Can you
setup an external mount that is shared drive on the cluster nodes? That way
you only need to mount a drive once. Elasticsearch snapshots require a
shared file system, so perhaps you might have something in place.

I do have a plugin that pulls in the the terms from a database, but I only
converted the synonym, stop and keepwork filters, not any of the char
filters. I'll add that to my TODO list.

1 Like

@Ivan I do have a shared file store (on AWS s3) that all clusters have access to as its where I store snapshots. If I put the .txt file in this shared space, would I just change the mappings_path to access it there, or would I have to have each node in the cluster copy the file from this shared space and put it in /etc/elasticsearch? Thanks again for all your help!

There are utilities to mount an S3 bucket as a local drive. It all depends
on your OS and your level of sysadmin abilities.

I was thinking of adding s3 support to my plugin, but I need to figure out
conditional dependencies first in gradle since I do not want unneeded libs
polluting the code.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.