Use mapping character filter from file

MPurcell · October 2, 2017, 8:34pm

Hello!

I'm looking to use the mapping character filter over a large number of terms. The basic idea is that I'm trying to standardize street addresses based on the USPS list seen here:
https://pe.usps.com/text/pub28/28apc_002.htm.

I'm fairly sure the best way to do this is with the mapping character filter. What I would like to do, and because this is such a large list of values being mapped, is put these in a separate file that the analyzer/token filter reads rather then directly map these values in the token filter itself as seen here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-mapping-charfilter.html.

What are some resources/examples I could use/mimic in order to create this type of file, and how would I use the mapping character filter to read from this file?

Thanks for any/all help!

MPurcell · October 3, 2017, 8:30pm

BUMP

Ivan · October 4, 2017, 4:25pm

I might have something to help you, but I am a few weeks from releasing it.
I will try to remember to update you once it is done.

MPurcell · October 4, 2017, 5:07pm

Thanks @Ivan, there is no real rush as this is a side project I'm working on. I remember seeing documentation about this in 2.x but I have not been able to find it. Thanks again.

Ivan · October 4, 2017, 6:20pm

Judging by the code, path should be supported. Try using mappings_path

github.com

elastic/elasticsearch/blob/master/modules/analysis-common/src/main/java/org/elasticsearch/analysis/common/MappingCharFilterFactory.java#L43


import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class MappingCharFilterFactory extends AbstractCharFilterFactory implements MultiTermAwareComponent {


private final NormalizeCharMap normMap;


MappingCharFilterFactory(IndexSettings indexSettings, Environment env, String name, Settings settings) {
    super(indexSettings, name);


    List<String> rules = Analysis.getWordList(env, settings, "mappings");
    if (rules == null) {
        throw new IllegalArgumentException("mapping requires either `mappings` or `mappings_path` to be configured");
    }


    NormalizeCharMap.Builder normMapBuilder = new NormalizeCharMap.Builder();
    parseRules(rules, normMapBuilder);
    normMap = normMapBuilder.build();
}


@Override

github.com

elastic/elasticsearch/blob/master/core/src/main/java/org/elasticsearch/index/analysis/Analysis.java#L227


 * Fetches a list of words from the specified settings file. The list should either be available at the key
 * specified by settingsPrefix or in a file specified by settingsPrefix + _path.
 *
 * @throws IllegalArgumentException
 *          If the word list cannot be found at either key.
 */
public static List<String> getWordList(Environment env, Settings settings, String settingPrefix) {
    String wordListPath = settings.get(settingPrefix + "_path", null);


    if (wordListPath == null) {
        List<String> explicitWordList = settings.getAsList(settingPrefix, null);
        if (explicitWordList == null) {
            return null;
        } else {
            return explicitWordList;
        }
    }


    final Path path = env.configFile().resolve(wordListPath);


    try (BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8)) {

MPurcell · October 10, 2017, 4:28pm

Thanks @Ivan, I'm planning on getting this to work this week. I'll post may results when done.

MPurcell · October 20, 2017, 3:07pm

Hey @Ivan I have a follow-up question I hope you may be able to answer. After reading your previous response, I created the key,value pairs for my mapping file which is located on my D drive. I also changed the char_filter type mapping to "mappings_path" and pointed to this file. When trying to apply the settings the settings apply, the index closes. When trying to open the index I get the following error message:

{
"error": {
"root_cause": [
{
"type": "exception",
"reason": "Failed to verify index [wf_loan_sample/wrse5BUyRs-wtVek3vkV_g]"
}
],
"type": "exception",
"reason": "Failed to verify index [wf_loan_sample/wrse5BUyRs-wtVek3vkV_g]",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "IOException while reading mappings_path: /etc/elasticsearch/D:\usps_mappings.txt",
"caused_by": {
"type": "file_not_found_exception",
"reason": "/etc/elasticsearch/D:\usps_mappings.txt (No such file or directory)"
}
}
},
"status": 500
}

My thinking is that this means the file has to be kept server side in /etc/elasticsearch. My hope is that I could keep the file on my ETL server instead. Is this thinking correct, or do you think its possible to keep the mappings_path file on my ETL server? I'd much prefer to only maintain the file in my ETL environment, rather then having more to worry about server side. Thanks again.

Ivan · October 22, 2017, 4:00pm

Yes, the file needs to exist on every node that uses it when the index is
(re)opened. How were you expecting to connect Elasticsearch to your ETL
environment?

MPurcell · October 23, 2017, 2:01pm

@Ivan I connect to my Elasticsearch environments with a windows or ubuntu machine, using Python and the package 'elasticsearch'. I keep my mappings and analyzers in separate JSON files, usually in a shared location, but if not on my windows/ubuntu machine within the project directory. These are then read during ETL and 'PUT' on the index via the elasticsearch package, before the bulk indexing begins. Using this method all the settings and mappings are kept on my machine rather then the cluster itself.

In relation to my previous question I could in fact have a separate JSON file with a new char_filter 'mapping' field and just fill out all the mappings, kinda like the example below. But it would be a tremendously large char_filter, and likely cause some headaches if anyone else attempts to take a look at the settings on that particular index. This is why I want to use the 'mappings_path', instead of 'mapping'. As you just described, it seems the only way I'll be able to utilize this feature is if I keep the 'mappings_path' file on each node in the cluster. I will do this if necessary, but I'm hesitant as all the other code has been kept on my ETL machines (or a shared location that they read from), and I would rather not start putting index related code directly on the cluster.

kinda like the example below:

            "cfPunctuation" : {
                "type" : "mapping",
                "description": "Company mapping list to remove punctuation from input string",
                "mappings" : [
                    "\\u0027=>",
                    "\\u0022=>",
                    "\\u002E=>\\u0020"
                    ]
                }

Ivan · October 23, 2017, 8:17pm

Your alternative to using a path is to simply provide the complete list of
terms, which is what you are looking to avoid in the first place! Can you
setup an external mount that is shared drive on the cluster nodes? That way
you only need to mount a drive once. Elasticsearch snapshots require a
shared file system, so perhaps you might have something in place.

I do have a plugin that pulls in the the terms from a database, but I only
converted the synonym, stop and keepwork filters, not any of the char
filters. I'll add that to my TODO list.

MPurcell · October 25, 2017, 1:58pm

@Ivan I do have a shared file store (on AWS s3) that all clusters have access to as its where I store snapshots. If I put the .txt file in this shared space, would I just change the mappings_path to access it there, or would I have to have each node in the cluster copy the file from this shared space and put it in /etc/elasticsearch? Thanks again for all your help!

Ivan · October 26, 2017, 3:22am

There are utilities to mount an S3 bucket as a local drive. It all depends
on your OS and your level of sysadmin abilities.

I was thinking of adding s3 support to my plugin, but I need to figure out
conditional dependencies first in gradle since I do not want unneeded libs
polluting the code.

system · November 23, 2017, 3:22am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Mapping_path option in char filter Elasticsearch	2	434	May 22, 2018
ElasticSearch won't recongize char_filter mapping Elasticsearch	6	1084	July 6, 2017
Issues with custom mappings: language, stop word settings, and character replacement Elasticsearch	5	499	July 6, 2017
Autocomplete mapping error Elasticsearch	10	1901	July 6, 2017
Index settings file Elasticsearch	1	284	July 6, 2017

Use mapping character filter from file

Related topics