Forcing Specific Mapping

I'm trying to link my elastic index to a piece of text annotation software called INCEpTION.

It asks for some specific mapping.

In order to set up a connection to an Elasticsearch repository, the following information needs to be provided:

  • Remote URL : the URL where the Elasticsearch instance is running (e.g. http://localhost:9200/ )
  • Index Name : the name of the index within the instance (e.g. mycorpus )
  • Search path : the suffix used to access the searching endpoint (usually _search )
  • Object type : the endpoint used to download the full document text (usually texts )
  • Field : the field of the documents in the Elasticsearch repository that is used for matching the search query (default doc.text )

From this information, two URLs are constructed:

  • the search URL: <URL>/<index name>/<search path>
  • the document retrieval URL as: <URL>/<index name>/<object type>/<document id>

From the remote URL field, only the protocol, hostname and port information is used. Any path information appearing after the port number is discarded and replaced by the index name and search path as outlined above.

The individual documents should contain following two fields as their source:

  • doc : should contain the subfield text which is the full text of the document
  • metadata : should contain subfields like language , source , timestamp and uri to provide further information about the document

I'm using FSCrawler to map my index from a network drive.

I've tried to adjust the settings inside _default/7/_settings.json and the _settings.yaml without success.

After I messed around with the settings.
(You can see even though I replaced "content" with doc it still makes it):

{
  "mappings": {
    "_doc": {
      "dynamic_templates": [
        {
          "raw_as_text": {
            "path_match": "meta.raw.*",
            "mapping": {
              "fields": {
                "keyword": {
                  "ignore_above": 256,
                  "type": "keyword"
                }
              },
              "type": "text"
            }
          }
        }
      ],
      "properties": {
        "attachment": {
          "type": "binary"
        },
        "attributes": {
          "properties": {
            "group": {
              "type": "keyword"
            },
            "owner": {
              "type": "keyword"
            },
            "permissions": {
              "type": "long"
            }
          }
        },
        "content": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "doc": {
          "properties": {
            "text": {
              "type": "text"
            }
          }
        },
        "file": {
          "properties": {
            "checksum": {
              "type": "keyword"
            },
            "content_type": {
              "type": "keyword"
            },
            "created": {
              "type": "date",
              "format": "dateOptionalTime"
            },
            "extension": {
              "type": "keyword"
            },
            "filename": {
              "type": "keyword",
              "store": true
            },
            "filesize": {
              "type": "long"
            },
            "indexed_chars": {
              "type": "long"
            },
            "indexing_date": {
              "type": "date",
              "format": "dateOptionalTime"
            },
            "last_accessed": {
              "type": "date",
              "format": "dateOptionalTime"
            },
            "last_modified": {
              "type": "date",
              "format": "dateOptionalTime"
            },
            "url": {
              "type": "keyword",
              "index": false
            }
          }
        },
        "meta": {
          "properties": {
            "author": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "created": {
              "type": "date"
            },
            "creator_tool": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "date": {
              "type": "date"
            },
            "format": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "keywords": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "language": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "metadata_date": {
              "type": "date"
            },
            "modifier": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "print_date": {
              "type": "date"
            },
            "publisher": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "raw": {
              "properties": {
                "Comments": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "Company": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "Content-Type": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "Content-Type-Parser-Override": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "PTEX:Fullbanner": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "SourceModified": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "X-TIKA:EXCEPTION:write_limit_reached": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "X-TIKA:Parsed-By": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "access_permission:assemble_document": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "access_permission:can_modify": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "access_permission:can_print": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "access_permission:can_print_degraded": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "access_permission:extract_content": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "access_permission:extract_for_accessibility": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "access_permission:fill_in_form": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "access_permission:modify_annotations": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "cp:revision": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "custom:ContentTypeId": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "dc:creator": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "dc:format": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "dc:language": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "dc:publisher": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "dc:subject": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "dc:title": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "dcterms:created": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "dcterms:modified": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "extended-properties:AppVersion": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "extended-properties:Application": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "extended-properties:Company": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "extended-properties:DocSecurityString": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "extended-properties:Template": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "extended-properties:TotalTime": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "meta:character-count": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "meta:character-count-with-spaces": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "meta:keyword": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "meta:last-author": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "meta:line-count": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "meta:page-count": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "meta:paragraph-count": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "meta:print-date": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "meta:word-count": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "pdf:PDFVersion": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "pdf:charsPerPage": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "pdf:docinfo:created": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "pdf:docinfo:creator": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "pdf:docinfo:creator_tool": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "pdf:docinfo:custom:Comments": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "pdf:docinfo:custom:Company": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "pdf:docinfo:custom:PTEX:Fullbanner": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "pdf:docinfo:custom:SourceModified": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "pdf:docinfo:keywords": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "pdf:docinfo:modified": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "pdf:docinfo:producer": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "pdf:docinfo:title": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "pdf:docinfo:trapped": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "pdf:encrypted": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "pdf:hasCollection": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "pdf:hasMarkedContent": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "pdf:hasXFA": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "pdf:hasXMP": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "pdf:producer": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "pdf:unmappedUnicodeCharsPerPage": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "resourceName": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "xmp:About": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "xmp:CreateDate": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "xmp:CreatorTool": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "xmp:MetadataDate": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "xmp:ModifyDate": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "xmpMM:DerivedFrom:DocumentID": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "xmpMM:DerivedFrom:InstanceID": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "xmpMM:DocumentID": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "xmpMM:History:Action": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "xmpMM:History:InstanceID": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "xmpMM:History:SoftwareAgent": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "xmpMM:History:When": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "xmpTPg:NPages": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                }
              }
            },
            "title": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            }
          }
        },
        "metadata": {
          "properties": {
            "altitude": {
              "type": "text"
            },
            "author": {
              "type": "text"
            },
            "comments": {
              "type": "text"
            },
            "contributor": {
              "type": "text"
            },
            "coverage": {
              "type": "text"
            },
            "created": {
              "type": "date",
              "format": "dateOptionalTime"
            },
            "creator_tool": {
              "type": "keyword"
            },
            "date": {
              "type": "date",
              "format": "dateOptionalTime"
            },
            "description": {
              "type": "text"
            },
            "format": {
              "type": "text"
            },
            "identifier": {
              "type": "text"
            },
            "keywords": {
              "type": "text"
            },
            "language": {
              "type": "keyword"
            },
            "latitude": {
              "type": "text"
            },
            "longitude": {
              "type": "text"
            },
            "metadata_date": {
              "type": "date",
              "format": "dateOptionalTime"
            },
            "modifier": {
              "type": "text"
            },
            "print_date": {
              "type": "date",
              "format": "dateOptionalTime"
            },
            "publisher": {
              "type": "text"
            },
            "rating": {
              "type": "byte"
            },
            "relation": {
              "type": "text"
            },
            "rights": {
              "type": "text"
            },
            "source": {
              "type": "text"
            },
            "timestamp": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "title": {
              "type": "text"
            },
            "type": {
              "type": "text"
            },
            "uri": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            }
          }
        },
        "path": {
          "properties": {
            "real": {
              "type": "keyword",
              "fields": {
                "fulltext": {
                  "type": "text"
                },
                "tree": {
                  "type": "text",
                  "analyzer": "fscrawler_path",
                  "fielddata": true
                }
              }
            },
            "root": {
              "type": "keyword"
            },
            "virtual": {
              "type": "keyword",
              "fields": {
                "fulltext": {
                  "type": "text"
                },
                "tree": {
                  "type": "text",
                  "analyzer": "fscrawler_path",
                  "fielddata": true
                }
              }
            }
          }
        }
      }
    }
  }
}

The example given by INCEpTION to manually create a doc using curl and index like this:
(I can change the doc->text and make it a level higher, change the field name content and it works fine.)

{
  "metadata": {
    "language": "en",
    "source": "My favourite document collection",
    "timestamp": "2011/11/11 11:11",
    "uri": "http://the.internet.com/my/document/collection/document1.txt",
    "title": "Cool Document Title"
  },
  "doc": {
    "text": "This is a test document"
  }
}

Resulting mapping below:

{
  "mappings": {
    "_doc": {
      "properties": {
        "content": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "doc": {
          "properties": {
            "text": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            }
          }
        },
        "metadata": {
          "properties": {
            "language": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "source": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "timestamp": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "title": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "uri": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            }
          }
        }
      }
    }
  }
}

I've tried to replicate the mapping from the example into the _settings.json file at the same place "content" sits but know that there is something I fundamentally misunderstand about how this works. As in "how" it maps. Why did it put the documents content in the "content"?

How does it put any of the things in the document anywhere?

Is there a way to recreate the specific mapping of objects and fields as given by the INCEpTION example?

Where the content of the document, both text and images, is mapped to doc->text?

FSCrawler writes the content to the content field.
There is no way to change that in FSCrawler.

I'd recommend changing the setting field in inception to be content.

Otherwise you can always create an ingest pipeline which you can set in FSCrawler.
This ingest pipeline could rename field content to doc.text.

Thanks David, Appreciate you.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.