What should fscrawler mapping look like to index each pdf document as a single unit of text?

neergttocsdivad · September 9, 2019, 10:50am

ref.: https://fscrawler.readthedocs.io/en/fscrawler-2.5/admin/fs/elasticsearch.html#creating-your-own-mapping-analyzers

I have read through the above page but still don't know what to do.

Below is the auto-generated mapping.
I guess most of this can be removed, but I don't know what can be subtracted without killing it.

{
  "mapping": {
    "dynamic_templates": [
      {
        "raw_as_text": {
          "path_match": "meta.raw.*",
          "mapping": {
            "fields": {
              "keyword": {
                "ignore_above": 256,
                "type": "keyword"
              }
            },
            "type": "text"
          }
        }
      }
    ],
    "properties": {
      "attachment": {
        "type": "binary"
      },
      "attributes": {
        "properties": {
          "group": {
            "type": "keyword"
          },
          "owner": {
            "type": "keyword"
          }
        }
      },
      "content": {
        "type": "text"
      },
      "file": {
        "properties": {
          "checksum": {
            "type": "keyword"
          },
          "content_type": {
            "type": "keyword"
          },
          "created": {
            "type": "date",
            "format": "date_optional_time"
          },
          "extension": {
            "type": "keyword"
          },
          "filename": {
            "type": "keyword",
            "store": true
          },
          "filesize": {
            "type": "long"
          },
          "indexed_chars": {
            "type": "long"
          },
          "indexing_date": {
            "type": "date",
            "format": "date_optional_time"
          },
          "last_accessed": {
            "type": "date",
            "format": "date_optional_time"
          },
          "last_modified": {
            "type": "date",
            "format": "date_optional_time"
          },
          "url": {
            "type": "keyword",
            "index": false
          }
        }
      },
      "meta": {
        "properties": {
          "altitude": {
            "type": "text"
          },
          "author": {
            "type": "text"
          },
          "comments": {
            "type": "text"
          },
          "contributor": {
            "type": "text"
          },
          "coverage": {
            "type": "text"
          },
          "created": {
            "type": "date",
            "format": "date_optional_time"
          },
          "creator_tool": {
            "type": "keyword"
          },
          "date": {
            "type": "date",
            "format": "date_optional_time"
          },
          "description": {
            "type": "text"
          },
          "format": {
            "type": "text"
          },
          "identifier": {
            "type": "text"
          },
          "keywords": {
            "type": "text"
          },
          "language": {
            "type": "keyword"
          },
          "latitude": {
            "type": "text"
          },
          "longitude": {
            "type": "text"
          },
          "metadata_date": {
            "type": "date",
            "format": "date_optional_time"
          },
          "modifier": {
            "type": "text"
          },
          "print_date": {
            "type": "date",
            "format": "date_optional_time"
          },
          "publisher": {
            "type": "text"
          },
          "rating": {
            "type": "byte"
          },
          "relation": {
            "type": "text"
          },
          "rights": {
            "type": "text"
          },
          "source": {
            "type": "text"
          },
          "title": {
            "type": "text"
          },
          "type": {
            "type": "text"
          }
        }
      },
      "path": {
        "properties": {
          "real": {
            "type": "keyword",
            "fields": {
              "fulltext": {
                "type": "text"
              },
              "tree": {
                "type": "text",
                "analyzer": "fscrawler_path",
                "fielddata": true
              }
            }
          },
          "root": {
            "type": "keyword"
          },
          "virtual": {
            "type": "keyword",
            "fields": {
              "fulltext": {
                "type": "text"
              },
              "tree": {
                "type": "text",
                "analyzer": "fscrawler_path",
                "fielddata": true
              }
            }
          }
        }
      }
    }
  }
}

dadoonet · September 10, 2019, 11:27am

The question is:

What should fscrawler mapping look like to index each pdf document as a single unit of text?

That's the default behavior of FSCrawler. It's not related to the mapping.
What is the exact problem you're facing?

neergttocsdivad · September 11, 2019, 7:03pm

Hi
It isn't really a problem
Since I am interested in performing corpus-linguistic analyses of the text, I don't need all the metadata and I imagined that I could save indexing-time and drive-space by removing lines from the _mappings

system · October 9, 2019, 7:03pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fscrawler - change the index mapping，reduce redundant field or object Elasticsearch	5	224	April 20, 2023
FSCrawler Question Elasticsearch	7	3083	March 17, 2017
How to use FSCrawler to send elasticsearch Base64 encoded PDF? Elasticsearch	5	875	May 17, 2018
FSCrawler Index Each Page as a Separate Document Elasticsearch	2	836	October 18, 2019
Visualizing the count of words in each document(pdf, word) in kibana using FSCRAWLER Kibana	4	1063	February 21, 2018

What should fscrawler mapping look like to index each pdf document as a single unit of text?

Related topics