How to execute multi search on Elasticsearch

Hi,
I have indexed metadata from documents with FSCrawler, so in ES I have an index that looks like this :

 {
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 19,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "BigData.pptx",
        "_score" : 1.0,
        "_source" : {
          "meta" : {
            "author" : "Jin",
            "title" : "PowerPoint Presentation",
            "date" : "2012-01-12T20:47:47.000+0000",
            "modifier" : "Jin",
            "created" : "2012-01-12T19:50:20.000+0000",
            "raw" : {
              "date" : "2012-01-12T21:47:47Z",
              "cp:revision" : "20",
              "Total-Time" : "57",
              "extended-properties:AppVersion" : "14.0000",
              "meta:paragraph-count" : "273",
              "meta:word-count" : "1319",
              "extended-properties:PresentationFormat" : "On-screen Show (4:3)",
              "dc:creator" : "Jin",
              "Word-Count" : "1319",
              "dcterms:created" : "2012-01-12T20:50:20Z",
              "dcterms:modified" : "2012-01-12T21:47:47Z",
              "Last-Modified" : "2012-01-12T21:47:47Z",
              "title" : "PowerPoint Presentation",
              "Last-Save-Date" : "2012-01-12T21:47:47Z",
              "Paragraph-Count" : "273",
              "meta:save-date" : "2012-01-12T21:47:47Z",
              "dc:title" : "PowerPoint Presentation",
              "Application-Name" : "Microsoft Office PowerPoint",
              "extended-properties:TotalTime" : "57",
              "modified" : "2012-01-12T21:47:47Z",
              "Notes" : "17",
              "Content-Type" : "application/vnd.openxmlformats-officedocument.presentationml.presentation",
              "Slide-Count" : "47",
              "X-Parsed-By" : "org.apache.tika.parser.DefaultParser",
              "creator" : "Jin",
              "extended-properties:Notes" : "17",
              "meta:author" : "Jin",
              "meta:creation-date" : "2012-01-12T20:50:20Z",
              "extended-properties:Application" : "Microsoft Office PowerPoint",
              "meta:last-author" : "Jin",
              "meta:slide-count" : "47",
              "Creation-Date" : "2012-01-12T20:50:20Z",
              "xmpTPg:NPages" : "47",
              "resourceName" : "BigData.pptx",
              "Last-Author" : "Jin",
              "Revision-Number" : "20",
              "Application-Version" : "14.0000",
              "Author" : "Jin",
              "Presentation-Format" : "On-screen Show (4:3)"
            }
          },
          "file" : {
            "extension" : "pptx",
            "content_type" : "application/vnd.openxmlformats-officedocument.presentationml.presentation",
            "created" : "2019-07-08T10:45:34.000+0000",
            "last_modified" : "2019-07-08T10:45:34.000+0000",
            "last_accessed" : "2019-07-17T09:50:04.000+0000",
            "indexing_date" : "2019-07-17T13:32:13.807+0000",
            "filesize" : 2496305,
            "filename" : "BigData.pptx",
            "url" : "file:///home/ubuntu/Downloads/FSCrawler/BigData.pptx",
            "indexed_chars" : 0
          },
          "path" : {
            "root" : "4d1f91a687e6d7c4e1dd3e1cbb4bd2",
            "virtual" : "/BigData.pptx",
            "real" : "/home/ubuntu/Downloads/FSCrawler/BigData.pptx"
          }
        }
      },

and so on that is just one hit from the hits field .

I want to get the matching index when I give one or many words, and from the metadata.
for example here the metadata has many fields like meta.raw.date, meta.raw.title ...etc
I want to get a result of searching, for example, the words 'Big Data' on the whole 'meta', I must get a result because the field meta.raw.resourceName has "BigData.pptx".
I couldn't find a way to execute such a search, I've tried 'more like this' and 'multi_match' but the problem that is I have to put the exact field in the query ( i have to put "meta.raw.resourceName": "BigData.pptx") in order to get the result and I have to put the exact word 'BigData.pptx' to get the result otherwise I get nothing if I put the word 'Big'
can anyone help me

Unless you really need all the raw metadata, I'd recommend to disable them and just rely on non "raw" fields. In FSCrawler 2.7, raw will be disabled by default. See Local FS settings — FSCrawler 2.10-SNAPSHOT documentation

the problem that is I have to put the exact field in the query

It's because of the analyzers that are used by default by FSCrawler.

For some fields, a keyword datatype is used. If you need to run full text search, you will need to probably change it to text.

i think i would use the info in the raw ,
can i change the default analyzer from the FSCrawler?

Have a look at https://fscrawler.readthedocs.io/en/latest/admin/fs/elasticsearch.html#mappings but if you are using all default settings, I think that if you search for bigdata with a match query on field meta.raw.resourceName it should work.

i'm using the default settings, and yes i get a match when i search :

GET /test/_search
{
  "query": {
    "match": {
      "meta.raw.resourceName": "big data"
    }
  }
}

But when i use "bigdata" i get nothing,
and is there a way so i can do the search on all the 'meta' field?
this is my setting :
{
"test" : {
"settings" : {
"index" : {
"mapping" : {
"total_fields" : {
"limit" : "2000"
}
},
"number_of_shards" : "1",
"provided_name" : "test",
"creation_date" : "1563369775998",
"analysis" : {
"analyzer" : {
"fscrawler_path" : {
"tokenizer" : "fscrawler_path"
}
},
"tokenizer" : {
"fscrawler_path" : {
"type" : "path_hierarchy"
}
}
},
"number_of_replicas" : "1",
"uuid" : "GLQhf9KoSaOQHheHvqrx9Q",
"version" : {
"created" : "7020099"
}
}
}
}
}

I'm very surprised that big data works and not bigdata. I'd have assume the opposite.

Could you run this?

GET test/_mapping

ok but i get this when i try to reply i guess it is too long !
Body is limited to 7000 characters; you entered 32923.

Share it on gist.github.com

ok here it is : https://gist.github.com/FeizNouri/0d960b9a0cc2e26fd46b6048347f81ac

(btw i just changed the name of the index to "docs" nothing more i am sure)

I can't reproduce what you are saying with a simple example:

DELETE test
PUT test
{
  "mappings": {
    "properties": {
      "meta": {
        "properties": {
          "raw": {
            "properties": {
              "resourceName": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}
PUT test/_doc/1
{
  "meta": {
    "raw": {
      "resourceName": "BigData.pptx"
    }
  }
}

Both queries bigdata or big data do not match.

GET test/_search
{
  "query": {
    "match": {
      "meta.raw.resourceName": "bigdata"
    }
  }
}
GET test/_search
{
  "query": {
    "match": {
      "meta.raw.resourceName": "big data"
    }
  }
}

This one matches bigdata.pptx:

GET test/_search
{
  "query": {
    "match": {
      "meta.raw.resourceName": "bigdata.pptx"
    }
  }
}

So it's different than what you are describing.

Anyway, the reason only bigdata.pptx is matching is because of the analyzer used in that case. The standard analyzer works like this:

POST /_analyze
{
  "analyzer": "standard", 
  "text": ["BigData.pptx"]
}

Gives:

{
  "tokens" : [
    {
      "token" : "bigdata.pptx",
      "start_offset" : 0,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

If you want to be able to search for bigdata, then the analyzer needs to produce that token.

For example:

POST /_analyze
{
  "analyzer": "simple", 
  "text": ["BigData.pptx"]
}

gives:

{
  "tokens" : [
    {
      "token" : "bigdata",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "pptx",
      "start_offset" : 8,
      "end_offset" : 12,
      "type" : "word",
      "position" : 1
    }
  ]
}

So if you run:

DELETE test
PUT test
{
  "mappings": {
    "properties": {
      "meta": {
        "properties": {
          "raw": {
            "properties": {
              "resourceName": {
                "type": "text",
                "analyzer": "simple"
              }
            }
          }
        }
      }
    }
  }
}
PUT test/_doc/1
{
  "meta": {
    "raw": {
      "resourceName": "BigData.pptx"
    }
  }
}
GET test/_search
{
  "query": {
    "match": {
      "meta.raw.resourceName": "bigdata"
    }
  }
}

This is now matching.

thanks this worked
but if i want FSCrawler to index info in a simpler way like the _source will just get something like:

"_source" : {
          "lang" : "en",
          "url" : "http://.........",
          "title" : "Data mining...........",
          "meta" : "Article about........................."
        }

this way it will be easier for me to execute the search i think rather then dividing the meta to many fields just return it as it is in one field, can you help me with that. i think it can be done in mapping but i couldn't really understand how mapping works.

It does that OOTB. That's why I said you normally don't need raw metadata.

But if that does not fit to your use case you can always add an ingest pipeline that transforms the data generated by FSCrawler to something else.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.