How to execute multi search on Elasticsearch

FeizNouri · July 17, 2019, 2:40pm

Hi,
I have indexed metadata from documents with FSCrawler, so in ES I have an index that looks like this :

 {
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 19,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "BigData.pptx",
        "_score" : 1.0,
        "_source" : {
          "meta" : {
            "author" : "Jin",
            "title" : "PowerPoint Presentation",
            "date" : "2012-01-12T20:47:47.000+0000",
            "modifier" : "Jin",
            "created" : "2012-01-12T19:50:20.000+0000",
            "raw" : {
              "date" : "2012-01-12T21:47:47Z",
              "cp:revision" : "20",
              "Total-Time" : "57",
              "extended-properties:AppVersion" : "14.0000",
              "meta:paragraph-count" : "273",
              "meta:word-count" : "1319",
              "extended-properties:PresentationFormat" : "On-screen Show (4:3)",
              "dc:creator" : "Jin",
              "Word-Count" : "1319",
              "dcterms:created" : "2012-01-12T20:50:20Z",
              "dcterms:modified" : "2012-01-12T21:47:47Z",
              "Last-Modified" : "2012-01-12T21:47:47Z",
              "title" : "PowerPoint Presentation",
              "Last-Save-Date" : "2012-01-12T21:47:47Z",
              "Paragraph-Count" : "273",
              "meta:save-date" : "2012-01-12T21:47:47Z",
              "dc:title" : "PowerPoint Presentation",
              "Application-Name" : "Microsoft Office PowerPoint",
              "extended-properties:TotalTime" : "57",
              "modified" : "2012-01-12T21:47:47Z",
              "Notes" : "17",
              "Content-Type" : "application/vnd.openxmlformats-officedocument.presentationml.presentation",
              "Slide-Count" : "47",
              "X-Parsed-By" : "org.apache.tika.parser.DefaultParser",
              "creator" : "Jin",
              "extended-properties:Notes" : "17",
              "meta:author" : "Jin",
              "meta:creation-date" : "2012-01-12T20:50:20Z",
              "extended-properties:Application" : "Microsoft Office PowerPoint",
              "meta:last-author" : "Jin",
              "meta:slide-count" : "47",
              "Creation-Date" : "2012-01-12T20:50:20Z",
              "xmpTPg:NPages" : "47",
              "resourceName" : "BigData.pptx",
              "Last-Author" : "Jin",
              "Revision-Number" : "20",
              "Application-Version" : "14.0000",
              "Author" : "Jin",
              "Presentation-Format" : "On-screen Show (4:3)"
            }
          },
          "file" : {
            "extension" : "pptx",
            "content_type" : "application/vnd.openxmlformats-officedocument.presentationml.presentation",
            "created" : "2019-07-08T10:45:34.000+0000",
            "last_modified" : "2019-07-08T10:45:34.000+0000",
            "last_accessed" : "2019-07-17T09:50:04.000+0000",
            "indexing_date" : "2019-07-17T13:32:13.807+0000",
            "filesize" : 2496305,
            "filename" : "BigData.pptx",
            "url" : "file:///home/ubuntu/Downloads/FSCrawler/BigData.pptx",
            "indexed_chars" : 0
          },
          "path" : {
            "root" : "4d1f91a687e6d7c4e1dd3e1cbb4bd2",
            "virtual" : "/BigData.pptx",
            "real" : "/home/ubuntu/Downloads/FSCrawler/BigData.pptx"
          }
        }
      },

and so on that is just one hit from the hits field .

I want to get the matching index when I give one or many words, and from the metadata.
for example here the metadata has many fields like meta.raw.date, meta.raw.title ...etc
I want to get a result of searching, for example, the words 'Big Data' on the whole 'meta', I must get a result because the field meta.raw.resourceName has "BigData.pptx".
I couldn't find a way to execute such a search, I've tried 'more like this' and 'multi_match' but the problem that is I have to put the exact field in the query ( i have to put "meta.raw.resourceName": "BigData.pptx") in order to get the result and I have to put the exact word 'BigData.pptx' to get the result otherwise I get nothing if I put the word 'Big'
can anyone help me

dadoonet · July 17, 2019, 3:07pm

Unless you really need all the raw metadata, I'd recommend to disable them and just rely on non "raw" fields. In FSCrawler 2.7, raw will be disabled by default. See Local FS settings — FSCrawler 2.10-SNAPSHOT documentation

the problem that is I have to put the exact field in the query

It's because of the analyzers that are used by default by FSCrawler.

For some fields, a keyword datatype is used. If you need to run full text search, you will need to probably change it to text.

FeizNouri · July 17, 2019, 3:50pm

i think i would use the info in the raw ,
can i change the default analyzer from the FSCrawler?

dadoonet · July 17, 2019, 4:03pm

Have a look at https://fscrawler.readthedocs.io/en/latest/admin/fs/elasticsearch.html#mappings but if you are using all default settings, I think that if you search for bigdata with a match query on field meta.raw.resourceName it should work.

FeizNouri · July 17, 2019, 4:36pm

i'm using the default settings, and yes i get a match when i search :

GET /test/_search
{
  "query": {
    "match": {
      "meta.raw.resourceName": "big data"
    }
  }
}

But when i use "bigdata" i get nothing,
and is there a way so i can do the search on all the 'meta' field?
this is my setting :
{
"test" : {
"settings" : {
"index" : {
"mapping" : {
"total_fields" : {
"limit" : "2000"
}
},
"number_of_shards" : "1",
"provided_name" : "test",
"creation_date" : "1563369775998",
"analysis" : {
"analyzer" : {
"fscrawler_path" : {
"tokenizer" : "fscrawler_path"
}
},
"tokenizer" : {
"fscrawler_path" : {
"type" : "path_hierarchy"
}
}
},
"number_of_replicas" : "1",
"uuid" : "GLQhf9KoSaOQHheHvqrx9Q",
"version" : {
"created" : "7020099"
}
}
}
}
}

dadoonet · July 18, 2019, 12:44pm

I'm very surprised that big data works and not bigdata. I'd have assume the opposite.

Could you run this?

GET test/_mapping

FeizNouri · July 18, 2019, 1:55pm

ok but i get this when i try to reply i guess it is too long !
Body is limited to 7000 characters; you entered 32923.

dadoonet · July 18, 2019, 2:06pm

Share it on gist.github.com

FeizNouri · July 18, 2019, 2:09pm

ok here it is : https://gist.github.com/FeizNouri/0d960b9a0cc2e26fd46b6048347f81ac

(btw i just changed the name of the index to "docs" nothing more i am sure)

dadoonet · July 18, 2019, 2:34pm

I can't reproduce what you are saying with a simple example:

DELETE test
PUT test
{
  "mappings": {
    "properties": {
      "meta": {
        "properties": {
          "raw": {
            "properties": {
              "resourceName": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}
PUT test/_doc/1
{
  "meta": {
    "raw": {
      "resourceName": "BigData.pptx"
    }
  }
}

Both queries bigdata or big data do not match.

GET test/_search
{
  "query": {
    "match": {
      "meta.raw.resourceName": "bigdata"
    }
  }
}
GET test/_search
{
  "query": {
    "match": {
      "meta.raw.resourceName": "big data"
    }
  }
}

This one matches bigdata.pptx:

GET test/_search
{
  "query": {
    "match": {
      "meta.raw.resourceName": "bigdata.pptx"
    }
  }
}

So it's different than what you are describing.

Anyway, the reason only bigdata.pptx is matching is because of the analyzer used in that case. The standard analyzer works like this:

POST /_analyze
{
  "analyzer": "standard", 
  "text": ["BigData.pptx"]
}

Gives:

{
  "tokens" : [
    {
      "token" : "bigdata.pptx",
      "start_offset" : 0,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

If you want to be able to search for bigdata, then the analyzer needs to produce that token.

For example:

POST /_analyze
{
  "analyzer": "simple", 
  "text": ["BigData.pptx"]
}

gives:

{
  "tokens" : [
    {
      "token" : "bigdata",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "pptx",
      "start_offset" : 8,
      "end_offset" : 12,
      "type" : "word",
      "position" : 1
    }
  ]
}

So if you run:

DELETE test
PUT test
{
  "mappings": {
    "properties": {
      "meta": {
        "properties": {
          "raw": {
            "properties": {
              "resourceName": {
                "type": "text",
                "analyzer": "simple"
              }
            }
          }
        }
      }
    }
  }
}
PUT test/_doc/1
{
  "meta": {
    "raw": {
      "resourceName": "BigData.pptx"
    }
  }
}
GET test/_search
{
  "query": {
    "match": {
      "meta.raw.resourceName": "bigdata"
    }
  }
}

This is now matching.

FeizNouri · July 19, 2019, 1:11pm

thanks this worked
but if i want FSCrawler to index info in a simpler way like the _source will just get something like:

"_source" : {
          "lang" : "en",
          "url" : "http://.........",
          "title" : "Data mining...........",
          "meta" : "Article about........................."
        }

this way it will be easier for me to execute the search i think rather then dividing the meta to many fields just return it as it is in one field, can you help me with that. i think it can be done in mapping but i couldn't really understand how mapping works.

dadoonet · July 19, 2019, 1:30pm

It does that OOTB. That's why I said you normally don't need raw metadata.

But if that does not fit to your use case you can always add an ingest pipeline that transforms the data generated by FSCrawler to something else.

system · August 16, 2019, 1:30pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Multiple search terms Elasticsearch	1	360	July 6, 2017
Different filter on multiple indexes Elasticsearch	3	3324	September 12, 2019
How to add searched value with terms query results Elasticsearch	1	167	February 26, 2023
How to merge the result after multi search, then I can sort them like sorting in queryDSL? Elasticsearch	2	330	July 6, 2017
How to merge the result after multi search, then I can sort them like sorting in queryDSL? Elasticsearch	2	879	July 6, 2017

How to execute multi search on Elasticsearch

Related topics