Ingest Attachment and Wildcard Queries


(colaci) #1

Hello, I'm using the Ingest Attachment plugin to import PDF/DOC files into Elasticsearch.

The "match" query works as expected, but I'm unable to make a "wildcard" query as I always get an empty result set. I just read ( https://stackoverflow.com/questions/45515044/elasticsearch-query-with-wildcards ) that the wildcard query requires a "not_analyzed" text field, but my "attachment.content" field is, of course, analyzed.

What are my options? Thank you.


(David Pilato) #2

Why would you use wildcards in the first place?

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-wildcard-query.html says:

Note that this query can be slow, as it needs to iterate over many terms. In order to prevent extremely slow wildcard queries, a wildcard term should not start with one of the wildcards * or ?.


(colaci) #3

I know the stakes, but it is required for some small number of searches when we want to find a document mentioning a specific ticket number (hash-like 32-character string) by its prefix.


(David Pilato) #4

Try with r*d instead of R*d


(colaci) #5

The Stackoverflow post is not mine. Here is what I'm doing:

# initialize the attachment pipeline

curl -XPUT 'http://localhost:9200/_ingest/pipeline/attachment?pretty' -H 'Content-Type: application/json' -d '{
 "description" : "Extract attachment information encoded in Base64 with UTF-8 charset",
 "processors" : [
   {
     "attachment": {
       "field": "data"
     }
   }
 ]
}'

# initialize a new db

curl -XDELETE 'http://localhost:9200/newdb'

curl -XPUT 'http://localhost:9200/newdb' -H 'Content-Type: application/json' -d '{
    "mappings" : {
        "files" : {
            "properties" : {
                "attachment.content" : {
                    "type": "text",
                    "fields" : {
                        "keyword" : {
                            "type" : "keyword",
                            "ignore_above" : 256
                        }
                    },
                    "analyzer" : "french"
                }
            }
        }
    }
}'

# prepare some test files

echo '{"data": "'$(base64 -w0 1.pdf)'", "filename": "1.pdf"}' >1.pdf.json
echo '{"data": "'$(base64 -w0 2.pdf)'", "filename": "2.pdf"}' >2.pdf.json
echo '{"data": "'$(base64 -w0 3.pdf)'", "filename": "3.pdf"}' >3.pdf.json

# insert the test files insto elasticsearch

curl -XPUT 'http://localhost:9200/newdb/files/1?pipeline=attachment&pretty' -H 'Content-Type: application/json' --data-binary @1.pdf.json
curl -XPUT 'http://localhost:9200/newdb/files/2?pipeline=attachment&pretty' -H 'Content-Type: application/json' --data-binary @2.pdf.json
curl -XPUT 'http://localhost:9200/newdb/files/3?pipeline=attachment&pretty' -H 'Content-Type: application/json' --data-binary @3.pdf.json


# testing match

curl -XGET 'http://localhost:9200/newdb/files/_search?pretty' -H 'Content-Type: application/json' -d '{
  "_source": ["filename"],
  "query": {
    "match": {
      "attachment.content" : {
        "query" : "Bonjour"
      }
    }
   }
}'

**##### returns results, OK!**

# testing wildcard

curl -XGET 'http://localhost:9200/newdb/files/_search?pretty' -H 'Content-Type: application/json' -d '{
  "_source": ["filename"],
  "query": {
    "wildcard": {
      "attachment.content" : {
        "value" : "Bon*"
      }
    }
   }
}'

**##### no result, BAD!**

# testing wildcard (lowercase)

curl -XGET 'http://localhost:9200/newdb/files/_search?pretty' -H 'Content-Type: application/json' -d '{
  "_source": ["filename"],
  "query": {
    "wildcard": {
      "attachment.content" : {
        "value" : "bon*"
      }
    }
   }
}'

**##### no result, BAD!**

(David Pilato) #6

I tried to simplify your use case at it's not explicitly related to ingest-attachment here:

DELETE test
PUT test
{
    "mappings" : {
        "doc" : {
            "properties" : {
                "attachment.content" : {
                    "type": "text",
                    "fields" : {
                        "keyword" : {
                            "type" : "keyword",
                            "ignore_above" : 256
                        }
                    },
                    "analyzer" : "french"
                }
            }
        }
    }
}
PUT test/doc/1?refresh
{
  "attachment.content": "Bonjour"
}

GET test/_search
{
  "query": {
    "match": {
      "attachment.content": {
        "query": "Bonjour"
      }
    }
  }
}
GET test/_search
{
  "query": {
    "wildcard": {
      "attachment.content": {
        "value" : "Bon*"
      }
    }
  }
}
GET test/_search
{
  "query": {
    "wildcard": {
      "attachment.content": {
        "value" : "bon*"
      }
    }
  }
}

This gives:

# GET test/_search
{
  "took": 17,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "test",
        "_type": "doc",
        "_id": "1",
        "_score": 0.2876821,
        "_source": {
          "attachment.content": "Bonjour"
        }
      }
    ]
  }
}

# GET test/_search
{
  "took": 14,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

# GET test/_search
{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "test",
        "_type": "doc",
        "_id": "1",
        "_score": 1,
        "_source": {
          "attachment.content": "Bonjour"
        }
      }
    ]
  }
}

So this is what I'm expected here.
What is the problem you have then?


(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.