Ingest-attachment: case insensitive search

Hello. I have a requirement to perform search in file content. Everything is ok. But is it possible to implement case insensitive search in file content?
I've read a lot and did't find an answer.

It’s a question of analyzer that you apply to the field you extract data to with ingest-attachment.
Not related to ingest-attachment itself.

By default text fields don’t care about case.

What is your problem ?

ok, I have the following index configuration settings:

 {
	"index": {
		"analysis": {
			"filter": {
				"swedish_stop": {
					"type": "stop",
					"stopwords": "_none_"
				},
				"swedish_stemmer": {
					"type": "stemmer",
					"language": "swedish"
				}
			},
			"analyzer": {
				"any": {
					"type": "custom",
					"tokenizer": "standard"
				},
				"any_lowercase": {
					"type": "custom",
					"tokenizer": "standard",
					"filter": [
						"lowercase"
					]
				},
				"swedish": {
					"type": "custom",
					"tokenizer": "standard",
					"filter": [
						"swedish_stop",
						"swedish_stemmer"
					]
				},
				"swedish_lowercase": {
					"type": "custom",
					"tokenizer": "standard",
					"filter": [
						"lowercase",
						"swedish_stop",
						"swedish_stemmer"
					]
				}
			},
			"normalizer": {
				"lowercase_normalizer": {
					"type": "custom",
					"char_filter": [],
					"filter": [
						"lowercase"
					]
				}
			}
		}
	}
}

Ingest attachment:

 {
    "attachment": {
        "description": "Extract attachment information",
        "processors": [
            {
                "attachment": {
                    "field": "payload",
                    "indexed_chars": "-1",
                    "properties": [
                        "content",
                        "content_type",
                        "content_length",
                        "title",
                        "language"
                    ]
                }
            }
        ]
    }
}

For my index I am using the following mapping:

 {
 	"test": {
		"mappings": {
 			"_type": {
				"properties": {
					"attachment": {
 						"properties": {
 							"content": {
 								"type": "text",
								"fields": {
 									"keyword": {
 										"type": "keyword",
 										"ignore_above": 256
 									}
 								}
 							},
 							"content_length": {
 								"type": "long"
							},
 							"content_type": {
 								"type": "text",
 								"fields": {
 									"keyword": {
 										"type": "keyword",
 										"ignore_above": 256
 									}
 								}
 							},
 							"language": {
 								"type": "text",
 								"fields": {
 									"keyword": {
 										"type": "keyword",
 										"ignore_above": 256
 									}
 								}
 							},
 							"title": {
 								"type": "text",
 								"fields": {
									"keyword": {
 										"type": "keyword",
 										"ignore_above": 256
 									}
 								}
 							}
 						}
 					},
 					"payload": {
 						"type": "text",
 						"fields": {
 							"any_lowercase": {
 								"type": "text",
 								"analyzer": "any_lowercase"
 							}
 						},
 						"analyzer": "any"
 					}
 				}
 			}
 		}
 	}
}

'payload' is a base64 encoded binary. Is it possible to store only this encoded binary without content? Or vice versa?

And what is the maximum file size for indexing?

Also I have one question more: if I don't want to have encoded content file and remove my field 'payload' from processor - so how can I use analyzer for lowercase? What should I send as request?

Is it possible to store only this encoded binary without content?

Yes. But you won't be able to search for the content then.
And it's not recommended to store blobs in elasticsearch.

Or vice versa?

Yes. You can add a remove processor in your ingest pipeline.

And what is the maximum file size for indexing?

IIRC by default attachment processor only extracts the first 10000 characters.
That being said it might be a bad idea to send as a JSON document a very huge file, like a mp4 video file of 10gb if you only want to index some metadata like the filename or what have you. It will overload your node memory.

Also I have one question more: if I don't want to have encoded content file and remove my field 'payload' from processor - so how can I use analyzer for lowercase? What should I send as request?

You apply the analyzer not on the payload field but on attachment.content field where the text is actually extracted to.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.