Elasticsearch NEST query on a list of attachments


(Alina Frey) #1
  • Using ingest-attachment plugin.

  • Elasticsearch version: 5.5.1

  • NEST: 5.5

  • My goal is to be able to search the Content of a list of documents called PersonDocuments attached to IndexablePersonModel, and return the persons that have documents that contain the query term.

  • PersonDocuments is a list of IndexablePersonDocument on the IndexablePersonModel

public IEnumerable<IndexablePersonDocument> PersonDocuments { get; set; }
  • And one of the attributes of IndexablePersonDocument is string Content.

  • How do I query on the PersonDocuments?

  • Here is the query that I have so far:

QueryContainer Query(QueryContainerDescriptor < IndexablePersonModel > q) {
	var returnQuery = q
		 .Match(m => m
			.Field(a => a.PersonDocuments.FirstOrDefault().Content)
			.Boost(SearchConstants.Boosts.XXXLarge)
			.Query(Form.Query))
		 || q.FunctionScore(fs => fs
			.MaxBoost(SearchConstants.Boosts.Large)
			.Functions(ff => ff
				.FieldValueFactor(fvf => fvf
					.Field(p => p...)
					.Factor(0.0001)))
			.Query(query => query
				.MultiMatch(m => m
					.Fields(f => f...)
					.Operator(Operator.And)
					.Query(Form.Query))));

	return returnQuery;
}
  • The line with the problem is:
.Field(a => a.PersonDocuments.FirstOrDefault().Content)
  • I don't know how to loop through the list PersonDocuments.

  • The following is the index that I created on persons:

{
	"person-index": {		
		"mappings": {
			"person": {
				"properties": {
					"name": {
						"type": "text",
						"fields": {
							"keyword": {
								"type": "text",
								"analyzer": "person-name-keyword"
							},
							"raw": {
								"type": "keyword"
							}
						},
						"analyzer": "person-name-analyzer"
					},
					"personDocuments": {
						"type": "nested",
						"properties": {
							"attachment": {
								"properties": {
									"author": {
										"type": "text",
										"fields": {
											"keyword": {
												"type": "keyword",
												"ignore_above": 256
											}
										}
									},
									"content": {
										"type": "text",
										"fields": {
											"keyword": {
												"type": "keyword",
												"ignore_above": 256
											}
										}
									},
									"content_length": {
										"type": "long"
									},
									"content_type": {
										"type": "text",
										"fields": {
											"keyword": {
												"type": "keyword",
												"ignore_above": 256
											}
										}
									},
									"date": {
										"type": "date"
									},
									"detect_language": {
										"type": "boolean"
									},
									"indexed_chars": {
										"type": "long"
									},
									"keywords": {
										"type": "text",
										"fields": {
											"keyword": {
												"type": "keyword",
												"ignore_above": 256
											}
										}
									},
									"language": {
										"type": "text",
										"fields": {
											"keyword": {
												"type": "keyword",
												"ignore_above": 256
											}
										}
									},
									"name": {
										"type": "text",
										"fields": {
											"keyword": {
												"type": "keyword",
												"ignore_above": 256
											}
										}
									},
									"title": {
										"type": "text",
										"fields": {
											"keyword": {
												"type": "keyword",
												"ignore_above": 256
											}
										}
									}
								}
							},
							"content": {
								"type": "text",
								"fields": {
									"keyword": {
										"type": "keyword",
										"ignore_above": 256
									}
								}
							},
							"id": {
								"type": "integer"
							},
							"name": {
								"type": "text",
								"analyzer": "document-path-analyzer"
							},
							"path": {
								"type": "text",
								"fields": {
									"keyword": {
										"type": "keyword",
										"ignore_above": 256
									}
								}
							}
						}
					}
				}
			}
		},
		"settings": {
			...
		}
	}
}

(Russ Cam) #2

The lambda expression here is used to construct the path to the correct field, in a strongly typed fashion; the resulting serialized output for that expression would be

"field": "personDocuments.content"

Where the indexed field is a collection of values, such as in the case of

public IEnumerable<IndexablePersonDocument> PersonDocuments { get; set; }

The query will be executed across the values in the collection for each document.

You'll notice that the mapping for personDocuments includes the properties of only a single personDocument. This is because Elasticsearch has no concept of an "array" datatype. As far as it is concerned, you can index a single personDocument, multiple personDocument instances, or none into the personDocuments field. What will be mapped back to your POCO however will be the original JSON that you indexed. Your POCO models the personDocuments field as a collection, so your model/application always expects a collection (including null or empty) back from Elasticsearch.

A couple of points though

  1. personDocuments is mapped as a nested data type, so a query on a field of personDocuments needs to be a nested query i.e. you need to put the Match query on the Content field inside of a Nested query

  2. A collection of the top level person documents will be returned, and not personDocuments. If you wish to get the matching personDocuments, take a look at Inner hits.


(Alina Frey) #3
  • I modified the query to the following, and still getting zero hits. Please advise.
QueryContainer Query(QueryContainerDescriptor < IndexablePersonModel > q) {
	var returnQuery = q
		.FunctionScore(fs => fs...)
		.Nested(n => n
			.Boost(SearchConstants.Boosts.XXXLarge)
			.InnerHits(i => i.Explain())
			.Path(p => p.PersonDocuments)
			.Query(nq => +nq
				.Bool(b => b
					.Should(
						s => s.Term(p => p.PersonDocuments.First().Attachment.Content.Suffix(SearchConstants.Keyword), <query-term>)
					).MinimumShouldMatch(MinimumShouldMatch.Fixed(1))
				)
			)
			.IgnoreUnmapped()
		);
					
	return returnQuery;
}
  • When I do a search from Postman with "query-term":
POST: http://localhost:9200/index-name/_search
BODY:
{
    "query": {
        "query_string": {
            "query": "query-term"
        }
    }
}
  • I get the following output:
{
    "took": 61,
    "timed_out": false,
    "_shards": {
        "total": 2,
        "successful": 2,
        "failed": 0
    },
    "hits": {
        "total": 4,
        "max_score": 1,
        "hits": [
            {
                "_index": "<index-name>",
                "_type": "person",
                "_id": "72802",
                "_score": 1,
                "_source": {                    
                    "attribute1": {
                        "input": [
                            ...
                        ],
                        "weight": 0
                    },
                    "attribute2": false,
					...
                    
                    "personDocuments": [
                        {
                            "attachment": {
                                "content_type": "text/plain; charset=ISO-8859-1",
                                "language": "en",
                                "content": "<query-term>",
                                "content_length": 7
                            },
                            "name": "<name-of-the-file>.pdf",
                            "id": 0,
                            "content": "<base64-content>"
                        }
                    ]
                }
            },
            ...
            ...
            ...
        ]    
    }
}
  • So, I know I am retrieving the objects in Postman, but I don't know how to match that in C# NEST code.

(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.