I am implementing a "chat with Elasticsearch index" feature, where a large language model (LLM), such as one from OpenAI, converts natural language queries (NLQs) into Elasticsearch queries. For accurate query conversion, the LLM needs to understand the Elasticsearch index mappings, including the structure and constraints of the fields. However, Elasticsearch mappings typically only provide field types (e.g., keyword, text, date) without additional context.
For example, consider an index with a “status” field of type keyword that accepts only specific values: "Approved", "Denied", and "Draft". These valid values cannot be specified in the standard mapping JSON, so the LLM cannot infer them. For a user query like "get approved documents," how can the LLM know to generate an Elasticsearch query that matches the case-sensitive value "Approved"?
One potential solution is to add detailed metadata to the _meta field in the Elasticsearch mappings. However, including such detailed metadata for all properties could significantly increase the size of the mappings JSON, potentially doubling it, which may impact performance or maintainability.
My questions are:
-
Is updating the _meta field in the mappings the only standard way to provide the LLM with sufficient context for accurate NLQ-to-query conversion?
-
Alternatively, should field descriptions and constraints (e.g., valid values) be provided as separate context outside the mappings? Is this a good practice?
-
If using the GitHub - elastic/mcp-server-elasticsearch (which provides a get_mappings tool for retrieving mappings), how can additional field descriptions be incorporated? Would they need to be included in the LLM prompt separately or the _meta of the mappings to be updated? Which approach is better?
I’d appreciate insights or best practices for providing LLMs with the necessary context to generate accurate Elasticsearch queries.