PII security with Elastic.Clients.Elasticsearch 9.2.1

Good day

I’m not sure if I’m posting this in the right place, but here goes.

I’m trying to implement an ES index with an object that has mainly numeric fields, and some text fields. I want to be able to search by any field, and only return the numeric ones as the others are personally identifying data that I neither need in my results, nor want to expose to anyone else.

Trying to find documentation on such for C# has been a bit of a wild goose chase, often ending up with me at documentation that may have worked in older clients, but doesn’t seem to be supported any more.

What is the correct way to set up an index such that some fields will never be returned, but can be searched-on? Or, if that’s not possible, how do I set up masking per field with Elastic.Clients.Elasticsearch 9?

Welcome @Davyd_McColl

I’m not sure if I’m posting this in the right place, but here goes.

That's definitely the right place to ask. :slight_smile:

I want to be able to search by any field,

So you need to send the full document as a JSON with all the fields

and only return the numeric ones as the others are personally identifying data that I neither need in my results, nor want to expose to anyone else.

There are multiple ways for solving this.

At search time

One way I think is to control that at search time with source filtering but may be you want something else.

Exclude from source (mapping)

See _source field | Reference

Using mapping (stored fields)

Another solution could be to disable the _source field, and for "non viewable fields" set them to store: false. That way you can only fetch the other stored fields.

There are probably other ways...

Hope this helps.

Hi

Thanks for that - it does help.

When I attempted to use .Store(false) on a mapping, like so:

client.Indices.Create<Document>(index =>
  index.Index("the-index")
    .Mappings(m =>
       m.Properties(p =>
         p.Text(o => o.FirstName, k => k.Store(false))
...

Then I still see all the fields when browsing data via Elastron, and they all come back via a query. So I’m not sure if I’m just doing this wrong, or if there’s an issue here. In tests, this index is recreated from scratch for a test, so whatever my setup for the index is at the time the test is run, that’s it - so there’s not an old index hanging about with all fields enabled, for example.

However, when I use:

client.Indices.Create<CallCenterCustomerIndexItem>(index =>
  index.Index("the-index")
    .Mappings(m =>
       m.Source(s =>
         s.Includes("id", "orderId", "callCenterId"...)

Then I find that the fields are restricted as I would expect (name fields come back null). The only problem I have here is that I’d like to use nameof for the properties - but that gives back PascalCase, where the document is being stored with camelCase properties. Ideally, I’d like to do something more robust than simply lower-casing the first letter of the PascalCased property names to get to camelCased ones - I’d rather specify the field name somewhere, and I’m quite sure it’s possible, but again, I’m hitting a wall trying to find documentation on the subject with v9+ of the library. When I used nameof originally, all fields came back with default values, and that’s what gave me the idea, after looking at the documents, to camelCase properties for Includes, but I’d like to be deterministic here, especially to cover the case where someone renames a field on the POCO without realising that would affect mapping.

What should I be doing to set names for fields?

Ok, so I can set the [JsonPropertyName] on the properties and direct that way. If there’s a better way, please let me know, eg if there’s away via the fluent index creation syntax.

did the-index already exist at this point?

I had to look this up. This is not an Elastic product, so most of us will have little idea how it works or what it might show. IMO better to look at the documents indexed via Kibana, it is very tightly integrated with elasticsearch.

Hi

No, the index is new every time in a test, as mentioned above.

It must absolutely not matter what client I use to browse results. If Elastron can “see” the field via REST calls, any other client can. The aim here is to be able to search by a field without that field value coming back in results, even if the client didn’t filter fields (ie, requested all fields) or specifically specified the field.

I don't think this is possible with only Elasticsearch, if you search by a value for a field this will be returned unless you filter it.

Can you provide some example of real usage, instead of just some code? It is not clear what exactly you want to do and how your document looks like.

1 Like

Can you please retrieve the applied mappings from the index and post them here? I do not use the .NET client so am not sure whether your code creates the index correctly or not.

If you are trying to exclude fields from the source or not store them, be aware that this can affect the ability to reindex the data based on the data in the index and apply partial updates.

1 Like

Hi, this is what I get when I curl /test-callcenter-orders/_mapping:

{
    "test-callcenter-orders": {
        "mappings": {
            "_source": {
                "includes": [
                    "id",
                    "order_id",
                    "date",
                    "store_id",
                    "call_center_id",
                    "group_id",
                    "brand_id",
                    "source",
                    "customer_id"
                ]
            },
            "properties": {
                "brand_id": {
                    "type": "long"
                },
                "call_center_id": {
                    "type": "long"
                },
                "customer_id": {
                    "type": "long"
                },
                "date": {
                    "type": "date"
                },
                "first_name": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "group_id": {
                    "type": "long"
                },
                "id": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "last_name": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "order_id": {
                    "type": "long"
                },
                "source": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "store_id": {
                    "type": "long"
                }
            }
        }
    }
}

And this is doing what I want - querying against that with the same model, FirstName and LastName always come back null, but I can find customers by name with filters. As far as I understand the bit I needed was the “includes”.

Is this data immutable or do you need to update it? If you need to update it, will you be overwriting it with complete documents from some external data store when you update?

The data is “immutable enough”. We’re using it to speed up searches by name for customers as mysql is taking a hammering (along with a few other filters, all in the _includes). The final result is distilled down to customer ids. We plan on only adding documents, so even if a customer’s name changes, a new document appears and there’s a match on their old and new names for their id - so no problem (the agent using this will confirm their details, since this would be for a food order, and I don’t even think they update details that regularly, to be honest). Documents are added per order the customer places, and there’s a backfill workflow to index historical orders.

However, I have a new problem now: I’ve been developing against elasticsearch:9.2.1 (docker), and stuff has been working nicely. However, testing at our staging deployment, which is pointed to an ES serverless cloud instance, the index cannot be created. I see a message like:

Parameter [includes] is not allowed in source

My mapping according to the local index is:

{
    "test-shouldreturnfindtheexactmatch-20251126122212": {
        "aliases": {
            "moo": {}
        },
        "mappings": {
            "properties": {
                "brand_id": {
                    "type": "long"
                },
                "call_center_id": {
                    "type": "long"
                },
                "customer_id": {
                    "type": "long"
                },
                "date": {
                    "type": "date"
                },
                "first_name": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "group_id": {
                    "type": "long"
                },
                "id": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "last_name": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "order_id": {
                    "type": "long"
                },
                "source": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "store_id": {
                    "type": "long"
                }
            }
        },
        "settings": {
            "index": {
                "routing": {
                    "allocation": {
                        "include": {
                            "_tier_preference": "data_content"
                        }
                    }
                },
                "number_of_shards": "1",
                "provided_name": "test-shouldreturnfindtheexactmatch-20251126122212",
                "creation_date": "1764152532620",
                "number_of_replicas": "1",
                "uuid": "g4bOzoLTR-CsKV_O2YdiZw",
                "version": {
                    "created": "9039001"
                }
            }
        }
    }
}

and I’m using the official dotnet client (Elastic.Clients.Elasticsearch) with my index setup as follows:

        var createIndexResponse = client.Indices.Create<OrderIndexItem>(index =>
            index.Index(IndexName)
                .Mappings(m =>
                    m.Source(s =>
                            s.Includes(
                                GenerateIncludedPropertyNames()
                            )
                        )
                        .Properties(p =>
                            p.Keyword(o => o.Id)
                        ).Properties(p =>
                            p.IntegerNumber(o => o.OrderId)
                        ).Properties(p =>
                            p.IntegerNumber(o => o.CallCenterId)
                        ).Properties(p =>
                            p.IntegerNumber(o => o.StoreId)
                        ).Properties(p =>
                            p.IntegerNumber(o => o.GroupId)
                        ).Properties(p =>
                            p.Keyword(s => s.FirstName)
                        ).Properties(p =>
                            p.Keyword(o => o.LastName)
                        ).Properties(p =>
                            p.Date(o => o.Date)
                        )
                )
        );

where GenerateIncludedPropertyNames generates property names from the [JsonPropertyName(…)] annotations on the model, and drops properties marked with [PersonalIdentifier]. This code generated the above index locally, but fails to generate an index at ES cloud, which has been set up with a serverless instance.

Any help appreciated.