Hi @Hugh_Dancy Welcome to the community and this a great question (and BIG topic) this is what elasticsearch does best... full text search at speed and scale...
So Perhaps regex is not the best approach... (it might be depending on your exact requirements.. but I suspect not)
Full-text search (or even Semantic / AKA Vector Search) might be a better fit.
Lets leave vector out for now... take a look at this simple example, and of course, as you learn you can build up queries with boolean
and must
or should
operators etc... (and of course, if need be use can adjust the text analyzers, boost etc..etc..etc..) you can pre-filter etc..etc..
But here is a simple example using the match
query type .. .take a look
PUT discuss-test-search
{
"mappings": {
"properties": {
"paragraph": {
"type": "text"
}
}
}
}
POST discuss-test-search/_doc
{
"paragraph": "The midnight sky cracked opened in thin shrouds."
}
POST discuss-test-search/_doc
{
"paragraph": "The night, with its perilous storm-threatened sky, was black as obsidian."
}
POST discuss-test-search/_doc
{
"paragraph": "The night seemed to last until dawn"
}
GET discuss-test-search/_search
{
"query": {
"match": {
"paragraph": {
"query": "night sky"
}
}
}
}
# results
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 0.8272065,
"hits": [
{
"_index": "discuss-test-search",
"_id": "iAqJq48Bq5nVW7SApKmG",
"_score": 0.8272065,
"_source": {
"paragraph": "The night, with its perilous storm-threatened sky, was black as obsidian."
}
},
{
"_index": "discuss-test-search",
"_id": "iQqJq48Bq5nVW7SApKmP",
"_score": 0.517004,
"_source": {
"paragraph": "The night seemed to last until dawn"
}
},
{
"_index": "discuss-test-search",
"_id": "hwqJq48Bq5nVW7SApKl8",
"_score": 0.4923848,
"_source": {
"paragraph": "The midnight sky cracked opened in thin shrouds."
}
}
]
}
}
Note the operator by default is or
you can try and
and see the difference.
GET discuss-test-search/_search
{
"query": {
"match": {
"paragraph": {
"query": "night sky",
"operator": "and"
}
}
}
}
I would say take a look at this and perhaps come back with more
Also note there is a score
for each results... higher Score better match.
Now I will say... you are already borderline Semantic search because it seems like you may want Midnight and Night to be the same, which lexically from left to right they are actually fairly far apart but Semantically (meaning) are closer...
Btw you can regex on the keyword
type, but that would be incredibly inefficient at scale. also, I don't think your regex would find midnight, etc.
So do a little "searching" and come back with more... I think you will want to search not regex...