Regex for a large text (book paragraphs)

I am taking singular paragraphs from a book and inserting them as text fields.

I want to be able to run regexp expressions across multiple words, like "night.*sky" to find sentences like

The midnight sky cracked opened in thin shrouds.

The night, with its perilous storm-threatened sky, was black as obsidian.

From my research, I see that text fields are tokenized by spaces so each word would be its own separate token and thus the regex would not work across words. What is the best way to search through these paragraphs for my use case?

Hi @Hugh_Dancy Welcome to the community and this a great question (and BIG topic) this is what elasticsearch does best... full text search at speed and scale...

So Perhaps regex is not the best approach... (it might be depending on your exact requirements.. but I suspect not)

Full-text search (or even Semantic / AKA Vector Search) might be a better fit.

Lets leave vector out for now... take a look at this simple example, and of course, as you learn you can build up queries with boolean and must or should operators etc... (and of course, if need be use can adjust the text analyzers, boost etc..etc..etc..) you can pre-filter etc..etc..

But here is a simple example using the match query type .. .take a look

PUT discuss-test-search
{
  "mappings": {
    "properties": {
      "paragraph": {
        "type": "text"
      }
    }
  }
}



POST discuss-test-search/_doc
{
  "paragraph": "The midnight sky cracked opened in thin shrouds."
}

POST discuss-test-search/_doc
{
  "paragraph": "The night, with its perilous storm-threatened sky, was black as obsidian."
}

POST discuss-test-search/_doc
{
  "paragraph": "The night seemed to last until dawn"
}

GET discuss-test-search/_search
{
  "query": {
    "match": {
      "paragraph": {
        "query": "night sky"
      }
    }
  }
}

# results 
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": 0.8272065,
    "hits": [
      {
        "_index": "discuss-test-search",
        "_id": "iAqJq48Bq5nVW7SApKmG",
        "_score": 0.8272065,
        "_source": {
          "paragraph": "The night, with its perilous storm-threatened sky, was black as obsidian."
        }
      },
      {
        "_index": "discuss-test-search",
        "_id": "iQqJq48Bq5nVW7SApKmP",
        "_score": 0.517004,
        "_source": {
          "paragraph": "The night seemed to last until dawn"
        }
      },
      {
        "_index": "discuss-test-search",
        "_id": "hwqJq48Bq5nVW7SApKl8",
        "_score": 0.4923848,
        "_source": {
          "paragraph": "The midnight sky cracked opened in thin shrouds."
        }
      }
    ]
  }
}

Note the operator by default is or you can try and and see the difference.

GET discuss-test-search/_search
{
  "query": {
    "match": {
      "paragraph": {
        "query": "night sky",
        "operator": "and"
      }
    }
  }
}

I would say take a look at this and perhaps come back with more

Also note there is a score for each results... higher Score better match.

Now I will say... you are already borderline Semantic search because it seems like you may want Midnight and Night to be the same, which lexically from left to right they are actually fairly far apart but Semantically (meaning) are closer...

Btw you can regex on the keyword type, but that would be incredibly inefficient at scale. also, I don't think your regex would find midnight, etc.

So do a little "searching" and come back with more... I think you will want to search not regex...

1 Like