Regex for a large text (book paragraphs)

Hugh_Dancy · May 24, 2024, 4:26pm

I am taking singular paragraphs from a book and inserting them as text fields.

I want to be able to run regexp expressions across multiple words, like "night.*sky" to find sentences like

The midnight sky cracked opened in thin shrouds.

The night, with its perilous storm-threatened sky, was black as obsidian.

From my research, I see that text fields are tokenized by spaces so each word would be its own separate token and thus the regex would not work across words. What is the best way to search through these paragraphs for my use case?

stephenb · May 24, 2024, 5:05pm

Hi @Hugh_Dancy Welcome to the community and this a great question (and BIG topic) this is what elasticsearch does best... full text search at speed and scale...

So Perhaps regex is not the best approach... (it might be depending on your exact requirements.. but I suspect not)

Full-text search (or even Semantic / AKA Vector Search) might be a better fit.

Lets leave vector out for now... take a look at this simple example, and of course, as you learn you can build up queries with boolean and must or should operators etc... (and of course, if need be use can adjust the text analyzers, boost etc..etc..etc..) you can pre-filter etc..etc..

But here is a simple example using the match query type .. .take a look

PUT discuss-test-search
{
  "mappings": {
    "properties": {
      "paragraph": {
        "type": "text"
      }
    }
  }
}



POST discuss-test-search/_doc
{
  "paragraph": "The midnight sky cracked opened in thin shrouds."
}

POST discuss-test-search/_doc
{
  "paragraph": "The night, with its perilous storm-threatened sky, was black as obsidian."
}

POST discuss-test-search/_doc
{
  "paragraph": "The night seemed to last until dawn"
}

GET discuss-test-search/_search
{
  "query": {
    "match": {
      "paragraph": {
        "query": "night sky"
      }
    }
  }
}

# results 
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": 0.8272065,
    "hits": [
      {
        "_index": "discuss-test-search",
        "_id": "iAqJq48Bq5nVW7SApKmG",
        "_score": 0.8272065,
        "_source": {
          "paragraph": "The night, with its perilous storm-threatened sky, was black as obsidian."
        }
      },
      {
        "_index": "discuss-test-search",
        "_id": "iQqJq48Bq5nVW7SApKmP",
        "_score": 0.517004,
        "_source": {
          "paragraph": "The night seemed to last until dawn"
        }
      },
      {
        "_index": "discuss-test-search",
        "_id": "hwqJq48Bq5nVW7SApKl8",
        "_score": 0.4923848,
        "_source": {
          "paragraph": "The midnight sky cracked opened in thin shrouds."
        }
      }
    ]
  }
}

Note the operator by default is or you can try and and see the difference.

GET discuss-test-search/_search
{
  "query": {
    "match": {
      "paragraph": {
        "query": "night sky",
        "operator": "and"
      }
    }
  }
}

I would say take a look at this and perhaps come back with more

Also note there is a score for each results... higher Score better match.

Now I will say... you are already borderline Semantic search because it seems like you may want Midnight and Night to be the same, which lexically from left to right they are actually fairly far apart but Semantically (meaning) are closer...

Btw you can regex on the keyword type, but that would be incredibly inefficient at scale. also, I don't think your regex would find midnight, etc.

So do a little "searching" and come back with more... I think you will want to search not regex...

Topic		Replies	Views
How to do regex search in ES Elasticsearch	1	371	July 6, 2017
Help: Elasticsearch Regexp query Elasticsearch	7	1744	December 3, 2020
Full text search for when text/document is split chunks Elasticsearch	2	689	December 19, 2019
Searching ES regex with space/colon/hypen etc Elasticsearch	3	4025	September 7, 2020
Regexp not searching as expected Elasticsearch	7	533	July 4, 2020

Regex for a large text (book paragraphs)

Related topics