I want to search for words to be in same sentence/paragraph.
Example: If I search for words "Bill", "Steve" then no document2 should be returned because both words exist in the same sentence. But if I search for "Bill" , "computer" then no document should be returned because both these words lies in two different sentences.
{
"_index": "test",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"content": "Bill Gates founded Microsoft. Steve Jobs founded Apple. They were both influential in the tech industry."
}
},
{
"_index": "test",
"_type": "_doc",
"_id": "2",
"_score": 1.0,
"_source": {
"content": "Bill Gates and Steve Jobs were both pioneers in technology. They revolutionized the personal computer industry."
}
}
You could use some proximity search but it won't be within a sentence. I mean that if you have something like: "I'm Bill. Steve is here." That will match as well.
If a store each sentence in a nested dictionary format with a unique key like:
{
"_index": "test",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"content" : {
"sentence1" : "Bill Gates founded Microsoft. Steve Jobs founded Apple.",
"sentence2" : "They were both influential in the tech industry."
}
}
},
{
"_index": "test",
"_type": "_doc",
"_id": "2",
"_score": 1.0,
"_source": {
"content" : {
"sentence1" : "Bill Gates and Steve Jobs were both pioneers in technology.",
"sentence2" : "They revolutionized the personal computer industry."
}
}
}
Can we perform search within single sentence now? Or storing each sentence in form of Array of sentences.
Can a custom sentence tokenizer will help to achieve this?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.