Hi Folks, I am currently working on a project to build a full text search for scientific research data. I need guidance on the feasibility of implementing the following functionality:
-
Basic Search Operators:
- AND: Search for documents containing all specified terms. For example, searching for "genome AND sequencing" should return documents that include both terms.
- OR: Search for documents containing any of the specified terms. For example, "cancer OR tumor" should return documents that include either term.
- NOT: Exclude documents containing the specified term. For example, "vaccine NOT influenza" should return documents that mention vaccines but not influenza.
Based on my initial research, these operators can be easily implemented using the Boolean queries along with must, should and must_not occurrence types. Is there is a better way to achieve this?
-
Advanced Search Operators:
- Same Sentence Search: Search for documents where specified terms appear within the same sentence. For example, "CRISPR same_sentence Cas9" should return documents where CRISPR and Cas9 are mentioned within the same sentence.
- Same Paragraph Search: Search for documents where specified terms appear within the same paragraph. For example, "immunotherapy same_paragraph checkpoint inhibitors" should return documents where these terms are in the same paragraph.
- Proximity Search (Ordered): Search for documents where specified terms appear near each other in a specific order. For example, "stem cell PROXIMITY/5 therapy" should return documents where "stem cell" appears within 5 words before "therapy".
- Proximity Search (Unordered): Search for documents where specified terms appear near each other in any order. For example, "therapy ADJACENT/5 stem cell" should return documents where "therapy" and "stem cell" appear within 5 words of each other in any order.
While in my research I found a way to implement Proximity search using Span query types. Though I could not find any resource to implement same sentence and same paragraph search natively within Elasticsearch, can you please guide me if there is a way to implement this functionality with Elasticsearch and what best practices should I use.
-
Complex Query Capabilities:
- Ability to mix and match the above two types of operators to create complex queries. For example, a query like "(genome AND sequencing) AND (CRISPR same_paragraph Cas9) AND (therapy PROXIMITY/5 stem cell)".
- Ensuring that these complex queries can be executed within a fixed time duration.
-
Performance Requirements:
- Results should be returned in under 120 seconds of search execution.
Data Details:
- Total Data Size: Terabytes
- Number of Documents: Around 1 billion
- Average Document Size: Approximately 20,000 words per document
- Document Content: Scientific research data
- Language: English
Questions:
- Is it feasible to implement the above functionalities in Elasticsearch given the data size and performance requirements (specifically same sentence and same paragraph search)?
- Are there specific configurations, best practices I should follow to achieve this?
- Would it be necessary to use any additional tools or plugins to support advanced search operators and complex queries?
- Is there any resource or guide to decide what infrastructure and architecture would be required to achieve the performance requirements mentioned above?
Thanks in advance!