Search combined/JOIN indexes

Johanna12221 · November 14, 2022, 10:26pm

Hello,
We are working on setting up a search with document contents (pdf/docx etc.) where permissions comes from a database.

I've gotten all the data into Elastic with the help of Logstash and FSCrawler.
However they are two different indexes and I'm out of ideas how to "merge" them. From someone who comes from doing lots of SQL, I would use JOIN on those indexes to a new view to search on content (from the index created by FSCrawler) with the permissions coming from SQL server Logstash.

The Logstash index has a column "filename" which matches "path.virtual" from FSCrawler.

This is how I search a document contents:

GET /view-doc/_search
{
    "query": {
        "query_string" : {
            "query" : "Exam test",
            "default_field": "content"
        }
    }
}

This is how I search the index containing the permission array:

GET /db-view/_search
{
  "query": {
    "bool" : {
      "must": [
        {
          "query_string": {
            "query": "search term db view"
          }
        }
      ],
      "should" : [
        { "term" : { "role_ids": "305" } },
        { "term" : { "role_ids" : "306" } }
      ],
      "minimum_should_match" : 1,
      "boost" : 1.0
    }
  }
}

How can I query both indexes having them joined on the filename column, and filtered on the permissions array?

warkolm · November 14, 2022, 10:29pm

TLDR you cannot do this with a single query.

The better approach would be to reindex the data and merge it so that each doc contents also have all the permissions attached to it.

Johanna12221 · November 14, 2022, 10:37pm

Thanks for your fast reply! I've tried to look into how to do that as well and read about "denormalization".
What could be a good approach for this with the current setup of Logstash with FSCrawler?

warkolm · November 15, 2022, 12:12am

You could store the permissions data like you have, then ingest the documents via fscrawler and an enrich policy - Create enrich policy API | Elasticsearch Guide [8.5] | Elastic

dadoonet · November 16, 2022, 12:39pm

It looks like a great idea as you can define an ingest pipeline in FScrawler.

@Johanna12221 If you make it work, could you update this thread please and ideally send a PR to the FSCrawler project so this trick is documented?

This should go in Tips and tricks — FSCrawler 2.10-SNAPSHOT documentation

Johanna12221 · November 17, 2022, 9:19am

Thanks for your reply, if I manage to solve it I will post the solution, but I'm stuck on that FSCrawler and Logstash stores the data differently. I made another post about it: Logstash escaping characters, want to disable - Elastic Stack / Logstash - Discuss the Elastic Stack

The result is I can't match with the enrich/ingest/pipeline.
This is the difference:
Logstash JDBC plugin "path": "\"\\publicerat\\IN0010.pdf\"",
FSCrawler: "virtual": """\publicerat\IN0010.pdf""",

system · December 15, 2022, 9:19am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Integrate ElasticSearch, Logstash and fscrawler Logstash	4	641	July 21, 2020
How to filter search result using a lookup from another index Elasticsearch	3	4956	August 21, 2018
Searching for data from multiple index ( join) Elasticsearch	6	523	February 7, 2020
Can we use SQL Join in elastic query? Elasticsearch	29	1062	September 27, 2019
How to join 2 index using 2 common values Elasticsearch	4	941	June 23, 2020

Search combined/JOIN indexes

Related topics