Search combined/JOIN indexes

Hello,
We are working on setting up a search with document contents (pdf/docx etc.) where permissions comes from a database.

I've gotten all the data into Elastic with the help of Logstash and FSCrawler.
However they are two different indexes and I'm out of ideas how to "merge" them. From someone who comes from doing lots of SQL, I would use JOIN on those indexes to a new view to search on content (from the index created by FSCrawler) with the permissions coming from SQL server Logstash.

The Logstash index has a column "filename" which matches "path.virtual" from FSCrawler.

This is how I search a document contents:

GET /view-doc/_search
{
    "query": {
        "query_string" : {
            "query" : "Exam test",
            "default_field": "content"
        }
    }
}

This is how I search the index containing the permission array:

GET /db-view/_search
{
  "query": {
    "bool" : {
      "must": [
        {
          "query_string": {
            "query": "search term db view"
          }
        }
      ],
      "should" : [
        { "term" : { "role_ids": "305" } },
        { "term" : { "role_ids" : "306" } }
      ],
      "minimum_should_match" : 1,
      "boost" : 1.0
    }
  }
}

How can I query both indexes having them joined on the filename column, and filtered on the permissions array?

TLDR you cannot do this with a single query.

The better approach would be to reindex the data and merge it so that each doc contents also have all the permissions attached to it.

Thanks for your fast reply! I've tried to look into how to do that as well and read about "denormalization".
What could be a good approach for this with the current setup of Logstash with FSCrawler?

You could store the permissions data like you have, then ingest the documents via fscrawler and an enrich policy - Create enrich policy API | Elasticsearch Guide [8.5] | Elastic

1 Like

It looks like a great idea as you can define an ingest pipeline in FScrawler.

@Johanna12221 If you make it work, could you update this thread please and ideally send a PR to the FSCrawler project so this trick is documented?

This should go in Tips and tricks — FSCrawler 2.10-SNAPSHOT documentation

Thanks for your reply, if I manage to solve it I will post the solution, but I'm stuck on that FSCrawler and Logstash stores the data differently. I made another post about it: Logstash escaping characters, want to disable - Elastic Stack / Logstash - Discuss the Elastic Stack

The result is I can't match with the enrich/ingest/pipeline.
This is the difference:
Logstash JDBC plugin "path": "\"\\publicerat\\IN0010.pdf\"",
FSCrawler: "virtual": """\publicerat\IN0010.pdf""",

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.