Help Designing Index for PDF Documents


(John Coggeshall) #1

Hello,

I'm hoping someone can help a relative newbie on the best way to design a search index for a project I'm working on. Basically what I have is < 1000 PDF documents sorted into a bit of a parent child relationship. Specifically:

Organization -> Document -> Section of Document

Where 'Section of Document' is an individual PDF. So for any given organization they could have 10 "documents", each document consisting of 50 Sections, and each section mapping to a PDF that is indexed.

My search queries need to return full text searches (along with various filters to return only matches from specific organizations, etc) grouped by Organization and Document... i.e. if I was rendering the results it would be:

Organization #1
   Document 1
       Matching Section #1
       Matching Section #2
  Document 2
      ...
 Organization #2
     Document #1
        Matching Section #1
       ...

So it's important that we display all the documents for a specific organization together in a single set, and then the same is true for a document within that organization that has a matching section.

I am getting myself a little lost reading the documentation between field collapsing, entity-centric indexing, and aggregations. I'd really appreciate if someone could help me out and explain the best approach here and maybe an example of the index structure / query structure I'd really appreciate it! This is for ES 5.

Regards,

John


(Michael Sander) #2

You can index each section separately, and have a field for organization, document, and section. If you set _store to true (the default setting), you'll be able to collate the results into whatever hierarchy you want.

The only issue I with the above is having nice ranking and relevance, and you'll need to think about that a bit more. How do you want to order the organization/documents/sections when displayed? Do you want the most relevant sections, documents, or sections on top? However, given that the document set is pretty small, you'll be able to grab the full result set in a single call, and order them however you like.


(John Coggeshall) #3

I'd prefer not to do any organization if I can avoid it on the app side as there's a possibility the limited document set size could significantly grow in the future.

Can you give me an example of what a collate would look like? I believe i'm already storing the fields we're discussing anyway but I've not had any success that returns results in what I'm looking for... Here's a sample indexed document. I noticed that i'm actually not storing document ID right now but I can easily add that.

{
        "_index": "myindex",
        "_type": "my-document",
        "_id": "AVndHOxU8lLrpGXmBZnK",
        "_score": 6.2197213,
        "_source": {
          "institution_name_en": "My Institution",
          "is_active": 1,
          "flags": "graduate",
          "region_id": 1,
          "created_at": "2017-01-26",
          "title": "Article 3 - Things You Need to Know",
          "institution_id": 9,
          "filename": "238a77972159ad99bafdf9087a1d5771.pdf",
          "updated_at": "2017-01-26",
          "attachment": {
            "date": "2008-02-22T15:23:16Z",
            "content_type": "application/pdf",
            "language": "en",
            "content": "omitted content",
            "content_length": 5336
          },
          "agreement_expires": "2010-08-31",
          "agreement_name": "TA / RA Agreement",
          "region_name_en": "Ontario",
          "is_current": 1,
          "md5": "03d282d9ddca72a04835a81e8cdc4084"
        },
        "highlight": {
          "attachment.content": [
            "omitted"
          ]
        }
      }

Here's an example query of how I'm pulling documents down now:

GET myindex/_search
{
    "_source" : {

      "excludes": [
        "data"
      ]
    },
    "from" : 0,
    "size" : 50,
    "query": {
      "bool": {
        "must" : {
          "match" : {
            "attachment.content" : "check off"
          }
        },
        "filter": {
          "bool": {
            "should" : [
              {
                "terms" : {
                  "institution_id": [63,1]
                }
              },
              {
                "terms" : {
                  "flags" : ["graduate", "part-time"]
                }
              }
            ],
            "must" : [
             {
                "terms" : {
                  "region_id" : [1]
                }
              },
              {
                "term" : {
                  "is_current" : 1
                }
              },
              {
                "term" : {
                  "is_active" : 1
                }
              }
            ]
          }
        }
      }
    },
    "highlight" : {
      "fields" : {
        "attachment.content" : {},
        "title" : {}
      }
    }
}

(Michael Sander) #4

What's your mapping? Also, how do you want the results to be displayed if one organization has 10 sections that match the query with low relevance, but another organization has 2 sections that match the query with high relevance? Which organization gets listed first?


(John Coggeshall) #5

My mapping is below.

As far as results being displayed.. The ideal would be the most relevant document first, with the most relevant sections of that document, grouped by organization. So in your specific case the organization with 2 sections that match high are better than the 10 sections that match lower. It's a little messy because when the results are displayed the intent would be that Organization A would be shown first because they had the document/section with the highest relevancy, but the consequence of that would be since we show all the documents for Organization A together we'll get less relevant results also in the same organization before we jump to the next Organization, which had the second most relevant result in one of it's documents, etc.

Ideally we would order by highest relevancy period by document/section, then within a specific organization order the documents by their relevancy and limit to 3-4 results, then return the results grouped by their organization.

{
  "myindex": {
    "mappings": {
      "mydocument": {
        "properties": {
          "agreement_expires": {
            "type": "date"
          },
          "agreement_name": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "agreement_start": {
            "type": "date"
          },
          "attachment": {
            "properties": {
              "author": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "content": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "content_length": {
                "type": "long"
              },
              "content_type": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "date": {
                "type": "date"
              },
              "keywords": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "language": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "title": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              }
            }
          },
          "created_at": {
            "type": "date"
          },
          "data": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "filename": {
            "type": "text"
          },
          "flags": {
            "type": "text"
          },
          "institution_id": {
            "type": "keyword"
          },
          "institution_name_en": {
            "type": "text"
          },
          "institution_name_fr": {
            "type": "text"
          },
          "is_active": {
            "type": "boolean"
          },
          "is_current": {
            "type": "long"
          },
          "md5": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "region_id": {
            "type": "keyword"
          },
          "region_name_en": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "region_name_fr": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "title": {
            "type": "text"
          },
          "updated_at": {
            "type": "date"
          }
        }
      }
    }
  }
}

(John Coggeshall) #6

To be a bit more clear...

Organization A
    Highest Matched Document Overall, based on sections
        Highest Matched Section #1
        Highest Matched Section #2
    2nd Highest Matched Document in Organization based on sections
    3rd Highest Matched Document in Organization
Organization B
   2nd Highest Matched Document Overall, based on sections
       Highest Matched Section #1
       Highest Matched Section #2
   2nd Highest Matched Document in Organization based on sections

That's the ideal situation, where organizations are ordered based on an overall relevancy, and then the documents returned for each organization are ordered based on their relevancy in the context of that organization. I'm willing to compromise on that as long as the results still make sense because this would be very difficult to implement though.


(Michael Sander) #7

Yes, that's a pretty unique way of sorting results. However, if you only have 1000 documents, you could very easily sort them on the client. 1000 docs is very little. Even 100,000 docs could probably be done with no noticeable delay.

A more "correct" way to do it is to change your mapping to look more like your results, and make them nested. Maybe something like this:

{
  "myindex": {
    "mappings": {
      "myorganization": {
        "properties": {
        "mydocument": {
          "type": "nested"
          "properties": {
          "mysection": {
            "type": "nested"
            "properties": {
...

Nested types are a bit more on the advanced side of elasticsearch, so I'd read the docs carefully.


(John Coggeshall) #8

Thanks for the help so far! I understand I could probably do this a bit easier just client-side, but I'm really interested in learning here too if you will bear with me..

So I've re-created my index as you've described. My mapping is attached below but I'm not sure how best to get ES to return results in an organized fashion as described? Can I return results collapsed on institution, which also contained collapsed sections of documents?

{
  "myindex": {
    "mappings": {
      "institution": {
        "properties": {
          "document": {
            "type": "nested",
            "properties": {
              "expiration_date": {
                "type": "date"
              },
              "flags": {
                "type": "text"
              },
              "is_active": {
                "type": "boolean"
              },
              "is_current": {
                "type": "boolean"
              },
              "name": {
                "type": "text"
              },
              "section": {
                "type": "nested",
                "properties": {
                  "created_at": {
                    "type": "date"
                  },
                  "filename": {
                    "type": "text"
                  },
                  "fingerprint": {
                    "type": "text"
                  },
                  "flags": {
                    "type": "text"
                  },
                  "is_active": {
                    "type": "boolean"
                  },
                  "name": {
                    "type": "text"
                  },
                  "updated_at": {
                    "type": "date"
                  }
                }
              },
              "start_date": {
                "type": "date"
              }
            }
          },
          "id": {
            "type": "keyword"
          },
          "institution": {
            "properties": {
              "document": {
                "properties": {
                  "expiration_date": {
                    "type": "date"
                  },
                  "flags": {
                    "type": "text",
                    "fields": {
                      "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                      }
                    }
                  },
                  "is_active": {
                    "type": "long"
                  },
                  "is_current": {
                    "type": "long"
                  },
                  "name": {
                    "type": "text",
                    "fields": {
                      "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                      }
                    }
                  },
                  "section": {
                    "properties": {
                      "created_at": {
                        "type": "date"
                      },
                      "data": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      },
                      "file": {
                        "properties": {
                          "content": {
                            "type": "text",
                            "fields": {
                              "keyword": {
                                "type": "keyword",
                                "ignore_above": 256
                              }
                            }
                          }
                        }
                      },
                      "filename": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      },
                      "fingerprint": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      },
                      "flags": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      },
                      "is_active": {
                        "type": "long"
                      },
                      "name": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      },
                      "updated_at": {
                        "type": "date"
                      }
                    }
                  }
                }
              },
              "id": {
                "type": "long"
              },
              "name_en": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "name_fr": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "region_id": {
                "type": "long"
              }
            }
          },
          "name_en": {
            "type": "text"
          },
          "name_fr": {
            "type": "text"
          },
          "region_id": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

(Michael Sander) #9

Hmmm... this is a tricky problem. Abstractly, it seems like you want to search for Organizations, but have the organizations ranked by documents, which in turn are ranked by best sections. Off the top of my head, I don't have a good answer for you, but take a look at the scoring functions to see if they can help. My guess is that you'll only be able to go two levels deep, i.e., get the best organizations, ranked by documents. Unclear whether you'll be able to get ES rank the sections too.

I feel like that due to the nested nature of the data, aggregations may also help. Perhaps you can run a search on Organizations, ranked solely on the sections, then use an aggregation to determine the nesting of the sections under the documents.... but I'm more just thinking aloud here, this may be a rabbit hole.

Apologies for not giving a conclusive answer for you, but this is a pretty atypical. Good luck!


(John Coggeshall) #10

Let me make sure I've been clear because your answer confused me a bit.

I don't care about ranking organizations, only ranking documents. Think of it this way: I want to search two different organizations' documents for the same subject matter and I want to compare the two. What I'm interested in here is just making sure I pull the relevant documents out of each organization (and can identify one organization's documents from another) -- I don't care which one I see first.

Does that help at all?


(system) #11

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.