Need advice how to organize data / schema?

I need to index book texts (mainly extracted from images OCR) and search those pages as fulltext.

TLDR: Should I index

  • each part as doc
  • or is it better to index records as docs and have dynamic field for pages

Record are books, and parts are pages that have text.

Record A
   page1
   page2
   ...
   pageN
Record B
   page1
   page2
   ...
   pageN

I need to search text field value, aggregate by record_id and display best N hits by score descending.

My model so far is to index each page as separate document and then do aggregation by record_id.

Current mapping
"mappings": {
"properties": {
  "part_id": {
    "type": "integer"
  },
  "record_id": {
    "type": "integer"
  },
  "ri_published": {
    "type": "boolean"
  },
  "rp_visible": {
    "type": "boolean"
  },
  "value": {
    "type": "text"
  }
}
}

Requirements

  • sort desc result by relevance/score (phrase, lucene dismax, ...)
  • need to exclude some parts (part_id) if they are not visible (rp_visible=false)
  • need to page results
  • (later) need to filter by some metadata (not yet in index)
    • need to filter by this metadata sometimes

Solutions that I use

Query 1 (older)
{
  "_source": {
    "excludes": [
      "value"
    ]
  },
  "query": {
    "bool": {
      "must": [
        {
          "match_phrase": {
            "value": {
              "query": "john wayne"
            }
          }
        }
      ]
    }
  },
  "size": 0,
  "aggs": {
    "by_record": {
      "terms": {
        "field": "record_id",
        "order": {
          "by_score_max": "desc"
        }
      },
      "aggs": {
        "by_top_records": {
          "top_hits": {
            "size": 3,
            "highlight": {
              "pre_tags": [
                "[ii]"
              ],
              "post_tags": [
                "[/ii]"
              ],
              "fields": {
                "value": {
                  "number_of_fragments": 5,
                  "fragment_size": 100,
                  "type": "unified",
                  "order": "none",
                  "no_match_size": 100
                }
              }
            },
            "_source": {
              "excludes": [
                "value"
              ]
            }
          }
        },
        "by_score_max": {
          "max": {
            "script": {
              "source": "_score"
            }
          }
        }
      }
    }
  }
}
Query 2 (composite) - using current
{
  "_source": {
    "excludes": [
      "value"
    ]
  },
  "query": {
    "bool": {
      "must": [
        {
          "match_phrase": {
            "value": {
              "query": "john wayne"
            }
          }
        }
      ]
    }
  },
  "size": 0,
  "aggs": {
    "by_record": {
      "composite": {
        "size": 200,
        "sources": {
          "record_id": {
            "terms": {
              "field": "record_id"
            }
          }
        }
      },
      "aggs": {
        "by_record_top": {
          "top_hits": {
            "size": 3,
            "highlight": {
              "pre_tags": [
                "[ii]"
              ],
              "post_tags": [
                "[/ii]"
              ],
              "fields": {
                "value": {
                  "number_of_fragments": 2,
                  "fragment_size": 100,
                  "type": "unified",
                  "order": "none",
                  "no_match_size": 200
                }
              }
            },
            "_source": {
              "excludes": [
                "value"
              ]
            }
          }
        },
        "by_record_max": {
          "max": {
            "script": {
              "source": "_score"
            }
          }
        },
        "agg_bucket_limit": {
          "bucket_sort": {
            "from": 0,
            "size": 20,
            "sort": {
              "by_record_max": {
                "order": "desc"
              }
            }
          }
        }
      }
    }
  }
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.