Hi,
With that much structure (even if there are variants) in the title that describe properties of the actual document, I'd probably first try to extract as much information as possible in some sort of pre-processing step, either during data preparation and cleaning and index those pieces of information in separate fields (e.g. "year_of_desision", "type_of_decision", "page" etc...). That way this meta information can be queries much more precisely later.
If this isn't possible for some reason during data preparation, maybe an ingest processor like the Grok Processor could be used to extract certain fields from the title.
Later, the search application logic (I assume there is going to be some website for searching with its own applciation layer) can try to extract the same kind of information (e.g. using Regular Expressions or some more sophisticated approach) to parse the user input and generate a more complex query targeting the metadata fields.
As an example: if you can pre-process a document with title "Hr-2006-12314-B" to something like:
{
"title" : "Hr-2006-12314-B",
"year": 2006,
"category" : "Some_Hr_decision_watever_that_means",
"page" : 12314,
"decision_type" : "B",
"body" : "The full document text goes somewhere as well of course"
}
An you later get a user input with "Hr. 2006 23748" you can probably write some simple code that extracts the different parts and maps them to the correct field. You could then send a query like:
{
"query": {
"bool": {
"must": [
{
"match": {
"year": 2006
}
},
{
"match": {
"category": "HR-decision"
}
}
],
"should": {
"match": {
"page": 23748
}
}
}
}
}
This gives you much more flexibility, also allows for easier grouping and sorting and e.g. queries like "give me all decisions of type A from year 2005". The whole thing can of course be combined with a full-text search component e.g. on a body text field.
I would investigate these possibilities first before starting to write complicated tokenizers, synonyms etc... because I think what you really can do here is extract some meta-information from the titles and later from the user queries with little effort.
Hope this helps as a rough sketch.