How would you suggest to create these synonyms?


I'm configuring ElasticSearch for a website that gives free access to court decisions and need some help to create synonyms for different ways to write the same page title/court decision name.

ie. Rt-2007-1051 = Rt 2007 1051 = Rt. 2007 s. 1051 (# 1 is preferred)

In reality many of them has two names combined like this: "HR-2007-1178-U - Rt-2007-1051"

Since the website has more than 40.000 court decisions with similar but different page titles, I hope it's possible to make a rule :slight_smile:

I 'm grateful for all suggestions to improve the search!

I wouldn't try to tackle this problem with synonyms first, since the number of expansions seems to be really large and due to the numerics involved I wouldn't expect it is easy to come up with an exhaustive list.
Instead I would try to think about how maybe different analysis strategies might help. For this there is no other way as to take a close look at the data and figure out a way to maybe tokenize the input in different ways in different fields. Maybe even a custom pattern tokenizer would help.

Can you give a few more examples of the patterns in the title? Are there at least a few rules that most of these court decision names follow? How do you expect people to search for them (e.g. are users likely to enter full titles like "Rt-2007-1051" or do they also have to be able to find partial matches?)

Thank you so much for your engagement to the question!

Basically we have two variants of title names and sometimes they are combined in the title name.

For Rt-names the pattern is four digits representing the year (eg. Rt-2007-) and the last digits represents the page number in a book (eg. Rt-2007-1051).

For HR-names the pattern is four digits representing the year (eg. HR-2018-), the next digits represents the case number that year (eg. HR-2018-884-) and the last letter represents the type of decision (eg. HR-2018-884-A).

One of the challenges is that there are several ways to type Rt names, eg. Rt-2007-1051 is the same as Rt. 2007 side 1051 is the same as Rt. 2007 s. 1051. (side=page)

Sometimes people know the Rt- and HR-names and uses them in the search. Other times they just search randomly to find answear to find out how the Supreme Court interprets a legal question.

The project is to create a database with all supreme court decisions for our country and make it searchable. The aim is that the law should be available for the people for free.

To give you further insight and understanding of the structure/pattern, I need to give you access to the website. I can not reveal it here but will send you a message with credentials.



With that much structure (even if there are variants) in the title that describe properties of the actual document, I'd probably first try to extract as much information as possible in some sort of pre-processing step, either during data preparation and cleaning and index those pieces of information in separate fields (e.g. "year_of_desision", "type_of_decision", "page" etc...). That way this meta information can be queries much more precisely later.

If this isn't possible for some reason during data preparation, maybe an ingest processor like the Grok Processor could be used to extract certain fields from the title.

Later, the search application logic (I assume there is going to be some website for searching with its own applciation layer) can try to extract the same kind of information (e.g. using Regular Expressions or some more sophisticated approach) to parse the user input and generate a more complex query targeting the metadata fields.

As an example: if you can pre-process a document with title "Hr-2006-12314-B" to something like:

           "title" : "Hr-2006-12314-B",
           "year": 2006,
           "category" : "Some_Hr_decision_watever_that_means",
           "page" : 12314,
           "decision_type" : "B",
           "body" : "The full document text goes somewhere as well of course"

An you later get a user input with "Hr. 2006 23748" you can probably write some simple code that extracts the different parts and maps them to the correct field. You could then send a query like:

  "query": {
    "bool": {
      "must": [
          "match": {
            "year": 2006
          "match": {
            "category": "HR-decision"
      "should": {
        "match": {
          "page": 23748

This gives you much more flexibility, also allows for easier grouping and sorting and e.g. queries like "give me all decisions of type A from year 2005". The whole thing can of course be combined with a full-text search component e.g. on a body text field.

I would investigate these possibilities first before starting to write complicated tokenizers, synonyms etc... because I think what you really can do here is extract some meta-information from the titles and later from the user queries with little effort.

Hope this helps as a rough sketch.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.