Elasticsearch as Search Database For Hts Database

We are looking to build an a search tool for the Hs codes database https://hts.usitc.gov/current

Ideally, we want to search for queries like:

"men's blue shirt"

and find the hs code from the database. As of now, we only care about the code and description.

Considering that there are up to 9 indent levels for the description, I designed the mappings like so:

{
"demo_htsdata": {
"mappings": {
"doc": {
"properties": {
"hts_code": {
"type": "text"
},
"hts_description_0..9": {
"type": "text"
}
}
}
}
}
}

Now I then populated the cluster with these sample data:

{
"hts_code": "6205.30.10.00",
"hts_description_0": "Men's or boys' shirts",
"hts_description_1": "Of man-made fibers:",
"hts_description_2": "Certified hand-loomed and folklore products (640)"
}
},
{

  "hts_code": "6205.30.15.10",
  "hts_description_0": "Men's or boys' shirts",
  "hts_description_1": "Containing 36 percent or more by weight of wool or fine animal hair",
  "hts_description_2": "Mens's"

},
{
"hts_code": "6205.30.15.20",
"hts_description_0": "Men's or boys' shirts",
"hts_description_1": "Containing 36 percent or more by weight of wool or fine animal hair",
"hts_description_2": "Boys'"
},
{
"hts_code": "6205.20",
"hts_description_0": "Men's or boys' shirts",
"hts_description_1": "Of cotton"
},
{
"hts_code": "0102.29.20.11",
"hts_description_0": "Live bovine animals",
"hts_description_1": "Cows imported specially for dairy purposes.",
"hts_description_2": "Weighing less than 90 kg each"
},
{
"hts_code": "6205.90.10.00",
"hts_description_0": "Men's or boys' shirts",
"hts_description_1": "Of other textile materials:",
"hts_description_2": "Containing 70 percent or more by weight of silk or silk waste"
},
{
"hts_code": "0102.29.20.12",
"hts_description_0": "Live bovine animals",
"hts_description_1": "Cows imported specially for dairy purposes.",
"hts_description_2": "Weighing 90 kg or more each"
},
{
"hts_code": "6205",
"hts_description_0": "Men's or boys' shirts"
}
},
{
"hts_code": "6205.90.10.00",
"hts_description_0": "Men's or boys' shirts",
"hts_description_1": "Of other textile materials:",
"hts_description_2": "Containing 70 percent or more by weight of silk or silk waste"
}

Now searching for

{
"query": {
"query_string": {
"query": "men's blue cotton shirt"
}
}
}

Yields a result of:

"hits": {
    "total": 7,
    "max_score": 1.7368788,
    "hits": [
        {
            "_index": "demo_htsdata_test_cow",
            "_type": "doc",
            "_id": "h91oPmUBRmE13omutgb1",
            "_score": 1.7368788,
            "_source": {
                "hts_code": "6205.20",
                "hts_description_0": "Men's or boys' shirts",
                "hts_description_1": "Of cotton"
            }
        },
        {
            "_index": "demo_htsdata_test_cow",
            "_type": "doc",
            "_id": "id1wPmUBRmE13omuXwZ_",
            "_score": 0.6548752,
            "_source": {
                "hts_code": "6205.90.10.00",
                "hts_description_0": "Men's or boys' shirts",
                "hts_description_1": "Of other textile materials:",
                "hts_description_2": "Containing 70 percent or more by weight of silk or silk waste"
            }
        },

This is is a perfect result. We will be able to use the first returned item and get the 6 digit hs code.

However, when searching with:

{
  "query": {
    "query_string": {
      "query": "men's blue folklore shirt"
    }
  }
}

Yields:

 "hits": {
        "total": 7,
        "max_score": 0.6548752,
        "hits": [
            {
                "_index": "demo_htsdata_test_cow",
                "_type": "doc",
                "_id": "id1wPmUBRmE13omuXwZ_",
                "_score": 0.6548752,
                "_source": {
                    "hts_code": "6205.90.10.00",
                    "hts_description_0": "Men's or boys' shirts",
                    "hts_description_1": "Of other textile materials:",
                    "hts_description_2": "Containing 70 percent or more by weight of silk or silk waste"
                }
            },
            {
                "_index": "demo_htsdata_test_cow",
                "_type": "doc",
                "_id": "hN1RPmUBRmE13omunwZ7",
                "_score": 0.63465416,
                "_source": {
                    "hts_code": "6205.30.10.00",
                    "hts_description_0": "Men's or boys' shirts",
                    "hts_description_1": "Of man-made fibers:",
                    "hts_description_2": "Certified hand-loomed and folklore products (640)"
                }
            },

The database will be searched by non tech users, who prefer to search "google" like kind of searches, free style, and one that they do not need to know the indentation level.

I am trying to understand

  1. What's the better way to organise the data? Is it by manually extracting the keywords of each indent level and putting it into a field? Better mapping strategy perhaps?
  2. How can the search yield better results? In the case of the "men's blue folklore shirt" the folklore product had a lower _score than the first item in the list. More specific, more keywords does yield a better result, but considering that the search term is done by the end users themselves, I do not think that they will be verbose or specific with their search term.

I'm trying to consider a possibility where the end user might not be too verbose with the search terms. How can I make use the full capabilities of Elasticsearch for this use case?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.