Classification of long text

Hey,

I am using the ES machine learning classification feature trying to predict a category of a certain text. I have orientated myself on this blog post, but I believe I'm doing something wrong in preparing the data.

I'm getting a really bad overall accuracy ~0.125 for 9 different categories. So probably only hits by chance (1/9 = 0.11).

My suspicion is that my tokenization and normalizing isn't working properly or not at all. Is there a way to test this against exiting index data?

My data looks like the following:

[
    {
        "name": "Tomato",
        "category": "fruit",
        "description": "The tomato is the edible berry of the plant Solanum lycopersicum, commonly known as the tomato plant. The species originated in western South America, Mexico, and Central America. The Nahuatl word tomatl gave rise to the Spanish word tomate, from which the English word tomato derives. Its domestication and use as a cultivated food may have originated with the indigenous peoples of Mexico"
    },
    {
        "name": "Potato",
        "category": "vegetable",
        "description": "The potato is a starchy root vegetable native to the Americas that is consumed as a staple food in many parts of the world. Potatoes are tubers of the plant Solanum tuberosum, a perennial in the nightshade family Solanaceae."
    },
    {
        "name": "Baguette",
        "category": "bread",
        "description": "A baguette is a long, thin type of bread of French origin that is commonly made from basic lean dough (the dough, not the shape, is defined by French law). It is distinguishable by its length and crisp crust. "
    }
]

A short name, a category (I want to predict) and a long description with up to 5.000 chars (this is just an example to represent the structure of the data). The size of my test dataset is around 3.500 documents.

The mapping of my index is this:

{
	"mappings": {
		"properties": {
			"description": {
				"type": "text",
				"analyzer":"english",
				"fielddata": true
			},
			"category": {
				"type": "text",
				"analyzer":"english",
				"fields": {
					"keyword": {
						"type": "keyword",
						"ignore_above": 512
					}
				}
			},
			"name": {
				"type": "text",
				"analyzer":"english",
				"fielddata": true
			}
		}
	}
}

I have also tried with subfield .keyword for description and name with no luck.

Any advice is appreciated.

Thanks,
Michel

Hi!

Could you post the rest of your code please?
How much data are you using / how many examples are you trying this with?

Happy to take a look for you!

P.S. You can also test the way the term vectorizer is applied on your data with this API;
So something like :

GET /my-index-000001/_termvectors/1
{
  "fields" : ["text"],
  "offsets" : true,
  "payloads" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true
}

Can show you if the processors are acting like expected

Here is the JSON representation of the classification job. I hope that's what you meant with the rest of my code.

{
    "description": "",
    "source": {
      "index": "entry_filtered*",
      "query": {
        "match_all": {}
      }
    },
    "dest": {
      "index": "approach5"
    },
    "analyzed_fields": {
      "includes": [
        "description",
        "category.keyword",
        "name"
      ]
    },
    "analysis": {
      "classification": {
        "dependent_variable": "category.keyword",
        "num_top_feature_importance_values": 0,
        "training_percent": 80,
        "randomize_seed": 42,
        "num_top_classes": -1
      }
    },
    "model_memory_limit": "42mb",
    "max_num_threads": 1
  }

Its a total of 10.897 documents (I was wrong with the number in my original post, I increased the size of the dataset since).

I have created the index with PUT entry_filtered (with the mapping above), and populated it with the _reindex feature querying data from a different index.

The request for _termvectors looks good for me:

	"term_vectors": {
		"description": {
			"field_statistics": {
				"sum_doc_freq": 542797,
				"doc_count": 10897,
				"sum_ttf": 647117
			},
			"terms": {
				"5": {
					"doc_freq": 162,
					"ttf": 176,
					"term_freq": 1,
					"tokens": [
						{
							"position": 1,
							"start_offset": 5,
							"end_offset": 6
						}
					]
				},
                ...

Thank you!

1 Like

Okay, so I've dived into this blog to see how it works.

First of all, can you confirm you are running the same queries + python code as in that example? I am not sure how the classification job fits in to be honest, could you elaborate on where that is generated?

I've run an example of that blog with the 20 news dataset they recommended for benchmarking and it seems to run pretty well - and that is with the same mapping you posted here so there shouldn't be any issues with that. I'd like to understand better how/if you deviated from the example to try to see where the issues came from.

Only thing I noticed is that the python code at the end is a little off, and it has one extra for loop than needed so I changed it to:

from operator import itemgetter
def get_best_category(response):
    categories = {}
    for hit in response['hits']['hits']:
        score = hit['_score']
        category = hit['_source']["label_text"]
        if category not in categories:
            categories[category] = score
        else:
            categories[category] += score
    if len(categories) > 0:
        sortedCategories = sorted(categories.items(), key=itemgetter(1), reverse=True)
        category = sortedCategories[0][0]
    return category

I've just posted the full example on a github page here for you: elasticsearch-python/classifier.ipynb at main · iuliaferoli/elasticsearch-python · GitHub

I picked the first element in the dataset to run the similarity query - and you can see within the tops it gets back it's mostly the same category (as confirmed by the sumarization code at the end too).

Hope the example helps, otherwise please add some details on exactly what query you ran / how you created the job / etc. thanks!

Thank you, Julia. As I said I have only orientated myself on that blog post.

So here is what I'm doing step by step:

  1. Create the source Index namend entry_filtered with the mapping mentioned above
  2. Populating the index entry_filtered with _reindex feature from a different index (to have some isolated data)
  3. Create a data view for entry_filtered index
  4. In Kibana: Analytics > Machine Learning > Data Frame Analytics > Jobs > Create Job
  5. Select the data view created for entry_filtered
  6. Select "Classifciation"
  7. Set "Dependent variable" to category.keyword
  8. Have category.keyword, name, description selected as "Included fields"
  9. Leave anything else as is; set a job ID
  10. Create & run the job

After some time the job finishes and gives the bad overall accuracy mentioned in the original post.

There is no python script or anything involved in my approach. The blog post uses more_like_this query and processes the score of the result in a script. This is not what I am doing.

I'm using the built-in features for machine learning.

Ahhh okay that is quite a different question then.

If you are just running a classification task with the ML component - you do not have enough features to have a very accurate model, so it makes sense that your scores would be quite low for this case. I just tried that in the UI too and my accuracy is 50% - so not much better than random.

What the blog describes is that when you try to build a classifier for text there are a lot of steps you'd need to take to process the data you have into enough features/insights that the ML model could detect trends and make predictions.

To quote the blog:

Most NLP tasks start with a standard preprocessing pipeline:

  1. Gathering the data
  2. Extracting raw text
  3. Sentence splitting
  4. Tokenization
  5. Normalizing (stemming, lemmatization)
  6. Stopword removal
  7. Part of Speech tagging

Now the cool thing that the blog offered as another solution was using the "more_like_this" query because of the native implementation of a lot of those steps within the analyzers / logic of the query. Hence why the classification then works "out-of-the-box" on just an unprocessed text field.

If you want to instead use the ML job, (and get a higher accuracy) you need to make sure you create those features yourself: like the blog mentioned, either with NLP libraries or other kinds of elasticsearch transformers and pipelines.

On the Machine Learning sections of the docs they also mention data processing.. This is not done automatically within the job like in the case of the more_like_this query.

So to summarize:

  • you can follow the blog example, and then your mapping & other details you provided are fine, and you'd get a pretty accurate model.
  • or you can find some other ways to do the pre-processing of your data before you use the ML Classifier in Kibana

Ok. I will have a deeper look into NLP and alternatively play around with more_like_this a bit.

I misunderstood the capabilities of ES here.

Thanks, Julia.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.