Classification of String Data

hey there,

i have a task to classify download stats (basically, that's URL + some minor but yet valuable fields like referring host, request country, etc) of open-source products our company provide into metadata like site, product family, name & component, major, minor & patch version, OS type & version, etc. and number of downloads of course.

at the moment, i have a script which analyses the URL and applies regular expressions to do that job. but number of different files grows, name conventions change over time (yes, i need to process historical data as well) so that's getting really hard to support that script via adding new regex'es.

for me that task looks like a great job for machine learning to classify the string into a set of definite keywords (names) and numbers (versions). luckily, Elastic proposes such functionality as an experimental feature.

by the moment i have some kind of "ground truth": the indices prepared by the script i could use to train a model initially. since Elastic mentions that's a supervised job, i will have a way to train the model further or provide a way for task stakeholders to do so.

but i am unsure if ElasticSearch's experimental ML functionality is the right tool for that task.

  1. i've tried to use the feature but when i specify a dependent variable, but it shows
    Invalid. Field [parsed.component.keyword] must have at most [30] distinct values but there were at least [34]
    according to the docs, dependent values "must contain no more than 30 classes" while i definitely have much more variations of the files to classify.
  2. i did not find a way to specify other dependent variables, but the task supposes i need a whole set of different parameters in output.

does the above mean ElasticSearch ML does not fit my needs? am i doing or getting anything wrong?

Hi,

but i am unsure if Elasticsearch's experimental ML functionality is the right tool for that task.

The purpose of the supervised ML classification job is to classify each document as belonging to one particular class. In other words, as described in this documentation: "Classification is a machine learning process that enables you to predict the class or category of a data point in your data set".

With that in mind, I'd say your problem does not really fit into this category.

  1. i've tried to use the feature but when i specify a dependent variable, but it shows
    Invalid. Field [parsed.component.keyword] must have at most [30] distinct values but there were at least [34]
    according to the docs, dependent values "must contain no more than 30 classes" while i definitely have much more variations of the files to classify.

As you have noticed, in elasticsearch's supervised ML, there is a restriction on the number of values of dependent variable. Currently it is set to 30. The actual number is less important. The more important thing here is that the number of categories is (and will be) bounded whereas in your case it is unbounded (e.g.: there can be many unique sites or downloads).

  1. i did not find a way to specify other dependent variables, but the task supposes i need a whole set of different parameters in output.

There can only be one dependent variable, i.e.: the class to which a document belongs.
So with ML classification, you cannot get all those different parameters in the output.

does the above mean Elasticsearch ML does not fit my needs? am i doing or getting anything wrong?

To sum up, I don't think the ML classification job can be used for data extraction problem. The main issue is potentially unbounded number of "classes" whereas in ML classification, the set of classes needs to be fixed.

2 Likes

thank you much for that detailed explanation, Przemysław. it looks like i need to find a better way. btw, thank you for that great term "data extraction" which explains my task better than tons of words i used.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.