hey there,
i have a task to classify download stats (basically, that's URL + some minor but yet valuable fields like referring host, request country, etc) of open-source products our company provide into metadata like site, product family, name & component, major, minor & patch version, OS type & version, etc. and number of downloads of course.
at the moment, i have a script which analyses the URL and applies regular expressions to do that job. but number of different files grows, name conventions change over time (yes, i need to process historical data as well) so that's getting really hard to support that script via adding new regex'es.
for me that task looks like a great job for machine learning to classify the string into a set of definite keywords (names) and numbers (versions). luckily, Elastic proposes such functionality as an experimental feature.
by the moment i have some kind of "ground truth": the indices prepared by the script i could use to train a model initially. since Elastic mentions that's a supervised job, i will have a way to train the model further or provide a way for task stakeholders to do so.
but i am unsure if ElasticSearch's experimental ML functionality is the right tool for that task.
- i've tried to use the feature but when i specify a dependent variable, but it shows
Invalid. Field [parsed.component.keyword] must have at most [30] distinct values but there were at least [34]
according to the docs, dependent values "must contain no more than 30 classes" while i definitely have much more variations of the files to classify. - i did not find a way to specify other dependent variables, but the task supposes i need a whole set of different parameters in output.
does the above mean ElasticSearch ML does not fit my needs? am i doing or getting anything wrong?