Hey,
I am using the ES machine learning classification feature trying to predict a category
of a certain text. I have orientated myself on this blog post, but I believe I'm doing something wrong in preparing the data.
I'm getting a really bad overall accuracy ~0.125 for 9 different categories. So probably only hits by chance (1/9 = 0.11).
My suspicion is that my tokenization and normalizing isn't working properly or not at all. Is there a way to test this against exiting index data?
My data looks like the following:
[
{
"name": "Tomato",
"category": "fruit",
"description": "The tomato is the edible berry of the plant Solanum lycopersicum, commonly known as the tomato plant. The species originated in western South America, Mexico, and Central America. The Nahuatl word tomatl gave rise to the Spanish word tomate, from which the English word tomato derives. Its domestication and use as a cultivated food may have originated with the indigenous peoples of Mexico"
},
{
"name": "Potato",
"category": "vegetable",
"description": "The potato is a starchy root vegetable native to the Americas that is consumed as a staple food in many parts of the world. Potatoes are tubers of the plant Solanum tuberosum, a perennial in the nightshade family Solanaceae."
},
{
"name": "Baguette",
"category": "bread",
"description": "A baguette is a long, thin type of bread of French origin that is commonly made from basic lean dough (the dough, not the shape, is defined by French law). It is distinguishable by its length and crisp crust. "
}
]
A short name, a category (I want to predict) and a long description with up to 5.000 chars (this is just an example to represent the structure of the data). The size of my test dataset is around 3.500 documents.
The mapping of my index is this:
{
"mappings": {
"properties": {
"description": {
"type": "text",
"analyzer":"english",
"fielddata": true
},
"category": {
"type": "text",
"analyzer":"english",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 512
}
}
},
"name": {
"type": "text",
"analyzer":"english",
"fielddata": true
}
}
}
}
I have also tried with subfield .keyword
for description
and name
with no luck.
Any advice is appreciated.
Thanks,
Michel