Hello,
I am an Italian developer, newbye of Elasticsearch, actually in training.
I would experiment as exercice a parser of articles in some French newspaper in order to analyze the text usin Natural Language Processing (NLP) and using Elastic search to store the data and to perform all searches, aggregations etc.
I am using:
- Elasticsearch 7.1 on Ubuntu 20 in an Amazon EC2 instance
- SpaCy python NPL library on Ubuntu 20 in Digital Ocean instance
- Microsoft Azure Text Analytics for sentiment analysis (called by Node.js functions on Amazon Aws Lambda)
- Php to crawl news site, prepare data and store to Elasticsearch
For an article i obtain this piece of json:
{
"nwsp": "Le Figaro",
"arg": "Actualit\u00e9",
"src": "A la Une",
"xml_url": "http:\/\/www.lefigaro.fr\/rss\/figaro_actualites.xml",
"link": "https:\/\/www.lefigaro.fr\/actualite-france\/hostile-a-une-suppression-de-l-igpn-gerald-darmamin-est-pret-a-un-toilettage-20201130",
"title": "Hostile \u00e0 une suppression de l\u2019IGPN, G\u00e9rald Darmanin est pr\u00eat \u00e0 un \u00abtoilettage\u00bb",
"description": "Devant la commission des lois de l\u2019Assembl\u00e9e nationale lundi soir, il a d\u00e9fendu le maintien de cette entit\u00e9 au sein du minist\u00e8re de l\u2019Int\u00e9rieur.",
"pubdate": "2020-11-30 10:11:08",
"author": [
"Christophe Cornevin",
"Jean-Marc Leclerc"
],
"sentiment": {
"sentiment": "negative",
"confidenceScores": {
"positive": 0.11,
"neutral": 0.13,
"negative": 0.76
}
},
"entities": [
{
"text": "IGPN",
"type": "ORG"
},
{
"text": "G\u00e9rald Darmanin",
"type": "PER"
},
{
"text": "Assembl\u00e9e nationale lundi",
"type": "ORG"
},
{
"text": "minist\u00e8re de l Int\u00e9rieur",
"type": "ORG"
}
],
"tags": [
"hostile",
"suppression",
"igpn",
"g\u00e9rald",
"darmanin",
"pr\u00eat",
"toilettage",
"devant",
"commission",
"lois",
"assembl\u00e9e",
"nationale",
"lundi",
"soir",
"d\u00e9fendu",
"maintien",
"entit\u00e9",
"minist\u00e8re",
"int\u00e9rieur"
],
"tag_to_txt": " hostile suppression igpn g\u00e9rald darmanin pr\u00eat toilettage devant commission lois assembl\u00e9e nationale lundi soir d\u00e9fendu maintien entit\u00e9 minist\u00e8re int\u00e9rieur"
},
My questions:
- Which is the best way to map this fields knowing that i would:
- make aggregations query on 'author', 'sentiment', 'entities' and 'tags' . It is not clear to me if i can make aggregations on fields containing json.Do i put it all in the same index? Or i have to split into many indexes as i do in MySql (i don't think so...)
- How i handle French carachters? Here it is converted to ASCII but i donìt know if then, doing a search query, this could be find.
Many many thanks to everybody that would put me in a right directions with suggestions or links to documentation.
Max