Looking for suggestions about mapping

MassimoIvaldi · December 1, 2020, 12:54am

Hello,
I am an Italian developer, newbye of Elasticsearch, actually in training.

I would experiment as exercice a parser of articles in some French newspaper in order to analyze the text usin Natural Language Processing (NLP) and using Elastic search to store the data and to perform all searches, aggregations etc.

I am using:

Elasticsearch 7.1 on Ubuntu 20 in an Amazon EC2 instance
SpaCy python NPL library on Ubuntu 20 in Digital Ocean instance
Microsoft Azure Text Analytics for sentiment analysis (called by Node.js functions on Amazon Aws Lambda)
Php to crawl news site, prepare data and store to Elasticsearch

For an article i obtain this piece of json:


    {
            "nwsp": "Le Figaro",
            "arg": "Actualit\u00e9",
            "src": "A la Une",
            "xml_url": "http:\/\/www.lefigaro.fr\/rss\/figaro_actualites.xml",
            "link": "https:\/\/www.lefigaro.fr\/actualite-france\/hostile-a-une-suppression-de-l-igpn-gerald-darmamin-est-pret-a-un-toilettage-20201130",
            "title": "Hostile \u00e0 une suppression de l\u2019IGPN, G\u00e9rald Darmanin est pr\u00eat \u00e0 un \u00abtoilettage\u00bb",
            "description": "Devant la commission des lois de l\u2019Assembl\u00e9e nationale lundi soir, il a d\u00e9fendu le maintien de cette entit\u00e9 au sein du minist\u00e8re de l\u2019Int\u00e9rieur.",
            "pubdate": "2020-11-30 10:11:08",
            "author": [
                "Christophe Cornevin",
                "Jean-Marc Leclerc"
            ],
            "sentiment": {
                "sentiment": "negative",
                "confidenceScores": {
                    "positive": 0.11,
                    "neutral": 0.13,
                    "negative": 0.76
                }
            },
            "entities": [
                {
                    "text": "IGPN",
                    "type": "ORG"
                },
                {
                    "text": "G\u00e9rald Darmanin",
                    "type": "PER"
                },
                {
                    "text": "Assembl\u00e9e nationale lundi",
                    "type": "ORG"
                },
                {
                    "text": "minist\u00e8re de l Int\u00e9rieur",
                    "type": "ORG"
                }
            ],
            "tags": [
                "hostile",
                "suppression",
                "igpn",
                "g\u00e9rald",
                "darmanin",
                "pr\u00eat",
                "toilettage",
                "devant",
                "commission",
                "lois",
                "assembl\u00e9e",
                "nationale",
                "lundi",
                "soir",
                "d\u00e9fendu",
                "maintien",
                "entit\u00e9",
                "minist\u00e8re",
                "int\u00e9rieur"
            ],
            "tag_to_txt": " hostile suppression igpn g\u00e9rald darmanin pr\u00eat toilettage devant commission lois assembl\u00e9e nationale lundi soir d\u00e9fendu maintien entit\u00e9 minist\u00e8re int\u00e9rieur"
        },

My questions:

Which is the best way to map this fields knowing that i would:

make aggregations query on 'author', 'sentiment', 'entities' and 'tags' . It is not clear to me if i can make aggregations on fields containing json.Do i put it all in the same index? Or i have to split into many indexes as i do in MySql (i don't think so...)

How i handle French carachters? Here it is converted to ASCII but i donìt know if then, doing a search query, this could be find.

Many many thanks to everybody that would put me in a right directions with suggestions or links to documentation.

Max

system · December 29, 2020, 12:54am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Is there any spaCy libraries for Elasticsearch available? Elasticsearch docker	7	2185	March 23, 2020
Sentence corpus Elasticsearch	1	328	July 6, 2017
Lenteur lors de la récupération des données depuis Elasticsearch à partir de php Discussions en français	10	2219	July 6, 2017
Advice about mapping Elasticsearch	3	327	July 6, 2017
Conseil pour mapping index quiz Discussions en français	3	466	May 13, 2019

Looking for suggestions about mapping

Related topics