How can I extract clear text of an attachment file (pdf)

Hi all,

I want to copy text extracted with Tika from my PDFs (I use
mapper-attachment plugin of Elasticsearch)in a csv or a txt file if it is
possible. I am trying to get the text by executing this command:

curl -XGET 'http://localhost:9200/test/attachment/_search?pretty' -d '

{

"fields": [

  "file"

]

}

'| jq '.fields["file"]| [.file] | @csv'

% Total % Received % Xferd Average Speed Time Time Time
Current

                             Dload  Upload   Total   Spent    Left  

Speed

100 155k 100 155k 100 42 8154k 2209 --:--:-- --:--:-- --:--:--
8611k

""

But I get nothing. What did I do wrong?

Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6b7bb873-c7ba-4d47-8dc1-93e1c13b9110%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Did you store the file content in mapping?

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfr https://twitter.com/elasticsearchfr | @scrutmydocs https://twitter.com/scrutmydocs

Le 2 mars 2015 à 16:49, Marria m_bekrar@esi.dz a écrit :

Hi all,

I want to copy text extracted with Tika from my PDFs (I use mapper-attachment plugin of Elasticsearch)in a csv or a txt file if it is possible. I am trying to get the text by executing this command:

curl -XGET 'http://localhost:9200/test/attachment/_search?pretty' -d '
{
"fields": [
"file"
]
}
'| jq '.fields["file"]| [.file] | @csv'
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 155k 100 155k 100 42 8154k 2209 --:--:-- --:--:-- --:--:-- 8611k
""
But I get nothing. What did I do wrong?

Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6b7bb873-c7ba-4d47-8dc1-93e1c13b9110%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/6b7bb873-c7ba-4d47-8dc1-93e1c13b9110%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/479F97E2-AD53-4B27-91DC-CA69D98C8F89%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.

Hi David,

this is what I did to get text of the attachment:

curl -X DELETE "localhost:9200/test"

curl -X PUT "localhost:9200/test" -d '{

"settings" : {

"index": {

   "analysis" :{

    "analyzer": {

       "default": {

         "type" : "custom",

         "tokenizer" : "uax_url_email",

         "filter" : ["standard", "lowercase", "stop"]

       }

     }

   }

 }

}

}'

curl -X PUT "localhost:9200/test/attachment/_mapping" -d '{

"attachment" : {

"properties" : {

  "file" : {

    "type" : "attachment",

    "index":"yes",

    "path":"full",

    "fields" : {

      "title" : { "store" : "yes" },

      "file" : { "term_vector":"with_positions_offsets", "type": 

"string", "store":"yes" }

    }

  }

}

}

}'

#!/bin/sh

coded=cat tonfichier.pdf | perl -MMIME::Base64 -ne 'print encode_base64($_)'

json="{"file":"${coded}"}"

echo "$json" > json.file

url -X POST "localhost:9200/test/attachment/" -d @json.file

To get the readable text i execute :

curl -XGET 'http://localhost:9200/test/attachment/_search?pretty' -d '

{

"fields": [

  "file"

]

}

Now, I need to extract the text in another file (.txt for exemple or cvs).
I can't do it.

I am trying another query now, nothing works:

curl -XGET 'http://localhost:9200/test/attachment/_search?pretty' -d '

{

"fields": [

  "file"

]

}

'| jq '.[] | .fields.file | @csv'

% Total % Received % Xferd Average Speed Time Time Time
Current

                             Dload  Upload   Total   Spent    Left  

Speed

100 155k 100 155k 100 42 8945k 2423 --:--:-- --:--:-- --:--:--
9118k

jq: error: Cannot index number with string

jq: error: Cannot index boolean with string

jq: error: null cannot be csv-formatted, only array

jq: error: null cannot be csv-formatted, only array

my jq syntax is wrong I think.

Thank you for helping me every time :slight_smile:

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d9379e19-9922-48d6-9f2a-eea291924f0b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

What gives?

curl -XGET 'http://localhost:9200/test/attachment/_search?pretty' http://localhost:9200/test/attachment/_search?pretty' -d '
{
"fields": [
"file"
]
}
'

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfr https://twitter.com/elasticsearchfr | @scrutmydocs https://twitter.com/scrutmydocs

Le 2 mars 2015 à 17:58, Marria m_bekrar@esi.dz a écrit :

Hi David,

this is what I did to get text of the attachment:

curl -X DELETE "localhost:9200/test"

curl -X PUT "localhost:9200/test" -d '{
"settings" : {
"index": {
"analysis" :{
"analyzer": {
"default": {
"type" : "custom",
"tokenizer" : "uax_url_email",
"filter" : ["standard", "lowercase", "stop"]
}
}
}
}
}

}'

curl -X PUT "localhost:9200/test/attachment/_mapping" -d '{
"attachment" : {
"properties" : {
"file" : {
"type" : "attachment",
"index":"yes",
"path":"full",
"fields" : {
"title" : { "store" : "yes" },
"file" : { "term_vector":"with_positions_offsets", "type": "string", "store":"yes" }
}
}
}
}

}'

#!/bin/sh

coded=cat tonfichier.pdf | perl -MMIME::Base64 -ne 'print encode_base64($_)'
json="{"file":"${coded}"}"
echo "$json" > json.file
url -X POST "localhost:9200/test/attachment/" -d @json.file

To get the readable text i execute :

curl -XGET 'http://localhost:9200/test/attachment/_search?pretty' -d '
{
"fields": [
"file"
]

}

Now, I need to extract the text in another file (.txt for exemple or cvs). I can't do it.

I am trying another query now, nothing works:

curl -XGET 'http://localhost:9200/test/attachment/_search?pretty' -d '
{
"fields": [
"file"
]
}
'| jq '. | .fields.file | @csv'
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 155k 100 155k 100 42 8945k 2423 --:--:-- --:--:-- --:--:-- 9118k
jq: error: Cannot index number with string
jq: error: Cannot index boolean with string
jq: error: null cannot be csv-formatted, only array

jq: error: null cannot be csv-formatted, only array

my jq syntax is wrong I think.

Thank you for helping me every time :slight_smile:

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d9379e19-9922-48d6-9f2a-eea291924f0b%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/d9379e19-9922-48d6-9f2a-eea291924f0b%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1B900B0F-F2DC-4D18-89DD-C27C04FEEBB9%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.

It gives me the text extracted of my PDFs (instead of getting base64
content, I need it in readable text). I'll show you just a small part of
the query result:

{

"took": 6,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 4,
"failed": 0
},
"hits": {
"total": 7,
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "attachment",
"_id": "AUumL_QfgcE6sosOU9XD",
"_score": 1,
"fields": {
"file": [
"PROJET INNOVANT 2013 Cahier OSEO\n1\n\n�INFORMATIONS
SUR LE DOCUMENT\n\nEntreprise Nom du projet Date de rendu du document Titre
du document Version Nombre de pages Chef du projet
Contributeurs\nResponsable Projet Innovant Encadrant Technique Encadrant
Humanités Mots clés\n\nAKH PARNTERS SOKARIS 15 avril 2013 Projet Innovant
2013 ­ Cahier OSEO 1.0 30 Arthur Magnien Arthur Magnien, Clément Garnier,
Grégory D'Angelo, Minwei Chen, Joel Pestana, Fujia Hou Stéphane Frénot
Claire Goursaud Claude Guedat Consulting, Traitement d'images, Publicité,
Optimisation, Analyse, Etude comportementale, influence, impact,
regard\n\nHISTORIQUE DU DOCUMENT\n\nVersion 0.1 0.2 0.3 0.4
0.8\n1.0\n2.0\n\nDate 19 Mars 2013 20 Mars 2013 28 Mars 2013 29 Mars 2013 8
Avril 2013\n15 Avril 2013\n18 Avril 2013\n\nCommentaires\nCréation du
document Ajout de l'introduction et la partie Management Ajout de la partie
Juridique et Santé Ajout de la partie Marketing et Technique Ajout partie
AF, Financière, analyse des risques et Bibliographie - Version provisoire
du cahier Révision avec Tuteurs. Modification des parties et ajout du
glossaire Deuxième version provisoire du cahier Mise en page finale du
document\n\nStatus Draft Draft Draft Draft Draft\n1.0\n2.0\n\nTABLE DES
MATIÈRES\nINFORMATIONS SUR LE DOCUMENT
...............................................................................................................
2 HISTORIQUE DU
DOCUMENT............................................................................................................................
2 TABLE DES
MATIÈRES.........................................................................................................................................
2 ABRÉVIATIONS
..................................................................................................................................................
3 RÉFÉRENCES
......................................................................................................................................................
3 CONTEXTE DU PROJET
......................................................................................................................................
4 NOTRE
ÉQUIPE...................................................................................................................................................
5 ANALYSE FONCTIONNELLE
..............................................................................................................................
6 ÉTUDE DE MARCHÉ
...........................................................................................................................................
8 ANALYSE DU BESOIN
........................................................................................................................................
9 ÉTUDE DE LA CONCURRENCE
..........................................................................................................................
9 POSITIONNEMENT ET OFFRES PROPOSÉES - AKH PARTNERS
....................................................................... 11
CARACTÉRISTIQUES TECHNIQUES
.................................................................................................................
14 PLAN FINANCIER
............................................................................................................................................
18\n2\n\n�CADRE JURIDIQUE
..........................................................................................................................................
22 ANALYSE DES RISQUES
...................................................................................................................................
23 MANAGEMENT DE
PROJET.............................................................................................................................
26
ANNEXES.........................................................................................................................................................
28\n\nABRÉVIATIONS\n\nCFE : Centre de formalité des entreprises CNIL :
Commission nationale de l'informatique et des libertés GIMP : GNU Image
Manipulation Program Git : Gestionnaire de versions décentralisé IFA :
Impôt forfaitaire annuel IFOP: Institut français d'opinion publique INPI :
Institut national de la propriété intellectuelle IREP : Institut de
recherches et d'études publicitaires JEI : Jeune entreprise innovante
OpenCV : Open Computer Vision\n\nOpenGL : Open Graphics Library, librairies
PI : Projet innovant du département TC SAS : Société par actions
simplifiées SOKARIS : Solution logicielle développée par AKH Partners SVN :
Subversion, gestionnaire de version SWOT :
Strengths-Weaknesses-Opportunities-Threats, analyse des forces, faiblesse,
opportunités et menaces TC : Département Télécommunications, Services &
Usages à l'INSA de Lyon\n\nRÉFÉRENCES\n\n[1] Journal du net.
Martech : actualités et tendances sur JDN
/dossier/marketing-visuel/2.shtml [2] Recherche IREP.
http://www.irep.asso.fr/marche-publicitaire-chiffresannuels.php [3] Etude
IFOP sur les critères de visualisation d'une publicité [4] Coût du plan
premium dans le métro parisien.
Page introuvable - MédiaTransports
_cgv2012_mbs_14022012.pdf [5] Junwei Technologies website. [6] Cliris
website. http://www.clirisgroup.com [7] Anaxa vida website
http://anaxa-vida.fr/ [8] Shopperception website.
http://www.shopperception.com/ [9]Google Analytics
Analytics Tools & Solutions for Your Business - Google Analytics [10] Tobii website. http://www.tobii.com/
[11] Etude 2011 microsfot advertsing [12] Etude Context Matters 2007 [13]
Caractéristiques en vision par ordinateur
http://fr.wikipedia.org/wiki/Extraction_de_caract�%A
9ristique_en_vision_par_ordinateur [14] Algorithme de Haar (2010) Suivi du
regard non intrusif en tête libre à partir d'image vidéo.
http://hal.archivesouvertes.fr/docs/00/80/82/99/PDF/RI080408.pdf\n\n[15]
Méthode de Viola et Jones
http://fr.wikipedia.org/wiki/Méthode_de_Viola_e t_Jones [16] Détection
in computer vision https://www.youtube.com/watch?v=NCtYdUEMotg [17] Open CV
Library http://opencv.org/ [18]"
]
}
},

{
"_index": "test",
"_type": "attachment",
"_id": "AUumKYuHgcE6sosOU9XC",
"_score": 1,
"fields": {
"file": [
"Dossier\n\n« Be everywhere. At once. »\n\n Référence
Ubiquity ­ Dossier Oseo.pdf\n Version 1.0 17 avril 2013\n\n Auteur Equipe
1\n Chef de projet Pierre Curis\n\n�« Be everywhere. At once. »\n\n
Informations sur le document\n\nNom du projet Date de rendu du document
Titre du document Type Version Responsable du projet Responsable PI
Encadrant technique Encadrant humanités Mots-clés\n\nUbiquity Jeudi 11
avril 2013 Ubiquity ­ Dossier Oséo Document PDF 1.0 Pierre Curis Stéphane
Frénot Isabelle Augé-Blum Patrice Heyde Livestreaming, web, cartographie,
reportage, informations\n\n Historique du document\n\nVersion v0.1 v0.2
v1.0\n\nDate 20/02/2012 12/04/2013 17/04/2013\n\nModification Création du
document "
]
}
},
{
"_index": "test",
"_type": "attachment",
"_id": "AUvamLbSi6bRbhYPLka0",
"_score": 1
}
]
}
}

And the other files I indexed. I didn't copy all the text but only small
parts

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b89ea240-be8b-4219-a1cb-f4b13cb640d2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Everything is correct. What do you expect?

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfr https://twitter.com/elasticsearchfr | @scrutmydocs https://twitter.com/scrutmydocs

Le 2 mars 2015 à 18:17, Marria m_bekrar@esi.dz a écrit :

It gives me the text extracted of my PDFs (instead of getting base64 content, I need it in readable text). I'll show you just a small part of the query result:

{
"took": 6,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 4,
"failed": 0
},
"hits": {
"total": 7,
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "attachment",
"_id": "AUumL_QfgcE6sosOU9XD",
"_score": 1,
"fields": {
"file": [
"PROJET INNOVANT 2013 Cahier OSEO\n1\n\n�INFORMATIONS SUR LE DOCUMENT\n\nEntreprise Nom du projet Date de rendu du document Titre du document Version Nombre de pages Chef du projet Contributeurs\nResponsable Projet Innovant Encadrant Technique Encadrant Humanités Mots clés\n\nAKH PARNTERS SOKARIS 15 avril 2013 Projet Innovant 2013 ­ Cahier OSEO 1.0 30 Arthur Magnien Arthur Magnien, Clément Garnier, Grégory D'Angelo, Minwei Chen, Joel Pestana, Fujia Hou Stéphane Frénot Claire Goursaud Claude Guedat Consulting, Traitement d'images, Publicité, Optimisation, Analyse, Etude comportementale, influence, impact, regard\n\nHISTORIQUE DU DOCUMENT\n\nVersion 0.1 0.2 0.3 0.4 0.8\n1.0\n2.0\n\nDate 19 Mars 2013 20 Mars 2013 28 Mars 2013 29 Mars 2013 8 Avril 2013\n15 Avril 2013\n18 Avril 2013\n\nCommentaires\nCréation du document Ajout de l'introduction et la partie Management Ajout de la partie Juridique et Santé Ajout de la partie Marketing et Technique Ajout partie AF, Financière, analyse des risques et Bibliographie - Version provisoire du cahier Révision avec Tuteurs. Modification des parties et ajout du glossaire Deuxième version provisoire du cahier Mise en page finale du document\n\nStatus Draft Draft Draft Draft Draft\n1.0\n2.0\n\nTABLE DES MATIÈRES\nINFORMATIONS SUR LE DOCUMENT ............................................................................................................... 2 HISTORIQUE DU DOCUMENT............................................................................................................................ 2 TABLE DES MATIÈRES......................................................................................................................................... 2 ABRÉVIATIONS .................................................................................................................................................. 3 RÉFÉRENCES ...................................................................................................................................................... 3 CONTEXTE DU PROJET ...................................................................................................................................... 4 NOTRE ÉQUIPE................................................................................................................................................... 5 ANALYSE FONCTIONNELLE .............................................................................................................................. 6 ÉTUDE DE MARCHÉ ........................................................................................................................................... 8 ANALYSE DU BESOIN ........................................................................................................................................ 9 ÉTUDE DE LA CONCURRENCE .......................................................................................................................... 9 POSITIONNEMENT ET OFFRES PROPOSÉES - AKH PARTNERS ....................................................................... 11 CARACTÉRISTIQUES TECHNIQUES ................................................................................................................. 14 PLAN FINANCIER ............................................................................................................................................ 18\n2\n\n�CADRE JURIDIQUE .......................................................................................................................................... 22 ANALYSE DES RISQUES ................................................................................................................................... 23 MANAGEMENT DE PROJET............................................................................................................................. 26 ANNEXES......................................................................................................................................................... 28\n\nABRÉVIATIONS\n\nCFE : Centre de formalité des entreprises CNIL : Commission nationale de l'informatique et des libertés GIMP : GNU Image Manipulation Program Git : Gestionnaire de versions décentralisé IFA : Impôt forfaitaire annuel IFOP: Institut français d'opinion publique INPI : Institut national de la propriété intellectuelle IREP : Institut de recherches et d'études publicitaires JEI : Jeune entreprise innovante OpenCV : Open Computer Vision\n\nOpenGL : Open Graphics Library, librairies PI : Projet innovant du département TC SAS : Société par actions simplifiées SOKARIS : Solution logicielle développée par AKH Partners SVN : Subversion, gestionnaire de version SWOT : Strengths-Weaknesses-Opportunities-Threats, analyse des forces, faiblesse, opportunités et menaces TC : Département Télécommunications, Services & Usages à l'INSA de Lyon\n\nRÉFÉRENCES\n\n[1] Journal du net. Martech : actualités et tendances sur JDN /dossier/marketing-visuel/2.shtml [2] Recherche IREP. http://www.irep.asso.fr/marche-publicitaire-chiffresannuels.php [3] Etude IFOP sur les critères de visualisation d'une publicité [4] Coût du plan premium dans le métro parisien. Page introuvable - MédiaTransports _cgv2012_mbs_14022012.pdf [5] Junwei Technologies website. [6] Cliris website. http://www.clirisgroup.com [7] Anaxa vida website http://anaxa-vida.fr/ [8] Shopperception website. http://www.shopperception.com/ [9]Google Analytics Analytics Tools & Solutions for Your Business - Google Analytics [10] Tobii website. http://www.tobii.com/ [11] Etude 2011 microsfot advertsing [12] Etude Context Matters 2007 [13] Caractéristiques en vision par ordinateur http://fr.wikipedia.org/wiki/Extraction_de_caract�%A 9ristique_en_vision_par_ordinateur [14] Algorithme de Haar (2010) Suivi du regard non intrusif en tête libre à partir d'image vidéo. http://hal.archivesouvertes.fr/docs/00/80/82/99/PDF/RI080408.pdf\n\n[15] Méthode de Viola et Jones http://fr.wikipedia.org/wiki/Méthode_de_Viola_e t_Jones [16] Détection in computer vision https://www.youtube.com/watch?v=NCtYdUEMotg [17] Open CV Library http://opencv.org/ [18]"
]
}
},
{
"_index": "test",
"_type": "attachment",
"_id": "AUumKYuHgcE6sosOU9XC",
"_score": 1,
"fields": {
"file": [
"Dossier\n\n« Be everywhere. At once. »\n\n Référence Ubiquity ­ Dossier Oseo.pdf\n Version 1.0 17 avril 2013\n\n Auteur Equipe 1\n Chef de projet Pierre Curis\n\n�« Be everywhere. At once. »\n\n Informations sur le document\n\nNom du projet Date de rendu du document Titre du document Type Version Responsable du projet Responsable PI Encadrant technique Encadrant humanités Mots-clés\n\nUbiquity Jeudi 11 avril 2013 Ubiquity ­ Dossier Oséo Document PDF 1.0 Pierre Curis Stéphane Frénot Isabelle Augé-Blum Patrice Heyde Livestreaming, web, cartographie, reportage, informations\n\n Historique du document\n\nVersion v0.1 v0.2 v1.0\n\nDate 20/02/2012 12/04/2013 17/04/2013\n\nModification Création du document "
]
}
},
{
"_index": "test",
"_type": "attachment",
"_id": "AUvamLbSi6bRbhYPLka0",
"_score": 1
}
]
}
}

And the other files I indexed. I didn't copy all the text but only small parts

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b89ea240-be8b-4219-a1cb-f4b13cb640d2%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/b89ea240-be8b-4219-a1cb-f4b13cb640d2%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/A931010B-D192-4F41-9325-37B2D32D90EE%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.

I want to copy the extracted text from Elasticsearch to another txt or cvs
file but I am getting nothing with JQ.

Look at this error:

jq: error: Cannot index number with string

jq: error: Cannot index boolean with string

null

null

Can you show me another method if you want to export the text from Elastic
to a txt file?

Thanks a lot :slight_smile:

Le lundi 2 mars 2015 18:26:30 UTC+1, David Pilato a écrit :

Everything is correct. What do you expect?

--
David Pilato | Technical Advocate | Elasticsearch.com
http://Elasticsearch.com

@dadoonet https://twitter.com/dadoonet | @elasticsearchfr
https://twitter.com/elasticsearchfr | @scrutmydocs
https://twitter.com/scrutmydocs

Le 2 mars 2015 à 18:17, Marria <m_be...@esi.dz <javascript:>> a écrit :

It gives me the text extracted of my PDFs (instead of getting base64
content, I need it in readable text). I'll show you just a small part of
the query result:

{

"took": 6,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 4,
"failed": 0
},
"hits": {
"total": 7,
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "attachment",
"_id": "AUumL_QfgcE6sosOU9XD",
"_score": 1,
"fields": {
"file": [
"PROJET INNOVANT 2013 Cahier OSEO\n1\n\n�INFORMATIONS
SUR LE DOCUMENT\n\nEntreprise Nom du projet Date de rendu du document Titre
du document Version Nombre de pages Chef du projet
Contributeurs\nResponsable Projet Innovant Encadrant Technique Encadrant
Humanités Mots clés\n\nAKH PARNTERS SOKARIS 15 avril 2013 Projet Innovant
2013 ­ Cahier OSEO 1.0 30 Arthur Magnien Arthur Magnien, Clément Garnier,
Grégory D'Angelo, Minwei Chen, Joel Pestana, Fujia Hou Stéphane Frénot
Claire Goursaud Claude Guedat Consulting, Traitement d'images, Publicité,
Optimisation, Analyse, Etude comportementale, influence, impact,
regard\n\nHISTORIQUE DU DOCUMENT\n\nVersion 0.1 0.2 0.3 0.4
0.8\n1.0\n2.0\n\nDate 19 Mars 2013 20 Mars 2013 28 Mars 2013 29 Mars 2013 8
Avril 2013\n15 Avril 2013\n18 Avril 2013\n\nCommentaires\nCréation du
document Ajout de l'introduction et la partie Management Ajout de la partie
Juridique et Santé Ajout de la partie Marketing et Technique Ajout partie
AF, Financière, analyse des risques et Bibliographie - Version provisoire
du cahier Révision avec Tuteurs. Modification des parties et ajout du
glossaire Deuxième version provisoire du cahier Mise en page finale du
document\n\nStatus Draft Draft Draft Draft Draft\n1.0\n2.0\n\nTABLE DES
MATIÈRES\nINFORMATIONS SUR LE DOCUMENT
...............................................................................................................
2 HISTORIQUE DU
DOCUMENT............................................................................................................................
2 TABLE DES
MATIÈRES.........................................................................................................................................
2 ABRÉVIATIONS
..................................................................................................................................................
3 RÉFÉRENCES
......................................................................................................................................................
3 CONTEXTE DU PROJET
......................................................................................................................................
4 NOTRE
ÉQUIPE...................................................................................................................................................
5 ANALYSE FONCTIONNELLE
..............................................................................................................................
6 ÉTUDE DE MARCHÉ
...........................................................................................................................................
8 ANALYSE DU BESOIN
........................................................................................................................................
9 ÉTUDE DE LA CONCURRENCE
..........................................................................................................................
9 POSITIONNEMENT ET OFFRES PROPOSÉES - AKH PARTNERS
....................................................................... 11
CARACTÉRISTIQUES TECHNIQUES
.................................................................................................................
14 PLAN FINANCIER
............................................................................................................................................
18\n2\n\n�CADRE JURIDIQUE
..........................................................................................................................................
22 ANALYSE DES RISQUES
...................................................................................................................................
23 MANAGEMENT DE
PROJET.............................................................................................................................
26
ANNEXES.........................................................................................................................................................
28\n\nABRÉVIATIONS\n\nCFE : Centre de formalité des entreprises CNIL :
Commission nationale de l'informatique et des libertés GIMP : GNU Image
Manipulation Program Git : Gestionnaire de versions décentralisé IFA :
Impôt forfaitaire annuel IFOP: Institut français d'opinion publique INPI :
Institut national de la propriété intellectuelle IREP : Institut de
recherches et d'études publicitaires JEI : Jeune entreprise innovante
OpenCV : Open Computer Vision\n\nOpenGL : Open Graphics Library, librairies
PI : Projet innovant du département TC SAS : Société par actions
simplifiées SOKARIS : Solution logicielle développée par AKH Partners SVN :
Subversion, gestionnaire de version SWOT :
Strengths-Weaknesses-Opportunities-Threats, analyse des forces, faiblesse,
opportunités et menaces TC : Département Télécommunications, Services &
Usages à l'INSA de Lyon\n\nRÉFÉRENCES\n\n[1] Journal du net.
Martech : actualités et tendances sur JDN
/dossier/marketing-visuel/2.shtml [2] Recherche IREP.
http://www.irep.asso.fr/marche-publicitaire-chiffresannuels.php [3]
Etude IFOP sur les critères de visualisation d'une publicité [4] Coût du
plan premium dans le métro parisien.
Page introuvable - MédiaTransports
_cgv2012_mbs_14022012.pdf [5] Junwei Technologies website. [6] Cliris
website. http://www.clirisgroup.com [7] Anaxa vida website
http://anaxa-vida.fr/ [8] Shopperception website.
http://www.shopperception.com/ [9]Google Analytics
Analytics Tools & Solutions for Your Business - Google Analytics [10] Tobii website.
http://www.tobii.com/ [11] Etude 2011 microsfot advertsing [12] Etude
Context Matters 2007 [13] Caractéristiques en vision par ordinateur
http://fr.wikipedia.org/wiki/Extraction_de_caract�%A
9ristique_en_vision_par_ordinateur [14] Algorithme de Haar (2010) Suivi du
regard non intrusif en tête libre à partir d'image vidéo.
http://hal.archivesouvertes.fr/docs/00/80/82/99/PDF/RI080408.pdf\n\n[15]
Méthode de Viola et Jones
http://fr.wikipedia.org/wiki/Méthode_de_Viola_e t_Jones [16]
Détection in computer vision https://www.youtube.com/watch?v=NCtYdUEMotg
[17] Open CV Library http://opencv.org/ [18]"
]
}
},

{
"_index": "test",
"_type": "attachment",
"_id": "AUumKYuHgcE6sosOU9XC",
"_score": 1,
"fields": {
"file": [
"Dossier\n\n« Be everywhere. At once. »\n\n Référence
Ubiquity ­ Dossier Oseo.pdf\n Version 1.0 17 avril 2013\n\n Auteur Equipe
1\n Chef de projet Pierre Curis\n\n�« Be everywhere. At once. »\n\n
Informations sur le document\n\nNom du projet Date de rendu du document
Titre du document Type Version Responsable du projet Responsable PI
Encadrant technique Encadrant humanités Mots-clés\n\nUbiquity Jeudi 11
avril 2013 Ubiquity ­ Dossier Oséo Document PDF 1.0 Pierre Curis Stéphane
Frénot Isabelle Augé-Blum Patrice Heyde Livestreaming, web, cartographie,
reportage, informations\n\n Historique du document\n\nVersion v0.1 v0.2
v1.0\n\nDate 20/02/2012 12/04/2013 17/04/2013\n\nModification Création du
document "
]
}
},
{
"_index": "test",
"_type": "attachment",
"_id": "AUvamLbSi6bRbhYPLka0",
"_score": 1
}
]
}
}

And the other files I indexed. I didn't copy all the text but only small
parts

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b89ea240-be8b-4219-a1cb-f4b13cb640d2%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b89ea240-be8b-4219-a1cb-f4b13cb640d2%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/de3c373e-9991-4021-be90-5669967d4118%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I can't copy the text from elastic to another file. Look at the result of
jq

jq: error: Cannot index number with string

jq: error: Cannot index boolean with string

null

null

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cb011e91-a932-48a5-ab9b-a5d717b86967%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

It’s clearly a jq problem. (I don’t know what jq is TBH)
Sorry I can’t help.

May be others?

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfr https://twitter.com/elasticsearchfr | @scrutmydocs https://twitter.com/scrutmydocs

Le 2 mars 2015 à 18:51, Marria m_bekrar@esi.dz a écrit :

I can't copy the text from elastic to another file. Look at the result of jq

jq: error: Cannot index number with string
jq: error: Cannot index boolean with string
null
null

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cb011e91-a932-48a5-ab9b-a5d717b86967%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/cb011e91-a932-48a5-ab9b-a5d717b86967%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/C18EFCC3-36E1-41C2-BA95-41759610638F%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.

It's OK :slight_smile:

Thank you very much David.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cd259acf-3826-4b75-80e7-042ce8abb5aa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

On 02/03/2015 17:51, Marria wrote:

I can't copy the text from elastic to another file. Look at the result
of jq

jq: error: Cannot index number with string

jq: error: Cannot index boolean with string

null

null

Why not just use Tika standalone to extract the text, rather than the ES
plugin?

Cheers

Charlie

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com
mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/cb011e91-a932-48a5-ab9b-a5d717b86967%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/cb011e91-a932-48a5-ab9b-a5d717b86967%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.

--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828
web: www.flax.co.uk

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/54F57706.7020702%40flax.co.uk.
For more options, visit https://groups.google.com/d/optout.

Hi Charlie,

Actualy, this is just a first step to get the text extracted from PDFs in
another txt file. I need to be able to copy the results of any query
executed on ElasticSearch in other external files too. I used jq to select
which field I want to copy in the external file but I didn't succeed.

I used pdftotext to extract the text but my supervisor refused this
solution. He wants the elastic search results.

For your solution, can you please give me a direct link to the command
lines. I found this:

http://www.tutorialspoint.com/tika/tika_extracting_pdf.htm

is it what you meant?

Thanks a lot Charlie.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/72393f3b-29ff-4090-9ead-cd66cccbe454%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

On 03/03/2015 09:28, Marria wrote:

Hi Charlie,

Actualy, this is just a first step to get the text extracted from PDFs
in another txt file. I need to be able to copy the results of any query
executed on Elasticsearch in other external files too. I used jq to
select which field I want to copy in the external file but I didn't
succeed.

I used pdftotext to extract the text but my supervisor refused this
solution. He wants the Elasticsearch results.

OK...although I don't really understand why your supervisor wants you to
do this, the mapper attachment plugin seems to store both the source and
the extracted text:
https://groups.google.com/forum/#!topic/elasticsearch/lbK1-uHCthc

Another possibility would be to run Tika separately, index the text and
store the original location of the file in an ES field, so you can refer
back to it at query time. This would mean you don't have to store the
whole PDF file in ES.

For your solution, can you please give me a direct link to the command
lines. I found this:

TIKA - Extracting PDF

Apache Tika – Getting Started with Apache Tika is probably the best
place to start as it's Tika's actual documentation.

Cheers

Charlie

is it what you meant?

Thanks a lot Charlie.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com
mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/72393f3b-29ff-4090-9ead-cd66cccbe454%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/72393f3b-29ff-4090-9ead-cd66cccbe454%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.

--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828
web: www.flax.co.uk

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/54F58173.7060306%40flax.co.uk.
For more options, visit https://groups.google.com/d/optout.

Hi Charlie,

Really thank you for your help.

I used only the text extracted as in the link you gave me.

I used Python client to extract the text:

import elasticsearch
import csv
import random
import unicodedata

#replace with the IP address of your Elasticsearch node
es = elasticsearch.Elasticsearch(["127.0.0.1:9200"])

Replace the following Query with your own Elastic Search Query

res = es.search(index="fichier", body=
{
"fields": [
"file"
]
}, size=10)
random.seed(1)
sample = res['hits']['hits']
#comment previous line, and un-comment next line for a random sample instead
#randomsample = random.sample(res['hits']['hits'], 5); #change int to
RANDOMLY SAMPLE a certain number of rows from your query

print("Got %d Hits:" % res['hits']['total'])

with open('mytest.tsv', 'wb') as csvfile: #set name of output file here
filewriter = csv.writer(csvfile, delimiter='\t', # we use TAB delimited,
to handle cases where freeform text may have a comma
quotechar='|', quoting=csv.QUOTE_MINIMAL)

create header row

filewriter.writerow(["id", "fields"]) #change the column labels here
for hit in sample: #switch sample to randomsample if you want a random
subset, instead of all rows
try: #try catch used to handle unstructured data, in cases where a field
may not exist for a given hit
col1 = hit["_id"]
except Exception, e:
col1 = ""
try:
col2 = hit["fields"]

col2 = col2.replace('\n', ' ')

except Exception, e:
col2 = ""
filewriter.writerow([col1,col2])

And, it works! I get all the text from the file.

Realy Charlie, thank you :slight_smile:

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1dfaa9a1-1933-4715-a73a-8613bb7acbd5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.