Problem with using uax_url_email

Marria · February 18, 2015, 2:37pm

Hi everybody,

I want to perform URL extraction from my PDF files. I use mapper-attachment
plugin to index my PDF files.

In order to be able to perform some regex queries and extract all the urls
present in a pdf file, I used uax_url_email:

curl -X PUT "localhost:9200/test" -d '{

"settings" : {

"index": {

  "analysis" :{

    "analyzer": {

      "default": {

        "type" : "custom",

        "tokenizer" : "uax_url_email",

        "filter" : ["standard", "lowercase", "stop"]

}

}'

and the map :

curl -X PUT "localhost:9200/test/attachment/_mapping" -d '{

"attachment" : {

"properties" : {

  "file" : {

    "type" : "attachment",

    "fields" : {

      "title" : { "store" : "yes" },

      "file" : { "term_vector":"with_positions_offsets",

"store":"yes" }

}

I indexed some PDF files, the problem is for a file , I get this (while
urls in this file start with http://):

https://lh3.googleusercontent.com/-6uzhp-v0qFs/VOSfMU95byI/AAAAAAAAAUc/H4c6xvb54kg/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.17.19.png
for another file, I got this (it leaves the http:// ):

https://lh3.googleusercontent.com/-1rYIYWJJEbU/VOSfweFpgbI/AAAAAAAAAUk/bWzfst_uZUE/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.19.43.png
But the problem is the urls are not recognized completely , look at this:

https://lh3.googleusercontent.com/-vsKUj5I9MiA/VOSgtyS3yWI/AAAAAAAAAUw/64lgO4gYSdI/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.22.32.png

Is it caused by the double column representation in the PDF file?

https://lh4.googleusercontent.com/-c7n5-oMygRM/VOShm4hwnWI/AAAAAAAAAU4/CQNjTTctMnY/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.26.46.png

So, what did I do wrong? how can I fix this and use regexp queries
successfully to extract all the URLs?

Thank you

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/28cf5243-c60b-4cbc-b488-2da97c65061d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Marria · February 19, 2015, 1:34pm

Hi,

for people having the same problem like me, here an answer I received from
Pablo in PT group:

About your problem I beleive this is a constraint of the Apache Tika [1],
which is used by the mapper-attachment plugin.
I believe that a search over Tika pdf limitations or a question on their
list will help you more than we can.
Anyway, maybe you want to ask in the Elasticsearch main list [2], which is
bigger than ours and has the Elasticsearch engineers.

I am sorry for not being able to help you that much.

Cheers,
Pablo

[1] http://tika.apache.org/
[2] elasti...@googlegroups.com

Le mercredi 18 février 2015 15:37:33 UTC+1, Marria a écrit :

Hi everybody,

I want to perform URL extraction from my PDF files. I use
mapper-attachment plugin to index my PDF files.

In order to be able to perform some regex queries and extract all the urls
present in a pdf file, I used uax_url_email:

curl -X PUT "localhost:9200/test" -d '{

"settings" : {
"index": {
  "analysis" :{
    "analyzer": {
      "default": {
        "type" : "custom",
        "tokenizer" : "uax_url_email",
        "filter" : ["standard", "lowercase", "stop"]
      }
    }
  }
}
}

}'

and the map :

curl -X PUT "localhost:9200/test/attachment/_mapping" -d '{

"attachment" : {
"properties" : {
  "file" : {
    "type" : "attachment",
    "fields" : {
      "title" : { "store" : "yes" },
      "file" : { "term_vector":"with_positions_offsets", 
"store":"yes" }
    }
  }
}
}

I indexed some PDF files, the problem is for a file , I get this (while
urls in this file start with http://):

https://lh3.googleusercontent.com/-6uzhp-v0qFs/VOSfMU95byI/AAAAAAAAAUc/H4c6xvb54kg/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.17.19.png
for another file, I got this (it leaves the http:// ):

https://lh3.googleusercontent.com/-1rYIYWJJEbU/VOSfweFpgbI/AAAAAAAAAUk/bWzfst_uZUE/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.19.43.png
But the problem is the urls are not recognized completely , look at this:

https://lh3.googleusercontent.com/-vsKUj5I9MiA/VOSgtyS3yWI/AAAAAAAAAUw/64lgO4gYSdI/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.22.32.png

Is it caused by the double column representation in the PDF file?

https://lh4.googleusercontent.com/-c7n5-oMygRM/VOShm4hwnWI/AAAAAAAAAU4/CQNjTTctMnY/s1600/Capture%2Bd’écran%2B2015-02-18%2Bà%2B15.26.46.png

So, what did I do wrong? how can I fix this and use regexp queries
successfully to extract all the URLs?

Thank you

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/12a4f452-6c6f-4e4f-ba00-97208efdbcba%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Mapper Attachment in ES Elasticsearch	20	3772	July 5, 2017
Index binary files (PDF, ...) Elasticsearch	20	3905	July 5, 2017
How can I extract clear text of an attachment file (pdf) Elasticsearch	14	3370	July 6, 2017
Document (pdf) containing quotes are not well parsed or queried Elasticsearch	14	1686	July 5, 2017
ES + Attachment --> indexed documents incomplete Elasticsearch	11	641	July 6, 2017

Problem with using uax_url_email

Related topics