Can I parse text in pdf document before sending it to elasticsearch using FSCrawler

Please help.

Yes.

How can i do it

Can you show me some examples . Like if I want to extract Phone no. from the pdf.

Now I understand the question.
So it's not related to FSCrawler but more a general question on how I can extract a phone number from a text, right?

I mean that FSCrawler is responsible to extract the text from a PDF.
Once done, you can do whatever with the extracted text.

Here I'd probably try to use an ingest pipeline (which you can define later in FSCrawler with Elasticsearch settings — FSCrawler 2.10-SNAPSHOT documentation) to try to apply some regex on your text.

You can try the Grok processor may be: Grok processor | Elasticsearch Guide [8.11] | Elastic

If you have further questions, please provide an example of what you tried so far, without using FSCrawler. As I said, that's not FSCrawler's responsability doing that. Like (but for another use case):

POST _ingest/pipeline/_simulate
{
  "pipeline": {
  "description" : "parse multiple patterns",
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns": ["%{FAVORITE_DOG:pet}", "%{FAVORITE_CAT:pet}"],
        "pattern_definitions" : {
          "FAVORITE_DOG" : "beagle",
          "FAVORITE_CAT" : "burmese"
        }
      }
    }
  ]
},
"docs":[
  {
    "_source": {
      "message": "I love burmese cats!"
    }
  }
  ]
}

Thank you for you reply . Actually i want to parse my text in resume. For example i want to parse mobile number and create a field mobile no. in the elasticsearch index using FSCrawler. Thank you for your time .

https://fscrawler.readthedocs.io/en/latest/admin/fs/elasticsearch.html#ingest-node) is this used to create field in an index

    "docs":[
      {
        "_source": {
          "message": "I love burmese cats!"
        }
      }

What is this used for ?

That's a sample document to test an ingest pipeline.

how can i create a field in the index while using elasticsearch

I don't understand the question. May be with an example?

Now i am able to create foo field as you suggested using this pipeline
PUT _ingest/pipeline/demo1
{
"description" : "fscrawler demo",
"processors" : [
{
"set" : {
"field": "foo",
"value": "bar"
}
}
]
}

Can you give me some examples of regex used in pipelines so that i can create mobile no.

No I can't.

But you can start with https://www.elastic.co/guide/en/elasticsearch/reference/current/grok-processor.html and try to make it work.
If you don't succeed, then share what you did so far as I already explained in details in Can I parse text in pdf document before sending it to elasticsearch using FSCrawler.

I am able to parse the name using following pipeline

PUT _ingest/pipeline/demo3
{
  "description" : "fscrawler demo3",
  "processors" : [
    {
      
    "grok": { 
        "field": "path.virtual", 
          "patterns":  ["\/%{DATA:Name} %{GREEDYDATA:remaining}"] 
        }
 
    }
  ]
}

In this pipe line i use grok . But i need regex for parsing mobile no.

KAMALIKA ROY BARMAN

EDUCATION
Indian School of Business – PGP in Management (Intended majors –Marketing & Strategy)
Apr ’18- Present
· Received a merit scholarship of INR 1 Lac (~Top 7% of class) for academic excellence & leadership drive-Young Leaders Programme
· Represented ISB (Top 20/250+ applicants) at NUS, Singapore & Tsinghua University, China in the 1st Asia Innovation Programme
· Developed a business model of an IoT based app for fitness centers in Asia by collaborating with NUS & Tsinghua peers
· National Finalist (Top 30/1800+teams), Mahindra War Room: Devised a go-to-market strategy to resurrect the retro bike- JAWA
· Events Coordinator, Alumni Affairs Council (Selected 2/60 applicants):
· Driving all editions of ‘Shadow an Alum’ initiative across both campuses connecting over 700+ students to alums (YoY growth-20%)
Veermata Jijabai Technological Institute(VJTI), Mumbai| 99.9 percentile in Maharashtra Engineering Entrance Test (~3.2 Lac candidates)
B.Tech in Electronics & Telecommunication ( Top 10 % in class| CGPA 8.06/10)
Jul ‘12-Apr ‘16
· 1st runner-up (2/90+ top international teams), BAJA SAE South Africa 2015: Crafted a business plan to sell ATVs in BRICS nations
· National winner (1/100+ top engineering colleges) of the Maruti Suzuki sponsored go-to-market strategy case-SUPRA SAE India ‘14 - Identified new target segments & proposed a business model to enter the Indian motorsport sector; used primary & secondary research
· Robotics Club: Pioneered 3D printing workshops for 450+ students by building the first ever student-made 3D printer on campus

Internship: Schlumberger Asia Services Ltd.| Wireline Segment
Mumbai|May’15-Jul’15
· Improved divisional efficiency by 30% by designing & developing an analytical tool (Excel-based VBA) to track warehouse inventory Higher Secondary Certificate Exam: Top 1% in state; Received scholarship for higher education (INR 3.75 Lac) from Maharashtra Govt

WORK EXPERIENCE
DELOITTE CONSULTING USI| Business Technology Analyst| System Integration, Technology Consulting
Mumbai|Aug ‘16- Mar’18
Designed, implemented & assessed B2C digital transformation strategies of businesses in US Financial Services & Insurance Sector by optimizing all stages of a product lifecycle – Requirement Analysis, Product Development & Quality Assurance
Key Achievements
· Top 1 % of the 1000+ countrywide analysts as per the latest performance appraisal snapshot released (March 2018)
· Awarded (1/300+) ‘Employee of the year’ title for exemplary contribution to client deliverables & internal firm initiatives
· Youngest team member (1.2years Vs average of 4 years-experience) to win ‘Applause Award’ for accelerating client and business growth
· Won ‘Spot Award’ within 4 months of joining the firm (average ~1 year) for timely detection & analysis of critical project bottlenecks
· 1st project team member (team size ~ 30) to be awarded ‘Performer of the month’ for analytical excellence & digital innovation
Key Engagements (Client-US Financial Services Major)
Digital Strategy Implementation:
· Enabled the 1st on-time successful B2C digital platform release (past success rate- 0/3) of client’s 2 major LOBs worth ~$150M
· Averted potential revenue loss by identifying & eliminating defects causing system failures by designing 75+ business process tests
· Reduced time-to-market by 30% by redesigning the quality assurance modules & executing the user acceptance tests efficiently
· Reduced man-hours by 60% and the scope of human error by creating an automation framework for end-to-end process testing
Process Re-engineering & Account Management:
· Handpicked by senior leaders to expand the scope of work & enhance client relation by creating new risk management frameworks
· Generated new business opportunity for Deloitte ($400K) by applying the P&C insurance rate testing model to client’s other LOBs
· Uncovered potential business operation losses (~$35M) for client by building a model to detect incorrect insurance premium rules
· Accelerated defect detection (~70%) & minimized processing time by redesigning the revenue test model of business worth ~$220M
· Enabled client to build in-house quality assurance capability ($500K/year saving) by training them on automation design & testing
Global Multi-Stakeholder Management & Analytical Excellence (India & USA):
· Modeled digital platform tests of business worth $380M by liaising with multi-cultural teams across 3 external stakeholder entities
· Streamlined & led daily meetings with client & their vendors across 3 geographies improving defect-fix turnaround time by 25%
· Improved project bottleneck time by 30% by identifying gaps in implementation of 300+ business requirements
· Mentored 25+ cross-functional professionals across Deloitte & client’s external vendor team on US govt.’s insurance rating methods
Revenue Management & Project Planning:
· Youngest analyst (1.5 vs Avg. of 4 yrs. exp.) to prepare ‘project hours & resource utilization’ estimate report for senior management - Formulated accurate estimates of client billable hours for 3 sub-engagements (~$500K) factoring external vendor’s deliverables
· Co-authored a White Paper on application of Robotic Process Automation (RPA) testing models in financial services sector
· Led a team of 250+ professionals in the 1st year of joining Deloitte in inter-dept. events; increased employee engagement by 30%

EXTRA-CURRICULARS & PERSONAL INTERESTS
Leadership (Marketing & Outreach)
· Campaign Manager, Teach For India (TFI): Pioneered VJTI & TFI’s collaboration; 1st TFI campaign manager on VJTI college campus
· Increased TFI fellowship applications by 80% by identifying & investing the right students/organizational heads in Teach For India
· Sponsorship Head, SAE VJTI: 1st female member to be elected to the Automotive Club Senate (elected by 350+ members)
· Led a team of 12 to enable auto club’s participation in its first international venture- BAJA SAE South Africa 2015
· Achieved the entire projected budget goal of INR 15L by partnering with 3 PSU ‘Maharatnas’ (ONGC, BHEL and GAIL)
· Trekking -Recognized by the Indian Mountaineering Association & Dept. of Tourism, India for completing 7 Himalayan treks
· Sky-dived from 13,000 ft in Dubai, hiked an active volcano in Bali, scuba-diving & marathon enthusiast (~10km distance runner) Community Service: Taught & mentored 15 Govt. school kids for 1.5 years in collaboration with Deloitte Mumbai office & Teach For India

This is my resume and i have many resumes like this but i need to extract mobile no. using regex in the pipeline instead of grok. It would be helpful if you show me regex pattern used in the pipeline

GROK is based on regex. May be this could help you: Custom pattern - Telephone number and others

Thanks it works :innocent: But now the thing is phone no. of all candidates are in different position.

For example 
 1st candidate:- My phone no. is 8976635405
2nd candidate: 9874517542 is phone no.

How can i use grok to search for a particular pattern in text

No idea.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.