extracting text

mikey2k · January 27, 2021

Hello guys. I am in need of some help. I have a PDF file that contains 100 tests (they are divided into 3 parts: A, B and C). How can I create another PDF that contains only the A part from each test? I found some tutorials with Python, but the thing is I don't really know how to use it.
Thank you in advance :) I'm sorry for the grammatical mistakes :) #stillLearning

iNow · January 27, 2021

Can you copy/paste the A parts from each PDF into a single Word document, then Save As and change file type to PDF from Word?

Sensei · January 27, 2021

PDF contains text and images. Start from making sure your text is really text, not image. e.g. some people scan paper documents and output from scanner (images) are put as is inside of the PDF document. To handle images there is needed OCR. Completely different procedure.

Also text can be in several columns.

Attempt to OCR will result in having couple words from each column mixed each row!

Find some example here and copy and paste it for a start:

https://www.google.com/search?q=python+extract+text+from+pdf

On 1/27/2021 at 3:00 AM, iNow said:

Can you copy/paste the A parts from each PDF into a single Word document, then Save As and change file type to PDF from Word?

Expand

This is what ordinary layman would do. Programmers write scripts which will automatically extract needed data. Manual extraction of data from thousands files would take months or years of work. In some not computerised countries and companies, people still work that way in offices. That's bizarre. And results in waste of human resources, ineffectiveness, inproductivity of company, office or government. Inability to compete with the real world were such job is done by programmers.

Programmers wanting to extract data from documents have different than amount of information, problems like damage of character encodings (it doesn't bother much UK, US, Australia and Canada programmers, but the rest of world indeed), text in scanned images, text in columns, incorrect recognition of the letter by OCR etc. etc.

Edited January 27, 2021 by Sensei

A_curious_Homosapien · August 31, 2021

Well the script is pretty simple for this, try reaching out someone who have some experience in this (it can be any language, python c c++ c# or any). He/She would do the job in no time.

Edited August 31, 2021 by A_curious_Homosapien

iNow · August 31, 2021

I’m sure our OP who made this post over 7 months ago and who hasn’t posted a single time since is grateful for your reply.

Sign In

extracting text

Recommended Posts

mikey2k

iNow

Sensei

A_curious_Homosapien

iNow

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity

Important Information