Jump to content

Recommended Posts

Posted

Hello guys,

I am looking for a console program or a Software Development Kit to do the following:

I want to make a program (Java or C++) which fetches document scans from my scanner and then optimizes and crops them and runs a OCR (optical character recognition). The 1st thing I want is to get a searchable pdf file and a text file I can send to my database. The 2nd challenge is to analyze the file and categorize it in categories like invoice, receipt, letter... I can do that by barcode recognition or by interpreting the size (or anything similar). And then I want to run specified tasks like: If this is document type C look at the left bottom corner and find the [] [] [] and use handwriting recognition to output the 3 letters which were written in the []'s.

 

I already solved the 1st task with "unpaper", "ImageMagick", "hocr2pdf" and tesseract for the OCR but the results were... lets say sophisticated...

I tried the same operation with tools like Adobe Acrobat X Pro and the results where so much better! So I realized that tesseract would not do the trick and I searched for alternatives. I can't use Adobe from commandline but there are tools like ABBYY for Linux which seem appropriate for the problem. But unfortunately the Linux Version of ABBYY was not made for the 2nd thing, and when I have to pay like 140$ for a software, I want it do all I want and not only a part of it.

So I searched and searched and I found these interesting SDK's like the OmniPage Capture SDK or also ABBYY's SDK... It seems like they are fitting my needs but there is one problem. I don't have 5,000$ for THAT! :D I am not planing to sell this software or to scan thousand of papers. I just want to have a little bit fun with my accounting and I want to simplify some regular tasks...

 

Any ideas for a program or a sdk that could help me here?

<h1></h1>

  • 1 year later...
Posted

Hello guys,

I am looking for a console program or a Software Development Kit to do the following:

I want to make a program (Java or C++) which fetches document scans from my scanner and then optimizes and crops them and runs a OCR (optical character recognition). The 1st thing I want is to get a searchable pdf file and a text file I can send to my database. The 2nd challenge is to analyze the file and categorize it in categories like invoice, receipt, letter... I can do that by barcode recognition or by interpreting the size (or anything similar). And then I want to run specified tasks like: If this is document type C look at the left bottom corner and find the [] [] [] and use handwriting recognition to output the 3 letters which were written in the []'s.

 

I already solved the 1st task with "unpaper", "ImageMagick", "hocr2pdf" and tesseract for the OCR but the results were... lets say sophisticated...

I tried the same operation with tools like Adobe Acrobat X Pro and the results where so much better! So I realized that tesseract would not do the trick and I searched for alternatives. I can't use Adobe from commandline but there are tools like ABBYY for Linux which seem appropriate for the problem. But unfortunately the Linux Version of ABBYY was not made for the 2nd thing, and when I have to pay like 140$ for a software, I want it do all I want and not only a part of it.

So I searched and searched and I found these interesting SDK's like the OmniPage Capture SDK or also ABBYY's SDK... It seems like they are fitting my needs but there is one problem. I don't have 5,000$ for THAT! biggrin.gif I am not planing to sell this software or to scan thousand of papers. I just want to have a little bit fun with my accounting and I want to simplify some regular tasks...

 

Any ideas for a program or a sdk that could help me here?

<h1></h1>

er..this seems a little complecated ...

Posted

As far as OCR software goes, multiple tests show that Tesseract and Abbyy are the best you can get to work with, you can use a tool like gscan2pdf for a more consistent test than trying to do it yourself. You should understand that this is the area where many people with doctorate degrees on the topic will be working in highly paid think tanks where they will build proprietary tools to do this well, and unfortunately OCR is not really a weekend project that one can release as open-source, or rather it is, but the results will be vastly different from the large companies that have dozens of people working on the problem. The same problem applies to voice recognition, pattern recognition, cutting edge graphics, etc. Point being that you will have better results with a proprietary technology there than you can get in an open source world, unfortunately.

 

To your original set of questions, without going full proprietary, and even with full proprietary, you just wont get 100% results all the time, don't expect it, but you should get enough information most of the time that should make you able to categorize your info and extract information, and for most search. Like all recognition technology, you may want to implement some processes around when the program is not super sure, it just hold the info in a temporary location until you figure out where it belongs. There is a framework that fits what you are describing though: https://wiki.gnome.org/action/show/Apps/OCRFeeder?action=show&redirect=OCRFeeder

  • 4 weeks later...
Posted (edited)

OK this is not a cheap endeavor. Few have succeeded however.

I would take the "scrabble" approach. There are letters in the alphabet which are used commonly, like T, A, L, O, C and R. abcd... Now j is a hanger. Fuck l. q. t is important.

So you have seen the alphabet? Rules of "scrabble" state a word must be in the dictionary. Going for the speed, I'd produce my own little dictionary. There are suggestions available.

Once you have defined the font face value, you must qualify this. This is standard OCR type software. Right? On the standard keyboard there are 105 keys. On a typewriter there are 54 letters, 10 numerals and maybe 15 or 16 special keys. 81 to even 128? 256? 512?

Todays processors have multiple cores and this is clearly an opportunity to use every single one of them. Find a center of a glyph. Compare it to all known glyphs.

. is a problem because it could de-glyph a glyph or it could be a spot. So we s hould look for capital letters following that spot. Why waste your time?

Now we want particulars. If there are two sets of glyphs how do you determine one from the other? Line thickness is one way. Straightness is another. T and T and T are all different but these are all T's and will be used as T's unless you are OCR'ng Bengimin Franklin. Font faces will follow.

 

3.5 KDE's koffice used a unique style of placing information on a page. It allowed free hand placement of text areas. It could land anywhere on the page.

 

The same occurs with a typewriter for many typewritten documents. Slide paper in, scroll, move type head ->, type something, carriage return, move type head ->, type something. And so on.

 

Finally, if you donot know what ssss omething is, why not ask? It is only a real headache if someone has already saved the document or it is non-editable

 

there is tis and there is ?

Edited by vampares
  • 1 month later...

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.