Skip to main content

Optical Character Recognition OCR for Phone number and email address images to text

Sometime we want to crawling data from a website. First, crawling data is illegal (in most cases?). But I tried anyway, for technical study, apply first.
It turned out that there are many good tools for this. So I start to use these tools for current project.
It mostly consist of NodeJS Cheerio for read HTML (like jQuery with browser).
OCR Tesseract for read some text from images. Install locally or use node package.
And obviously Database, I am using MySQL.


http://www.leptonica.org/

https://realpython.com/setting-up-a-simple-ocr-server/

https://nanonets.com/blog/ocr-with-tesseract/


Test (training ?) data.

https://osdn.net/projects/sfnet_tesseract-ocr-alt/downloads/tesseract-ocr-3.02.eng.tar.gz/


OCR + OpenCV (a lot of example to use ) (English only ?)

https://nanonets.com/blog/ocr-with-tesseract/


https://stackoverflow.com/questions/14800730/tesseract-running-error


Some Android app I have tried work really well with Vietnamese (tried with printed Shopee invoice).
Will try another app and different kind of input later.
Some 100M downloaded app seem very good but required payment.

https://www.scraperapi.com/blog/5-tips-for-web-scraping/


https://stackoverflow.com/questions/14347581/mysql-second-or-third-index-of-in-string


https://stackoverflow.com/questions/18581483/how-to-do-repeated-requests-until-one-succeeds-without-blocking-in-node

In PHP I can simple set timeout for awhile then recall API, do it three time before ignore / exception handling case. I have worked with PHP Shopify lib like this.


https://github.com/ericchiang/pup

https://scotch.io/tutorials/asynchronous-javascript-using-async-await

https://blog.bitsrc.io/the-power-of-axios-cf45e085d924


https://github.com/watson/cheerio-eq


https://www.tabnine.com/code/javascript/functions/cheerio/Cheerio/first

It seem tesseract not good at recognition email (@) sign.

https://lazyadmin.nl/office-365/ocr-email-attachments-and-store-them-on-sharepoint/


https://miai.vn/2019/08/28/ocr-dao-tao-tesseract-ocr-de-nhan-dang-tieng-viet-voi-cac-font-chu-khu-khoam/
https://miai.vn/2020/08/03/phan-loai-bien-bao-giao-thong-bang-deep-learning-cnn/


https://www.pyimagesearch.com/2018/09/17/opencv-ocr-and-text-recognition-with-tesseract/


https://groups.google.com/g/tesseract-ocr/c/XTxlAqFbOr8?pli=1

https://github.com/tesseract-ocr/langdata/issues/63


https://research.aimultiple.com/ocr-accuracy/


https://stackoverflow.com/questions/9237250/imagemagick-for-captcha

http://www.fmwconcepts.com/imagemagick/captcha/index.php


I ended up by using simple Email regex (correction) rules. No need to expensive and require much study ML technique.

? => 7

egmail => @gmail

uahoo => yahoo

euahuoo => @yahoo

0gmail => @gmail

D => p 


https://tesseract.projectnaptha.com/

https://www.twilio.com/blog/2016/11/a-simple-way-to-ocr-images-from-a-url-with-tesseract-js.html

https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html

https://www.npmjs.com/package/node-tesseract-ocr

Update: After using node-tesseract-ocr, I got very good result, nearly 100% correct, so I used it and no need for custom logic for error correction above.

...
Random User-Agent to avoid source website ban. I think admin of source site can easily ban my IP but they do not. If they ban IP then I can use some cloud VPS but they are do not.
Noted that, crawling data is illegal (in most cases ?) so think about legal a bit.

https://github.com/axios/axios/issues/2560

https://www.npmjs.com/package/random-useragent


https://levelup.gitconnected.com/bulk-operation-into-mysql-with-nodejs-478c8fc30917

https://stackoverflow.com/questions/18581483/how-to-do-repeated-requests-until-one-succeeds-without-blocking-in-node

Update: 
After double check I found that node-tesseract-ocr seem only a wrapper for installed Tesseract (standalone). So you have to install Tesseract into your machine/server first in order to make it work. I will verify this more because of why run over Node wrapper give better result (in case email recognition) ?

https://superuser.com/questions/758876/tesseract-3-03-english-language-data


https://kbravh.dev/writing/cracking-a-captcha-with-tesseract-js/

https://stackoverflow.com/questions/59126144/how-to-improve-tesseract-js-accuracy

https://stackoverflow.com/questions/65687240/how-to-increase-ocr-accuracy-in-node-js-and-tesseract-js

Comments

Popular posts from this blog

Rand mm 10

https://stackoverflow.com/questions/2447791/define-vs-const Oh const vs define, many time I got unexpected interview question. As this one, I do not know much or try to study this. My work flow, and I believe of many programmer is that search topic only when we have task or job to tackle. We ignore many 'basic', 'fundamental' documents, RTFM is boring. So I think it is a trade off between the two way of study language. And I think there are a bridge or balanced way to extract both advantage of two method. There are some huge issue with programmer like me that prevent we master some technique that take only little time if doing properly. For example, some Red Hat certificate program, lesson, course that I have learned during Collage gave our exceptional useful when it cover almost all topic while working with Linux. I remember it called something like RHEL (RedHat Enterprise Linux) Certificate... I think there are many tons of documents, guide n books about Linux bu

Martin Fowler - Software Architecture - Making Architecture matter

  https://martinfowler.com/architecture/ One can appreciate the point of this presentation when one's sense of code smell is trained, functional and utilized. Those controlling the budget as well as developer leads should understand the design stamina hypothesis, so that the appropriate focus and priority is given to internal quality - otherwise pay a high price soon. Andrew Farrell 8 months ago I love that he was able to give an important lesson on the “How?” of software architecture at the very end: delegate decisions to those with the time to focus on them. Very nice and straight-forward talk about the value of software architecture For me, architecture is the distribution of complexity in a system. And also, how subsystems communicate with each other. A battle between craftmanship and the economics and economics always win... https://hackernoon.com/applying-clean-architecture-on-web-application-with-modular-pattern-7b11f1b89011 1. Independent of Frameworks 2. Testable 3. Indepe