Skip to main content

Optical Character Recognition OCR for Phone number and email address images to text

Sometime we want to crawling data from a website. First, crawling data is illegal (in most cases?). But I tried anyway, for technical study, apply first.
It turned out that there are many good tools for this. So I start to use these tools for current project.
It mostly consist of NodeJS Cheerio for read HTML (like jQuery with browser).
OCR Tesseract for read some text from images. Install locally or use node package.
And obviously Database, I am using MySQL.


http://www.leptonica.org/

https://realpython.com/setting-up-a-simple-ocr-server/

https://nanonets.com/blog/ocr-with-tesseract/


Test (training ?) data.

https://osdn.net/projects/sfnet_tesseract-ocr-alt/downloads/tesseract-ocr-3.02.eng.tar.gz/


OCR + OpenCV (a lot of example to use ) (English only ?)

https://nanonets.com/blog/ocr-with-tesseract/


https://stackoverflow.com/questions/14800730/tesseract-running-error


Some Android app I have tried work really well with Vietnamese (tried with printed Shopee invoice).
Will try another app and different kind of input later.
Some 100M downloaded app seem very good but required payment.

https://www.scraperapi.com/blog/5-tips-for-web-scraping/


https://stackoverflow.com/questions/14347581/mysql-second-or-third-index-of-in-string


https://stackoverflow.com/questions/18581483/how-to-do-repeated-requests-until-one-succeeds-without-blocking-in-node

In PHP I can simple set timeout for awhile then recall API, do it three time before ignore / exception handling case. I have worked with PHP Shopify lib like this.


https://github.com/ericchiang/pup

https://scotch.io/tutorials/asynchronous-javascript-using-async-await

https://blog.bitsrc.io/the-power-of-axios-cf45e085d924


https://github.com/watson/cheerio-eq


https://www.tabnine.com/code/javascript/functions/cheerio/Cheerio/first

It seem tesseract not good at recognition email (@) sign.

https://lazyadmin.nl/office-365/ocr-email-attachments-and-store-them-on-sharepoint/


https://miai.vn/2019/08/28/ocr-dao-tao-tesseract-ocr-de-nhan-dang-tieng-viet-voi-cac-font-chu-khu-khoam/
https://miai.vn/2020/08/03/phan-loai-bien-bao-giao-thong-bang-deep-learning-cnn/


https://www.pyimagesearch.com/2018/09/17/opencv-ocr-and-text-recognition-with-tesseract/


https://groups.google.com/g/tesseract-ocr/c/XTxlAqFbOr8?pli=1

https://github.com/tesseract-ocr/langdata/issues/63


https://research.aimultiple.com/ocr-accuracy/


https://stackoverflow.com/questions/9237250/imagemagick-for-captcha

http://www.fmwconcepts.com/imagemagick/captcha/index.php


I ended up by using simple Email regex (correction) rules. No need to expensive and require much study ML technique.

? => 7

egmail => @gmail

uahoo => yahoo

euahuoo => @yahoo

0gmail => @gmail

D => p 


https://tesseract.projectnaptha.com/

https://www.twilio.com/blog/2016/11/a-simple-way-to-ocr-images-from-a-url-with-tesseract-js.html

https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html

https://www.npmjs.com/package/node-tesseract-ocr

Update: After using node-tesseract-ocr, I got very good result, nearly 100% correct, so I used it and no need for custom logic for error correction above.

...
Random User-Agent to avoid source website ban. I think admin of source site can easily ban my IP but they do not. If they ban IP then I can use some cloud VPS but they are do not.
Noted that, crawling data is illegal (in most cases ?) so think about legal a bit.

https://github.com/axios/axios/issues/2560

https://www.npmjs.com/package/random-useragent


https://levelup.gitconnected.com/bulk-operation-into-mysql-with-nodejs-478c8fc30917

https://stackoverflow.com/questions/18581483/how-to-do-repeated-requests-until-one-succeeds-without-blocking-in-node

Update: 
After double check I found that node-tesseract-ocr seem only a wrapper for installed Tesseract (standalone). So you have to install Tesseract into your machine/server first in order to make it work. I will verify this more because of why run over Node wrapper give better result (in case email recognition) ?

https://superuser.com/questions/758876/tesseract-3-03-english-language-data


https://kbravh.dev/writing/cracking-a-captcha-with-tesseract-js/

https://stackoverflow.com/questions/59126144/how-to-improve-tesseract-js-accuracy

https://stackoverflow.com/questions/65687240/how-to-increase-ocr-accuracy-in-node-js-and-tesseract-js

Comments

Popular posts from this blog

AWS Elasticache Memcached connection

https://docs.aws.amazon.com/AmazonElastiCache/latest/mem-ug/accessing-elasticache.html#access-from-outside-aws http://hourlyapps.blogspot.com/2010/06/examples-of-memcached-commands.html Access memcached https://docs.aws.amazon.com/AmazonElastiCache/latest/mem-ug/GettingStarted.AuthorizeAccess.html Zip include hidden file https://stackoverflow.com/questions/12493206/zip-including-hidden-files phpmemcachedadmin ~ phpMyAdmin or phpPgAdmin ... telnet mycachecluster.eaogs8.0001.usw2.cache.amazonaws.com 11211 stats items stats cachedump 27 100 https://docs.aws.amazon.com/AmazonElastiCache/latest/mem-ug/VPCs.EC.html https://lzone.de/cheat-sheet/memcached VPC ID Security Group ID (sg-...) Cluster: The identifier for the cluster memcached1 Creation Time: The time (UTC) when the cluster was created January 9, 2019 at 11:47:16 AM UTC+7 Configuration Endpoint: The configuration endpoint of the cluster memcached1.ahgofe.cfg.usw1.cache.amazonaws.com:11211 St...

Notes Windows 10 Virtualbox config, PHP Storm Japanese, custom PHP, Apache build, Postgresql

 cmd => Ctrl + Shift + Enter mklink "C:\Users\HauNT\Videos\host3" "C:\Windows\System32\drivers\etc\hosts" https://www.quora.com/How-to-create-a-router-in-php https://serverfault.com/questions/225155/virtualbox-how-to-set-up-networking-so-both-host-and-guest-can-access-internet 1 NAT + 1 host only config https://unix.stackexchange.com/questions/115464/how-to-properly-set-up-2-network-interfaces-in-centos-running-in-virtualbox DEVICE=eth0 TYPE=Ethernet #BOOTPROTO=dhcp BOOTPROTO=none #IPADDR=10.9.11.246 #PREFIX=24 #GATEWAY=10.9.11.1 #IPV4_FAILURE_FATAL=yes #HWADDR=08:00:27:CC:AC:AC ONBOOT=yes NAME="System eth0" [root@localhost www]# cat /etc/sysconfig/network-scripts/ifcfg-eth1 # Advanced Micro Devices, Inc. [AMD] 79c970 [PCnet32 LANCE] DEVICE=eth1 IPADDR=192.168.56.28 <= no eff => auto like DHCP #GATEWAY=192.168.56.1 #BOOTPROTO=dhcp BOOTPROTO=static <= no eff ONBOOT=yes HWADDR=08:00:27:b4:20:10 [root@localhost www]# ...

Rocket.Chat DB schema

_raix_push_notifications avatars.chunks avatars.files instances meteor_accounts_loginServiceConfiguration meteor_oauth_pendingCredentials meteor_oauth_pendingRequestTokens migrations rocketchat__trash rocketchat_cron_history rocketchat_custom_emoji rocketchat_custom_sounds rocketchat_import rocketchat_integration_history rocketchat_integrations rocketchat_livechat_custom_field rocketchat_livechat_department rocketchat_livechat_department_agents rocketchat_livechat_external_message rocketchat_livechat_inquiry rocketchat_livechat_office_hour rocketchat_livechat_page_visited rocketchat_livechat_trigger rocketchat_message rocketchat_oauth_apps rocketchat_oembed_cache rocketchat_permissions rocketchat_raw_imports rocketchat_reports rocketchat_roles rocketchat_room rocketchat_settings rocketchat_smarsh_history rocketchat_statistics rocketchat_subscription rocketchat_uploads system.indexes users usersSessions https://rocket.chat/docs/developer-guides/sc...