Sometime we want to crawling data from a website. First, crawling data is illegal (in most cases?). But I tried anyway, for technical study, apply first.
It turned out that there are many good tools for this. So I start to use these tools for current project.
It mostly consist of NodeJS Cheerio for read HTML (like jQuery with browser).
OCR Tesseract for read some text from images. Install locally or use node package.
And obviously Database, I am using MySQL.
https://realpython.com/setting-up-a-simple-ocr-server/
https://nanonets.com/blog/ocr-with-tesseract/
Test (training ?) data.
https://osdn.net/projects/sfnet_tesseract-ocr-alt/downloads/tesseract-ocr-3.02.eng.tar.gz/
OCR + OpenCV (a lot of example to use ) (English only ?)
https://nanonets.com/blog/ocr-with-tesseract/
https://stackoverflow.com/questions/14800730/tesseract-running-error
Some Android app I have tried work really well with Vietnamese (tried with printed Shopee invoice).
Will try another app and different kind of input later.
Some 100M downloaded app seem very good but required payment.
https://www.scraperapi.com/blog/5-tips-for-web-scraping/
https://stackoverflow.com/questions/14347581/mysql-second-or-third-index-of-in-string
In PHP I can simple set timeout for awhile then recall API, do it three time before ignore / exception handling case. I have worked with PHP Shopify lib like this.
https://github.com/ericchiang/pup
https://scotch.io/tutorials/asynchronous-javascript-using-async-await
https://blog.bitsrc.io/the-power-of-axios-cf45e085d924
https://github.com/watson/cheerio-eq
https://www.tabnine.com/code/javascript/functions/cheerio/Cheerio/first
It seem tesseract not good at recognition email (@) sign.
https://lazyadmin.nl/office-365/ocr-email-attachments-and-store-them-on-sharepoint/
https://miai.vn/2019/08/28/ocr-dao-tao-tesseract-ocr-de-nhan-dang-tieng-viet-voi-cac-font-chu-khu-khoam/
https://miai.vn/2020/08/03/phan-loai-bien-bao-giao-thong-bang-deep-learning-cnn/
https://www.pyimagesearch.com/2018/09/17/opencv-ocr-and-text-recognition-with-tesseract/
https://groups.google.com/g/tesseract-ocr/c/XTxlAqFbOr8?pli=1
https://github.com/tesseract-ocr/langdata/issues/63
https://research.aimultiple.com/ocr-accuracy/
https://stackoverflow.com/questions/9237250/imagemagick-for-captcha
http://www.fmwconcepts.com/imagemagick/captcha/index.php
I ended up by using simple Email regex (correction) rules. No need to expensive and require much study ML technique.
? => 7
egmail => @gmail
uahoo => yahoo
euahuoo => @yahoo
0gmail => @gmail
D => p
https://tesseract.projectnaptha.com/
https://www.twilio.com/blog/2016/11/a-simple-way-to-ocr-images-from-a-url-with-tesseract-js.html
https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html
https://www.npmjs.com/package/node-tesseract-ocr
Update: After using node-tesseract-ocr, I got very good result, nearly 100% correct, so I used it and no need for custom logic for error correction above.
...
Random User-Agent to avoid source website ban. I think admin of source site can easily ban my IP but they do not. If they ban IP then I can use some cloud VPS but they are do not.
Noted that, crawling data is illegal (in most cases ?) so think about legal a bit.
https://github.com/axios/axios/issues/2560
https://www.npmjs.com/package/random-useragent
https://levelup.gitconnected.com/bulk-operation-into-mysql-with-nodejs-478c8fc30917
Update:
After double check I found that node-tesseract-ocr seem only a wrapper for installed Tesseract (standalone). So you have to install Tesseract into your machine/server first in order to make it work. I will verify this more because of why run over Node wrapper give better result (in case email recognition) ?
https://superuser.com/questions/758876/tesseract-3-03-english-language-data
https://kbravh.dev/writing/cracking-a-captcha-with-tesseract-js/
https://stackoverflow.com/questions/59126144/how-to-improve-tesseract-js-accuracy
Comments
Post a Comment