Vanna Iced Tea: How to train Tesseract OCR

Thursday, June 25, 2009

How to train Tesseract OCR

Tesseract Training_for Khmer Language_For Posting

Please also look at the instructions by tesseract page: http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract

10 comments:

Md. Abul HasnatJune 26, 2009 at 7:16 PM
Similar Post :: http://crblpocr.blogspot.com/2008/07/how-to-train-bangla-and-devanagari.html
ReplyDelete
Replies
UnknownJune 26, 2009 at 10:42 PM
Hi Vanna,

Thanks for that.

Ms. Sochenda from PAN Localization Cambodia is also testing Khmer Kep font with Tesseract.

Please keep in touch,

ING LengIeng
PAN Localization Cambodia
ReplyDelete
Replies
VannaJune 28, 2009 at 11:35 AM
Dear Lengleng,

You're welcome.
I am doing Khmer OCR research for my master thesis.
ReplyDelete
Replies
ChendaJuly 2, 2009 at 6:46 PM
Thanks for posting here.
ReplyDelete
Replies
AnonymousAugust 11, 2009 at 11:04 PM
Thanks for this good tutorial, it helped me a lot!

On my machine, cnTraining.exe and mfTraining.exe from v2.04 did not work, I had to use the ones from v2.00.
ReplyDelete
Replies
AnonymousMarch 8, 2010 at 1:17 PM
simple recap of instructions (worked great thanks)
you need to create 8 files

1, freq-dawg 2, word-dawg 3, user-words (can be empty file)
4, inttemp 5, normproto 6, pffmtable 7, unicharset
8, DangAmbigs (can be empty file)

steps to get them
start with a file.tif ready to train for ocr
-----------------
1: `tesseract file.tif file batch.nochop makebox`
2: `mv file.txt file.box`
3: edit file.box to match the appropriate text
-or you can use a helper tool-
windows tool: sites.google.com/site/spilkaondrej
linux tool: tesseractTrainer.py (download section)
4: tesseract file.tif junk nobatch box.train
5: mftraining file.tr
6: cntraining file.tr
7: unicharset_extractor file.box
8: create 'frequent_words_list' & 'words_list' with atleast 1 word in each text file
(words seperated by new line)
9: wordlist2dawg frequent_words_list freq-dawg
10: wordlist2dawg words_list word-dawg
11: touch DangAmbigs user-words
12: `move DangAmbigs xxx.DangAmbigs; move freq-dawg xxx.freq-dawg; ...`
#move ... all 8 files that are listed above
13: `move xxx.* /usr/share/tesseract/tessdata/`

14: tesseract file.tif output -l xxx
should now have a correct output.txt file
ReplyDelete
Replies
AnonymousAugust 23, 2010 at 11:25 PM
Great positing, your intructions work and more clear than the wiki.

Now just need to know if I can add my files or additional files to an existing language...
ReplyDelete
Replies
Vikas KumarJune 6, 2011 at 1:14 PM
Excellent tutorial. Works just perfect. Thanks.
ReplyDelete
Replies
ShurikusJanuary 1, 2012 at 8:08 PM
Very good tutorial!
ReplyDelete
Replies