Thursday, June 25, 2009

How to train Tesseract OCR

Tesseract Training_for Khmer Language_For Posting

Please also look at the instructions by tesseract page: http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract

10 comments:

  1. Similar Post :: http://crblpocr.blogspot.com/2008/07/how-to-train-bangla-and-devanagari.html

    ReplyDelete
  2. Hi Vanna,

    Thanks for that.

    Ms. Sochenda from PAN Localization Cambodia is also testing Khmer Kep font with Tesseract.

    Please keep in touch,

    ING LengIeng
    PAN Localization Cambodia

    ReplyDelete
  3. Dear Lengleng,

    You're welcome.
    I am doing Khmer OCR research for my master thesis.

    ReplyDelete
  4. Thanks for this good tutorial, it helped me a lot!

    On my machine, cnTraining.exe and mfTraining.exe from v2.04 did not work, I had to use the ones from v2.00.

    ReplyDelete
  5. simple recap of instructions (worked great thanks)
    you need to create 8 files

    1, freq-dawg 2, word-dawg 3, user-words (can be empty file)
    4, inttemp 5, normproto 6, pffmtable 7, unicharset
    8, DangAmbigs (can be empty file)

    steps to get them
    start with a file.tif ready to train for ocr
    -----------------
    1: `tesseract file.tif file batch.nochop makebox`
    2: `mv file.txt file.box`
    3: edit file.box to match the appropriate text
    -or you can use a helper tool-
    windows tool: sites.google.com/site/spilkaondrej
    linux tool: tesseractTrainer.py (download section)
    4: tesseract file.tif junk nobatch box.train
    5: mftraining file.tr
    6: cntraining file.tr
    7: unicharset_extractor file.box
    8: create 'frequent_words_list' & 'words_list' with atleast 1 word in each text file
    (words seperated by new line)
    9: wordlist2dawg frequent_words_list freq-dawg
    10: wordlist2dawg words_list word-dawg
    11: touch DangAmbigs user-words
    12: `move DangAmbigs xxx.DangAmbigs; move freq-dawg xxx.freq-dawg; ...`
    #move ... all 8 files that are listed above
    13: `move xxx.* /usr/share/tesseract/tessdata/`

    14: tesseract file.tif output -l xxx
    should now have a correct output.txt file

    ReplyDelete
  6. Great positing, your intructions work and more clear than the wiki.


    Now just need to know if I can add my files or additional files to an existing language...

    ReplyDelete
  7. Excellent tutorial. Works just perfect. Thanks.

    ReplyDelete