TesseractOCREngine (SAFS API DOCUMENT)

java.lang.Object
- org.safs.tools.ocr.OCREngine
- - org.safs.tools.ocr.tesseract.TesseractOCREngine

```
public class TesseractOCREngine
extends OCREngine
```
Extends OCREngine to support Tesseract OCR engine. http://code.google.com/p/tesseract-ocr/ http://groups.google.com/group/tesseract-ocr Tesseract 2.04 provides two ways for using 1) Running tesseract.exe in command line directly. Usage: tesseract.exe [-l lang] [configfile] 2) Tesseract releases tessdll.dll for developers to call, dlltest.exe is released for testing tessdll.dll. Files tesseract.exe, tessdll.dll and dlltest.exe can be downloaded, or generated from the released code. SAFS used to take the second way -- SAFS calls tessdllWrapper.dll(SAFS defined), which talks with tessdll.dll. Some experiments showed tesseract.exe seemed much better than tessdll.dll in detecting text on images. Not sure if they use the same parts of code to do the work. Just take the first way -- running tesseract.exe directly. Two files will be output in current user directory if call imageToText() 1. ~temp.tif scaled image for Tesseract to recognize. 2. ~temp.txt text file storing detected text in the image. Three files required in searching path: tesseract.exe --- Command: tesseract imagefile outfile -l eng SafsTessdll.exe --- Command: SafsTessdll imagefile outfile eng tessdll.dll --- needed by SafsTessdll.exe SafsTessdll.exe was built by SAFS and newly added for findTextRectFromImage(). It outputs a UTF-8 file that contains detected character, their Unicode and their coordinates.
We have added direct tesseract.exe support for findTextRectFromImage(). The output format of this file is different and uses a different coordinate system then the SAFS DLL, but it provides greater text recognition accuracy for better matching and locating text.
If a version other than Tesseract 2.04 is installed, the environment variable TESSDATA_VERSION should be set. Ex: TESSDATA_VERSION=3.4.4
We don't know of any other means to deduce the version of tesseract installed.

Author:

Junwu Ma
DEC 14, 2009 Original Release
JAN 27, 2010 (JunwuMa) Modified imageToText() to call tesseract.exe directly for recognition accuracy.
FEB 25, 2010 (JunwuMa) Add method getSelfDefinedLangId().
MAR 19, 2010 (JunwuMa) Added method findTextRectFromImage() to support mode "ImageText=" in ImageUtils.java.
MAR 25, 2010 (JunwuMa) Modified imageToText() to open the temporary file in proper format, UTF-8.
APR 20, 2010 (Lei Wang) Add a map languages to contain pairs (javaLangCode, OCRLangCode). Modify method getSelfDefinedLangId(): get OCRLangCode from map languages.
MAY 27, 2010 (Carl Nagle) changed temp output directory to System property "java.io.tmpdir"
OCT 22, 2010 (Carl Nagle) added support for tesseract.exe text coordinate extraction added support to detect System Environment Variable TESSDATA_VERSION to detect versions of Tesseract > 2.04.

See Also:

tessFileParser, ReverseRectangle

Field Summary

Fields
Modifier and Type	Field and Description
`static java.lang.String`	`LANG_CHN` "chi"
`static java.lang.String`	`LANG_ENG` "eng"
`static java.lang.String`	`LANG_FRA` "fra"
`static java.lang.String`	`LANG_JPN` "jpn"
`static java.lang.String`	`LANG_KOR` "kor"
`static java.util.HashMap<java.lang.String,java.lang.String>`	`languages`
`static java.lang.String`	`TESSERACT_VERSION` Coded default to "2.04" Static initializers look for System Environment variable "TESSDATA_VERSION" to change this.
`static int`	`TEXT_FIND_MODE` Set to desired mode for locating coordinates of text. TEXT_FIND_TESSDLL_MODE was the original mechanism used.
`static int`	`TEXT_FIND_TESSDLL_MODE` Mode value to use SAFSTESSDLL wrapper to locate coordinates of image text.
`static int`	`TEXT_FIND_TESSEXE_MODE` Mode value to use tesseract.exe directory to locate coordinates of image text.
`static java.lang.String`	`TMP_DIR_PROPERTY` "java.io.tmpdir"
`static java.lang.String`	`TMP_TEXT_COOR_OUTPUT` "~tempcoor.txt"
`static java.lang.String`	`TMP_TEXT_COOR_ROOT` "~tempcoor"
`static java.lang.String`	`TMP_TEXT_OUTPUT` "~temp"
`static java.lang.String`	`TMP_TIF_SCALEDED` "~temp.tif"

Fields inherited from class org.safs.tools.ocr.OCREngine
defaultZoomScale, OCR_DEFAULT_ENGINE_CLAZZ, OCR_DEFAULT_ENGINE_KEY, OCR_G_ENGINE_CLAZZ, OCR_G_ENGINE_KEY, OCR_T_ENGINE_CLAZZ, OCR_T_ENGINE_KEY, STAF_OCR_ENGINE_VAR_NAME, STAF_OCR_LANGUAGE_ID_VAR_NAME

Constructor Summary

Constructors
Constructor and Description

TesseractOCREngine()

Constructors
Constructor and Description
`TesseractOCREngine()`

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`java.awt.Rectangle`	`findTextRectFromImage(java.lang.String searchtext, int index, java.awt.image.BufferedImage image, java.lang.String stdlangId, java.awt.Rectangle subarea, float zoom)` Find text from BufferedImage, and return its area in the BufferedImage.
`protected java.lang.String`	`getSelfDefinedLangId(java.lang.String StdlangId)` Override its super.
`java.lang.String`	`imageToText(java.awt.image.BufferedImage image, java.lang.String langId, java.awt.Rectangle subarea, float zoom)` Convert buffered image to text using OCR technology.
`static void`	`main(java.lang.String[] args)` Can be used to unit test.

Methods inherited from class org.safs.tools.ocr.OCREngine
getdefaultZoomScale, getOCREngine, getOCREngineKey, getOCRLanguageCode, imageToText, runCommandLine, setdefaultZoomScale, setOCREngineKey, setOCRLanguageCode, storedImageToText, storedImageToText, zoomImageWithType

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - LANG_ENG
```
public static final java.lang.String LANG_ENG
```
    "eng"
    
    See Also:
    
    Constant Field Values
  - LANG_CHN
```
public static final java.lang.String LANG_CHN
```
    "chi"
    
    See Also:
    
    Constant Field Values
  - LANG_JPN
```
public static final java.lang.String LANG_JPN
```
    "jpn"
    
    See Also:
    
    Constant Field Values
  - LANG_KOR
```
public static final java.lang.String LANG_KOR
```
    "kor"
    
    See Also:
    
    Constant Field Values
  - LANG_FRA
```
public static final java.lang.String LANG_FRA
```
    "fra"
    
    See Also:
    
    Constant Field Values
  - TMP_DIR_PROPERTY
```
public static java.lang.String TMP_DIR_PROPERTY
```
    "java.io.tmpdir"
  - TMP_TIF_SCALEDED
```
public static java.lang.String TMP_TIF_SCALEDED
```
    "~temp.tif"
  - TMP_TEXT_OUTPUT
```
public static java.lang.String TMP_TEXT_OUTPUT
```
    "~temp"
  - TMP_TEXT_COOR_ROOT
```
public static java.lang.String TMP_TEXT_COOR_ROOT
```
    "~tempcoor"
  - TMP_TEXT_COOR_OUTPUT
```
public static java.lang.String TMP_TEXT_COOR_OUTPUT
```
    "~tempcoor.txt"
  - TESSERACT_VERSION
```
public static java.lang.String TESSERACT_VERSION
```
    Coded default to "2.04" Static initializers look for System Environment variable "TESSDATA_VERSION" to change this. We have not found another way to deduce the installed version of Tesseract.
  - languages
```
public static java.util.HashMap<java.lang.String,java.lang.String> languages
```
  - TEXT_FIND_TESSDLL_MODE
```
public static final int TEXT_FIND_TESSDLL_MODE
```
    Mode value to use SAFSTESSDLL wrapper to locate coordinates of image text.
    
    See Also:
    
    Constant Field Values
  - TEXT_FIND_TESSEXE_MODE
```
public static final int TEXT_FIND_TESSEXE_MODE
```
    Mode value to use tesseract.exe directory to locate coordinates of image text.
    
    See Also:
    
    Constant Field Values
  - TEXT_FIND_MODE
```
public static int TEXT_FIND_MODE
```
    Set to desired mode for locating coordinates of text.
    TEXT_FIND_TESSDLL_MODE was the original mechanism used. However, this mechanism was found to lack the accuracy of text recognition that tesseract.exe has.
    Possible values: TEXT_FIND_TESSDLL_MODE, and the default TEXT_FIND_TESSEXE_MODE.
- Constructor Detail
  - TesseractOCREngine
```
public TesseractOCREngine()
```
- Method Detail
  - imageToText
```
public java.lang.String imageToText(java.awt.image.BufferedImage image,
                                    java.lang.String langId,
                                    java.awt.Rectangle subarea,
                                    float zoom)
                             throws SAFSException
```
    Description copied from class: OCREngine
    
    Convert buffered image to text using OCR technology. It needs to be implemented in its derived class.
    
    Overrides:
    
    imageToText in class OCREngine
    
    Returns:
    
    String, converted from the input image. NULL if fails to convert.
    
    Throws:
    
    SAFSException - if meets any Exception
  - getSelfDefinedLangId
```
protected java.lang.String getSelfDefinedLangId(java.lang.String StdlangId)
```
    Override its super. Translate the standard language code to OCR specific language code. It should be overridden in derived classes. Refer to Locale.ENGLISH.getLanguage() for input langid. For example: 'en' -- English
    
    Overrides:
    
    getSelfDefinedLangId in class OCREngine
    
    Parameters:
    
    langId, - standard language code
    
    Returns:
  - findTextRectFromImage
```
public java.awt.Rectangle findTextRectFromImage(java.lang.String searchtext,
                                                int index,
                                                java.awt.image.BufferedImage image,
                                                java.lang.String stdlangId,
                                                java.awt.Rectangle subarea,
                                                float zoom)
                                         throws SAFSException
```
    Find text from BufferedImage, and return its area in the BufferedImage. Two modes of operation are possible based on the TEXT_FIND_MODE setting.
```
 TESSDLL Mode:
 Two files SafsTessdll.exe and tessdll.dll will be used.
 Two files will be output in current user's Temp directory.
 1. ~temp.tif  scaled image
 2. ~tempcoor.txt text file storing detected text and their coordinates.
    Rectangle coordinates 0,0 relative to TOP-LEFT corner of search area.
 
```
```
 TESSEXE Mode:
 tesseract.exe will be used.
 Two files will be output in current user's Temp directory.
 1. ~temp.tif  scaled image
 2. ~tempcoor.txt text file storing detected text and their coordinates.
    ReverseRectangle coordinates 0,0 relative to BOTTOM-LEFT corner of search area.
 
```
    Overrides:
    
    findTextRectFromImage in class OCREngine
    
    Parameters:
    
    searchtext, - text for which to search
    
    index, - starts from 1, specifies to find the Nth instance of searchText.
    
    image, - source BufferedImage for detecting
    
    stdlangId, - standard language id with which TOCR intends to detect. Refer to Locale.ENGLISH.getLanguage().
    
    subarea -
    
    zoom -
    
    Returns:
    
    Rectangle or ReverseRectangle or null.
    
    Throws:
    
    SAFSException
    
    See Also:
    
    #TEXT_FIND_MODE}, ReverseRectangle
  - main
```
public static void main(java.lang.String[] args)
```
    Can be used to unit test.
    java org.safs.tools.ocr.tesseract.TesseractOCREngine imageFile [-z zoom] [-l lang] [-t text]
    imageFile - path to screen captured image to process.
    zoom - zoom level to use for OCR. Defaults to 1.9.
    lang - locale to use for OCR. Ex: "en"
    text - text to locate in image.
    
    Parameters:
    
    args - -- up to 7 array items: imageFile [-z zoom] [-l lang] [-t text] imageFile - (required) path to screen captured image to process.
    zoom - zoom level to use for OCR. Defaults to 1.9.
    lang - locale to use for OCR. Ex: "en". Defaults to "en".
    text - text to locate in image.

Class TesseractOCREngine

Field Summary

Fields inherited from class org.safs.tools.ocr.OCREngine

Constructor Summary

Method Summary

Methods inherited from class org.safs.tools.ocr.OCREngine

Methods inherited from class java.lang.Object

Field Detail

LANG_ENG

LANG_CHN

LANG_JPN

LANG_KOR

LANG_FRA

TMP_DIR_PROPERTY

TMP_TIF_SCALEDED

TMP_TEXT_OUTPUT

TMP_TEXT_COOR_ROOT

TMP_TEXT_COOR_OUTPUT

TESSERACT_VERSION

languages

TEXT_FIND_TESSDLL_MODE

TEXT_FIND_TESSEXE_MODE

TEXT_FIND_MODE

Constructor Detail

TesseractOCREngine

Method Detail

imageToText

getSelfDefinedLangId

findTextRectFromImage

main