Sitemap

Friday, June 19, 2015

Text Extraction Process manually: textexport

The text extraction process is performed using the "textexport" program that comes bundled in the ContentAccess component. This component contains Oracle OutsideIn functionality. The textexport program reads the PDF file and places all of the extracted text into the active collection folder (ots1 or ots2) under
<ucm-install>/search/ots1/bulkload/~export

To preserve this file, open the Repository Manager applet, and on the Indexing tab, click the Configuration button. On the popup that displays, the debug level can be set to trace.

If you want to see the text extraction process, you need to run the textexport manually. Create an HDA testfile.hda file where the input file parameter will need to be set to a valid path:
<?hda version="10.1.3.5.1 (111229)" jcharset=UTF8 encoding=utf-8?>
@Properties LocalData
OutputCharacterSet=utf8
blFieldTypes=
FallbackFormat=fi_unicode
InputFilePath=C:\Users\sonal\Downloads\pdf.pdf
blDateFormat=M/d/yy {h:mm[:ss] {aa}[zzz]}!mAM,PM!tAmerica/Chicago
@end

Run the following command from the cmd:
C:\Oracle\Oracle_ECM1\oit\win32\lib\contentaccess\textexport.exe -c C:\testfile.hda -f C:\finaltextfile.txt

finaltextfile.txt will contain the extracted text from the pdf file mentioned in the HDA file.
fi_unicode: Display as text and assume the Unicode character set.

No comments:

Post a Comment