The text extraction process is performed using the "textexport" program that comes bundled in the ContentAccess component. This component contains Oracle OutsideIn functionality. The textexport program reads the PDF file and places all of the extracted text into the active collection folder (ots1 or ots2) under
To preserve this file, open the Repository Manager applet, and on the Indexing tab, click the Configuration button. On the popup that displays, the debug level can be set to trace.
If you want to see the text extraction process, you need to run the textexport manually. Create an HDA testfile.hda file where the input file parameter will need to be set to a valid path:
Run the following command from the cmd:
finaltextfile.txt will contain the extracted text from the pdf file mentioned in the HDA file.
fi_unicode: Display as text and assume the Unicode character set.
<ucm-install>/search/ots1/bulkload/~export
To preserve this file, open the Repository Manager applet, and on the Indexing tab, click the Configuration button. On the popup that displays, the debug level can be set to trace.
If you want to see the text extraction process, you need to run the textexport manually. Create an HDA testfile.hda file where the input file parameter will need to be set to a valid path:
<?hda version="10.1.3.5.1 (111229)" jcharset=UTF8 encoding=utf-8?>
@Properties LocalData
OutputCharacterSet=utf8
blFieldTypes=
FallbackFormat=fi_unicode
InputFilePath=C:\Users\sonal\Downloads\pdf.pdf
blDateFormat=M/d/yy {h:mm[:ss] {aa}[zzz]}!mAM,PM!tAmerica/Chicago
@end
Run the following command from the cmd:
C:\Oracle\Oracle_ECM1\oit\win32\lib\contentaccess\textexport.exe -c C:\testfile.hda -f C:\finaltextfile.txt
finaltextfile.txt will contain the extracted text from the pdf file mentioned in the HDA file.
fi_unicode: Display as text and assume the Unicode character set.
No comments:
Post a Comment