PDF to Text extraction for non-english language PDF

PDF to Text extraction for non-english language PDF - c#

I am using DataLogic utilities(Datalogics.PDFL) to manipulate the PDF, I am facing issues with the below scenario.
A PDF with non-english text getting weird output.
Sample input file SS
Getting output in the below format for the same:
WordFinderConfig wordConfig = new WordFinderConfig();
wordConfig.IgnoreCharGaps = false;
wordConfig.IgnoreLineGaps = false;
wordConfig.NoAnnots = false;
wordConfig.NoEncodingGuess = false;
// Std Roman treatment for custom encoding; overrides the noEncodingGuess option
wordConfig.UnknownToStdEnc = true;
wordConfig.DisableTaggedPDF = false; // legacy mode WordFinder creation
wordConfig.NoXYSort = true;
wordConfig.PreserveSpaces = false;
wordConfig.NoLigatureExp = false;
wordConfig.NoHyphenDetection = false;
wordConfig.TrustNBSpace = false;
wordConfig.NoExtCharOffset = false; // text extraction efficiency
wordConfig.NoStyleInfo = false; // text extraction efficiency
WordFinder wordFinder = new WordFinder(doc, WordFinderVersion.Latest, wordConfig);

I'd encourage you to upgrade to the most current release (e.g. via Nuget) and if you still experience problematic Text Extraction results to then contact our (Datalogics) Support Department for assistance and provide them with the input document and a runnable sample for reproduction purposes.

Related

How to dynamically add watermark to report in Stimulsoft

I would like to dynamically add watermark to a report that is generated in Stimulsoft. The watermark can not be hard-coded and only appear if the report was generated in TEST environment.
I have a variable that checks if the report was created in test environment:
isTestEnv
Which means that if the watermark was added to the page the old fashioned way I would use:
if(isTestEnv == true) {
Page1.Watermark.Enabled = true;
} else {
Page1.Watermark.Enabled = false;
}
But this is not the case. I have to add the watermark when generating the report. Does anyone know how to?
The text is same on all pages it simply says "TEST". But how to push that into a report is the mystery.

you can use this code and set your water mark image in your report
Stimulsoft.Base.StiLicense.loadFromFile("../license.key");
var options = new Stimulsoft.Viewer.StiViewerOptions({showTooltips:false});
var viewer = new Stimulsoft.Viewer.StiViewer(options, "StiViewer", false);
var report = new Stimulsoft.Report.StiReport({isAsyncMode: true});
report.loadFile("Backgroundimg.mrt");
var page = report.pages.getByIndex(0);
page.watermark.image = Stimulsoft.System.Drawing.Image.fromFile('test.jpg');
page.watermark.aspectRatio = true;
page.watermark.imageStretch = true;
page.watermark.imageShowBehind= true;
report.renderAsync(function () {
viewer.report = report;
viewer.renderHtml("viewerContent");
});

You can set the report page watermark to some Report variable at design time and in your code set the value for the report variable.
Something like this:
StiReport report = new StiReport();
report.Load("REPORT_TEMPLATE_PATH");
//You can check if this variable exists or not using an if condition
report.Dictionary.Variables["WATERMARK_VARIABLE_NAME"] = "YOUR_TEXT";
report.Show();//or report.ShowWithWpf();

ClosedXML and C#: How to collapse rows by Default?

I am trying to write code which produces excel report with pivot table. For accomplishing this task I am using ClosedXML library. The output looks like this:
The problem is that I have to get all groups of data collapsed by default, i.e. in the output I should see the following:
In other words, my output should contain collapsed rows and only summary should be displayed. How can I achieve this in code? Which method should I use?
pt.ShowRowStripes = true;
secondWorksheet.FirstRow().Hide();
secondWorksheet.TabActive = true;
secondWorksheet.CollapseRows(1);
secondWorksheet.Rows().Collapse();
pt.EnableShowDetails = false;
pt.ShowValuesRow = false;
secondWorksheet.PageSetup.ShowGridlines = true;
secondWorksheet.ShowGridLines = true;
workbook.PageOptions.ShowGridlines = true;
secondWorksheet.PivotTables.First().EnableShowDetails = false;

This is not currently supported by ClosedXML. Pivot tables are still very much work in progress.

Using ClosedXML.Signed version 0.94.2, this worked for me:
IXLPivotTable pivotTable = workbook.Worksheet("SheetContainingPivotTable").PivotTables.First();
pivotTable.ColumnLabels.ToList().ForEach(x => x.SetCollapsed(true));
pivotTable.RowLabels.ToList().ForEach(x => x.SetCollapsed(true));

How to use C# export QTP result to PDF automatically

I'm writing a C# program to run QTP.
Now my program can trigger QTP automatically and send the result to my mailbox. But this result is HTML, i find that QTP can export a PDF result.
so, here is my code.
qtpAutoReport = qtpApp.Options.Run.AutoExportReportConfig;
qtpAutoReport.AutoExportResults = true;
qtpAutoReport.StepDetailsReport = true;
qtpAutoReport.DataTableReport = false;
qtpAutoReport.LogTrackingReport = false;
qtpAutoReport.ScreenRecorderReport = false;
qtpAutoReport.SystemMonitorReport = false;
qtpAutoReport.StepDetailsReportFormat = "Short";
qtpAutoReport.ExportLocation = AutoExportPath;
qtpAutoReport.ExportForFailedRunsOnly = false;
qtpAutoReport.StepDetailsReportType = "PDF";
When i use this code qtpAutoReport.StepDetailsReportType = "HTML";
My program can run successfully, and i can find this HTML file on my disk.
But, when i use this code qtpAutoReport.StepDetailsReportType = "PDF";
After QTP test is over, i can't any file on my disk.
So my question is why QTP can't export result when i set StepDetailsReportType as "PDF"?

There does seem to be an issue with UFT, I found a method that works for GUI tests(vbscript), give it a try with Service Test (c#).
All options are the same as your example, with one addition:
uftObject.Options.Run.ViewResults = True
This tells UFT that you want to view the results after completion. Without this flag I get no PDF result, with it the file is waiting at the export path.
Option Explicit
Dim uftObject, qtResultsOpt
Set uftObject=CreateObject("Quicktest.application")
uftObject.Launch
uftObject.Visible = True
Set qtResultsOpt = uftObject.Options.Run.AutoExportReportConfig
Dim AutoExportPath
AutoExportPath = "C:\Users\paxic\Desktop\stackoverflow\results"
qtResultsOpt.AutoExportResults = true
qtResultsOpt.StepDetailsReport = true
qtResultsOpt.DataTableReport = false
qtResultsOpt.LogTrackingReport = false
qtResultsOpt.ScreenRecorderReport = false
qtResultsOpt.SystemMonitorReport = false
qtResultsOpt.StepDetailsReportFormat = "Short"
qtResultsOpt.ExportLocation = AutoExportPath
qtResultsOpt.ExportForFailedRunsOnly = false
qtResultsOpt.StepDetailsReportType = "PDF"
uftObject.Open "C:\Users\JMorley\Desktop\stackoverflow\ExampleOne"
qtResultsOpt.AutoExportResults = True
uftObject.Options.Run.ViewResults = True
uftObject.Test.Run

Image pre-processing for text recognize with tesseract or puma.net

How i can pre-processing image with OpenCVdotnet for better text recognize?
I try tesseract wrapper and Puma.NET,but my result is worse... how i can improve result?
#region Tesseract
Bitmap pictureInfoArea = src.ToBitmap();
TesseractEngine engine = new TesseractEngine("tessdata/", "rus", EngineMode.Default);
//engine.SetVariable("tessedit_char_whitelist", "0123456789");
var page = engine.Process(pictureInfoArea, PageSegMode.Auto);
string sTesseract = page.GetText();
#endregion
#region Puma.NET
PumaPage pumaInfoArea = new PumaPage(pictureInfoArea);
using (pumaInfoArea)
{
// Changing default settings
pumaInfoArea.FileFormat = PumaFileFormat.TxtAnsi;
pumaInfoArea.EnableSpeller = true;
pumaInfoArea.Language = PumaLanguage.Russian;
// Recognizing and saving results to a file
string sPuma = pumaInfoArea.RecognizeToString();
//MessageBox.Show(s);
}
#endregion

Here is a tutorial explaining how to train your own language. I suggest that you install jTessBoxeditor, that help you well in training your patterns,after applying the letters separation algorithm. jTessBoxeditor has a GUI interface letting you train your own dataset
or
Here you have another tutorial to train Tesseract3 for a new language.
Have a look at this one (i did not test it) sunnypage.ge/en http://lib.psnc.pl/Content/358/PSNC_Tesseract-FineReader-report.pdf

Problems with OpenOffice Writer using C#

I am creating a OO Writer document with C#.
Any help would be appreciated - I no longer know whether I am coming or going, I have tried so many variations....
using C#, has anybody successfully got the following to work? I just have a simple table of 2 columns and want to set the column widths to different values (actual value at this stage immaterial - just not identical widths).
This code is adapted from various web sources given as examples of how to do column widths. I cannot get it to work....
//For OpenOffice....
using unoidl.com.sun.star.lang;
using unoidl.com.sun.star.uno;
using unoidl.com.sun.star.bridge;
using unoidl.com.sun.star.frame;
using unoidl.com.sun.star.text;
using unoidl.com.sun.star.beans;
..............................
XTextTable odtTbl = (XTextTable) ((XMultiServiceFactory)oodt).createInstance("com.sun.star.text.TextTable");
odtTbl.initialize(10, 2);
XPropertySet xPS = (XPropertySet)odtTbl;
Object xObj = xPS.getPropertyValue("TableColumnSeparators")**; // << Runtime ERROR**
TableColumnSeparator[] xSeparators = (TableColumnSeparator[])xObj;
xSeparators[0].Position = 500;
xSeparators[1].Position = 5000;
xPS.setPropertyValue("TableColumnSeparators", new uno.Any(typeof(unoidl.com.sun.star.text.XTextTable),xSeparators));
// Runtime ERROR indicates the ; at the end of the Object line, with message of IllegalArgumentException
Now this is only one type of error out of all the combinations of attempts. Not many allowed execution at all, but the above code did actually run until the error.
What is the correct code for doing this in C# please?
In addition, what is the correct C# code to set an O'Writer heading to a particular style (such as "Heading 1") so that it looks and prints like that style in the document?
Thank you.

unoidl.com.sun.star.uno.XComponentContext localContext = uno.util.Bootstrap.bootstrap();
unoidl.com.sun.star.lang.XMultiServiceFactory multiServiceFactory = (unoidl.com.sun.star.lang.XMultiServiceFactory)localContext.getServiceManager();
XComponentLoader componentLoader =(XComponentLoader)multiServiceFactory.createInstance("com.sun.star.frame.Desktop");
XComponent xComponent = componentLoader.loadComponentFromURL(
"private:factory/swriter", //a blank writer document
"_blank", 0, //into a blank frame use no searchflag
new unoidl.com.sun.star.beans.PropertyValue[0]);//use no additional arguments.
//object odtTbl = null;
//odtTbl = ((XMultiServiceFactory)xComponent).createInstance("com.sun.star.text.TextTable");
XTextDocument xTextDocument = (unoidl.com.sun.star.text.XTextDocument)xComponent;
XText xText = xTextDocument.getText();
XTextCursor xTextCursor = xText.createTextCursor();
XPropertySet xTextCursorProps = (unoidl.com.sun.star.beans.XPropertySet) xTextCursor;
XSimpleText xSimpleText = (XSimpleText)xText;
XTextCursor xCursor = xSimpleText.createTextCursor();
object objTextTable = null;
objTextTable = ((XMultiServiceFactory)xComponent).createInstance("com.sun.star.text.TextTable");
XTextTable xTextTable = (XTextTable)objTextTable;
xTextTable.initialize(2,3);
xText.insertTextContent(xCursor, xTextTable, false);
XPropertySet xPS = (XPropertySet)objTextTable;
uno.Any xObj = xPS.getPropertyValue("TableColumnSeparators");
TableColumnSeparator[] xSeparators = (TableColumnSeparator[])xObj.Value; //!!!! xObj.Value
xSeparators[0].Position = 2000;
xSeparators[1].Position = 3000;
xPS.setPropertyValue("TableColumnSeparators", new uno.Any(typeof(TableColumnSeparator[]), xSeparators)); //!!!! TableColumnSeparator[]

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

PDF to Text extraction for non-english language PDF - c#

I'd encourage you to upgrade to the most current release (e.g. via Nuget) and if you still experience problematic Text Extraction results to then contact our (Datalogics) Support Department for assistance and provide them with the input document and a runnable sample for reproduction purposes.

Related

How to dynamically add watermark to report in Stimulsoft

ClosedXML and C#: How to collapse rows by Default?

How to use C# export QTP result to PDF automatically

Image pre-processing for text recognize with tesseract or puma.net

Problems with OpenOffice Writer using C#

Categories

Resources