Aspose text replacement in PDF doesn't rearrange content

Aspose text replacement in PDF doesn't rearrange content - c#

I'm using Aspose to replace a set of words in an existing PDF file via a WPF application.
I took a look at this https://docs.aspose.com/display/pdfnet/Replace+Text+in+a+PDF+Document
But the section 'Text Replacement should automatically re-arrange Page Contents' didn't help solve my problem which is exactly what is described in the section.
"However recently some customers encountered issues during text
replace when particular TextFragment is replaced with smaller contents
and some extra spaces are displayed in resultant PDF or in case the
TextFragment is replaced with some longer string, then words overlap
existing page contents."
The problem is if I replace a word by a shorter or longer word, there's either a blank gap or overlap.
I tried some options such as
textFragmentAbsorber.TextReplaceOptions.ReplaceAdjustmentAction =
TextReplaceOptions.ReplaceAdjustment.WholeWordsHyphenation;
or AdjustSpaceWidth
but is has no effect at all.
My code is the same as the one I linked above except I only replace the text and let the rest untouched.
Also my TextFragmentAbsorber looks like this
var textFragmentAbsorber = new TextFragmentAbsorber("(?i)("+ text.OriginalText +")", new TextSearchOptions(true));
With text.OriginalText being the text I want to replace with case insensitive regex.

Related

Extracting text from PDF with iTextSharp is not working for some PDF

I am using the following code to extract text from the first page of PDF files with iTextSharp :
public static string ExtractTextFromPDFFirstPage(string fileName)
{
string text = null;
using (var pdfReader = new PdfReader(fileName))
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
text = PdfTextExtractor.GetTextFromPage(pdfReader,1,strategy);
text = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text)));
}
return text;
}
It works quite well for many PDF, but not for some other ones.
Working PDF : http://data.hexagosoft.com/LFBO.pdf
Not working PDF : http://data.hexagosoft.com/LFBP.pdf
These two PDF seems to be quite similar, but one is working and the other is not.
I guess the fact that their producer tag is not the same is a clue here.
Another clue is that this function works for any other page of the PDF without a chart.
I also tried with ghostscipt, without success.
The Encoding line seems to be useless as well.
How can i extract the text of the first page of the non working PDF, using iTextSharp ?
Thanks

Both documents use fonts with inofficial glyph names in their Encoding/Differences array and both do not use a ToUnicode map. The glyph naming seems to be somewhat straight: the number following the MT prefix is the ASCII code of the used glyph.
The first document works, because the mapping is not changed at all and iText will use the default encoding (I guess):
/Differences[65/MT65/MT66/MT67 71/MT71/MT72/MT73 76/MT76 78/MT78 83/MT83]
The other document really changes the mapping:
/Differences [2 /MT76 /MT105 /MT103 /MT104 /MT116 /MT110 /MT32 /MT97 /MT100 /MT115 /MT58 ]
This means: E.g. the character code 2 should map to the glyph named MT76 which is an inofficial/private glyph name that iText doesn't know, so it doesn't have more information but the character code 2 and will use this code for the final result (I guess).
It's impossible without implementing a logic for the MT prefixed glyph names to get the correct text out of this document. Anyhow it is nowhere defined that a glyph name beginning with MT followed by an integer can be mapped to the ASCII value... That's simply by accident or implemented by the font designer/creation tool, whatever it came from.

The 2nd PDF (LFBP.pdf) contains the incorrect mapping from glyphs to text, i.e. you see correct glyphs but the text representation was not correctly encoded for some reason during the generation of this PDF. If you have lot of files like this then the working approach could be:
detect broken pages while extracting text by searching some phrase that should appear on every page, maybe like "service"
process these pages separately using OCR with tools like Tesseract with .NET Wraper

Use OpenXML to replace text in DOCX file - strange content

I'm trying to use the OpenXML SDK and the samples on Microsoft's pages to replace placeholders with real content in Word documents.
It used to work as described here, but after editing the template file in Word adding headers and footers it stopped working. I wondered why and some debugging showed me this:
Which is the content of texts in this piece of code:
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(DocumentFile, true))
{
var texts = wordDoc.MainDocumentPart.Document.Body.Descendants<Text>().ToList();
}
So what I see here is that the body of the document is "fragmented", even though in Word the content looks like this:
Can somebody tell me how I can get around this?
I have been asked what I'm trying to achieve. Basically I want to replace user defined "placeholders" with real content. I want to treat the Word document like a template. The placeholders can be anything. In my above example they look like {var:Template1}, but that's just something I'm playing with. It could basically be any word.
So for example if the document contains the following paragraph:
Do not use the name USER_NAME
The user should be able to replace the USER_NAME placeholder with the word admin for example, keeping the formatting intact. The result should be
Do not use the name admin
The problem I see with working on paragraph level, concatenating the content and then replacing the content of the paragraph, I fear I'm losing the formatting that should be kept as in
Do not use the name admin

Various things can fragment text runs. Most frequently proofing markup (as apparently is the case here, where there are "squigglies") or rsid (used to compare documents and track who edited what, when), as well as the "Go back" bookmark Word sets in the background. These become readily apparent if you view the underlying WordOpenXML (using the Open XML SDK Productivity Tool, for example) in the document.xml "part".
It usually helps to go an element level "higher". In this case, get the list of Paragraph descendants and from there get all the Text descendants and concatenate their InnerText.

OpenXML is indeed fragmenting your text:
I created a library that does exactly this : render a word template with the values from a JSON.
From the documenation of docxtemplater :
Why you should use a library for this
Docx is a zipped format that contains some xml. If you want to build a simple replace {tag} by value system, it can already become complicated, because the {tag} is internally separated into <w:t>{</w:t><w:t>tag</w:t><w:t>}</w:t>. If you want to embed loops to iterate over an array, it becomes a real hassle.
The library basically will do the following to keep formatting :
If the text is :
<w:t>Hello</w:t>
<w:t>{name</w:t>
<w:t>} !</w:t>
<w:t>How are you ?</w:t>
The result would be :
<w:t>Hello</w:t>
<w:t>John !</w:t>
<w:t>How are you ?</w:t>
You also have to replace the tag by <w:t xml:space=\"preserve\"> to ensure that the space is not stripped out if they is any in your variables.

How to search for Particular Line Contents in PDF and Make that Line Marked In Color using Itext in c#

I have a pdf file which i need to read and validate for its Correctness and if any wrong data Comes it should mark that Line with Red Color.Till Now i am able to read and Validate the Contents of the Pdf file by taking that into string but i am not getting how to make that line Colored,suppose Mark Red color in case any wrong data line comes.So my question is this that "How to search for Particular Line Contents in PDF and Make that Line Marked In Color".
Here is my code in c#..
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
if (currentText.Contains("1 . 1 To Airtel Mobile") && currentText.Contains("Total"))
{
int startPosition = currentText.IndexOf("1 . 1 To Airtel Mobile");
int endPosition = currentText.IndexOf("Total");
string result = currentText.Substring(startPosition, endPosition - startPosition);
// result will contain everything from and up to the Total line
using (StringReader reader = new StringReader(result))
{
// Loop over the lines in the string.
string[] split = line.Split(new Char[] { ' ' });
}
}
If the line Contents gets Validated Correct its Ok else Mark that Line with Red Color in PDF file

Please read the documentation before posting semi-duplicate questions, such as:
Edit an existing PDF file using iTextSharp
How to Read and Mark(Highlight) a pdf file using C#
You have received some very good feedback, such as the answer from Nenotlep that was initially deleted (I asked the moderators to have it restored). Especially the comment by mkl should have been very useful to you. It refers to Retrieve the respective coordinates of all words on the page with itextsharp and that's exactly what you're asking now, making your question a duplicate (a possible reason to have it removed from StackOverflow).
In his answer, mkl explains that you're taking your assignment too lightly. Instead of extracting pure text, you should extract TextRenderInfo objects. These objects contain information about the content (the actual text) as well as the position on the page. See for instance the ParsingHelloWorld example from chapter 15 of my book.
The method you're using returns the content of the PDF as a string. Similar to result1.txt which is the output of said example:
Hello World
In the same example, we parse a different PDF that has the exact same content when looked at by the human eye. However, when you parse the document, the content looks like this (see result2.txt):
ld
Wor
llo
He
The reason for this difference is inherent to the nature of PDF: the concept of lines doesn't really exist: you can add characters to a page in any which order you want. You don't even need to add complete words!
When you use the GetTextFromPage() method, you tell iText you don't want to get any info about the position of the text. Mlk has tried explaining this to you, but I'll try explaining it once more. In the example from my book, I have extended the RenderListener in a class named MyTextRenderListener. Now the output looks like this (see result3.txt).
<>
<<ld><Wor><llo><He>>
<<Hello People>>
This is the output of the same PDF we parsed when getting result2.txt. As you can see, we missed the words Hello People in the previous attempt.
The example is really simple: it just shows you have to text snippets are stored in the PDF. We get all the TextRenderInfo objects and we use the GetText() method to get the text. The order in which we get the text is the order that is used in the PDF's content stream.
When using a specific strategy, such as the LocationTextExtractionStrategy, iText retrieves all these objects and it used the GetBaseline() method to sort all the text snippets.
<<ld><Wor><llo><He>>
results in:
<<He><llo><Wor><ld>>
Then iText looks at the distance between the different snippets. In this case, iText adds a space between the <llo> and <Wor> snippet.
You are now looking to do the same thing: you are going to write a system that is going to retrieve all the text snippets, that is going to order them, examine them, and based on the composed content, you are going to add a background at those locations.

.net program to parse .doc file

I want to create an application which will be able to parse doc/docx files structure of this file is shown bellow:
par-000.01 - some content
par-000.21 - some content
par-000.31 - some content
par-001.32 - some content
content could be multi line and not regular. What I want to do is to put these content into database I mean for first record - par-000.01 into code column and some content into text column. The reason why I cannot do this manually is that I have about 15 docs where each of them contains about 10 pages of paragraphs I want to put into my database. I cannot find any article how can i parse whole doc file so I believe it could be possible if i write proper regular expression. Can anyone redirect me to the article how I can do what I want- I can't find anything that suits me probably I am using wrong key words..

Since you say you have reasonable amount of data, 15 docs * 10 pages/doc * ~100 lines/page = 15000 lines this is manageable in a word document, and you did not say that this is a repeating data feed, i.e. this is a one-time conversion, I would do it using an editor that supported global find and replace and convert to a Comma Separated Variable format. Most DB I know can load a CSV file.
I know you asked for C# app, but that is overkill for time and effort based on your problem
So
Convert '<start of line>' to '<start of line>"'
for MS Word with Find and replace
find: ^p
replace: ^&"
Convert ' - ' to '","'
for MS Word with Find and replace
find: ' - ' Note: don't add tick marks.
replace: ","
Convert '<end of line>' to '"<end of line>'
for MS Word with Find and replace
find: ^p
replace: "^&
Manually fix up start of first line and end of last line.
you should get
"par-000.01","some content"
"par-000.21","some content"
Now just load that into a DB using its CSV load.
Also if you insist on doing this with C#, then realize that you can probably save the text as a *.txt file without all of the Word tags and it will be much easier to take apart with a C# app. Don't get fixated on the Word tags, just side step the problem with creative thinking.

You can automate parsing of Word documents (.doc or .docx) in C# using GroupDocs.Parser for .NET API. The text can be extracted from the documents either line by line or as a whole. This is how you can do it.
// extracting all the text
WordsTextExtractor extractor = new WordsTextExtractor("sample.docx");
Console.Write(extractor.ExtractAll());
// OR
// Extract text line by line
string line = extractor.ExtractLine();
// If the line is null, then the end of the file is reached
while (line != null)
{
// Print a line to the console
Console.Write(line);
// Extract another line
line = extractor.ExtractLine();
}
Disclosure: I work as Developer Evangelist at GroupDocs.

C# Unknown Text Found

I'm creating a program to transfer text from a word document to a database. During some testing I came across some text inside a textbox after setting it's text to a table cell range as follows:
textBox1.Text = oDoc.Tables[1].Cell(1, 3).Range.Text;
What appeared in the form was:
What wasn't expected was the dot at the end of the text and I have no idea what it is supposed to represent. The dot can be highlighted but if you try and copy and paste it nothing appears. You can delete the dot manually. Can anyone help me identify what this is?

The identification bit shouldn't be too hard:
string text = oDoc.Tables[1].Cell(1, 3).Range.Text;
textBox1.Text = ((int) text[4]).ToString("x4");
That will give you the Unicode UTF-16 code unit for that character... you can then find out what it is on the Unicode web site. (I usually look at the Charts page or the directory of PDFs and guess which chart it will be in based on the numbering - it's not ideal, and there are probably better ways, but it's always worked well enough for me...)
Of course when you've identified it you'll still need to work out what the heck it's doing there... does the original Word document just have "HOLD"?

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.