I have a pdf file which I am processing by converting it into text using the following coding..
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
During processing if I am seeing any type of ambiguity in the content means error in the data of the PDF file, I have to mark the entire line of the pdf(Color that line with Red) file but I am not able to analyze how to achieve that. Please help me.
As already mentioned in comments: What you essentially need is a SimpleTextExtractionStrategy replacement which not only returns text but instead text with positions. The LocationTextExtractionStrategy would be a good starting point for that as it collects the text with positions (to put it in the right order).
If you look into the source of LocationTextExtractionStrategy you'll see that it keeps its text pieces in a member List<TextChunk> locationalResult. A TextChunk (inner class in LocationTextExtractionStrategy) represents a text piece (originally drawn by a single text drawing operation) with location information. In GetResultantText this list is sorted (top-to-bottom, left-to-right, all relative to the text base line) and reduced to a string.
What you need, is something like this LocationTextExtractionStrategy with the difference that you retrieve the (sorted) text pieces including their positions.
Unfortunately the locationalResult member is private. If it was at least protected, you could simply have derived your new strategy from LocationTextExtractionStrategy. Instead you now have to copy its source to add to it (or do some introspection/reflection magic).
Your addition would be a new method similar to GetResultantText. This method might recognize all the text on the same line (just like GetResultantText does) and either
do the analysis / search for ambiguities itself and return a list of the locations (start and end) of any found ambiguities; or
put the text found for the current line into a single TextChunk instance together with the effective start and end locations of that line and eventually return a List<TextChunk> each of which represents a text line; if you do this, the calling code would do the analysis to find ambiguities, and if it finds one, it has the start and end location of the line the ambiguity is on. Beware, TextChunk in the original strategy is protected but you need to make it public for this approach to work.
Either way, you eventually have the start and end location of the ambiguities or at least of the lines the ambiguities are on. Now you have to highlight the line in question (as you say, you have to mark the entire line of the pdf(Color that line with Red)).
To manipulate a given PDF you use a PdfStamper. You can mark a line on a page by either
getting the UnderContent for that page from the PdfStamper and fill a rectangle in red there using your position data; this disadvantage of this approach is that if the original PDF already has underlayed the line with filled areas, your mark will be hidden thereunder; or by
getting the OverContent for that page from the PdfStamper and fill a somewhat transparent rectangle in red; or by
adding a highlight annotation to the page.
To make things even smoother, you might want to extend your copy of TextChunk (inner class in your copy of LocationTextExtractionStrategy) to not only keep the base line coordinates but also maximal ascent and descent of the glyphs used. Obviously you'd have to fill-in those information in RenderText...
Doing so you know exactly the height required for your marking rectangle.
Too long to be a comment; added as answer.
My good fellow and peer Adi, It depends a lot on your PDF contents. It's kind of hard to do a generic solution to something like this. What does currentText contain? Can you give an example of it? Also, if you have a lot of these PDFs to check, you need to get currentText of a few of them, just to make sure that your current PDF to string conversion produces the same result every time. If it is same every time from different PDFs; then you can start to automate.
The automation also depends a lot on your content, for example if current Text is something like this: Value: 10\nValue: 11\nValue: 9Value\n15 then what I recommend is going through every line, extracting the value and checking it against what you need it to be. This is untested semi-pseudo code that gives you an idea of what I mean:
var lines = new List<string>(currentText.Split('\n'));
var newlines = new List<string>();
foreach (var line in lines) {
if (line != "Value: 10") {
newLines.Add(line); // This line is correct, no marking needed
} else {
newlines.Add("THIS IS WRONG: " + line); // Mark as incorrect; use whatever you need here
}
}
// Next, return newlines to the user showing them which lines are bad so they can edit the PDF
If you need to automatically edit the existing PDF, this will be very, very, very hard. I think it's beyond the scope of my answer - I was answering how to identify the wrong lines and not how to mark them - sorry! Someone else please add that answer.
By the way; PDF is NOT a good format for doing something like this. If you have access to any other source of information, most likely the other one will be better.
Related
I want to be able to place 2 paragraphs after text in a MSWord document I am working on. It is all written in a custom compiler setup as an addin in word. Unfortunately I am a C# beginner and I cannot make heads or tails of it anymore.
Will Microsoft.Office.Interop.Word.Paragraph.Next(ref object count); be able to add even 1 paragraph if added in some way after text? E.g.
Somehow declaring Paragraph2 as 2 x Microsoft.Office.Interop.Word.Paragraph.Next(ref object count); and would be used as follows
string text = "Hello, World!";
BC.SetThisField(text + Paragraph2);
Resulting in (Pilcrows to represent what it would look like with paragraph characters showing):
Hello, World!¶
¶
The link(below) to Microsoft's documentation on this has simply led me down a road of researching ref, object and their definition of count which is apparently 1. I have been unsuccessful in using even the default of what is supposed to be.
https://learn.microsoft.com/en-us/dotnet/api/microsoft.office.interop.word.paragraph.next?view=word-pia#Microsoft_Office_Interop_Word_Paragraph_Next_System_Object__
I have thought, according to similar questions asked on stackoverflow that I should use Paragraphs.Add(Object) Method instead, although I would still be stuck either way.
If I understand the question correctly, you want to insert text, followed by two paragraph marks? The simplest C# for that would be:
string text = "Hello, World!\n\n";
BC.SetThisField(text);
\n or \r generates a paragraph mark (ANSI 13) in Word.
Another possibility is to concatenate char(13).
There's also InsertParagraphAfter, but that requires an additional
step and a target Range object.
The Paragraphs.Add() will insert a single paragraph, but in order
to insert more than one it would have to be run in a loop and would
also require a Range object as the target.
Microsoft.Office.Interop.Word.Paragraph.Next(ref object count); will not insert paragraphs and will return an error if the code tries to perform an action with a paragraph that does not exist. (If code tried to assign text to the paragraph, for example.)
I want to read the some content pdf files. I just started before getting into the stuff I just want to know what the right approach to do so.
ItextSharp reader may be helpful in that case, so I converted the pdf into text using:
public static string pdfText(string path)
{
PdfReader reader = new PdfReader(path);
string text = string.Empty;
for(int page = 1; page <= reader.NumberOfPages; page++)
{
text += PdfTextExtractor.GetTextFromPage(reader,page);
}
reader.Close();
return text;
}
I'm still wondering if this approach seems OK, or if I should convert this pdf into excel and then read the content which I want instead.
Professionals thoughts will be appreciated.
With iText, you can also choose a specific strategy for extracting the text. But keep in mind that this is always a heuristic process.
Pdf documents essentially contain only the instructions needed to render the document for a viewer. So there is no concept of "text". More something like "draw character A at position 420, 890".
In order for any text-extraction to work, it needs to make some guesses on when two characters are close enough together that they should be concatenated, and when they should be apart.
Coincidentally, iText does this based on the width of a single space character in the font that is being used.
Keep in mind there could also be ActualText (this is a sort of text that gets hidden in the document, and is only used in extraction. It makes it possible to have the document render a character like "œ" (ligature version), which gets extracted as "oe" (non ligature version).
Depending on your input documents, you might want to look into the different implementations of ITextExtractionStrategy.
I'm trying to use the OpenXML SDK and the samples on Microsoft's pages to replace placeholders with real content in Word documents.
It used to work as described here, but after editing the template file in Word adding headers and footers it stopped working. I wondered why and some debugging showed me this:
Which is the content of texts in this piece of code:
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(DocumentFile, true))
{
var texts = wordDoc.MainDocumentPart.Document.Body.Descendants<Text>().ToList();
}
So what I see here is that the body of the document is "fragmented", even though in Word the content looks like this:
Can somebody tell me how I can get around this?
I have been asked what I'm trying to achieve. Basically I want to replace user defined "placeholders" with real content. I want to treat the Word document like a template. The placeholders can be anything. In my above example they look like {var:Template1}, but that's just something I'm playing with. It could basically be any word.
So for example if the document contains the following paragraph:
Do not use the name USER_NAME
The user should be able to replace the USER_NAME placeholder with the word admin for example, keeping the formatting intact. The result should be
Do not use the name admin
The problem I see with working on paragraph level, concatenating the content and then replacing the content of the paragraph, I fear I'm losing the formatting that should be kept as in
Do not use the name admin
Various things can fragment text runs. Most frequently proofing markup (as apparently is the case here, where there are "squigglies") or rsid (used to compare documents and track who edited what, when), as well as the "Go back" bookmark Word sets in the background. These become readily apparent if you view the underlying WordOpenXML (using the Open XML SDK Productivity Tool, for example) in the document.xml "part".
It usually helps to go an element level "higher". In this case, get the list of Paragraph descendants and from there get all the Text descendants and concatenate their InnerText.
OpenXML is indeed fragmenting your text:
I created a library that does exactly this : render a word template with the values from a JSON.
From the documenation of docxtemplater :
Why you should use a library for this
Docx is a zipped format that contains some xml. If you want to build a simple replace {tag} by value system, it can already become complicated, because the {tag} is internally separated into <w:t>{</w:t><w:t>tag</w:t><w:t>}</w:t>. If you want to embed loops to iterate over an array, it becomes a real hassle.
The library basically will do the following to keep formatting :
If the text is :
<w:t>Hello</w:t>
<w:t>{name</w:t>
<w:t>} !</w:t>
<w:t>How are you ?</w:t>
The result would be :
<w:t>Hello</w:t>
<w:t>John !</w:t>
<w:t>How are you ?</w:t>
You also have to replace the tag by <w:t xml:space=\"preserve\"> to ensure that the space is not stripped out if they is any in your variables.
I'm doing a little program where the data saved on some users are stored in a text file. I'm using Sytem.IO with the Streamwriter to write new information to my text file.
The text in the file is formatted like so :
name1, 1000, 387
name2, 2500, 144
... and so on. I'm using infos = line.Split(',') to return the different values into an array that is more useful for searching purposes. What I'm doing is using a While loop to search for the correct line (where the name match) and I return the number of points by using infos[1].
I'd like to modify this infos[1] value and set it to something else. I'm trying to find a way to replace a word in C# but I can't find a good way to do it. From what I've read there is no way to replace a single word, you have to rewrite the complete file.
Is there a way to delete a line completely, so that I could rewrite it at the end of the text file and not have to worried about it being duplicated?
I tried using the Replace keyword, but it didn't work. I'm a bit lost by looking at the answers proposed for similar problems, so I would really appreciate if someone could explain me what my options are.
If I understand you correctly, you can use File.ReadLines method and LINQ to accomplish this.First, get the line you want:
var line = File.ReadLines("path")
.FirstOrDefault(x => x.StartsWith("name1 or whatever"));
if(line != null)
{
/* change the line */
}
Then write the new line to your file excluding the old line:
var lines = File.ReadLines("path")
.Where(x => !x.StartsWith("name1 or whatever"));
var newLines = lines.Concat(new [] { line });
File.WriteAllLines("path", newLines);
The concept you are looking for is called 'RandomAccess' for file reading/writing. Most of the easy-to-use I/O methods in C# are 'SequentialAccess', meaning you read a chunk or a line and move forward to the next.
However, what you want to do is possible, but you need to read some tutorials on file streams. Here is a related SO question. .NET C# - Random access in text files - no easy way?
You are probably either reading the whole file, or reading it line-for-line as part of your search. If your fields are fixed length, you can read a fixed number of bytes, keep track of the Stream.Position as you read, know how many characters you are going to read and need to replace, and then open the file for writing, move to that exact position in the stream, and write the new value.
It's a bit complex if you are new to streams. If your file is not huge, copying a file line for line can be done pretty efficiently by the System.IO library if coded correctly, so you might just follow your second suggestion which is read the file line-for-line, write it to a new Stream (memory, temp file, whatever), replace the line in question when you get to that value, and when done, replace the original.
It is most likely you are new to C# and don't realize the strings are immutable (a fancy way of saying you can't change them). You can only get new strings from modifying the old:
String MyString = "abc 123 xyz";
MyString.Replace("123", "999"); // does not work
MyString = MyString.Replace("123", "999"); // works
[Edit:]
If I understand your follow-up question, you could do this:
infos[1] = infos[1].Replace("1000", "1500");
I have a pdf file which i need to read and validate for its Correctness and if any wrong data Comes it should mark that Line with Red Color.Till Now i am able to read and Validate the Contents of the Pdf file by taking that into string but i am not getting how to make that line Colored,suppose Mark Red color in case any wrong data line comes.So my question is this that "How to search for Particular Line Contents in PDF and Make that Line Marked In Color".
Here is my code in c#..
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
if (currentText.Contains("1 . 1 To Airtel Mobile") && currentText.Contains("Total"))
{
int startPosition = currentText.IndexOf("1 . 1 To Airtel Mobile");
int endPosition = currentText.IndexOf("Total");
string result = currentText.Substring(startPosition, endPosition - startPosition);
// result will contain everything from and up to the Total line
using (StringReader reader = new StringReader(result))
{
// Loop over the lines in the string.
string[] split = line.Split(new Char[] { ' ' });
}
}
If the line Contents gets Validated Correct its Ok else Mark that Line with Red Color in PDF file
Please read the documentation before posting semi-duplicate questions, such as:
Edit an existing PDF file using iTextSharp
How to Read and Mark(Highlight) a pdf file using C#
You have received some very good feedback, such as the answer from Nenotlep that was initially deleted (I asked the moderators to have it restored). Especially the comment by mkl should have been very useful to you. It refers to Retrieve the respective coordinates of all words on the page with itextsharp and that's exactly what you're asking now, making your question a duplicate (a possible reason to have it removed from StackOverflow).
In his answer, mkl explains that you're taking your assignment too lightly. Instead of extracting pure text, you should extract TextRenderInfo objects. These objects contain information about the content (the actual text) as well as the position on the page. See for instance the ParsingHelloWorld example from chapter 15 of my book.
The method you're using returns the content of the PDF as a string. Similar to result1.txt which is the output of said example:
Hello World
In the same example, we parse a different PDF that has the exact same content when looked at by the human eye. However, when you parse the document, the content looks like this (see result2.txt):
ld
Wor
llo
He
The reason for this difference is inherent to the nature of PDF: the concept of lines doesn't really exist: you can add characters to a page in any which order you want. You don't even need to add complete words!
When you use the GetTextFromPage() method, you tell iText you don't want to get any info about the position of the text. Mlk has tried explaining this to you, but I'll try explaining it once more. In the example from my book, I have extended the RenderListener in a class named MyTextRenderListener. Now the output looks like this (see result3.txt).
<>
<<ld><Wor><llo><He>>
<<Hello People>>
This is the output of the same PDF we parsed when getting result2.txt. As you can see, we missed the words Hello People in the previous attempt.
The example is really simple: it just shows you have to text snippets are stored in the PDF. We get all the TextRenderInfo objects and we use the GetText() method to get the text. The order in which we get the text is the order that is used in the PDF's content stream.
When using a specific strategy, such as the LocationTextExtractionStrategy, iText retrieves all these objects and it used the GetBaseline() method to sort all the text snippets.
<<ld><Wor><llo><He>>
results in:
<<He><llo><Wor><ld>>
Then iText looks at the distance between the different snippets. In this case, iText adds a space between the <llo> and <Wor> snippet.
You are now looking to do the same thing: you are going to write a system that is going to retrieve all the text snippets, that is going to order them, examine them, and based on the composed content, you are going to add a background at those locations.