Using Aspose DLL i want to find the particular Paragraph and repeat that paragraph in the document below the old paragraph using One key word.
Here is the example
---------------------------------
Document Content..................
.................................
.................................
..................................
Dated This [Date of Report]
[Name of all existing directors]
Director
here we need to create the documents for each directors here directors came dynamically from data base.
Document data is same for all the directors.
Your example is not very clear. However, using IReplacingCallback, you can find the paragraph of the key string. Following is a simple code:
static void Main(string[] args)
{
// Load in the document
Document doc = new Document("C:\\data\\Testing.doc");
//Regular expression for findinf Full Name string
Regex regex = new Regex("Full Name", RegexOptions.IgnoreCase);
//To find the text and insert the paragraph
doc.Range.Replace(regex, new ReplaceEvaluatorFindandHighlight(), true);
doc.Save("C:\\data\\document_new.doc");
}
//Class to find the text as per key string
private class ReplaceEvaluatorFindandHighlight : IReplacingCallback
{
/// <summary>
/// This method is called by the Aspose.Words find and replace engine for each match.
/// This method highlights the match string, even if it spans multiple runs.
/// </summary>
ReplaceAction IReplacingCallback.Replacing(ReplacingArgs e)
{
// This is a Run node that contains either the beginning or the complete match.
Node currentNode = e.MatchNode;
//Use Document Builder to Navigate to the paragraph
DocumentBuilder builder = new DocumentBuilder((Document)e.MatchNode.Document);
builder.MoveTo(currentNode.ParentNode);
//Insert a Paragraph break
builder.InsertParagraph();
//Insert the Paragraph for the Text we have search
builder.Writeln(currentNode.ParentNode.ToString(SaveFormat.Text)); // Inserts a string and a paragraph break into the document.
// Signal to the replace engine to do nothing because we have already done all what we wanted.
return ReplaceAction.Skip;
}
}
Refer Aspose.Words documentation to get indepth detail of Finding Text or Extracting Paragraphs as per your requirement.
Related
I have an asp.net Core 2.0 C# application which read/parse the PDF file and get the text. In this I want to read specific value which have specific label name. You can see the below image I want to get the value 171857 which is Invoice number and store it in database.
I have tried below code to read the pdf using iTextSharp.
using (PdfReader reader = new PdfReader(fileName))
{
StringBuilder sb = new StringBuilder();
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
for (int page = 0; page < reader.NumberOfPages; page++)
{
string text = PdfTextExtractor.GetTextFromPage(reader, page + 1, strategy);
if (!string.IsNullOrWhiteSpace(text))
{
sb.Append(Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text))));
}
}
var pdfText = sb.ToString();
}
In pdfText variable I will get all text content from pdf but It seems that this is not the proper way to get the Invoice number. Is there any other way to read the specific content from pdf by it's label name like we will provide label name Invoice and it will return the value 171857 as example with other 3rd party pdf reader libraries?
Any help or suggestions would be highly appreciated.
Thanks
I have helped a friend extracting similar value from pdf invoice generated by Excel arc. I have for this answer created an Excel invoice and print it as PDF file and zipped for download for testing purpose.
The next thing I do, I am using an Open Source and Free Library called PDFClown. Here is the nuget package for it.
So far so good, what I did is I scan all pdf document (for example invoice can be one page or multiple pages) add each content to a list of string.
The next step I find the index (the invoice number index could be in 10th element in list, in our case it is index 1) that refer to invoice value which I will call Tag or Label.
Hence I do not have your pdf file, I improvised and added a unique Tag called (or any other name) "INVOICE". The invoice number in this case comes after invoice tag tag. So I find the index of "INVOICE" tag and add 1 to index this is because the invoice number follow the invoice tag. This way I will pick the invoice text 0005 in this case and return it as value 5. This way you can fetch what every text/value followed by any tag scanned in our list and return it the way that you need.
So you need to play with it a bit to fit it 100% to your pdf file.
So here is my test files Excel and Pdf zipped down. Download it for your test.
Here is the code:
public class InvoiceTextExtraction
{
private List<string> _contentList;
public void GetValueFromPdf()
{
_contentList = new List<string>();
CreatePdfContent(#"C:\temp\Invoice1.pdf");
var index = _contentList.FindIndex(e => e == "INVOICE") + 1;
int.TryParse(_contentList[index], out var value);
Console.WriteLine(value);
}
public void CreatePdfContent(string filePath)
{
using (var file = new File(filePath))
{
var document = file.Document;
foreach (var page in document.Pages)
{
Extract(new ContentScanner(page));
}
}
}
private void Extract(ContentScanner level)
{
if (level == null)
return;
while (level.MoveNext())
{
var content = level.Current;
switch (content)
{
case ShowText text:
{
var font = level.State.Font;
_contentList.Add(font.Decode(text.Text));
break;
}
case Text _:
case ContainerObject _:
Extract(level.ChildLevel);
break;
}
}
}
}
Input extracted from pdf file. The code scan return following elements:
INVOICE
0005
PAYMENT DUE BY:
4/19/2019
.etc
.
.
.
Tax
USD TOTAL
171857
18 september 2019
and here is the result
5
The code is inspired from this link.
Assuming that the invoice label and invoice number is embedded as text in PDF and not as Bitmap.
One way that I can think of doing this is by using Spire.PDF and extract location of the label, and then find the number written right below that location. This will be relatively simple if you have same template of all the PDFs you want to process.
It isn't immediately clear from the answer whether pdfText will contain the Invoice number along with the rest of the text, but I'll assume it does. If it doesn't, then you will need OCR, which is a different beast entirely.
My first instinct would be to build a regex (^\d{6}$) in this case and try to apply it on all text on the page. If there is only one match (the invoice #), then great! Otherwise if it matches more things, you could find all occurences and look for a pattern. For example, if customers had an ID that also matched that regex, you could extract all lines which contain a matching number, and discard all lines that contain some other info (maybe all lines with a customer # would also have a date in a specific format for instance). Basically find all occurences where the regex could match, and try to find rules to exclude all the occurences you don't care about.
I am using the Microsoft.Office.Interop.Word;
range.Find.MatchWildcards = true;
range.Find.Text = "#sg*";
range.Find.ClearFormatting();
while (range.Find.Execute())
{
// create a local Range containing only a single found string
var tagname = new Tags
{
TagName = range.Text
};
I can't retrieve the original text the wild card matched. for e.g #sgdate is the whole word match but only getting back the wild card porting when I ask for range.text. How do I get the full text
After further study of the code only the replaced items have a # so just search for wild card on hash you can add the hash when you are constructing the text
I need to search a document for strings enclosed in <>. So if the application finds the variable within the document, it replaces that variable with DateTime.Today.ToShortDateString(). For instance:
string filename = "C:\\Temp\\" + appNum + "_ReceiptOfApplicationLtr.docx";
if (File.Exists((string)filename))
{
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(filename, true))
{
var body = wordDoc.MainDocumentPart.Document.Body;
foreach (var text in body.Descendants<Text>())
{
if (text.Text == "<TodaysDate>")
{
text.Text = text.Text.Replace("<TodaysDate>", DateTime.Today.ToShortDateString());
}
}
using (StreamWriter sw = new StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create)))
{
sw.Write(filename);
}
}
}
Well when it searches the Descendants Text, it finds the first <, then TodaysDate, finally >. The issue being it won't find the string <TodaysDate>. Can anyone help me out?
Open XML can store text in different text tags inside the same run. What I would do if I were you is just find the Run where your string is stored and use the InnerText property to find all the text inside that run.
For example:
Run runToFind = body.Descendants<Run>()
.FirstOrDefault(r => r.Innertext.Contains("<TodaysDate>");
Then you can replace the Run with another one:
runToFind.Parent.Replace(new Run(new Text(DateTime.Now.ToShortDateString())),runToFind);
For anyone still struggling with this - you can check out this library
https://github.com/antonmihaylov/OpenXmlTemplates
With it instead of searching for special tags in the text (because of the problems specified in the comment of Thomas Barnekow), you add a Content control in the document and in the tag name of the content control you specify the name of the variable you want to replace.
You can then feed JSON data or a regular C# dictionary object and the text will get replaced.
Note - I am the maker of that library, but i have no financial gain from it - it is open sourced and under active development (and always looking for contributors!)
I'm trying to add an hyperlink to some XML being inserted in a field of a Word Document (using Microsoft.Office.Interop.Word).
The XML being inserted contains multiple paragraphs, each containing some text that should be converted to a hyperlink. The text that contains the hyperlink is extracted from the end of the paragraph after the "Available at " substring is found.
The following code is able to create the hyperlink but the first hyperlink is always applied to all paragraphs. I was expecting the code to create an hyperlink for each of the paragraphs being iterated.
My guess is that the paragraph.Range object is pointing to text that is in fact the whole XML inserted as opposed to the text contained within the paragraph. I've also confirmed that the paragraph.Range.Text property returns the correct text for each paragraph so I am completely confused as to what should be expected for the Range property.
Any ideas? Thanks in advance.
if (!string.IsNullOrWhiteSpace(bibliography))
{
const string linkToken = "Available at ";
field.Result.InsertXML(bibliography);
foreach (Paragraph paragraph in field.Result.Paragraphs)
{
var paragraphText = paragraph.Range.Text;
var indexOfLink = paragraphText.IndexOf(linkToken, StringComparison.OrdinalIgnoreCase);
if (indexOfLink >= 0)
{
var linkStart = indexOfLink + linkToken.Length;
var linkPart = paragraphText.Substring(linkStart);
Uri uriFound;
if (Uri.TryCreate(linkPart, UriKind.Absolute, out uriFound))
{
object linkAddress = uriFound.ToString();
paragraph.Range.Hyperlinks.Add(paragraph.Range, ref linkAddress);
}
}
}
}
I need to retain paragraph breaks in a .docx file, but get rid of linebreaks which are often in the wrong place when copying from one file to another (due to different page sizes, and when the font is changed).
Using the DocX Library, I'm trying this:
private void ReplaceLineBreaksWithBoo(string filename)
{
List<string> lineBreaks;
using (DocX document = DocX.Load(filename))
{
lineBreaks = document.FindUniqueByPattern("\n", System.Text.RegularExpressions.RegexOptions.None);
if (lineBreaks.Count > 0)
{
foreach (string s in lineBreaks)
{
document.ReplaceText(s, string.empty); // <-- or a space?
}
}
document.Save();
}
}
...but it doesn't work - "\n" is not the right thing to pass, I reckon; I don't know what I need for that first arg to the FindUniqueByPattern() method. Documentation is nil and the discussion forum there resembles Bodie, California:
I guess you can't do it using FindUniqueByPattern or FindAll. Newline is not represented by any symbol but stored as a paragraph with empty text. You can peek document representation in xml format from document.Xml property, there you'll see empty line stored as single <w:p> element.
Therefore you can search for Paragraphs with empty text instead of searching for newline character :
using (DocX document = DocX.Load(filename))
{
var emptyLines = document.Paragraphs.Where(o => string.IsNullOrEmpty(o.Text));
foreach (var paragraph in emptyLines)
{
paragraph.Remove(false);
}
document.Save();
}