I am using the Microsoft.Office.Interop.Word;
range.Find.MatchWildcards = true;
range.Find.Text = "#sg*";
range.Find.ClearFormatting();
while (range.Find.Execute())
{
// create a local Range containing only a single found string
var tagname = new Tags
{
TagName = range.Text
};
I can't retrieve the original text the wild card matched. for e.g #sgdate is the whole word match but only getting back the wild card porting when I ask for range.text. How do I get the full text
After further study of the code only the replaced items have a # so just search for wild card on hash you can add the hash when you are constructing the text
Related
I've a list of paragraphs. Each paragagraph can contain Text. I'm trying to search for a string that may be as whole within a single paragraph, or spread across multiple paragraphs with as bad case where each letter is different paragraph.
public List<WordParagraph> FindText(string text) {
List<WordParagraph> list = new List<WordParagraph>();
var found = false;
Paragraph currentParagraph = null;
foreach (var paragraph in this.Paragraphs) {
//if (currentParagraph == null) {
// currentParagraph = paragraph._paragraph;
//} else {
// if (currentParagraph != paragraph._paragraph) {
// found = false;
// }
//}
// paragraph.Text
// logic missing to find text that can start within some paragraph.Text, but
// can span across multiple paragraphs
// for example searching for text "This Is MyTest" within 4 paragraphs that
// may be written like
// paragraph.Text = "Thi"
// paragraph.Text = "s Is"
// paragraph.Text = " MyTes"
// paragraph.Text = "t"
}
return list;
}
I've tried some logic around foreach char in text, and nested loop over text from the paragraph.text but the logic was failing me.
To give you a bit of background. Consider a Word Document that has a single sentence - one long sentence but each word, or even letter is formatted differently - different font size, bold, underline or whatever. It looks like this:
Now what Word actually saved in the file is a single paragraph, but each paragraph has multiple "runs". The run contains a Text element. Each text element contains the text that you see in Word, but due to formatting of possibly even each word it can be split into many many small Text properties.
Now in my example, I've simplified the logic and for me, each "run" is a paragraph with a text. So List of WordParagraphs is a list of runs within Screenshot you see.
Now I need to find a string "I have that" from the whole sentence you see in word. That means I need to go thru all paragraphs, find the first letter that matches and then check if next letter matches as well, if not I need to start again.
My brain is having hard time to grasp this logic in code.
I have an asp.net Core 2.0 C# application which read/parse the PDF file and get the text. In this I want to read specific value which have specific label name. You can see the below image I want to get the value 171857 which is Invoice number and store it in database.
I have tried below code to read the pdf using iTextSharp.
using (PdfReader reader = new PdfReader(fileName))
{
StringBuilder sb = new StringBuilder();
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
for (int page = 0; page < reader.NumberOfPages; page++)
{
string text = PdfTextExtractor.GetTextFromPage(reader, page + 1, strategy);
if (!string.IsNullOrWhiteSpace(text))
{
sb.Append(Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text))));
}
}
var pdfText = sb.ToString();
}
In pdfText variable I will get all text content from pdf but It seems that this is not the proper way to get the Invoice number. Is there any other way to read the specific content from pdf by it's label name like we will provide label name Invoice and it will return the value 171857 as example with other 3rd party pdf reader libraries?
Any help or suggestions would be highly appreciated.
Thanks
I have helped a friend extracting similar value from pdf invoice generated by Excel arc. I have for this answer created an Excel invoice and print it as PDF file and zipped for download for testing purpose.
The next thing I do, I am using an Open Source and Free Library called PDFClown. Here is the nuget package for it.
So far so good, what I did is I scan all pdf document (for example invoice can be one page or multiple pages) add each content to a list of string.
The next step I find the index (the invoice number index could be in 10th element in list, in our case it is index 1) that refer to invoice value which I will call Tag or Label.
Hence I do not have your pdf file, I improvised and added a unique Tag called (or any other name) "INVOICE". The invoice number in this case comes after invoice tag tag. So I find the index of "INVOICE" tag and add 1 to index this is because the invoice number follow the invoice tag. This way I will pick the invoice text 0005 in this case and return it as value 5. This way you can fetch what every text/value followed by any tag scanned in our list and return it the way that you need.
So you need to play with it a bit to fit it 100% to your pdf file.
So here is my test files Excel and Pdf zipped down. Download it for your test.
Here is the code:
public class InvoiceTextExtraction
{
private List<string> _contentList;
public void GetValueFromPdf()
{
_contentList = new List<string>();
CreatePdfContent(#"C:\temp\Invoice1.pdf");
var index = _contentList.FindIndex(e => e == "INVOICE") + 1;
int.TryParse(_contentList[index], out var value);
Console.WriteLine(value);
}
public void CreatePdfContent(string filePath)
{
using (var file = new File(filePath))
{
var document = file.Document;
foreach (var page in document.Pages)
{
Extract(new ContentScanner(page));
}
}
}
private void Extract(ContentScanner level)
{
if (level == null)
return;
while (level.MoveNext())
{
var content = level.Current;
switch (content)
{
case ShowText text:
{
var font = level.State.Font;
_contentList.Add(font.Decode(text.Text));
break;
}
case Text _:
case ContainerObject _:
Extract(level.ChildLevel);
break;
}
}
}
}
Input extracted from pdf file. The code scan return following elements:
INVOICE
0005
PAYMENT DUE BY:
4/19/2019
.etc
.
.
.
Tax
USD TOTAL
171857
18 september 2019
and here is the result
5
The code is inspired from this link.
Assuming that the invoice label and invoice number is embedded as text in PDF and not as Bitmap.
One way that I can think of doing this is by using Spire.PDF and extract location of the label, and then find the number written right below that location. This will be relatively simple if you have same template of all the PDFs you want to process.
It isn't immediately clear from the answer whether pdfText will contain the Invoice number along with the rest of the text, but I'll assume it does. If it doesn't, then you will need OCR, which is a different beast entirely.
My first instinct would be to build a regex (^\d{6}$) in this case and try to apply it on all text on the page. If there is only one match (the invoice #), then great! Otherwise if it matches more things, you could find all occurences and look for a pattern. For example, if customers had an ID that also matched that regex, you could extract all lines which contain a matching number, and discard all lines that contain some other info (maybe all lines with a customer # would also have a date in a specific format for instance). Basically find all occurences where the regex could match, and try to find rules to exclude all the occurences you don't care about.
I need to search a document for strings enclosed in <>. So if the application finds the variable within the document, it replaces that variable with DateTime.Today.ToShortDateString(). For instance:
string filename = "C:\\Temp\\" + appNum + "_ReceiptOfApplicationLtr.docx";
if (File.Exists((string)filename))
{
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(filename, true))
{
var body = wordDoc.MainDocumentPart.Document.Body;
foreach (var text in body.Descendants<Text>())
{
if (text.Text == "<TodaysDate>")
{
text.Text = text.Text.Replace("<TodaysDate>", DateTime.Today.ToShortDateString());
}
}
using (StreamWriter sw = new StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create)))
{
sw.Write(filename);
}
}
}
Well when it searches the Descendants Text, it finds the first <, then TodaysDate, finally >. The issue being it won't find the string <TodaysDate>. Can anyone help me out?
Open XML can store text in different text tags inside the same run. What I would do if I were you is just find the Run where your string is stored and use the InnerText property to find all the text inside that run.
For example:
Run runToFind = body.Descendants<Run>()
.FirstOrDefault(r => r.Innertext.Contains("<TodaysDate>");
Then you can replace the Run with another one:
runToFind.Parent.Replace(new Run(new Text(DateTime.Now.ToShortDateString())),runToFind);
For anyone still struggling with this - you can check out this library
https://github.com/antonmihaylov/OpenXmlTemplates
With it instead of searching for special tags in the text (because of the problems specified in the comment of Thomas Barnekow), you add a Content control in the document and in the tag name of the content control you specify the name of the variable you want to replace.
You can then feed JSON data or a regular C# dictionary object and the text will get replaced.
Note - I am the maker of that library, but i have no financial gain from it - it is open sourced and under active development (and always looking for contributors!)
I am currently trying to extract the ID of a YouTube video from the embed url YouTube supplies.
I am currently using this as an example:
<iframe width="560" height="315" src="http://www.youtube.com/embed/aSVpBqOsC7o" frameborder="0" allowfullscreen></iframe>
So far my code currently looks like this,
else if (TB_VideoLink.Text.Trim().Contains("http://www.youtube.com/embed/"))
{
youtube_url = TB_VideoLink.Text.Trim();
int Count = youtube_url.IndexOf("/embed/", 7);
string cutid = youtube_url.Substring(Count,youtube_url.IndexOf("\" frameborder"));
LB_VideoCodeLink.Text = cutid;
}
I Seem to be getting there, however the code falls over on CutID and I am not sure why???
Cheers
I always find it much easier to use regular expressions for this sort of thing, Substringand IndexOf always seem dated to me, but that's just my personal opinion.
Here is how I would solve this problem.
Regex regexPattern = new Regex(#"src=\""\S+/embed/(?<videoId>\w+)");
Match videoIdMatch = regexPattern.Match(TB_VideoLink.Text);
if (videoIdMatch.Success)
{
LB_VideoCodeLink.Text = videoIdMatch.Groups["videoId"].Value;
}
This will perform a regular expression match, locating src=", ignoring all characters up until /embed/ then extracting all the word characters after it as a named group.
You can then get the value of this named group. The advantage is, this will work even if frameborder does not occur directly after the src.
Hope this is useful,
Luke
The second parameter of the Substring method is length, not second index. Subtract the index of the second test from the first to get the required length.
else if (TB_VideoLink.Text.Trim().Contains("http://www.youtube.com/embed/"))
{
youtube_url = TB_VideoLink.Text.Trim();
// Find the start of the embed code
int Count = youtube_url.IndexOf("/embed/", 7);
// From the start of the embed bit, search for the next "
int endIndex = youtube_url.IndexOf("\"", Count);
// The ID is from the 'Count' variable, for the next (endIndex-Count) characters
string cutid = youtube_url.Substring(Count, endIndex - Count);
LB_VideoCodeLink.Text = cutid;
}
You probably should have some more exception handling for when either of the two test strings do not exist.
Similar to answer above, but was beaten to it.. doh
//Regex with YouTube Url and Group () any Word character a-z0-9 and expect 1 or more characters +
var youTubeIdRegex = new Regex(#"http://www.youtube.com/embed/(?<videoId>\w+)",RegexOptions.IgnoreCase|RegexOptions.Compiled);
var youTubeUrl = TB_VideoLink.Text.Trim();
var match = youTubeIdRegex.Match(youTubeUrl);
var youTubeId = match.Groups["videoId"].Value; //Group[1] is (\w+) -- first group ()
LB_VideoCodeLink.Text = youTubeId;
I have a text file with names as balamurugan,chendurpandian,......
if i give a value in the textbox as ba ....
If i click a submit button means i have to search the textfile for the value ba and display as pattern matched....
I have read the text file using
string FilePath = txtBoxInput.Text;
and displayed it in a textbox using
textBoxContents.Text = File.ReadAllText(FilePath);
But i dont know how to search a word in a text file using c# can anyone give suggestion???
You can simply use:
textBoxContents.Text.Contains(keyword)
This will return true if your text contains your chosen keyword.
Depends upon the kind of pattern matching that you needs - you can use as simple as String.Contains method or can try out Regular Expressions that will give you more control on how you want to search and give all matches at the same time. Here are couple of links to get you started quickly on regular expressions:
http://www.codeproject.com/KB/dotnet/regextutorial.aspx
http://www.developer.com/open/article.php/3330231/Regular-Expressions-Primer.htm
First, you should split up the input string, after which you could do a contains on each value:
// On file read:
String[] values = File.ReadAllText(FilePath);
// On search:
List<String> results = new List<String>();
for(int i = 0; i < values.Length; i++) {
if(values[i].Contains(search)) results.Add(values[i]);
}
Alternatively, if you only want it to search at the beginning or the end of the string, you can use StartsWith or EndsWith, respectively:
// Only match beginnging
values[i].StartsWith(search);
// Only match end
values[i].EndsWith(search);