I have a plain text file as below,
<body labelR={Right} LabelL={Left}> </body/> Video provides a powerful way to help you prove your point. When you click Online Video, you can paste in the embed code for the video you want to add. You can also type a keyword to search online for the video that best fits your document. <body TestR={TestRight} TestL={TestLeft}> </body/>
It is read into the file system as,
var plainText = File.ReadAllText(#"D:\TestTxt.txt");
I'm trying to figure out a way if there is a way to filer out and get a list of a particular set of elements which are in XML syntax. Desired outcome is as below,
A list of 2 items in this case, with,
<body labelR={Right} LabelL={Left}>
</body/>
<body TestR={TestRight} TestL={TestLeft}>
</body/>
Basically the XML elements with <body> </body>
I cannot use LINQ to XML here since this plain text content is not valid XML syntax, I have read that RegEx might be possible but I'm not sure the proper way to use it here.
Any advise is greatly appreciated here
I think the best way to implement this situation is to change your txt file to an XML file by adding a little piece of code, Then you can easily read it
this and this will help you to do that.
using (XmlReader reader = XmlReader.Create(#"YOUFILEPATH.xml"))
{
while (reader.Read())
{
if (reader.IsStartElement())
{
//return only when you have START tag
switch (reader.Name.ToString())
{
case "Key":
Console.WriteLine("Element tag name is: " + reader.ReadString());
break;
case "Element value is: "
Console.WriteLine("Your Location is : " + reader.ReadString());
break;
}
}
Console.WriteLine("");
}
A plain string-based solution could be:
var s = "<body labelR={Right} LabelL={Left}> </body/> Video provides ... your document. <body TestR={TestRight} TestL={TestLeft}> </body/>";
int start = 0;
while ((start = s.IndexOf("<body", start )) >= 0)
{
var end = s.IndexOf("</body/>", start + "<body".Length) + "</body/>".Length;
Console.WriteLine(s[start..end]);
start = end;
}
This finds the next <body starting from the previous "node" (if any). Then it finds the (end of the) next </body/>.
Finally it prints the substring.
Repeat until no start marker was found, so it prints:
<body labelR={Right} LabelL={Left}> </body/>
<body TestR={TestRight} TestL={TestLeft}> </body/>
You may want to add some checks - what if the end marker is missing?
Related
I have an asp.net Core 2.0 C# application which read/parse the PDF file and get the text. In this I want to read specific value which have specific label name. You can see the below image I want to get the value 171857 which is Invoice number and store it in database.
I have tried below code to read the pdf using iTextSharp.
using (PdfReader reader = new PdfReader(fileName))
{
StringBuilder sb = new StringBuilder();
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
for (int page = 0; page < reader.NumberOfPages; page++)
{
string text = PdfTextExtractor.GetTextFromPage(reader, page + 1, strategy);
if (!string.IsNullOrWhiteSpace(text))
{
sb.Append(Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text))));
}
}
var pdfText = sb.ToString();
}
In pdfText variable I will get all text content from pdf but It seems that this is not the proper way to get the Invoice number. Is there any other way to read the specific content from pdf by it's label name like we will provide label name Invoice and it will return the value 171857 as example with other 3rd party pdf reader libraries?
Any help or suggestions would be highly appreciated.
Thanks
I have helped a friend extracting similar value from pdf invoice generated by Excel arc. I have for this answer created an Excel invoice and print it as PDF file and zipped for download for testing purpose.
The next thing I do, I am using an Open Source and Free Library called PDFClown. Here is the nuget package for it.
So far so good, what I did is I scan all pdf document (for example invoice can be one page or multiple pages) add each content to a list of string.
The next step I find the index (the invoice number index could be in 10th element in list, in our case it is index 1) that refer to invoice value which I will call Tag or Label.
Hence I do not have your pdf file, I improvised and added a unique Tag called (or any other name) "INVOICE". The invoice number in this case comes after invoice tag tag. So I find the index of "INVOICE" tag and add 1 to index this is because the invoice number follow the invoice tag. This way I will pick the invoice text 0005 in this case and return it as value 5. This way you can fetch what every text/value followed by any tag scanned in our list and return it the way that you need.
So you need to play with it a bit to fit it 100% to your pdf file.
So here is my test files Excel and Pdf zipped down. Download it for your test.
Here is the code:
public class InvoiceTextExtraction
{
private List<string> _contentList;
public void GetValueFromPdf()
{
_contentList = new List<string>();
CreatePdfContent(#"C:\temp\Invoice1.pdf");
var index = _contentList.FindIndex(e => e == "INVOICE") + 1;
int.TryParse(_contentList[index], out var value);
Console.WriteLine(value);
}
public void CreatePdfContent(string filePath)
{
using (var file = new File(filePath))
{
var document = file.Document;
foreach (var page in document.Pages)
{
Extract(new ContentScanner(page));
}
}
}
private void Extract(ContentScanner level)
{
if (level == null)
return;
while (level.MoveNext())
{
var content = level.Current;
switch (content)
{
case ShowText text:
{
var font = level.State.Font;
_contentList.Add(font.Decode(text.Text));
break;
}
case Text _:
case ContainerObject _:
Extract(level.ChildLevel);
break;
}
}
}
}
Input extracted from pdf file. The code scan return following elements:
INVOICE
0005
PAYMENT DUE BY:
4/19/2019
.etc
.
.
.
Tax
USD TOTAL
171857
18 september 2019
and here is the result
5
The code is inspired from this link.
Assuming that the invoice label and invoice number is embedded as text in PDF and not as Bitmap.
One way that I can think of doing this is by using Spire.PDF and extract location of the label, and then find the number written right below that location. This will be relatively simple if you have same template of all the PDFs you want to process.
It isn't immediately clear from the answer whether pdfText will contain the Invoice number along with the rest of the text, but I'll assume it does. If it doesn't, then you will need OCR, which is a different beast entirely.
My first instinct would be to build a regex (^\d{6}$) in this case and try to apply it on all text on the page. If there is only one match (the invoice #), then great! Otherwise if it matches more things, you could find all occurences and look for a pattern. For example, if customers had an ID that also matched that regex, you could extract all lines which contain a matching number, and discard all lines that contain some other info (maybe all lines with a customer # would also have a date in a specific format for instance). Basically find all occurences where the regex could match, and try to find rules to exclude all the occurences you don't care about.
I made a program that modifies a local html page and replaces a certain value in it as you can see here:
string j = Microsoft.VisualBasic.Interaction.InputBox("Replace " + listView2.SelectedItems[0].Text + "/" + intlstv2.ToString() + " With?");
var links = webBrowser1.Document.GetElementsByTagName("td");
foreach(HtmlElement lnk in links)
{
if (lnk.GetAttribute("className") == "point" && lnk.InnerText == listView2.SelectedItems[0].Text || lnk.GetAttribute("className") == "point tekort" && lnk.InnerText == listView2.SelectedItems[0].Text)
{
MessageBox.Show(lnk.InnerText);
MessageBox.Show("Replacing with: " + j.ToString());
System.IO.File.WriteAllText("Fin.html", this.webBrowser1.DocumentText);
System.IO.File.WriteAllText("Fin.html", System.IO.File.ReadAllText("Fin.html").Replace(lnk.InnerText, j));
}
}
And in the html file:
<td class="point">14,5</td> <---- Value that I want replaced
<td class="average" title="Med">14,5</td> <---- Value that I want to keep
The value selected in listview 2 = 14,5 but the problem I'm having is that in the html, 14,5 exists twice (once for the class name point and the second for the class name med) I would only like to replace the innertext of classname point without changing the med's innertext.
How would I do this?
You could find better success by traversing the HTML document as an object tree rather than a "blob" of string.
That being said, using HtmlDocument to do this will be painful as it doesn't offer a way to introspect easily by class name, attribute values, etc. That being said, you could call GetElementByTagName and fetch all the td elements, and filter these by the class attribute. A bit of complexity, but I guess manageable.
I usually use the HtmlAgilityPack library, which provides many, many more methods and objects which will allow you to find your html elements with greater ease. Strongly recommend you use it!
I need to surround every word in loaded html text with a span which will uniquely identify every word. The problem is that some content is not being handled by my regex pattern. My current problems include...
1) Special html characters like ” “ are treated as words.
2) Currency values. e.g. $2,500 end up as "2" "500" (I need "$2,500")
3) Double hyphened words. e.g. one-legged-man. end up "one-legged" "man"
I'm new to regular expressions and after looking at various other posts have derived the following pattern that seems to work for everything except the above exceptions.
What I have so far is:
string pattern = #"(?<!<[^>]*?)\b('\w+)|(\w+['-]\w+)|(\w+')|(\w+)\b(?![^<]*?>)";
string newText = Regex.Replace(oldText, pattern, delegate(Match m) {
wordCnt++;
return "<span data-wordno='" + wordCnt.ToString() + "'>" + m.Value + "</span>";
});
How can I fix/extend the above pattern to cater for these problems or should I be using a different approach all together?
A fundamental problem that you're up against here is that html is not a "regular language". This means that html is complex enough that you are always going to be able to come up with valid html that isn't recognized by any regular expression. It isn't a matter of writing a better regular expression; this is a problem that regex can't solve.
What you need is a dedicated html parser. You could try this nuget package. There are many others, but HtmlAgilityPack is quite popular.
Edit: Below is an example program using HtmlAgilityPack. When an HTML document is parsed, the result is a tree (aka the DOM). In the DOM, text is stored inside text nodes. So something like <p>Hello World<\p> is parsed into a node to represent the p tag, with a child text node to hold the "Hello World". So what you want to do is find all the text nodes in your document, and then, for each node, split the text into words and surround the words with spans.
You can search for all the text nodes using an xpath query. The xpath I have below is /html/body//*[not(self::script)]/text(), which avoids the html head and any script tags in the body.
class Program
{
static void Main(string[] args)
{
var doc = new HtmlDocument();
doc.Load(args[0]);
var wordCount = 0;
var nodes = doc.DocumentNode
.SelectNodes("/html/body//*[not(self::script)]/text()");
foreach (var node in nodes)
{
var words = node.InnerHtml.Split(' ');
var surroundedWords = words.Select(word =>
{
if (String.IsNullOrWhiteSpace(word))
{
return word;
}
else
{
return $"<span data-wordno={wordCount++}>{word}</span>";
}
});
var newInnerHtml = String.Join("", surroundedWords);
node.InnerHtml = newInnerHtml;
}
WriteLine(doc.DocumentNode.InnerHtml);
}
}
Fix 1) by adding "negative look-behind assertions" (?<!\&). I believe they are needed at the beginning of the 1st, 3rd, and 4th alternatives in the original pattern above.
Fix 2) by adding a new alternative |(\$?(\d+[,.])+\d+)' at the end of pattern. This also handles non-dollar and decimal-pointed numbers at the same time.
Fix 3) by enhancing the (\w+['-]\w+) alternative to read instead ((\w+['-])+\w+).
so I have an XML document I'm trying to import using XmlTextReader in C#, and my code works well except for one part, that's where the tag line is not on the same line as the actually text/content, for example with product_name:
<product>
<sku>27939</sku>
<product_name>
Sof-Therm Warm-Up Jacket
</product_name>
<supplier_number>ALNN1064</supplier_number>
</product>
My code to try to sort the XML document is as such:
while (reader.Read())
{
switch (reader.Name)
{
case "sku":
newEle = new XMLElement();
newEle.SKU = reader.ReadString();
break;
case "product_name":
newEle.ProductName = reader.ReadString();
break;
case "supplier_number":
newEle.SupplierNumber = reader.ReadString();
products.Add(newEle);
break;
}
}
I have tried almost everything I found in the XmlTextReader documentation
reader.MoveToElement();
reader.MoveToContent();
reader.MoveToNextAttribute();
and a couple others that made less sense, but none of them seem to be able to consistently deal with this issue. Obviously I could fix this one case, but then it would break the regular cases. So my question is, would there be a way to have it after I find the "product_name" tag to go to the next line that contains text and extract it?
I should have mentioned, I am outputting it to an HTML table after and the element is coming up blank so I'm fairly certain it is not reading it correctly.
Thanks in advanced!
I think you will find Linq To Xml easier to use
var xDoc = XDocument.Parse(xmlstring); //or XDocument.Load(filename);
int sku = (int)xDoc.Root.Element("sku");
string name = (string)xDoc.Root.Element("product_name");
string supplier = (string)xDoc.Root.Element("supplier_number");
You can also convert your xml to dictionary
var dict = xDoc.Root.Elements()
.ToDictionary(e => e.Name.LocalName, e => (string)e);
Console.WriteLine(dict["sku"]);
It looks like you may need to remove the carriage returns, line feeds, tabs, and spaces before and after the text in the XML element. In your example, you have
<!-- 1. Original example -->
<product_name>
Sof-Therm Warm-Up Jacket
</product_name>
<!-- 2. It should probably be. If possible correct the XML generator. -->
<product_name>Sof-Therm Warm-Up Jacket</product_name>
<!-- 3a. If white space is important, then preserve it -->
<product_name xml:space='preserve'>
Sof-Therm Warm-Up Jacket
</product_name>
<!-- 3b. If White space is important, use CDATA -->
<product_name>!<[CDATA[
Sof-Therm Warm-Up Jacket
]]></product_name>
The XmlTextReader has a WhitespaceHandling property, but when I tested it, it still including the returns and indentation:
reader.WhitespaceHandling = WhitespaceHandling.None;
An option is to use a method to remove the extra characters while you are parsing the document. This method removes the normal white space at the beginning and end of a string:
string TrimCrLf(string value)
{
return Regex.Replace(value, #"^[\r\n\t ]+|[\r\n\t ]+$", "");
}
// Then in your loop...
case "product_name":
// Trim the contents of the 'product_name' element to remove extra returns
newEle.ProductName = TrimCrLf(reader.ReadString());
break;
You can also use this method, TrimCrLf(), with Linq to Xml and the traditional XmlDocument. You can even make it an extension method:
public static class StringExtensions
{
public static string TrimCrLf(this string value)
{
return Regex.Replace(value, #"^[\r\n\t ]+|[\r\n\t ]+$", "");
}
}
// Use it like:
newEle.ProductName = reader.ReadString().TrimCrLf();
Regular expression explanation:
^ = Beginning of field
$ = End of field
[]+= Match 1 or more of any of the contained characters
\n = carriage return (0x0D / 13)
\r = line feed (0x0A / 10)
\t = tab (0x09 / 9)
' '= space (0x20 / 32)
I have run into a similar problem before when dealing with text that originated on a Mac platform due to reversed \r\n in newlines. Suggest you try Ryan's regex solution, but with the following regex:
"^[\r\n]+|[\r\n]+$"
I am scraping a database of products and I am able to get all the HTML and retrieve most values as they have some unique items. However I am stuck on some areas that have common tags.
Example:
<div class="label">Name:</div><div class="value">John</div>
<div class="label">Age:</div><div class="value">24</div>
Any ideas on how I could get those labels and associated values?
I am using HTMLAgilityPack for the rest if there is something in there that may help.
Please use the xpath to get div's with class as label and class as value
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(yourHtml);
Dictionary<string, string> dict = new Dictionary<string, string>();
//This will get all div's with class as label & class value in dictionary
int cnt = 1;
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class='label']"))
{
var val = doc.DocumentNode.SelectSingleNode("//div[#class='value'][" + cnt + "]").InnerText;
if(!dict.ContainsKey(node.InnerText))//dictionary takes unique keys only
{
dict.Add(node.InnerText, val);
cnt++;
}
}
You could try this:
Int32 endingIndex;
var Name1 = GetTextBetween(yourHtml, "<div class=\"label\">", "</div><div class=\"value\">", out endingIndex);
var Value1 = GetTextBetween(yourHtml.SubString(endingIndex), "<div class=\"value\">", "</div>", out endingIndex);
var Name2 = GetTextBetween(yourHtml.SubString(endingIndex), "<div class=\"label\">", "</div><div class=\"value\">", out endingIndex);
var Value2 = GetTextBetween(yourHtml.SubString(endingIndex), "<div class=\"value\">", "</div>", out endingIndex);
public static String GetTextBetween(String allDataToParse, String startText, String endText, out Int32 indexOfEndText)
{
var indexOfStartText = allDataToParse.IndexOf(startText);
indexOfEndText = allDataToParse.IndexOf(endText);
return allDataToParse.Substring(indexOfStartText, indexOfEndText - indexOfStartText).Replace(startText, String.Empty) ;
}
Although XPath always sounds like a great idea, when you're scraping data you can't rely on the HTML to be well formed. Many webpages break their HTML regularly to make scraping harder. Even though Mark's code looks awkward, it's actually more robust in some cases.
As sad as it sounds, you can only rely on consistency in the target document when the provider has proven reliable over a long length of time. Ideally, I'd use a regular expression to search for the tags I want specifically. Here's a good starting point:
Regular expression for extracting tag attributes
Unfortunately, only you know the exact quirks of the document you're working on. A simple solution, like the one Mark proposes, will likely work if the page you're viewing is reliable. And frankly, it's less likely to be fragile and crash unexpectedly.
If you use the HTML document parsing code that HatSoft suggests, your program may work great on most documents, but in my experience websites will throw errors randomly, change their layout unexpectedly, or sometimes your network code will only receive a partial string. Perhaps this is okay, but I'd suggest you try both approaches and see what is more reliable for you.