I've a list of paragraphs. Each paragagraph can contain Text. I'm trying to search for a string that may be as whole within a single paragraph, or spread across multiple paragraphs with as bad case where each letter is different paragraph.
public List<WordParagraph> FindText(string text) {
List<WordParagraph> list = new List<WordParagraph>();
var found = false;
Paragraph currentParagraph = null;
foreach (var paragraph in this.Paragraphs) {
//if (currentParagraph == null) {
// currentParagraph = paragraph._paragraph;
//} else {
// if (currentParagraph != paragraph._paragraph) {
// found = false;
// }
//}
// paragraph.Text
// logic missing to find text that can start within some paragraph.Text, but
// can span across multiple paragraphs
// for example searching for text "This Is MyTest" within 4 paragraphs that
// may be written like
// paragraph.Text = "Thi"
// paragraph.Text = "s Is"
// paragraph.Text = " MyTes"
// paragraph.Text = "t"
}
return list;
}
I've tried some logic around foreach char in text, and nested loop over text from the paragraph.text but the logic was failing me.
To give you a bit of background. Consider a Word Document that has a single sentence - one long sentence but each word, or even letter is formatted differently - different font size, bold, underline or whatever. It looks like this:
Now what Word actually saved in the file is a single paragraph, but each paragraph has multiple "runs". The run contains a Text element. Each text element contains the text that you see in Word, but due to formatting of possibly even each word it can be split into many many small Text properties.
Now in my example, I've simplified the logic and for me, each "run" is a paragraph with a text. So List of WordParagraphs is a list of runs within Screenshot you see.
Now I need to find a string "I have that" from the whole sentence you see in word. That means I need to go thru all paragraphs, find the first letter that matches and then check if next letter matches as well, if not I need to start again.
My brain is having hard time to grasp this logic in code.
I know that the data should be correct. I have no control over the data and my boss is just going to tell me that I need to figure out a way to deal with someone else's mistake. So please don't tell me it's not my problem that the data is bad, because it is.
Anywho, this is what I'm looking at:
"Words","email#email.com","","4253","57574","FirstName","","LastName, MD","","","576JFJD","","1971","","Words","Address","SUITE "A"","City","State","Zip","Phone","",""
Data has been scrubbed for confidentiality reasons.
So as you see, the data contains quotation marks and there are commas inside some of these quoted fields. So I cannot remove them. But the "Suite A""" is throwing off the parser. There are too many quotation marks. >.<
I'm using the TextFieldParser in the Microsoft.VisualBasic.FileIO namespace with these settings:
parser.HasFieldsEnclosedInQuotes = true;
parser.SetDelimiters(",");
parser.TextFieldType = FieldType.Delimited;
The error is
MalformedLineException: Line 9871 cannot be parsed using the current
delimiters.
I would like to scrub the data somehow to account for this but I'm not sure how to do it. Or maybe there's a way to just skip this line? Although I suspect my higher ups will not approve of me just skipping data that we might need.
If you are only trying to get rid of the stray " marks in your csv, you can use the following regex to find them and replace them with '
String sourcestring = "source string to match with pattern";
String matchpattern = #"(?<!^|,)""(?!(,|$))";
String replacementpattern = #"$1'";
Console.WriteLine(Regex.Replace(sourcestring,matchpattern,replacementpattern,RegexOptions.Multiline));
Explanation:
#"(?<!^|,)""(?!(,|$))"; will find will find any " that is not preceded by the beginning of the string, or a , and that is not followed by the end of the string or a ,
I am not familiar with TextFieldParser. However with CsvHelper, you can add a custom handler for invalid data:
var config = new CsvConfiguration();
config.IgnoreReadingExceptions = true;
config.ReadingExceptionCallback += (e, row) =>
{
// you can add some custom patching here if possible
// or, save the line numbers and add/edit them manually later.
};
using(var file = File.OpenRead(".csv"))
using(var reader = new CsvReader(reader, config))
{
reader.GetRecords<YourDtoClass>();
}
My only addition to what everyone is saying (because we've all been there) is to try to attempt to rectify each new issue you encounter with code. There are some decent REGEX strings out there https://www.google.com/?ion=1&espv=2#q=c-sharp+regex+csv+clean or you could manually fix things using String.Replace (String.Replace("\"\"\"","").Replace("\"\","").Replace("\",,","\",") or such). Eventually, as you detect and find ways of correcting more and more mistakes, your manual recovery rate will be minimized substantially (most of your bad data will likely come from similar mistakes). Cheers!
PS - Idea-ish (it's been a while - the logic may neeed some tweaking as I'm writing from memory), but you'll get the gist:
public string[] parseCSVWithQuotes(string csvLine,int expectedNumberOfDataPoints)
{
string ret = "";
string thisChar = "";
string lastChar = "";
bool needleDown = true;
for(int i = 0; i < csvLine.Length; i++)
{
thisChar = csvLine.Substring(i, 1);
if (thisChar == "'"&&lastChar!="'")
needleDown = needleDown == true ? false : true;//when needleDown = true, characters are treated literally
if (thisChar == ","&&lastChar!=",") {
if (needleDown)
{
ret += "|";//convert literal comma to pipe so it doesn't cause another break on split
}else
{
ret += ",";//break on split is intended because the comma is outside the single quote
}
}
if (!needleDown && (thisChar == "\"" || thisChar == "*")) {//repeat for any undesired character or use RegEx
//do not add -- this eliminates any undesired characters outside single quotes
}
else
{
if ((lastChar == "'" || lastChar == "\"" || lastChar == ",") && thisChar == lastChar)
{
//do not add - this eliminates double characters
}else
{
ret += thisChar;
lastChar = thisChar;
//this character is not an undesired character, is no a double, is valid.
}
}
}
//we've cleaned as best we can
string[] parts = ret.Split(',');
if(parts.Length==expectedNumberOfDataPoints){
for(int i = 0; i < parts.Length; i++)
{
//go back and replace the temporary pipe with the literal comma AFTER split
parts[i] = parts[i].Replace("|", ",");
}
return parts;
}else{
//save ret to bad CSV log
return null;
}
}
I've had to do this before,
The first step is to parse the data using string.split(',')
The next step is to combine the segments that belong together.
What I essentially did was
make a new list representing the combined strings
if a string begins with a quote, push it onto your new list
if it does not begin with a quote, append it to the last string in your list
Bonus: throw exceptions when a string ends with a quote but the next one does not begin with a quote
Depending on what the rules are regarding what can actually appear in your data, you might have to change your code to account for that.
At the core of CSV's file format, each line is a row, each cell in that row is separated by a comma. In your case, your format also contains the (very unfortunate) stipulation that commas inside a pair of quotation marks do not count as separators and are instead part of the data. I say very unfortunate because a misplaced quotation mark affects the entire rest of the line, and since quotation marks in standard ASCII do not distinguish between open and closed, there really is nothing you can do to recover from this without knowing the original intent.
That is when you log a message in a way that the person who does know the original intent (the person that provided the data) can look at the file and correct the error:
if (parse_line(line, &data)) {
// save the data
} else {
// log the error
fprintf(&stderr, "Bad line: %s", line);
}
And since your quotation marks aren't escaping newlines, you can keep on going with the next line after running into this error.
ADDENDUM: And if your company has a choice (i.e. your data is being serialized by a company tool) don't use CSV. Use something like XML or JSON with a much more clearly defined parsing mechanism.
I had to do this once aswell. My approach was to go through a line and keep track on what I was reading.
Basicly, I coded my own scanner chopping off tokens from the input line which gave me full control over my faulty .csv data.
This is what I did:
For each character on a line of input.
1. when outside of a string meeting a comma => all of the previous string (which can be empty) is a valid token.
2. when outside of a sting meeting anything but a comma or a quote => now you have a real problem, unquoted tekst => handle as you see fit.
3. when outside of a string meeing a quote => found a start of string.
4. when inside of a string meeting a comma => accept the comma as part of the string.
5. when inside of the string meeting a qoute => trouble starts here, mark this point.
6. continue and when meeting a comma (skipping white space if desired) close the string, 'unread' the comma and continue. (than will bring you to point 1.)
7. or continue and when meeting a quote -> obviously, what was read must be part of the string, add it to the string, 'unread' the quote and continue. (that will you bring to point 5)
8. or continue and find an whitespace, then End Of Line ('\n') -> the last qoute must be the closing quote. accept the string as a value.
9. or continue and fine non-whitespace, then End Of Line. -> now you have a real problem, you have the start of a string but it is not closed -> handle the error as you see fit.
If the number of fields in your .csv file is fixed you can count the comma's you recognise as field seperators and when you see a End Of Line you know you have another problem or not.
With the stream of strings received from the input line you can build a 'clean' .csv line and this way build a buffer of accepted and cleaned input that you can use in your already existing code.
I need to surround every word in loaded html text with a span which will uniquely identify every word. The problem is that some content is not being handled by my regex pattern. My current problems include...
1) Special html characters like ” “ are treated as words.
2) Currency values. e.g. $2,500 end up as "2" "500" (I need "$2,500")
3) Double hyphened words. e.g. one-legged-man. end up "one-legged" "man"
I'm new to regular expressions and after looking at various other posts have derived the following pattern that seems to work for everything except the above exceptions.
What I have so far is:
string pattern = #"(?<!<[^>]*?)\b('\w+)|(\w+['-]\w+)|(\w+')|(\w+)\b(?![^<]*?>)";
string newText = Regex.Replace(oldText, pattern, delegate(Match m) {
wordCnt++;
return "<span data-wordno='" + wordCnt.ToString() + "'>" + m.Value + "</span>";
});
How can I fix/extend the above pattern to cater for these problems or should I be using a different approach all together?
A fundamental problem that you're up against here is that html is not a "regular language". This means that html is complex enough that you are always going to be able to come up with valid html that isn't recognized by any regular expression. It isn't a matter of writing a better regular expression; this is a problem that regex can't solve.
What you need is a dedicated html parser. You could try this nuget package. There are many others, but HtmlAgilityPack is quite popular.
Edit: Below is an example program using HtmlAgilityPack. When an HTML document is parsed, the result is a tree (aka the DOM). In the DOM, text is stored inside text nodes. So something like <p>Hello World<\p> is parsed into a node to represent the p tag, with a child text node to hold the "Hello World". So what you want to do is find all the text nodes in your document, and then, for each node, split the text into words and surround the words with spans.
You can search for all the text nodes using an xpath query. The xpath I have below is /html/body//*[not(self::script)]/text(), which avoids the html head and any script tags in the body.
class Program
{
static void Main(string[] args)
{
var doc = new HtmlDocument();
doc.Load(args[0]);
var wordCount = 0;
var nodes = doc.DocumentNode
.SelectNodes("/html/body//*[not(self::script)]/text()");
foreach (var node in nodes)
{
var words = node.InnerHtml.Split(' ');
var surroundedWords = words.Select(word =>
{
if (String.IsNullOrWhiteSpace(word))
{
return word;
}
else
{
return $"<span data-wordno={wordCount++}>{word}</span>";
}
});
var newInnerHtml = String.Join("", surroundedWords);
node.InnerHtml = newInnerHtml;
}
WriteLine(doc.DocumentNode.InnerHtml);
}
}
Fix 1) by adding "negative look-behind assertions" (?<!\&). I believe they are needed at the beginning of the 1st, 3rd, and 4th alternatives in the original pattern above.
Fix 2) by adding a new alternative |(\$?(\d+[,.])+\d+)' at the end of pattern. This also handles non-dollar and decimal-pointed numbers at the same time.
Fix 3) by enhancing the (\w+['-]\w+) alternative to read instead ((\w+['-])+\w+).
I need to retain paragraph breaks in a .docx file, but get rid of linebreaks which are often in the wrong place when copying from one file to another (due to different page sizes, and when the font is changed).
Using the DocX Library, I'm trying this:
private void ReplaceLineBreaksWithBoo(string filename)
{
List<string> lineBreaks;
using (DocX document = DocX.Load(filename))
{
lineBreaks = document.FindUniqueByPattern("\n", System.Text.RegularExpressions.RegexOptions.None);
if (lineBreaks.Count > 0)
{
foreach (string s in lineBreaks)
{
document.ReplaceText(s, string.empty); // <-- or a space?
}
}
document.Save();
}
}
...but it doesn't work - "\n" is not the right thing to pass, I reckon; I don't know what I need for that first arg to the FindUniqueByPattern() method. Documentation is nil and the discussion forum there resembles Bodie, California:
I guess you can't do it using FindUniqueByPattern or FindAll. Newline is not represented by any symbol but stored as a paragraph with empty text. You can peek document representation in xml format from document.Xml property, there you'll see empty line stored as single <w:p> element.
Therefore you can search for Paragraphs with empty text instead of searching for newline character :
using (DocX document = DocX.Load(filename))
{
var emptyLines = document.Paragraphs.Where(o => string.IsNullOrEmpty(o.Text));
foreach (var paragraph in emptyLines)
{
paragraph.Remove(false);
}
document.Save();
}
I am appending some text containing '\r\n' into a word document at run-time.
But when I see the word document, they are replaced with small square boxes :-(
I tried replacing them with System.Environment.NewLine but still I see these small boxes.
Any idea?
the answer is to use \v - it's a paragraph break.
Have you not tried one or the other in isolation i.e.\r or \n as Word will interpret a carriage return and line feed respectively. The only time you would use the Environment.Newline is in a pure ASCII text file. Word would handle those characters differently! Or even a Ctrl+M sequence. Try that and if it does not work, please post the code.
Word uses the <w:br/> XML element for line breaks.
After much trial and error, here is a function that sets the text for a Word XML node, and takes care of multiple lines:
//Sets the text for a Word XML <w:t> node
//If the text is multi-line, it replaces the single <w:t> node for multiple nodes
//Resulting in multiple Word XML lines
private static void SetWordXmlNodeText(XmlDocument xmlDocument, XmlNode node, string newText)
{
//Is the text a single line or multiple lines?>
if (newText.Contains(System.Environment.NewLine))
{
//The new text is a multi-line string, split it to individual lines
var lines = newText.Split("\n\r".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
//And add XML nodes for each line so that Word XML will accept the new lines
var xmlBuilder = new StringBuilder();
for (int count = 0; count < lines.Length; count++)
{
//Ensure the "w" prefix is set correctly, otherwise docFrag.InnerXml will fail with exception
xmlBuilder.Append("<w:t xmlns:w=\"http://schemas.microsoft.com/office/word/2003/wordml\">");
xmlBuilder.Append(lines[count]);
xmlBuilder.Append("</w:t>");
//Not the last line? add line break
if (count != lines.Length - 1)
{
xmlBuilder.Append("<w:br xmlns:w=\"http://schemas.microsoft.com/office/word/2003/wordml\" />");
}
}
//Create the XML fragment with the new multiline structure
var docFrag = xmlDocument.CreateDocumentFragment();
docFrag.InnerXml = xmlBuilder.ToString();
node.ParentNode.AppendChild(docFrag);
//Remove the single line child node that was originally holding the single line text, only required if there was a node there to start with
node.ParentNode.RemoveChild(node);
}
else
{
//Text is not multi-line, let the existing node have the text
node.InnerText = newText;
}
}