I'm trying to parse a CSV file in C#. Split on commas (,). I got it to work with this:
[\t,](?=(?:[^\"]|\"[^\"]*\")*$)
Splitting this string:
2012-01-06,"Some text with, comma",,"300,00","143,52"
Gives me:
2012-01-06
"Some text with, comma"
"300,00"
"143,52"
But I can't figure out how to lose the "" from the output so I get this instead:
2012-01-06
Some text with, comma
300,00
143,52
Any suggestions?
If you are trying to parse a CSV and using .NET, don't use regular expressions. Use a component that was created for this purpose. See the question CSV File Imports in .Net.
I know the CSV specification looks simple enough, but trust me, you are in for heartache and destruction if you continue down this path.
Why are you using regular expressions for this? Ensuring the file is well-formed?
You can use String.Replace()
String s = "Some text with, comma";
s = s.Replace("\"", "");
// After matched
String line = 2012-01-06,"Some text with, comma",,"300,00","143,52";
String []fields = line.Split(',');
for (int i = 0; i < fields.Length; i++)
{
// Call a function to remove quotes
fields[i] = removeQuotes(fields[i]);
}
String removeQuotes(String s)
{
return s.Replace("\"", "");
}
So, something like this. Again, I wouldn't use RegEx for this purpose, but YMMV.
var sp = Regex.Split(a, "[\t,](?=(?:[^\"]|\"[^\"]*\")*$)")
.Select(s => Regex.Replace(s.Replace("\"\"","\""),"^\"|\"$","")).ToArray();
So, the idea here is that first of all, you want to replace double double quotes with a single double quote. And then that string is fed to the second regex which simply removes double quotes at the beginning and end of the string.
The reason for the first replace is because of strings like this:
var a = "1999,Chevy,\"Venture \"\"Extended Edition, Very Large\"\" Dude\",\"\",\"5000.00\"";
So, this would give you a string like this: ""Extended Edition"", and the double quotes need to be changed to single quotes.
Related
I have an issue with a string containing the plus sign (+).
I want to split that string (or if there is some other way to solve my problem)
string ColumnPlusLevel = "+-J10+-J10+-J10+-J10+-J10";
string strpluslevel = "";
strpluslevel = ColumnPlusLevel;
string[] strpluslevel_lines = Regex.Split(strpluslevel, "+");
foreach (string line in strpluslevel_lines)
{
MessageBox.Show(line);
strpluslevel_summa = strpluslevel_summa + line;
}
MessageBox.Show(strpluslevel_summa, "summa sumarum");
The MessageBox is for my testing purpose.
Now... The ColumnPlusLevel string can have very varied entry but it is always a repeated pattern starting with the plus sign.
i.e. "+MJ+MJ+MJ" or "+PPL14.1+PPL14.1+PPL14.1" as examples.
(It comes form Another software and I cant edit the output from that software)
How can I find out what that pattern is that is being repeated?
That in this exampels is the +-J10 or +MJ or +PPL14.1
In my case above I have tested it by using only a MessageBox to show the result but I want the repeated pattering stored in a string later on.
Maybe im doing it wrong by using Split, maybe there is another solution.
Maybe I use Split in the wrong way.
Hope you understand my problem and the result I want.
Thanks for any advice.
/Tomas
How can I find out what that pattern is that is being repeated?
Maybe i didn't understand the requirement fully, but isn't it easy as:
string[] tokens = ColumnPlusLevel.Split(new[]{'+'}, StringSplitOptions.RemoveEmptyEntries);
string first = tokens[0];
bool repeatingPattern = tokens.Skip(1).All(s => s == first);
If repeatingPattern is true you know that the pattern itself is first.
Can you maybe explain how the logic works
The line which contains tokens.Skip(1) is a LINQ query, so you need to add using System.Linq at the top of your code file. Since tokens is a string[] which implements IEnumerable<string> you can use any LINQ (extension-)method. Enumerable.Skip(1) will skip the first because i have already stored that in a variable and i want to know if all others are same. Therefore i use All which returns false as soon as one item doesn't match the condition(so one string is different to the first). If all are same you know that there is a repeating pattern which is already stored in the variable first.
You should use String.Split function :
string pattern = ColumnPlusLevel.Split("+")[0];
...but it is always a repeated pattern starting with the plus sign.
Why do you even need String.Split() here if the pattern always only repeats itself?
string input = #"+MJ+MJ+MJ";
int indexOfSecondPlus = input.IndexOf('+', 1);
string pattern = input.Remove(indexOfSecondPlus, input.Length - indexOfSecondPlus);
//pattern is now "+MJ"
No need of string split, no need to use LinQ
String has a method called Split which let's you split/divide the string based on a given character/character-set:
string givenString = "+-J10+-J10+-J10+-J10+-J10"'
string SplittedString = givenString.Split("+")[0] ///Here + is the character based on which the string would be splitted and 0 is the index number
string result = SplittedString.Replace("-","") //The mothod REPLACE replaces the given string with a targeted string,i added this so that you can get the numbers only from the string
I have a bunch of strings, some of which have one of the following formats:
"TestA (3/12/10)"
"TestB (10/12/10)"
The DateTime portion of the strings will always be in mm/dd/yy format.
What I want to do is remove the whole DateTime part including the parenthesis. If it was always the same length I would just get the index of / and subtract that by the number of characters up to and including the (. But since the mm portion of the string could be one or two characters, I can't do that.
So is there a way to do a .Contains or something to see if the string contains the specified DateTime format?
You could use a Regular Expression to strip out the possible date portions if you can be sure they would consistently be in a certain format using the Regex.Replace() method :
var updatedDate = Regex.Replace(yourDate,#"\(\d{1,2}\/\d{1,2}\/\d{1,2}\)","");
You can see a working example of it here, which yields the following output :
TestA (3/12/10) > TestA
TestB (10/12/10) > TestB
TestD (4/5/15) > TestC
TestD (4/6/15) > TestD
You could always use a regular expression to replace the strings
Here is an example
var regEx = new Regex(#"\(\d{1,2}\/\d{1,2}\/\d{1,2}\)");
var text = regEx.Replace("TestA (3/12/10)", "");
Use a RegEx for this. I recommend:
\(\d{1,2}\/\d{1,2}\/\d{1,2}\)
See RegExr for it working.
Regex could be used for this, something such as:
string replace = "TestA (3/12/10) as well as TestB (10/12/10)";
string replaced = Regex.Replace(replace, "\\(\\d+/\\d+/\\d+\\)", "");
If I'm understanding this correctly you want to just acquire the test name of each string. Copy this code to a button click event.
string Old_Date = "Test SomeName(3/12/10)";
string No_Date = "";
int Date_Pos = 0;
Date_Pos = Old_Date.IndexOf("(");
No_Date = Old_Date.Remove(Date_Pos).Trim();
MessageBox.Show(No_Date, "Your Updated String", MessageBoxButton.OK);
To sum it up in one line of code
No_Date = Old_Date.Remove(Old_Date.IndexOf("(")).Trim();
Hi all I want to know something regarding to fixed-string in regular expression.
How to represent a fixed-string, regardless of special characters or alphanumeric in C#?
For eg; have a look at the following string:
infinity.world.uk/Members/namelist.aspx?ID=-1&fid=X
The entire string before X will be fixed-string (ie; the whole sentence will appear the same) BUT only X will be the decimal variable.
What I want is that I want to append decimal number X to the fixed string. How to express that in terms of C# regular expression.
Appreciate your help
string fulltext = "inifinity.world.uk/Members/namelist.aspx?ID=-1&fid=" + 10;
if you need to modify existing url, dont use regex, string.Format or string.Replace you get problem with encoding of arguments
Use Uri and HttpUtility instead:
var url = new Uri("http://infinity.world.uk/Members/namelist.aspx?ID=-1&fid=X");
var query = HttpUtility.ParseQueryString(url.Query);
query["fid"] = 10.ToString();
var newUrl = url.GetLeftPart(UriPartial.Path) + "?" + query;
result: http://infinity.world.uk/Members/namelist.aspx?ID=-1&fid=10
for example, using query["fid"] = "%".ToString(); you correctly generate http://infinity.world.uk/Members/namelist.aspx?ID=-1&fid=%25
demo: https://dotnetfiddle.net/zZ9Y1h
String.Format is one way of replacing token values in a string, if that's what you want. In the example below, the {0} is a token, and String.Format takes the fixedString and replaces the token with the value of myDecimal.
string fixedString = "infinity.world.uk/Members/namelist.aspx?ID=-1&fid={0}";
decimal myDecimal = 1.5d;
string myResultString = string.Format(fixedString, myDecimal.ToString());
I am currently using the following line:
w.Write(DateTime.Now.ToString("MM/dd/yyyy,HH:mm:ss"));
and it gives and output like:
05/23/2011,14:24:54
What I need is quotations around the date and time, the output should look like this:
"05/23/2011","14:24:54"
any thoughts on how to "break up" datetime, and get quotes around each piece?
Try String.Format:
w.Write(String.Format("\"{0:MM/dd/yyyy}\",\"{0:HH:mm:ss}\"", DateTime.Now));
DateTime.Now.ToString("\\\"MM/dd/yyyy\\\",\\\"HH:mm:ss\\\"")
This will do the trick, too.
string format = #"{0:\""MM/dd/yyyy\"",\""HH:mm:ss\""}" ;
string s = string.Format(format,DateTime.Now) ;
as will this:
string format = #"{0:'\""'MM/dd/yyyy'\""','\""'HH:mm:ss'\""'}" ;
string s = string.Format(format,DateTime.Now) ;
and this
string format = #"{0:""\""""MM/dd/yyyy""\"""",""\""""HH:mm:ss""\""""}" ;
string s = string.Format(format,DateTime.Now) ;
The introduction of a literal double quote (") or apostrophe (') in a DateTime or Numeric format strings introduces literal text. The embedded literal quote/apostrophe must be balanced — they act as an embedded quoted string literal in the format string. To get a double quote or apostrophe it needs to be preceded with a backslash.
John Sheehan's formatting cheatsheets makes note of this...feature, but insofar as I can tell, the CLR documentation is (and always has been) incorrect WRT this: the docs on custom date/time and numeric format strings just says that "[any other character] is copied to the result string unchanged.".
string part1 = DateTime.Now.ToString("MM/dd/yyyy");
string part2 = DateTime.Now.ToString("HH:mm:ss");
Console.WriteLine("\""+part1+"\",\""+part2+"\"");
Works just fine. May not be the best way though
I'm not sure about the type of w but if it supports the standard set of Write overloads the following should work.
w.Write(#"""{0}""", DateTime.Now.ToString(#"MM/dd/yyyy"",""HH:mm:ss")));
If not then you can do the following
var msg = String.Format(#"""{0}""", DateTime.Now.ToString(#"MM/dd/yyyy"",""HH:mm:ss"))));
w.Write(msg);
The following version, though obvious, will not work:
w.Write(DateTime.Now.ToString("\"MM/dd/yyyy\",\"HH:mm:ss\""));
This will output:
MM/dd/yyyy,HH:mm:ss
So don't do that.
I am trying to use C# to parse CSV. I used regular expressions to find "," and read string if my header counts were equal to my match count.
Now this will not work if I have a value like:
"a",""b","x","y"","c"
then my output is:
'a'
'"b'
'x'
'y"'
'c'
but what I want is:
'a'
'"b","x","y"'
'c'
Is there any regex or any other logic I can use for this ?
CSV, when dealing with things like multi-line, quoted, different delimiters* etc - can get trickier than you might think... perhaps consider a pre-rolled answer? I use this, and it works very well.
*=remember that some locales use [tab] as the C in CSV...
CSV is a great example for code reuse - No matter which one of the csv parsers you choose, don't choose your own. Stop Rolling your own CSV parser
I would use FileHelpers if I were you. Regular Expressions are fine but hard to read, especially if you go back, after a while, for a quick fix.
Just for sake of exercising my mind, quick & dirty working C# procedure:
public static List<string> SplitCSV(string line)
{
if (string.IsNullOrEmpty(line))
throw new ArgumentException();
List<string> result = new List<string>();
bool inQuote = false;
StringBuilder val = new StringBuilder();
// parse line
foreach (var t in line.Split(','))
{
int count = t.Count(c => c == '"');
if (count > 2 && !inQuote)
{
inQuote = true;
val.Append(t);
val.Append(',');
continue;
}
if (count > 2 && inQuote)
{
inQuote = false;
val.Append(t);
result.Add(val.ToString());
continue;
}
if (count == 2 && !inQuote)
{
result.Add(t);
continue;
}
if (count == 2 && inQuote)
{
val.Append(t);
val.Append(',');
continue;
}
}
// remove quotation
for (int i = 0; i < result.Count; i++)
{
string t = result[i];
result[i] = t.Substring(1, t.Length - 2);
}
return result;
}
There's an oft quoted saying:
Some people, when confronted with a
problem, think "I know, I'll use
regular expressions." Now they have
two problems. (Jamie Zawinski)
Given that there's no official standard for CSV files (instead there are a large number of slightly incompatible styles), you need to make sure that what you implement suits the files you will be receiving. No point in implementing anything fancier than what you need - and I'm pretty sure you don't need Regular Expressions.
Here's my stab at a simple method to extract the terms - basically, it loops through the line looking for commas, keeping track of whether the current index is within a string or not:
public IEnumerable<string> SplitCSV(string line)
{
int index = 0;
int start = 0;
bool inString = false;
foreach (char c in line)
{
switch (c)
{
case '"':
inString = !inString;
break;
case ',':
if (!inString)
{
yield return line.Substring(start, index - start);
start = index + 1;
}
break;
}
index++;
}
if (start < index)
yield return line.Substring(start, index - start);
}
Standard caveat - untested code, there may be off-by-one errors.
Limitations
The quotes around a value aren't removed automatically.
To do this, add a check just before the yield return statement near the end.
Single quotes aren't supported in the same way as double quotes
You could add a separate boolean inSingleQuotedString, renaming the existing boolean to inDoubleQuotedString and treating both the same way. (You can't make the existing boolean do double work because you need the string to end with the same quote that started it.)
Whitespace isn't automatically removed
Some tools introduce whitespace around the commas in CSV files to "pretty" the file; it then becomes difficult to tell intentional whitespace from formatting whitespace.
In order to have a parseable CSV file, any double quotes inside a value need to be properly escaped somehow. The two standard ways to do this are by representing a double quote either as two double quotes back to back, or a backslash double quote. That is one of the following two forms:
""
\"
In the second form your initial string would look like this:
"a","\"b\",\"x\",\"y\"","c"
If your input string is not formatted against some rigorous format like this then you have very little chance of successfully parsing it in an automated environment.
If all your values are guaranteed to be in quotes, look for values, not for commas:
("".*?""|"[^"]*")
This takes advantage of the fact that "the earliest longest match wins" - it looks for double quoted values first, and with a lower priority for normal quoted values.
If you don't want the enclosing quote to be part of the match, use:
"(".*?"|[^"]*)"
and go for the value in match group 1.
As I said: Prerequisite for this to work is well-formed input with guaranteed quotes or double quotes around each value. Empty values must be quoted as well! A nice side-effect is that it does not care for the separator char. Commas, TABs, semi-colons, spaces, you name it. All will work.
FileHelpers supports multiline fields.
You could parse files like these:
a,"line 1
line 2
line 3"
b,"line 1
line 2
line 3"
Here is the datatype declaration:
[DelimitedRecord(",")]
public class MyRecord
{
public string field1;
[FieldQuoted('"', QuoteMode.OptionalForRead, MultilineMode.AllowForRead)]
public string field2;
}
Here is the usage:
static void Main()
{
FileHelperEngine engine = new FileHelperEngine(typeof(MyRecord));
MyRecord[] res = engine.ReadFile("file.csv");
}
Try CsvHelper (a library I maintain) or FastCsvReader. Both work well. CsvHelper does writing also. Like everyone else has been saying, don't roll your own. :P
FileHelpers for .Net is your friend.
See the link "Regex fun with CSV" at:
http://snippets.dzone.com/posts/show/4430
The Lumenworks CSV parser (open source, free but needs a codeproject login) is by far the best one I've used. It'll save you having to write the regex and is intuitive to use.
Well, I'm no regex wiz, but I'm certain they have an answer for this.
Procedurally it's going through letter by letter. Set a variable, say dontMatch, to FALSE.
Each time you run into a quote toggle dontMatch.
each time you run into a comma, check dontMatch. If it's TRUE, ignore the comma. If it's FALSE, split at the comma.
This works for the example you give, but the logic you use for quotation marks is fundamentally faulty - you must escape them or use another delimiter (single quotes, for instance) to set major quotations apart from minor quotations.
For instance,
"a", ""b", ""c", "d"", "e""
will yield bad results.
This can be fixed with another patch. Rather than simply keeping a true false you have to match quotes.
To match quotes you have to know what was last seen, which gets into pretty deep parsing territory. You'll probably, at that point, want to make sure your language is designed well, and if it is you can use a compiler tool to create a parser for you.
-Adam
I have just try your regular expression in my code..its work fine for formated text with quote ...
but wondering if we can parse below value by Regex..
"First_Bat7679",""NAME","ENAME","FILE"","","","From: "DDD,_Ala%as"#sib.com"
I am looking for result as:
'First_Bat7679'
'"NAME","ENAME","FILE"'
''
''
'From: "DDD,_Ala%as"#sib.com'
Thanx