c# regular expression - c#

I have an output like -
Col.A Col.B Col.C Col.D
--------------------------------------------------------------
* 1 S60-01-GE-44T-AC SGFM115001195 7520051202 A
1 S60-PWR-AC APFM115101302 7520047802 A
1 S60-PWR-AC APFM115101245 7520047802 A
or
Col.A Col.B Col.C Col.D
--------------------------------------------------------------
* 0 S50-01-GE-48T-AC DL252040175 7590005605 B
0 S50-PWR-AC N/A N/A N/A
0 S50-FAN N/A N/A N/A
For these outputs the regular expression -
(?:\*)?\s+(?<unitno>\d+)\s+\S+-\d+-(?:GE|TE)?-?(?:\d+(?:F|T))-?(?:(?:AC)|V)?\s+(?<serial>\S+)\s+\S+\s+\S+\s+\n
works fine to capture Column A and Column B. But recently I got a new kind of output -
Col.A Col.B Col.C Col.D
---------------------------------------------------------
* 0 S4810-01-64F HADL120620060 7590009602 A
0 S4810-PWR-AC H6DL120620060 7590008502 A
0 S4810-FAN N/A N/A N/A
0 S4810-FAN N/A N/A N/A
As you can see the patterns "GE|TE" and the "AC|V" are missing from these outputs. How do I change my regular expression accordingly maintaining backward compatibility.
EDIT:
The output that you see comes in a complete string and due to some operational limits I cannot use any other concept other than regex here to get my desired values. I know using split would be ideal here but I cannot.

You are probably better off using String.Split() to break the column values out into sperate strings and then processing them, rather that using a huge un-readable regular expression.
foreach (string line in lines) {
string[] colunnValues = line.Split((char[])null, StringSplitOptions.RemoveEmptyEntries);
...
}

A regular expression seems not to be the right approach here. Use a positional approach
string s = "* 0 S4810-01-64F HADL120620060 7590009602 A";
bool withStar = s[0] == '*';
string nr = s.Substring(2, 2).Trim();
string colA = s.Substring(5, 18).TrimEnd();
string colB = s.Substring(24, 14).TrimEnd();
...
UPDATE
I you want (or must) stick to Regex, test for the spaces instead of the values. Of cause this works only if the values never include spaces.
string[] result = Regex.Split(s, "\s+");
Of cause you can also search for non-spaces \S instead of \s.
MatchCollection matches = Regex.Matches(s, "\S+");
or excluding the star
(?:\*)?[^*\s]+

your regular expression doesn't even need GE or TE. See that ? after (?:GE|TE)?
that means that the previous group or symbol is optional.
the same is true with the AC and V section

I would not use regular expressions to parse these reports.
Instead, treat them as fixed column width reports after the headers are stripped off.
I would do something like (this is typed cold as an example, not tested even for syntax):
// Leaving off all public/private/error detection stuff
class ColumnDef
{
string Name { set; get; }
int FirstCol { set; get; }
int LastCol { set; get; }
}
ColumnDef[] report = new ColumnDef[]
{
{ Name = "ColA",
FirstCol = 0,
LastCol = 2
},
/// ... and so on for each column
}
IDictionary<string, string> ParseDataLine(string line)
{
var dummy = new Dictionary<string, string>();
foreach (var c in report)
{
dummy[c.Name] = line.Substring(c.FirstCol, c.LastCol).Trim();
}
}
This is an example of a generic ETL (Extract, Transform, and Load) problem--specifically the Extract stage.
You will have to strip out header and footer lines before using ParseDataLine, and I am not sure there is enough information shown to do that. Based on what your post says, any line that is blank, or doesn't start with a space or a * is a header/footer line to be ignored.

Why not try something like this (?:\*)?\s+(?<unitno>\d+)\s+\S+\s+(?<serial>\S+)\s+\S+\s+\S+(?:\s+)?\n
This is built off your provided regular expression and due to the trailing \n the provided input will need to end with a carriage return.

Related

c# Read/ Write CSV - excluding Comma in field Value [duplicate]

I am looking for suggestions on how to handle a csv file that is being created, then uploaded by our customers, and that may have a comma in a value, like a company name.
Some of the ideas we are looking at are: quoted Identifiers (value "," values ","etc) or using a | instead of a comma. The biggest problem is that we have to make it easy, or the customer won't do it.
There's actually a spec for CSV format, RFC 4180 and how to handle commas:
Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.
http://tools.ietf.org/html/rfc4180
So, to have values foo and bar,baz, you do this:
foo,"bar,baz"
Another important requirement to consider (also from the spec):
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:
"aaa","b""bb","ccc"
As others have said, you need to escape values that include quotes. Here’s a little CSV reader in C♯ that supports quoted values, including embedded quotes and carriage returns.
By the way, this is unit-tested code. I’m posting it now because this question seems to come up a lot and others may not want an entire library when simple CSV support will do.
You can use it as follows:
using System;
public class test
{
public static void Main()
{
using ( CsvReader reader = new CsvReader( "data.csv" ) )
{
foreach( string[] values in reader.RowEnumerator )
{
Console.WriteLine( "Row {0} has {1} values.", reader.RowIndex, values.Length );
}
}
Console.ReadLine();
}
}
Here are the classes. Note that you can use the Csv.Escape function to write valid CSV as well.
using System.IO;
using System.Text.RegularExpressions;
public sealed class CsvReader : System.IDisposable
{
public CsvReader( string fileName ) : this( new FileStream( fileName, FileMode.Open, FileAccess.Read ) )
{
}
public CsvReader( Stream stream )
{
__reader = new StreamReader( stream );
}
public System.Collections.IEnumerable RowEnumerator
{
get {
if ( null == __reader )
throw new System.ApplicationException( "I can't start reading without CSV input." );
__rowno = 0;
string sLine;
string sNextLine;
while ( null != ( sLine = __reader.ReadLine() ) )
{
while ( rexRunOnLine.IsMatch( sLine ) && null != ( sNextLine = __reader.ReadLine() ) )
sLine += "\n" + sNextLine;
__rowno++;
string[] values = rexCsvSplitter.Split( sLine );
for ( int i = 0; i < values.Length; i++ )
values[i] = Csv.Unescape( values[i] );
yield return values;
}
__reader.Close();
}
}
public long RowIndex { get { return __rowno; } }
public void Dispose()
{
if ( null != __reader ) __reader.Dispose();
}
//============================================
private long __rowno = 0;
private TextReader __reader;
private static Regex rexCsvSplitter = new Regex( #",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))" );
private static Regex rexRunOnLine = new Regex( #"^[^""]*(?:""[^""]*""[^""]*)*""[^""]*$" );
}
public static class Csv
{
public static string Escape( string s )
{
if ( s.Contains( QUOTE ) )
s = s.Replace( QUOTE, ESCAPED_QUOTE );
if ( s.IndexOfAny( CHARACTERS_THAT_MUST_BE_QUOTED ) > -1 )
s = QUOTE + s + QUOTE;
return s;
}
public static string Unescape( string s )
{
if ( s.StartsWith( QUOTE ) && s.EndsWith( QUOTE ) )
{
s = s.Substring( 1, s.Length - 2 );
if ( s.Contains( ESCAPED_QUOTE ) )
s = s.Replace( ESCAPED_QUOTE, QUOTE );
}
return s;
}
private const string QUOTE = "\"";
private const string ESCAPED_QUOTE = "\"\"";
private static char[] CHARACTERS_THAT_MUST_BE_QUOTED = { ',', '"', '\n' };
}
The CSV format uses commas to separate values, values which contain carriage returns, linefeeds, commas, or double quotes are surrounded by double-quotes. Values that contain double quotes are quoted and each literal quote is escaped by an immediately preceding quote: For example, the 3 values:
test
list, of, items
"go" he said
would be encoded as:
test
"list, of, items"
"""go"" he said"
Any field can be quoted but only fields that contain commas, CR/NL, or quotes must be quoted.
There is no real standard for the CSV format, but almost all applications follow the conventions documented here. The RFC that was mentioned elsewhere is not a standard for CSV, it is an RFC for using CSV within MIME and contains some unconventional and unnecessary limitations that make it useless outside of MIME.
A gotcha that many CSV modules I have seen don't accommodate is the fact that multiple lines can be encoded in a single field which means you can't assume that each line is a separate record, you either need to not allow newlines in your data or be prepared to handle this.
Put double quotes around strings. That is generally what Excel does.
Ala Eli,
you escape a double quote as two
double quotes. E.g.
"test1","foo""bar","test2"
You can put double quotes around the fields. I don't like this approach, as it adds another special character (the double quote). Just define an escape character (usually backslash) and use it wherever you need to escape something:
data,more data,more data\, even,yet more
You don't have to try to match quotes, and you have fewer exceptions to parse. This simplifies your code, too.
There is a library available through nuget for dealing with pretty much any well formed CSV (.net) - CsvHelper
Example to map to a class:
var csv = new CsvReader( textReader );
var records = csv.GetRecords<MyClass>();
Example to read individual fields:
var csv = new CsvReader( textReader );
while( csv.Read() )
{
var intField = csv.GetField<int>( 0 );
var stringField = csv.GetField<string>( 1 );
var boolField = csv.GetField<bool>( "HeaderName" );
}
Letting the client drive the file format:
, is the standard field delimiter, " is the standard value used to escape fields that contain a delimiter, quote, or line ending.
To use (for example) # for fields and ' for escaping:
var csv = new CsvReader( textReader );
csv.Configuration.Delimiter = "#";
csv.Configuration.Quote = ''';
// read the file however meets your needs
More Documentation
In case you're on a *nix-system, have access to sed and there can be one or more unwanted commas only in a specific field of your CSV, you can use the following one-liner in order to enclose them in " as RFC4180 Section 2 proposes:
sed -r 's/([^,]*,[^,]*,[^,]*,)(.*)(,.*,.*)/\1"\2"\3/' inputfile
Depending on which field the unwanted comma(s) may be in you have to alter/extend the capturing groups of the regex (and the substitution).
The example above will enclose the fourth field (out of six) in quotation marks.
In combination with the --in-place-option you can apply these changes directly to the file.
In order to "build" the right regex, there's a simple principle to follow:
For every field in your CSV that comes before the field with the unwanted comma(s) you write one [^,]*, and put them all together in a capturing group.
For the field that contains the unwanted comma(s) you write (.*).
For every field after the field with the unwanted comma(s) you write one ,.* and put them all together in a capturing group.
Here is a short overview of different possible regexes/substitutions depending on the specific field. If not given, the substitution is \1"\2"\3.
([^,]*)(,.*) #first field, regex
"\1"\2 #first field, substitution
(.*,)([^,]*) #last field, regex
\1"\2" #last field, substitution
([^,]*,)(.*)(,.*,.*,.*) #second field (out of five fields)
([^,]*,[^,]*,)(.*)(,.*) #third field (out of four fields)
([^,]*,[^,]*,[^,]*,)(.*)(,.*,.*) #fourth field (out of six fields)
If you want to remove the unwanted comma(s) with sed instead of enclosing them with quotation marks refer to this answer.
As mentioned in my comment to harpo's answer, his solution is good and works in most cases, however in some scenarios when commas as directly adjacent to each other it fails to split on the commas.
This is because of the Regex string behaving unexpectedly as a vertabim string.
In order to get this behave correct, all " characters in the regex string need to be escaped manually without using the vertabim escape.
Ie. The regex should be this using manual escapes:
",(?=(?:[^\"\"]*\"\"[^\"\"]*\"\")*(?![^\"\"]*\"\"))"
which translates into ",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))"
When using a vertabim string #",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))" it behaves as the following as you can see if you debug the regex:
",(?=(?:[^"]*"[^"]*")*(?![^"]*"))"
So in summary, I recommend harpo's solution, but watch out for this little gotcha!
I've included into the CsvReader a little optional failsafe to notify you if this error occurs (if you have a pre-known number of columns):
if (_expectedDataLength > 0 && values.Length != _expectedDataLength)
throw new DataLengthException(string.Format("Expected {0} columns when splitting csv, got {1}", _expectedDataLength, values.Length));
This can be injected via the constructor:
public CsvReader(string fileName, int expectedDataLength = 0) : this(new FileStream(fileName, FileMode.Open, FileAccess.Read))
{
_expectedDataLength = expectedDataLength;
}
Add a reference to the Microsoft.VisualBasic (yes, it says VisualBasic but it works in C# just as well - remember that at the end it is all just IL).
Use the Microsoft.VisualBasic.FileIO.TextFieldParser class to parse CSV file Here is the sample code:
Dim parser As TextFieldParser = New TextFieldParser("C:\mar0112.csv")
parser.TextFieldType = FieldType.Delimited
parser.SetDelimiters(",")
While Not parser.EndOfData
'Processing row
Dim fields() As String = parser.ReadFields
For Each field As String In fields
'TODO: Process field
Next
parser.Close()
End While
You can use alternative "delimiters" like ";" or "|" but simplest might just be quoting which is supported by most (decent) CSV libraries and most decent spreadsheets.
For more on CSV delimiters and a spec for a standard format for describing delimiters and quoting see this webpage
If you're interested in a more educational exercise on how to parse files in general (using CSV as an example), you may check out this article by Julian Bucknall. I like the article because it breaks things down into much smaller problems that are much less insurmountable. You first create a grammar, and once you have a good grammar, it's a relatively easy and methodical process to convert the grammar into code.
The article uses C# and has a link at the bottom to download the code.
If you feel like reinventing the wheel, the following may work for you:
public static IEnumerable<string> SplitCSV(string line)
{
var s = new StringBuilder();
bool escaped = false, inQuotes = false;
foreach (char c in line)
{
if (c == ',' && !inQuotes)
{
yield return s.ToString();
s.Clear();
}
else if (c == '\\' && !escaped)
{
escaped = true;
}
else if (c == '"' && !escaped)
{
inQuotes = !inQuotes;
}
else
{
escaped = false;
s.Append(c);
}
}
yield return s.ToString();
}
In Europe we have this problem must earlier than this question. In Europe we use all a comma for a decimal point. See this numbers below:
| American | Europe |
| ------------- | ------------- |
| 0.5 | 0,5 |
| 3.14159265359 | 3,14159265359 |
| 17.54 | 17,54 |
| 175,186.15 | 175.186,15 |
So it isn't possible to use the comma separator for CSV files. Because of that reason, the CSV files in Europe are separated by a semicolon (;).
Programs like Microsoft Excel can read files with a semicolon and it's possible to switch from separator. You could even use a tab (\t) as separator. See this answer from Supper User.
Here's a neat little workaround:
You can use a Greek Lower Numeral Sign instead (U+0375)
It looks like this ͵
Using this method saves you a lot of resources too...
I know it's almost 13 years later, but we came across a similar situation where the client inputs us a CSV and has values with commas, there are 2 use cases:
If the client uses a windows Excel client to write the CSV (usually that's the case in windows environment) then commas are automatically added to the value.
The actual text value of the CSV:
3786962,1st Meridian Care Services,John,"Person A,Person B, Person C, Person D",Voyager
If the client is sending you the excel programmatically, then he should adhere to RFC4180 and enclose the value with "quotes". example:
Col1, Col2, "a, b, c", Col4
Just use SoftCircuits.CsvParser on NuGet. It will handle all those details for you and efficiently handles very large files. And, if needed, it can even import/export objects by mapping columns to object properties. In addition, my testing showed it averages nearly 4 times faster than the popular CsvHelper.
You can read the csv file like this.
this makes use of splits and takes care of spaces.
ArrayList List = new ArrayList();
static ServerSocket Server;
static Socket socket;
static ArrayList<Object> list = new ArrayList<Object>();
public static void ReadFromXcel() throws FileNotFoundException
{
File f = new File("Book.csv");
Scanner in = new Scanner(f);
int count =0;
String[] date;
String[] name;
String[] Temp = new String[10];
String[] Temp2 = new String[10];
String[] numbers;
ArrayList<String[]> List = new ArrayList<String[]>();
HashMap m = new HashMap();
in.nextLine();
date = in.nextLine().split(",");
name = in.nextLine().split(",");
numbers = in.nextLine().split(",");
while(in.hasNext())
{
String[] one = in.nextLine().split(",");
List.add(one);
}
int xount = 0;
//Making sure the lines don't start with a blank
for(int y = 0; y<= date.length-1; y++)
{
if(!date[y].equals(""))
{
Temp[xount] = date[y];
Temp2[xount] = name[y];
xount++;
}
}
date = Temp;
name =Temp2;
int counter = 0;
while(counter < List.size())
{
String[] list = List.get(counter);
String sNo = list[0];
String Surname = list[1];
String Name = list[2];
for(int x = 3; x < list.length; x++)
{
m.put(numbers[x], list[x]);
}
Object newOne = new newOne(sNo, Name, Surname, m, false);
StudentList.add(s);
System.out.println(s.sNo);
counter++;
}
I generally URL-encode the fields which can have any commas or any special chars. And then decode it when it is being used/displayed in any visual medium.
(commas becomes %2C)
Every language should have methods to URL-encode and decode strings.
e.g., in java
URLEncoder.encode(myString,"UTF-8"); //to encode
URLDecoder.decode(myEncodedstring, "UTF-8"); //to decode
I know this is a very general solution and it might not be ideal for situation where user wants to view content of csv file, manually.
I usually do this in my CSV files parsing routines. Assume that 'line' variable is one line within a CSV file and all of the columns' values are enclosed in double quotes. After the below two lines execute, you will get CSV columns in the 'values' collection.
// The below two lines will split the columns as well as trim the DBOULE QUOTES around values but NOT within them
string trimmedLine = line.Trim(new char[] { '\"' });
List<string> values = trimmedLine.Split(new string[] { "\",\"" }, StringSplitOptions.None).ToList();
The simplest solution I've found is the one LibreOffice uses:
Replace all literal " by ”
Put double quotes around your string
You can also use the one that Excel uses:
Replace all literal " by ""
Put double quotes around your string
Notice other people recommended to do only step 2 above, but that doesn't work with lines where a " is followed by a ,, like in a CSV where you want to have a single column with the string hello",world, as the CSV would read:
"hello",world"
Which is interpreted as a row with two columns: hello and world"
public static IEnumerable<string> LineSplitter(this string line, char
separator, char skip = '"')
{
var fieldStart = 0;
for (var i = 0; i < line.Length; i++)
{
if (line[i] == separator)
{
yield return line.Substring(fieldStart, i - fieldStart);
fieldStart = i + 1;
}
else if (i == line.Length - 1)
{
yield return line.Substring(fieldStart, i - fieldStart + 1);
fieldStart = i + 1;
}
if (line[i] == '"')
for (i++; i < line.Length && line[i] != skip; i++) { }
}
if (line[line.Length - 1] == separator)
{
yield return string.Empty;
}
}
I used Csvreader library but by using that I got data by exploding from comma(,) in column value.
So If you want to insert CSV file data which contains comma(,) in most of the columns values, you can use below function.
Author link => https://gist.github.com/jaywilliams/385876
function csv_to_array($filename='', $delimiter=',')
{
if(!file_exists($filename) || !is_readable($filename))
return FALSE;
$header = NULL;
$data = array();
if (($handle = fopen($filename, 'r')) !== FALSE)
{
while (($row = fgetcsv($handle, 1000, $delimiter)) !== FALSE)
{
if(!$header)
$header = $row;
else
$data[] = array_combine($header, $row);
}
fclose($handle);
}
return $data;
}
I used papaParse library to have the CSV file parsed and have the key-value pairs(key/header/first row of CSV file-value).
here is example that I use:
https://codesandbox.io/embed/llqmrp96pm
it has dummy.csv file in there to have the CSV parsing demo.
I've used it within reactJS though it is easy and simple to replicate in app written with any language.
An example might help to show how commas can be displayed in a .csv file. Create a simple text file as follows:
Save this text file as a text file with suffix ".csv" and open it with Excel 2000 from Windows 10.
aa,bb,cc,d;d
"In the spreadsheet presentation, the below line should look like the above line except the below shows a displayed comma instead of a semicolon between the d's."
aa,bb,cc,"d,d", This works even in Excel
aa,bb,cc,"d,d", This works even in Excel 2000
aa,bb,cc,"d ,d", This works even in Excel 2000
aa,bb,cc,"d , d", This works even in Excel 2000
aa,bb,cc, " d,d", This fails in Excel 2000 due to the space belore the 1st quote
aa,bb,cc, " d ,d", This fails in Excel 2000 due to the space belore the 1st quote
aa,bb,cc, " d , d", This fails in Excel 2000 due to the space belore the 1st quote
aa,bb,cc,"d,d " , This works even in Excel 2000 even with spaces before and after the 2nd quote.
aa,bb,cc,"d ,d " , This works even in Excel 2000 even with spaces before and after the 2nd quote.
aa,bb,cc,"d , d " , This works even in Excel 2000 even with spaces before and after the 2nd quote.
Rule: If you want to display a comma in a a cell (field) of a .csv file:
"Start and end the field with a double quotes, but avoid white space before the 1st quote"
As this is about general practices let's start from rules of the thumb:
Don't use CSV, use XML with a library to read & write the xml file instead.
If you must use CSV. Do it properly and use a free library to parse and store the CSV files.
To justify 1), most CSV parsers aren't encoding aware so if you aren't dealing with US-ASCII you are asking for troubles.
For example excel 2002 is storing the CSV in local encoding without any note about the encoding. The CSV standard isn't widely adopted :(.
On the other hand xml standard is well adopted and it handles encodings pretty well.
To justify 2), There is tons of csv parsers around for almost all language so there is no need to reinvent the wheel even if the solutions looks pretty simple.
To name few:
for python use build in csv module
for perl check CPAN and Text::CSV
for php use build in fgetcsv/fputcsv functions
for java check SuperCVS library
Really there is no need to implement this by hand if you aren't going to parse it on embedded device.
First, let's ask ourselves, "Why do we feel the need to handle commas differently for CSV files?"
For me, the answer is, "Because when I export data into a CSV file, the commas in a field disappear and my field gets separated into multiple fields where the commas appear in the original data." (That it because the comma is the CSV field separator character.)
Depending on your situation, semi colons may also be used as CSV field separators.
Given my requirements, I can use a character, e.g., single low-9 quotation mark, that looks like a comma.
So, here's how you can do it in Go:
// Replace special CSV characters with single low-9 quotation mark
func Scrub(a interface{}) string {
s := fmt.Sprint(a)
s = strings.Replace(s, ",", "‚", -1)
s = strings.Replace(s, ";", "‚", -1)
return s
}
The second comma looking character in the Replace function is decimal 8218.
Be aware that if you have clients that may have ascii-only text readers that this decima 8218 character will not look like a comma. If this is your case, then I'd recommend surrounding the field with the comma (or semicolon) with double quotes per RFC 4128: https://www.rfc-editor.org/rfc/rfc4180
Thank you others in this post.
I used the information here to create a function in JavaScript that will get csv output for an array of objects which may have property values containing commas.
like
rowsArray = [{obj1prop1: "foo", obj1prop2: "bar,baz"}, {obj2prop1: "qux", obj2prop2: "quux,corge,thud"}]
into
csvRowsArray = [{obj1prop1: "foo", obj1prop2: "\"bar,baz\""}, {...} ]
To use the commas in the values in a csv, the value needs to be wrapped in double quotes. And in order to have double quotes in the value in the json object, they just need to be escaped, i.e., \", backslash double quote. The escape is made here by subbing in a template literal and including the necessary quotes `"${row[key]}"`. The quotes are escaped when put in the object.
Here is my function:
const calculateTheCSVExport = (props) => {
if (props.rows === undefined) return;
let jsonRowsArray = props.rows;
// console.log(jsonRowsArray);
let csvRowsArrayNoCommasInObjectValues = [];
let csvCurrRowObject = {}
jsonRowsArray.forEach(row => {
Object.keys(row).forEach(key => {
// console.log(key, row[key])
if (row[key].indexOf(',') > -1) {
csvCurrRowObject = {...csvCurrRowObject, [key]: `"${row[key]}"`} // enclose value in escaped double quotes in JSON in order to export commas to csv correctly. see more: https://stackoverflow.com/questions/769621/dealing-with-commas-in-a-csv-file
} else {
csvCurrRowObject = {...csvCurrRowObject, [key]: row[key]}
}
});
csvRowsArrayNoCommasInObjectValues.push(csvCurrRowObject);
csvCurrRowObject = {};
})
// console.log(csvRowsArrayNoCommasInObjectValues)
return csvRowsArrayNoCommasInObjectValues;
}
I think the easiest solution to this problem is to have the customer to open the csv in excel, and then ctrl + r to replace all comma with whatever identifier you want. This is very easy for the customer and require only one change in your code to read the delimiter of your choice.
Use a tab character (\t) to separate the fields.

Format unstructured string

I have tried several methods (by position, by white space, regex) but cannot figure how to best parse the following lines as a table. For e.g. let's say the two lines I want to parse are:
Bonds Bid Offer (mm) (mm) Chng
STACR 2015-HQA1 M1 125 120 5 x 1.5 0
STACR 2015-HQA12 2M2 265 5 x -2
I want that it should parse as follows for [BondName] [Bid] [Offer]:
[STACR 2015-HQA1 M1] [125] [120]
[STACR 2015-HQA12 2M2] [265] [null]
Notice the null which is an actual value and also the spaces should be retained in the bond name. FYI, the number of spaces in the Bond Name will be 2 as in the above examples.
Edit: Since many of you have asked for code here it is. The spaces between the points can range from 1-5 so I cannot reply on spaces (it was straightforward then).
string bondName = quoteLine.Substring(0, 19);
string bid = quoteLine.Substring(19, 5).Trim();
string offer = quoteLine.Substring(24, 6).Trim();
The only way I can see this working is that:
1st data point is STACR (Type)
2nd data point is the year and Series
(e.g. 2015-HQA1)
3rd data point is Tranche (M1)
4th data point is bid
(e.g. 125 ** bid is always available **)
5th data point is offer (e.g. 120 but can be blank
or whitespace which introduces complexity)
With the current set of requirements, I'm assuming the following
1. String starts with 3 part bond name
2. Followed by bid
3. Followed by offer (optional)
4. After that, we'll have something like ... x ... ... (we'll use x as reference)
Given they are valid, you can use the following code
var str = "STACR 2015-HQA1 M1 125 120 5 x 1.5 0"; //your data
var parts = str.Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries).ToList();
//we'll use this pattern : <3 part bond name> <bid> <offer/null> <something x ....>
var xIsAt = parts.IndexOf("x"); //we'll use x as reference
if (xIsAt > 2) //first three are BondName
parts.RemoveRange(xIsAt - 1, parts.Count - xIsAt + 1); //remove "5 x 1.5 ..."
var bond = string.Join(" ", parts.Take(3)); //first 3 parts are bond
var bid = parts.Count > 3 ? parts.ElementAt(3) : null; //4th is bid
var offer = parts.Count > 4 ? parts.ElementAt(4) : null; //5th is offer
[EDIT]
I did not account for the blank 'Offer' so this method will fail on a blank 'Offer'. Looks like someone already has a working answer, but i'll leave the linq example for anyone that finds it useful.
[END EDIT]
Linq based option.
Split the string by spaces, and remove empty spaces. Then reverse the order so you can start from the back and work your way forward. The data appears more normalized at the end of the string.
For each successive part of the line, you skip the previous options and only take what you need. For the last part which is the long string, you skip what you don't need, then reverse the order back to normal, and join the segments together with spaces.
string test = "STACR 2015-HQA1 M1 125 120 5 x 1.5 0";
var split_string_remove_empty = test.Split(new char[]{ ' ' }, StringSplitOptions.RemoveEmptyEntries).Reverse();
var change = split_string_remove_empty.Take(1)
.SingleOrDefault();
var mm2 = split_string_remove_empty.Skip(1)
.Take(1)
.SingleOrDefault();
var mm3 = split_string_remove_empty.Skip(3)
.Take(1)
.SingleOrDefault();
var offer = split_string_remove_empty.Skip(4)
.Take(1)
.SingleOrDefault();
var bid = split_string_remove_empty.Skip(5)
.Take(1)
.SingleOrDefault();
var bonds = string.Join(" ", split_string_remove_empty.Skip(6)
.Reverse());
Output:

String formatting in C# to get identical spacing

I've been looking up string formatting and frankly I'm getting confused. This is what I want to do.
I have a "character stats" page (this is a console app), and I want it formatted like this:
=----------------------------------=
= Strength: 24 | Agility: 30 =
= Dexterity: 30 | Stamina: 28 =
= Magic: 12 | Luck: 18 =
=----------------------------------=
I guess basically I'm trying to find out how to make that middle '|' divider be in the same place regardless of how many letters the stat is or how many points the stat is.
Thanks for the input.
Edit: I also want the ending '=' to also be in the same spot.
I learned something new, it seems! As some of the others have mentioned, you can accomplish the same thing using String.Format.
The interpolation strings used in String.Format can also include an optional alignment component.
// index alignment
// v v
String.Format("Hello {0,-10}!", "World");
When this is negative, then the string is left-aligned. When positive, it is right aligned. In both cases, the string is padded correspondingly with whitespace if it is shorter than the specified width (otherwise, the string is just inserted fully).
I believe this is an easier and more readable technique than having to fiddle with String.PadRight.
You can also use String.PadRight (or String.PadLeft). Example:
class Stats {
// Contains properties as you defined ...
}
var stats = new Stats(...);
int leftColWidth = 16;
int rightColWidth = 13;
var sb = new StringBuilder();
sb.AppendLine("=----------------------------------=");
sb.Append("= ");
sb.Append(("Strength: " + stats.Strength.ToString()).PadRight(leftColWidth));
sb.Append(" | ");
sb.Append(("Agility: " + stats.Agility.ToString()).PadRight(rightColWidth));
// And so on.
I used to use this technique a lot back in the 80's doing text based games. Obviously we didn't have string.Format back in those days; but it allows you to visualize the layout in the code.
Pre-format the text as you want it to be laid out, then just use the string.Format() function like so...
string formattedText = #"
=----------------------------------=
= Strength: {0,2} | Agility: {3,2} =
= Dexterity: {1,2} | Stamina: {4,2} =
= Magic: {2,2} | Luck: {5,2} =
=----------------------------------=".Trim();
string output = string.Format(formattedText, 12, 13, 14, 15, 16, 1);
Console.WriteLine(output);
Console.ReadLine();
String.Format("{0,-20}|","Dexterity: 30")
would align the value to the left and pad it to 20 characters. The only problem is that if the parameter is longer than 20 it would not be truncated.
You will need to use a String.PadRight or a String.PadLeft. Do something like this:
Trip_Name1 = Trip_Name1.PadRight(20,' ');
This is what you are looking for I think.

Extracting data from plain text string

I am trying to process a report from a system which gives me the following code
000=[GEN] OK {Q=1 M=1 B=002 I=3e5e65656-e5dd-45678-b785-a05656569e}
I need to extract the values between the curly brackets {} and save them in to variables. I assume I will need to do this using regex or similar? I've really no idea where to start!! I'm using c# asp.net 4.
I need the following variables
param1 = 000
param2 = GEN
param3 = OK
param4 = 1 //Q
param5 = 1 //M
param6 = 002 //B
param7 = 3e5e65656-e5dd-45678-b785-a05656569e //I
I will name the params based on what they actually mean. Can anyone please help me here? I have tried to split based on spaces, but I get the other garbage with it!
Thanks for any pointers/help!
If the format is pretty constant, you can use .NET string processing methods to pull out the values, something along the lines of
string line =
"000=[GEN] OK {Q=1 M=1 B=002 I=3e5e65656-e5dd-45678-b785-a05656569e}";
int start = line.IndexOf('{');
int end = line.IndexOf('}');
string variablePart = line.Substring(start + 1, end - start);
string[] variables = variablePart.Split(' ');
foreach (string variable in variables)
{
string[] parts = variable.Split('=');
// parts[0] holds the variable name, parts[1] holds the value
}
Wrote this off the top of my head, so there may be an off-by-one error somewhere. Also, it would be advisable to add error checking e.g. to make sure the input string has both a { and a }.
I would suggest a regular expression for this type of work.
var objRegex = new System.Text.RegularExpressions.Regex(#"^(\d+)=\[([A-Z]+)\] ([A-Z]+) \{Q=(\d+) M=(\d+) B=(\d+) I=([a-z0-9\-]+)\}$");
var objMatch = objRegex.Match("000=[GEN] OK {Q=1 M=1 B=002 I=3e5e65656-e5dd-45678-b785-a05656569e}");
if (objMatch.Success)
{
Console.WriteLine(objMatch.Groups[1].ToString());
Console.WriteLine(objMatch.Groups[2].ToString());
Console.WriteLine(objMatch.Groups[3].ToString());
Console.WriteLine(objMatch.Groups[4].ToString());
Console.WriteLine(objMatch.Groups[5].ToString());
Console.WriteLine(objMatch.Groups[6].ToString());
Console.WriteLine(objMatch.Groups[7].ToString());
}
I've just tested this out and it works well for me.
Use a regular expression.
Quick and dirty attempt:
(?<ID1>[0-9]*)=\[(?<GEN>[a-zA-Z]*)\] OK {Q=(?<Q>[0-9]*) M=(?<M>[0-9]*) B=(?<B>[0-9]*) I=(?<I>[a-zA-Z0-9\-]*)}
This will generate named groups called ID1, GEN, Q, M, B and I.
Check out the MSDN docs for details on using Regular Expressions in C#.
You can use Regex Hero for quick C# regex testing.
You can use String.Split
string[] parts = s.Split(new string[] {"=[", "] ", " {Q=", " M=", " B=", " I=", "}"},
StringSplitOptions.None);
This solution breaks up your report code into segments and stores the desired values into an array.
The regular expression matches one report code segment at a time and stores the appropriate values in the "Parsed Report Code Array".
As your example implied, the first two code segments are treated differently than the ones after that. I made the assumption that it is always the first two segments that are processed differently.
private static string[] ParseReportCode(string reportCode) {
const int FIRST_VALUE_ONLY_SEGMENT = 3;
const int GRP_SEGMENT_NAME = 1;
const int GRP_SEGMENT_VALUE = 2;
Regex reportCodeSegmentPattern = new Regex(#"\s*([^\}\{=\s]+)(?:=\[?([^\s\]\}]+)\]?)?");
Match matchReportCodeSegment = reportCodeSegmentPattern.Match(reportCode);
List<string> parsedCodeSegmentElements = new List<string>();
int segmentCount = 0;
while (matchReportCodeSegment.Success) {
if (++segmentCount < FIRST_VALUE_ONLY_SEGMENT) {
string segmentName = matchReportCodeSegment.Groups[GRP_SEGMENT_NAME].Value;
parsedCodeSegmentElements.Add(segmentName);
}
string segmentValue = matchReportCodeSegment.Groups[GRP_SEGMENT_VALUE].Value;
if (segmentValue.Length > 0) parsedCodeSegmentElements.Add(segmentValue);
matchReportCodeSegment = matchReportCodeSegment.NextMatch();
}
return parsedCodeSegmentElements.ToArray();
}

Regular expression that returns a constant value as part of a match

I have a regular expression to match 2 different number formats: \=(?[0-9]+)\?|\+(?[0-9]+)\?
This should return 9876543 as its Value for ;1234567890123456?+1234567890123456789012345123=9876543? and ;1234567890123456?+9876543?
What I would like is to be able to return another value along with the matched 'Value'.
So, for example, if the first string was matched, I'd like it to return:
Value:
9876543
Format:
LongFormat
And if matched in the second string:
Value:
9876543
Format:
ShortFormat
Is this possible?
Another option, which is not quite the solution you wanted, but saves you using two separate regexes, is to use named groups, if your implementation supports it.
Here is some C#:
var regex = new Regex(#"\=(?<Long>[0-9]+)\?|\+(?<Short>[0-9]+)\?");
string test1 = ";1234567890123456?+1234567890123456789012345123=9876543?";
string test2 = ";1234567890123456?+9876543?";
var match = regex.Match(test1);
Console.WriteLine("Long: {0}", match.Groups["Long"]); // 9876543
Console.WriteLine("Short: {0}", match.Groups["Short"]); // blank
match = regex.Match(test2);
Console.WriteLine("Long: {0}", match.Groups["Long"]); // blank
Console.WriteLine("Short: {0}", match.Groups["Short"]); // 9876543
Basically just modify your regex to include the names, and then regex.Groups[GroupName] will either have a value or wont. You could even just use the Success property of the group to know which matched (match.Groups["Long"].Success).
UPDATE:
You can get the group name out of the match, with the following code:
static void Main(string[] args)
{
var regex = new Regex(#"\=(?<Long>[0-9]+)\?|\+(?<Short>[0-9]+)\?");
string test1 = ";1234567890123456?+1234567890123456789012345123=9876543?";
string test2 = ";1234567890123456?+9876543?";
ShowGroupMatches(regex, test1);
ShowGroupMatches(regex, test2);
Console.ReadLine();
}
private static void ShowGroupMatches(Regex regex, string testCase)
{
int i = 0;
foreach (Group grp in regex.Match(testCase).Groups)
{
if (grp.Success && i != 0)
{
Console.WriteLine(regex.GroupNameFromNumber(i) + " : " + grp.Value);
}
i++;
}
}
I'm ignoring the 0th group, because that is always the entire match in .NET
No, you can't match text that isn't there. The match can only return a substring of the target.
You essentially want to match against two patterns and take different actions in each case. See if you can separate them in your code:
if match(\=(?[0-9]+)\?) then
return 'Value: ' + match + 'Format: LongFormat'
else if match(\+(?[0-9]+)\?) then
return 'Value: ' + match + 'Format: ShortFormat'
(Excuse the dodgy pseudocode, but you get the idea.)
You can't match text that isn't there - but, depending on what language you're using, you can process what you match, and conditionally add text based on what is there.
With some implementations of regex, you can specify a "callback function" which allows you to run logic against each result.
Here's a pseudo-code example:
Input.replaceAll( /[+=][0-9]+(?=\?)/ , formatValue );
formatValue : function(match,groups)
{
switch( left(match,1) )
{
case '+' : Format = 'Short'; break;
case '=' : Format = 'Long'; break;
default : Format = 'Unknown'; break;
}
Value : match.replace('[+=]');
return 'Value: '+Value+' Format: ' + Format;
}
What that will do, in a language that supports regex callbacks, is execute the formatValue function every time it finds a match, and use the result of the function as the replacement text.
You haven't specified which implementation you're using, so this may or not be possible for you, but it is definitely worth checking out.

Categories