Parsing this special format file - c#

I have a file that is formatted this way --
{2000}000000012199{3100}123456789*{3320}110009558*{3400}9876
54321*{3600}CTR{4200}D2343984*JOHN DOE*1232 STREET*DALLAS TX
78302**{5000}D9210293*JANE DOE*1234 STREET*SUITE 201*DALLAS
TX 73920**
Basically, the number in curly brackets denotes field, followed by the value for that field. For example, {2000} is the field for "Amount", and the value for it is 121.99 (implied decimal). {3100} is the field for "AccountNumber" and the value for it is 123456789*.
I am trying to figure out a way to split the file into "records" and each record would contain the record type (the value in the curly brackets) and record value, but I don't see how.
How do I do this without a loop going through each character in the input?

A different way to look at it.... The { character is a record delimiter, and the } character is a field delimiter. You can just use Split().
var input = #"{2000}000000012199{3100}123456789*{3320}110009558*{3400}987654321*{3600}CTR{4200}D2343984*JOHN DOE*1232 STREET*DALLAS TX78302**{5000}D9210293*JANE DOE*1234 STREET*SUITE 201*DALLASTX 73920**";
var rows = input.Split( new [] {"{"} , StringSplitOptions.RemoveEmptyEntries);
foreach (var row in rows)
{
var fields = row.Split(new [] { "}"}, StringSplitOptions.RemoveEmptyEntries);
Console.WriteLine("{0} = {1}", fields[0], fields[1]);
}
Output:
2000 = 000000012199
3100 = 123456789*
3320 = 110009558*
3400 = 987654321*
3600 = CTR
4200 = D2343984*JOHN DOE*1232 STREET*DALLAS TX78302**
5000 = D9210293*JANE DOE*1234 STREET*SUITE 201*DALLASTX 73920**
Fiddle

This regular expression should get you going:
Match a literal {
Match 1 or more digts ("a number")
Match a literal }
Match all characters that are not an opening {
\{\d+\}[^{]+
It assumes that the values itself cannot contain an opening curly brace. If that's the case, you need to be more clever, e.g. #"\{\d+\}(?:\\{|[^{])+" (there are likely better ways)
Create a Regex instance and have it match against the text. Each "field" will be a separate match
var text = #"{123}abc{456}xyz";
var regex = new Regex(#"\{\d+\}[^{]+", RegexOptions.Compiled);
foreach (var match in regex.Matches(text)) {
Console.WriteLine(match.Groups[0].Value);
}

This doesn't fully answer the question, but it was getting too long to be a comment, so I'm leaving it here in Community Wiki mode. It does, at least, present a better strategy that may lead to a solution:
The main thing to understand here is it's rare — like, REALLY rare — to genuinely encounter a whole new kind of a file format for which an existing parser doesn't already exist. Even custom applications with custom file types will still typically build the basic structure of their file around a generic format like JSON or XML, or sometimes an industry-specific format like HL7 or MARC.
The strategy you should follow, then, is to first determine exactly what you're dealing with. Look at the software that generates the file; is there an existing SDK, reference, or package for the format? Or look at the industry surrounding this data; is there a special set of formats related to that industry?
Once you know this, you will almost always find an existing parser ready and waiting, and it's usually as easy as adding a NuGet package. These parsers are genuinely faster, need less code, and will be less susceptible to bugs (because most will have already been found by someone else). It's just an all-around better way to address the issue.
Now what I see in the question isn't something I recognize, so it's just possible you genuinely do have a custom format for which you'll need to write a parser from scratch... but even so, it doesn't seem like we're to that point yet.

Here is how to do it in linq without slow regex
string x = "{2000}000000012199{3100}123456789*{3320}110009558*{3400}987654321*{3600}CTR{4200}D2343984*JOHN DOE*1232 STREET*DALLAS TX78302**{5000}D9210293*JANE DOE*1234 STREET*SUITE 201*DALLASTX 73920**";
var result =
x.Split('{',StringSplitOptions.RemoveEmptyEntries)
.Aggregate(new List<Tuple<string, string>>(),
(l, z) => { var az = z.Split('}');
l.Add(new Tuple<string, string>(az[0], az[1]));
return l;})
LinqPad output:

Related

how to get a value from json with just the index?

Im making an app which needs to loop through steam games.
reading libraryfolder.vbf, i need to loop through and find the first value and save it as a string.
"libraryfolders"
{
"0"
{
"path" "D:\\Steam"
"label" ""
"contentid" "-1387328137801257092942"
"totalsize" "0"
"update_clean_bytes_tally" "42563526469"
"time_last_update_corruption" "1663765126"
"apps"
{
"730" "31892201109"
"4560" "9665045969"
"9200" "22815860246"
"11020" "776953234"
"34010" "11967809445"
"34270" "1583765638"
for example, it would record:
730
4560
9200
11020
34010
34270
Im already using System.Text.JSON in the program, is there any way i could loop through and just get the first value using System.Text.JSON or would i need to do something different as vdf doesnt separate the values with colons or commas?
That is not JSON, that is the KeyValues format developed by Valve. You can read more about the format here:
https://developer.valvesoftware.com/wiki/KeyValues
There are existing stackoverflow questions regarding converting a VDF file to JSON, and they mention libraries already developed to help read VDF which can help you out.
VDF to JSON in C#
If you want a very quick and dirty way to read the file without needing any external library I would probably use REGEX and do something like this:
string pattern = "\"apps\"\\s+{\\s+(\"(\\d+)\"\\s+\"\\d+\"\\s+)+\\s+}";
string libraryPath = #"C:\Program Files (x86)\Steam\steamapps\libraryfolders.vdf";
string input = File.ReadAllText(libraryPath);
List<string> indexes = Regex.Matches(input, pattern, RegexOptions.Singleline)
.Cast<Match>().ToList()
.Select(m => m.Groups[2].Captures).ToList()
.SelectMany(c => c.Cast<Capture>())
.Select(c => c.Value).ToList();
foreach(string s in indexes)
{
Debug.WriteLine(s);
}
See the regular expression explaination here:
https://regex101.com/r/bQSt79/1
It basically captures all occurances of "apps" { } in the 0 group, and does a repeating capture of pairs of numbers inbetween the curely brackets in the 1 group, but also captures the left most number in the pair of numbers in the 2 group. Generally repeating captures will only keep the last occurance but because this is C# we can still access the values.
The rest of the code takes each match, the 2nd group of each match, the captures of each group, and the values of those captures, and puts them in a list of strings. Then a foreach will print the value of those strings to log.

c# remove (null) from XML tags

I need to figure out a good way using C# to parse an XML file for (NULL) and remove it from the tags and replace it with the word BAD.
For example:
<GC5_(NULL) DIRTY="False"></GC5_(NULL)>
should be replaced with
<GC5_BAD DIRTY="False"></GC5_BAD>
Part of the problem is I have no control over the original XML, I just need to fix it once I receive it. The second problem is that the (NULL) can appear in zero, one, or many tags. It appears to be an issue with users filling in additional fields or not. So I might get
<GC5_(NULL) DIRTY="False"></GC5_(NULL)>
or
<MH_OTHSECTION_TXT_(NULL) DIRTY="False"></MH_OTHSECTION_TXT_(NULL)>
or
<LCDATA_(NULL) DIRTY="False"></LCDATA_(NULL)>
I am a newbie to C# and programming.
EDIT:
So I have come up with the following function that while not pretty, so far work.
public static string CleanInvalidXmlChars(string fileText)
{
List<char> charsToSubstitute = new List<char>();
charsToSubstitute.Add((char)0x19);
charsToSubstitute.Add((char)0x1C);
charsToSubstitute.Add((char)0x1D);
foreach (char c in charsToSubstitute)
fileText = fileText.Replace(Convert.ToString(c), string.Empty);
StringBuilder b = new StringBuilder(fileText);
b.Replace("", string.Empty);
b.Replace("", string.Empty);
b.Replace("<(null)", "<BAD");
b.Replace("(null)>", "BAD>");
Regex nullMatch = new Regex("<(.+?)_\\(NULL\\)(.+?)>");
String result = nullMatch.Replace(b.ToString(), "<$1_BAD$2>");
result = result.Replace("(NULL)", "BAD");
return result;
}
I have only been able to find 6 or 7 bad XML files to test this code on, but it has worked on each of them and not removed good data. I appreciate the feedback and your time.
In general, regular expressions are not the right way of handling XML files. There's a range of solutions to handle XML files correctly - you can read up on System.Xml.Linq for a good start. If you're a newbie, it's certainly something you should learn at some point. As Ed Plunkett pointed out in the comments, though, your XML is not actually XML: ( and ) characters are not allowed in XML element names.
Since you will have to do it as an operation on a string, Corak's comment to use
contentOfXml.Replace("(NULL)", "BAD");
may be a good idea, but will break if any elements can contain the string (NULL) as anything other than their name.
If you want a regex approach, this might work decently, but I'm not sure if it's not missing any edge cases:
var regex = new Regex(#"(<\/?[^_]*_)\(NULL\)([^>]*>)");
var result = regex.Replace(contentOfXml, "$1BAD$2");
Will it be suitable for you to read this XML as a string and perform a regex replacement? Like:
Regex nullMatch = new Regex("<(.+?)_\\(NULL\\)(.+?)>");
String processedXmlString = nullMatch.Replace(originalXmlString, "<$1_BAD$2>");

Delete character out of string

I am having some problems with a quite easy task - i feel like im missing something very obvious here.
I have a .csv file which is semicolon seperated. In this file are several numbers that contain dots like "1.300" but there are also dates included like "2015.12.01". The task is to find and delete all dots but only those that are in numbers and not in dates. The dates and numbers are completely variable and never at the same position in the file.
My question now: What is the 'best' way to handle this problem?
From a programmers point of view: Is it a good solution to just split at every semilicon, count the dots and if there is only one dot, delete it? This is the only way to solve the problem i could think of by now.
Example source file:
2015.12.01;
13.100;
500;
1.200;
100;
Example result:
2015.12.01;
13100;
500;
1200;
100;
If you can rely on the fact that dates have two dots and numbers just one, you can use that as a filter:
string s = "123.45";
if (s.Count(x => x == '.') == 1)
{
s = s.Replace(".", null);
}
The source file looks like a valid file generated by a program running on a machine whose locale uses . as the thousand separator (most of Europe does) and date separator (German locales only I think). Such locales also use ; as the list separator.
If the question was only how to parse such dates, numbers, the answer would be to pass the proper culture to the parse function, eg: decimal.Parse("13.500",new CultureInfo("de-at")) would return 13500. The actual issue though is that the data must be fed to another program that uses . as the decimal separator.
The safest option would be to change the locale used by the exporting program, eg change the thread CultureInfo if the exporter is a .NET program, the locale in an SSIS package etc, to a locale like en-gb to export with . and avoid the weird date format. This assumes that the next program in the pipeline doesn't use German for the date, English for numbers
Another option would be to load the text, parse the fields using the proper locale then export them in the format required by the next program.
Finally, a regular expression could be used to match only the numeric fields and remove the dot. This can be a bit tricky and depends on the actual contents.
For example (\d+)\.(\d{3}) can be used to match numbers if there is only one thousand separator. This can fail if some text field contains similar values. Or ;(\d+)\.(\d{3}); could match only a full field, except the first and last fields, eg:
Regex.Replace("1.457;2016.12.30;13.000;1,50;2015.12.04;13.456",#";(\d+)\.(\d{3});",#"$1$2;")
produces :
1.457;2016.12.3013000;1,50;2015.12.04;13.456
A regular expression that would match either numbers between ; or the first/last field could be
(^|;)(\d+)\.(\d{3})(;|$)
This would produce 1457;2016.12.30;13000;1,50;2015.12.04;13456, eg:
var data="1.457;2016.12.30;13.000;1,50;2015.12.04;13.456";
var pattern=#"(^|;)(\d+)\.(\d{3})(;|$)";
var replacement=#"$1$2$3$4";
var result= Regex.Replace(data,pattern,replacement);
The advantage of a regex over splitting and replacing strings is that it's a lot faster and more memory efficient. Instead of generating temporary strings for each split, manipulation, a Regex only calculates indexes in the source. A string object is generated only when you request the final text result. This results in far fewer allocations and garbage collections.
Even in medium-sized files this can result in 10x better performance
I wouldn't rely on the number of dots as mistakes can be made.
You can use the double.TryParse to safely test if the string is a number
var data = "2015.12.01;13.100;500;1.200;100;";
var dataArray = data.Split(';');
foreach (var s in dataArray)
{
double result;
if(double.TryParse(s,out result))
// implement your logic here
Console.WriteLine(s.Replace(".",string.Empty));
}

Returning the regular expression match as part of a split (or equivalent functionality)

I am trying to parse through some log files and put them into a database for analysis. A single line looks something like this:
2012-09-30 17:16:27,213 [39] (boxes) ERROR Assembly.Places [(null)] - Error while displaying a thing
I have made a regular expression that works well for pulling out the date in front and breaking up the lines that way, but I lose the date itself. This is a pretty important bit of data, and I don't want to lose it!
I cannot just do this by \r\n, because some logs are fatal errors that include stack traces for the developers. Those, obviously, use \r\n to make them readable.
My current code looks like this for reference:
var logpath = Directory.GetFiles(#"C:\a\directory", "*.log");
foreach (var log in logpath)
{
var fileStream = new StreamReader(log);
var fileString = fileStream.ReadToEnd();
var records = Regex.Split(fileString, "[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}");
...
}
Split() will always remove the matched delimiter. The trick is not to match any actual text, but rather a position in the string.
This is done through zero-width look-ahead:
var datePattern = "^(?=[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3})";
var datePositions = new Regex(datePattern, RegexOptions.Multiline);
// ...
Regex.Split(fileString, datePositions);
You should match instead of splitting
This is the regex.Use singleLine Mode
([0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3})(.*?)((?=[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}|$))
Group 1 contains date
Group 2 contains the required date
NOTE
The regex is conceptually like this.
(yourDate)(.*?yourdata)(?=till the other date|$)
Dont forget to use singlelineMode
Well, I'm not an expert on the subject but I did found this: Regex.Match.
From what I see you can receive the first match of the date format with a Match object
which has all kind of nice properties that put together you can probably cut the parts you want.
p.s. also exists a Regex.Matches which will return all matches in the file, might be easier for use.
Sorry I don't have time for to find a complete code example.
good day

Using RegEx to read through a CSV file

I have a CSV file, with the following type of data:
0,'VT,C',0,
0,'C,VT',0,
0,'VT,H',0,
and I desire the following output
0
VT,C
0
0
C,VT
0
0
VT,H
0
Therefore splitting the string on the comma however ignoring the comma within quote marks. At the moment I'm using the following RegEx:
("(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)"
however this gives me the result of:
0
VT
C
0
0
C
VT
0
0
VT
H
0
This show the RegEx is not reading the quote mark properly. Can anyone suggest some alterations that might help?
Usually when it comes to CSV parsing, people use specific libraries well suited for the programming language they are using to code their application.
Anyway if you are going to use a regular expression to make a really loose(!) parsing you may try using something like this:
'(?<value>[^']*?)'
It will match anything in between single quotes, and assuming the csv file is well formed, it will not miss a field. Of course it doesn't accept embedded quotes but it easily gets the job done. That's what I use when I need to get the job done really quickly. Please don't consider it a complete solution to your problem...it just works in special conditions when the requirements are what you described and the input is well formed.
[EDIT]
I was checking again your question and noticed you want to include also non quoted fields...well ok in that case my expression will not work at all. Anyway listen...if you think hard about your problem, you'll find that's something quite difficult to solve without ambiguity. Because you need fixed rules and if you allow quoted and not quoted fields, the parser will have hard time figuring out legit commas as separator/quoted.
Another expression to model such a solution may be:
('[^']+'|[^,]+),?
It will match both quoted/notquoted fields...anyway I'm not sure if it needs to assume the csv HAS to adhere to strict conditions. That will work much safer then a split strategy as far as I can tell ... you just need to collect all matches and print the matched_value + \r\n on your target string.
This regex is based of the fact you have 1 digit before and after your 'value'
Regex.Replace(input, #"(?:(?<=\d),|,(?=\d))", "\n");
You can test it out on RegexStorm
foreach(var m in Regex.Matches(s,"(('.*?')|[0-9])"))
I have manages to get the following method to read the file as required:
public List<string> SplitCSV(string input, List<string> line)
{
Regex csvSplit = new Regex("(([^,^\'])*(\'.*\')*([^,^\'])*)(,|$)", RegexOptions.Compiled);
foreach (Match match in csvSplit.Matches(input))
{
line.Add(match.Value.TrimStart(','));
}
return line;
}
Thanks for everyone help though.

Categories