Read fixed width record from text file - c#

I've got a text file full of records where each field in each record is a fixed width. My first approach would be to parse each record simply using string.Substring(). Is there a better way?
For example, the format could be described as:
<Field1(8)><Field2(16)><Field3(12)>
And an example file with two records could look like:
SomeData0000000000123456SomeMoreData
Data2 0000000000555555MoreData
I just want to make sure I'm not overlooking a more elegant way than Substring().
Update: I ultimately went with a regex like Killersponge suggested:
private readonly Regex reLot = new Regex(REGEX_LOT, RegexOptions.Compiled);
const string REGEX_LOT = "^(?<Field1>.{6})" +
"(?<Field2>.{16})" +
"(?<Field3>.{12})";
I then use the following to access the fields:
Match match = reLot.Match(record);
string field1 = match.Groups["Field1"].Value;

Use FileHelpers.
Example:
[FixedLengthRecord()]
public class MyData
{
[FieldFixedLength(8)]
public string someData;
[FieldFixedLength(16)]
public int SomeNumber;
[FieldFixedLength(12)]
[FieldTrim(TrimMode.Right)]
public string someMoreData;
}
Then, it's as simple as this:
var engine = new FileHelperEngine<MyData>();
// To Read Use:
var res = engine.ReadFile("FileIn.txt");
// To Write Use:
engine.WriteFile("FileOut.txt", res);

Substring sounds good to me. The only downside I can immediately think of is that it means copying the data each time, but I wouldn't worry about that until you prove it's a bottleneck. Substring is simple :)
You could use a regex to match a whole record at a time and capture the fields, but I think that would be overkill.

Why reinvent the wheel? Use .NET's TextFieldParser class per this how-to for Visual Basic: How to read from fixed-width text files.

You may have to watch out, if the end of the lines aren't padded out with spaces to fill the field, your substring won't work without a bit of fiddling to work out how much more of the line there is to read. This of course only applies to the last field :)

Unfortunately out of the box the CLR only provides Substring for this.
Someone over at CodeProject made a custom parser using attributes to define fields, you might wanna look at that.

Nope, Substring is fine. That's what it's for.

You could set up an ODBC data source for the fixed format file, and then access it as any other database table.
This has the added advantage that specific knowledge of the file format is not compiled into your code for that fateful day that someone decides to stick an extra field in the middle.

Related

character to use when splitting strings in visual c#?

Ok, I'm racking my brains over this one. It's pretty simple though (I think).
I'm currently creating a text file as a comma separated string of values.
Later, I read in that file data and then use the .split function to split the data by commas.
I discovered that sometimes one of the description fields in the data conatins an embedded comma, which ends up throwing the split command off.
Is there any special character I could use that could pretty much guarantee wouldn't be in the data, or is there a better way to accomplish this? Thanks!
// Initial Load
fullString = fileName + "," + String.Join(",", fieldValues);
// Access later
String[] valuesArray = myString.Split(',');
Short answer, there's no "simple" way to do it using Split. The best you can hope for is to set the deliminator as something cooky that wouldn't ever get used (but even that's not a guarantee).
The simple method would be to used something like CsvHelper (get it through Nuget) or any of the other dozen or so packages that are designed for parsing CSV.

Getting FileHelpers 2.0 to handle CSV files with excess commas

I am currently using the FileHelpers library (v2.0.0.0) to parse a CSV file. The CSV file is mapped to a class that has a handful of public properties, let's say there are N. The problem is that, by default, FileHelpers doesn't seem to correctly handle cases where the user specifies a CSV file that has more than N-1 commas. The remaining commas just get appended to the last property value.
I figured this must be configurable via FileHelpers' attributes, but I didn't see anything that would ignore fields that don't have a matching property in the record.
I looked into the RecordConditions, but using something like ExcludeIfEnds(",") looks like it will skip the line entirely if it ends with a comma, but I just want them stripped.
It's possible that my only recourse is to pre-process the file and strip any trailing commas, which is totally fine, but I wanted to know if FileHelpers can do this as well, and perhaps I'm just not seeing it in the docs.
Just an idea for a hack / workaround: you could create a property called "ExtraCommas" and add it to your class, so that extra commas are serialized there and not in the real properties of your object...
If the number of commas varies, I think you are out of luck and would have to do post processing. However you can set blank fields in your class if there are a fixed amount.
[FieldOrder(5)]
public string Blank1;
[FieldOrder(6)]
public string Blank2;
This doesn't really ever bite me because I don't use a FileHelpers class as a business class, I use it as an object to build the business class from. I store it for auditing. I think at one point I played around with making the fields for the Blanks private, not sure how that turned out.
Here is a custom method that you can use, it might not be the best solution, but it will solve the last comma problem. The code could be more optimized for sure, this is just to give you the idea of how to get around this kind of problem.
int main(){
StreamReader sr = new StreamReader(#"C:\Users\musab.shaheed\Desktop\csv.csv");
var lineCount=File.ReadLines(#"C:\Users\musab.shaheed\Desktop\csv.csv").Count();
for (int i = 0; i < lineCount;i++ ) {
String fileText = sr.ReadLine();
fileText=fileText.Substring(0, fileText.Length - 1);
//store your data in here
Console.WriteLine(fileText);
};
sr.Close();
}

c# convert string that has ctrl+z to regular string

i have a string like this:
some_string="A simple demo of SMS text messaging." + Convert.ToChar(26));
what is the SIMPLEST way of me getting rid of the char 26?
please keep in mind that sometimes some_string has char 26 and sometimes it does not, and it can be in different positions too, so i need to know what is the most versatile and easiest way to get rid of char 26?
If it can be in different positions (not just the end):
someString = someString.Replace("\u001A", "");
Note that you have to use the return value of Replace - strings are immutable, so any methods which look like they're changing the contents actually return a new string with the appropriate changes.
If it's only at the end:
some_string.TrimEnd((char)26)
If it can be anywhere then forget this and use Jon Skeet's answer.

Conditional Regex Replace in C# without MatchEvaluator

So, Im trying to make a program to rename some files. For the most part, I want them to look like this,
[Testing]StupidName - 2[720p].mkv
But, I would like to be able to change the format, if so desired. If I use MatchEvaluators, you would have to recompile every time. Thats why I don't want to use the MatchEvaluator.
The problem I have is that I don't know how, or if its possible, to tell Replace that if a group was found, include this string. The only syntax for this I have ever seen was something like (?<group>:data), but I can't get this to work. Well if anyone has an idea, im all for it.
EDIT:
Current Capture Regexes =
^(\[(?<FanSub>[^\]\)\}]+)\])?[. _]*(?<SeriesTitle>[\w. ]*?)[. _]*\-[. _]*(?<EpisodeNumber>\d+)[. _]*(\-[. _]*(?<EpisodeName>[\w. ]*?)[. _]*)?([\[\(\{](?<MiscInfo>[^\]\)\}]*)[\]\)\}][. _]*)*[\w. ]*(?<Extension>\.[a-zA-Z]+)$
^(?<SeriesTitle>[\w. ]*?)[. _]*[Ss](?<SeasonNumber>\d+)[Ee](?<EpisodeNumber>\d+).*?(?<Extension>\.[a-zA-Z]+)$
^(?<SeriesTitle>[\w. ]*?)[. _]*(?<SeasonNumber>\d)(?<EpisodeNumber>\d{2}).*?(?<Extension>\.[a-zA-Z]+)$
Current Replace Regex = [${FanSub}]${SeriesTitle} - ${EpisodeNumber} [${MiscInfo}]${Extension}
Using Regex.Replace, the file TestFile 101.mkv, I get []TestFile - 1[].mkv. What I want to do is make it so that [] is only included if the group FanSub or MiscInfo was found.
I can solve this with a MatchEvaluator because I actually get to compile a function. But this would not be a easy solution for users of the program. The only other idea I have to solve this is to actually make my own Regex.Replace function that accepts special syntax.
It sounds like you want to be able to specify an arbitrary format dynamically rather than hard-code it into your code.
Perhaps one solution is to break your filename parts into specific groups then pass in a replacement pattern that takes advantage of those group names. This would give you the ability to pass in different replacement patterns which return the desired filename structure using the Regex.Replace method.
Since you didn't explain the categories of your filename I came up with some random groups to demonstrate. Here's a quick example:
string input = "Testing StupidName Number2 720p.mkv";
string pattern = #"^(?<Category>\w+)\s+(?<Name>.+?)\s+Number(?<Number>\d+)\s+(?<Resolution>\d+p)(?<Extension>\.mkv)$";
string[] replacePatterns =
{
"[${Category}]${Name} - ${Number}[${Resolution}]${Extension}",
"${Category} - ${Name} - ${Number} - ${Resolution}${Extension}",
"(${Number}) - [${Resolution}] ${Name} [${Category}]${Extension}"
};
foreach (string replacePattern in replacePatterns)
{
Console.WriteLine(Regex.Replace(input, pattern, replacePattern));
}
As shown in the sample, named groups in the pattern, specified as (?<Name>pattern), are referred to in the replacement pattern by ${Name}.
With this approach you would need to know the group names beforehand and pass these in to rearrange the pattern as needed.

Advice for extracting word text and handling cellbreak characters

Looking for advice (perhaps best practice).
We have a MS Word document (Office 2007) that we are extracting text from a cell.
We can use the following:
string text = wordTable.cell(tablerow.index, 1).Range.Text;
The text is extracted; however we seem to get extra characters trailing, for example \r\a.
Now we could add the following:
.... wordTable.cell(tablerow.index, 1).Range.Text.Replace("\r\a,"");
But this seems a little too lazy, and pretty much a waste of time that would most likely lead to problems down the road.
We could also have a method that receives the string to clean:
private string cleanTextWordCellBreak(string wordTextToClean)
{
// Clean the text here
return cleanstring;
}
then we could use it:
cleanTextWordCellBreak(wordTable.cell(tablerow.index, 1).Range.Text;
);
This seems closer to a better way of handling the issue. What would you do?
I would break it out into a separate method but use the replace implementation since it's the simplest solution. You could always change the implementation later if you run into problem (like the text contains more than one \r\a and needs to be preserved)
So:
private string stripCellText(string text)
{
return text.Replace("\r\a", "");
}
string text = stripCellText(wordTable.cell(tablerow.index, 1).Range.Text);
I would definitely opt for breaking it out into a separate method personally. it helps with code readability and makes it a lot easier to change if needed in the future.
Another way of getting it would be get the length of Characters & extracting text upto that length.
dim range as Range
dim text as string
dim length as Integer
range = ActiveDocument.Tables(1).Cell(1,1).Range
text = range.Text
length = range.Characters.Count
Debug.Print Mid(text, 1, length - 1)

Categories