C# - Read, Edit & Save FixedLength file - c#

I need to read FixedLenght file, edit some data inside of it and then save that file to some location. This little app which should do all this should be run every 2 hours.
This is the example of the file:
14000 US A111 78900
14000 US A222 78900
14000 US A222 78900
I need to look for data like A111 and A222, and to replace all A111 to for example A555. I have tried using TextFieldParser but without any luck... This is my code. I am able to get element of array but I am not sure what to do next...
using (TextFieldParser parser =
FileSystem.OpenTextFieldParser(sourceFile))
{
parser.TextFieldType = FieldType.FixedWidth;
parser.FieldWidths = new int[] { 6, 3, 5, 5 };
while (!parser.EndOfData)
{
try
{
string[] fields = parser.ReadFields();
foreach (var f in fields)
{
Console.WriteLine(f);
}
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
}
}
This is solution by Berkouz, but still having issues, the items of array are not replaced in output when saved to a file. The code:
string[] rows = File.ReadAllLines(sourceFile);
foreach (var row in rows)
{
string[] elements = row.Split(' ');
for (int i = 0; i < elements.Length; i++)
{
if (elements.GetValue(i).ToString() == "A111") {
elements.SetValue("A555", i);
}
}
}
var destFile = targetPath.FullName + "\\" + "output.txt";
File.WriteAllLines(destFile, rows);

Note the line where rows[rowIndex] is assigned to. that's because of string immutability forcing replace and similar functions to have an output value(as opposed to modifying their input) that you have to assign back to your data storage(whatever it may be, an array in this case).
var rows = File.ReadAllLines(sourcefile);
for (int rowIndex = 0; rowIndex != rows.Length; rowIndex++)
rows[rowIndex] = rows[rowIndex].Replace("A111", "A555");
File.WriteAllLines(destFile, rows);

This looks like an AB problem. If this is a one time thing, I suggest you use sed instead.
Invocation is simple: sed -e 's/A111/A555/g'
In case your file contents are more complex you can use awk, perl pcre regex features.
If this is in fact not a one-time thing and you want it written in C#, you can:
A) use System.IO.File.ReadAllLines(), split the text using string.Split(), replace the item you want using string.Replace() and write it back using WriteAllLines()
B) use a MemoryMappedFile. This way, you don't have to worry about writing anything. But it tends to get a little bit pointery and you should be careful with BOMs.
There are a LOT of other ways, these are the two ends of the spectrum for easy/slow/clean and fast/efficient/ugly code.

Related

Is it possible to be able to read this seemingly disorganized CSV file in C#?

I'm given a large CSV file with very odd formatting and field names and that sort of thing. Say for example we have these two records:
Text18;Text30;Text5;Text6;Text7;Text27;Text14;Text9;Text11;Text19;Text12;Text13;Text24;Text32;Text4;Text34
Supervisor:;Tom Stringer;;;;;;;;;;;;;;
Ethan Whitehouse;;;;;;;;;;;;;;;
;;Date In;;Time In;Date Out;;Time Out;Break Time;;;Total Hrs.;;WageRate;;DLC
Monday;;10/31/2016;8:42 AM;;10/31/2016;;5:41 PM;0.00;Hrs.;8.98;;Hrs.;;33.40;$300.04
;;;;;;Total:;;;;;;;;;
;;;;;;;;0.00;Hrs.;8.98;;Hrs.;;33.40;$300.04
Mark Smalley;;;;;;;;;;;;;;;
;;Date In;;Time In;Date Out;;Time Out;Break Time;;;Total Hrs.;;WageRate;;DLC
Monday;;10/31/2016;8:48 AM;;10/31/2016;;4:10 PM;0.00;Hrs.;7.37;;Hrs.;;29.00;$213.63
;;;;;;Total:;;;;;;;;;
;;;;;;;;0.00;Hrs.;7.37;;Hrs.;;29.00;$213.63
I need to be able to find (for this example) Mark Smalley, and his total DLC. So basically I need Mark Smalley = $213.63. I need to be able to add these dollar amounts to an array. Is there a good way to do this? I have very little control over how the data is formatted/delimited.
You can parse this CSV file into a jagged array:
var myCsv = File.ReadAllText(#"C:\MyStuff\CSV.txt");
var myCsvArr = myCsv.Split(new string[] { Environment.NewLine }, StringSplitOptions.None);
var myData = new string[myCsvArr.Length][];
for (int i = 0; i < myCsvArr.Length; i++)
{
myData[i] = myCsvArr[i].Split(';');
}
Then I guess accessing the elements will be easy.

Merging CSV lines in huge file

I have a CSV that looks like this
783582893T,2014-01-01 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582893T,2014-01-01 00:15,1,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582893T,2014-01-01 00:30,2,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582855T,2014-01-01 00:00,0,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582855T,2014-01-01 00:15,1,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582855T,2014-01-01 00:30,2,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
...
783582893T,2014-01-02 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582893T,2014-01-02 00:15,1,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582893T,2014-01-02 00:30,2,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
although there are 5 billion records. If you notice the first column and part of the 2nd column (the day), three of the records are all 'grouped' together and are just a breakdown of 15 minute intervals for the first 30 minutes of that day.
I want the output to look like
783582893T,2014-01-01 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582855T,2014-01-01 00:00,0,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
...
783582893T,2014-01-02 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
Where the first 4 columns of the repeating rows are ommitted and the rest of the columns are combined with the first record of it's kind. Basically I am converting the day from being each line is 15 minutes, to each line is 1 day.
Since I will be processing 5 billion records, I think the best thing is to use regular expressions (and EmEditor) or some tool that is made for this (multithreading, optimized), rather than a custom programmed solution. Althought I am open to ideas in nodeJS or C# that are relatively simple and super quick.
How can this be done?
If there's always a set number of records records and they're in order, it'd be fairly easy to just read a few lines at a time and parse and output them. Trying to do regex on billions of records would take forever. Using StreamReader and StreamWriter should make it possible to read and write these large files since they read and write one line at a time.
using (StreamReader sr = new StreamReader("inputFile.txt"))
using (StreamWriter sw = new StreamWriter("outputFile.txt"))
{
string line1;
int counter = 0;
var lineCountToGroup = 3; //change to 96
while ((line1 = sr.ReadLine()) != null)
{
var lines = new List<string>();
lines.Add(line1);
for(int i = 0; i < lineCountToGroup - 1; i++) //less 1 because we already added line1
lines.Add(sr.ReadLine());
var groupedLine = lines.SomeLinqIfNecessary();//whatever your grouping logic is
sw.WriteLine(groupedLine);
}
}
Disclaimer- untested code with no error handling and assuming that there are indeed the correct number of lines repeated, etc. You'd obviously need to do some tweaks for your exact scenario.
You could do something like this (untested code without any error handling - but should give you the general gist of it):
using (var sin = new SteamReader("yourfile.csv")
using (var sout = new SteamWriter("outfile.csv")
{
var line = sin.ReadLine(); // note: should add error handling for empty files
var cells = line.Split(","); // note: you should probably check the length too!
var key = cells[0]; // use this to match other rows
StringBuilder output = new StringBuilder(line); // this is the output line we build
while ((line = sin.ReadLine()) != null) // if we have more lines
{
cells = line.Split(","); // split so we can get the first column
while(cells[0] == key) // if the first column matches the current key
{
output.Append(String.Join(",",cells.Skip(4))); // add this row to our output line
}
// once the key changes
sout.WriteLine(output.ToString()); // write out the line we've built up
output.Clear();
output.Append(line); // update the new line to build
key = cells[0]; // and update the key
}
// once all lines have been processed
sout.WriteLine(output.ToString()); // We'll have just the last line to write out
}
The idea is to loop through each line in turn and keep track of the current value of the first column. When that value changes, you write out the output line you've been building up and update the key. This way you don't have to worry about exactly how many matches you have or if you might be missing a few points.
One note, it might be more efficient to use a StringBuilder for output rather than a String if you are going to concatentate 96 rows.
Define the ProcessOutputLine to store merged lines.
Call ProcessLine after each ReadLine and at end of file.
string curKey ="" ;
string keyLength = ... ; // set totalength of 4 first columns
string outputLine = "" ;
private void ProcessInputLine(string line)
{
string newKey=line.substring(0,keyLength) ;
if (newKey==curKey) outputline+=line.substring(keyLength) ;
else
{
if (outputline!="") ProcessOutPutLine(outputLine)
curkey = newKey ;
outputLine=Line ;
}
EDIT : this solution is very similar to that of Matt Burland, the only noticable difference is that I don't use the Split function.

Read last 30,000 lines of a file [duplicate]

This question already has answers here:
How to read last "n" lines of log file [duplicate]
(9 answers)
Closed 9 years ago.
If has a csv file whose data will increase by time to time. Now what i need to do is to read the last 30,000 lines.
Code :
string[] lines = File.ReadAllLines(Filename).Where(r => r.ToString() != "").ToArray();
int count = lines.Count();
int loopCount = count > 30000 ? count - 30000 : 0;
for (int i = loopCount; i < lines.Count(); i++)
{
string[] columns = lines[i].Split(',');
orderList.Add(columns[2]);
}
It is working fine but the problem is
File.ReadAllLines(Filename)
Read a complete file which causes performance lack. I want something like it only reads the last 30,000 lines which iteration through the complete file.
PS : i am using .Net 3.5 . Files.ReadLines() not exists in .Net 3.5
You can Use File.ReadLines() Method instead of using File.ReadAllLines()
From MSDN:File.ReadLines()
The ReadLines and ReadAllLines methods differ as follows:
When you use ReadLines, you can start enumerating the collection of strings before
the whole collection is returned; when you use ReadAllLines, you must
wait for the whole array of strings be returned before you can access
the array.
Therefore, when you are working with very large files,
ReadLines can be more efficient.
Solution 1 :
string[] lines = File.ReadAllLines(FileName).Where(r => r.ToString() != "").ToArray();
int count = lines.Count();
List<String> orderList = new List<String>();
int loopCount = count > 30000 ? 30000 : 0;
for (int i = count-1; i > loopCount; i--)
{
string[] columns = lines[i].Split(',');
orderList.Add(columns[2]);
}
Solution 2: if you are using .NET Framework 3.5 as you said in comments below , you can not use File.ReadLines() method as it is avaialble since .NET 4.0 .
You can use StreamReader as below:
List<string> lines = new List<string>();
List<String> orderList = new List<String>();
String line;
int count=0;
using (StreamReader reader = new StreamReader("c:\\Bethlehem-Deployment.txt"))
{
while ((line = reader.ReadLine()) != null)
{
lines.Add(line);
count++;
}
}
int loopCount = (count > 30000) ? 30000 : 0;
for (int i = count-1; i > loopCount; i--)
{
string[] columns = lines[i].Split(',');
orderList.Add(columns[0]);
}
You can use File.ReadLines by you can start enumerating the collection of strings before the whole collection is returned.
After that you can use the linq to make things lot more easier. Reverse will reverse the order of collection and Take will take the n number of items. Now put again Reverse to get the last n lines in original format.
var lines = File.ReadLines(Filename).Reverse().Take(30000).Reverse();
If you are using the .NET 3.5 or earlier you can create your own method which works same as File.ReadLines like this. Here is the code for the method originally written by #Jon
public IEnumerable<string> ReadLines(string file)
{
using (TextReader reader = File.OpenText(file))
{
string line;
while ((line = reader.ReadLine()) != null)
{
yield return line;
}
}
}
Now you can use linq over this function as well like the above statement.
var lines = ReadLines(Filename).Reverse().Take(30000).Reverse();
The problem is that you do not know where to start reading the file to get the last 30,000 lines. Unless you want to maintain a separate index of line offsets you can either read the file from the start counting lines only retaining the last 30,000 lines or you can start from the end counting lines backwards. The last approach can be efficient if the file is very large and you only want a few lines. However, 30,000 does not seem like "a few lines" so here is an approach that reads the file from the start and uses a queue to keep the last 30,000 lines:
var filename = #" ... ";
var linesToRead = 30000;
var queue = new Queue<String>();
using (var streamReader = File.OpenText(fileName)) {
while (!streamReader.EndOfStream) {
queue.Enqueue(streamReader.ReadLine());
if (queue.Count > linesToRead)
queue.Dequeue();
}
}
Now you can access the lines that are stored in queue. This class implements IEnumerable<String> allowing you to use foreach to iterate the lines. However, if you want random access you will have to use the ToArray method to convert the queue into an array which adds some overhead to the computation.
This solution is efficient in terms memory because at most 30,000 lines has to be kept in memory and the garbage collector can free any extra lines when required. Using File.ReadAllLines will pull all the lines into memory at once possibly increasing the memory required by the process.
Or I have a diffrent ideo for this.
Try splitting the csv to categories like A-D , E-G ....
and acces what first character you need .
Or you can split data with count of entites. Every file will contain 15.000 entites for example. And a text file which will contain tiny data about entits and location Like :
Txt File:
entitesID | inWhich.Csv
....

Using C#, how do I read a text file into a matrix of characters and then query that matrix? Is this even possible?

Example
If I had a text file with these lines:
The cat meowed.
The dog barked.
The cat ran up a tree.
I would want to end up with a matrix of rows and columns like this:
0 1 2 3 4 5 6 7 8 9
0| t-h-e- -c-a-t- -m-e-o-w-e-d-.- - - - - - - -
1| t-h-e- -d-o-g- -b-a-r-k-e-d-.- - - - - - - -
2| t-h-e- -c-a-t- -r-a-n- -u-p- -a- -t-r-e-e-.-
Then I would like to query this matrix to quickly determine information about the text file itself. For example, I would quickly be able to tell if everything in column "0" is a "t" (it is).
I realize that this might seem like a strange thing to do. I am trying to ultimately (among other things) determine if various text files are fixed-width delimited without any prior knowledge about the file. I also want to use this matrix to detect patterns.
The actual files that will go through this are quite large.
Thanks!
For example, I would quickly be able to tell if everything in column "0" is a "t" (it is).
int column = 0;
char charToCheck = 't';
bool b = File.ReadLines(filename)
.All(s => (s.Length > column ? s[column] : '\0') == charToCheck);
What you can do is read the first line of your text file and use it as a mask. Compare every next line to the mask and remove every character from the mask that is not the same as the character at the same position. After processing al lines you'll have a list of delimiters.
Btw, code is not very clean but it is a good starter I think.
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
namespace DynamicallyDetectFixedWithDelimiter
{
class Program
{
static void Main(string[] args)
{
var sr = new StreamReader(#"C:\Temp\test.txt");
// Get initial list of delimiters
char[] firstLine = sr.ReadLine().ToCharArray();
Dictionary<int, char> delimiters = new Dictionary<int, char>();
for (int i = 0; i < firstLine.Count(); i++)
{
delimiters.Add(i, firstLine[i]);
}
// Read subsequent lines, remove delimeters from
// the dictionary that are not present in subsequent lines
string line;
while ((line = sr.ReadLine()) != null && delimiters.Count() != 0)
{
var subsequentLine = line.ToCharArray();
var invalidDelimiters = new List<int>();
// Compare all chars in first and subsequent line
foreach (var delimiter in delimiters)
{
if (delimiter.Key >= subsequentLine.Count())
{
invalidDelimiters.Add(delimiter.Key);
continue;
}
// Remove delimiter when it differs from the
// character at the same position in a subsequent line
if (subsequentLine[delimiter.Key] != delimiter.Value)
{
invalidDelimiters.Add(delimiter.Key);
}
}
foreach (var invalidDelimiter in invalidDelimiters)
{
delimiters.Remove(invalidDelimiter);
}
}
foreach (var delimiter in delimiters)
{
Console.WriteLine(String.Format("Delimiter at {0} = {1}", delimiter.Key, delimiter.Value));
}
sr.Close();
}
}
}
"I am trying to ultimately (among other things) determine if various text files are fixed-width (...)"
If that's so, you could try this:
public bool isFixedWidth (string fileName)
{
string[] lines = File.ReadAllLines(fileName);
int length = lines[0].Length;
foreach (string s in lines)
{
if (s.length != Length)
{
return false;
}
}
return true;
}
Once you get that lines variable, you can access any character as though they were in a matrix. Like char c = lines[3][1];. However, there is no hard guarantee that all lines are the same length. You could pad them to be the same length as the longest one, if you so wanted.
Also,
"how would I query to get a list of all columns that contain a space character for ALL rows (for example)"
You could try this:
public bool CheckIfAllCharactersInAColumnAreTheSame (string[] lines, int colIndex)
{
char c = lines[0][colIndex];
try
{
foreach (string s in lines)
{
if (s[colIndex] != c)
{
return false;
}
}
return true;
}
catch (IndexOutOfRangeException ex)
{
return false;
}
}
Since it's not clear where you're have difficulty exactly, here are a few pointers.
Reading the file as strings, one per line:
string[] lines = File.ReadAllLines("filename.txt");
Obtaning a jagged array (a matrix) of characters from the lines (this step seems unnecessary since strings can be indexed just like character arrays):
char[][] charMatrix = lines.Select(l => l.ToCharArray()).ToArray();
Example query: whether every character in column 0 is a 't':
bool allTs = charMatrix.All(row => row[0] == 't');

Create text files of every combination of specific lines within a base text file

Ok, so hopefully I can explain this in enough detail for somebody to be able to help me.. I am writing a program in C# that is supposed to take a text file and replace specific text, which happen to be names of files, and print a new text file for every single combination of the given filenames. The specific places to change the text of filenames have their own set of possible filenames, listed as an array described below. The program should run regardless of how many filenames are available for each location as well as how many total locations for the filenames. If you really wanted to make it awesome, it can be slightly optimized knowing that no filenames should be duplicated throughout any single text file.
text is an array of lines that make up the base of the total file.
lineNum holds an array of the line locations of the filename entries.
previousFiles is an array of previously used filenames, starting with what is already in the file.
files is a jagged 2-dimensional array of possible filenames where files[1] would be an array of all the possible filenames for the 2nd location
Here is an example of how it would work with 3 separate filename locations, the first one given 3 possible filenames, the second given 8 possible filenames, and the third given 3 possible filenames.
Oh and assume buildNewFile works.
int iterator = 0;
for (int a = 0; a < 3; a++)
{
for (int b = 0; b < 8; b++)
{
for (int c = 0; c < 3; c++)
{
iterator++;
text[lineNums[0]] = text[lineNums[0]].Replace(previousFiles[0], files[0][a]);
text[lineNums[1]] = text[lineNums[1]].Replace(previousFiles[0], files[0][a]);
text[lineNums[2]] = text[lineNums[2]].Replace(previousFiles[1], files[1][b]);
text[lineNums[3]] = text[lineNums[3]].Replace(previousFiles[1], files[1][b]);
text[lineNums[4]] = text[lineNums[4]].Replace(previousFiles[2], files[2][c]);
text[lineNums[5]] = text[lineNums[5]].Replace(previousFiles[2], files[2][c]);
previousFiles = new string[] { files[0][a], files[1][b], files[2][c] };
buildNewFile(text, Info.baseFolder + "networks\\" + Info.dsnFilename + iterator + ".dsn");
}
}
}
If you guys can help me, thank you so much, I just can't figure out how to do it recursively or anything. If you have any questions I'll answer them and edit up here to reflect that.
It took me a little while to figure out what you really wanted to do. This problem can be solved without recursion, the trick is to look at the data you have and get it into a more usable format.
Your "files" array is the one that is the most inconvenient. The trick is to transform the data into usable permutations. To do that, I suggest taking advantage of yield and using a method that returns IEnumerable. The code for it is here:
public IEnumerable<string[]> GenerateFileNameStream(string[][] files)
{
int[] current_indices = new int[files.Length];
current_indices.Initialize();
List<string> file_names = new List<string>();
while (current_indices[0] < files[0].Length)
{
file_names.Clear();
for (var index_index = 0; index_index < current_indices.Length; index_index++)
{
file_names.Add(files[index_index][current_indices[index_index]]);
}
yield return file_names.ToArray();
// increment the indices, trickle down as needed
for (var check_index = 0; check_index < current_indices.Length; check_index++)
{
current_indices[check_index]++;
// if the index hasn't rolled over, we're done here
if (current_indices[check_index] < files[check_index].Length) break;
// if the last location rolls over, then we are totally done
if (check_index == current_indices.Length - 1) yield break;
// reset this index, increment the next one in the next iteration
current_indices[check_index] = 0;
}
}
}
Basically, it keeps track of the current index for each row of the files 2D array and returns the file name at each current index. Then it increments the first index. If the first index rolls over, then it resets to 0 and increments the next index instead. This way we can iterate through every permutation of the file names.
Now, looking at the relationship between lineNum and files, I assume that each location in the file is copied to two lines. The rest of the code is here:
public void MakeItWork(string[][] files, int[] lineNum, string[] text, string[] previousFiles)
{
var iterator = 0;
var filenames = GenerateFileNameStream(files);
// work a copy of the text, assume the "previousFiles" are in this text
var text_copy = new string[text.Length];
foreach (var filenameset in filenames)
{
iterator++;
Array.Copy(text, text_copy, text.Length);
for (var line_index = 0; line_index < lineNum.Length; line_index++)
{
var line_number = lineNum[line_index];
text[line_number] = text[line_number].Replace(previousFiles[line_index], filenameset[line_index / 2]);
}
buildNewFile(text_copy, Info.baseFolder + "networks\\" + Info.dsnFilename + iterator + ".dsn");
}
}
This code just takes the results from the enumerator and generates the files for you. The assumption based on your sample code is that each filename location is used twice per file (since the lineNum array was twice as long as the files location count.
I haven't fully tested all the code, but the crux of the algorithm is there. The key is to transform your data into a more usable form, then process it. The other suggestion I have when asking a question here is to describe the problem more as a "problem" and not in the terms of your current solution. If you detailed the goal you are trying to achieve instead of showing code, you can get more insights into the problem.

Categories