I have a text file contains following similar lines for example 500k lines.
ADD GTRX:TRXID=0, TRXNAME="M_RAK_JeerExch_G_1879_18791_A-0", FREQ=81, TRXNO=0, CELLID=639, IDTYPE=BYID, ISMAINBCCH=YES, ISTMPTRX=NO, GTRXGROUPID=2556;
ADD GTRX:TRXID=1, TRXNAME="M_RAK_JeerExch_G_1879_18791_A-1", FREQ=24, TRXNO=1, CELLID=639, IDTYPE=BYID, ISMAINBCCH=NO, ISTMPTRX=NO, GTRXGROUPID=2556;
ADD GTRX:TRXID=5, TRXNAME="M_RAK_JeerExch_G_1879_18791_A-2", FREQ=28, TRXNO=2, CELLID=639, IDTYPE=BYID, ISMAINBCCH=NO, ISTMPTRX=NO, GTRXGROUPID=2556;
ADD GTRX:TRXID=6, TRXNAME="M_RAK_JeerExch_G_1879_18791_A-3", FREQ=67, TRXNO=3, CELLID=639, IDTYPE=BYID, ISMAINBCCH=NO, ISTMPTRX=NO, GTRXGROUPID=2556;
My intention is first to get value for FREQ where ISMAINBCCH=YES that I did easily, but if ISMAINBCCH=NO then concatenate FREQ values for that I have done by using File.ReadLines but it is taking a long time. Is there any better way to do this? If I take FREQ value for ISMAINBCCH=YES then concatenate the values ISMAINBCCH=NO are coming in a range of 10 lines above and below, but I don't know how to implement it. Probably I should get current line where ISMAINBCCH=YES for FREQ. Following is the code what I have done so far
using (StreamReader sr = File.OpenText(filename))
{
while ((s = sr.ReadLine()) != null)
{
if (s.Contains("ADD GTRX:"))
{
try
{
var gtrx = new Gtrx
{
CellId = int.Parse(PullValue(s, "CELLID")),
Freq = int.Parse(PullValue(s, "FREQ")),
//TrxNo = int.Parse(PullValue(s, "TRXNO")),
IsMainBcch = PullValue(s, "ISMAINBCCH").ToUpper() == "YES",
Commabcch = new List<string> { PullValue(s, "ISMAINBCCH") },
DEFINED_TCH_FRQ = null,
TrxName = PullValue(s, "TRXNAME"),
};
var result = String.Join(",",
from ss in File.ReadLines(filename)
where ss.Contains("ADD GTRX:")
where int.Parse(PullValue(ss, "CELLID")) == gtrx.CellId
where PullValue(ss, "ISMAINBCCH").ToUpper() != "YES"
select int.Parse(PullValue(ss, "FREQ")));
}
}
}
gtrx.DEFINED_TCH_FRQ = result;
}
from ss in File.ReadLines(filename)
This reads the entire file, produces an array, which you are then using in a loop (itself from reading the same file) so that array gets thrown away and then created again. You're reading the same file number_of_lines + 1 times when it hasn't changed in the meantime.
An obvious boost would therefore be to just call File.ReadLines(filename) once, store the array and then use that array both for the loop instead of while ((s = sr.ReadLine()) != null) and in the loop instead of that repeated call to ReadLines().
But there's a flaw in your logic in even looking at ReadLines() repeatedly; you're already scanning through the file so you're going to come across all the lines relevant to the same CELLID later anyway:
var gtrxDict = new Dictionary<int, Gtrx>();
using (StreamReader sr = File.OpenText(filename))
{
while ((s = sr.ReadLine()) != null)
{
if (s.Contains("ADD GTRX:"))
{
int cellID = int.Parse(PullValue(s, "CELLID"));
Gtrx gtrx;
if(gtrxDict.TryGetValue(cellID, out gtrx)) // Found previous one
gtrx.DEFINED_TCH_FRQ += "," + int.Parse(PullValue(ss, "FREQ"));
else // First one for this ID, so create a new object
gtrxDict[cellID] = new Gtrx
{
CellId = cellID,
Freq = int.Parse(PullValue(s, "FREQ")),
IsMainBcch = PullValue(s, "ISMAINBCCH").ToUpper() == "YES",
Commabcch = new List<string> { PullValue(s, "ISMAINBCCH") },
DEFINED_TCH_FRQ = int.Parse(PullValue(ss, "FREQ")).ToString(),
TrxName = PullValue(s, "TRXNAME"),
};
}
}
}
This way we don't need to keep more than one line from the file in memory at all, never mind doing so repeatedly. After this has run gtrxDict will contain a Gtrx object for each distinct CELLID in the file, with DEFINED_TCH_FRQ as a comma-separated list of the values from each matching line.
The following code snippet can be used to read the entire text file:
using System.IO;
/// Read Text Document specified by full path
private string ReadTextDocument(string TextFilePath)
{
string _text = String.Empty;
try
{
// open file if exists
if (File.Exists(TextFilePath))
{
using (StreamReader reader = new StreamReader(TextFilePath))
{
_text = reader.ReadToEnd();
reader.Close();
}
}
else
{
throw new FileNotFoundException();
}
return _text;
}
catch { throw; }
}
Get in-memory string, then apply Split() function to create string[] and process array elements in the same way as lines in original text file. In case of processing the very large file this method provides the option of reading it by chunks of data, processing them and then disposing upon completion (re: https://msdn.microsoft.com/en-us/library/system.io.streamreader%28v=vs.110%29.aspx).
As mentioned in comments by #Michael Liu, there is another option of using File.ReadAllText() which provides even more compact solution and can be used instead of reader.ReadToEnd(). Other useful methods of File class are detailed in : https://msdn.microsoft.com/en-us/library/system.io.file%28v=vs.110%29.aspx
And, finally, FileStream class can be used for both file read/write operations with various levels of granularity (re: https://msdn.microsoft.com/en-us/library/system.io.filestream%28v=vs.110%29.aspx).
SUMMARY
In response to the interesting comments thread, here is a brief summary.
The biggest bottleneck pertinent to the procedure described in PO question is Disk IO operations. Here are some numbers: average seek time in good quality HDD is about 5 msec plus the actual read time (per line). It well could be that the entire in-memory file data processing take less time than just a single HDD IO read (sometimes significantly; btw, SSD works better but still not a match to DDR3 RAM). The RAM memory size of modern PC is rather significant (typically 4...8 GB RAM is more than enough to handle most of text files). Thus, the core idea of my solution is to minimize the Disk IO read operations and do entire file data processing in-memory. Implementation can be different, apparently.
Hope this may help. Best regards,
I think that this more-or-less gets you what you want.
First read in all the data:
var data =
(
from s in File.ReadLines(filename)
where s != null
where s.Contains("ADD GTRX:")
select new Gtrx
{
CellId = int.Parse(PullValue(s, "CELLID")),
Freq = int.Parse(PullValue(s, "FREQ")),
//TrxNo = int.Parse(PullValue(s, "TRXNO")),
IsMainBcch = PullValue(s, "ISMAINBCCH").ToUpper() == "YES",
Commabcch = new List<string> { PullValue(s, "ISMAINBCCH") },
DEFINED_TCH_FRQ = null,
TrxName = PullValue(s, "TRXNAME"),
}
).ToArray();
Based on the loaded data create a lookup to return the frequencies based on each cell id:
var lookup =
data
.Where(d => !d.IsMainBcch)
.ToLookup(d => d.CellId, d => d.Freq);
Now update the DEFINED_TCH_FRQ based on the lookup:
foreach (var d in data)
{
d.DEFINED_TCH_FRQ = String.Join(",", lookup[d.CellId]);
}
Related
My task is to check which of the elements of a column in one csv are not included in the elements of a column in the other csv. There is a country column in both csv and the task is to check which countries are not in the secong csv but are in the first csv.
I guess I have to solve it with Lists after I read the strings from the two csv. But I dont know how to check which items in the first list are not in the other list and then put it to a third list.
There are many way to achieve this, for many real world CSV applications it is helpful to read the CSV input into a typed in-memory store there are standard libraries that can assist with this like CsvHelper as explained in this canonical post: Parsing CSV files in C#, with header
However for this simple requirement we only need to parse the values for Country form the master list, in this case the second csv. We don't need to manage, validate or parse any of the other fields in the CSVs
Build a list of unique Country values from the second csv
Iterate the first csv
Get the Country value
Check against the list of countries from the second csv
Write to the third csv if the country was not found
You can test the following code on .NET Fiddle
NOTE: this code uses StringWriter and StringReader as their interfaces are the same as the file reader and writers in the System.IO namespace. but we can remove the complexity associated with file access for this simple requirement
string inputcsv = #"Id,Field1,Field2,Country,Field3
1,one,two,Australia,three
2,one,two,New Zealand,three
3,one,two,Indonesia,three
4,one,two,China,three
5,one,two,Japan,three";
string masterCsv = #"Field1,Country,Field2
one,Indonesia,...
one,China,...
one,Japan,...";
string errorCsv = "";
// For all in inputCsv where the country value is not listed in the masterCsv
// Write to errorCsv
// Step 1: Build a list of unique Country values
bool csvHasHeader = true;
int countryIndexInMaster = 1;
char delimiter = ',';
List<string> countries = new List<string>();
using (var masterReader = new System.IO.StringReader(masterCsv))
{
string line = null;
if (csvHasHeader)
{
line = masterReader.ReadLine();
// an example of how to find the column index from first principals
if(line != null)
countryIndexInMaster = line.Split(delimiter).ToList().FindIndex(x => x.Trim('"').Equals("Country", StringComparison.OrdinalIgnoreCase));
}
while ((line = masterReader.ReadLine()) != null)
{
string country = line.Split(delimiter)[countryIndexInMaster].Trim('"');
if (!countries.Contains(country))
countries.Add(country);
}
}
// Read the input CSV, if the country is not in the master list "countries", write it to the errorCsv
int countryIndexInInput = 3;
csvHasHeader = true;
var outputStringBuilder = new System.Text.StringBuilder();
using (var outputWriter = new System.IO.StringWriter(outputStringBuilder))
using (var inputReader = new System.IO.StringReader(inputcsv))
{
string line = null;
if (csvHasHeader)
{
line = inputReader.ReadLine();
if (line != null)
{
countryIndexInInput = line.Split(delimiter).ToList().FindIndex(x => x.Trim('"').Equals("Country", StringComparison.OrdinalIgnoreCase));
outputWriter.WriteLine(line);
}
}
while ((line = inputReader.ReadLine()) != null)
{
string country = line.Split(delimiter)[countryIndexInInput].Trim('"');
if(!countries.Contains(country))
{
outputWriter.WriteLine(line);
}
}
outputWriter.Flush();
errorCsv = outputWriter.ToString();
}
// dump output to the console
Console.WriteLine(errorCsv);
Since you write about solving it with lists, I assume you can load those values from the CSV to the lists, so let's start with:
List<string> countriesIn1st = LoadDataFrom1stCsv();
List<string> countriesIn2nd = LoadDataFrom2ndCsv();
Then you can easily solve it with linq:
List<string> countriesNotIn2nd = countriesIn1st.Where(country => !countriesIn2nd.Contains(country)).ToList();
Now you have your third list with countries that are in first, but not in the second list. You can save it.
I know there is more similar question but I was not able to find the answer to mine. I have two CSV files. Both files contain image metadata for the same images, however, the first file image IDs are outdated. So I need to take the IDs from the second file and replace outdated IDs with new ones. I was thinking to compare image Longitude, Latitude, and Altitude rows values, and where it matches in both files I take image id from the second file. The IDs would be used in the new object. And the sequence of lines in files is different and the first file contains more lines than the second one.
The files structure looks as follows:
First file:
ImgID,Longitude,Latitude,Altitude
01,44.7282372307,27.5786807185,14.1536407471
02,44.7287939869,27.5777060219,13.2340240479
03,44.7254687824,27.582636255,16.5887145996
04,44.7254294913,27.5826908925,16.5794525146
05,44.728785278,27.5777185252,13.2553100586
06,44.7282279311,27.5786933339,14.1576690674
07,44.7253847039,27.5827526969,16.6026000977
08,44.7287777782,27.5777295052,13.2788238525
09,44.7282196988,27.5787045314,14.1649169922
10,44.7253397041,27.5828151049,16.6300048828
11,44.728769439,27.5777417846,13.3072509766
Second file:
ImgID,Longitude,Latitude,Altitude
5702,44.7282372307,27.5786807185,14.1536407471
5703,44.7287939869,27.5777060219,13.2340240479
5704,44.7254687824,27.582636255,16.5887145996
5705,44.7254294913,27.5826908925,16.5794525146
5706,44.728785278,27.5777185252,13.2553100586
5707,44.7282279311,27.5786933339,14.1576690674
How this can be done in C#? Is there is some handy library to work with?
I would use the CSVHelper library for CSV read/write as it is a complete nice library. For this, you should declare a class to hold your data, and its property names must match your CSV file's column names.
public class ImageData
{
public int ImgID { get; set; }
public double Longitude { get; set; }
public double Latitude { get; set; }
public double Altitude { get; set; }
}
Then to see if two lines are equal, what you need to do is see if each property in each line in one file matches the other. You could do this by simply comparing properties, but I'd rather write a comparer for this, like so:
public class ImageDataComparer : IEqualityComparer<ImageData>
{
public bool Equals(ImageData x, ImageData y)
{
return (x.Altitude == y.Altitude && x.Latitude == y.Latitude && x.Longitude == y.Longitude);
}
public int GetHashCode(ImageData obj)
{
unchecked
{
int hash = (int)2166136261;
hash = (hash * 16777619) ^ obj.Altitude.GetHashCode();
hash = (hash * 16777619) ^ obj.Latitude.GetHashCode();
hash = (hash * 16777619) ^ obj.Longitude.GetHashCode();
return hash;
}
}
}
Simple explanation is that we override the Equals() method and dictate that two instances of ImageData class are equal if the three property values are matching. I will show the usage in a bit.
The CSV read/write part is pretty easy (the library's help page has some good examples and tips, please read it). I can write two methods for reading and writing like so:
public static List<ImageData> ReadCSVData(string filePath)
{
List<ImageData> records;
using (var reader = new StreamReader(filePath))
{
using (var csv = new CsvReader(reader, CultureInfo.InvariantCulture))
{
csv.Configuration.HasHeaderRecord = true;
records = csv.GetRecords<ImageData>().ToList();
}
}
return records;
}
public static void WriteCSVData(string filePath, List<ImageData> records)
{
using (var writer = new StreamWriter(filePath))
{
using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture))
{
csv.WriteRecords(records);
}
}
}
You can actually write generic <T> read/write methods so the two methods are usable with different classes, if that's something useful for you.
Next is the crucial part. First, read the two files to memory using the methods we just defined.
var oldData = ReadCSVData(Path.Combine(Directory.GetCurrentDirectory(), "OldFile.csv"));
var newData = ReadCSVData(Path.Combine(Directory.GetCurrentDirectory(), "NewFile.csv"));
Now, I can go through each line in the 'old' data, and see if there's a corresponding record in 'new' data. If so, I grab the ID from the new data and replace the ID of old data with it. Notice the usage of the comparer we wrote.
foreach (var line in oldData)
{
var replace = newData.FirstOrDefault(x => new ImageDataComparer().Equals(x, line));
if (replace != null && replace.ImgID != line.ImgID)
{
line.ImgID = replace.ImgID;
}
}
Next, simply overwrite the old data file.
WriteCSVData(Path.Combine(Directory.GetCurrentDirectory(), "OldFile.csv"), oldData);
Results
I'm using a simplified version of your data to easily verify our results.
Old Data
ImgID,Longitude,Latitude,Altitude
1,1,2,3
2,2,3,4
3,3,4,5
4,4,5,6
5,5,6,7
6,6,7,8
7,7,8,9
8,8,9,10
9,9,10,11
10,10,11,12
11,11,12,13
New Data
ImgID,Longitude,Latitude,Altitude
5702,1,2,3
5703,2,3,4
5704,3,4,5
5705,4,5,6
5706,5,6,7
5707,6,7,8
Now our expected results should be that the first 6 lines of the old files should have the ids updated, and that's what we get:
Updated Old Data
ImgID,Longitude,Latitude,Altitude
5702,1,2,3
5703,2,3,4
5704,3,4,5
5705,4,5,6
5706,5,6,7
5707,6,7,8
7,7,8,9
8,8,9,10
9,9,10,11
10,10,11,12
11,11,12,13
An alternate way to do it, if for some reason you didn't want to use the CSVHelper, is to write a method that compares two lines of data and determines if they're equal (by ignoring the first column data):
public static bool DataLinesAreEqual(string first, string second)
{
if (first == null || second == null) return false;
var xParts = first.Split(',');
var yParts = second.Split(',');
if (xParts.Length != 4 || yParts.Length != 4) return false;
return xParts.Skip(1).SequenceEqual(yParts.Skip(1));
}
Then we can read all the lines from both files into arrays, and then we can update our first file lines with those from the second file if our method says they're equal:
var csvPath1 = #"c:\temp\csvData1.csv";
var csvPath2 = #"c:\temp\csvData2.csv";
// Read lines from both files
var first = File.ReadAllLines(csvPath1);
var second = File.ReadAllLines(csvPath2);
// Select the updated line where necessary
var updated = first.Select(f => second.FirstOrDefault(s => DataLinesAreEqual(f, s)) ?? f);
// Write the updated result back to the first file
File.WriteAllLines(csvPath1, updated);
I have one List<string> which length is undefined, and for some purpose I'm converting entire List<string> to string, so I want's to check before conversion that it is possible or not(is it gonna throw out of memory exception?) so I can process that much data and continue in another batch.
Sample
int drc = ImportConfiguration.Data.Count;
List<string> queries = new List<string>() { };
//iterate over data row to generate query and execute it
for (int drn = 0; drn < drc; drn++)//drn stands to Data Row Number
{
queries.Add(Generate(ImportConfiguration.Data[drn], drn));
//SO HERE I WANT"S TO CHECK FOR SIZE
//IF IT"S NOT POSSIBLE IN NEXT ITERATION THAN I'LL EXECUTE IT RIGHT NOW
//AND EMPTIED LIST AGAIN FOR NEXT BATCH
if (drn == drc - 1 || drn % 5000 == 0)
{
SqlHelper.ExecuteNonQuery(connection, System.Data.CommandType.Text, String.Join(Environment.NewLine, queries));
queries = new List<string>() { };
}
}
Since you are trying to send a large amount of text to a SQL Server instance, you could use SQL Server's streaming support to write the string to the stream as you go, minimizing the amount of memory needed to construct the data to send.
I can't say it is not possible but I think a better way would be to do the join and catch any exceptions:
try
{
var joined = string.Join(",", list);
}
catch(OutOfMemoryException)
{
// join failed, take action (log, notify user, etc.)
}
Note: if the exception is happening, then you need to consider a different approach than using a list and joining.
You could try:
List<string> theList;
try {
String allString = String.Join(",", theList.ToArray());
} catch (OutOfMemoryException e) {
// ... handle OutOfMemoryException exception (e)
}
EDIT
Based on your comment.
You could give an estimation in the following way.
Get available memory: Take a look at this post
Get sum size of your list strings theList.Sum(s => s.Length);
List<string> theList = new List<string>{ "AAA", "BBB" };
// number of characters
var allSize = theList.Sum(s => s.Length);
// available memory
Process proc = Process.GetCurrentProcess();
var availableMemory = proc.PrivateMemorySize64;;
if (availableMemory > allSize) {
// you can try
try {
String allString = String.Join(",", theList.ToArray());
} catch (OutOfMemoryException e) {
// ... handle OutOfMemoryException exception (e)
}
} else {
// it is not going to work...
}
everyone.
I want to parse 300+Mb text-file with 2.000.000+ lines in it and make some operations(split every line, make comparsions, save data in dict.) with stored data.
It takes program ~50+ mins to get expected result (for files with 80.000 lines it takes about 15-20seconds)
Is there any way to make it to work faster?
Code sample below:
using (FileStream cut_file = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (BufferedStream bs = new BufferedStream(cut_file))
using (StreamReader s_reader = new StreamReader(bs)) {
string line;
while ((line = s_reader.ReadLine()) != null) {
string[] every_item = line.Split('|'); //line sample: jdsga237 | 3332, 3223, 121 |
string car = every_item[0];
string[] cameras = every_item[1].Split(',');
if (!cars.Contains(car)) { //cars is List<string> defined at the beginning of programm
for (int camera = 0; camera < cameras.Count(); camera++) {
if (cams_input.Contains(cameras[camera])) { //cams_input is List<string> defined at the beginning of programm
cars.Add(car);
result[myfile]++; //result is Dictionary<string, int>. Used dict. for parsing several files.
}
}
}
}
}
Well, it is quite possible you have a problem linked to memory use.
However, you have some blatant inefficiencies in useless Linq usage:
when you call Contains() on a List, you basically do a foreach on the List.
So, an improvement over your code is to use HashSet instead of List in order to speed up the Contains().
Same for calling Count() on the array in the for loop. It's an array, so just call Array.Length.
Anyway, you should profile the code in your machine (I use the JetBrains Profiler and find it invaluable to do this kind of performance profiling).
My take on this:
string myfile = "";
var cars = new HashSet<string>();
var cams_input = new HashSet<string>();
var result = new Dictionary<string, int>();
foreach (var line in System.IO.File.ReadLines(myfile, System.Text.Encoding.UTF8))
{
var everyItem = line.Split('|'); //line sample: jdsga237 | 3332, 3223, 121 |
var car = everyItem[0];
if (cars.Contains(car)) continue;
var cameras = everyItem[1].Split(',');
for (int camera = 0; camera < cameras.Length; camera++)
{
if (cams_input.Contains(cameras[camera]))
{
cars.Add(car);
// I really don't get who is inserting value zero.
result[myfile]++;
}
}
}
Edit: As per your comment, the performance seemed to be related to the use of lists. You should read a guide about collections available in the .Net framework, like this: http://www.codethinked.com/an-overview-of-system_collections_generic
Every single type is best suited for a type of task: the HashSet, for example, is meant to be used to store a set (doh!) of unique values, and the really shiny feat it gives you is O(1) Contains operations.
What you pay is the storage of the hashes, and computation of them.
You lose also sorting, etc.
A Dictionary is basically an HashSet with a value attached to each key.
Good study!
Ps: if the problem is solved, please close the question.
As of now, I am using this code to open a file and read it into a list and parse that list into a string[]:
string CP4DataBase =
"C:\\Program\\Line Balancer\\FUJI DB\\KTS\\KTS - CP4 - Part Data Base.txt";
CP4DataBaseRTB.LoadFile(CP4DataBase, RichTextBoxStreamType.PlainText);
string[] splitCP4DataBaseLines = CP4DataBaseRTB.Text.Split('\n');
List<string> tempCP4List = new List<string>();
string[] line1CP4Components;
foreach (var line in splitCP4DataBaseLines)
tempCP4List.Add(line + Environment.NewLine);
string concattedUnitPart = "";
foreach (var line in tempCP4List)
{
concattedUnitPart = concattedUnitPart + line;
line1CP4PartLines++;
}
line1CP4Components = new Regex("\"UNIT\",\"PARTS\"", RegexOptions.Multiline)
.Split(concattedUnitPart)
.Where(c => !string.IsNullOrEmpty(c)).ToArray();
I am wondering if there is a quicker way to do this. This is just one of the files I am opening, so this is repeated a minimum of 5 times to open and properly load the lists.
The minimum file size being imported right now is 257 KB. The largest file is 1,803 KB. These files will only get larger as time goes on as they are being used to simulate a database and the user will continually add to them.
So my question is, is there a quicker way to do all of the above code?
EDIT:
***CP4***
"UNIT","PARTS"
"BLOCK","HEADER-"
"NAME","106536"
"REVISION","0000"
"DATE","11/09/03"
"TIME","11:10:11"
"PMABAR",""
"COMMENT",""
"PTPNAME","R160805"
"CMPNAME","R160805"
"BLOCK","PRTIDDT-"
"PMAPP",1
"PMADC",0
"ComponentQty",180
"BLOCK","PRTFORM-"
"PTPSZBX",1.60
"PTPSZBY",0.80
"PTPMNH",0.25
"NeedGlue",0
"BLOCK","TOLEINF-"
"PTPTLBX",0.50
"PTPTLBY",0.40
"PTPTLCL",10
"PTPTLPX",0.30
"PTPTLPY",0.30
"PTPTLPQ",30
"BLOCK","ELDT+" "PGDELSN","PGDELX","PGDELY","PGDELPP","PGDELQ","PGDELP","PGDELW","PGDELL","PGDELWT","PGDELLT","PGDELCT","PGDELR"
0,0.000,0.000,0,0,0.000,0.000,0.000,0.000,0.000,0.000,0
"BLOCK","VISION-"
"PTPVIPL",0
"PTPVILCA",0
"PTPVILB",0
"PTPVICVT",10
"PENVILIT",0
"BLOCK","ENVDT"
"ELEMENT","CP43ENVDT-"
"PENNMI",1.0
"PENNMA",1.0
"PENNZN",""
"PENNZT",1.0
"PENBLM",12
"PENCRTS",0
"PENSPD1",100
"PTPCRDCT",0
"PENVICT",1
"PCCCRFT",1
"BLOCK","CARRING-"
"PTPCRAPO",0
"PTPCRPCK",0
"PTPCRPUX",0.00
"PTPCRPUY",0.00
"PTPCRRCV",0
"BLOCK","PACKCLS-"
"FDRTYPE","Emboss"
"TAPEWIDTH","8mm"
"FEEDPITCH",4
"REELDIAMETER",0
"TAPEDEPTH",0.0
"DOADVVACUUM",0
"CHKBEFOREFEED",0
"TAPEARMLENGTH",0
"PPCFDPP",0
"PPCFDEC",4
"PPCMNPT",30
"UNIT","PARTS"
"BLOCK","HEADER-"
"NAME","106653"
"REVISION","0000"
"DATE","11/09/03"
"TIME","11:10:42"
"PMABAR",""
"COMMENT",""
"PTPNAME","0603R"
"CMPNAME","0603R"
"BLOCK","PRTIDDT-"
"PMAPP",1
"PMADC",0
"ComponentQty",18
"BLOCK","PRTFORM-"
"PTPSZBX",1.60
"PTPSZBY",0.80
"PTPMNH",0.23
"NeedGlue",0
"BLOCK","TOLEINF-"
"PTPTLBX",0.50
"PTPTLBY",0.34
"PTPTLCL",0
"PTPTLPX",0.60
"PTPTLPY",0.40
"PTPTLPQ",30
"BLOCK","ELDT+" "PGDELSN","PGDELX","PGDELY","PGDELPP","PGDELQ","PGDELP","PGDELW","PGDELL","PGDELWT","PGDELLT","PGDELCT","PGDELR"
0,0.000,0.000,0,0,0.000,0.000,0.000,0.000,0.000,0.000,0
"BLOCK","VISION-"
"PTPVIPL",0
"PTPVILCA",0
"PTPVILB",0
"PTPVICVT",10
"PENVILIT",0
"BLOCK","ENVDT"
"ELEMENT","CP43ENVDT-"
"PENNMI",1.0
"PENNMA",1.0
"PENNZN",""
"PENNZT",1.0
"PENBLM",12
"PENCRTS",0
"PENSPD1",80
"PTPCRDCT",0
"PENVICT",1
"PCCCRFT",1
"BLOCK","CARRING-"
"PTPCRAPO",0
"PTPCRPCK",0
"PTPCRPUX",0.00
"PTPCRPUY",0.00
"PTPCRRCV",0
"BLOCK","PACKCLS-"
"FDRTYPE","Emboss"
"TAPEWIDTH","8mm"
"FEEDPITCH",4
"REELDIAMETER",0
"TAPEDEPTH",0.0
"DOADVVACUUM",0
"CHKBEFOREFEED",0
"TAPEARMLENGTH",0
"PPCFDPP",0
"PPCFDEC",4
"PPCMNPT",30
... the file goes on and on and on.. and will only get larger.
The REGEX is placing each "UNIT PARTS" and the following code until the NEXT "UNIT PARTS" into a string[].
After this, I am checking each string[] to see if the "NAME" section exists in a different list. If it does exist, I am outputting that "UNIT PARTS" at the end of a textfile.
This bit is a potential performance killer:
string concattedUnitPart = "";
foreach (var line in tempCP4List)
{
concattedUnitPart = concattedUnitPart + line;
line1CP4PartLines++;
}
(See this article for why.) Use a StringBuilder for repeated concatenation:
// No need to use tempCP4List at all
StringBuilder builder = new StringBuilder();
foreach (var line in splitCP4DataBaseLines)
{
concattedUnitPart.AppendLine(line);
line1CP4PartLines++;
}
Or even just:
string concattedUnitPart = string.Join(Environment.NewLine,
splitCP4DataBaseLines);
Now the regex part may well also be slow - I'm not sure. It's not obvious what you're trying to achieve, whether you need regular expressions at all, or whether you really need to do the whole thing in one go. Can you definitely not just process it line by line?
You could achieve the same output list 'line1CP4Components' using the following:
Regex StripEmptyLines = new Regex(#"^\s*$", RegexOptions.Multiline);
Regex UnitPartsMatch = new Regex(#"(?<=\n)""UNIT"",""PARTS"".*?(?=(?:\n""UNIT"",""PARTS"")|$)", RegexOptions.Singleline);
string CP4DataBase =
"C:\\Program\\Line Balancer\\FUJI DB\\KTS\\KTS - CP4 - Part Data Base.txt";
CP4DataBaseRTB.LoadFile(CP4DataBase, RichTextBoxStreamType.PlainText);
List<string> line1CP4Components = new List<string>(
UnitPartsMatch.Matches(StripEmptyLines.Replace(CP4DataBaseRTB.Text, ""))
.OfType<Match>()
.Select(m => m.Value)
);
return line1CP4Components.ToArray();
You may be able to ignore the use of StripEmptyLines, but your original code is doing this via the Where(c => !string.IsNullOrEmpty(c)). Also your original code is causing the '\r' part of the "\r\n" newline/linefeed pair to be duplicated. I assumed this was an accident and not intentional?
Also you don't seem to be using the value in 'line1CP4PartLines' so I omitted the creation of the value. It was seemingly inconsistent with the omission of empty lines later so I guess you're not depending on it. If you need this value a simple regex can tell you how many new lines are in the string:
int linecount = new Regex("^", RegexOptions.Multiline).Matches(CP4DataBaseRTB.Text).Count;
// example of what your code will look like
string CP4DataBase = "C:\\Program\\Line Balancer\\FUJI DB\\KTS\\KTS - CP4 - Part Data Base.txt";
List<string> Cp4DataList = new List<string>(File.ReadAllLines(CP4DataBase);
//or create a Dictionary<int,string[]> object
string strData = string.Empty;//hold the line item data which is read in line by line
string[] strStockListRecord = null;//string array that holds information from the TFE_Stock.txt file
Dictionary<int, string[]> dctStockListRecords = null; //dictionary object that will hold the KeyValuePair of text file contents in a DictList
List<string> lstStockListRecord = null;//Generic list that will store all the lines from the .prnfile being processed
if (File.Exists(strExtraLoadFileLoc + strFileName))
{
try
{
lstStockListRecord = new List<string>();
List<string> lstStrLinesStockRecord = new List<string>(File.ReadAllLines(strExtraLoadFileLoc + strFileName));
dctStockListRecords = new Dictionary<int, string[]>(lstStrLinesStockRecord.Count());
int intLineCount = 0;
foreach (string strLineSplit in lstStrLinesStockRecord)
{
lstStockListRecord.Add(strLineSplit);
dctStockListRecords.Add(intLineCount, lstStockListRecord.ToArray());
lstStockListRecord.Clear();
intLineCount++;
}//foreach (string strlineSplit in lstStrLinesStockRecord)
lstStrLinesStockRecord.Clear();
lstStrLinesStockRecord = null;
lstStockListRecord.Clear();
lstStockListRecord = null;
//Alter the code to fit what you are doing..