I need to know is there some string at exact location on the .txt file.
I know how to found concrete string with Contains method, but since I do not need to search whole file (string will always be on the same location) I'm trying to find quickest solution.
if (searchedText.Contains(item))
{
Console.WriteLine("Found {0}",item);
break;
}
Thanks
If it's in UTF-8 and isn't guaranteed to be ASCII, then you'll just have to read the relevant number of characters. Something like:
using (var reader = File.OpenText("test.txt"))
{
char[] buffer = new char[16 * 1024];
int charsLeft = location;
while (charsLeft > 0)
{
int charsRead = reader.Read(buffer, 0, Math.Min(buffer.Length,
charsLeft));
if (charsRead <= 0)
{
throw new IOException("Incomplete data"); // Or whatever
}
charsLeft -= charsRead;
}
string line = reader.ReadLine();
bool found = line.StartsWith(targetText);
...
}
Notes:
This is inefficient in terms of reading the complete line starting from the target location. That's simpler than looping to make sure the right data is read, but if you have files with really long lines, you may want to tweak this.
This code doesn't cope with characters which aren't in the BMP (Basic Multilingual Plane). It would count them as two characters, as they're read as two UTF-16 code units. This is unlikely to affect you.
if(searchedText.SubString(i, l).Contains(item))
where i is the starting index and l is the length of the string you're searching for.
Since you're using Contains, you have some margin in l.
Related
How to sort a large csv file with 10 columns?
The sorting should be based on data type for example, string, Date, integer etc
Assuming Based on 5th column (Period Column) we need to sort.
As it is large CSV file, Without loading the same in memory we have to do.
I tried using logparser, but beyond certain size it throws error saying
"log parser tool has stopped working"
So please suggest any algorithm which i can implement in c#. Or if there is any other component or code which can help me.
Thanks in advance
Do know that running a program without memory is hard, specially if you have an algorithm that by its nature requires memory allocation.
I've looked at the External sort method mentioned by Jim Menschel and this is my implementation.
I didn't implement sorting on the fifth field but left some hints in the code so you can add that yourself.
This code reads a file, line by line and creates, in a temporary directory for each line a new file. Then we open two of those files and create a new target file. After reading a line from the two open files, we can compare them (or their fields). Based on their comparison we write the smallest one to the target file and read the next line from the file we used.
Although this doesn't keep much strings in memory it is hard on the diskdrive. I checked the NTFS limits and 50,000,000 files is within the specs.
Here are the main methods of the class:
main entry point
This take the file to be sorted
public void Sort(string file)
{
Directory.CreateDirectory(sortdir);
Split(file);
var sortedFile = SortAndCombine();
// if you feel confident you can override the original file
File.Move(sortedFile, file + ".sorted");
Directory.Delete(sortdir);
}
Split file
Split the file in a new file for each line
Yes, that will be a lot of files but it guarantees the least amount of memory used. It is easy to optimize though, read a couple of lines, sort those and write to a file.
void Split(string file)
{
using (var sr = new StreamReader(file, Encoding.UTF8))
{
var line = sr.ReadLine();
while (!String.IsNullOrEmpty(line))
{
// whatever you do, make sure this file your writed
// is ordered, just writing a single line is the easiest
using (var sw = new StreamWriter(CreateUniqueFilename()))
{
sw.WriteLine(line);
}
line = sr.ReadLine();
}
}
}
Combine the files
Iterate over all files and take one and the next one, merge those files
string SortAndCombine()
{
long processed; // keep track of how much we processed
do
{
// iterate the folder
var files = Directory.EnumerateFiles(sortdir).GetEnumerator();
bool hasnext = files.MoveNext();
processed = 0;
while (hasnext)
{
processed++;
// have one
string fileOne = files.Current;
hasnext = files.MoveNext();
if (hasnext)
{
// we have number two
string fileTwo = files.Current;
// do the work
MergeSort(fileOne, fileTwo);
hasnext = files.MoveNext();
}
}
} while (processed > 1);
var lastfile = Directory.EnumerateFiles(sortdir).GetEnumerator();
lastfile.MoveNext();
return lastfile.Current; // by magic is the name of the last file
}
Merge and Sort
Open two files and create one target file. Read a line from both of these and write sthe mallest of the two to the target file.
Keep doing that until both lines are null
void MergeSort(string fileOne, string fileTwo)
{
string result = CreateUniqueFilename();
using(var srOne = new StreamReader(fileOne, Encoding.UTF8))
{
using(var srTwo = new StreamReader(fileTwo, Encoding.UTF8))
{
// I left the actual field parsing as an excersise for the reader
string lineOne, lineTwo; // fieldOne, fieldTwo;
using(var target = new StreamWriter(result))
{
lineOne = srOne.ReadLine();
lineTwo = srTwo.ReadLine();
// naive field parsing
// fieldOne = lineOne.Split(';')[4];
// fieldTwo = lineTwo.Split(';')[4];
while(
!String.IsNullOrEmpty(lineOne) ||
!String.IsNullOrEmpty(lineTwo))
{
// use your parsed fieldValues here
if (lineOne != null && (lineOne.CompareTo(lineTwo) < 0 || lineTwo==null))
{
target.WriteLine(lineOne);
lineOne = srOne.ReadLine();
// fieldOne = lineOne.Split(';')[4];
}
else
{
if (lineTwo!=null)
{
target.WriteLine(lineTwo);
lineTwo = srTwo.ReadLine();
// fieldTwo = lineTwo.Split(';')[4];
}
}
}
}
}
}
// all is perocessed, remove the input files.
File.Delete(fileOne);
File.Delete(fileTwo);
}
Helper variable and method
There is one shared member for the temporary directory and a method for generating temporary unique filenames.
private string sortdir = Path.Combine(Path.GetTempPath(), Guid.NewGuid().ToString("N"));
string CreateUniqueFilename()
{
return Path.Combine(sortdir, Guid.NewGuid().ToString("N"));
}
Memory analysis
I've created a small file with 5000 lines in it with the following code:
using(var sw= new StreamWriter("c:\\temp\\test1.txt"))
{
for(int line=0; line<5000; line++)
{
sw.WriteLine(Guid.NewGuid().ToString());
}
}
I then ran the sorting code with the memory profiler. This is what the summary looked like on my box with Windows 10, 4GB RAM and a spinning disk:
The object lifetime shows as expected a lot of String, char[] and byte[] allocations, but none of those have survived a Gen 0 collection, which means they are all short lived and I don't expect this to be a problem if the number of lines to sort increases.
This is the simplest solution that works for me. From here easy alterations and improvements are possible, either leading to even less memory consumption, reduce allocations or a higher speed. Make sure to measure, select the area where you can make the biggest impact and compare successive results. That should give you the optimum between memory usage and performance.
Instead of reading CSV completely you can simply index it:
Read unsorted CSV line by line and remember 5th element (column) value and something to identify this line later: line number or offset of this line from beginning of the file and size.
You will have some kind of List<Tuple<string, ...>>. Sort that
var sortedList = unsortedList.OrderBy(item => item.Item1);
Now you can create sorted CSV by enumerating sorted list, reading line from source file and appending it to new CSV:
using (var sortedCSV = File.AppendText(newCSVFileName))
foreach(var item in sortedList)
{
... // read line from unsorted csv using item.Item2, etc.
sortedCSV.WriteLine(...);
}
I have used the below code to split the string, but it takes a lot of time.
using (StreamReader srSegmentData = new StreamReader(fileNamePath))
{
string strSegmentData = "";
string line = srSegmentData.ReadToEnd();
int startPos = 0;
ArrayList alSegments = new ArrayList();
while (startPos < line.Length && (line.Length - startPos) >= segmentSize)
{
strSegmentData = strSegmentData + line.Substring(startPos, segmentSize) + Environment.NewLine;
alSegments.Add(line.Substring(startPos, segmentSize) + Environment.NewLine);
startPos = startPos + segmentSize;
}
}
Please suggest me an alternative way to split the string into smaller chunks of fixed size
First of all you should define what you mean with chunk size. If you mean chunks with a fixed number of code units then your actual algorithm may be slow but it works. If it's not what you intend and you actually mean chunks with a fixed number of characters then it's broken. I discussed a similar issue in this Code Review post: Split a string into chunks of the same length then I will repeat here only relevant parts.
You're partitioning over Char but String is UTF-16 encoded then you may produce broken strings in, at least, three cases:
One character is encoded with more than one code unit. Unicode code point for that character is encoded as two UTF-16 code units, each code unit may end up in two different slices (and both strings will be invalid).
One character is composed by more than one code point. You're dealing with a character made by two separate Unicode code points (for example Han character 𠀑).
One character has combining characters or modifiers. This is more common than you may think: for example Unicode combining character like U+0300 COMBINING GRAVE ACCENT used to build à and Unicode modifiers such as U+02BC MODIFIER LETTER APOSTROPHE.
Definition of character for a programming language and for a human being are pretty different, for example in Slovak dž is a single character however it's made by 2/3 Unicode code points which are in this case also 2/3 UTF-16 code units then "dž".Length > 1. More about this and other cultural issues on How can I perform a Unicode aware character by character comparison?.
Ligatures exist. Assuming one ligature is one code point (and also assuming it's encoded as one code unit) then you will treat it as a single glyph however it represents two characters. What to do in this case? In general definition of character may be pretty vague because it has a different meaning according to discipline where this word is used. You can't (probably) handle everything correctly but you should set some constraints and document code behavior.
One proposed (and untested) implementation may be this:
public static IEnumerable<string> Split(this string value, int desiredLength)
{
var characters = StringInfo.GetTextElementEnumerator(value);
while (characters.MoveNext())
yield return String.Concat(Take(characters, desiredLength));
}
private static IEnumerable<string> Take(TextElementEnumerator enumerator, int count)
{
for (int i = 0; i < count; ++i)
{
yield return (string)enumerator.Current;
if (!enumerator.MoveNext())
yield break;
}
}
It's not optimized for speed (as you can see I tried to keep code short and clear using enumerations) but, for big files, it still perform better than your implementation (see next paragraph for the reason).
About your code note that:
You're building a huge ArrayList (?!) to hold result. Also note that in this way you resize ArrayList multiple times (even if, given input size and chunk size then its final size is known).
strSegmentData is rebuilt multiple times, if you need to accumulate characters you must use StringBuilder otherwise each operation will allocate a new string and copying old value (it's slow and it also adds pressure to Garbage Collector).
There are faster implementations (see linked Code Review post, especially Heslacher's implementation for a much faster version) and if you do not need to handle Unicode correctly (you're sure you manage only US ASCII characters) then there is also a pretty readable implementation from Jon Skeet (note that, after profiling your code, you may still improve its performance for big files pre-allocating right size output list). I do not repeat their code here then please refer to linked posts.
In your specific you do not need to read entire huge file in memory, you can read/parse n characters at time (don't worry too much about disk access, I/O is buffered). It will slightly degrade performance but it will greatly improve memory usage. Alternatively you can read line by line (managing to handle cross-line chunks).
Below is my analysis of your question and code (read the comments)
using (StreamReader srSegmentData = new StreamReader(fileNamePath))
{
string strSegmentData = "";
string line = srSegmentData.ReadToEnd(); // Why are you reading this till the end if it is such a long string?
int startPos = 0;
ArrayList alSegments = new ArrayList(); // Better choice would be to use List<string>
while (startPos < line.Length && (line.Length - startPos) >= segmentSize)
{
strSegmentData = strSegmentData + line.Substring(startPos, segmentSize) + Environment.NewLine; // Seem like you are inserting linebreaks at specified interval in your original string. Is that what you want?
alSegments.Add(line.Substring(startPos, segmentSize) + Environment.NewLine); // Why are you recalculating the Substring? Why are you appending the newline if the aim is to just "split"
startPos = startPos + segmentSize;
}
}
Making all kind of assumption, below is the code I would recommend for splitting long string. It is just a clean way of doing what you are doing in the sample. You can optimize this, but not sure how fast you are looking for.
static void Main(string[] args) {
string fileNamePath = "ConsoleApplication1.pdb";
var segmentSize = 32;
var op = ReadSplit(fileNamePath, segmentSize);
var joinedSTring = string.Join(Environment.NewLine, op);
}
static List<string> ReadSplit(string filePath, int segmentSize) {
var splitOutput = new List<string>();
using (var file = new StreamReader(filePath, Encoding.UTF8, true, 8 * 1024 )) {
char []buffer = new char[segmentSize];
while (!file.EndOfStream) {
int n = file.ReadBlock(buffer, 0, segmentSize);
splitOutput.Add(new string(buffer, 0, n));
}
}
return splitOutput;
}
I haven't done any performance tests on my version, but my guess is that it is faster than your version.
Also, I am not sure how you plan to consume the output, but a good optimization when doing I/O is to use async calls. And a good optimization (at the cost of readability and complexity) when handling large string is to stick with char[]
Note that
You might have to deal with Character encoding issues while reading the file
If you already have the long string in memory and file reading was just include in the demo, then you should use the StringReader class instead of the StreamReader class
I am working on a parser that is intended to read in data in fixed-width format (8 char x 10 col). However, sometimes this isn't the case, and there is sometimes valid data in the areas that do not meet this. It is not safe to assume that there is an escape character (such as the + in the figure below), as that is one of several formats.
I had attempted using TextFieldParser.FixedWidth, and giving it a 8x10 input, but anything that does not meet this quantity is sent to the ErrorLine instead.
It doesn't seem like it would be good practice to parse from my exception catching block, is it?
Since it is only discrepant lines who require additional work, is a brute force submethod the best approach? All of my data always comes in 8 char blocks. the final block in a line can be tricky in that it may be shorter if it was manually entered. (Predicated on #1 being OK to do)
Is there a better tool to be using? I feel like I'm trying to fit a square peg in a round hole with a fixedwidth textfieldparser.
Note: Delimited parsing is not an option, see the 2nd figure.
edit for clarification: the text below is a pair of excerpts of input decks for NASTRAN, a finite element code. I am aiming to have a generalized parsing method that will read the text files in, and then hand off the split up string[]s to other methods to actually process each card into a specific mapped object. (e.g. in the image below, the two object types are RBE3 and SET1)
Extracted Method:
public static IEnumerable<string[]> ParseFixed(string fileName, int width, int colCount)
{
var fieldArrayList = new List<string[]>();
using (var tfp = new TextFieldParser(fileName))
{
tfp.TextFieldType = FieldType.FixedWidth;
var fieldWidths = new int[colCount];
for (int i = 0; i < fieldWidths.Length; i++)
{
fieldWidths[i] = width;
}
tfp.CommentTokens = new string[] { "$" };
tfp.FieldWidths = fieldWidths;
tfp.TrimWhiteSpace = true;
while (!tfp.EndOfData)
{
try
{
fieldArrayList.Add(tfp.ReadFields());
}
catch (Microsoft.VisualBasic.FileIO.MalformedLineException ex)
{
Debug.WriteLine(ex.ToString());
// parse atypical lines here...?
continue;
}
}
}
return fieldArrayList;
}
I am reading in multiple files in with millions of lines and I am creating a list of all line numbers that have a specific issue. For example if a specific field is left blank or contains an invalid value.
So my question is what would be the most efficient date type to keep track of a list of numbers that could be upwards of a million number of rows. Would using String Builder, Lists, or something else be more efficient?
My end goal is to out put a message like "Specific field is blank on 1-32, 40, 45, 47, 49-51, etc. So in the case of a String Builder, I would check the previous value and if it is is only 1 more I would change it from 1 to 1-2 and if it was more than one would separate it by a comma. With the List, I would just add each number to the list and then combine then once the file has been completely read. However in this case I could have multiple list containing millions of numbers.
Here is the current code I am using to combine a list of numbers using String Builder:
string currentLine = sbCurrentLineNumbers.ToString();
string currentLineSub;
StringBuilder subCurrentLine = new StringBuilder();
StringBuilder subCurrentLineSub = new StringBuilder();
int indexLastSpace = currentLine.LastIndexOf(' ');
int indexLastDash = currentLine.LastIndexOf('-');
int currentStringInt = 0;
if (sbCurrentLineNumbers.Length == 0)
{
sbCurrentLineNumbers.Append(lineCount);
}
else if (indexLastSpace == -1 && indexLastDash == -1)
{
currentStringInt = Convert.ToInt32(currentLine);
if (currentStringInt == lineCount - 1)
sbCurrentLineNumbers.Append("-" + lineCount);
else
{
sbCurrentLineNumbers.Append(", " + lineCount);
commaCounter++;
}
}
else if (indexLastSpace > indexLastDash)
{
currentLineSub = currentLine.Substring(indexLastSpace);
currentStringInt = Convert.ToInt32(currentLineSub);
if (currentStringInt == lineCount - 1)
sbCurrentLineNumbers.Append("-" + lineCount);
else
{
sbCurrentLineNumbers.Append(", " + lineCount);
commaCounter++;
}
}
else if (indexLastSpace < indexLastDash)
{
currentLineSub = currentLine.Substring(indexLastDash + 1);
currentStringInt = Convert.ToInt32(currentLineSub);
string charOld = currentLineSub;
string charNew = lineCount.ToString();
if (currentStringInt == lineCount - 1)
sbCurrentLineNumbers.Replace(charOld, charNew);
else
{
sbCurrentLineNumbers.Append(", " + lineCount);
commaCounter++;
}
}
My end goal is to out put a message like "Specific field is blank on 1-32, 40, 45, 47, 49-51
If that's the end goal, no point in going through an intermediary representation such as a List<int> - just go with a StringBuilder. You will save on memory and CPU that way.
StringBuilder serves your purpose so stick with that, if you ever need the line numbers you can easily change the code then.
Depends on how you can / want to break the code up.
Given you are reading it in line order, not sure you need a list at all.
Your current desired output implies that you can't output anything until the file is completely scanned. The size of the file suggests a one pass`analysis phase would be a good idea as well, given you are going to use buffered input as opposed to reading the entire thing into memory.
I'd be tempted with an enum to describe the issue e.g Field??? is blank and then use that as the key a dictionary of string builders.
As a first thought anyway
Is your output supposed to be human readable? If so, you'll hit the limit of what is reasonable to read, long before you have any performance/memory issues from your data structure. Use whatever is easiest for you to work with.
If the output is supposed to be machine readable, then that output might suggest an appropriate data structure.
As others have pointed out, I would probably use StringBuilder. The List may have to resize many times; the new implementation of StringBuilder does not have to resize.
I'm reading in a text file using BinaryReader, then doing what I want with it (stripping characters, etc..), then writing it out using BinaryWriter.
Nice and simple.
One of the things I need to do before I strip anything is to:
Check that the amount of characters in the file is even (obviously file.Length % 2) and
If the length is even, check that every preceding character is a zero.
For example:
0, 10, 0, 20, 0, 30, 0, 40.
I need to verify that every second character is a zero.
Any ideas? Some sort of clever for loop?
OKAY!
I need to be a lot more clear about what I'm doing. I have file.txt that contains 'records'. Let's just say it's a comma delimited file. Now, What my program needs to do is read through this file, byte by byte, and strip all of the characters we don't want. I have done that. But, some of the files that will be going through this program will be single byte, and some will be double byte. I need to deal with both of these possibilities. But, I need to figure out whether the file is single or double byte in the first place.
Now, obviously if the file is double byte:
The file length will be divisible by 2 and
Every preceding character will be a zero.
and THAT'S why I need to do this.
I hope this clears some stuff up..
UPDATE!
I'm just going to have a boolean in the arguments - is16Bit. Thanks for your help guys! I would have rather deleted the question but it won't let me..
Something like this in a static class:
public static IEnumerable<T> EveryOther(this IEnumerable<T> list)
{
bool send = true;
foreach(var item in list)
{
if (send) yield return item;
send = !send;
}
}
and then (using the namespace of the previous class)
bool everyOtherIsZero = theBytes.EveryOther().All(c => c == 0);
string[] foo = file.text.Split(new{','}, StringSplitOptions.RemoveEmptyEntries);
for(int i=0; i<foo .Length; i+=2)
{
if(file[i]!="0")
return false;
}
How about this
string content = File.ReadAllText(#"c:\test.txt");
if (content.Length % 2 != 0)
throw new Exception("not even");
for(int i = 0; i < content.Length; i+=2)
if (content[i] != '0')
throw new Exception("no zero found");