Search a string from 500k entries in txt - c#

I have a .txt file which has about 500k entries, each separated by new line. The file size is about 13MB and the format of each line is the following:
SomeText<tab>Value<tab>AnotherValue<tab>
My problem is to find a certain "string" with the input from the program, from the first column in the file, and get the corresponding Value and AnotherValue from the two columns.
The first column is not sorted, but the second and third column values in the file are actually sorted. But, this sorting is of no good use to me.
The file is static and does not change. I was thinking to use the Regex.IsMatch() here but I am not sure if that's the best approach here to go line by line.
If the lookup time would increase drastically, I could probably go for rearranging the first column (and hence un-sorting the second & third column). Any suggestions on how to implement this approach or the above approach if required?
After locating the string, how should I fetch those two column values?
EDIT
I realized that there will be quite a bit of searches in the file for atleast oe request by the user. If I have an array of values to be found, how can I return some kind of dictionary having a corresponding values of found matches?

Maybe with this code:
var myLine = File.ReadAllLines()
.Select(line => line.Split(new [] {' ', '\t'}, SplitStringOptions.RemoveEmptyEntries)
.Single(s => s[0] == "string to find");
myLine is an array of strings that represents a row. You may also use .AsParallel() extension method for better performance.

How many times do you need to do this search?
Is the cost of some pre-processing on startup worth it if you save time on each search?
Is loading all the data into memory at startup feasible?
Parse the file into objects and stick the results into a hashtable?
I don't think Regex will help you more than any of the standard string options. You are looking for a fixed string value, not a pattern, but I stand to be corrected on that.
Update
Presuming that the "SomeText" is unique, you can use a dictionary like this
Data represents the values coming in from the file.
MyData is a class to hold them in memory.
public IEnumerable<string> Data = new List<string>() {
"Text1\tValue1\tAnotherValue1\t",
"Text2\tValue2\tAnotherValue2\t",
"Text3\tValue3\tAnotherValue3\t",
"Text4\tValue4\tAnotherValue4\t",
"Text5\tValue5\tAnotherValue5\t",
"Text6\tValue6\tAnotherValue6\t",
"Text7\tValue7\tAnotherValue7\t",
"Text8\tValue8\tAnotherValue8\t"
};
public class MyData {
public String SomeText { get; set; }
public String Value { get; set; }
public String AnotherValue { get; set; }
}
[TestMethod]
public void ParseAndFind() {
var dictionary = Data.Select(line =>
{
var pieces = line.Split(new char[] { '\t' }, StringSplitOptions.RemoveEmptyEntries);
return new MyData {
SomeText = pieces[0],
Value = pieces[1],
AnotherValue = pieces[2],
};
}).ToDictionary<MyData, string>(dat =>dat.SomeText);
Assert.AreEqual("AnotherValue3", dictionary["Text3"].AnotherValue);
Assert.AreEqual("Value7", dictionary["Text7"].Value);
}
hth,
Alan

var firstFoundLine = File.ReadLines("filename").FirstOrDefault(s => s.StartsWith("string"));
if (firstFoundLine != "")
{
char yourColumnDelimiter = '\t';
var columnValues = firstFoundLine.Split(new []{yourColumnDelimiter});
var secondColumn = columnValues[1];
var thirdColumns = columnValues[2];
}
File.ReadLines is better than File.RealAllLines because you won't need to read the whole file -- only until matching string is found http://msdn.microsoft.com/en-us/library/dd383503.aspx

Parse this monstrosity into some sort of database.
SQL Server/MySQL would be preferable, but if you can't use them for various reasons, SQLite or even Access or Excel could work.
Doing that a single time is not hard.
After you are done with that, searching will become easy and fast.

GetLines(inputPath).FirstOrDefault(p=>p.Split(",")[0]=="SearchText")
private static IEnumerable<string> GetLines(string inputFile)
{
string filePath = Path.Combine(Directory.GetCurrentDirectory(),inputFile);
return File.ReadLines(filePath);
}

Related

Custom Class to CSV

I have a requirement to output some of our ERP data to a very specific csv format with an exact number of fields per record. Most of which we won't be providing at this time (Or have default values). To support future changes, I decided to write out the CSV format into a custom class of strings (All are strings) and readonly each of the strings we are not currently utilizing and default in the values that should go into those, most are String.Empty. So the Class looks something like this:
private class CustomClass
{
public string field1 = String.Empty;
public readonly string field2 = String.Empty; //Not going to be used
public string field3 = String.Empty;
public readonly string field4 = "N/A"; //Not going to be used
...
}
Now, after I populate the used fields, I need to take this data and export a specifically formatted comma delimited string. So using other posts on StackOverflow I came up with the following function to add to the class:
public string ToCsvFields()
{
StringBuilder sb = new StringBuilder();
foreach (var f in typeof(CustomClass).GetFields())
{
if (sb.Length > 0)
sb.Append(",");
var x = f.GetValue(this);
if (x != null)
sb.Append("\"" + x.ToString() + "\"");
}
return sb.ToString();
}
This works and gives me the exact CSV output I need for each Line when I call CustomClass.ToCsvFields(), and makes it pretty easy to maintain if the consumer of the CSV changes their column definition. But this line in-particular makes me feel like something could go wrong with Production code: var x = f.GetValue(this);
I understand what it is doing, but I generally shy away from "this" in my code; am I just being paranoid and this is totally acceptable code for this purpose?

Read Specific Strings from Text File

I'm trying to get certain strings out of a text file and put it in a variable.
This is what the structure of the text file looks like keep in mind this is just one line and each line looks like this and is separated by a blank line:
Date: 8/12/2013 12:00:00 AM Source Path: \\build\PM\11.0.64.1\build.11.0.64.1.FileServerOutput.zip Destination Path: C:\Users\Documents\.NET Development\testing\11.0.64.1\build.11.0.55.5.FileServerOutput.zip Folder Updated: 11.0.64.1 File Copied: build.11.0.55.5.FileServerOutput.zip
I wasn't entirely too sure of what to use for a delimiter for this text file or even if I should be using a delimiter so it could be subjected to change.
So just a quick example of what I want to happen with this, is I want to go through and grab the Destination Path and store it in a variable such as strDestPath.
Overall the code I came up with so far is this:
//find the variables from the text file
string[] lines = File.ReadAllLines(GlobalVars.strLogPath);
Yeah not much, but I thought perhaps if I just read one line at at a time and tried to search for what I was looking for through that line but honestly I'm not 100% sure if I should stick with that way or not...
If you are skeptical about how large your file is, you should come up using ReadLines which is deferred execution instead of ReadAllLines:
var lines = File.ReadLines(GlobalVars.strLogPath);
The ReadLines and ReadAllLines methods differ as follows:
When you use ReadLines, you can start enumerating the collection of strings before the whole collection is returned; when you use ReadAllLines, you must wait for the whole array of strings be returned before you can access the array. Therefore, when you are working with very large files, ReadLines can be more efficient.
As weird as it might sound, you should take a look to log parser. If you are free to set the file format you could use one that fits with log parser and, believe me, it will make your life a lot more easy.
Once you load the file with log parse you can user queries to get the information you want. If you don't care about using interop in your project you can even add a com reference and use it from any .net project.
This sample reads a HUGE csv file a makes a bulkcopy to the DB to perform there the final steps. This is not really your case, but shows you how easy is to do this with logparser
COMTSVInputContextClass logParserTsv = new COMTSVInputContextClass();
COMSQLOutputContextClass logParserSql = new COMSQLOutputContextClass();
logParserTsv.separator = ";";
logParserTsv.fixedSep = true;
logParserSql.database = _sqlDatabaseName;
logParserSql.server = _sqlServerName;
logParserSql.username = _sqlUser;
logParserSql.password = _sqlPass;
logParserSql.createTable = false;
logParserSql.ignoreIdCols = true;
// query shortened for clarity purposes
string SelectPattern = #"Select TO_STRING(UserName),TO_STRING(UserID) INTO {0} From {1}";
string query = string.Format(SelectPattern, _sqlTable, _csvPath);
logParser.ExecuteBatch(query, logParserTsv, logParserSql);
LogParser in one of those hidden gems Microsoft has and most people don't know about. I have use to read iis logs, CSV files, txt files, etc. You can even generate graphics!!!
Just check it here http://support.microsoft.com/kb/910447/en
Looks like you need to create a Tokenizer. Try something like this:
Define a list of token values:
List<string> gTkList = new List<string>() {"Date:","Source Path:" }; //...etc.
Create a Token class:
public class Token
{
private readonly string _tokenText;
private string _val;
private int _begin, _end;
public Token(string tk, int beg, int end)
{
this._tokenText = tk;
this._begin = beg;
this._end = end;
this._val = String.Empty;
}
public string TokenText
{
get{ return _tokenText; }
}
public string Value
{
get { return _val; }
set { _val = value; }
}
public int IdxBegin
{
get { return _begin; }
}
public int IdxEnd
{
get { return _end; }
}
}
Create a method to Find your Tokens:
List<Token> FindTokens(string str)
{
List<Token> retVal = new List<Token>();
if (!String.IsNullOrWhitespace(str))
{
foreach(string cd in gTkList)
{
int fIdx = str.IndexOf(cd);
if(fIdx > -1)
retVal.Add(cd,fIdx,fIdx + cd.Length);
}
}
return retVal;
}
Then just do something like this:
foreach(string ln in lines)
{
//returns ordered list of tokens
var tkns = FindTokens(ln);
for(int i=0; i < tkns.Length; i++)
{
int len = (i == tkns.Length - 1) ? ln.Length - tkns[i].IdxEnd : tkns[i+1].IdxBegin - tkns[i].IdxEnd;
tkns[i].value = ln.Substring(tkns[i].IdxEnd+1,len).Trim();
}
//Do something with the gathered values
foreach(Token tk in tkns)
{
//stuff
}
}

how to validate and return data if we have stored data in .txt file?

I have a notepad (.txt) file which has three fields viz ID Name and Location. I want to read this data in C# using streamreader and check for the condition whether ID is there or not in a file. If yes i should get that row as output else error.
suppose i have following txt file with fields as
00125 JAMES LONDON
00127 STARK USA
00128 ARNOLD AUSTRALIA
NOW, i should ask the user to enter there ID. If Id matches then i should get that particular row as output. E.g. if user enters 00127 then i should get output as
00127 JAMES LONDON
I know this would have been very simple had the data was stored in database. But what if data is stored in .txt file.
Thanks in Advance
Simplest solution (assume you have fixed format of ID - five digits):
var users = File.ReadAllLines("data.txt")
.ToDictionary(line => line.Substring(0, 5));
That creates dictionary with lines as values and ids as keys. Usage:
string line = users["00125"]; // 00125 JAMES LONDON
That was simplest solution. But actually, I'd introduce some class like:
public class User
{
public int Id { get; set; }
public string Name { get; set; }
public string Location { get; set; }
public static User Parse(string s)
{
var parts = s.Split(new []{' '}, StringSplitOptions.RemoveEmptyEntries);
return new User {
Id = Int32.Parse(parts[0]),
Name = parts[1],
Location = parts[2]
};
}
public override string ToString()
{
return String.Format("{0:00000} {1} {2}", Id, Name, Location);
}
}
Then parse each line and put users to dictionary of type Dictionary<int, User>. That will make your code strongly typed and easier to maintain:
var users = File.ReadAllLines("data.txt")
.Select(line => User.Parse(line))
.ToDictionary(u => u.Id);
var user = users[127];
string name = user.Name; // STARK
Console.WriteLine(user); // 00127 STARK USA
You can use the StreamReader to read the contents from a file.
Then, you can split the String using Split.
With that, you'd be able to add all the items in a two dimensional array. Then the array would look like {{00125, JAMES, LONDON}, {00127, STARK, USA}, {00128, ARNOLD, AUSTRALIA}}. Using that, you can compare the input to the first entry of every array, and return the data accordingly.
That should give you a clear view on which way to go from now.
There are several options:
Read the bloody file every time you get a request. This will work fine if the file is small and requests aren't frequent. You read line by line, and compare id to the user's id. Really fast to implement, but dead dumb in resource utilization. Run time complexity is O(N) - you need to scan the whole file in worst case.
You can modify first option and add a layer of caching. If you get frequent queries for the same id, you will get a nice performance boost. Run time complexity is still O(N) - you need to scan the whole file in worst case. And this is still far away from being fast.
You can also try and sort the file ids once. With best algorithms this will take O(N x Log(N) time, the time you spend once preprocessing the data. However, any read query could be executed in O(Log(N)) using binary search based on id.
If you are looking for code snippets, I would search for StreamReader, TextReader, StreamWriter, binary search, quick sort. There are plenty samples in the wild...

How to format and read CSV file?

Here is just an example of the data I need to format.
The first column is simple, the problem the second column.
What would be the best approach to format multiple data fields in one column?
How to parse this data?
Important*: The second column needs to contain multiple values, like in an example below
Name Details
Alex Age:25
Height:6
Hair:Brown
Eyes:Hazel
A csv should probably look like this:
Name,Age,Height,Hair,Eyes
Alex,25,6,Brown,Hazel
Each cell should be separated by exactly one comma from its neighbor.
You can reformat it as such by using a simple regex which replaces certain newline and non-newline whitespace with commas (you can easily find each block because it has values in both columns).
A CSV file is normally defined using commas as field separators and CR for a row separator. You are using CR within your second column, this will cause problems. You'll need to reformat your second column to use some other form of separator between multiple values. A common alternate separator is the | (pipe) character.
Your format would then look like:
Alex,Age:25|Height:6|Hair:Brown|Eyes:Hazel
In your parsing, you would first parse the comma separated fields (which would return two values), and then parse the second field as pipe separated.
This is an interesting one - it can be quite difficult to parse specific format files which is why people often write specific classes to deal with them. More conventional file formats like CSV, or other delimited formats are [more] easy to read because they are formatted in a similar way.
A problem like the above can be addressed in the following way:
1) What should the output look like?
In your instance, and this is just a guess, but I believe you are aiming for the following:
Name, Age, Height, Hair, Eyes
Alex, 25, 6, Brown, Hazel
In which case, you have to parse out this information based on the structure above. If it's repeated blocks of text like the above then we can say the following:
a. Every person is in a block starting with Name Details
b. The name value is the first text after Details, with the other columns being delimited in the format Column:Value
However, you might also have sections with addtional attributes, or attributes that are missing if the original input was optional, so tracking the column and ordinal would be useful too.
So one approach might look like the following:
public void ParseFile(){
String currentLine;
bool newSection = false;
//Store the column names and ordinal position here.
List<String> nameOrdinals = new List<String>();
nameOrdinals.Add("Name"); //IndexOf == 0
Dictionary<Int32, List<String>> nameValues = new Dictionary<Int32 ,List<string>>(); //Use this to store each person's details
Int32 rowNumber = 0;
using (TextReader reader = File.OpenText("D:\\temp\\test.txt"))
{
while ((currentLine = reader.ReadLine()) != null) //This will read the file one row at a time until there are no more rows to read
{
string[] lineSegments = currentLine.Split(new[] { " " }, StringSplitOptions.RemoveEmptyEntries);
if (lineSegments.Length == 2 && String.Compare(lineSegments[0], "Name", StringComparison.InvariantCultureIgnoreCase) == 0
&& String.Compare(lineSegments[1], "Details", StringComparison.InvariantCultureIgnoreCase) == 0) //Looking for a Name Details Line - Start of a new section
{
rowNumber++;
newSection = true;
continue;
}
if (newSection && lineSegments.Length > 1) //We can start adding a new person's details - we know that
{
nameValues.Add(rowNumber, new List<String>());
nameValues[rowNumber].Insert(nameOrdinals.IndexOf("Name"), lineSegments[0]);
//Get the first column:value item
ParseColonSeparatedItem(lineSegments[1], nameOrdinals, nameValues, rowNumber);
newSection = false;
continue;
}
if (lineSegments.Length > 0 && lineSegments[0] != String.Empty) //Ignore empty lines
{
ParseColonSeparatedItem(lineSegments[0], nameOrdinals, nameValues, rowNumber);
}
}
}
//At this point we should have collected a big list of items. We can then write out the CSV. We can use a StringBuilder for now, although your requirements will
//be dependent upon how big the source files are.
//Write out the columns
StringBuilder builder = new StringBuilder();
for (int i = 0; i < nameOrdinals.Count; i++)
{
if(i == nameOrdinals.Count - 1)
{
builder.Append(nameOrdinals[i]);
}
else
{
builder.AppendFormat("{0},", nameOrdinals[i]);
}
}
builder.Append(Environment.NewLine);
foreach (int key in nameValues.Keys)
{
List<String> values = nameValues[key];
for (int i = 0; i < values.Count; i++)
{
if (i == values.Count - 1)
{
builder.Append(values[i]);
}
else
{
builder.AppendFormat("{0},", values[i]);
}
}
builder.Append(Environment.NewLine);
}
//At this point you now have a StringBuilder containing the CSV data you can write to a file or similar
}
private void ParseColonSeparatedItem(string textToSeparate, List<String> columns, Dictionary<Int32, List<String>> outputStorage, int outputKey)
{
if (String.IsNullOrWhiteSpace(textToSeparate)) { return; }
string[] colVals = textToSeparate.Split(new[] { ":" }, StringSplitOptions.RemoveEmptyEntries);
List<String> outputValues = outputStorage[outputKey];
if (!columns.Contains(colVals[0]))
{
//Add the column to the list of expected columns. The index of the column determines it's index in the output
columns.Add(colVals[0]);
}
if (outputValues.Count < columns.Count)
{
outputValues.Add(colVals[1]);
}
else
{
outputStorage[outputKey].Insert(columns.IndexOf(colVals[0]), colVals[1]); //We append the value to the list at the place where the column index expects it to be. That way we can miss values in certain sections yet still have the expected output
}
}
After running this against your file, the string builder contains:
"Name,Age,Height,Hair,Eyes\r\nAlex,25,6,Brown,Hazel\r\n"
Which matches the above (\r\n is effectively the Windows new line marker)
This approach demonstrates how a custom parser might work - it's purposefully over verbose as there is plenty of refactoring that could take place here, and is just an example.
Improvements would include:
1) This function assumes there are no spaces in the actual text items themselves. This is a pretty big assumption and, if wrong, would require a different approach to parsing out the line segments. However, this only needs to change in one place - as you read a line at a time, you could apply a reg ex, or just read in characters and assume that everything after the first "column:" section is a value, for example.
2) No exception handling
3) Text output is not quoted. You could test each value to see if it's a date or number - if not, wrap it in quotes as then other programs (like Excel) will attempt to preserve the underlying datatypes more effectively.
4) Assumes no column names are repeated. If they are, then you have to check if a column item has already been added, and then create an ColName2 column in the parsing section.

Searching a String for a certain thing, then removing up to a certain point to a list

Im working on an Automatic Downloader of sorts for personal use, and so far I have managed to set up the program to store the source of the link provided into a string, the links to the downloads are written in plain text in the source, So what I need to be able to do, is search a string for say "http://media.website.com/folder/" and have it return all occurences to a list? the problem is though, I also need the unique id given for each file after the /folder/" to be stored with each occurence of the above, Any ideas? Im using Visual C#.
Thanks!!!
Steven
Maybe something like this?
Dictionary<string, string> dictionary = new Dictionary<string, string>();
string searchText = "Text to search here";
string textToFind = "Text to find here";
string fileID = "";
bool finished = false;
int foundIndex = 0;
while (!finished)
{
foundIndex = searchText.IndexOf(textToFind, foundIndex);
if (foundIndex == -1)
{
finished = true;
}
else
{
//get fieID, change to whatever logic makes sense, in this example
//it assumes a 2 character identifier following the search text
fileID = searchText.Substring(foundIndex + searchText.Length, 2);
dictionary.Add(fileID, textToFind);
}
}
use Regex to get the matches, that will give you a list of all the matches. Use wildcards for the numeric value that will differ between matches, so you can parse for it.
I'm not great with Regex, but it'd be something like,
Regex.Match(<your string>,#"(http://media.website.com/folder/)(d+)")
Or
var textToFind = "http://media.website.com/folder/";
var ids = from l in listOfUrls where l.StartsWith(textToFind) select new { RawUrl = l, ID=l.Substring(textToFind.Length)}

Categories