Convert result of File.ReadAllBytes as File object without actually writing it - c#

One Module reads some files using
File.ReadAllBytes("path")
and stores the result in a database table.
Later I take the result from the table and normaly use
File.WriteAllBytes("outputPath", binary.Data)
to write the file back.
Now I have to change the content of the file. Of course I can write the file to a temp folder, read it back in as a File object, change it, write it back to the destination folder.
But is there a smarter way to do that? Create the File object directly out of the binary data?

Got a solution. Its the Encoding Class:
var binary = GetBinaryFileFromDatabase();
var inputLines = Encoding.Default.GetString(binary).Split('\n');
var outputLines = new List<string>();
foreach (var line in inputLines)
{
var columns = line.Split(';');
if (columns.Length > 28 && columns[28] != null && columns[28].Length > 0)
{
columns[28] = columns[28].Replace(Path.GetFileName(columns[28]), "SomeValue");
}
outputLines.Add(string.Join(";", columns));
}
File.WriteAllLines(outputPath, outputLines);
Thanks everybody!

Related

Sylvan CSV Reader C# Check for Missing Column in CSV

#MarkPflug I have a requirement to read 12 columns out of 45 - 85 total columns. This is from multiple csv files (in the hundreds). But here is the problem, a lot of the times a column or two will be missing from some csv data files. How do I check in C# for a missing column in a csv file given I use the nuget package sylvan csv reader. Here is some code:
// Create a reader
CsvDataReader reader = CsvDataReader.Create(file, new CsvDataReaderOptions { ResultSetMode = ResultSetMode.MultiResult });
// Get column by name from csv. This is where the error occurs only in the files that have missing columns. I store these and then use them in a GetString(Ordinal).
reader.GetOrdinal("HomeTeam");
reader.GetOrdinal("AwayTeam");
reader.GetOrdinal("Referee");
reader.GetOrdinal("FTHG");
reader.GetOrdinal("FTAG");
reader.GetOrdinal("Division");
// There is more data here, but anyway you get the point.
// Here I run the reader and for each piece of data I run my database write method.
while (await reader.ReadAsync())
{
await AddEntry(idCounter.ToString(), idCounter.ToString(), attendance, referee, division, date, home_team, away_team, fthg, ftag, hthg, htag, ftr, htr);
}
I tried the following:
// This still causes it to go out of bounds.
if(reader.GetOrdinal("Division") < reader.FieldCount)
// only if the ordinal exists then assign it in a temp variable
else
// skip this column (set the data in add entry method to "")
Looking at the source, it appears that GetOrdinal throws if the column name isn't found or is ambiguous. As such I expect you could do:
int blah1Ord = -1;
try{ blah1Ord = reader.GetOrdinal("blah1"); } catch { }
int blah2Ord = -1;
try{ blah2Ord = reader.GetOrdinal("blah2"); } catch { }
while (await reader.ReadAsync())
{
var x = new Whatever();
if(blah1Ord > -1) x.Blah1 = reader.GetString(blah1Ord);
if(blah2Ord > -1) x.Blah2 = reader.GetString(blah2Ord);
}
And so on, so you effectively sound out whether a column exists - the ordinal remains -1 if it doesn't - and then use that to decide whether to read the column or not
Incidentally, I've been dealing with CSVs with poor/misspelled/partial header names, and I've found myself getting the column schema and searching it for partials, like:
using var cdr = CsvDataReader.Create(sr);
var cs = await cdr.GetColumnSchemaAsync();
var sc = StringComparison.OrdinalIgnoreCase;
var blah1Ord = cs.FirstOrDefault(c => c.ColumnName.Contains("blah1", sc))?.ColumnOrdinal ?? -1;
I started using the Sylvan library and it is really powerful.
Not sure if this could help you but if you use the DataBinder.Create<T> generic method from an entity, you can do the following to get columns in your CSV file that do not map to any of the entity properties:
var dataBinderOptions = new DataBinderOptions()
{
// AllColumns is required to throw UnboundMemberException
BindingMode = DataBindingMode.AllColumns,
};
IDataBinder<TEntity> binder;
try
{
binder = DataBinder.Create<TEntity>(dataReader, dataBinderOptions);
}
catch (UnboundMemberException ex)
{
// Use ex.UnboundColumns to get unmapped columnns
readResult.ValidationProblems.Add($"Unmapped columns: {String.Join(", ", ex.UnboundColumns)}");
return;
}

C# - check which element in a csv is not in an other csv and then write the elements to another csv

My task is to check which of the elements of a column in one csv are not included in the elements of a column in the other csv. There is a country column in both csv and the task is to check which countries are not in the secong csv but are in the first csv.
I guess I have to solve it with Lists after I read the strings from the two csv. But I dont know how to check which items in the first list are not in the other list and then put it to a third list.
There are many way to achieve this, for many real world CSV applications it is helpful to read the CSV input into a typed in-memory store there are standard libraries that can assist with this like CsvHelper as explained in this canonical post: Parsing CSV files in C#, with header
However for this simple requirement we only need to parse the values for Country form the master list, in this case the second csv. We don't need to manage, validate or parse any of the other fields in the CSVs
Build a list of unique Country values from the second csv
Iterate the first csv
Get the Country value
Check against the list of countries from the second csv
Write to the third csv if the country was not found
You can test the following code on .NET Fiddle
NOTE: this code uses StringWriter and StringReader as their interfaces are the same as the file reader and writers in the System.IO namespace. but we can remove the complexity associated with file access for this simple requirement
string inputcsv = #"Id,Field1,Field2,Country,Field3
1,one,two,Australia,three
2,one,two,New Zealand,three
3,one,two,Indonesia,three
4,one,two,China,three
5,one,two,Japan,three";
string masterCsv = #"Field1,Country,Field2
one,Indonesia,...
one,China,...
one,Japan,...";
string errorCsv = "";
// For all in inputCsv where the country value is not listed in the masterCsv
// Write to errorCsv
// Step 1: Build a list of unique Country values
bool csvHasHeader = true;
int countryIndexInMaster = 1;
char delimiter = ',';
List<string> countries = new List<string>();
using (var masterReader = new System.IO.StringReader(masterCsv))
{
string line = null;
if (csvHasHeader)
{
line = masterReader.ReadLine();
// an example of how to find the column index from first principals
if(line != null)
countryIndexInMaster = line.Split(delimiter).ToList().FindIndex(x => x.Trim('"').Equals("Country", StringComparison.OrdinalIgnoreCase));
}
while ((line = masterReader.ReadLine()) != null)
{
string country = line.Split(delimiter)[countryIndexInMaster].Trim('"');
if (!countries.Contains(country))
countries.Add(country);
}
}
// Read the input CSV, if the country is not in the master list "countries", write it to the errorCsv
int countryIndexInInput = 3;
csvHasHeader = true;
var outputStringBuilder = new System.Text.StringBuilder();
using (var outputWriter = new System.IO.StringWriter(outputStringBuilder))
using (var inputReader = new System.IO.StringReader(inputcsv))
{
string line = null;
if (csvHasHeader)
{
line = inputReader.ReadLine();
if (line != null)
{
countryIndexInInput = line.Split(delimiter).ToList().FindIndex(x => x.Trim('"').Equals("Country", StringComparison.OrdinalIgnoreCase));
outputWriter.WriteLine(line);
}
}
while ((line = inputReader.ReadLine()) != null)
{
string country = line.Split(delimiter)[countryIndexInInput].Trim('"');
if(!countries.Contains(country))
{
outputWriter.WriteLine(line);
}
}
outputWriter.Flush();
errorCsv = outputWriter.ToString();
}
// dump output to the console
Console.WriteLine(errorCsv);
Since you write about solving it with lists, I assume you can load those values from the CSV to the lists, so let's start with:
List<string> countriesIn1st = LoadDataFrom1stCsv();
List<string> countriesIn2nd = LoadDataFrom2ndCsv();
Then you can easily solve it with linq:
List<string> countriesNotIn2nd = countriesIn1st.Where(country => !countriesIn2nd.Contains(country)).ToList();
Now you have your third list with countries that are in first, but not in the second list. You can save it.

Export varbinary to JSON File

Database table has stored document in varbinary.
So i can get in byte[] in C# code.
Now How can i export this byte[] JSON file field.
if (item.IS_VIDEO == 0)
{
var content = ctx.DOCUMENT_TABLE.First(a => a.document_id == item.document_id).DOCUMENT_CONTENT;
if (content != null)
{
publicationClass.document_content = System.Text.Encoding.Default.GetString(content); //for export to json field
}
}
is this a way to export byte[] file to JSON?
Have you considered letting JSON serializer deal with the problem?
byte[] file = File.ReadAllBytes("FilePath"); // replace with how you get your array of bytes
string str = JsonConvert.SerializeObject(file);
This can be then deserialized on the receiving end like this:
var xyz = JsonConvert.DeserializeObject<byte[]>(str);
This appears to be working without any issues, however there might be some size limitations that might be worth investigating before commiting to this method.

File.ReadLines taking long time to process textfile

I have a text file contains following similar lines for example 500k lines.
ADD GTRX:TRXID=0, TRXNAME="M_RAK_JeerExch_G_1879_18791_A-0", FREQ=81, TRXNO=0, CELLID=639, IDTYPE=BYID, ISMAINBCCH=YES, ISTMPTRX=NO, GTRXGROUPID=2556;
ADD GTRX:TRXID=1, TRXNAME="M_RAK_JeerExch_G_1879_18791_A-1", FREQ=24, TRXNO=1, CELLID=639, IDTYPE=BYID, ISMAINBCCH=NO, ISTMPTRX=NO, GTRXGROUPID=2556;
ADD GTRX:TRXID=5, TRXNAME="M_RAK_JeerExch_G_1879_18791_A-2", FREQ=28, TRXNO=2, CELLID=639, IDTYPE=BYID, ISMAINBCCH=NO, ISTMPTRX=NO, GTRXGROUPID=2556;
ADD GTRX:TRXID=6, TRXNAME="M_RAK_JeerExch_G_1879_18791_A-3", FREQ=67, TRXNO=3, CELLID=639, IDTYPE=BYID, ISMAINBCCH=NO, ISTMPTRX=NO, GTRXGROUPID=2556;
My intention is first to get value for FREQ where ISMAINBCCH=YES that I did easily, but if ISMAINBCCH=NO then concatenate FREQ values for that I have done by using File.ReadLines but it is taking a long time. Is there any better way to do this? If I take FREQ value for ISMAINBCCH=YES then concatenate the values ISMAINBCCH=NO are coming in a range of 10 lines above and below, but I don't know how to implement it. Probably I should get current line where ISMAINBCCH=YES for FREQ. Following is the code what I have done so far
using (StreamReader sr = File.OpenText(filename))
{
while ((s = sr.ReadLine()) != null)
{
if (s.Contains("ADD GTRX:"))
{
try
{
var gtrx = new Gtrx
{
CellId = int.Parse(PullValue(s, "CELLID")),
Freq = int.Parse(PullValue(s, "FREQ")),
//TrxNo = int.Parse(PullValue(s, "TRXNO")),
IsMainBcch = PullValue(s, "ISMAINBCCH").ToUpper() == "YES",
Commabcch = new List<string> { PullValue(s, "ISMAINBCCH") },
DEFINED_TCH_FRQ = null,
TrxName = PullValue(s, "TRXNAME"),
};
var result = String.Join(",",
from ss in File.ReadLines(filename)
where ss.Contains("ADD GTRX:")
where int.Parse(PullValue(ss, "CELLID")) == gtrx.CellId
where PullValue(ss, "ISMAINBCCH").ToUpper() != "YES"
select int.Parse(PullValue(ss, "FREQ")));
}
}
}
gtrx.DEFINED_TCH_FRQ = result;
}
from ss in File.ReadLines(filename)
This reads the entire file, produces an array, which you are then using in a loop (itself from reading the same file) so that array gets thrown away and then created again. You're reading the same file number_of_lines + 1 times when it hasn't changed in the meantime.
An obvious boost would therefore be to just call File.ReadLines(filename) once, store the array and then use that array both for the loop instead of while ((s = sr.ReadLine()) != null) and in the loop instead of that repeated call to ReadLines().
But there's a flaw in your logic in even looking at ReadLines() repeatedly; you're already scanning through the file so you're going to come across all the lines relevant to the same CELLID later anyway:
var gtrxDict = new Dictionary<int, Gtrx>();
using (StreamReader sr = File.OpenText(filename))
{
while ((s = sr.ReadLine()) != null)
{
if (s.Contains("ADD GTRX:"))
{
int cellID = int.Parse(PullValue(s, "CELLID"));
Gtrx gtrx;
if(gtrxDict.TryGetValue(cellID, out gtrx)) // Found previous one
gtrx.DEFINED_TCH_FRQ += "," + int.Parse(PullValue(ss, "FREQ"));
else // First one for this ID, so create a new object
gtrxDict[cellID] = new Gtrx
{
CellId = cellID,
Freq = int.Parse(PullValue(s, "FREQ")),
IsMainBcch = PullValue(s, "ISMAINBCCH").ToUpper() == "YES",
Commabcch = new List<string> { PullValue(s, "ISMAINBCCH") },
DEFINED_TCH_FRQ = int.Parse(PullValue(ss, "FREQ")).ToString(),
TrxName = PullValue(s, "TRXNAME"),
};
}
}
}
This way we don't need to keep more than one line from the file in memory at all, never mind doing so repeatedly. After this has run gtrxDict will contain a Gtrx object for each distinct CELLID in the file, with DEFINED_TCH_FRQ as a comma-separated list of the values from each matching line.
The following code snippet can be used to read the entire text file:
using System.IO;
/// Read Text Document specified by full path
private string ReadTextDocument(string TextFilePath)
{
string _text = String.Empty;
try
{
// open file if exists
if (File.Exists(TextFilePath))
{
using (StreamReader reader = new StreamReader(TextFilePath))
{
_text = reader.ReadToEnd();
reader.Close();
}
}
else
{
throw new FileNotFoundException();
}
return _text;
}
catch { throw; }
}
Get in-memory string, then apply Split() function to create string[] and process array elements in the same way as lines in original text file. In case of processing the very large file this method provides the option of reading it by chunks of data, processing them and then disposing upon completion (re: https://msdn.microsoft.com/en-us/library/system.io.streamreader%28v=vs.110%29.aspx).
As mentioned in comments by #Michael Liu, there is another option of using File.ReadAllText() which provides even more compact solution and can be used instead of reader.ReadToEnd(). Other useful methods of File class are detailed in : https://msdn.microsoft.com/en-us/library/system.io.file%28v=vs.110%29.aspx
And, finally, FileStream class can be used for both file read/write operations with various levels of granularity (re: https://msdn.microsoft.com/en-us/library/system.io.filestream%28v=vs.110%29.aspx).
SUMMARY
In response to the interesting comments thread, here is a brief summary.
The biggest bottleneck pertinent to the procedure described in PO question is Disk IO operations. Here are some numbers: average seek time in good quality HDD is about 5 msec plus the actual read time (per line). It well could be that the entire in-memory file data processing take less time than just a single HDD IO read (sometimes significantly; btw, SSD works better but still not a match to DDR3 RAM). The RAM memory size of modern PC is rather significant (typically 4...8 GB RAM is more than enough to handle most of text files). Thus, the core idea of my solution is to minimize the Disk IO read operations and do entire file data processing in-memory. Implementation can be different, apparently.
Hope this may help. Best regards,
I think that this more-or-less gets you what you want.
First read in all the data:
var data =
(
from s in File.ReadLines(filename)
where s != null
where s.Contains("ADD GTRX:")
select new Gtrx
{
CellId = int.Parse(PullValue(s, "CELLID")),
Freq = int.Parse(PullValue(s, "FREQ")),
//TrxNo = int.Parse(PullValue(s, "TRXNO")),
IsMainBcch = PullValue(s, "ISMAINBCCH").ToUpper() == "YES",
Commabcch = new List<string> { PullValue(s, "ISMAINBCCH") },
DEFINED_TCH_FRQ = null,
TrxName = PullValue(s, "TRXNAME"),
}
).ToArray();
Based on the loaded data create a lookup to return the frequencies based on each cell id:
var lookup =
data
.Where(d => !d.IsMainBcch)
.ToLookup(d => d.CellId, d => d.Freq);
Now update the DEFINED_TCH_FRQ based on the lookup:
foreach (var d in data)
{
d.DEFINED_TCH_FRQ = String.Join(",", lookup[d.CellId]);
}

writing collection of collection to csv file using csv helper

I am using csvhelper to write data to a csv file. I am using C# and VS2010. The object I wish to write is a complex data type object of type Dictionary< long,List<HistoricalProfile>>
Below is the code I am using to write the data to the csv file:
Dictionary<long, List<HistoricalProfile>> profiles
= historicalDataGateway.GetAllProfiles();
var fileName = CSVFileName + ".csv";
CsvWriter writer = new CsvWriter(new StreamWriter(fileName));
foreach (var items in profiles.Values)
{
writer.WriteRecords(items);
}
writer.Dispose();
When it loops the second time I get an error
The header record has already been written. You can't write it more than once.
Can you tell me what I am doing wrong here. My final goal is to have a single csv file with a huge list of records.
Thanks in advance!
Have you seen this library http://www.filehelpers.net/ this makes it very easy to read and write CSV files
Then your code would just be
var profiles = historicalDataGateway.GetAllProfiles(); // should return a list of HistoricalProfile
var engine = new FileHelperEngine<HistoricalProfile>();
// To Write Use:
engine.WriteFile("FileOut.txt", res);
I would go more low-level and iterate through the collections myself:
var fileName = CSVFileName + ".csv";
var writer = new StreamWriter(fileName);
foreach (var items in profiles.Values)
{
writer.WriteLine(/*header goes here, if needed*/);
foreach(var item in items)
{
writer.WriteLine(item.property1 +"," +item.propery2...);
}
}
writer.Close();
If you want to make the routine more useful, you could use reflection to get the list of properties you wish to write out and construct your record from there.

Categories