I formatted this data using c#, streamreader and writer and created this csv file. Here is the sample data:
A------B-C---D----------E------------F------G------H-------I
NEW, C,A123 ,08/24/2011,08/24/2011 ,100.00,100.00,X123456,276135
NEW, C,A125 ,08/24/2011,08/24/2011 ,200.00,100.00,X123456,276135
NEW, C,A127 ,08/24/2011,08/24/2011 , 50.00,100.00,X123456,276135
NEW, T,A122 ,08/24/2011,08/24/2011 , 5.00,100.00,X225511,276136
NEW, T,A124 ,08/24/2011,08/24/2011 , 10.00,100.00,X225511,276136
NEW, T,A133 ,08/24/2011,08/24/2011 ,500.00,100.00,X444556,276137
I would like the following output:
NEW, C,A123 ,08/24/2011,08/24/2011 ,100.00,100.00,X123456,276135
NEW, C,A125 ,08/24/2011,08/24/2011 ,200.00,100.00,X123456,276135
NEW, C,A127 ,08/24/2011,08/24/2011 , 50.00,100.00,X123456,276135
NEW, C,A001 ,08/24/2011,08/24/2011 ,350.00,100.00,X123456,276135
NEW, T,A122 ,08/24/2011,08/24/2011 , 5.00,100.00,X225511,276136
NEW, T,A124 ,08/24/2011,08/24/2011 , 10.00,100.00,X225511,276136
NEW, T,A001 ,08/24/2011,08/24/2011 , 15.00,100.00,X225511,276136
NEW, T,A133 ,08/24/2011,08/24/2011 ,500.00,100.00,X444556,276137
NEW, T,A001 ,08/24/2011,08/24/2011 ,500.00,100.00,X225511,276137
With each change in field "I", I would like to add a line, sum column F, add a "A001" to C, and copy the contents of the other fields into that newly ADDed line.
The letters on the columns are for illustrative purposes only. There are no headers.
First, what should I do first? How do I sum column F, copy contents of all fields, and add "A001" to C? How do I add a line and copy the fields w/ each change in I?
From your questions it doesn't sound like your data fits a flat file format very well, or at least not a CSV (analogy to a single DB table). Wanting to 'copy the contents of the other fields into that newly ADDed line' implies to me that there is possibily a relationship that might be better expressed referentially rather than copying the data to a new row.
Also keep in mind the requirement to sum according to column 'F' suggests that you will need to iterate over every row in order to calculate the sum.
If you decide to go a route other than a CSV your might try a light weight database solution such as SQLite. An alternative might be to look at XmlSerializer or DataContract (with Serializer) and just work with objects in your code. The objects can the be serialized to the disk when you're done with them.
You could use a custom iterator block, just as an example this is showing how to add the new line that contains your "A001" C column. You can just as easily add a new sum column - keep in mind though that you should keep the number of columns in each row the same.
public IEnumerable<string> GetUpdatedLines(string fileName)
{
int? lastValue = null;
foreach (string line in File.ReadLines(fileName))
{
var values = line.Split(',');
int myIValue = Convert.ToInt32(values[7]);
if (lastValue.HasValue && myIValue != lastValue)
{
//emit new line sum
values[2] = "A001";
yield return string.Join(",", values);
}
lastValue = myIValue;
yield return line;
}
}
Related
I want to get all files of the same size in buckets according, size is a key.
Default behaviour would override the value whenever you associate it with an existing key. I want to push a value to string[] array whenever the same value is met.
1556 - "1.txt" - Entry added to the dictionary, 1.txt put to the string[]
1556 - "7.txt" - 7.txt pushed to the string[] associated with 1556
My current thought is to enumerate once through all files and create entries with keys and empty arrays in the dictionary:
foreach(var file in directory){
map[file.length] = new string[]/List<string>();
then enumerate second time retrieving array associated with current key:
foreach(var file in directory){
map[file.length].push(file.name);
}
Are there any better ways to do this?
As far as I understood, you want entries to consist of fileSize as keys,
fileNames array as value.
In that case, I'd suggest to use just one loop, like so:
foreach(var file in directory)
{
if (!map.ContainsKey(file.Length))
{
map.Add(file.Length, new List<string>());
}
map[file.Length].Add(file.Name);
}
Edit: removed space for each file name.
I am using Parquet.Net to read parquet files, but the only option to read from the parquet file is.
//get the first group
Parquet.ParquetRowGroupReader rowGroup = myParquet.OpenRowGroupReader(0);
//gets the first column
Parquet.Data.DataColumn col1 = rowGroup.ReadColumn(myParquet.Schema.GetDataFields()[0]);
This allows me to get the the first column from the first rowGroup, but the problem is, the first rowGroup can be something like 4million rows and readColumn will read all 4million values.
How do I tell readColumn that I only want it to read, say the first 100 rows. Reading all 4million rows wastes memory and file read time.
I actually got a memory error, until I changed my code to resize that 4million value array down to my 100. After calling each column.
I don't necessarily need row based access, I can work with columns, I just don't need a whole rowGroup worth of values in each column. Is this possible? If row based access is better, how does one use it? The Parquet.Net project site doesn't give any examples, and just talks about tables.
According to the source code this capability exists in DataColumnReader but this is an internal class and thus not directly usable.
ParquetRowGroupReader uses it inside its ReadColumn method, but exposes no such options.
What can be done in practice is copying the whole DataColumnReader class and using it directly, but this could breed future compatibility issues.
If the problem can wait for some time, I'd recommend copying the class and then opening an issue + pull request to the library with the enhanced class, so the copied class can eventually be removed.
If you look at the parquet-dotnet documentation they do not recommend writing more than 5000 records into one row group for performance reasons, though at the bottom of the page they say they are designed to hold 50000 rows on average:
It's not recommended to have more than 5'000 rows in a single row
group for performance reasons
We are working with 100000 in a row group with my team, overall it may depend on what you are storing but 4000000 records in one row group inside a column does sounds like too much.
So to answer your question, to read only part of the column make your row groups inside the column smaller and then read only as many row groups as you wish. If you want to only read 100 records, read in the first row group and take first 100 from it, reasonably sized row groups are very fast to read.
ParquetSharp should be able to do that. It’s a wrapper around the Apache Arrow C++ library but it supports Windows, Linux and macOS.
using System;
using ParquetSharp;
using (var reader = new ParquetFileReader(path))
using (var group = reader.RowGroup(0))
{
// You can use the logical reader for automatic conversion to a fitting CLR type
// here `double` as an example
// (unfortunately this does not work well with complex schemas IME)
const int batchSize = 4000;
Span<double> buffer = new double[batchSize];
var howManyRead = group.Column(0).LogicalReader<double>().ReadBatch(buffer);
// or if you want raw Parquet (with Dremel data and physical type)
var resultObject = group.Column(0).Apply(new Visitor());
}
class Visitor : IColumnReaderVisitor<object>
{
public object OnColumnReader<TValue>(ColumnReader<TValue> columnReader)
where TValue : unmanaged
{
// TValue will be the physical Parquet type
const int batchSize = 200000;
var buffer = new TValue[batchSize];
var definitionLevels = new short[batchSize];
var repetitionLevels = new short[batchSize];
long valuesRead;
var levelsRead = columnReader.ReadBatch(batchSize,
definitionLevels, repetitionLevels,
buffer, out valuesRead);
// Return stuff you are interested in here, will be `resultObject` above
return new object();
}
}
I am wondering if I got to ditch filehelpers and do this myself as I think I might be going beyond what it was designed for.
I want a user to be able to upload any csv file(maybe in the future excel file). The first row would have the header
C1 C2 C3 C4 C5 C6
once upload it would look like
C1,C2,C3,C4,C5,C6
a,b,c,d,e,f
Now I want to look at the header and basically take certain ones. For instance I want C2, C3, C4. The rest are extra information I don't care about.
Now someone might upload a file that has this header
C1 C2 C3 C4
Again I am looking only for C2, C3, C4.
I know I can have multiple formats but what I am getting at is I want them to be able to basically upload any file with any number of headers(could be 1000 for all I care) and then have my application try to find the information I care about(so in the case of the 1000 headers I maybe only want 3)
Is this possible?
Edit
(based on shamp00 comments)
My goal is to fill in as much data as I possibly can determine, however cases like this might happen. I want C1, C2, C3. They give a file with C1,C3,C4. I got 2 columns of data I need but I don't have C2.
Now I had 2 ideas one was to display the data into 2 tables. Table 1 would have C1, C2, C3 and table 2 would have C1,C3,C4 and they basically take the data they have in table 2 and move the appropriate data into my expected columns.
With this approach I am basically saying "you did not give me 100% what I expected, now you have to format every single row into my format".
The second approach would be 1 table and try to fill in as much data as possible.
For example the user upload the file that has C1,C3,C4. I determine that their are 2 columns that are known but I don't have the full amount of expected data yet.
So I would display all the rows back in a html table to the user with headers of
C1, C2, C3, C4
C1 would be filled in, C2 cells would be blank (as this is the data I am missing from them), C3 would be filled in, C4 would be filled in with (this data was unexpected but who knows it might actually be the data C2 should hold but since they misspelled the header name my program could not figure it out).
Then essentially they would just fill in C2 with data they got from somewhere else or maybe from C4.
Now they only have to fill in 1 column in instead of all the columns that where expect. So in a sense I need a concrete class like MyClass was with C1,C2,C3 but at the same time I need to dynamic so I can hold C4,C5.....Cn.
I would always display C1,C2,C3 first and the rest of these unexpected ones would come after and through the magic of javascript and stuff they could edit the missing info. If nothing is missing they nothing would show up to be edited.
Based on shamp00 comments I am now wondering if I need to return the data as a Data Table(fortunately this seems to be a system class as right now my code is in a service layer and I was return back a Domain Transfer Class as I want to keep my code independent from like web code classes and hence why I was trying to figure out how to the dynamic class FileHelpers generated.).
Then somehow (not 100% sure yet) just keep track where those 3 columns I am really interested are so I know which data is what.
You can use FileHelpers using a technique like the one described in my answer to your other question.
You read the header line to determine which columns are relevant and then traverse the resulting DataTable processing only those columns.
Something like
public class MyClass
{
public string SomeImportantField { get; set; }
public string SomeOtherField { get; set; }
public string AnotherField { get; set; }
}
public IList<MyClass> GetObjectsFromStream(Stream stream)
{
var cb = new DelimitedClassBuilder("temp", ",") { IgnoreFirstLines = 1, IgnoreEmptyLines = true, Delimiter = "," };
var sr = new StreamReader(stream);
var headerArray = sr.ReadLine().Split(',');
foreach (var header in headerArray)
{
var fieldName = header.Replace("\"", "").Replace(" ", "");
cb.AddField(fieldName, typeof(string));
}
var engine = new FileHelperEngine(cb.CreateRecordClass());
List<MyClass> objects = new List<MyClass>();
DataTable dt = engine.ReadStreamAsDT(sr);
foreach (DataRow row in dt.Rows) // Loop over the rows.
{
MyClass myClass = new MyClass();
for (int i = 0; i < row.ItemArray.Length; i++) // Loop over the items.
{
if (headerArray[i] == "ImportantField")
myClass.SomeImportantField = row.ItemArray[i].ToString();
if (headerArray[i] == "OtherField")
myClass.SomeOtherField = row.ItemArray[i].ToString();
if (headerArray[i] == "AnotherField")
myClass.AnotherField = row.ItemArray[i].ToString();
objects.Add(myClass);
}
}
return objects;
}
I am not familiar with FileHelpers, but I have done something very similar to what you describe by using a tool called LogParser (http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=24659) in conjunction with my own "DelimitedTextFileData" class. If you decide FileHelpers isn't going to do what you need, I would recommend you look into LogParser next. Even if it's overkill for your current project, it IS an excellent tool to be aware of for future projects.
LogParser is a tool that allows "SQL like" queries against a wide variety of sources - including CSV text files. It is a command line based .exe, but also comes with an API you can reference in your .NET project. In my situation, I was dealing with text files that could be delimited by any character, so I developed my own class to let me specify the delimiter on class instantiation and then use a simple API to tap the larger LogParser API. I also had to parse files with an unknown number of (and name for) columns, so my custom class has a function to retrieve a list of columns found in the file. You may not need to take these extra steps if you're always dealing with a CSV, and you know exactly what columns you want. Nevertheless, I'd be happy to share my custom class if you'd like; just let me know the best way to send it.
LogParser is meant to let you "query anything using SQL-like syntax", and it occurred to me that one purpose of Linq is to do the same. Have you searched online for any "Linq to Text File" libraries? If there is a good one out there, it may solve your problem as well.
I am using Linq to Excel library to get acces to one of excel sheets, the problem which i got into is, that my call cant fin a column with specifik name.
public IQueryable<Cell> getExcel()
{
var excel = new ExcelQueryFactory();
excel.FileName = #"C:\Users\Timsen\Desktop\QUOTATION.CSV";
var indianaCompanies = from c in excel.Worksheet() select c["ARTICLE TEXT"];
return indianaCompanies;
}
Error :
base {System.SystemException} = {"'ARTICLE TEXT' column name does not exist. Valid column names are 'QUOT NO;DEBTOR;ITEM;ART NO;HWS NO#;BRANCH PRICE;QTY;PR;ARTICLE T', 'F2', 'F3', 'F4', 'F5'"}
Name of tables in Excell
QUOT
NO
DEBTOR
ITEM
ART
NO
HWS
NO.
BRANCH PRICE
QTY
PR
ARTICLE TEXT
TYPE NAME
SALES PRICE
QT%
DIS
AMOUNT
UNI
B
ARTG
SUPPL
DUTY
UPDAte Sample of Excel :
Can you show us the first line or two of the csv file?
If I'm interpreting the error message correctly, the header line has semicolons instead of commas for separators.
Specifically, the error message appears to list these as the column names (note that it's using single quotes and commas to try and make it clear, which seems useful).
'QUOT NO;DEBTOR;ITEM;ART NO;HWS NO#;BRANCH PRICE;QTY;PR;ARTICLE T'
'F2'
'F3'
'F4'
'F5'
Since that first column name is 64 characters, I'm assuming it cut off at that point and the rest of the columns would be in there as well (still semicolon-delimited) if that limit wasn't in place.
Not sure off-hand if you can specify a different delimiter with the linq-to-excel project or not, since it appears to use Jet for csv files, as per https://github.com/paulyoder/LinqToExcel/blob/792e0807b2cf2cb6b74f55565ad700d2fcf31e19/src/LinqToExcel/Query/ExcelUtilities.cs
If making it a 'real' csv isn't an option and the library doesn't support specifying an alternate delimiter, you might just be able to get the articles text by going through the lines in the file (except the first) and pull out the 12th column (since that appears to be the article text).
So, something like:
var articleTextValues =
// Skip(1) since we don't want the header
from line in File.ReadAllLines(#"C:\Users\Timsen\Desktop\QUOTATION.CSV").Skip(1)
select line.Split(';')[11];
Change your code to this:
public IQueryable<Cell> getExcel()
{
var excel = new ExcelQueryFactory();
excel.FileName = #"C:\Users\Timsen\Desktop\QUOTATION.CSV";
var indianaCompanies = from c in excel.Worksheet() select c["ARTICLE T"];
return indianaCompanies;
}
It lists the valid column names. It is having problems with some of the headers for the columns indicated by the fact QTY;PR; means it parsed two different columns instead of one. F2 indicates it does not know what the header should actually be called.
The simplest solution is to verify the data being imported into the following query matches your excel document.
var indianaCompanies = from c in excel.Worksheet() select *;
I believe that will work.
I am saving each line in a list to the end of a file...
But would I would like to do is check if that file already contains that line so it does not save the same line twice.
So before using StreamWriter to write the file I want to check each item in the list to see if it exists in the file. If it does, I want to remove it from the list before using StreamWriter.
..... Unless of course there is a better way to go about doing this?
Assuming your files are small and you are limited to flat files plus a database table is not an option, etc., then you could just read existing items into a list then make the write operation condition based on examining the list... Again, I would try for another method if at all possible (db table, etc) but just the most direct answer your question...
string line = "your line to append";
// Read existing lines into list
List<string> existItems = new List<string>();
using (StreamReader sr = new StreamReader(path))
while (!sr.EndOfStream)
existItems.Add(sr.ReadLine());
// Conditional write new line to file
if (existItems.Contains(line))
using (StreamWriter sw = new StreamWriter(path))
sw.WriteLine(line);
I guess what you could do is initialize the list from the file, adding each line as a new entry to the list.
Then, as you add to the list, check to see if it contains the line already.
List<string> l = new List<string>{"A", "B", "C"}; //This would be initialized from the file.
string s;
if(!l.Contains(s))
l.Add(s);
When you are ready to save the file, just write out what is in the list.
This will be slow, especially if you have a lot of data going into the table.
If possible, can you store all the lines in a database table with a primary key on the text column? Then add if the column value does not exist, and when you're done, dump the table to a text file? I think that's what I'd do.
I'd like to point out I don't think this is ideal, but it should be fairly performant (Using mssql syntax):
create table foo (
rowdata varchar(1000) primary key
);
-- for insertion (where #rowdata is new text line):
insert into foo (rowdata)
select #rowdata
where not exists(select 1 from foo where rowdata = #rowdata)
-- for output
select rowdata from foo;
If you can sort the file every time you save it would be much faster to determine if a particular entry exists.
Also a database table would be a good idea as mentioned earlier as you can search for the entry to be added in the table and if it does not exist add it.
It depends on if you are after speed (db), fast implementation (file access) or don't care (use in memory lists until the file gets too big and burns and crashes.)
A similar case can be found here