Parsing PostgreSQL CSV Log - c#

I am working an a section of application which needs to Parse CSV Logs generated by PostgreSql server.
The Logs are stored C:\Program Files\PostgreSQL\9.0\data\pg_log
The Server version in 9.0.4
The application is developed in C Sharp
The basic utility after Parse the Log is to show contents in a DataGridView.
There are other filter options like to view log contents for a particular range of Time for a Day.
However the main problem that is, the Log format is not readable
It was first tested with A Fast CSV Reader
Parsing CSV files in C#, with header
http://www.codeproject.com/KB/database/CsvReader.aspx
Then we made a custom utility using String.Split method with the usual Foreach loop going through the array
A Sample Log data line
2012-03-21 11:59:20.640 IST,"postgres","stock_apals",3276,"localhost:1639",4f697540.ccc,10,"idle",2012-03-21 11:59:20 IST,2/163,0,LOG,00000,"statement: SELECT id,pdate,itemname,qty from stock_apals order by pdate,id",,,,,,,,"exec_simple_query, .\src\backend\tcop\postgres.c:900",""
As you can see the columns in the Log are comma separated , But however individual values
are not Quote Enclosed.
For instance the 1st,4rth,6th .. columns
Is there a utility or a Regex that can find malformed columns and place quotes
This is especially with respect to performace, becuase these Logs are very long and
new ones are made almost every hour
I just want to update the columns and use the FastCSVReader to parse it.
Thanks for any advice and help

I've updated my csv parser, so it's now able to parse you data (at least provided in example). Below is exampe console app which is parsing your data saved in multiline_quotes.txt file. Project source can be found here (you can download a ZIP). You need either Gorgon.Parsing or Gorgon.Parsing.Net35 (in case you can't use .NET 4.0).
Actually I was able to achive same result using Fast CSV Reader. You just used it some wrong way in the first place.
namespace So9817628
{
using System.Data;
using System.Text;
using Gorgon.Parsing.Csv;
class Program
{
static void Main(string[] args)
{
// prepare
CsvParserSettings s = new CsvParserSettings();
s.CodePage = Encoding.Default;
s.ContainsHeader = false;
s.SplitString = ",";
s.EscapeString = "\"\"";
s.ContainsQuotes = true;
s.ContainsMultilineValues = true;
// uncomment below if you don't want escape quotes ("") to be replaced with single quote
//s.ReplaceEscapeString = false;
CsvParser parser = new CsvParser(s);
DataTable dt = parser.ParseToDataTableSequential("multiline_quotes.txt");
dt.WriteXml("parsed.xml");
}
}
}

Related

Reading from CSV with value numbers formatted

The process that I wish to achieve it that I read from a CSV file and automatically the system creates a new CSV file in different format.
I am able to read and format the CSV file however I have issues when dealing with number formatting as the values are formatted in thousands(1,000). For example when I read from the CSV and split each line with ',' my values change.
Ex Line 1: Test Name, Test Desc, Test Currency, 12,500
var line1 = line.split(',');
This splits the value 12 & 500 because of the , delimiter. How can I get the number as a whole amount please?
using (var reader = new StreamReader(openFileDialog1.FileName))
{
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
var values = line.Split(',');
}
}
You cant. When a CSV file contains numbers (or any text with a , in it) it needs to quote the fields. It is impossible for simple code (i.e. not AI) to differentiate in the way your human eye can.
Ex Line 1: Test Name, Test Desc, Test Currency, 12,500
Should be:
Ex Line 1: "Test Name", "Test Desc", "Test Currency", "12,500"
Common CSV parsers/libraries will know how to handle this (e.g. CsvHelper)
If you have control over the CSV file generation, then you should make this change. If it's from a 3rd party then see if you can get them to make a change.
There may be an edge case in your example if there is always a space after fields and not in the number fields. Your delimiter then becomes ", " instead of just ','
Side note:
You should consider not to use culture-specific separators in a .csv file because it always leads to headaches when the data is exported/imported with different regional settings.
Possible solutions:
I suggest to dump and parse numbers (dates, etc.) with invariant culture:
myNumber.ToString(CultureInfo.InvariantCulture)
If you really need to dump numbers with comma decimal sign enclose the field into quotes. This does not turn the numbers strings as .csv has no type information.
Excel vs. the .csv format
Another side note for Excel: Microsoft's .csv handling is somewhat confusing and contradicts the RFC Standard. When you export a .csv in Excel the numbers are always dumped using the regional settings. To avoid the confusion with delimiters, Excel uses a different character (usually semicolon) as delimiter if the decimal separator is comma.
The used delimiter is the one is set as the list separator in the operating system's regional settings and in .NET can be retrieved via the CultureInfo.TextInfo.ListSeparator property.
I find this solution from Microsoft quite unfortunate as .csv files dumped by different regional settings cannot be always read on another computer and this only causes troubles since decades.

c#.net regex to remove certain non ascii chars does not work

I'm newbie to .net, I use script task in SSIS. I am trying to load a file to Database that has some characters like below. This looks like a data copied from word where - has turned to –
Sample text:
Correction – Spring Promo 2016
Notepad++ shows:
I used the regex in .net script [^\x00-\x7F] but even though it falls in the range it gets replaced. I do not want these characters be altered. What am I missing here?
If I don't replace I get a truncation error as I believe these characters take more than a bit size.
Edit: I added sample rows. First two rows have problem and last two are okay.
123|NA|0|-.10000|Correction – Spring Promo 2016|.000000|gift|2013-06-29
345|NA|1|-.50000|Correction–Spring Promo 2011|.000000|makr|2012-06-29
117|ER|0|12.000000|EDR - (WR) US STATE|.000000|TEST MARGIN|2016-02-30
232|TV|0|.100000|UFT / MGT v8|.000000|test. second|2006-06-09
After good long weekend :) I am beginning to think that this is due to code page error. The exact error message when loading the flat file is as below.
Error: Data conversion failed. The data conversion for column "NAME" returned status value 4 and status text "Text was truncated or one or more characters had no match in the target code page.".
This is what I do in my ssis package.
Script task that validates the flat files.
The only validation that affect the contents of the file is to check the number of delimited columns in the file is same as what it should be for that file. I need to read each line (if there is an extra pipe delimiter (user entry), remove that line from the file and log that into custom table).
Using the StreamWriter class, I write all the valid lines to a temp file and rename/move the file at the end.
apologies but I have just noticed that this process changes all such lines above to something like this.
Notepad: Correction � Spring Promo 2016
How do I stop my script task doing this? (which should be the solution)
If that's not easy, option 2 being..
My connection managers are flat file source and OLEDB destination. The OLEDB uses the default code page which is 1252. If these characters are not a match in code page 1252, what should I be using? Are there any other workarounds without changing the code page?
Script task:
foreach (string file in files)... some other checks
{
var tFile = Path.GetTempFileName();
using (StreamReader rFile = new StreamReader(file))
using (var swriter = new StreamWriter(tFile))
{
string line;
while ((line = rFile.ReadLine()) != null)
{
NrDelimtrInLine = line.Count(x => x == '|') + 1;
if (columnCount == NrDelimtrInLine)
{
swriter.WriteLine(line);
}
}}}
Thank you so much.
It's not clear to me what you intend since "I do not want these characters to be altered" seems mutually exclusive with "they must be replaced to avoid truncation". I would need to see the code to give you further advice.
In general I recommend always testing your regex patterns outside of code first. I usually use http://regexr.com
If you want to match your special characters:
If you want to match anything except your special characters:

How To Parse a Scientific Notation Value Back To Original Value When Reading CSV In C#

I am working on a weather map application that imports csv files via a url. The date fields inside the csv are stored as a string like 201601280330, which would be today's date 1/28/2016 3:30. However, when reading the csv via a streamreader the value is coming back in scientific notation like 2.01601E+11. Nothing I have tried seems to return the whole value 201601280330.
For example:
var date = "2.01601E+11";
var d = Decimal.Parse(date, System.Globalization.NumberStyles.Float);
returns 201601000000 chopping off the remaining part of the string. Does anyone know a way to return the full value. When I save the csv locally as a text file the correct value 201601280330 is saved rather than 2.01601E+11. Anyway to get this in the code without having to save first? I am using the code below to read the csv file.
public static DataTable GetDataTableFromCSVFile(string path)
{
DataTable dt = new DataTable();
try
{
using (StreamReader sr = new StreamReader(path))
{
string[] headers = sr.ReadLine().Split(',');
foreach (string header in headers)
{
if (header.Contains("ERROR:"))
{
return null;
}
dt.Columns.Add(header);
}
while (!sr.EndOfStream)
{
string[] rows = sr.ReadLine().Split(',');
DataRow dr = dt.NewRow();
for (int i = 0; i < headers.Length; i++)
{
dr[i] = rows[i];
}
dt.Rows.Add(dr);
}
}
}
EDIT
Although this does not provide a technical answer to the problem I was experiencing it might provide a better explanation and help others if they experience this. Some of the example csv files sent to me for testing had been edited in Excel to make the csv smaller and then saved. After further testing, it looks like the issue occurs only for the csv files edited in Excel. Tentatively, this solves the problem for me since we won't be downloading the csv files and editing them in Excel, rather pulling them straight from a url. The code I posted above will read a the correct value without notation while for the csv edited in Excel it will only read the value with scientific notation. Unless I am wrong, I assume Excel must add something to edited csv files that prevents the value from being read correctly.
My original answer questioned whether you were using Excel, but as you hadn't mentioned that in your question I was rightly told that I was off topic, so I changed it. Now that you have provided a follow-up answer that does mention Excel I have changed it back, here is what I wrote originally:
Once the value has been converted into scientific notation it cannot be converted back. It is a limitation of Excel that is performing this conversion. If, when importing the data you choose a column type of Text (rather than General), then the data will be imported verbatim and Excel won't convert it into scientific notation.
As I suspected, it is Excel (not your code) that is changing the numerical data into scientific notation. I have seen this problem many times and suggest people DO NOT open CSV files using Excel. If you have to, then import rather than open so you can specify the data types of your numerical columns.
Although this does not provide a technical answer to the problem I was experiencing it might provide a better explanation and help others if they experience this. Some of the example csv files sent to me for testing had been edited in Excel to make the csv smaller and then saved. After further testing, it looks like the issue occurs only for the csv files edited in Excel. Tentatively, this solves the problem for me since we won't be downloading the csv files and editing them in Excel, rather pulling them straight from a url. The code I posted above will read a the correct value without notation while for the csv edited in Excel it will only read the value with scientific notation. Unless I am wrong, I assume Excel must add something to edited csv files that prevents the value from being read correctly.

creating a difference file from .csv files

I am creating an application which converts a MS Access table and an Excel sheet to .csv files and then differences the access table with the excel sheet. The .csv files are fine but the resulting difference file has errors in fields that contain html (the access table has fields with the html). I'm not sure if this is a special character issue because the special characters were not an issue in creating the .csv file in the first place, or if it is an issue with the way I am differencing the two files.
Part of the problem I suppose could be that in the access .csv file, the fields that contain the html are formatted so that some of the information is on separate lines instead of all on one line, which could be throwing off the reader, but I don't know how to correct this issue.
This is the code for creating the difference file:
string destination = Form2.destination;
string path = Path.Combine(destination, "en-US-diff.csv");
string difFile = path;
if (File.Exists(difFile))
{
File.Delete(difFile);
}
using (var wtr = new StreamWriter(difFile))
{
// Create the IEnumerable data sources
string[] access = System.IO.File.ReadAllLines(csvOutputFile);
string[] excel = System.IO.File.ReadAllLines(csvOutputFile2);
// Create the query
IEnumerable<string> differenceQuery = access.Except(excel);
// Execute the query
foreach (string s in differenceQuery)
{
wtr.WriteLine(s);
}
}
Physical line versus logical line. One solution is to use a sentinel, which is simply an arbitrary string token selected in such a way so as not to confound the parsing process, for example "##||##".
When the input files are created, add the sentinel to the end of each line...
1,1,1,1,1,1,###||##
Going back to your code, the System.IO.File.ReadAllLines(csvOutputFile); uses the Environment.Newline string as its sentinel. This means that you need to replace this statement with the following (pseudo code)...
const string sentinel = "##||##";
string myString = File.ReadAllText("myFileName.csv");
string[] access = myString.Split(new string[]{sentinel},
StringSplitOptions.RemoveEmptyEntries);
At that point you will have the CSV lines in your 'access' array the way you wanted as a collection of 'logical' lines.
To make things further conformant, you would also need to execute this statement on each line of your array...
line = line.Replace(Environment.NewLine, String.Empty).Trim();
That will remove the culprits and allow you to parse the CSV using the methods you have already developed. Of course this statement could be combined with the IO statements in a LINQ expression if desired.

Using FileHelpers without a type

I have a CSV file that is being exported from another system whereby the column orders and definitions may change. I have found that FileHelpers is perfect for reading csv files, but it seems you cannot use it unless you know the ordering of the columns before compiling the application. I want to know if its at all possible to use FileHelpers in a non-typed way. Currently I am using it to read the file but then everything else I am doing by hand, so I have a class:
[DelimitedRecord(",")]
public class CSVRow
{
public string Content { get; set; }
}
Which means each row is within Content, which is fine, as I have then split the row etc, however I am now having issues with this method because of commas inherent within the file, so a line might be:
"something",,,,0,,1,,"something else","","",,,"something, else"
My simple split on commas on this string doesnt work as there is a comma in `"something, else" which gets split. Obviously here is where something like FileHelpers comes in real handy, parsing these values and taking the quote marks into consideration. So is it possible to use FileHelpers in this way, without having a known column definition, or at least being able to pass it a csv string and get a list of values back, or is there any good library that does this?
You can use FileHelpers' RunTime records if you know (or can deduce) the order and definitions of the columns at runtime.
Otherwise, there are lots of questions about CSV libraries, eg Reading CSV files in C#
Edit: updated link. Original is archived here

Categories