c#.net regex to remove certain non ascii chars does not work - c#

I'm newbie to .net, I use script task in SSIS. I am trying to load a file to Database that has some characters like below. This looks like a data copied from word where - has turned to –
Sample text:
Correction – Spring Promo 2016
Notepad++ shows:
I used the regex in .net script [^\x00-\x7F] but even though it falls in the range it gets replaced. I do not want these characters be altered. What am I missing here?
If I don't replace I get a truncation error as I believe these characters take more than a bit size.
Edit: I added sample rows. First two rows have problem and last two are okay.
123|NA|0|-.10000|Correction – Spring Promo 2016|.000000|gift|2013-06-29
345|NA|1|-.50000|Correction–Spring Promo 2011|.000000|makr|2012-06-29
117|ER|0|12.000000|EDR - (WR) US STATE|.000000|TEST MARGIN|2016-02-30
232|TV|0|.100000|UFT / MGT v8|.000000|test. second|2006-06-09
After good long weekend :) I am beginning to think that this is due to code page error. The exact error message when loading the flat file is as below.
Error: Data conversion failed. The data conversion for column "NAME" returned status value 4 and status text "Text was truncated or one or more characters had no match in the target code page.".
This is what I do in my ssis package.
Script task that validates the flat files.
The only validation that affect the contents of the file is to check the number of delimited columns in the file is same as what it should be for that file. I need to read each line (if there is an extra pipe delimiter (user entry), remove that line from the file and log that into custom table).
Using the StreamWriter class, I write all the valid lines to a temp file and rename/move the file at the end.
apologies but I have just noticed that this process changes all such lines above to something like this.
Notepad: Correction � Spring Promo 2016
How do I stop my script task doing this? (which should be the solution)
If that's not easy, option 2 being..
My connection managers are flat file source and OLEDB destination. The OLEDB uses the default code page which is 1252. If these characters are not a match in code page 1252, what should I be using? Are there any other workarounds without changing the code page?
Script task:
foreach (string file in files)... some other checks
{
var tFile = Path.GetTempFileName();
using (StreamReader rFile = new StreamReader(file))
using (var swriter = new StreamWriter(tFile))
{
string line;
while ((line = rFile.ReadLine()) != null)
{
NrDelimtrInLine = line.Count(x => x == '|') + 1;
if (columnCount == NrDelimtrInLine)
{
swriter.WriteLine(line);
}
}}}
Thank you so much.

It's not clear to me what you intend since "I do not want these characters to be altered" seems mutually exclusive with "they must be replaced to avoid truncation". I would need to see the code to give you further advice.
In general I recommend always testing your regex patterns outside of code first. I usually use http://regexr.com
If you want to match your special characters:
If you want to match anything except your special characters:

Related

Field and text delimiters within cells in csv files

This is likely a very basic question that I could not, despite trying, find a satsifying answer to. Feel free to skip to the question at the end if you aren't interested in the background.
The task:
I wish to create an easy localisation solution for my unity projects. After some initial research I concluded it would be best to use a .csv file read by a streamreader, so that translators would only ever have to interact with the csv table, where information is neatly organized.
The main problem:
Due to the nature of the text, I need to account for linebreaks and special characters in the actual fields. As such I could not use the normal readLine() method.
This I worked with by using Read() and checking if a linebreak is within a text delimiter bracket. But as I check for the text delimiter, I am afraid it might run into an un-escaped delimiter part of the normal in-cell text (since the normal text delimiter is quotation marks).
So I switched the delimiter to §. But now every time I open the file I have to re-enter § as a text delimiter in OpenOfficeCalc, probably due to encoding differences. Which is annoying but not the end of the world.
My question:
How does OpenOffice (or similar software) usually tell in-cell commas/quotation marks apart from the ones used as delimiters? If I knew that, I could probably incorporate a similar approach in my reading of the file.
I've tried to look at the files with NotePad++, revealing a difference in linebreaks (/r instead of /r/n) and obviously it's within a text delimiter bracket, but when it comes to how it seperates its delimiters from ones just entered in the text/field, I am drawing a blank.
Translation file in OpenOffice Calc:
Translation file in NotePad++, showing all characters:
I'd appreciate any insight or links on the topic.
From https://en.wikipedia.org/wiki/Comma-separated_values:
The CSV file format is not fully standardized. The basic idea of separating fields with a comma is clear, but that idea gets complicated when the field data may also contain commas or even embedded line breaks.
LibreOffice Calc has a reasonable way to handle these things.
Use LF for line breaks and CR at the end of each record. It seems your code already handles this.
Use quotes to delimit strings when needed. If the string contains one or more quotes, then duplicate the quote to make it literal.
From the example in your question, it looks like you told Calc not to use any quotes as string delimiters. Why did you do this? When I tried it, LibreOffice (or Apache OpenOffice) showed the fields in different columns after opening the file saved that way.
The following example CSV file has fields that contain commas, quotes and line breaks.
When viewed in Calc:
A B
--------- --
1 | 1,",2", 3
--------- --
2 | a c
| b
Calc correctly reads and saves the file as shown below. Settings when saving are Field delimiter , and String delimiter " which are the defaults.
"1,"",2"",",3[CR]
"a
b",c[CR]

Replacing a word in a text file

I'm doing a little program where the data saved on some users are stored in a text file. I'm using Sytem.IO with the Streamwriter to write new information to my text file.
The text in the file is formatted like so :
name1, 1000, 387
name2, 2500, 144
... and so on. I'm using infos = line.Split(',') to return the different values into an array that is more useful for searching purposes. What I'm doing is using a While loop to search for the correct line (where the name match) and I return the number of points by using infos[1].
I'd like to modify this infos[1] value and set it to something else. I'm trying to find a way to replace a word in C# but I can't find a good way to do it. From what I've read there is no way to replace a single word, you have to rewrite the complete file.
Is there a way to delete a line completely, so that I could rewrite it at the end of the text file and not have to worried about it being duplicated?
I tried using the Replace keyword, but it didn't work. I'm a bit lost by looking at the answers proposed for similar problems, so I would really appreciate if someone could explain me what my options are.
If I understand you correctly, you can use File.ReadLines method and LINQ to accomplish this.First, get the line you want:
var line = File.ReadLines("path")
.FirstOrDefault(x => x.StartsWith("name1 or whatever"));
if(line != null)
{
/* change the line */
}
Then write the new line to your file excluding the old line:
var lines = File.ReadLines("path")
.Where(x => !x.StartsWith("name1 or whatever"));
var newLines = lines.Concat(new [] { line });
File.WriteAllLines("path", newLines);
The concept you are looking for is called 'RandomAccess' for file reading/writing. Most of the easy-to-use I/O methods in C# are 'SequentialAccess', meaning you read a chunk or a line and move forward to the next.
However, what you want to do is possible, but you need to read some tutorials on file streams. Here is a related SO question. .NET C# - Random access in text files - no easy way?
You are probably either reading the whole file, or reading it line-for-line as part of your search. If your fields are fixed length, you can read a fixed number of bytes, keep track of the Stream.Position as you read, know how many characters you are going to read and need to replace, and then open the file for writing, move to that exact position in the stream, and write the new value.
It's a bit complex if you are new to streams. If your file is not huge, copying a file line for line can be done pretty efficiently by the System.IO library if coded correctly, so you might just follow your second suggestion which is read the file line-for-line, write it to a new Stream (memory, temp file, whatever), replace the line in question when you get to that value, and when done, replace the original.
It is most likely you are new to C# and don't realize the strings are immutable (a fancy way of saying you can't change them). You can only get new strings from modifying the old:
String MyString = "abc 123 xyz";
MyString.Replace("123", "999"); // does not work
MyString = MyString.Replace("123", "999"); // works
[Edit:]
If I understand your follow-up question, you could do this:
infos[1] = infos[1].Replace("1000", "1500");

String processing / CSV challenge

Having used SQL Server Bulk insert of CSV file with inconsistent quotes (CsvToOtherDelimiter option) as my basis, I discovered a few weirdnesses with the RemoveCSVQuotes part [it chopped the last char from quoted strings that contained a comma!]. So.. rewrote that bit (maybe a mistake?)
One wrinkle is that the client has asked 'what about data like this?'
""17.5179C,""
I assume if I wanted to keep using the CsvToOtherDelimiter solution, I'd have to amend the RegExp...but it's WAY beyond me... what's the best approach?
To clarify: we are using C# to pre-process the file into a pipe-delimited format prior to running a bulk insert using a format file. Speed is pretty vital.
The accepted answer from your link starts with:
You are going to need to preprocess the file, period.
Why not transform your csv to xml? Then you would be able to verify your data against an xsd before storing into a database.
To convert a CSV string into a list of elements, you could write a program that keeps track of state (in quotes or out of quotes) as it processes the string one character at a time, and emits the elements it finds. The rules for quoting in CSV are weird, so you'll want to make sure you have plenty of test data.
The state machine could go like this:
scan until quote (go to 2) or comma (go to 3)
if the next character is a quote, add only one of the two quotes to the field and return to 1. Otherwise, go to 4 (or report an error if the quote isn't the first character in the field).
emit the field, go to 1
scan until quote (go to 5)
if the next character is a quote, add only one of the two quotes to the field and return to 4. Otherwise, emit the field, scan for a comma, and go to 1.
This should correctly scan stuff like:
hello, world, 123, 456
"hello world", 123, 456
"He said ""Hello, world!""", "and I said hi"
""17.5179C,"" (correctly reports an error, since there should be a
separator between the first quoted string "" and the second field
17.5179C).
Another way would be to find some existing library that does it well. Surely, CSV is common enough that such a thing must exist?
edit:
You mention that speed is vital, so I wanted to point out that (so long as the quoted strings aren't allowed to include line returns...) each line may be processed independently in parallel.
I ended up using the csv parser that I don't know we had already (comes as part of our code generation tool) - and noting that ""17.5179C,"" is not valid and will cause errors.

.net program to parse .doc file

I want to create an application which will be able to parse doc/docx files structure of this file is shown bellow:
par-000.01 - some content
par-000.21 - some content
par-000.31 - some content
par-001.32 - some content
content could be multi line and not regular. What I want to do is to put these content into database I mean for first record - par-000.01 into code column and some content into text column. The reason why I cannot do this manually is that I have about 15 docs where each of them contains about 10 pages of paragraphs I want to put into my database. I cannot find any article how can i parse whole doc file so I believe it could be possible if i write proper regular expression. Can anyone redirect me to the article how I can do what I want- I can't find anything that suits me probably I am using wrong key words..
Since you say you have reasonable amount of data, 15 docs * 10 pages/doc * ~100 lines/page = 15000 lines this is manageable in a word document, and you did not say that this is a repeating data feed, i.e. this is a one-time conversion, I would do it using an editor that supported global find and replace and convert to a Comma Separated Variable format. Most DB I know can load a CSV file.
I know you asked for C# app, but that is overkill for time and effort based on your problem
So
Convert '<start of line>' to '<start of line>"'
for MS Word with Find and replace
find: ^p
replace: ^&"
Convert ' - ' to '","'
for MS Word with Find and replace
find: ' - ' Note: don't add tick marks.
replace: ","
Convert '<end of line>' to '"<end of line>'
for MS Word with Find and replace
find: ^p
replace: "^&
Manually fix up start of first line and end of last line.
you should get
"par-000.01","some content"
"par-000.21","some content"
Now just load that into a DB using its CSV load.
Also if you insist on doing this with C#, then realize that you can probably save the text as a *.txt file without all of the Word tags and it will be much easier to take apart with a C# app. Don't get fixated on the Word tags, just side step the problem with creative thinking.
You can automate parsing of Word documents (.doc or .docx) in C# using GroupDocs.Parser for .NET API. The text can be extracted from the documents either line by line or as a whole. This is how you can do it.
// extracting all the text
WordsTextExtractor extractor = new WordsTextExtractor("sample.docx");
Console.Write(extractor.ExtractAll());
// OR
// Extract text line by line
string line = extractor.ExtractLine();
// If the line is null, then the end of the file is reached
while (line != null)
{
// Print a line to the console
Console.Write(line);
// Extract another line
line = extractor.ExtractLine();
}
Disclosure: I work as Developer Evangelist at GroupDocs.

Parsing a CSV file

We have an integration with another system that relies on passing CSV files back and forth (really old school).
The structure is generally:
ID, Name, PhoneNumber, comments, fathersname
1, tom, 555-1234, just some random text, bill
2, jill smith, 555-4234, other random text, richard
Every so often we see this:
3, jacked up, 999-1231, here
be dragons
amongst us, ted
The primary problem I care about is detecting that a line breaker (\n) occurs in the middle of the record when that is the record terminator.
Is there anyway I can preprocess this to reliably fix it?
Note that we have zero control over what the other system emits.
So you should be able to do something more or less like this:
for (int i = 0; i < lines.Count; i++)
{
var fields = lines[i].Split(',').ToList();
while (fields.Count < numFields)//here be dragons amonst us
{
i++;//include next line in this line
//check to make sure we haven't run out of lines.
//combine end of previous field with start of the next one,
//and add the line break back in.
var innerFields = lines[i].Split(',');
fields[fields.Count - 1] += "\n" + innerFields[0];
fields.AddRange(innerFields.Skip(1));
}
//we now know we have a "real" full line
processFields(fields);
}
(For simplicity I assumed all lines were read in at the start; I assume you could alter it to lazily fetch each line easily enough.)
Let me start and say that the CSV file in your example is invalid. If a line break occurs inside a string, it should be wrapped with double quote characters.
Now for the answer - In order to parse this invalid csv format you must do several assumptions. In this case I made 2 assumptions: 1) The ID column must be numeric 2) The comment field can not contain digits.
Based on these assumptions you can check the first character after the line break character. If it is digit, you assume its a new record. If not you should treat it as a continue value of the comment field.
I don't know if the second assumption is valid, if not, you can enhance the logic so it will cover the business rules of the system.
Good Luck!
Firstly I would recommend using a tool to manage reading and writing your csv files, I use the FileHelpers library which is great.
You can essentially type your records and it will do all the validation and such for you. Worth the effort.
To your question perhaps you can do some preprocessing on the file and use Regex to replace any line breaks with a space?
I do something similar (not with files but) try
line.Replace(Environment.NewLine, " ");
With FileHelpers you could write a custom converter to do this during processing, or hook into the BeforeRead event.

Categories