Weird string behaviour, replaces parts of itself - c#

I can't seem to wrap my head around this. How can this become:
string.Format(#"https://www.dropbox.com/s/{0}/{1}?dl=1", dl[rndIndex], rndIndex);
this:
/3?dl=1/www.dropbox.com/s/s8ghw2mvld2jg0l
It's like taking the part after {0} ,shifts it to the front and overrides the existing string...
Does anyone know what's going on here?
This is the entire code (pseudo):
string[] dl = new string[] { "...", "...", "..." };
int rndIndex = rnd.Next(0, dl.Length);
Console.WriteLine(string.Format(#"https://www.dropbox.com/s/{0}/{1}?dl=1", dl[rndIndex], rndIndex));
There's nothing wrong with dl[] and rndIndex, checked both of them.
This fixed the problem:
string s = dl[rndIndex];
s = s.Replace(((char)13).ToString(), "");
Which is what you suggested.

The way the it replaces the beginning of the URL, it seems that dl[rndIndex] contains a carriage return which places the cursor back to the beginning of the line and then overwrites the https:/ part of the URL (which fits as /3?dl=1 has the same length).
So your formatted string actually looks like this:
"https://www.dropbox.com/s/s8ghw2mvld2jg0l\r/3?dl=1"
^^
carriage return
Now when that is printed to a console which supports carriage returns, it will print the first part https://www.dropbox.com/s/s8ghw2mvld2jg0l then set the cursor back to the beginning and print the rest /3?dl=1.
So you should basically strip out all carriage returns from the string first. In any way it seems as if your dl array does not contain what you expect it to do.

I can reproduce the exact problem by #poke's comment:
string.Format("https://www.dropbox.com/s/{0}/{1}?dl=1", "hfjdhfjdh\r", 30)
will output:
/30?dl=1www.dropbox.com/s/hfjdhfjdh
The problem is a carriage return or new line character in your array elements.

Modifying your fragment as follows:
string[] dl = new string[] { "...", "...", "..." };
int rndIndex = 1; // rnd.Next(0, dl.Length);
Console.WriteLine(string.Format(#"https://www.dropbox.com/s/{0}/{1}?dl=1", dl[rndIndex], rndIndex));
Gives a correct answer.
https://www.dropbox.com/s/.../1?dl=1
So, two thoughts:
It's a data dependent problem
Should the random function not read rnd.Next(0, dl.Length-1)

Related

How can I deal with parsing bad csv data?

I know that the data should be correct. I have no control over the data and my boss is just going to tell me that I need to figure out a way to deal with someone else's mistake. So please don't tell me it's not my problem that the data is bad, because it is.
Anywho, this is what I'm looking at:
"Words","email#email.com","","4253","57574","FirstName","","LastName, MD","","","576JFJD","","1971","","Words","Address","SUITE "A"","City","State","Zip","Phone","",""
Data has been scrubbed for confidentiality reasons.
So as you see, the data contains quotation marks and there are commas inside some of these quoted fields. So I cannot remove them. But the "Suite A""" is throwing off the parser. There are too many quotation marks. >.<
I'm using the TextFieldParser in the Microsoft.VisualBasic.FileIO namespace with these settings:
parser.HasFieldsEnclosedInQuotes = true;
parser.SetDelimiters(",");
parser.TextFieldType = FieldType.Delimited;
The error is
MalformedLineException: Line 9871 cannot be parsed using the current
delimiters.
I would like to scrub the data somehow to account for this but I'm not sure how to do it. Or maybe there's a way to just skip this line? Although I suspect my higher ups will not approve of me just skipping data that we might need.
If you are only trying to get rid of the stray " marks in your csv, you can use the following regex to find them and replace them with '
String sourcestring = "source string to match with pattern";
String matchpattern = #"(?<!^|,)""(?!(,|$))";
String replacementpattern = #"$1'";
Console.WriteLine(Regex.Replace(sourcestring,matchpattern,replacementpattern,RegexOptions.Multiline));
Explanation:
#"(?<!^|,)""(?!(,|$))"; will find will find any " that is not preceded by the beginning of the string, or a , and that is not followed by the end of the string or a ,
I am not familiar with TextFieldParser. However with CsvHelper, you can add a custom handler for invalid data:
var config = new CsvConfiguration();
config.IgnoreReadingExceptions = true;
config.ReadingExceptionCallback += (e, row) =>
{
// you can add some custom patching here if possible
// or, save the line numbers and add/edit them manually later.
};
using(var file = File.OpenRead(".csv"))
using(var reader = new CsvReader(reader, config))
{
reader.GetRecords<YourDtoClass>();
}
My only addition to what everyone is saying (because we've all been there) is to try to attempt to rectify each new issue you encounter with code. There are some decent REGEX strings out there https://www.google.com/?ion=1&espv=2#q=c-sharp+regex+csv+clean or you could manually fix things using String.Replace (String.Replace("\"\"\"","").Replace("\"\","").Replace("\",,","\",") or such). Eventually, as you detect and find ways of correcting more and more mistakes, your manual recovery rate will be minimized substantially (most of your bad data will likely come from similar mistakes). Cheers!
PS - Idea-ish (it's been a while - the logic may neeed some tweaking as I'm writing from memory), but you'll get the gist:
public string[] parseCSVWithQuotes(string csvLine,int expectedNumberOfDataPoints)
{
string ret = "";
string thisChar = "";
string lastChar = "";
bool needleDown = true;
for(int i = 0; i < csvLine.Length; i++)
{
thisChar = csvLine.Substring(i, 1);
if (thisChar == "'"&&lastChar!="'")
needleDown = needleDown == true ? false : true;//when needleDown = true, characters are treated literally
if (thisChar == ","&&lastChar!=",") {
if (needleDown)
{
ret += "|";//convert literal comma to pipe so it doesn't cause another break on split
}else
{
ret += ",";//break on split is intended because the comma is outside the single quote
}
}
if (!needleDown && (thisChar == "\"" || thisChar == "*")) {//repeat for any undesired character or use RegEx
//do not add -- this eliminates any undesired characters outside single quotes
}
else
{
if ((lastChar == "'" || lastChar == "\"" || lastChar == ",") && thisChar == lastChar)
{
//do not add - this eliminates double characters
}else
{
ret += thisChar;
lastChar = thisChar;
//this character is not an undesired character, is no a double, is valid.
}
}
}
//we've cleaned as best we can
string[] parts = ret.Split(',');
if(parts.Length==expectedNumberOfDataPoints){
for(int i = 0; i < parts.Length; i++)
{
//go back and replace the temporary pipe with the literal comma AFTER split
parts[i] = parts[i].Replace("|", ",");
}
return parts;
}else{
//save ret to bad CSV log
return null;
}
}
I've had to do this before,
The first step is to parse the data using string.split(',')
The next step is to combine the segments that belong together.
What I essentially did was
make a new list representing the combined strings
if a string begins with a quote, push it onto your new list
if it does not begin with a quote, append it to the last string in your list
Bonus: throw exceptions when a string ends with a quote but the next one does not begin with a quote
Depending on what the rules are regarding what can actually appear in your data, you might have to change your code to account for that.
At the core of CSV's file format, each line is a row, each cell in that row is separated by a comma. In your case, your format also contains the (very unfortunate) stipulation that commas inside a pair of quotation marks do not count as separators and are instead part of the data. I say very unfortunate because a misplaced quotation mark affects the entire rest of the line, and since quotation marks in standard ASCII do not distinguish between open and closed, there really is nothing you can do to recover from this without knowing the original intent.
That is when you log a message in a way that the person who does know the original intent (the person that provided the data) can look at the file and correct the error:
if (parse_line(line, &data)) {
// save the data
} else {
// log the error
fprintf(&stderr, "Bad line: %s", line);
}
And since your quotation marks aren't escaping newlines, you can keep on going with the next line after running into this error.
ADDENDUM: And if your company has a choice (i.e. your data is being serialized by a company tool) don't use CSV. Use something like XML or JSON with a much more clearly defined parsing mechanism.
I had to do this once aswell. My approach was to go through a line and keep track on what I was reading.
Basicly, I coded my own scanner chopping off tokens from the input line which gave me full control over my faulty .csv data.
This is what I did:
For each character on a line of input.
1. when outside of a string meeting a comma => all of the previous string (which can be empty) is a valid token.
2. when outside of a sting meeting anything but a comma or a quote => now you have a real problem, unquoted tekst => handle as you see fit.
3. when outside of a string meeing a quote => found a start of string.
4. when inside of a string meeting a comma => accept the comma as part of the string.
5. when inside of the string meeting a qoute => trouble starts here, mark this point.
6. continue and when meeting a comma (skipping white space if desired) close the string, 'unread' the comma and continue. (than will bring you to point 1.)
7. or continue and when meeting a quote -> obviously, what was read must be part of the string, add it to the string, 'unread' the quote and continue. (that will you bring to point 5)
8. or continue and find an whitespace, then End Of Line ('\n') -> the last qoute must be the closing quote. accept the string as a value.
9. or continue and fine non-whitespace, then End Of Line. -> now you have a real problem, you have the start of a string but it is not closed -> handle the error as you see fit.
If the number of fields in your .csv file is fixed you can count the comma's you recognise as field seperators and when you see a End Of Line you know you have another problem or not.
With the stream of strings received from the input line you can build a 'clean' .csv line and this way build a buffer of accepted and cleaned input that you can use in your already existing code.

How to strip a string from the point a hyphen is found within the string C#

I'm currently trying to strip a string of data that is may contain the hyphen symbol.
E.g. Basic logic:
string stringin = "test - 9894"; OR Data could be == "test";
if (string contains a hyphen "-"){
Strip stringin;
output would be "test" deleting from the hyphen.
}
Console.WriteLine(stringin);
The current C# code i'm trying to get to work is shown below:
string Details = "hsh4a - 8989";
var regexItem = new Regex("^[^-]*-?[^-]*$");
string stringin;
stringin = Details.ToString();
if (regexItem.IsMatch(stringin)) {
stringin = stringin.Substring(0, stringin.IndexOf("-") - 1); //Strip from the ending chars and - once - is hit.
}
Details = stringin;
Console.WriteLine(Details);
But pulls in an Error when the string does not contain any hyphen's.
How about just doing this?
stringin.Split('-')[0].Trim();
You could even specify the maximum number of substrings using overloaded Split constructor.
stringin.Split('-', 1)[0].Trim();
Your regex is asking for "zero or one repetition of -", which means that it matches even if your input does NOT contain a hyphen. Thereafter you do this
stringin.Substring(0, stringin.IndexOf("-") - 1)
Which gives an index out of range exception (There is no hyphen to find).
Make a simple change to your regex and it works with or without - ask for "one or more hyphens":
var regexItem = new Regex("^[^-]*-+[^-]*$");
here -------------------------^
It seems that you want the (sub)string starting from the dash ('-') if original one contains '-' or the original string if doesn't have dash.
If it's your case:
String Details = "hsh4a - 8989";
Details = Details.Substring(Details.IndexOf('-') + 1);
I wouldn't use regex for this case if I were you, it makes the solution much more complex than it can be.
For string I am sure will have no more than a couple of dashes I would use this code, because it is one liner and very simple:
string str= entryString.Split(new [] {'-'}, StringSplitOptions.RemoveEmptyEntries)[0];
If you know that a string might contain high amount of dashes, it is not recommended to use this approach - it will create high amount of different strings, although you are looking just for the first one. So, the solution would look like something like this code:
int firstDashIndex = entryString.IndexOf("-");
string str = firstDashIndex > -1? entryString.Substring(0, firstDashIndex) : entryString;
you don't need a regex for this. A simple IndexOf function will give you the index of the hyphen, then you can clean it up from there.
This is also a great place to start writing unit tests as well. They are very good for stuff like this.
Here's what the code could look like :
string inputString = "ho-something";
string outPutString = inputString;
var hyphenIndex = inputString.IndexOf('-');
if (hyphenIndex > -1)
{
outPutString = inputString.Substring(0, hyphenIndex);
}
return outPutString;

Find and replace file lines

I have a text file with over 12,000 lines. In that file I need to replace certain lines.
Some lines begin with a ;, some have random words, some start with space. However, I am only concerned with the two types of lines I describe below.
I have a line like
SET avariable:0 ;Comments
and I need to replace it to look like
set aDIFFvariable:0 :Integer // comments
The only CASE that is necessary is in the word Integer I needs to be capitalized.
I also have
String aSTRING(7) ;Comment
that needs to look like
STRING aSTRING(7) :array [0..7] of AnsiChar; // Comments
I need to keep all the spacing the same.
Here is what I have so far
static void Main(string[] args)
{
string text = File.ReadAllText("C:\\old.txt");
text = text.Replace("old text", "new text");
File.WriteAllText("C:\\new.txt", text);
}
I think I need to use REGEX, which I have tried to make for my first example:
\s\s[set]\s*{4}.*[:0]\s*[;].* <-- I now know this is invalid - please advise
I need help with properly setting up my program to find and replace those lines. Should I read one line at a time and if it matches then do something? I am confused really as to where to start.
BRIEF pseudo code of what I want to do
//open file
//step through file
//if line == [regex] then add/replace as needed
//else, go to next line
//if EOF, close file
Taking a stab at this separately because each line is so radically different that capturing both in the same expression will be a nightmare.
To match your first example and replace it:
String input = "SET avariable:0 ;Comments";
if (Regex.IsMatch(input, #"\s?(set)\s*(\w+):?(\d)\s+;?(.*)?"))
{
input = Regex.Replace(input, #"\s?(set)\s*(\w+):?(\d)\s+;?(.*)?", "$1 $2:$3 :Integer // $4";
}
Give that a shot (Play with it here: http://regex101.com/r/zY7hV2)
To match your second example and replace it:
String input = "String aSTRING(7) ;Comments";
if (Regex.IsMatch(input, #"\s?(string)\s*(\w+)\((\d)\)\s*;(.*)"))
{
input = Regex.Replace(input, #"\s?(string)\s*(\w+)\((\d)\)\s*;(.*)", "$1 $2($3) :array [0..$3] of AnsiChar; // $4";
}
And play around with this one here: http://regex101.com/r/jO5wP5

find string using c#?

I am trying find a string in below string.
http://example.com/TIGS/SIM/Lists/Team Discussion/DispForm.aspx?ID=1779
by using http://example.com/TIGS/SIM/Lists string. How can I get Team Discussion word from it?
Some times strings will be
http://example.com/TIGS/SIM/Lists/Team Discussion/DispForm.aspx?ID=1779
I need `Team Discussion`
http://example.com/TIGS/ALIF/Lists/Artifical Lift Discussion Forum 2/DispForm.aspx?ID=8
I need `Artifical Lift Discussion Forum 2`
If you're always following that pattern, I recommend #Justin's answer. However, if you want a more robust method, you can always couple the System.Uri and Path.GetDirectoryName methods, then perform a String.Split. Like this example:
String url = #"http://example.com/TIGS/SIM/Lists/Team Discussion/DispForm.aspx?ID=1779";
System.Uri uri = new System.Uri(url);
String dir = Path.GetDirectoryName(uri.AbsolutePath);
String[] parts = dir.Split(new[]{ Path.DirectorySeparatorChar });
Console.WriteLine(parts[parts.Length - 1]);
The only major problem, however, is you're going to wind up with a path that's been "encoded" (i.e. your space is now going to be represented by a %20)
This solution will get you the last directory of your URL regardless of how many directories are in your URL.
string[] arr = s.Split('/');
string lastPart = arr[arr.Length - 2];
You could combine this solution into one line, however it would require splitting the string twice, once for the values, the second for the length.
If you wanted to see a regular expression example:
string input = "http://example.com/TIGS/SIM/Lists/Team Discussion/DispForm.aspx?ID=1779";
string given = "http://example.com/TIGS/SIM/Lists";
System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex(given + #"\/(.+)\/");
System.Text.RegularExpressions.Match match = regex.Match(input);
Console.WriteLine(match.Groups[1]); // Team Discussion
Here's a simple approach, assuming that your URL always has the same number of slashes before the are you want:
var value = url.Split(new[]{'/'}, StringSplitOptions.RemoveEmptyEntries)[5];
Here is another solution that provides the following advantages:
Does not require the use of regular expressions.
Does not require a certain 'count' of slashes be present (indexing based of a specific number). I consider this a key benefit because it makes the code less likely to fail if some part of the URL changes. Ultimately it is best to base your parsing logic off which part of the text's structure you consider least likely to change.
This method, however, DOES rely on the following assumptions, which I consider to be the least likely to change:
URL must have "/Lists/" right before target text.
URL must have "/" right after target text.
Basically, I just split the string twice, using text that I expect to be surrounding the area I am interested in.
String urlToSearch = "http://example.com/TIGS/SIM/Lists/Team Discussion/DispForm.aspx";
String result = "";
// First, get everthing after "/Lists/"
string[] temp1 = urlToSearch.Split(new String[] { "/Lists/" }, StringSplitOptions.RemoveEmptyEntries);
if (temp1.Length > 1)
{
// Next, get everything before the first "/"
string[] temp2 = temp1[1].Split(new String[] { "/" }, StringSplitOptions.RemoveEmptyEntries);
result = temp2[0];
}
Your answer will then be stored in the 'result' variable.

Multiline TextBox with custom word wrapping

I am beginning to program in .Net and C# and currently I am stuck. I have a very similar problem as the posting on this question at stackoverflow : C#: Multiline TextBox with TextBox.WordWrap Displaying Long Base64 String.
The response to that question was this block of code:
public IEnumerable<string> SimpleWrap(string line, int length)
{
var s = line;
while (s.Length > length)
{
var result = s.Substring(0, length);
s = s.Substring(length);
yield return result;
}
yield return s;
}
I dont know how to make use of that piece of code. CAn someone please provide me with a code snippet that uses this particular method to write text that automatically also inserts a new line.
My code currently looks like this:
var length = GetMaximumCharacters(txtBxResults);
var txtWrap = SimpleWrap(stringValue, length);
foreach (string s in txtWrap)
{
txtBxResults.AppendText(s);
}
If I use AppendText method, it simple writes all the text in one single line which I do not want.
Any replies will be greatly appreciated.
Thanks,
KK
You almost have it right, you just need to insert the newline character as well. Try
foreach (string s in txtWrap)
{
txtBxResults.AppendText(s + Environment.NewLine);
}
Well I can't give you the exact code right now (I'll come back and post it later) but in general, what you should do is identify the index of the next comma and, if characters on the current line + that index > length of the line then append a new line before that compound. If you do that in a bucle, when it's done it should be formatted correctly, also take into account the last compound won't have (I think) a comma at the end.

Categories