Same string content but different lengths - c#

I'm having a hard time trying to figure this out. I'm writing a Unit test that verifies that the MD5 that a site displays matches the actual MD5 of the file. I do this by simply grabbing what the page displays and then calculating my own MD5 of the file. I get the text on the page by using Selenium WebDriver.
As expected, the strings show up as the same...or it appears to be
When I try to test the two strings using Assert.AreEqual or Assert.IsTrue, it fails no matter how I try to compare them
I've tried the following ways:
Assert.AreEqual(md5, md5Text); //Fails
Assert.IsTrue(md5 == md5Text); //Fails
Assert.IsTrue(String.Equals(md5, md5Text)); //Fails
Assert.IsTrue(md5.Normalize() == md5Text.Normalize()); //Fails
Assert.AreEqul(md5.Normalize(), md5Text.Normalize()); //Fails
At first, I thought the strings were actual different, but looking at them in the debugger shows that both strings are exactly the same
So I tried looking at their lengths, that's when I saw why
The strings are different lengths..so I tried to substring the md5 variable to match the size of the md5Text variable. My thinking here was maybe md5 had a bunch of 0 width characters. However doing this got rid of the last half of md5
SOO, this must mean their in different encodings correct? But wouldn't Normalize() fix that?
This is how the variable md5 is created
string md5;
using (var stream = file.Open()) //file is a custom class with an Open() method that returns a Stream
{
using (var generator = MD5.Create())
{
md5 = BitConverter.ToString(generator.ComputeHash(stream)).Replace("-", "‌​").ToLower().Trim();
}
}
and this is how the md5Text variable is created
//I'm using Selenium WebDrvier to grab the text from the page
var md5Element = row.FindElements(By.XPath("//span[#data-bind='text: MD5Hash']")).Where(e => e.Visible()).First();
var md5Text = md5Element.Text;
How can I make this test pass? as it should be passing (since they are the same)
UPDATE:
The comments suggested I turn the strings into a char[] and iterate over it. Here are the results of that (http://pastebin.com/DX335wU8) and the code I added to do it
char[] md5Characters = md5.ToCharArray();
char[] md5TextCharacters = md5Text.ToCharArray();
//Use md5 length since it's bigger
for (int i = 0; i < md5Characters.Length; i++)
{
System.Diagnostics.Debug.Write("md5: " + md5Characters[i]);
if (i >= md5TextCharacters.Length)
{
System.Diagnostics.Debug.Write(" | Exhausted md5Text characters..");
}
else
{
System.Diagnostics.Debug.Write(" | md5Text: " + md5TextCharacters[i]);
}
System.Diagnostics.Debug.WriteLine("");
}
One thing I found interesting is that the md5 char array has a bunch of random characters inside of it every 2 letters

.Replace("-", "‌​")
Your "" is not empty, there is actually a "‌​ then unicode zero width non-joiner + zero width space then " so you are not replacing "-" with an empty string rather you are inserting additional characters.
Delete and retype "" or use String.Empty.

Related

Remove control characters sequence from string EOT comma ETX

I have some xml files where some control sequences are included in the text: EOT,ETX(anotherchar)
The other char following EOT comma ETX is not always present and not always the same.
Actual example:
<FatturaElettronicaHeader xmlns="">
</F<EOT>‚<ETX>èatturaElettronicaHeader>
Where <EOT> is the 04 char and <ETX> is 03. As I have to parse the xml this is actually a big issue.
Is this some kind of encoding I never heard about?
I have tried to remove all the control characters from my string but it will leave the comma that is still unwanted.
If I use Encoding.ASCII.GetString(file); the unwanted characters will be replaced with a '?' that is easy to remove but it will still leave some unwanted characters causing parse issues:
<BIC></WBIC> something like this.
string xml = Encoding.ASCII.GetString(file);
xml = new string(xml.Where(cc => !char.IsControl(cc)).ToArray());
I hence need to remove all this kind of control character sequences to be able to parse this kind of files and I'm unsure about how to programmatically check if a character is part of a control sequence or not.
I have find out that there are 2 wrong patterns in my files: the first is the one in the title and the second is EOT<.
In order to make it work I looked at this thread: Remove substring that starts with SOT and ends EOT, from string
and modified the code a little
private static string RemoveInvalidCharacters(string input)
{
while (true)
{
var start = input.IndexOf('\u0004');
if (start == -1) break;
if (input[start + 1] == '<')
{
input = input.Remove(start, 2);
continue;
}
if (input[start + 2] == '\u0003')
{
input = input.Remove(start, 4);
}
}
return input;
}
A further cleanup with this code:
static string StripExtended(string arg)
{
StringBuilder buffer = new StringBuilder(arg.Length); //Max length
foreach (char ch in arg)
{
UInt16 num = Convert.ToUInt16(ch);//In .NET, chars are UTF-16
//The basic characters have the same code points as ASCII, and the extended characters are bigger
if ((num >= 32u) && (num <= 126u)) buffer.Append(ch);
}
return buffer.ToString();
}
And now everything looks fine to parse.
sorry for the delay in responding,
but in my opinion the root of the problem might be an incorrect decoding of a p7m file.
I think originally the xml file you are trying to sanitize was a .xml.p7m file.
I believe the correct way to sanitize the file is by using a library such as Buoncycastle in java or dotnet and the class CmsSignedData.
CmsSignedData cmsObj = new CmsSignedData(content);
if (cmsObj.SignedContent != null)
{
using (var stream = new MemoryStream())
{
cmsObj.SignedContent.Write(stream);
content = stream.ToArray();
}
}

Regex.Split and string.Split not working as expected

I am attempting to split strings using '?' as the delimiter. My code reads data from a CSV file, and certain symbols (like fractions) are not recognized by C#, so I am trying to replace them with a relevant piece of data (bond coupon in this case). I have print statements in the following code (which is embedded in a loop with index variable i) to test the output:
string[] l = lines[i][1].Split('?');
//string[] l = Regex.Split(lines[i][1], #"\?");
System.Console.WriteLine("L IS " + l.Length.ToString() + " LONG");
for (int j = 0; j < l.Length; j++)
System.Console.WriteLine("L["+ j.ToString() + "] IS " + l[j]);
if (l.Length > 1)
{
double cpn = Convert.ToDouble(lines[i][12]);
string couponFrac = (cpn - Math.Floor(cpn)).ToString().Remove(0,1);
lines[i][1] = l[0].Remove(l[0].Length-1) + couponFrac + l[1]; // Recombine, replacing '?' with CPN
}
The issue is that both split methods (string.Split() and Regex.Split() ) produce inconsistent results with some of the string elements in lines splitting correctly and the others not splitting at all (and thus the question mark is still in the string).
Any thoughts? I've looked at similar posts on split methods and they haven't been too helpful.
I had no problem using String.Split. Could you post your input and output?
If at all you could probably use String.Replace to replace your desired '?' with a character that does not occur in the string and then use String.Split on that character to split the resultant string for the same effect. (just a try)
I didn't have any trouble parsing the following.
var qsv = "now?is?the?time";
var keywords = qsv.Split('?');
keywords.Dump();
screenshot of code and output...
UPDATE:
There doesn't appear to be any problem with Split. There is a problem somewhere else because in this small scale test it works just fine. I would suggest you use LinqPad to test out these kinds of scenarios small scale.
var qsv = "TII 0 ? 04/15/15";
var keywords = qsv.Split('?');
keywords.Dump();
qsv = "TII 0 ? 01/15/22";
keywords = qsv.Split('?');
keywords.Dump();
New updated output:

how to find indexof substring in a text file

I have converted an asp.net c# project to framework 3.5 using VS 2008. Purpose of app is to parse a text file containing many rows of like information then inserting the data into a database.
I didn't write original app but developer used substring() to fetch individual fields because they always begin at the same position.
My question is:
What is best way to find the index of substring in text file without having to manually count the position? Does someone have preferred method they use to find position of characters in a text file?
I would say IndexOf() / IndexOfAny() together with Substring(). Alternatively, regular expressions. It the file has an XML-like structure, this.
If the files are delimited eg with commas you can use string.Split
If data is: string[] text = { "1, apple", "2, orange", "3, lemon" };
private void button1_Click(object sender, EventArgs e)
{
string[] lines = this.textBoxIn.Lines;
List<Fruit> fields = new List<Fruit>();
foreach(string s in lines)
{
char[] delim = {','};
string[] fruitData = s.Split(delim);
Fruit f = new Fruit();
int tmpid = 0;
Int32.TryParse(fruitData[0], out tmpid);
f.id = tmpid;
f.name = fruitData[1];
fields.Add(f);
}
this.textBoxOut.Clear();
string text=string.Empty;
foreach(Fruit item in fields)
{
text += item.ToString() + " \n";
}
this.textBoxOut.Text = text;
}
}
The text file I'm reading does not contain delimiters - sometimes there spaces between fields and sometimes they run together. In either case, every line is formatted the same. When I asked the question I was looking at the file in notepad.
Question was: how do you find the position in a file so that position (a number) could be specified as the startIndex of my substring function?
Answer: I've found that opening the text file in notepad++ will display the column # and line count of any position where the curser is in the file and makes this job easier.
You can use indexOf() and then use Length() as the second substring parameter
substr = str.substring(str.IndexOf("."), str.Length - str.IndexOf("."));

Streamreader isn't returning the correct values from my text file, can't figure out how to properly read my text files C#

I'm running three counters, one to return the total amount of chars, one to return the number of '|' chars in my .txt file (total). And one to read how many separate lines are in my text file. I'm assuming my counters are wrong, I'm not sure. In my text file there are some extra '|' chars, but that is a bug I need to fix later...
The Message Boxes show
"Lines = 8"
"Entries = 8"
"Total Chars = 0"
Not sure if it helps but the .txt file is compiled using a streamwriter, and I have a datagridview saved to a string to create the output. Everything seems okay with those functions.
Here is a copy of the text file I'm reading
Matthew|Walker|MXW320|114282353|True|True|True
Audrey|Walker|AXW420|114282354|True|True|True
John|Doe|JXD020|111222333|True|True|False
||||||
And here is the code.
private void btnLoadList_Click(object sender, EventArgs e)
{
var loadDialog = new OpenFileDialog
{
InitialDirectory = Convert.ToString(Environment.SpecialFolder.MyDocuments),
Filter = "Text (*.txt)|*.txt",
FilterIndex = 1
};
if (loadDialog.ShowDialog() != DialogResult.OK) return;
using (new StreamReader(loadDialog.FileName))
{
var lines = File.ReadAllLines(loadDialog.FileName);//Array of all the lines in the text file
foreach (var assocStringer in lines)//For each assocStringer in lines (Runs 1 cycle for each line in the text file loaded)
{
var entries = assocStringer.Split('|'); // split the line into pieces (e.g. an array of "Matthew", "Walker", etc.)
var obj = (Associate) _bindingSource.AddNew();
if (obj == null) continue;
obj.FirstName = entries[0];
obj.LastName = entries[1];
obj.AssocId = entries[2];
obj.AssocRfid = entries[3];
obj.CanDoDiverts = entries[4];
obj.CanDoMhe = entries[5];
obj.CanDoLoading = entries[6];
}
}
}
Hope you guys find the bug(s) here. Sorry if the formatting is sloppy I'm self-taught, no classes. Any extra advice is welcomed, be as honest and harsh as need be, no feelings will be hurt.
In summary
Why is this program not reading the correct values from the text file I'm using?
Not totally sure I get exactly what you're trying to do, so correct me if I'm off, but if you're just trying to get the line count, pipe (|) count and character count for the file the following should get you that.
var lines = File.ReadAllLines(load_dialog.FileName);
int lineCount = lines.Count();
int totalChars = 0;
int totalPipes = 0; // number of "|" chars
foreach (var s in lines)
{
var entries = s.Split('|'); // split the line into pieces (e.g. an array of "Matthew", "Walker", etc.)
totalChars += s.Length; // add the number of chars on this line to the total
totalPipes = totalPipes + entries.Count() - 1; // there is always one more entry than pipes
}
All the Split() is doing is breaking the full line into an array of the individual fields in the string. Since you only seem to care about the number of pipes and not the fields, I'm not doing much with it other than determining the number of pipes by taking the number of fields and subtracting one (since you don't have a trailing pipe on each line).

How can I put text from a multi-line text box into one string?

I'm faced with a bit of an issue. The scenario is that I have a multi-line text box and I want to put all that text into one single string, without any new lines in it. This is what I have at the moment:
string[] values = tbxValueList.Text.Split('\n');
foreach (string value in values)
{
if (value != "" && value != " " && value != null && value != "|")
{
valueList += value;
}
}
The problem is that no matter what I try and what I do, there is always a new line (at least I think?) in my string, so instead of getting:
"valuevaluevalue"
I get:
"value
value
value".
I've even tried to replace with string.Replace and regex.Replace, but alas to no avail. Please advise.
Yours sincerely,
Kevin van Zanten
The new line needs to be "\r\n". Better still - use Environment.NewLine.
The code is inefficient though, you are creating numerous unnecessary strings and an unnecessary array. Simply use:
tbxValueList.Text.Replace(Environment.NewLine, String.Empty);
On another note, if you ever see yourself using the += operator on a string more than a couple of times then you should probably be using a StringBuilder. This is because strings are Immutable.
Note that new lines can be up to two characters depending on the platform.
You should replace both CR/carriage-return (ASCII 13) and LF/linefeed (ASCII 10).
I wouldn't rely on localized data as David suggests (unless that was your intention); what if you're getting the text string from a different environment, such as from a DB which came from a Windows client?
I'd use:
tbxValueList.Text.Replace((Char)13,"").Replace((Char)10,"");
That replaces all occurrences of both characters independent of order.
Try this
tbxValueList.Text.Replace(System.Environment.NewLine, "");
try this one as well
string[] values = tbxValueList.Text.Replace("\r\n", " ").Split(' ');

Categories