Parsing string C#

Parsing string C# - c#

So here is my problem, I'm trying to get the content of a text file as a string, then parse it. What I want is a tab containing each word and only words (no blank, no backspace, no \n ...) What I'm doing is using a function LireFichier that send me back the string containing the text from the file (works fine because it's displayed correctly) but when I try to parse it fails and start doing random concatenation on my string and I don't get why.
Here is the content of the text file I'm using :
truc,
ohoh,
toto, tata, titi, tutu,
tete,
and here's my final string :
;tete;;titi;;tata;;titi;;tutu;
which should be:
truc;ohoh;toto;tata;titi;tutu;tete;
Here is the code I wrote (all using are ok):
namespace ConsoleApplication1{
class Program
{
static void Main(string[] args)
{
string chemin = "MYPATH";
string res = LireFichier(chemin);
Console.WriteLine("End of reading...");
Console.WriteLine("{0}",res);// The result at this point is good
Console.WriteLine("...starting parsing");
res = parseString(res);
Console.WriteLine("Chaine finale : {0}", res);//The result here is awfull
Console.ReadLine();//pause
}
public static string LireFichier(string FilePath) //Read the file, send back a string with the text
{
StreamReader streamReader = new StreamReader(FilePath);
string text = streamReader.ReadToEnd();
streamReader.Close();
return text;
}
public static string parseString(string phrase)//is suppsoed to parse the string
{
string fin="\n";
char[] delimiterChars = { ' ','\n',',','\0'};
string[] words = phrase.Split(delimiterChars);
TabToString(words);//I check the content of my tab
for(int i=0;i<words.Length;i++)
{
if (words[i] != null)
{
fin += words[i] +";";
Console.WriteLine(fin);//help for debug
}
}
return fin;
}
public static void TabToString(string[] montab)//display the content of my tab
{
foreach(string s in montab)
{
Console.WriteLine(s);
}
}
}//Fin de la class Program
}

I think your main issue is
string[] words = phrase.Split(delimiterChars, StringSplitOptions.RemoveEmptyEntries);

You could try using the string splitting option to remove empty entries for you:
string[] words = phrase.Split(delimiterChars, StringSplitOptions.RemoveEmptyEntries);
See the documentation here.

Try this:
class Program
{
static void Main(string[] args)
{
var inString = LireFichier(#"C:\temp\file.txt");
Console.WriteLine(ParseString(inString));
Console.ReadKey();
}
public static string LireFichier(string FilePath) //Read the file, send back a string with the text
{
using (StreamReader streamReader = new StreamReader(FilePath))
{
string text = streamReader.ReadToEnd();
streamReader.Close();
return text;
}
}
public static string ParseString(string input)
{
input = input.Replace(Environment.NewLine,string.Empty);
input = input.Replace(" ", string.Empty);
string[] chunks = input.Split(',');
StringBuilder sb = new StringBuilder();
foreach (string s in chunks)
{
sb.Append(s);
sb.Append(";");
}
return sb.ToString(0, sb.ToString().Length - 1);
}
}
Or this:
public static string ParseFile(string FilePath)
{
using (var streamReader = new StreamReader(FilePath))
{
return streamReader.ReadToEnd().Replace(Environment.NewLine, string.Empty).Replace(" ", string.Empty).Replace(',', ';');
}
}

Your main problem is that you are splitting on \n, but the linebreaks read from your file are \r\n.
You output string does contain all of your items, but the \r characters left in it cause later "lines" to overwrite earlier "lines" on the console.
(\r is a "return to start of line" instruction; without the \n "move to the next line" instruction your words from line 1 are being overwritten by those in line 2, then line 3 and line 4.)
As well as splitting on \r as well as \n, you need to check a string is not null or empty before adding it to your output (or, preferably, use StringSplitOptions.RemoveEmptyEntries as others have mentioned).

string ParseString(string filename) {
return string.Join(";", System.IO.File.ReadAllLines(filename).Where(x => x.Length > 0).Select(x => string.Join(";", x.Split(",".ToCharArray(), StringSplitOptions.RemoveEmptyEntries).Select(y => y.Trim()))).Select(z => z.Trim())) + ";";
}

Related

Delete all but {x} C# string

I'm trying to cycle through a .txt to build a test function for another application I'm building.
I've got a list of UK based lat/long values that are formatted like this:
Latitude: 57°39′55″N 57.665198
Longitude: 6°57′27″W -6.95739395
Distance: 184.8338 mi Bearing: 329.815°
with the intended result of this small application being just the lat/long values:
57.665198
-6.95739395
So far I've got a StreamReader working with a myString.StartsWith("Latitude") {} but I'm stuck.
How do I detect a splitstring of 2 spaces " " inside of a string and delete everything before that? My code so far is this:
static void Main(string[] args)
{
string text = "";
using (var streamReader = new StreamReader(#"c:\mb\latlong.txt", Encoding.UTF8))
{
text = streamReader.ReadToEnd();
if (text.Trim().StartsWith("Latitude: "))
{
text.Split()
} else if (text.StartsWith("Distance: "))
{
} else if (text.StartsWith(""))
{
}
streamReader.ReadLine();
}
Console.ReadKey();
}
Thanks in advance

You can try using regular expressions
var result = File
.ReadLines(#"C:\MyFile.txt")
.SelectMany(line => Regex
.Matches(line, #"(?<=\s)-?[0-9]+(\.[0-9]+)*$")
.OfType<Match>()
.Select(match => match.Value));
Test
// 57.665198
// -6.95739395
Console.Write(String.Join(Environment.NewLine, result));

Use string.IndexOf(" ") to find the position of the two spaces in the string. Then you can use string.Substring(position) to get the string after that point.
In your code:
if (text.Trim().StartsWith("Latitude: "))
{
var positionOfTwoSpaces = text.IndexOf(" ");
var latString = text.Substring(positionOfTwoSpaces);
var latValue = float.Parse(latString);
}

You can try the regular expression solution. (You might need to fix up the space counts in the regex definitions)
static void Main(string[] args)
{
string text = "";
Regex lat = new Regex("Latitude: .+? (.+)");
Regex lon = new Regex("Longitude .+? (.+)");
using (var streamReader = new StreamReader(#"c:\mb\latlong.txt", Encoding.UTF8))
{
string line;
while ((line = streamReader.ReadLine() != null)
{
if (lat.IsMatch(line))
lat.Match(line).Groups[1].Value // latitude
else if(lon.IsMatch(line))
lon.Match(line).Groups[1].Value // longitude
}
}
Console.ReadKey();
}

A simple solution would be
string[] fileLines = IO.File.ReadAllLines("input file path");
List<string> resultLines = new List<string>();
foreach (string line in fileLines) {
string[] parts = line.Split(" "); //Double space
if (parts.Count() > 1) {
string lastPart = parts.LastOrDefault();
if (!string.IsNullOrEmpty(lastPart)) {
resultLines.Add(lastPart);
}
}
}
IO.File.WriteAllLines("output file path", resultLines.ToArray());

As I already suggested in my comment. You can look for the last occurrence of the space and substring from there.
using System;
using System.IO;
using System.Text;
public class Test
{
public static void Main()
{
String line = String.Empty;
while(!String.IsNullOrEmpty((line = streamReader.ReadLine())))
{
if(line.StartsWith("Latitude:"))
{
line = line.Substring(line.LastIndexOf(' ') + 1);
Console.WriteLine(line);
}
}
Console.ReadKey();
}
}
Working example.
I didn't provide all the code because this is just copy paste for the longitude case. I think you can do this by your own. :)

Finding ® in a string of text

Let me rephrase my question:
I am reading in text where one of the characters is the registered symbol, ®, from a text file that has no problem displaying the symbol. When I try to print the string after reading it from the file, the symbol is an unprintable character. When I read in the string and split the string to characters and convert the character to an Int16 and print out the hex, I get 0xFFFD. I specify Encoding.UTF8 when I open the StreamReader.
Here is what I have
using (System.IO.StreamReader sr = new System.IO.StreamReader(HttpContext.Current.Server.MapPath("~/App_Code/Hormel") + "/nutrition_data.txt", System.Text.Encoding.UTF8))
{
string line;
while((line = sr.ReadLine()) != null)
{
//after spliting the file on '~'
items[i] = scrubData(utf8.GetString(utf8.GetBytes(items[i].ToCharArray())));
//items[i] = scrubData(items[i]); //original
}
}
Here is the scrubData function
private String scrubData(string data)
{
string newStr = String.Empty;
try
{
if (data.Contains("HORMEL"))
{
string[] s = data.Split(' ');
foreach(string str in s)
{
if (str.Contains("HORMEL"))
{
char[] ch = str.ToCharArray();
for(int i=0; i<ch.Length; i++)
{
EventLogProvider.LogInformation("LoadNutritionInfoTask", "Test", ch[i] + " = " + String.Format("{0:X}", Convert.ToInt16(ch[i])));
}
}
}
}
return String.Empty;
}
catch (Exception ex)
{
EventLogProvider.LogInformation("LoadNutritionInfoTask", "ScrubData", ex.Message);
return data;
}
}
I'm not concerned with what is being returned right now, I am printing out the characters and the hex codes that correspond to them.

First, you need to make sure you're reading the text with the correct encoding. It appears to me that you are using UTF-8, since you say ® (Unicode code point U+00AE) is 0xC2AE, which is the same as UTF-8. You can use that like:
Encoding.UTF8.GetString(new byte[] { 0xc2, 0xae }) // "®", the registered symbol
// or
using (var streamReader = new StreamReader(file, Encoding.UTF8))
Once you've got it as a string in C#, you should use HttpUtility.HtmlEncode to encode it as HTML. E.g.
HttpUtility.HtmlEncode("SomeStuff®") // result is "SomeStuff®"

Check encoding you are decoding bytes with.

Try this:
string txt = "textwithsymbol";
string html = "<html></html>";
txt = txt.Replace("\u00ae", html);
Obviously you would replace the txt variable with the text you have read in and "\u00ae" is the symbol you are looking for.

Remove words from string c#

I am working on a ASP.NET 4.0 web application, the main goal for it to do is go to the URL in the MyURL variable then read it from top to bottom, search for all lines that start with "description" and only keep those while removing all HTML tags. What I want to do next is remove the "description" text from the results afterwords so I have just my device names left. How would I do this?
protected void parseButton_Click(object sender, EventArgs e)
{
MyURL = deviceCombo.Text;
WebRequest objRequest = HttpWebRequest.Create(MyURL);
objRequest.Credentials = CredentialCache.DefaultCredentials;
using (StreamReader objReader = new StreamReader(objRequest.GetResponse().GetResponseStream()))
{
originalText.Text = objReader.ReadToEnd();
}
//Read all lines of file
String[] crString = { "<BR> " };
String[] aLines = originalText.Text.Split(crString, StringSplitOptions.RemoveEmptyEntries);
String noHtml = String.Empty;
for (int x = 0; x < aLines.Length; x++)
{
if (aLines[x].Contains(filterCombo.SelectedValue))
{
noHtml += (RemoveHTML(aLines[x]) + "\r\n");
}
}
//Print results to textbox
resultsBox.Text = String.Join(Environment.NewLine, noHtml);
}
public static string RemoveHTML(string text)
{
text = text.Replace(" ", " ").Replace("<br>", "\n");
var oRegEx = new System.Text.RegularExpressions.Regex("<[^>]+>");
return oRegEx.Replace(text, string.Empty);
}

Ok so I figured out how to remove the words through one of my existing functions:
public static string RemoveHTML(string text)
{
text = text.Replace(" ", " ").Replace("<br>", "\n").Replace("description", "").Replace("INFRA:CORE:", "")
.Replace("RESERVED", "")
.Replace(":", "")
.Replace(";", "")
.Replace("-0/3/0", "");
var oRegEx = new System.Text.RegularExpressions.Regex("<[^>]+>");
return oRegEx.Replace(text, string.Empty);
}

public static void Main(String[] args)
{
string str = "He is driving a red car.";
Console.WriteLine(str.Replace("red", "").Replace(" ", " "));
}
Output:
He is driving a car.
Note: In the second Replace its a double space.
Link : https://i.stack.imgur.com/rbluf.png
Try this.It will remove all occurrence of the word which you want to remove.

Try something like this, using LINQ:
List<string> lines = new List<string>{
"Hello world",
"Description: foo",
"Garbage:baz",
"description purple"};
//now add all your lines from your html doc.
if (aLines[x].Contains(filterCombo.SelectedValue))
{
lines.Add(RemoveHTML(aLines[x]) + "\r\n");
}
var myDescriptions = lines.Where(x=>x.ToLower().BeginsWith("description"))
.Select(x=> x.ToLower().Replace("description",string.Empty)
.Trim());
// you now have "foo" and "purple", and anything else.
You may have to adjust for colons, etc.

void Main()
{
string test = "<html>wowzers description: none <div>description:a1fj391</div></html>";
IEnumerable<string> results = getDescriptions(test);
foreach (string result in results)
{
Console.WriteLine(result);
}
//result: none
// a1fj391
}
static Regex MyRegex = new Regex(
"description:\\s*(?<value>[\\d\\w]+)",
RegexOptions.Compiled);
IEnumerable<string> getDescriptions(string html)
{
foreach(Match match in MyRegex.Matches(html))
{
yield return match.Groups["value"].Value;
}
}

Adapted From Code Project
string value = "ABC - UPDATED";
int index = value.IndexOf(" - UPDATED");
if (index != -1)
{
value = value.Remove(index);
}
It will print ABC without - UPDATED

C# Remove Invalid Characters from Filename

I have data coming from an nvarchar field of the SQL server database via EF3.5. This string is used to create a Filename and need to remove invalid characters and tried following options but none of them works. Please suggest why this is such an understandable mystery? Am I doing anything wrong?
I went though almost all of the related questions on this site.. and now posting a consolidated question from all the suggestions/answers from other similar questions.
UPD: The Issue was unrelated..All of these options do work. So posting it to community wiki.
public static string CleanFileName1(string filename)
{
string file = filename;
file = string.Concat(file.Split(System.IO.Path.GetInvalidFileNameChars(), StringSplitOptions.RemoveEmptyEntries));
if (file.Length > 250)
{
file = file.Substring(0, 250);
}
return file;
}
public static string CleanFileName2(string filename)
{
var builder = new StringBuilder();
var invalid = System.IO.Path.GetInvalidFileNameChars();
foreach (var cur in filename)
{
if (!invalid.Contains(cur))
{
builder.Append(cur);
}
}
return builder.ToString();
}
public static string CleanFileName3(string filename)
{
string regexSearch = string.Format("{0}{1}",
new string(System.IO.Path.GetInvalidFileNameChars()),
new string(System.IO.Path.GetInvalidPathChars()));
Regex r = new Regex(string.Format("[{0}]", Regex.Escape(regexSearch)));
string file = r.Replace(filename, "");
return file;
}
public static string CleanFileName4(string filename)
{
return new String(filename.Except(System.IO.Path.GetInvalidFileNameChars()).ToArray());
}
public static string CleanFileName5(string filename)
{
string file = filename;
foreach (char c in System.IO.Path.GetInvalidFileNameChars())
{
file = file.Replace(c, '_');
}
return file;
}

Here is a function I use in a static common class:
public static string RemoveInvalidFilePathCharacters(string filename, string replaceChar)
{
string regexSearch = new string(Path.GetInvalidFileNameChars()) + new string(Path.GetInvalidPathChars());
Regex r = new Regex(string.Format("[{0}]", Regex.Escape(regexSearch)));
return r.Replace(filename, replaceChar);
}

Try this
filename = Regex.Replace(filename, "[\/?:*""><|]+", "", RegexOptions.Compiled)

no invalid chars returned by System.IO.Path.GetInvalidFileNameChars() being removed. – Bhuvan 5 mins ago
The first method you posted works OK for the characters in Path.GetInvalidFileNameChars(), here it is at work:
static void Main(string[] args)
{
string input = "abc<def>ghi\\1234/5678|?9:*0";
string output = CleanFileName1(input);
Console.WriteLine(output); // this prints: abcdefghi1234567890
Console.Read();
}
I suppose though that your problem is with some language-specific special characters. You can try to troubleshoot this problem by printing out the ASCII codes of the characters in your string:
string stringFromDatabase = "/5678|?9:*0"; // here you get it from the database
foreach (char c in stringFromDatabase.ToCharArray())
Console.WriteLine((int)c);
and consulting the ASCII table: http://www.asciitable.com/
I again suspect that you'll see characters with codes larger than 128, and you should exclude those from your string.

C# Find if a word is in a document

I am looking for a way to check if the "foo" word is present in a text file using C#.
I may use a regular expression but I'm not sure that is going to work if the word is splitted in two lines. I got the same issue with a streamreader that enumerates over the lines.
Any comments ?

What's wrong with a simple search?
If the file is not large, and memory is not a problem, simply read the entire file into a string (ReadToEnd() method), and use string Contains()

Here ya go. So we look at the string as we read the file and we keep track of the first word last word combo and check to see if matches your pattern.
string pattern = "foo";
string input = null;
string lastword = string.Empty;
string firstword = string.Empty;
bool result = false;
FileStream FS = new FileStream("File name and path", FileMode.Open, FileAccess.Read, FileShare.Read);
StreamReader SR = new StreamReader(FS);
while ((input = SR.ReadLine()) != null)
{
firstword = input.Substring(0, input.IndexOf(" "));
if(lastword.Trim() != string.Empty) { firstword = lastword.Trim() + firstword.Trim(); }
Regex RegPattern = new Regex(pattern);
Match Match1 = RegPattern.Match(input);
string value1 = Match1.ToString();
if (pattern.Trim() == firstword.Trim() || value1 != string.Empty) { result = true; }
lastword = input.Trim().Substring(input.Trim().LastIndexOf(" "));
}

Here is a quick quick example using LINQ
static void Main(string[] args)
{
{ //LINQ version
bool hasFoo = "file.txt".AsLines()
.Any(l => l.Contains("foo"));
}
{ // No LINQ or Extension Methods needed
bool hasFoo = false;
foreach (var line in Tools.AsLines("file.txt"))
if (line.Contains("foo"))
{
hasFoo = true;
break;
}
}
}
}
public static class Tools
{
public static IEnumerable<string> AsLines(this string filename)
{
using (var reader = new StreamReader(filename))
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
while (line.EndsWith("-") && !reader.EndOfStream)
line = line.Substring(0, line.Length - 1)
+ reader.ReadLine();
yield return line;
}
}
}

What about if the line contains football? Or fool? If you are going to go down the regular expression route you need to look for word boundaries.
Regex r = new Regex("\bfoo\b");
Also ensure you are taking into consideration case insensitivity if you need to.

You don't need regular expressions in a case this simple. Simply loop over the lines and check if it contains foo.
using (StreamReader sr = File.Open("filename", FileMode.Open, FileAccess.Read))
{
string line = null;
while (!sr.EndOfStream) {
line = sr.ReadLine();
if (line.Contains("foo"))
{
// foo was found in the file
}
}
}

You could construct a regex which allows for newlines to be placed between every character.
private static bool IsSubstring(string input, string substring)
{
string[] letters = new string[substring.Length];
for (int i = 0; i < substring.Length; i += 1)
{
letters[i] = substring[i].ToString();
}
string regex = #"\b" + string.Join(#"(\r?\n?)", letters) + #"\b";
return Regex.IsMatch(input, regex, RegexOptions.ExplicitCapture);
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parsing string C# - c#

I think your main issue is string[] words = phrase.Split(delimiterChars, StringSplitOptions.RemoveEmptyEntries);

You could try using the string splitting option to remove empty entries for you: string[] words = phrase.Split(delimiterChars, StringSplitOptions.RemoveEmptyEntries); See the documentation here.

string ParseString(string filename) { return string.Join(";", System.IO.File.ReadAllLines(filename).Where(x => x.Length > 0).Select(x => string.Join(";", x.Split(",".ToCharArray(), StringSplitOptions.RemoveEmptyEntries).Select(y => y.Trim()))).Select(z => z.Trim())) + ";"; }

Related

Delete all but {x} C# string

Finding ® in a string of text

Remove words from string c#

C# Remove Invalid Characters from Filename

C# Find if a word is in a document

Categories

Resources