Let me rephrase my question:
I am reading in text where one of the characters is the registered symbol, ®, from a text file that has no problem displaying the symbol. When I try to print the string after reading it from the file, the symbol is an unprintable character. When I read in the string and split the string to characters and convert the character to an Int16 and print out the hex, I get 0xFFFD. I specify Encoding.UTF8 when I open the StreamReader.
Here is what I have
using (System.IO.StreamReader sr = new System.IO.StreamReader(HttpContext.Current.Server.MapPath("~/App_Code/Hormel") + "/nutrition_data.txt", System.Text.Encoding.UTF8))
{
string line;
while((line = sr.ReadLine()) != null)
{
//after spliting the file on '~'
items[i] = scrubData(utf8.GetString(utf8.GetBytes(items[i].ToCharArray())));
//items[i] = scrubData(items[i]); //original
}
}
Here is the scrubData function
private String scrubData(string data)
{
string newStr = String.Empty;
try
{
if (data.Contains("HORMEL"))
{
string[] s = data.Split(' ');
foreach(string str in s)
{
if (str.Contains("HORMEL"))
{
char[] ch = str.ToCharArray();
for(int i=0; i<ch.Length; i++)
{
EventLogProvider.LogInformation("LoadNutritionInfoTask", "Test", ch[i] + " = " + String.Format("{0:X}", Convert.ToInt16(ch[i])));
}
}
}
}
return String.Empty;
}
catch (Exception ex)
{
EventLogProvider.LogInformation("LoadNutritionInfoTask", "ScrubData", ex.Message);
return data;
}
}
I'm not concerned with what is being returned right now, I am printing out the characters and the hex codes that correspond to them.
First, you need to make sure you're reading the text with the correct encoding. It appears to me that you are using UTF-8, since you say ® (Unicode code point U+00AE) is 0xC2AE, which is the same as UTF-8. You can use that like:
Encoding.UTF8.GetString(new byte[] { 0xc2, 0xae }) // "®", the registered symbol
// or
using (var streamReader = new StreamReader(file, Encoding.UTF8))
Once you've got it as a string in C#, you should use HttpUtility.HtmlEncode to encode it as HTML. E.g.
HttpUtility.HtmlEncode("SomeStuff®") // result is "SomeStuff®"
Check encoding you are decoding bytes with.
Try this:
string txt = "textwithsymbol";
string html = "<html></html>";
txt = txt.Replace("\u00ae", html);
Obviously you would replace the txt variable with the text you have read in and "\u00ae" is the symbol you are looking for.
Related
I have a text document that has two alphanumeric words in it. I would like to read the text file and display only the first one in my richTextBox
This is what I have so far but does not seem to work:
RichTextBox.CheckForIllegalCrossThreadCalls = false;
try
{
string filename = #"C:\Test\event.txt";
if (File.Exists(filename))
{
var last = File.ReadLines(filename).Last();
string[] words = last.Split(' ');
Console.WriteLine(words[0]);
richTextBox1.Text = File.ReadAllText(filename);
}
else
{
Debug.WriteLine("File does not exist.");
}
}
catch (Exception f)
{
Console.WriteLine(f);
}
At the moment it reads the entire text document.
Thanks
If I understand correctly you can try to use FirstOrDefault get first line string, then use Split method to get the first word.
if (File.Exists(filename))
{
var firstLine = File.ReadLines(filename).FirstOrDefault();
richTextBox1.Text = firstLine.Split(' ')[0];
}
You are consoling out the answer, just set:
richTextBox1.Text = File.ReadAllText(filename).split(' ')[0];
There are probably more esoteric and performant ways to do this through binary or char reading; but this should solve your issue.
I'm having trouble reading a local file, into a string, in c#.
Here's what I came up with till now:
string file = #"C:\script_test\{5461EC8C-89E6-40D1-8525-774340083829}.html";
using (StreamReader reader = new StreamReader(file))
{
string line = "";
while ((line = reader.ReadLine()) != null)
{
textBox1.Text += line.ToString();
}
}
And it's the only solution that seems to work.
I've tried some other suggested methods for reading a file, such as:
string file = #"C:\script_test\{5461EC8C-89E6-40D1-8525-774340083829}.html";
string html = File.ReadAllText(file).ToString();
textBox1.Text += html;
Yet it does not work as expected.
Here are the first few lines of the file i'm trying to read:
as you can see, it has some funky characters, honestly I don't know if that's the cause of this weird behavior.
But in the first case, the code seems to skip those lines, printing only "Document generated by Office Communicator..."
Your task would be easier if you could use an API or the SDK or even would have a description of the format you try to read. However the binary format looks not to be that complicated and with an hexviewer installed I got this far to get the html out of the example you provided.
To parse non-text files you fall-back to the BinaryReader and then use one of the Read methods to read the correct type from the bytestream. I used ReadByte and ReadInt32. Notice how in the description of the method is explained how many bytes are read. That becomes handy when you try to decipher your file.
private string ParseHist(string file)
{
using (var f = File.Open(file, FileMode.Open))
{
using (var br = new BinaryReader(f))
{
// read 4 bytes as an int
var first = br.ReadInt32();
// read integer / zero ended byte arrays as string
var lead = br.ReadInt32();
// until we have 4 zero bytes
while (lead != 0)
{
var user = ParseString(br);
Trace.Write(lead);
Trace.Write(":");
Trace.Write(user.Length);
Trace.Write(":");
Trace.WriteLine(user);
lead = br.ReadInt32();
// weird special case
if (lead == 2)
{
lead = br.ReadInt32();
}
}
// at the start of the html block
var htmllen = br.ReadInt32();
Trace.WriteLine(htmllen);
// parse the html
var html = ParseString(br);
Trace.Write(len);
Trace.Write(":");
Trace.Write(html.Length);
Trace.Write(":");
Trace.WriteLine(html);
// other structures follow, left unparsed
return html.ToString();
}
}
}
// a string seems to be ascii encoded and ends with a zero byte.
private static string ParseString(BinaryReader br)
{
var ch = br.ReadByte();
var sb = new StringBuilder();
while (ch != 0)
{
sb.Append((char)ch);
ch = br.ReadByte();
}
return sb.ToString();
}
You could use the simple parsing logic in a winform application as follows:
private void button1_Click(object sender, EventArgs e)
{
webBrowser1.DocumentText = ParseHist(#"5461EC8C-89E6-40D1-8525-774340083829-Copia.html");
}
Keep in mind that this is not bullet proof or the recommended way but it should get you started. For files that don't parse well you'll need to go back to the hexviewer and work-out what other byte structures are new or different from what you already had. That is not something I intend to help you with, that is left as an exercise for you to figure out.
I don't know if it's the right way to answer this, but here's what I've managed to do so far:
string file = #"C:\script_test\{1C0365BC-54C6-4D31-A1C1-586C4575F9EA}.hist";
string outText = "";
//Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf8 = Encoding.UTF8;
StreamReader reader = new StreamReader(file, utf8);
char[] text = reader.ReadToEnd().ToCharArray();
//skip first n chars
/*
for (int i = 250; i < text.Length; i++)
{
outText += text[i];
}
*/
for (int i = 0; i < text.Length; i++)
{
//skips non printable characters
if (!Char.IsControl(text[i]))
{
outText += text[i];
}
}
string source = "";
source = WebUtility.HtmlDecode(outText);
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.LoadHtml(source);
string html = "<html><style>";
foreach (HtmlNode node in htmlDoc.DocumentNode.SelectNodes("//style"))
{
html += node.InnerHtml+ Environment.NewLine;
}
html += "</style><body>";
foreach (HtmlNode node in htmlDoc.DocumentNode.SelectNodes("//body"))
{
html += node.InnerHtml + Environment.NewLine;
}
html += "</body></html>";
richTextBox1.Text += html+Environment.NewLine;
webBrowser1.DocumentText = html;
The conversation displays correctly, both style and encoding.
So it's a start for me.
Thank you all for the support!
EDIT
Char.IsControl(char)
skips non printable characters :)
I read the content of a CSV file from a zip file in memory(the requirment is not to write to disk) into the MemoryStream. and use to following code to get the human readable string
string result = Encoding.ASCII.GetString(memoryStream.ToArray());
However, we would like the result to be a string[] to map each row in the CSV file.
Is there a way to handle this automatically?
Thanks
Firstly, there's no need to call ToArray on the memory stream. Just use a StreamReader, and call ReadLine() repeatedly:
memoryStream.Position = 0; // Rewind!
List<string> rows = new List<string>();
// Are you *sure* you want ASCII?
using (var reader = new StreamReader(memoryStream, Encoding.ASCII))
{
string line;
while ((line = reader.ReadLine()) != null)
{
rows.Add(line);
}
}
You can use Split method to split string by newlines:
string[] result = Encoding.
ASCII.
GetString(memoryStream.ToArray()).
Split(new string[] { Environment.NewLine }, StringSplitOptions.None);
Depending on the contents of your CSV file, this can be a much harder problem than you're giving it credit for.
assume this is your csv:
id, data1, data2
1, some data, more data
2, "This element has a new line
right in the middle of the field", and that can create problems if you're reading line by line
If you simply read this in line by line with reader.ReadLine(), you're not going to get what you want if you happen to have quoted fields with new lines in the middle (which is generally allowed in CSVs). you need something more like this
List<String> results = new List<string>();
StringBuilder nextRow = new StringBuilder();
bool inQuote = false;
char nextChar;
while(reader.ReadChar(out nextChar)){ // pretend ReadChar reads a char into nextChar and returns false when it hits EOF
if(nextChar == '"'){
inQuote = !inQuote;
} else if(!inQuote && nextChar == '\n'){
results.Add(nextRow.ToString());
nextRow.Length = 0;
} else{ nextString.Append(nextChar); }
}
note that this handles double quotes. Missing quotes will be a problem, but they always are in .csv files.
So here is my problem, I'm trying to get the content of a text file as a string, then parse it. What I want is a tab containing each word and only words (no blank, no backspace, no \n ...) What I'm doing is using a function LireFichier that send me back the string containing the text from the file (works fine because it's displayed correctly) but when I try to parse it fails and start doing random concatenation on my string and I don't get why.
Here is the content of the text file I'm using :
truc,
ohoh,
toto, tata, titi, tutu,
tete,
and here's my final string :
;tete;;titi;;tata;;titi;;tutu;
which should be:
truc;ohoh;toto;tata;titi;tutu;tete;
Here is the code I wrote (all using are ok):
namespace ConsoleApplication1{
class Program
{
static void Main(string[] args)
{
string chemin = "MYPATH";
string res = LireFichier(chemin);
Console.WriteLine("End of reading...");
Console.WriteLine("{0}",res);// The result at this point is good
Console.WriteLine("...starting parsing");
res = parseString(res);
Console.WriteLine("Chaine finale : {0}", res);//The result here is awfull
Console.ReadLine();//pause
}
public static string LireFichier(string FilePath) //Read the file, send back a string with the text
{
StreamReader streamReader = new StreamReader(FilePath);
string text = streamReader.ReadToEnd();
streamReader.Close();
return text;
}
public static string parseString(string phrase)//is suppsoed to parse the string
{
string fin="\n";
char[] delimiterChars = { ' ','\n',',','\0'};
string[] words = phrase.Split(delimiterChars);
TabToString(words);//I check the content of my tab
for(int i=0;i<words.Length;i++)
{
if (words[i] != null)
{
fin += words[i] +";";
Console.WriteLine(fin);//help for debug
}
}
return fin;
}
public static void TabToString(string[] montab)//display the content of my tab
{
foreach(string s in montab)
{
Console.WriteLine(s);
}
}
}//Fin de la class Program
}
I think your main issue is
string[] words = phrase.Split(delimiterChars, StringSplitOptions.RemoveEmptyEntries);
You could try using the string splitting option to remove empty entries for you:
string[] words = phrase.Split(delimiterChars, StringSplitOptions.RemoveEmptyEntries);
See the documentation here.
Try this:
class Program
{
static void Main(string[] args)
{
var inString = LireFichier(#"C:\temp\file.txt");
Console.WriteLine(ParseString(inString));
Console.ReadKey();
}
public static string LireFichier(string FilePath) //Read the file, send back a string with the text
{
using (StreamReader streamReader = new StreamReader(FilePath))
{
string text = streamReader.ReadToEnd();
streamReader.Close();
return text;
}
}
public static string ParseString(string input)
{
input = input.Replace(Environment.NewLine,string.Empty);
input = input.Replace(" ", string.Empty);
string[] chunks = input.Split(',');
StringBuilder sb = new StringBuilder();
foreach (string s in chunks)
{
sb.Append(s);
sb.Append(";");
}
return sb.ToString(0, sb.ToString().Length - 1);
}
}
Or this:
public static string ParseFile(string FilePath)
{
using (var streamReader = new StreamReader(FilePath))
{
return streamReader.ReadToEnd().Replace(Environment.NewLine, string.Empty).Replace(" ", string.Empty).Replace(',', ';');
}
}
Your main problem is that you are splitting on \n, but the linebreaks read from your file are \r\n.
You output string does contain all of your items, but the \r characters left in it cause later "lines" to overwrite earlier "lines" on the console.
(\r is a "return to start of line" instruction; without the \n "move to the next line" instruction your words from line 1 are being overwritten by those in line 2, then line 3 and line 4.)
As well as splitting on \r as well as \n, you need to check a string is not null or empty before adding it to your output (or, preferably, use StringSplitOptions.RemoveEmptyEntries as others have mentioned).
string ParseString(string filename) {
return string.Join(";", System.IO.File.ReadAllLines(filename).Where(x => x.Length > 0).Select(x => string.Join(";", x.Split(",".ToCharArray(), StringSplitOptions.RemoveEmptyEntries).Select(y => y.Trim()))).Select(z => z.Trim())) + ";";
}
I'm trying to obtain the correct unicode characters represented by this string:
string originalString = "\u0605\u04c3\u5000\u0000\u5000\ufd00\u4400\ud500\u7600\ud300\u4f00\ubc00\u0c00\u2d00\u4000\ue400\u0e00\u7400\u4800\ub700\u1d00\u1300\ue900\u6000\u4c00\ufb00\u9900\u3900\ud900\u6700\uae00\ueb00\u8f00\u2800\u0200\ub300\u5c00\ufe00\u0100\u3d00\u9100\u3000\u0300\u1600\u0100\u7000\u6200\u8e00\u1d00\u8e00\u6200\ua900\u6300\uc800\u0900\ub700\ub000\u6000\ue400\u9200\u3f00\u9100\u8d00\uef00\u3600\u0100\u9e00\u0081";
If I hard-code it in the cs file, I can see in debug mode that it shows the correct characters, but if I have the exact string written in a file and I try to read it, it shows the string as it is in the file.
TextReader tr = new StreamReader("c:\\test.txt");
string tmpString = tr.ReadLine();
tr.Close();
byte[] array = Encoding.Unicode.GetBytes(tmpString );
string finalResult = Encoding.Unicode.GetString(array);
How can I make the finalResult string have the correct unicode characters?
Thanks in advance
Gonçalo
EDIT: Already tried placing
TextReader tr = new StreamReader("c:\\test.txt",Encoding.Unicode);
but the characters are different from the correct ones.
Does your file actually contain the content:
\u0605\u04c3\u5000\u0000\u5000\ufd00\u4400\ud500\u7600\ud300\u4f00
\ubc00\u0c00\u2d00\u4000\ue400\u0e00\u7400\u4800\ub700\u1d00\u1300
\ue900\u6000\u4c00\ufb00\u9900\u3900\ud900\u6700\uae00\ueb00\u8f00
\u2800\u0200\ub300\u5c00\ufe00\u0100\u3d00\u9100\u3000\u0300\u1600
\u0100\u7000\u6200\u8e00\u1d00\u8e00\u6200\ua900\u6300\uc800\u0900
\ub700\ub000\u6000\ue400\u9200\u3f00\u9100\u8d00\uef00\u3600\u0100\u9e00\u0081
If so, you need to convert each sequence to its corresponding unicode character
string originalString = "\u0605\u04c3\u5000\u0000\u5000\ufd00\u4400\ud500\u7600\ud300\u4f00\ubc00\u0c00\u2d00\u4000\ue400\u0e00\u7400\u4800\ub700\u1d00\u1300\ue900\u6000\u4c00\ufb00\u9900\u3900\ud900\u6700\uae00\ueb00\u8f00\u2800\u0200\ub300\u5c00\ufe00\u0100\u3d00\u9100\u3000\u0300\u1600\u0100\u7000\u6200\u8e00\u1d00\u8e00\u6200\ua900\u6300\uc800\u0900\ub700\ub000\u6000\ue400\u9200\u3f00\u9100\u8d00\uef00\u3600\u0100\u9e00\u0081";
string tmpString = "\\u0605\\u04c3\\u5000\\u0000\\u5000\\ufd00\\u4400\\ud500\\u7600\\ud300\\u4f00\\ubc00\\u0c00\\u2d00\\u4000\\ue400\\u0e00\\u7400\\u4800\\ub700\\u1d00\\u1300\\ue900\\u6000\\u4c00\\ufb00\\u9900\\u3900\\ud900\\u6700\\uae00\\ueb00\\u8f00\\u2800\\u0200\\ub300\\u5c00\\ufe00\\u0100\\u3d00\\u9100\\u3000\\u0300\\u1600\\u0100\\u7000\\u6200\\u8e00\\u1d00\\u8e00\\u6200\\ua900\\u6300\\uc800\\u0900\\ub700\\ub000\\u6000\\ue400\\u9200\\u3f00\\u9100\\u8d00\\uef00\\u3600\\u0100\\u9e00\\u0081";
string finalResult = Regex.Replace(tmpString, #"\\u(....)", match => ((char)int.Parse(match.Groups[1].Value, System.Globalization.NumberStyles.HexNumber)).ToString());
you can use the Encoding as parameter while reading the file
TextReader tr = new StreamReader("c:\\test.txt",Encoding.Unicode);
string unicode_string = tr.ReadLine();
Try something like:
TextReader streamReader = new StreamReader("c:\\test.txt");
string input = streamReader.ReadLine();
string[] chars = input.Split(new char[] { '\\', 'u' },
StringSplitOptions.RemoveEmptyEntries);
streamReader.Close();
string answer = string.Empty;
foreach (string charachter in chars)
{
byte byte1 = byte.Parse(string.Format("{0}{1}",
charachter[0], charachter[1]), NumberStyles.AllowHexSpecifier);
byte byte2 = byte.Parse(string.Format("{0}{1}",
charachter[2], charachter[3]), NumberStyles.AllowHexSpecifier);
answer += Encoding.Unicode.GetString(new byte[] { byte2, byte1 });
}