C# Parse Memory Stream text from RichTextBox with special characters - c#

I need your help to find the best/fastest way to parse (regular expression) text in a RichTextbox.
I have already tried several methods, and the fastest one, so far, seems to be saving the text into a MemoryStream and read it line by line while performing the validation.
I have no problem doing that and it actually seems to work pretty well... Except, when I have special chars - Latin chars to be more specific. Lets say for example that I have the name, "João" (John in English BTW), the text, coming from the StreamReader, appears as "Jo\'e3o"... resulting in a failure to find the text.
Not sure if this is because of encoding, I have tried to set the Encoding to UTF8 when creating the StreamReader, but it doesn't work, I always see the text with those codes.
I am starting to think that my only option is to parse the text or lines from the RichTextbox obj, but it is sooooo much slower...
UPDATE
Adding some example code on how I'm reading the RichTextBox text.
(This seems to be the fastest way to read large amounts of text.)
var rtb = new RichTextBox();
var rtbMemStream = new MemoryStream();
rtb.SaveFile(rtbMemStream, RichTextBoxStreamType.RichText);
using (StreamReader sr = new StreamReader(rtbStream, Encoding.UTF8))
{
while (!sr.EndOfStream)
{
var streamLine = sr.ReadLine();
ParseLine(streamLine);
}
}
Any help or suggestions is appreciated,
Thank you in advanced.

Related

How to correctly encode arabic subtitles in c#?

Hello there i am creating a video player with subtitles support using MediaElement class and SubtitlesParser library, i faced an issue with 7 arabic subtitle files (.srt) being displayed ???? or like this:
I tried multiple diffrent encoding but with no luck:
SubtitlesList = new SubtitlesParser.Classes.Parsers.SubParser().ParseStream(fileStream);
subLine = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(subLine));
or
SubtitlesList = new SubtitlesParser.Classes.Parsers.SubParser().ParseStream(fileStream,Encoding.UTF8);
Then i found this and based on the answer i used Encoding.Default "ANSI" to parse subtitles then re-interpret the encoded text:
SubtitlesList = new SubtitlesParser.Classes.Parsers.SubParser().ParseStream(fileStream, Encoding.Default);
var arabic = Encoding.GetEncoding(1256);
var latin = Encoding.GetEncoding(1252);
foreach (var item in SubtitlesList)
{
List<string> lines = new List<string>();
lines.AddRange(item.Lines.Select(line => arabic.GetString(latin.GetBytes(line))));
item.Lines = lines;
}
this worked only on 4 files but the rest still show ?????? and nothing i tried till now worked on them, this what i found so far:
exoplayer weird arabic persian subtitles format (this gave me a hint about the real problem).
C# Converting encoded string IÜÜæØÜÜ?E? to readable arabic (Same answer).
convert string from Windows 1256 to UTF-8 (Same answer).
How can I transform string to UTF-8 in C#? (It works for Spanish language but not arabic).
Also am hoping to find a single solution to correctly display all the files is this possible ?
please forgive my simple language English is not my native language
i think i found the answer to my question, as a beginner i only had a basic knowledge of encoding till i found this article
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text
Your text editor, browser, word processor or whatever else that's trying to read the document is assuming the wrong encoding. That's all. The document is not broken , there's no magic you need to perform, you simply need to select the right encoding to display the document.
I hope this helps anyone else who got confused about the correct way to handel this, there is no way to know the files correct encoding, only the user can.

C# Reading files and encoding issue

I've searched everywhere for this answer so hopefully it's not a duplicate. I decided I'm just finally going to ask it here.
I have a file named Program1.exe When I drag that file into Notepad or Notepad++ I get all kinds of random symbols and then some readable text. However, when I try to read this file in C#, I either get inaccurate results, or just a big MZ. I've tried all supported encodings in C#. How can notepad programs read a file like this but I simply can't? I try to convert bytes to string and it doesn't work. I try to directly read line by line and it doesn't work. I've even tried binary and it doesn't work.
Thanks for the help! :)
Reading a binary file as text is a peculiar thing to do, but it is possible. Any of the 8-bit encodings will do it just fine. For example, the code below opens and reads an executable and outputs it to the console.
const string fname = #"C:\mystuff\program.exe";
using (var sw = new StreamReader(fname, Encoding.GetEncoding("windows-1252")))
{
var s = sw.ReadToEnd();
s = s.Replace('\x0', ' '); // replace NUL bytes with spaces
Console.WriteLine(s);
}
The result is very similar to what you'll see in Notepad or Notepad++. The "funny symbols" will differ based on how your console is configured, but you get the idea.
By the way, if you examine the string in the debugger, you're going to see something quite different. Those funny symbols are encoded as C# character escapes. For example, nul bytes (value 0) will display as \0 in the debugger, as NUL in Notepad++, and as spaces on the console or in Notepad. Newlines show up as \r in the debugger, etc.
As I said, reading a binary file as text is pretty peculiar. Unless you're just looking to see if there's human-readable data in the file, I can't imagine why you'd want to do this.
Update
I suspect the reason that all you see in the Windows Forms TextBox is "MZ" is that the Windows textbox control (which is what the TextBox ultimately uses), uses the NUL character as a string terminator, so won't display anything after the first NUL. And the first thing after the "MZ" is a NUL (shows as `\0' in the debugger). You'll have to replace the 0's in the string with spaces. I edited the code example above showing how you'd do that.
The exe is a binary file and if you try to read it as a text file you'll get the effect that you are describing. Try using something like a FileStream instead that does not care about the structure of the file but treats it just as a series of bytes.

c# encoding problems (question marks) while reading file from StreamReader

I've a problem while reading a .txt file from my Windows Phone app.
I've made a simple app, that reads a stream from a .txt file and prints it.
Unfortunately I'm from Italy and we've many letters with accents. And here's the problem, in fact all accented letters are printed as a question mark.
Here's the sample code:
var resourceStream = Application.GetResourceStream(new Uri("frasi.txt",UriKind.RelativeOrAbsolute));
if (resourceStream != null)
{
{
//System.Text.Encoding.Default, true
using (var reader = new StreamReader(resourceStream.Stream, System.Text.Encoding.UTF8))
{
string line;
line = reader.ReadLine();
while (line != null)
{
frasi.Add(line);
line = reader.ReadLine();
}
}
}
So, I'm asking you how to avoid this matter.
All the best.
[EDIT:] Solution: I didn't make sure the file was encoded in UTF-8- I saved it with the correct encoding and it worked like a charm. thank you Oscar
You need to use Encoding.Default. Change:
using (var reader = new StreamReader(resourceStream.Stream, System.Text.Encoding.UTF8))
to
using (var reader = new StreamReader(resourceStream.Stream, System.Text.Encoding.Default))
You have commented out is what you should be using if you do not know the exact encoding of your source data. System.Text.Encoding.Default uses the encoding for the operating system's current ANSI code page and provides the best chance of a correct encoding. This should detect the current region settings/encoding and use those.
However, from MSDN the warning:
Different computers can use different encodings as the default, and the default encoding can even change on a single computer. Therefore, data streamed from one computer to another or even retrieved at different times on the same computer might be translated incorrectly. In addition, the encoding returned by the Default property uses best-fit fallback to map unsupported characters to characters supported by the code page. For these two reasons, using the default encoding is generally not recommended. To ensure that encoded bytes are decoded properly, your application should use a Unicode encoding, such as UTF8Encoding or UnicodeEncoding, with a preamble. Another option is to use a higher-level protocol to ensure that the same format is used for encoding and decoding.
Despite this, in my experience with data coming from a number of different source and various different cultures, this is the one that provides the most consistent results out-of-the-box... Esp. for the case of diacritic marks which are turned to question marks when moving from ANSI to UTF8.
I hope this helps.

Apostrophe character in an html email not showing

I am using System.Net.Mail and I am reading html into the body of email.
Unfortunately the apostrophe character ' is shown as a question mark with a black background.
I have tried to replace the apostrophe with the html &apos; but this still displays the question mark with a black background. Other Html tags (h1, p etc) are working fine.
I now there must be a really obvious answer but I cannot seem to find it. Thanks for your help.
UPDATE
It appears that it is System.IO.StreamReader that is causing my problem.
using (StreamReader reader = new StreamReader("/Email/Welcome.htm"))
{
body = reader.ReadToEnd();
//body string now has odd question mark character instead of apostrophe.
}
If you know the encoding of your file you will want to pass that to your StreamReader initialization:
using (StreamReader reader = new StreamReader("/Email/Welcome.htm", "Windows-1252"))
{
body = reader.ReadToEnd();
// If the encoding is correct you'll now see ´ rather than �
// Which, by the way is the unicode replacement character
// See: http://www.fileformat.info/info/unicode/char/fffd/index.htm
}
You need to save this file as unicode utf-8 format to get it right.

Search and Replace of text in a memorystream in C# .NET

I have loaded a memorystream with a word document and I want to be able to alter specific text within the memorystream and save it back to the word document, like search and replace functionality. Please can anyone help me with this as I don't want to use the Word Interop libraries. I have the code to load and save the document already, please see below. The problem is, if I convert the memorystring to a string and use the string replace method, when I save the string all the formatting within the word document is lost and when I open the document all it shows is black boxes all over the place.
private void ReplaceInFile(string filePath, string searchText, string replaceText)
{
byte[] inputFile = File.ReadAllBytes(filePath);
MemoryStream memory = new MemoryStream(inputFile);
byte[] data = memory.ToArray();
string pathStr = Request.PhysicalApplicationPath + "\\Docs\\OutputDocument.doc";
FileInfo wordFile = new FileInfo(pathStr);
FileStream fileStream = wordFile.Open(FileMode.Create, FileAccess.Write, FileShare.None);
fileStream.Write(data, 0, data.Length);
fileStream.Close();
memory.Close();
}
I copied the code from sample code on the internet. So That is why memorystream was used as I had no idea how to do it. My issue is the company I work for doesn't want to use the word interop as sometimes they have found that word can display popup dialog boxes on occassion that prevents the coded functionality from executing. This is why I want to look at ways of achieving a mail merge functionality but in a programmatical way. I did do a very similar thing to what I want to do here many years ago but in Delphi not C# and I have typically lost the code. So if anyone can shed any light on this then I would be grateful.
You will have to use the Word interop libraries - or at least something similar. It's not like Word documents are just plain text documents - they're binary files. Converting the bytes into a string and doing a replace that way is going to break the document completely.
With the new open formats you may be able to write your own code to parse them, but it's going to be significantly harder than using a library.
Your best bet is to convert the file to OOXML - then that's an XML file which you can update programmatically using a string find / replace, System.XML or LINQ.
(See http://blogs.msdn.com/b/ericwhite/archive/2008/09/19/bulk-convert-doc-to-docx.aspx for more info on the server side conversion process.)

Categories