Reading multi language text file in c#

Reading multi language text file in c# - c#

I have to read a text file which can contains char from following languages: English, Japanese, Chinese, French, Spanish, German, Italian
My task is to simply read the data and write it to new text file (placing new line char \n after 100 chars).
I cannot use File.ReadAllText and File.ReadAllLines as file size can be more than 500 MB. So I have written following code:
using (var streamReader = new StreamReader(inputFilePath, Encoding.ASCII))
{
using (var streamWriter = new StreamWriter(outputFilePath,false))
{
char[] bytes = new char[100];
while (streamReader.Read(bytes, 0, 100) > 0)
{
var data = new string(bytes);
streamWriter.WriteLine(data);
}
MessageBox.Show("Compleated");
}
}
Other than ASCII encoding I have tried UTF-7, UTF-8, UTF-32 and IBM500. But no luck in reading and writing multi language characters.
Please help me to achieve this.

You will have to take a look at the first 4 bytes of the file you are parsing.
these bytes will give you a hint on what encoding you have to use.
Here is a helper method I have written to do the task:
public static string GetStringFromEncodedBytes(this byte[] bytes) {
var encoding = Encoding.Default;
var skipBytes = 0;
if (bytes[0] == 0x2b && bytes[1] == 0x2f && bytes[2] == 0x76) {
encoding = Encoding.UTF7;
skipBytes = 3;
}
if (bytes[0] == 0xef && bytes[1] == 0xbb && bytes[2] == 0xbf) {
encoding = Encoding.UTF8;
skipBytes = 3;
}
if (bytes[0] == 0xff && bytes[1] == 0xfe) {
encoding = Encoding.Unicode;
skipBytes = 2;
}
if (bytes[0] == 0xfe && bytes[1] == 0xff) {
encoding = Encoding.BigEndianUnicode;
skipBytes = 2;
}
if (bytes[0] == 0 && bytes[1] == 0 && bytes[2] == 0xfe && bytes[3] == 0xff) {
encoding = Encoding.UTF32;
skipBytes = 4;
}
return encoding.GetString(bytes.Skip(skipBytes).ToArray());
}

This is a good enough start to get to the answer. If i is not equal to 100 you need to read more chars. No trouble with french chars like é - they are all handled in C# char class.
char[] soFlow = new char[100];
int posn = 0;
using (StreamReader sr = new StreamReader("a.txt"))
using (StreamWriter sw = new StreamWriter("b.txt", false))
while(sr.EndOfStream == false)
{
try {
int i = sr.Read(soFlow, posn%100, 100);
//if i < 100 need to read again with second char array
posn += 100;
sw.WriteLine(new string(soFlow));
}
catch(Exception e){Console.WriteLine(e.Message);}
}
Spec: Read(Char[], Int32, Int32) Reads a specified maximum of characters from the current stream into a buffer, beginning at the specified index.
Certainly worked for me anyway :)

Related

How to convert file to unix or windows in c# when I don't know the original encoding

In c#, I have a file which has Unix line endings(\r) I need to replace those to Windows (\r\n). But,
1 - I don't know the original file encoding (utf-8, unicode, iso8852-1, etc) and
2 - I don't know how big the original file may be.
The first point is important - I cannot simply read and write each line using a StreamWriter because I don't know the original encoding.
How can I achieve this?

private void Unix2Dos(string fileName)
{
const byte CR = 0x0D;
const byte LF = 0x0A;
byte[] DOS_LINE_ENDING = new byte[] { CR, LF };
byte[] data = File.ReadAllBytes(fileName);
using (FileStream fileStream = File.OpenWrite(fileName))
{
BinaryWriter bw = new BinaryWriter(fileStream);
int position = 0;
int index = 0;
do
{
index = Array.IndexOf<byte>(data, LF, position);
if (index >= 0)
{
if ( ( index > 0 ) && (data[index - 1] == CR ))
{
// already dos ending
bw.Write(data, position, index - position + 1);
}
else
{
bw.Write(data, position, index - position);
bw.Write(DOS_LINE_ENDING);
}
position = index + 1;
}
}
while (index > 0);
bw.Write(data, position, data.Length - position);
fileStream.SetLength(fileStream.Position);
}
}
Reference: http://csharp-goodies.blogspot.com/2011/02/convert-files-from-dos-to-unix-and-back.html

Decode cyrillic quoted-printable content

I'm using this sample for getting mail from server. Problem is that response contains cyrillic symbols I cannot decode.
Here is a header:
Content-type: text/html; charset="koi8-r"
Content-Transfer-Encoding: quoted-printable
And receive response function:
static void receiveResponse(string command)
{
try
{
if (command != "")
{
if (tcpc.Connected)
{
dummy = Encoding.ASCII.GetBytes(command);
ssl.Write(dummy, 0, dummy.Length);
}
else
{
throw new ApplicationException("TCP CONNECTION DISCONNECTED");
}
}
ssl.Flush();
byte[] bigBuffer = new byte[1024*16];
int bites = ssl.Read(bigBuffer, 0, bigBuffer.Length);
byte[] buffer = new byte[bites];
Array.Copy(bigBuffer, 0, buffer, 0, bites);
sb.Append(Encoding.ASCII.GetString(buffer));
string result = sb.ToString();
// here is an unsuccessful attempt at decoding
result = Regex.Replace(result, #"=([0-9a-fA-F]{2})",
m => m.Groups[1].Success
? Convert.ToChar(Convert.ToInt32(m.Groups[1].Value, 16)).ToString()
: "");
byte[] bytes = Encoding.Default.GetBytes(result);
result = Encoding.GetEncoding("koi8r").GetString(bytes);
}
catch (Exception ex)
{
throw new ApplicationException(ex.ToString());
}
}
How to decode stream correctly? In result string I got <p>=F0=D2=C9=D7=C5=D4 =D1 =F7=C1=CE=D1</p> instead of <p>Привет я Ваня</p>.

As #Max pointed out, you will need to decode the content using the encoding algorithm declared in the Content-Transfer-Encoding header.
In your case, it is the quoted-printable encoding.
You will need to decode the text of the message into an array of bytes and then you’ll need to convert that array of bytes into a string using the appropriate System.Text.Encoding. The name of the encoding to use will typically be specified in the Content-Type header as the charset parameter (in your case, koi8-r).
Since you already have the text as bytes in the buffer variable, simply perform the deciding on that:
byte[] buffer = new byte[bites];
int decodedLength = 0;
for (int i = 0; i < bites; i++) {
if (bigBuffer[i] == (byte) '=') {
if (bites > i + 1) {
// possible hex sequence
byte b1 = bigBuffer[i + 1];
byte b2 = bigBuffer[i + 2];
if (IsXDigit (b1) && IsXDigit (b2)) {
// decode
buffer[decodedLength++] = (ToXDigit (b1) << 4) | ToXDigit (b2);
i += 2;
} else if (b1 == (byte) '\r' && b2 == (byte) '\n') {
// folded line, drop the '=\r\n' sequence
i += 2;
} else {
// error condition, just pass it through
buffer[decodedLength++] = bigBuffer[i];
}
} else {
// truncated? just pass it through
buffer[decodedLength++] = bigBuffer[i];
}
} else {
buffer[decodedLength++] = bigBuffer[i];
}
}
string result = Encoding.GetEncoding ("koi8-r").GetString (buffer, 0, decodedLength);
Custom functions:
static byte ToXDigit (byte c)
{
if (c >= 0x41) {
if (c >= 0x61)
return (byte) (c - (0x61 - 0x0a));
return (byte) (c - (0x41 - 0x0A));
}
return (byte) (c - 0x30);
}
static bool IsXDigit (byte c)
{
return (c >= (byte) 'A' && c <= (byte) 'F') || (c >= (byte) 'a' && c <= (byte) 'f') || (c >= (byte) '0' && c <= (byte) '9');
}
Of course, instead of writing your own hodge podge IMAP library, you could just use MimeKit and MailKit ;-)

Encoding errors in embedded Json file

I have run into an issue and can't quite get my head around it.
I have this code:
public List<NavigationModul> LoadNavigation()
{
byte[] navBytes = NavigationResources.Navigation;
var encoding = GetEncoding(navBytes);
string json = encoding.GetString(navBytes);
List<NavigationModul> navigation = JsonConvert.DeserializeObject<List<NavigationModul>>(json);
return navigation;
}
public static Encoding GetEncoding(byte [] textBytes)
{
if (textBytes[0] == 0x2b && textBytes[1] == 0x2f && textBytes[2] == 0x76) return Encoding.UTF7;
if (textBytes[0] == 0xef && textBytes[1] == 0xbb && textBytes[2] == 0xbf) return Encoding.UTF8;
if (textBytes[0] == 0xff && textBytes[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
if (textBytes[0] == 0xfe && textBytes[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
if (textBytes[0] == 0 && textBytes[1] == 0 && textBytes[2] == 0xfe && textBytes[3] == 0xff) return Encoding.UTF32;
return Encoding.ASCII;
}
The Goal is to load an embedded Json File (NavigationResources.Navigation) from a ResourceFile. The Navigation File is an embedded file. We are just jusing the ResourceManager to avoid Magic strings.
After loading the bytes of the embedded file and checking for its encoding, I now read the String from the file and pass it to the JsonConverter.DeserializeObject function.
But unfortunaly this fails due to invalid Json. Long story short: The loaded json string still contains encoding identification bytes. And I can't figure out how to get rid of it.
I also tryed to convert the utf8 bytearray to default encoding before loading the string but this only makes the encoding bytes become a visible charecter.
I talked to my peers and they told me that they have run into the same problem reading embedded batchfiles, leading to broken batchfiles. They didn't know how to fix the problem either, but came up with a workaround for the batchfiles itself (add a blank line into the batchfile to make it work)
Any suggestions on how to fix this?

Thanks to Alex K. I have a solution:
Cuting of the Identification Bytes before calling Encoding.GetString did the trick.
Here is my function I now use to do the Task:
public static string GetStringFromEncodedBytes(byte[] bytes) {
Encoding encoding = Encoding.Default;
int skipBytes = 0;
if (bytes[0] == 0x2b && bytes[1] == 0x2f && bytes[2] == 0x76) {
encoding = Encoding.UTF7;
skipBytes = 3;
}
if (bytes[0] == 0xef && bytes[1] == 0xbb && bytes[2] == 0xbf)
{
encoding = Encoding.UTF8;
skipBytes = 3;
}
if (bytes[0] == 0xff && bytes[1] == 0xfe)
{
encoding = Encoding.Unicode;
skipBytes = 2;
}
if (bytes[0] == 0xfe && bytes[1] == 0xff)
{
encoding = Encoding.BigEndianUnicode;
skipBytes = 2;
}
if (bytes[0] == 0 && bytes[1] == 0 && bytes[2] == 0xfe && bytes[3] == 0xff)
{
encoding = Encoding.UTF32;
skipBytes = 4;
}
return encoding.GetString(bytes.Skip(skipBytes).ToArray());
}

Here's a simpler approach, removing the BOM after decoding:
// Your data is always in UTF-8 apparently, so just rely on that.
string text = Encoding.UTF8.GetString(data);
if (text.StartsWith("\ufeff"))
{
text = text.Substring(1);
}
This has the downside of copying the string, of course.
Or if you do want to skip the bytes:
// Again, we're assuming UTF-8
int start = data.Length >= 3 && data[0] == 0xef &&
data[1] == 0xbb && data[2] == 0xbf)
? 3 : 0;
string text = Encoding.UTF8.GetString(data, start, data.Length - start);
That way you don't need to use Skip and ToArray, and it avoids doing any extraneous copying.

How to replace extended ASCII characters in C#?

I am trying to replace non-printable characters ie extended ASCII characters from a HUGE string.
foreach (string line in File.ReadLines(txtfileName.Text))
{
MessageBox.Show( Regex.Replace(line,
#"\p{Cc}",
a => string.Format("[{0:X2}]", " ")
)); ;
}
this doesnt seem to be working.
EX:
AAÂAA should be converted to AA AA

Assuming the Encoding to be UTF8 try this:
string strReplacedVal = Encoding.ASCII.GetString(
Encoding.Convert(
Encoding.UTF8,
Encoding.GetEncoding(
Encoding.ASCII.EncodingName,
new EncoderReplacementFallback(" "),
new DecoderExceptionFallback()
),
Encoding.UTF8.GetBytes(line)
)
);

Since you are opening the file as UTF-8, it must be. So, its code units are one byte and UTF-8 has the very nice feature of encoding characters above ␡ with bytes exclusively above 0x7f and characters at or below ␡ with bytes exclusively at or below 0x7f.
For efficiency, you can rewrite the file in place a few KB at a time.
Note: that some characters might be replaced by more than one space, though.
// Operates on a UTF-8 encoded text file
using (var stream = File.Open(path, FileMode.Open, FileAccess.ReadWrite))
{
const int size = 4096;
var buffer = new byte[size];
int count;
while ((count = stream.Read(buffer, 0, size)) > 0)
{
var changed = false;
for (int i = 0; i < count; i++)
{
// obliterate all bytes that are not encoded characters between ␠ and ␡
if (buffer[i] < ' ' | buffer[i] > '\x7f')
{
buffer[i] = (byte)' ';
changed = true;
}
}
if (changed)
{
stream.Seek(-count, SeekOrigin.Current);
stream.Write(buffer, 0, count);
}
}
}

FileStream and Encoding

I have a program write save a text file using stdio interface. It swap the 4 MSB with the 4 LSB, except the characters CR and/or LF.
I'm trying to "decode" this stream using a C# program, but I'm unable to get the original bytes.
StringBuilder sb = new StringBuilder();
StreamReader sr = new StreamReader("XXX.dat", Encoding.ASCII);
string sLine;
while ((sLine = sr.ReadLine()) != null) {
string s = "";
byte[] bytes = Encoding.ASCII.GetBytes(sLine);
for (int i = 0; i < sLine.Length; i++) {
byte c = bytes[i];
byte lb = (byte)((c & 0x0F) << 4), hb = (byte)((c & 0xF0) >> 4);
byte ascii = (byte)((lb) | (hb));
s += Encoding.ASCII.GetString(new byte[] { ascii });
}
sb.AppendLine(s);
}
sr.Close();
return (sb);
I've tried to change encoding in UTF8, but it didn't worked. I've also used a BinaryReader created using the 'sr' StreamReader, but nothing good happend.
StringBuilder sb = new StringBuilder();
StreamReader sr = new StreamReader("XXX.shb", Encoding.ASCII);
BinaryReader br = new BinaryReader(sr.BaseStream);
string sLine;
string s = "";
while (sr.EndOfStream == false) {
byte[] buffer = br.ReadBytes(1);
byte c = buffer[0];
byte lb = (byte)((c & 0x0F) << 4), hb = (byte)((c & 0xF0) >> 4);
byte ascii = (byte)((lb) | (hb));
s += Encoding.ASCII.GetString(new byte[] { ascii });
}
sr.Close();
return (sb);
If the file starts with 0xF2 0xF2 ..., I read everything except the expected value. Where is the error? (i.e.: 0xF6 0xF6).
Actually this C code do the job:
...
while (fgets(line, 2048, bfd) != NULL) {
int cLen = strlen(xxx), lLen = strlen(line), i;
// Decode line
for (i = 0; i < lLen-1; i++) {
unsigned char c = (unsigned char)line[i];
line[i] = ((c & 0xF0) >> 4) | ((c & 0x0F) << 4);
}
xxx = realloc(xxx , cLen + lLen + 2);
xxx = strcat(xxx , line);
xxx = strcat(xxx , "\n");
}
fclose(bfd);
What wrong in the C# code?

Got it.
The problem is the BinaryReader construction:
StreamReader sr = new StreamReader("XXX.shb", Encoding.ASCII);
BinaryReader br = new BinaryReader(sr.BaseStream);
I think this construct a BinaryReader based on StreaReader which "translate" characters coming from the file.
Using this code, actually works well:
FileInfo fi = new FileInfo("XXX.shb");
BinaryReader br = new BinaryReader(fi.OpenRead());
I wonder if it is possible to read those kind of data with a Text stream reader line by line, since line endings are preserved during "encoding" phase.

I guess you should use a BinaryReader and ReadBytes(), then only use Encoding.ASCII.GetString() on the bytesequence after you have swapped the bits.
In your example, you seem to read the file as ascii (meaning, you convert bytes to .NET internal dual-byte code upon read telling it that it is ascii), then convert it BACK to bytes again, as ascii-bytes.
That is unnecessary for you.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Reading multi language text file in c# - c#

Related

How to convert file to unix or windows in c# when I don't know the original encoding

Decode cyrillic quoted-printable content

Encoding errors in embedded Json file

How to replace extended ASCII characters in C#?

FileStream and Encoding

Categories

Resources