C# Streamreader: Handling of special characters \" \' etc

C# Streamreader: Handling of special characters \" \' etc - c#

I'm reading then writing a text file. Before and after the data of interest the file contains many lines that should remain unaltered. But streamreader seems to convert the special characters ( " ' — ) into other characters that appear as funky diamonds in both C# textboxes and in notepad. How can text get passed through file read/write operations completely unaltered? Thanks.
StreamWriter sw = new StreamWriter(sOutputFileName);
using (StreamReader sr = new StreamReader(sTempFileName))
{
while (sr.Peek() >= 0)
{
rdBuffer = sr.ReadLine();
txtProgressDisplay.Text += rdBuffer + "\r\n";
// parse and process some lines here
wrBuffer = rdBuffer;
sw.WriteLine(wrBuffer);
txtProgressDisplay.Text += wrBuffer + "\r\n";
}
sr.Close();
}
sw.Close();

I am almost certain the issue is related to character encoding, ie UTF8, ASCII, UTF7, etc. Try creating your StreamReader passing in the correct encoding,
StreamReader sr = new StreamReader(sTempFileName, System.Text.Encoding.ASCII);
You can use Encoding.ASCII, Encoding.UTF7, etc

Your problem seems to be something with encoding.
1) Check that your text viewer is using the same encoding as your .NET application (maybe UTF-8?).
2) Check if the file itself has been created using the same encoding as your .NET application too (are you mixing characters in different encodings?).

Related

read encoding identifier with StreamReader

I am reading a C# book and in the chapter about streams it says:
If you explicitly specify an encoding, StreamWriter will, by default,
write a prefix to the start of the stream to identify the encoding.
This is usually undesirable and you can prevent it by constructing the
encoding as follows:
var encoding = new UTF8Encoding (encoderShouldEmitUTF8Identifier:false, throwOnInvalidBytes:true);
I'd like to actually see how the identifier looks so I came up with this code:
using (FileStream fs = File.Create ("test.txt"))
using (TextWriter writer = new StreamWriter (fs,new UTF8Encoding(true,false)))
{
writer.WriteLine ("Line1");
}
using (FileStream fs = File.OpenRead ("test.txt"))
using (TextReader reader = new StreamReader (fs))
{
for (int b; (b = reader.Read()) > -1;)
Console.WriteLine (b + " " + (char)b); // identifier not printed
}
To my dissatisfaction, no identifier was printed. How do I read the identifier? Am I missing something?

By default, .NET will try very hard to insulate you from encoding errors. If you want to see the byte-order-mark, aka "preamble" or "BOM", you need to be very explicit with the objects to disable the automatic behavior. This means that you need to use an encoding that does not include the preamble, and you need to tell StreamReader to not try to detect the encoding.
Here is a variation of your original code that will display the BOM:
using (MemoryStream stream = new MemoryStream())
{
Encoding encoding = new UTF8Encoding(encoderShouldEmitUTF8Identifier: true);
using (TextWriter writer = new StreamWriter(stream, encoding, bufferSize: 8192, leaveOpen: true))
{
writer.WriteLine("Line1");
}
stream.Position = 0;
encoding = new UTF8Encoding(encoderShouldEmitUTF8Identifier: false);
using (TextReader reader = new StreamReader(stream, encoding, detectEncodingFromByteOrderMarks: false))
{
for (int b; (b = reader.Read()) > -1;)
Console.WriteLine(b + " " + (char)b); // identifier not printed
}
}
Here, encoderShouldEmitUTF8Identifier: true is passed to the encoder used to create the stream, so that the BOM is written when the stream is created, but encoderShouldEmitUTF8Identifier: false is passed to the encoder used to read the stream, so that the BOM will be treated as a normal character when the stream is being read back. The detectEncodingFromByteOrderMarks: false parameter is passed to the StreamReader constructor as well, so that it won't consume the BOM itself.
This produces this output, just like you wanted:
65279 ?
76 L
105 i
110 n
101 e
49 1
13
10
It is worth mentioning that use of the BOM as a form of identifying UTF8 encoding is generally discouraged. The BOM mainly exists so that the two variations of UTF16 can be distinguished (i.e. UTF16LE and UTF16BE, "little endian" and "big endian", respectively). It's been co-opted as a means of identifying UTF8 as well, but really it's better to just know what the encoding is (which is why things like XML and HTML explicitly state the encoding as ASCII in the first part of the file, and MIME's charset property exists). A single character isn't nearly as reliable as other more explicit means.

.NET StreamReader encoding behaviour

I am trying to understand the unicode encoding behaviour and came across the following,
I am writing to a file a string using Encoding.Unicode using
StreamWriter(fileName,false, Encoding.Unicode);
I am reading from the same file but use ASCII intentionally.
StreamReader(fileName,false, Encoding.ASCII);
When I read the string using ReadLine to my surprise it is giving back the same unicode string.
I expected the string to contain ? or other characters with double the length of the original string.
What is happening here?
Code Snippet
string test= "سشصضطظع";//some random arabic set
StreamWriter s = new StreamWriter(fileName,false, Encoding.UTF8);
s.Write(input);
s.Flush();
s.Close();
StreamReader s = new StreamReader(fileName, encoding);
string ss = s.ReadLine();
s.Close();
//In string ss I expect to be a ascii with Double the length of test
If I call StreamReader s = new StreamReader(fileName, encoding, false);
then it gives the expected result.`
Thanks

The parameter detectEncodingFromByteOrderMarks should be set to false when creating StreamReader object.

How to avoid adding double quotes to the metadata keywords in PDF file using iTextSharp in C#?

Using the iTextSharp Library I'm able to insert metadata in a PDF file using the various Schemas.
The keywords in the keywords metadata are for my purposes delimited by a comma and enclosed in double quotes. Once the script I've written runs, the keywords are enclosed in triple quotes.
Any ideas on how to avoid this or any advice on working with XMP?
Example of required metadata : "keyword1","keyword2","keyword3"
Example of current metadata : """keyword1"",""keyword2"",""keyword3"""
Coding:
string _keywords = meta_line.Split(',')[1] + ","
+ meta_line.Split(',')[2] + ","
+ meta_line.Split(',')[3] + ","
+ meta_line.Split(',')[4] + ","
+ meta_line.Split(',')[5] + ","
+ meta_line.Split(',')[6] + ","
+ meta_line.Split(',')[7];
_keywords = _keywords.Replace('~', ',');
Console.WriteLine(metaFile);
foreach (string inputFile in Directory.GetFiles(source, "*.pdf", SearchOption.TopDirectoryOnly))
{
if (Path.GetFileName(metaFile) == Path.GetFileName(inputFile))
{
string outputFile = source + #"\output\" + Path.GetFileName(inputFile);
PdfReader reader = new PdfReader(inputFile);
using (FileStream fs = new FileStream(outputFile, FileMode.Create, FileAccess.Write, FileShare.None))
{
PdfStamper stamper = new PdfStamper(reader, fs);
Dictionary<String, String> info = reader.Info;
stamper.MoreInfo = info;
PdfWriter writer = stamper.Writer;
byte[] buffer = new byte[65536];
System.IO.MemoryStream ms = new System.IO.MemoryStream(buffer, true);
try
{
iTextSharp.text.xml.xmp.XmpSchema dc = new iTextSharp.text.xml.xmp.DublinCoreSchema();
dc.SetProperty(iTextSharp.text.xml.xmp.DublinCoreSchema.TITLE, new iTextSharp.text.xml.xmp.LangAlt(_title));
iTextSharp.text.xml.xmp.XmpArray subject = new iTextSharp.text.xml.xmp.XmpArray(iTextSharp.text.xml.xmp.XmpArray.ORDERED);
subject.Add(_subject);
dc.SetProperty(iTextSharp.text.xml.xmp.DublinCoreSchema.SUBJECT, subject);
iTextSharp.text.xml.xmp.XmpArray author = new iTextSharp.text.xml.xmp.XmpArray(iTextSharp.text.xml.xmp.XmpArray.ORDERED);
author.Add(_author);
dc.SetProperty(iTextSharp.text.xml.xmp.DublinCoreSchema.CREATOR, author);
PdfSchemaAdvanced pdf = new PdfSchemaAdvanced();
pdf.AddKeywords(_keywords);
iTextSharp.text.xml.xmp.XmpWriter xmp = new iTextSharp.text.xml.xmp.XmpWriter(ms);
xmp.AddRdfDescription(dc);
xmp.AddRdfDescription(pdf);
xmp.Close();
int bufsize = buffer.Length;
int bufcount = 0;
foreach (byte b in buffer)
{
if (b == 0) break;
bufcount++;
}
System.IO.MemoryStream ms2 = new System.IO.MemoryStream(buffer, 0, bufcount);
buffer = ms2.ToArray();
foreach (char buff in buffer)
{
Console.Write(buff);
}
writer.XmpMetadata = buffer;
}
catch (Exception ex)
{
throw ex;
}
finally
{
ms.Close();
ms.Dispose();
}
stamper.Close();
// writer.Close();
}
reader.Close();
}
}
The below method didn't add any metadata - not sure why (point 3 in the comments):
iTextSharp.text.xml.xmp.XmpArray keywords = new iTextSharp.text.xml.xmp.XmpArray(iTextSharp.text.xml.xmp.XmpArray.ORDERED);
keywords.Add("keyword1");
keywords.Add("keyword2");
keywords.Add("keyword3");
pdf.SetProperty(iTextSharp.text.xml.xmp.PdfSchema.KEYWORDS, keywords);

I currently don't have the newest iTextSharp version. I have a itextsharp 5.1.1.0. It does not contain PdfSchemaAdvanced class, but it has PdfSchema and its base class XmpSchema. I bet the PdfSchemaAdvanced in your lib also derives from XmpSchema.
The PdfSchema.AddKeyword only does one thing:
base["pdf:Keywords"] = keywords;
and XmpSchema.[].set in turn does:
base[key] = XmpSchema.Escape(value);
so it's very clear that the value is being, well, 'Escaped', to ensure that special characters are not interfering with the storage format.
Now, the Escape function, what what I see, performs simple character-by-character scanning and performs substitutions:
" -> "
& -> &
' -> &apos;
< -> <
> -> >
and that's all. Seems like a typical html-entites processing. At least in my version of the library. So, it would not duplicate the quotes, just change their encoding.
Then, AddRdfDescription seems to simply iterate over the stored keys and just wraps them in tags with no furhter processing. So, it'd emit something like that:
Escaped"Contents&OfThis"Key
as:
<pdf:Keywords>Escaped"Contents&OfThis"Key</pdf:Keywords>
Aside from the AddKeywords method, you should also see AddProperty method. It acts similarly to add-keywords except for the fact that it receives key and does not Escape() its input value.
So, if you are perfectly sure that your _keywords are formatted properly, you might try:
AddProperty("pdf:Keywords", _keywords)
but I discourage you from doing that. At least in my version of itextsharp, the library seems to properly process the 'keywords' and format it safely as RDF.
Heh, you may also try using the PdfSchema class that I just checked instead of the Advanced one. I bet it still is present in the library.
But, in general, I think the problem lies elsewhere.
Double or triple-check the contents of _keywords variable and then also check the binary contents of the generated PDF. Look into it with some hexeditor or simple plain-text editor like Notepad and look for the <pdf:Keywords> tag. Check what it actually contains. It might be all OK and it might be your pdf-metadata-reader that adds those quotes.

Cannot write to rtf file after replacing inside string with utf8 characters

I have a rtf file in which I have to make some text replacements with some language specific characters (UTF8). After the replacements I try to save to a new rtf file but either the characters are not set right(strange characters) or the file is saved with all the rtf raw code and all the formatting.
Here is my code:
var fs = new FileStream(#"F:\projects\projects\RtfEditor\Test.rtf", FileMode.Open, FileAccess.Read);
//reads the file in a byte[]
var sb = FileWorker.ReadToEnd(fs);
var enc = Encoding.GetEncoding(1250);
//var enc = Encoding.UTF8;
var sbs = enc.GetString(sb);
var sbsNew = sbs.Replace("#test/#", "ă î â șșțț");
//first writting aproach
var fsw = new FileStream(#"F:\projects\projects\RtfEditor\diac.rtf", FileMode.Create, FileAccess.Write);
fsw.Write(enc.GetBytes(sbsNew), 0, enc.GetBytes(sbsNew).Length);
fsw.Flush();
fsw.Close();
In this aproach, the result file is the right one but the characters "șșțț" are shown as "????".
//second writing aproach
using (StreamWriter sw = new StreamWriter(fsw, Encoding.UTF8))
{
sw.Write(sbsNew);
sw.Flush();
}
In this aproach, the result file is a rtf file but with all rtf raw code and formatting and the special characters are saved right (șșțț appear correcty, no more ????)

A RTF file can directly contain 7-bit characters only. Everything else needs to be encoded into escape sequences. More detailed information can be found in e.g. this Wikipedia article.

C# letters like å not correctly shown in output file

I work in C# and this is my code:
Encoding encoding;
StringBuilder output = new StringBuilder();
//somePath is string
using (StreamReader sr = new StreamReader(somePath))
{
string line;
encoding = sr.CurrentEncoding;
while ((line = sr.ReadLine()) != null)
{
//make some changes to line
output.AppendLine(line);
}
}
using (StreamWriter writer = new StreamWriter(someOtherPath, false))//encoding
{
writer.Write(output);
}
In the file which is on somePath, I have Norwegian characters like å. But, on the file in someOtherPath I get question marks instead of them. I think it's an encoding problem, so I tried getting the input file encoding and to grant it to the output file. It had no results. I tried opening the file with Google Chrome and grant it every possible encoding but the letters weren't the same as in the input file.

StreamReader can only make guesses with regards to certain encodings. Ideally, you should find out what the encoding of the file really is, then use that to read the file. What created the file, and what allows you to read it correctly? Does the latter program expose which encoding it's using? (For example, it may be using something like Windows-CP1252.)
I would personally recommend using UTF-8 as your output encoding if you can, but it depends on whether you're in control over whatever's then reading the output.
EDIT: Okay, now I've seen the file, I can confirm it's not UTF-8. The word "direktør" is represented as these bytes:
64 69 72 65 6b 74 f8 72
So the non-ASCII character is a single byte (F8) which is not a valid UTF-8 representation of a character.
It could be ISO-Latin-1 - it's not clear (there are multiple encodings which would match). If it is, you can use:
Encoding encoding = Encoding.GetEncoding(28591);
using (TextReader reader = new StreamReader(filename, encoding))
{
...
}
(Alternatively, use File.ReadAllLines to make life simpler.)
You'll need to separately work out what output encoding you want.
EDIT: Here's a short but complete program which I've run against the file you provided, and which has correctly converted the character to UTF-8:
using System;
using System.IO;
using System.Text;
class Test
{
static void Main()
{
Encoding encoding = Encoding.GetEncoding(28591);
StringBuilder output = new StringBuilder();
using (TextReader reader = new StreamReader("file.html", encoding))
{
string line;
while ((line = reader.ReadLine()) != null)
{
output.AppendLine("Read line: " + line);
}
}
using (StreamWriter writer = new StreamWriter("output.html", false))
{
writer.Write(output);
}
}
}

Try this case to save your text:
using (StreamWriter writer = new StreamWriter(someOtherPath, Encoding.UTF8)) { ... }

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# Streamreader: Handling of special characters \" \' etc - c#

I am almost certain the issue is related to character encoding, ie UTF8, ASCII, UTF7, etc. Try creating your StreamReader passing in the correct encoding, StreamReader sr = new StreamReader(sTempFileName, System.Text.Encoding.ASCII); You can use Encoding.ASCII, Encoding.UTF7, etc

Your problem seems to be something with encoding. 1) Check that your text viewer is using the same encoding as your .NET application (maybe UTF-8?). 2) Check if the file itself has been created using the same encoding as your .NET application too (are you mixing characters in different encodings?).

Related

read encoding identifier with StreamReader

.NET StreamReader encoding behaviour

How to avoid adding double quotes to the metadata keywords in PDF file using iTextSharp in C#?

Cannot write to rtf file after replacing inside string with utf8 characters

C# letters like å not correctly shown in output file

Categories

Resources