Ruby ASCII-8BIT encoding equivalent in c# - c#

I am trying to convert the following code from ruby to C#:
open(file_path, ’rb’) do |doc|
response = RestClient::Request.execute :method => :post,
:url => "http://something.com",
:payload => {
:document => doc
}
my understanding is this will read a file in binary in ASCII-8BIT as per the following documentation:
"b" Binary file mode
Suppresses EOL <-> CRLF conversion on Windows. And
sets external encoding to ASCII-8BIT unless explicitly
specified.
So trying to replicate this example code in C# I have written:
string myString;
Encoding enc = Encoding.GetEncoding(????);
using (FileStream fs = new FileStream("D:\\sample1page.pdf", FileMode.Open))
using (BinaryReader br = new BinaryReader(fs, enc))
{
byte[] bin = br.ReadBytes(System.Convert.ToInt32(fs.Length));
myString = enc.GetString(bin);
}
So the question is what encoding number do I need to do markerd by ???? is to do the equivalent ASCII-8BIT that works in ruby

Related

How to remove BOM from an encoded base64 UTF string?

I have a file encoded in base64 using openssl base64 -in en -out en1 in a command line in MacOS and I am reading this file using the following code:
string fileContent = File.ReadAllText(Path.Combine(AppContext.BaseDirectory, MConst.BASE_DIR, "en1"));
var b1 = Convert.FromBase64String(fileContent);
var str1 = System.Text.Encoding.UTF8.GetString(b1);
The string I am getting has a ? before the actual file content. I am not sure what's causing this, any help will be appreciated.
Example Input:
import pandas
import json
Encoded file example:
77u/DQppbXBvcnQgY29ubmVjdG9yX2FwaQ0KaW1wb3J0IGpzb24NCg0K
Output based on the C# code:
?import pandas
import json
Normally, when you read UTF (with BOM) from a text file, the decoding is handled for you behind the scene. For example, both of the following lines will read UTF text correctly regardless of whether or not the text file has a BOM:
File.ReadAllText(path, Encoding.UTF8);
File.ReadAllText(path); // UTF8 is the default.
The problem is that you're dealing with UTF text that has been encoded to a Base64 string. So, ReadAllText() can no longer handle the BOM for you. You can either do it yourself by (checking and) removing the first 3 bytes from the byte array or delegate that job to a StreamReader, which is exactly what ReadAllText() does:
var bytes = Convert.FromBase64String(fileContent);
string finalString = null;
using (var ms = new MemoryStream(bytes))
using (var reader = new StreamReader(ms)) // Or:
// using (var reader = new StreamReader(ms, Encoding.UTF8))
{
finalString = reader.ReadToEnd();
}
// Proceed to using finalString.

How to convert txt to binary files using DOCSIS with c#

I am converting a text file into binary file and pushing them on ftp server where DOCSIS is used to understand that binary file but unfortunately I am unable to do and getting the following error.
My code is following:
string binaryfileName = #"C:\TELNETAPP_Bilal\" + str+".bin";
if (File.Exists(binaryfileName))
{
File.Delete(binaryfileName);
}
//BinaryWriter bwStream = new BinaryWriter(new FileStream(binaryfileName, FileMode.Create));
Encoding ascii = Encoding.ASCII;
//BinaryWriter bwEncoder = new BinaryWriter(new FileStream(binaryfileName, FileMode.Create), ascii);
using (BinaryWriter binWriter =
new BinaryWriter(File.Open(binaryfileName, FileMode.Create), ascii))
{
for (int i = 0; i < blines.Count; i++)
{
binWriter.Write(blines[i] + Environment.NewLine);
}
}
This answer is a bit more complicated. You are trying to convert a human-readable text to the binary that is not understood by the cable modem. This is equivalent to writing C code to machine code.
I wrote a Java version of what you want and goes into detail on the process. You can piece together the Java code to C#
OSCAR

C# - Read bytes from file from a specific string

I'm trying to parse a crg-file in C#. The file is mixed with plain text and binary data. The first section of the file contains plain text while the rest of the file is binary (lots of floats), here's an example:
$
$ROAD_CRG
reference_line_start_u = 100
reference_line_end_u = 120
$
$KD_DEFINITION
#:KRBI
U:reference line u,m,730.000,0.010
D:reference line phi,rad
D:long section 1,m
D:long section 2,m
D:long section 3,m
...
$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
�#z����RA����\�l
...
I know I can read bytes starting at a specific offset but how do I find out which byte to start from? The last row before the binary section will always contain at least four dollar signs "$$$$". Here's what I've got so far:
using var fs = new FileStream(#"crg_sample.crg", FileMode.Open, FileAccess.Read);
var startByte = ??; // How to find out where to start?
using (BinaryReader reader = new BinaryReader(fs))
{
reader.BaseStream.Seek(startByte, SeekOrigin.Begin);
var f = reader.ReadSingle();
Debug.WriteLine(f);
}
When you have a mixture of text data and binary data, you need to treat everything as binary. This means you should be using raw Stream access, or something similar, and using binary APIs to look through the text data (often looking for cr/lf/crlf at bytes as sentinels, although it sounds like in your case you could just look for the $$$$ using binary APIs, then decode the entire block before, and scan forwards). When you think you have an entire line, then you can use Encoding to parse each line - the most convenient API being encoding.GetString(). When you've finished looking through the text data as binary, then you can continue parsing the binary data, again using the binary API. I would usually recommend against BinaryReader here too, because frankly it doesn't gain you much over more direct API. The other problem you might want to think about is CPU endianness, but assuming that isn't a problem: BitConverter.ToSingle() may be your friend.
If the data is modest in size, you may find it easiest to use byte[] for the data; either via File.ReadAllBytes, or by renting an oversized byte[] from the array-pool, and loading it from a FileStream. The Stream API is awkward for this kind of scenario, because once you've looked at data: it has gone - so you need to maintain your own back-buffers. The pipelines API is ideal for this, when dealing with large data, but is an advanced topic.
UPDATE: This code may not work as expected. Please review the valuable information in the comments.
using (var fs = new FileStream(#"crg_sample.crg", FileMode.Open, FileAccess.Read))
{
using (StreamReader sr = new StreamReader(fs, Encoding.ASCII, true, 1, true))
{
var line = sr.ReadLine();
while (!string.IsNullOrWhiteSpace(line) && !line.Contains("$$$$"))
{
line = sr.ReadLine();
}
}
using (BinaryReader reader = new BinaryReader(fs))
{
// TODO: Start reading the binary data
}
}
Solution
I know this is far from the most optimized solution but in my case it did the trick and since the plain text section of the file was known to be fairly small this didn't cause any noticable performance issues. Here's the code:
using var fileStream = new FileStream(#"crg_sample.crg", FileMode.Open, FileAccess.Read);
using var reader = new BinaryReader(fileStream);
var newLine = '\n';
var markerString = "$$$$";
var currentString = "";
var foundMarker = false;
var foundNewLine = false;
while (!foundNewLine)
{
var c = reader.ReadChar();
if (!foundMarker)
{
currentString += c;
if (currentString.Length > markerString.Length)
currentString = currentString.Substring(1);
if (currentString == markerString)
foundMarker = true;
}
else
{
if (c == newLine)
foundNewLine = true;
}
}
if (foundNewLine)
{
// Read binary
}
Note: If you're dealing with larger or more complex files you should probably take a look at Mark Gravell's answer and the comment sections.

read encoding identifier with StreamReader

I am reading a C# book and in the chapter about streams it says:
If you explicitly specify an encoding, StreamWriter will, by default,
write a prefix to the start of the stream to identify the encoding.
This is usually undesirable and you can prevent it by constructing the
encoding as follows:
var encoding = new UTF8Encoding (encoderShouldEmitUTF8Identifier:false, throwOnInvalidBytes:true);
I'd like to actually see how the identifier looks so I came up with this code:
using (FileStream fs = File.Create ("test.txt"))
using (TextWriter writer = new StreamWriter (fs,new UTF8Encoding(true,false)))
{
writer.WriteLine ("Line1");
}
using (FileStream fs = File.OpenRead ("test.txt"))
using (TextReader reader = new StreamReader (fs))
{
for (int b; (b = reader.Read()) > -1;)
Console.WriteLine (b + " " + (char)b); // identifier not printed
}
To my dissatisfaction, no identifier was printed. How do I read the identifier? Am I missing something?
By default, .NET will try very hard to insulate you from encoding errors. If you want to see the byte-order-mark, aka "preamble" or "BOM", you need to be very explicit with the objects to disable the automatic behavior. This means that you need to use an encoding that does not include the preamble, and you need to tell StreamReader to not try to detect the encoding.
Here is a variation of your original code that will display the BOM:
using (MemoryStream stream = new MemoryStream())
{
Encoding encoding = new UTF8Encoding(encoderShouldEmitUTF8Identifier: true);
using (TextWriter writer = new StreamWriter(stream, encoding, bufferSize: 8192, leaveOpen: true))
{
writer.WriteLine("Line1");
}
stream.Position = 0;
encoding = new UTF8Encoding(encoderShouldEmitUTF8Identifier: false);
using (TextReader reader = new StreamReader(stream, encoding, detectEncodingFromByteOrderMarks: false))
{
for (int b; (b = reader.Read()) > -1;)
Console.WriteLine(b + " " + (char)b); // identifier not printed
}
}
Here, encoderShouldEmitUTF8Identifier: true is passed to the encoder used to create the stream, so that the BOM is written when the stream is created, but encoderShouldEmitUTF8Identifier: false is passed to the encoder used to read the stream, so that the BOM will be treated as a normal character when the stream is being read back. The detectEncodingFromByteOrderMarks: false parameter is passed to the StreamReader constructor as well, so that it won't consume the BOM itself.
This produces this output, just like you wanted:
65279 ?
76 L
105 i
110 n
101 e
49 1
13
10
It is worth mentioning that use of the BOM as a form of identifying UTF8 encoding is generally discouraged. The BOM mainly exists so that the two variations of UTF16 can be distinguished (i.e. UTF16LE and UTF16BE, "little endian" and "big endian", respectively). It's been co-opted as a means of identifying UTF8 as well, but really it's better to just know what the encoding is (which is why things like XML and HTML explicitly state the encoding as ASCII in the first part of the file, and MIME's charset property exists). A single character isn't nearly as reliable as other more explicit means.

compressing a string in C# and uncompressing in python

I am trying to compress a large string on a client program in C# (.net 4) and send it to a server (django, python 2.7) using a PUT request.
Ideally I want to use the standard library at both ends, so I am trying to use gzip.
My C# code is:
public static string Compress(string s) {
var bytes = Encoding.Unicode.GetBytes(s);
using (var msi = new MemoryStream(bytes))
using (var mso = new MemoryStream()) {
using (var gs = new GZipStream(mso, CompressionMode.Compress)) {
msi.CopyTo(gs);
}
return Convert.ToBase64String(mso.ToArray());
}
}
The python code is:
s = base64.standard_b64decode(request)
buff = cStringIO.StringIO(s)
with gzip.GzipFile(fileobj=buff) as gz:
decompressed_data = gz.read()
It's almost working, but the output is: {▯"▯c▯h▯a▯n▯g▯e▯d▯"▯} when it should be {"changed"}, i.e. every other letter is something weird.
If I take out every other character by doing decompressed_data[::2], then it works, but it's a bit of a hack, and clearly there is something else wrong.
I'm wondering if I need to base64 encode it at all for a PUT request? Is this only necessary for POST?
I think the main problem might be C# uses UTF-16 encoded strings. This may yield a problem similar to yours. As any other encoding problem, we might need a little luck here but I guess you can solve this by doing:
decompressed_data = gz.read().decode('utf-16')
There, decompressed_data should be Unicode and you can treat it as such for further work.
UPDATE: This worked for me:
C Sharp
static void Main(string[] args)
{
FileStream f = new FileStream("test", FileMode.CreateNew);
using (StreamWriter w = new StreamWriter(f))
{
w.Write(Compress("hello"));
}
}
public static string Compress(string s)
{
var bytes = Encoding.Unicode.GetBytes(s);
using (var msi = new MemoryStream(bytes))
using (var mso = new MemoryStream())
{
using (var gs = new GZipStream(mso, CompressionMode.Compress))
{
msi.CopyTo(gs);
}
return Convert.ToBase64String(mso.ToArray());
}
}
Python
import base64
import cStringIO
import gzip
f = open('test','rb')
s = base64.standard_b64decode(f.read())
buff = cStringIO.StringIO(s)
with gzip.GzipFile(fileobj=buff) as gz:
decompressed_data = gz.read()
print decompressed_data.decode('utf-16')
Without decode('utf-16) it printed in the console:
>>>h e l l o
with it it did well:
>>>hello
Good luck, hope this helps!
It's almost working, but the output is: {▯"▯c▯h▯a▯n▯g▯e▯d▯"▯} when it should be {"changed"}
That's because you're using Encoding.Unicode to convert the string to bytes to start with.
If you can tell Python which encoding to use, you could do that - otherwise you need to use an encoding on the C# side which matches what Python expects.
If you can specify it on both sides, I'd suggest using UTF-8 rather than UTF-16. Even though you're compressing, it wouldn't hurt to make the data half the size (in many cases) to start with :)
I'm also somewhat suspicious of this line:
buff = cStringIO.StringIO(s)
s really isn't text data - it's compressed binary data, and should be treated as such. It may be okay - it's just worth checking whether there's a better way.

Categories