Smart ASCII string representation - c#

I have an app that converts binary file into ASCII file. With profiler I found that I spend 25% of time doing Encoding.GetBytes() which is called from BinaryWriter.Write(wchar[]). It is completely correct since I have many constructs similar to this one:
m_writer.Write("some fancy long text".ToCharArray());
Do you have any smart idea how to avoid this encoding conversion?
I now that one idea would be to to do something similar to this:
const byte[] SOME_FANCY_LONG_TEXT = Encoding.ASCII.GetBytes("some fancy ...");
// ... and later
m_writer.Write(SOME_FANCY_LONG_TEXT);
but I have to many such entries to do it manually.

If you're creating a text file, why are you using BinaryWriter at all? Just use a TextWriter. BinaryWriter is meant for binary streams where you want to write primitives, strings etc in a simple way.
(Is all your text definitely going to be ASCII, by the way? You might want to consider using UTF-8 instead.)

Related

Advantage in using SerialPort.ReadByte over ReadChar?

Of all the example codes I have read online regarding SerialPorts all uses ReadByte then convert to Character instead of using ReadChar in the first place.
Is there a advantage in doing this?
The SerialPort.Encoding property is often misunderstood. The default is ASCIIEncoding, it will produce ? for byte values 0x80..0xFF. So they don't like getting these question marks. If you see such code then converting the byte to char directly then they are getting it really wrong, Unicode has lots of unprintable codepoints in that byte range and the odds that the device actually meant to send these characters are zero. A string tends to be regarded as easier to handle than a byte[], it is.
When you use ReadChar it is based on the encoding you are using, like #Preston Guillot said. According to the docu of ReadChar:
This method reads one complete character based on the encoding.
Use caution when using ReadByte and ReadChar together. Switching
between reading bytes and reading characters can cause extra data to
be read and/or other unintended behavior. If it is necessary to switch
between reading text and reading binary data from the stream, select a
protocol that carefully defines the boundary between text and binary
data, such as manually reading bytes and decoding the data.

Really simple short string compression

Is there a really simple compression technique for strings up to about 255 characters in length (yes, I'm compressing URLs)?
I am not concerned with the strength of compression - I am looking for something that performs very well and is quick to implement. I would like something simpler than SharpZipLib: something that can be implemented with a couple of short methods.
I think the key question here is "Why do you want to compress URLs?"
Trying to shorten long urls for the address bar?
You're better storing the original URL somewhere (database, text file ...) alongside a hashcode of the non-domain part (MD5 is fine). You can then have a simple page (or some HTTPModule if you're feeling flashy) to read the MD5 and lookup the real URL. This is how TinyURL and others work.
For example:
http://mydomain.com/folder1/folder2/page1.aspx
Could be shorted to:
http://mydomain.com/2d4f1c8a
Using a compression library for this will not work. The string will be compressed into a shorter binary representation, but converting this back to a string which needs to be valid as part of a URL (e.g. Base64) will negate any benefit you gained from the compression.
Storing lots of URLs in memory or on disk?
Use the built in compressing library within System.IO.Compression or the ZLib library which is simple and incredibly good. Since you will be storing binary data the compressed output will be fine as-is. You'll need to uncompress it to use it as a URL.
As suggested in the accepted answer, Using data compression does not work to shorten URL paths that are already fairly short.
DotNetZip has a DeflateStream class that exposes a static (Shared in VB) CompressString method. It's a one-line way to compress a string using DEFLATE (RFC 1951). The DEFLATE implementation is fully compatible with System.IO.Compression.DeflateStream, but DotNetZip compresses better. Here's how you might use it:
string[] orig = {
"folder1/folder2/page1.aspx",
"folderBB/folderAA/page2.aspx",
};
public void Run()
{
foreach (string s in orig)
{
System.Console.WriteLine("original : {0}", s);
byte[] compressed = DeflateStream.CompressString(s);
System.Console.WriteLine("compressed : {0}", ByteArrayToHexString(compressed));
string uncompressed = DeflateStream.UncompressString(compressed);
System.Console.WriteLine("uncompressed: {0}\n", uncompressed);
}
}
Using that code, here are my test results:
original : folder1/folder2/page1.aspx
compressed : 4bcbcf49492d32d44f03d346fa0589e9a9867a89c5051500
uncompressed: folder1/folder2/page1.aspx
original : folderBB/folderAA/page2.aspx
compressed : 4bcbcf49492d7272d24f03331c1df50b12d3538df4128b0b2a00
uncompressed: folderBB/folderAA/page2.aspx
So you can see the "compressed" byte array, when represented in hex, is longer than the original, about 2x as long. The reason is that a hex byte is actually 2 ASCII chars.
You could compensate somewhat for that by using base-62, instead of base-16 (hex) to represent the number. In that case a-z and A-Z are also digits, giving you 0-9 (10) + a-z (+26) + A-Z (+26) = 62 total digits. That would shorten the output significantly. I haven't tried that. yet.
EDIT
Ok I tested the Base-62 encoder. It shortens the hex string by about half. I figured it would cut it to 25% (62/16 =~ 4) But I think I am losing something with the discretization. In my tests, the resulting base-62 encoded string is about the same length as the original URL. So, no, using compression and then base-62 encoding is still not a good approach. you really want a hash value.
I'd suggest looking in the System.IO.Compression Namespace. There's an article on CodeProject that may help.
I have just created a compression scheme that targets URLs and achieves around 50% compression (compared to base64 representation of the original URL text).
see http://blog.alivate.com.au/packed-url/
It would be great if someone from a big tech company built this out properly and published it for all to use. Google championed Protocol buffers. This tool can save a lot of disk space for someone like Google, while still being scannable. Or perhaps the great captain himself? https://twitter.com/capnproto
Technically, I would call this a binary (bitwise) serialisation scheme for the data that underlies a URL. Treat the URL as text-representation of conceptual data, then serialize that conceptual data model with a specialised serializer. The outcome is a more compressed version of the original of course. This is very different to how a general-purpose compression algorithm works.
What's your goal?
A shorter URL? Try URL shorteners like http://tinyurl.com/ or http://is.gd/
Storage space? Check out System.IO.Compression. (Or SharpZipLib)
You can use deflate algorithm directly, without any headers checksums or footers, as described in this question: Python: Inflate and Deflate implementations
This cuts down a 4100 character URL to 1270 base64 characters, in my test, allowing it to fit inside IE's 2000 limit.
And here's an example of a 4000-character URL, which can't be solved with a hashtable since the applet can exist on any server.
I would start with trying one of the existing (free or open source) zip libraries, e.g. http://www.icsharpcode.net/OpenSource/SharpZipLib/
Zip should work well for text strings, and I am not sure if it is worth implementing a compression algorithm yourserlf....
Have you tried just using gzip?
No idea if it would work effectively with such short strings, but I'd say its probably your best bet.
The open source library SharpZipLib is easy to use and will provide you with compression tools

C#: String -> MD5 -> Hex

in languages like PHP or Python there are convenient functions to turn an input string into an output string that is the HEXed representation of it.
I find it a very common and useful task (password storing and checking, checksum of file content..), but in .NET, as far as I know, you can only work on byte streams.
A function to do the work is easy to put on (eg http://blog.stevex.net/index.php/c-code-snippet-creating-an-md5-hash-string/), but I'd like to know if I'm missing something, using the wrong pattern or there is simply no such thing in .NET.
Thanks
The method you linked to seems right, a slightly different method is showed on the MSDN C# FAQ
A comment suggests you can use:
System.Web.Security.FormsAuthentication.HashPasswordForStoringInConfigFile(string, "MD5");
Yes you can only work with bytes (as far as I know). But you can turn those bytes easily into their hex representation by looping through them and doing something like:
myByte.ToString("x2");
And you can get the bytes that make up the string using:
System.Text.Encoding.UTF8.GetBytes(myString);
So it could be done in a couple lines.
One problem is with the very concept of "the HEXed representation of [a string]".
A string is a sequence of characters. How those characters are represented as individual bits depends on the encoding. The "native" encoding to .NET is UTF-16, but usually a more compact representation is achieved (while preserving the ability to encode any string) using UTF-8.
You can use Encoding.GetBytes to get the encoded version of a string once you've chosen an appropriate encoding - but the fact that there is that choice to make is the reason that there aren't many APIs which go straight from string to base64/hex or which perform encryption/hashing directly on strings. Any such APIs which do exist will almost certainly be doing the "encode to a byte array, perform appropriate binary operation, decode opaque binary data to hex/base64".
(That makes me wonder whether it wouldn't be worth writing a utility class which could take an encoding, a Func<byte[], byte[]> and an output format such as hex/base64 - that could represent an arbitrary binary operation applied to a string.)

What's the best way to read mixed (i.e. text and binary) data?

I need to be able to read a file format that mixes binary and non-binary data. Assuming I know the input is good, what's the best way to do this? As an example, let's take a file that has a double as the first line, a newline (0x0D 0x0A) and then ten bytes of binary data afterward. I could, of course, calculate the position of the newline, then make a BinaryReader and seek to that position, but I keep thinking that there has to be a better way.
You can use System.IO.BinaryReader. The problem with this though is you must know what type of data you are going to be reading before you call any of the Read methods.
Read(byte[], int, int)
Read(char[], int, int)
Read()
Read7BitEncodedInt()
ReadBoolean()
ReadByte()
ReadBytes(int)
ReadChar()
ReadChars()
ReadDecimal()
ReadDouble()
ReadInt16()
ReadInt32()
ReadInt64()
ReadSByte()
ReadSingle()
ReadString()
ReadUInt16()
ReadUInt32()
ReadUInt64()
And of course the same methods exist for writing in System.IO.BinaryWriter.
Is this file format already fixed? If it's not, it's a really good idea to change to use a length-prefixed format for the strings. Then you can read just the right amount and convert it to a string.
Otherwise, you'll need to read chunks from the file, scan for the newline, and decode the right amount of data or (if you don't find the newline) either buffer it somewhere else (e.g. a MemoryStream) or just remember the starting point and rewind the stream appropriately. It will be ugly, but that's just because of the deficiency of the file format.
I would suggest you don't "over-decode" (i.e. decode the arbitrary binary data after the string) - while it may well not do any harm, in some encodings you could be reading an impossible sequence of binary data, which then starts getting into the realms of DecoderFallbacks and the like.
I've had to deal with that when reading HTTP requests coming in over the wire on Compact Framework. My solution was to roll my own non-buffering ASCII-only StreamReader, so that it was safe to interleave calls to both the StreamReader and the underlying Stream.

What is the fastest way to parse text with custom delimiters and some very, very large field values in C#?

I've been trying to deal with some delimited text files that have non standard delimiters (not comma/quote or tab delimited). The delimiters are random ASCII characters that don't show up often between the delimiters. After searching around, I've seem to have only found no solutions in .NET will suit my needs and the custom libraries that people have written for this seem to have some flaws when it comes to gigantic input (4GB file with some field values having very easily several million characters).
While this seems to be a bit extreme, it is actually a standard in the Electronic Document Discovery (EDD) industry for some review software to have field values that contain the full contents of a document. For reference, I've previously done this in python using the csv module with no problems.
Here's an example input:
Field delimiter =
quote character = þ
þFieldName1þþFieldName2þþFieldName3þþFieldName4þ
þValue1þþValue2þþValue3þþSomeVery,Very,Very,Large value(5MB or so)þ
...etc...
Edit:
So I went ahead and created a delimited file parser from scratch. I'm kind of weary using this solution as it may be prone to bugs. It also doesn't feel "elegant" or correct to have to write my own parser for a task like this. I also have a feeling that I probably didn't have to write a parser from scratch for this anyway.
Use the File Helpers API. It's .NET and open source. It's extremely high performance using compiled IL code to set fields on strongly typed objects, and supports streaming.
It supports all sorts of file types and custom delimiters; I've used it to read files larger than 4GB.
If for some reason that doesn't do it for you, try just reading line by line with a string.split:
public IEnumerable<string[]> CreateEnumerable(StreamReader input)
{
string line;
while ((line = input.ReadLine()) != null)
{
yield return line.Split('þ');
}
}
That'll give you simple string arrays representing the lines in a streamy fashion that you can even Linq into ;) Remember however that the IEnumerable is lazy loaded, so don't close or alter the StreamReader until you've iterated (or caused a full load operation like ToList/ToArray or such - given your filesize however, I assume you won't do that!).
Here's a good sample use of it:
using (StreamReader sr = new StreamReader("c:\\test.file"))
{
var qry = from l in CreateEnumerable(sr).Skip(1)
where l[3].Contains("something")
select new { Field1 = l[0], Field2 = l[1] };
foreach (var item in qry)
{
Console.WriteLine(item.Field1 + " , " + item.Field2);
}
}
Console.ReadLine();
This will skip the header line, then print out the first two field from the file where the 4th field contains the string "something". It will do this without loading the entire file into memory.
Windows and high performance I/O means, use IO Completion ports. You may have todo some extra plumbing to get it working in your case.
This is with the understanding that you want to use C#/.NET, and according to Joe Duffy
18) Don’t use Windows Asynchronous Procedure Calls (APCs) in managed
code.
I had to learn that one the hard way ;), but ruling out APC use, IOCP is the only sane option. It also supports many other types of I/O, frequently used in socket servers.
As far as parsing the actual text, check out Eric White's blog for some streamlined stream use.
I would be inclined to use a combination of Memory Mapped Files (msdn point to a .NET wrapper here) and a simple incremental parse, yielding back to an IEnumerable list of your record / text line (or whatever)
You mention that some fields are very very big, if you try to read them in their entirety to memory you may be getting yourself into trouble. I would read through the file in 8K (or small chunks), parse the current buffer, keep track of state.
What are you trying to do with this data that you are parsing? Are you searching for something? Are you transforming it?
I don't see a problem with you writing a custom parser. The requirements seem sufficiently different to anything already provided by the BCL, so go right ahead.
"Elegance" is obviously a subjective thing. In my opinion, if your parser's API looks and works like a standard BCL "reader"-type API, then that is quite "elegant".
As for the large data sizes, make your parser work by reading one byte at a time and use a simple state machine to work out what to do. Leave the streaming and buffering to the underlying FileStream class. You should be OK with performance and memory consumption.
Example of how you might use such a parser class:
using(var reader = new EddReader(new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read, 8192)) {
// Read a small field
string smallField = reader.ReadFieldAsText();
// Read a large field
Stream largeField = reader.ReadFieldAsStream();
}
While this doesn't help address the large input issue, a possible solution to the parsing issue might include a custom parser that users the strategy pattern to supply a delimiter.

Categories