Reading alphanumeric data from text file - c#

I am using the code below to read binary data from text file and divide it into small chunks. I want to do the same with a text file with alphanumeric data which is obviously not working with the binary reader. Which reader would be best to achieve that stream,string or text and how to implement that in the following code?
public static IEnumerable<IEnumerable<byte>> ReadByChunk(int chunkSize)
{
IEnumerable<byte> result;
int startingByte = 0;
do
{
result = ReadBytes(startingByte, chunkSize);
startingByte += chunkSize;
yield return result;
} while (result.Any());
}
public static IEnumerable<byte> ReadBytes(int startingByte, int byteToRead)
{
byte[] result;
using (FileStream stream = File.Open(#"C:\Users\file.txt", FileMode.Open, FileAccess.Read, FileShare.Read))
using (BinaryReader reader = new BinaryReader(stream))
{
int bytesToRead = Math.Max(Math.Min(byteToRead, (int)reader.BaseStream.Length - startingByte), 0);
reader.BaseStream.Seek(startingByte, SeekOrigin.Begin);
result = reader.ReadBytes(bytesToRead);
}
return result;
}

I can only help you get the general process figured out:
String/Text is the 2nd worst data format to read, write or process. It should be reserved for output towards and input from the user exclusively. It has some serious issues as a storage and retreival format.
If you have to transmit, store or retreive something as text, make sure you use a fixed Encoding and Culture Format (usually invariant) at all endpoints. You do not want to run into issues with those two.
The worst data fromat is raw binary. But there is a special 0th place for raw binary that you have to interpret into text, to then further process. To quote the most importnt parts of what I linked on encodings:
It does not make sense to have a string without knowing what encoding it uses. [...]
If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.
Almost every stupid “my website looks like gibberish” or “she can’t read my emails when I use accents” problem comes down to one naive programmer who didn’t understand the simple fact that if you don’t tell me whether a particular string is encoded using UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), you simply cannot display it correctly or even figure out where it ends. There are over a hundred encodings and above code point 127, all bets are off.

Related

How to import and read large binary file data in c#?

i have a large binary file that contains different data types, i can access single records in the file but i am not sure how to loop over the binary values and load it in the memory stream byte by byte
i have been using binary reader
BinaryReader binReader = new BinaryReader(File.Open(fileName, FileMode.Open));
Encoding ascii = Encoding.ASCII;
string authorName = binReader.ReadString();
Console.WriteLine(authorName);
Console.ReadLine();
but this won't work since i have a large file with different data types
simply, i need to convert the file to read byte by byte and then read these data either if it's a string or whatsoever.
would appreciate any thought that can help
This will very much depend on what format the file is in. Each byte in the file might represent different things, or it might just represent values from a large array, or some mix of the two.
You need to know what the format looks like to be able to read it, since binary files are not self-descriptive. Reading a simple object might look like
var authorName = binReader.ReadString();
var publishDate = DateTime.FromBinary(binReader.ReadInt64());
...
If you have a list of items it is common to use a length prefix. Something like
var numItems = binReader.ReadInt32();
for(int i = 0; i < numItems; i++){
var title = binReader.ReadString();
...
}
You would then typically create one or more objects from the data that can be used in the rest of the application. I.e.
new Bibliography(authorName, publishDate , books);
If this is a format you do not control I hope you have a detailed specification. Otherwise this is kind of a lost cause for anything but the cludgiest solutions.
If there is more data than can fit in memory you need some kind of streaming mechanism. I.e. read one item, do some processing of the item, save the result, read the next item, etc.
If you do control the format I would suggest alternatives that are easier to manage. I have used protobuf.Net, and I find it quite easy to use, but there are other alternatives. The common way to use these kinds of libraries is to create a class for the data, and add attributes for the fields that should be stored. The library can manage serialization/deserialization automatically, and usually handle things like inheritance and changes to the format in an easy way.
Here's a simple bit of code that shows the most basic way of doing it.
using System;
using System.IO;
using System.Linq;
using System.Threading.Tasks;
namespace binary_read
{
class Program
{
private static readonly int bufferSize = 1024;
static async Task Main(string[] args)
{
var bytesRead = 0;
var totalBytes = 0;
using (var stream = File.OpenRead(args.First()))
{
do
{
var buffer = new byte[bufferSize];
bytesRead = await stream.ReadAsync(buffer, 0, bufferSize);
totalBytes += bytesRead;
// Process buffer
} while (bytesRead > 0);
Console.WriteLine($"Processed {totalBytes} bytes.");
}
}
}
}
The main bit to take note of is within the using block.
Firstly, when working with files/streams/sockets it's best to use using if possible to deterministically clean up after yourself.
Then it's really just a matter of calling Read/ReadAsync on the stream if you're just after the raw data. However there are various 'readers' that provide an abstraction to make working with certain formats easier.
So if you know that you're going to be reading ints and doubles and strings, then you can use the BinaryReader and it's ReadIntxx/ReadDouble/ReadString methods.
If you're reading into a struct, then you can read the properties in a loop as suggested by #JonasH above. Or use the method in this answer.

How to Interpret Binary a Binary File

I have a binary file that I'd like to open, read and understand; but I've never tried to work with binary information before.
Various questions (including Using structs in C# to read data and
How to read a binary file using c#?) helped me to open and read the file, but I have no idea how to interpret the information I've so far extracted.
One approach I got some hopeful data out of was this:
using (BinaryReader reader = new BinaryReader(File.Open(filename, FileMode.Open, FileAccess.Read)))
{
for (int i = 0; i < 100; i++)
{
iValue = reader.ReadInt32();
sb.AppendFormat("{1}={2}{0}", Environment.NewLine, i, iValue);
}
}
Returns something like this:
0=374014592
1=671183229
2=558694987
3=-1018526206
4=1414798970
5=650
6=4718677
7=44
8=0
9=7077888
10=7864460
But this isn't what I was expecting, nor do I even know what it means - have i successfully determined the file contains a bunch of numbers or am I looking at an interpretation of the data (similar to how using the wrong/different encodings will return different characters for the same input).
Do I have any hope or should I stop entirely?
You have to already know how the binary file is structured in order to be able to read and interpret the file properly.
For example, if you write to a binary file an int, a double, a boolean and a string, like this:
int i = 25;
double d = 3.14157;
bool b = true;
string s = "I am happy";
using (var bw = new BinaryWriter(new FileStream("mydata", FileMode.Create))
{
bw.Write(i);
bw.Write(d);
bw.Write(b);
bw.Write(s);
}
then you must later read back the data values using the same types, in exactly the same order:
using (var br = new BinaryReader(new FileStream("mydata", FileMode.Open)))
{
i = br.ReadInt32();
Console.WriteLine("Integer data: {0}", i);
d = br.ReadDouble();
Console.WriteLine("Double data: {0}", d);
b = br.ReadBoolean();
Console.WriteLine("Boolean data: {0}", b);
s = br.ReadString();
Console.WriteLine("String data: {0}", s);
}
http://www.tutorialspoint.com/csharp/csharp_binary_files.htm
Here is what you would need to know to be able to successfully read a .WAV file (a binary file format that holds sound information). WAV files are one of the simpler binary formats:
http://soundfile.sapp.org/doc/WaveFormat/
By definition a binary file is just a series of bits. Whether you interpret those bits as numbers, characters or something else depends entirely upon what was written into the file in the first place.
In general there's no way to tell what was written into the file by looking at the file contents. Of course if you interpret the bits as characters and get readable text then there's a good chance that text is what was written into the file. But a file containing only text typically wouldn't be described as a binary file.
By calling ReadInt32 you are assuming that the contents of your file are a series of four-byte integers. But what if eight-byte integers or floats or an enumeration or something else was written to your file? What if your file doesn't contain a multiple of four bytes?
You might consider changing your loop to use ReadByte rather than ReadInt32 so it might look something like this...
bValue = reader.ReadByte();
sb.AppendFormat("{1}=0x{2:X}{0}", Environment.NewLine, i, bValue);
so you treat the file as a sequence of bytes and write the data out in hex rather than as a decimal number.
Another approach might be to find a good hex editor and use that to inspect the file contents rather than writing your own code (at least to start with).
There is a simple hex editor built into Visual Studio (assuming that's what you are using). Go to File | Open | Open File. Then in the Open File dialog select your binary file and then click on the drop down to the right of the Open Button and select Open With and then select Binary Editor.
What you'll see is the contents of the file shown as hex and characters. Not great but quick.

Data from byte array

I'm trying to read the bytes in the stream at each frame.
I want to be able to read the position and the timestamp information that is stored on a file I have created.
The stream is a stream of recorded skeleton data and it is in encoded binary format
Stream recordStream;
byte[] results;
using (FileStream SourceStream = File.Open(#".....\Stream01.recorded", FileMode.Open))
{
if (SourceStream.CanRead)
{
results = new byte[recordStream.Length];
SourceStream.Read(results, 0, (int)recordStream.Length);
}
}
The file should be read and the Read method should read the current sequence of bytes before advances the position in the stream.
Is there a way to pull out the data (position and timestamp) I want from the bytes read, and save it in separate variables before it advances?
Could using the binary reader give me the capabilities to do this.
BinaryReader br1 = new BinaryReader(recordStream);
I have save the file as .recorded. I have also saved it as .txt to see what is contained in the file, but since it is encoded, it is not understandable.
Update:
I tried running the code with breakpoints to see if it enters the function with my binaryreader and it crashes with an error: ArgumentException was unhandled. Stream was not readable, on the BinaryReader initialization and declaration
BinaryReader br1 = new BinaryReader(recordStream);
The file type was .recorded.
You did not provide any information about the format of the data you are trying to read.
However, using the BinaryReader is exactly what you need to do.
It exposes methods to read data from the stream and convert them to various types.
Consider the following example:
var filename = "pathtoyourfile";
using (var stream = File.Open(filename, FileMode.Open))
using(var reader = new BinaryReader(stream))
{
var x = reader.ReadByte();
var y = reader.ReadInt16();
var z = reader.ReadBytes(10);
}
It really depends on the format of your data though.
Update
Even though I feel I've already provided all the information you need,
let's use your data.
You say each record in your data starts with
[long: timestamp][int: framenumber]
using (var stream = File.Open(filename, FileMode.Open))
using(var reader = new BinaryReader(stream))
{
var timestamp = reader.ReadInt64();
var frameNumber = reader.ReadInt32();
//At this point you have the timestamp and the frame number
//you can now do whatever you want with it and decide whether or not
//to continue, after that you just continue reading
}
How you continue reading depends on the format of the remaining part of the records
If all fields in a record have a specific length, then you either (depending on the
choice you made knowing the values of the timestamp and the frame number) continue
reading all the fields for that record OR you simply advance to a position in the stream
that contains the next record. For example if each record is 100 bytes long, if you want to skip this record after you got the first two fields:
stream.Seek(88, SeekOrigin.Current);
//88 here because the first two fields take 12 bytes -> (100 - 8 + 4)
If the records have a variable length the solution is similar, but you'll have to
take into account the length of the various fields (which should be defined by
length fields preceding the variable length fields)
As for knowing if the first 8 bytes really do represent a timestamp,
there's no real way of knowing for sure... remember in the end the stream just contains
a series of individual bytes that have no meaning whatsoever except for the meaning
given to them by your file format. Either you have to revise the file format or you could
try checking if the value of 'timestamp' in the example above even makes sense.
Is this a file format you have defined yourself, if so... perhaps you are making it to complicated and might want to look at solutions such as Google Protocol Buffers or Apache Thrift.
If this is still not what you are looking for, you will have to redefine your question.
Based on your comments:
You need to know the exact definition of the entire file. You create a struct based on this file format:
struct YourFileFormat {
[FieldOffset(0)]
public long Timestamp;
[FieldOffset(8)]
public int FrameNumber;
[FieldOffset(12)]
//.. etc..
}
Then, using a BinaryReader, you can either read each field individually for each frame:
// assume br is an instantiated BinaryReader..
YourFileFormat file = new YourFileFormat();
file.Timestamp = br.ReadInt64();
file.FrameNumber = br.ReadInt32();
// etc..
Or, you can read the entire file in and have the Marshalling classes copy everything into the struct for you..
byte[] fileContent = br.ReadBytes(sizeof(YourFileFormat));
GCHandle gcHandle = GCHandle.Alloc(fileContent, GCHandleType.Pinned); // or pinning it via the "fixed" keyword in an unsafe context
file = (YourFileFormat)Marshal.PtrToStructure(gcHandle.AddrOfPinnedObject(), typeof(YourFileFormat));
gcHandle.Free();
However, this assumes you'll know the exact size of the file. With this method though.. each frame (assuming you know how many there are) can be a fixed size array within this struct for that to work.
Bottom line: Unless you know the size of what you want to skip.. you can't hope to get the data from the file you require.

Text file encoding issue

I found some questions on encoding issues before asking, however they are not what I want. Currently I have two methods, I'd better not modify them.
//FileManager.cs
public byte[] LoadFile(string id);
public FileStream LoadFileStream(string id);
They are working correctly for all kind of files. Now I have an ID of a text file(it's guaranteed to be a .txt file) and I want to get its content. I tried the following:
byte[] data = manager.LoadFile(id);
string content = Encoding.UTF8.GetString(data);
But obviously it's not working for other non-UTF8 encodings. To resolve the encoding issue I tried to get its FileStream first and then use a StreamReader.
public StreamReader(Stream stream, bool detectEncodingFromByteOrderMarks);
I hope this overlord can resolve the encoding but I still get strange contents.
using(var stream = manager.LoadFileStream(id))
using(var reader = new StreamReader(stream, true))
{
content = reader.ReadToEnd(); //still incorrect
}
Maybe I misunderstood the usage of detectEncodingFromByteOrderMarks? And how to resolve the encoding issue?
ByteOrderMarks are sometimes added to files encoded in one of the unicode formats, to indicate whether characters made up from multiple bytes are stored in big or little endian format (is byte 1 stored first, and then byte 0? Or byte 0 first, and then byte 1?). This is particularly relevant when files are read both by for instance windows and unix machines, because they write these multibyte characters in opposite directions.
If you read a file and the first few bytes equal that of a ByteOrderMark, chances are quite high the file is encoded in the unicode format that matches that ByteOrderMark. You never know for sure, though, as Shadow Wizard mentioned. Since it's always a guess, the option is provided as a parameter.
If there is no ByteOrderMark in the first bytes of the file, it'll be hard to guess the file's encoding.
More info: http://en.wikipedia.org/wiki/Byte_order_mark

Importing HTML using creates odd characters from non-standard characters?

We're attempting to read in an HTML file that contains certain MS Word characters (such as that long hyphen). The problem is these characters, for example, are showing up as garbage in SQL 2008. The data column is varbinary, and am viewing this data by casting to varchar. Here is the code, verbatim:
EDIT: Corrected definition of bad characters
var file = new FileInfo(/*file info*/);
using (var fs = file.OpenRead())
{
var buffer = new byte[16 * 1024];
using (var ms = new MemoryStream())
{
int read;
while ((read = fs.Read(buffer, 0, buffer.Length)) > 0)
{
ms.Write(buffer, 0, read);
}
item.Data = ms.ToArray();
}
}
The "item" object is outside the scope of the code.
If it makes any different, we are using EF 4. The data type for this data column in question is binary. Please let me know what code or details I can provide. Thanks.
Casting arbitrary bytes into some arbitrary code page shows up as funky characters. Nothing new here, this was always the case and will always be. You need to properly manage your text endcoding end-to-end, from the file being read to the final data being shown. Start by reading this: International Features in Microsoft SQL Server 2005. This old KB is also helpfull (in some way at least) Description of storing UTF-8 data in SQL Server. Once you figure out what encoding your HTML files are and what encoding do you whish to display the data, we can discuss available options.
Oh, and I forgot the obligatory link: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
As a temporary solution, if I am not wrong, the characters are shown like a square, no? You can always substitute the annoying characters once you display them.
You look for the ASCII code (to know it, you only have to convert.int32) and you replace it with the character you prefer.

Categories