Ok, so I have done a little searching and found the answer to my question about how to properly frame my incoming XML data over my TCP socket is to either send the XML with a header indicating size beforehand, or use delimiting escape sequences. Since I am basically writing a client for a very old C server which consequently communicates with a number of other clients which I'm not interested in having to tinker with I chose the latter. Previously I was actually stripping the '\n' out of the stream because I was ignorant to what its purpose was. Now when my asynchronous OnDataReceived method is called I do get a nice chunk of data with the d elimiters, here is a sample picked from the stream and stored in a string variable:
<?xml version=\"1.0\"?><message><type>SERVER</type><user>TestDeleteOrKillMe</user> <cmd>PRIVATE_MSG</cmd><host>65.255.81.81</host><msg>4111|3.16C (UNIX)|</msg></message>\n<?xml version=\"1.0\"?><message><type>SERVER</type><user>TestDeleteOrKillMe</user> <cmd>PRIVATE_MSG</cmd><host>65.255.81.81</host><msg>4362|copyright 1993 by James D. Bennett|</msg></message>\n
My question is now what is the procedure for pulling the string values out of the individual XML "statements?" so that I can send these to my ParseMessage(string) method and start processing them? At first I was thinking of possibly some RegEx to maybe start at the beginning until it finds the first \n and then select and strip all of the previous text out? I'd like to hear some input on what the common(or uncommon) practices for this problem is before I begin some possibly overly convoluted approach. Much thanks.
Related
So I have a problem: I'm reading a string from a memory address that is different at different times. For example:
Axe?ca Ocarina?tar??ing?ing????????????
I only need Axe.
Ball of Green Yarn??ing?ing????????????
I only need Ball of Green Yarn.
I'm reading 80 bytes of text (40 chars) because that's the most amount of characters the string should get to. But how can I know how long the string actually is?
It really depends on what's writing the string.
Generally, strings are NUL-terminated, i.e. a '\0' character immediately follows the string. Old-style (non-_s-variant) C functions like strlen and strcat use that to determine the end of existing strings and mark the end of modified strings.
Most string data types tend to work this way, but not all. In Turbo Pascal, strings were length-prefixed. BSTRs used in COM (including pre-.NET VB) are both.
Based on the samples you've shown, there's a good chance that the ? character you're seeing after the part you want is a NUL character. It looks like the buffer is being reused and re-terminated each time, e.g. a shorter string like "Axe" was written over a longer string like a certain kind of ocarina.
Examine the buffer in the debugger and you'll probably find a '\0' character immediately following what you want.
Probably. Again, it depends on what's writing the string. Until you look for yourself, it could be anything, and even then, it could just be a coincidence that it's NUL-terminated this time. Don't rely on observation alone. Without documentation, it could be different and still just as valid. Whatever you do, do not read past the 40-character buffer you know you have, NUL terminated or not.
I am trying to read the data stored in an ICMT tag on a WAV file generated by a noise monitoring device.
The RIFF parsing code all seems to work fine, except for the fact that the ICMT tag seems to have data after the declared size. As luck would have it, it's the timestamp, which is the one absolutely critical piece of info for my application.
SYN is hex 16, which gives a size of 22, which is up to and including the NUL before the timestamp. The monitor documentation is no help; it says that the tag includes the time, but their example also has the same issue.
It is the last tag in the enclosing list, and the size of the list does include it - does that mean it doesn't need a chunk ID? I'm struggling to find decent RIFF docs, but I can't find anything that suggests that's the case; also I can't see how it'd be possible to determine that it was the last chunk and so know to read it with no chunk ID.
Alternatively, the ICMT comment chunk is the last thing in the file - is that a special case? Can I just get the time by reading everything from the end of the declared length ICMT to the end of the file and assume that will always work?
The current parser behaviour is that it's being read after the channel / dB information as a chunk ID + size, and then complaining that there was not enough data left in the file to fulfil the request.
No, it would still need its own ID. No, being the last thing in the file is no special case either. What you're showing here is malformed.
Your current parser errors correctly, as the next thing to be expected again is a 4 byte ID followed by 4 bytes for the length. The potential ID _10: is unknown and would be skipped, but interpreting 51:4 as DWORD for the length of course asks for trouble.
The device is the culprit. Do you have other INFO fields which use NULL bytes? If not then I assume the device is naive enough to consider a NULL the end of a string, despite producing himself strings with multiple NULLs.
Since I encountered countless files not sticking to standards I can only say your parser is too naive as well: it knows how long the encapsulating list is and thus could easily detect field lengths that would not fit anymore. And could ignore garbage like that. Or, in your case, offer the very specific option "add to last field".
In my application I have a serial port object and a listbox. In the DataRecieved event, I send serialPort.ReadLine() to the listbox. If I write a "n" character to the serial port, nothing will get added to the listbox because what gets recieved doesn't end in "\r" or "\n".
What is the correct way to read information from a serial port? (Keep in mind that I need to keep the full string/char[] of the last thing recieved.)
The 'correct' way depends heavily on implementation.
The SerialPort.ReadLine() method expects a CR/LF as a means to define a payload unit. And, by thing, I imagine that you mean exactly that - a message, payload or package (as in one meaningful, functional unit of information.)
What SerialPort.ReadLine() does is to wrap the whole 'receive everything coming from the buffer and wait for a end-of-payload mark before continuing' mechanism for you.
If you'd rather have the raw incoming content as soon as it arrives, then you may consider changing your code to use SerialPort.Read() instead.
If your message consists of an exact amount of bytes (sometimes the case with sensor data protocols) you can define the bytes you expect - but you should set a timeout in this case.
SerialPort.ReadTimeout = timeOut;
SerialPort.Read(responseBytes, 0, bytesExpected)
I'm making a simple client application in C#, and have reached a problem.
The server application sends a string in the format of "<number> <param> <param>" etc. In other words, the first symbol is an integer, and the rest are whatever, all are separated by one space each.
The problem I get, when reading this string, is that my program first reads a string with the , and then the next time I read I get the rest of the message.
For example, if I were to do a writeline on what I receive, it would look like this:
(if he sends "1 0 0 0")
1
0 0 0
(EDIT: The formatting doesn't seem to permit this. The 1 is on a row of its own, the rest are supposed to be on the row below, including the space preceding the first 0)
I've run out of ideas how to fix this. Here's the method (I commented out some stuff I tried):
http://pastebin.com/0bXC9J2f
EDIT (again): I forgot, it seems to work just fine when I'm in debug and just go through everything step by step, so I can't find any source of the problem that way.
TCP is stream based and not message based. One Read can contain any of the following alternatives:
A teeny weeny part of message
A half message
Excactly one message
One and a half message
Two messages
Thus you need to use some kind of method to see if a complete message have arrived. The most common methods are:
Add a footer (for instance an empty line) which indicates end of message
Add a fixed length header containing the length of the message
If your protocol is straight TCP, then you cannot send messages, strings or anything else except octet, (byte) streams. Does your 'string' have a null at the end? If so, you need to append received data until the null arrives, then you have your message.
If this is your problem, then you should code your protocol so that it works no matter how many read calls are made on the socket, eg. if a null-terminated string of [99 data bytes+#0] is sent by the server, your protocol should be able to assemble the correct string if 100 bytes are returned in one call, 1 byte is received in 100 calls, or anything in between.
Rgds,
Martin
I've been trying to deal with some delimited text files that have non standard delimiters (not comma/quote or tab delimited). The delimiters are random ASCII characters that don't show up often between the delimiters. After searching around, I've seem to have only found no solutions in .NET will suit my needs and the custom libraries that people have written for this seem to have some flaws when it comes to gigantic input (4GB file with some field values having very easily several million characters).
While this seems to be a bit extreme, it is actually a standard in the Electronic Document Discovery (EDD) industry for some review software to have field values that contain the full contents of a document. For reference, I've previously done this in python using the csv module with no problems.
Here's an example input:
Field delimiter =
quote character = þ
þFieldName1þþFieldName2þþFieldName3þþFieldName4þ
þValue1þþValue2þþValue3þþSomeVery,Very,Very,Large value(5MB or so)þ
...etc...
Edit:
So I went ahead and created a delimited file parser from scratch. I'm kind of weary using this solution as it may be prone to bugs. It also doesn't feel "elegant" or correct to have to write my own parser for a task like this. I also have a feeling that I probably didn't have to write a parser from scratch for this anyway.
Use the File Helpers API. It's .NET and open source. It's extremely high performance using compiled IL code to set fields on strongly typed objects, and supports streaming.
It supports all sorts of file types and custom delimiters; I've used it to read files larger than 4GB.
If for some reason that doesn't do it for you, try just reading line by line with a string.split:
public IEnumerable<string[]> CreateEnumerable(StreamReader input)
{
string line;
while ((line = input.ReadLine()) != null)
{
yield return line.Split('þ');
}
}
That'll give you simple string arrays representing the lines in a streamy fashion that you can even Linq into ;) Remember however that the IEnumerable is lazy loaded, so don't close or alter the StreamReader until you've iterated (or caused a full load operation like ToList/ToArray or such - given your filesize however, I assume you won't do that!).
Here's a good sample use of it:
using (StreamReader sr = new StreamReader("c:\\test.file"))
{
var qry = from l in CreateEnumerable(sr).Skip(1)
where l[3].Contains("something")
select new { Field1 = l[0], Field2 = l[1] };
foreach (var item in qry)
{
Console.WriteLine(item.Field1 + " , " + item.Field2);
}
}
Console.ReadLine();
This will skip the header line, then print out the first two field from the file where the 4th field contains the string "something". It will do this without loading the entire file into memory.
Windows and high performance I/O means, use IO Completion ports. You may have todo some extra plumbing to get it working in your case.
This is with the understanding that you want to use C#/.NET, and according to Joe Duffy
18) Don’t use Windows Asynchronous Procedure Calls (APCs) in managed
code.
I had to learn that one the hard way ;), but ruling out APC use, IOCP is the only sane option. It also supports many other types of I/O, frequently used in socket servers.
As far as parsing the actual text, check out Eric White's blog for some streamlined stream use.
I would be inclined to use a combination of Memory Mapped Files (msdn point to a .NET wrapper here) and a simple incremental parse, yielding back to an IEnumerable list of your record / text line (or whatever)
You mention that some fields are very very big, if you try to read them in their entirety to memory you may be getting yourself into trouble. I would read through the file in 8K (or small chunks), parse the current buffer, keep track of state.
What are you trying to do with this data that you are parsing? Are you searching for something? Are you transforming it?
I don't see a problem with you writing a custom parser. The requirements seem sufficiently different to anything already provided by the BCL, so go right ahead.
"Elegance" is obviously a subjective thing. In my opinion, if your parser's API looks and works like a standard BCL "reader"-type API, then that is quite "elegant".
As for the large data sizes, make your parser work by reading one byte at a time and use a simple state machine to work out what to do. Leave the streaming and buffering to the underlying FileStream class. You should be OK with performance and memory consumption.
Example of how you might use such a parser class:
using(var reader = new EddReader(new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read, 8192)) {
// Read a small field
string smallField = reader.ReadFieldAsText();
// Read a large field
Stream largeField = reader.ReadFieldAsStream();
}
While this doesn't help address the large input issue, a possible solution to the parsing issue might include a custom parser that users the strategy pattern to supply a delimiter.