I have a set of files, each of which contain the full text of a series of HTTP POST responses. A number of these contain binary objects (e.g. images or PDFs). I've been trying to use regexes to extract the binary objects, but I can't seem to get it correctly. The HTTPListener class (and associated classes) all seem to require an active connection, i.e. parsing a real time request response pair, which I don't have. Is there a good library out there which can parse a file (or a string) as an HTTP response? If not, can anyone think of a better method for doing this than regex?
Thanks,
Rik
You can easily write your own Parser which does the following:
Reads the Response file line by line
Till the line Content Length, which specifies the number of bytes in Payload
Read the payload as binary
Image class has an overload which creates an image from a Stream. This way you can verify whether your result images matches the original image.
Regards
Related
I'm trying to read the content body of a message in an Azure Logic App, but I'm not having much success. I have seen a lot of suggestions which say that the body is base64 encoded, and suggest using the following to decode:
#{json(base64ToString(triggerBody()?['ContentData']))}
The base64ToString(...) part is decoding the content into a string correctly, but the string appears to contain a prefix with some extra serialization information at the start:
#string3http://schemas.microsoft.com/2003/10/Serialization/�3{"Foo":"Bar"}
There are also some extra characters in that string that are not being displayed in my browser. So the json(...) function doesn't accept the input, and gives an error instead.
InvalidTemplate. Unable to process template language expressions in
action 'HTTP' inputs at line '1' and column '2451': 'The template
language function 'json' parameter is not valid. The provided value
#string3http://schemas.microsoft.com/2003/10/Serialization/�3{"Foo":"bar" }
cannot be parsed: Unexpected character encountered while parsing value: #. Path '', line 0, position 0.. Please see https://aka.ms/logicexpressions#json for usage details.'.
For reference, the messages are added to the topic using the .NET service bus client (the client shouldn't matter, but this looks rather C#-ish):
await TopicClient.SendAsync(new BrokeredMessage(JsonConvert.SerializeObject(item)));
How can I read this correctly as a JSON object in my Logic App?
This is caused by how the message is placed on the ServiceBus, specifically in the C# code. I was using the following code to add a new message:
var json = JsonConvert.SerializeObject(item);
var message = new BrokeredMessage(json);
await TopicClient.SendAsync(message);
This code looks fine, and works between different C# services no problem. The problem is caused by the way the BrokeredMessage(Object) constructor serializes the payload given to it:
Initializes a new instance of the BrokeredMessage class from a given object by using DataContractSerializer with a binary XmlDictionaryWriter.
That means the content is serialized as binary XML, which explains the prefix and the unrecognizable characters. This is hidden by the C# implementation when deserializing, and it returns the object you were expecting, but it becomes apparent when using a different library (such as the one used by Azure Logic Apps).
There are two alternatives to handle this problem:
Make sure the receiver can handle messages in binary XML format
Make sure the sender actually uses the format we want, e.g. JSON.
Paco de la Cruz's answer handles the first case, using substring, indexOf and lastIndexOf:
#json(substring(base64ToString(triggerBody()?['ContentData']), indexof(base64ToString(triggerBody()?['ContentData']), '{'), add(1, sub(lastindexof(base64ToString(triggerBody()?['ContentData']), '}'), indexof(base64ToString(triggerBody()?['ContentData']), '}')))))
As for the second case, fixing the problem at the source simply involves using the BrokeredMessage(Stream) constructor instead. That way, we have direct control over the content:
var json = JsonConvert.SerializeObject(item);
var bytes = Encoding.UTF8.GetBytes(json);
var stream = new MemoryStream(bytes);
var message = new BrokeredMessage(stream, true);
await TopicClient.SendAsync(message);
You can use the substring function together with indexOf and lastIndexOf to get only the JSON substring.
Unfortunately, it's rather complex, but it should look something like this:
#json(substring(base64ToString(triggerBody()?['ContentData']), indexof(base64ToString(triggerBody()?['ContentData']), '{'), add(1, sub(lastindexof(base64ToString(triggerBody()?['ContentData']), '}'), indexof(base64ToString(triggerBody()?['ContentData']), '}')))))
More info on how to use these functions here.
HTH
Paco de la Cruz solution worked for me, though I had to swap out the last '}' in the expression for a '{', otherwise it finds the wrong end of the data segment.
I also split it into two steps to make it a little more manageable.
First I get the decoded string out of the message into a variable (that I've called MC) using:
#{base64ToString(triggerBody()?['ContentData'])}
then in another logic app action do the substring extraction:
#{substring(variables('MC'),indexof(variables('MC'),'{'),add(1,sub(lastindexof(variables('MC'),'}'),indexof(variables('MC'),'{'))))}
Note that the last string literal '{' is reversed from Paco's solution.
This is working for my test cases, but I'm not sure how robust this is.
Also, I've left it as a String, I do the conversion to JSON later in my logic app.
UPDATE
We have found that just occasionally (2 in several hundred runs) the text that we want to discard can contain the '{' character.
I have modified our expression to explicitly locate the start of the data segment, which for me is:
'{"IntegrationRequest"'
so the substitution becomes:
#{substring(variables('MC'),indexof(variables('MC'),'{"IntegrationRequest"'),add(1,sub(lastindexof(variables('MC'),'}'),indexof(variables('MC'),'{"IntegrationRequest"'))))}
I'm trying to write a C# utility to consume the results returned from the Export API by MailChimp.
The documentation states that the results will be returned as "streamed JSON."
"This means that a call to this API will not return a single valid JSON
object but, rather, a series of valid JSON objects separated by
newline characters."
The results that I'm seeing don't look like normal JSON to me, and aren't what I was expecting to be working with. It looks to me like CSV data wrapped in square brackets, with row headers in the first line.
A snip of the results can be viewed here. I'll paste them below as well.
["Email Address","First Name","Last Name","Company","FirstOrder","LastOrder","CustomerID","SalesRep","ScreenName","PlayerPage","PlayerPDF","Services Purchased","Contests","EMAIL_TYPE","MEMBER_RATING","OPTIN_TIME","OPTIN_IP","CONFIRM_TIME","CONFIRM_IP","LATITUDE","LONGITUDE","GMTOFF","DSTOFF","TIMEZONE","CC","REGION","LAST_CHANGED","LEID","EUID"]
["john#domain.com","John","Doe","ACME Inc","2010-09-07","2010-09-07","ABC123","sally","","","","Service1","","html",2,"",null,"2011-12-23 15:58:44","10.0.1.1","34.0257000","-84.1418000","-5","-4","America\/Kentucky\/Monticello","US","GA","2014-04-11 18:38:39","40830325","82c81e14a"]
["jane#domain2.com","Jane","Doe","XYZ Inc","2011-05-02","2011-05-02","XYZ001","jack","","","","Service2","","html",2,"",null,"2011-12-23 15:58:44","10.0.1.1","34.0257000","-84.1418000","-5","-4","America\/Kentucky\/Monticello","US","GA","2014-04-11 18:38:40","40205835","6c23329a"]
Can you help me understand what is being returned -- as it doesn't appear to be normal JSON. And what would be my best approach to parse this stream of data into a C# object.
EDIT: I've confirmed that the data stream is valid JSON using http://www.freeformatter.com/json-validator.html and pasting in the sample lines above. So what I'm hoping for is a way to dynamically create an object based on the first line, then create a list of these objects with the values contained in the subsequent lines.
You are correct, this is not in typical JSON form. What you could do is create a collection of Dictionary<string, string> objects. Use the first part of the response to use as the keys of the dictionaries and then the values found in subsequent pieces of the result as the values of each dictionary.
In my app I want to compress the data that get stored in redis string keys.
I don't want to compress all of them though because small data values don't compress well and I want to avoid the cpu overhead on them.
My question is how to detect that a value is compressed when I read the string key in order to perform decompression?
I tried some code to append a custom header to the zip stream but i didn't had any luck.
A common pattern is to use a payload prefix combined with a delimiter.
For example, you could use a format like this:
[key];[encoding];[metatype];[version]\t[payload]
I use delimiters ; and \t here. Choose other delimiters if you like them better. Ofcourse you must prevent these delimiters from occurring in your prefix tags themselves. [payload] contains for example binary data, string data, whatever. [encoding] can for example be zip,msgpack,utf8,base64,json (just some ideas).
The benefit of using a payload prefix is that you don't have to deserialize or uncompress the payload itself to use it as an entity. In Redis-Lua for example, you can't unzip. But you can do a simple read of the preload prefix, and respond to client requests. Even if you can deserialize inside Redis-Lua, like JSON or MsgPack formats, you might not want to do that because of performance reasons.
There are other options ofcourse. If you don't like prefixes with delimiters, you could also put the payload and encoding-tag in an array, and serialize it as MsgPack. Or, use JSON for the prefix, then a null character, then the payload. Or even (a bit more memory efficient): use 4 or 8 bytes for the prefix size, MsgPack for the prefix, and use the prefix size to determine where the payload starts (which might even be MsgPack as well).
Final word of advice: don't mess with the payload itself (like altering the zip header), that's bound to get you in a whole lot of unnecessary trouble.
Hope this helps, TW
I am accepting a POST request like so:
Socket connection = m_connection;
Byte[] receive = new Byte[1024];
int received = connection.Receive(receive);
Console.WriteLine(received.ToString());
string request = Encoding.ASCII.GetString(receive);
Console.WriteLine(request);
The post values end up being weird, if I post text values a lot of times they end up with a lot of +'s behind them. If I post C:\Users\John Doe\wwwroot, it ends up being: C%3A%5CUsers%5John+Doe%5Cwwwroot
index.html becomes index.html++++++++++++++++++++++++++++++++
It seems I am getting the Encoding wrong somehow, however I tried multiple encodings, and they have same weirdness. What is the best way to correctly read a HTTP POST request from a socket byte stream?
You need to trim the byte array receive that you are passing to the GetString method. Right now, you are passing all 1024 bytes, so the GetString method is trying to encode those as best it can.
You need to use the received variable to indicate the bounds for the string you are encoding.
You should use System.Web.HttpUtility.UrlDecode not Encoding.ASCII to peform the decoding.
You will probably get away with passing Encoding.Default as the second parameter to this static method.
Your are seeing the result of a HTML form POST which encodes the values as if they were being appended to a URL as a search string. Hence it is a & delimited set of name=value pairs. Any out-of-band characters are encoded to their hex value %xx.
The UrlDecode method will decode all this for you.
As other have stated you really need to chunk the stream in, it may be bigger that 1K.
Strictly speaking you should check the Content-Type header for any ;CharSet= attribute. If present you need to ensure the character encode you pass to UrlDecode is appropriate to that CharSet (e.g., if CharSet=UTF-8 then use Encoding.UTF8).
First of, you don't need to decode the input, HTTP is ASCII and it be faster to work with just bytes. Now, what you'll want to do is that you'll define a maximum HTTP request header size, say 4K? and then you'll keep reading bytes until you hit \r\n\r\n this signals the end of the HTTP request. You'll need to enforce this maximum header size limit otherwise a single malicious users could send a infinite HTTP request and your server would run out of memory.
You should read the HTTP specification.
Depending on your HTTP request the HTTP content can be many things and you need to act accordingly. The HTTP protocol itself is always ASCII so you can treat it as just bytes but the content can be encoded very differently. This is generally explained by the Content-Type: header. But again, read the HTTP specification.
Given a Stream as input, how do I safely create an XPathNavigator against an XML data source?
The XML data source:
May possibly contain invalid hexadecimal characters that need to be removed.
May contain characters that do not match the declared encoding of the document.
As an example, some XML data sources in the cloud will have a declared encoding of utf-8, but the actual encoding is windows-1252 or ISO 8859-1, which can cause an invalid character exception to be thrown when creating an XmlReader against the Stream.
From the StreamReader.CurrentEncoding property documentation: "The current character encoding used by the current reader. The value can be different after the first call to any Read method of StreamReader, since encoding autodetection is not done until the first call to a Read method." This seems indicate that CurrentEncoding can be checked after the first read, but are we stuck storing this encoding when we need to write out the XML data to a Stream?
I am hoping to find a best practice for safely creating an XPathNavigator/IXPathNavigable instance against an XML data source that will gracefully handle encoding an invalid character issues (in C# preferably).
I had a similar issue when some XML fragments were imported into a CRM system using the wrong encoding (there was no encoding stored along with the XML fragments).
In a loop I created a wrapper stream using the current encoding from a list. The encoding was constructed using the DecoderExceptionFallback and EncoderExceptionFallback options (as mentioned by #Doug). If a DecoderFallbackException was thrown during processing the original stream is reset and the next-most-likely encoding is used.
Our encoding list was something like UTF-8, Windows-1252, GB-2312 and US-ASCII. If you fell off the end of the list then the stream was really bad and was rejected/ignored/etc.
EDIT:
I whipped up a quick sample and basic test files (source here). The code doesn't have any heuristics to choose between code pages that both match the same set of bytes, so a Windows-1252 file may be detected as GB2312, and vice-versa, depending on file content, and encoding preference ordering.
It's possible to use the DecoderFallback class (and a few related classes) to deal with bad characters, either by skipping them or by doing something else (restarting with a new encoding?).
When using a XmlTextReader or something similiar, the reader itself will figure out the encoding declared in the xml file.