I'm using Socket class for my web client. I can't use HttpWebRequest since it doesn't support socks proxies. So I have to parse headers and handle chunked encoding by myself. The most difficult thing for me is to determine length of content so I have to read it byte-by-byte. First I have to use ReadByte() to find last header ("\r\n\r\n" combination), then check whether body has transfer-encoding or not. If it does I have to read chunk's size etc:
public void ParseHeaders(Stream stream)
{
while (true)
{
var lineBuffer = new List<byte>();
while (true)
{
int b = stream.ReadByte();
if (b == -1) return;
if (b == 10) break;
if (b != 13) lineBuffer.Add((byte)b);
}
string line = Encoding.ASCII.GetString(lineBuffer.ToArray());
if (line.Length == 0) break;
int pos = line.IndexOf(": ");
if (pos == -1) throw new VkException("Incorrect header format");
string key = line.Substring(0, pos);
string value = line.Substring(pos + 2);
Headers[key] = value;
}
}
But this approach has very poor performance. Can you suggest better solution? Maybe some open source examples or libraries that handle http request through sockets (not very big and complicated though, I'm a noob).
The best would be to post link to example that reads message body and correctly handles the cases when: content has chunked-encoding, is gzip- or deflate-encoded, Content-Length header is omitted (message ends when connection is closed). Something like source code of HttpWebRequest class.
Upd:
My new function looks like this:
int bytesRead = 0;
byte[] buffer = new byte[0x8000];
do
{
try
{
bytesRead = this.socket.Receive(buffer);
if (bytesRead <= 0) break;
else
{
this.m_responseData.Write(buffer, 0, bytesRead);
if (this.m_inHeaders == null) this.GetHeaders();
}
}
catch (Exception exception)
{
throw new Exception("Read response failed", exception);
}
}
while ((this.m_inHeaders == null) || !this.isResponseBodyComplete());
Where GetHeaders() and isResponseBodyComplete() use m_responseData (MemoryStream) with already received data.
I suggest that you don't implement this yourself - the HTTP 1.1 protocol is sufficiently complex to make this a project of several man-months.
The question is, is there a HTTP request protocol parser for .NET? This question has been asked on SO, and in the answers you'll see several suggestions, including source code for handling HTTP streams.
Converting Raw HTTP Request into HTTPWebRequest Object
EDIT: The rotor code is reasonably complex, and difficult to read/navigate as webpages. But still, the implementaiton effort to add SOCKS supports is much lower than implementing the entire HTTP protocol yourself. You will have something working within a few days at most that you can depend upon, that is based on a tried and tested implementation.
The request and response are read from/written to to a NetworkStream, m_Transport, in the Connection class. This is used in these methods:
internal int Read(byte[] buffer, int offset, int size)
//and
private static void ReadCallback(IAsyncResult asyncResult)
both in http://www.123aspx.com/Rotor/RotorSrc.aspx?rot=42903
The socket is created in
private void StartConnectionCallback(object state, bool wasSignalled)
So you could modify this method to create a Socket to your socks server, and do the necessary handshake to obtain the external connection. The rest of the code can remain the same.
I gammered this info in about 30 mins looking on the pages on the web. This should go much faster if you load these files into an IDE. It may seem like a burden to have to read through this code - after all, reading code is far harder than writing it, but you are making just small changes to an already established, working system.
To be sure the changes work in all cases, it will be wise to also test when the connection is broken, to ensure that the client reconnects using the same method , and so re-establishes the SOCKS connection and sends the SOCKS request.
If the problem is a bottleneck in terms of ReadByte being too slow, I suggest you wrap your input stream with a StreamBuffer. If the performance issue you claim to have is expensive becuase of small reads, then that will solve the problem for you.
Also, you don't need this:
string line = Encoding.ASCII.GetString(lineBuffer.ToArray());
HTTP by design requires that the header is only made up of ASCII characters. You don't really want to -- or need to -- turn it into actual .NET strings (which are Unicode).
If you wanna find the EOF of the HTTP header, you can do this for good performance.
int k = 0;
while (k != 0x0d0a0d0a)
{
var ch = stream.ReadByte();
k = (k << 8) | ch;
}
When the string \r\n\r\n is encoutered k will equal 0x0d0a0d0a
In most (should be all) http requests, there should be a header called content-length that will tell you how many bytes there are in the body of the request. Then it is simply a matter of allocating the appropriate amount of bytes and reading those bytes all at once.
While I would tend to agree with mdma about trying as hard as possible to avoid implementing your own HTTP stack, one trick you might consider is reading from the stream moderate-sized chunks. If you do a read and you give it a buffer that's larger than what's available, it should return you the number of bytes it did read. That should reduce the number of system calls and speed up your performance significantly. You'll still have to scan the buffers much like you do now, though.
Taking a look at another client's code is helpful (if not confusing):
http://src.chromium.org/viewvc/chrome/trunk/src/net/http/
I'm currently doing something like this too. I find the best way to increase the efficiency of the client is to use the asynchronous socket functions provided. They're quite low-level and get rid of busy waiting and dealing with threads yourself. All of these have Begin and End in their method names. But first, I would try it using blocking, just so you get the semantics of HTTP out of the way. Then you can work on efficiency. Remember: Premature optimization is evil- so get it working, then optimize all of the stuff!
Also: Some of your efficiency might be tied up in your use of ToArray(). It's known to be a bit expensive computationally. A better solution might be to store your intermediate results in a byte[] buffer and append them to a StringBuilder with the correct encoding.
For gzipped or deflated data, read in all of the data (keep in mind that you might not get all of the data the first time you ask. Keep track of how much data you have read in, and keep on appending to the same buffer). Then you can decode the data using GZipStream(..., CompressionMode.Decompress).
I would say that doing this is not as difficult as some might imply, you just have to be a bit adventurous!
All the answers here about extending Socket and/or TCPClient seem to miss something really obvious - that HttpWebRequest is also a class and can therefore be extended.
You don't need to write your own HTTP/socket class. You simply need to extend HttpWebRequest with a custom connection method. After connecting all data is standard HTTP and can be handled as normal by the base class.
public class SocksHttpWebRequest : HttpWebRequest
public static Create( string url, string proxy_url ) {
... setup socks connection ...
// call base HttpWebRequest class Create() with proxy url
base.Create(proxy_url);
}
The SOCKS handshake is not particularly complex so if you have a basic understanding of programming sockets it shouldn't take very long to implement the connection. After that HttpWebRequest can do the HTTP heavy lifting.
Why don't you read until 2 newlines and then just grab from the string? Performance might be worse but it still should be reasonable:
Dim Headers As String = GetHeadersFromRawRequest(ResponseBinary)
If Headers.IndexOf("Content-Encoding: gzip") > 0 Then
Dim GzSream As New GZipStream(New MemoryStream(ResponseBinary, Headers.Length + (vbNewLine & vbNewLine).Length, ReadByteSize - Headers.Length), CompressionMode.Decompress)
ClearTextHtml = New StreamReader(GzSream).ReadToEnd()
End If
Private Function GetHeadersFromRawRequest(ByVal request() As Byte) As String
Dim Req As String = Text.Encoding.ASCII.GetString(request)
Dim ContentPos As Integer = Req.IndexOf(vbNewLine & vbNewLine)
If ContentPos = -1 Then Return String.Empty
Return Req.Substring(0, ContentPos)
End Function
You may want to look at the TcpClient class in System.Net, it's a wrapper for a Socket that simplifies the basic operations.
From there you're going to have to read up on the HTTP protocol. Also be prepared to do some zip operations. Http 1.1 supports GZip of it's content and partial blocks. You're going to have to learn quite a bit to parse them out by hand.
Basic Http 1.0 is simple, the protocol is well documented online, our friendly neighborhood Google can help you with that one.
I would create a SOCKS proxy that can tunnel HTTP and then have it accept the requests from HttpWebRequest and forward them. I think that would be far easier than recreating everything that HttpWebRequest does. You could start with Privoxy, or just roll your own. The protocol is simple and documented here:
http://en.wikipedia.org/wiki/SOCKS
And on the RFC's that they link to.
You mentioned that you have to have many different proxies -- you could set up a local port for each one.
Related
This is my first question posted on this forum, and I'm a beginner in c# world , so this is kind of exciting for me, but I'm facing some issues with sending a large amount of data through sockets so this is more details about my problem:
I'm sending a binary image of 5 Mo through a TCP socket, at the receiving part I'm saving the result(data received ) and getting only 1.5 Mo ==> data has been lost (I compared the original and the resulting file and it showed me the missed parts)
this is the code I use:
private void senduimage_Click(object sender, EventArgs e)
{
if (!user.clientSocket_NewSocket.Connected)
{
Socket clientSocket_NewSocket = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
user.clientSocket_NewSocket = clientSocket_NewSocket;
System.IAsyncResult _NewSocket = user.clientSocket_NewSocket.BeginConnect(ip_address, NewSocket.Transceiver_TCP_Port, null, null);
bool successNewSocket = _NewSocket.AsyncWaitHandle.WaitOne(2000, true);
}
byte[] outStream = System.Text.Encoding.ASCII.GetBytes(Uimage_Data);
user.clientSocket_NewSocket.Send(outStream);
}
In forums they say to divide data into chunks, is this a solution, if so how can I do this, I've tried but it didn't work!
There are lots of different solutions but chunking is usually a good solution, you can either do this blindly where you keep filling your temporary buffer and then putting it into some stateful buffer until you hit some arbitrary token or the buffer is not completely full, or you can adhere to some sort of contract per tcp message (a message being the overall data to recieve).
If you were to look at doing some sort of contract then do something like the first N bytes of a message is the descriptor, which you could make as big or as small as you want, but your temp buffer will ONLY read this size up front from the stream.
A typical header could be something like:
public struct StreamHeader // 5 bytes
{
public byte MessageType {get;set;} // 1 byte
public int MessageSize {get;set;} // 4 bytes
}
So you would read that in then if its small enough allocate the full message size to the temp buffer and read it all in, or if you deem it too big chunk it into sections and keep reading until the total bytes you have received match the MessageSize portion of your header structure.
Probably you haven't read the documentation on socket usage in C#. (http://msdn.microsoft.com/en-us/library/ms145160.aspx)
The internal buffer can not store all the data you provided to send methode. A possible solution to your problem can be is like you said to divide your data into chunks.
int totalBytesToSend = outstream.length; int bytesSend = 0;
while(bytesSend < totalBytesToSend )
bytesSend+= user.clientSocket_NewSocket.Send(outStream, bytesSend, totalBytesToSend - bytesSend,...);
I suspect that one of your problems is that you are not calling EndConnect. From the MSDN documentation:
The asynchronous BeginConnect operation must be completed by calling the EndConnect method.
Also, the wait:-
bool successNewSocket = _NewSocket.AsyncWaitHandle.WaitOne(2000, true);
is probably always false as there is nothing setting the event to the signaled state. Usually, you would specify a callback function to the BeginConnect function and in the callback you'd call EndConnect and set the state of the event to signaled. See the example code on this MSDN page.
UPDATE
I think I see another problem:-
byte[] outStream = System.Text.Encoding.ASCII.GetBytes(Uimage_Data);
I don't know what type Uimage_Data but I really don't think you want to convert it to ASCII. A zero in the data may signal an end of data byte (or maybe a 26 or someother ASCII code). The point is, the encoding process is likely to be changing the data.
Can you provide the type for the Uimage_Data object?
Most likely the problem is that you are closing the client-side socket before all the data has been transmitted to the server, and it is therefore getting discarded.
By default when you close a socket, all untransmitted data (sitting in the operating system buffers) is discarded. There are a few solutions:
[1] Set SO_LINGER (see http://developerweb.net/viewtopic.php?id=2982)
[2] Get the server to send an acknowledgement to the client, and don't close the client-side socket until you receive it.
[3] Wait until the output buffer is empty on the client side before closing the socket (test using getsocketopt SO_SND_BUF - I'm not sure of the syntax on c#).
Also you really should be testing the return value of Send(). Although in theory it should block until it sends all the data, I would want to actually verify that and at least print an error message if there is a mismatch.
I have WPF app that processes a lot of urls (thousands), each it sends off to it's own thread, does some processing and stores a result in the database.
The urls can be anything, but some seem to be massively big pages, this seems to shoot the memory usage up a lot and make performance really bad. I set a timeout on the web request, so if it took longer than say 20 seconds it doesn't bother with that url, but it seems to not make much difference.
Here's the code section:
HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(urlAddress.Address);
req.Timeout = 20000;
req.ReadWriteTimeout = 20000;
req.Method = "GET";
req.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
using (StreamReader reader = new StreamReader(req.GetResponse().GetResponseStream()))
{
pageSource = reader.ReadToEnd();
req = null;
}
It also seems to stall/ramp up memory on reader.ReadToEnd();
I would have thought having a cut off of 20 seconds would help, is there a better method? I assume there's not much advantage to using asynch web method as each url download is on its own thread anyway..
Thanks
In general, it's recommended that you use asynchronous HttpWebRequests instead of creating your own threads. The article I've linked above also includes some benchmarking results.
I don't know what you're doing with the page source after you read the stream to end, but using string can be an issue:
System.String type is used in any .NET application. We have strings
as: names, addresses, descriptions, error messages, warnings or even
application settings. Each application has to create, compare or
format string data. Considering the immutability and the fact that any
object can be converted to a string, all the available memory can be
swallowed by a huge amount of unwanted string duplicates or unclaimed
string objects.
Some other suggestions:
Do you have any firewall restrictions? I've seen a lot of issues at work where the firewall enables rate limiting and fetching pages grinds down to a halt (happens to me all the time)!
I presume that you're going to use the string to parse HTML, so I would recommend that you initialize your parser with the Stream instead of passing in a string containing the page source (if that's an option).
If you're storing the page source in the database, then there isn't much you can do.
Try to eliminate the reading of the page source as a potential contributor to the memory/performance problem by commenting it out.
Use a streaming HTML parser such as Majestic 12- avoids the need to load the entire page source into memory (again, if you need to parse)!
Limit the size of the pages you're going to download, say, only download 150KB. The average page size is about 100KB-130KB
Additionally, can you tell us what's your initial rate of fetching pages and what does it go down to? Are you seeing any errors/exceptions from the web request as you're fetching pages?
Update
In the comment section I noticed that you're creating thousands of threads and I would say that you don't need to do that. Start with a small number of threads and keep increasing them until you peek the performance on your system. Once you start adding threads and the performance looks like it's tapered off, then sop adding threads. I can't imagine that you will need more than 128 threads (even that seems high). Create a fixed number of threads, e.g. 64, let each thread take a URL from your queue, fetch the page, process it and then go back to getting pages from the queue again.
You could enumerate with a buffer instead of calling ReadToEnd, and if it is taking too long, then you could log and abandon - something like:
static void Main(string[] args)
{
Uri largeUri = new Uri("http://www.rfkbau.de/index.php?option=com_easybook&Itemid=22&startpage=7096");
DateTime start = DateTime.Now;
int timeoutSeconds = 10;
foreach (var s in ReadLargePage(largeUri))
{
if ((DateTime.Now - start).TotalSeconds > timeoutSeconds)
{
Console.WriteLine("Stopping - this is taking too long.");
break;
}
}
}
static IEnumerable<string> ReadLargePage(Uri uri)
{
int bufferSize = 8192;
int readCount;
Char[] readBuffer = new Char[bufferSize];
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
using (StreamReader stream = new StreamReader(response.GetResponseStream(), Encoding.UTF8))
{
readCount = stream.Read(readBuffer, 0, bufferSize);
while (readCount > 0)
{
yield return new string(readBuffer, 0, bufferSize);
readCount = stream.Read(readBuffer, 0, bufferSize);
}
}
}
Lirik has really good summary.
I would add that if I were implementing this, I would make a separate process that reads the pages. So, it would be a pipeline. First stage would download the URL and write it to a disk location. And then queue that file to the next stage. Next stage reads from the disk and does the parsing & DB updates. That way you will get max throughput on the download and parsing as well. You can also tune your threadpools so that you have more workers parsing, etc. This architecture also lends very well to distributed processing where you can have one machine downloading, and another host parsing/etc.
Another thing to note is that if you are hitting the same server from multiple threads (even if you are using Async) then you will hit yourself against the max outgoing connection limit. You can throttle yourself to stay below that, or increase the connection limit on the ServicePointManager class.
This is code I'm using to test a webserver on an embedded product that hasn't been behaving well when an HTTP request comes in fragmented across multiple TCP packets:
/* This is all within a loop that cycles size_chunk up to the size of the whole
* test request, in order to test all possible fragment sizes. */
TcpClient client_sensor = new TcpClient(NAME_MODULE, 80);
client_sensor.Client.NoDelay = true; /* SHOULD force the TCP socket to send the packets in exactly the chunks we tell it to, rather than buffering the output. */
/* I have also tried just "client_sensor.NoDelay = true, with no luck. */
client_sensor.Client.SendBufferSize = size_chunk; /* Added in a desperate attempt to fix the problem before posting my shameful ignorance on stackoverflow. */
for (int j = 0; j < TEST_HEADERS.Length; j += size_chunk)
{
String request_fragment = TEST_HEADERS.Substring(j, (TEST_HEADERS.Length < j + size_chunk) ? (TEST_HEADERS.Length - j) : size_chunk);
client_sensor.Client.Send(Encoding.ASCII.GetBytes(request_fragment));
client_sensor.GetStream().Flush();
}
/* Test stuff goes here, check that the embedded web server responded correctly, etc.. */
Looking at Wireshark, I see only one TCP packet go out, which contains the entire test header, not the approximately header length / chunk size packets I expect. I have used NoDelay to turn off the Nagle algorithm before, and it usually works just like I expect it to. The online documentation for NoDelay at http://msdn.microsoft.com/en-us/library/system.net.sockets.tcpclient.nodelay%28v=vs.90%29.aspx definitely states "Sends data immediately upon calling NetworkStream.Write" in its associated code sample, so I think I've been using it correctly all this time.
This happens whether or not I step through the code. Is the .NET runtime optimizing away my packet fragmentation?
I'm running x64 Windows 7, .NET Framework 3.5, Visual Studio 2010.
TcpClient.NoDelay does not mean that blocks of bytes will not be aggregated into a single packet. It means that blocks of bytes will not be delayed in order to aggregate into a single packet.
If you want to force a packet boundary, use Stream.Flush.
Grr. It was my antivirus getting in the way. A recent update caused it to start interfering with the sending of HTTP requests to port 80 by buffering all output until the final "\r\n\r\n" marker was seen, regardless of how the OS was trying to handle the outbound TCP traffic. I should have checked that first, but I've been using this same antivirus program for years and never had this problem before, so I didn't even think of it. Everything works just the way it used to when I disable the antivirus.
The MSDN docs show setting the TcpClient.NoDelay = true, not the TcpClient.Client.NoDelay property. Did you try that?
Your test code is just fine (I assume that you send valid HTTP). What you should check is why TCP server is not behaving well when reading from TCP connection. TCP is a stream protocol - that means you cannot make any assumptions on the size of data packets unless you explicitly specify those sizes in your data protocol. For instance you can prefix all your data packets using fixed-size (2 bytes) prefix, that will contain the size of the data to be received.
When reading HTTP usually read is made of several phases: read HTTP line, read HTTP headers, read HTTP content (if applicable). First two parts do not have any size specifications, but they have special delimiter (CRLF).
Here is some info how HTTP can be read and parsed.
I'm creating a client/server socket app. And I'm not being able to solve this problem, probably due to lack of knowledge on the matter.
The client must send an answer in order to proceed communication:
while(comunicate)
{
if (chkCorrectAnswer.Checked)
answer = encoder.GetBytes('\x02' + "82SP|" + '\x03');
else
answer = encoder.GetBytes("bla");
ServerStream.Write(answer, 0, answer.Length);
//or ??
//tcp.Client.Send(answer);
}
And the server recieves it:
while(comunicate)
{
var validanswer= encoder.GetBytes("myvalidanswer");
answer = new byte[validanswer.Length];
stream.Read(answer, 0, validanswer.Length);
//or ??
//tcp.Client.Receive(answer);
if (answer.SequenceEqual(validanswer))
// continue communication
}
Each code is in different app, looped "comunication thread".
The answer seems being sent correctly but the server doesn't seem to be recieveing it porperly. Sometimes it recieves blablab or lablabl and variations with 7 chars. I thoung the recieving would fill the buffer only with the incoming data and somehow it is filling the buffer with repeated data.
Two questions here:
What should I use, stream.write/read or client.send/recieve?
How to ensure this answer verification?
0x02 and 0x03 is called start of text (STX) and end of text (ETX) and are separators used to identify where your messages starts and ends. There is really no need to use both, it was a common practice when doing serial communication.
You need to continue building a message until ETX is received.
something like (easiest solution but not very efficient if you have lots of clients)
string buffer = "";
var readBuffer = new byte[1];
int readBytes = 0;
while ((readBytes = stream.Read(readBuffer, 0, 1)) == 1 && readBuffer[0] != 3)
{
buffer += readBuffer[0];
}
you can of course read larger chunks. But then you need to check if more than one message was arrived and process the buffer accordingly. That's how I would do it though.
I thoung the recieving would fill the buffer only with the incoming data and somehow it is filling the buffer with repeated data.
Well, you are repeatedly sending the data in a loop, so this is to be expected.
If you want to read only a certain number of bytes off the stream you need to also send the size of the logical packet ahead of it so that the receiving end can first read the size (say as a fixed int value) and then the actual response.
When you do the read you'll get everything you wrote including amything you've written previously.
Implement a length header or some kind of seperator so you know what's what --
length + message
or
message + seperator
Then parse it when you do the read.
I am working on building a simple proxy which will log certain requests which are passed through it. The proxy does not need to interfere with the traffic being passed through it (at this point in the project) and so I am trying to do as little parsing of the raw request/response as possible durring the process (the request and response are pushed off to a queue to be logged outside of the proxy).
My sample works fine, except for a cannot reliably tell when the "response" is complete so I have connections left open for longer than needed. The relevant code is below:
var request = getRequest(url);
byte[] buffer;
int bytesRead = 1;
var dataSent = false;
var timeoutTicks = DateTime.Now.AddMinutes(1).Ticks;
Console.WriteLine(" Sending data to address: {0}", url);
Console.WriteLine(" Waiting for response from host...");
using (var outboundStream = request.GetStream()) {
while (request.Connected && (DateTime.Now.Ticks < timeoutTicks)) {
while (outboundStream.DataAvailable) {
dataSent = true;
buffer = new byte[OUTPUT_BUFFER_SIZE];
bytesRead = outboundStream.Read(buffer, 0, OUTPUT_BUFFER_SIZE);
if (bytesRead > 0) { _clientSocket.Send(buffer, bytesRead, SocketFlags.None); }
Console.WriteLine(" pushed {0} bytes to requesting host...", _backBuffer.Length);
}
if (request.Connected) { Thread.Sleep(0); }
}
}
Console.WriteLine(" Finished with response from host...");
Console.WriteLine(" Disconnecting socket");
_clientSocket.Shutdown(SocketShutdown.Both);
My question is whether there is an easy way to tell that the response is complete without parsing headers. Given that this response could be anything (encoded, encrypted, gzip'ed etc), I dont want to have to decode the actual response to get the length and determine if I can disconnect my socket.
As David pointed out, connections should remain open for a period of time. You should not close connections unless the client side does that (or if the keep alive interval expires).
Changing to HTTP/1.0 will not work since you are a server and it's the client that will specify HTTP/1.1 in the request. Sure, you can send a error message with HTTP/1.0 as version and hope that the client changes to 1.0, but it seems inefficient.
HTTP messages looks like this:
REQUEST LINE
HEADERS
(empty line)
BODY
The only way to know when a response is done is to search for the Content-Length header. Simply search for "Content-Length:" in the request buffer and extract everything to the linefeed. (But trim the found value before converting to int).
The other alternative is to use the parser in my webserver to get all headers. It should be quite easy to use just the parser and nothing more from the library.
Update: There is a better parser here: HttpParser.cs
If you make a HTTP/1.0 request instead of 1.1, the server should close the connection as soon as it's through since it doesn't need to keep the connection open for another request.
Other than that, you really need to parse the content length header in the response to get the best value.
Using blocking IO and multiple threads might be your answer. Specifically
using(var response = request.GetResponse())
using(var stream = response.GetResponseStream())
using(var reader = new StreamReader(stream)
data = reader.ReadToEnd()
This is for textual data, however binary handling is similar.