I am writing a crawler for a website.
Its response is gzip encoded.
I am not able to parse correctly a particular field, though the decompression is successful.
I am also using htmlagilitypack to parse it,
the parsed value of the field is only a part of the original value
as an example :
I am getting only /wEWAwKc04vTCQKb86mzBwKln/PuCg==
whereas the firebug shows the actual value as much longer:
/wEWBgKj7IuJCgKb86mzBwKln/PuCgLT250qAtC0+8cMAvimiNYD
what does the '==' at the end means?
I am assuming it that its a error on decompressors behalf?
The character = is added by the Base64 encoding.
Encoding the following sentence
Man is distinguished, not only by his reason, but by this singular passion from other animals, which is a lust of the mind, that by a perseverance of delight in the continued and indefatigable generation of knowledge, exceeds the short vehemence of any carnal pleasure.
you would get
TWFuIGlzIGRpc3Rpbmd1aXNoZWQsIG5vdCBvbmx5IGJ5IGhpcyByZWFzb24sIGJ1dCBieSB0aGlz
IHNpbmd1bGFyIHBhc3Npb24gZnJvbSBvdGhlciBhbmltYWxzLCB3aGljaCBpcyBhIGx1c3Qgb2Yg
dGhlIG1pbmQsIHRoYXQgYnkgYSBwZXJzZXZlcmFuY2Ugb2YgZGVsaWdodCBpbiB0aGUgY29udGlu
dWVkIGFuZCBpbmRlZmF0aWdhYmxlIGdlbmVyYXRpb24gb2Yga25vd2xlZGdlLCBleGNlZWRzIHRo
ZSBzaG9ydCB2ZWhlbWVuY2Ugb2YgYW55IGNhcm5hbCBwbGVhc3VyZS4=
The = character can only be present at the end of the Base64 string. If you obtain it, it means you are probably getting all the characters; vice versa is not true, as that character is used as padding character, and it is not always mandatory in all the Base64 implementations.
You don't have a problem with decompression. The page has obviously been correctly decompressed. Otherwise your software would likely throw an error or you'd see just a bunch of strange characters.
However, what you get is an ASCII string that's obviously in Base 64 encoding. The equal signs at the end appear if the original binary data is not a multiple of 3 bytes. So that's all perfect Base 64 data.
As to why your crawler gets different data than Firefox with Firebug: I don't know but can image many reasons. These are two separate browsing sessions and the web site might just assign them different session IDs or somehow record some history of the session.
Anyhow, at the end of the day I don't understand your problem. What exactly are you unable to parse? Do you get some kind of error? What do you mean by field? Are you talking about a field of an HTML form?
Related
Somehow I'm getting a weird result from a GetString(). So, in my project I got this code:
byte[] arrayBytes = System.Convert.FromBase64String(n["spo_fdat"].InnerText);
string str = System.Text.Encoding.UTF8.GetString(arrayBytes);
The InnerText Value and the code is in: https://dotnetfiddle.net/mMUlti
So, my problem is that somehow I'm getting this result on my Visual Studio:
While in the online compiler that I post above the output is as expected.
This output is an output for a printer and this \0 are destroying the format.
Anyone have a clue of what is going on and what should I do/try?
It looks like for some reason every other byte in your input is null. If you strip those out you get something that looks much more plausible as printer commands (though I am no expert). Hopefully you can verify things...
To do this all I did was added this line in:
arrayBytes = arrayBytes.Where((x,i)=>i%2==0).ToArray();
The where command takes the value (x), and index (i) and if the index mode 2 is 0 (ie its even) then the where clause allows it - if its odd it throws it away.
The output I get from this starts:
CT~~CD,~CC^~CT~
^XA~TA000~JSN^LT0^MNW^MTT^PON^PMN^LH0,0^JMA^PR2,2~SD15^JUS^LRN^CI0^XZ
^XA
^MMT
^PW607
^LL0406
There are some non-printing character in there too that look like possible printing commands (eg 16 is the first character that is "data link escape" character.
Edited afterthought:
The problem you have here is obviously a problem with the specification. It seems to be that your input is wrong. You need to talk to whoever generated it find out the specification they are using to generate it, make sure their ode matches that spec and then right your code to accept that spec. With a solid specification you should both be writing compatible code.
Try inspecting the bytes instead. You'll see that what you have encoded in the base-64 string is much closer to what Visual Studio shows to you in comparison to the output from dotnetfiddle. Consoles usually don't escape non-printables (such as \0 - the null character) whereas Visual Studio string inspector does so in attempt to provide as much value to its user as possible.
Looking at your base-64 encoded data, it looks way more like UTF-16 than UTF-8. If you decode it like so, you'll perhaps get rid of the null characters in Visual Studio inspector as well.
Regardless of that, the base-64 data don't make much sense. More semantical context is required to figure out what the issue is.
According to inspection by Chris, it looks like the data is UTF-8 encoded in UTF-16.
You should be able to get proper results with the following:
var xml = //your base-64 input...
var arrayBytes = Convert.FromBase64String(xml);
var utf16 = Encoding.Unicode.GetString(arrayBytes);
var utf8Bytes = utf16.Select(c => (byte)c).ToArray();
var utf8 = Encoding.UTF8.GetString(utf8Bytes);
Console.WriteLine(utf8);
The opposite is probably how your input was created. However, you could also go for Chris' solution of ignoring every odd byte as it is basically the same with less weird encoding things going on (although this may be more explicit to what really goes on: UTF-8 inside UTF-16).
I'm generating an encoded value to get passed within my URL, the issue is, our SEO manager configure the application, to pass lowercase URL, and he says he won't change the configuration. now i have to somehow encode my url, that uppercase, or whole string get encoded by their character code, so i can pass it without ruin the main value,
for example, my resulting base64 string is as following:
aHR0cDovL2xvY2FsaG9zdDoxMzUwL2hvdGVscy9nMy8xMzk1LTA1LTEwLzEvOTI3MjIyZmY
but it turn to be like this, when is passed to controller:
ahr0cdovl2xvy2fsag9zddoxmzuwl2hvdgvscy9nmy8xmzk1lta1ltewlzevoti3mjiyzmy
which can't be read... the case cause issue while decode.
You cannot encode it using base64 if it will be transformed to lowercase out of your control, base64 relies upon using uppercase characters.
If the configuration your manager is insisting on is that incoming or outgoing query string parameters be incorrectly lower cased, however, you should inform him that he is in violation of the URI specification, specifically the query string section. Of course it is ultimately up to your own internal company choices whether you want only lower case in your internal URIs, but you should not assume that other applications handling URIs will operate like this.
As #sachin stated above, if you can make this a POST request (instead of a GET like I assume it is now), and provided that your manager is not lower casing those upon sending them as well :/ You can send this data via POST.
Alternatively, you could use Base32 instead to get around this, it does rely on uppercase characters only, but you can simply transform the recieved value to upper case upon recieveing it prior to decoding the (now Base32) string. This is a pretty ridiculous solution though...
Just to be clear: "lol" would encode in Base32 to "NRXWY===" which would then be lower cased to "nrxwy===" which you could then uppercase back to "NRXWY===" prior to decoding.
These are two NuGet packages that do Base32 encoding:
Base32 as per RFC4648 here and the author claims it's tested and working correctly.
Another package, which looks appealing because it supports zBase32 here, the advantage with zBase32 is that it already uses lowercase characters only, so you won't have to worry about changing the case. The porter/author has included instructions on how to get zBase32 encoding
Both of the these (Base32 and zBase32) use a subset of Base64 characters, so they'll both work fine with URIs, all of the charcaters used are valid in URIs (the utf-8 content is irrelevant since you're just encoding bytes, so you'll get the same bytes back when you decode from Base32)
I received a crash report from an application which was trying to read XML from a file it had previously written. After requesting the user send me the file, I compared it with what should have been written and found a really odd problem I haven't come across before.
Some (but not all) of the i characters had been replaced with ı - a dotless i. For example, a node named "title" was fine, but a node named "initialdirectory" had the first i replaced, the second was left alone, i.e. ınitialdirectory.
Until today I wasn't even aware there was such a character, but now I do and I just don't know how it was written like that - the XML was written using an XmlWriter with UTF8 encoding. Just a normal everyday write, nothing complicated.
I normally (well, since getting Resharper and it yells at me for skipping the parameter) use StringComparison.OrdinalIgnoreCase when doing IndexOf etc, but I'm at a loss on how I'm supposed to do this when writing data, unless I'm supposed to start changing thread cultures.
Has anyone experienced a similar issue before, and if so, what's the best way to deal with it?
In Turkish there are two i's: one with a dot, i, and one without a dot, ı. In upper case the first one has a dot, İ, and the second one hasn't, I.
At some point your program is converting InitialDirectory to lower case according to the default locale, which is known to be Turkish in some parts of the world. To fix the problem you can convert cases using a fixed, known locale, such as American English.
Update: Even better, use the ToLowerInvariant() method which converts a string to lower case in the "invariant culture".
I have a large url that I am encoding using System.Web.HttpUtility.UrlEncode. When I encode it its not encoding it like the working example I have. I am not sure what the problem is, maybe different character type or something, I put an example of what suppose to be created and what actually being created. thanks for any help, i am lost on this one.
Working exmaple (look how this one has Did%252Citag%252 and the other doesnt)
22%7Chttp%3A%2F%2Fv17.nonxt1.googlevideo.com%2Fvideoplayback%3Fid%3D0b608733ae5257c3%26itag%3D22%26source%3Dpicasa%26ip%3D0.0.0.0%26ipbits%3D0%26expire%3D1333533157%26sparams%3Did%252Citag%252Csource%252Cip%252Cipbits%252Cexpire%26signature%3D8AD67D74F34FBAFBBA87616C0AED4A336DF0982A.129E2B5E648F8A2F35A34F312AC5C3C957A1C40A%26key%3Dlh1%2C35%7Chttp%3A%2F%2Fv18.nonxt3.googlevideo.com%2Fvideoplayback%3Fid%3D0b608733ae5257c3%26itag%3D35%26source%3Dpicasa%26ip%3D0.0.0.0%26ipbits%3D0%26expire%3D1333533157%26sparams%3Did%252Citag%252Csource%252Cip%252Cipbits%252Cexpire%26signature%3D7A58A11994C710872E945D0EAA6E43B6BFB8A648.B9C1D9FB377E1A49EBF3DC6C166C0B6E3E94EC24%26key%3Dlh1%2C34%7Chttp%3A%2F%2Fv6.nonxt1.googlevideo.com%2Fvideoplayback%3Fid%3D0b608733ae5257c3%26itag%3D34%26source%3Dpicasa%26ip%3D0.0.0.0%26ipbits%3D0%26expire%3D1333533157%26sparams%3Did%252Citag%252Csource%252Cip%252Cipbits%252Cexpire%26signature%3D260B10850A3448C849B8B8F1F2AF5E31244E71BC.6D7420FD66B85D40982BFB2C847EDB46021C63AE%26key%3Dlh1%2C5%7Chttp%3A%2F%2Fv23.nonxt7.googlevideo.com%2Fvideoplayback%3Fid%3D0b608733ae5257c3%26itag%3D5%26source%3Dpicasa%26ip%3D0.0.0.0%26ipbits%3D0%26expire%3D1333533157%26sparams%3Did%252Citag%252Csource%252Cip%252Cipbits%252Cexpire%26signature%3D9894DCDA7D2634EE0006CE0F6E0E29ABF7A8F253.18765D7CD7BDE80ED1A47DC8EC559C3E05C92F56%26key%3Dlh1
Here is an example of the one I am creating (see this one encodes as did%2citag%2)
5%7chttp%3a%2f%2fv23.nonxt7.googlevideo.com%2fvideoplayback%3fid%3d0b608733ae5257c3%26itag%3d5%26source%3dpicasa%26ip%3d0.0.0.0%26ipbits%3d0%26expire%3d1333562840%26sparams%3did%2citag%2csource%2cip%2cipbits%2cexpire%26signature%3dC0E2993011931D9F5FCAFAF54E821415F6042DDD.477CD23B021563A6DE30E858E35C21046E0B0BA6%26key%3dlh1%2c18%7chttp%3a%2f%2fv11.nonxt4.googlevideo.com%2fvideoplayback%3fid%3d0b608733ae5257c3%26itag%3d18%26source%3dpicasa%26ip%3d0.0.0.0%26ipbits%3d0%26expire%3d1333562840%26sparams%3did%2citag%2csource%2cip%2cipbits%2cexpire%26signature%3d696501A8ACBA0E1246173B040E0FB81DA8EBCDC7.944BA6C08C630EFFC2456D66BAD12376D7E377B2%26key%3dlh1%2c34%7chttp%3a%2f%2fv6.nonxt1.googlevideo.com%2fvideoplayback%3fid%3d0b608733ae5257c3%26itag%3d34%26source%3dpicasa%26ip%3d0.0.0.0%26ipbits%3d0%26expire%3d1333562840%26sparams%3did%2citag%2csource%2cip%2cipbits%2cexpire%26signature%3dDDD3D9081F7F2FF462D17CFAE6CAB72AEB86DEA9.3275E0EE8921EF728132035FC94BEF5926A0B7C1%26key%3dlh1%2c35%7chttp%3a%2f%2fv18.nonxt3.googlevideo.com%2fvideoplayback%3fid%3d0b608733ae5257c3%26itag%3d35%26source%3dpicasa%26ip%3d0.0.0.0%26ipbits%3d0%26expire%3d1333562840%26sparams%3did%2citag%2csource%2cip%2cipbits%2cexpire%26signature%3d7826E7470450F9F473BC7A845967EF3AC655CFB.3850F952F5D68151D325CD754C581CD66B0BC4D7%26key%3dlh1%2c22%7chttp%3a%2f%2fv17.nonxt1.googlevideo.com%2fvideoplayback%3fid%3d0b608733ae5257c3%26itag%3d22%26source%3dpicasa%26ip%3d0.0.0.0%26ipbits%3d0%26expire%3d1333562840%26sparams%3did%2citag%2csource%2cip%2cipbits%2cexpire%26signature%3d32FAAE6AE74B22BFB3DBD4300CEEDBC1A12A9ED4.8014678ABB1AEE93FB4B1C36E2C74C89102DC112%26key%3dlh1
Looks like in the first example the URL is double encoded. Meaning if you look at decoded sparams parameter it is represented as
sparams=id%2Citag%2Csource%2Cip%2Cipbits%2Cexpire
In your second example
sparams=id,itag,source,ip,ipbits,expire
So, what is happening in the first example is that, they are doing a UrlEncode on the value first. Using this value Construct the URL and then do UrlEncode on the constructed URL.
UPDATE : This is a general practice to be followed if the value of your querystring contains values which needs to be UrlEncoded (eg. , & space ? etc)
According to w3c standards, your example is fine. There is no %252 symbol.
I'm not sure exactly what you are expecting, but when you fire these strings into a URL Decoder, this is what you get:
String 1
22|http://v17.nonxt1.googlevideo.com/videoplayback?id=0b608733ae5257c3&itag=22&source=picasa&ip=0.0.0.0&ipbits=0&expire=1333533157&sparams=id%2Citag%2Csource%2Cip%2Cipbits%2Cexpire&signature=8AD67D74F34FBAFBBA87616C0AED4A336DF0982A.129E2B5E648F8A2F35A34F312AC5C3C957A1C40A&key=lh1,35|http://v18.nonxt3.googlevideo.com/videoplayback?id=0b608733ae5257c3&itag=35&source=picasa&ip=0.0.0.0&ipbits=0&expire=1333533157&sparams=id%2Citag%2Csource%2Cip%2Cipbits%2Cexpire&signature=7A58A11994C710872E945D0EAA6E43B6BFB8A648.B9C1D9FB377E1A49EBF3DC6C166C0B6E3E94EC24&key=lh1,34|http://v6.nonxt1.googlevideo.com/videoplayback?id=0b608733ae5257c3&itag=34&source=picasa&ip=0.0.0.0&ipbits=0&expire=1333533157&sparams=id%2Citag%2Csource%2Cip%2Cipbits%2Cexpire&signature=260B10850A3448C849B8B8F1F2AF5E31244E71BC.6D7420FD66B85D40982BFB2C847EDB46021C63AE&key=lh1,5|http://v23.nonxt7.googlevideo.com/videoplayback?id=0b608733ae5257c3&itag=5&source=picasa&ip=0.0.0.0&ipbits=0&expire=1333533157&sparams=id%2Citag%2Csource%2Cip%2Cipbits%2Cexpire&signature=9894DCDA7D2634EE0006CE0F6E0E29ABF7A8F253.18765D7CD7BDE80ED1A47DC8EC559C3E05C92F56&key=lh1
String 2
5|http://v23.nonxt7.googlevideo.com/videoplayback?id=0b608733ae5257c3&itag=5&source=picasa&ip=0.0.0.0&ipbits=0&expire=1333562840&sparams=id,itag,source,ip,ipbits,expire&signature=C0E2993011931D9F5FCAFAF54E821415F6042DDD.477CD23B021563A6DE30E858E35C21046E0B0BA6&key=lh1,18|http://v11.nonxt4.googlevideo.com/videoplayback?id=0b608733ae5257c3&itag=18&source=picasa&ip=0.0.0.0&ipbits=0&expire=1333562840&sparams=id,itag,source,ip,ipbits,expire&signature=696501A8ACBA0E1246173B040E0FB81DA8EBCDC7.944BA6C08C630EFFC2456D66BAD12376D7E377B2&key=lh1,34|http://v6.nonxt1.googlevideo.com/videoplayback?id=0b608733ae5257c3&itag=34&source=picasa&ip=0.0.0.0&ipbits=0&expire=1333562840&sparams=id,itag,source,ip,ipbits,expire&signature=DDD3D9081F7F2FF462D17CFAE6CAB72AEB86DEA9.3275E0EE8921EF728132035FC94BEF5926A0B7C1&key=lh1,35|http://v18.nonxt3.googlevideo.com/videoplayback?id=0b608733ae5257c3&itag=35&source=picasa&ip=0.0.0.0&ipbits=0&expire=1333562840&sparams=id,itag,source,ip,ipbits,expire&signature=7826E7470450F9F473BC7A845967EF3AC655CFB.3850F952F5D68151D325CD754C581CD66B0BC4D7&key=lh1,22|http://v17.nonxt1.googlevideo.com/videoplayback?id=0b608733ae5257c3&itag=22&source=picasa&ip=0.0.0.0&ipbits=0&expire=1333562840&sparams=id,itag,source,ip,ipbits,expire&signature=32FAAE6AE74B22BFB3DBD4300CEEDBC1A12A9ED4.8014678ABB1AEE93FB4B1C36E2C74C89102DC112&key=lh1
You URLS are quite different, and they also have a leading chars that I'm not sure you are wanting.
I am using silverlight / ASP .NET and C#. What if I want to do this from silverlight for instance,
// I have left out the quotes to show you literally what the characters
// are that I want to use
string password = vtakyoj#"5
string encodedPassword = HttpUtility.UrlEncode(encryptedPassword, Encoding.UTF8);
// encoded password now = vtakyoj%23%225
URI uri = new URI("http://www.url.com/page.aspx#password=vtakyoj%23%225");
HttpPage.Window.Navigate(uri);
If I debug and look at the value of uri it shows up as this (we are still inside the silverlight app),
http://www.url.com?password=vtakyoj%23"5
So the %22 has become a quote for some reason.
If I then debug inside the page.aspx code (which of course is ASP .NET) the value of Request["password"] is actually this,
vtakyoj#"5
Which is the original value. How does that work? I would have thought that I would have to go,
HttpUtility.UrlDecode(Request["password"], Encoding.UTF8)
To get the original value.
Hope this makes sense?
Thanks.
First lets start with the UTF8 business. Esentially in this case there isn't any. When a string contains characters with in the standard ASCII character range (as your password does) a UTF8 encoding of that string is identical to a single byte ASCII string.
You start with this:-
vtakyoj#"5
The HttpUtility.UrlEncode somewhat aggressively encodes it to:-
vtakyoj%23%225
Its encoded the # and " however only # has special meaning in a URL. Hence when you view string value of the Uri object in Silverlight you see:-
vtakyoj%23"5
Edit (answering supplementary questions)
How does it know to decode it?
All data in a url must be properly encoded thats part of its being valid Url. Hence the webserver can rightly assume that all data in the query string has been appropriately encoded.
What if I had a real string which had %23 in it?
The correct encoding for "%23" would be "%3723" where %37 is %
Is that a documented feature of Request["Password"] that it decodes it?
Well I dunno, you'd have check the documentation I guess. BTW use Request.QueryString["Password"] the presence of this same indexer directly on Request was for the convenience of porting classic ASP to .NET. It doesn't make any real difference but its better for clarity since its easier to make the distinction between QueryString values and Form values.
if I don't use UFT8 the characters are being filtered out.
Aare you sure that non-ASCII characters may be present in the password? Can you provide an example you current example does not need encoding with UTF-8?
If Request["password"] is to work, you need "http://url.com?password=" + HttpUtility.UrlEncode("abc%$^##"). I.e. you need ? to separate the hostname.
Also the # syntax is username:password#hostname, but it has been disabled in IE7 and above IIRC.