For some reason itextsharp is now reading pdf which contains numbers such as 4123 as 4*23 where the * is actually a an arrow pointing up. Not sure why this is happening. Please help.
Thanks.
Sample file is located here: https://dl.dropboxusercontent.com/u/116833/SAMPLE%20PDF.pdf
The reason for the arrows is that the file actually tries to mislead text extractors which extract text according to the guidelines of Section 9.10.2 Mapping Character Codes to Unicode Values of the PDF specification ISO 32000-1 while not confusing those which prefer ActualText marked-content sequence entries: The former method is lead to believe the '3's are arrows while the latter is told the '3's are threes.
Most likely this is done to prevent automated text extraction while allowing manual copy&paste because Adobe Reader does prefer the ActualText marked-content sequence entries (thus, manual extraction works all right) while many programmatic extractors prefer the former method.
As far as I read the relevant sections of the specification, it prefers neither way over the other.
Details
E.g. look at the first part number:
BT
/T1_1 1 Tf
10 0 0 10 69.1456 750.2834 Tm
(1 )Tj
ET
EMC
/Span <</MCID 14 >>BDC
BT
/T1_1 1 Tf
10 0 0 10 89.5488 750.2834 Tm
(2)Tj
/Span<</ActualText<FEFF0033>>> BDC
(3)Tj
EMC
(412109 )Tj
ET
EMC
As you see the '3' is marked with an ActualText entry indicating that it is a three indeed (<FEFF0033> is a long way to indicate the Unicode digit three).
The font T1_1, on the other hand, offers a ToUnicode stream containing the mapping
...
<30> <0030>
<31> <0031>
<32> <0032>
<33> <0018>
<34> <0034>
<35> <0035>
...
As you see while other digits (0x30 is '0', 0x31 is '1', ... , 0x39 is '9') are mapped identically, the '3', i.e. 0x33, is mapped to the Unicode code point 0x0018, and
U+0018 is the Unicode hex value of the character <control>, which is categorized as "control character" in the Unicode 6.0 character table.
"<control>" was previously named "CANCEL" in older versions of Unicode.
(cf. http://www.marathon-studios.com/unicode/U0018/Control)
In some context this control character is displayed as an upwards arrow.
Related
I can easily convert a character string into a Huffman-Tree then encode into a binary sequence.
How should I save these to be able to actually compress the original data and then recover back?
I searched the web but I only could find guides and answers showing until what I already did. How can I use huffman algorithm further to actually achieve lossless compression?
I am using C# for this project.
EDIT: I've achieved these so far, might need rethinking.
I am attempting to compress a text file. I use Huffman Algorithm but there are some key points I couldn't figure out:
"aaaabbbccdef" when compressed gives this encoding
Key = a, Value = 11
Key = b, Value = 01
Key = c, Value = 101
Key = d, Value = 000
Key = e, Value = 001
Key = f, Value = 100
11111111010101101101000001100 is the encoded version. It normally needs 12*8 bits but we've compressed it to be 29 bits. This example might be a litte unnecessary for a file this small but let me explain what I tried to do.
We have 29 bits here but we need 8*n bits so I fill the encodedString with zeros until it becomes a multiple of eight. Since I can add 1 to 7 zeros it is more than enough to use 1-byte to represent this. This case I've added 3 zeros
11111111010101101101000001100000 Then add as binary how many extra bits I've added to the front and the split into 8-bit pieces
00000011-11111111-01010110-11010000-01100000
Turn these into ASCII characters
ÿVÐ`
Now if I have the encoding table I can look to the first 8bits convert that to integer ignoreBits and by ignoring the last ignoreBits turn it back to the original form.
The problem is I also want to include uncompressed version of encoding table with this file to have a fully functional ZIP/UNZIP prpgram but I am having trouble deciding when my ignoreBits ends, my encodingTable startse/ends, encoded bits start/end.
I thought about using null character but there is no assurance that Values cannot produce a null character. "ddd" in this situation produces 00000000-0.....
Your representation of the code needs to be self-terminating. Then you know the next bit is the start of the Huffman codes. One way is to traverse the tree that resulted from the Huffman code, writing a 0 bit for each branch, or a 1 bit followed by the symbol for leaf. When the traverse is done, you know the next bit must be the codes.
You also need to make your data self terminating. Note that in the example you give, the added three zero bits will be decoded as another 'd'. So you will incorrectly get 'aaaabbbccdefd' as the result. You need to either precede the encoded data with a count of symbols expected, or you need to add a symbol to your encoded set, with frequency 1, that marks the end of the data.
I am needing to better understand user text input into win applications / asp.net applications, and what the appropriate field sizes should be for the data stored in SQLServer.
If everything were ASCII, it seems like this would be simple (1 byte for each char), but I guess I really don't understand what is going on when a user puts text into an input field. If the input is in UniCode then there are (generally) 2 bytes per character (?) and if I know a text input can not be cany longer than 5 characters, then should the SQL column be varchar(10)??? How do I know if an input should be in ANSI or Unicode??
Hopefully this makes sense. This is something that I have never fully understood in terms of how a web page or a win app determines how the data is encoded.
When you create a column, you specify the number of characters you need to store, regardless of whether it is Unicode or not. Need up to 5 characters? Then it's either VARCHAR(5) or NVARCHAR(5), depending on whether you actually need Unicode or not - that's a business discussion, not a technical one. The 2 bytes has nothing to do with the column definition - that's about storage size. So a VARCHAR(5) will take 5 bytes if fully populated, and an NVARCHAR(5) will take 10 bytes if fully populated. You don't have to worry about those implementation details when defining the column; however you should be sure that Unicode is required before making the choice, because doubling the space requirement for no reason is wasteful.
(Ignoring arguments about whether such a column should be CHAR/NCHAR, null byte overhead, etc.)
SQL columns are not set by the byte size but the character size. A column of varchar(10) will accept 10 ccharacters. if you going to be taking Unicode input it is best to set nvarchar(10) this will still take 10 characters but it will allow all Unicode input to that column. The same goes for ntext text nchar char. A good MSDN page to understand SQL data types can be found at http://technet.microsoft.com/en-us/library/ms187752.aspx. As for what is going into your text boxes on the ASP.NET site, anything can be inputed into that text box, it is up to you via code to enforce the rules of what you want inputed.
Ok I'll try and explain the problem although it's going to be a bit hard.
I'm trying to parse some information from a certain page containing coordinates.
and the copy paste give you something like this:
Distance Position
5.8 (77|-2)
6.3 (76|-1)
7.8 (76|6)
9.2 (91|3)
9.5 (79|10)
12.2 (80|13)
15 (82|-14)
15 (81|16)
now the problem that I have is that between the "(" and the number there is an unidentified char, if you press on the right arrow key it won't move but if you press few times then it will move.
I haven't encounter this thing anywhere, and the website is in php if that helps.
also if that helps when I copy paste the information in here the char disappear and I can move freely through the text.
Please help me with this problem since it's causing my software to malfunction since I'm trying to parse the coordinates into an int and because of that char it won't let me, it'll give me a format exception.
While viewing in UTF-8, I see nothing, while changing the encoding to ANSI, I am left with:
5.8 ‎â€(â€â€77‬‬|â€-â€2‬‬)‬‎
6.3 ‎â€(â€â€76‬‬|â€-â€1‬‬)‬‎
7.8 ‎â€(â€â€76‬‬|â€â€6‬‬)‬‎
9.2 ‎â€(â€â€91‬‬|â€â€3‬‬)‬‎
9.5 ‎â€(â€â€79‬‬|â€â€10‬‬)‬‎
12.2 ‎â€(â€â€80‬‬|â€â€13‬‬)‬‎
15 ‎â€(â€â€82‬‬|â€-â€14‬‬)‬‎
15 ‎â€(â€â€81‬‬|â€â€16‬‬)‬‎
You seem to have used the Left-to-right mark (‎â€), and the encoding was swapped once or twice.
You could clean it, because it's from a website. My first guess would be that your browser settings are not correct (wrong encoding set).
You can still try cleaning it.
Code:
Regex rgx = new Regex("[^a-zA-Z0-9_\n %\[\]\.\(\)%&-]");
data = rgx.Replace(data, "");
ive been reading about this topic and didnt get the specific info for my question :
(maybe the following is incorrect - but please do correct me)
Every file( text/binary) is saving BYTES.
byte is 8 bits hence max value is 2^8-1 = 255 codes.
those 255 codes divides to 2 groups:
0..127 : textual chars
128:..255 : special chars.
so binary file contains char codes from the whole range : 0..255 ( ascii chars+special chars).
1 ) correct ?
2) NOw , lets say im saving one INT in binary file. ( 4 byte in 32 bit system)
how does the file tells the progem reads it : its not 4 single unrelated bytes but an int which is 4 bytes ?
Underlying all files are being stored as bytes, so in a sense what you're saying is correct. However, if you open a file that's intended to be read as binary and try to read it in a text editor, it will look like gibberish.
How does a program know whether to read a file as text or as binary? (ie as special sets of ASCII or other encoded bytes, or just as the underlying bytes with a different representation)?
Well, it doesn't know - it just does what it's told.
In Windows, you open .txt files in notepad - notepad expects to be reading text. Try opening a binary file in notepad. It will open, you will see stuff, but it will be rubbish.
If you're writing your own program you can write using BinaryWriter and read using BinaryReader if you want to store everything as binary. What would happen if you wrote using BinaryWriter and read using StringReader?
To answer your specific example:
using (var test = new BinaryWriter(new FileStream(#"c:\test.bin", FileMode.Create)))
{
test.Write(10);
test.Write("hello world");
}
using (var test = new BinaryReader(new FileStream(#"c:\test.bin", FileMode.Open)))
{
var out1 = test.ReadInt32();
var out2 = test.ReadString();
Console.WriteLine("{0} {1}", out1, out2);
}
See how you have to read in the same order that's written? The file doesn't tell you anything.
Now switch the second part around:
using (var test = new BinaryReader(new FileStream(#"c:\test.bin", FileMode.Open)))
{
var out1 = test.ReadString();
var out2 = test.ReadInt32();
Console.WriteLine("{0} {1}", out1, out2);
}
You'll get gibberish out (if it works at all). Yet there is nothing you can read in the file that will tell you that beforehand. There is no special information there. The program must know what to do based on some out of band information (a specification of some sort).
so binary file contains char codes from the whole range : 0..255 ( ascii chars+special chars).
No, a binary file just contains bytes. Values between 0 and 255. They should only be considered as character at all if you decide to ascribe that meaning to them. If it's a binary file (e.g. a JPEG) then you shouldn't do that - a byte 65 in image data isn't logically an 'A' - it's whatever byte 65 means at that point in the file.
(Note that even text files aren't divided into "ASCII characters" and "special characters" - it depends on the encoding. In UTF-16, each code unit takes two bytes regardless of its value. In UTF-8 the number of bytes depends on the character you're trying to represent.)
how does the file tells the progem reads it : its not 4 single unrelated bytes but an int which is 4 bytes ?
The file doesn't tell the program. The program has to know how to read the file. If you ask Notepad to open a JPEG file, it won't show you an image - it will show you gibberish. Likewise if you try to force an image viewer to open a text file as if it were a JPEG, it will complain that it's broken.
Programs reading data need to understand the structure of the data they're going to read - they have to know what to expect. In some cases the format is quite flexible, like XML: there are well-specified layers, but then the program reads the values with higher-level meaning - elements, attributes etc. In other cases, the format is absolutely precise: first you'll start with a 4 byte integer, then two 2-byte integers or whatever. It depends on the format.
EDIT: To answer your specific (repeated) comment:
Im Cmd shell....youve written your binary file. I have no clue what did you do there. how am i suppose to know whether to read 4 single bytes or 4 bytes as once ?
Either the program reading the data needs to know the meaning of the data or it doesn't. If it's just copying the file from one place to another, it doesn't need to know the meaning of the data. It doesn't matter whether it copies it one byte at a time or all four bytes at once.
If it does need to know the meaning of the data, then just knowing that it's a four byte integer doesn't really help much - it would need to know what that integer meant to do anything useful with it. So your file written from the command shell... what does it mean? If I don't know what it means, what does it matter whether I know to read one byte at a time or four bytes as an integer?
(As I mentioned above, there's an intermediate option where code can understand structure without meaning, and expose that structure to other code which then imposes meaning - XML is a classic example of that.)
It's all a matter of interpretation. Neither the file nor the system know what's going on in your file, they just see your storage as a sequence of bytes that has absolutely no meaning in itself. The same thing happens in your brain when you read a word (you attempt to choose a language to interpret it in, to give the sequence of characters a meaning).
It is the responsibility of your program to interpret the data the way you want it, as there is no single valid interpretation. For example, the sequence of bytes 48 65 6C 6C 6F 20 53 6F 6F 68 6A 75 6E can be interpreted as:
A string (Hello Soohjun)
A sequence of 12 one-byte characters (H, e, l, l, o, , S, o, o, h, j, u, n)
A sequence of 3 unsigned ints followed by a character (1214606444, 1864389487, 1869113973, 110)
A character followed by a float followed by an unsigned int followed by a float (72, 6.977992E22, 542338927, 4.4287998E24), and so on...
You are the one choosing the meaning of those bytes, another program would make a different interpretation of the very same data, much the same a combination of letters has a different interpretation in say, English and French.
PS: By the way, that's the goal of reverse engineering file formats: find the meaning of each byte.
Maybe there are any way to compress small strings(86 chars) to something smaller?
#a#1\s\215\c\6\-0.55955,-0.766462,0.315342\s\1\x\-3421.-4006,3519.-4994,3847.1744,sbs
The only way I see is to replace the recurring characters on a unique character.
But i can't find something about that in google.
Thanks for any reply.
http://en.wikipedia.org/wiki/Huffman_coding
Huffman coding would probably be pretty good start. In general the idea is to replace individual characters with the smallest bit pattern needed to replicate the original string or dataset.
You'll want to run statistical analysis on a variety of 'small strings' to find the most common characters so that the more common characters will be represented with the smallest unique bit patterns. And possibly makeup a 'example' small string with every character that will need to be represented (like a-z0-9#.0-)
I took your example string of 85 bytes (not 83 since it was copied verbatim from the post, perhaps with some intended escapes not processed). I compressed it using raw deflate, i.e. no zlib or gzip headers and trailers, and it compressed to 69 bytes. This was done mostly by Huffman coding, though also with four three-byte backward string references.
The best way to compress this sort of thing is to use everything you know about the data. There appears to be some structure to it and there are numbers coded in it. You could develop a representation of the expected data that is shorter. You can encode it as a stream of bits, and the first bit could indicate that what follows is straight bytes in the case that the data you got was not what was expected.
Another approach would be to take advantage of previous messages. If this message is one of a stream of messages, and they all look similar to each other, then you can make a dictionary of previous messages to use as a basis for compression, which can be reconstructed at the other end by the previous messages received. That may offer dramatically improved compression if they messages really are similar.
You should look up RUN-LENGTH ENCODING. Here is a demonstration
rrrrrunnnnnn BECOMES 5r1u6n WHAT? truncate repetitions: for x consecutive r use xr
Now what if some of the characters are digits? Then instead of using x, use the character whose ASCII value is x. for example,
if you have 43 consecutive P, write +P because '+' has ASCII code 43. If you have 49 consecutive y, write 1y because '1' has ASCII code 49.
Now the catch, which you will find with all compression algorithms, is if you have a string with little or no repetitions. Then in that case your code may be longer than the original word. But that's true for all compression algorithms.
NOTE:
I don't encourage using Huffman coding because even if you use the Ziv-Lempel implementation, it's still a lot of work to get it right.