I have been given a large quantity of Xml's where I need to pull out parts of the text elements and reuse it for other purposes. (I am using XDocument to pull Xml data).
But, how do I decode the text contained in the elements? What is even the formatting used here? A few examples:
"What is the meaning of this® asks Sonny."
"The big centre cost 1¾ million pounds"
"... lost it. ® The next ..."
I have tried HttpUtility.HtmlDecode but that did not do the trick. If I decode twice the "®" turns into a ® which is obviously not right.
Looks like ® are line breaks. The ® are probably question marks. The 190 one, I don't even know. Perhaps a dot or comma?
Any ideas would be welcome.
It does appear that the strings you show have been HTML encoded, and then XML encoded (or HTML again).
It is correct that ® -> ® -> ® (the registered trademark symbol) per the ISO Latin-1 entities - ® should behave the same way
Similarly ¾ would turn into a fraction representing three quarters.
Related
This is likely a very basic question that I could not, despite trying, find a satsifying answer to. Feel free to skip to the question at the end if you aren't interested in the background.
The task:
I wish to create an easy localisation solution for my unity projects. After some initial research I concluded it would be best to use a .csv file read by a streamreader, so that translators would only ever have to interact with the csv table, where information is neatly organized.
The main problem:
Due to the nature of the text, I need to account for linebreaks and special characters in the actual fields. As such I could not use the normal readLine() method.
This I worked with by using Read() and checking if a linebreak is within a text delimiter bracket. But as I check for the text delimiter, I am afraid it might run into an un-escaped delimiter part of the normal in-cell text (since the normal text delimiter is quotation marks).
So I switched the delimiter to §. But now every time I open the file I have to re-enter § as a text delimiter in OpenOfficeCalc, probably due to encoding differences. Which is annoying but not the end of the world.
My question:
How does OpenOffice (or similar software) usually tell in-cell commas/quotation marks apart from the ones used as delimiters? If I knew that, I could probably incorporate a similar approach in my reading of the file.
I've tried to look at the files with NotePad++, revealing a difference in linebreaks (/r instead of /r/n) and obviously it's within a text delimiter bracket, but when it comes to how it seperates its delimiters from ones just entered in the text/field, I am drawing a blank.
Translation file in OpenOffice Calc:
Translation file in NotePad++, showing all characters:
I'd appreciate any insight or links on the topic.
From https://en.wikipedia.org/wiki/Comma-separated_values:
The CSV file format is not fully standardized. The basic idea of separating fields with a comma is clear, but that idea gets complicated when the field data may also contain commas or even embedded line breaks.
LibreOffice Calc has a reasonable way to handle these things.
Use LF for line breaks and CR at the end of each record. It seems your code already handles this.
Use quotes to delimit strings when needed. If the string contains one or more quotes, then duplicate the quote to make it literal.
From the example in your question, it looks like you told Calc not to use any quotes as string delimiters. Why did you do this? When I tried it, LibreOffice (or Apache OpenOffice) showed the fields in different columns after opening the file saved that way.
The following example CSV file has fields that contain commas, quotes and line breaks.
When viewed in Calc:
A B
--------- --
1 | 1,",2", 3
--------- --
2 | a c
| b
Calc correctly reads and saves the file as shown below. Settings when saving are Field delimiter , and String delimiter " which are the defaults.
"1,"",2"",",3[CR]
"a
b",c[CR]
Huffman Coding task.
what I doing.
Read string from file, prepare Huffman structure, encode string to bits and save that bits to binary file.
What I need:
Decode string from binary file but encoding and decoding must be independent. After closing app for e.q.
I saving to binary file like that:
A:000;l:001;a:10; :110;m:010;k:011;o:1110;t:1111;
00000110110010101100111110111110;
And need to read it and decode. So I think I need to build Huffman structure again from that but how?
I see this options
Encoder and decoder always use the same tree, it never changes. So the decoder already knows, that 000 means A.
Tree is appended before the message in binary format. Encoder and decoder have to know the exact format for storing the tree, there are many possibilities how to do this. In simplest case there would be number of encoded characters and for every character its ascii code, length of Huffman code and the code itself.
Tree is built on the fly using adaptive Huffman coding, but it does not seem to be Your case.
Since you know A:000;l:001;a:10; :110;m:010;k:011;o:1110;t:1111; You can try to traverse the string 00000110110010101100111110111110 a character at a time. also have a switch statement for each of the characters, and their code. When ever you come across a case, for eg000, you can output A. This is one way I can see you being able to go back to the string. I am sure there is a better way out there.
hope this helps.
Assuming "Adaptive Huffman", it's not usual to decide yourself what code to use for each character.
The usual sequence is
Analyze the text to be encoded. That means counting the occurrences of each character. In the English language 'e' would be more frequent than 'x', 'y' or 'z' for example.
Sort the arrays of char/occurrence in ascending order.
Build a BTree - that means combining the two lowest, adding their counts and making a new tree node. Ignore those two and look for the next pair of lowest occurrences (which might include the node you just made). This continues until you end up with a BTree with one root. (There are lots of helpful images of this). I can explain this in more detailed steps if necessary.
From the root of the tree you "walk" to each leaf. For each "left" add a '0' and for each right a '1'. When you reach the leaf, you have the code for that letter. If your text has many e's it will have the shortest code and no other code will start with the same sequence of bits. This is the idea, the most frequent characters have the shortest code, thus bigger memory savings.
Now, by walking the tree you have the code (varying lengths) for each character.
Encode your text to a string of bits.
To decode you use the same tree. You say it must work "after closing app" so you will have to store the tree in some form with the encoded data.
In your comment you mention the problem with having varying length codes. There is no ambiguity. In an extreme case, if you had more e's than all other characters combined, the tree would be very lopsided. 'e' would be encoded as '1' and all other letters would have codes of varying lengths, beginning with 0.
Having used SQL Server Bulk insert of CSV file with inconsistent quotes (CsvToOtherDelimiter option) as my basis, I discovered a few weirdnesses with the RemoveCSVQuotes part [it chopped the last char from quoted strings that contained a comma!]. So.. rewrote that bit (maybe a mistake?)
One wrinkle is that the client has asked 'what about data like this?'
""17.5179C,""
I assume if I wanted to keep using the CsvToOtherDelimiter solution, I'd have to amend the RegExp...but it's WAY beyond me... what's the best approach?
To clarify: we are using C# to pre-process the file into a pipe-delimited format prior to running a bulk insert using a format file. Speed is pretty vital.
The accepted answer from your link starts with:
You are going to need to preprocess the file, period.
Why not transform your csv to xml? Then you would be able to verify your data against an xsd before storing into a database.
To convert a CSV string into a list of elements, you could write a program that keeps track of state (in quotes or out of quotes) as it processes the string one character at a time, and emits the elements it finds. The rules for quoting in CSV are weird, so you'll want to make sure you have plenty of test data.
The state machine could go like this:
scan until quote (go to 2) or comma (go to 3)
if the next character is a quote, add only one of the two quotes to the field and return to 1. Otherwise, go to 4 (or report an error if the quote isn't the first character in the field).
emit the field, go to 1
scan until quote (go to 5)
if the next character is a quote, add only one of the two quotes to the field and return to 4. Otherwise, emit the field, scan for a comma, and go to 1.
This should correctly scan stuff like:
hello, world, 123, 456
"hello world", 123, 456
"He said ""Hello, world!""", "and I said hi"
""17.5179C,"" (correctly reports an error, since there should be a
separator between the first quoted string "" and the second field
17.5179C).
Another way would be to find some existing library that does it well. Surely, CSV is common enough that such a thing must exist?
edit:
You mention that speed is vital, so I wanted to point out that (so long as the quoted strings aren't allowed to include line returns...) each line may be processed independently in parallel.
I ended up using the csv parser that I don't know we had already (comes as part of our code generation tool) - and noting that ""17.5179C,"" is not valid and will cause errors.
While loading XML file in a C# application, I am getting
Name cannot begin with the '1' character, hexadecimal value 0x31.
Line 2, position 2.
The XML tag begins like this.
<version="1.0" encoding="us-ascii" standalone="yes" />
<1212041205115912>
I am not supposed to change this tag at any cost.
How can I resolve this?
You are supposed to change the tag name since the one you wrote violates the xml standard.
Just to remember the interesting portion of it here:
XML Naming Rules
XML elements MUST follow these naming rules:
Names can contain letters, numbers, and other characters
Names cannot start with a number or punctuation character
Names cannot start with the letters xml (or XML, or Xml, etc)
Names cannot contain spaces
Any name can be used, no words are reserved.
as a suggestion to solve your problem mantaining the standard:
Use an attribute, ie <Number value="1212041205115912"/>
Add a prefix to the tag ie <_1212041205115912/>
Of course you can mantain the structure you propose by writing your own format parser, but I can state it would be a really bad idea, because in the future someone would probably extend the format and would not be happy to see that the file that seems xml it is actually not, and he/she can get angry for that. Furthermore, if you want your custom format, use something simpler, I mean: messing a text file with some '<' and '>' does not add any value if it is not an officially recognized format, it is better to use someting like a simple plain text file instead.
IF you absolutely cant change it, eg. for some reason the format is already out in the wild and used by other systems/customers/whatever.
Since it is an invalid xml document, try to clean it up before parsing it.
eg. make a regex that replaces all < number> tags with < IMessedUp>number< /IMessedUp> and then parse it.
Sort of iffy way to do it, but I will solve your problem.
If you need to process this document, then stop thinking of it as XML, and cast aside any thoughts of using XML tools to process it. You're dealing with a proprietary format and you will need to write your own tools to handle it. If you want the benefits of using XML technology, you will have to redesign your documents so they are valid XML.
I need to encode some data (text) so that it can easily be passed by the user over phone.
The text contains random characters and is normally not longer than 100 chars. Example:
"37-b,kA.sZ:Bb9--10.y<§"
I'd like to encode this text into more human readable form so that it can easily be passed over phone.
Base36 produces a text that can easily be passed over phone, but I don't see how to encode/decode this correctly.
Any ideas or alternatives?
(Platform is .net 3.5 SP1)
Base 36 sounds like a good choice (when using symbols a-z and 0-9, it is the largest space of characters, that can be easily passed over the phone). I would suggest you make the output contain blocks of 6 or 8 characters, to make it easier to read. Also; consider adding a checksum in the end, so you can verify there are no errors in the data.
100 characters in this encoding will still not be easy to read over the phone and get right the first time. Have you considered another delivery mechanism ? Text message (SMS) ?
On Wikipedia, there is an example of encoding Base36 in Python - shouldn't be too hard to convert to C#.