I'm doing an basic CSV import/export in C#. Most of it is really simple and basic, we just have one speciality.
In values we import/export, we have some specials values, which are not ASCII values. To ease the work of our end users, the customer decided to convert some values in some other values and do the opposite when importing.
Some examples
Value in our application | ValueS that must be accepted on parsing
-----------------------------------------------------------------------
³ | 3, ^3, **3
μ | u
₃ | 3
⁹ | 9
° | deg
φ | phi
To export, it's easy, we replace the matching character by the first on the second column.
But for the parsing, it's more complicated, and I don't see an easy way to get all the possible values to import?
One example:
H³ 3° (asd)₃
Would be exported as
H3 3deg (asd)3
So to parse this correctly, I've to get all the possibilities:
H3 3deg (asd)3 //This may be a real values
H³ 3deg (asd)3
H₃ 3deg (asd)3
H3 ³deg (asd)3
....
What would be the good way of doing this?
I doubt it's possible with such an encoding. All H3 values are equally likely unless there is a rule that differentiates them. This makes parsing more difficult, not less.
What you are trying to do though looks a lot like what has already been done with tools like Latex or even Word. You should probably use the encodings used by Latex since they've already done the work of encoding symbols to human readable and editable keywords that can be parsed easily, eg: use ^ for power, _ for indices, \degree for degrees, etc.
In fact, even Word allows these same keywords nowadays in the Math editor, allowing you to type \sum and get ∑, or \oint for ∮
You should probably tag the fields that contain substitutions, eg by surrounding them in multiple braces, so that users can use the keywords in their own text.
I think you need to exclude ambiguous mappings. E.g.:
³ | ^3, **3
₃ | 3
⁹ | ^9, **9
or
³ | 3, ^3, **3
₃ | _3
⁹ | 9
ASCII has 7 Bits for each character. Now you want to use chars which are implemented in the space of 8 Bits (UTF8 for example).
Now you lose information by converting your UTF8 character to ASCII but you want get back the full information.
To manage this, you need a mask, which helps to recognize the right character.
You could use special characters as your mask. So you don't reinvent the wheel and others can find the documentation all over the internet for your interface.
But if you make ³ => 3, you lose information (3 superscript => 3; where is the superscript and how you should guess the right choice?)
Related
I am using the Example of CNTK: LSTMSequenceClassifier via the Console Application: CSTrainingCPUOnlyExamples, using the default data file: Train.ctf, it looks like this:
The Input Layer is dimension: 2000 ( One Hot Vector ), the Output is: 5 Classes ( Softmax ).
The File is loaded via:
MinibatchSource minibatchSource = MinibatchSource.TextFormatMinibatchSource(Path.Combine(DataFolder, "Train.ctf"), streamConfigurations, MinibatchSource.InfinitelyRepeat, true);
StreamInformation featureStreamInfo = minibatchSource.StreamInfo(featuresName);
StreamInformation labelStreamInfo = minibatchSource.StreamInfo(labelsName);
I would really appreciate how the data file is generated and how 2000 Inputs map to 5 classes Output.
Of course, my goal is to write an application to Format and save Data to a file that can be read as an Input Data File. Of course I would need to understand the Structure to make this work.
Thanks!
I see the Y Dimension, this part makes sense, but am having trouble with the Input Layer.
Edit: #Frank Seide MSFT
I wonder if you can verify and give best practices:
private string Format(int sequenceId, string featureName, string featureShape, string labelName, string featureComment, string labelShape, string labelComment)
{
return $"{sequenceId} |{featureName.Replace(" ","-")} {featureShape} |# {featureComment} |{labelName.Replace(" ","-")} {labelShape} |# {labelComment}\r\n";
}
which might return something like:
0 |x 560:1 |# I am a comment |y 1 0 0 0 0 |# I am a comment
Where:
sequenceId = 0;
featureName = "x";
featureShape = "560:1";
featureComment = "I am a comment";
labelName = "y";
labelShape = "1 0 0 0 0";
labelComment = "I am a comment";
On GPU, Frank did suggest around 20 Sequences for each Minibatch, see: https://www.youtube.com/watch?v=TK671HxrufE #26:25
This is for custom C# Dataset formatting.
End edit...
An accidental discovery and I found an answer with some Documentation:
BrainScript CNTK Text Format Reader using CNTKTextFormatReader
The documtnet goes on to explain:
CNTKTextFormatReader (later simply CTF Reader) is designed to consume input text data formatted according to the specification below. It supports the following main features:
Multiple input streams (inputs) per file
Both sparse and dense inputs
Variable length sequences
CNTK Text Format (CTF)
Each line in the input file contains one sample for one or more inputs. Since (explicitly or implicitly) every line is also attached to a sequence, it defines one or more sequence, input, sample relations. Each input line must be formatted as follows:
[Sequence_Id](Sample or Comment)+
.
where
Sample=|Input_Name (Value )*
Comment=|# some content
Each line starts with a sequence id and contains one or more samples (in other words, each line is an unordered collection of samples).
Sequence id is a number. It can be omitted, in which case the line number will be used as the sequence id.
Each sample is effectively a key-value pair consisting of an input name and the corresponding value vector (mapping to higher dimensions is done as part of the network itself).
Each sample begins with a pipe symbol (|) followed by the input name (no spaces), followed by a whitespace delimiter and then a list of values.
Each value is either a number or an index-prefixed number for sparse inputs.
Both tabs and spaces can be used interchangeably as delimiters.
A comment starts with a pipe immediately followed by a hash symbol: |#, then followed by the actually content (body) of the comment. The body can contain any characters, however a pipe symbol inside the body needs to be escaped by appending the hash symbol to it (see the example below). The body of a comment continues until the end of line or the next un-escaped pipe, whichever comes first.
Handy, and gives an answer.
The input data corresponding to the reader configuration above should look something like this:
|B 100:3 123:4 |C 8 |A 0 1 2 3 4 |# a CTF comment
|# another comment |A 0 1.1 22 0.3 54 |C 123917 |B 1134:1.911 13331:0.014
|C -0.001 |# a comment with an escaped pipe: '|#' |A 3.9 1.11 121.2 99.13 0.04 |B 999:0.001 918918:-9.19
Note the following about the input format:
|Input_Name identifies the beginning of each input sample. This element is mandatory and is followed by the correspondent value vector.
Dense vector is just a list of floating point values; sparse vector is a list of index:value tuples.
Both tabs and spaces are allowed as value delimiters (within input vectors) as well as input delimiters (between inputs).
Each separate line constitutes a "sequence" of length 1 ("Real" variable-length sequences are explained in the extended example below).
Each input identifier can only appear once on a single line (which translates into one sample per input per line requirement).
The order of input samples within a line is NOT important (conceptually, each line is an unordered collection of key-value pairs)
Each well-formed line must end with either a "Line Feed" \n or "Carriage Return, Line Feed" \r\n symbols.
Some awesome content on the Input and Label Data in this Video:
https://youtu.be/hMRrqkl77rI - #30:23
https://youtu.be/Vi05nEzAS8Y - #25:20
Also, helpful but not directly related: Read and feed data to CNTK Trainer
I have a table in which I save an ID and a rule like:
| ID | Rule |
|------|--------------------------------------|
| 1 | firstname[0]+'.'+lastname+'#'+domain |
| 2 | firstname+'_'+lastname+'#'+domain |
| 3 | lastname[0]+firstname+'#'+domain |
My problem is: How can I get and analyze/execute that rule in my code? Because the cell is taken as a string and I don't know how to apply that rule to my variables or my code.
I was thinking about String.Format, but I don't know how to split a string taking just the first character with it.
If you could give me an advice or any better way to do this, I'd appreciate that because I'm completely lost.
If that is C#, you could construct a LINQ Expression out of the parse tree from for example ANTLR, or if the format is very simple, regex.
You have to make these steps:
Evaluate the incoming string using ANTLR. You could start off with the C# grammar;
Build an expression from it;
Run the expression giving the firstname, domain, etc. parameters.
Not sure that would do the trick, but you might want to look at CSharpCodeProvier. Never used it, but according to the examples, it seems to be capable of compiling code entered in a textbox.
The thing is that this solution generates an exe file that will be stored in your project folder. Even if you delete them after a successful compiling, that might not be the best option.
I am trying to parse a file using regex split, it works well with the '\t' character but some lines have the '\t' inside a field instead of acting as the delimiter.
Like :
G2226 TEST 1 C 29 Internal Head Office D Head Office ZZZ Unassigned 10910 10/10/2011 11/10/2011 10/10/2011 11/10/2011 "Test call Sort the customer out some data. See the customer again tomorrow to talk about Prod " Mr ABC Mr ABC Mr ABC Mr ABC Credit Requested BDM Call Internal Note 10
This part has 2 tabs I wish were ignored :
"Test call Sort the customer out some data. See the customer again tomorrow to talk about Prod\t\t"
The good thing is, they are included in double quotes, but I cannot work out how to ignore them, any ideas?
Edit:
My goal is to get 36 columns, some columns may come out more after a Regex.Split(lineString,'\t') using '\t' because they include '\t' characters inside some of the fields. I would like to ignore those ones. The one above comes out to 38 cols, which is rejected by my datatable as the header is only 36 cols, I would like to solve this problem.
If you have a simple CSV file, then regex split is usually the easiest way to process it.
However, if your CSV file contains more complex elements, such as quoted fields that contain separator characters or newlines, then this approach will no longer work. It is not a trivial matter to correctly parse these types of files, so you should use a library when possible.
The answers to this question give several options for C# libraries that can read a CSV file.
Regex is not the right tool for this.
You have basically a CSV format, it is "tab separated", not "comma separated", but it works exactly the same. So, find a CSV parser and use that - the separation character is usually configurable.
If you really need a regular expression, you can try something like this:
(?!\t")\t(?!\t")
I'm working with a database that has content where the angled brackets have been replaced with the character ^.
e.g.
^b^some text^/b^
Can anyone please recommended a c# solution to convert the ^ character back to the appropriate bracket, so it can be displayed as html? I'm guessing some kind of regex will do the job...?
Thanks in advance
You can replace every n'th ^ character with > where n is even and < where n is odd.
var html = "^b^some text^/b^";
var n = 0;
var result = Regex.Replace(html, "\\^", m => ((n++ % 2) == 0) ? "<" : ">");
// result == "<b>some text</b>"
Note that this works only as long as the original HTML code contains a closing > character for every < character (<p<b>... is bad) and that there were no ^ characters in the original HTML code (<b>2^5</b> is bad).
A more complicated, but possibly safer solution would be to search for specific sets of characters, such as ^p, ^img, ^div, etc. and their counterparts, ^/p^, ^/div^, ^/img^, etc., and replacing each of them specifically.
Whether this is feasible though, depends on what tags exist in the data, and how big an effort you are willing to put in to do this securely. Do you know if there is a finite set of tags that have been used? Was the HTML generated, or is there a chance that someone has edited them manually, necessarily making the pattern-searching more complicated?
Maybe you could first do some analysis, for instance searching and listing the various instances where the character ^ occurs? How much data are we talking about, and is it static, or will it continue to grow (including the ^-problem)?
Tricky, to the point of being impossible to do perfectly automatically -- unless you can make some very convenient assumptions about the original HTML (that it is a small subset of all possible HTML, that it was known to conform to certain predictable patterns). I think in the end there's going to have to hand editing.
Having said that, and apologies for not including any actual C# code, here's how I'd consider approaching it.
Let's go after the problem incrementally, where we convert common patterns first. The goal being after every step to reduce the number of remaining ^ characters.
So first, regex-replace lots of very common literal patterns
^p^ -> <p>
^div^ -> <div>
^/div^ -> <div>
etc.
Next, replace patterns that contain optional text, like
^link[anything-except-^]^ -> <link[original-text]>
and on and on. My approach is to replace only expected patterns, and by doing that, avoid false matches. Then iterate with other patterns until there are no ^ chars left. This takes lots of inspection of data, and lots of patterns. It's brute force, not smart, but there you go.
I use CSharp, XPath and HTMLAgility Pack. I use XPath strings such as:
"//table[3]/td[1]/span[2]/text() | //table[6]/td[1]/span[2]/text()"
"//table[8]/td[1]/span[2]/text() | //table[10]/td[1]/span[2]/text()"
The difference is only in table numbers. Is it possible to use some other XPath function to replace the XPath or |?
What I actually do: With the first XPath string (where I have table numbers 3 & 6) I extract one value. With the second XPath string (where i have table numbers are 8 & 10) I extract another value.
And additional question about performance - is the XPath string //table[8]/td[1]/span[2]/text() faster than the XPath string with OR //table[8]/td[1]/span[2]/text() | //table[10]/td[1]/span[2]/text()? I ask this because I have many many XPath strings for many many values and if there is a difference which really means I need to try something else. I can't do the measurement right now that's why I ask you this question to share your experience.
Firstly, //table[6] looks odd. Are you sure you don't mean (//table)[6]? (The first selects every table that is the 6th child of its parent; the second selects the sixth table in the document.) I will assume the latter.
In XPath 2.0 you can write
(//table)[position()=(3,6,8,10)]/td[1]/span[2]/text()
In 1.0 that would have to be
(//table)[position()=3 or position()=6 or position()=8 or position()=10]
/td[1]/span[2]/text()
Or (in either release) you could write
((//table)[3] | (//table)[6] | (//table)[8] | (//table)[10])/td[1]/span[2]/text()
Your question about performance can't be answered without knowing what XPath implementation you are using.