I am using the Example of CNTK: LSTMSequenceClassifier via the Console Application: CSTrainingCPUOnlyExamples, using the default data file: Train.ctf, it looks like this:
The Input Layer is dimension: 2000 ( One Hot Vector ), the Output is: 5 Classes ( Softmax ).
The File is loaded via:
MinibatchSource minibatchSource = MinibatchSource.TextFormatMinibatchSource(Path.Combine(DataFolder, "Train.ctf"), streamConfigurations, MinibatchSource.InfinitelyRepeat, true);
StreamInformation featureStreamInfo = minibatchSource.StreamInfo(featuresName);
StreamInformation labelStreamInfo = minibatchSource.StreamInfo(labelsName);
I would really appreciate how the data file is generated and how 2000 Inputs map to 5 classes Output.
Of course, my goal is to write an application to Format and save Data to a file that can be read as an Input Data File. Of course I would need to understand the Structure to make this work.
Thanks!
I see the Y Dimension, this part makes sense, but am having trouble with the Input Layer.
Edit: #Frank Seide MSFT
I wonder if you can verify and give best practices:
private string Format(int sequenceId, string featureName, string featureShape, string labelName, string featureComment, string labelShape, string labelComment)
{
return $"{sequenceId} |{featureName.Replace(" ","-")} {featureShape} |# {featureComment} |{labelName.Replace(" ","-")} {labelShape} |# {labelComment}\r\n";
}
which might return something like:
0 |x 560:1 |# I am a comment |y 1 0 0 0 0 |# I am a comment
Where:
sequenceId = 0;
featureName = "x";
featureShape = "560:1";
featureComment = "I am a comment";
labelName = "y";
labelShape = "1 0 0 0 0";
labelComment = "I am a comment";
On GPU, Frank did suggest around 20 Sequences for each Minibatch, see: https://www.youtube.com/watch?v=TK671HxrufE #26:25
This is for custom C# Dataset formatting.
End edit...
An accidental discovery and I found an answer with some Documentation:
BrainScript CNTK Text Format Reader using CNTKTextFormatReader
The documtnet goes on to explain:
CNTKTextFormatReader (later simply CTF Reader) is designed to consume input text data formatted according to the specification below. It supports the following main features:
Multiple input streams (inputs) per file
Both sparse and dense inputs
Variable length sequences
CNTK Text Format (CTF)
Each line in the input file contains one sample for one or more inputs. Since (explicitly or implicitly) every line is also attached to a sequence, it defines one or more sequence, input, sample relations. Each input line must be formatted as follows:
[Sequence_Id](Sample or Comment)+
.
where
Sample=|Input_Name (Value )*
Comment=|# some content
Each line starts with a sequence id and contains one or more samples (in other words, each line is an unordered collection of samples).
Sequence id is a number. It can be omitted, in which case the line number will be used as the sequence id.
Each sample is effectively a key-value pair consisting of an input name and the corresponding value vector (mapping to higher dimensions is done as part of the network itself).
Each sample begins with a pipe symbol (|) followed by the input name (no spaces), followed by a whitespace delimiter and then a list of values.
Each value is either a number or an index-prefixed number for sparse inputs.
Both tabs and spaces can be used interchangeably as delimiters.
A comment starts with a pipe immediately followed by a hash symbol: |#, then followed by the actually content (body) of the comment. The body can contain any characters, however a pipe symbol inside the body needs to be escaped by appending the hash symbol to it (see the example below). The body of a comment continues until the end of line or the next un-escaped pipe, whichever comes first.
Handy, and gives an answer.
The input data corresponding to the reader configuration above should look something like this:
|B 100:3 123:4 |C 8 |A 0 1 2 3 4 |# a CTF comment
|# another comment |A 0 1.1 22 0.3 54 |C 123917 |B 1134:1.911 13331:0.014
|C -0.001 |# a comment with an escaped pipe: '|#' |A 3.9 1.11 121.2 99.13 0.04 |B 999:0.001 918918:-9.19
Note the following about the input format:
|Input_Name identifies the beginning of each input sample. This element is mandatory and is followed by the correspondent value vector.
Dense vector is just a list of floating point values; sparse vector is a list of index:value tuples.
Both tabs and spaces are allowed as value delimiters (within input vectors) as well as input delimiters (between inputs).
Each separate line constitutes a "sequence" of length 1 ("Real" variable-length sequences are explained in the extended example below).
Each input identifier can only appear once on a single line (which translates into one sample per input per line requirement).
The order of input samples within a line is NOT important (conceptually, each line is an unordered collection of key-value pairs)
Each well-formed line must end with either a "Line Feed" \n or "Carriage Return, Line Feed" \r\n symbols.
Some awesome content on the Input and Label Data in this Video:
https://youtu.be/hMRrqkl77rI - #30:23
https://youtu.be/Vi05nEzAS8Y - #25:20
Also, helpful but not directly related: Read and feed data to CNTK Trainer
Related
I am trying to match the following string for an interface to a security system:
*3824 04:57:04 24/02/16 ALARM(DC4) Input 1 (SI)Main Door Opened(DC2)
Please note that (DC4) / (SI) / (DC2) are just the Visual representation of the ASCII characters so the input on the serial port would be a single byte, not 4/5 bytes.
The system will be continuously sending messages in a similar format to the above and I will need to check each one and see if it requires further processing.
The word ALARM is my keyword so if a message without ALARM in it comes though then I will be ignoring it (MATCH Failed).
If the word ALARM appears in the message then I need to get the location of the event and pass onto other layers within my application.
Sample 1 *3824 04:57:04 24/02/16 ALARM(DC4) Input 1 (SI)Main Door Opened(DC2)
Sample 2 *3824 04:57:04 24/02/16 ALARM(DC4) Input 2 (SI)Back Door Opened(DC2)
So I need to extract everything between the (SI) and (DC2) ASCII characters as a string for further processing.
So Message 1 would match "Main Door Opened" and Message 2 would match "Back Door Opened".
The other layers in the application will then extract this string from the appropriate Group # field if the match is a success.
Thanks,
Daniel.
Try this:
([A-Z]+)(?:[^\)]+.){2}([^\(]+)
Regex101:
Input:
*3824 04:57:04 24/02/16 ALARM(DC4) Input 1 (SI)Main Door Opened(DC2)
Output:
MATCH 1
1. [24-29] `ALARM`
2. [47-63] `Main Door Opened`
This is an exact match in group 1:
ALARM\(DC4\).*\(SI\)(.*)(?=\(DC2\))
Just looking to see what the best way to approach the following situation would be.
I am trying to make a small job that reads in a txt file which has a thousand or so lines;
Each line is about 40 characters long (mostly numbers, some letter identifiers).
I have used
DataTable txtCache = new DataTable();
txtCache.Columns.Add(new DataColumn("Column1"));
string[] lines = System.IO.File.ReadAllLines(FILEcheck.Properties.Settings.Default.filePath);
foreach (string line in lines)
{
txtCache.Rows.Add(line);
}
However, what I really want to do is a bit confusing and hard to explain so i'll do my best. An example of line is below:
5498494000584454684840}eD44448774V6468465 Z
In the beginning of that long string is a "84", and then a "58" a little bit later. I need to do a comparison on these two numbers. They could be anything, but only a few combinations are acceptable in the file. They will always be in the same spot and same amount of characters (so it will always be 2 numbers and always in the 4-5 location). So I want to have 3 columns. I want the full string in 1 column, and then the 2 individual smaller numbers in columns of themselves. I can then compare them later on, and if there is an issue, I can return the full string which caused the issue.
Is this possible? I am just not sure how to parse out a substring based on character location and then loading it into a datatable.
Any advice would be appreciated. Thank you,
You could create the columns for each of items you are looking to store (whole string, first number, second number), and then add a row for each of the lines in the input file. You could just use the substring method to parse out the two digit numbers and store them. To do your analysis, you could parse the numbers out from the strings, or whatever else you need to do.
lines[0].Substring(3,2) will give you "84" in your above example. If you want the int, you could use Int32.Parse(lines[0].Substring(3,2))
Substring reference: http://msdn.microsoft.com/en-us/library/aka44szs%28v=vs.110%29.aspx
I have a text file that reads
1 "601 Cross Street College Station TX 71234"
2 "(another address)"
3 ...
.
.
I wanted to know how to parse this text file into an integer and a string using C#. The integer would hold the S.No and the string the address without the quotes.
I need to do this because later on I have a function that takes these two values from the text file as input and spits out some data. This function has to be executed on each entry in the text file.
If i is an integer and add is the string, the output should be
a=1; add=601 Cross Street College Station TX 71234 //for the first line and so on
As one can observe the address needs to be one string.
This is not a homework question. And what I have been able to accomplish so far is to read out all the lines using
string[] lines = System.IO.File.ReadAllLines(#"C:\Users\KS\Documents\input.txt");
Any help is appreciated.
I would need to see more of your input data to determine the most reliable method.
But one approach would be to split each address into words. You can then loop through the words and find each word that contains only digits. This will be your street number. You could look after the street number and look for S, So, or South but as your example illustrates, there might be no such indicator.
Also, you haven't provided what you want to happen if more than one number is found.
As far as removing the quotes, just remove the first and last characters. I'd recommend checking that they are in fact quotes before removing them.
From your description, every entry has this format:
[space][number][space][quote][address][quote]
Here is some quick and dirty code that will parse this format into an int/string tuple:
using namespace System;
using namespace System.Linq;
static Tuple<int, string> ParseLine(string line)
{
var tokens = line.Split(); // Split by spaces
var number = int.Parse(tokens[1]); // The number is the 2nd token
var address = string.Join(" ", tokens.Skip(2)); // The address is every subsequent token
address = address.Substring(1, address.Length - 2); // ... minus the first and last characters
return Tuple.Create(number, address);
}
Having used SQL Server Bulk insert of CSV file with inconsistent quotes (CsvToOtherDelimiter option) as my basis, I discovered a few weirdnesses with the RemoveCSVQuotes part [it chopped the last char from quoted strings that contained a comma!]. So.. rewrote that bit (maybe a mistake?)
One wrinkle is that the client has asked 'what about data like this?'
""17.5179C,""
I assume if I wanted to keep using the CsvToOtherDelimiter solution, I'd have to amend the RegExp...but it's WAY beyond me... what's the best approach?
To clarify: we are using C# to pre-process the file into a pipe-delimited format prior to running a bulk insert using a format file. Speed is pretty vital.
The accepted answer from your link starts with:
You are going to need to preprocess the file, period.
Why not transform your csv to xml? Then you would be able to verify your data against an xsd before storing into a database.
To convert a CSV string into a list of elements, you could write a program that keeps track of state (in quotes or out of quotes) as it processes the string one character at a time, and emits the elements it finds. The rules for quoting in CSV are weird, so you'll want to make sure you have plenty of test data.
The state machine could go like this:
scan until quote (go to 2) or comma (go to 3)
if the next character is a quote, add only one of the two quotes to the field and return to 1. Otherwise, go to 4 (or report an error if the quote isn't the first character in the field).
emit the field, go to 1
scan until quote (go to 5)
if the next character is a quote, add only one of the two quotes to the field and return to 4. Otherwise, emit the field, scan for a comma, and go to 1.
This should correctly scan stuff like:
hello, world, 123, 456
"hello world", 123, 456
"He said ""Hello, world!""", "and I said hi"
""17.5179C,"" (correctly reports an error, since there should be a
separator between the first quoted string "" and the second field
17.5179C).
Another way would be to find some existing library that does it well. Surely, CSV is common enough that such a thing must exist?
edit:
You mention that speed is vital, so I wanted to point out that (so long as the quoted strings aren't allowed to include line returns...) each line may be processed independently in parallel.
I ended up using the csv parser that I don't know we had already (comes as part of our code generation tool) - and noting that ""17.5179C,"" is not valid and will cause errors.
I have a mixed Hebrew/english string to parse.
The string is built like this:
[3 hebrew] [2 english 2] [1 hebrew],
So, it can be read as: 1 2 3, and it is stored as 3 2 1 (exact byte sequence in file, double-checked in hex editor, and anyway RTL is only the display attribute). .NET regex parser has RTL option, which (when given for plain LTR text) starts processing from right side of the string.
I am wondering, when this option should be applied to extract [3 hebrew] and [2 english] parts from the string,or to check if [1 hebrew] matches the end of the string? Are there any hidden specifics or there's nothing to worry about (like when processing any LTR string with special unicode characters)?
Also, can anyone recommend me a good RTL+LTR text editor? (afraid that VS Express displays the text wrong sometimes, and if it can even start messing the saved strings - I would like to re-check the files without using hex editors anymore)
The RightToLeft option refers to the order through the character sequence that the regular expression takes, and should really be called LastToFirst since in the case of Hebrew and Arabic it is actually left-to-right, and with mixed RLT and LTR text such as you describe the expression "right to left" is even less appropriate.
This has a minor effect on speed (will only matter if the searched text is massive) and on regular expressions that are done with a startAt index (searching those earlier in the string than startAt rather than later in the string).
Examples; let's hope the browers don't mess this up too much:
string saying = "למכות is in כתר"; //Just because it amuses me that this is a saying whatever way round the browser puts malkuth and kether.
string kether = "כתר";
Console.WriteLine(new Regex(kether, RegexOptions.RightToLeft).IsMatch(saying));//True
Console.WriteLine(new Regex(kether, RegexOptions.None).IsMatch(saying));//True, perhaps minutely faster but so little that noise would hide it.
Console.WriteLine(new Regex(kether, RegexOptions.RightToLeft).IsMatch(saying, 2));//False
Console.WriteLine(new Regex(kether, RegexOptions.None).IsMatch(saying, 2));//True
//And to show that the ordering is codepoint rather than physical display ordering:
Console.WriteLine(new Regex("" + kether[0] + ".*" + kether[2]).IsMatch(saying));//True
Console.WriteLine(new Regex("" + kether[2] + ".*" + kether[0]).IsMatch(saying));//False