I've been trying to figure out the mystical realm of MIDI parsing, and I'm having no luck. All I'm trying to do is get the note value (60 = C4, 72 = C5, etc), in order of when they occur.
My code is as follows. All it does is very simply open a file as a byte array and read everything out as hex:
byte[] MIDI = File.ReadAllBytes("TestMIDI.mid");
foreach (var element in MIDI) {
string b = Convert.ToString(element,16);
Debug.WriteLine(b);
}
All TestMIDI.mid contains is one note on C5. Here's a hex dump of it. Using this info, I'm trying to find the simple hex value for Note On (0x9, or just 9 in the dump), but there aren't any. I can find a few 72's, but there are 3, which doesn't make any sense to me (note on, note off, then what?).
This is my first attempt at parsing MIDI as a file and using hex dumps (are they even called that?), so I'm sorry if I'm heading in the complete wrong direction. All I need is to get the note that plays, and in what order. I don't need timing or anything fancy at all. The reason behind this, if it matters - is to then generate new code in a different language to be played out of a speaker, very similar to the beep command on *nix. Because of this, I don't want to use any frameworks that 1) I didn't program, and really didn't learn anything and 2) do far more than what I need, making the framework heavier than the actual code by me.
Accepted answer is not a solution for the problem. It will not work in common case. I'll provide several cases where this code either will not work or will fail. Order of these cases corresponds their probability - most probable cases go first.
False positives. MIDI files contain a lot of data structures where you can find a byte with the value 144. And these structures are not Note On events. For real MIDI files you'll get bunch of "notes" that are not notes but random values within the file.
Channels other than 0. Most of the modern MIDI files contain several track chunks. Each one holds events for the specific MIDI channel (from 0 to 15). 144 (or 90 in hex) represents a Note On event for the channel 0. So you are going to miss a lot of Note On events for other channels.
Running status. MIDI files actively use concept of running status. This technique allows don't store status bytes of consecutive events of the same type. It means that status byte 144 can be written only once for the first Note On event and you will not find it further in the file.
144 is the last byte in a file. MIDI file can end with this value. For example if a custom chunk is the last chunk in the file or track chunk doesn't end with End of Track event (which is corruption according to MIDI file specification but possible scenario in real world). In this case you' ll get IndexOutOfRangeException on MIDI[i+1].
Thus, you should never search for specific value to find some semantic data structure in a MIDI file. You must use one of the .NET libraries available on the Internet. For example, with the DryWetMIDI you can use this code:
IEnumerable<Note> notes = MidiFile.Read(filePath)
.GetNotes();
To do this right, you'll need at least some semblance of a MIDI parser. Searching through 0x9 events is a good start, but 0x9 is also a Note-Off event if the velocity field is 0. 0x9 can also be present inside other events (meta events, MPQN events, delta times, etc), so you'll get false positives. So, you need something that actually knows the MIDI file format to do this accurately.
Look for a library, write your own, or port an open-source one. Mine is in Java if you want to look.
Related
I'm testing out the new unified speech engine on Azure, and I'm working on a piece where I'm trying to transcribe a 10 minute audio file. I've created a recognizer with CreateSpeechRecognizerWithFileInput, and I've kicked off continuous recognition with StartContinuousRecognitionAsync. I created the recognizer with detailed results enabled.
In the FinalResultsReceived event, there doesn't seem to be a way to access the audio offset in the SpeechRecognitionResult. If I do this though:
string rawResult = ea.Result.ToString(); //can get access to raw value this way.
Regex r=new Regex(#".*Offset"":(\d*),.*");
int offset=Convert.ToInt32(r?.Match(rawResult)?.Groups[1]?.Value);
Then I can extract the offset. The raw result looks something like this:
ResultId:4116b361141446a98f306fdc11c3a5bd Status:Recognized Recognized text:<OK, so what's your think it went well, let's look at number number is 104-828-1198.>. Json:{"Duration":129500000,"NBest":[{"Confidence":0.887861133,"Display":"OK, so what's your think it went well, let's look at number number is 104-828-1198.","ITN":"OK so what's your think it went well let's look at number number is 104-828-1198","Lexical":"OK so what's your think it went well let's look at number number is one zero four eight two eight one one nine eight","MaskedITN":"OK so what's your think it went well let's look at number number is 104-828-1198"}],"Offset":6900000,"RecognitionStatus":"Success"}
The challenge there is that the Offset is sometimes zero, even for cases where it's a nonzero file index, so I'll get zeroes in the middle of a recognition stream.
I also tried submitting the same file through the batch transcription API, which gives me a different result entirely:
{
"RecognitionStatus": "Success",
"Offset": 531700000,
"Duration": 91300000,
"NBest": [{
"Confidence": 0.87579143,
"Lexical": "OK so what's your think it went well let's look at number number is one zero four eight two eight one",
"ITN": "OK so what's your think it went well let's look at number number is 1048281",
"MaskedITN": "OK so what's your think it went well let's look at number number is 1048281",
"Display": "OK, so what's your think it went well, let's look at number number is 1048281."
}
]
},
So I have three questions on this:
Is there a supported method to get the offset of a recognized section of a file in the recognizer API? The SpeechRecognitionResult doesn't expose this, nor does the Best() extension.
Why is the offset coming back as 0 for a segment part way through the file?
What are the units for the offsets in the bulk recognition and file recognition APIs, and why are they different? They don't appear to be ms or frames, at least from what I've found in Audacity. The result I posted was from roughly 59s into the file, which is roughly 800k samples.
Chris,
Thanks for your feedback. To your questions,
1) The offset as well as duration have been added to the API. The next coming release (very soon) will allow you access both properties. Please stay tuned.
2) This is probably due to different recognition mode being used. We will also fix that in the next release.
3) The time unit for both API is 100ns(tick). Please also note that batch transcription uses different model than online recognition, so that the recognition result might be slightly different.
Sorry for the inconvenience!
Thanks,
I am trying to read the data stored in an ICMT tag on a WAV file generated by a noise monitoring device.
The RIFF parsing code all seems to work fine, except for the fact that the ICMT tag seems to have data after the declared size. As luck would have it, it's the timestamp, which is the one absolutely critical piece of info for my application.
SYN is hex 16, which gives a size of 22, which is up to and including the NUL before the timestamp. The monitor documentation is no help; it says that the tag includes the time, but their example also has the same issue.
It is the last tag in the enclosing list, and the size of the list does include it - does that mean it doesn't need a chunk ID? I'm struggling to find decent RIFF docs, but I can't find anything that suggests that's the case; also I can't see how it'd be possible to determine that it was the last chunk and so know to read it with no chunk ID.
Alternatively, the ICMT comment chunk is the last thing in the file - is that a special case? Can I just get the time by reading everything from the end of the declared length ICMT to the end of the file and assume that will always work?
The current parser behaviour is that it's being read after the channel / dB information as a chunk ID + size, and then complaining that there was not enough data left in the file to fulfil the request.
No, it would still need its own ID. No, being the last thing in the file is no special case either. What you're showing here is malformed.
Your current parser errors correctly, as the next thing to be expected again is a 4 byte ID followed by 4 bytes for the length. The potential ID _10: is unknown and would be skipped, but interpreting 51:4 as DWORD for the length of course asks for trouble.
The device is the culprit. Do you have other INFO fields which use NULL bytes? If not then I assume the device is naive enough to consider a NULL the end of a string, despite producing himself strings with multiple NULLs.
Since I encountered countless files not sticking to standards I can only say your parser is too naive as well: it knows how long the encapsulating list is and thus could easily detect field lengths that would not fit anymore. And could ignore garbage like that. Or, in your case, offer the very specific option "add to last field".
Given a sample of hexadecimal data, I would like to identify UNKNOWN sequences of bytes that are repeated throughout the sample. (Not searching for a known string or value) I am attempting to reverse engineer a network protocol, and I am working on determining data structures within the packet. As an example of what I'm trying to do (albeit on a smaller scale):
(af:b6:ea:3d:83:02:00:00):{21:03:00:00}:[b3:49:96:23:01]
{21:03:00:00}:(af:b6:ea:3d:83:02:00:00):01:42:00:00:00:00:01:57
And
(38:64:88:6e:83:02:00:00):{26:03:00:00}:[b3:49:96:23:01]
{26:03:00:00}:(38:64:88:6e:83:02:00:00):01:42:00:00:00:00:00:01
Obviously, these are easy to spot by eye, but patterns that are hundreds of chars into the data are not. I'm not expecting a magic bullet for the solution, just a nudge in the right direction, or even better, a premade tool.
I'm currently needing this for a C# project, but I am open to any and all tools.
If you have no idea what you are looking for, you could get an idea of the layout of the data by performing a negative entropy analysis on a reasonably large enough sample of conversations to see the length of the records/sub-records.
If the data is structured with repeated sequences of roughly the same length and content type you should see clusters of values with nearly the same negative entropy around the length of the record and sub records.
For example if you put a basic file with a lot of the same data through that, you should see values around the average record length with comparable negentropies (ex: if you use a CSV file with an average line length of 117 bytes, you might see 115, 116, 117 & 119 with the highest negentropy), and values around the most common field lengths with the same negentropy.
You might do a byte occurence scan, to see which byte values are likely separators.
There is a free hex editor with sources which does that for you (hexplorer, in the Crypto/Find Pattern menu). You may have to change the default font through Options to actually something in the UI.
Okay so I'm trying to make a basic malware scanner in C# my question is say I have the Hex signature for a particular bit of code
For example
{
System.IO.File.Delete(#"C:\Users\Public\DeleteTest\test.txt");
}
//Which will have a hex of 53797374656d2e494f2e46696c652e44656c657465284022433a5c55736572735c5075626c69635c44656c657465546573745c746573742e74787422293b
Gets Changed to -
{
System.IO.File.Delete(#"C:\Users\Public\DeleteTest\notatest.txt");
}
//Which will have a hex of 53797374656d2e494f2e46696c652e44656c657465284022433a5c55736572735c5075626c69635c44656c657465546573745c6e6f7461746573742e74787422293b
Keep in mind these bits will be within the entire Hex of the program - How could I go about taking my base signature and looking for partial matches that say have a 90% match therefore gets flagged.
I would do a wildcard but that wouldn't work for slightly more complex things where it might be coded slightly different but the majority would be the same. So is there a way I can do a percent match for a substring? I was looking into the Levenshtein Distance but I don't see how I'd apply it into this given scenario.
Thanks in advance for any input
Using an edit distance would be fine. You can take two strings and calculate the edit distance, which will be an integer value denoting how many operations are needed to take one string to the other. You set your own threshold based off that number.
For example, you may statically set that if the distance is less than five edits, the change is relevant.
You could also take the length of string you are comparing and take a percentage of that. Your example is 36 characters long, so (int)(input.Length * 0.88m) would be a valid threashold.
First, your program bits should match EXACTLY or else it has been modified or is corrupt. Generally, you will store an MD5 hash on the original binary and check the MD5 against new versions to see if they are 'the same enough' (MD5 can't guarantee a 100% match).
Beyond this, in order to detect malware in a random binary, you must know what sort of patterns to look for. For example, if I know a piece of malware injects code with some binary XYZ, I will look for XYZ in the bits of the executable. Patterns get much more complex than that, of course, as the malware bits can be spread out in chuncks. What is more interesting is that some viruses are self-morphing. This means that each time it runs, it modifies itself, meaning the scanner does not know an exact pattern to find. In these cases, the scanner must know the types of derivatives can be produced and look for all of them.
In terms of finding a % match, this operation is very time consuming unless you have constraints. By comparing 2 strings, you cannot tell which pieces were removed, added, or replaced. For instance, if I have a starting string 'ABCD', is 'AABCDD' a 100% match or less since content has been added? What about 'ABCDABCD'; here it matches twice. How about 'AXBXCXD'? What about 'CDAB'?
There are many DIFF tools in existence that can tell you what pieces of a file have been changed (which can lead to a %). Unfortunately, none of them are perfect because of the issues that I described above. You will find that you have false negatives, false positives, etc. This may be 'good enough' for you.
Before you can identify a specific algorithm that will work for you, you will have to decide what the restrictions of your search will be. Otherwise, your scan will be NP-hard, which leads to unreasonable running times (your scanner may run all day just to check one file).
I suggest you look into Levenshtein distance and Damerau-Levenshtein distance.
The former tells you how many add/delete operations are needed to turn one string into another; and the latter tells you how many add/delete/replace operations are needed to turn one string into another.
I use these quite a lot when writing programs where users can search for things, but they may not know the exact spelling.
There are code examples on both articles.
Is there a library that I can use to perform binary search in a very big text file (can be 10GB).
The file is a sort of a log file - every row starts with a date and time. Therefore rows are ordered.
I started to write the pseudo-code on how to do it, but I gave up since it may seem condescending. You probably know how to write a binary search, it's really not complicated.
You won't find it in a library, for two reasons:
It's not really "binary search" - the line sizes are different, so you need to adapt the algorithm (e.g. look for the middle of the file, then look for the next "newline" and consider that to be the "middle").
Your datetime log format is most likely non-standard (ok, it may look "standard", but think a bit.... you probably use '[]' or something to separate the date from the log message, something like [10/02/2001 10:35:02] My message ).
On summary - I think your need is too specific and too simple to implement in custom code for someone to bother writing a library :)
As the line lengths are not guaranteed to be the same length, you're going to need some form of recognisable line delimiter e.g. carriage return or line feed.
The binary search pattern can then be pretty much your traditional algorithm. Seek to the 'middle' of the file (by length), seek backwards (byte by byte) to the start of the line you happen to land in, as identified by the line delimiter sequence, read that record and make your comparison. Depending on the comparison, seek halfway up or down (in bytes) and repeat.
When you identify the start index of a record, check whether it was the same as the last seek. You may find that, as you dial in on your target record, moving halfway won't get you to a different record. e.g. you have adjacent records of 100 bytes and 50 bytes respectively, so jumping in at 75 bytes always takes you back to the start of the first record. If that happens, read on to the next record before making your comparison.
You should find that you will reach your target pretty quickly.
You would need to be able to stream the file, but you would also need random access. I'm not sure how you accomplish this short of a guarantee that each line of the file contains the same number of bytes. If you had that, you could get a Stream of the object and use the Seek method to move around in the file, and from there you could conduct your binary search by reading in the number of bytes that constitute a line. But again, this is only valid if the lines are the same number of bytes. Otherwise, you would jump in and out of the middle of lines.
Something like
byte[] buffer = new byte[lineLength];
stream.Seek(lineLength * searchPosition, SeekOrigin.Begin);
stream.Read(buffer, 0, lineLength);
string line = Encoding.Default.GetString(buffer);
This shouldn't be too bad under the constraint that you hold an Int64 in memory for every line-feed in the file. That really depends upon how long the line of text is on average, given 1000 bytes per line you be looking at around (10,000,000,000 / 1000 * 4) = 40mb. Very big, but possible.
So try this:
Scan the file and store the ordinal offset of each line-feed in a List
Binary search the List with a custom comparer that scans to the file offset and reads the data.
If your file is static (or changes rarely) and you have to run "enough" queries against it, I believe the best approach will be creating "index" file:
Scan the initial file and take the datetime parts of the file plus their positions in the original (this is why has to be pretty static) encode them some how (for example: unix time (full 10 digits) + nanoseconds (zero-filled 4 digits) and line position (zero filed 10 digits). this way you will have file with consistent "lines"
preform binary search on that file (you may need to be a bit creative in order to achieve range search) and get the relevant location(s) in the original file
read directly from the original file starting from the given location / read the given range
You've got range search with O(log(n)) run-time :) (and you've created primitive DB functionality)
Needless to say that if the file data file is updated "too" frequently or you don't run "enough" queries against the index file you mat end up with spending more time on creating the index file than you are saving from the query file.
Btw, working with this index file doesn't require the data file to be sorted. As log files tend to be append only, and sorted, you may speed up the whole thing by simply creating index file that only holds the locations of the EOL marks (zero-filled 10 digits) in the data file - this way you can preform the binary search directly on the data-file (using the index file in order to determinate the seek positions in the original file) and if lines are appended to the log file you can simply add (append) their EOL positions to the index file.
The List object has a Binary Search method.
http://msdn.microsoft.com/en-us/library/w4e7fxsh%28VS.80%29.aspx