I'm gathering light statistics of EXIF data for a large collection of photos and I'm trying to find the simplest way (i.e. performance doesn't matter) of translating Exposure Time values to/from usable data. There is (as far as I can find) no standard for what values camera manufacturers might use i.e. I can't just scan the web for random images and hard-code a map.
Here are is a sample set of values I've encountered (" indicates seconds):
279", 30", 5", 3.2", 1.6", 1.3", 1", 1/1.3, 1/1.6, 1/2, 1/2.5, 1/3, 1/4, 1/5, 1/8, 1/13, 1/8000, 1/16000
Also take into consideration that I'd also like to find the average (mean) ... but it should be one of the actual shutter speeds collected and not just some arbitrary number.
EDIT:
By usable data I mean some sort of creative? numbering system that I can convert to/from for performing calculations. I thought about just multiplying everything by 1,000,000 except some fractions when divided are quite exotic.
EDIT #2:
To clarify, I'm using ExposureTime instead of ShutterSpeed because it contains photographer friendly values e.g. 1/50. ShutterSpeed is more of an approximation (which varies between camera manufacturers) and leads to values such as 1/49.
You want to parse them into some kind of time-duration object.
A simple way, looking at that data, would be to check wheter " or / occurs, if " parse as seconds, / parse as fraction of seconds. I don't really understand what else you could mean. For an example you'd need to specify a language--also, there might be a parser out there already.
Shutter speed is encoded in the EXIF metadata as an SRATIONAL, 32-bits for the numerator and 32-bits for the denominator. Sample code that retrieves it, using System.Drawing:
var bmp = new Bitmap(#"c:\temp\canon-ixus.jpg");
if (bmp.PropertyIdList.Contains(37377)) {
byte[] spencoded = bmp.GetPropertyItem(37377).Value;
int numerator = BitConverter.ToInt32(spencoded, 0);
int denominator = BitConverter.ToInt32(spencoded, 4);
Console.WriteLine("Shutter speed = {0}/{1}", numerator, denominator);
}
Output: Shutter speed = 553859/65536, sample image retrieved here.
It seems there are three types of string you will encounter:
String with double quotes " for seconds
String with leading 1/
String with no special characters
I propose you simply test for these conditions and parse the value accordingly using floats:
string[] InputSpeeds = new[] { "279\"", "30\"", "5\"", "3.2\"", "1.6\"", "1.3\"", "1\"", "1/1.3", "1/1.6", "1/2", "1/2.5", "1/3", "1/4", "1/5", "1/8", "1/13", "1/8000", "1/16000" };
List<float> OutputSpeeds = new List<float>();
foreach (string s in InputSpeeds)
{
float ConvertedSpeed;
if (s.Contains("\""))
{
float.TryParse(s.Replace("\"", String.Empty), out ConvertedSpeed);
OutputSpeeds.Add(ConvertedSpeed);
}
else if (s.Contains("1/"))
{
float.TryParse(s.Remove(0, 2), out ConvertedSpeed);
if (ConvertedSpeed == 0)
{
OutputSpeeds.Add(0F);
}
else
{
OutputSpeeds.Add(1 / ConvertedSpeed);
}
}
else
{
float.TryParse(s, out ConvertedSpeed);
OutputSpeeds.Add(ConvertedSpeed);
}
}
If you want TimeSpans simply change the List<float> to List<TimeSpan> and instead of adding the float to the list, use TimeSpan.FromSeconds(ConvertedSpeed);
Related
So I'm a complete newb to unity and c# and I'm trying to make my first mobile incremental game. I know how to format a variable from (e.g.) 1000 >>> 1k however I have several variables that can go up to decillion+ so I imagine having to check every variable's value seperately up to decillion+ will be quite inefficient. Being a newb I'm not sure how to go about it, maybe a for loop or something?
EDIT: I'm checking if x is greater than a certain value. For example if it's greater than 1,000, display 1k. If it's greater than 1,000,000, display 1m...etc etc
This is my current code for checking if x is greater than 1000 however I don't think copy pasting this against other values would be very efficient;
if (totalCash > 1000)
{
totalCashk = totalCash / 1000;
totalCashTxt.text = "$" + totalCashk.ToString("F1") + "k";
}
So, I agree that copying code is not efficient. That's why people invented functions!
How about simply wrapping your formatting into function, eg. named prettyCurrency?
So you can simply write:
totalCashTxt.text = prettyCurrency(totalCashk);
Also, instead of writing ton of ifs you can handle this case with logarithm with base of 10 to determine number of digits. Example in pure C# below:
using System.IO;
using System;
class Program
{
// Very simple example, gonna throw exception for numbers bigger than 10^12
static readonly string[] suffixes = {"", "k", "M", "G"};
static string prettyCurrency(long cash, string prefix="$")
{
int k;
if(cash == 0)
k = 0; // log10 of 0 is not valid
else
k = (int)(Math.Log10(cash) / 3); // get number of digits and divide by 3
var dividor = Math.Pow(10,k*3); // actual number we print
var text = prefix + (cash/dividor).ToString("F1") + suffixes[k];
return text;
}
static void Main()
{
Console.WriteLine(prettyCurrency(0));
Console.WriteLine(prettyCurrency(333));
Console.WriteLine(prettyCurrency(3145));
Console.WriteLine(prettyCurrency(314512455));
Console.WriteLine(prettyCurrency(31451242545));
}
}
OUTPUT:
$0.0
$333.0
$3.1k
$314.5M
$31.5G
Also, you might think about introducing a new type, which implements this function as its ToString() overload.
EDIT:
I forgot about 0 in input, now it is fixed. And indeed, as #Draco18s said in his comment nor int nor long will handle really big numbers, so you can either use external library like BigInteger or switch to double which will lose his precision when numbers becomes bigger and bigger. (e.g. 1000000000000000.0 + 1 might be equal to 1000000000000000.0). If you choose the latter you should change my function to handle numbers in range (0.0,1.0), for which log10 is negative.
The last couple of days have been full with making calculations and formulas and I'm beginning to lose my mind (a little bit). So now I'm turning to you guys for some insight/help.
Here's the problem; I'm working with bluetooth beacons whom are placed all over an entire floor in a building to make an indoor GPS showcase. You can use your phone to connect with these beacons, which results in receiving your longitude and latitude location from them. These numbers are large float/double variables, looking like this:
lat: 52.501288451787076
lng: 6.079107635606511
The actual changes happen at the 4th and 5th position after the point. I'm converting these numbers to the Cartesian coordinate system using;
x = R * cos(lat) * cos(lon)
z = R *sin(lat)
Now the coordinates from this conversion are kind of solid. They are numbers with which I can work with. I use them in a 3d engine (Unity3d) to make a real-time map where you can see where someone is walking.
Now for the actual problem! These beacons are not entirely accurate. These numbers 'jump' up and down even when you lay your phone down. Ranging from, let's assume the same latitude as mentioned above, 52.501280 to 52.501296. If we convert this and use it as coordinates in a 3d engine, the 'avatar' for a user jumps from one position to another (more small jumps than large jumps).
What is a good way to cope with these jumping numbers? I've tried to check for big jumps and ignore those, but the jumps are still too big. A broader check will result in almost no movement, even when a phone is moving. Or is there a better way to convert the lat and long variables for use in a 3d engine?
If there is someone who has had the same problem as me, some mathematical wonder who can give a good conversion/formula to start with or someone who knows what I'm possibly doing wrong then please, help a fellow programmer out.
Moving Average
You could use this: (Taken here: https://stackoverflow.com/a/1305/5089204)
Attention: Please read the comments to this class as this implementation has some flaws... It's just for quick test and show...
public class LimitedQueue<T> : Queue<T> {
private int limit = -1;
public int Limit {
get { return limit; }
set { limit = value; }
}
public LimitedQueue(int limit)
: base(limit) {
this.Limit = limit;
}
public new void Enqueue(T item) {
if (this.Count >= this.Limit) {
this.Dequeue();
}
base.Enqueue(item);
}
}
Just test it like this:
var queue = new LimitedQueue<float>(4);
queue.Enqueue(52.501280f);
var avg1 = queue.Average(); //52.50128
queue.Enqueue(52.501350f);
var avg2 = queue.Average(); //52.5013161
queue.Enqueue(52.501140f);
var avg3 = queue.Average(); //52.50126
queue.Enqueue(52.501022f);
var avg4 = queue.Average(); //52.5011978
queue.Enqueue(52.501635f);
var avg5 = queue.Average(); //52.50129
queue.Enqueue(52.501500f);
var avg6 = queue.Average(); //52.5013237
queue.Enqueue(52.501505f);
var avg7 = queue.Average(); //52.5014153
queue.Enqueue(52.501230f);
var avg8 = queue.Average(); //52.50147
The limited queue will not grow... You just define the count of elements you want to use (in this case I specified 4). The 5th element pushes the first out and so on...
The average will always be a smooth sliding :-)
I'm implementing the K-nearest neighbours classification algorithm in C# for a training and testing set of about 20,000 samples each, and 25 dimensions.
There are only two classes, represented by '0' and '1' in my implementation. For now, I have the following simple implementation :
// testSamples and trainSamples consists of about 20k vectors each with 25 dimensions
// trainClasses contains 0 or 1 signifying the corresponding class for each sample in trainSamples
static int[] TestKnnCase(IList<double[]> trainSamples, IList<double[]> testSamples, IList<int[]> trainClasses, int K)
{
Console.WriteLine("Performing KNN with K = "+K);
var testResults = new int[testSamples.Count()];
var testNumber = testSamples.Count();
var trainNumber = trainSamples.Count();
// Declaring these here so that I don't have to 'new' them over and over again in the main loop,
// just to save some overhead
var distances = new double[trainNumber][];
for (var i = 0; i < trainNumber; i++)
{
distances[i] = new double[2]; // Will store both distance and index in here
}
// Performing KNN ...
for (var tst = 0; tst < testNumber; tst++)
{
// For every test sample, calculate distance from every training sample
Parallel.For(0, trainNumber, trn =>
{
var dist = GetDistance(testSamples[tst], trainSamples[trn]);
// Storing distance as well as index
distances[trn][0] = dist;
distances[trn][1] = trn;
});
// Sort distances and take top K (?What happens in case of multiple points at the same distance?)
var votingDistances = distances.AsParallel().OrderBy(t => t[0]).Take(K);
// Do a 'majority vote' to classify test sample
var yea = 0.0;
var nay = 0.0;
foreach (var voter in votingDistances)
{
if (trainClasses[(int)voter[1]] == 1)
yea++;
else
nay++;
}
if (yea > nay)
testResults[tst] = 1;
else
testResults[tst] = 0;
}
return testResults;
}
// Calculates and returns square of Euclidean distance between two vectors
static double GetDistance(IList<double> sample1, IList<double> sample2)
{
var distance = 0.0;
// assume sample1 and sample2 are valid i.e. same length
for (var i = 0; i < sample1.Count; i++)
{
var temp = sample1[i] - sample2[i];
distance += temp * temp;
}
return distance;
}
This takes quite a bit of time to execute. On my system it takes about 80 seconds to complete. How can I optimize this, while ensuring that it would also scale to larger number of data samples? As you can see, I've tried using PLINQ and parallel for loops, which did help (without these, it was taking about 120 seconds). What else can I do?
I've read about KD-trees being efficient for KNN in general, but every source I read stated that they're not efficient for higher dimensions.
I also found this stackoverflow discussion about this, but it seems like this is 3 years old, and I was hoping that someone would know about better solutions to this problem by now.
I've looked at machine learning libraries in C#, but for various reasons I don't want to call R or C code from my C# program, and some other libraries I saw were no more efficient than the code I've written. Now I'm just trying to figure out how I could write the most optimized code for this myself.
Edited to add - I cannot reduce the number of dimensions using PCA or something. For this particular model, 25 dimensions are required.
Whenever you are attempting to improve the performance of code, the first step is to analyze the current performance to see exactly where it is spending its time. A good profiler is crucial for this. In my previous job I was able to use the dotTrace profiler to good effect; Visual Studio also has a built-in profiler. A good profiler will tell you exactly where you code is spending time method-by-method or even line-by-line.
That being said, a few things come to mind in reading your implementation:
You are parallelizing some inner loops. Could you parallelize the outer loop instead? There is a small but nonzero cost associated to a delegate call (see here or here) which may be hitting you in the "Parallel.For" callback.
Similarly there is a small performance penalty for indexing through an array using its IList interface. You might consider declaring the array arguments to "GetDistance()" explicitly.
How large is K as compared to the size of the training array? You are completely sorting the "distances" array and taking the top K, but if K is much smaller than the array size it might make sense to use a partial sort / selection algorithm, for instance by using a SortedSet and replacing the smallest element when the set size exceeds K.
I have a follows code that i have run on LinqPad:
void Main()
{
List<VariableData> outputVariableData =
new List<VariableData>();
for(int i = 1 ; i< 100; i ++)
{
outputVariableData.Add(new VariableData
{
Id = i,
VariableValue = .33
});
}
double result = outputVariableData.Average(dd=> dd.VariableValue);
double add = outputVariableData.Sum(dd=> dd.VariableValue)/99;
add.Dump();
result.Dump();
}
public class VariableData
{
public int Id { get; set; }
public double VariableValue{ get; set; }
}
It results
0.329999999999999
0.329999999999999
When i check the average of same numbers in the excel sheet with formula =AVERAGE(A1:A101) and it return .33 as it is.
Actually i am drawing chart with this data and average value is showin on the chart, which making the chart drawing so absert and chart is not able to manage such type of value.
I am little confused about the output of these both, i suppose excel automatic round the value. so i have simple and little silly question that is output of my extension method is correct??
As noted, excel gives you a nice rounded version. For displaying of numbers this may be a very useful read: http://msdn.microsoft.com/en-us/library/dwhawy9k.aspx
There is also http://msdn.microsoft.com/en-us/library/f5898377 available.
What's the relevance of Standard Numeric Format Strings You Literally Ask?
Easy - the problem isn't that the numbers wrong - it's perfectly fine given the inaccuracies of floating point numbers. However displaying these numbers can be an issue - it's not what we expect nor is it useful. However, if we spend a few seconds defining how we want our numbers presented...
void Main()
{
List<VariableData> outputVariableData =
new List<VariableData>();
for(int i = 1 ; i< 100; i ++)
{
outputVariableData.Add(new VariableData
{
Id = i,
VariableValue = .33
});
}
double result = outputVariableData.Average(dd=> dd.VariableValue);
double add = outputVariableData.Sum(dd=> dd.VariableValue)/99;
add.Dump();
add.ToString("P0").Dump();
add.ToString("N2").Dump();
result.Dump();
result.ToString("P0").Dump();
result.ToString("N2").Dump();
}
public class VariableData
{
public int Id { get; set; }
public double VariableValue{ get; set; }
}
Suddenly we get the output we want:
0.329999999999999
33 %
0.33
0.329999999999999
33 %
0.33
The problem now isn't trying to massage the numbers to perfection, or using unusual datatype - but spending a few minutes figuring out what options are available and making use of them! It's easy to do and very useful in the long run :)
Of course, this should only be used for displaying data - if you need to round data for computational reasons then I suggest looking into Math.Round etc.
Hope this answers your Q.
Supplemental:
This is another option that works in a very similar vein: http://msdn.microsoft.com/en-us/library/0c899ak8.aspx It explains how to do things like:
void Main()
{
0.31929.ToString("0.##%").Dump();
0.ToString("0.##%").Dump();
1.ToString("0.##%").Dump();
}
Which results in:
31.93%
0%
100%
The computer stores floating point numbers in binary format and there is no exact representation of 0.33
Decimal Binary
2 10
0.5 0.1
0.25 0.01
0.33 0.01010100011110101110000101000111101011100001010001111010111000...
http://www.exploringbinary.com/binary-converter/
It is exactly how computer can present computations from Math point the numbers 0.33 and 0.32(9) are equal, Excel perhaps has own rounding
http://www.cut-the-knot.org/arithmetic/999999.shtml
Floating points will give you rounding differences. It happens.
If you want the numbers to be exact, try using decimal instead of double for VariableValue.
This gives 0.33 for both add and result.
The following class parses through a very large string (an entire novel of text) and breaks it into consecutive 4-character strings that are stored as a Tuple. Then each tuple can be assigned a probability based on a calculation. I am using this as part of a monte carlo/ genetic algorithm to train the program to recognize a language based on syntax only (just the character transitions).
I am wondering if there is a faster way of doing this. It takes about 400ms to look up the probability of any given 4-character tuple. The relevant method _Probablity() is at the end of the class.
This is a computationally intensive problem related to another post of mine: Algorithm for computing the plausibility of a function / Monte Carlo Method
Ultimately I'd like to store these values in a 4d-matrix. But given that there are 26 letters in the alphabet that would be a HUGE task. (26x26x26x26). If I take only the first 15000 characters of the novel then performance improves a ton, but my data isn't as useful.
Here is the method that parses the text 'source':
private List<Tuple<char, char, char, char>> _Parse(string src)
{
var _map = new List<Tuple<char, char, char, char>>();
for (int i = 0; i < src.Length - 3; i++)
{
int j = i + 1;
int k = i + 2;
int l = i + 3;
_map.Add
(new Tuple<char, char, char, char>(src[i], src[j], src[k], src[l]));
}
return _map;
}
And here is the _Probability method:
private double _Probability(char x0, char x1, char x2, char x3)
{
var subset_x0 = map.Where(x => x.Item1 == x0);
var subset_x0_x1_following = subset_x0.Where(x => x.Item2 == x1);
var subset_x0_x2_following = subset_x0_x1_following.Where(x => x.Item3 == x2);
var subset_x0_x3_following = subset_x0_x2_following.Where(x => x.Item4 == x3);
int count_of_x0 = subset_x0.Count();
int count_of_x1_following = subset_x0_x1_following.Count();
int count_of_x2_following = subset_x0_x2_following.Count();
int count_of_x3_following = subset_x0_x3_following.Count();
decimal p1;
decimal p2;
decimal p3;
if (count_of_x0 <= 0 || count_of_x1_following <= 0 || count_of_x2_following <= 0 || count_of_x3_following <= 0)
{
p1 = e;
p2 = e;
p3 = e;
}
else
{
p1 = (decimal)count_of_x1_following / (decimal)count_of_x0;
p2 = (decimal)count_of_x2_following / (decimal)count_of_x1_following;
p3 = (decimal)count_of_x3_following / (decimal)count_of_x2_following;
p1 = (p1 * 100) + e;
p2 = (p2 * 100) + e;
p3 = (p3 * 100) + e;
}
//more calculations omitted
return _final;
}
}
EDIT - I'm providing more details to clear things up,
1) Strictly speaking I've only worked with English so far, but its true that different alphabets will have to be considered. Currently I only want the program to recognize English, similar to whats described in this paper: http://www-stat.stanford.edu/~cgates/PERSI/papers/MCMCRev.pdf
2) I am calculating the probabilities of n-tuples of characters where n <= 4. For instance if I am calculating the total probability of the string "that", I would break it down into these independent tuples and calculate the probability of each individually first:
[t][h]
[t][h][a]
[t][h][a][t]
[t][h] is given the most weight, then [t][h][a], then [t][h][a][t]. Since I am not just looking at the 4-character tuple as a single unit, I wouldn't be able to just divide the instances of [t][h][a][t] in the text by the total no. of 4-tuples in the next.
The value assigned to each 4-tuple can't overfit to the text, because by chance many real English words may never appear in the text and they shouldn't get disproportionally low scores. Emphasing first-order character transitions (2-tuples) ameliorates this issue. Moving to the 3-tuple then the 4-tuple just refines the calculation.
I came up with a Dictionary that simply tallies the count of how often the tuple occurs in the text (similar to what Vilx suggested), rather than repeating identical tuples which is a waste of memory. That got me from about ~400ms per lookup to about ~40ms per, which is a pretty great improvement. I still have to look into some of the other suggestions, however.
In yoiu probability method you are iterating the map 8 times. Each of your wheres iterates the entire list and so does the count. Adding a .ToList() ad the end would (potentially) speed things. That said I think your main problem is that the structure you've chossen to store the data in is not suited for the purpose of the probability method. You could create a one pass version where the structure you store you're data in calculates the tentative distribution on insert. That way when you're done with the insert (which shouldn't be slowed down too much) you're done or you could do as the code below have a cheap calculation of the probability when you need it.
As an aside you might want to take puntuation and whitespace into account. The first letter/word of a sentence and the first letter of a word gives clear indication on what language a given text is written in by taking punctuaion charaters and whitespace as part of you distribution you include those characteristics of the sample data. We did that some years back. Doing that we shown that using just three characters was almost as exact (we had no failures with three on our test data and almost as exact is an assumption given that there most be some weird text where the lack of information would yield an incorrect result). as using more (we test up till 7) but the speed of three letters made that the best case.
EDIT
Here's an example of how I think I would do it in C#
class TextParser{
private Node Parse(string src){
var top = new Node(null);
for (int i = 0; i < src.Length - 3; i++){
var first = src[i];
var second = src[i+1];
var third = src[i+2];
var fourth = src[i+3];
var firstLevelNode = top.AddChild(first);
var secondLevelNode = firstLevelNode.AddChild(second);
var thirdLevelNode = secondLevelNode.AddChild(third);
thirdLevelNode.AddChild(fourth);
}
return top;
}
}
public class Node{
private readonly Node _parent;
private readonly Dictionary<char,Node> _children
= new Dictionary<char, Node>();
private int _count;
public Node(Node parent){
_parent = parent;
}
public Node AddChild(char value){
if (!_children.ContainsKey(value))
{
_children.Add(value, new Node(this));
}
var levelNode = _children[value];
levelNode._count++;
return levelNode;
}
public decimal Probability(string substring){
var node = this;
foreach (var c in substring){
if(!node.Contains(c))
return 0m;
node = node[c];
}
return ((decimal) node._count)/node._parent._children.Count;
}
public Node this[char value]{
get { return _children[value]; }
}
private bool Contains(char c){
return _children.ContainsKey(c);
}
}
the usage would then be:
var top = Parse(src);
top.Probability("test");
I would suggest changing the data structure to make that faster...
I think a Dictionary<char,Dictionary<char,Dictionary<char,Dictionary<char,double>>>> would be much more efficient since you would be accessing each "level" (Item1...Item4) when calculating... and you would cache the result in the innermost Dictionary so next time you don't have to calculate at all..
Ok, I don't have time to work out details, but this really calls for
neural classifier nets (Just take any off the shelf, even the Controllable Regex Mutilator would do the job with way more scalability) -- heuristics over brute force
you could use tries (Patricia Tries a.k.a. Radix Trees to make a space optimized version of your datastructure that can be sparse (the Dictionary of Dictionaries of Dictionaries of Dictionaries... looks like an approximation of this to me)
There's not much you can do with the parse function as it stands. However, the tuples appear to be four consecutive characters from a large body of text. Why not just replace the tuple with an int and then use the int to index the large body of text when you need the character values. Your tuple based method is effectively consuming four times the memory the original text would use, and since memory is usually the bottleneck to performance, it's best to use as little as possible.
You then try to find the number of matches in the body of text against a set of characters. I wonder how a straightforward linear search over the original body of text would compare with the linq statements you're using? The .Where will be doing memory allocation (which is a slow operation) and the linq statement will have parsing overhead (but the compiler might do something clever here). Having a good understanding of the search space will make it easier to find an optimal algorithm.
But then, as has been mentioned in the comments, using a 264 matrix would be the most efficent. Parse the input text once and create the matrix as you parse. You'd probably want a set of dictionaries:
SortedDictionary <int,int> count_of_single_letters; // key = single character
SortedDictionary <int,int> count_of_double_letters; // key = char1 + char2 * 32
SortedDictionary <int,int> count_of_triple_letters; // key = char1 + char2 * 32 + char3 * 32 * 32
SortedDictionary <int,int> count_of_quad_letters; // key = char1 + char2 * 32 + char3 * 32 * 32 + char4 * 32 * 32 * 32
Finally, a note on data types. You're using the decimal type. This is not an efficient type as there is no direct mapping to CPU native type and there is overhead in processing the data. Use a double instead, I think the precision will be sufficient. The most precise way will be to store the probability as two integers, the numerator and denominator and then do the division as late as possible.
The best approach here is to using sparse storage and pruning after each each 10000 character for example. Best storage strucutre in this case is prefix tree, it will allow fast calculation of probability, updating and sparse storage. You can find out more theory in this javadoc http://alias-i.com/lingpipe/docs/api/com/aliasi/lm/NGramProcessLM.html