Replace values located close to themselves with the mean value - c#

I'm not sure if SO is the proper place for asking this, if it's not, I will remove the question and try it in some other place. Said that I'm trying to do the following:
I have a List<double> and want to replace the block of values whose values are situated very close (say 0.75 in this example) to a single value, representing the mean of the replaced values.
The values that are isolated, or alone, should not be modified.
Also the replaced block can't be longer than 5.
Computing the mean value for each interval from 0, 5, 10.. would not provide the expected results.
It happened many times that LINQ power surprised me gladly and I would be happy if someone could guide me in the creation of this little method.
What I've thought is to first find the closest values for each one, calculate the distance, if the distance is less than minimum (0.75 in this example) then assign those values to the same block.
When all values are assigned to their blocks run a second loop that replaces each block (with one value, or many values) to its mean.
The problem that I have in this approach is to assign the "block": if several values are together, I need to check if the evaluating value is contained in another block, and if it so, the new value should be in that block too.
I don't know if this is the right way of doing this or I'm over complicating it.
EDIT: the expected result:
Although you see two axes only one is used, the List is 1D, I should have drawn only the X axis.
The length of the lines that are represented is irrelevant. It's just to mark on the axis where the value is situated.

It turns out that MSDN has already done this, and provided an in-depth example application with code:
Data Clustering - Detecting Abnormal Data Using k-Means Clustering

Related

Printing Different Patterns in Single Program

Are there solutions for the following problem?
A user enters an X coordinate, Y coordinate, length, and (optional)
number. If a number was entered, print a straight line with the
specified length, followed by the (x,y) coordinates. If n=2, print
bisecting lines with the specified length. If n=3, print a triangle
where the lines are the specified length.
Assuming you want ASCII drawings, here's how this would work. Note that I'm just providing an outline of how this would work, since I don't want to answer the interview question for you. Also, this is not production-quality since there's no validation (or differentiation between lack of input and invalid input).
First, let's ask the user for their input. Console.ReadLine is how you do this. Since we'll be getting input four times, let's make it a method. The return type can be int since everything we're getting from the user is a number. We'll print the prompt (passed in as an argument) with Console.WriteLine and then return the result of Console.ReadLine after converting it to an int (do you need validation)? Since n is optional, maybe return something like -1 if the user doesn't enter anything.
Store the results so that we can use them later in our calculations. Use an if... else if (or switch) statement to determine if n was supplied by the user and take action accordingly. We can call different methods depending on whether we want to draw a line, two bisecting lines, or a triangle.
Is the actual drawing the problem? I'm having trouble understand exactly where you need help, and the drawing of the shapes would be more complicated. For now, though, this should get you on the right track.

Data Structure for storing X days of history of a variable

I have a variable that has a current value but when I change the value it first needs to store the past value in some data structure that will show me the past X many values.
This is to do all kinds of calculations on past values like an average of the most recent values and such.
My only idea was to use a queue for this and since I only need the past X values then I implemented a FixedSizedQueue that would automatically dequeue older values.
Since then I've found out I can't really access a random value in it at least in default implementations of queues it seems. But additionally that if one would make that work they would be slow and need to iterate over all values.
So I'm left wondering is there any way at all to do this efficiently? The only other way I can think off would be to have an array and simply implement some pushing feature that would move all elements by one index position. But that seems overly wasteful. If these are the only two options, which one would be better if I need to access each value in the data structure 20 times each time I change it, and the size would be 50 values stored?
This is a place where performance will matter a great deal since each variable being "recorded" will change at least a million times when iterating over the data I have so don't worry about me doing premature optimization. Thank you, I appreciate it!
You are looking for a ring buffer / circular buffer.
You can find a c# implementation here.

Non-Repetitive Random Alphanumeric Code

One of my clients wants to use a unique code for his items (long story..) and he asked me for a solution. The code will consist in 4 parts in which the first one is the zip code where the item is sent from, the second one is the supplier registration number, the third number is the year when the item is sent and the last part is a three division alphanumeric unique character.
As you can see the first three parts are static fields which will never change for the same sender in the same year. So we can say that the last part is the identifier part for that year. This part is 3-division alpahnumeric which means starting from 000 and ending with ZZZ.
The problem is that my client, for some reasonable reasons, wants this part to be not sequential. For example this is not what he wants:
06450-05-2012-000
06450-05-2012-001
06450-05-2012-002
...
06450-05-2012-ZZY
06450-05-2012-ZZZ
The last part should produced randomly like:
06450-05-2012-A17
06450-05-2012-0BF
06450-05-2012-002
...
06450-05-2012-T7W
06450-05-2012-22C
But it should also non-repetitive. So once a possible id is generated the possibility should be discarded from the selection pool.
I am looking for an effective way to do this.
If I only record selected possibilities and check a newly created one against them there is always a worst case possibility that it keeps producing already selected ones, especially near the end.
If I create all possibilities at once and record them in a table or a file it may take a while after every item creation because it will lookup for a non-selected record. By the way 26 letters + 10 digits means 46.656 possible combinations, and there is a chance that there may be a 4th divison added which means 1.679.616 possible combinations.
Is there a more effective way you can suggest? I will use C# for coding and MS SQL for databese..
If it doesn't have to be random, you could maybe simply choose a fixed but "unpredictable" addend which is relatively prime to 26 + 10 == 36 == 2²·3². This means, just choose a fixed addend divisible by neither 2 nor 3.
Then keep adding this fixed number to your previous serial number every time you need a new serial number. This is to be done modulo 46656 (or 1679616) of course.
Mathematics guarantees you won't get the same number twice (before no more "free" numbers are left).
As the addend, you could use const int addend = 26075 since it's 5 modulo 6.
If you expect to create far less than 36^3 entries for each zip-supplier-year tuple, you should probably just pick a random value for the last field and then check to see if it exists, repeating if it does.
Even if you create half of the maximum number of possible entries, new entries still have an expected value of only one failure. Assuming your database is indexed on the overall identifier, this isn't too great a price to pay.
That said, if you expect to use all but a few possible identifiers, then you should probably create all the possible records in advance. It may sounds like a high cost, but each space in memory storing an unused record will eventually store a real record.
I'd expect the first situation is more likely, but if not, or if there's some other combination of the two, please add a comment with some more information and I'll revise my answer.
I think options depend on the amount of the codes that are going to be used:
If you expect to use most of them within a year, then it is better to pre-generate. If done right, lookup should be really fast. And you are going to have 1.679.616 items per year in your DB anyway, so you will have to do such things right.
On the other hand, is it good that you are expecting to use most of them? It may leave you without codes if there are suddenly more items than expected.
If you expect to use only a small amount, then random+existence check might be a way to go, however it is unclear what amount it should be for that to be best (I am pretty sure it is possible to calculate that though).

Searching for partial substring within string in C#

Okay so I'm trying to make a basic malware scanner in C# my question is say I have the Hex signature for a particular bit of code
For example
{
System.IO.File.Delete(#"C:\Users\Public\DeleteTest\test.txt");
}
//Which will have a hex of 53797374656d2e494f2e46696c652e44656c657465284022433a5c55736572735c5075626c69635c44656c657465546573745c746573742e74787422293b
Gets Changed to -
{
System.IO.File.Delete(#"C:\Users\Public\DeleteTest\notatest.txt");
}
//Which will have a hex of 53797374656d2e494f2e46696c652e44656c657465284022433a5c55736572735c5075626c69635c44656c657465546573745c6e6f7461746573742e74787422293b
Keep in mind these bits will be within the entire Hex of the program - How could I go about taking my base signature and looking for partial matches that say have a 90% match therefore gets flagged.
I would do a wildcard but that wouldn't work for slightly more complex things where it might be coded slightly different but the majority would be the same. So is there a way I can do a percent match for a substring? I was looking into the Levenshtein Distance but I don't see how I'd apply it into this given scenario.
Thanks in advance for any input
Using an edit distance would be fine. You can take two strings and calculate the edit distance, which will be an integer value denoting how many operations are needed to take one string to the other. You set your own threshold based off that number.
For example, you may statically set that if the distance is less than five edits, the change is relevant.
You could also take the length of string you are comparing and take a percentage of that. Your example is 36 characters long, so (int)(input.Length * 0.88m) would be a valid threashold.
First, your program bits should match EXACTLY or else it has been modified or is corrupt. Generally, you will store an MD5 hash on the original binary and check the MD5 against new versions to see if they are 'the same enough' (MD5 can't guarantee a 100% match).
Beyond this, in order to detect malware in a random binary, you must know what sort of patterns to look for. For example, if I know a piece of malware injects code with some binary XYZ, I will look for XYZ in the bits of the executable. Patterns get much more complex than that, of course, as the malware bits can be spread out in chuncks. What is more interesting is that some viruses are self-morphing. This means that each time it runs, it modifies itself, meaning the scanner does not know an exact pattern to find. In these cases, the scanner must know the types of derivatives can be produced and look for all of them.
In terms of finding a % match, this operation is very time consuming unless you have constraints. By comparing 2 strings, you cannot tell which pieces were removed, added, or replaced. For instance, if I have a starting string 'ABCD', is 'AABCDD' a 100% match or less since content has been added? What about 'ABCDABCD'; here it matches twice. How about 'AXBXCXD'? What about 'CDAB'?
There are many DIFF tools in existence that can tell you what pieces of a file have been changed (which can lead to a %). Unfortunately, none of them are perfect because of the issues that I described above. You will find that you have false negatives, false positives, etc. This may be 'good enough' for you.
Before you can identify a specific algorithm that will work for you, you will have to decide what the restrictions of your search will be. Otherwise, your scan will be NP-hard, which leads to unreasonable running times (your scanner may run all day just to check one file).
I suggest you look into Levenshtein distance and Damerau-Levenshtein distance.
The former tells you how many add/delete operations are needed to turn one string into another; and the latter tells you how many add/delete/replace operations are needed to turn one string into another.
I use these quite a lot when writing programs where users can search for things, but they may not know the exact spelling.
There are code examples on both articles.

What is the fast way of getting an index of an element in an array? [duplicate]

This question already has answers here:
How to find the index of an element in an array in Java?
(15 answers)
Closed 6 years ago.
I was asked this question in an interview. Although the interview was for dot net position, he asked me this question in context to java, because I had mentioned java also in my resume.
How to find the index of an element having value X in an array ?
I said iterating from the first element till last and checking whether the value is X would give the result. He asked about a method involving less number of iterations, I said using binary search but that is only possible for sorted array. I tried saying using IndexOf function in the Array class. But nothing from my side answered that question.
Is there any fast way of getting the index of an element having value X in an array ?
As long as there is no knowledge about the array (is it sorted? ascending or descending? etc etc), there is no way of finding an element without inspecting each one.
Also, that is exactly what indexOf does (when using lists).
How to find the index of an element having value X in an array ?
This would be fast:
int getXIndex(int x){
myArray[0] = x;
return 0;
}
A practical way of finding it faster is by parallel processing.
Just divide the array in N parts and assign every part to a thread that iterates through the elements of its part until value is found. N should preferably be the processor's number of cores.
If a binary search isn't possible (beacuse the array isn't sorted) and you don't have some kind of advanced search index, the only way I could think of that isn't O(n) is if the item's position in the array is a function of the item itself (like, if the array is [10, 20, 30, 40], the position of an element n is (n / 10) - 1).
Maybe he wants to test your knowledge about Java.
There is Utility Class called Arrays, this class contains various methods for manipulating arrays (such as sorting and searching)
http://download.oracle.com/javase/6/docs/api/java/util/Arrays.html
In 2 lines you can have a O(n * log n) result:
Arrays.sort(list); //O(n * log n)
Arrays.binarySearch(list, 88)); //O(log n)
Puneet - in .net its:
string[] testArray = {"fred", "bill"};
var indexOffset = Array.IndexOf(testArray, "fred");
[edit] - having read the question properly now, :) an alternative in linq would be:
string[] testArray = { "cat", "dog", "banana", "orange" };
int firstItem = testArray.Select((item, index) => new
{
ItemName = item,
Position = index
}).Where(i => i.ItemName == "banana")
.First()
.Position;
this of course would find the FIRST occurence of the string. subsequent duplicates would require additional logic. but then so would a looped approach.
jim
It's a question about data structures and algorithms (altough a very simple data structure). It goes beyond the language you are using.
If the array is ordered you can get O(log n) using binary search and a modified version of it for border cases (not using always (a+b)/2 as the pivot point, but it's a pretty sophisticated quirk).
If the array is not ordered then... good luck.
He can be asking you about what methods you have in order to find an item in Java. But anyway they're not faster. They can be olny simpler to use (than a for-each - compare - return).
There's another solution that's creating an auxiliary structure to do a faster search (like a hashmap) but, OF COURSE, it's more expensive to create it and use it once than to do a simple linear search.
Take a perfectly unsorted array, just a list of numbers in memory. All the machine can do is look at individual numbers in memory, and check if they are the right number. This is the "password cracker problem". There is no faster way than to search from the beginning until the correct value is hit.
Are you sure about the question? I have got a questions somewhat similar to your question.
Given a sorted array, there is one element "x" whose value is same as its index find the index of that element.
For example:
//0,1,2,3,4,5,6,7,8,9, 10
int a[10]={1,3,5,5,6,6,6,8,9,10,11};
at index 6 that value and index are same.
for this array a, answer should be 6.
This is not an answer, in case there was something missed in the original question this would clarify that.
If the only information you have is the fact that it's an unsorted array, with no reletionship between the index and value, and with no auxiliary data structures, then you have to potentially examine every element to see if it holds the information you want.
However, interviews are meant to separate the wheat from the chaff so it's important to realise that they want to see how you approach problems. Hence the idea is to ask questions to see if any more information is (or could be made) available, information that can make your search more efficient.
Questions like:
1/ Does the data change very often?
If not, then you can use an extra data structure.
For example, maintain a dirty flag which is initially true. When you want to find an item and it's true, build that extra structure (sorted array, tree, hash or whatever) which will greatly speed up searches, then set the dirty flag to false, then use that structure to find the item.
If you want to find an item and the dirty flag is false, just use the structure, no need to rebuild it.
Of course, any changes to the data should set the dirty flag to true so that the next search rebuilds the structure.
This will greatly speed up (through amortisation) queries for data that's read far more often than written.
In other words, the first search after a change will be relatively slow but subsequent searches can be much faster.
You'll probably want to wrap the array inside a class so that you can control the dirty flag correctly.
2/ Are we allowed to use a different data structure than a raw array?
This will be similar to the first point given above. If we modify the data structure from an array into an arbitrary class containing the array, you can still get all the advantages such as quick random access to each element.
But we gain the ability to update extra information within the data structure whenever the data changes.
So, rather than using a dirty flag and doing a large update on the next search, we can make small changes to the extra information whenever the array is changed.
This gets rid of the slow response of the first search after a change by amortising the cost across all changes (each change having a small cost).
3. How many items will typically be in the list?
This is actually more important than most people realise.
All talk of optimisation tends to be useless unless your data sets are relatively large and performance is actually important.
For example, if you have a 100-item array, it's quite acceptable to use even the brain-dead bubble sort since the difference in timings between that and the fastest sort you can find tend to be irrelevant (unless you need to do it thousands of times per second of course).
For this case, finding the first index for a given value, it's probably perfectly acceptable to do a sequential search as long as your array stays under a certain size.
The bottom line is that you're there to prove your worth, and the interviewer is (usually) there to guide you. Unless they're sadistic, they're quite happy for you to ask them questions to try an narrow down the scope of the problem.
Ask the questions (as you have for the possibility the data may be sorted. They should be impressed with your approach even if you can't come up with a solution.
In fact (and I've done this in the past), they may reject all your possibile approaches (no, it's not sorted, no, no other data structures are allowed, and so on) just to see how far you get.
And maybe, just maybe, like the Kobayashi Maru, it may not be about winning, it may be how you deal with failure :-)

Categories