Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
Basically I want to check if a String is a Sentence ("Hello, I am Me!") or Symbol Spam ("HH,,,{''{"), without using the number of symbols as a factor as much as possible. Right now it just detects based on a counter of symbols, but when someone says something with lots of punctuation, they get kicked.
Help?
If the number of symbols in the text is not sufficient, and you don't want to use something too fancy (or bought) could I suggest implementing one or more of these further steps (of increasing difficulty):
Make a count of all A-Za-z and space characters in the string and make a ratio of this to the count of symbols - so if they write a sentence then !!!!!!!!!!!!! at the end it still doesn't snag as the ratio is high enough.
If this still isn't discerning enough, add a further check if you pass item 1...
Count numbers of consecutive A-Za-z characters in the string - work out the average length of these 'words' - if the average is too short then it is probably spam.
These can be done in RegEx reasonably easily - If you want more sophistication then you have to use something written by someone else that has much more developed statistical methods (or start reading lexographical university papers that are beyond me!)
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'm returning a string from an API that has a length of 45 characters. There is one word that is unique for one condition that doesn't appear in the other condition.
I'm wondering if using string.contains() is faster performance-wise than comparing the whole string with string.equals() or string == "blah blah".
I don't know the inner workings of any of these methods, but logically, it seems like contains() should be faster because it can stop traversing the string after it finds the match. Is this accurate? Incidentally, the word I want to check is the first word in the string.
I agree with D Stanley (comment). You should use String.StartsWith()
That said, I also don't know the inner working of each method either, but I can see your logic. However "String.Contains()" may still load the entire string before processing it, in which case the performance difference would be very small.
As a final point, with a string length of only 45 characters, the performance difference should me extremely minute. I was shocked when I wrote a junky method to substitute characters and found that is processes ~10kb of text in a fraction of a blink of the eye. So unless you're doing some crazy handling else wise in your app, it shouldn't matter much.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I am looking for a way to take phone numbers Input from an end user, given the various ways that can be formatted, and reformat it in a more standardized way. My end users work with only US phone numbers. Looking at the C# port of google's LibPhoneNumber, I assume it was made for exactly this task, but I can't figure out its usage.
My goals for output is to standardize on 999-999-9999 format.
So if the user inputs (516)666-1234, I want to output 516-666-1234.
If the users inputs 5166661234, I want to output 516-666-1234.
If possible, if the user inputs (516)666-1234x102, I'd like the output to be 516-666-1234x102
If the library can't handle extensions, I'll deal with that problem externally.
What calls do I have to make to produce the desired output?
Alternatively, can you provide a link that provides the answer to my question?
You have to combine RegEx (to string out non-numeric fields) and string formatting. Note that string.Format of a number is tricky, so we format the 10-digit phone number and then append the extension.
public string getPhoneNumber(string phone) {
string result = System.Text.RegularExpressions.Regex.Replace(phone, #"[^0-9]+", string.Empty);
if (result.Length <= 10) {
return Double.Parse(result).ToString("###-###-####");
} else {
return Double.Parse(result.Substring(0,10)).ToString("###-###-####") +
"x " + result.Substring(10, result.Length-10);
}
}
To do this right, I'd want to check for "1" at the start of my digits and ditch it. I'd want to check for < 10 characters and make some assumptions about area code, etc.
But - this is a good start.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I have a string "Building1Floor2" and it's always in that format, how do I cleanly get the building number (e.g. 1) and floor number. I'm thinking I need a regex, but not entirely sure that's the best way. I could just use the index if the format stays the same, but if I have have a high floor number e.g. 100 it will break.
P.S. I'm using C#.
Use a regex like this:
Building(\d+)Floor(\d+)
Regex would be an ok option here if "Building" and "Floor" could change. e.g.: "Floor1Room23"
You could use "[A-Za-z]+([0-9]{1,})[A-Za-z]+([0-9]{1,})"
With those groupings, $1 would now be the Building number, and $2 would be Floor.
If "Building" and "Floor" never changed, however, then regex might be overkill.. you could use a string split
Find the index of the "F" and substring on that.
int first = str.IndexOf("F") ;
String building = str.substring(1, first);
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I need to create simple search engine for my application. Let's simplify it to the following: we have some texts (a lot) and i need to search and show relevant results.
I've based on this great article extend some things and it works pretty well for me.
But i have problem with stemming words to terms. For example words "annotation", "annotations" etc. will be stemmed to "annot", but imagine you try search something, and you will see unexpected results:
"anno" - nothing
"annota" - nothing
etc.
Only word "annot" will give relevant result. So, how should i improve my search to give expected results? Because "annot" contains "anno" and "annota" is slightly more than "annot". Using contains all the time obviously isn't the solution
If in first case i can use some Ternary search tree, in second case i don't know what to do.
Any ideas would be very helpful.
UPDATE
oleksii has pointed me to n-grams here, which may works for me, but i don't know how to properly index n-grams.
So the Question:
Which data structure would be the best for my needs
How properly index my n-grams
Stemming perhaps isn't much relevant here. Stemming will convert a plural to a singular form.
Given you have a tokeniser, a stemmer and a cleaner (to remove stop words, perhaps punctuation and numbers, short words etc) what you are looking at is a full-text search. I would advice you to get an off-the-shelf solution (like Elasticsearch, Lucene, Solr), but if you fancy a DIY approach I can suggest the following naive implementation.
Step 1
Create a search-orientated tokeniser. One example would be an n-gram tokeniser. It will take your word and split into the following sequences:
annotation
1 - [a, n, o, t, a, i]
2 - [an, nn, no, ot, ...]
3 - [ann, nno, not, ota, ...]
4 - [anno, nnot, nota, otat, ...]
....
Step 2
Sort n-grams for more efficient look-up
Step 3
Search n-grams for exact match using binary search
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I have a situation on my C programming here and just wondering whether my solution is the correct way:
I have a LED display with particle count sensor and will show 6 digit of seven segment numbers as the count value. The sensor will give voltage input value. The input is from 0V to 10V. So the range of 0V-10V need to be shown in the display as 000000 to 999999 count.
My solution is:
Display number = Input voltage * 99999.9
For example:
Display number = 10.000*99999.9=999999
Display number = 5.500*99999.9=549999
Display number = 2.300*99999.9=229999
Is this the correct solution? I notice that I will get a lot of 9 on the display value.
The most usable and user friendly solution is to ignore the fact that your most significant digit is capable of displaying up to 9 and simply multiply by 10000 unless you desperately need the maxim resolution in which case simply use a scale factor of 100000 and document that your range is 0-9.99999.
My reasoning is that it is better to either loose one digit in the accuracy across the whole range or clip just the maximum value than to have an error across the entire range.