Making translations based on a database - c#

I am being completely hypothetical at this point, but since I am new to c#, I wanted to ask the opinion of others to see what the better ways of approaching this might be. At this point, I have a program that is looking for tags and comparing them to a master list of tags. However, at the moment, the tags are read and register a 24 character string. The strings are fine for the program, but I would like to have the output reference a database with a translator for each of these strings, so that when the final program outputs the tags that have been found and the ones that are missing, the tags have appropriate names along with them, and not just a complicated string of characters.
Since I am new, I would just like to see if anyone can give me ideas on how to handle this and possibly point me in the right direction to get started.
Thanks.

Related

Resolving a poor search on run-on tokens in a query

A little personal project of mine is to blindly produce a search engine from scratch without using any outside sources. This is mostly for a learning experience and I haven't had much trouble up until now, where I have both a dilemma and a tough problem.
Observe this case:
Suzy wants to search for "fuzzy bears". This is fine, functions as well as it can. However, Suzy screws up and types "fuzzybears". Right now, my search algorithm breaks down since this is interpreted as a single token, and not multiple tokens. Any case or combination of words that has even one occurrence of such a run on term, or glued tokens, causes a poor search result.
For scope, this is something I am writing using a combination of C# and T-SQL.
I've tried multiple solutions, but nothing has really come from them. Firstly, I used a List to take the terms and create variations, but this was much too slow to my liking and required a lot more memory than I feel it should need.
I wanted to save search queries to a database for statistics and maybe to learn more about organically growing the algorithm, so maybe a way to handle these glued tokens in SQL could be a solution, but I have no clue how to start with something like that unless I used a cursor or some other slow solution.
I could take searches, save them to my database, create different combinations where some tokens are glued, and then have those glued tokens as terms to hit on? The issue with this solution is it takes up quite a bit of space and I won't always need these strings since spelling errors like this aren't all too common.
Mainly, what I need is speed. It doesn't really have to be pretty, but if it's fast and accurate then I'm happy even if it takes up a lot of disk space.
Not asking for solutions here, but if anyone can point me in a direction I can go or it would be greatly appreciated.
Consider this approach: since spaces, punctuation, and anything similar would screw up a search like this, remove all of those, convert to a common case (I prefer lowercase, but pick what you prefer), and then tokenize based on syllables, using roughly the same set of division rules as for hyphenating English words.
So, to search for answers that contain "Consider this approach:", you reduce the phrase to "considerthisapproach" and then tokenize as "con","sid","er","this","ap","proach". If con and sid and er appear next to each other, and in that order, you've found the word "consider".
This approach can be adapted for statistical matching too, so e.g. if at least 85% of syllables are found in the correct order, you consider it a close match, and maybe order the results by match % so more meaningful matches are at the top.

Need to parse C# textbox for objectionable words

I'm part of a small "message board" type project being built in a C# Web Form. I need to parse the user-entered text for objectionable words. This is my first C# project and I'm not sure how to split the words in the textbox.
It's been requested that I make an XML config file to contain the words to be screened for. Ideally, I would like to do a fark.com style replace. I've never made an XML config file and I really just need a place to start. All the config file information I've found has not been particularly applicable to this scenario.
Edit:
I ended up using a .txt file and splitting it on whitespace, then parsing the textbox on whitespace and comparing words. The project leader wanted a config file, but I pitched him on the simple solution and we went for it. Thanks for the replies.
An XML file won't scale well, especially if accessed concurrently. You'd better be using a database engine for such a task.
Making an XML config file just to filter a bunch of words probably isn't the best way to go there, considering it's most-likely just going to be a giant list of strings...
If it's not, have a look at the XmlDocument Class and the System.Xml namespace I assume you're aware of the format for XML documents but, if not, here is a simple example. The format is pretty much open to whatever XML tags you want, but the XmlDocument class I linked you to does have some fairly annoying catches that you'll come across while implementing it.
In terms of splitting the user text, it's fairly easy to hide "bad" words in another string so I'm not sure String.Split() is even what you want either. You will probably want to Regex it.
With that said, I came across this blog post a while ago that offers a simple profanity filter for .NET using Regex. Perhaps it will suit your needs.
Depends on how large this "bad words list" will be, and whether you expect it to change.
If it's pretty static, I would load the list from your XML file into some kind of in-memory collection. Then for each line of text you receive, parse the line into words, and then check each word for its existence in the collection.
If it's going to change frequently, and you need to pick up on those changes quickly, then you want more random access...that means a database. Hitting an XML repeatedly would be a performance drag.
Either way, split the string and react to each hit.
The string can be split up using something like:
myLineOfText.Split(new String[] { " " }, StringSplitOptions.RemoveEmptyEntries);

Read structure from file c#

I'm trying to read a structure of a text file in a certain way. The text file is kind of a user-friendly configuration file.
Current structure of file (structure can be changed if necessary):
info1=exampleinfo
info2=exampleinfo2
info3="example","example2","example3"
info4="example","example2","example3"
There is no real difficulty in getting the first two lines, but the latter two are more difficult. I need to put both in two seperate string arrays that I can use. I could use a split string, but the problem is in that in the info4 array, the values can contain comma's (this is all user input).
How to go about solving this?
The reason you're having trouble writing parser is that you're not starting with a good definition of the file format. Instead of asking how you should parse it if there are commas, you should be deciding how to properly encode values with commas. Then parsing is simple.
If this file is written by non-technical users who can't be trusted with a complex format (like json), consider a format like:
info1=exampleinfo
info2=exampleinfo2
info3=example
example2
example3
info4=example
example2
example3
That is, don't mess around with quotes and commas. Users understand line breaks and spaces pretty well.
I'm 100% in favor of #DavidHeffernan's solutions, JSON would be great. And #ScottMermelstein's solution of a program that builds the output - that's probably your best bet if possible, not allowing the user to make a mistake even if they wanted to.
However, if you need them to build the textfile, and you're working with users who can't be trusted to put together valid JSON, since it is a picky format, maybe try a delimiter that won't be used by the user, to separate values.
For example, pipes are always good, since practically nobody uses them:
info1=exampleinfo
info2=exampleinfo2
info3=example|example2|example3
info4=example|exam,ple2|example3
All you'd need is a rule that says their data cannot contain pipes. More than likely, the users would be ok with that.

Parse numbers from large text, possibly without regex (performance critical)

I'm extremely familiar with regex before you all start answering with variations of: /d+
I want to know if there are alternatives to regex for parsing numbers out of a large text file.
I'm parsing through tons of huge files and need to do some group/location analysis on the positions of keywords. I'm now at the point where i need to start finding groups of numbers as well nested closely to my content of interest. I want to avoid regex if at all possible because this needs to be a speedy process.
It is possible to take chunks of a file to inspect for the numbers of interest. That however would require more work and add hard coded limits for searching. (i'd like to avoid this)
I'm open to any suggestions.
UPDATE
Sorry for the lack of sample data. For HIPAA reasons I'd rather not even consider scrambling the text and posting it.
A great substitute would be the HTML source of any stackoverflow.com question page. Imagine I needed to grab the reputation (score) of all people that posted an answer to a question. This also means that the comma (,) is needed as well. I can't remove the html to simplify the content because I'm using some density analysis to weed out unrelated content. Removing the HTML would mix content too close together.
Unless the file is some sort of SGML, then I don't know of any method (which is not to say there isn't, I just don't know of one)
However, it's not to say that you can't create your own parser; you could eliminate some of the overheads of the .Net regex library by writing something that only finds ranges of numbers.
Fundamentally, I guess that that's all any library would do, at the most basic level.
Might help if you can post a sample of the sort of data you'll be processing?

C# - Show the differences when comparing strings

In my asp.net project, I have two strings (actually, they are stored in a Session object, then i do a .ToString() )
This project is part of my free Japanese language exercises on my website (Italian only for now, so i won't link/spam)
For now i do an if (original == inputted.ToLower()) , but I would like to compare the strings and highlight the differences on the screen
like this:
original: hiroyashi
wrote by user: hiroyoshi
i was thinking to compare the two strings and save the differences in another variable, with HTML tags, and then show it on a Literal control... but... if the differences are many, or the input is shorter... how to do that?
It looks there is the needing of an huge amount of coding... or not?
I seem to remember someone asking this not too long ago, and essentially they were pointed at difference engines.
A quick search on codeplex brings up:
http://www.codeplex.com/site/search?projectSearchText=diff
May be worth a hunt through some of those that come up - you may be able to plug something into your existing code?
Cheers,
Terry
John Resig wrote a javascript diff algorithm, but he's removed the page explaining what it does from his site. It's still available through the google cache though. Apologies if linking that is bad John. It should do what you want, someone else took it, tweaked it and put an article up about it here - complete with a test page
I am not sure if this would be helpful, but this is a way I would do:
I would use a hashmap, and store all words seperate by space there.
Then using that I would map with the original.
You can add html tags or whatever if they are different.
There is bound to be a performance issue here on a large dictionary of words
The coding itself would not be long though.

Categories