I'm part of a small "message board" type project being built in a C# Web Form. I need to parse the user-entered text for objectionable words. This is my first C# project and I'm not sure how to split the words in the textbox.
It's been requested that I make an XML config file to contain the words to be screened for. Ideally, I would like to do a fark.com style replace. I've never made an XML config file and I really just need a place to start. All the config file information I've found has not been particularly applicable to this scenario.
Edit:
I ended up using a .txt file and splitting it on whitespace, then parsing the textbox on whitespace and comparing words. The project leader wanted a config file, but I pitched him on the simple solution and we went for it. Thanks for the replies.
An XML file won't scale well, especially if accessed concurrently. You'd better be using a database engine for such a task.
Making an XML config file just to filter a bunch of words probably isn't the best way to go there, considering it's most-likely just going to be a giant list of strings...
If it's not, have a look at the XmlDocument Class and the System.Xml namespace I assume you're aware of the format for XML documents but, if not, here is a simple example. The format is pretty much open to whatever XML tags you want, but the XmlDocument class I linked you to does have some fairly annoying catches that you'll come across while implementing it.
In terms of splitting the user text, it's fairly easy to hide "bad" words in another string so I'm not sure String.Split() is even what you want either. You will probably want to Regex it.
With that said, I came across this blog post a while ago that offers a simple profanity filter for .NET using Regex. Perhaps it will suit your needs.
Depends on how large this "bad words list" will be, and whether you expect it to change.
If it's pretty static, I would load the list from your XML file into some kind of in-memory collection. Then for each line of text you receive, parse the line into words, and then check each word for its existence in the collection.
If it's going to change frequently, and you need to pick up on those changes quickly, then you want more random access...that means a database. Hitting an XML repeatedly would be a performance drag.
Either way, split the string and react to each hit.
The string can be split up using something like:
myLineOfText.Split(new String[] { " " }, StringSplitOptions.RemoveEmptyEntries);
Related
I am translating a string like this and I am wondering if there is a built in functionality so you can highlight a word, so you later can surround a specific word around an element like in the example below
What is shown when users translate
Read our Terms of Service here.
What is shown on the website
Read our Terms of Service here.
What you describe doesn't make sense in general, because any given word or phrase in the source language may get translated into a different number of words that may not even be contiguous.
The way this is typically done is that you simply include the markup in your string, and your translators deal with it. Any competent translation service can handle html markup.
Another option is to make the entire sentence the clickable text. The basic principle is to provide complete segments for translation. Attempting to deconstruct or reconstruct from parts is doomed to fail under localization.
I'm trying to read a structure of a text file in a certain way. The text file is kind of a user-friendly configuration file.
Current structure of file (structure can be changed if necessary):
info1=exampleinfo
info2=exampleinfo2
info3="example","example2","example3"
info4="example","example2","example3"
There is no real difficulty in getting the first two lines, but the latter two are more difficult. I need to put both in two seperate string arrays that I can use. I could use a split string, but the problem is in that in the info4 array, the values can contain comma's (this is all user input).
How to go about solving this?
The reason you're having trouble writing parser is that you're not starting with a good definition of the file format. Instead of asking how you should parse it if there are commas, you should be deciding how to properly encode values with commas. Then parsing is simple.
If this file is written by non-technical users who can't be trusted with a complex format (like json), consider a format like:
info1=exampleinfo
info2=exampleinfo2
info3=example
example2
example3
info4=example
example2
example3
That is, don't mess around with quotes and commas. Users understand line breaks and spaces pretty well.
I'm 100% in favor of #DavidHeffernan's solutions, JSON would be great. And #ScottMermelstein's solution of a program that builds the output - that's probably your best bet if possible, not allowing the user to make a mistake even if they wanted to.
However, if you need them to build the textfile, and you're working with users who can't be trusted to put together valid JSON, since it is a picky format, maybe try a delimiter that won't be used by the user, to separate values.
For example, pipes are always good, since practically nobody uses them:
info1=exampleinfo
info2=exampleinfo2
info3=example|example2|example3
info4=example|exam,ple2|example3
All you'd need is a rule that says their data cannot contain pipes. More than likely, the users would be ok with that.
I'm currently following a tutorial series for a Tile Engine which uses XML files to store conversations between NPCs. A topic it doesn't appear to cover (I have only quickly glanced through the subsequent videos) is how to prevent the user from either altering or knowing in advance what the NPC is going to say by opening the XML file easily with a generic text editor.
The 2nd point of being able to read future conversations is not a real issue but something I wanted to think about, so if that's hard to implement I am not too fussed at this point.
How would I go about making the XML uneditable? I know vaguely about CRC32's which can check file integrity which may be useful and I also think there might be better ways to go about that (i.e. not with a CRC32).
The most extreme action I can think of would be to create my own arbitrary encoding for the conversation data, but the usefulness of XML files deters me from that slightly, and with the tutorials I'm following teaching me a lot things I don't know, I would prefer not to defer too far away from them!
Just looking for a direction really, thanks!
Xml is in its fundamentals an open format, so I mean there is not way how to make xml uneditable.
But you can have a copy of xml document (or some of fingerprint of xml) on your server (or on endpoints of NPC conversation) and then you can compare if xml document was edited or no.
If document was edited, you cas replace it with backup version or say to endpoints, that xml document was corrupted...
Historically, many games wrap multiple resources into a single binary file.
You might put it in a ZIP file (and maybe change the file extension). That would allow you to avoid having an XML file with an obvious name as a temptation for your users :).
Ultimately, you're asking something similar to the DRM question. I don't know whether your platform has an answer to that. (E.g., "using RSA encryption" is not secure as such; your program still has to decrypt the data at some point using the appropriate key, etc).
I'm extremely familiar with regex before you all start answering with variations of: /d+
I want to know if there are alternatives to regex for parsing numbers out of a large text file.
I'm parsing through tons of huge files and need to do some group/location analysis on the positions of keywords. I'm now at the point where i need to start finding groups of numbers as well nested closely to my content of interest. I want to avoid regex if at all possible because this needs to be a speedy process.
It is possible to take chunks of a file to inspect for the numbers of interest. That however would require more work and add hard coded limits for searching. (i'd like to avoid this)
I'm open to any suggestions.
UPDATE
Sorry for the lack of sample data. For HIPAA reasons I'd rather not even consider scrambling the text and posting it.
A great substitute would be the HTML source of any stackoverflow.com question page. Imagine I needed to grab the reputation (score) of all people that posted an answer to a question. This also means that the comma (,) is needed as well. I can't remove the html to simplify the content because I'm using some density analysis to weed out unrelated content. Removing the HTML would mix content too close together.
Unless the file is some sort of SGML, then I don't know of any method (which is not to say there isn't, I just don't know of one)
However, it's not to say that you can't create your own parser; you could eliminate some of the overheads of the .Net regex library by writing something that only finds ranges of numbers.
Fundamentally, I guess that that's all any library would do, at the most basic level.
Might help if you can post a sample of the sort of data you'll be processing?
I've not done much with linq to xml, but all the examples I've seen load the entire XML document into memory.
What if the XML file is, say, 8GB, and you really don't have the option?
My first thought is to use the XElement.Load Method (TextReader) in combination with an instance of the FileStream Class.
QUESTION: will this work, and is this the right way to approach the problem of searching a very large XML file?
Note: high performance isn't required.. i'm trying to get linq to xml to basically do the work of the program i could write that loops through every line of my big file and gathers up, but since linq is "loop centric" I'd expect this to be possible....
Using XElement.Load will load the whole file into the memory. Instead, use XmlReader with the XNode.ReadFrom function, where you can selectively load notes found by XmlReader with XElement for further processing, if you need to. MSDN has a very good example doing just that: http://msdn.microsoft.com/en-us/library/system.xml.linq.xnode.readfrom.aspx
If you just need to search the xml document, XmlReader alone will suffice and will not load the whole document into the memory.
Gabriel,
Dude, this isn't exactly answering your ACTUAL question (How to read big xml docs using linq) but you might want to checkout my old question What's the best way to parse big XML documents in C-Sharp. The last "answer" (timewise) was a "note to self" on what ACTUALLY WORKED. It turns out that a hybrid document-XmlReader & doclet-XmlSerializer is fast (enough) AND flexible.
BUT note that I was dealing with docs upto only 150MB. If you REALLY have to handle docs as big as 8GB? then I guess you're likely to encounter all sorts of problems; including issues with the O/S's LARGE_FILE (>2GB) handling... in which case I strongly suggest you keep things as-primitive-as-possible... and XmlReader is as primitive as possible (and THE fastest according to my testing) XML-parser available in the Microsoft namespace.
Also: I've just noticed a belated comment in my old thread suggesting that I check out VTD-XML... I had a quick look at it just now... It "looks promising", even if the author seems to have contracted a terminal case of FIGJAM. He claims it'll handle docs of upto 256GB; to which I reply "Yeah, have you TESTED it? In WHAT environment?" It sounds like it should work though... I've used this same technique to implement "hyperlinks" in a textual help-system; back before HTML.
Anyway good luck with this, and your overall project. Cheers. Keith.
I realize that this answer might be considered non-responsive and possibly annoying, but I would say that if you have an XML file which is 8GB, then at least some of what you are trying to do in XML should be done by the file system or database.
If you have huge chunks of text in that file, you could store them as individual files and store the metadata and the filenames separately. If you don't, you must have many levels of structured data, probably with a lot of repetition of the structures. If you can decide what is considered an individual 'record' which can be stored as a smaller XML file or in a column of a database, then you can structure your database based on the levels of nesting above that. XML is great for small and dirty, it's also good for quite unstructured data since it is self-structuring. But if you have 8GB of data which you are going to do something meaningful with, you must (usually) be able to count on some predictable structure somewhere in it.
Storing XML (or JSON) in a database, and querying and searching both for XML records, and within the XML is well supported nowadays both by SQL stuff and by the NoSQL paradigm.
Of course you might not have the choice of not using XML files this big, or you might have some situation where they are really the best solution. But for some people reading this it could be helpful to look at this alternative.