Artificial Intelligence, Text Classifier [closed]

Artificial Intelligence, Text Classifier [closed] - c#

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I am new to AI. I am working an application that text classification via machine learning. The application needs to classify different parts of an HTML document. For example, most webpages have head, menu, sidebar, footer, main content, etc. I want to use a text classifier to classify these parts of an HTML document, and to identify different type of forms on the page.
It would be very helpful if anyone could provide detailed guidance on this subject.
Examples of similar application, would also be very helpful.
I am looking for more technical suggestions, relating to code & implementation.
I can assign labels to html tag attributes, like class or id
<div class="menu-1">
<div id="entry">
<div id="content">
<div id="footer">
<div id="comment-12">
<div id="comment-title">
like for first item:
TrainClassifier(label: "Menu", value: "menu-1", attribute: "class", position-in-string: "21%", tag: "div");
Inputs:
"menu-1" (attribute value)
List item
"class" (attribute name)
"21" (tag position in string)
"div" (tag name)
Output
"Menu" (classified as label)
What neural network library, can take the above inputs, and classify them in to labels (i.e. Menu).
All users cannot create regex, or xpath, they need more easy approach, so it is important, to make the software intelligent, user can highlight the part of html document he/she needs, using webbrowser control, and train the software till it can work on its own.
but I dont know how to make the software train using AI,
the AI I am looking for is, like it should be able to accept various inputs, and classify on the basis of that, as I have already said new to AI, don't know much about it.
It would be helpful to me if I get answer to the question I have asked, like what library I should use, and how to implement, answers suggesting Xpath or Regex or other methods pls don't answer, it often happens that you get all suggestions but the one you need.

I suggest you to look into simpler algorithms first which are easy to understand, I can give pointers to some.
Naive Bayes (you will find many implementations but you can do it yourself, the algo is simple to implement yet quite powerful).
Maximum Entropy (Eg. SharpMaxEnt - open source).
SVM (Eg. LibSVM for C# port).
If you want to get a taste of how these work, download the WEKA toolkit:
http://sourceforge.net/projects/weka/
The commonly followed steps are usually the following:
Identify as many attributes/features as you can get (and a set of labels).
Collect data which is a set { Label, Attribute1, A2, A3, ... }
Select a minimal set of important attributes using feature selection algorithms (also available in the WEKA toolkit)
Train the classifier using standard algorithm
Test the system, until you receive the desired accuracy,recall, or other params.
Good Luck!

This is a very broad topic. There are a few neural network libraries out there for C#, just search for them on Stack Overflow.
You will need to perform supervised training before you can do any type of classification. In order for the ANN to understand what you are throwing at it, you will need to figure out how you will parse the HTML to get the results you are looking for.
As an example, most websites will use CSS to render content on a browser. Other sites may use tables. You will need to train for both.
Your problem is not an easy one.

Classification could help you, if you had pieces of data that you had to assign labels to. This is not the case. You would be better off manually writing out XPath rules for taking apart your documents.

Related

information on gotchas for multi lingual application [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I am currently working on .net 4.5 application that contains multi lingual data.
I am new to this so I am looking for resources that explain concepts such as encoding for different languages, globalization, localization etc.
Any tips as to where I should look for such information?

MSDN - as always - is the best resource: http://msdn.microsoft.com/en-us/library/h6270d0z.aspx .
Some gotchas from my own experience:
Use unicode types in your database. So for SQL Server, make your text types nvarchar, ntext instead of varchar, text to have them as unicode. Otherwise you will lose information in languages such as Chinese
Make your design flexible, a phrase that is 10 characters in English could easily be 3-4 times as big in German or French, make your buttons flexible (sliding door technique for example for html), make your width and heights percentages and as responsive as possible.
In your resource files, have plural and singular forms of strings with placeholders for numbers, for example, if you have a phrase stating "within 2 km of this place" then you will probably need a resource entry for Km separately from the whole sentence for scenarios of singular/plural (kilometers, kilometer) don't assume that you could just add an "s" for pluralization. That won't work in all languages. Some languages even have a special case for singular, plural and for two objects that are not treated the same as plural (i.e. arabic) (Look at Dwayne's comment for an interesting intake on this point)
If you're going to localize for a language such as Arabic or Hebrew, then these are right to left, your whole design (including pictures) will need to change orientation. In HTML, that's as easy - mostly - as having a "dir: rtl" attribute, but sometimes it can be tricky.
It's not just about translation. Things that will change include number formats, using comma seperators or periods for decimal points and thousands, currency symbols coming before or after, currency formatting, date formatting etc... Make sure that all of these are formatted by .net framework using the culture of the current user.
Be disciplined about not hardcoding any strings in your UI. A handy trick is to have a resource language for a language that doesn't use latin characters (Chinese, Russian, Arabic whatever), create a resource file for that language and fill all entries with random string from Google in that language. Run your application, and you will be able to easily spot the parts of the UI that are not coming from the resource file (they will be the english characters in the middle of the Chinese ones).
It is not just about the UI. If you are sending messages from the backend, like a response from a service or so on, that also needs to be localized. In some cases, even error messages logged in the Event log are required to be localized. Make sure you think about that.
Javascript. If you're doing ajaxified web with heavy javascript, you might need to use a library such jquery localization to help with localization. You will have to serve your resource file in a JS key-value kind of structure. Since this is less standard than ASP.NET, it could require some improvisation from your side depending on your needs (decisions such as how to load these files with resources, all-at-once or with AMD, or may be create a service that returns the localized strings, or just let asp.net bind the values from the actual resource file at compile time etc...)

Difficulty of switching from HTML only e-mail to Text only e-mail in .net C#? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am working with a consulting group on a program which currently uses a .net C# script to send e-mails in HTML format at regular intervals.
The e-mail itself aside from being in HTML format although the content is text with some tags and contains less then a page of text.
I would like the consultant to change this to text format replacing the tags with line feed/carriage returns. I have been told that this is a four hour job but that seems excessive to me.
When I look online at a page such as this http://www.mattvanandel.com/771/c-sending-an-email/ it would seem the change could be completed in less than 4 hours including recompiling the .net code into a DLL, testing and uploading the code to a server.
Not all developers are created equal, but assuming that the .Net developer is experienced enough to warrant a $250 per hour salary does this seem reasonable? If it is something less than 4 hours (i.e. more like 4 min) can someone tell me what might have to be done to make the modification. From what I can see its likely 2 lines of code that need to be modified (i.e. the body string and the IsBodyHtml statement). What else may I be missing?

Dependant upon what kind testing would be required to verify that the system is stable after the change, then perhaps 4 hours may or may not be excessive.
For a simple looking change in a tightly coupled system may have massive implications and risk. On the other hand in a loosely coupled system, the risk should be minimal.
So the question is, why 4 hours. If it was me. I'd request a breakdown of what the 4 hours represents. You are after all the customer and if you need a cost breakdown I'd suggest you're within your purview to request it.
However I'd suggest that you ask in a non confrontational way (i.e. don't jump in with all guns blazing) as the there may well be serious implications that the developer knows about but you don't. Maybe just ask for a simple - 'what is involved in implementing this change'.
And don't feel you have to accept the first answer given, you should if you are dissatified, request further clarification from the developer.

It all depends on how the code is written - and on that we can only speculate currently. It may be that they use a really complex 3rd party tool - in which case it might take four hours.
However, if it is done using System.Net.Mail then it could be as simple as setting the IsBodyHtmlproperty on a MailMessage to true, which is a four-second job.

Changing that 'IsBodyHtml' property would make it send text, but you would also need to modify the text to insert the line feeds - on static text this is not totally difficult, but you need to consider when a line feed is proper (what in the html has "block" layout and what is simple in-line styled). Also you do not mention if the text is dynamic or static which adds complications if it IS dynamicly generated.
Time you pay for, but also knowlege. I get someone else to fix things on my car, not because I can't, but because they are better and have the tools I might not have.
Just from a time spent perspective:
Get knowlege/use knowlege already present
Estimate time to communicate with you
Design the change
Code the change
Deploy the change
Test the change/functional test
Solicit feedback on the change/acceptance test (from you?)

There is only one property "IsBodyHtml" of MailMessage Class in .net to switch between Html/Text mail message type.
So you can check yourself, how big is the job excepting removing html tags and pumblishing the updated dll on server.

The mechanics of switching the code itself is as simple as you say above, replacing the HTML body string with the new string and changing the IsBodyHtml property. (Assuming the code uses the built in .NET Framework mailing components).
Remember though, that text based emails will remove all formatting, so you won't be able to have font colours, images, hyperlinks or anything else in the content except as plain text.
If you really want to cut the estimate down, get someone internal to edit the text and all the developer will have to do is switch 2 lines of code and then test/deploy.
I can't comment on the time required to test/deploy as that's entirely dependent on your system.

isn't number localization just unnecessary? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I've just read this page http://weblogs.asp.net/scottgu/archive/2010/06/10/jquery-globalization-plugin-from-microsoft.aspx
One of the things they did was to convert the arabic date to the arabic calendar. I'm wondering if it is a good idea at all to do so. Will it actually be annoying/confusing for the user (even if the user is Arabic).
Also, my second question is that do we really need to change 3,899.99 to 3.899,99 for some cultures like German? I mean it doesn't hurt to do so since the library already does it for us but wouldn't this actually cause more confusion to the user (even if he is German).
I'm sure whatever culture these people come from, if i give you a number 3,899.99 there's no way you'd get that wrong right? (since he'd probably learned the universal format anyway)

Your problem here seems to be a bad assumption. There is no "universal format" for numbers. 3,899.99 is valid in some places, and confusing in others. Same for the converse. People can often figure out what they need to (especially if it's in software that is clearly doing a shoddy job of localization otherwise. :) ), but that's not the point.
Except in certain scientific and technical domains that general software doesn't usually address, there's no universal format for any of these things. If you want your software to be accepted on native terms anywhere but your own place, you'll need to work for it.

To me it seems like it would be much less confusing to see dates and numbers in the format you're used to (in your country or language) - why do you think it would be the other way around?

The point of localization is to make your application look more natural for the user. It is definitely advisable to do this in your application if you use it internationally. While you can use US standards, that is not very customer-friendly way of doing things.
How would it be more confusing to a person to see the format they are familiar with? Meet people where they are with your application. If their standard is 10.000,00 and you are showing them 10,000.00, even if they understand it, it does make it a bit disconcerting. Reverse the situation and think what you would like. Would you like a developer using 10.000,00 for their application because you can understand it just fine?

Depends. 3.899,99 to me looks like two numbers. 3.899 and 99. I imagine our number formatting looks similarly funny to foreigners. Sure, I could guess what it means here, but what if you had a whole bunch of numbers like this clustered together? The winning lotto numbers are 45,26,21,56,94,13. Is that one big number, or 6 2-digit numbers?
Date formatting is especially important. 01/02/03. Is that Jan 2 2003, Feb 1 2003, Feb 3 2001 or what? Different cultures specify the d/m/y in different orders. Also, when spelled out, they obviously have different names for the months.
If you have the time and resources to internationalize it, I think you should.

As a foreigner myself, I can assure you that localization helps a lot in terms of user satisfaction. Commas or dots in numbers may induce big mistakes. Another on is the relative position of days and months.
To improve even further, create translations and add an option to choose locale. That way you will have close to 100% customer satisfaction

another important thing is input. if you don't have localization, take the user input "1.234"... what does the user mean? 1.234 or 1234 ? ... there may be users that don't like their values to be off by factor 1000 ... who knows? ;)

Who Writes Microsoft Support Articles? Can They Always Be Trusted? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
Here is an example of the type of article I'm talking about:
http://support.microsoft.com/kb/319401
I assume these articles are written by people who work for Microsoft and that the code in the articles will always be rock solid and never contain any malicious code. I just want to make sure I can explain to my boss that this is an ok place to copy code from (I've been told never to copy code from the internet, but this seems like a safe source).

I would trust them not to be malicious, but they're not always good code. (MSDN samples are sometimes pretty awful.)
For example, here's some code in the sample you gave:
compareResult = ObjectCompare.Compare
(listviewX.SubItems[ColumnToSort].Text,
listviewY.SubItems[ColumnToSort].Text);
// Calculate correct return value based on object comparison
if (OrderOfSort == SortOrder.Ascending)
{
// Ascending sort is selected, return normal result of compare operation
return compareResult;
}
else if (OrderOfSort == SortOrder.Descending)
{
// Descending sort is selected, return negative result of compare operation
return (-compareResult);
}
else
{
// Return '0' to indicate they are equal
return 0;
}
Now, there are two issues here:
Why is it deemed valid to have a comparer with no sort order? This should be a constructor parameter, validated at the point of construction IMO.
You should not just negate the result of one comparison to perform a "reverse comparison". That breaks if the result of the first comparison is int.MinValue - because -int.MinValue == int.MinValue. It's better to reverse the arguments used to perform the original comparison.
There are other things I'd take issue with in this code, but these two should be enough to make my point.
I heartily agree with the other answers too, in terms of:
- Check the copyright / licence etc of any code you want to use
- Make sure you understand anything you want to use

Your boss probably wouldn't mind if you only copied the code into a test project that you use to test and understand the code. You can then use what you've learned to write the production code.
And while I don't think anyone outside of Microsoft knows the names of the people who write those support articles, they come from the same vendor that your toolchain does, so if you don't trust the support articles, then you can't trust the tools you've bought either.

Microsoft Knowledgebase articles show safe (as in non-malicious but not necessarily secure) code, but usually the example provides the most basic use case possible. There's a good chance that you'll have to tweak the code a bit for it to work the way you want.
You should also pay attention to the date of the articles. For example, the article you link to is almost three years old. There's definitely a better way to handle that situation now.

Be aware that most codes in articles are there to help you understand the concepts. They are not "production ready". Learn the concepts instead and implement your own.

Have you been told not to copy code from the internet because of rights issues? If so then you don't have to worry about this Microsoft code.
I would advise you not to use any code you don't understand. If you can't say if the code is malicious or not don't use it.

MSDN and kb support articles are written by MS employees that are part of the given product's UX team (user experience). These are people who typically have a background in technical writing, but are not necessarily developers themselves (although some are). It's very common for the UX team to collaborate with developers on the product to ensure their code samples are correct. However this collaboration in my experience is one of the lowest priorities a typical MS developer has and can go ignored, and so it can at times lead to poor code getting out.
With that said, I completely agree with Carl Norum's comment. Copying code you do not understand is done at your own risk. Make sure you understand any code you place in your product!

I've always found the Microsoft articles to be of the highest quality (sadly not their products).
However, there's always the danger of a spoofing site.

Explain that you carefully read the article to understand the information in there, and only copy code that you understand.
If you don't understand the code, then even if the code is correct it may not be doing what you actually need done, thus your program will be incorrect.
You also will have a hard time debugging and maintaining code if there are parts that you don't understand.

Automated letter generation [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I work for a doctor and am looking for a solution to speed up his process of composing medical reports. Most of the text in the medical reports are redundant and should be able to be generated by a selection process
What I would like is to present him with a form with various options, checkboxes and the selections he makes will drive the generation of the report and create a word document that he can then fine tune or just save/print/whatever.
For instance he will be prompted with:
Age: ______
Gender : () Male, () Female
Length of Condition : () Week () Month () Year
Pain involving: [] Neck, [] Shoulder, [] Chest, [] Hip, [] Leg, etc....
This subset of the form would generate the following sentence:
"This is a 38 year old woman with a month history of pain in the leg"
I'd like the process to be data driven (as much as possible), so changes to the selection choices don't require reprogramming.
I would suspect that he is not the first person to ask for a system like this. So my first question is has anybody come across any existing software that we can purchase that would meet our needs?
In the event that no pre-packaged software is out there, I'd like some input as to general design strategies. What kind of data structure would you use to store the choices? How do I interface with word to create the document?
If I were to write this myself my language of choice would be C#.
EDIT:
A number of suggestion where made assuming that I'm looking for a Medical records package.
I don't think that is a solution to the problem I'm addressing.
The doctor is simply looking for a tool to automate his report writing. His reports are usually submitted as part of a workmans comp or no-fault case. They are for external consumption, and not usually not referred back to internally after the fact.
ANOTHER POINT:
The functionality I'm looking for isn't specific to the medical community. I'm looking for a tool where a given checkbox/radio button generates a specific sentence, and the mapping is configured by the user. Sort of a form letter on steriods.

Yes, there is definitely software out there that does this. You're looking for medical records software. The specifics of the software really depend on where the doctor is located, however. Because your profile indicates a New York location, I assume that you're in the United States. In that case, I know of exactly one offering in that domain. Perhaps they will or won't fit your needs (I've never worked with it myself), but NexTech certainly has a commercial product offering in that general market segment.
If you choose to build your own (which is always a possibility), be aware of the fact that there are legal requirements that surround such software. Once again, I'm not aware of specifics, but you may need to talk with the owner of the practice to ensure that your software doesn't violate any relevant privacy laws.

We automate creation of sales tax returns using open source PDF libraries. We're on Java, but here are some options for PDF generation on .Net.
In our case we work with a specific form template that the states provide and fill in amounts programaticly. It sounds like you're looking to accomplish something very similar.

You may want to have a look at medical, an electronic health record module for Open ERP. There are also a variety of commercial packages out there; e-MDs, for example, provides this specific feature.

I know Epic is a big player in this area, but they may be out of your price range.

I can think of these options:
XMLFO
Mail-Merge with MS Word

With the risk of being shot, but: That sounds like something that Microsoft Access was created for. You can easily generate the Form and Report. If you really need it as a Word Document (as opposed to simply using Access' Report function) you can link Word Documents to Access databases.
Just an idea.

I think you can just set this up in Word using fields. And the Doctor would just tab from field to field.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.