How to remove suffix from a word in c#? - c#

I am trying to remove a suffix/verb tense from the words I get and return them to their original state.
For example:
play - playing
watches - watch
stopped - stop
I tried to search some information how to do it but I couldn't find any.
I tried to use Humanizer and OpenNlp but I don't know how it actually works and couldn't find any method I need from them.
public List<string> changeWord(List<string> wordss,string baseUrl)
{
string[] wordEnd = {"ing","es", "ies"};
List<string> tags = getH1AndTitleTags(baseUrl);
foreach(string tag in tags)
{
if (tag.Contains(wordEnd[0]))
{
tag.Replace("ing", "");
tags.Add(tag);
}
}
return tags;
}

I found this package: Porter2StemmerStandard. Here is the sample code:
using Porter2StemmerStandard;
class Program
{
static void Main(string[] args)
{
// Create a new stemmer
var stemmer = new EnglishPorter2Stemmer();
// Stem a word
string word = "playing";
var stemmedWord = stemmer.Stem(word);
Console.WriteLine(stemmedWord.Value); // Output: play
// Stem another word
word = "watches";
stemmedWord = stemmer.Stem(word);
Console.WriteLine(stemmedWord.Value); // Output: watch
// Stem a third word
word = "stopped";
stemmedWord = stemmer.Stem(word);
Console.WriteLine(stemmedWord.Value); // Output: stop
}
}

Related

Adding JPG comment with TagLibSharp showing up as Chinese in Windows File Explorer

Newbie here trying to figure this out. Any help, please, would be greatly appreciated!
I have a .net V4.8 console app utilizing the TagLibSharp V2.2.0 library which updates a JPG with a comment string. If I update the JPG comment with a string with an even number of characters and then look at the comment string in Windows File Explorer, I see the comment in English. If I update the JPG comment with a string with an odd number of characters and then look at the comment string in Windows File Explorer, I see what looks like Chinese characters. The number of Chinese characters is half the number of characters in the string. In both cases, odd or even number of characters in the string, when I retrieve the comment string using TagLibSharp, it appears in English.
Is this some kind of encoding problem? And if so, how do I solve it so that Windows File Explorer will always display the comment as English text regardless of whether the string has an even or odd number of characters in it?
Thanks for any kind and gentle guidance.
using System;
using System.IO;
namespace CommentApic
{
class Program
{
public static readonly string UserDocuments = Environment.GetFolderPath(Environment.SpecialFolder.MyDocuments);
public static readonly string originalPic = Path.Combine(UserDocuments, "OriginalPhoto.jpg");
public static readonly string newPic = Path.Combine(UserDocuments, "NewPhoto.jpg");
static void Main(string[] args)
{
if (args.Length == 1)
{
// Display starting tags
Console.WriteLine("\n\tOriginal comments:");
Console.WriteLine(string.Format("\t\toriginalPic: \"{0}\"", getJPGComment(originalPic)));
// Copy to working copy
Console.WriteLine("\n\tCopying originalPic to newPic");
File.Copy(originalPic, newPic, true);
// Remvoe all tags
Console.WriteLine("\n\tRemoving tags on newPic");
removeAllTags(newPic);
// Display tags
Console.WriteLine("\n\tThere should be no comments now:");
Console.WriteLine(string.Format("\t\tnewPic: \"{0}\"", getJPGComment(newPic)));
// Set the comment tag
Console.WriteLine("\n\tSetting comments on newPic");
setJPGComment(newPic, args[0]);
// Show the comments
Console.WriteLine("\n\tComments Now:");
Console.WriteLine(string.Format("\t\tnewPic: \"{0}\"", getJPGComment(newPic)));
}
else
{
Console.WriteLine("Please supply a comment to add to the JPG");
}
}
///////////////////////////////////////////////////////////////////////
public static string getJPGComment(string file)
{
// Local variables
string comment = "";
using (TagLib.File tagFile = TagLib.File.Create(file))
{
// Get the image tags
TagLib.Image.File image = tagFile as TagLib.Image.File;
// Ensure all tags are avaialble
image.EnsureAvailableTags();
// Get the tag
comment = image.ImageTag.Comment;
}
// Return comment to caller
return comment;
}
///////////////////////////////////////////////////////////////////////
public static void removeAllTags(string file)
{
using (var taglibFile = TagLib.File.Create(file))
{
taglibFile.RemoveTags(TagLib.TagTypes.AllTags);
taglibFile.Save();
}
}
///////////////////////////////////////////////////////////////////////
public static void setJPGComment(string file, string comment)
{
using (TagLib.File tagFile = TagLib.File.Create(file))
{
// Get the image tags
TagLib.Image.File image = tagFile as TagLib.Image.File;
// Ensure all tags are availble
image.EnsureAvailableTags();
// Set the tag value
image.ImageTag.Comment = comment;
// Save the tag
tagFile.Save();
}
}
}
}
Sample Output:
CommentApic.exe A
Original comments:
originalPic: "New photo comment"
Copying originalPic to newPic
Removing tags on newPic
There should be no comments now:
newPic: ""
Setting comments on newPic
Comments Now:
newPic: "A"
Windows File Explorer shows this:
CommentApic.exe AA
Original comments:
originalPic: "New photo comment"
Copying originalPic to newPic
Removing tags on newPic
There should be no comments now:
newPic: ""
Setting comments on newPic
Comments Now:
newPic: "AA"
Windows File Explorer shows this:
I managed to get this working with this code:
public static void setPhotoComment(string file, string comment)
{
// Open the file
using (TagLib.File tfile = TagLib.File.Create(file, "taglib/jpg", TagLib.ReadStyle.None))
{
// Remove all tags
tfile.RemoveTags(TagLib.TagTypes.AllTags);
// Create an empty tag
TagLib.IFD.IFDTag tag = (TagLib.IFD.IFDTag)tfile.GetTag(TagLib.TagTypes.TiffIFD, true);
// Was something returned
if (tag != null)
{
// Get the tag structure
if (tag.Structure != null)
{
// Save a pointer to the structure
TagLib.IFD.IFDStructure tagStructure = tag.Structure;
// Create our byte vector
TagLib.ByteVector commentBytes = TagLib.ByteVector.FromString(comment + "\0", TagLib.StringType.UTF16);
// Create our byte vector entry
TagLib.IFD.Entries.ByteVectorIFDEntry commentEntry = new TagLib.IFD.Entries.ByteVectorIFDEntry(JPGtag,
commentBytes);
// Add comment entry
tagStructure.SetEntry(0, commentEntry);
// Save the comment
tfile.Save();
}
}
}
}

Aspose PDF - get text from page that has a matching string

I'm working with an existing library - the goal of the library is to pull text out of PDFs to verify against expected values to quality check recorded data vs data in pdf.
I'm looking for a way to succinctly pull a specific page worth of text given a string that should only fall on that specific page.
var pdfDocument = new Document(file.PdfFilePath);
var textAbsorber = new TextAbsorber{
ExtractionOptions = {
FormattingMode = TextExtractionOptions.TextFormattingMode.Pure
}
};
pdfDocument.Pages.Accept(textAbsorber);
foreach (var page in pdfDocument.Pages)
{
}
I'm stuck inside the foreach(var page in pdfDocument.Pages) portion... or is that the right area to be looking?
Answer: Text Absorber recreated each page - inside the foreach loop.
If the absorber isn't recreated, it keeps text from previous loops.
public List<string> ProcessPage(MyInfoClass file, string find)
{
var pdfDocument = new Document(file.PdfFilePath);
foreach (Page page in pdfDocument.Pages)
{
var textAbsorber = new TextAbsorber {
ExtractionOptions = {
FormattingMode = TextExtractionOptions.TextFormattingMode.Pure
}
};
page.Accept(textAbsorber);
var ext = textAbsorber.Text;
var exts = ext.Replace("\n", "").Split('\r').ToList();
if (ext.Contains(find))
return exts;
}
return null;
}

How can I simulate user input from a console?

Im doing some challenges in HackerRank. I usually use a windows Form project in visualstudio to do the debug, but realize I lost lot of time input the test cases. So I want suggestion of a way I can easy simulate the console.ReadLine()
Usually the challenges have the cases describe with something like this:
5
1 2 1 3 2
3 2
And then is read like: using three ReadLine
static void Main(String[] args) {
int n = Convert.ToInt32(Console.ReadLine());
string[] squares_temp = Console.ReadLine().Split(' ');
int[] squares = Array.ConvertAll(squares_temp,Int32.Parse);
string[] tokens_d = Console.ReadLine().Split(' ');
int d = Convert.ToInt32(tokens_d[0]);
int m = Convert.ToInt32(tokens_d[1]);
// your code goes here
}
Right now I was thinking in create a file testCase.txt and use StreamReader.
using (StreamReader sr = new StreamReader("testCase.txt"))
{
string line;
// Read and display lines from the file until the end of
// the file is reached.
while ((line = sr.ReadLine()) != null)
{
Console.WriteLine(line);
}
}
This way I can replace Console.ReadLine() with sr.ReadLine(), but this need have a text editor open, delete old case, copy the new one and save the file each time.
So is there a way I can use a Textbox, so only need copy/paste in the textbox and use streamReader or something similar to read from the textbox?
You can use the StringReader class to read from a string rather than a file.
the solution you accepted! doesn't really emulate the Console.ReadLine(), so you can't paste it directly to HackerRank.
I solved it this way:
.
.
Just paste this class above the static Main method or anywhere inside the main class to hide the original System.Console
class Console
{
public static Queue<string> TestData = new Queue<string>();
public static void SetTestData(string testData)
{
TestData = new Queue<string>(testData.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries).Select(x=>x.TrimStart()));
}
public static void SetTestDataFromFile(string path)
{
TestData = new Queue<string>(File.ReadAllLines(path));
}
public static string ReadLine()
{
return TestData.Dequeue();
}
public static void WriteLine(object value = null)
{
System.Console.WriteLine(value);
}
public static void Write(object value = null)
{
System.Console.WriteLine(value);
}
}
and use it this way.
//Paste the Console class here.
static void HackersRankProblem(String[] args)
{
Console.SetTestData(#"
6
6 12 8 10 20 16
");
int n = int.Parse(Console.ReadLine());
string arrStr = Console.ReadLine();
.
.
.
}
Now your code will look the same! and you can test as many data as you want without changing your code.
Note: If you need more complexes Write or WriteLine methods, just add them and send them to the original System.Console(..args)
Just set Application Arguments: <input.txt
and provide in input.txt your input text.
Be careful to save the file with ANSI encoding.

When using MergeField FieldCodes in OpenXml SDK in C# why do field codes disappear or fragment?

I have been working successfully with the C# OpenXml SDK (Unofficial Microsoft Package 2.5 from NuGet) for some time now, but have recently noticed that the following line of code returns different results depending on what mood Microsoft Word appears to be in when the file gets saved:
var fields = document.Descendants<FieldCode>();
From what I can tell, when creating the document in the first place (using Word 2013 on Windows 8.1) if you use the Insert->QuickParts->Field and choose MergeField from the Field names left hand pane, and then provide a Field name in the field properties and click OK then the field code is correctly saved in the document as I would expect.
Then when using the aforementioned line of code I will receive a field code count of 1 field. If I subsequently edit this document (and even leave this field well alone) the subsequent saving could mean that this field code no longer is returned in my query.
Another case of the same curiousness is when I see the FieldCode nodes split across multiple items. So rather than seeing say:
" MERGEFIELD Author \\* MERGEFORMAT "
As the node name, I will see:
" MERGEFIELD Aut"
"hor \\* MERGEFORMAT"
Split as two FieldCode node values. I have no idea why this would be the case, but it certainly makes my ability to match nodes that much more exciting. Is this expected behaviour? A known bug? I don't really want to have to crack open the raw xml and edit this document to work until I understand what is going on. Many thanks all.
I came across this very problem myself, and found a solution that exists within OpenXML: a utility class called MarkupSimplifier which is part of the PowerTools for Open XML project. Using this class solved all the problems I was having that you describe.
The full article is located here.
Here are some pertinent exercepts :
Perhaps the most useful simplification that this performs is to merge adjacent runs with identical formatting.
It goes on to say:
Open XML applications, including Word, can arbitrarily split runs as necessary. If you, for instance, add a comment to a document, runs will be split at the location of the start and end of the comment. After MarkupSimplifier removes comments, it can merge runs, resulting in simpler markup.
An example of the utility class in use is:
SimplifyMarkupSettings settings = new SimplifyMarkupSettings
{
RemoveComments = true,
RemoveContentControls = true,
RemoveEndAndFootNotes = true,
RemoveFieldCodes = false,
RemoveLastRenderedPageBreak = true,
RemovePermissions = true,
RemoveProof = true,
RemoveRsidInfo = true,
RemoveSmartTags = true,
RemoveSoftHyphens = true,
ReplaceTabsWithSpaces = true,
};
MarkupSimplifier.SimplifyMarkup(wordDoc, settings);
I have used this many times with Word 2010 documents using VS2015 .Net Framework 4.5.2 and it has made my life much, much easier.
Update:
I have revisited this code and have found it clears upon runs on MERGEFIELDS but not IF FIELDS that reference mergefields e.g.
{if {MERGEFIELD When39} = "Y???" "Y" "N" }
I have no idea why this might be so, and examination of the underlying XML offers no hints.
Word will often split text runs with into multiple text runs for no reason I've ever understood. When searching, comparing, tidying etc. We preprocess the body with method which combines multiple runs into a single text run.
/// <summary>
/// Combines the identical runs.
/// </summary>
/// <param name="body">The body.</param>
public static void CombineIdenticalRuns(W.Body body)
{
List<W.Run> runsToRemove = new List<W.Run>();
foreach (W.Paragraph para in body.Descendants<W.Paragraph>())
{
List<W.Run> runs = para.Elements<W.Run>().ToList();
for (int i = runs.Count - 2; i >= 0; i--)
{
W.Text text1 = runs[i].GetFirstChild<W.Text>();
W.Text text2 = runs[i + 1].GetFirstChild<W.Text>();
if (text1 != null && text2 != null)
{
string rPr1 = "";
string rPr2 = "";
if (runs[i].RunProperties != null) rPr1 = runs[i].RunProperties.OuterXml;
if (runs[i + 1].RunProperties != null) rPr2 = runs[i + 1].RunProperties.OuterXml;
if (rPr1 == rPr2)
{
text1.Text += text2.Text;
runsToRemove.Add(runs[i + 1]);
}
}
}
}
foreach (W.Run run in runsToRemove)
{
run.Remove();
}
}
I tried to simplify the document with Powertools but the result was a corrupted word file. I make this routine for simplify only fieldcodes that has specifics names, works in all parts on the docs (maindocumentpart, headers and footers):
internal static void SimplifyFieldCodes(WordprocessingDocument document)
{
var masks = new string[] { Constants.VAR_MASK, Constants.INP_MASK, Constants.TBL_MASK, Constants.IMG_MASK, Constants.GRF_MASK };
SimplifyFieldCodesInElement(document.MainDocumentPart.RootElement, masks);
foreach (var headerPart in document.MainDocumentPart.HeaderParts)
{
SimplifyFieldCodesInElement(headerPart.Header, masks);
}
foreach (var footerPart in document.MainDocumentPart.FooterParts)
{
SimplifyFieldCodesInElement(footerPart.Footer, masks);
}
}
internal static void SimplifyFieldCodesInElement(OpenXmlElement element, string[] regexpMasks)
{
foreach (var run in element.Descendants<Run>()
.Select(item => (Run)item)
.ToList())
{
var fieldChar = run.Descendants<FieldChar>().FirstOrDefault();
if (fieldChar != null && fieldChar.FieldCharType == FieldCharValues.Begin)
{
string fieldContent = "";
List<Run> runsInFieldCode = new List<Run>();
var currentRun = run.NextSibling();
while ((currentRun is Run) && currentRun.Descendants<FieldCode>().FirstOrDefault() != null)
{
var currentRunFieldCode = currentRun.Descendants<FieldCode>().FirstOrDefault();
fieldContent += currentRunFieldCode.InnerText;
runsInFieldCode.Add((Run)currentRun);
currentRun = currentRun.NextSibling();
}
// If there is more than one Run for the FieldCode, and is one we must change, set the complete text in the first Run and remove the rest
if (runsInFieldCode.Count > 1)
{
// Check fielcode to know it's one that we must simplify (for not to change TOC, PAGEREF, etc.)
bool applyTransform = false;
foreach (string regexpMask in regexpMasks)
{
Regex regex = new Regex(regexpMask);
Match match = regex.Match(fieldContent);
if (match.Success)
{
applyTransform = true;
break;
}
}
if (applyTransform)
{
var currentRunFieldCode = runsInFieldCode[0].Descendants<FieldCode>().FirstOrDefault();
currentRunFieldCode.Text = fieldContent;
runsInFieldCode.RemoveAt(0);
foreach (Run runToRemove in runsInFieldCode)
{
runToRemove.Remove();
}
}
}
}
}
}
Hope this helps!!!

How can I parse this HTML to get the content I want?

I am currently trying to parse an HTML document to retrieve all of the footnotes inside of it; the document contains dozens and dozens of them. I can't really figure out the expressions to use to extract all of content I want. The thing is, the classes (ex. "calibre34") are all randomized in every document. The only way to see where the footnotes are located is to search for "hide" and it's always text afterwards and is closed with a < /td> tag. Below is an example of one of the footnotes in the HTML document, all I want is the text. Any ideas? Thanks guys!
<td class="calibre33">1.<span><a class="x-xref" href="javascript:void(0);">
[hide]</a></span></td>
<td class="calibre34">
Among the other factors on which the premium would be based are the
average size of the losses experienced, a margin for contingencies,
a loading to cover the insurer's expenses, a margin for profit or
addition to the insurer's surplus, and perhaps the investment
earnings the insurer could realize from the time the premiums are
collected until the losses must be paid.</td>
Use HTMLAgilityPack to load the HTML document and then extract the footnotes with this XPath:
//td[text()='[hide]']/following-sibling::td
Basically,what it does is first selecting all td nodes that contain [hide] and then finally go to and select their next sibling. So the next td. Once you have this collection of nodes you can extract their inner text (in C#, with the support provided in HtmlAgilityPack).
How about use MSHTML to parse HTML source?
Here is the demo code.enjoy.
public class CHtmlPraseDemo
{
private string strHtmlSource;
public mshtml.IHTMLDocument2 oHtmlDoc;
public CHtmlPraseDemo(string url)
{
GetWebContent(url);
oHtmlDoc = (IHTMLDocument2)new HTMLDocument();
oHtmlDoc.write(strHtmlSource);
}
public List<String> GetTdNodes(string TdClassName)
{
List<String> listOut = new List<string>();
IHTMLElement2 ie = (IHTMLElement2)oHtmlDoc.body;
IHTMLElementCollection iec = (IHTMLElementCollection)ie.getElementsByTagName("td");
foreach (IHTMLElement item in iec)
{
if (item.className == TdClassName)
{
listOut.Add(item.innerHTML);
}
}
return listOut;
}
void GetWebContent(string strUrl)
{
WebClient wc = new WebClient();
strHtmlSource = wc.DownloadString(strUrl);
}
}
class Program
{
static void Main(string[] args)
{
CHtmlPraseDemo oH = new CHtmlPraseDemo("http://stackoverflow.com/faq");
Console.Write(oH.oHtmlDoc.title);
List<string> l = oH.GetTdNodes("x");
foreach (string n in l)
{
Console.WriteLine("new td");
Console.WriteLine(n.ToString());
}
Console.Read();
}
}

Categories