Plain text search in markdown text

Plain text search in markdown text - c#

I am trying to write code (in C#) that can search for any plain-text word or phrase in a markdown file. Currently I'm doing this by a long-winded method: convert the markdown to HTML, strip HTML element tags out of the HTML text and then use a simple regular expression to search that for the word/phrase in question. Needless to say, this can be pretty slow.
A concrete example might show the problem. Say the markdown file contains
Something ***significant***
I would like to be able to find that by providing the search phrase something significant (i.e. ignoring the ***'s).
Is there an efficient way of doing this (i.e. that avoids the conversion to HTML) and doesn't involve me writing my own markdown parser?
Edit:
I want a generic way to search for any text or phrase in markdown text that contains any valid markdown formatting. The first answers were ways to match the specific text example I gave.
Edit:
I should have made it clear: this is required for a simple user-facing search and the markdown files could contain any valid markdown formatting. For this reason I need to be able to ignore anything in the markdown that the user wouldn't see as text if they converted the markdown to HTML. E.g. the markdown text that specifies an image (like ![Valid XHTML](http://w3.org/Icons/valid-xhtml10). should be skipped during the search). Converting to HTML produces decent results for the user because it then reasonably accurately reflects what a user sees (but it's just a slow solution, esp when there's a lot of markdown text to look through).

Use a regexp
var str = "Something ***significant***";
var regexp = new Regex("Something.+significant.+");
Console.WriteLine(regexp.Match(str).Success);

I want to do the same thing. I think of one way to achieve that.
Your method has two steps.
Get the plain text out of the markdown source (which has also two steps. Markdown->HTML and HTML->stripped to plain text)
Search within the plain text
Now, if the markdown source is persisted in a data store, then you may be able to also persist the plain text for search purposes only. So the step to extract the plain text from the markdown may be executed only once when persisting the markdown source (or every time the markdown source is updated), but the code that actually searches in the markdown could be executed immediately on the already persisted plain text data as many times as you want.
For example, if you have a relational DB with a column like markdown_text, you could also create a plain_text column and recreate its value every time the markdown_text column is changed.
Users won't bother if saving their markdown takes a few milliseconds (or even seconds) more than before. Users tend to feel safe when something that alters the system's state takes some time (they feel that something is actually happening in the system), rather than happen immediately (they feel that something went wrong and their command did not execute). But they will feel frustrated if searching took more than a few ms to complete. In general users want queries to complete immediately but commands to take some time (not more than a few seconds though).

Try this:
string input = "Something ***significant***";
string v = input.Replace("***", "");
Console.WriteLine(v)
look this example: enter link description here

Related

Use OpenXML to replace text in DOCX file - strange content

I'm trying to use the OpenXML SDK and the samples on Microsoft's pages to replace placeholders with real content in Word documents.
It used to work as described here, but after editing the template file in Word adding headers and footers it stopped working. I wondered why and some debugging showed me this:
Which is the content of texts in this piece of code:
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(DocumentFile, true))
{
var texts = wordDoc.MainDocumentPart.Document.Body.Descendants<Text>().ToList();
}
So what I see here is that the body of the document is "fragmented", even though in Word the content looks like this:
Can somebody tell me how I can get around this?
I have been asked what I'm trying to achieve. Basically I want to replace user defined "placeholders" with real content. I want to treat the Word document like a template. The placeholders can be anything. In my above example they look like {var:Template1}, but that's just something I'm playing with. It could basically be any word.
So for example if the document contains the following paragraph:
Do not use the name USER_NAME
The user should be able to replace the USER_NAME placeholder with the word admin for example, keeping the formatting intact. The result should be
Do not use the name admin
The problem I see with working on paragraph level, concatenating the content and then replacing the content of the paragraph, I fear I'm losing the formatting that should be kept as in
Do not use the name admin

Various things can fragment text runs. Most frequently proofing markup (as apparently is the case here, where there are "squigglies") or rsid (used to compare documents and track who edited what, when), as well as the "Go back" bookmark Word sets in the background. These become readily apparent if you view the underlying WordOpenXML (using the Open XML SDK Productivity Tool, for example) in the document.xml "part".
It usually helps to go an element level "higher". In this case, get the list of Paragraph descendants and from there get all the Text descendants and concatenate their InnerText.

OpenXML is indeed fragmenting your text:
I created a library that does exactly this : render a word template with the values from a JSON.
From the documenation of docxtemplater :
Why you should use a library for this
Docx is a zipped format that contains some xml. If you want to build a simple replace {tag} by value system, it can already become complicated, because the {tag} is internally separated into <w:t>{</w:t><w:t>tag</w:t><w:t>}</w:t>. If you want to embed loops to iterate over an array, it becomes a real hassle.
The library basically will do the following to keep formatting :
If the text is :
<w:t>Hello</w:t>
<w:t>{name</w:t>
<w:t>} !</w:t>
<w:t>How are you ?</w:t>
The result would be :
<w:t>Hello</w:t>
<w:t>John !</w:t>
<w:t>How are you ?</w:t>
You also have to replace the tag by <w:t xml:space=\"preserve\"> to ensure that the space is not stripped out if they is any in your variables.

how to strip html code from a string in c#

Im coding an app for windows phone in c#.
the program creates a html file, in the course of the programs running i add a lot of html tags.
now i need to strip those from a string when needed.
now all my searches show me i can take a string turn it into an array then put it back together minus any words i dont want, now this is handy but wont work for my needs. i have no idea where to start or even if it is possible
here is an example of the strings i need to remove
testString = "AnotherTest<br>";
so this is a string of the parts i need to remove
List<string> partsToRemove ={"</a>","\">","<br>","<a","href=\"#"};
so how do i take "AnotherTest<br>" and remove all the parts included in partsToRemove?
To clarify:
i will only be removing html from small strings as needed not from a whole html file
to give a working concept:
my program is creating a back ground for a roleplay character, part of that process uses a "gang" generator, the gang generator provides the strings with html tags ready for placement (adding them on the fly is not possible with out radical alteration to my whole program) this is fine for the end result BUT i give users access to the generator itself so if they just want a gang they can use what i have created, this is then diplayed in a textbox (i could easierly change that to another web box) and if enabled the phone reads it out, so here i would take the string created for the gang and feed it through a method that strips the html code and returns a "clean" string
before posting i searched for a solution but all i came across was how to remove words, whole words.

You can try to use regex to do this:
Remove all html tags:
String result = Regex.Replace(htmlDocument, #"<[^>]*>", String.Empty);

for the case that you've shown, you can use this : /(<a|href=\\"#|">|</a>|<br>|\\)/gm regex
But since you might have many different types, the best is to keep a list of patterns, or try to figure out a pattern that matches all the different combinations that you have. It might be more suitable to split the document, and execute a regex multiple times, to keep the regex as simple as possible.
Hope I've answered you're question.

Remove Markdown tags from string

I have a string which has Markdown tags embedded inside it. I do not want to encode the Markdown as anything else, I just want to rip out all of the tags.
How can I do this quickly? I need to do this as part of a batch processing job which processes around 5 million pieces of text, so speed is very important.
I looked at MarkdownSharp, and using Transform, but I'm not sure it's the best way of doing this. I just want plaintext output, with no tags inside. I'm even considering a regex removal, but I'm not sure what the most performant option would be.

You could probably use MarkdownSharp or any other similar library (I recommend Strike, since it is surprisingly fast!) to convert the Markdown to Html and then use HtmlAgilityPack to extract the text.
A faster option, but more work for you, would be to modify an existing Markdown parser to produce plain text instead.

The solution was a bit hard to get from the comments, but this works for .NET 6:
Install Markdown Deep from NuGet. I needed something for .NET 6 so I used the Core version https://www.nuget.org/packages/MarkdownDeep.NET.Core/
Create a Markdown object:
using MarkdownDeep;
var markdownRemover = new Markdown()
{
SummaryLength = -1
};
Strip the markdown from the text:
var plainText = markdownRemover.Transform(mdText);

lucene.net/examine weight html tags

I've got this project where we are implementing Examine / Lucene.net. And I'm look for some guidance from you guys.
As far as I have been able to find out from the knowledge of google, is that if I want to boost the weight, I need to boost the weight on the Field, right ?
But could I get something like this: Is it able to give a boost to a term if the term is inside a h1-tag or the title for that matter. When giving a complete site-html, and do a frequent term search.
the thing i would like to do, is no make a service which gets a html document, and from that is able to find what words in a this document optimised after depending on which terms are used in the text and if they are in the important places, like in a title-tag or h2-tag and so forward.
Is this possible to achieve ? its so the editors live can know, "what they are writing are best found with which searchwords.
Big thanks in advance.

I don't think it quite works that way. Yes, you can boost a field but you cannot boost a term dependent on it's location in some markup because you don't know that at the time of the search.
I think what you could do is create an Umbraco event handler that fires when a page is published. This event could:
Utilise the GatheringNodeData event of an Index
Take the contents of the rich text editor-based field and using regex or something like HtmlUtility extract specific text based upon it's markup location, e.g. H1, H2 and H3 text.
For each piece of text in a heading found, add it into a string variable
Add the whole string into the Lucene index as a new field, e.g. "Headings"
You can now boost on the "Headings" field separately to the field containing the field containing the HTML.

Is it possible to programmatically 'clean' emails?

Does anyone have any suggestions as to how I can clean the body of incoming emails? I want to strip out disclaimers, images and maybe any previous email text that may be also be present so that I am left with just the body text content. My guess is it isn't going to be possible in any reliable way, but has anyone tried it? Are there any libraries geared towards this sort of thing?

In email, there is couple of agreed markings that mean something you wish to strip. You can look for these lines using regular expressions. I doubt you can't really well "sanitize" your emails, but some things you can look for:
Line starting with "> " (greater than then whitespace) marks a quote
Line with "-- " (two hyphens then whitespace then linefeed) marks the beginning of a signature, see Signature block on Wikipedia
Multipart messages, boundaries start with --, beyond that you need to do some searching to separate the message body parts from unwanted parts (like base64 images)
As for an actual C# implementation, I leave that for you or other SOers.

A few obvious things to look at:
if the mail is anything but pure plain text, the message will be multi-part mime. Any part whose type is "image/*" (image/jpeg, etc), can probably be dropped. In all likelyhood any part whose type is not "text/*" can go.
A HTML message will probably have a part of type "multipart/alternative" (I think), and will have 2 parts, one "text/plain" and one "text/html". The two parts should be just about equivalent, so you can drop the HTML part. If the only part present is the HTML bit, you may have to do a HTML to plain text conversion.
The usual format for quoted text is to precede the text by a ">" character. You should be able to drop these lines, unless the line starts ">From", in which case the ">" has been inserted to prevent the mail reader from thinking that the "From " is the start of a new mail.
The signature should start with "-- \r\n", though there is a very good chance that the trailing space will be missing.

Version 3 of OSBF-Lua has a mail-parsing library that will handle the MIME and split a message into its MIME parts and so on. I currently have a mess of Lua scripts that do
stuff like ignore most non-text attachments, prefer plain text to HTML, and so on. (I also wrap long lines to 80 characters while trying to preserve quoting.)
As far as removing previously quoted mail, the suggestions above are all good (you must subscribe to some ill-mannered mailing lists).
Removing disclaimers reliably is probably going to be hard. My first cut would be simply to maintain a library of disclaimers that would be stripped off the end of each mail message; I would write a script to make it easy for me to add to the library. For something more sophisticated I would try some kind of machine learning.
I've been working on spam filtering since Feb 2007 and I've learned that anything to do with email is a mess. A good rule of thumb is that whatever you want to do is a lot harder than you think it is :-(

Given your question "Is it possible to programmatically ‘clean’ emails?", I'd answer "No, not reliably".
The danger you face isn't really a technological one, but a sociological one.
It's easy enough to spot, and filter out, some aspects of the messages - like images. Filtering out signatures and disclaimers is, likewise, possible to achieve (though more of a challenge).
The real problem is the cost of getting it wrong.
What happens if your filter happens to remove a critical piece of the message? Can you trace it back to find the missing piece, or is your filtering desctructive? Worse, would you even notice that the piece was missing?
There's a classic comedy sketch I saw years ago that illustrates the point. Two guys working together on a car. One is underneath doing the work, the other sitting nearby reading instructions from a service manual - it's clear that neither guy knows what he's doing, but they're doing their best.
Manual guy, reading aloud: "Undo the bold in the centre of the oil pan ..." [turns page]
Tool guy: "Ok, it's out."
Manual guy: "... under no circumstances."

If you creating your own application i'd look into Regex, to find text and replace it. To make the application a little nice, i'd create a class Called Email and in that class i have a property called RAW and a property called Stripped.
Just some hints, you'll gather the rest when you look into regex!

SigParser has an assembly you can use in .NET. It gives you the body back in both HTML and text forms with the rest of the stuff stripped out. If you give it an HTML email it will convert the email to text if you need that.
var parser = new SigParser.EmailParsing.EmailParser();
var result = await parser.GetCleanedBodyAsync(new SigParser.EmailParsing.Models.CleanedBodyInput {
FromEmailAddress = "john.smith#example.com",
FromName = "John Smith",
TextBody = #"Hi Mark,
This is my message.
Thanks
John Smith
888-333-4434"
});
// This would print "Hi Mark,\r\nThis is my message."
Console.WriteLine(result.CleanedBodyPlain);

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.