Heading identification with Regex

Heading identification with Regex - c#

I'm wondering how I can identify headings with differing numerical marking styles with one or more regular expressions assuming sometimes styles overlap between documents. The goal is to extract all the subheadings and data for a specific heading in each file, but these files aren't standardized. Is regular expressions even the right approach here?
I'm working on a program that parses a .pdf file and looks for a specific section. Once it finds the section it finds all subsections of that section and their content and stores it in a dictionary<string, string>. I start by reading the entire pdf into a string, and then use this function to locate the "marking" section.
private string GetMarkingSection(string text)
{
int startIndex = 0;
int endIndex = 0;
bool startIndexFound = false;
Regex rx = new Regex(HEADINGREGEX);
foreach (Match match in rx.Matches(text))
{
if (startIndexFound)
{
endIndex = match.Index;
break;
}
if (match.ToString().ToLower().Contains("marking"))
{
startIndex = match.Index;
startIndexFound = true;
}
}
return text.Substring(startIndex, (endIndex - startIndex));
}
Once the marking section is found, I use this to find subsections.
private Dictionary<string, string> GetSubsections(string text)
{
Dictionary<string, string> subsections = new Dictionary<string, string>();
string[] unprocessedSubSecs = Regex.Split(text, SUBSECTIONREGEX);
string title = "";
string content = "";
foreach(string s in unprocessedSubSecs)
{
if(s != "") //sometimes it pulls in empty strings
{
Match m = Regex.Match(s, SUBSECTIONREGEX);
if (m.Success)
{
title = s;
}
else
{
content = s;
if (!String.IsNullOrWhiteSpace(content) && !String.IsNullOrWhiteSpace(title))
{
subsections.Add(title, content);
}
}
}
}
return subsections;
}
Getting these methods to work the way I want them to isn't an issue, the problem is getting them to work with each of the documents. I'm working on a commercial application so any API that requires a license isn't going to work for me.
These documents are anywhere from 1-16 years old, so the formatting varies quite a bit. Here is a link to some sample headings and subheadings from various documents. But to make it easy, here are the regex patterns I'm using:
Heading: (?m)^(\d+\.\d+\s[ \w,\-]+)\r?$
Subheading: (?m)^(\d\.[\d.]+ ?[ \w]+) ?\r?$
Master Key: (?m)^(\d\.?[\d.]*? ?[ \-,:\w]+) ?\r?$
Since some headings use the subheading format in other documents I am unable to use the same heading regex for each file, and the same goes for my subheading regex.
My alternative to this was that I was going to write a master key (listed in the regex link) to identify all types of headings and then locate the last instance of a numeric character in each heading (5.1.X) and then look for 5.1.X+1 to find the end of that section.
That's when I ran into another problem. Some of these files have absolutely no proper structure. Most of them go from 5.2->7.1.5 (5.2->5.3/6.0 would be expected)
I'm trying to wrap my head around a solution for something like this, but I've got nothing... I am open to ideas not involving regex as well.
Here is my updated GetMarkingSection method:
private Dictionary<string, string> GetMarkingSection(string text)
{
var headingRegex = HEADING1REGEX;
var subheadingRegex = HEADING2REGEX;
Dictionary<string, string> markingSection = new Dictionary<string, string>();
if (Regex.Matches(text, HEADING1REGEX, RegexOptions.Multiline | RegexOptions.Singleline).Count > 0)
{
foreach (Match m in Regex.Matches(text, headingRegex, RegexOptions.Multiline | RegexOptions.Singleline))
{
if (Regex.IsMatch(m.ToString(), HEADINGMASTERKEY))
{
if (m.Groups[2].Value.ToLower().Contains("marking"))
{
var subheadings = Regex.Matches(m.ToString(), subheadingRegex, RegexOptions.Multiline | RegexOptions.Singleline);
foreach (Match s in subheadings)
{
markingSection.Add(s.Groups[1].Value + " " + s.Groups[2].Value, s.Groups[3].Value);
}
return markingSection;
}
}
}
}
else
{
headingRegex = HEADING2REGEX;
subheadingRegex = HEADING3REGEX;
foreach(Match m in Regex.Matches(text, headingRegex, RegexOptions.Multiline | RegexOptions.Singleline))
{
if(Regex.IsMatch(m.ToString(), HEADINGMASTERKEY))
{
if (m.Groups[2].Value.ToLower().Contains("marking"))
{
var subheadings = Regex.Matches(m.ToString(), subheadingRegex, RegexOptions.Multiline | RegexOptions.Singleline);
foreach (Match s in subheadings)
{
markingSection.Add(s.Groups[1].Value + " " + s.Groups[2].Value, s.Groups[3].Value);
}
return markingSection;
}
}
}
}
return null;
}
Here are some example PDF files:

See if this approach works:
var heading1Regex = #"^(\d+)\s(?<title>.*?)$\n(?<content>.*?)$\n*(?=^\d+\s|\Z)";
Demo
var heading2Regex = #"^(\d+)\.(\d+)\s(?<title>.*?)$\n(?<content>.*?)$\n*(?=^\d+\.\d+\s|\Z)";
Demo
var heading3Regex = #"^(\d+)\.(\d+)\.(\d+)\s(?<title>.*?)$\n(?<content>.*?)$\n*(?=^\d+\.\d+\.\d+\s|\Z)";
Demo
For each pdf file:
var headingRegex = heading1Regex;
var subHeadingRegex = heading2Regex;
if there are any matches for headingRegex
{
for each match, find matches for subHeadingRegex
}
else
{
var headingRegex = heading2Regex;
var subHeadingRegex = heading3Regex;
//repeat same steps
}
1. Edge case 1: after 5.2, comes 7.1.3
As shown here,
get main section match using heading2Regex.
convert group1 of the match to integer
int.TryParse(match.group1, out var headingIndex);
get sub section matches for heading3Regex
for each subsection match, convert group1 to integer.
int.TryParse(match.group1, out var subHeadingIndex);
check if headingIndex is equal to subHeadingIndex. if not handle accordingly.

Related

How to Extract Domain name from string with Regex in C#?

I want extract Top-Level Domain names and Country top-level domain names from string with Regex. I tested many Regex like this code:
var linkParser = new Regex(#"\b(?:https?://|www\.)\S+\b", RegexOptions.Compiled | RegexOptions.IgnoreCase);
Match m = linkParser.Match(Url);
Console.WriteLine(m.Value);
But none of these codes could do it properly.
The text string entered by the user can be in the following statements:
jonasjohn.com
http://www.jonasjohn.de/snippets/csharp/
jonasjohn.de
www.jonasjohn.de/snippets/csharp/
http://www.answers.com/article/1194427/8-habits-of-extraordinarily-likeable-people
http://www.apple.com
https://www.cnn.com.au
http://www.downloads.news.com.au
https://ftp.android.co.nz
http://global.news.ca
https://www.apple.com/
https://ftp.android.co.nz/
http://global.news.ca/
https://www.apple.com/
https://johnsmith.eu
ftp://johnsmith.eu
johnsmith.gov.ae
johnsmith.eu
www.jonasjohn.de
www.jonasjohn.ac.ir/snippets/csharp
http://www.jonasjohn.de/
ftp://www.jonasjohn.de/
https://subdomain.abc.def.jonasjohn.de/test.htm
The Regex I tested:
^(?:https?:\/\/)?(?:[^#\/\n]+#)?(?:www\.)?([^:\/\n]+)"
\b(?:https?://|www\.)\S+\b
://(?<host>([a-z\\d][-a-z\\d]*[a-z\\d]\\.)*[a-z][-a-z\\d]+[a-z])
and also too many
I just need the domain name and I don't need a protocol or a subdomain.
Like:
Domainname.gTLD or DomainName.ccTLD or DomainName.xyz.ccTLD
I got list of them from PUBLIC SUFFIX
Of course, I've seen a lot of posts on stackoverflow.com, but none of it answered me.

You don't need a Regex to parse a URL. If you have a valid URL, you can use one of the Uri constructors or Uri.TryCreate to parse it:
if(Uri.TryCreate("http://google.com/asdfs",UriKind.RelativeOrAbsolute,out var uri))
{
Console.WriteLine(uri.Host);
}
www.jonasjohn.de/snippets/csharp/ and jonasjohn.de/snippets/csharp/ aren't valid URLs though. TryCreate can still parse them as relative URLs, but reading Host throws System.InvalidOperationException: This operation is not supported for a relative URI.
In that case you can use the UriBuilder class, to parse and modify the URL eg:
var bld=new UriBuilder("jonasjohn.com");
Console.WriteLine(bld.Host);
This prints
jonasjohn.com
Setting the Scheme property produces a valid,complete URL:
bld.Scheme="https";
Console.WriteLine(bld.Uri);
This produces:
https://jonasjohn.com:80/

According to Lidqy answer, I wrote this function, which I think supports most possible situations, and if the input value is out of this, you can make it an exception.
public static string ExtractDomainName(string Url)
{
var regex = new Regex(#"^((https?|ftp)://)?(www\.)?(?<domain>[^/]+)(/|$)");
Match match = regex.Match(Url);
if (match.Success)
{
string domain = match.Groups["domain"].Value;
int freq = domain.Where(x => (x == '.')).Count();
while (freq > 2)
{
if (freq > 2)
{
var domainSplited = domain.Split('.', 2);
domain = domainSplited[1];
freq = domain.Where(x => (x == '.')).Count();
}
}
return domain;
}
else
{
return String.Empty;
}
}

var rx = new Regex(#"^((https?|ftp)://)?(www\.)?(?<domain>[^/]+)(/|$)");
var data = new[] { "jonasjohn.com",
"http://www.jonasjohn.de/snippets/csharp/",
"jonasjohn.de",
"www.jonasjohn.de/snippets/csharp/",
"http://www.answers.com/article/1194427/8-habits-of-extraordinarily-likeable-people",
"http://www.apple.com",
"https://www.cnn.com.au",
"http://www.downloads.news.com.au",
"https://ftp.android.co.nz",
"http://global.news.ca",
"https://www.apple.com/",
"https://ftp.android.co.nz/",
"http://global.news.ca/",
"https://www.apple.com/",
"https://johnsmith.eu",
"ftp://johnsmith.eu",
"johnsmith.gov.ae",
"johnsmith.eu",
"www.jonasjohn.de",
"www.jonasjohn.ac.ir/snippets/csharp",
"http://www.jonasjohn.de/",
"ftp://www.jonasjohn.de/",
"https://subdomain.abc.def.jonasjohn.de/test.htm"
};
foreach (var dat in data) {
var match = rx.Match(dat);
if (match.Success)
Console.WriteLine("{0} => {1}", dat, match.Groups["domain"].Value);
else {
Console.WriteLine("{0} => NO MATCH", dat);
}
}

Extract ID and replace everything in `Example HTML`

New to Regular Expressions, I want to have the following text in my HTML and would like to replace with something else
Example HTML:
{{Object id='foo'}}
Extract the id into a variable like this:
string strId = "foo";
So far I have the following Regular Expression code that will capture the Example HTML:
string strStart = "Object";
string strFind = "{{(" + strStart + ".*?)}}";
Regex regExp = new Regex(strFind, RegexOptions.IgnoreCase);
Match matchRegExp = regExp.Match(html);
while (matchRegExp.Success)
{
//At this point, I have this variable:
//{{Object id='foo'}}
//I can find the id='foo' (see below)
//but not sure how to extract 'foo' and use it
string strFindInner = "id='(.*?)'"; //"{{Slider";
Regex regExpInner = new Regex(strFindInner, RegexOptions.IgnoreCase);
Match matchRegExpInner = regExpInner.Match(matchRegExp.Value.ToString());
//Do something with 'foo'
matchRegExp = matchRegExp.NextMatch();
}
I understand this might be a simple solution, I am hoping to gain more knowledge about Regular Expressions but more importantly, I am hoping to receive a suggestion on how to approach this cleaner and more efficiently.
Thank you
Edit:
Is this an example that I could potentially use: c# regex replace

While I am not solving my initial question with Regular Expressions, I did move into a simpler solution using SubString, IndexOf and string.Split for the time being, I understand that my code needs to be cleaned up but thought I would post the answer that I have thus far.
string html = "<p>Start of Example</p>{{Object id='foo'}}<p>End of example</p>"
string strObject = "Slider"; //Example
//When found, this will contain "{{Object id='foo'}}"
string strCode = "";
//ie: "id='foo'"
string strCodeInner = "";
//Tags will be a list, but in this example, only "id='foo'"
string[] tags = { };
//Looking for the following "{{Object "
string strFindStart = "{{" + strObject + " ";
int intFindStart = html.IndexOf(strFindStart);
//Then ending in the following
string strFindEnd = "}}";
int intFindEnd = html.IndexOf(strFindEnd) + strFindEnd.Length;
//Must find both Start and End conditions
if (intFindStart != -1 && intFindEnd != -1)
{
strCode = html.Substring(intFindStart, intFindEnd - intFindStart);
//Remove Start and End
strCodeInner = strCode.Replace(strFindStart, "").Replace(strFindEnd, "");
//Split by spaces, this needs to be improved if more than IDs are to be used
//but for proof of concept this is perfect
tags = strCodeInner.Split(new char[] { ' ' });
}
Dictionary<string, string> dictTags = new Dictionary<string, string>();
foreach (string tag in tags)
{
string[] tagSplit = tag.Split(new char[] { '=' });
dictTags.Add(tagSplit[0], tagSplit[1].Replace("'", "").Replace("\"", ""));
}
//At this point, I can replace "{{Object id='foo'}}" with anything I'd like
//What I don't show is that I go into the website's database,
//get the object (ie: Slider) and return the html for slider with the ID of foo
html = html.Replace(strCode, strView);
/*
"html" variable may contain:
<p>Start of Example</p>
<p id="foo">This is the replacement text</p>
<p>End of example</p>
*/

C# Regex to replace specific hashtags with certain block of text

I am a new C# developer and I am struggling right now to write a method to replace a few specific hashtags in a sample of tweets with certain block of texts. For example if the tweet has a hashtag like #StPaulSchool, I want to replace this hashtag with this certain text "St. Paul School" without the '#' tag.
I have a very small list of the certain words which I need to replace. If there is no match, then I would like remove the hashtag (replace it with empty string)
I am using the following method to parse the tweet and convert it into a formatted tweet but I don't know how to enhance it in order to handle the specific hashtags. Could you please tell me how to do that?
Here's the code:
public string ParseTweet(string rawTweet)
{
Regex link = new Regex(#"http(s)?://([\w+?\.\w+])+([a-zA-Z0-9\~\!\#\#\$\%\^\&\*\(\)_\-\=\+\\\/\?\.\:\;\'\,]*)?");
Regex screenName = new Regex(#"#\w+");
Regex hashTag = new Regex(#"#\w+");
var words_to_replace = new string[] { "StPaulSchool", "AzharSchool", "WarwiSchool", "ManMet_School", "BrumSchool"};
var inputWords = new string[] { "St. Paul School", "Azhar School", "Warwick School", "Man Metapolian School", "Brummie School"};
string formattedTweet = link.Replace(rawTweet, delegate (Match m)
{
string val = m.Value;
//return string.Format("URL");
return string.Empty;
});
formattedTweet = screenName.Replace(formattedTweet, delegate (Match m)
{
string val = m.Value.Trim('#');
//return string.Format("USERNAME");
return string.Empty;
});
formattedTweet = hashTag.Replace(formattedTweet, delegate (Match m)
{
string val = m.Value;
//return string.Format("HASHTAG");
return string.Empty;
});
return formattedTweet;
}

The following code works for the hashtags:
static void Main(string[] args)
{
string longTweet = #"Long sentence #With #Some schools like #AzharSchool and spread out
over two #StPaulSchool lines ";
string result = Regex.Replace(longTweet, #"\#\w+", match => ReplaceHashTag(match.Value), RegexOptions.Multiline);
Console.WriteLine(result);
}
private static string ReplaceHashTag(string input)
{
switch (input)
{
case "#StPaulSchool": return "St. Paul School";
case "#AzharSchool": return "Azhar School";
default:
return input; // hashtag not recognized
}
}
If the list of hashtags to convert becomes very long it would be more succint to use a Dictionary, eg:
private static Dictionary<string, string> _hashtags
= new Dictionary<string, string>
{
{ "#StPaulSchool", "St. Paul School" },
{ "#AzharSchool", "Azhar School" },
};
and rewrite the body of the ReplaceHashTag method with this:
if (!_hashtags.ContainsKey(hashtag))
{
return hashtag;
}
return _hashtags[hashtag];

I believe that using regular expressions makes this code unreadable and difficult to maintain. Moreover, you are using regular expression to find a very simple pattern - to find strings that starts with the hashtag (#) character.
I suggest a different approach: Break the sentence into words, transform each word according to your business rules, then join the words back together. Although this sounds like a lot of work, and it may be the case in another language, the C# String class makes this quite easy to implement.
Here is a basic example of a console application that does the requested functionality, the business rules are hard-coded, but this should be enough so you could continue:
static void Main(string[] args)
{
string text = "Example #First #Second #NoMatch not a word ! \nSecond row #Second";
string[] wordsInText = text.Split(' ');
IEnumerable<string> transformedWords = wordsInText.Select(selector: word => ReplaceHashTag(word: word));
string transformedText = string.Join(separator: " ", values: transformedWords);
Console.WriteLine(value: transformedText);
}
private static string ReplaceHashTag(string word)
{
if (!word.StartsWith(value: "#"))
{
return word;
}
string wordWithoutHashTag = word.Substring(startIndex: 1);
if (wordWithoutHashTag == "First")
{
return "FirstTransformed";
}
if (wordWithoutHashTag == "Second")
{
return "SecondTransformed";
}
return string.Empty;
}
Note that this approach gives you much more flexibility chaining your logic, and by making small modifications you can make this code a lot more testable and incremental then the regular expression approach

C# Parse text qualified file into separate strings

I'm trying to parse and split the following sample text file using dotnet c# in order to break each single data points into separate strings.
§Id§|§Name§|§UpdateDate§|§Description§
1|§AAA/FE-45§|2000-02-02 00:00:00|§§
2|§BBB-123§|2000-02-03 00:00:00|§§
3|§CC|45§|2000-02-07 00:00:00|§The following,
is a multiline description
please check Name:
CC|45 as soon as possible§
File Properties:
CodePage: ANSI
Column Headers: Yes
Row Delimiter: {CR}{LF}
Column Delimiter: | (Vertical Bar)
Text Qualifier: §
The trouble I have is that the text type columns are qualified with a non standard symbol and the given text could be a block of multi-line text that may contain various symbols such as {CRLF}, {LF} or even | (Vertical Bar).
From what I can read around, i cannot use TextFieldParser because it only handles double quote qualifier and Bulk Insert does not support text qualifier.
I'm no c# expert at all; I wouldn't want to reinvent the wheel and ideally would like to use the best practices. But I also like to understand and "own" what I produce so I would prefer to avoid Libraries such as Filehelpers.
Thank you for your guidance!

A typical approach would be to use finite automata for this. In your case, you can try the following code:
public static List<string[]> split(string s)
{
bool ins = false;
int no = 3;
var L = new List<string>();
var Res = new List<string[]>();
var B = new StringBuilder();
foreach (var c in s)
{
switch (c)
{
case '§':
if (ins)
{
ins = false;
L.Add(B.ToString());
if (no == 0)
{
Res.Add(L.ToArray<string>());
L.Clear();
no = 3;
}
}
else
{
ins = true;
B.Clear();
}
break;
case '|':
if (!ins) { no--; }
else B.Append(c);
break;
default:
if (ins) B.Append(c);
break;
}
}
return Res;
}
}

Try this code
string pattern = #"(?<id>\d+) \| (?<name>§.+?§) \| (?<date>\d{4}-\d\d-\d\d \s \d\d:\d\d:\d\d) \| (?<desc>§.*?§)";
Regex regex = new Regex(pattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.Singleline);
string text = File.ReadAllText("test.txt", Encoding.GetEncoding(1251));
text = text.Split(new string[] { Environment.NewLine }, 2, StringSplitOptions.None)[1];
var matches = regex.Matches(text);
foreach (Match match in matches)
{
Console.WriteLine(match.Groups["id"].Value);
Console.WriteLine(match.Groups["name"].Value.Trim('§'));
Console.WriteLine(match.Groups["date"].Value);
Console.WriteLine(match.Groups["desc"].Value.Trim('§'));
Console.WriteLine();
}

Search and replace values in text file with C#

I have a text file with a certain format. First comes an identifier followed by three spaces and a colon. Then comes the value for this identifier.
ID1 :Value1
ID2 :Value2
ID3 :Value3
What I need to do is searching e.g. for ID2 : and replace Value2 with a new value NewValue2. What would be a way to do this? The files I need to parse won't get very large. The largest will be around 150 lines.

If the file isn't that big you can do a File.ReadAllLines to get a collection of all the lines and then replace the line you're looking for like this
using System.IO;
using System.Linq;
using System.Collections.Generic;
List<string> lines = new List<string>(File.ReadAllLines("file"));
int lineIndex = lines.FindIndex(line => line.StartsWith("ID2 :"));
if (lineIndex != -1)
{
lines[lineIndex] = "ID2 :NewValue2";
File.WriteAllLines("file", lines);
}

Here's a simple solution which also creates a backup of the source file automatically.
The replacements are stored in a Dictionary object. They are keyed on the line's ID, e.g. 'ID2' and the value is the string replacement required. Just use Add() to add more as required.
StreamWriter writer = null;
Dictionary<string, string> replacements = new Dictionary<string, string>();
replacements.Add("ID2", "NewValue2");
// ... further replacement entries ...
using (writer = File.CreateText("output.txt"))
{
foreach (string line in File.ReadLines("input.txt"))
{
bool replacementMade = false;
foreach (var replacement in replacements)
{
if (line.StartsWith(replacement.Key))
{
writer.WriteLine(string.Format("{0} :{1}",
replacement.Key, replacement.Value));
replacementMade = true;
break;
}
}
if (!replacementMade)
{
writer.WriteLine(line);
}
}
}
File.Replace("output.txt", "input.txt", "input.bak");
You'll just have to replace input.txt, output.txt and input.bak with the paths to your source, destination and backup files.

Ordinarily, for any text searching and replacement, I'd suggest some sort of regular expression work, but if this is all you're doing, that's really overkill.
I would just open the original file and a temporary file; read the original a line at a time, and just check each line for "ID2 :"; if you find it, write your replacement string to the temporary file, otherwise, just write what you read. When you've run out of source, close both, delete the original, and rename the temporary file to that of the original.

Something like this should work. It's very simple, not the most efficient thing, but for small files, it would be just fine:
private void setValue(string filePath, string key, string value)
{
string[] lines= File.ReadAllLines(filePath);
for(int x = 0; x < lines.Length; x++)
{
string[] fields = lines[x].Split(':');
if (fields[0].TrimEnd() == key)
{
lines[x] = fields[0] + ':' + value;
File.WriteAllLines(lines);
break;
}
}
}

You can use regex and do it in 3 lines of code
string text = File.ReadAllText("sourcefile.txt");
text = Regex.Replace(text, #"(?i)(?<=^id2\s*?:\s*?)\w*?(?=\s*?$)", "NewValue2",
RegexOptions.Multiline);
File.WriteAllText("outputfile.txt", text);
In the regex, (?i)(?<=^id2\s*?:\s*?)\w*?(?=\s*?$) means, find anything that starts with id2 with any number of spaces before and after :, and replace the following string (any alpha numeric character, excluding punctuations) all the way 'till end of the line. If you want to include punctuations, then replace \w*? with .*?

You can use regexes to achieve this.
Regex re = new Regex(#"^ID\d+ :Value(\d+)\s*$", RegexOptions.IgnoreCase | RegexOptions.Compiled);
List<string> lines = File.ReadAllLines("mytextfile");
foreach (string line in lines) {
string replaced = re.Replace(target, processMatch);
//Now do what you going to do with the value
}
string processMatch(Match m)
{
var number = m.Groups[1];
return String.Format("ID{0} :NewValue{0}", number);
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Heading identification with Regex - c#

Related

How to Extract Domain name from string with Regex in C#?

Extract ID and replace everything in `Example HTML`

C# Regex to replace specific hashtags with certain block of text

C# Parse text qualified file into separate strings

Search and replace values in text file with C#

Categories

Resources