C# Extract json object from mixed data text/js file - c#

I need to parse reactjs file in main.451e57c9.js to retrieve version number with C#.
This file contains mixed data, here is little part of it:
.....inally{if(s)throw i}}return a}}(e,t)||xe(e,t)||we()}var Se=
JSON.parse('{"shortVersion":"v3.1.56"}')
,Ne="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgA
AASAAAAAqCAYAAAATb4ZSAAAACXBIWXMAAAsTAAALEw.....
I need to extract json data of {"shortVersion":"v3.1.56"}
The last time I tried to simply find the string shortVersion and return a certain number of characters after, but it seems like I'm trying to create the bicycle from scratch. Is there proper way to identify and extract json from the mixed text?
public static void findVersion()
{
var partialName = "main.*.js";
string[] filesInDir = Directory.GetFiles(#pathToFile, partialName);
var lines = File.ReadLines(filesInDir[0]);
foreach (var line in File.ReadLines(filesInDir[0]))
{
string keyword = "shortVersion";
int indx = line.IndexOf(keyword);
if (indx != -1)
{
string code = line.Substring(indx + keyword.Length);
Console.WriteLine(code);
}
}
}
RESULT
":"v3.1.56"}'),Ne="data:image/png;base64,iVBORw0KGgoAA.....

string findJson(string input, string keyword) {
int startIndex = input.IndexOf(keyword) - 2; //Find the starting point of shortversion then subtract 2 to start at the { bracket
input = input.Substring(startIndex); //Grab everything after the start index
int endIndex = 0;
for (int i = 0; i < input.Length; i++) {
char letter = input[i];
if (letter == '}') {
endIndex = i; //Capture the first instance of the closing bracket in the new trimmed input string.
break;
}
}
return input.Remove(endIndex+1);
}
Console.WriteLine(findJson("fwekjfwkejwe{'shortVersion':'v3.1.56'}wekjrlklkj23klj23jkl234kjlk", "shortVersion"));
You will recieve {'shortVersion':'v3.1.56'} as output
Note you may have to use line.Replace('"', "'");

Try below method -
public static object ExtractJsonFromText(string mixedStrng)
{
for (var i = mixedStrng.IndexOf('{'); i > -1; i = mixedStrng.IndexOf('{', i + 1))
{
for (var j = mixedStrng.LastIndexOf('}'); j > -1; j = mixedStrng.LastIndexOf("}", j -1))
{
var jsonProbe = mixedStrng.Substring(i, j - i + 1);
try
{
return JsonConvert.DeserializeObject(jsonProbe);
}
catch
{
}
}
}
return null;
}
Fiddle
https://dotnetfiddle.net/N1jiWH

You should not use GetFiles() since you only need one and that returns all before you can do anything. This should give your something you can work with here and it should be as fast as it likely can be with big files and/or lots of files in a folder (to be fair I have not tested this on such a large file system or file)
using System;
using System.IO;
using System.Linq;
public class Program
{
public static void Main()
{
Console.WriteLine("Hello World");
var path = $#"c:\SomePath";
var jsonString = GetFileVersion(path);
if (!string.IsNullOrWhiteSpace(jsonString))
{
// do something with string; deserialize or whatever.
var result=JsonConvert.DeserializeObject<List<Version>>(jsonString);
var vers = result.shortVersion;
}
}
private static string GetFileVersion(string path)
{
var partialName = "main.*.js";
// JSON string fragment to find: doubled up braces and quotes for the $# string
string matchString = $#"{{""shortVersion"":";
string matchEndString = $#" ""}}'";
// we can later stop on the first match
DirectoryInfo dir = new DirectoryInfo(path);
if (!dir.Exists)
{
throw new DirectoryNotFoundException("The directory does not exist.");
}
// Call the GetFileSystemInfos method and grab the first one
FileSystemInfo info = dir.GetFileSystemInfos(partialName).FirstOrDefault();
if (info.Exists)
{
// walk the file contents looking for a match (assumptions made here there IS a match and it has that string noted)
var line = File.ReadLines(info.FullName).SkipWhile(line => !line.Contains(matchString)).Take(1).First();
var indexStart = line.IndexOf(matchString);
var indexEnd = line.IndexOf(matchEndString, indexStart);
var jsonString = line.Substring(indexStart, indexEnd + matchEndString.Length);
return jsonString;
}
return string.Empty;
}
public class Version
{
public string shortVersion { get; set; }
}
}

Use this this should be faster - https://dotnetfiddle.net/sYFvYj
public static object ExtractJsonFromText(string mixedStrng)
{
string pattern = #"\(\'\{.*}\'\)";
string str = null;
foreach (Match match in Regex.Matches(mixedStrng, pattern, RegexOptions.Multiline))
{
if (match.Success)
{
str = str + Environment.NewLine + match;
}
}
return str;
}

Related

Count line by searching crlf (0d0a)

I would like to count number of lines in a file based on crlf (0D0A) count. My current code only counting the number of lines based on cr (0D). Can anybody give suggestion ?
public static int Countline(string file)
{
var lineCount = 0;
using (var reader = File.OpenText(file))
{
while (reader.ReadLine() != null)
{
lineCount++;
}
}
return lineCount;
}
Usage:
Countline("text.txt", "\r\n");
Method:
public static int Countline(string file, string lineSeperator)
{
string text = File.ReadAllText(file);
return System.Text.RegularExpressions.Regex.Matches(text, lineSeperator).Count;
}
string content = System.IO.File.ReadAllText( fileName );
int numMatches = content.Select((c, i) => content.Substring(i)).Count(sub => sub.StartsWith(Environment.NewLine));
Note I'm using Environment.NewLine for line endings but you can replace with the whole string if you prefer.
public int CountLines(string Text)
{
int count = 0;
foreach (ReadOnlySpan<char> _ in Text.AsSpan().EnumerateLines())
{
count++;
}
return count;
}
Benchmark:

String Utilities in C#

I'm learning about string utilities in C#, and I have a method that replaces parts of a string.
Using the replace method I need to get an output such as
"Old file name: file00"
"New file name: file01"
Depending on what the user wants to change it to.
I am looking for help on making the method (NextImageName) replace only the digits, but not the file name.
class BuildingBlock
{
public static string ReplaceOnce(string word, string characters, int position)
{
word = word.Remove(position, characters.Length);
word = word.Insert(position, characters);
return word;
}
public static string GetLastName(string name)
{
string result = "";
int posn = name.LastIndexOf(' ');
if (posn >= 0) result = name.Substring(posn + 1);
return result;
}
public static string NextImageName(string filename, int newNumber)
{
if (newNumber > 9)
{
return ReplaceOnce(filename, newNumber, (filename.Length - 2))
}
if (newNumber < 10)
{
}
if (newNumber == 0)
{
}
}
The other "if" statements are empty for now until I find out how to do the first one.
The correct way to do this would be to use Regular Expressions.
Ideally you would separate "file" from "00" in "file00". Then take "00", convert it to an Int32 (using Int32.Parse()) and then rebuild your string with String.Format().
public static string NextImageName(string filename, int newNumber)
{
string oldnumber = "";
foreach (var item in filename.ToCharArray().Reverse())
if (char.IsDigit(item))
oldnumber = item + oldnumber ;
else
break;
return filename.Replace(oldnumber ,newNumber.ToString());
}
public static string NextImageName(string filename, int newNumber)
{
int i = 0;
foreach (char c in filename) // get index of first number
{
if (char.IsNumber(c))
break;
else
i++;
}
string s = filename.Substring(0,i); // remove original number
s = s + newNumber.ToString(); // add new number
return s;
}

C# How can I compare two word strings and indicate which parts are different

For example if I have...
string a = "personil";
string b = "personal";
I would like to get...
string c = "person[i]l";
However it is not necessarily a single character. I could be like this too...
string a = "disfuncshunal";
string b = "dysfunctional";
For this case I would want to get...
string c = "d[isfuncshu]nal";
Another example would be... (Notice that the length of both words are different.)
string a = "parralele";
string b = "parallel";
string c = "par[ralele]";
Another example would be...
string a = "ato";
string b = "auto";
string c = "a[]to";
How would I go about doing this?
Edit: The length of the two strings can be different.
Edit: Added additional examples. Credit goes to user Nenad for asking.
I must be very bored today, but I actually made UnitTest that pass all 4 cases (if you did not add some more in the meantime).
Edit: Added 2 edge cases and fix for them.
Edit2: letters that repeat multiple times (and error on those letters)
[Test]
[TestCase("parralele", "parallel", "par[ralele]")]
[TestCase("personil", "personal", "person[i]l")]
[TestCase("disfuncshunal", "dysfunctional", "d[isfuncshu]nal")]
[TestCase("ato", "auto", "a[]to")]
[TestCase("inactioned", "inaction", "inaction[ed]")]
[TestCase("refraction", "fraction", "[re]fraction")]
[TestCase("adiction", "ad[]diction", "ad[]iction")]
public void CompareStringsTest(string attempted, string correct, string expectedResult)
{
int first = -1, last = -1;
string result = null;
int shorterLength = (attempted.Length < correct.Length ? attempted.Length : correct.Length);
// First - [
for (int i = 0; i < shorterLength; i++)
{
if (correct[i] != attempted[i])
{
first = i;
break;
}
}
// Last - ]
var a = correct.Reverse().ToArray();
var b = attempted.Reverse().ToArray();
for (int i = 0; i < shorterLength; i++)
{
if (a[i] != b[i])
{
last = i;
break;
}
}
if (first == -1 && last == -1)
result = attempted;
else
{
var sb = new StringBuilder();
if (first == -1)
first = shorterLength;
if (last == -1)
last = shorterLength;
// If same letter repeats multiple times (ex: addition)
// and error is on that letter, we have to trim trail.
if (first + last > shorterLength)
last = shorterLength - first;
if (first > 0)
sb.Append(attempted.Substring(0, first));
sb.Append("[");
if (last > -1 && last + first < attempted.Length)
sb.Append(attempted.Substring(first, attempted.Length - last - first));
sb.Append("]");
if (last > 0)
sb.Append(attempted.Substring(attempted.Length - last, last));
result = sb.ToString();
}
Assert.AreEqual(expectedResult, result);
}
Have you tried my DiffLib?
With that library, and the following code (running in LINQPad):
void Main()
{
string a = "disfuncshunal";
string b = "dysfunctional";
var diff = new Diff<char>(a, b);
var result = new StringBuilder();
int index1 = 0;
int index2 = 0;
foreach (var part in diff)
{
if (part.Equal)
result.Append(a.Substring(index1, part.Length1));
else
result.Append("[" + a.Substring(index1, part.Length1) + "]");
index1 += part.Length1;
index2 += part.Length2;
}
result.ToString().Dump();
}
You get this output:
d[i]sfunc[shu]nal
To be honest I don't understand what this gives you, as you seem to completely ignore the changed parts in the b string, only dumping the relevant portions of the a string.
Here is a complete and working console application that will work for both examples you gave:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace ConsoleApplication2
{
class Program
{
static void Main(string[] args)
{
string a = "disfuncshunal";
string b = "dysfunctional";
StringBuilder sb = new StringBuilder();
for (int i = 0; i < a.Length; i++)
{
if (a[i] != b[i])
{
sb.Append("[");
sb.Append(a[i]);
sb.Append("]");
continue;
}
sb.Append(a[i]);
}
var str = sb.ToString();
var startIndex = str.IndexOf("[");
var endIndex = str.LastIndexOf("]");
var start = str.Substring(0, startIndex + 1);
var mid = str.Substring(startIndex + 1, endIndex - 1);
var end = str.Substring(endIndex);
Console.WriteLine(start + mid.Replace("[", "").Replace("]", "") + end);
}
}
}
it will not work if you want to display more than one entire section of the mismatched word.
You did not specify what to do if the strings were of different lengths, but here is a solution to the problem when the strings are of equal length:
private string Compare(string string1, string string2) {
//This only works if the two strings are the same length..
string output = "";
bool mismatch = false;
for (int i = 0; i < string1.Length; i++) {
char c1 = string1[i];
char c2 = string2[i];
if (c1 == c2) {
if (mismatch) {
output += "]" + c1;
mismatch = false;
} else {
output += c1;
}
} else {
if (mismatch) {
output += c1;
} else {
output += "[" + c1;
mismatch = true;
}
}
}
return output;
}
Not really good approach but as an exercise in using LINQ: task seem to be find matching prefix and suffix for 2 strings, return "prefix + [+ middle of first string + suffix.
So you can match prefix (Zip + TakeWhile(a==b)), than repeat the same for suffix by reversing both strings and reversing result.
var first = "disfuncshunal";
var second = "dysfunctional";
// Prefix
var zipped = first.ToCharArray().Zip(second.ToCharArray(), (f,s)=> new {f,s});
var prefix = string.Join("",
zipped.TakeWhile(c => c.f==c.s).Select(c => c.f));
// Suffix
var zippedReverse = first.ToCharArray().Reverse()
.Zip(second.ToCharArray().Reverse(), (f,s)=> new {f,s});
var suffix = string.Join("",
zippedReverse.TakeWhile(c => c.f==c.s).Reverse().Select(c => c.f));
// Cut and combine.
var middle = first.Substring(prefix.Length,
first.Length - prefix.Length - suffix.Length);
var result = prefix + "[" + middle + "]" + suffix;
Much easier and faster approach is to use 2 for loops (from start to end, and from end to start).

Multiple string replace in c#

I am dynamically editing a regex for matching text in a pdf, which can contain hyphenation at the end of some lines.
Example:
Source string:
"consecuti?vely"
Replace rules:
.Replace("cuti?",#"cuti?(-\s+)?")
.Replace("con",#"con(-\s+)?")
.Replace("consecu",#"consecu(-\s+)?")
Desired output:
"con(-\s+)?secu(-\s+)?ti?(-\s+)?vely"
The replace rules are built dynamically, this is just an example which causes problems.
Whats the best solution to perform such a multiple replace, which will produce the desired output?
So far I thought about using Regex.Replace and zipping the word to replace with optional (-\s+)? between each character, but that would not work, because the word to replace already contains special-meaning characters in regex context.
EDIT: My current code, doesnt work when replace rules overlap like in example above
private string ModifyRegexToAcceptHyphensOfCurrentPage(string regex, int searchedPage)
{
var originalTextOfThePage = mPagesNotModified[searchedPage];
var hyphenatedParts = Regex.Matches(originalTextOfThePage, #"\w+\-\s");
for (int i = 0; i < hyphenatedParts.Count; i++)
{
var partBeforeHyphen = String.Concat(hyphenatedParts[i].Value.TakeWhile(c => c != '-'));
regex = regex.Replace(partBeforeHyphen, partBeforeHyphen + #"(-\s+)?");
}
return regex;
}
the output of this program is "con(-\s+)?secu(-\s+)?ti?(-\s+)?vely";
and as I understand your problem, my code can completely solve your problem.
class Program
{
class somefields
{
public string first;
public string secound;
public string Add;
public int index;
public somefields(string F, string S)
{
first = F;
secound = S;
}
}
static void Main(string[] args)
{
//declaring output
string input = "consecuti?vely";
List<somefields> rules=new List<somefields>();
//declaring rules
rules.Add(new somefields("cuti?",#"cuti?(-\s+)?"));
rules.Add(new somefields("con",#"con(-\s+)?"));
rules.Add(new somefields("consecu",#"consecu(-\s+)?"));
// finding the string which must be added to output string and index of that
foreach (var rul in rules)
{
var index=input.IndexOf(rul.first);
if (index != -1)
{
var add = rul.secound.Remove(0,rul.first.Count());
rul.Add = add;
rul.index = index+rul.first.Count();
}
}
// sort rules by index
for (int i = 0; i < rules.Count(); i++)
{
for (int j = i + 1; j < rules.Count(); j++)
{
if (rules[i].index > rules[j].index)
{
somefields temp;
temp = rules[i];
rules[i] = rules[j];
rules[j] = temp;
}
}
}
string output = input.ToString();
int k=0;
foreach(var rul in rules)
{
if (rul.index != -1)
{
output = output.Insert(k + rul.index, rul.Add);
k += rul.Add.Length;
}
}
System.Console.WriteLine(output);
System.Console.ReadLine();
}
}
You should probably write your own parser, it's probably easier to maintain :).
Maybe you could add "special characters" around pattern in order to protect them like "##" if the strings not contains it.
Try this one:
var final = Regex.Replace(originalTextOfThePage, #"(\w+)(?:\-[\s\r\n]*)?", "$1");
I had to give up an easy solution and did the editing of the regex myself. As a side effect, the new approach goes only twice trough the string.
private string ModifyRegexToAcceptHyphensOfCurrentPage(string regex, int searchedPage)
{
var indexesToInsertPossibleHyphenation = GetPossibleHyphenPositions(regex, searchedPage);
var hyphenationToken = #"(-\s+)?";
return InsertStringTokenInAllPositions(regex, indexesToInsertPossibleHyphenation, hyphenationToken);
}
private static string InsertStringTokenInAllPositions(string sourceString, List<int> insertionIndexes, string insertionToken)
{
if (insertionIndexes == null || string.IsNullOrEmpty(insertionToken)) return sourceString;
var sb = new StringBuilder(sourceString.Length + insertionIndexes.Count * insertionToken.Length);
var linkedInsertionPositions = new LinkedList<int>(insertionIndexes.Distinct().OrderBy(x => x));
for (int i = 0; i < sourceString.Length; i++)
{
if (!linkedInsertionPositions.Any())
{
sb.Append(sourceString.Substring(i));
break;
}
if (i == linkedInsertionPositions.First.Value)
{
sb.Append(insertionToken);
}
if (i >= linkedInsertionPositions.First.Value)
{
linkedInsertionPositions.RemoveFirst();
}
sb.Append(sourceString[i]);
}
return sb.ToString();
}
private List<int> GetPossibleHyphenPositions(string regex, int searchedPage)
{
var originalTextOfThePage = mPagesNotModified[searchedPage];
var hyphenatedParts = Regex.Matches(originalTextOfThePage, #"\w+\-\s");
var indexesToInsertPossibleHyphenation = new List<int>();
//....
// Aho-Corasick to find all occurences of all
//strings in "hyphenatedParts" in the "regex" string
// ....
return indexesToInsertPossibleHyphenation;
}

Remove HTML tags and comments from a string in C#?

How do I remove everything beginning in '<' and ending in '>' from a string in C#. I know it can be done with regex but I'm not very good with it.
The tag pattern I quickly wrote for a recent small project is this one.
string tagPattern = #"<[!--\W*?]*?[/]*?\w+.*?>";
I used it like this
MatchCollection matches = Regex.Matches(input, tagPattern);
foreach (Match match in matches)
{
input = input.Replace(match.Value, string.Empty);
}
It would likely need to be modified to correctly handle script or style tags.
Non regex option: But it still won't parse nested tags!
public static string StripHTML(string line)
{
int finished = 0;
int beginStrip;
int endStrip;
finished = line.IndexOf('<');
while (finished != -1)
{
beginStrip = line.IndexOf('<');
endStrip = line.IndexOf('>', beginStrip + 1);
line = line.Remove(beginStrip, (endStrip + 1) - beginStrip);
finished = line.IndexOf('<');
}
return line;
}
Another non-regex code that works 8x faster than regex:
public static string StripTagsCharArray(string source)
{
char[] array = new char[source.Length];
int arrayIndex = 0;
bool inside = false;
for (int i = 0; i < source.Length; i++)
{
char let = source[i];
if (let == '<')
{
inside = true;
continue;
}
if (let == '>')
{
inside = false;
continue;
}
if (!inside)
{
array[arrayIndex] = let;
arrayIndex++;
}
}
return new string(array, 0, arrayIndex);
}

Categories