I'm using LocationTextExtractionStrategy combined with a custom ITextExtractionStrategy class to read a PDF. With this code I can read documents portions based on coords without problems.
Now I get a PDF that seems like the others but if I try to read it I get text like this:
2 D 80 D 8 1 M 13M2 R V / 8 3B 3 3 710 022/F//0 R8 8 1 0 / 3
This is the code I'm using:
private static string ReadFilePart(string fileName,int pageNumber, int fromLeft, int fromBottom, int width, int height)
{
var rect = new System.util.RectangleJ(fromLeft, fromBottom, width, height);
var pdfReader = new PdfReader(fileName);
var filters = new RenderFilter[1];
filters[0] = new RegionTextRenderFilter(rect);
var strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filters);
var pageText = PdfTextExtractor.GetTextFromPage(pdfReader, pageNumber, new LimitedTextStrategy(strategy));
pdfReader.Close();
return pageText;
}
private class LimitedTextStrategy : ITextExtractionStrategy
{
public readonly ITextExtractionStrategy textextractionstrategy;
public LimitedTextStrategy(ITextExtractionStrategy strategy)
{
textextractionstrategy = strategy;
}
public void RenderText(TextRenderInfo renderInfo)
{
foreach (TextRenderInfo info in renderInfo.GetCharacterRenderInfos())
{
textextractionstrategy.RenderText(info);
}
}
public string GetResultantText()
{
return textextractionstrategy.GetResultantText();
}
public void BeginTextBlock()
{
textextractionstrategy.BeginTextBlock();
}
public void EndTextBlock()
{
textextractionstrategy.EndTextBlock();
}
public void RenderImage(ImageRenderInfo renderInfo)
{
textextractionstrategy.RenderImage(renderInfo);
}
}
I cannot share the PDF file due to sensitive data.
Update
If I change LocationTextExtractionStrategy with SimpleTextExtractionStrategy it recognize the full row without strange characters (PDF structure?).
Update 2
I can now share the file! Problematic pages are 2° and 3°
PDF file
Test solution to read the file
Update 3
mkl pointed me in the right direction and I fixed adding FistChar, LastChar and Widths to all fonts with missing properties with default values.
private static PdfReader FontFix(PdfReader pdfReader)
{
for (var p = 1; p <= pdfReader.NumberOfPages; p++)
{
var dic = pdfReader.GetPageN(p);
var resources = dic.GetAsDict(PdfName.RESOURCES);
var fonts = resources?.GetAsDict(PdfName.FONT);
if (fonts == null) continue;
foreach (var key in fonts.Keys)
{
var font = fonts.GetAsDict(key);
var firstChar = font.Get(PdfName.FIRSTCHAR);
if(firstChar==null)
font.Put(PdfName.FIRSTCHAR, new PdfNumber(0));
var lastChar = font.Get(PdfName.LASTCHAR);
if (lastChar == null)
font.Put(PdfName.LASTCHAR, new PdfNumber(255));
var widths= font.GetAsArray(PdfName.WIDTHS);
if (widths == null)
{
var array=new int[256];
array=Enumerable.Repeat(600, 256).ToArray();
font.Put(PdfName.WIDTHS, new PdfArray(array));
}
}
}
return pdfReader;
}
The error in the PDF
The cause of this issue is that the PDF contains one incomplete font dictionary. Most font dictionaries in the PDF are complete but there is one exception, the dictionary in object 28 used for the font Fo0 in the shared resources which is used to "fill in" the fields on pages two and three:
<<
/Name /Fo0
/Subtype /TrueType
/BaseFont /CourierNew
/Type /Font
/Encoding /WinAnsiEncoding
>>
In particular this font dictionary does not contain the required Widths entry whose value would be an array of the widths of the font glyphs.
Thus, iTextSharp has no idea how wide the glyphs actually are and uses 0 as default value.
As an aside, such incomplete font dictionaries are allowed (albeit deprecated) for a very limited set of Type 1 fonts, the so called standard 14 fonts. The TrueType font "CourierNew" obviously is not among them. But the developer who created the software responsible for the incomplete structure above, probably did not care to look into the PDF specification and simply followed the example of those special Type 1 fonts.
The effect on your code
In your LimitedTextStrategy.RenderText implementation
public void RenderText(TextRenderInfo renderInfo)
{
foreach (TextRenderInfo info in renderInfo.GetCharacterRenderInfos())
{
textextractionstrategy.RenderText(info);
}
}
you split the renderInfo (describing a longer string) into multiple TextRenderInfo instances (describing one glyph each). If the font of renderInfo is the critical Fo0, all those TextRenderInfo instances have the same position because iTextSharp assumed the glyph widths to be 0.
...using the LocationTextExtractionStrategy
Those TextRenderInfo instances then are filtered and forwarded to the LocationTextExtractionStrategy which later on sorts them by position. As the positions coincide and the sorting algorithm used does not keep elements with the same position in their original order, this sorting effectively shuffles them. Eventually you get all the corresponding characters in a chaotic order.
...using the SimpleTextExtractionStrategy
In this case those TextRenderInfo instances then are filtered and forwarded to the SimpleTextExtractionStrategy which does not sort them but instead adds the respectively corresponding characters to the result string. If in the content stream the text showing operations occur in reading order, the result returned by the strategies is in proper reading order, too.
Why does Adobe Reader display the text in proper order?
If confronted with a broken PDF, different programs can attempt different strategies to cope with the situation.
Adobe Reader in the case at hand most likely searches a CourierNew TrueType font program in the operation system and uses the width information from there. This most likely is what the creator of that broken font structure hoped for.
Related
Goal: extract a value from a specific location inside a PDF page. In GemBox.Pdf, I can extract text elements including their bounds and content, but:
Problem: a text element can have a complex structure, with each glyph being positioned using individual settings.
Consider this common example of a page header:
Billing Info Date: 02/02/20222
Company Ltd. Order Number: 0123456789
123 Main Street Name: Smith, John
Let's say, I want to get the order number (0123456789) from the document, knowing its precise position on the page. But in practice, often enough the entire line would be one single text element, with the content SO CompanyOrder Number:0123456789, and all positioning and spacing done via offsets and indices only. I can get the bounds and text of the entire line, but I need the bounds (and value) of each character/glyph, so I can combine them into "words" (= character sequences, separated by whitespace or large offsets).
I know this is definitely possible in other libraries. But this question is specific to GemBox. It seems to me, all the necessary implementations should already there, just not much is exposed in the API.
In itextsharp I can get the bounds for each single glyph, like this:
// itextsharp 5.2.1.0
public GlyphExtractionStrategy : LocationTextExtractionStrategy
{
public override void RenderText(TextRenderInfo renderInfo)
{
var segment = renderInfo.GetBaseline();
var chunk = new TextChunk(
renderInfo.GetText(),
segment.GetStartPoint(),
segment.GetEndPoint(),
renderInfo.GetSingleSpaceWidth(),
renderInfo.GetAscentLine(),
renderInfo.GetDescentLine()
);
// glyph infos
var glyph = chunk.Text;
var left = chunk.StartLocation[0];
var top = chunk.StartLocation[1];
var right = chunk.EndLocation[0];
var bottom = chunk.EndLocation[1];
}
}
var reader = new PdfReader(bytes);
var strategy = new GlyphExtractionStrategy();
PdfTextExtractor.GetTextFromPage(reader, pageNumber: 1, strategy);
reader.Close();
Is this possible in GemBox? If so, that would be helpful, because we already have the code to combinine the glphs into "words".
Currently, I can somewhat work around this using regex, but this is not always possible and also way too technical for end users to configure.
Try using this latest NuGet package, we added PdfTextContent.GetGlyphOffsets method:
Install-Package GemBox.Pdf -Version 17.0.1128-hotfix
Here is how you can use it:
using (var document = PdfDocument.Load("input.pdf"))
{
var page = document.Pages[0];
var enumerator = page.Content.Elements.All(page.Transform).GetEnumerator();
while (enumerator.MoveNext())
{
if (enumerator.Current.ElementType != PdfContentElementType.Text)
continue;
var textElement = (PdfTextContent)enumerator.Current;
var text = textElement.ToString();
int index = text.IndexOf("Number:");
if (index < 0)
continue;
index += "Number:".Length;
for (int i = index; i < text.Length; i++)
{
if (text[i] == ' ')
index++;
else
break;
}
var bounds = textElement.Bounds;
enumerator.Transform.Transform(ref bounds);
string orderNumber = text.Substring(index);
double position = bounds.Left + textElement.GetGlyphOffsets().Skip(index - 1).First();
// TODO ...
}
}
How to sort a large csv file with 10 columns?
The sorting should be based on data type for example, string, Date, integer etc
Assuming Based on 5th column (Period Column) we need to sort.
As it is large CSV file, Without loading the same in memory we have to do.
I tried using logparser, but beyond certain size it throws error saying
"log parser tool has stopped working"
So please suggest any algorithm which i can implement in c#. Or if there is any other component or code which can help me.
Thanks in advance
Do know that running a program without memory is hard, specially if you have an algorithm that by its nature requires memory allocation.
I've looked at the External sort method mentioned by Jim Menschel and this is my implementation.
I didn't implement sorting on the fifth field but left some hints in the code so you can add that yourself.
This code reads a file, line by line and creates, in a temporary directory for each line a new file. Then we open two of those files and create a new target file. After reading a line from the two open files, we can compare them (or their fields). Based on their comparison we write the smallest one to the target file and read the next line from the file we used.
Although this doesn't keep much strings in memory it is hard on the diskdrive. I checked the NTFS limits and 50,000,000 files is within the specs.
Here are the main methods of the class:
main entry point
This take the file to be sorted
public void Sort(string file)
{
Directory.CreateDirectory(sortdir);
Split(file);
var sortedFile = SortAndCombine();
// if you feel confident you can override the original file
File.Move(sortedFile, file + ".sorted");
Directory.Delete(sortdir);
}
Split file
Split the file in a new file for each line
Yes, that will be a lot of files but it guarantees the least amount of memory used. It is easy to optimize though, read a couple of lines, sort those and write to a file.
void Split(string file)
{
using (var sr = new StreamReader(file, Encoding.UTF8))
{
var line = sr.ReadLine();
while (!String.IsNullOrEmpty(line))
{
// whatever you do, make sure this file your writed
// is ordered, just writing a single line is the easiest
using (var sw = new StreamWriter(CreateUniqueFilename()))
{
sw.WriteLine(line);
}
line = sr.ReadLine();
}
}
}
Combine the files
Iterate over all files and take one and the next one, merge those files
string SortAndCombine()
{
long processed; // keep track of how much we processed
do
{
// iterate the folder
var files = Directory.EnumerateFiles(sortdir).GetEnumerator();
bool hasnext = files.MoveNext();
processed = 0;
while (hasnext)
{
processed++;
// have one
string fileOne = files.Current;
hasnext = files.MoveNext();
if (hasnext)
{
// we have number two
string fileTwo = files.Current;
// do the work
MergeSort(fileOne, fileTwo);
hasnext = files.MoveNext();
}
}
} while (processed > 1);
var lastfile = Directory.EnumerateFiles(sortdir).GetEnumerator();
lastfile.MoveNext();
return lastfile.Current; // by magic is the name of the last file
}
Merge and Sort
Open two files and create one target file. Read a line from both of these and write sthe mallest of the two to the target file.
Keep doing that until both lines are null
void MergeSort(string fileOne, string fileTwo)
{
string result = CreateUniqueFilename();
using(var srOne = new StreamReader(fileOne, Encoding.UTF8))
{
using(var srTwo = new StreamReader(fileTwo, Encoding.UTF8))
{
// I left the actual field parsing as an excersise for the reader
string lineOne, lineTwo; // fieldOne, fieldTwo;
using(var target = new StreamWriter(result))
{
lineOne = srOne.ReadLine();
lineTwo = srTwo.ReadLine();
// naive field parsing
// fieldOne = lineOne.Split(';')[4];
// fieldTwo = lineTwo.Split(';')[4];
while(
!String.IsNullOrEmpty(lineOne) ||
!String.IsNullOrEmpty(lineTwo))
{
// use your parsed fieldValues here
if (lineOne != null && (lineOne.CompareTo(lineTwo) < 0 || lineTwo==null))
{
target.WriteLine(lineOne);
lineOne = srOne.ReadLine();
// fieldOne = lineOne.Split(';')[4];
}
else
{
if (lineTwo!=null)
{
target.WriteLine(lineTwo);
lineTwo = srTwo.ReadLine();
// fieldTwo = lineTwo.Split(';')[4];
}
}
}
}
}
}
// all is perocessed, remove the input files.
File.Delete(fileOne);
File.Delete(fileTwo);
}
Helper variable and method
There is one shared member for the temporary directory and a method for generating temporary unique filenames.
private string sortdir = Path.Combine(Path.GetTempPath(), Guid.NewGuid().ToString("N"));
string CreateUniqueFilename()
{
return Path.Combine(sortdir, Guid.NewGuid().ToString("N"));
}
Memory analysis
I've created a small file with 5000 lines in it with the following code:
using(var sw= new StreamWriter("c:\\temp\\test1.txt"))
{
for(int line=0; line<5000; line++)
{
sw.WriteLine(Guid.NewGuid().ToString());
}
}
I then ran the sorting code with the memory profiler. This is what the summary looked like on my box with Windows 10, 4GB RAM and a spinning disk:
The object lifetime shows as expected a lot of String, char[] and byte[] allocations, but none of those have survived a Gen 0 collection, which means they are all short lived and I don't expect this to be a problem if the number of lines to sort increases.
This is the simplest solution that works for me. From here easy alterations and improvements are possible, either leading to even less memory consumption, reduce allocations or a higher speed. Make sure to measure, select the area where you can make the biggest impact and compare successive results. That should give you the optimum between memory usage and performance.
Instead of reading CSV completely you can simply index it:
Read unsorted CSV line by line and remember 5th element (column) value and something to identify this line later: line number or offset of this line from beginning of the file and size.
You will have some kind of List<Tuple<string, ...>>. Sort that
var sortedList = unsortedList.OrderBy(item => item.Item1);
Now you can create sorted CSV by enumerating sorted list, reading line from source file and appending it to new CSV:
using (var sortedCSV = File.AppendText(newCSVFileName))
foreach(var item in sortedList)
{
... // read line from unsorted csv using item.Item2, etc.
sortedCSV.WriteLine(...);
}
I am working on a parser that is intended to read in data in fixed-width format (8 char x 10 col). However, sometimes this isn't the case, and there is sometimes valid data in the areas that do not meet this. It is not safe to assume that there is an escape character (such as the + in the figure below), as that is one of several formats.
I had attempted using TextFieldParser.FixedWidth, and giving it a 8x10 input, but anything that does not meet this quantity is sent to the ErrorLine instead.
It doesn't seem like it would be good practice to parse from my exception catching block, is it?
Since it is only discrepant lines who require additional work, is a brute force submethod the best approach? All of my data always comes in 8 char blocks. the final block in a line can be tricky in that it may be shorter if it was manually entered. (Predicated on #1 being OK to do)
Is there a better tool to be using? I feel like I'm trying to fit a square peg in a round hole with a fixedwidth textfieldparser.
Note: Delimited parsing is not an option, see the 2nd figure.
edit for clarification: the text below is a pair of excerpts of input decks for NASTRAN, a finite element code. I am aiming to have a generalized parsing method that will read the text files in, and then hand off the split up string[]s to other methods to actually process each card into a specific mapped object. (e.g. in the image below, the two object types are RBE3 and SET1)
Extracted Method:
public static IEnumerable<string[]> ParseFixed(string fileName, int width, int colCount)
{
var fieldArrayList = new List<string[]>();
using (var tfp = new TextFieldParser(fileName))
{
tfp.TextFieldType = FieldType.FixedWidth;
var fieldWidths = new int[colCount];
for (int i = 0; i < fieldWidths.Length; i++)
{
fieldWidths[i] = width;
}
tfp.CommentTokens = new string[] { "$" };
tfp.FieldWidths = fieldWidths;
tfp.TrimWhiteSpace = true;
while (!tfp.EndOfData)
{
try
{
fieldArrayList.Add(tfp.ReadFields());
}
catch (Microsoft.VisualBasic.FileIO.MalformedLineException ex)
{
Debug.WriteLine(ex.ToString());
// parse atypical lines here...?
continue;
}
}
}
return fieldArrayList;
}
when outputting to the console, you can set the specific location of the cursor and write to that (or use other nifty tricks like printing backspaces that will take you back.)
Is there a similar thing that can be done with a stream of text?
Scenario: I need to build a string with n pieces of, text where each might be on a different line and start position (or top and left padding).
Two strings might appear on the same line.
I could build a simple Dictionary<int, StringBuilder> and fidget with that, but I'm wondering if there's something like the console functionality for streams of text where you can write to a specific place (row and column).
Edit:
This is for a text only. No control.
The result might be a string with several new lines, and text appearing at different locations.
Example (where . will be white spaces):
..... txt3....... txt2
......................
................ txt1.
this will be the result of having txt1 at row 3 column (whatever), and txt2 and txt3 and row 1 with different colum values (where txt3 column < txt2 colmun)
While waiting for a better answer, here's my solution. Seems to work, been lightly tested, and can be simply pasted into linqpad and run.
void Main()
{
m_dict = new SortedDictionary<int, StringBuilder>();
AddTextAt(1,40, "first");
AddTextAt(2,40, "xx");
AddTextAt(0,10, "second");
AddTextAt(4,5, "third");
AddTextAt(1,15, "four");
GetStringFromDictionary().Dump();
}
// "global" variable
SortedDictionary<int, StringBuilder> m_dict;
/// <summary>
/// This will emulate writting to the console, where you can set the row/column and put your text there.
/// It's done by having Dictionary(int,StringBuilder) that will use to store our data, and eventually,
/// when we need the string iterate over it and build our final representation.
/// </summary>
private void AddTextAt(int row, int column, string text)
{
StringBuilder sb;
// NB: The following will initialize the string builder !!
// Dictionary doesn't have an entry for this row, add it and all the ones before it
if (!m_dict.TryGetValue(row, out sb))
{
int start = m_dict.Keys.Any() ? m_dict.Keys.Last() +1 : 0;
for (int i = start ; i <= row; i++)
{
m_dict.Add(i, null);
}
}
int leftPad = column + text.Length;
// If dictionary doesn't have a value for this row, just create a StringBuilder with as many
// columns as left padding, and then the text
if (sb == null)
{
sb = new StringBuilder(text.PadLeft(leftPad));
m_dict[row] = sb;
}
// If it does have a value:
else
{
// If the new string is to be to the "right" of the current text, append with proper padding
// (column - current string builder length) and the text
int currrentSbLength = sb.ToString().Length;
if (column >= currrentSbLength)
{
leftPad = column - currrentSbLength + text.Length;
sb.Append(text.PadLeft(leftPad));
}
// otherwise, text goes on the "left", create a new string builder with padding and text, and
// append the older one at the end (with proper padding?)
else
{
m_dict[row] = new StringBuilder( text.PadLeft(leftPad)
+ sb.ToString().Substring(leftPad) );
}
}
}
/// <summary>
/// Concatenates all the strings from the private dictionary, to get a representation of the final string.
/// </summary>
private string GetStringFromDictionary()
{
var sb = new StringBuilder();
foreach (var k in m_dict.Keys)
{
if (m_dict[k]!=null)
sb.AppendLine(m_dict[k].ToString());
else
sb.AppendLine();
}
return sb.ToString();
}
Output:
second
four first
xx
third
No. Text files don't really have concept of horizontal/vertical position, so you'd need to build some sort of positioning yourself.
For basic positioning tabs ("\t") may be enough, for anything more advanced you'd need to fill empty space with spaces.
It sounds like you have some sort of table layout - it may be easier to build data in cells first (List<List<string>> - list of rows consisting of columns of strings) and than format it with either String.Format("{0}\t{1}\t...", table[row][0],table[row][1],...) or manually adding necessary amount of spaces for each "cell"
i need a fast method to work with big text file
i have 2 files,
a big text file (~20Gb)
and an another text file that contain ~12 million list of Combo words
i want find all combo words in the first text file and replace it with an another Combo word (combo word with underline)
example "Computer Information" >Replace With> "Computer_Information"
i use this code, but performance is very poor (i test in Hp G7 Server With 16Gb Ram and 16 Core)
public partial class Form1 : Form
{
HashSet<string> wordlist = new HashSet<string>();
private void loadComboWords()
{
using (StreamReader ff = new StreamReader(txtComboWords.Text))
{
string line;
while ((line = ff.ReadLine()) != null)
{
wordlist.Add(line);
}
}
}
private void replacewords(ref string str)
{
foreach (string wd in wordlist)
{
// ReplaceEx(ref str,wd,wd.Replace(" ","_"));
if (str.IndexOf(wd) > -1)
str.Replace(wd, wd.Replace(" ", "_"));
}
}
private void button3_Click(object sender, EventArgs e)
{
string line;
using (StreamReader fread = new StreamReader(txtFirstFile.Text))
{
string writefile = Path.GetFullPath(txtFirstFile.Text) + Path.GetFileNameWithoutExtension(txtFirstFile.Text) + "_ReplaceComboWords.txt";
StreamWriter sw = new StreamWriter(writefile);
long intPercent;
label3.Text = "initialing";
loadComboWords();
while ((line = fread.ReadLine()) != null)
{
replacewords(ref line);
sw.WriteLine(line);
intPercent = (fread.BaseStream.Position * 100) / fread.BaseStream.Length;
Application.DoEvents();
label3.Text = intPercent.ToString();
}
sw.Close();
fread.Close();
label3.Text = "Finished";
}
}
}
any idea to do this job in reasonable time
Thanks
At first glance the approach you've taken looks fine - it should work OK, and there's nothing obvious that will cause e.g. lots of garbage collection.
The main thing I think is that you'll only be using one of those sixteen cores: there's nothing in place to share the load across the other fifteen.
I think the easiest way to do this is to split the large 20Gb file into sixteen chunks, then analyse each of the chunks together, then merge the chunks back together again. The extra time taken splitting and reassembling the file should be minimal compared to the ~16 times gain involved in scanning these sixteen chunks together.
In outline, one way to do this might be:
private List<string> SplitFileIntoChunks(string baseFile)
{
// Split the file into chunks, and return a list of the filenames.
}
private void AnalyseChunk(string filename)
{
// Analyses the file and performs replacements,
// perhaps writing to the same filename with a different
// file extension
}
private void CreateOutputFileFromChunks(string outputFile, List<string> splitFileNames)
{
// Combines the rewritten chunks created by AnalyseChunk back into
// one large file, outputFile.
}
public void AnalyseFile(string inputFile, string outputFile)
{
List<string> splitFileNames = SplitFileIntoChunks(inputFile);
var tasks = new List<Task>();
foreach (string chunkName in splitFileNames)
{
var task = Task.Factory.StartNew(() => AnalyseChunk(chunkName));
tasks.Add(task);
}
Task.WaitAll(tasks.ToArray());
CreateOutputFileFromChunks(outputFile, splitFileNames);
}
One tiny nit: move the calculation of the length of the stream out of the loop, you only need to get that once.
EDIT: also, include #Pavel Gatilov's idea to invert the logic of the inner loop and search for each word in the line in the 12 million list.
Several ideas:
I think it will be more efficient to split each line into words and look if each of several words appears in your word list. 10 lookups in a hashset is better than millions of searches of a substring. If you have composite keywords, make appropriate indexes: one that contains all single words that occur in the real keywords and another that contains all the real keywords.
Perhaps, loading strings into StringBuilder is better for replacing.
Update progress after, say 10000 lines processed, not after each one.
Process in background threads. It won't make it much faster, but the app will be responsible.
Parallelize the code, as Jeremy has suggested.
UPDATE
Here is a sample code that demonstrates the by-word index idea:
static void ReplaceWords()
{
string inputFileName = null;
string outputFileName = null;
// this dictionary maps each single word that can be found
// in any keyphrase to a list of the keyphrases that contain it.
IDictionary<string, IList<string>> singleWordMap = null;
using (var source = new StreamReader(inputFileName))
{
using (var target = new StreamWriter(outputFileName))
{
string line;
while ((line = source.ReadLine()) != null)
{
// first, we split each line into a single word - a unit of search
var singleWords = SplitIntoWords(line);
var result = new StringBuilder(line);
// for each single word in the line
foreach (var singleWord in singleWords)
{
// check if the word exists in any keyphrase we should replace
// and if so, get the list of the related original keyphrases
IList<string> interestingKeyPhrases;
if (!singleWordMap.TryGetValue(singleWord, out interestingKeyPhrases))
continue;
Debug.Assert(interestingKeyPhrases != null && interestingKeyPhrases.Count > 0);
// then process each of the keyphrases
foreach (var interestingKeyphrase in interestingKeyPhrases)
{
// and replace it in the processed line if it exists
result.Replace(interestingKeyphrase, GetTargetValue(interestingKeyphrase));
}
}
// now, save the processed line
target.WriteLine(result);
}
}
}
}
private static string GetTargetValue(string interestingKeyword)
{
throw new NotImplementedException();
}
static IEnumerable<string> SplitIntoWords(string keyphrase)
{
throw new NotImplementedException();
}
The code shows the basic ideas:
We split both keyphrases and processed lines into equivalent units which may be efficiently compared: the words.
We store a dictionary that for any word quickly gives us references to all keyphrases that contain the word.
Then we apply your original logic. However, we do not do it for all 12 mln keyphrases, but rather for a very small subset of keyphrases that have at least a single-word intersection with the processed line.
I'll leave the rest of the implementation to you.
The code however has several issues:
The SplitIntoWords must actually normalize the words to some canonical form. It depends on the required logic. In the simplest case you'll probably be fine with whitespace-character splitting and lowercasing. But it may happen that you'll need a morphological matching - that would be harder (it's very close to full-text search tasks).
For the sake of the speed, it's likely to be better if the GetTargetValue method was called once for each keyphrase before processing the input.
If a lot of your keyphrases have coinciding words, you'll still have a signigicant amount of extra work. In that case you'll need to keep the positions of keywords in the keyphrases in order to use word distance calculation to exclude irrelevant keyphrases while processing an input line.
Also, I'm not sure if StringBuilder is actually faster in this particular case. You should experiment with both StringBuilder and string to find out the truth.
It's a sample after all. The design is not very good. I'd consider extracting some classes with consistent interfaces (e.g. KeywordsIndex).