I use Microsoft.Office.Interop.Word to create a new doc based on another doc. There are multiple iterations with search/replace operations using Range.Text and all work pretty fast. However, in one case I need to copy an entire chapter with all formatting and instead of Range.Text (which resets all formatting) I used Range.Copy and Range.Paste. They work, but for the test chapter with about 450 words, they take up to 40 sec (vs. less than 1 sec when I change same code to use Range.Text).
Question: is there any way to make Range.Copy/Range.Paste faster? All I need is to find a particular piece of text and copy it with all tables, formatting, etc. to another file.
If you want to copy text with formatting in word you can use FormattedText property of Range like
targetRange.FormattedText = sourceRange.FormattedText;
Avoid using Range.Copy() and Range.Paste() as this approach internally use clipboard this may cause security problems or may provide unpredictable results in some cases
I am working on an NLP based application that uses global keyboard hook to read key presses. Here is it's working interface:
BIEngine.Hook.KeyboardListener KListener = new BIEngine.Hook.KeyboardListener();
private void Application_Startup(object sender, StartupEventArgs e)
{
KListener.KeyDown += new BIEngine.Hook.RawKeyEventHandler(KListener_KeyDown);
}
void KListener_KeyDown(object sender, BIEngine.Hook.RawKeyEventArgs args)
{
Trace.WriteLine(args.ToString());
}
Now I am getting the words for the user as space, carriage return, tab, periods etc delimited set of alphabets. So if the user types in his software window got today I would be getting
g
o
t
t..
So what would be the most efficient way (as this application would be running constantly in the background) to concatenate these letters to form words sans the spaces and other delimiters and react to a certain set words, say for example if the user types today, it will be passed to the NLP library and the user would be presented with some sort of feedback.
Thanks for any suggestions, codes etc.
I strongly recommend that you use the simplest approach that does what you want, and stop worrying about performance. Premature optimization, as it's known, can cost lots of time with very little benefit.
If you never let the string get particularly long (like, ~2000 characters) then I suggest you simply append to a normal string, trimming it whenever it grows longer than, say, 100 chars. I highly doubt you will be able to observe any performance impact from this. Only if you ever run into measurable performance problems (say, you notice the program taking more than 0.1% CPU time while the user is typing) should you consider optimizing this. And I bet you'll find that it's not your string concatenation that is using the CPU, but something else altogether.
Why? Because if you try to optimize everything before it is a problem, you will never get much actual work done. Most of the time optimization is unnecessary.
Having said all this, the most efficient way to match a string character by character would be to use a finite state machine, but I feel that explaining how to go about that is outside the scope of this question.
Looking at your post, I automatically thought about using a write instead of a writeline, but I don't know what implications that might have on your actual configuration.
That would keep it on the "same line", but to what end?
You can also insert a block code to your app to perform the visual or logical transformations, then display it or process it.
This way, you don't have to add additional workloads to your app start procedure.
I'm looking for some suggestions on better approaches to handling a scenario with reading a file in C#; the specific scenario is something that most people wouldn't be familiar with unless you are involved in health care, so I'm going to give a quick explanation first.
I work for a health plan, and we receive claims from doctors in several ways (EDI, paper, etc.). The paper form for standard medical claims is the "HCFA" or "CMS 1500" form. Some of our contracted doctors use software that allows their claims to be generated and saved in a HCFA "layout", but in a text file (so, you could think of it like being the paper form, but without the background/boxes/etc). I've attached an image of a dummy claim file that shows what this would look like.
The claim information is currently extracted from the text files and converted to XML. The whole process works ok, but I'd like to make it better and easier to maintain. There is one major challenge that applies to the scenario: each doctor's office may submit these text files to us in slightly different layouts. Meaning, Doctor A might have the patient's name on line 10, starting at character 3, while Doctor B might send a file where the name starts on line 11 at character 4, and so on. Yes, what we should be doing is enforcing a standard layout that must be adhered to by any doctors that wish to submit in this manner. However, management said that we (the developers) had to handle the different possibilities ourselves and that we may not ask them to do anything special, as they want to maintain good relationships.
Currently, there is a "mapping table" set up with one row for each different doctor's office. The table has columns for each field (e.g. patient name, Member ID number, date of birth etc). Each of these gets a value based on the first file that we received from the doctor (we manually set up the map). So, the column PATIENT_NAME might be defined in the mapping table as "10,3,25" meaning that the name starts on line 10, at character 3, and can be up to 25 characters long. This has been a painful process, both in terms of (a) creating the map for each doctor - it is tedious, and (b) maintainability, as they sometimes suddenly change their layout and then we have to remap the whole thing for that doctor.
The file is read in, line by line, and each line added to a
List<string>
Once this is done, we do the following, where we get the map data and read through the list of file lines and get the field values (recall that each mapped field is a value like "10,3,25" (without the quotes)):
ClaimMap M = ClaimMap.GetMapForDoctor(17);
List<HCFA_Claim> ClaimSet = new List<HCFA_Claim>();
foreach (List<string> cl in Claims) //Claims is List<List<string>>, where we have a List<string> for each claim in the text file (it can have more than one, and the file is split up into separate claims earlier in the process)
{
HCFA_Claim c = new HCFA_Claim();
c.Patient = new Patient();
c.Patient.FullName = cl[Int32.Parse(M.Name.Split(',')[0]) - 1].Substring(Int32.Parse(M.Name.Split(',')[1]) - 1, Int32.Parse(M.Name.Split(',')[2])).Trim();
//...and so on...
ClaimSet.Add(c);
}
Sorry this is so long...but I felt that some background/explanation was necessary. Are there any better/more creative ways of doing something like this?
Given the lack of standardization, I think your current solution although not ideal may be the best you can do. Given this situation, I would at least isolate concerns e.g. file read, file parsing, file conversion to standard xml, mapping table access etc. to simple components employing obvious patterns e.g. DI, strategies, factories, repositories etc. where needed to decouple the system from the underlying dependency on the mapping table and current parsing algorithms.
You need to work on the DRY (Don't Repeat Yourself) principle by separating concerns.
For example, the code you posted appears to have an explicit knowledge of:
how to parse the claim map, and
how to use the claim map to parse a list of claims.
So there are at least two responsibilities directly relegated to this one method. I'd recommend changing your ClaimMap class to be more representative of what it's actually supposed to represent:
public class ClaimMap
{
public ClaimMapField Name{get;set;}
...
}
public class ClaimMapField
{
public int StartingLine{get;set;}
// I would have the parser subtract one when creating this, to make it 0-based.
public int StartingCharacter{get;set;}
public int MaxLength{get;set;}
}
Note that the ClaimMapField represents in code what you spent considerable time explaining in English. This reduces the need for lengthy documentation. Now all the M.Name.Split calls can actually be consolidated into a single method that knows how to create ClaimMapFields out of the original text file. If you ever need to change the way your ClaimMaps are represented in the text file, you only have to change one point in code.
Now your code could look more like this:
c.Patient.FullName = cl[map.Name.StartingLine].Substring(map.Name.StartingCharacter, map.Name.MaxLength).Trim();
c.Patient.Address = cl[map.Address.StartingLine].Substring(map.Address.StartingCharacter, map.Address.MaxLength).Trim();
...
But wait, there's more! Any time you see repetition in your code, that's a code smell. Why not extract out a method here:
public string ParseMapField(ClaimMapField field, List<string> claim)
{
return claim[field.StartingLine].Substring(field.StartingCharacter, field.MaxLength).Trim();
}
Now your code can look more like this:
HCFA_Claim c = new HCFA_Claim
{
Patient = new Patient
{
FullName = ParseMapField(map.Name, cl),
Address = ParseMapField(map.Address, cl),
}
};
By breaking the code up into smaller logical pieces, you can see how each piece becomes very easy to understand and validate visually. You greatly reduce the risk of copy/paste errors, and when there is a bug or a new requirement, you typically only have to change one place in code instead of every line.
If you are only getting unstructured text, you have to parse it. If the text content changes you have to fix your parser. There's no way around this. You could probably find a 3rd party application to do some kind of visual parsing where you highlight the string of text you want and it does all the substring'ing for you but still unstructured text == parsing == fragile. A visual parser would at least make it easier to see mistakes/changed layouts and fix them.
As for parsing it yourself, I'm not sure about the line-by-line approach. What if something you're looking for spans multiple lines? You could bring the whole thing in a single string and use IndexOf to substring that with different indices for each piece of data you're looking for.
You could always use RegEx instead of Substring if you know how to do that.
While the basic approach your taking seems appropriate for your situation, there are definitely ways you could clean up the code to make it easier to read and maintain. By separating out the functionality that you're doing all within your main loop, you could change this:
c.Patient.FullName = cl[Int32.Parse(M.Name.Split(',')[0]) - 1].Substring(Int32.Parse(M.Name.Split(',')[1]) - 1, Int32.Parse(M.Name.Split(',')[2])).Trim();
to something like this:
var parser = new FormParser(cl, M);
c.PatientFullName = FormParser.GetName();
c.PatientAddress = FormParser.GetAddress();
// etc
So, in your new class, FormParser, you pass the List that represents your form and the claim map for the provider into the constructor. You then have a getter for each property on the form. Inside that getter, you perform your parsing/substring logic like you're doing now. Like I said, you're not really changing the method by which your doing it, but it certainly would be easier to read and maintain and might reduce your overall stress level.
So I've been working on cobbling together a game and decided I'd like to have a little program to show a file with each character replaced by its byte equivalent for working with coding saves and whatnot. Figured it'd a layup. Three hours later, I've been wracking my brain trying to figure this out.
When I load a small (or perhaps short is the better term) file it looks like the window on top. When I load a larger file, it looks like the window on the bottom.
http://dl.dropbox.com/u/16985121/Images/ViewAsBytes.PNG
That's 10pt Courier New, but it seems to happen with any font I try. There's always that extra column, and if there wasn't enough room for the column, it'd just squeeze in whatever it could in that space that it previously didn't use. I've tried tweaking all kinds of variables, as well as comparing the textbox before and after it adds the text from the file (which is read in just as bytes from a FileStream and then fed into a StringBuilder) but nothing seems to change even though something is clearly different.
I can think of a bunch of different workarounds for this, but now I'm just more interested in what TextBox thinks it's doing exactly than getting my program done. Anyone got any idea?
Here's the code that reads in the data and puts that to the textbox:
FileStream stream = new FileStream(files[0], FileMode.Open);
StringBuilder sb = new StringBuilder();
int byteIn = stream.ReadByte();
while (byteIn != -1)
{
sb.Append('[');
if (byteIn < 100)
sb.Append('0');
if (byteIn < 10)
sb.Append('0');
sb.Append(byteIn.ToString());
sb.Append(']');
byteIn = stream.ReadByte();
}
txtView.Text = sb.ToString();
stream.Close();
This is because you set to the WordWrap property to True. Set it to False, set Multiline to True and ScrollBars to Both. Append Environment.NewLine to the string you generate, every 16 bytes is the norm for hex viewers. Use byte.ToString("X2") to generate a hex string instead of a decimal string.
You now have a full scrollable view of the data, any amount is supported. Allow the user to resize the window so she won't have to scroll horizontally. Or just make it big enough.
This is going to be a long post. I would like to have suggestions if any on the procedure I am following. I want the best method to print line numbers next to each CRLF-terminated-line in a richtextbox. I am using C# with .NET. I have tried using ListView but it is inefficient when number of lines grow. I have been successful in using Graphics in custom control to print the line numbers and so far I am happy with the performance.
But as the number of lines grow to 50K to 100K the scrolling is affected badly. I have overridden WndProc method and handling all the messages to call the line-number printing only when required. (Overriding OnContentsResized and OnVScroll make redundant calls to the printing method).
Now the line number printing is fine when number of lines is small say upto 10K (with which I am fine as it is rare need to edit a file with 10000 lines) but I want to remove the limitation.
Few Observations
Number of lines displayed in the richtexbox is constant +-1. So, the performance difference should be due to large text and not because I am using Graphics painting.
Painting line numbers for large text is slower when compared to small files
Now the Pseudo Code
FIRST_LINE_NUMBER = _textBox.GetFirstVisibleLineNumber();
LAST_LINE_NUMBER = _textBox.GetLastVisibleLineNUmber();
for(loop_from_first_to_last_line_number)
{
Y = _textBox.GetYPositionOfLineNumber(current_line_number);
graphics_paint_line_number(current_line_number, Y);
}
I am using GetCharIndexFromPosition and loop through the RichTextBox.Lines to find the line number in both the functions which get the line numbers. To get Y position I am using GetPositionFromCharIndex to get the Point struct.
All the above RichTextBox methods seem to be of O(n), which eats up the performance. (Correct me if I am wrong.)
I have decided to use a binary-tree to store the line numbers to improve the search perfomance when searching for line number by char index. I have an idea of getting a data-structure which takes O(n) construction time, O(nlgn) worst-case-update, and O(lgn) search.
Is this approach worth the effort?
Is there any other approach to solve the problem? If required I am ready to write the control from scratch, I just want it to be light-weight and fast.
Before deciding on the best way forward, we need to make sure we understand the bottleneck.
First of all, it is important to know how RichTextbox (which I assume you are using as you mentioned it) handles the large files. So I would recommend to remove all line printing stuff and see how it performs with large text. If it is poor, there is your problem.
Second step would be to put some profiling statements or just use a profiler (one comes with the VS 2010) to find the bottleneck. It might turn out to be the method for finding the line number, or something else.
At this point, I would only suggest more investigation. If you have finished the investigation and have more info, update your question and I will get back to you accordingly.