Advice for extracting word text and handling cellbreak characters

Advice for extracting word text and handling cellbreak characters - c#

Looking for advice (perhaps best practice).
We have a MS Word document (Office 2007) that we are extracting text from a cell.
We can use the following:
string text = wordTable.cell(tablerow.index, 1).Range.Text;
The text is extracted; however we seem to get extra characters trailing, for example \r\a.
Now we could add the following:
.... wordTable.cell(tablerow.index, 1).Range.Text.Replace("\r\a,"");
But this seems a little too lazy, and pretty much a waste of time that would most likely lead to problems down the road.
We could also have a method that receives the string to clean:
private string cleanTextWordCellBreak(string wordTextToClean)
{
// Clean the text here
return cleanstring;
}
then we could use it:
cleanTextWordCellBreak(wordTable.cell(tablerow.index, 1).Range.Text;
);
This seems closer to a better way of handling the issue. What would you do?

I would break it out into a separate method but use the replace implementation since it's the simplest solution. You could always change the implementation later if you run into problem (like the text contains more than one \r\a and needs to be preserved)
So:
private string stripCellText(string text)
{
return text.Replace("\r\a", "");
}
string text = stripCellText(wordTable.cell(tablerow.index, 1).Range.Text);

I would definitely opt for breaking it out into a separate method personally. it helps with code readability and makes it a lot easier to change if needed in the future.

Another way of getting it would be get the length of Characters & extracting text upto that length.
dim range as Range
dim text as string
dim length as Integer
range = ActiveDocument.Tables(1).Cell(1,1).Range
text = range.Text
length = range.Characters.Count
Debug.Print Mid(text, 1, length - 1)

Related

Most efficient way of adding/removing a character to beginning of string?

I was doing a small 'scalable' C# MVC project, with quite a bit of read/write to a database.
From this, I would need to add/remove the first letter of the input string.
'Removing' the first character is quite easy (using a Substring method) - using something like:
String test = "HHello world";
test = test.Substring(1,test.Length-1);
'Adding' a character efficiently seems to be messy/awkward:
String test = "ello World";
test = "H" + test;
Seeing as this will be done for a lot of records, would this be be the most efficient way of doing these operations?
I am also testing if a string starts with the letter 'T' by using, and adding 'T' if it doesn't by:
String test = "Hello World";
if(test[0]!='T')
{
test = "T" + test;
}
and would like to know if this would be suitable for this

If you have several records and to each of the several records field you need to append a character at the beginning, you can use String.Insert with an index of 0 http://msdn.microsoft.com/it-it/library/system.string.insert(v=vs.110).aspx
string yourString = yourString.Insert( 0, "C" );
This will pretty much do the same of what you wrote in your original post, but since it seems you prefer to use a Method and not an operator...
If you have to append a character several times, to a single string, then you're better using a StringBuilder http://msdn.microsoft.com/it-it/library/system.text.stringbuilder(v=vs.110).aspx

Both are equally efficient I think since both require a new string to be initialized, since string is immutable.
When doing this on the same string multiple times, a StringBuilder might come in handy when adding. That will increase performance over adding.
You could also opt to move this operation to the database side if possible. That might increase performance too.

For removing I would use the remove command as this doesn't require to know the length of the string:
test = test.Remove(0, 1);
You could also treat the string as an array for the Add and use
test = test.Insert(0, "H");
If you are always removing and then adding a character you can treat the string as an array again and just replace the character.
test = (test.ToCharArray()[0] = 'H').ToString();
When doing lots of operations to the same string I would use a StringBuilder though, more expensive to create but faster operations on the string.

Parsing a String for Special characters in C#

I am getting a string in the following format in the query string:
Arnstung%20Chew(20)
I want to convert it to just Arnstung Chew.
How do I do it?
Also how do I make sure that the user is not passing a script or anything harmful in the query string?

string str = "Arnstung Chew (20)";
string replacedString = str.Substring(0, str.IndexOf("(") -1 ).Trim();
string safeString = System.Web.HttpUtility.HtmlEncode(replacedString);

It's impossible to provide a comprehensive answer without knowing what variations might appear on your input text. For example, will there always be two words separated by a space followed by a number in parentheses? Or might there be other variations as well?
I have a lot of parsing code on my Black Belt Coder site, including a sscanf() replacement for .NET that may potentially be useful in your case.

c# convert string that has ctrl+z to regular string

i have a string like this:
some_string="A simple demo of SMS text messaging." + Convert.ToChar(26));
what is the SIMPLEST way of me getting rid of the char 26?
please keep in mind that sometimes some_string has char 26 and sometimes it does not, and it can be in different positions too, so i need to know what is the most versatile and easiest way to get rid of char 26?

If it can be in different positions (not just the end):
someString = someString.Replace("\u001A", "");
Note that you have to use the return value of Replace - strings are immutable, so any methods which look like they're changing the contents actually return a new string with the appropriate changes.

If it's only at the end:
some_string.TrimEnd((char)26)
If it can be anywhere then forget this and use Jon Skeet's answer.

Declaring a looooong single line string in C#

Is there a decent way to declare a long single line string in C#, such that it isn't impossible to declare and/or view the string in an editor?
The options I'm aware of are:
1: Let it run. This is bad because because your string trails way off to the right of the screen, making a developer reading the message have to annoying scroll and read.
string s = "this is my really long string. this is my really long string. this is my really long string. this is my really long string. this is my really long string. this is my really long string. this is my really long string. this is my really long string. ";
2: #+newlines. This looks nice in code, but introduces newlines to the string. Furthermore, if you want it to look nice in code, not only do you get newlines, but you also get awkward spaces at the beginning of each line of the string.
string s = #"this is my really long string. this is my long string.
this line will be indented way too much in the UI.
This line looks silly in code. All of them suffer from newlines in the UI.";
3: "" + ... This works fine, but is super frustrating to type. If I need to add half a line's worth of text somewhere I have to update all kinds of +'s and move text all around.
string s = "this is my really long string. this is my long string. " +
"this will actually show up properly in the UI and looks " +
"pretty good in the editor, but is just a pain to type out " +
"and maintain";
4: string.format or string.concat. Basically the same as above, but without the plus signs. Has the same benefits and downsides.
Is there really no way to do this well?

There is a way. Put your very long string in resources. You can even put there long pieces of text because it's where the texts should be. Having them directly in code is a real bad practice.

If you really want this long string in the code, and you really don't want to type the end-quote-plus-begin-quote, then you can try something like this.
string longString = #"Some long string,
with multiple whitespace characters
(including newlines and carriage returns)
converted to a single space
by a regular expression replace.";
longString = Regex.Replace(longString, #"\s+", " ");

If using Visual Studio
Tools > Options > Text Editor > All Languages > Word Wrap
I'm sure any other text editor (including notepad) will be able to do this!

It depends on how the string is going to wind up being used. All the answers here are valid, but context is important. If long string "s" is going to be logged, it should be surrounded with a logging guard test, such as this Log4net example:
if (log.IsDebug) {
string s = "blah blah blah" +
// whatever concatenation you think looks the best can be used here,
// since it's guarded...
}
If the long string s is going to be displayed to a user, then Developer Art's answer is the best choice...those should be in resource file.
For other uses (generating SQL query strings, writing to files [but consider resources again for these], etc...), where you are concatenating more than just literals, consider StringBuilder as Wael Dalloul suggests, especially if your string might possibly wind up in a function that just may, at some date in the distant future, be called many many times in a time-critical application (All those invocations add up). I do this, for example, when building a SQL query where I have parameters that are variables.
Other than that, no, I don't know of anything that both looks pretty and is easy to type (though the word wrap suggestion is a nice idea, it may not translate well to diff tools, code print outs, or code review tools). Those are the breaks. (I personally use the plus-sign approach to make the line-wraps neat for our print outs and code reviews).

you can use StringBuilder like this:
StringBuilder str = new StringBuilder();
str.Append("this is my really long string. this is my long string. ");
str.Append("this is my really long string. this is my long string. ");
str.Append("this is my really long string. this is my long string. ");
str.Append("this is my really long string. this is my long string. ");
string s = str.ToString();
You can also use: Text files, resource file, Database and registry.

Does it have to be defined in the source file? Otherwise, define it in a resource or config file.

Personally I would read a string that big from a file perhaps an XML document.

You could use StringBuilder

For really long strings, I'd store it in XML (or a resource). For occasions where it makes sense to have it in the code, I use the multiline string concatenation with the + operator. The only place I can think of where I do this, though, is in my unit tests for code that reads and parses XML where I'm actually trying to avoid using an XML file for testing. Since it's a unit test I almost always want to have the string right there to refer to as well. In those cases I might segregate them all into a #region directive so I can show/hide it as needed.

I either just let it run, or use string.format and write the string in one line (the let it run method) but put each of the arguments in new line, which makes it either easier to read, or at least give the reader some idea what he can expect in the long string without reading it in detail.

Use the Project / Properties / Settings from the top menu of Visual Studio. Make the scope = "Application".
In the Value box you can enter very long strings and as a bonus line feeds are preserved. Then your code can refer to that string like this:
string sql = Properties.Settings.Default.xxxxxxxxxxxxx;

Read fixed width record from text file

I've got a text file full of records where each field in each record is a fixed width. My first approach would be to parse each record simply using string.Substring(). Is there a better way?
For example, the format could be described as:
<Field1(8)><Field2(16)><Field3(12)>
And an example file with two records could look like:
SomeData0000000000123456SomeMoreData
Data2 0000000000555555MoreData
I just want to make sure I'm not overlooking a more elegant way than Substring().
Update: I ultimately went with a regex like Killersponge suggested:
private readonly Regex reLot = new Regex(REGEX_LOT, RegexOptions.Compiled);
const string REGEX_LOT = "^(?<Field1>.{6})" +
"(?<Field2>.{16})" +
"(?<Field3>.{12})";
I then use the following to access the fields:
Match match = reLot.Match(record);
string field1 = match.Groups["Field1"].Value;

Use FileHelpers.
Example:
[FixedLengthRecord()]
public class MyData
{
[FieldFixedLength(8)]
public string someData;
[FieldFixedLength(16)]
public int SomeNumber;
[FieldFixedLength(12)]
[FieldTrim(TrimMode.Right)]
public string someMoreData;
}
Then, it's as simple as this:
var engine = new FileHelperEngine<MyData>();
// To Read Use:
var res = engine.ReadFile("FileIn.txt");
// To Write Use:
engine.WriteFile("FileOut.txt", res);

Substring sounds good to me. The only downside I can immediately think of is that it means copying the data each time, but I wouldn't worry about that until you prove it's a bottleneck. Substring is simple :)
You could use a regex to match a whole record at a time and capture the fields, but I think that would be overkill.

Why reinvent the wheel? Use .NET's TextFieldParser class per this how-to for Visual Basic: How to read from fixed-width text files.

You may have to watch out, if the end of the lines aren't padded out with spaces to fill the field, your substring won't work without a bit of fiddling to work out how much more of the line there is to read. This of course only applies to the last field :)

Unfortunately out of the box the CLR only provides Substring for this.
Someone over at CodeProject made a custom parser using attributes to define fields, you might wanna look at that.

Nope, Substring is fine. That's what it's for.

You could set up an ODBC data source for the fixed format file, and then access it as any other database table.
This has the added advantage that specific knowledge of the file format is not compiled into your code for that fateful day that someone decides to stick an extra field in the middle.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.