TextReader.ReadLine() Fails to Read Entire Line - c#

I've got a Comma Delimited Text file that I am trying to read in.
I read in 1 line at a time, and process that information.
Using the code snippet and file fragment below, my error comes when I get to the line that starts with 841 - it only pulls in 147 characters.
Question: What is causing the TextReader to stop pulling in this line? Is there some special sequence in it?
Code Snippet:
int lastNum = -1;
int num = 1;
using (TextReader reader = File.OpenText(filename)) {
do {
string line = reader.ReadLine();
if (!String.IsNullOrEmpty(line)) {
string[] split = line.Split(',');
int indexer = Convert.ToInt32(split[0]);
Console.WriteLine("#{0}: ID '{1}' Line Length = {2}", num++, split[0], line.Length);
}
} while ((-1 < reader.Peek());
reader.Close();
}
File Fragment (from line 0 to ProblemLine + 1):
ID,Line,[Date],WO,Module,DSO,Integer,Unit,,Contact,Category,Problem,Solution,Action,Actor,Acted
824,,1/4/2011,589259,,170966,JC,V3A,,Tom Read,WO.3,"The unit is stainless steel, but the coil connection plates that were on the work order were not stainless steel",MTF # 264698 to take off CC500 AND CC875 and added XCC500 AND XCC875,,,
825,,1/4/2011,588779,,171102,JC,V3A,,,W.4,Changing from a 310AJ motor to a 310AX,MTF # 46746 to fan assembly and motor,,,
826,,1/4/2011,588948,,170941,JC,V3B,,,W.4,Changing from a 310AJ motor to a 310AX,MTF # 241092 and 241093 to change fan assemly and EBM motor,,,
827,,1/4/2011,588206,,171143,JC,H3A,,,WO.2,Potentiometer was missing from the work order,MTF # 264851 to add 29278,,,
828,,1/4/2011,584741 584742 584748 584747 584749,,171009,BF,V2B,,"Carlos, Laura",,Johnson units. Motors would not fit correctly using the motor mounts already installed.,MTF# S264510 to remove 006-300 motor mounts from work orders. MTF# S264699 to add 006-033 motor mounts to work orders.,,,
829,,1/4/2011,586519,,170891-1-2,DB,H3B,,"Carlos, Laura",WO.2,"1"" bushing not on BOM.",MTF# 264769 added 28614,,,
830,,1/4/2011,583814,,170804-1-3,DB,V3B,,"Carlos, Laura",WO.3,Wrong pulley (26710) and wrong Belt A-41 (29725) appear on WO.,MTF# 264570 removed those and put on an A-33 (26768) and pulley 27005. Two units so Qty 2 for each item.,,,
831,,1/5/2011,584742,,171009,JC,V2B,,,,there was an extra overload relay on the work order because it had been changed and the original was never taken off.,MTF # 241926 to take off 7- 27167 overload relay,,,
832,,1/5/2011,591742,,170965,JC,H3C,,"Carlos, Laura",WO.3,Belt was too short,MTF # 241729 to take off 30737 (BX42) and put on 28589 (BX52). Center to center distance was 19 3/8 in,,,
833,,1/5/2011,584749,,171009,JC,H2A,,Joe ,E.3,Did a motor change in order for the motor to work on the unit,MTF # 264854 to add 28918 and take off 28095 motor and SP01204 pulley,,,
834,,1/5/2011,588945,,171157,JC,V3B,,Alex,D,Stainless steel unit needed a stainless steel power entering cover plate.,Spoke with Alex and he designed X302-905 and MTF # 241094 was done to add to this work order.,,,
835,,1/5/2011,589259,,170966,JC,V3A,,Alex,D,Stainless steel unit needed a stainless steel power entering cover plate.,Spoke with Alex and he designed X302-905 and MTF # 241094 was done to add to this work order.,,,
836,,1/5/2011,584749,,171009,JC,H2A,,,,Changed overload relay because changed motor,MTF # 264857 to change overload relay. Took off 27169 and added 26736,,,
837,,1/6/2011,583815,,170804,JC,V3B,,"Carlos, Laura",WO.3,bore hole on the pulley was too big ,MTF # 241096 to take off 26710 7/8 pull and put on 27005 5/8 pulley,,,
838,,1/6/2011,583816,,170804,JC,V3B,,"Carlos, Laura",WO.3,bore hole on the pulley was too big ,MTF # 241096 to take off 26710 7/8 pull and put on 27005 5/8 pulley,,,
839,,1/6/2011,587632,,171143,BF,M2,,"Carlos, Laura",WO.2,H302-850 blank off #3 not on WO.,MTF# S242648 to add (1) H302-850,,,
840,,1/6/2011,583816,,170804,BF,M2,,"Carlos, Laura",WO.3,A41 Belt too large,"MTF# S241706 to remove A41 (29725) and add A33 (26780). C-C distance 12.5",,,
841,,1/7/2011,588945,,171157,JC,V3B,,Tom Read ,D,"Assembly drawing AD-V3B-162C-EPSSTLDR had a 7/8 distributor connecting to a 5/8 opening on a tee.
",MTF # 264653 to to add bushing 27256 and 28997 tee in order to use a tee that would fit into the distributor.,,,
842,,1/7/2011,589257,,170966,JC,V3C,,Everyone ,WO.2,heat exchanger was missing from the work order ,MTF # 264858 to add the heat exchanger on work order and one was ordered.,,,
LOOK! ^^^ S.O.'s reader did it too!
Here is the exact text of the line that starts with 841:
841,,1/7/2011,588945,,171157,JC,V3B,,Tom Read ,D,"Assembly drawing AD-V3B-162C-EPSSTLDR had a 7/8 distributor connecting to a 5/8 opening on a tee.
",MTF # 264653 to to add bushing 27256 and 28997 tee in order to use a tee that would fit into the distributor.,,,
FYI: I am developing in C# against .NET Framework 4.
[Solved] I was able to figure this out using Rob Parker's comment and using a raw Stream instead of the prettier TextReader class. It turns out my Rogue character was an inserted Carriage Return (\n).
using (Stream fs = File.Open(filename, FileMode.Open, FileAccess.Read)) {
byte[] data = new byte[1024];
int len;
do {
len = fs.Read(data, 0, data.Length);
for (int n = 0; n < len; n++) {
if ((n + 3) < len) {
string strId = string.Format("{0}{1}{2}", (char)data[n + 1], (char)data[n + 2], (char)data[n + 3]);
int numeric = Convert.ToInt32(strId);
if (numeric == 841) {
char[] suspects = new char[50];
int n2 = n;
int n3 = 0;
while (n2 < len) {
if ((n + 130 < n2) && (n2 < n + 160)) {
suspects[n3++] = (char)data[n2];
}
n2++;
}
Console.WriteLine("Wait Here!");
break;
}
}
}
num++;
} while (0 < len);
}
Thanks everyone for your help!

TextReader treats any of the following characters as an end-of-line delimiter (it tries to play nice with the various end-of-line conventions out there):
CR.The old MacOS (pre-OS X) end-of-line convention: "\r".
CR+LF.The Microsoft Windows/DOS end-of-line convention: "\r\n".
LF.The *nix end-of-line convention: "\n".
My suspicion is that you've got a spurious \r (CR) floating around in their somewhere.

Since it turned out to be particularly helpful...
Have you checked what character(s) there are between the period and doublequote character at the point where it's splitting the line?
If ReadLine() doesn't include the line-break characters in what it returns you might have to do a little work to get to it/them. But if you can get the FileStream object used by the TextReader (not sure if it's exposed) you could add code to detect the problem line (starting "841,") and hit a breakpoint (or Debugger.Break()) and then use the underlying FileStream to back up the Position and read the raw bytes to see what's there.

Related

C# - break out large string into multiple smaller strings for export to a database

C# newb here - I have a script written in C# which takes the contents of several fields of the internal database of an application (Contoso Application, in this case) and exports them to a SQL Server Database table.
Here is the code:
using System;
using System.IO;
using System.Data.SqlClient;
using Contoso.Application.Api;
using Contoso.Application.Commands;
using System.Linq;
public class Script
{
public static bool ExportData(DataExportArguments args)
{
try
{
var sqlStringTest = new SqlConnectionStringBuilder();
sqlStringTest.DataSource = "SQLserverName";
sqlStringTest.InitialCatalog = "TableName";
sqlStringTest.IntegratedSecurity = True;
sqlStringTest.UserID = "userid";
sqlStringTest.Password = "password";
using (var sqlConnection = new SqlConnection(sqlStringTest.ConnectionString))
{
sqlConnection.Open();
using (IExportReader dataReader = args.Data.GetTable())
{
while (dataReader.Read())
{
using (var sqlCommand = new SqlCommand())
{
sqlCommand.Connection = sqlConnection;
sqlCommand.CommandText =
#"INSERT INTO [dbo].[Table] (
Id,
Url,
articleText)
VALUES (
#Id,
#Url,
#articleText)";
sqlCommand.Parameters.AddWithValue("#Id", dataReader.GetStringValue("Id"));
sqlCommand.Parameters.AddWithValue("#Url", dataReader.GetStringValue("Url"));
sqlCommand.Parameters.AddWithValue("#articleText",
dataReader.Columns.Any(x => x.Name == "articleText")
? dataReader.GetStringValue("articleText")
: (object)DBNull.Value);
}
}
}
}
}
catch (Exception exp)
{
args.WriteDebug(exp.ToString(), DebugMessageType.Error);
return false;
}
return true;
}
}
FYI - articleText is of type nvarchar(max)
What I'm trying to accomplish: sometimes the data in the articleText field is short, sometimes it is very long. What I need to do is break out a record into multiple records when the string in a given articleText field is greater than 10,000 characters. So if a given articleText field is 25,000 characters, there would be 3 records exported: first one would have an articleText field of 10,000 characters, 2nd, 10,000 characters, 3rd, 5,000 characters.
Further to this requirement, I need to ensure that if the character cutoff for each record falls in the middle of a word (which will likely happen most of the time) that I account for that.
Therefore, as an example, if we have a record in the application's internal database with Id of 1, Url of www.contoso.com, and articleText of 28,000 characters, I would want to export 3 records to SQL Server as such:
Record 1:
Id: 1
Url: www.contoso.com
articleText: if articleText greater than 10,000 characters, export characters 1-10,000, else export entirety of articleText.
Record 2:
Id: 1
Url: www.contoso.com
articleText: assuming Record 2 only exists if Record 1 was greater than 10k character, export characters 9,990-20,000 (start at character 9,990 in case Record 1 cuts off at the middle of a word).
Record 3:
Id: 1
Url: www.contoso.com
articleText: export characters 19,900-28,000 (or alternatively, 19,900 through end of string).
For any given export session, there are thousands of records in the internal database to be exported (hence the while loop). Approximately 20% of the records will meet the criteria of articleText exceeding 10k characters, so for any that don't, we absolutely only want to export one record. Further, although my example above only goes to 28k characters, this script needs to be able to accommodate any size.
I'm a bit stumped at how one would go about accomplishing something like this. I believe the first step is to get a character count for articleText to determine how many records need to be exported. From there, I feel I've gone down a rabbit hole. Any suggestions on how to go about this would be greatly appreciated.
EDIT #1: to clarify on the cutoff requirement - the reason the above is the approach I'm suggesting to handle the cutoff is because the article may have a person's name in it. Simply finding a space and cutting it off there wouldn't work because it's possible you would split between a first and last name. The approach I mention above would meet our requirements because the word or name only needs to exist in its entirety in one of the records.
Further, reassembly of the separated records in SQL Server is not a requirement and therefore not necessary.
This might be a start: it's not very efficient, admittedly, but just to illustrate how it might be done:
void Main()
{
string text = "012345 6789012 3456789012 34567890 1234567" +
"0123 456789 01234567 8901234567 8901234567" +
"012345 67890123456 78901234567890123456" +
"0123456 7890123456 789012345 6789012345" +
"012345 678901234 5678901234 5678901234" +
"01234567 89012345678 901234567890123" +
"ABCDEFGHI JLMNOPQES TUVWXYZ";
int startingPoint = 0;
int chunkSize = 50;
int padding = 10;
List<string> chunks = new List<string>();
do
{
if (startingPoint == 0)
{
chunks.Add(new string(text.Take(chunkSize).ToArray()));
}
else
{
chunks.Add(new string(text.Skip(startingPoint).Take(chunkSize).ToArray()));
}
startingPoint = startingPoint + chunkSize - padding;
}
while (startingPoint < text.Length);
Console.WriteLine("Original length: {0}", text.Length);
Console.WriteLine("Chunk count: {0}", chunks.Count);
Console.WriteLine("Expected new length: {0}", text.Length + (chunks.Count -1) * padding);
Console.WriteLine("Actual new length: {0}", chunks.Sum(c => c.Length));
Console.WriteLine();
Console.WriteLine("Chunks:");
foreach (var chunk in chunks)
{
Console.WriteLine(chunk);
}
}
Output:
Original length: 263
Chunk count: 7
Expected new length: 323
Actual new length: 323
Chunks:
012345 6789012 3456789012 34567890 12345670123 456
670123 456789 01234567 8901234567 8901234567012345
4567012345 67890123456 789012345678901234560123456
4560123456 7890123456 789012345 6789012345012345 6
45012345 678901234 5678901234 567890123401234567 8
01234567 89012345678 901234567890123ABCDEFGHI JLMN
EFGHI JLMNOPQES TUVWXYZ
You are going to have to tokenize the input to be able split it sensibly. In order to do that, you have to be able to make some assumptions about the input.
For example, you could split the input on the last end-of-sentence that occurs prior to the 10K character boundary. But, you have to be able to make concrete assumptions with the input about what constitutes an end-of-sentence. If you can assume that the input is well-punctuated and grammatically correct, then a simple regex like [^.!?]+[.!?] {1,2}[A-Z] can be used to detect the end of a sentence, where the sentence ends with ".", "!", or "?", is followed by at least one but no more than two spaces, and the next character is a capital letter. Since the
following capital letter is included in the match, you just drop back one character position and split.
The exact process will depend on the specific assumptions you can make about the input.

Fastest way to split a huge text into smaller chunks

I have used the below code to split the string, but it takes a lot of time.
using (StreamReader srSegmentData = new StreamReader(fileNamePath))
{
string strSegmentData = "";
string line = srSegmentData.ReadToEnd();
int startPos = 0;
ArrayList alSegments = new ArrayList();
while (startPos < line.Length && (line.Length - startPos) >= segmentSize)
{
strSegmentData = strSegmentData + line.Substring(startPos, segmentSize) + Environment.NewLine;
alSegments.Add(line.Substring(startPos, segmentSize) + Environment.NewLine);
startPos = startPos + segmentSize;
}
}
Please suggest me an alternative way to split the string into smaller chunks of fixed size
First of all you should define what you mean with chunk size. If you mean chunks with a fixed number of code units then your actual algorithm may be slow but it works. If it's not what you intend and you actually mean chunks with a fixed number of characters then it's broken. I discussed a similar issue in this Code Review post: Split a string into chunks of the same length then I will repeat here only relevant parts.
You're partitioning over Char but String is UTF-16 encoded then you may produce broken strings in, at least, three cases:
One character is encoded with more than one code unit. Unicode code point for that character is encoded as two UTF-16 code units, each code unit may end up in two different slices (and both strings will be invalid).
One character is composed by more than one code point. You're dealing with a character made by two separate Unicode code points (for example Han character 𠀑).
One character has combining characters or modifiers. This is more common than you may think: for example Unicode combining character like U+0300 COMBINING GRAVE ACCENT used to build à and Unicode modifiers such as U+02BC MODIFIER LETTER APOSTROPHE.
Definition of character for a programming language and for a human being are pretty different, for example in Slovak dž is a single character however it's made by 2/3 Unicode code points which are in this case also 2/3 UTF-16 code units then "dž".Length > 1. More about this and other cultural issues on How can I perform a Unicode aware character by character comparison?.
Ligatures exist. Assuming one ligature is one code point (and also assuming it's encoded as one code unit) then you will treat it as a single glyph however it represents two characters. What to do in this case? In general definition of character may be pretty vague because it has a different meaning according to discipline where this word is used. You can't (probably) handle everything correctly but you should set some constraints and document code behavior.
One proposed (and untested) implementation may be this:
public static IEnumerable<string> Split(this string value, int desiredLength)
{
var characters = StringInfo.GetTextElementEnumerator(value);
while (characters.MoveNext())
yield return String.Concat(Take(characters, desiredLength));
}
private static IEnumerable<string> Take(TextElementEnumerator enumerator, int count)
{
for (int i = 0; i < count; ++i)
{
yield return (string)enumerator.Current;
if (!enumerator.MoveNext())
yield break;
}
}
It's not optimized for speed (as you can see I tried to keep code short and clear using enumerations) but, for big files, it still perform better than your implementation (see next paragraph for the reason).
About your code note that:
You're building a huge ArrayList (?!) to hold result. Also note that in this way you resize ArrayList multiple times (even if, given input size and chunk size then its final size is known).
strSegmentData is rebuilt multiple times, if you need to accumulate characters you must use StringBuilder otherwise each operation will allocate a new string and copying old value (it's slow and it also adds pressure to Garbage Collector).
There are faster implementations (see linked Code Review post, especially Heslacher's implementation for a much faster version) and if you do not need to handle Unicode correctly (you're sure you manage only US ASCII characters) then there is also a pretty readable implementation from Jon Skeet (note that, after profiling your code, you may still improve its performance for big files pre-allocating right size output list). I do not repeat their code here then please refer to linked posts.
In your specific you do not need to read entire huge file in memory, you can read/parse n characters at time (don't worry too much about disk access, I/O is buffered). It will slightly degrade performance but it will greatly improve memory usage. Alternatively you can read line by line (managing to handle cross-line chunks).
Below is my analysis of your question and code (read the comments)
using (StreamReader srSegmentData = new StreamReader(fileNamePath))
{
string strSegmentData = "";
string line = srSegmentData.ReadToEnd(); // Why are you reading this till the end if it is such a long string?
int startPos = 0;
ArrayList alSegments = new ArrayList(); // Better choice would be to use List<string>
while (startPos < line.Length && (line.Length - startPos) >= segmentSize)
{
strSegmentData = strSegmentData + line.Substring(startPos, segmentSize) + Environment.NewLine; // Seem like you are inserting linebreaks at specified interval in your original string. Is that what you want?
alSegments.Add(line.Substring(startPos, segmentSize) + Environment.NewLine); // Why are you recalculating the Substring? Why are you appending the newline if the aim is to just "split"
startPos = startPos + segmentSize;
}
}
Making all kind of assumption, below is the code I would recommend for splitting long string. It is just a clean way of doing what you are doing in the sample. You can optimize this, but not sure how fast you are looking for.
static void Main(string[] args) {
string fileNamePath = "ConsoleApplication1.pdb";
var segmentSize = 32;
var op = ReadSplit(fileNamePath, segmentSize);
var joinedSTring = string.Join(Environment.NewLine, op);
}
static List<string> ReadSplit(string filePath, int segmentSize) {
var splitOutput = new List<string>();
using (var file = new StreamReader(filePath, Encoding.UTF8, true, 8 * 1024 )) {
char []buffer = new char[segmentSize];
while (!file.EndOfStream) {
int n = file.ReadBlock(buffer, 0, segmentSize);
splitOutput.Add(new string(buffer, 0, n));
}
}
return splitOutput;
}
I haven't done any performance tests on my version, but my guess is that it is faster than your version.
Also, I am not sure how you plan to consume the output, but a good optimization when doing I/O is to use async calls. And a good optimization (at the cost of readability and complexity) when handling large string is to stick with char[]
Note that
You might have to deal with Character encoding issues while reading the file
If you already have the long string in memory and file reading was just include in the demo, then you should use the StringReader class instead of the StreamReader class

Matlab code to C# code conversion

function [ samples,y, energies] = energy( speech, fs )
window_ms = 200;
threshold = 0.75;
window = window_ms*fs/1000;
speech = speech(1:(length(speech) - mod(length(speech),window)),1);
samples = reshape(speech,window,length(speech)/window);
energies = sqrt(sum(samples.*samples))';
vuv = energies > threshold;
y=vuv;
I have this matlab code and I need to write this code in c#. However I couldn't understand the last part of the code. Also i think speech corresponds to a data List or array according to first part of code. If it does not, please can someone explain what this code is doing. I just want to know logic. fs = 1600 or 3200;
The code takes an array representing a signal. It then breaks it into pieces according to a window of specified length, compute the energy in each segment, and finds out which segments have energy above a certain threshold.
Lets go over the code in more details:
speech = speech(1:(length(speech) - mod(length(speech),window)),1);
the above line is basically making sure that the input signal's length is a multiples of the window length. So say speech was an array of 11 values, and window length was 5, then the code would simply keep only the first 10 values (from 1 to 5*2) removing the last remaining one value.
The next line is:
samples = reshape(speech,window,length(speech)/window));
perhaps it is best explained with a quick example:
>> x = 1:20;
>> reshape(x,4,[])
ans =
1 5 9 13 17
2 6 10 14 18
3 7 11 15 19
4 8 12 16 20
so it reshapes the array into a matrix of "k" rows (k being the window length), and as many columns as needed to complete the array. So the first "K" values would be the first segment, the next "k" values are the second segment, and so on..
Finally the next line is computing the signal energy in each segment (in a vectorized manner).
energies = sqrt(sum(samples.*samples))';
List<int> speech = new List<int>();
int window = 0;
int length = speech.Count();
int result = length % window;
int r = length - result;
// speech = speech(1: r, 1)
This:
(length(speech) - mod(length(speech),window)
is a formula
([length of speech] - [remainder of (speech / window)])
so try
(length(speech) - (length(speech) % window))
% is the symbol equivalent to mod(..)
EDIT I should say that I assume that is what mod(..) is in your code :)

Trying to make a text parser that inserts newlines where needed, but it's not quite working

I'm making a game, and I read dialogue text from an XML file. I'm trying to make a routine to add in newlines automatically as needed, so that it fits in the text box. It just won't work right, though. Here's the code as it currently is:
SpriteFont dialogueFont = font31Adelon;
int lastSpace = 0;
string builder = "";
string newestLine = "";
float maxWidth = 921.6f;
float stringLength = 0;
for (int i = 0; i <= speech.Length - 1; i++) //each char in the string
{
if (speech[i] == ' ') //record the index of the most recent space
{
lastSpace = i;
}
builder += speech[i];
newestLine += speech[i];
stringLength = dialogueFont.MeasureString(newestLine).X;
if (stringLength > maxWidth) //longer than allowed
{
builder = builder.Remove(lastSpace); //cut off from last space
builder += Environment.NewLine;
i = lastSpace; //start back from the cutoff
newestLine = "";
}
}
speech = builder;
My test string is "This is an example of a long speech that has to be broken up into multiple lines correctly. It is several lines long and doesn't really say anything of importance because it's just example text."
This is how speech ends up looking:
http://www.iaza.com/work/120627C/iaza11394935036400.png
The first line works because it happens to be a space that brings it over the limit, I think.
i = 81 and lastSpace = 80 is where the second line ends. builder looks like this before the .Remove command:
"This is an example of a long speech that\r\nhas to be broken up into multiple lines c"
and after it is run it looks like this:
"This is an example of a long speech that\r\nhas to be broken up into multiple line"
The third line goes over the size limit at i = 123 and lastSpace = 120. It looks like this before the .Remove:
"This is an example of a long speech that\r\nhas to be broken up into multiple line\r\ncorrectly. It is several lines long and doe"
and after:
"This is an example of a long speech that\r\nhas to be broken up into multiple line\r\ncorrectly. It is several lines long an"
As you can see, it cuts off an extra character, even though character 80, that space, is where it's supposed to start removing. From what I've read .Remove, when called with a single parameter, cuts out everything including and after the given index. It's cutting out i = 79 too, though! It seems like it should be easy enough to add or subtract from lastSpace to make up for this, but I either get "index out of bounds" errors, or I cut off even more characters. I've tried doing .Remove(lastSpace, i-lastSpace), and that doesn't work either. I've tried handling "ends with a space" cases differently than others, by adding or subtracting from lastSpace. I've tried breaking things up in different ways, and none of it has worked.
I'm so tired of looking at this, any help would be appreciated.
You add Environment.NewLine to your StringBuilder, you need to consider that when you specify the index where to start removing.
Environment.NewLine's value is System dependent. It can be "\r\n", "\n" or "\r" (or, only one of the first two according to MSDN).
In your case it's "\r\n", that means for removing one space, you added two other characters.
First, you need to declare a new variable:
int additionalChars = 0;
Then, when adding a line of text, you should change your code to something like this:
builder = builder.Remove(lastSpace + additionalChars); //cut off from last space
builder += Environment.NewLine;
additionalChars += Environment.NewLine.Length - 1;
The -1 is because you already removed a space (should make this code independent of the system's definition of Environment.NewLine).
UPDATE: You should also account for words that are longer than the line limit. You should break them anyway (couldn't help but have a try):
if (stringLength > maxWidth) //longer than allowed
{
// Cut off only if there was a space to cut off at
if (lastSpace >= 0) {
builder = builder.Remove(lastSpace + additionalChars); //cut off from last space
i = lastSpace; //start back from the cutoff
}
builder += Environment.NewLine;
// If there was no space that we cut off, there is also no need to subtract it here
additionalChars += Environment.NewLine.Length - (lastSpace >= 0 ? 1 : 0);
lastSpace = -1; // Means: No space found yet
newestLine = "";
}
As an alternative approach, you could break your sentence up into an array using .split and then fill your box until there isn't space for the next work, then add the newline and start on the next line.
Can you do something like the following code. The advantage is two-fold. Firstly, it skips to the next space to measure and decide whether to add the whole word or not, rather than going letter by letter. Secondly, it only calls MeasureString once for each word in the string, rather than for every letter added to the string. It uses StringBuilder with Append when the word will fit, or AppendLine when it won't fit and a new-line needs to be added.
int lastSpace = 0;
int nextSpace = s.IndexOf(' ', lastSpace + 1);
float width = 0;
float totalWidth = 0;
float maxWidth = 200;
while (nextSpace >= 0)
{
string piece = s.Substring(lastSpace, nextSpace - lastSpace);
width = g.MeasureString(piece, this.Font).Width;
if (totalWidth + width < maxWidth)
{
sb.Append(piece);
totalWidth += width;
}
else
{
sb.AppendLine(piece);
totalWidth = 0;
}
lastSpace = nextSpace;
nextSpace = s.IndexOf(' ', lastSpace + 1);
}
MessageBox.Show(sb.ToString());

Decrypting text using frequency analysis in c#.

I've been tasked with decrypting a text file using frequency analysis. This isn't a do it for me question but i have absolutley no idea what to do next. What i have so far reads in the text from file and counts the frequency of each letter. If someone could point me in the right direction as to swapping letters depending on their frequency it would be much appreciated.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
namespace freqanaly
{
class Program
{
static void Main()
{
string text = File.ReadAllText("c:\\task_2.txt");
char[,] message = new char[2,26];
Console.Write(text); int count = 0;
for (int x = 'A'; x <= 'Z'; x++)
{
message[0, count] = (char)x;
Console.WriteLine(message[0, count]);
count++;
}
foreach (char c in text)
{ count = 0;
for (int x = 'A'; x <= 'Z'; x++)
{
if (c == x)
{
message[1, count]++;
}
count++;
}
}
Console.ReadKey();
for (int x = 0; x <= 25; x++)
{
Console.Write(message[0, x]); Console.Write(" = "); Console.WriteLine((int)message[1, x]);
}
Console.ReadKey();
}
}
}
This IS encrypted data, just using a simple subsitution cipher (I assume). See the definition of encoding/encrypting.
http://www.perlmonks.org/index.pl?node_id=66249
Regardless, as Sergey suggested, get a letter frequency table and match frequencies. You will have to take into account some deviation, since there is no guarantee there are exacltly 8.167% of 'A's in the document (perhaps in this document the percent of 'A's are 8.78 or 7.65%). Also, be sure to evaluate on every occurance of A, and not differentiate 'a' from 'A'. This can be handled with a simple ToUpper or ToLower transform on the character; just be consistant.
Also, when you start getting into less common, but still popular letters, you will need to handle that. C, F, G, W, and M are all around the 2% +/- mark, so you will need to play with the decrypted text till the letters fit in the word, and in other words within the document where this character substitution will also happen. This concept is similar to fitting numbers in a Suduko matrix. Luckily, once you find where a letter should go, it cascades through out the document and you can start to see the decrypted plain text emerge. As an example, '(F)it' and '(W)it' are both valid words, but if you see '(F)hen' in the document when you substitute a 'F', you can make a good guess that you should substitute this character with a 'W' instead. (T)here and (W)here is another example, and a word ()hen won't provide any guidance by itself, since both (W)hen and (T)hen are valid words. It is here you have to incorporate contextual clues as to which word makes sense. "Then is a good time to start our attack?" doesn't make as much sense as "When is a good time to start our attack?".
All of this is assusming you are using a monoalphebetic substitution. A polyalphebetic substitution is more difficult, and you may need to look into cracking the Vigenère cipher examples to try to figure out a way around this problem.
I suggest reading "The Code Book" by S. Singh. It is a very interesting read and easy to digest the historical ciphers used and how they were cracked.
http://www.google.com/products/catalog?q=the+code+book&rls=com.microsoft:en-us:IE-SearchBox&oe=&um=1&ie=UTF-8&tbm=shop&cid=5361323398438876518&sa=X&ei=hpR0T-HyObSK2QWvgvH-Dg&ved=0CFoQ8wIwBQ#
Next you should grab some of publically available English frequency lists (from Wikipedia, for example) and compare the actual frequencies table you got with it - in order to find the replacements for letters.

Categories