Where can I find an efficient algorithm for breaking lines of text for formatted display?
One approach to this very problem is addressed in the book Introduction to Algorithms (Cormen, Leiserson, Rivest, Stein) as problem 15-2.
It takes the approach that a nicely broken block of text has as even spacing at the end as possible, punishing large differences.
This problem is solvable using dynamic programming.
Naturally this is only one approach to the problem, but in my opinion it at least looks better than the greedy algorithm.
I'm not much for putting my solutions to textbook problems on the Internet, so I'll leave it to you to either solve it or Google for a solution, in order to get the exact algorithm needed.
Related
A little personal project of mine is to blindly produce a search engine from scratch without using any outside sources. This is mostly for a learning experience and I haven't had much trouble up until now, where I have both a dilemma and a tough problem.
Observe this case:
Suzy wants to search for "fuzzy bears". This is fine, functions as well as it can. However, Suzy screws up and types "fuzzybears". Right now, my search algorithm breaks down since this is interpreted as a single token, and not multiple tokens. Any case or combination of words that has even one occurrence of such a run on term, or glued tokens, causes a poor search result.
For scope, this is something I am writing using a combination of C# and T-SQL.
I've tried multiple solutions, but nothing has really come from them. Firstly, I used a List to take the terms and create variations, but this was much too slow to my liking and required a lot more memory than I feel it should need.
I wanted to save search queries to a database for statistics and maybe to learn more about organically growing the algorithm, so maybe a way to handle these glued tokens in SQL could be a solution, but I have no clue how to start with something like that unless I used a cursor or some other slow solution.
I could take searches, save them to my database, create different combinations where some tokens are glued, and then have those glued tokens as terms to hit on? The issue with this solution is it takes up quite a bit of space and I won't always need these strings since spelling errors like this aren't all too common.
Mainly, what I need is speed. It doesn't really have to be pretty, but if it's fast and accurate then I'm happy even if it takes up a lot of disk space.
Not asking for solutions here, but if anyone can point me in a direction I can go or it would be greatly appreciated.
Consider this approach: since spaces, punctuation, and anything similar would screw up a search like this, remove all of those, convert to a common case (I prefer lowercase, but pick what you prefer), and then tokenize based on syllables, using roughly the same set of division rules as for hyphenating English words.
So, to search for answers that contain "Consider this approach:", you reduce the phrase to "considerthisapproach" and then tokenize as "con","sid","er","this","ap","proach". If con and sid and er appear next to each other, and in that order, you've found the word "consider".
This approach can be adapted for statistical matching too, so e.g. if at least 85% of syllables are found in the correct order, you consider it a close match, and maybe order the results by match % so more meaningful matches are at the top.
I am working on a little computational geometry library that uses Mathematica's NETLink to allow polytopes to be modeled in C# and controlled viewed via Mathematica. I hope to allow easy and exact manipulation of geometry, with a focus on geometry unfolding problems.
Currently I am looking to implement an exact shortest path's algorithm on a convex polyhedron. It's been suggested to me that I use Chen and Han's algorithm to do this, and specifically that I look at O'Rourke's implementation. However, this is a pretty big task. Given that I'm starting with quick-and-dirty techniques for the rest of the functions, I'm looking for something simpler, even if it has significantly worse performance.
There is an algorithm by Sharir and Schorr that gets the shortest path in O(n^3) time (with n I assume being the number of vertices), but I can't seem to find the paper anywhere. I'm wondering if this algorithm is indeed simpler, if any implementations of it already exists, and just if anyone has some general advice.
I'm using the backtracking algorithm described in this youtube video.
Now, I should be able to get ALL possible solutions. Am I able to do this with the backtracking algoritme and how? If not possible, which other (simple) algorithm should I use?
This question is not a great fit for this site since it does not appear to be about actual code.
But I'll take a shot at it anyways.
Of course you can get all possible solutions with a backtracking algorithm. Remember how a backtracking algorithm works:
while(there are still guesses available)
make a guess
solve the puzzle with the guess
if there was a solution then record the solution and quit the loop.
cross the guess off the list of possible guesses
if you recorded a solution then the puzzle is solvable.
If you want all solutions then just modify the algorithm to:
while(there are still guesses available)
make a guess
solve the puzzle with the guess
if there was a solution then record the solution. Don't quit.
cross the guess off the list of possible guesses
if you recorded any solution then the puzzle is solvable.
Incidentally, I wrote a series of blog articles on solving sudoku in C# using a graph-colouring backtracking algorithm; it might be of interest to you:
https://learn.microsoft.com/en-us/archive/blogs/ericlippert/graph-colouring-with-simple-backtracking-part-one
In this code you'll see the line:
return solutions.FirstOrDefault();
"solutions" contains a query that enumerates all solutions. I only want the first such solution, so that's what I ask it for. If you want every solution, just rewrite the program so that it does not call FirstOrDefault. See the comments below for some notes.
Greetings each and all!
I'm currently looking into procedural generation of a road network and stumbled upon the L-system algorithm. From what I understand from various scientific papers on the subject, and further papers on the papers on the subject, the algorithm is changed to use "global goals and local constraints", in which the taken path is modified to fit input values such as terrain and population density. Now that part I understand, or atleast the overall concept, but how am I supposed to modify the algorithm?
Right now I have a string which is modified over timesteps according to a set of rules. I then analyze the string and move and turn as I go through the chars, render the result and get beautiful patterns on screen.
Now, to create a network of major roads, should I still use a base axiom with a ruleset and then apply the constraints? And if so, what could a set of good startvalues and rules be?
Or should I rather replace the basic ruleset with the constraints and global goals? And if so, what remains of the original L-system algorithm?
Any help is greatly appreciated, and for the record I'm doing this in C# and XNA, allthough I reccon this is more on a theoretical plane.
Thanks for your time,
Karl
So, I've been googling, reading and understanding more the last week and I've found a solution to the problem which I thought I might share.
I found this brilliant blog post which basically straightened everything out for me:
http://www.newton64.ca/blog/?p=747#7472
That post is based upon another blog post founded here:
http://mollyrocket.com/forums/viewtopic.php?t=730&sid=a9a2628b059a727cbde67309757ed178
Now, as far as the L-system goes, I'm not quite sure whether this approach really is an L-system anymore. Sure, there are similarities regarding the iterative process of building the network. In L-systems the string grows over iterations and branches are created using "[" or "]" (atleast in the cases I've seen), and in the approach I'm taking now a while-loop and a priority queue does pretty much the same thing.
I also would like to point out that I did not fully understand the papers "describing" how to use an L-system for generating a road network, so my reasoning might be a bit off here. But algorithm naming and boundries aside, I found a solution that works for me, which is good for now.
Happy coding!
Karl
I'm the author of the above blog post -- glad you found it useful! I never did get around to finishing Spare Parts -- and if nothing else, I'd have to change the name -- but you've got me thinking about it again.
Certainly, the algorithm I described probably isn't much of an L-system anymore; importantly, though, I think it's pretty much functionally equivalent. I'm a positivist when it comes to programming, so if it works, compile it!
EDIT: I've since taken down my old website, but I've recreated the post here. Hope it's still helpful!
I'm extremely familiar with regex before you all start answering with variations of: /d+
I want to know if there are alternatives to regex for parsing numbers out of a large text file.
I'm parsing through tons of huge files and need to do some group/location analysis on the positions of keywords. I'm now at the point where i need to start finding groups of numbers as well nested closely to my content of interest. I want to avoid regex if at all possible because this needs to be a speedy process.
It is possible to take chunks of a file to inspect for the numbers of interest. That however would require more work and add hard coded limits for searching. (i'd like to avoid this)
I'm open to any suggestions.
UPDATE
Sorry for the lack of sample data. For HIPAA reasons I'd rather not even consider scrambling the text and posting it.
A great substitute would be the HTML source of any stackoverflow.com question page. Imagine I needed to grab the reputation (score) of all people that posted an answer to a question. This also means that the comma (,) is needed as well. I can't remove the html to simplify the content because I'm using some density analysis to weed out unrelated content. Removing the HTML would mix content too close together.
Unless the file is some sort of SGML, then I don't know of any method (which is not to say there isn't, I just don't know of one)
However, it's not to say that you can't create your own parser; you could eliminate some of the overheads of the .Net regex library by writing something that only finds ranges of numbers.
Fundamentally, I guess that that's all any library would do, at the most basic level.
Might help if you can post a sample of the sort of data you'll be processing?