500 error when querying yahoo placefinder with a particular character? - c#

I am using the Yahoo Placefinder service to find some latitude/longitude positions for a list of addresses I have in a csv file.
I am using the following code:
String reqURL = "http://where.yahooapis.com/geocode?location=" + HttpUtility.UrlEncode(location) + "&appid=KGe6P34c";
XmlDocument xml = new XmlDocument();
xml.Load(reqURL);
XPathNavigator nav = xml.CreateNavigator();
// process xml here...
I just found a very stubborn error, that I thought (incorrectly) for several days was due to Yahoo forbidding further requests from me.
It is for this URL:
http://where.yahooapis.com/geocode?location=31+Front+Street%2c+Sedgefield%2c+Stockton%06on-Tees%2c+England%2c+TS21+3AT&appid=KGe6P34c
My browser complains about a parsing error for that url. My c# program says it has a 500 error.
The location string here comes from this address:
Agape Business Consortium Ltd.,michael.cutbill#agapesolutions.co.uk,Michael A Cutbill,Director,,,9 Jenner Drive,Victoria Gardens,,Stockton-on-Tee,,TS19 8RE,,England,85111,Hospitals,www.agapesolutions.co.uk
I think the error comes from the first hyphen in Stockton-on-Tee , but I can't explain why this is. If I replace this hypen with a 'normal' hyphen, the query goes through successfully.
Is this error due to a fault my end (the HttpUtility.UrlEncode function being incorrect?) or a fault Yahoo's end?
Even though I can see what is causing this problem, I don't understand why. Could someone explain?
EDIT:
Further investigation on my part indicates that the character this hypen is being encoded to, "%06", is the ascii control character "Acknowledge", "ACK". I have no idea why this character would turn up here. It seems that differrent places render Stockton-on-Tee in different ways - it appears normal opened in a text editor, but by the time it appears in Visual Studio, before being encoded, it is Stocktonon-Tees. Note that, when I copied the previous into this text box in firefox, the hypen rendered as a weird, square box character, but on this subsequent edit the SO software appears to have santized the character.
I include below the function & holder class I am using to parse the csv file - as you can see, I am doing nothing strange that might introduce unexpected characters. The dangerous character appears in the "Town" field.
public List<PaidBusiness> parseCSV(string path)
{
List<PaidBusiness> parsedBusiness = new List<PaidBusiness>();
List<string> parsedBusinessNames = new List<string>();
try
{
using (StreamReader readFile = new StreamReader(path))
{
string line;
string[] row;
bool first = true;
while ((line = readFile.ReadLine()) != null)
{
if (first)
first = false;
else
{
row = line.Split(',');
PaidBusiness business = new PaidBusiness(row);
if (!business.bad) // no problems with the formatting of the business (no missing fields, etc)
{
if (!parsedBusinessNames.Contains(business.CompanyName))
{
parsedBusinessNames.Add(business.CompanyName);
parsedBusiness.Add(business);
}
}
}
}
}
}
catch (Exception e)
{ }
return parsedBusiness;
}
public class PaidBusiness
{
public String CompanyName, EmailAddress, ContactFullName, Address, Address2, Address3, Town, County, Postcode, Region, Country, BusinessCategory, WebAddress;
public String latitude, longitude;
public bool bad;
public static int noCategoryCount = 0;
public static int badCount = 0;
public PaidBusiness(String[] parts)
{
bad = false;
for (int i = 0; i < parts.Length; i++)
{
parts[i] = parts[i].Replace("pithawala", ",");
parts[i] = parts[i].Replace("''", "'");
}
CompanyName = parts[0].Trim();
EmailAddress = parts[1].Trim();
ContactFullName = parts[2].Trim();
Address = parts[6].Trim();
Address2 = parts[7].Trim();
Address3 = parts[8].Trim();
Town = parts[9].Trim();
County = parts[10].Trim();
Postcode = parts[11].Trim();
Region = parts[12].Trim();
Country = parts[13].Trim();
BusinessCategory = parts[15].Trim();
WebAddress = parts[16].Trim();
// data testing
if (CompanyName == "")
bad = true;
if (EmailAddress == "")
bad = true;
if (Postcode == "")
bad = true;
if (Country == "")
bad = true;
if (BusinessCategory == "")
bad = true;
if (Address.ToLower().StartsWith("po box"))
bad = true;
// its ok if there is no contact name.
if (ContactFullName == "")
ContactFullName = CompanyName;
//problem if there is no business category.
if (BusinessCategory == "")
noCategoryCount++;
if (bad)
badCount++;
}
}

Welcome to real world data. It's likely that the problem is in the CSV file. To verify, read the line and inspect each character:
foreach (char c in line)
{
Console.WriteLine("{0}, {1}", c, (int)c);
}
A "normal" hyphen will give you a value of 45.
The other problem could be that you're reading the file using the wrong encoding. It could be that the file is encoded as UTF8 and you're reading it with the default encoding. You might try specifying UTF8 when you open the file:
using (StreamReader readFile = new StreamReader(path, Encoding.UTF8))
Do that, and then output each character on the line again (as above), and see what character you get for the hyphen.

Related

C# ConsoleRead() input not saved in variable

I am working with C# for the first time and I am facing a strange issue.
I am building my own class for a plugin, but copied parts of the code from an existing class. Basically it's
var sanInput = Console.ReadLine();
alternativeNames = sanInput.Split(',');
sanList = new List<string>(alternativeNames);
but somehow this doesn't work. The debug console says System.Console.ReadLine returned jogi,philipp string, but sanInput keeps null as its value.
Even stranger is the fact, that the next step works "a bit". string.Split returned {string[2]} string[], so it returns an array of [jogi, philipp], but still sanInput, alternativeNamesand sanList stay as null.
How is it possible, that the second line works if sanInput has no value and how can I fix this problem? When I work with the existing class with the same code everything works as expected.
//EDIT:
Looks like a quite complicated issue. Here is the complete method:
public override void HandleMenuResponse(string response, List<Target> targets)
{
if (response == "r")
{
Console.WriteLine("Which hosts do you want to configure? Enter numbers separated by a comma.");
var hostsInput = Console.ReadLine();
int[] hosts = null;
string[] alternativeNames = null;
List<string> sanList = null;
hosts = hostsInput.Split(',').Select(int.Parse).ToArray();
Console.Write("Generating certificates for ");
foreach (int entry in hosts)
{
Console.Write(targets[entry - 1].Host + ", ");
}
Console.Write("\n \n");
foreach (int entry in hosts)
{
int entry2 = entry - 1;
if (Program.Options.San)
{
Console.WriteLine("Enter all Alternative Names for " + targets[entry2].Host + " seperated by a comma:");
// Copied from http://stackoverflow.com/a/16638000
int BufferSize = 16384;
Stream inputStream = Console.OpenStandardInput(BufferSize);
Console.SetIn(new StreamReader(inputStream, Console.InputEncoding, false, BufferSize));
var sanInput = Console.ReadLine();
alternativeNames = sanInput.Split(',');
sanList = new List<string>(alternativeNames);
targets[entry2].AlternativeNames.AddRange(sanList);
}
Auto(targets[entry - 1]);
}
}
if (response == "e")
{
string[] alternativeNames = null;
List<string> sanList = new List<string>();
if (Program.Options.San)
{
Console.WriteLine("Enter all Alternative Names seperated by a comma:");
// Copied from http://stackoverflow.com/a/16638000
int BufferSize = 16384;
Stream inputStream = Console.OpenStandardInput(BufferSize);
Console.SetIn(new StreamReader(inputStream, Console.InputEncoding, false, BufferSize));
var sanInput = Console.ReadLine();
alternativeNames = sanInput.Split(',');
}
if (alternativeNames != null)
{
sanList = new List<string>(alternativeNames);
}
foreach (var entry in targets)
{
Auto(entry);
}
}
}
I know the code isn't pretty and efficient. All in all it let's the user decide if he wants to use all detected hosts (response e) or only single ones (response r). But the mentioned problem occurs only in the second if-method. If I switch them it's again the latter one. So maybe the reason lies in the main program or in this BufferSize-Stuff? I don't know.
//EDIT 2: I think I found the problem: Somehow the integer BufferSize (shortly before the Console.Read()) is set to 0, so of course without any buffer it can't read the input. So the question remains: Why?
//EDIT 3: Okay, I'm done. It looks like I can't use the same name for the variables although they are in two different if-methods. I just named them sanInput2, alternativeNames2 etc. and now everything works.
try this, all of the variables are having values(you can use var also for all the variables):
var sanInput = Console.ReadLine();
string[] alternativeNames = sanInput.Split(',');
List<string> sanList = new List<string>(alternativeNames);
The problem you mention is, that debugging code in VS do assignements in two steps. First is to execute Console.ReadLine() (therefore you see Console.Readline returned message) and AFTER that is is assigned into sanInput. Same situation is after Split. Function is called, but not assigned yet.
My recommendation: use rather the step over instead of step inside. After time, you get used to this functionality and appreciate it.

C# parsing part of a string

WE have an application that prints out a log line. Within the log lines we also print out the fully syncML Payload in xml. I need to parse out just the syncML payloads. The actual xml and strip everything else out.
Log line looks like this.
`2016-01-06T15:13:45.188-0500 [DEBUG] {} Logger
[{{Correlation,(longID)}{Uri,POST (post
URL)}{host,(HOST)}{userID,(userID)}}] - request class SyncML: <?xml
version="1.0" encoding="UTF-8" standalone="yes"?></ns3:SyncML>`
My regex for the request class is as follows.
Regex request = new Regex(#"request class SyncML");
String line;
while ((line = sr.ReadLine()) != null)
{
if(req.Success)
{
Match req = request.Match(line);
string s = line.Substring(line.IndexOf("<?xml "));
}
}
After the request.Match(line), in VS it shows the full line. So I know the Match is a truly a success.
However, when I do line.SubString(line.IndexOF... I get System.ArgumentOutOfRangeException. When I checked print out indexOf it's -1.
Perhaps I am using this wrong. I guess my question is what do I need to do to just strip out everything before
If the "<?xml" string begins on the next line, use this:
Regex request = new Regex(#"request class winmo.SyncML");
String line;
while ((line = sr.ReadLine()) != null)
{
if(req.Success)
{
Match req = request.Match(line);
var xmlLine = line = sr.ReadLine();
if (null == xmlLine) break;
string s = xmlLine.Substring(line.IndexOf("<?xml "));
}
}
Or, you can improve your Regex for the newly edited example:
Regex request = new Regex(#"^.+request class winmo.SyncML[^\<]+(\<\?xml [^`]+)`");
string line;
while ((line = sr.ReadLine()) != null)
{
Match req = request.Match(line);
if(req.Success)
string s = req.Group[1].Value;
}
Additionally, you can search more than one line at a time with the improved Regex:
Regex request = new Regex(#"^.+request class winmo.SyncML[^\<]+(\<\?xml [^`]+)");
var lines = new List<String>(5);
string line;
while ((line = sr.ReadLine()) != null)
{
//NOTE:You'll need to make sure this gets enough of your log file to get what you want
lines.Add(line);
while(lines.Count>4)
lines.RemoveAt(0);
Match req = request.Match(string.Join("\r\n", lines);
if(req.Success)
string s = req.Group[1].Value;
}
Maybe you want something like this:
String line;
while ((line = sr.ReadLine()) != null)
{
if(line.Contains("<?xml "))
{
string s = line.Substring(line.IndexOf("<?xml "));
// do something useful with s
}
}
Your Regex looks wrong it should be Regex request = new Regex(#"request class SyncML");
Try using
"<?xml"
instead of
"<?xml "
, I don't see that space after xml.
This question have been edited. So, If the string are formatted in several lines, you should do:
while((line = sr.ReadLine))!= null){
if(req.Success){
Math req = request.Match(line);
if(line.contains("<?xml")){
stirng s = line.Substring(line.IndexOf(#"<?xml"));
}
}
}
If you have the entire log as a long string, you can use substring(x) with indexof(string) to strip everything before the area you are interested in. I'm making the assumption from your last line that everything after the initial log info is part of the wanted xml.
string sFullLog = ReadFullLogAsASingleString();//Could be taxing in large logs
string sXML = sFullLog.Substring(sFullLog.IndexOf("<?xml"));
I see that the provided sample is a single log entry, and that log entry has the xml of intrest.

How to read and create new user by every 4th line

So, I read a text file. It looks like this:
TEACHER - TEACHER/STUDENT
adamsmith - ID
Adam Smith - Name
B1u2d3a4 - Password
STUDENT
marywilson
Mary Wilson
s1Zeged
TEACHER
sz12gee3
George Johnson
George1234
STUDENT
sophieb
Sophie Black
SophieB12
And so on, there are all the users.
The user class:
class User
{
private string myID;
private string myName;
private string myPW;
private bool isTeacher;
public string ID
{
get
{
return myID;
}
set
{
myID = value;
}
}
public string Name
{
get
{
return myName;
}
set
{
myName = value;
}
}
public string PW
{
get
{
return myPW;
}
set
{
PW = value;
}
}
public bool teacher
{
get
{
return teacher;
}
set
{
isTeacher = value;
}
}
public override string ToString()
{
return myName;
}
}
The Form1_Load method:
private void Form1_Load(object sender, EventArgs e)
{
List<User> users = new List<User>();
string line;
using (StreamReader sr = new StreamReader("danet.txt"))
{
while ((line=sr.ReadLine())!=null)
{
User user = new User();
user.ID=line;
user.Name=sr.ReadLine();
user.PW=sr.ReadLine();
if(sr.ReadLine=="TEACHER")
{
teacher=true;
}
else
{
teacher=false;
}
users.Add(user);
}
}
}
I want to read the text and store the informations. By this method I get 4 times more user than I should. I was thinking of using for and a couple of things, but I didn't get to a solution.
New answer
Your reader assumes the every fourth line is the user-id, it is not, the absolute first line is a STUDENT/TEACHER line. Either this is a typo, or you have to change your format.
Your PW property will cause a StackOverflowException,
public string PW
{
get
{
return myPW;
}
set
{
PW = value;
}
}
Change the setter to myPW = value;, or just convert them to auto-properties.
Your teacher property has the same error, but on the getter.
You have also missed the () on one of your ReadLine's, but let's just assume this is a typo.
Not using a text-file, but just a string so I'm using a StringReader instead, but it's the same concept.
string stuff =
#"adamsmith
Adam Smith
B1u2d3a4
STUDENT
marywilson
Mary Wilson
s1Zeged
TEACHER
sz12gee3
George Johnson
George1234
STUDENT
sophieb
Sophie Black
SophieB12
STUDENT";
public void Main(string[] args)
{
string line;
var users = new List<User>();
using (var sr = new StringReader(stuff))
{
while ((line = sr.ReadLine()) != null)
{
User user = new User();
user.ID = line;
user.Name = sr.ReadLine();
user.PW = sr.ReadLine();
user.teacher = sr.ReadLine() == "TEACHER";
users.Add(user);
}
}
}
Old answer
There is nothing inherently erroneous with you code. But since you have not provided an actual example of what your "danet.txt" looks like, one must assume the error lies within the data itself.
Your "parser" (if you want to call it that) is not forgiving, i.e. if there is an empty line in your source file or if you just mess up one line (say forget putting in a password or ID) then everything would get offset – but as far as your "parser" is concerned, nothing is wrong.
By default formats which depend on "line positions" or "line offset" are prone to break, especially if the file itself is created by hand versus being auto-generated.
Why not use a denoted format instead? Such as XML, JSON or even just INI. C# can handle either of these, either built in or by external libraries (see the links).
There will never be any way for your "line-by-line" parser to not break if the user makes a faulty input, that is unless you have very strict formats for IDs, names, passwords and "student/teachers". and then validate them, using regular expressions (or similar). But that would defeat the purpose of a simple "line-by-line" format. And by then, you might as well go with a more "complex" format.
while ((line=sr.ReadLine())!=null)
{
User user = new User();
for (int i = 0; i < 4; i++)
{
switch (i)
{
case 1:
user.ID = line;
break;
case 2:
user.Name=sr.ReadLine();
break;
....
}
}
}

When using MergeField FieldCodes in OpenXml SDK in C# why do field codes disappear or fragment?

I have been working successfully with the C# OpenXml SDK (Unofficial Microsoft Package 2.5 from NuGet) for some time now, but have recently noticed that the following line of code returns different results depending on what mood Microsoft Word appears to be in when the file gets saved:
var fields = document.Descendants<FieldCode>();
From what I can tell, when creating the document in the first place (using Word 2013 on Windows 8.1) if you use the Insert->QuickParts->Field and choose MergeField from the Field names left hand pane, and then provide a Field name in the field properties and click OK then the field code is correctly saved in the document as I would expect.
Then when using the aforementioned line of code I will receive a field code count of 1 field. If I subsequently edit this document (and even leave this field well alone) the subsequent saving could mean that this field code no longer is returned in my query.
Another case of the same curiousness is when I see the FieldCode nodes split across multiple items. So rather than seeing say:
" MERGEFIELD Author \\* MERGEFORMAT "
As the node name, I will see:
" MERGEFIELD Aut"
"hor \\* MERGEFORMAT"
Split as two FieldCode node values. I have no idea why this would be the case, but it certainly makes my ability to match nodes that much more exciting. Is this expected behaviour? A known bug? I don't really want to have to crack open the raw xml and edit this document to work until I understand what is going on. Many thanks all.
I came across this very problem myself, and found a solution that exists within OpenXML: a utility class called MarkupSimplifier which is part of the PowerTools for Open XML project. Using this class solved all the problems I was having that you describe.
The full article is located here.
Here are some pertinent exercepts :
Perhaps the most useful simplification that this performs is to merge adjacent runs with identical formatting.
It goes on to say:
Open XML applications, including Word, can arbitrarily split runs as necessary. If you, for instance, add a comment to a document, runs will be split at the location of the start and end of the comment. After MarkupSimplifier removes comments, it can merge runs, resulting in simpler markup.
An example of the utility class in use is:
SimplifyMarkupSettings settings = new SimplifyMarkupSettings
{
RemoveComments = true,
RemoveContentControls = true,
RemoveEndAndFootNotes = true,
RemoveFieldCodes = false,
RemoveLastRenderedPageBreak = true,
RemovePermissions = true,
RemoveProof = true,
RemoveRsidInfo = true,
RemoveSmartTags = true,
RemoveSoftHyphens = true,
ReplaceTabsWithSpaces = true,
};
MarkupSimplifier.SimplifyMarkup(wordDoc, settings);
I have used this many times with Word 2010 documents using VS2015 .Net Framework 4.5.2 and it has made my life much, much easier.
Update:
I have revisited this code and have found it clears upon runs on MERGEFIELDS but not IF FIELDS that reference mergefields e.g.
{if {MERGEFIELD When39} = "Y???" "Y" "N" }
I have no idea why this might be so, and examination of the underlying XML offers no hints.
Word will often split text runs with into multiple text runs for no reason I've ever understood. When searching, comparing, tidying etc. We preprocess the body with method which combines multiple runs into a single text run.
/// <summary>
/// Combines the identical runs.
/// </summary>
/// <param name="body">The body.</param>
public static void CombineIdenticalRuns(W.Body body)
{
List<W.Run> runsToRemove = new List<W.Run>();
foreach (W.Paragraph para in body.Descendants<W.Paragraph>())
{
List<W.Run> runs = para.Elements<W.Run>().ToList();
for (int i = runs.Count - 2; i >= 0; i--)
{
W.Text text1 = runs[i].GetFirstChild<W.Text>();
W.Text text2 = runs[i + 1].GetFirstChild<W.Text>();
if (text1 != null && text2 != null)
{
string rPr1 = "";
string rPr2 = "";
if (runs[i].RunProperties != null) rPr1 = runs[i].RunProperties.OuterXml;
if (runs[i + 1].RunProperties != null) rPr2 = runs[i + 1].RunProperties.OuterXml;
if (rPr1 == rPr2)
{
text1.Text += text2.Text;
runsToRemove.Add(runs[i + 1]);
}
}
}
}
foreach (W.Run run in runsToRemove)
{
run.Remove();
}
}
I tried to simplify the document with Powertools but the result was a corrupted word file. I make this routine for simplify only fieldcodes that has specifics names, works in all parts on the docs (maindocumentpart, headers and footers):
internal static void SimplifyFieldCodes(WordprocessingDocument document)
{
var masks = new string[] { Constants.VAR_MASK, Constants.INP_MASK, Constants.TBL_MASK, Constants.IMG_MASK, Constants.GRF_MASK };
SimplifyFieldCodesInElement(document.MainDocumentPart.RootElement, masks);
foreach (var headerPart in document.MainDocumentPart.HeaderParts)
{
SimplifyFieldCodesInElement(headerPart.Header, masks);
}
foreach (var footerPart in document.MainDocumentPart.FooterParts)
{
SimplifyFieldCodesInElement(footerPart.Footer, masks);
}
}
internal static void SimplifyFieldCodesInElement(OpenXmlElement element, string[] regexpMasks)
{
foreach (var run in element.Descendants<Run>()
.Select(item => (Run)item)
.ToList())
{
var fieldChar = run.Descendants<FieldChar>().FirstOrDefault();
if (fieldChar != null && fieldChar.FieldCharType == FieldCharValues.Begin)
{
string fieldContent = "";
List<Run> runsInFieldCode = new List<Run>();
var currentRun = run.NextSibling();
while ((currentRun is Run) && currentRun.Descendants<FieldCode>().FirstOrDefault() != null)
{
var currentRunFieldCode = currentRun.Descendants<FieldCode>().FirstOrDefault();
fieldContent += currentRunFieldCode.InnerText;
runsInFieldCode.Add((Run)currentRun);
currentRun = currentRun.NextSibling();
}
// If there is more than one Run for the FieldCode, and is one we must change, set the complete text in the first Run and remove the rest
if (runsInFieldCode.Count > 1)
{
// Check fielcode to know it's one that we must simplify (for not to change TOC, PAGEREF, etc.)
bool applyTransform = false;
foreach (string regexpMask in regexpMasks)
{
Regex regex = new Regex(regexpMask);
Match match = regex.Match(fieldContent);
if (match.Success)
{
applyTransform = true;
break;
}
}
if (applyTransform)
{
var currentRunFieldCode = runsInFieldCode[0].Descendants<FieldCode>().FirstOrDefault();
currentRunFieldCode.Text = fieldContent;
runsInFieldCode.RemoveAt(0);
foreach (Run runToRemove in runsInFieldCode)
{
runToRemove.Remove();
}
}
}
}
}
}
Hope this helps!!!

Getting bibliographic data from text in a PDF and exporting to a window form

I use iText5 for .NET to extract text from a PDF, by using below code.
private void button1_Click(object sender, EventArgs e)
{
PdfReader reader2 = new PdfReader("Scharfetter1969.pdf");
int pagen = reader2.NumberOfPages;
reader2.Close();
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();
for (int i = 1; i < 2; i++)
{
textBox1.Text = "";
PdfReader reader = new PdfReader("Scharfetter1969.pdf");
String s = PdfTextExtractor.GetTextFromPage(reader, i, its);
s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
textBox1.Text = s;
reader.Close();
}
}
But I want to get bibliographic data from research paper pdf.
Here is example of data which is extrected from this pdf (in endnote format), Here's a link!
%0 Journal Article
%T Repeated temperature modulation epitaxy for p-type doping and light-emitting diode based on ZnO
%A Tsukazaki, A.
%A Ohtomo, A.
%A Onuma, T.
%A Ohtani, M.
%A Makino, T.
%A Sumiya, M.
%A Ohtani, K.
%A Chichibu, S.F.
%A Fuke, S.
%A Segawa, Y.
%J Nature Materials
%V 4
%N 1
%P 42-46
%# 1476-1122
%D 2004
%I Nature Publishing Group
But remember that this is bibliographic information, it is not available in metadata of this pdf. I want to access Article Type (%O), Title (%T), Authors (%A), Date (%D) and (%I) and show it to different assigned textbox in window form.
I am using C# if any one have any code for this, or guide me how to do this.
PDF is a one-way format. You put data in so that it renders consistently on all devices (monitors, printers, etc) but the format was never intended to pull data back out. Any and all attempts to do that will be pure guess work. iText's PdfTextExtractor works but you are going to have to piece things together based on your own arbitrary set of rules, and these rules will probably change from PDF to PDF. The supplied PDF was created by InDesign which does such a great job of making text look good that it actually makes it even harder to parse the data back out.
That said, if your PDFs are all visually consistent, you could try to pull the data out while retaining formatting and use the formatting rules to guess what is what. That post will get you some HTML formatting that you could guess at. (If this actually works I'd recommend returning something more specific than HTML but I'll leave that up to you.)
Running it against your supplied PDF shows that the title is using the font HelveticaNeue-LightExt at about 17pts so you could write a rule to look for all lines that use that font at that size and combine them together. Authors are done in HelveticaNeue-Condensed at about 10pts so that's another rule.
The below code is a modified version of the one linked to above. Its a full working C# 2010 WinForms app targeting iTextSharp 5.1.1.0. It pulls out the title and authors for the supplied PDF but you'll need to tweak it for other PDFs and meta data. See the comments in the code for specific implementation details.
using System;
using System.Collections.Generic;
using System.Text;
using System.Windows.Forms;
using iTextSharp.text.pdf.parser;
using iTextSharp.text.pdf;
namespace WindowsFormsApplication1
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void Form1_Load(object sender, EventArgs e)
{
PdfReader reader = new PdfReader(System.IO.Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "nmat4-42.pdf"));
TextWithFontExtractionStategy S = new TextWithFontExtractionStategy();
string F = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1, S);
//Buffers to hold various parts from the PDF
List<string> titles = new List<string>();
List<string> authors = new List<string>();
//Array of lines of text
string[] lines = F.Split(new string[] { Environment.NewLine }, StringSplitOptions.None);
//Temporary string
string t;
//Loop through each line in the array
foreach (string line in lines)
{
//See if the line looks like a "title"
if (line.Contains("HelveticaNeue-LightExt") && line.Contains("font-size:17.28003"))
{
//Remove the HTML tags
titles.Add(System.Text.RegularExpressions.Regex.Replace(line, "</?span.*?>", "").Trim());
}
//See if the line looks like an "author"
else if (line.Contains("HelveticaNeue-Condensed") && line.Contains("font-size:9.995972"))
{
//Remove the HTML tags and trim extra characters
t = System.Text.RegularExpressions.Regex.Replace(line, "</?span.*?>", "").Trim(new char[] { ' ', ',', '*' });
//Make sure we have a valid name, probably need some more exceptions here, too
if (!string.IsNullOrWhiteSpace(t) && t != "AND")
{
authors.Add(t);
}
}
}
//Write out the title to the console
Console.WriteLine("Title : {0}", string.Join(" ", titles.ToArray()));
//Write out each author
foreach (string author in authors)
{
Console.WriteLine("Author : {0}", author);
}
Console.WriteLine(F);
this.Close();
}
public class TextWithFontExtractionStategy : iTextSharp.text.pdf.parser.ITextExtractionStrategy
{
//HTML buffer
private StringBuilder result = new StringBuilder();
//Store last used properties
private Vector lastBaseLine;
private string lastFont;
private float lastFontSize;
//http://api.itextpdf.com/itext/com/itextpdf/text/pdf/parser/TextRenderInfo.html
private enum TextRenderMode
{
FillText = 0,
StrokeText = 1,
FillThenStrokeText = 2,
Invisible = 3,
FillTextAndAddToPathForClipping = 4,
StrokeTextAndAddToPathForClipping = 5,
FillThenStrokeTextAndAddToPathForClipping = 6,
AddTextToPaddForClipping = 7
}
public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo)
{
string curFont = renderInfo.GetFont().PostscriptFontName;
//Check if faux bold is used
if ((renderInfo.GetTextRenderMode() == (int)TextRenderMode.FillThenStrokeText))
{
curFont += "-Bold";
}
//This code assumes that if the baseline changes then we're on a newline
Vector curBaseline = renderInfo.GetBaseline().GetStartPoint();
Vector topRight = renderInfo.GetAscentLine().GetEndPoint();
iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle(curBaseline[Vector.I1], curBaseline[Vector.I2], topRight[Vector.I1], topRight[Vector.I2]);
Single curFontSize = rect.Height;
//See if something has changed, either the baseline, the font or the font size
if ((this.lastBaseLine == null) || (curBaseline[Vector.I2] != lastBaseLine[Vector.I2]) || (curFontSize != lastFontSize) || (curFont != lastFont))
{
//if we've put down at least one span tag close it
if ((this.lastBaseLine != null))
{
this.result.AppendLine("</span>");
}
//If the baseline has changed then insert a line break
if ((this.lastBaseLine != null) && curBaseline[Vector.I2] != lastBaseLine[Vector.I2])
{
this.result.AppendLine("<br />");
}
//Create an HTML tag with appropriate styles
this.result.AppendFormat("<span style=\"font-family:{0};font-size:{1}\">", curFont, curFontSize);
}
//Append the current text
this.result.Append(renderInfo.GetText());
//Set currently used properties
this.lastBaseLine = curBaseline;
this.lastFontSize = curFontSize;
this.lastFont = curFont;
}
public string GetResultantText()
{
//If we wrote anything then we'll always have a missing closing tag so close it here
if (result.Length > 0)
{
result.Append("</span>");
}
return result.ToString();
}
//Not needed
public void BeginTextBlock() { }
public void EndTextBlock() { }
public void RenderImage(ImageRenderInfo renderInfo) { }
}
}
}

Categories