I have a task where i need to create a conversion program from OmniPage XML ocr output into ALTO XML.
The output of OmniPage XML is really different from ALTO XML.
I tried to find an ALTO XML file and trying to figure out where those values came from.
I need to get the formula of SP Tag WIDTH.
Below is the sample XML i'm trying to figure out
<TextLine ID="P1_TL00002" HPOS="26.00" VPOS="98.00" WIDTH="1667.00" HEIGHT="130.00">
<String ID="P1_ST00002" HPOS="26.00" VPOS="106.00" WIDTH="387.00" HEIGHT="95.00" CONTENT="Twenties" WC="0.99" CC="06370005"/>
<SP ID="P1_SP00001" HPOS="413.00" VPOS="201.00" WIDTH="29.00"/>
<String ID="P1_ST00003" HPOS="442.00" VPOS="98.00" WIDTH="246.00" HEIGHT="103.00" CONTENT="Glrls" WC="0.78" CC="00045"/>
<SP ID="P1_SP00002" HPOS="688.00" VPOS="201.00" WIDTH="26.00"/>
<String ID="P1_ST00004" HPOS="714.00" VPOS="98.00" WIDTH="178.00" HEIGHT="103.00" CONTENT="ancl" WC="0.54" CC="5660"/>
<SP ID="P1_SP00003" HPOS="892.00" VPOS="201.00" WIDTH="39.00"/>
<String ID="P1_ST00005" HPOS="931.00" VPOS="98.00" WIDTH="368.00" HEIGHT="130.00" CONTENT="FUppER" WC="0.83" CC="090000"/>
<SP ID="P1_SP00004" HPOS="1299.00" VPOS="228.00" WIDTH="32.00"/>
<String ID="P1_ST00006" HPOS="1331.00" VPOS="98.00" WIDTH="362.00" HEIGHT="106.00" CONTENT="PAshiON" WC="0.76" CC="0008206"/>
</TextLine>
I already figured out the values of HPOS and VPOS.
I used c# Rect class
Rect r = new Rect(26, 106, 387, 95);
Debug.WriteLine("BottomRight: " + r.BottomRight);
BottomRight: 413,201
But I can't figure where the SP tag's WIDTH value comes from.
Please help me.
Looks like it’s just the difference between the SP tags HPOS and the following String tags HPOS, e.g. 413.00 + 29.00 = 442.00, 688.00 + 26.00 = 714.00, and so on.
Related
So I have two files: a mot file and an xml file. What I need to do with these files is to read data from the xml file and compare it to the mot file if it exists. That's the general idea.
Before anything else, for those who are unfamiliar with what a mot
file is (I don't also have much knowledge about it, just the basics)...
(From Wikipedia) A mot file (or a Motorola S-Record
file) is a file format that conveys binary information in ASCII Hex text form.
(from another source) An S-record file consists of a
sequence of specially formatted ASCII character strings. An S-record
will be less than or equal to 78 bytes in length.
The format of a S-Record is:
S | Type | Record Length | Address (starting address) | Data | Checksum
(e.g. S21404200047524D5354524D0000801410AA5AA555F9)
([parsed] S2 14 042000 47524D5354524D0000801410AA5AA555 F9)
The specific idea is that I have data AA BB CC DD and so on allocated in addresses 0x042000 ~ 0x04200F. What’s written in the xml would be:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<data-set xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<record>
<File name="Test.mot">
<Address id="042000">
<Data>AA</Data>
</Address>
</File>
</record>
<record>
<File name="Test.mot">
<Address id="042001">
<Data>BB CC DD</Data>
</Address>
</File>
</record>
<record>
<File name="Test.mot">
<Address id="042004">
<Data>EE FF</Data>
</Address>
</File>
</record>
Then the program would get the data and address from he XML and search the .mot file for any hits. So if a mot file has a record S214042000AABBCCDDEEFF01234567891A2B3C4D5EF9, then this is supposed to bring a match with what's in the xml. Result to true, or 1. If anything in the xml doesn't have a match, then it would return with false or 0.
The problem now would be I’m not well-versed with C# much less with XML although I did have a tiny bit of experience with both. I initially thought it would be something like this:
using (StreamReader sr = new StreamReader("Test.mot"))
{
String line =String.Empty;
while ((line = sr.ReadLine()) != null)
{
if (line.Contains("042004") & line.Contains("EE FF"))
{
Console.WriteLine("Success");
}
else
{
Console.WriteLine("Failure");
}
}
}
But obviously, it didn't result with what I expected. And Failure keeps popping up. Am I right to use StreamReader to read the .mot file? And with regards to the XML file, will XMLDocument work? How do I get data from the xml and compare it with the .mot file? Could someone walk me through how to get this done or provide guides how to properly start with this.
Let me know if I'm not clear on anything.
EDIT:
I thought of an idea. I'm not sure if it's doable, though. Let's say the program will read the mot S-Record file, and it will identify the type of the record. From there every record line listed in the file would be broken down as shown in the sample below:
sample record line: "S214042000AABBCCDDEEFF01234567891A2B3C4D5EF9"
S2 - type w/c means there would be a 3-byte address
14 - record length
F9 - checksum
042000 - AA
042001 - BB
042002 - CC
042003 - DD
...
04200F - 5E
With this new list, I think or I hope it would be easier for the program to use the data in the XML to locate it in the mot file.
Tell me if this will work, or if there are any alternatives.
Correct me when i'm wrong as it is full of assumptions:
the XML only gives the starting values of the data package under the mot file:
||||||||||||
S214042000AABBCCDDEEFF01234567891A2B3C4D5EF9
AABBCCDDEEFF
You could read out the xml and place each record in a record class
public class Record
{
string FileName{get;set;}
string Id {get;set;}
string Data {get;set;}
public Record(){} //default constructor
}
with the XmlDocument class you could read out the xml.
something like:
var document = new XmlDocument();
document.LoadXml("your.xml");
var records = document.SelectNodes("record");
var recordList = new List<Record>();
foreach(var r in records)
{
var file = r.SelectSingleNode("file");
var fileName = file.Attributes["name"].Value;
var address = file.SelectSingleNode("Address");
var id = address.Attributes["id"].Value;
var data = address.SelectSingleNode("Data").InnerText.Replace(" ", "");
recordList.Add(new Record{FileName = fileName, Id = id, Data = data});
}
Afterwards you can then readout everyline of the mot file by position:
since the location of the 042000 always be the 5 - 10 character
var fn = "Test.mot";
using (StreamReader sr = new StreamReader(fn))
{
var record = recordList.Single(r=> r.FileName);
String line =String.Empty;
while ((line = sr.ReadLine()) != null)
{
if (line.SubString(4,6) == record.Id && line.SubString(10, record.Data.Length) == record.Data)
{
Console.WriteLine("Success");
}
else
{
Console.WriteLine("Failure");
}
}
}
Let me know if it helped you out a bit
var xml = #"<?xml version=""1.0"" encoding=""UTF-8"" standalone=""yes""?>
<metadata created=""2014-11-03T18:13:02.769Z"" xmlns=""http://example.com/ns/mmd-2.0#"" xmlns:ext=""http://example.com/ns/ext#-2.0"">
<customer-list count=""112"" offset=""0"">
<customer id=""5f6ab597-f57a-40da-be9e-adad48708203"" type=""Person"" ext:score=""100"">
<name>Bobby Smith</name>
<gender>male</gender>
<country>US</country>
<date-span>
<begin>1965-02-18</begin>
<end>false</end>
</date-span>
</customer>
<customer id=""22"" type=""Person"" ext:score=""100"">
<name>Tina Smith</name>
<gender>Female</gender>
<country>US</country>
<date-span>
<end>false</end>
</date-span>
</customer>
<customer id=""30"" type=""Person"" ext:score=""500"">
<name>George</name>
<gender>Male</gender>
<country>US</country>
<date-span>
<begin>1965</begin>
<end>false</end>
</date-span>
</customer>
</customer-list>
</metadata>";
Im using the above XML. The problem i have is the date (im referring to the <date-span> <begin> element) can be in any format. So i am trying to use the below code in order to take care of the date format
GetCustomers = from c in XDoc.Descendants(ns + "customer")
select
new Customer
{
Name = c.Element(ns + "name").Value,
DateOfBirth = Convert.ToDateTime(c.Element(ns + "date-span").Elements(ns + "begin").Any() ? c.Element(ns + "date-span").Element(ns + "begin").Value : DateTime.Now.ToString())
};
The above works but crashed as soon as the XML contained 1965 - unfortunately i have no control over the XML. So i tried to use TryParse in order to convert 1965 to dd/mm/1965 where dd and mm could be todays date and current month, but i cant seem to get it working:
BeginDate = Convert.ToDateTime(c.Element(ns + "life-span").Elements(ns + "begin").Any() ? DateTime.TryParse( c.Element(ns + "life-span").Element(ns + "begin").Value, culture, styles, out dateResult) : DateTime.Now).ToString())
Could anyone guide me here how to resolve the issue?
Edit 1
var ModifyBeginDate = XDoc.Descendants(ns + "artist").Elements(ns + "date-span").Elements(ns + "begin");
The above retrieves all the dates but how do i assign the values after i have changed them back to the XML (i dont think i can use this variable in my code as when i iterate through the XML it would go directly back to the original XML)
If the data can be in any format then you will have to preprocess the data before trying to parse it into a DateTime.
If i were going to implement this the first thing i would do is break the input into an array of integers, if there is only one item in the array i would check the length, if it was 4 long then I would assume that it is a year and instantiate a new DateTime of January 1, with the year. If i found an array of length 4,2,2 or 2,2,4 i would parse them accordingly - obviously there will be some of it left to guessing but if you can't control the format of the xml there will always be something left to chance
You could use something like this (but modified to only return the integer types and skip the splitting type which could be /,-, etc) to split the date time into an array containing the integer values:
https://stackoverflow.com/a/13548184/184746
I have a small problem which I'm hoping to get some help resolving. So far I am at a dead end.
This is an example input:
<example some="" random="" attributes="" here="">
<something>
[01/01/1993 10:10:10] name:
</something>important text.
</example>
I need to get the 'important text' which is positioned where shown. I cannot modify the XML due to it being produced by another application.
Thanks,
Thomas.
PS. My current thoughts are to read all the elements and elements' content and replace it with nothing - this obviously isn't a very good way.
This is probably what you are looking for:
var xdoc = XDocument.Load("1.xml");
var text = xdoc.Root.Element("something").NextNode as XText;
if (text != null)
{
Console.WriteLine(text.Value);
}
This code checks if your next node is XText and not null, which is a good practice in your case.
var xText = XDocument.Parse(xmlstr).Root.Nodes().Last() as XText;
var text = xText.Value;
OR
var text = XDocument.Parse(DATA).Root.Nodes().Last().ToString();
I am copying a value 8.03 from excel sheet and reading that value in C# from the clipboard using the code below. The value that is coming out is 8.0299999999999994 and not 8.03. When I look at the xml as in below the value is 8.0299999999999994. The sheet1.xml which is in the zip file of xlsx also has the value 8.02999999999999994. But when I copy the value to notepad the value gets copied at 8.03. What can I do to copy the value as 8.03 like in notepad?
DataObject retrievedData = Clipboard.GetDataObject() as DataObject;
if (retrievedData.GetDataPresent("XML Spreadsheet"))
{
var sourceDataReader = new XmlTextReader(retrievedData.GetData("XML Spreadsheet") as System.IO.Stream);
var doc = XDocument.Load(reader);
var excelWorksheeet = doc.Element("Worksheet");
XNamespace ns = "urn:schemas-microsoft-com:office:spreadsheet";
XName worksheetName = ns + "Worksheet";
var worksheetElement = doc.Root.Element(worksheetName);
}
The worksheetElement has the value:
<Worksheet ss:Name="Sheet1" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns="urn:schemas-microsoft-com:office:spreadsheet">
<Table ss:ExpandedColumnCount="1" ss:ExpandedRowCount="1" ss:DefaultRowHeight="15">
<Row>
<Cell>
<Data ss:Type="Number">8.0299999999999994</Data>
</Cell>
</Row>
</Table>
</Worksheet>
The clipboard has support for several formats. Obviously the XML Spreadsheet format does a different rounding of the underlying float value than excel does. If you want to get the data exactly as notepad does, try to get the text representation instead.
Of course, copying as text will not get all the detailed information that XML Spreadsheet gives you, but it all depends on what you want to do with the data.
An alternative is to get both representations. Get the XML Spreadsheet format to get the detailed data and use the text format to get a display string.
I have this text in a rich text box named richTextBox:
<notification_counts>
<unseen>0</unseen>
</notification_counts>
<friend_requests_counts>
<unread>1</unread>
<unseen>**0**</unseen>
</friend_requests_counts>
I would like to extract the value from the unseen tag (0 in this example) and place it in the text box named textbox1. How should I go about doing this?
Full code
<?xml version="1.0" encoding="UTF-8"?>
<notifications_get_response xmlns="http://api.facebook.com/1.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://api.facebook.com/1.0/ http://api.facebook.com/1.0/facebook.xsd">
<messages>
<unread>0</unread>
<unseen>0</unseen>
<most_recent>****</most_recent>
</messages>
<pokes>
<unread>0</unread>
<most_recent>0</most_recent>
</pokes>
<shares>
<unread>0</unread>
<most_recent>0</most_recent>
</shares>
<notification_counts>
<unseen>0</unseen>
</notification_counts>
<friend_requests_counts>
<unread>1</unread>
<unseen>0</unseen>
</friend_requests_counts>
<friend_requests list="true">
<uid>***</uid>
</friend_requests>
<group_invites list="true"/>
<event_invites list="true"/>
</notifications_get_response>
If your rich text box only contains XML markup, you can parse it to extract the value you're interested in. For instance, using LINQ to XML:
using System.Xml.Linq;
textBox1.Text = XElement.Parse(richTextBox.Text)
.Descendant("friend_requests_counts")
.Element("unseen").Value;
EDIT: Since your XML markup contains namespaces, you have to take them into account when selecting the elements:
XNamespace fb = "http://api.facebook.com/1.0/";
textBox1.Text = XDocument.Parse(richTextBox.Text).Root
.Element(fb + "friend_requests_counts")
.Element(fb + "unseen").Value;