I have an application on the phone and it takes in about 50 pages of XML, each XML has about 100 nods in it. So if you do the math that is about 5000 nodes I am parsing. Sometimes these nodes are not set up the same. Example: maybe 75% have a different schema than the other 25% so there is code to handle this and parse them differently.
I can't optimize the http calls any more then I have as the web services only serve up data 100 "items" at a time, so I have to hit the web service 50 times basically to get all the pages of data. Here is the high level process.
Call web service (webclient)
Parse XML (take note total pages in xml. it will say Page 1 of 100)
Add results to collection
Call web service again for page 2
Parse
Add results to collection
....rinse and repeat 100 times.
The parsing code is really the only place I can optimize. All I am doing is using linq to parse the XML and separate out the nodes in an IEnumerable and then I parse them and place them in a custom object I created. I'm looking for some high level ideas on how to optimize this entire process. Maybe I'm missing something.
Some code....just imagine the below, just like 1000 times or more, and with more attributes, this is a small example. Most have like 30 attributes that need parsing..Also I have no access to a real schema, and no control over schema changes.
XElement eventData = XElement.Parse(e.Result);
IEnumerable<XElement> feed =
(eventData.Element("results").Elements("event").Select(el => el)).Distinct();
foreach (XElement el in feed)
{
_brokenItem = el.ToString();
thisFeeditem.InternalGuid = Guid.NewGuid().ToString();
thisFeeditem.ServiceIcon = GetServiceIcon(thisFeeditem.ServiceType);
thisFeeditem.Description = el.Attribute("displayName").Value;
thisFeeditem.EventURL = el.Attribute("uri").Value;
thisFeeditem.Guid = el.Attribute("id").Value;
thisFeeditem.Latitude = el.Element("venue").Attribute("lat").Value;
thisFeeditem.Longitude = el.Element("venue").Attribute("lng").Value;
}
Without seeing your code, it is not easy to optimise it. However, there is one general point you should consider:
Linq-to-XML is a DOM-based parser, in that it reads the entire XML document into a model which resides in memory. All queries are executed against the DOM. For large documents, constructing the DOM can be memory and CPU intensive. Also, your Linq-to-XML queries, if written inefficiently can navigate the same tree nodes multiple times.
As an alternative, consider using a serial parser like XmlReader. Parsers of this type do not create a memory-based model of your document, and operate in a forward-only manner, forcing you to read each element just once.
You could change the architecture.
Create a web service that does the collection and filtering of the XML data and on the phone retrieve the data from that web service.
This way you move the heavy processing to a (scale-able?!) server and you only have to modify the service when the XML sources change instead of having to update all clients.
You can also cache results and prevent duplicates.
Now you are in full control of what happens on the phone.
Related
Essentially, my program needs to consume about 100 (and this number will expand) WebServices, pull a piece of data from each, store it, parse it, and then display it. I've written the code for storing, parsing, and displaying.
My problem is this: I can't find any tutorial online about how to loop through a list of WebReferences and query each one. Am I doomed to writing 100 WebReferences and manually writing queries for each one, or is it possible to store a List or Array of the URLs (or something) and loop through it? Or is there another, better way of doing this?
I've specifically done research on this and I haven't found anything, I've done my due diligence. I'm not asking about how to consume a WebService, there's plenty of information on that and it's not that hard.
Current foreach loop (not sufficent as I need to pass login credentials and get a response):
//Retrieve the XMLString from the server
//The ServerURLList is just a giant list of URLS, I didn't include it
var client = new WebClient { Credentials = new NetworkCredential("LoginCredentials", "LoginCredentialsPass") };
var XMLStringFromServer = client.DownloadString((String)(dr[0]));
//Notice it takes the string URL from the DataTable provided, so that it can do all 100 customers while parsing the response
I'm working on a c# application that extracts metadata from all the documents in a Lotus Notes database (.nsf file, no Domino server) thusly:
NotesDocumentCollection documents = _notesDatabase.AllDocuments;
if (documents.Count>0)
{
NotesDocument document = documents.GetFirstDocument();
while (document!=null)
{
//processing
}
}
This works well, except we also need to record all the views that a document appears in. We iterate over all the views like this:
foreach (var viewName in _notesDatabase.Views)
{
NotesView view = _notesDatabase.GetView(viewName);
if (view != null)
{
if (view.AllEntries.Count > 0)
{
folderCount = view.AllEntries.Count;
NotesDocument document = view.GetFirstDocument();
while (document!=null)
{
//record the document/view cross reference
document = view.GetNextDocument(document);
}
}
Marshal.ReleaseComObject(view);
view = null;
}
}
Here are my problems and questions:
We fairly regularly encounter documents in a view that were not found in NotesDatabase.AllDocuments collection. How is that possible? Is there a better way to get all the documents in a notes database?
Is there a way to find out all the views a document is in without looping through all the views and documents? This part of the process can be very slow, especially on large nsf files (35 GB!). I'd love to find a way to get just a list of view name and Document.UniversalID.
If there is not a more efficient way to find all the document + view information, is it possible to do this in parallel, with a separate thread/worker/whatever processing each view?
Thanks!
Answering questions in the same order:
I'm not sure how this is possible either unless perhaps there's a special type of document that doesn't get returned by that AllDocuments property. Maybe replication conflicts are excluded?
Unfortunately there's no better way. Views are really just a saved query into the database that return a list of matching documents. There's no list of views directly associated with a document.
You may be able to do this in parallel by processing each view on its own thread, but the bottleneck may be the Domino server that needs to refresh the views and thus it might not gain much.
One other note, the "AllEntries" in a view is different than all the documents in the view. Entries can include things like the category row, which is just a grouping and isn't backed by an actual document. In other words, the count of AllEntries might be more than the count of all documents.
Well, first of all, it's possible that documents are being created while your job runs. It takes time to cycle through AllDocuments, and then it takes time to cycle through all the views. Unless you are working on a copy or replica of the database that is is isolated from all other possible users, then you can easily run into a case where a document was created after you loaded AllDocuments but before you accessed one of the views.
Also, is it may be possible that some of the objects returned by the view.getXXXDocument() methods are deleted documents. You should probably be checking document.isValid() to avoid trying to process them.
I'm going to suggest using the NotesNoteCollection as a check on AllDocuments. If AllDocuments were returning the full set of documents, or if NotesNoteCollection does (after selecting documents and building the collection), then there is a way to do this that is going to be faster than iterating each view.
(1) Read all the selection formulas from the views, removing the word 'SELECT' and saving them in a list of pairs of {view name, formula}.
(2) Iterate through the documents (from the NotesNoteCollection or AllDocuments) and for each doc you can use foreach to iterate through the list of view/formula pairs. Use the NotesSession.Evaluate method on each formula, passing the current document in for the context. A return of True from any evaluated formula tells you the document is in the view corresponding to the formula.
It's still brute force, but it's got to be faster than iterating all views.
I'm writing a simple function that updates/creates nodes from an XML data-source (about 400 nodes) and I'm wondering what the best way to save and publish all the nodes is. I've noticed that you can Save a list of nodes but there's no SaveAndPublish equivalent.
Should I just iterate over the list and call SaveAndPublish for each node or is there a better way? If there is an alternative, is there any difference in terms of performance?
Any answers would be greatly appreciated!
You are correct there is no Publish or SaveAndPublish option that takes in an IEnumerable like the Save method. It could be handy as it could save some lines of code.
The most valid option currently to achieve what you want is to do the following.
var cs = ApplicationContext.Current.Services.ContentService;
foreach(var content in yourListOfContentItems)
{
cs.SaveAndPublish(content);
}
Saving your list before publishing by calling Save method isn't really going to make any differences to you as if Umbraco detects there is new content in your list it processes each individually. And from what I can tell doing that and then calling Publish after is not going to save you any cycles either because the Publish method calls the same SaveAndPublishDo method that SaveAndPublish calls. So might as well go straight for the end result.
I'm creating an app that works with ServiceNow (custom reporting tool)
It's configured to use demo12 and XML service described here.
When i made this request
https://demo12.service-now.com/incident_list.do?XML&sysparm_query=opened_at%3E2012-04-17%2000:00:00%5Eopened_at%3C2012-04-18%2000:00:00%5E&sysparm_view=
in response XML i see not only <incident> nodes, but also <u_zprototype_incidents>
XPath to get node names is
distinct-values(/xml/*/name(.))
and result is (user-friendly formatted)
<XdmValue>
<XdmAtomicValue>u_zprototype_incidents</XdmAtomicValue>
<XdmAtomicValue>incident</XdmAtomicValue>
</XdmValue>
not sure, if this is how it should be displayed.
Is there any other way (extra URI param, etc.) to get valid XML (only <incident> nodes) ?
I know that i can use /xml/*[contains(name(.),'incident')][sys_id='my GUID'] to get needed nodes. but i think it consume more CPU time than just /xml/incident[sys_id='my GUID'].
Any ideas?
For what it's worth, there's something atypical on that demo12 site. There are not supposed to be parent elements named "u_zprototype_incidents" by default. A custom table was created, extending the "incident" table, named "u_zprototype_incidents".
If you want to limit yourself ONLY to records in the base "incident" table, I would suggest that you simply add a new filter for "sys_class_name=incident". Giving you this URL:
https://demo12.service-now.com/incident_list.do?XML&sysparm_query=opened_at%3E2012-04-17%2000:00:00%5Eopened_at%3C2012-04-18%2000:00:00%5E^sys_class_name=incident&sysparm_view=
...With that you can use /xml/incident[sys_id='my GUID']
Asp.NET - C#.NET
I need a advice regarding a design problem below:
I'll receive everyday XML files. It changes the quantity e.g. yesterday 10 XML files received, today XML 56 files received and maybe tomorrow 161 XML files etc.
There are 12 types (12 XSD)... and in the top there is a attribute called FormType e.g. FormType="1", FormType="2" , FormType="12" etc. up to 12 formtypes.
All of them have common fields like Name, adres, Phone.
But e.g. FormType=1 is for Construction, FormType=2 is for IT, FormType 3=Hospital, Formtype=4 is for Advertisement etc. etc.
As I said all of them have common attributes.
Requirements:
Need a search screen so the user can do search on these XML contents. But I don't have any clue how to approach this. e.g. Search the text in some attributes for the xml's received from Date_From and Date_To.
Problem:
I've heard about putting the XML's in a Binary field and do XPATH query or whatever but don't know the word's to search on google.
I was thinking to create a big database.table and read all XML's and put in the Database Table. But the issue is some xml attributes are very huge like 2-3 pages. and the same attributes in other XML file are empty..
So creating NVARCHAR(MAX) for every XML attribute and putting them in table.field.... After some period my DATABASE will be a big big monster...
Can someone advice what is the best approach to handle this issue?
I'm not 100% sure I understand your problem. I'm guessing that the query's supposed to return individual XML documents that meet some kind of user-specified criteria.
In that event, my starting point would probably be to implement a method for querying a single XML document, i.e. one that returns true if the document's a hit and false otherwise. In all likelihood, I'd make the query parameter an XPath query, but who knows? Here's a simple example:
public bool TestXml(XDocument d, string query)
{
return d.XPathSelectElements(query).Any();
}
Next, I need a store of XML documents to query. Where does that store live, and what form does it take? At a certain level, those are implementation details that my application doesn't care about. They could live in a database, or the file system. They could be cached in memory. I'd start by keeping it simple, something like:
public IEnumerable<XDocument> XmlDocuments()
{
DirectoryInfo di = new DirectoryInfo(XmlDirectoryPath);
foreach (FileInfo fi in di.GetFiles())
{
yield return XDocument.Load(fi.Filename);
}
}
Now I can get all of the documents that fulfill a request like this:
public IEnumerable<XDocument> GetDocuments(query)
{
return XmlDocuments.Where(x => TextXml(x, query));
}
The thing that jumps out at me when I look at this problem: I have to parse my documents into XDocument objects to query them. That's going to happen whether they live in a database or the file system. (If I stick them in a database and write a stored procedure that does XPath queries, as someone suggested, I'm still parsing all of the XML every time I execute a query; I've just moved all that work to the database server.)
That's a lot of I/O and CPU time that gets spent doing the exact same thing over and over again. If the volume of queries is anything other than tiny, I'd consider building a List<XDocument> the first time GetDocuments() is called and come up with a scheme of keeping that list in memory until new XML documents are received (or possibly updating it when new XML documents are received).