Related
In my application we have multi-lingual language strings which are stored in custom tables, as the user can edit, delete, import new languages etc... via a UI
Currently, what I'm doing is at the beginning of each request is. I'm going off and getting all the language strings (From our database) for the currently selected language and sticking them in a dictionary.
I then have a Html Helper extension method which I use in the razor views (See below), which fishes in the dictionary I got at the beginning of the request to pull out the correct language based on the key supplied in the helper.
Html.LanguageString("MyLanguage.KeyHere")
Now this works fine. However, as the application is getting bigger. We are getting more and more language strings. It's not an issue right now, as its still very fast as there are only around 200 strings to get.
But this also means I'm getting all of them, even if a page has say one on it. I'd ideally like a way of processing the LanguageString("")'s before hand and doing a query to just get those that are needed at the beginning of the request? Or maybe my own linq based language that can be processed and product a more efficient call.
I'm looking for some advice on how to do this. As I'd like the application to be as efficient as possible. Any advice, help, tips are greatly received. Thanks.
I'd suggest caching language strings on the application basis rather than fetching them for every request. For example, this can be done by maintaining a static dictionary and invalidating the cache only when the user makes changes to these strings. This will make your application more responsive as well as save you from implementing (imho) rather more complex and not necessarily efficient technique of loading this data on-demand.
As a side note I'd add the following: it's usually a good practice to address these kinds of problems when they arise (rather than fixing something that is not broken) and focus on more important things. I totally agree that performance implications of a given solution must always be taken into consideration, I'm just saying that premature optimizations are not always a good idea.
I have the folowing question:
What is the prefered way to use the status in code, an enum OR singleton?
I have in a DB stored the status values with their ID's. If the status changes in de DB is would also need some changes in the code.
does anyone now what is more prefered, based on conventions?
I've been looking on the internet but couldn't find a clear answer.
It depends in part on whether the ids for your statuses have guaranteed values, or whether the ids could change per-database (via an IDENTITY). Personally, for statuses I prefer fixed - which gives you the most flexibility, and least overhead - you can choose to use an enum (or maybe some consts if more convenient), and you never have to add an indirection, i.e. "get the id that is open".
This isn't always possible, though, and when it isn't it is still definitely useful to cache and re-use them (to avoid hitting the DB for that lookup). However, I would avoid a singleton, not least because it won't play nicely if you ever need to talk to more than one database - the ids in each could well be different. However, any suitable cache implementation (or maybe IoC/DI) should allow you to store the appropriate data (probably some kind of dictionary). Singletons are also just a bit of a pain generally if you like testing etc.
But: an enum and fixed id values is a lot simpler.
Note that under any implementation, changing the status list is a non-trivial operation, not least it will be a big UPDATE (or several if you are denormalized).
If you intend to use the Status across the application and is standardised across then it would be best fit for an Enum
Enum Status
{Open, Pending, Closed, Deferred}
Also this makes the code more readable
Below is something I read and was wondering if the statement is true.
Serialization is the process of
converting a data structure or object
into a sequence of bits so that it can
be stored in a file or memory buffer,
or transmitted across a network
connection link to be "resurrected"
later in the same or another computer
environment.[1] When the resulting
series of bits is reread according to
the serialization format, it can be
used to create a semantically
identical clone of the original
object. For many complex objects, such
as those that make extensive use of
references, this process is not
straightforward.
Serialization is just a fancy way of describing what you do when you want a certain data structure, class, etc to be transmitted.
For example, say I have a structure:
struct Color
{
int R, G, B;
};
When you transmit this over a network you don't say send Color. You create a line of bits and send it. I could create an unsigned char* and concatenate R, G, and B and then send these. I just did serialization
Serialization of some kind is required, but this can take many forms. It can be something like dotNET serialization, that is handled by the language, or it can be a custom built format. Maybe a series of bytes where each byte represents some "magic value" that only you and your application understand.
For example, in dotNET I can can create a class with a single string property, mark it as serializable and the dotNET framework takes care of most everything else.
I can also build my own custom format where the first 4 bytes represent the length of the data being sent and all subsequent bytes are characters in a string. But then of course you need to worry about byte ordering, unicode vs ansi encoding, etc etc.
Typically it is easier to make use of whatever framework your language/OS/dev framework uses, but it is not required.
Yes, serialization is the only way to transmit data over the wire. Consider what the purpose of serialization is. You define the way that the class is stored. In memory tho, you have no way to know exactly where each portion of the class is. Especially if you have, for instance, a list, if it's been allocated early but then reallocated, it's likely to be fragmented all over the place, so it's not one contiguous block of memory. How do you send that fragmented class over the line?
For that matter, if you send a List<ComplexType> over the wire, how does it know where each ComplexType begins and ends.
The real problem here is not getting over the wire, the problem is ending up with the same semantic object on the other side of the wire. For properly transporting data between dissimilar systems -- whether via TCP/IP, floppy, or punch card -- the data must be encoded (serialized) into a platform independent representation.
Because of alignment and type-size issues, if you attempted to do a straight binary transfer of your object it would cause Undefined Behavior (to borrow the definition from the C/C++ standards).
For example the size and alignment of the long datatype can differ between architectures, platforms, languages, and even different builds of the same compiler.
Is serialization a must in order to transfer data across the wire?
Literally no.
It is conceivable that you can move data from one address space to another without serializing it. For example, a hypothetical system using distributed virtual memory could move data / objects from one machine to another by sending pages ... without any specific serialization step.
And within a machine, the objects could be transferred by switch pages from one virtual address space to another.
But in practice, the answer is yes. I'm not aware of any mainstream technology that works that way.
For anything more complex than a primitive or a homogeneous run of primitives, yes.
Binary serialization is not the only option. You can also serialize an object as an XML file, for example. Or as a JSON.
I think you're asking the wrong question. Serialization is a concept in computer programming and there are certain requirements which must be satisfied for something to be considered a serialization mechanism.
Any means of preparing data such that it can be transmitted or stored in such a way that another program (including but not limited to another instance of the same program on another system or at another time) can read the data and re-instantiate whatever objects the data represents.
Note I slipped the term "objects" in there. If I write a program that stores a bunch of text in a file; and I later use some other program, or some instance of that first program to read that data ... I haven't really used a "serialization" mechanism. If I write it in such a way that the text is also stored with some state about how it was being manipulated ... that might entail serialization.
The term is used mostly to convey the concept that active combinations of behavior and state are being rendered into a form which can be read by another program/instance and instantiated. Most serialization mechanism are bound to a particular programming language, or virtual machine system (in the sense of a Java VM, a C# VM etc; not in the sense of "VMware" virtual machines). JSON (and YAML) are a notable exception to this. They represents data for which there are reasonably close object classes with reasonably similar semantics such that they can be instantiated in multiple different programming languages in a meaningful way.
It's not that all data transmission or storage entails "serialization" ... is that certain ways of storing and transmitting data can be used for serialization. At very list it must be possible to disambiguated among the types of data that the programming language supports. If it reads: 1 is has to know whether that's text or an integer or a real (equivalent to 1.0) or a bit.
Strictly speaking it isn't the only option; you could put an argument that "remoting" meets the meaning inthe text; here a fake object is created at the receiver that contains no state. All calls (methods, properties etc) are intercepted and only the call and result are transferred. This avoids the need to transfer the object itself, but can get very expensive if overly "chatty" usage is involved (I.e. Lots of calls)as each has the latency of the speed of light (which adds up).
However, "remoting" is now rather out of fashion. Most often, yes: the object will need to be serialised and deserialized in some way (there are lots of options here). The paragraph is then pretty-much correct.
Having a messages as objects and serializing into bytes is a better way of understanding and managing what is transmitted over wire. In the old days protocols and data was much simpler, often, programmers just put bytes into output stream. Common understanding was shared by having well-known and simple specifications.
I would say serialization is needed to store the objects in file for persistence, but dynamically allocated pointers in objects need to be build again when we de-serialize, But the serialization for transfer depends on the physical protocol and the mechanism used, for example if i use UART to transfer data then its serialized bit by bit but if i use parallel port then 8 bits together gets transferred , which is not serialized
What is the simplest way to identify if a given method is reading or writing a member variable or property?
I am writing a tool to assist in an RPC system, in which access to remote objects is expensive. Being able to detect if a given object is not used in a method could allow us to avoid serializing its state. Doing it on source code is perfectly reasonable (but being able to do it on compiled code would be amazing)
I think I can either write my own simple parser, I can try to use one of the existing C# parsers and work with the AST. I am not sure if it is possible to do this with Assemblies using Reflection. Are there any other ways? What would be the simplest?
EDIT: Thanks for all the quick replies. Let me give some more information to make the question clearer. I definitely prefer correct, but it definitely shouldn't be extremely complex. What I mean is that we can't go too far checking for extremes or impossibles (as the passed-in delegates that were mentioned, which is a great point). It would be enough to detect those cases and assume everything could be used and not optimize there. I would assume that those cases would be relatively uncommon.
The idea is for this tool to be handed to developers outside of our team, that should not be concerned about this optimization. The tool takes their code and generates proxies for our own RPC protocol. (we are using protobuf-net for serialization only, but no wcf nor .net remoting). For this reason, anything we use has to be free or we wouldn't be able to deploy the tool for licensing issues.
You can have simple or you can have correct - which do you prefer?
The simplest way would be to parse the class and the method body. Then identify the set of tokens which are properties and field names of the class. The subset of those tokens which appears in the method body are the properties and field names you care about.
This trivial analysis of course is not correct. If you had
class C
{
int Length;
void M() { int x = "".Length; }
}
Then you would incorrectly conclude that M references C.Length. That's a false positive.
The correct way to do it is to write a full C# compiler, and use the output of its semantic analyzer to answer your question. That's how the IDE implements features like "go to definition".
Before attempting to write this kind of logic yourself, I would check to see if you can leverage NDepend to meet your needs.
NDepend is a code dependency analysis tool ... and much more. It implements a sophisticated analyzer for examining relationships between code constructs and should be able to answer that question. It also operates on both source and IL, if I'm not mistaken.
NDepend exposes CQL - Code Query Language - which allows you to write SQL-like queries against the relationships between structures in your code. NDepend has some support for scripting and is capable of being integrated with your build process.
To complete the LBushkin answer on NDepend (Disclaimer: I am one of the developer of this tool), NDepend can indeed help you on that. The Code LINQ Query (CQLinq) below, actually match methods that...
shouldn't provoque any RPC calls but
that are reading/writing any fields of any RPC types,
or that are reading/writing any properties of any RPC types,
Notice how first we define the 4 sets: typesRPC, fieldsRPC, propertiesRPC, methodsThatShouldntUseRPC - and then we match methods that violate the rule. Of course this CQLinq rule needs to be adapted to match your own typesRPC and methodsThatShouldntUseRPC:
warnif count > 0
// First define what are types whose call are RDC
let typesRPC = Types.WithNameIn("MyRpcClass1", "MyRpcClass2")
// Define instance fields of RPC types
let fieldsRPC = typesRPC.ChildFields()
.Where(f => !f.IsStatic).ToHashSet()
// Define instance properties getters and setters of RPC types
let propertiesRPC = typesRPC.ChildMethods()
.Where(m => !m.IsStatic && (m.IsPropertyGetter || m.IsPropertySetter))
.ToHashSet()
// Define methods that shouldn't provoke RPC calls
let methodsThatShouldntUseRPC =
Application.Methods.Where(m => m.NameLike("XYZ"))
// Filter method that should do any RPC call
// but that is using any RPC fields (reading or writing) or properties
from m in methodsThatShouldntUseRPC.UsingAny(fieldsRPC).Union(
methodsThatShouldntUseRPC.UsingAny(propertiesRPC))
let fieldsRPCUsed = m.FieldsUsed.Intersect(fieldsRPC )
let propertiesRPCUsed = m.MethodsCalled.Intersect(propertiesRPC)
select new { m, fieldsRPCUsed, propertiesRPCUsed }
My intuition is that detecting which member variables will be accessed is the wrong approach. My first guess at a way to do this would be to just request serialized objects on an as-needed basis (preferably at the beginning of whatever function needs them, not piecemeal). Note that TCP/IP (i.e. Nagle's algorithm) should stuff these requests together if they are made in rapid succession and are small
Eric has it right: to do this well, you need what amounts to a compiler front end. What he didn't emphasize enough is the need for strong flow analysis capabilities (or a willingness to accept very conservative answers possibly alleviated by user annotations). Maybe he meant that in the phrase "semantic analysis" although his example of "goto definition" just needs a symbol table, not flow analysis.
A plain C# parser could only be used to get very conservative answers (e.g., if method A in class C contains identifier X, assume it reads class member X; if A contains no calls then you know it can't read member X).
The first step beyond this is having a compiler's symbol table and type information (if method A refers to class member X directly, then assume it reads member X; if A contains **no* calls and mentions identifier X only in the context of accesses to objects which are not of this class type then you know it can't read member X). You have to worry about qualified references, too; Q.X may read member X if Q is compatible with C.
The sticky point are calls, which can hide arbitrary actions. An analysis based on just parsing and symbol tables could determine that if there are calls, the arguments refer only to constants or to objects which are not of the class which A might represent (possibly inherited).
If you find an argument that has an C-compatible class type, now you have to determine whether that argument can be bound to this, requiring control and data flow analysis:
method A( ) { Object q=this;
...
...q=that;...
...
foo(q);
}
foo might hide an access to X. So you need two things: flow analysis to determine whether the initial assignment to q can reach the call foo (it might not; q=that may dominate all calls to foo), and call graph analysis to determine what methods foo might actually invoke, so that you can analyze those for accesses to member X.
You can decide how far you want to go with this simply making the conservative assumption "A reads X" anytime you don't have enough information to prove otherwise. This will you give you a "safe" answer (if not "correct" or what I'd prefer to call "precise").
Of frameworks that might be helpful, you might consider Mono, which surely parses and builds symbol tables. I don't know what support it provides for flow analysis or call graph extraction; I would not expect the Mono-to-IL front-end compiler to do a lot of that, as people usually hide that machinery in the JIT part of JIT-based systems. A downside is that Mono may be behind the "modern C#" curve; last time I heard, it handled only C# 2.0 but my information may be stale.
An alternative is our DMS Software Reengineering Toolkit and its C# Front End.
(Not an open source product).
DMS provides general source code parsing, tree building/inspection/analysis, general symbol table support and built-in machinery for implementing control-flow analysis, data flow analysis, points-to analysis (needed for "What does object O point to?"), and call graph construction. This machinery has all been tested by fire with DMS's Java and C front ends, and the symbol table support has been used to implement full C++ name and type resolution, so its pretty effective. (You don't want to underestimate the work it takes to build all that machinery; we've been working on DMS since 1995).
The C# Front End provides for full C# 4.0 parsing and full tree building. It presently does not build symbol tables for C# (we're working on this) and that's a shortcoming compared to Mono. With such a symbol table, however, you would have access to all that flow analysis machinery (which has been tested with DMS's Java and C front ends) and that might be a big step up from Mono if it doesn't provide that.
If you want to do this well, you have a considerable amount of work in front of you. If you want to stick with "simple", you'll have to do with just parsing the tree and being OK with being very conservative.
You didn't say much about knowing if a method wrote to a member. If you are going to minimize traffic the way you describe, you want to distinguish "read", "write" and "update" cases and optimize messages in both directions. The analysis is obviously pretty similar for the various cases.
Finally, you might consider processing MSIL directly to get the information you need; you'll still have the flow analysis and conservative analysis issues. You might find the following technical paper interesting; it describes a fully-distributed Java object system that has to do the same basic analysis you want to do,
and does so, IIRC, by analyzing class files and doing massive byte code rewriting.
Java Orchestra System
By RPC do you mean .NET Remoting? Or DCOM? Or WCF?
All of these offer the opportunity to monitor cross process communication and serialization via sinks and other constructs, but they are all platform specific, so you'll need to specify the platform...
You could listen for the event that a property is being read/written to with an interface similar to INotifyPropertyChanged (although you obviously won't know which method effected the read/write.)
I think the best you can do is explicitly maintain a dirty flag.
Is it bad practice to have a string like
"name=Gina;postion= HouseMatriarch;id=1234" to hold state data in an application.I know that I could just as well have a struct , class or hashtable to hold this info.
Is it acceptable practice to hold delimited key/value pairs in a database field– just for use in where the data type is not known at design time.
Thanks...I am just trying to insure that I maintain good design practices
Yes, holding your data in a string like "name=Gina;postion= HouseMatriarch;id=1234" is very bad practice. This data structure should be stored in structs or objects, because it is hard to access, validate and process the data in a string. It will take you much more time to write the code to parse your string to get at your data than just using the language structures of C#.
I would also advise against storing key/value pairs in database fields if the other option is just adding columns for those fields. If you don't know the type of your data at design time, you are probably not doing the design right. How will you be able to build an application when you don't know what data types your fields will have to hold? Or perhaps you should elaborate on the context of the application to make the intent clearer. It is not all black and white :-)
Well, if your application only passes this data between other systems, I don't see any problems in treating it as a string. However, if you need to use the data, I would definitely introduce the necessary types to handle that.
I think you will find your application easier to maintain if you make a struct or class to hold the data and then add a custom property to return (and set) the string you been using. This method will take the fields and format it in the string that you are already using and do the reverse (take the string and fill the fields) This way you maintain maximum compatibility with your old algorithms.
Well one immediate problem with that approach is embedded escape chars. Given your example what would happen if the user entered their name as follows:
Pet;er
or
Pe=;ter
or
pe;Name=Yeoi;
I am not sure what state data it is you are trying to hold, and without any context it's hard to make valid suggestions. Perhaps a first step would be to replace this with a key value pair, at least that negates the problem mentioned above and means you don't have to parse strings regularly.
I try to not keep data in any string based formats. But I encountered several situations, in which it was not possible to know in advance how the structure of the data will be (e.g. it was possible for the customer/end-user to dynamically add fields).
In contrast to your approach, we decided to store the data in XML, e.g. in your case this would be something similar like this:
<user id="1234">
<name>Gina</name>
<postion>HouseMatriarch</position>
</user>
This gives you the following advantages:
The classes to work with the data (read/write) are already available in the framework (e.g. XmlDocument or XML serialization)
you can easily exchange the data with other systems (if/when required)
You can store the data in a file
you can store the data in a database column (xml data type). You can even query that column when using SQL Server (although I'd try to avoid storing data in XML, that has to be queried)
using XML allows to add additional fields to your data at any time
Update: I'm not sure why my answer was downvoted that much - maybe it is because of the bad example. Therefore I'd like to make it clear: I would not use XML for properties such as an ID/primary key of a user, or for standard properties like "name", "email", etc. But for "extended/dynamic" properties (as described above) I still think this is an easy and elegant solution.
If you want to store structured data in a string I think you should use a standard notation such as JSON.
It's bad practice because of the amount of effort you have to go to, to construct the strings and parse them later. There are other more robust ways of serialising data for passing between systems.
For core business data, suitably designed classes will be far simpler to maintain, and with all the properties strongly typed, you'll know early on when you mis-type a property name.
As for key-value pairs, I'd say they're sometimes Ok, sometimes not. If there are a lot of possible values, but not a lot of actually owned values, then it can be perfectly all right to use KVPs. Sebastian Dietz's alternative of having a separate column for each field would result in a lot of empty fields in that case. It would also mean extra work altering the table every time you needed a new one.
None of the answers has mentioned normalization yet, so I thought I would. When database fields are involved, one of the key principles of normalization is that each field in a table only represents one thing. Delimited fields violate that principle.
One of the guys at Red Gate Software posted this article along those lines that you may find useful.
Well it just means that it is less searchable or indexable as a hashtable would be. It would also require a bit of processing to get into a state where it could be easily used by other bits of code. For example a bit of code that queries the id in that data would be something horrible such as:
if(dataStringThing.Substring(26, 4) == SomeIdInStringFormat)
So yes in most cases it is bad practice. However in other cases where this might be a default format that you need to retain the data in or performance means that you only should parse it as and when required. So it may not be a bad thing.
I would suggest myself if you have reasons to keep it in that format that it might be best to transform it into a class that separates the fields but also create a ToString() implementation on that class that restores it to the original format if you also need this. If the performance of it is a concern then modify this object to only parse the source into the fields in the class the first time those fields are accessed.
To re-iterate nothing in isolation is necessarily a bad practise. Bad practises are context dependant.
It (hopefully) obviously shouldn't be a normal choice. But there are cases where it's useful or necessary.
I can't think of any cases that wouldn't involve it being part of a communications protocol with some external service (e.g. a database connection string), so you're probably stuck with the format.
If you have a choice in the format (perhaps you are writing both sides of a system which can only communicate using strings), then at least choose something structured and well known. Examples of such have been given elsewhere, but the prime ones are naturally going to be XML or JSON. CSV, or some other delimited format may be useful in very simple cases (such as the database connection string) - but pay special attention to escaping delimiter characters (as the "Bobby Tables" joke (already referenced in another comment) nicely illustrated - google for him if you are not familiar with that one).
Your mention of a database suggests that this may be where the focus is. Are you trying to serialise application objects? (there are other ways of doing that). As another poster said, this may be a sign of a design that needs rethinking. But if you do need to store unknown datatypes in a DB, then XML may be an appropriate choice - especially if your DB supports XML fields. It's a bit of a minefield, though, so make sure you are familiar with how they work first.
I think it is not that bad when you are using a StringList for manging your string.
Especially when the structure of a e.g configuration-string (or configuration-database field) must be flexibel.
But in normally you should not do this, because of this disadvantages.
It all depends on what you're trying to accomplish.
If you need a heirarcical format of data or lots of fields that preserve data type, then no... a parsed string is a bad idea.
However, if you just need to transmit a string across a service and byte-conservation is important, then a Tag-Data pair may be exactly what you need.
If you do use a parsed string, it's important to be able to get at the data inside and quickly manage it. If you want an example TDP class, I posted one today to my website:
http://www.jerryandcheryl.net/jspot/2009/01/tag-data-pairs.html
I hope that helps.
I suggest considering these usage factors.
If you are processing the data within your own code, then you can use whatever data structures you wish. However, you may have issues developing your own implementation of a complex data structure, so consider using a pre-built one instead. Many come with whatever programming platform you may be using, while many more are documented in various books, articles, and discussions both printed and online. If you properly isolate your work from others, then you can safely do whatever you want.
On the other hand, if you need to share that data with others, then most careful consideration should be given. If you must share the data with an API, or via a storage mechanism (database, file, etc.), or via some transport (sockets, HTTP, etc.), then you should be thinking of others first and foremost. If you wish success and respect from your efforts, then you need to pay attention to standards and conventions and cost. Thankfully, practically any such use that you can imagine has been done before, so you can leverage others' efforts.
In a database, consider how others (and yourself) will be inserting, updating, deleting, and selecting the data. For example, using XML in a database makes all these steps unnecessarily hard and expensive compared to the alternatives. Pay attention to database normalization--learn it if you are not familiar already.
If you are dealing with text, pay attention to character encodings and make them explicit.
If there is an existing standard or convention for what you are doing, honor it. If there is a compelling reason to deviate, then accept the burden of justifying it, explaining it, and making it easy for others to accommodate your choices.
If you control both sides of a communication/transport medium, feel free to optimize. If you don't, err on the side of interoperability. Remember that a primary difference between the two scenarios is the level of self-description embedded with the data: interoperability has lots, optimization drops it based on shared assumptions. Text-rich data is more understandable, but binary is faster.
Think about your audience.