Have you ever heard that "premature optimization is the root of all evil"? Probably yes. It's quite well known Donald Knuth's phrase. However, the whole cite is much less known:
"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."
Why Am I writing about that? Because recently I had on occasion to fix an application which was written according to the fist part of this cite. Or even worse it was written according to the rule "any optimization is the root of all evil". Here are some examples what not to do and some tips what to do.
Reuse data
If you already retrieved some data from DB use it. It may sound obvious but apparently it isn't. I found the code that was working in the following way:
foreach(var obj in GetDataToProcess()) { // Let's do a lot of stuff Process(someObject.Id); } // ... public void Process(int id) { var obj = GetById(id); //... }In this example, someone firstly reads some data from DB and then each found object is retrieved again from DB based on its identifier. In this particular example, data returned from GetDataToProcess were exactly same as data returned from GetById. It's crazy and to make things worse GetById method was implemented in blasphemous way.
Do not read more data than needed
What do you think about such an approach to reading just one object from DB:
public IEnumerable<SomeObject> GetAll() { // Read all data from DB and return } public SomeObject GetById(int id) { return GetAll().SingleOrDefault(o => o.Id == id); }It's more than horrible. In order to read just one object from DB we read all of them (in this there were thousands of them) and then we filter them in the code.
Use dictionaries
Let's assume that GetById method from the above is implemented correctly i.e. it reads just one particular object from DB. It's much more better. But what if we must read thousands of objects in this way. Communication with DB costs a lot. It's much more better to read all data at once (if we have enough memory). This is one of the possible solutions:
private IDictionary<int, SomeObject> _cache; public void Begin() { _cache= GetAll().ToDictionary(o => o.Id,o => o); } public void End() { _cache = null; } public SomeObject GetById(int id) { if(_cache != null) return _cache[id]; // Read object from DB }If we know that GetById will be called many times, we should call Begin to cache data first. Thanks to that each call to GetById will read data from the cache instead of from DB.
Test your code with real data
All these bugs are not a problem if we work with small amount of data. But we should never assume that the same will be in the production. So always remember to test you code with real data. The simple rule of thumb may be.
Run your application and wait. If you start becoming irritated or bored, it may mean that something is wrong ;)
Do not use repositories
It's quite a common pattern to hide all the communication with DB in so called repositories. However, it's controversial to use this pattern together with ORMs. Why? Because ORMs are actually the implementation of the repository pattern so why to hide them behind another layer of repositories. Besides if ORM code is hidden in the repository it's easier to overlook the crappy implementation of some method of this repository. For example, see the initial implementation of GetById method.
By using these relatively simple techniques I was able to speed up the application considerably and decrease the processing time from around few hours to a few minutes.
Remember, the fact that premature optimization is the root of all evil doesn't mean that you can write crappy code.
*The picture at the beginning of the post comes from own resources and shows a toucan in Warsaw Zoo.
6 comments:
First one is based on bad data layer separation. It is not only optimization problem.
@Piotr Błaszyński - The code showed by me is really, really simplified. In the real application the data access layer was clearly separated from business logic. For example GetById method was not a part of class responsible for processing data but of a repository and was called in this way _repository.GetById(id)
True, true, true. Sometimes the only way is to use ninja optimization pattern.
I agree with your all your points but last. IMHO we should hide complex queries behind repository methods, this not only makes business code that uses them more readable but also promotes code reuse. Alternatively someone may use specification pattern to wrap complex query into object. I believe at hoc queries let to query logic duplication and what's more important they made code difficult to understand.
Of course if you want to improve performance it will be hindrance to you since you will not be able to see queries code immediately but if your goal is to change business logic you will appreciate good names of repository methods (e.g. getOpenOrdersBelongingToCustomer(someCustomer) instead of some orders.Where(o => o.State < 'xxx').Where(o => o.Customer = someCustomer).Single()).
"... GetById method was implemented in blasphemous way."
I LOLed at that.
@0xMarcin As to the repositories. You made many good points and sometime ago I would agree with you in 100%. However, the more I think about that the more doubts I have. In many cases all these repositories lead to the over engineering and clumsy code i.e. we have classes with dozens or more of methods like: getOpenOrdersBelongingToCustomer, getClosedOrdersBelongingToCustomer, getOpenOrdersBelongingToCustomerOrderedAsc, getOpenOrdersBelongingToCustomerOrderedDesc etc. Of course we may use the specification pattern or something similar to avoid that. However, it's an additional effort. Besides it means that we actually write something that we have for free if we directly use ORM mapper.
Post a Comment