Skip List

Binary search are amazingly fast to search and easy-to-implement.But this speed is limited only to search.When the insertions into this sorted array are considerable,the speed falls down due to ‘O(n) insertion cost’. O(n) insertion can be removed by linked list but these can’t be used for random accesses.
This is a write up on a popular data structure named Skip-lists which can solve the above problem of insertions.

Introduction:
Skip lists uses linked list to search elements. Linked list traversal is O(n) complexity because it needs to visit every preceding element to reach a certain element. What if we find a way to reach a certain element without visiting every preceding element? That will surely reduce the complexity by a factor of ‘the number of elements’ we are skipping. But how do we implement this?

Basic model of Skip list:
Look at the below diagram with two layers of linked list one above the other.There is one original linked list containing all the elements and then there is a express highway linked list with some elements.

skiplist

When we start searching for an element we start from the left end of the express list.There are three things that can happen:

1. The element is smaller than the required search element.We move to the next element in that express list.

2. The element is equal to the required search element.In this case we return that the element is found.

3. The element is greater than the required search element.Here we move to the previous element of the express list. Move one step down to the below layer(Original list) and continue the search from there towards the right.


Complexity of skip lists:
In the worst case it should go through all the elements in the express list + number of elements in the segment of below layer. A segment is number of “normal list” nodes between two “express list” nodes.


Why is this a parallel data structure?
At this point you might be thinking how is this better from binary search tree (eg: Red black trees). The reason is that the concurrent insertions and deletions on binary search tree would spoil the integrity of the structure if no care is taken. This forces the algorithm to lock the structure whenever insertion or deletion is happening.

In skip lists this does not happen because the structure is very simple and only local locks are needed to insert and delete.

Disadvantages:
Every structure has it’s own drawbacks.In skip lists there is a trade off between space complexity and time complexity. The space complexity is due to storage of extra express lists (pointers).

This is enough to get a good idea of what skip lists are. Continue if you want to get the idea of how skip lists can be parallelized.

Parallelizing skip-lists:

We will parallelize the insertion of the elements in the skip list.The serial code is in the following link.
Serial code
The parallelized version of code is here.
Parallel code

 How the serial code is implemented?
The serial code is a probabilistic based algorithm(i.e creating levels in a probable manner). I have changed that to balanced skiplist for better comparison of serial and  parallel code.

The serial code is implemented very smartly to reduce space storage. They use two structures in the code “skiplist” and “snode”. “skiplist” has the information of level,size and has a pointer to the header of the skiplist. Skiplist consists of snodes. “snode” stores the key value pair and an extra forward array. This forward is very important as it contains information of the next node in a certain level. snode_name->forward[1] will  give you the next node in the bottom list. snode_name->forward[2] if exists will give you the next node of the second list.Notice that this implementation is different than how I explained skiplist in the figure above.In my explanation I have duplicated nodes in the express list.Here it is not so .The above figure is implemented like this –  node1->forward[1] is pointer to node2 and node1->forward[2] is pointer to node5.

How are insertions done?
The function starts from the head element of the top layer.It iterates through the top layer until it reaches the node having key greater than the key to be inserted.In this case it drops to the layer below and starts going ahead in that layer.It keeps dropping this way to the bottom-most layer and stops one place behind the position it is going to insert. It stores all these ‘one step back nodes’ in an array called update.Then the actual insertions happen in different levels. The insertions are just like how you insert in linked list.

How can insertions be parallelized? We can insert elements using multiple threads but you should be careful here because there are race conditions. You can use multiple threads as long as changes done by one thread doesn’t affect the other.

How to be sure that the there won’t be race conditions between threads?
“If a thread1 is inserting a node with key value x then the thread2 can update anything which is smaller than the key value of the node before it without any race situations.”
For example, if you are inserting “7.5” in the above figure then you can parallely insert “3.5” without any race conditions.

After parallelizing the stats are:

In the original code inserting elements when the list is huge took 25ms.
After parallelizing the same insertions, it took  17-18ms in 4 out of 5 times and 28ms in 1 out of 5 times. This variation is due to fact that the thread execution is haphazard and the exact time can’t be predicted.

Serial code execution time:

Serial code execution time

Parallel code execution time:

parallel

Hardware details of the computer in which the code was executed:

lscpu

Advertisements

Understanding False Sharing

A snippet from wikipedia – “In computer science, false sharing is a performance-degrading usage pattern that can arise in systems with distributed, coherent caches at the size of the smallest resource block managed by the caching mechanism. When a system participant attempts to periodically access data that will never be altered by another party, but that data shares a cache block with data that is altered, the caching protocol may force the first participant to reload the whole unit despite a lack of logical necessity. The caching system is unaware of activity within this block and forces the first participant to bear the caching system overhead required by true shared access of a resource.

Too abstract to understand easily , isn’t it ?
To understand this better let’s take one hypothetical situation.It goes like this:

There are three painters. Each one has his own wooden board on which they paint, each board has three divisions , say division 1, division 2 and division 3. A painter can only paint one of these three divisions. When a painter paints one division of his wooden board, the other two boards must also be changed to reflect what the first painter has done.

brush-artist_318-33254

Here the wooden boards are analogous to cache blocks,
painters are analogous to parallel threads
and painting is analogous to write activity.

Remember that this update was logically unnecessary as the divisions used by each painter do not intersect with divisions used by other painters. What could have been done is after painting is done by all the painters, the wooden boards could be updated at the end.But this is not how our computer architecture works. This is because the component which manages the caching mechanism do not know which division of the cache block is actually updated. It marks the whole block dirty.This forces a memory update to maintain cache coherency. This is a very expensive computation compared to a write activity in the cache block.

This problem only arises when there is a write process and two parallel threads have intersecting cache blocks.The only way now to solve this problem would be to make sure that two parallel threads have different cache blocks.

Performance hit due to false sharing:

Now without the stats on time differences between with and without false sharing this article would be incomplete.

Refer the code here:
https://github.com/MJjainam/falseSharing/blob/master/parallelComputing.c

Steps to run the code in linux:

You need gcc to compile the code:
$gcc -pthread -o parallelComputing parallelComputing.cRun the code:
$./parallelComputing.c

The above code has three parts:
1. Serial computation.
2. Parallel computation with false sharing : false sharing in this part of the code is ensured by updating two adjacent memory locations of an array repeatedly.As the memory locations are adjacent they will be in the same cache block.
3. Parallel computation without false sharing : Two elements with a considerable memory gap between them are taken to make sure they don’t lie in the same cache block.

Output stats:

graph

The input size in the above graph is decided by the number of times the  loop is executed in the expensive function.
As you can see that parallel computing with false sharing is worse than serial computation, that’s how expensive it is to maintain cache coherency between two threads having intersecting cache blocks.So it’s very important to be aware of what’s going on in your parallel program till the cache level.

With this I end my article here. If you have any doubts please feel free to comment. I will get back to them as early as possible.