Number crunching: Why you should never, ever, EVER use linked-list in your code again


Updates:

  1. 23rd April, 2014: I noticed that a few of the ideone examples lost their code, whether from someone hacked my account or not is unknown. For that reason I put up the ideone examples at GitHub. You can find them here.
    https://github.com/KjellKod/blog_linkedlist_vs_vector

  2. 21th August, 2012: A mirror of this blog-article but heavily debated can be found, with many reader comments, at  CodeProject

Mea Culpa – Please read before continuing

Let us look at the title again. It contains a very important prefix. The title says:

Number crunching: Why you should never, ever, EVER use linked-list in your code again.

The number crunching prefix is key to this whole article. If the “Number crunching:” part of the title is ignored it might be understood as a general advice to always stay away from linked-list, that is not the purpose of this article.

The point of the article is to explain why linked-list is so badly suited for number crunching tasks. What number crunching is will soon be explained. Even so there are of course always exceptions to the rule and so the title cannot be 100 percent true.

The title above is a hyperbole. It strongly emphasizes a purpose but being deliberately over the top. Since I do have mixed feelings about it, and obviously so do some readers then a possible title change could be:

Why you should never, ever, EVER use linked-list for number crunching  

However I decided to keep it since it could be more confusing to change it. A few readers have also commented that they think the title is just right. So let’s just leave it with that.

Now, why on earth did I decide in the first place for such an  unusual act as to put a hyperbole title on this blog-article? Honestly, because the message needed to get out. My impression is that the performance impact of using a linked-list for number crunching is not commonly known and that is why I decided to write this article.

I hope with this Mea Culpa  you can see beyond any initial confusion which is all my fault and see to the core issue:the cons of linked-list for a set of number crunching problems.

Table of Contents

Introduction

This article will explain why the linked-list is a less then ideal choice for dealing with number crunching.

Number crunching  is the term used here for describing the crunching, sorting, inserting, removing and simply juggling of raw numbers or POD data. This article is about why linked-list is not well suited for these tasks and should in fact be avoided. It is written with an attitude but backed up with facts and proofs. You are welcome, even encouraged, to verify these claims and all code and proofs are either accessible online or are available to download (at CodeProject).

The intention is to clearly show some obvious examples to prove a point. The article is not written to cover every special case and exception to its recommendation.

These obvious examples are for some readers edge cases, for other readers it is common programming tasks. Some readers think the point of the article is common sense, others that it is a joke and are shocked when they realize that it is in fact reality. I especially recommend the latter readers to not easily dismiss this article but in fact take time to study it. You might just get a new insight.

In Real Life

Number crunching tasks are sometimes similar to my examples below, sometimes not. They can be found in a variety of areas such as avionics technology, heart surgery equipment, surveillance technology and [...] well you name it. What to remember as you continue is that number crunching as used in this article is referring to numbers and POD data that is stored, retrieved, removed and sorted. Nothing weird, and very common and are always types that are cheap to copy

A few times in my professional career I have encountered performance issues where some time critical areas got bad performance just because of number crunching by linked-list. I have witnessed these linked-list related performance issues in a range of areas from real-time, embedded, small memory constrained PowerPC systems to x86 PC applications.

These times it came as a complete surprise to those involved. It was a surprise because to them it was proven by mathematical complexity models that it was a very efficient data structure for this or that use. That it was
slow was considered bad, but OK, because other data structures would be slower according to the same mathematical line of reasoning.

Unfortunately for a given problem set these mathematical models are likely to lead you astray. Following blindly the Big-O complexity of the data structures and algorithms might lead you to a conclusion that is completely and utterly false.

Big-O notations tells nothing about how one (data structure with algorithm) fare against another. Big-O will only tell you how performance will degrade as n increases. So comparing one a data structure that is RAM intensive to another data structure that is cache friendly from an abstract Big-O point-of-view is just pointless.  This is nothing new, but since the books and online resources rarely, or ever, mention this not many know about it, or at least tend to forget.

Disclaimer

Just in case my Mea Culpa and Introduction did not cover it. Let me be clear: This article is not disqualifying the linked-list for other types of usages, such as when it contains types that have pointers to other types, large and expensive-to-copy types (and yes, POD can also be too large).

This article is not about specific usage patterns where linked-list makes extra sense – because there sure are several of those. Take a peek later in the comments section and you can find several more, or less, good examples of such cases. This article is an opinionated article but backed up with hard facts. It does not discredit linked-list overall except in one specific area: number crunching.

Other types of containers are used to compare with the linked-list. The purpose is not to hype these containers but to show the areas where linked-list is lacking. It even goes as far as using these containers in clearly suboptimal ways which still by far outperforms the linked-list.

The code examples and container names in this article are from C++. std::vector would in Java be called ArrayList and the Linked-list in C++ is called std::list, which in Java it would be called LinkedList

OK, enough with the disclaimer, lets get going.

Scope

What data structure should you use? What data structure should you avoid?

Imagine that you have to use a data structure. It is nothing fancy, only a storage for raw numbers, or POD data carriers that you will use now and then and some later on.

These items will be inserted, then removed intermittently. They will have some internal order, or significance, to them and be arranged accordingly. Sometimes insertions and removal will be at the beginning or the end but most of the times you will have to find the location and then do an insertion
or removal somewhere in between. An absolute requirement is that the insertion and removal of an element is efficient, hopefully even O(1).

Now, what data structure should you use?

On-line resources

At a time like this it is good advice to double check with books and/or on-line resources to verify that your initial hunch was correct. In this situation maybe you recollect that vectors are generally fast for accessing an item but can be slow in modifying operators since items not at the end that are inserted/removed will cause part of the contents in the array to be shifted. Of course you also know that an insertion when the vector is full will trigger a re-size. A new array storage will be created and all the items in the vector will be copied or moved to the new storage. This seems intuitively slow.

A linked list on the other hand might have slow O(n) access to the item but the insertion/removal is basically just switching of pointers so this O(1) operations are very much appealing. I double check with a few different online resources and make my decision. Linked-list it is

BEEEEEEEEP. ERROR. A RED GNOME JUMPS DOWN IN FRONT OF ME AND FLAGS ME DOWN. HE TELLS ME IN NO UNCERTAIN WAYS THAT I AM WRONG. DEAD WRONG.

Wrong? How can this be wrong? This is what the online resources say:

CplusPlus.com: List
[ Regarding  std::list ] … Compared to other base standard sequence containers (vector and deque), lists perform generally better in inserting, extracting and moving elements in any position within the container, and therefore also in algorithms that make intensive use of these, like sorting algorithms.

Cplusplus.com : Vector
[ Regarding std::vector]  … vectors are generally the most efficient in time for accessing elements and to add or remove elements from the end of the sequence. [...] For operations that involve inserting or removing elements at positions other than the end, they perform worse than deques and lists.

Wikipedia on Sequence Container: List
Vectors are inefficient at removing or inserting elements other than at the end. Such operations have O(n) (see Big-O notation) complexity compared with O(1) for linked-lists. This is offset by the speed of access — access to a random element in a vector is of complexity O(1) compared with O(n) for general linked-lists and O(log n) for link-trees.

In this example most insertions/removals will definitely not be at the end, this is already established. So that should mean that the linked-list would be more effective than the vector, right? I decide to double check with Wikipedia, Wiki.Answers and Wikibooks on Algorithms. They all seem to agree and I cannot understand what the RED GNOME is complaining about.

I take a break to watch Bjarne Stroustrup’s Going Native 2012 keynote. At time 01:44 in the video-cast and slide 43, something interesting happens. I recognize what the RED GNOME has tried to to tell me. It all falls into place. Of course. I should have known. Duh. Locality of Reference.

Important Stuff comes now

  1. It does not matter that the linked-list is faster than vector for inserting/removing an element. In the slightly larger scope that is completely irrelevant. Before the element can be inserted/removed the right location must be found. And finding that location is extremely slow compared to a vector. In fact, if both linear search is done for vector and for linked-list, then the linear search for vector completely, utterly and with no-contest beats the list.
  2. The extra shuffling, copying overhead on the vector is time-wise cheap. It is dirt cheap and can be completely ignored if compared to the huge overhead for traversing a linked-list.

Why? you may ask? Why is the linear search so extremely efficient for vector compared to the supposedly oh-so-slow linked-list linear search? Is not O(n) for linked-list comparable to O(n) for vector?

In a perfect world perhaps, but in reality it is NOT! It is here that Wiki.Answers and the other online resources are wrong! as the advice from them seem to suggest to use the linked-list whenever non-end insertion/removals are common. This is bad advice as they completely seem to disregard the impact of locality of reference.

Locality of Reference I

The linked-list have items in disjoint areas of memory. To traverse the list the cache lines cannot be utilized effectively. One could say that the linked-list is cache line hostile, or that the linked-list maximizes cache line misses. The disjoint memory will make traversing of the linked-list slow because RAM fetching will be used extensively.

A vector on the other hand is all about having data in adjacent memory. An insertion or removal of an item might mean that data must be shuffled, but this is cheap for the vector. Dirt cheap (yes I like that term). The vector, with its adjacent memory layout maximizes cache utilization and minimizes cache line misses. This makes the whole difference, and that difference is huge as I will show you soon.

Let us test this by doing insertions of random values. We will keep the data structure sorted and to get to the right position we will use linear search for both vector and the linked-list. Of course this is silly way to use the vector but I want to show how effective the adjacent memory vector is compared
to the disjoint memory linked-list.

The Code

/* NumbersInVector &randoms is pre created random values that are stored in a std::vector. This way the same random values can be used for testing and comparing the vector to the linked-list.

    Example:
    0, 1, 8, 4, 1 will get sorted to:
    0, 1, 1, 4, 8   */

template
void linearInsertion(const NumbersInVector &randoms,
                     Container &container)
{
 std::for_each(randoms.begin(), randoms.end(),
 [&](const Number &n)
 {
    auto itr = container.begin();
    for (; itr!= container.end(); ++itr)
    {
       if ((*itr) >= n) { break; }
    }
    container.insert(itr, n); // sorted insert
 });
}

/* Measure time in milliseconds for linear insert
   in a std container */
template
TimeValue linearInsertPerformance(const NumbersInVector &randoms,                                   Container &container)
{
 g2::StopWatch watch;
 linearInsertion(std::cref(randoms), container);
 auto time = watch.elapsedMs().count();
 return time;
}

The time tracking (StopWatch that I wrote about here) is easy with C++11… Now all that is needed is the creation of the random values and keeping track of the measured time for a few sets of random numbers. We measure this from short sets of 10 numbers all the way up to 500,000. This will give a nice perspective.

void listVsVectorLinearPerformance(size_t nbr_of_randoms)
{
std::uniform_int_distribution distribution(0, nbr_of_randoms);
std::mt19937 engine((unsigned int)time(0)); // Mersenne twister MT19937
auto generator = std::bind(distribution, engine);
NumbersInVector     values(nbr_of_randoms);

// Generate n random values and store them
std::for_each(values.begin(), values.end(), [&](Number &n)
{ n = generator();});

NumbersInList      list;
TimeValue list_time = linearInsertPerformance(values, list);
NumbersInVector    vector;
TimeValue vector_time = linearInsertPerformance(values, vector);

std::cout << list_time << ", " << vector_time << std::endl;
}

Calling this for values going all the way up to 500,000 gives us enough values to plot them in a graph and which shows exactly what Bjarne Stroustrup told us at the Going Native 2012. For removal of items we get a similar curve but that is not as time intensive. I have google doc collected some timing results that I made on IdeOne.com and on a x64 Windows7 and Ubuntu Linux laptop for you to see.

Can you see the red fuzz at the bottom? The graph is not displayed incorrectly. That red fuzz is the time measurements for vector. 500,000 sorted insertions in a linked-list took some 1 hour and 47 minutes. The same number of elements for the vector takes 1 minute and 18 seconds.

You can test this yourself. The code is attached with this article. For a quick test you can try out the smaller (max 15 seconds) test of up to 40.000 elements at the pastebin and online compiler IdeOne: http://ideone.com/62Emz

Locality of Reference II

The processor and cache architecture has a lot of influence on the result obviously. On a Linux OS the cache level information can be retrieved by

grep . /sys/devices/system/cpu/cpu*/index*/cache

On my x64 quad-core laptop that gives each core-pair 2x L1 cache a 32 KBytes, 2x L2 cache a 256 KBytes. All cores share a 3 MBytes L3 cache.

Conceptual drawing of quad core Intel i5

I thought it would be interesting to compare my quad-core against a simpler and older computer. Using the CPU-Z freeware software I got from my old dual-core Windows7 PC that it has 2xL1 caches a 32KBytes and 1xL2 cache a 2048 KBytes.

Testing both with a smaller sets of values shows the cache significance clearly (disregarding different bus bandwidth and CPU Speed )

x86 old dual-core PC

x64 quad-core Intel i5

Devils Advocate

Of course there are always exceptions, special cases, and objections to either using vector or the advice of not using linked-list. If we can ignore the obvious non-related cases  and focus on just the POD or number crunching then there are still some valid concerns that should be addressed.

Devil’s Advocate I  – Cost of copying

For vector the traditionally perceived overhead for copying, shuffling and resizing (with memory allocations and copying) is sufficient reasons for some (ref: [2] to [7]) for choosing the linked-list over the vector.

Above was shown how the perceived inefficiency of the vector was in fact dirt cheap compared to the cost of traversing a linked-list. The linked-list item insertion is cheap but to get to the insertion place a very expansive traversal in disjoint memory must be done.

The examples above are somewhat extreme. The integer examples brings out another bad quality of linked-list. The linked-list need 3 * size of integer to represent one item when the vector only need 1. So for small data types it makes even less sense to use a linked-list.

However, what would happen if the cost of copying was not so cheap? If the cost for modifying the vector wouldincrease?

For larger types you might see the same behavior as shown above but obviously the times would differ. As the types grow in size and becoming more expensive then finally the linked-list would outperform the vector which has to shuffle and copy a lot of objects. Using the C++ move operator would perhaps change this. I have not tested with move operator so that is up to you to try out for yourself. 

As PODs grow larger in size, the copying cost will also grow. The overhead in copying and shuffling will take its toll. The cache architecture and the cache’s line size will also make an impact. If the POD size is larger than the cache-line size then the cache usage will be suboptimal compared to when used for smaller POD size.

At some POD size this extra overhead will even be more time consuming then the linked-list slow traversal. At this point choosing to use a vector is a worse choice then choosing the linked-list, even for the scenario used earlier in this article.

This is shown in the three graphs below. The tests were done for 200, 4000 and 10.000 elements. Each graph represents a fixed number of elements, where the byte size of the PODs is increasing.

Observe the intersection between the linked-list and the vector performance. It is at the intersection that the performance goes from being in favor of vector to being in favor of linked-list for that number of elements

From the graphs it is obvious that for the same number of elements the POD size has virtually no impact on the linked-list performance. For vector the POD size is of massive importance. The cost of copying increase proportionally with the size.

Efficiency when using the linked-list is limited by the cost of RAM access and thus almost POD size ignorant. However as the number of items increase the total cost for RAM access will be more expensive and substantially larger PODs (bytes) are needed before linked-list will outperform the vector.

The data is collected in a google document and also in the CodeProject download area in excel format. A quick look using ideone.com can be done at ideone.com/W9vpT.

Devil’s Advocate II : Linear, sorted insert with smart linked-list

A common objection to the linear, sorted insert comparison shown above is that since the linked-list linear is known to be slow the iterator to the last inserted location should be kept. That way on average the linear search could benefit from O(n/2) instead of O(n).

When testing this it showed what huge impact the cache architecture has on the performance on adjacent memory structures. The first test run was made at ideon.com/XprUU. The ideone online compiler has unknown cache architecture and since it is serving multiple runs simultaneously it is likely that the caches utilization is unpredictible and RAM usage is heavy.

With this likely cache under utilization the smart linked-list was clearly the winner. At least up to a certain number of elements. It took up to 6000 elements before the vector had caught up to the smart linked-list. This was a huge surprise to me! but it made sense when I tested it with a known cache architecture.

The same test was run on a known cache architecture, my x64 laptop, with L1, L2 and L3 cache. It immediately showed the benefits of cache utilization for adjacent memory structures. The smart linked-list and the vector kept even pace up to about 100 elements, at 500 elements the difference was around 20 percent and steadily increasing.

If you take a peek at its data sheet you can see the performance data both when running at ideone.com/XprUU and on the x64 laptop.

On other computers with different cache architectures and when using other languages (Java, C#, etc) this will give different results..

Devil’s Advocate III : Fragmented memory makes vector a big no-no

For fragmented memory, or systems where the memory can be suspected to be fragmented, or for huge vectors, then it might be beneficial to use list-of-vectors instead. There are a few different types to choose from: unrolled linked-lists [12] or the std::deque are two.

I have not tested the fragmented memory scenario as I simply could not come up with a way to force the memory to be easily fragmented. On 32-bit dual-core PC and 64-bit quad-core PC I never once encountered allocation exception, even for multiples of the 500.000 items comparison testing.

Devil’s Advocate IV : Algorithms with merge or sort

Yes, what about it? The online resources pointed out earlier are stating that merge or sort of the containers is where the linked-list will outperform the vector. I am sorry to say but this is not the case for a computer with a modern cache architecture. At least not in the area of number crunching, and is it not there that sort are most commonly used?

Once again these resources are giving you bad advice.

From a mathematical perspective YES it does makes sense when calculating big-O complexity. The problem is that the mathematical models (at least the ones I have seen) are flawed. At least flawed in that aspect that making a decision for what to use from the big-O complexity value is just not good enough.

Computer cache, RAM and memory architecture are completely disregarded and only the mathematical, SIMPLE, complexity is regarded.

To show this some simple sort testing is laid out below.

void listVsVectorSort(size_t nbr_of_randoms)
{
std::uniform_int_distribution distribution(0, nbr_of_randoms);
std::mt19937 engine((unsigned int)time(0)); // Mersenne twister MT19937
auto generator = std::bind(distribution, engine);
NumbersInVector  vector(nbr_of_randoms);
NumbersInList list;

std::for_each(vector.begin(), vector.end(), [&](Number& n)
{ n = generator(); list.push_back(n); }    );

TimeValue list_time;
{  // list measure sort
g2::StopWatch watch;
list.sort();
list_time = watch.elapsedUs().count();
}

TimeValue vector_time;
{  // vector measure sort
g2::StopWatch watch;
std::sort(vector.begin(), vector.end());
vector_time = watch.elapsedUs().count();
}

std::cout <<  nbr_of_randoms << "\t\t, " << list_time << "\t\t, " << vector_time << std::endl;
}

The sort testing I made is available on the google spreadsheet on tab “sort comparison“. I used std::sort(vector) vs std::list::sort.

Sure, it is two different algorithms who likely use two different algorithms so this test might be a leap of faith. Either way it can be interesting since both algorithms are at their best and this is where the linked-list supposedly should be the winner. No big surprise that it was not.

The std::sort (vector) beats the std::list::sort hands-down. The complete code is attached and available here http://ideone.com/tLUeK.

The graph below show the sort performance from 10 up to 1.000 elements. The cost is always higher for the linked-list. Both show a proportional cost increase as the number of elements increase but the linked-list has a steeper incline. The linked-list sorting will compared to the vector be obviously more expensive per item as the elements increase.

Testing this on a modern x64 computer with better CPU-cache architecture would increase the difference between std::list and vector even more.

As the number of elements continue to grow this initial performance perspective does not change. The graph below shows from 10 to 100.000 elements but has virtually the same proportional properties as when compared 10 to 1.000.

The corresponding sort test but for merge was not tested. However,  for merge of two sorted structures, that after the merge should still be sorted, then that involves traversal again and once again the linked-list is no good.

Devil’s Advocate V : Insert at first position

   A CodeProject reader gave the response: “6 good reasons to always use linkedlist
“[If] Your insertion is always done [at] the index 0 or somwhere in the middle. No matter how large list grows the operation perfomed in constant time O(1). With vectors it is O(log(n)). The larger the array the slower the insertion.” 

Another reader wrote in my blog :
“[A] very common example when adding elements to the beginning a list. Using a LinkedList this would be a fairly simple operation (just swapping some pointers) while the ArrayList  [Java version of dynamic array] would struggle to make such insertion (moving the entire array)

Copy and pasting from my own article comment answer to the first reader:

“What you wrote above is completely accurate. You are not the first to mention this and while it is so right it is also so wrong. Let me explain.

The scenario you outlined is very believable. In fact putting one piece of data in front of an older piece of data and so forth is perhaps the most common task for data collecting?

In this scenario it makes sense to do insert at “index 0” for the linked-list. This does not makes sense for the vector.

The default insertion on vector should be done by push_back. Insertion in the direction of growth. An obvious choice to avoid shifting the whole previous section. This is made abundantly clear in C++ where thestd::vector[^] have push_back, and insert but lacks completely the function push_front.

So, let us not be silly [naive] here, right? Doing push_back on the vector instead and comparing that to insert at index 0 on the linked-list that is probably the valid options a programmer would choose from. Don’t you agree?

push_back on the vector does not need to shuffle everything for every new element. The overhead when the vector is full and need to be re-sized still applies.“

A code and performance comparison between these three insertions can be seen at http://ideone.com/DDEJF. It compares between the common linked-list operation : insert at “index 0” vs std::vector push_back vs a naivestd::vector “ push_front “. 

From: http://ideone.com/DDEJF. Times are given in microseconds [us]
elements, list, vector, vector (naive)
10,       3,     6,     1 
100,      7,     3,     13 
500,      43,    7,     83 
1000,     70,    10,    285 
2000,     149,   27,    986 
10000,    928,   159,   22613 
20000,    1578,  284,   97330 
40000,    3138,  582,   553636 
100000,   7802,  1179,  3555756  
Exiting test,. the whole measuring took 4266 milliseconds (4seconds or 0 minutes)

What to Do Now?

Donald Knuth made the following statement regarding optimization:

“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil”.

Meaning that “Premature Optimization” is when the software developer thinks he is doing performance improvements to the code without knowing for a fact that it is needed or that it will greatly improve performance. The changes made leads to a design that is not as clean as before, incorrect or hard to read. The code is overly complicated or obfuscated. This is clearly an anti-pattern.

On the other hand we have the inverse of the extreme Premature Optimization anti-pattern, this is called Premature Pessimization. Premature pessimization is when the programmer chose techniques that are known to have worse performance than other set of techniques. Usually this is done out of habit or just because.

Example 1: Always copy arguments instead of using const reference.

Example 2: Using the post increment ++ operator instead of pre increment ++. I.e value++ instead of ++value.

The cost for incrementing native types pre- or post-increment is inconsequential,but for class types the performance difference can have an impact. So using value++ always is a bad habit that when used for another set of types actually is degrading performance. For this reason it is always better to just do pre increments. The code is just as clear as post increments and performance will never be worse

Using linked-list in a scenario when it has little performance benefit (say 10 elements only touched rarely) is still worse compared to using the vector. It is a bad habit that you should just stay away from.

Why use a bad performance container when other options are available? Options that are just as clean and easy to work with and have no impact on code clarity.

Stay away from Premature Optimization but be sure to not fall into the trap of Premature Pessimization! You should follow the best (optimal) solution, and, of course, common sense without trying to optimize and obfuscate the code and design. Sub-optimal techniques such as linked-list should be avoided since there are clearly better alternatives out there.

Java and C#

One comment got when this article was published was something in the line of

This might be true for C++ but for Java or C# this is not so important. Those languages are not chosen for speed but for faster development time

This might be true but actually since the speed aspect will likely hit Java and C# just as fast and hard as C++ code it might still be important.

I got help from a friendly reader for a Java try. After some changes proposed by other readers (ref  comments) I had a similar linear search, then insert example at  http://ideone.com/JOJ05.

In Java the C++ std::list equivalent is called LinkedList. The C++ dynamic array std::vector is in Java called ArrayList.  There is one big difference between these libraries, both the Java LinkedList andArrayList can only contain objects. This means that the boost of adjacent memory will be somewhat less for an ArrayList containing Integer objects.  To access and compare values for the search there is one extra indirection, from reference to value.

To see what difference that made for ArrayList I threw in a DynamicIntArray to get a truly cache friendly structure to compare with. The DynamicIntArray use int, not Integer objects.

For Java the LinkedList showed immediately that it was performing worse than the ArrayList. For the test on ideone.com (http://ideone.com/JOJ05) the LinkedList was performing from a factor of 1.5 to a factor of 3 slower then the best array container.

On my x64 ultrabook the cache architecture is different and the cache friendly structures got a nice boost. The LinkedList was then a factor  1.5 to factor 15 slower then the DynamicIntArray for the same range of elements.

Integer object slow down for ArrayList
On ideone.com the extra indirection overhead of using Integer objects instead of int was roughly a 30% slow down for ArrayList in comparison to the faster DynamicIntArray.  On my x64 laptop the ArrayList took roughly twice (200%) as long time to finish as the DynamicIntArray (still faster than at ideone.com)

Much more Java specific testing can be read in another blog entry, so if you would like to read more about it (filtering, sorting, big-O discussion) you can read about it in Java Battle Royal: LinkedList vs ArrayList vs DynamicIntArray 

Conclusion

This article is not to discriminate the linked-list for all intents and purposes. It does show however the shortcomings of the linked-list in the area of number crunching.

Traditionally, yes even today, books and online resources almost always completely disregard computer architecture and how this impacts with the concept of locality of reference. These resources calculate algorithm big-O efficiency and makes recommendations that are just plain wrong as there are huge difference between RAM intense data structures and cache friendly structures. In reality the cache architecture has an enormous impact.

Above I showed some silly usage of C++ std::vector. Instead of using the fast [] operator or some kind of tree structure the scenarious played out equal to not discredit the std::list unnecessarily, in fact some were made to favor the linked-list. It showed that for small, cheap to copy data, raw numbers, or POD an adjacent memory structure is to prefer to a disjoint memory structure. The perceived overhead for new allocations, copying, shuffling that goes with a dynamic array like the Vector is next-to-nothing compared to the very, very slow RAM fetching inolved whenever accessing elements in a linked-list.

When it comes to number crunching then the vector or other adjacent memory structure should be the default and not the linked-list. Of course every new situation needs to be analyzed and data structure and algorithm chosen appropriately. Apply the knowledge you got reading this wisely, only use the linked-list if it will give benefits not easily retrieved from an array-based structure.

Remember that the performance issues with linked-list is usually not the main performance issues for your software. But why using a clearly sub-optimal data structure when the dynamic vector can just as easily be used? Code readability is the same and using linked-list for number crunching is to me an obvious case of premature pessimization.

When we are in the business of number crunching why not use your understanding of the linked-list disadvantages and stay away from the linked-list for the time being? At least until you are sure you need it. Your code will be better from it and performance will be the same or better, sometimes even much better.

Feedback
I recommend to check out the comments below or at the comments section at CodeProject, maybe leaving one yourself. There are lots of opinions and even some interesting exceptions to my advice and in general very smart comments by equal smart coders. Thanks to them this article got a makeover when I addressed their concerns.

Last, a little plea. Please, before I get flamed to pieces for this obstinate article, why not run some sample code on ideone.com or similar, prove me wrong and educate me. I live to learn =)

Oh, and yes. I know, thank you for pointing it out. There are other containers out there, but this time I focused on linked-list vs vector 

References

  1. Keynote by Bjarne Stroustrup, GoingNative 2012
  2. cplusplus.com std::list reference
  3. cplusplus.com std::vector reference
  4. Wikipedia on sequence containers
  5. Wiki.Answers: Difference between [Linked-] List and Vector
  6. WikiBooks: Advantages / Disadvantages [List vs Vector]
  7. Wikipedia: Linked-List vs Dynamic Array
  8. Wikipedia: On optimization and Donald Knuth
  9. CodeProject: Pre-allocated Arrayed List, by Vinod Vijayan
  10. Blog response by George Aristy Performance comparison: Linked-List vs. Vector [Java]
  11. Future Chips: Should you ever use Linked-Lists?
  12. MSDN blogs: Unrolled linked lists
  13. The CodeProject version of this blog-entry and all its readers’ comments
  14. KjellKod.Wordpress Java follow up after improvement suggestions from readers
About these ads

About kjellkod

Software Engineer by trade, (former?) Martial Artist and Champion by strong will and small skill... and a Swede by nationality :)
This entry was posted in C++, Software Engineering. Bookmark the permalink.

33 Responses to Number crunching: Why you should never, ever, EVER use linked-list in your code again

  1. ldonoso says:

    I get the point but Maybe the program is not adequate to test the cost of inserting/removing:
    – The cost of traversing supersedes the cost of insertion. And an insertion/removing doesn’t invalidate the iterators so you can may save traversals in a real program.
    – The constructor of an int is free, what about an expensive constructor? (a move constructor would solve this point)

    Also lists have advantages over vectors:
    – excepcion save
    – algorithms splice/merge, sort

    • kjellkod says:

      [ 2012, 3rd of March: Updated my response. It was initially befuddled by beer ;) and now I did do some sort testing as well. I hope this is more clear ]

      Hi Idonoso and thank your for your input. It is much appreciated.
      Very good objections to my blog-article. Unfortunately they are only true according to the textbooks and the references I pointed out earlier. You know, the same online references that obviously are wrong for modern processor-cache architectures.

      Especially “sort” is pointed out (in the references) as more effective for linked-list, I show below that is obviously also false. I am happy to be proven wrong,. so please go over my code (below). No fault on you though Idonoso. Who would not trust these resources? At least I have for years. From now on, no more.

      [...] an insertion/removing doesn’t invalidate the iterators so you can may save traversals in a real program.

      If traversal the linked-list can be avoided, or, minimized then of course the O(1) insert would be appealing. However, even then you need the traversal from these temporarily saved pointers,. and like described above traversal on linked-list is extremely slow (compared to vector). However I do think this would be fun to test, I would not be surprised if this is normally voiced objection over the argumentation I make.

      Also lists have advantages over vectors:
      – excepcion save
      – algorithms splice/merge, sort

      1. I am not sure what you mean with “exception safe”,. could you explain more?

      2. splice/merge sort. Yes, and NO. From a mathematical perspective YES. The problem is that the mathematical model (at least the ones I’ve seen) are flawed. Computer cache, RAM, memory architecture are completely disregarded and only the mathematical, SIMPLE, complexity is regarded.

      Not to go overboard here I still did some testing for sort . I have not done it for merge but if with merge you mean merging two sorted structures, and after the merge they are still sorted, then that involves traversal again and once again linked-list is no good.

      The sort testing I made is available on the google spreadsheeton tab “sort comparison“. I used std::sort (vector) vs std::list::sort. The std::sort (vector) beats the std::list::sort hands-down. The code is available here http://ideone.com/tLUeK

      • dacwe says:

        “However, even then you need the traversal from these temporarily saved pointers,. and like described above traversal on linked-list is extremely slow (compared to vector).”

        No, no, no! Check the “filtering” example I provided in my follow-up comment below (or direct link: http://ideone.com/KM5Ev – O(n^2) for the array list, O(n) for the linked list).

        2. splice/merge sort. Yes, and NO. From a mathematical perspective YES. The problem is that the mathematical model (at least the ones I’ve seen) are flawed. Computer cache, RAM, memory architecture are completely disregarded and only the mathematical, SIMPLE, complexity is regarded.

        You are comparing the wrong stuff… std::sort needs random access std::list::sort doesn’t. Different algorithms give different performance, why is that so impressive? And looking at your data the list::sort version is 5 times slower. But depending on the other code you have the linked list could be the much better choice (for example de-queing sorted elements by deleting the first element in the list)…

        The mathematical perspective is not flawed. Not at all.

  2. Llorllale says:

    Hi KjellKod

    This article is very interesting. I did the same test in Java and came up with similar results. I posted the results at http://llorllale.blogspot.com/2012/03/performance-comparison-linked-list-vs.html.

    I do have a couple of queries for you. Can you be a bit more specific on the hardware you were using to test? In the 10-4000 element range, which is where I successfully tested, my implementation in Java actually seems a lot *faster* (gasp) than yours in C++11. Or perhaps my implementation didn’t follow your logic closely.

    Another thing: you determined that it took almost two hours to sort 500K elements into a linked-list, yet how long was your program running by the time it got to that point? Initially I left mine running overnight for ~9 hours and it got NOWHERE CLOSE to half a million elements.

    • dacwe says:

      That’s because your LinkedList sorting algorithm is O(n^3) and the ArrayList is O(n^2). You are looking up elements by index – that is really fast using array list (constant time) but using a linked list it’s not (linear time).

      Change to use an Iterator and both will be in O(n^2). Then you should only see a factor, not a exponential difference.

      • kjellkod says:

        Thank you David. I corrected that and also an ArrayList mishap. The ArrayList was treaded (by API) as a List which made the random access .get(index) too slow

        The result can be seen here: http://ideone.com/JOJ05

        For good measure I also added a DynamicIntArray as the ArrayList looses some of the cache boost by having one object indirection.
        Here is what the updated chart would look like:
        Slow linear search, then insert of random values

  3. Codinpsycho says:

    Hey man…great article….and just to let u knw….today i was talking to a senior programmer at my company …with the same argument about cache line misses and supporting vectors…and when at night i came home..by chance i got to ur article and it just confirmed my doubts…well really appreciate ur work….world ill be a better place without linked lists … ;)

    • dacwe says:

      world ill be a better place without linked lists …
      Nope, I can give you infinite number of examples where linked lists outperforms array lists…

  4. kjellkod says:

    @Codinpsycho,
    Heh, well said ;)

    In one way I hope that the senior programmer knew about this,.. but I am kind of reading between the lines that he did not. In which case I feel better that this little over-the-top text got squeezed out.

  5. kjellkod says:

    @Llorale very cool that you implemented the java test and thanks for juggling java-C++ comparison over e-mail. For the benefit of others I can mention that testing the java/c++ for “linear-sorted-insert” gave the following result.

    ********* C++ Times in microseconds from http://ideone.com/pi2Od **********
    Elements ADD (List, Vector) ERASE(List, Vector)
    100, 14, 20, 21, 8
    200, 29, 36, 33, 19
    500, 136, 122, 132, 73
    1.000, 415, 413, 446, 185
    4.000, 5865, 5573, 5883, 2081
    10.000, 96754, 33850, 109992, 11808
    20.000, 583703, 138520, 606003, 46546
    Exiting test,. the whole measuring took 1650 milliseconds (1seconds or 0 minutes)

    Java in milliseconds from shamelessly hacked from your beautiful code http://ideone.com/u5wbd
    #Elements,LinkedList,ArrayList,Vector
    100, 2, 2, 0
    200, 1, 1, 0
    500, 11, 3, 4
    1.000, 78, 11, 16
    2.000, 576, 38, 60
    3.000, 1958, 85, 136
    4.000, 5731, 156, 248

    Apart from the speed variances I think it was really interesting how fast the linked-list just “peaked” eh, I mean performance-crashed. I think it is an even more valid point in Java to be careful here for even small numbers such as 4000 integers the linear-search become very cumbersome. Almost 6 seconds! And who said “linked-list” and cache-aware did not matter?

    Great that you brought this to my attention!

  6. rc says:

    As for Java: I understand how an array of ints would minimize cache misses. But what about an an ArrayList of Integers? Such an ArrayList would contain references to Integer objects. The references would be next to each other in memory, but not the Integer objects? Don’t you lose locality when you dereference the Integers? What am I missing?

    • rc says:

      Response to my own reply: Perhaps there was a “scalar replacement” optimization that allocates the Integers on the stack, instead of the heap. This would improve cache locality. But you need this optimization for ArrayList to win?

      • kjellkod says:

        Hi rc,

        First off, Java is not my expertise but to my understanding the primitive types in Java are strictly-speaking not normal objects and reasoning about them like normal objects can lead you astray. This can be easily confusing, I was not sure myself so I had to look it up.

        In Java the primitive types (called simple types in C#) are copied by value and not by reference. So in this aspect it is no difference between my Java and C++ examples.
        I.e. the value can easily (by default) take advantage of the boosed performance thanks to the natural locality-by-reference, in fact the primitive types were probably designed with that purpose.

        You can look up “Unified type system” and “Java primitive types” for more information. Here’s a start http://en.m.wikipedia.org/wiki/Comparison_of_C_Sharp_and_Java#section_1 browse down to “Simple/primitive types”

      • rc says:

        Yes, but your example (http://ideone.com/u5wbd) did not use a primitive type, that is, the example did not use an array of ints, it used ArrayList , which is an ArrayList of (references to) Integer objects. This would be similar to a C++ vector of int*. The Java Integer objects are stored on the heap, unless an “escape analysis” determines that a “scalar replacement” optimization can be applied to store the Objects on the stack instead. Without this optimization, I’m not sure that ArrayList would be much faster than LinkedList – since the Integer objects would be on the heap, lcache locality would be lost. (While array of int will probably be faster than both ArrayList of Integer and LinkedList of Integer.)

    • Llorllale says:

      The answer is that with containers based on an array that are holding references to other objects, like ArrayList and Vector, the add and remove operations manipulate the ordering of the _references_ only, and these are located in contiguous space in memory like you said. The objects to which said references are pointing to are irrelevant and are always in the heap.

      In case you missed it: the reason why these data structures are faster in practice at adding and removing items from positions other than the beginning or end is because modern processors are geared towards optimizing access to _continuous_ memory space. Ie, when the cpu goes to memory to fetch some data it will fetch all the data found on the same “line” into its cache. If you need to access some other data, and this data is found right next to the first, then you save yourself the more expensive operation of the cpu going all the way to RAM and just work with its own internal cache.

      Most textbooks and search results say otherwise because they talk purely from a theoretical point of view and don’t include these optimizations into their analysis.

      • rc says:

        kjellkod said “In Java the primitive types (called simple types in C#) are copied by value and not by reference. So in this aspect it is no difference between my Java and C++ examples.I.e. the value can easily (by default) take advantage of the boosted performance thanks to the natural locality-by-reference, in fact the primitive types were probably designed with that purpose. ”

        Yes, but kjellkod’s example (http://ideone.com/u5wbd) did not use a primitive type, that is, the example did not use an array of ints, it used ArrayList, which is an ArrayList of (references to) Integer objects. This would be similar to a C++ vector of int*. The Java Integer objects are stored on the heap, unless an “escape analysis” determines that a “scalar replacement” optimization can be applied to store the Objects on the stack instead. Without this optimization, I’m not sure that ArrayList would be much faster than LinkedList. (While array of int will probably be faster than ArrayList and LinkedList.)

        Llorllale says “containers based on an array that are holding references to other objects, like ArrayList and Vector, the add and remove operations manipulate the ordering of the _references_ only, and these are located in contiguous space in memory like you said. The objects to which said references are pointing to are irrelevant and are always in the heap.”

        But how do you determine the proper order of the Integer references without accessing the Integer objects? That is, you have to compare the intValues() of the Integer objects in order to determine the proper position for adding the Integer objects in sorted order. (Otherwise, you are literally comparing references, which is meaningless.) If the Integer objects are allocated on the heap, then you lose locality. The “scalar replacement” optimization puts the Integer objects on the stack, improving locality.

        Correct?

        Thanks for your replies kjellkod and Llorllale.

  7. kjellkod says:

    @rc I think you are on to something here. Thanks to your and @Llorllale juggling of this I now understand more of Java. This thread and my search to resolve this after you brought it up just upgraded my Java knowledge from 0.1 to 1.0 =)

    At first I thought “heck just print out” the values and the object references to see I.R.L that the ArrayList contained references and not raw values.
    System.out.println( arrayList.get(i) );
    System.out.println( arrayList.get(i).intValue() );
    Then I felt silly when they printed the same. Obviously the type or print library retrieves the actual value. I am a little disappointed with the reference documentation http://docs.oracle.com/javase/6/docs/api/java/util/ArrayList.html that led my Java newbie knowledge astray. Either way your conclusion seems to have a point but it would be nice to glimpse behind the curtain and get some final clarity in it.

    Thank you @rc for being persistent and following up on this. I appreciate it a lot.

    • Llorllale says:

      I’m still giving this some thought.. you do have a point, rc.

      • rc says:

        I’m playing with this too, I noticed that an array of ints was 8 – 10 times faster than the ArrayList of Integers. I’m trying to measure the cache misses. I’ll let you know what else I find …

  8. anand says:

    Hi there,
    what you said is very correct. I have to sort data points as large as 800,000. Since my arrarys (in C) could not accomodate such huge numbers, i used both singly and doubly linked list. It took enormous time with linked list even for few thousands of points whereas it took no time with arrays. The reason being that sorting involves more accessing operations at various locations in between than swapping (inserting and removing) those values.

  9. David W says:

    Hello Kjell! (http://xkcd.com/386/) :-D

    When using LinkedList you should be using it the way it’s suppoed to be used, e.g. _not_ using indexes. It is, as the name suggests, a _linked_list_ and behave really bad using indexes to find elements.

    Just think about a very common example when adding elements to the beginning a list. Using a LinkedList this would be a fairly simple operation (just swapping some pointers) while the ArrayList would struggle to make such insertion (moving the entire array).

    Rewriting your java-version example as it’s supposed to be written you can really see the benefits using LinkedList. The LinkedList wins at every size in your example and the other implementation are a factor two to three times slower! Check this url yourself: http://ideone.com/lmNDj

    I would encourage you to put a big disclaimer at the top warning people to use indexes to get elements in a LinkedList and how you were fooled by this! ;) Also link the the very good SO-post handling this subject: http://stackoverflow.com/questions/322715/when-to-use-linkedlist-over-arraylist

    Best Regards,
    –David W (swedish: Vi kanske kan ta en öl med gamla gänget nån gång i höst, vore trevligt! :) )

    • kjellkod says:

      Ooops. Good thing to have a Java expert on the line. Even better to share some education between us :)

      Thank you so much for commenting. Your are so right that I should have used iterators instead of indexes. Huge mistake that mattered a lot in Java. The code was borrowed and code re-use is my only lame excuse, apart from being Java rusty.

      Your code work pointed out a huge time stealer that I had removed, then for some unknown reason inserted and forgot about. list.size() was used which heavily penalized all testing. Removing that puts everything in better comparison.

      You can see the my correction of this here, http://ideone.com/1wnF1 (based on your code). As you see there really IS NOT a huge difference between ArrayList and LinkedList. In fact they are suspiciously alike (*).

      This of course is not what I showed in my article so you are correct that this is a fault. I will update the article + blog and point this out as soon as I am back from vacation.

      (*)It is interesting to see though. As all common senses would tell anyone that RAM LinkedList traversal should be much slower than cache access. But here they perform almost similar. It is almost as if the cache factor is reduced to nothing for ArrayList in Java. It has a very good explanation that I think @rc got it right in his earlier comment.

      @rc comments copied
      ——————————
      I understand how an array of ints would minimize cache misses. But what about an an ArrayList of Integers? Such an ArrayList would contain references to Integer objects. The references would be next to each other in memory, but not the Integer objects? Don’t you lose locality when you dereference the Integers?
      ——————————–

      • David W says:

        Cache is a faster memory, period, nothing else and it certainly doesn’t make O(n) go O(1).

        It’s also not common sense that a linked list should be much slower just because you have a cache. It’s just because your example is extremely cache friendly.

        Quoting your ideone c++ example: “Linked-list shows close to exponential time usage in contrast to the vectors”, pause right there. Exponential? Then you have to do something wrong in the code! But nothing is fishy in your example (both are using iterators)… then the numbers must be wrong? Sure it looks good in the graphs but is it really exponential? No, it just isn’t, sorry but you are wrong, sure the linked list is getting slower, but it actually stops getting slower when reaching 250000 elements as the cache couldn’t help you there. And then you say, “But it’s a constant factor, 80 to be exact, less efficient. Surely that’s a reason to use array list over linked list”, then I would reply “Only a constant factor? Write a better algorithm!”.

        I think you got it the O-notation wrong, you think moving an array is harder work for the processor then finding the index, it isn’t. To find and insert the element is O(n), using both data structures. The linked list has to do more work finding the index but will insert it faster [O(1 + n / 2) => O(n)] and the array list will find the index faster but will have to move the rest of the list one step [O(n/2 + n/2) => O(n)]. All the constants are hidden in O(n)!

        You can always find an example that makes you go “WOW” I will never ever use linked list again but this is only for the set of problems that doesn’t really need a linked list to be fast. The example I wrote about, removing and inserting stuff at the beginning of an array, would be really really slow using an array list. That truly is an O(1) instead of an O(n) operation and how fast your memories are they won’t be faster than O(1) (if you don’t have some funky FPGA that can shift the whole memory that is). Another problem that I can think of is to filter an array of elements, removing the values that are above or below a threshold, perfect for an linked list (http://ideone.com/KM5Ev)! And wouldn’t those examples be “Number crunching problems”?

        What you should do is to look at your problem and create the most suitable solution for it. In this example the array list version wins but in others it’s hopelessly slow. The output from the example is a sorted list (descending). Using a normal sort algorithm that is an O(n log(n)) operation. Sorting 500000 elements the “correct” way is no problem at all and wouldn’t any time at all, doing it using your example version is on the other hand really slow O(n^2). Which data structure you use is just irrelevant (check it out: http://ideone.com/GFrjj).

        I wouldn’t recommend anyone to choose array list over linked list. I would try to encourage them choose the correct one for the specific problem! E.g. “Better to be a factor than a power wrong since that really would impact the performance!”

  10. kjellkod says:

    Well said David. I could not agree more that you have to look at every individual task at hand and choose the appropriate tool for it.

    My article was written because I felt online resources were disqualifying vector in areas where it clearly bested the linked-list.

    In one area we might have different opinions. In the area of number crunching the contiguous memory structures rule. There are exceptions, as always, but starting with an array based structure as your default choice and then maybe choosing something else is so much better then using a linked-list as your default choice. And yes,. people tend to use “defaults” – knowing that a tool more often then not is right for the task area.

    Several of your objections were voiced by other people at CodeProject. I decided to do a blog reply here to bring them out for easy, public display. It is kind of a repetition of some answers to you as well as ones previously given to other people.

    After correcting the Java section I will also update this blog-article to mirror the CodeProject one.
    This should avoid any further confusion as both of them will be up-to-date.

    Thank you for being attentive.

    • kjellkod says:

      There. Now the blog-article “Why you should never, ever…” is updated and mirrors fairly well the CodeProject article (which was written to mirror the blog-article ;)

  11. Sen says:

    I like your tests, and it certainly seems to be valid for std::list (I’ve done some extensive testing on comparing std::vector vs std::list, and came to the same conclusion as you did. However, I’m still not convinced that the LINKED LIST part is the culprit completely. The thing is, std::list uses std::allocator FOR EVERY ELEMENT INSERTED/REMOVED. This means a new/delete call for every element. This is bad in every way, because memory allocations are usually extremely slow compared to other operations. If the entire linked list fits in the L1/L2 cache, performance should be a lot better. I’ve rewritten your insertion sort to use a custom singly linked list, with a extremely simple memory allocator, not supporting deallocations. The list supports begin(), end(), insert() and iterators. You can see the difference immediately, the linked list tests are faster than the vector tests for the smaller sizes. After a while the cache thrashing sets in though, at 20000/40000 elements, due to more and more random jumps in memory, completely negating cache benefits. That’s when the vector starts to shine since it will reorder its elements on every insert, to be cache friendly when iterating linearly. http://ideone.com/dYe4f7#view_edit_box

    • kjellkod says:

      Thank you @Sen for reading my blog-article and commenting. It is much appreciated with feedback.

      Your result is very interesting. Have you tried it on a machine with better cache structure then ideone.com? (ideone has worse cache performance then my very, very old PC)
      Either way it should show the same result, probably with better result for even larger fast-lists. The reason is that as you point out is that the list is within the L1/L2 cache i.e. the speedup is likely due to where it is located rather then how it is made.

      Another reader made the same comment earlier on CodeProject. The reasoning was that it would be the new allocations that made the list slower. To see how that affected things I made a list-pool and pre-allocated all the list nodes. This way the new allocations could be *completely removed* from the timing.

      Then later on, for each insert I just retrieved a list element from the pool. From my test this showed that the new allocations had very little impact on the overall performance. What made a difference was the linear search, not the allocations. Here you can see my CodeProject reply to his question: msg=4255152#xx4255152xx. Here is the code: http://ideone.com/4hetC

      Thank you so much for sharing your code. It was cool to see how the list could get significant speedup. For smaller smart-lists it is all work inside the L1/L2 cache. Since it is all within L1/L2 cache it is arguable almost contiguous and will also get the speedup from the cache and is therefore very, very fast. With the memory pool creating nodes in contiguous memory it is only when the random insert creates too much fragmentation that the cache misses start to destroy performance.

      The code does a splendid job of showing that structures being powered from contiguous memory is faster then structures with disjoint memory. Which is the point of the article. :)

  12. Lelala says:

    Huff, thanks for that post – seems like i did everything wrong the whole days when programming :-)
    I think most people just use linked lists because it is easy and they do not have to think about what they are doing :-)

  13. David Fox says:

    Great article! Regarding Java collections, there is a library called Trove which stores primitives directly:

    http://trove.starlight-systems.com/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s