How fast is C++ compared to Python?
An example for data scientists who believe they don’t need to know C++
There are millions of reasons to love Python (especially for data scientists). But how is Python different from more professional low-level programming languages like C or C++? I guess this is a question that many data scientists or Python users asked or will ask themselves one day. There are many differences between Python and languages like C++. For this article, I am going to show you how fast C++ is compared to Python with a super simple example.
Photo by author.
To show the difference, I decided to go with a simple and practical task instead of an imaginary task. The task that I am going to accomplish is to generate all possible DNA k-mers for a fixed value of “k”. If you don’t know about DNA k-mers, I explain it in plain language in the next section. I chose this example because many genomic-related data processing and analysis tasks (e.g. k-mers generation) are considered computationally intensive. That’s a reason why many data scientists in the field of bioinformatics are interested in C++ (in addition to Python).
A Short Introduction to DNA K-mers
A DNA is a long chain of units called nucleotides. In DNA, there are 4 types of nucleotides shown with letters A, C, G, and T. Humans (or more precisely Homo Sapiens) have 3 billion nucleotide pairs. For example, a small portion of human DNA could be something like:
In this example, if you choose any 4 consecutive nucleotides (i.e. letters) from this string, it will be a k-mer with a length of 4 (we call it a 4-mer). Here are some examples of 4-mers derived from the example.
For this article, let’s generate all possible 13-mers. Mathematically it is a permutation with a replacement problem. Therefore, we have ⁴¹³ (=67,108,864) possible 13-mers. I use a simple algorithm to generate results in C++ and Python. Let’s take a look at the solutions and comparing them.
To compare C++ and Python for this specific challenge easily, I used exactly the same algorithm for both languages. Both codes are intentionally designed to be simple and similar. I avoided using complex data structures or third-party packages or libraries. The first code is written in Python.