Python Vs. R for Data Science
So many opinions are out there about which is better for Data Science. Python has been a strong contributor to Data Science, especially in regards to Machine Learning and AI.
However, when looking up the differences between the two languages (R and Python), there’s obvious bias presented. Often times people prefer Python and tend to minimize the problems of the language, while maximizing the problems of R.
I find much of the dialogue on this subject neglects some of the more impressive aspects of R in comparison to Python. I hope to correct that.
Strengths of R
I recently started using R as most statisticians teach courses with R and RStudio.
*RStudio is a the classic IDE for the R language.
As an example… let’s say we load a CSV file into memory and it has the following columns:
In RStudio, these are all auto-completed. Simply type the loaded dataset with an appending $ and all the column headers popup in a contextual menu:
RStudio also gives contextual menu’s for values of different tables:
Above we have a data set with multiple tables. By loading / connecting to the data set, I can append :: and up pops all the tables available.
This can chain from table to column like so:
%MINIFYHTMLe7e978c7c5e1335f1537fe4f703c66cd23%Code language: PHP (php)
It’s pretty amazing to get an idea of what’s in the data frame via the IDE, much like code completion.
In fact it’s very much like code completion.
Look at syntax below, as I typed plot() to do a scatter plot, RStudio auto-completed the variables (column headers) for me:
Unlike a standard (or even an exceptional) code IDE, RStudio excels at providing a UI for exporting and importing data. You don’t need to issue a R command/method to load a data set, you can use the IDE itself (File > Open.)
You can also export any diagrams generated, using the export button:
The IDE also allows for you to re-render any plot at any given size, all through a UI:
RStudio has a great tabbed window called Sources. Sources are all data sets loaded into memory. Take the screenshot below for example. In the screenshot we have three sources loaded. I can tab between the data sets to get a tabular visual very quickly.
No Zero Indexing
I started from a programming background where the idea of 0 index was just an accepted reality.
Why the Zero Indexing?
We accept 0 indexing as normal, but it really isn’t normal. Why start with 0 in an indexed list or array?
The answer goes back to computer efficiencies.
At the low level for a compiler to deal with a 1 based array (meaning that the first element in an array or list is index 1) it would have to calculate the (array + index) – 1.
0 Index arrays simply drop the -1, so there’s one less operation. This is important if you’re dealing with performance issues.
The hit today on a modern CPU / GPU to calculate a 1 based index is not that big of a hit – especially for Statistical Modeling. If you feel performance here is important, keep in mind that interpreted scripting languages (like Ruby, Python or JS) already are much slower than compiled languages.
Because of the larger performance issues of scripting languages, there’s no real point in comparing 0 Index performance.
However, 0 indexes work against human thinking.
The first element in the list or array is thought of logically as index 1 then 2, then 3 and so on. Stating that we want data from index 0 is counter-intuitive.
R handles data in 1 based arrays, which makes sense to me when dealing with Data. At first I resisted this concept, holding on to other languages as a template of proper practices.
When focusing on usability in Data Science I find 1 based indexing makes common sense.
Some biased writers are spouting nonsense that R can’t do things like text mining… I don’t think such people even used the language. Check out this pretty amazing tutorial on Udemy, about Text Mining with R and visualizing with Tableau.
Training in Stats
Almost every course on statistics will utilize R. When I started looking to learn Statistics, I didn’t want to just read a dry book of formulas…
I wanted to get the theory, but practice the theory in a language/utility that did the heavy calculations for me.
While there is a book on doing this in Python (“Think Stats”) it’s confusing as the author uses their own statistical implementation (rather than NumPy and Pandas.)
There is virtually nothing substantial in training in Statistics and Statistical Modeling from a Python point of view. R is still the defacto learning methodology for Statistics and Statistical Modeling courses.
Strengths of Python
Unlike R, Python is a general purpose language and can be used in testing security, building applications, data science and much more.
While both the R language and Python are open source languages, Python has a greater set of free options when it comes to deploying code. True, R can be deployed in systems like Heroku (for free), but they lack the R Server running under the hood.
In order to really get R to shine, you have to use something like shinyapps.io (a host provider that offers a free tier, but only gives you 5 free R application’s.)
With Heroku and many other options out there, Python can run effectively without any deployment constraint. Python notebooks can be deployed in the Anaconda cloud for free.
Google’s “colab” allows for free Python Jupyter notebooks to be deployed as well as even running the notebooks in the cloud.
To get an unlimited amount of runnable apps on a R Server, you’ll most likely be paying $440/year ($39/mo.) In the Python world you can do this for almost free. Sure there are limitations (such as running out of free services on Google or Heroku, but the costs per additional tiers is quite affordable by comparison.)
Collaboration / Community
The community commitment behind Python has been enormous. Almost all A.I. and Machine Learning courses and certification programs use Python as the base language.
On top of that Python has a vast scientific community creating libraries beyond the scope of data science.
Application development with Python (using PyQt or Tinker) is another community driven aspect of Python.
Looking for answers on code issues with Python, is no problem. Most likely someone else has already asked and answered the question you’re getting in Python.
As a general purpose language, Python can be used as a Desktop Application development solution, Web Application development solution and in Penetration Testing.
There’s so many options for using Python – Python is also the language behind various 3D design tools (like Blender.)
Knowing Python goes a long way beyond the scope of Data Science.
Personally, both R and Python lack in this regards. However both can produce some display aspects of symbolic math… and I think Python has a slight edge on this.
By Symbolic Mat, I’m referring to seeing your equations written in Mathematical notation. Python and R use an api called LaTex to create this symbolic math representation.
I found setting up LaTex in Python is quite easy if you’re using the Anaconda framework.
Where both of these lack, is that LaTex doesn’t compute the symbolic formula. It’s for display purposes only.
Anaconda is a data science framework. Once installed it has a variety of tools that can be launched and configured. Primarily this is for Python and data science (although it does come with RStudio.)
You get Jupyter, Jupyter labs, some visualization IDE’s and tons of data science libraries. Maintaining the libraries is a snap using the “conda” command line notation.
Training in Machine Learning & AI
While R holds down the fort on training in Statistical Analysis and Statistical Modeling, Python is the defacto for Machine Learning and AI.
Udacity has invested considerably time into making Python the language of choice in their Machine Learning, AI and Deep Learning programs.
This is perhaps one of the reasons Python is getting so much emphasis in these fields. The hyper-acceptance of Python by teachers in a variety of online schools and physical training centers is really pushing the language acceptance in Data Science.
While Data Science has aspects of both Statistics and Machine Learning, there’s a teaching division on these respective topics.
Although, as I wrote this article, I noticed that Udacity just pushed a course on Machine Learning with R. Still, the majority of ML/AI/DL focused training is with Python.
I’ve been using Python for the last 6 years or so. I’ve had my share of ups and downs with the language. There are things I love and things I’m not so hot about.
While R is not a general purpose language, for Statistical Modeling I prefer it. In fact for all things data, I prefer it. I see comments from many that disrespect the R language, citing it as “old fogies use it, time to upgrade to Python,” but such comments lack any real depth of experience.
You simply can’t plug into a data driven IDE with Python, like you can in RStudio. Yes, you can do all the same functions… similarly one could write Python code in VI vs. Pycharm but the latter offers a variety of code completion, and visual aids to really make writing code more professional.
That’s how I feel about R. R is the professional feel of computer driven Statistical Analysis, where Python feels like it’s playing catchup or perhaps “knock off.”
Don’t get me wrong, Python has some amazing libraries (NumPy and Pandas) which seem to replicate a lot of the work done in R. Yet Python IDE’s lack the data driven object approach you’ll get with R (completion of Tables, variables, etc.) By comparison Python feels awkward to me.
Putting Data Science aside, I feel that Python is really a better language for learning. If you’re thinking of using a language in different capacities… building desktop apps, web apps, using it to test said applications and security testing – then Python would be a great language of focus.
For symbolic math – neither of these languages really shines at all. If you’re interested in a utility to run symbolic math then Mathematica or Matlab would be more appropriate.
For Data Science I think R has a strong lead over Python in usability. Learning wise… well Python has the hyper extreme edge in learning Machine Learning and AI from colleges and MOOCs. Statistics wise, R seems to really hold its own on learning resources.
So far I prefer RStudio for doing Data work. It just makes more sense. It’s usable and less goofy. In the end, if this is a job path you’re breaking into, you need to learn both, as job opportunities may require one or the other.
As I’m embedded in a team I love (and not looking for placement), I prefer R over Python to deliver any analysis I need.