The first spreadsheet application I used was Lotus-123. I remember using the
=SUM() function on a column of cells and blowing my freaking mind! Spreadsheets, namely Excel have been an integral part of my career since that time more than 2 decades ago. Always wanting to learn as much about it as I could, knowing that it paid dividends if I could use it better than everyone around me in terms of career advancement.
Wanting to be better at Excel got me into VBA as my first programming language. I could iterate over rows and columns and automate complex operations, even write my own functions if I could think of one that didn't exist already. But the learning bug is strong with this one which pushed me towards learning Python. My first exposure into Pandas was during the Coursera Computational Investing course. It was one of the libraries used for analysis. It had a lot to take in so I dipped in and out of it over the course of many years.
As Pandas got more and more popular, there were more references to it as well as more applications for its use. I went from using Excel exclusively to using Python and Pandas side by side. Learning the best use cases for each application and trying to use the best tools for each application.
Comparing Excel to Pandas is not an apples to apples comparison. One is a fully fledged application while the other is a library within a programming language but we can look at how good they are at how good they are at doing the same tasks.
I have tried to do the comparison between the two by looking at the pros and cons of both. This is a start and I will be dipping back into this post if I think of or come across any other items after publishing this post. I can start with the things they have in common.
Both have excellent online communities. If ever you had a problem, there is a good chance someone also had this problem before you and was kind enough to ask the experts online. There is a plethora of training courses online as well. Though I have to admit, because knowledge of Excel is now a requirement for even Entry level jobs, there seem to be more forums and training courses out there.
The ubiquitous application of the finance industry. You sometimes hear people who even don't use computers all that much talk about having some spreadsheet or other. It is probably the most financially successful application in the world or at least one of the top 3.
The pull here is that you can get up and running really quickly with very little past knowledge. It is a spreadsheet with cells which is similar to a piece of paper. This representation on a PC is useful even without using a single formula.
There is no difference between your data and how it is presented. A cell can just be on its own or be one of many used to calculate another cell's value. You are free to create your spreadsheet any way you want without much rules. You are just good to go.
The Excel Pros
You interact with Excel visually. Either with your mouse or keyboard ninja skills. Just like an MS word document, some of what your data looks like is immediately visible to you. This is great for small data sets where you can meaningfully scroll over them. The spreadsheet is a canvas on which you can do what you like.
Pivot tables are powerful visual representations of your data. Its a great way of compressing your data down to how you want to see it. Pivot tables have been in Excel from the beginning and because of its usefulness it is still an integral feature. It is easy to master a few simple tricks that can help you to figure out how a data set is structured.
Excel has a wide variety of graphs to work with that are always available out of the box. Combining data and presentation means, drawing graphs on any visible set of numbers is just a couple of casual clicks away.
Relatively flat learning curve
Once you grasp certain fundamentals that can be picked up quite easily, you can use Excel for most of what you want. Excel also does a good job of hiding its more advanced features until you are good enough to start using it.
Everyone has it
In the commercial world, most people use Windows and from those people, most have Excel. Now some corporations give you the option of having Macs but there is an MS Office option for OS X. With a large part of Office 365 being browser based.
The Excel team have really listened to the users and included almost any kind of formula that you could possibly need. The reality is that you probably will only need a few formulas that you will use most of the time. And there is no way you will use all but a fraction of the formulas available to you. VBA also give you the option to write your own formulas or functions which is popular in some finance organisations.
All applications within the MS Office suite speak to each other.
For instance you could have an Excel spreadsheet as a data source in an MS Access database, or a MS Access database as a data source to a spreadsheet. This is available out of the box. There is also integration with other standards such as CSV and PDF that are very popular outside the MS Office ecosystem.
Visual Basic for Applications is a scripting language based on Microsoft's Visual Basic programming language. Though this is not a fully fledged programming language, it makes MS Office elements available to you in code. This makes it easy for you to automate processes and functions. VBA is most commonly used as macros where you can record and replay the activities you do on a spreadsheet automating processes.
At a more advanced level, enterprise level applications have been built using this functionality. Like most of the advanced functionality in Excel, the average user doesn't need to know this functionality you can get very far with Excel without writing a single line of VBA code.
VBA is great for data processing. Being able to write your own functions. Automate data processing tasks and even automate interfacing with external files and databases.
The Excel Cons
Not free. Though Microsoft has gone a long way with making the MS Office Suite reasonably affordable. It used to be that an MS Office license was a high one off payment but has since implemented other SaaS models since rolling out Office 365 integration.
Anything goes in a workbook. Cells can contain values or formulas. A single cell could contain the result of millions of calculations or just some text. A tab could be a well behaved two dimensional dataset or a summary document with only text and no formulas at all. Because of this, you can't make any assumptions about what is on a worksheet until you have had a look at it. Even for a reasonably complex spreadsheet, you could spend an inordinate amount of time trying to figure out why you did something in a particular way in the past. Working out why someone else did something in a spreadsheet is harder.
Not for big data
Excel made a massive jump from 65 thousand rows pre version 2007 to over a million since. Although current versions of Excel can handle this number of rows, any reasonably complex application starts to hang when it gets over 10 thousand rows. This gets worse when there are formulas that can only be calculated after other formulas are and worksheets depending on other worksheets. Excel is written at an application level. This means if you have multiple workbooks open, all of them will pause just because there are calculations on one worksheet that causes the whole system to stall.
Excel is a GUI, which means your computer gives up resources just processing graphics. Which means less graphically intensive data processing tools (depending on how they are created) will have more resources to play with. See what happens in Pandas.
Not ideal for data science projects
Real world data is often messy and unstructured. It is unlikely that it will come to you in a neat format (tidy data) that you can then do analysis on. Processing a dataset from its original source involves changing its shape and also making subjective judgements on how best you think it would be to analyse data. Excel doesn't lend its self very well to recording these changes.
If you inherit a spreadsheet someone else has created, it is hard to figure out why things were done in a particular way.
Analysis needs to be well documented outside Excel to be taken over by other parties.
Formulas are at a row/column level
Melting and casting data involves changing columns to rows and vice versa. For example, if you had a wide dataset you wanted to make longer by incorporating column data into lines, or do the opposite, it would be a tedious process that would require a degree of skill. In a programming environment, this can be done instantly with a single command. Split, apply, combine functions are available in most data processing languages.
Advanced Data Analysis
Although there is a great collection of mathematical, statistical and financial formulas to solve a wide range of domain specific problems, and even going to the level of basic machine learning functions like regression. You can write even write formulas to do classification, though these are like hacks. There are no specific libraries for these use cases. In short you are hamstrung when it comes to this breed of analysis.
No Support for Time Series
The later versions of Pivot tables have had a stab of supporting time series by grouping dates into date ranges. But it doesn't get it right and is a bit of an annoyance because it is on by default and you need to turn it off. There is no intuitive support for grouping dates by period.
Pandas was created by Wes McKinney who wanted to build Python support for his work with financial data. Pandas gives you programming objects and data structures you can use within the Python programming language. Python has had incredible popularity and adoption since its release. The fact that it is a general purpose programming language means there are other great libraries available to you which you can integrate into your solution with a simple import.
Pandas is actually built on top of other libraries in Python. The author took clever advantage of other excellent libraries available to him. Pandas provides a comprehensive framework for data manipulation and analysis. There are other libraries for further analysis and presentation.
The Pandas Pros
Separates Data from Presentation
You create or import data into programmable objects. Any analysis is then done by code. Output can be either a table or a graph. This gives you a clear delineation between the data the analysis and the presentation.
A properly organized data science project, separates these three elements out. Pandas provides a framework for doing this well.
Pandas is built on top of Python's NumPy library which give you matrix like structures for data manipulation. A lot of this is written in low level code making it very fast. There is also the fact that there is no GUI to display and interact with your data, meaning that Pandas is overall less resource intensive than Excel so it has more to utilize comparing like with like.
Rich Eco System
As mentioned, Pandas is built on top of NumPy which provides support for working with large multidimensional arrays. NumPy its self is a complete library for that use case and is used as a basis for other excellent libraries like SciPy or Scientific Python for science and engineering. It provides all round functionality with how you want to use the programming objects Pandas offers. There are a rich array of visualization libraries providing both interactive and 2d and even 3d visuals. All just a library import away.
Jupyter Notebooks are a great way to interact with your data and share the results. Everything comes together here to give you a complete solution for data analysis.
Analysis can be done on an iterative level. Processor intensive calculations can be done and saved then reloaded with the calculations intact. This is a better solution than getting the entire workbook to recalculate every time there is a change made to a single cell within the entire workbook.
Support for Timeseries
Time series are data structures that understand time. Or a series of timestamps chained together. Time series understands the properties of these timestamps so you can do things like show a group of timestamps that share a year, month, hour, etc. This is great for financial analysis, looking for patterns of how values have moved over time.
Bringing Together Multiple Data Sources
There is a comprehensive list of programming functions that allow you to join different sets on data. This includes joining by columns or merging together. You can treat datasets like relational tables and join them together with SQL commands. Along with this, you are able to write scripts to bring together data or even create new datasets.
Python has libraries to read other applications data. This could be in the form of connectors to query databases or libraries into a variety of other applications. This includes big data solutions. There are Python libraries to read and process Excel documents.
The Pandas Cons
High Barriers to Entry
The elephant in the room. At the end of the day you need to be able to code. This takes time and patience. Learning to code well takes many years. Even internalizing some concepts that experienced coders take for granted could take a long time even for the brightest of us. There is also no urgency. You just can go your entire career not writing a single line of code.
But think of the things you can do if you did.
I hope this gives you a comprehensive view on the pros and the cons of both. I would like the Excel community to dip their toes in Python and Pandas and hopefully fall in love like I did.