The fundamental objects in Pandas are the Series and the Data Frame objects. The Series being a one dimensional representation of data and the Data Frame being a two dimensional representation. There are also objects to create higher dimensions like a Panel and a Panel4d, however this functionality can implemented in Series and Data Frames using hierarchical indexing. Hierarchical indexing is implemented by the Multi Index object.
The Index Object
Series and Data Frame objects have indexes when created. They could be explicitly created. If they aren’t explicitly created then pandas defines an index of a sequence of integers.
# Don't specify and index pd.Series(np.random.randn(9)) 0 0.093129 1 -0.207627 2 1.207354 3 0.642533 4 0.320934 5 -0.397743 6 -0.382841 7 -0.516137 8 0.527837 dtype: float64 # Specify an Index pd.Series(np.random.randn(9), index=['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd']) a 1.026805 a 0.266297 a 1.331264 b -0.371552 b -1.380878 c -0.211467 c 0.513514 d -0.370247 d 0.827373 dtype: float64
The index is a property of the Series.
s = pd.Series(np.random.randn(9), index=['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd']) s.index Index(['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'], dtype='object')
The index object allows you to subset your data.
s['a'] a -0.944438 a 1.030934 a -1.902332 dtype: float64
Pandas also has a hierarchical or multi index object. This lets you represent more dimensions in the index.
index = [['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'], [1, 2, 3, 1, 3, 1, 2, 2, 3]] s = pd.Series(np.random.randn(9), index=index) a 1 -1.327063 2 0.057254 3 -0.390980 b 1 0.154127 3 0.338492 c 1 -1.620070 2 -0.195065 d 2 2.647152 3 -0.702559 dtype: float64
You can attach other properties to indexes like names so they can be referenced more easily.
index = [['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'], [1, 2, 3, 1, 3, 1, 2, 2, 3]] s = pd.Series(np.random.randn(9), index=index) s.index.names = ['Letters', 'Numbers'] s Letters Numbers a 1 -0.404769 2 1.029862 3 -0.697415 b 1 0.399055 3 2.006614 c 1 -0.278010 2 -0.905441 d 2 1.653139 3 1.098323 dtype: float64
With a data frame you can specify columns that can be made into indexes with the set_index data frame function. This is an easy way of creating multi dimensional indexes with already existing data.
MultiIndex series can be made into Data Frames using the unstack function and Data Frames can be made into MultiIndex series using the stack command.
Once an index is set up. You can then run statistical analysis on the levels within the index.
s.sum(level='Letters') Letters a -0.072323 b 2.405669 c -1.183451 d 2.751462 dtype: float64
Thinking along the lines of hierarchical indexes for Pandas objects, gives you a more powerful way of dealing with high dimensional data. Organisation with indexes is more efficient. It gives you the benefit of visualizing in two dimensions, yet representing in multiple dimensions. This is a great tool to have in your data analysis toolkit.
This article is part of my 21 day challenge where I will try to write a blog post every day for 21 days.
What can be improved
This post got across details about some functionality. It could be put out in a more interesting and engaging manner. It could also be put in a more practical way that will help the reader.
Subsequent posts should talk about how this can help the reader.