Hierarchical Indexing in Pandas

The fundamental objects in Pandas are the Series and the Data Frame objects. The Series being a one dimensional representation of data and the Data Frame being a two dimensional representation. There are also objects to create higher dimensions like a Panel and a Panel4d, however this functionality can implemented in Series and Data Frames using hierarchical indexing. Hierarchical indexing is implemented by the Multi Index object.

The Index Object

Series and Data Frame objects have indexes when created. They could be explicitly created. If they aren’t explicitly created then pandas defines an index of a sequence of integers.

# Don't specify and index
pd.Series(np.random.randn(9))

0    0.093129
1   -0.207627
2    1.207354
3    0.642533
4    0.320934
5   -0.397743
6   -0.382841
7   -0.516137
8    0.527837
dtype: float64

# Specify an Index
pd.Series(np.random.randn(9), 
     index=['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'])

a    1.026805
a    0.266297
a    1.331264
b   -0.371552
b   -1.380878
c   -0.211467
c    0.513514
d   -0.370247
d    0.827373
dtype: float64

The index is a property of the Series.

s = pd.Series(np.random.randn(9), 
     index=['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'])
s.index

Index(['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'], dtype='object')

The index object allows you to subset your data.

s['a']

a   -0.944438
a    1.030934
a   -1.902332
dtype: float64

Hierarchical Indexing

Pandas also has a hierarchical or multi index object. This lets you represent more dimensions in the index.

index = [['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'], 
         [1, 2, 3, 1, 3, 1, 2, 2, 3]]

s = pd.Series(np.random.randn(9), 
              index=index)

a  1   -1.327063
   2    0.057254
   3   -0.390980
b  1    0.154127
   3    0.338492
c  1   -1.620070
   2   -0.195065
d  2    2.647152
   3   -0.702559
dtype: float64

You can attach other properties to indexes like names so they can be referenced more easily.

index = [['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'], 
         [1, 2, 3, 1, 3, 1, 2, 2, 3]]

s = pd.Series(np.random.randn(9), 
              index=index)
s.index.names = ['Letters', 'Numbers']
s

Letters  Numbers
a        1         -0.404769
         2          1.029862
         3         -0.697415
b        1          0.399055
         3          2.006614
c        1         -0.278010
         2         -0.905441
d        2          1.653139
         3          1.098323
dtype: float64

With a data frame you can specify columns that can be made into indexes with the set_index data frame function. This is an easy way of creating multi dimensional indexes with already existing data.

MultiIndex series can be made into Data Frames using the unstack function and Data Frames can be made into MultiIndex series using the stack command.

Once an index is set up. You can then run statistical analysis on the levels within the index.

s.sum(level='Letters')

Letters
a   -0.072323
b    2.405669
c   -1.183451
d    2.751462
dtype: float64

Thinking along the lines of hierarchical indexes for Pandas objects, gives you a more powerful way of dealing with high dimensional data. Organisation with indexes is more efficient. It gives you the benefit of visualizing in two dimensions, yet representing in multiple dimensions. This is a great tool to have in your data analysis toolkit.

This article is part of my 21 day challenge where I will try to write a blog post every day for 21 days.

What can be improved

This post got across details about some functionality. It could be put out in a more interesting and engaging manner. It could also be put in a more practical way that will help the reader.

Subsequent posts should talk about how this can help the reader.

Leave a comment

Your email address will not be published. Required fields are marked *