Learn how to Manipulate Data using Pandas

0
257

The foundation of Pandas is the Series and DataFrame data structures. Although these objects do not solve every data problem they are an excellent solution for frequently encountered problems. The Series data structure is a one-dimensional object holding a NumPy array and an index. There are different ways in which a Series object can be created. One approach is using a Python list as shown below. When the series is printed, the first column shows the index while the second column shows the actual data.

Another approach that can be used to create a Series object is using a Python dictionary as shown below. When using a dictionary to create a Series object the keys will become the Series index.

Just like Numpy arrays selecting values via an index, filtering, scalar multiplication and mathematical functions are supported. Examples of these operations are shown below.

 

After a Series object has been created you can easily change an index by assigning a new index.

img4

A DataFrame has a tabular structure similar to the one used on spreadsheets. There are multiple named columns holding different data types. When a column has different data types a type that can accommodate all of them will be selected. There are different ways of creating DataFrames. A frequently used approach is passing Numpy arrays or lists that must be of equal length to the DataFrame constructor. An example is shown below.

img5

A column in the data frame can be accessed by using the column name as an ‘index’ or by accessing the attributes of the data frame.

img6

 

After creating a pandas object there are different manipulations that can be done on the object. When you need your data to adapt to a new index you call the reindex method. This method will rearrange your data and create missing values on new index values.

img7

 

To avoid missing values being introduced after reindexing you specify an alternative value.

img8

When working with a data frame reindex enables you to alter the columns and rows. Altering rows is done just like in a Series object. An example of altering columns is shown below.

img9

When you need to remove a specific value from a Series or DataFrame you just need to pass an index. In a DataFrame you can delete a row or a column. Examples are shown below.

img10

 

img11

 

img12

 

A pandas DataFrame offers different ways of indexing and selection rows and columns. To select one or several columns the notation below is used.

img13

 

Rows in a DataFrame can be selected by indexing or by Boolean comparison. Examples are shown below.

img 14

 

img15

 

To select a subset of rows and columns iloc and loc indexing are used. To select rows and columns based on labels you use loc while to do selection based on integer index you use iloc. Selection of a single row using iloc will return a Series object while the selection of multiple rows or a complete column will return a DataFrame. These selection approaches require you specify the row and a column selector. Examples are shown below.

img16

 

To implement element-wise operations to each column or row of a DataFrame the apply method is used. For example to return the lowest value in each column the syntax below is used.

img17

 

To support data sorting the sort_index method can be used on a Series or DataFrame object. DataFrame objects can be sorted on either axis. To sort by the values of a Series the order method is available. When you need to sort by one or more columns you pass column names to by option. Examples are shown below.

img18

 

img19

 

Methods to produce descriptive statistics are built into Pandas objects and this provides a simple and efficient way of summarizing data. Some commonly used methods are discussed below.

The count method returns the number of non-missing values row wise or column wise. An example is shown below.

img20

 

To get the minimum and maximum values the methods min and max are used. To get the lowest and highest index values the methods idxmin and idxmax are used. To get sample quantile the method quantile is used. To get the mean, median or sum of values the methods mean, median and sum are used. Variance is obtained using the method var, the standard deviation is obtained using the method std, while skewness and kurtosis are obtained using the methods skew and kurt.

The correlation and covariance are pairwise computations and they are obtained using the methods corr and cov.

img21

To check for duplicates in a column the unique method is used. For example to check if the sex column has only M and F values.

img22

In practice, most data sets will have missing values. It is prudent to investigate the reason for missing values before taking any action. There are different ways of handling missing values built into pandas objects. To remove known missing values the method dropna is used. Dropping missing values is a bit trick in DataFrames. The default behavior is dropna filters out all rows with missing values. To delete columns you need to specify the axis.

img23

 

img24

This tutorial covered frequently used data manipulation techniques in pandas. Concepts covered were creating pandas objects, reindexing, selecting rows and columns, applying functions, sorting data, summarizing data and handling missing values.

In the meantime, now you can learn Data Science and Analysis: Make DataFrames in Padas and Python.

LEAVE A REPLY

Please enter your comment!
Please enter your name here