What is Data Analysis?
Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. In today’s business world, data analysis plays a role in making decisions more scientific and helping businesses operate more effectively
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python.
In this article, I have used Pandas to know more about doing data analysis.
Mainly pandas have two data structures, series, data frames, and Panel.
Installation
The easiest way to install pandas is to use pip:
or, Download it from here.
pandas series can be used for the one-dimensional labeled array.
|
import pandas as pd index_list = ['test1', 'test2', 'test3', 'test4'] a = pd.Series([100, 98.7, 98.4, 97.7],index=index_list) print(a) output: test1 100.0 test2 98.7 test3 98.4 test4 97.7 dtype: float64 |
Labels can be accessed using index attribute
print(a.index)
|
Index(['test1', 'test2', 'test3', 'test4'], dtype='object') |
You can use array indexing or labels to access data in the series.
You can use array indexing or labels to access data in the series
print(a[1])
print(a[‘test4’])
You can also apply mathematical operations on pandas series.
b = a * 2
c = a ** 1.5
print(b)
print(c)
|
test1 200.0 test2 197.4 test3 196.8 test4 195.4 dtype: float64 test1 1000.000000 test2 980.563513 test3 976.096258 test4 965.699142 dtype: float64 |
You can even create a series of heterogeneous data.
s = pd.Series([‘test1’, 1.2, 3, ‘test2’], index=[‘test3’, ‘test4’, 2, ‘4.3’])
print(s)
|
test3 test1 test4 1.2 2 3 4.3 test2 dtype: object |
pandas DataFrame is a two-dimensional array with heterogeneous data.i.e., data is aligned in a tabular fashion in rows and columns.
Structure
Let us assume that we are creating a data frame with the student’s data.
Name |
Age |
Gender |
Rating |
Steve |
32 |
Male |
3.45 |
Lia |
28 |
Female |
4.6 |
Vin |
45 |
Male |
3.9 |
Katie |
38 |
Female |
2 |
You can think of it as an SQL table or a spreadsheet data representation.
The table represents the data of a sales team of an organization with their overall performance rating. The data is represented in rows and columns. Each column represents an attribute and each row represents a person.
The data types of the four columns are as follows −
Column |
Type |
Name |
String |
Age |
Integer |
Gender |
String |
Rating |
Float |
Key Points
• Heterogeneous data
• Size Mutable
• Data Mutable
A pandas DataFrame can be created using the following constructor −
pandas.DataFrame( data, index, columns, dtype, copy)
• data
data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame.
• index
For the row labels, the Index to be used for the resulting frame is Optional Default np.arrange(n) if no index is passed.
• columns
For column labels, the optional default syntax is – np.arrange(n). This is only true if no index is passed.
• dtype
The data type of each column.
• copy
This command (or whatever it is) is used for copying of data if the default is False.
There are many methods to create DataFrames.
• Lists
• dict
• Series
• Numpy ndarrays
• Another DataFrame
Creating DataFrame from the dictionary of Series
The following method can be used to create DataFrames from a dictionary of pandas series.
|
import pandas as pd index_list = ['test1', 'test2', 'test3', 'test4'] a = {"column1": pd.Series([100, 98.7, 98.4, 97.7],index=index_list), "column2": pd.Series([100, 100, 100, 85.4], index=index_list)} df = pd.DataFrame(a) |
print(df)
|
column1 column2 test1 100.0 100.0 test2 98.7 100.0 test3 98.4 100.0 test4 97.7 85.4 |
print(df.index)
|
Index(['test1', 'test2', 'test3', 'test4'], dtype='object') |
print(df.columns)
|
Index(['column1', 'column2'], dtype='object') |
Creating DataFrame from list of dictionaries
l = [{‘orange’: 32, ‘apple’: 42}, {‘banana’: 25, ‘carrot’: 44, ‘apple’: 34}]
df = pd.DataFrame(l, index=[‘test1’, ‘test2’])
print(df)
|
apple banana carrot orange test1 42 NaN NaN 32.0 test2 34 25.0 44.0 NaN |
You might have noticed that we got a DataFrame with NaN values in it. This is because we didn’t the data for that particular row and column.
Creating DataFrame from Text/CSV files
Pandas tool comes in handy when you want to load data from a CSV or a text file. It has built-in functions to do this for use.
df = pd.read_csv(‘happiness.csv’)
Yes, we created a DataFrame from a CSV file. This dataset contains the outcome of the European quality of life survey. This dataset is available here. Now we have stored the DataFrame in df, we want to see what’s inside. First, we will see the size of the DataFrame.
print(df.shape)
It has 105 Rows and 4 Columns. Instead of printing out all the data, we will see the first 10 rows.
df.head(10)
|
Country Gender Mean N= 0 AT Male 7.3 471 1 NaN Female 7.3 570 2 NaN Both 7.3 1041 3 BE Male 7.8 468 4 NaN Female 7.8 542 5 NaN Both 7.8 1010 6 BG Male 5.8 416 7 NaN Female 5.8 555 8 NaN Both 5.8 971 9 CY Male 7.8 433 |
There are many more methods to create a DataFrames. But now we will see the basic operation on DataFrames.
Operations on DataFrame
We’ll recall the DataFrame we made earlier.
|
import pandas as pd index_list = ['test1', 'test2', 'test3', 'test4'] a = {"column1": pd.Series([100, 98.7, 98.4, 97.7],index=index_list), "column2": pd.Series([100, 100, 100, 85.4], index=index_list)} df = pd.DataFrame(a) |
print(df)
|
column1 column2 test1 100.0 100.0 test2 98.7 100.0 test3 98.4 100.0 test4 97.7 85.4 |
Now we want to create a new row column from current columns. Let’s see how it is done.
df[‘column3’] = (2 * df[‘column1’] + 3 * df[‘column2’])/5
|
column1 column2 column3 test1 100.0 100.0 100.00 test2 98.7 100.0 99.48 test3 98.4 100.0 99.36 test4 97.7 85.4 90.32 |
We have created a new column column3 from column1 and column2. We’ll create one more using boolean.
df[‘flag’] = df[‘column1’] > 99.5
We can also remove columns.
column3 = df.pop(‘column3’)
print(column3)
|
test1 100.00 test2 99.48 test3 99.36 test4 90.32 Name: column3, dtype: float64 |
print(df)
|
column1 column2 flag test1 100.0 100.0 True test2 98.7 100.0 False test3 98.4 100.0 False test4 97.7 85.4 False |
Descriptive Statistics using pandas
It’s very easy to view descriptive statistics of a dataset using pandas. We are gonna use, Biomass data collected from this source. Let’s load the data first.
url = ‘https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/DAAG/biomass.csv’
df = pd.read_csv(url)
df.head()
|
Unnamed:0 dbh wood bark root rootsk branch species fac26 0 1 90 5528.0 NaN 460.0 NaN NaN E. maculata z 1 2 106 13650.0 NaN 1500.0 665.0 NaN E. Pilularis 2 2 3 112 11200.0 NaN 1100.0 680.0 NaN E. Pilularis 2 3 4 34 1000.0 NaN 430.0 40.0 NaN E. Pilularis 2 4 5 130 NaN NaN 3000.0 1030.0 NaN E. maculata z |
We are not interested in the unnamed column. So, let’s delete that first. Then we’ll see the statistics with one line of code.
|
dbh wood bark root rootsk branch count 153.000000 133.000000 17.000000 54.000000 53.000000 76.000000 mean 26.352941 1569.045113 513.235294 334.383333 113.802264 54.065789 std 28.273679 4071.380720 632.467542 654.641245 247.224118 65.606369 min 3.000000 3.000000 7.000000 0.300000 0.050000 4.000000 25% 8.000000 29.000000 59.000000 11.500000 2.000000 10.750000 50% 15.000000 162.000000 328.000000 41.000000 11.000000 35.000000 75% 36.000000 1000.000000 667.000000 235.000000 45.000000 77.750000 max 145.000000 25116.000000 1808.000000 3000.000000 1030.000000 371.000000 |
It’s simple as that. We can see all the statistics. Count, mean, standard deviation and other statistics. Now we are gonna find some other metrics which are not available in the describe() summary.
Mean :
print(df.mean())
|
dbh 26.352941 wood 1569.045113 bark 513.235294 root 334.383333 rootsk 113.802264 branch 54.065789 dtype: float6 |
Min and Max
print(df.min())
|
dbh 3 wood 3 bark 7 root 0.3 rootsk 0.05 branch 4 species Acacia mabellae dtype: object |
print(df.max())
|
dbh 145 wood 25116 bark 1808 root 3000 rootsk 1030 branch 371 species Other dtype: object |
Pairwise Correlation
df.corr()
|
dbh wood bark root rootsk branch dbh 1.000000 0.905175 0.965413 0.899301 0.934982 0.861660 wood 0.905175 1.000000 0.971700 0.988752 0.967082 0.821731 bark 0.965413 0.971700 1.000000 0.961038 0.971341 0.943383 root 0.899301 0.988752 0.961038 1.000000 0.936935 0.679760 rootsk 0.934982 0.967082 0.971341 0.936935 1.000000 0.621550 branch 0.861660 0.821731 0.943383 0.679760 0.621550 1.000000 |
Data Cleaning
We need to clean our data. Our data might contain missing values, NaN values, outliers, etc. We may need to remove or replace that data. Otherwise, our data might make any sense.
We can find null values using the following method.
print(df.isnull().any())
|
dbh False wood True bark True root True rootsk True branch True species False fac26 True dtype: bool |
We have to remove these null values. This can be done by the method shown below.
newdf = df.dropna()
print(newdf.shape)
|
dbh wood bark root rootsk branch species fac26 123 27 550.0 105.0 44.0 9.0 59.0 B. myrtifolia z 124 26 414.0 78.0 38.0 13.0 44.0 B. myrtifolia z 125 9 42.0 8.0 5.0 1.3 7.0 B. myrtifolia z 126 12 85.0 13.0 17.0 2.2 16.0 B. myrtifolia z |
print(newdf.shape)
Pandas .Panel()
A panel is a 3D container of data. The term Panel data is derived from econometrics and is partially responsible for the name pandas − pan(el)-da(ta)-s.
The names for the 3 axes are intended to give some semantic meaning to describing operations involving panel data. They are −
• items − axis 0, each item corresponds to a DataFrame contained inside.
• major_axis − axis 1, it is the index (rows) of each of the DataFrames.
• minor_axis − axis 2, it is the columns of each of the DataFrames.
A Panel can be created using the following constructor −
The parameters of the constructor are as follows −
• data – Data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame
• items – axis=0
• major_axis – axis=1
• minor_axis – axis=2
• dtype – the Data type of each column
• copy – Copy data. Default, false
A Panel can be created using multiple ways like −
• From ndarrays
• From dict of DataFrames
• From 3D ndarray
|
# creating an empty panel import pandas as pd import numpy as np data = np.random.rand(2,4,5) p = pd.Panel(data) |
print(p)
|
output: Dimensions: 2 (items) x 4 (major_axis) x 5 (minor_axis) Items axis: 0 to 1 Major_axis axis: 0 to 3 Minor_axis axis: 0 to 4 |
Note − Observe the dimensions of the empty panel and the above panel, all the objects are different.
From dict of DataFrame Objects
|
#creating an empty panel import pandas as pd import numpy as np data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)), 'Item2' : pd.DataFrame(np.random.randn(4, 2))} p = pd.Panel(data) |
print(p)
|
output: Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis) Items axis: Item1 to Item2 Major_axis axis: 0 to 3 Minor_axis axis: 0 to 2 |
Selecting the Data from Panel
Select the data from the panel using −
• Items
• Major_axis
• Minor_axis
Using Items
|
# creating an empty panel import pandas as pd import numpy as np data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)), 'Item2' : pd.DataFrame(np.random.randn(4, 2))} p = pd.Panel(data) |
print p[‘Item1’]
|
output: 0 1 2 0 -0.006795 -1.156193 -0.524367 1 0.025610 1.533741 0.331956 2 1.067671 1.309666 1.304710 3 0.615196 1.348469 -0.410289 |
We have two items, and we retrieved item1. The result is a DataFrame with 4 rows and 3 columns, which are the Major_axis and Minor_axis dimensions.
Using major_axis
Data can be accessed using the method panel.major_axis(index).
|
Item1 Item2 0 0.027133 -1.078773 1 0.115686 -0.253315 2 -0.473201 NaN |
Using minor_axis
Data can be accessed using the method panel.minor_axis(index).
|
import pandas as pd import numpy as np data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)), 'Item2' : pd.DataFrame(np.random.randn(4, 2))} p = pd.Panel(data) |
print(p.minor_xs(1))
|
Item1 Item2 0 0.092727 -1.633860 1 0.333863 -0.568101 2 0.388890 -0.338230 3 -0.618997 -1.01808 |