What is Data Analysis?
Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. In today’s business world, data analysis plays a role in making decisions more scientific and helping businesses operate more effectively
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python.
In this article, I have used Pandas to know more about doing data analysis.
Mainly pandas have two data structures, series, data frames, and Panel.
Installation
The easiest way to install pandas is to use pip:
pip install pandas
or, Download it from here.
pandas series can be used for the one-dimensional labeled array.
import pandas as pd
index_list = ['test1', 'test2', 'test3', 'test4']
a = pd.Series([100, 98.7, 98.4, 97.7],index=index_list)
print(a)
output:
test1 100.0
test2 98.7
test3 98.4
test4 97.7
dtype: float64
Labels can be accessed using index attribute
print(a.index)
Index(['test1', 'test2', 'test3', 'test4'], dtype='object')
You can use array indexing or labels to access data in the series.
You can use array indexing or labels to access data in the series
print(a[1])
print(a[‘test4’])
98.7
97.7
You can also apply mathematical operations on pandas series.
b = a * 2
c = a ** 1.5
print(b)
print(c)
test1 200.0
test2 197.4
test3 196.8
test4 195.4
dtype: float64
test1 1000.000000
test2 980.563513
test3 976.096258
test4 965.699142
dtype: float64
You can even create a series of heterogeneous data.
s = pd.Series([‘test1’, 1.2, 3, ‘test2’], index=[‘test3’, ‘test4’, 2, ‘4.3’])
print(s)
test3 test1
test4 1.2
2 3
4.3 test2
dtype: object
pandas DataFrame is a two-dimensional array with heterogeneous data.i.e., data is aligned in a tabular fashion in rows and columns.
Structure
Let us assume that we are creating a data frame with the student’s data.
Name |
Age |
Gender |
Rating |
Steve |
32 |
Male |
3.45 |
Lia |
28 |
Female |
4.6 |
Vin |
45 |
Male |
3.9 |
Katie |
38 |
Female |
2 |
You can think of it as an SQL table or a spreadsheet data representation.
The table represents the data of a sales team of an organization with their overall performance rating. The data is represented in rows and columns. Each column represents an attribute and each row represents a person.
The data types of the four columns are as follows −
Column |
Type |
Name |
String |
Age |
Integer |
Gender |
String |
Rating |
Float |
Key Points
• Heterogeneous data
• Size Mutable
• Data Mutable
A pandas DataFrame can be created using the following constructor −
pandas.DataFrame( data, index, columns, dtype, copy)
• data
data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame.
• index
For the row labels, the Index to be used for the resulting frame is Optional Default np.arrange(n) if no index is passed.
• columns
For column labels, the optional default syntax is – np.arrange(n). This is only true if no index is passed.
• dtype
The data type of each column.
• copy
This command (or whatever it is) is used for copying of data if the default is False.
There are many methods to create DataFrames.
• Lists
• dict
• Series
• Numpy ndarrays
• Another DataFrame
Creating DataFrame from the dictionary of Series
The following method can be used to create DataFrames from a dictionary of pandas series.
import pandas as pd
index_list = ['test1', 'test2', 'test3', 'test4']
a = {"column1": pd.Series([100, 98.7, 98.4, 97.7],index=index_list), "column2": pd.Series([100, 100, 100, 85.4], index=index_list)}
df = pd.DataFrame(a)
print(df)
column1 column2
test1 100.0 100.0
test2 98.7 100.0
test3 98.4 100.0
test4 97.7 85.4
print(df.index)
Index(['test1', 'test2', 'test3', 'test4'], dtype='object')
print(df.columns)
Index(['column1', 'column2'], dtype='object')
Creating DataFrame from list of dictionaries
l = [{‘orange’: 32, ‘apple’: 42}, {‘banana’: 25, ‘carrot’: 44, ‘apple’: 34}]
df = pd.DataFrame(l, index=[‘test1’, ‘test2’])
print(df)
apple banana carrot orange
test1 42 NaN NaN 32.0
test2 34 25.0 44.0 NaN
You might have noticed that we got a DataFrame with NaN values in it. This is because we didn’t the data for that particular row and column.
Creating DataFrame from Text/CSV files
Pandas tool comes in handy when you want to load data from a CSV or a text file. It has built-in functions to do this for use.
df = pd.read_csv(‘happiness.csv’)
Yes, we created a DataFrame from a CSV file. This dataset contains the outcome of the European quality of life survey. This dataset is available here. Now we have stored the DataFrame in df, we want to see what’s inside. First, we will see the size of the DataFrame.
print(df.shape)
(105, 4)
It has 105 Rows and 4 Columns. Instead of printing out all the data, we will see the first 10 rows.
df.head(10)
Country Gender Mean N=
0 AT Male 7.3 471
1 NaN Female 7.3 570
2 NaN Both 7.3 1041
3 BE Male 7.8 468
4 NaN Female 7.8 542
5 NaN Both 7.8 1010
6 BG Male 5.8 416
7 NaN Female 5.8 555
8 NaN Both 5.8 971
9 CY Male 7.8 433
There are many more methods to create a DataFrames. But now we will see the basic operation on DataFrames.
Operations on DataFrame
We’ll recall the DataFrame we made earlier.
import pandas as pd
index_list = ['test1', 'test2', 'test3', 'test4']
a = {"column1": pd.Series([100, 98.7, 98.4, 97.7],index=index_list), "column2": pd.Series([100, 100, 100, 85.4], index=index_list)}
df = pd.DataFrame(a)
print(df)
column1 column2
test1 100.0 100.0
test2 98.7 100.0
test3 98.4 100.0
test4 97.7 85.4
Now we want to create a new row column from current columns. Let’s see how it is done.
df[‘column3’] = (2 * df[‘column1’] + 3 * df[‘column2’])/5
column1 column2 column3
test1 100.0 100.0 100.00
test2 98.7 100.0 99.48
test3 98.4 100.0 99.36
test4 97.7 85.4 90.32
We have created a new column column3 from column1 and column2. We’ll create one more using boolean.
df[‘flag’] = df[‘column1’] > 99.5
We can also remove columns.
column3 = df.pop(‘column3’)
print(column3)
test1 100.00
test2 99.48
test3 99.36
test4 90.32
Name: column3, dtype: float64
print(df)
column1 column2 flag
test1 100.0 100.0 True
test2 98.7 100.0 False
test3 98.4 100.0 False
test4 97.7 85.4 False
Descriptive Statistics using pandas
It’s very easy to view descriptive statistics of a dataset using pandas. We are gonna use, Biomass data collected from this source. Let’s load the data first.
url = ‘https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/DAAG/biomass.csv’
df = pd.read_csv(url)
df.head()
Unnamed:0 dbh wood bark root rootsk branch species fac26
0 1 90 5528.0 NaN 460.0 NaN NaN E. maculata z
1 2 106 13650.0 NaN 1500.0 665.0 NaN E. Pilularis 2
2 3 112 11200.0 NaN 1100.0 680.0 NaN E. Pilularis 2
3 4 34 1000.0 NaN 430.0 40.0 NaN E. Pilularis 2
4 5 130 NaN NaN 3000.0 1030.0 NaN E. maculata z
We are not interested in the unnamed column. So, let’s delete that first. Then we’ll see the statistics with one line of code.
dbh wood bark root rootsk branch
count 153.000000 133.000000 17.000000 54.000000 53.000000 76.000000
mean 26.352941 1569.045113 513.235294 334.383333 113.802264 54.065789
std 28.273679 4071.380720 632.467542 654.641245 247.224118 65.606369
min 3.000000 3.000000 7.000000 0.300000 0.050000 4.000000
25% 8.000000 29.000000 59.000000 11.500000 2.000000 10.750000
50% 15.000000 162.000000 328.000000 41.000000 11.000000 35.000000
75% 36.000000 1000.000000 667.000000 235.000000 45.000000 77.750000
max 145.000000 25116.000000 1808.000000 3000.000000 1030.000000 371.000000
It’s simple as that. We can see all the statistics. Count, mean, standard deviation and other statistics. Now we are gonna find some other metrics which are not available in the describe() summary.
Mean :
print(df.mean())
dbh 26.352941
wood 1569.045113
bark 513.235294
root 334.383333
rootsk 113.802264
branch 54.065789
dtype: float6
Min and Max
print(df.min())
dbh 3
wood 3
bark 7
root 0.3
rootsk 0.05
branch 4
species Acacia mabellae
dtype: object
print(df.max())
dbh 145
wood 25116
bark 1808
root 3000
rootsk 1030
branch 371
species Other
dtype: object
Pairwise Correlation
df.corr()
dbh wood bark root rootsk branch
dbh 1.000000 0.905175 0.965413 0.899301 0.934982 0.861660
wood 0.905175 1.000000 0.971700 0.988752 0.967082 0.821731
bark 0.965413 0.971700 1.000000 0.961038 0.971341 0.943383
root 0.899301 0.988752 0.961038 1.000000 0.936935 0.679760
rootsk 0.934982 0.967082 0.971341 0.936935 1.000000 0.621550
branch 0.861660 0.821731 0.943383 0.679760 0.621550 1.000000
Data Cleaning
We need to clean our data. Our data might contain missing values, NaN values, outliers, etc. We may need to remove or replace that data. Otherwise, our data might make any sense.
We can find null values using the following method.
print(df.isnull().any())
dbh False
wood True
bark True
root True
rootsk True
branch True
species False
fac26 True
dtype: bool
We have to remove these null values. This can be done by the method shown below.
newdf = df.dropna()
print(newdf.shape)
dbh wood bark root rootsk branch species fac26
123 27 550.0 105.0 44.0 9.0 59.0 B. myrtifolia z
124 26 414.0 78.0 38.0 13.0 44.0 B. myrtifolia z
125 9 42.0 8.0 5.0 1.3 7.0 B. myrtifolia z
126 12 85.0 13.0 17.0 2.2 16.0 B. myrtifolia z
print(newdf.shape)
(4, 8)
Pandas .Panel()
A panel is a 3D container of data. The term Panel data is derived from econometrics and is partially responsible for the name pandas − pan(el)-da(ta)-s.
The names for the 3 axes are intended to give some semantic meaning to describing operations involving panel data. They are −
• items − axis 0, each item corresponds to a DataFrame contained inside.
• major_axis − axis 1, it is the index (rows) of each of the DataFrames.
• minor_axis − axis 2, it is the columns of each of the DataFrames.
A Panel can be created using the following constructor −
The parameters of the constructor are as follows −
• data – Data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame
• items – axis=0
• major_axis – axis=1
• minor_axis – axis=2
• dtype – the Data type of each column
• copy – Copy data. Default, false
A Panel can be created using multiple ways like −
• From ndarrays
• From dict of DataFrames
• From 3D ndarray
# creating an empty panel
import pandas as pd
import numpy as np
data = np.random.rand(2,4,5)
p = pd.Panel(data)
print(p)
output:
Dimensions: 2 (items) x 4 (major_axis) x 5 (minor_axis)
Items axis: 0 to 1
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 4
Note − Observe the dimensions of the empty panel and the above panel, all the objects are different.
From dict of DataFrame Objects
#creating an empty panel
import pandas as pd
import numpy as np
data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)),
'Item2' : pd.DataFrame(np.random.randn(4, 2))}
p = pd.Panel(data)
print(p)
output:
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2
Selecting the Data from Panel
Select the data from the panel using −
• Items
• Major_axis
• Minor_axis
Using Items
# creating an empty panel
import pandas as pd
import numpy as np
data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)),
'Item2' : pd.DataFrame(np.random.randn(4, 2))}
p = pd.Panel(data)
print p[‘Item1’]
output:
0 1 2
0 -0.006795 -1.156193 -0.524367
1 0.025610 1.533741 0.331956
2 1.067671 1.309666 1.304710
3 0.615196 1.348469 -0.410289
We have two items, and we retrieved item1. The result is a DataFrame with 4 rows and 3 columns, which are the Major_axis and Minor_axis dimensions.
Using major_axis
Data can be accessed using the method panel.major_axis(index).
Item1 Item2
0 0.027133 -1.078773
1 0.115686 -0.253315
2 -0.473201 NaN
Using minor_axis
Data can be accessed using the method panel.minor_axis(index).
import pandas as pd
import numpy as np
data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)),
'Item2' : pd.DataFrame(np.random.randn(4, 2))}
p = pd.Panel(data)
print(p.minor_xs(1))
Item1 Item2
0 0.092727 -1.633860
1 0.333863 -0.568101
2 0.388890 -0.338230
3 -0.618997 -1.01808