Tag Archives: analytics

Data Analysis with Pandas & Python

What is Data Analysis?
Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. In today’s business world, data analysis plays a role in making decisions more scientific and helping businesses operate more effectively
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python.
In this article, I have used Pandas to know more about doing data analysis.
Mainly pandas have two data structures, series, data frames, and Panel.

Installation
The easiest way to install pandas is to use pip:

pip install pandas

or, Download it from here.

  • pandas Series

pandas series can be used for the one-dimensional labeled array.

import pandas as pd
index_list = ['test1', 'test2', 'test3', 'test4']
a = pd.Series([100, 98.7, 98.4, 97.7],index=index_list)
print(a)
output:
test1    100.0
test2 98.7
test3 98.4
test4 97.7
dtype: float64

Labels can be accessed using index attribute
print(a.index)

Index(['test1', 'test2', 'test3', 'test4'], dtype='object')

You can use array indexing or labels to access data in the series.
You can use array indexing or labels to access data in the series
print(a[1])
print(a[‘test4’])

98.7
97.7

You can also apply mathematical operations on pandas series.
b = a * 2
c = a ** 1.5
print(b)
print(c)

test1 200.0
test2 197.4
test3 196.8
test4 195.4
dtype: float64

test1 1000.000000
test2 980.563513
test3 976.096258
test4 965.699142
dtype: float64

You can even create a series of heterogeneous data.
s = pd.Series([‘test1’, 1.2, 3, ‘test2’], index=[‘test3’, ‘test4’, 2, ‘4.3’])

print(s)

test3   test1
test4   1.2
2       3
4.3     test2
dtype: object
  • pandas DataFrame

pandas DataFrame is a two-dimensional array with heterogeneous data.i.e., data is aligned in a tabular fashion in rows and columns.
Structure
Let us assume that we are creating a data frame with the student’s data.

Name Age Gender Rating
Steve 32 Male 3.45
Lia 28 Female 4.6
Vin 45 Male 3.9
Katie 38 Female 2

You can think of it as an SQL table or a spreadsheet data representation.
The table represents the data of a sales team of an organization with their overall performance rating. The data is represented in rows and columns. Each column represents an attribute and each row represents a person.
The data types of the four columns are as follows −

Column Type
Name String
Age Integer
Gender String
Rating Float

Key Points
• Heterogeneous data
• Size Mutable
• Data Mutable

A pandas DataFrame can be created using the following constructor −
pandas.DataFrame( data, index, columns, dtype, copy)

•  data
data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame.
•  index
For the row labels, the Index to be used for the resulting frame is Optional Default np.arrange(n) if no index is passed.
•  columns
For column labels, the optional default syntax is – np.arrange(n). This is only true if no index is passed.
•  dtype
The data type of each column.
•  copy
This command (or whatever it is) is used for copying of data if the default is False.

There are many methods to create DataFrames.
• Lists
• dict
• Series
• Numpy ndarrays
• Another DataFrame

Creating DataFrame from the dictionary of Series
The following method can be used to create DataFrames from a dictionary of pandas series.

import pandas as pd
index_list = ['test1', 'test2', 'test3', 'test4']
a = {"column1": pd.Series([100, 98.7, 98.4, 97.7],index=index_list), "column2": pd.Series([100, 100, 100, 85.4], index=index_list)}
df = pd.DataFrame(a)

print(df)

      column1  column2
test1 100.0    100.0
test2 98.7     100.0
test3 98.4     100.0
test4 97.7     85.4

print(df.index)

Index(['test1', 'test2', 'test3', 'test4'], dtype='object')

print(df.columns)

Index(['column1', 'column2'], dtype='object')

Creating DataFrame from list of dictionaries
l = [{‘orange’: 32, ‘apple’: 42}, {‘banana’: 25, ‘carrot’: 44, ‘apple’: 34}]
df = pd.DataFrame(l, index=[‘test1’, ‘test2’])

print(df)

        apple  banana  carrot  orange
test1     42     NaN     NaN    32.0

test2     34    25.0    44.0     NaN

You might have noticed that we got a DataFrame with NaN values in it. This is because we didn’t the data for that particular row and column.

Creating DataFrame from Text/CSV files
Pandas tool comes in handy when you want to load data from a CSV or a text file. It has built-in functions to do this for use.

df = pd.read_csv(‘happiness.csv’)

Yes, we created a DataFrame from a CSV file. This dataset contains the outcome of the European quality of life survey. This dataset is available here. Now we have stored the DataFrame in df, we want to see what’s inside. First, we will see the size of the DataFrame.

print(df.shape)

(105, 4)

It has 105 Rows and 4 Columns. Instead of printing out all the data, we will see the first 10 rows.
df.head(10)

   Country  Gender  Mean    N=
0      AT    Male   7.3   471
1     NaN  Female   7.3   570
2     NaN    Both   7.3  1041
3      BE    Male   7.8   468
4     NaN  Female   7.8   542
5     NaN    Both   7.8  1010
6      BG    Male   5.8   416
7     NaN  Female   5.8   555
8     NaN    Both   5.8   971
9      CY    Male   7.8   433

There are many more methods to create a DataFrames. But now we will see the basic operation on DataFrames.

Operations on DataFrame
We’ll recall the DataFrame we made earlier.

import pandas as pd
index_list = ['test1', 'test2', 'test3', 'test4']
a = {"column1": pd.Series([100, 98.7, 98.4, 97.7],index=index_list), "column2": pd.Series([100, 100, 100, 85.4], index=index_list)}
df = pd.DataFrame(a)

print(df)

      column1 column2
test1 100.0   100.0
test2 98.7    100.0
test3 98.4    100.0
test4 97.7    85.4

Now we want to create a new row column from current columns. Let’s see how it is done.
df[‘column3’] = (2 * df[‘column1’] + 3 * df[‘column2’])/5

        column1  column2  column3
test1    100.0    100.0   100.00
test2     98.7    100.0    99.48
test3     98.4    100.0    99.36
test4     97.7     85.4    90.32

We have created a new column column3 from column1 and  column2. We’ll create one more using boolean.
df[‘flag’] = df[‘column1’] > 99.5

We can also remove columns.
column3 = df.pop(‘column3’)

print(column3)

test1    100.00
test2     99.48
test3     99.36
test4     90.32
Name: column3, dtype: float64

print(df)

       column1  column2   flag
test1    100.0    100.0   True
test2     98.7    100.0  False
test3     98.4    100.0  False
test4     97.7     85.4  False

Descriptive Statistics using pandas
It’s very easy to view descriptive statistics of a dataset using pandas. We are gonna use, Biomass data collected from this source. Let’s load the data first.

url = ‘https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/DAAG/biomass.csv’
df = pd.read_csv(url)
df.head()

     Unnamed:0  dbh  wood   bark    root   rootsk  branch species     fac26
0          1    90   5528.0  NaN   460.0   NaN      NaN   E. maculata    z
1          2   106   13650.0 NaN  1500.0   665.0    NaN   E. Pilularis   2
2          3   112   11200.0 NaN  1100.0   680.0    NaN   E. Pilularis   2
3          4    34   1000.0  NaN   430.0    40.0    NaN   E. Pilularis   2
4          5   130   NaN     NaN  3000.0  1030.0    NaN   E. maculata    z

We are not interested in the unnamed column. So, let’s delete that first. Then we’ll see the statistics with one line of code.

          dbh        wood      bark        root        rootsk        branch
count 153.000000 133.000000   17.000000   54.000000   53.000000   76.000000
mean  26.352941  1569.045113  513.235294  334.383333  113.802264  54.065789
std   28.273679  4071.380720  632.467542  654.641245  247.224118  65.606369
min   3.000000   3.000000     7.000000    0.300000    0.050000    4.000000
25%   8.000000   29.000000    59.000000   11.500000   2.000000    10.750000
50%   15.000000  162.000000   328.000000  41.000000   11.000000   35.000000
75%   36.000000  1000.000000  667.000000  235.000000  45.000000   77.750000
max   145.000000 25116.000000 1808.000000 3000.000000 1030.000000 371.000000

It’s simple as that. We can see all the statistics. Count, mean, standard deviation and other statistics. Now we are gonna find some other metrics which are not available in the describe() summary.

Mean :
print(df.mean())

dbh         26.352941
wood      1569.045113
bark       513.235294
root       334.383333
rootsk     113.802264
branch      54.065789
dtype: float6

Min and Max
print(df.min())

dbh                      3
wood                     3
bark                     7
root                   0.3
rootsk                0.05
branch                   4
species    Acacia mabellae
dtype: object

print(df.max())

dbh          145
wood       25116
bark        1808
root         3000
rootsk      1030
branch      371
species    Other
dtype: object

Pairwise Correlation
df.corr()

             dbh       wood      bark      root    rootsk    branch
dbh     1.000000   0.905175  0.965413  0.899301  0.934982  0.861660
wood    0.905175   1.000000  0.971700  0.988752  0.967082  0.821731
bark    0.965413   0.971700  1.000000  0.961038  0.971341  0.943383
root    0.899301   0.988752  0.961038  1.000000  0.936935  0.679760
rootsk  0.934982   0.967082  0.971341  0.936935  1.000000  0.621550
branch  0.861660   0.821731  0.943383  0.679760  0.621550  1.000000

Data Cleaning
We need to clean our data. Our data might contain missing values, NaN values, outliers, etc. We may need to remove or replace that data. Otherwise, our data might make any sense.
We can find null values using the following method.

print(df.isnull().any())

dbh        False
wood        True
bark        True
root        True
rootsk      True
branch      True
species    False
fac26       True
dtype: bool

We have to remove these null values. This can be done by the method shown below.

newdf = df.dropna()

print(newdf.shape)

     dbh   wood   bark  root  rootsk   branch        species  fac26
123   27  550.0  105.0  44.0     9.0    59.0   B. myrtifolia     z
124   26  414.0   78.0  38.0    13.0    44.0   B. myrtifolia     z
125    9   42.0    8.0   5.0     1.3     7.0   B. myrtifolia     z
126   12   85.0   13.0  17.0     2.2    16.0   B. myrtifolia     z

print(newdf.shape)

(4, 8)

Pandas .Panel()
A panel is a 3D container of data. The term Panel data is derived from econometrics and is partially responsible for the name pandas − pan(el)-da(ta)-s.
The names for the 3 axes are intended to give some semantic meaning to describing operations involving panel data. They are −
• items − axis 0, each item corresponds to a DataFrame contained inside.
• major_axis − axis 1, it is the index (rows) of each of the DataFrames.
• minor_axis − axis 2, it is the columns of each of the DataFrames.

A Panel can be created using the following constructor −
The parameters of the constructor are as follows −
• data – Data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame
• items – axis=0
• major_axis – axis=1
• minor_axis – axis=2
• dtype – the Data type of each column
• copy – Copy data. Default, false

A Panel can be created using multiple ways like −
• From ndarrays
• From dict of DataFrames
• From 3D ndarray

# creating an empty panel
import pandas as pd
import numpy as np
data = np.random.rand(2,4,5)
p = pd.Panel(data)

print(p)

output:
Dimensions: 2 (items) x 4 (major_axis) x 5 (minor_axis)
Items axis: 0 to 1
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 4

Note − Observe the dimensions of the empty panel and the above panel, all the objects are different.

From dict of DataFrame Objects

#creating an empty panel
import pandas as pd
import numpy as np
data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)),
'Item2' : pd.DataFrame(np.random.randn(4, 2))}
p = pd.Panel(data)

print(p)

output:
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2

Selecting the Data from Panel
Select the data from the panel using −
• Items
• Major_axis
• Minor_axis

Using Items

# creating an empty panel
import pandas as pd
import numpy as np
data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)),
'Item2' : pd.DataFrame(np.random.randn(4, 2))}
p = pd.Panel(data)

print p[‘Item1’]

output:
        0          1          2
0 -0.006795 -1.156193 -0.524367
1 0.025610 1.533741 0.331956
2 1.067671 1.309666 1.304710
3 0.615196 1.348469 -0.410289

We have two items, and we retrieved item1. The result is a DataFrame with 4 rows and 3 columns, which are the Major_axis and Minor_axis dimensions.

Using major_axis
Data can be accessed using the method panel.major_axis(index).

     Item1     Item2
0 0.027133 -1.078773
1 0.115686 -0.253315
2 -0.473201 NaN

Using minor_axis
Data can be accessed using the method panel.minor_axis(index).

import pandas as pd
import numpy as np
data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)),
'Item2' : pd.DataFrame(np.random.randn(4, 2))}
p = pd.Panel(data)

print(p.minor_xs(1))

Item1      Item2
0 0.092727 -1.633860
1 0.333863 -0.568101
2 0.388890 -0.338230
3 -0.618997 -1.01808