NumPy for Data Science Beginners in Python

NumPy library on Python is an essential tool for data scientists to work on numerical data, especially when they deal with data arrays, especially multi-dimensional, and need a memory-efficient fast indexing of arrays,

However, knowing about other useful packages when solving data science problems is essential. So, let’s see which packages are available in Python programming language and are used to solve data-related problems.

Python Packages and Why Are Those Needed?

Python packages were introduced to avoid the need to add everything in Python core, which would create a mess. 

A Package is nothing but a bunch of script files that try to solve a specific problem. The collection of such scripts is called a package.

The developers develop thousands of packages. Here, we will only talk about the packages used for data science. These are as follows:

  1. NumPy
  2. Matplotlib
  3. pandas
  4. scikit-learn

In this article, we will be talking about NumPy. We will also cover the scikit-learn package later on.

To use the above packages, we need to install them on our system. To install the packages, we need to have pip, a package maintenance system for Python. Pip is generally installed with Python.

Installing Numpy

Once the pip is installed, we need to run the below command to install the packages:

pip3 install <package-name>

E.g., pip3 install NumPy

The above command would install the NumPy package on the system.

Once the packages are installed on the system, it will be time to use them in our Python scripts. Firstly, to use them in our Python scripts, we need to import them.

The command to import packages in our script is simple and is as follows: 

import numpy

Now, to use the NumPy function, we can write the following code:

numpy.array([1,2])

Another way is: 

import numpy as np
np.array([1,2])

If we wish to use array() function directly, then we can use the following script: 

from numpy import array
array([1,2])

Now that we have clarity about the packages and how to use them in our code, let’s explore the NumPy package’s functionality.

NumPy Array

NumPy package provides an alternative to lists: that is NumPy array. The NumPy array has more powerful features than a Python list.

Suppose we have two NumPy arrays, as shown below:

import numpy as np 
a1 = np.array([1,2,3]) 
a2 = np.array([4,5,6]) 
print(a1 * a2) 
#Output:
array([ 4, 10, 18])

But if you try to do it with a typical list function:

a1 = [1,2,3]
a2 = [4,5,6]
print(a1 * a2)
# Output:
TypeError: can't multiply sequence by non-int of type 'list'

Now, you would have realized the power of the NumPy package. Let’s explore the advantages of the library.

Creating an Array

We have now seen creating a NumPy array using the “array()” function. Apart from the array() function, NumPy provides the following other functions as well, to create arrays:

  1. np.ones(size): Create an array of 1s
  2. np.zeros(size): Create an array of 0s
  3. np.random.random(size): Create an array of random numbers between 0 and 1
  4. np.arange(start index, end index, interval): Create an array with increments of a fixed step size
  5. np.linspace(): Create an array of fixed length
  6. np.full(): Create a constant array of any number n
  7. np.tile(): Create a new array – by repeating an existing array – for a particular number of times
  8. np.eye(): Create an identity matrix of any dimension
  9. np.randint(): Create a random array of integers within a particular range

Refer to the example shown below, demonstrating usage of functions discussed above:

import numpy as np
print(np.ones([2,3]))
# Output
[[1. 1. 1.] [1. 1. 1.]]
print(np.zeros([2,3], dtype=np.int))
# Output
[[000] [000]]
print(np.random.random([2,3]))
# Output
[[0.24902248 0.80429347 0.58541468] [0.97517227 0.3406624 0.30174668]] print(np.arange(0,10,2))
# Output
[0 2 4 6 8]

So far, so good. We have seen ways to create a 1-D array. We can create an n-dimension array using NumPy package. Refer to an example shown below to create a 2D array:

import numpy as np
a1 = np.array([[1,2,4], [2,3,5]])

Structure of Numpy Array

We can see the number of rows and columns in our array using “shape” attribute. Refer to the code shown below: 

a1.shape
Output:
(2, 3) # where 2 represent rows and 3 represent columns

Like the shape function, other functions can help us inspect the NumPy array. Refer to the example shown below:

print(a1.dtype) # Data Type of numpy array
print(a1.ndim) # No of dimension/axis of an array print(a1.itemsize) # Memory used by each element of array

Subsetting on Numpy Array 

We can access an element of the array created using NumPy package. Refer to the example shown below:

import numpy as np 
a1 = np.array([1,2,3]) 
print(a1[0]) 
#Output:
1

NOTE: Subsetting in NumPy array is like the typical indexing of the list.

We can do subsetting of the array in n-dimensional spaces as well. Consider an example shown below: 

import numpy as np
a1 = np.array([[1,2,4], [2,3,5]])
print(a1[0][2]) # where 0 represent row and 2 represent column #Output:
4

Another syntax that we can use is:

print(a1[0, 2])

The above code will also result in the same output.

If we want to check the first row of the 2D array, we can do it as:

print(a1[0])
Output:
[1 2 4]

Operations on Numpy Array

Numpy provides operations that we can perform on the arrays generated using NumPy. A few of them are listed below:

  1. Reshaping
  2. Transposing
  3. Stacking
  4. Mathematical Operations

Let’s discuss them in detail with the help of an example.

Reshaping

We can reshape any array to whatever shape we desire. Consider an example where we have a one-dimensional array of size 10. And we want to reshape it to the two-dimensional array:

import numpy as np 
arr = np.arange(10) 
print(arr.reshape(2,-1))

You might have noticed that we are passing 2 and -1 as an argument to reshape the function. Here, -1 means adjusting the dimension to incorporate all elements in the 2D array. 

Transposing 

Transposing means swapping up the rows with the columns of an array. Refer to the code shown below to transpose.

np.arange(10).reshape(2,-1).T

In the example above, the code snippet creates a 2D array and transposes it.

Stacking

Stacking means combining two or more arrays either vertically or horizontally. NumPy provides “hstack()” and “stack()” functions to stack arrays horizontally and vertically. Refer to the example shown below:

import numpy as np
arr1 = np.arange(10).reshape(5,2)
arr2 = np.arange(6).reshape(3,2) 
arr3 = np.arange(20).reshape(5,4) 
print(np.vstack((arr1, arr2))) 
print(np.hstack((arr1, arr3)))

NOTE: To stack arrays vertically, the number of columns should be the same, and to stack arrays horizontally, the number of rows should be the same.

Mathematical Operations

We can perform an operations like “sin”, “cos”, “log”, “addition”, “multiplication”, etc., on the NumPy array.

import numpy as np
a1 = np.array([[1,2,4], [2,3,5]]) 
a2 = a1 + a1
print(np.sin(a1)) 
print(np.cos(a1)) 
print(np.log(a1))
print(a2)

We have seen a few inbuilt functions in the example shown above. Moreover, we can also use custom functions on the array besides applying inbuilt functions. Refer to the code shown below:

import numpy as np
a1 = np.array([[1,2,4], [2,3,5]])
func = np.vectorize(lambda x: x**2) 
print(func(a1))

We use the “vectorized” function to apply a custom function. However, we could have used list comprehension to achieve the same result:


print([x**2 for x in a1])

But it is NOT RECOMMENDED as we will not be using NumPy computational speed here.

Basic Statistics using NumPy

We can efficiently perform basic statistics operations like mean, median, standard deviation, and more.

The first step to perform NumPy operation is to import the NumPy package:

import numpy as np
person = np.array([20,50,60,40,50]) # weight of person
#To find out mean
person_mean = np.mean(person)
#To find median
person_median = np.median(person)
# To find standard deviation
person_sd = np.std(person)
#To find correlation
person = np.array([[20,2.6], [50,6], [60, 5.5], [40, 4], [50, 5]]) # Where values depict weight and height respectively
np.corrcoef(person[:,0],person[:,1])

Many more operations like “sum()” and “sort()” are available in Python. 

import numpy as np
person = np.array([20,50,60,40,50]) # weight of person person_sum = np.sum(person)
person_sort = np.sort(person)

Logical Operators

NumPy offers logical operators like logical_or and logical_and. Refer to the example shown below:

import numpy as np
science_score = np.array([18.0, 20.0, 11.75, 19.50]) math_score = np.array([14.6, 14.0, 18.25, 19.0])
# List of score greater then 13 or less than 10 print(np.logical_or(science_score > 13, science_score < 10)) #ouput:
[ True True False True]
# List of math score between 13 and 15 print(np.logical_and(math_score > 13 , math_score < 15))
# output
[ True True False False]

Looping over the NumPy array

We can even loop over 1D or 2D arrays. Looping over a 1D NumPy array is as simple as looping over the list. But if we want to loop over NumPy 2D array, we have to use the “nditer” function provided by NumPy. Consider the example shown below:

# Looping over 1D array import numpy as np
population = np.array(['500','600','550','700'])
for x in population: 
      print(str(x) + ' billion')
# output
500 billion
600 billion
550 billion
700 billion
# Looping over 2D array
population = np.array([['India','500'], ['China','600'], ['Autralia','550'], ['France','700']]) 
for x in np.nditer(population) :
print(x) # output
India 500
China 600
Autralia 550
France 700

There is a lot more than this that a NumPy can do. The above demonstration is to get you started to perform NumPy operations. Check out the official documentation to learn more.

Limitation of NumPy

NumPy is a powerful Python library that performs well on numerical data and when the number of rows is 50K or less. However, when we have tabular data that is not only numerical and have to work with 500k or more raws, Numpy is not the option.

Industry applications normally have a massive amount of data that is not essentially numeric, and in such cases, we need to opt for Pandas for data science.

Tavish lives in Hyderabad, India, and works as a result-oriented data scientist specializing in improving the major key performance business indicators. 

He understands how data can be used for business excellence and is a focused learner who enjoys sharing knowledge.

Need help?

Let us know about your question or problem and we will reach out to you.