Using Python to Simplify Data Operations in Data Science

In Data Science, we primarily use Python as a programming language to perform operations on the available datasets. This article will discuss concepts and details for using Pythons to simplify data operations in data science.

Pros and Cons of Python for Data Operations

Even though the pros outweigh the cons, it is crucial to look at both aspects. So, let’s have a look at the advantages and limitations of using Python programming language in data science for data operations:

Advantages of Using Python for Data Operations

  1. Ease of Learning and Readability:
    • Python’s syntax is clean and readable, making it an excellent choice for beginners and teams looking for a language that can help minimize the potential for misinterpretation.
  2. Extensive Libraries:
    • Python boasts a rich set of libraries like Pandas, NumPy, and SciPy specifically tailored for data analysis and manipulation. This vast ecosystem means many data-related tasks can easily be performed with pre-existing tools.
  3. Community and Documentation:
    • The Python community is widespread and active. There are ample resources, tutorials, and documentation available, making problem-solving relatively more accessible than in other programming languages.
  4. Integration Capabilities:
    • Python can be easily integrated with other languages and technologies (like C/C++), and also offers great support for various data formats and databases.
  5. Machine Learning and Data Science Support:
    • Python is a preferred language in data science and machine learning owing to libraries like TensorFlow and Scikit-learn. Therefore, it makes it easy to integrate machine learning models into data pipelines.
  6. Open-Source:
    • Being open-source, Python allows for customization and use without the burden of licensing costs.
  7. Versatility:
    • From web development (using frameworks like Django or Flask) to scientific computing, Python can be used in many domains, making skills transferable across different applications in various industries.

Drawbacks of Using Python for Data Operations

  1. Speed Limitations:
    • While Python is incredibly versatile, it is often slower than compiled languages like C++ or Java. This might be a constraint when dealing with real-time data processing or large datasets.
  2. Global Interpreter Lock (GIL):
    • Python’s GIL can be a bottleneck in CPU-bound and multithreaded code, limiting the execution speed and making parallel computation less efficient compared to languages that support true multi-threading.
  3. Memory Consumption:
    • Python can consume more memory than languages like C, which might be a limitation when working with large datasets or in systems with memory constraints.
  4. Limited in Mobile Computing:
    • Python is not as popular in mobile computing due to certain limitations, which might be a downside for developers looking to integrate data operations with mobile apps.
  5. Runtime Errors:
    • Python can sometimes be prone to runtime errors due to its dynamic typing. Proper testing and validation are crucial to ensure stable and reliable data operations.
  6. Optimization Challenges:
    • While the rich set of libraries is a major strength, achieving optimal performance often requires a deep understanding of these libraries and how they handle data under the hood.
  7. Integration Issues with Enterprise Systems:
    • Though it offers good integration capabilities, Python may face challenges in some enterprise settings, where technologies might be predominantly based on .NET or Java.

Random Generators using NumPy

In a NumPy array, we can generate random numbers using the “rand()” function. For example, the code shown below is used to generate random numbers:

import numpy as np

# It will generate random number every time 
print(np.random.rand())

But what if we want to generate a random number with a specific range? Consider the scenario of rolling a dice. When we roll dice, we get an output of one of the numbers from 1 to 6. The below code demonstrates the scenario of rolling a dice:

import numpy as np

np.random.seed(123)
dice = np.random.randint(1,7) 
print(dice)

We have used the “seed” function above to constrain output within range. We can pass any value in the seed function.

Python Functions for Data Operations

We have discussed functions in Object-oriented programming in Python. Now it’s time to go further and explore more about functions in Python programming for data operations. Here, I will be talking about the following:

  1. The scope of variables in functions
  2. Default and flexible arguments in functions
  3. Passing key-value pair to function
  4. Lambda functions

Let’s start with a scope of variables in or outside the function. There are two levels of variable scopes. These are:

  1. Global scope
  2. Local scope

Python provides a “global” keyword to declare global variables. Let’s see this in action. Consider we have a variable defined outside of the function:

number = 1

def fun1():
number = 10 # local copy of number is created

def fun2():
global number
number= 30 # with global keyword number value will be updated in global scope

print('Number value in global scope: ' + str(number)) # 1
fun1()
print('Number value in global scope after executing fun1: ' + str(number)) # 1 fun2()
print('Number value in global scope after executing fun2: ' + str(number)) # 30

Hmmm! Strange, how come the number value changes to 30 after executing “fun2”? This happened because new memory is not allocated for the variable with the global keyword. So, instead, the global variable reference is updated with the value 30.

Error Handling in Python

Sometimes, we don’t want code execution to break at runtime. So, it’s essential to catch any errors if they occur during code execution. In Python, we handle exceptions using the try-except” block. Let’s explore it with the help of the example shown below:

try:
word = 'test' * '23' 
print(word)
except:
print('Error Occurred.')

As we know, we cannot multiply string with string, so the flow will go to except block and ‘Error Occurred.’

Another way to handle an error is using the “raise” keyword.

try:
raise Exception('Exception handled using raise keyword')
except Exception as e: 
print(e)

The code above uses the “raise” keyword to handle an exception.

NOTE: Handling the general exceptions is generally not a good idea. We should be very specific in handling the type of Exception. E.g., “ValueError.”

Iterators in Python

We can iterate over the data structures using the “for” loop as shown below:

names = ['Jack', 'Alley', 'Wann', 'Haley']

for name in names: 
print(name)

The above is the more common way to iterate over the list. But we can also iterate over the list using iterators as shown below:

names = ['Jack', 'Alley', 'Wann', 'Haley'] 

person = iter(names)

# Iterating over the list when needed 
print(next(person)) 
print(next(person))

#Output 
Jack 
Alley

The advantage of using Python Iterators is that we can iterate over the list on demand. It plays a major role in data science when dealing with huge datasets.

zip keyword in Python

The “zip()” function takes an “iterable” to make an iterator that aggregates elements and returns an iterator of tuples – Objects like sets, strings, lists, tuples, dictionaries, etc., called “iterables“. In simpler terms, iterable is anything that you can loop over. To know how the zip function works, look at the example shown below:

names = ['Rock', 'Bob', 'Rony']
classes = ['10', '1', 'A1']
subjects = ['Maths', 'English', 'Computer science']

zipped_data = zip(names, classes, subjects) 
print(zipped_data)

zipped_to_list = list(zipped_data) 
print('Zip to list: ' + str(zipped_to_list))

# unpack zipobject:
zipped_data = zip(names, classes, subjects)

print('Unpacking zip object')
for value1, value2, value3 in zipped_data:
  print(value1, value2, value3)

#output:
<zip object at 0x10339cd88>
Zip to list: [('Rock', '10', 'Maths'), ('Bob', '1', 'English'), ('Rony', 'A1', 'Computer science')]

Unpacking zip object
Rock 10 Maths
Bob 1 English
Rony A1 Computer science

We have created a zip object from three lists in the above code. Then, we converted a zipped object to a list. Also, we have written code to unpack the zip object using a “for” loop. There is another way to unpack the zip object:


names = ['Rock', 'Bob', 'Rony']
classes = ['10', '1', 'A1']
subjects = ['Maths', 'English', 'Computer science']

zipped_data = zip(names, classes, subjects) print(*zipped_data)

zipped_data = zip(names, classes, subjects) 
result1, result2, result3 = zip(*zipped_data) 
print(result1, result2, result3)

# output
('Rock', '10', 'Maths') ('Bob', '1', 'English') ('Rony', 'A1', 'Computer science') 
('Rock', '10', 'Maths') ('Bob', '1', 'English') ('Rony', 'A1', 'Computer science')

List Comprehension

List Comprehension is an effective way of creating a list while we deal with data. It also reduces the number of lines of code that we would have written with for loop. Let’s see the syntax to create our very first list comprehension:

squares = [i* i for i in range(10)]

# output 
[0,1,4,9,16,25,36,49,64,81]

The syntax seems to be confusing at first sight. But as you practice, you will become a pro at writing list comprehensions. The above code is very similar to the code shown below, especially if we would like to generate the same output without using a list comprehension:

square = [];
for value in range(10):
  square.append(value * value)

#ouput 
[0,1,4,9,16,25,36,49,64,81]

We can even generate a nested list comprehension. To demonstrate, we will be creating a 3 x 3 matrix using a list comprehension:

matrix = [[col for col in range(0,3)] for row in range(0,3)] print(matrix)

# output 
[[0,1,2,3,4],[0,1,2,3,4],[0,1,2,3,4],[0,1,2,3,4],[0,1,2,3,4]]

The syntax seems confusing to some of you, but as you go around with list comprehension, you will master it.

We can also create a conditional list comprehension.


subjects = ['Maths', 'English', 'Computer science']
math_subject = [subject for subject in subjects if subject == 'Maths'] 
print(math_subject)

#output 
['Maths']

We can use the if-else block to generate conditional list comprehension:

subjects = ['Maths', 'English', 'Computer science']
math_subject = [subject if subject == 'Maths' else subject + '!' for subject in subjects ] 
print(math_subject)

#ouput
['Maths', 'English!', 'Computer science!']

Dictionary Comprehension

Dictionary Comprehension in Python is a concise and expressive way to create dictionaries using a single line of code.

It is particularly useful in data operations for creating new dictionaries by applying an expression to each element of an iterable. Dictionary comprehension can be a great tool for performing operations like filtering data, transforming keys and/or values, and constructing new dictionaries in a very readable and Pythonic way.

We create a dictionary using dictionary comprehension. The syntax to generate dictionary comprehension is similar to that of generating list comprehension. The only difference is we use “[]” to generate list comprehension and “{}” to generate dictionary comprehension.

Consider the example shown below to generate dictionary comprehension:

subjects = ['Maths', 'English', 'Computer science']
subject_dictionary = {member: len(member) for member in subjects} 
print(subject_dictionary)

# output
{'Maths': 5, 'English': 7, 'Computer science': 16}

Above, you can see that we have created keys as subject names and values as the length of characters in the subject name.

Generators

List comprehensions and generator expressions look very similar in their syntax, except for using parentheses “()” in generator expressions and brackets “[]” in list comprehensions.

Difference between generators and list comprehension

The generator expression is doing the same thing as a list comprehension. The only difference between list comprehension and generator expression is list comprehension returns the list, but generator expression returns an object that can be iterated over.

Let’s see how to create a generator expression:

subjects = ['Maths', 'English', 'Computer science'] 

subject_generators = (member for member in subjects)

print(subject_generators) 
print(next(subject_generators))

# output
<generator object <genexpr> at 0x10b754a40> 
Maths

In the example above, we use the “next” function to get the value from the generator object.

yield keyword

We can use the “yield” keyword to return a result from a generator function. Example showing the use of yield keyword:

subjects = ['Maths', 'English', 'Computer science']

def toUpperCase(inputList):
 for value in inputList:
 yield value.upper()

for subject in toUpperCase(subjects):
 print(subject)

Difference between “yield” and “return” keywords

To understand the difference between yield and return keywords, consider the example shown below:

subjects = ['Maths', 'English', 'Computer science']

def toUpperCase(inputList):
 for value in inputList:
  yield value.upper()

subject = toUpperCase(subjects)
 print(next(subject))
 print(next(subject))

#Output
MATHS
ENGLISH

If we use the “return” keyword in the function, we can’t use “next” as we use above. So, to simplify, yield keyword returns generators, which we can iterate over as needed, but the “return” keyword returns the list.

‘map’, ‘reduce’, and ‘filter’

Python provides “map,” “reduce,” and “filter” functions to ease day-to-day tasks for every data operation. The “map” function is called with a “lambda” function and a list as an argument. It returns a new list that contains all the lambda-modified items.

# Syntax: map(lambda function, list)
 print(list(map(lambda x: x*2 , [1,2,3,4,5])))

# Output 
[2,4,6,8,10]

The “reduce” function is called with a “lambda” function and a list. It performs repetitive operations over the pair of lists, and a reduced result is returned.


# Syntax: reduce(lambda function, list)

from functools import reduce
print(reduce(lambda x,y: x if x>y else y, [1,2,3,4,5]))

# Output 
5

NOTE: The Python core provides the “reduce” function. Instead, we need to import it from the ‘functools‘ package.

The “filter” function is called with a lambda function and a list. It filters out the element from the list where the function returns True.

# Syntax: filter(lambda function, list) 

print(list(filter(lambda x: x % 2 == 0 , [1,2,3,4,5])))

# Output 
[2, 4]

Above, we have explored some concepts to simplify data operations in Python for data science, and I hope that it will be helpful for the curious.

Tavish lives in Hyderabad, India, and works as a result-oriented data scientist specializing in improving the major key performance business indicators. 

He understands how data can be used for business excellence and is a focused learner who enjoys sharing knowledge.

Need help?

Let us know about your question or problem and we will reach out to you.