What is NumPy? The Backbone of Data Science in Python

python dev.to

If you want to do data science, machine learning, or AI in Python — you will use NumPy constantly. Every major library from pandas to TensorFlow is built on top of it. Here is what it is and why it matters.

What is NumPy?

NumPy stands for Numerical Python. It is a library that gives Python the ability to work with large arrays of numbers extremely fast.

Python lists can store numbers. But they are slow for mathematical operations. NumPy arrays do the same thing 10 to 100 times faster.

This matters when you are working with thousands of rows of data or training a machine learning model on millions of examples.

Your first NumPy array

import numpy as np

sales = np.array([45000, 52000, 38000, 61000, 55000])

print(np.mean(sales))   # average
print(np.sum(sales))    # total
print(np.max(sales))    # highest
print(np.std(sales))    # how spread out the numbers are
Enter fullscreen mode Exit fullscreen mode

Five lines. You get the average, total, maximum, and standard deviation of any list of numbers instantly. No loops. No manual calculation.

Why it is faster than Python lists

When you multiply a Python list by 2, Python loops through each item one by one. When you multiply a NumPy array by 2, it does all items simultaneously using optimised C code under the hood.

import numpy as np
import time

data = list(range(1_000_000))
np_data = np.array(data)

# Python loop
start = time.time()
result = [x * 2 for x in data]
print(f"Python loop: {time.time() - start:.4f} seconds")

# NumPy
start = time.time()
result = np_data * 2
print(f"NumPy: {time.time() - start:.4f} seconds")
Enter fullscreen mode Exit fullscreen mode

On my machine NumPy is 50 times faster for this operation. At a million items. That gap only grows as data gets bigger.

The operations that make it powerful

Filter without loops

sales = np.array([45000, 52000, 38000, 61000, 55000])

# Get only months above 50,000
high_months = sales[sales > 50000]
print(high_months)  # [52000 61000 55000]
Enter fullscreen mode Exit fullscreen mode

Apply conditions across all values

avg = np.mean(sales)
performance = np.where(sales >= avg, "Good", "Below average")
print(performance)
# ['Good' 'Good' 'Below average' 'Good' 'Good']
Enter fullscreen mode Exit fullscreen mode

Work with 2D data like a spreadsheet

# 6 months of data: revenue, orders, avg order value
monthly = np.array([
    [45000, 120, 375],
    [52000, 138, 377],
    [38000, 102, 373],
])

print(monthly[:, 0].sum())   # total revenue across all months
print(monthly[:, 1].mean())  # average orders per month
Enter fullscreen mode Exit fullscreen mode

The [:, 0] means "all rows, column 0". This is how you slice 2D data without writing nested loops.

Why every data scientist needs to know this

pandas DataFrames are built on NumPy arrays. When you call df['Revenue'].mean() in pandas, pandas calls NumPy internally. When you train a machine learning model in scikit-learn, it converts your data into NumPy arrays before processing.

Understanding NumPy means you understand what is happening under the hood in every data science tool you will ever use.

Install and get started

pip install numpy
Enter fullscreen mode Exit fullscreen mode

Then open Google Colab and try this right now:

import numpy as np

data = np.array([10, 20, 30, 40, 50])
print("Mean:", np.mean(data))
print("Doubled:", data * 2)
print("Above 25:", data[data > 25])
Enter fullscreen mode Exit fullscreen mode

Three lines and you have filtered, transformed, and analysed data without writing a single loop.

The one-line summary

NumPy makes Python fast enough for real data science — it is the foundation everything else is built on.


Written by Raaga Priya Madhan — CSE student, Bangalore. I write about Python, data science, and ML. See my code on GitHub and connect on LinkedIn

Source: dev.to

arrow_back Back to Tutorials