Python Packages for Science

Ricardo M. Ferraz Leal

rhf@ornl.gov

Standard Data formats

Pandas Dataframes
- Tabular data: i.e. spreadsheet style.
- Panel: 3-dimensional array:
  - labels, major_axis, minor_axis
- Panel4D (Experimental)
  - labels, items, major_axis, minor_axis
- Lots of functionality: statistics, interpolation, masks, etc
- Intuitive Indexing and Selecting Data
- New: Release the Global Interpreter Lock (GIL) on some cython operations!
Numpy Arrays
- numpy code often releases the GIL while it is calculating.

New Standards

xarray (formerly xray):

Extension to pandas for labeled multi-dimensional arrays.
Xray uses NetCDF4 (hence HDF5) for persistent storage.
References:
- Notebook here.
- Presentation of Xray at SciPy 2015 here.

Dask:

Parallel computing: threading, multiprocessing, etc..
- dask.array = numpy + threading
- dask.dataframe = pandas + threading
- dask.bag = map, filter, itertools, toolz + multiprocessing
References:
- My test here
- Talk from SciPy here.
- Dask releasing the GIL with Numba here.
- Dask.array: Calculations with arrays bigger than your memory here.
- Article.

Xray + Dask:

Xray provides labeled, multi-dimensional arrays.
Dask provides a system for parallel computing.
Together, they allow for easy analysis of scientific datasets that don’t fit into memory.
References:
- Example here

DistArray provides general multidimensional NumPy-like distributed arrays to Python. It intends to bring the strengths of NumPy to data-parallel high-performance computing. DistArray has a similar API to NumPy.

Enthought version of Dask (?)
MPI
Uses the Distributed Array Protocol.
Notebook here.

Cython

Python with types...

Can invoke C/C++ routines
Declares static type of subroutine parameters and results, local variables, and class attributes.
I.e. Python to C source code translator that integrates with the CPython interpreter on a low level.
My tests here.

Numba

Numba works by generating optimized machine code using the LLVM compiler infrastructure.

# jit decorator tells Numba to compile this function.
# The argument types will be inferred by Numba when function is called.
@jit
def sum2d(arr):

A function can be compiled into a Numpy ufunc using:

@vectorize([float64(float64, float64)])
def f(x, y):
  return x + y

Castra

Castra is an on-disk, partitioned, compressed, column store. Castra provides efficient columnar range queries.

Efficient on-disk
Partitioned
Compressed
Column-store
Tabular data

ODO

To convert file/data formats

Formats:

AWS
CSV
JSON
HDF5
Hadoop File System
Hive Metastore
Mongo
Spark/SparkSQL
SAS
SQL
SSH

Blaze

The Blaze ecosystem is a set of libraries that help users store, describe, query and process data. It is composed of the following core projects:

Blaze: An interface to query data on different storage systems
Dask: Parallel computing through task scheduling and blocked algorithms
Datashape: A data description language
DyND: A C++ library for dynamic, multidimensional arrays
Odo: Data migration between different storage systems

Seaborn: statistical data visualization

import seaborn as sns
sns.jointplot(data=df, kind="kde");