Python Packages for Science

Ricardo M. Ferraz Leal

rhf@ornl.gov

Standard Data formats

  • Pandas Dataframes
    • Tabular data: i.e. spreadsheet style.
    • Panel: 3-dimensional array:
      • labels, major_axis, minor_axis
    • Panel4D (Experimental)
      • labels, items, major_axis, minor_axis
    • Lots of functionality: statistics, interpolation, masks, etc
    • Intuitive Indexing and Selecting Data
    • New: Release the Global Interpreter Lock (GIL) on some cython operations!
  • Numpy Arrays
    • numpy code often releases the GIL while it is calculating.

New Standards

xarray (formerly xray):

  • Extension to pandas for labeled multi-dimensional arrays.
  • Xray uses NetCDF4 (hence HDF5) for persistent storage.

  • References:

    • Notebook here.
    • Presentation of Xray at SciPy 2015 here.

Dask:

  • Parallel computing: threading, multiprocessing, etc..

    • dask.array = numpy + threading
    • dask.dataframe = pandas + threading
    • dask.bag = map, filter, itertools, toolz + multiprocessing
  • References:

    • My test here
    • Talk from SciPy here.
    • Dask releasing the GIL with Numba here.
    • Dask.array: Calculations with arrays bigger than your memory here.
    • Article.

Xray + Dask:

  • Xray provides labeled, multi-dimensional arrays.
  • Dask provides a system for parallel computing.
  • Together, they allow for easy analysis of scientific datasets that don’t fit into memory.

  • References:

DistArray

DistArray provides general multidimensional NumPy-like distributed arrays to Python. It intends to bring the strengths of NumPy to data-parallel high-performance computing. DistArray has a similar API to NumPy.

Cython

Python with types...

  • Can invoke C/C++ routines
  • Declares static type of subroutine parameters and results, local variables, and class attributes.
  • I.e. Python to C source code translator that integrates with the CPython interpreter on a low level.
  • My tests here.

Numba

  • Numba works by generating optimized machine code using the LLVM compiler infrastructure.
    # jit decorator tells Numba to compile this function.
    # The argument types will be inferred by Numba when function is called.
    @jit
    def sum2d(arr):
  • A function can be compiled into a Numpy ufunc using:
    @vectorize([float64(float64, float64)])
    def f(x, y):
      return x + y

Castra

Castra is an on-disk, partitioned, compressed, column store. Castra provides efficient columnar range queries.

  • Efficient on-disk
  • Partitioned
  • Compressed
  • Column-store
  • Tabular data

ODO

To convert file/data formats

Formats:

  • AWS
  • CSV
  • JSON
  • HDF5
  • Hadoop File System
  • Hive Metastore
  • Mongo
  • Spark/SparkSQL
  • SAS
  • SQL
  • SSH

Blaze

The Blaze ecosystem is a set of libraries that help users store, describe, query and process data. It is composed of the following core projects:

  • Blaze: An interface to query data on different storage systems
  • Dask: Parallel computing through task scheduling and blocked algorithms
  • Datashape: A data description language
  • DyND: A C++ library for dynamic, multidimensional arrays
  • Odo: Data migration between different storage systems

Seaborn: statistical data visualization

import seaborn as sns
sns.jointplot(data=df, kind="kde");