Python as a statistics workbench

  • Lots of people use a main tool like Excel or another spreadsheet, SPSS, Stata, or R for their statistics needs. They might turn to some specific package for very special needs, but a lot of things can be done with a simple spreadsheet or a general stats package or stats programming environment.

    I've always liked Python as a programming language, and for simple needs, it's easy to write a short program that calculates what I need. Matplotlib allows me to plot it.

    Has anyone switched completely from, say R, to Python? R (or any other statistics package) has a lot of functionality specific to statistics, and it has data structures that allow you to think about the statistics you want to perform and less about the internal representation of your data. Python (or some other dynamic language) has the benefit of allowing me to program in a familiar, high-level language, and it lets me programmatically interact with real-world systems in which the data resides or from which I can take measurements. But I haven't found any Python package that would allow me to express things with "statistical terminology" – from simple descriptive statistics to more complicated multivariate methods.

    What can you recommend if I wanted to use Python as a "statistics workbench" to replace R, SPSS, etc.?

    What would I gain and lose, based on your experience?

    FYI, there is a new python stats subreddit that is going off: http://www.reddit.com/r/pystats

    When you need to move things around on the command line, pythonpy (https://github.com/Russell91/pythonpy) is a nice tool.

  • ars

    ars Correct answer

    10 years ago

    It's hard to ignore the wealth of statistical packages available in R/CRAN. That said, I spend a lot of time in Python land and would never dissuade anyone from having as much fun as I do. :) Here are some libraries/links you might find useful for statistical work.

    • NumPy/Scipy You probably know about these already. But let me point out the Cookbook where you can read about many statistical facilities already available and the Example List which is a great reference for functions (including data manipulation and other operations). Another handy reference is John Cook's Distributions in Scipy.

    • pandas This is a really nice library for working with statistical data -- tabular data, time series, panel data. Includes many builtin functions for data summaries, grouping/aggregation, pivoting. Also has a statistics/econometrics library.

    • larry Labeled array that plays nice with NumPy. Provides statistical functions not present in NumPy and good for data manipulation.

    • python-statlib A fairly recent effort which combined a number of scattered statistics libraries. Useful for basic and descriptive statistics if you're not using NumPy or pandas.

    • statsmodels Statistical modeling: Linear models, GLMs, among others.

    • scikits Statistical and scientific computing packages -- notably smoothing, optimization and machine learning.

    • PyMC For your Bayesian/MCMC/hierarchical modeling needs. Highly recommended.

    • PyMix Mixture models.

    • Biopython Useful for loading your biological data into python, and provides some rudimentary statistical/ machine learning tools for analysis.

    If speed becomes a problem, consider Theano -- used with good success by the deep learning people.

    There's plenty of other stuff out there, but this is what I find the most useful along the lines you mentioned.

    All answers were both helpful and useful, and would all deserve to be accepted. This one, however, does a very good job at answering the question: with Python, you have to put together lots of pieces to do what you want. These pointers will no doubt be very useful for anyone wanting to do statistics/modeling/etc. with Python. Thanks to everyone!

    @ars please do you know what is the best way to use Python with Windows ?

    @StéphaneLaurent I usually install the various pieces myself, but for a quick start/install, you might consider: pythonxy.

    This script installs many of the libraries cited above: http://fonnesbeck.github.com/ScipySuperpack/

    Pythonxy is nice but it can get annoying if you want to do large computations as it is only available for 32 bits. Here are unofficial binaries for installing many python packages. They can be quite useful if you decide to work under windows. http://www.lfd.uci.edu/~gohlke/pythonlibs/ @StéphaneLaurent

    Somebody needs to create a Kickstarter for a Python-like GUI app for doing statistics with all of these tools built in. If I have to use Stata for another minute, I might just kill someone...

    Is "rpy2" hidden somewhere in there? It feels essential if you want to run R from python

    Yeah.. You can run R from python, relatively easily, natively or through other libraries. It seems the main argument for R, is that most of the functions necessarily are already packaged in R or available in CRAN. Python also has Spyder, Anaconda, Enthough Python, Jupyter Notebooks... and these days I would expect with the popularity of python, most functions available in R is probably already available in Python. The previous answers seem to be from quite a while back. Wondering if R still is better than Python.. is it more on equal ground?

    Also for those of you strongly recommending R, have you tried pythons OO programming capabilities? Isnt using the OO capabilities in Python, basically giving it similar capabilities as R?

    How does larry compare to xarry?

    Theano is pretty outdated. Most people use Tensorflow now.

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM