Had the opportunity to give a talk on Machine Learning on 6/28/2017. For quick and dirty here is a link to the slides. I'll try to post more of the material directly here soon.
ML Lunch note that if you hit the little package symbol on the upper right it will display as a slide show.
It is also available on Github for download note that some links are on my machine so they won't work or show up, they have and use some files I cannot release (sorry)
At long last something worth blogging about. Note: that I have not (yet) created a Jupyter notebook to reflect all this, what I have has too much proprietary data and is currently a bad mess of code from all my experimentation.
With some trepidation I decided to see what I could get going with the fairly new Google Cloud DataLab.
In short it looks quite promising. What follows are a few notes where I found the documentation a bit tricky. So I hope this might help a few who are trying out the product.
Getting started: 1st thing is that you must use the Google cloud console to create a project and enable the API (Google Compute Engine and Google BigQuery APIs must be enabled for the project, Before using the Cloud DataLab deployment tool) which is accessed, https://datalab.cloud.google.com/ The cloud datalab deployment tool will only display active pre-existing projects that you have set up using the Google cloud console. This page has some clickable links for setting up a project and enabling the API”s https://cloud.google.com/datalab/getting-started/ The Google Cloud console is accessed here https://console.cloud.google.com/home/dashboard/ On the upper right there is a “select a project” dropdown, the “create a project” entry is on that menu. So do that if you did not use the getting-started link, and enable the api’s mentioned above.
Once you have completed the deployment wizard you should have a notebook “tree view” open in your browser. It has a number of Notebook tutorials embedded. The tutorials are the primary documentation at this time, so pay attention!
So after presumably running through the tutorials you might like to do some work of your own? 1st, understand that the Git repo that gets installed with your project is the ONLY one you get, you cannot switch or use one of your own (as far as I can tell). This is not too bad, you just have to add your notebooks by uploading them using the upload icon on the tree view.. You can then commit the uploaded file using the git interface provided on the upper right of the tree view… Very important: be sure to commit any changes that you want to persist…. The notebook server may crash and you may want to start and stop it (which is another issue) only committed notebooks will be available on the next start.
Note that you can find out what packages and versions are installed by executing !pip list in a cell. You can generally install and upgrade packages using !pip commands These installs or upgrades DO NOT persist But they run very quickly (Google bandwidth ftw) so just put what you need in the 1st cell and execute it on starting a notebook.
Note that starting and stopping specific notebooks is controlled by the “sessions” item on the top of the tree view (this is different from the current Jupyter implementation)
So we have covered getting a notebook uploaded.. (you may want to read more on cloning the repo containing the notebooks to your local machine, which is doable)
But, next item is maybe you want to actually analyze some of your own data. So time to learn about files! Google has its own take here so it is a departure from what you are used to on your local machine and is probably the biggest change to your code (should not be too bad though once you “get” it.)
Google DataLab supports two methods for file/data access. 1) Google Big Query (GBQ) and 2) Google Cloud Storage (GCS) From Google: ”Those are the only two storage options currently supported in Datalab (which should not be used in any event for large scale data transfers; these are for small scale transfers that can fit in memory in the Datalab VM).” see On StackOverflow
In short, that answer did not quite do it for me… The answer is to use the Google cloud storage console to upload the datasets you want to access in your project,, but where and how? Following this On Google's repo tutorial I created a “bucket” in GCS by
import gcp project = gcp.Context.default().project_id sample_bucket_name = project + '-datalab-samples' sample_bucket_path = 'gs://' + sample_bucket_name sample_bucket_object = sample_bucket_path + '/Hello.txt' print 'Bucket: ' + sample_bucket_path print 'Object: ' + sample_bucket_object *** Note that %% commands must be in separate cells A single % should allow the command to be in the same cell as other Python code %%storage write --variable text --object $sample_bucket_object
Once I did this I went to the Storage console (with the project name) You can click on The storage section it in the main console https://console.cloud.google.com/storage/browser I found that a new bucket existed tomdlabb-datalab-samples And under that was the “Hello.txt” file. So now I knew where to load my own data!
Using the Cloud Storage browser there is a Upload button, I loaded up Two csv files One 2,000,000 records.. Just to see… Next issue how to read it in and use it in pandas…
After much messing around! It seems that doing this allowed me to get data from my csv into a Pandas DataFrame..
sample_bucket_object = sample_bucket_path + '/icd_to_ccs.csv' ****In a separate cell %%storage read --object $sample_bucket_object --variable text *** and again in a separate cell (the big deal here is that you need to use the BytesIO module to read the buffer) df_icd_to_css = pd.read_csv(BytesIO(text))
Because I was reading at least one very large file I decided to get rid of the “extra” copy (not real sure how much this helps..) import gc
# Remove text variable del text # Force gc collection - this not actually necessary, but may be useful. gc.collect() So then you have all the normal Pandas stuff going: df_icd_to_css.head(2)
I wanted to use pandas-profiling but could not get it to install (apparently since it was self designated as alpha) so I had to use !pip install --pre pandas-profiling
It fails and crashes the kernel I’m assuming due to the 2,000,000 records,, I have seen in passing that I can start the notebook in a larger VM so I think that is a next step..
Another useful note on data read writing is here: On StackOverflow
Note that all my efforts were to just get simple flat csv files into the system and into Pandas DataFrames another thing to explore particularly for the big data sets is using Big Query which looks quite straightforward but since my end goal is building Machine Learning models with Scikit Learn, which generally requires having all data in Memory, I have not yet explored Big Query. I’m more interested in getting bigger VM’s for the time being… But I expect to need to explore out of core solutions as well..
Update how to reboot the kernel when the main one dies… see On StackOverflow
Found it here: https://cloud.google.com/datalab/getting-started The very bottom of that page also has instructins for customizing your machine resources, which I have yet to try, but will need to for larger datasets.
FWIW These commands can be used in the new command line shell on the Cloud console page.see https://cloud.google.com/shell/docs/ .. Without the sdk on your machine.. You need to modify the commands slightly since you will be logged into your project already,
Stopping/starting VM instances
You may want to stop a Cloud Datalab managed VM instance to avoid incurring ongoing charges. To stop a Cloud Datalab managed machine instance, go to a command prompt, and run:
$ gcloud auth login $ gcloud config set project <YOUR PROJECT ID> $ gcloud preview app versions stop main
After confirming that you want to continue, wait for the command to complete, and make sure that the output indicates that the version has stopped. If you used a non-default instance name when deploying, please use that name instead of "main" in the stop command, above (and in the start command, below). For restarting a stopped instance, run:
$ gcloud auth login $ gcloud config set project <YOUR PROJECT ID> $ gcloud preview app versions start main
Python for Finance
Author: Yuxing Yan
Published by Packt Publishing Ltd.
35 Livery Place
Birmingham, B3 2PB, UK
I received a copy of the e-book for review from the publisher.
This book is aimed at Finance professionals and/or students with an interest in options trading and portfolio composition and evaluation.
- Ch1 Introduction and Installation of Python
- Ch2 Using Python as an Ordinary Calculator
- Ch3 Using Python as a Financial Calculator
- Ch4 13 Lines of Python to Price a Call Option
- Ch5 Introduction to Modules
- Ch6 Introduction to Numpy Scipy
- Ch7 Visual Finance via Matplotlib
- Ch8 Statistical Analysis of Time Series
- Ch9 The Black-Scholes-Merton Option Model
- Ch10 Python Loops and Implied Volatility
- Ch11 Monte Carlo Simulation and Options
- Ch12 Volatility Measures and GARCH
The book takes a somewhat unique approach in interweaving Python concepts on an as needed basis with the introduction of progressively more advanced mathematics specific to the Finance field, primarily Options oriented.
I found the approach quite useful. In addition the book includes examples of retrieving financial data from a number of sources. specific code for retrieving data from Yahoo, Google, Federal Reserve and Prof French library is provided.
The introduction to NumPy, SciPy, and Matplotlib (graphing library) are quite well done with well chosen examples.
The book is a strong addition to the growing body of work for finance professionals who want to learn Python.
The book falls short in some of its more lofty goals including exploring Statsmodels and Pandas. And it has almost no mention of IPython notebook, which has rapidly become the default environment for many in the Finance field. Critically in my opinion, the book lacks a good explanation of the differences between floating point and integer arithmetic with its critical implications for Finance professionals.
To some degree this is to be expected since each of those topic areas is in fact subject to books of their own. That is, they are very large libraries and have numerous features that can and do interact with each other.
While the book does cover Pandas it’s usage examples are somewhat limited. The text does not point out how Pandas is built on top of Numpy/Scipy. It does not cover the use of resample to change frequency or offset to offset frequencies. Nor does it cover in much depth the extensive IO libraries built into Pandas, which provide cleaner access to many external data sources like Yahoo, Google the Federal Reserve as well as SQL databases. The coverage is limited to a subset of the CSV IO and some use of the Pickle module. Further there is no mention of the Pandas Dataframe Plotting capabilities.
Statsmodels, like Pandas is a newer Python library with rapidly expanding capabilities. Basic capabilities are nicely explained but more advanced capabilities are not. I was hoping to see some usage examples of some of the more advanced capabilities with explanations, but I admit that that is due to my own current frustration with figuring out some of those things myself.
Advanced Financial concepts such as, Time Series with Efficient frontier, Monte carlo simulation for options and GARCH are each well covered with good examples.
While some of the above sounds somewhat critical, I enjoyed the book and will keep it for quick reference, particularly for the Options valuation material. For those new to Python with a Finance interest, I would recommend the book highly, but I would augment it with additional Python material. The book is an excellent introduction and can provide a solid foundation for exploring the more advanced facilities of the various Python Financially oriented modules.
This was presented to two statistics and data mining classes at the University of Alabama on March 14, 2013. The purpose was to provide a high level overview of the open source tools and community with an emphasis on the Python world. As I went through preparing, I was impressed with how much activity there is and how it is accelerating. Those who already participate in the Python community will know many if not most of these resources, for new people, there will be some useful links. Again, those who know Python will recognize how much I left out, but this is just the intro! So I don't want to overwhelm anyone.
Python For Data Analysis. Author: Wes McKinney, Publisher: O’Reilly Media, Inc, Sebastopol, CA 95472 isbn: 978-1-449-31979-3 copyright 2013, 472 pages, cover price: $39.99
No matter what your skill level, you need this book.
Once you get the basics of the Python language down you need to lift your skills to the next level to do useful work. For me, I wanted to put up a few web sites so that encouraged me to learn about Django, a web framework written in Python. This leap left me with some major gaps in my conceptual understanding of many Python idioms, which with time I guess I’ll fill in.
Another area where I spend time is doing various forms of Data Analysis. The Python tools and skills needed for this work are quite different from building web sites. I was very interested when I first saw the Pandas data analysis library which recommended the use of I-Python, another tool that I had not really begun to explore.
In short, the Pandas library and I-Python tool set make for a very powerful data manipulation and analysis toolset. There is a fascinating confluence of activity in the Python world with packages such as Numpy, scipy and I-Python and now Pandas, stats models, scikit learn and Numba increasingly supporting the Scientific and Data analysis communities. While a lot of the tools emphasize Matrix (array) operations, which initially put me off, Pandas makes it way more approachable since it more closely resembles spreadsheet structures which in fact resemble matrices once you wrap your head around the concepts.
Another major data manipulation capability introduced by Pandas, as explained in the text, is a set of SQL- like operators for Array operations enabling joining, summarizing and other SQL like operations on in-memory datasets.
I bought an early access copy of Python for Data Analysis and I have since kept it up to date which is a great feature of O’Reilly early access publications.
The book covers basic prerequisite information on the following:
The book is excellent taking one through the conceptual issues through to the execution of sophisticated analysis of data sets from a variety of sources. The problems are well documented and the code can be executed with available data (something I have yet to do). Examples include: Getting and using data from:
- Federal Election Commission
- Yahoo finance
Throughout the book examples and code are presented with thorough explanations, way beyond simple code commentary in a teacher-like style. Due to the nature of the code being explained there was for me a constant set of aha moments as I began to understand not just the syntax of the code but why the code should be used to achieve the desired result, and also how to use some of the less obvious Python and Pandas language elements to better effect, in short helping me to be a more fluent coder.
Wes is truly a polymath, who understands analysis, advanced math and statistics as well as being an awesome coder of the Pandas package itself, and, to boot, he can write clearly in a way to be understood by mere mortals attempting to get up to speed on the tools and concepts embraced and enabled by the tools he has built and assembled. I also find the appendix summary of the Python language to be compact and useful in its own right.
Can’t recommend the book highly enough.
The Signal And The Noise, By Nate Silver, The Penguin Press, New York, copyright 2012 isbn 978-1-59420-411-1, 534 pages, Cover price: $27.95
I’m no math whiz but I love analysis. This book has little math but much useful insight on how to view and analyze data and information.
Nate Silver became famous for his predictions about the most recent presidential election in his 538 blog and columns for the New York Times.
The book is highly readable and one almost does not realize that they are in the hands of a master teacher. You just learn a ton of useful ways to think about analytical problems.
People looking for a more detailed explanation of how Nate built his election models will be disappointed. Conversely, if statistics, formulas, and detailed mathematical explanations scare or bore you , have no fear, those items are not present.
This places the book in a somewhat awkward spot, not tech and yet analytical. Once you “get it”, it becomes quite enjoyable.. As I told my wife, as I got to the the 8th chapter, I was getting to the “good part” where Nate goes into what is perhaps the most technical part of the book with an explanation of Bayesian statistics and the difference from conventional frequentist statistics that most of us have learned in school.
While, in truth, I was hoping for a “bit more” of math, and or computer code, to help me learn how to apply the techniques which are so well explored in the book, I was very satisfied with what I got. My frustration is due to my desire to translate theory and examples into code and personal utility on various projects I’m working on. So, in short, a personal issue.
What is very well done is to explore a wider variety of “problem spaces” such as baseball prediction, political prediction, the financial meltdown, Texas Holdem betting strategy, predicting terrorist activity, climate change and picking correct data vs noisy or biased data, and how to tell the difference. The examples are explained in sufficient detail so you learn a great deal about how to approach similar problems and how to evaluate similar data and conclusions. You will certainly view your investment process in a different and more informed manner.
I strongly recommend this book for anyone interested in day to day evaluation of life choices,which should be everyone. The book would also be a good college text for learning about “statistical thinking”.
Dave Girouard, the former head of Google's' Apps business, says it very well in this piece on the GIGAOM site, which I strongly recommend reading.
Insane excuses for not moving to the cloud.
- Insanity #1: These big outages mean we should keep things in house
- Insanity #2: I need somebody to talk to when a service interruption occurs
- Insanity #3; Cloud is OK for non-critical applications with non-sensitive data
See the full article Here.
And this followup post about legal risk makes good points as well Here.
And even more from Google on how they handle legal request/warrants Here.
I would also add that I see many people just staying put because the default decision to do nothing seems so appealing. After all who wants to leave Outlook for something new that they have to learn, just remember back on how difficult it was to learn Outlook in the first place. Not to mention the confusion and trauma that a new release presents to the end users.
The Google product has been hugely successful due to its ease of use and continuous non traumatic update cycle, and cost savings are simply a bonus. In addition the Google products are built from the ground up to be completely Mobile friendly. It is amazing how many individuals choose to use Gmail for personal use and feel they step back in time every-time they have to use company provided exchange mail.
In short not moving is simply falling further behind and spending more than is needed for the comfort of not learning a bit of new technology.
Google's' latest Quarterly report included the following on Google apps acceptance.
"Our enterprise business continued to grow at an impressive pace, gaining traction among some of the largest companies in the world. New customers include Nintendo, the Canadian Broadcast Company, Shaw Industries, POSCO, Randstad and Hyundai, to name a few. And after signing in May, the U.S. Department of the Interior moved more than 70,000 employees to the cloud during Q4 making it the largest federal agency to date using Google Apps."
Here is a brief introduction to Google Apps for business.By: Tom Brander