A few Hint's For Google Cloud DataLab
At long last something worth blogging about. Note: that I have not (yet) created a Jupyter notebook to reflect all this, what I have has too much proprietary data and is currently a bad mess of code from all my experimentation.
With some trepidation I decided to see what I could get going with the fairly new Google Cloud DataLab.
In short it looks quite promising. What follows are a few notes where I found the documentation a bit tricky. So I hope this might help a few who are trying out the product.
Getting started: 1st thing is that you must use the Google cloud console to create a project and enable the API (Google Compute Engine and Google BigQuery APIs must be enabled for the project, Before using the Cloud DataLab deployment tool) which is accessed, https://datalab.cloud.google.com/ The cloud datalab deployment tool will only display active pre-existing projects that you have set up using the Google cloud console. This page has some clickable links for setting up a project and enabling the API”s https://cloud.google.com/datalab/getting-started/ The Google Cloud console is accessed here https://console.cloud.google.com/home/dashboard/ On the upper right there is a “select a project” dropdown, the “create a project” entry is on that menu. So do that if you did not use the getting-started link, and enable the api’s mentioned above.
Once you have completed the deployment wizard you should have a notebook “tree view” open in your browser. It has a number of Notebook tutorials embedded. The tutorials are the primary documentation at this time, so pay attention!
So after presumably running through the tutorials you might like to do some work of your own? 1st, understand that the Git repo that gets installed with your project is the ONLY one you get, you cannot switch or use one of your own (as far as I can tell). This is not too bad, you just have to add your notebooks by uploading them using the upload icon on the tree view.. You can then commit the uploaded file using the git interface provided on the upper right of the tree view… Very important: be sure to commit any changes that you want to persist…. The notebook server may crash and you may want to start and stop it (which is another issue) only committed notebooks will be available on the next start.
Note that you can find out what packages and versions are installed by executing !pip list in a cell. You can generally install and upgrade packages using !pip commands These installs or upgrades DO NOT persist But they run very quickly (Google bandwidth ftw) so just put what you need in the 1st cell and execute it on starting a notebook.
Note that starting and stopping specific notebooks is controlled by the “sessions” item on the top of the tree view (this is different from the current Jupyter implementation)
So we have covered getting a notebook uploaded.. (you may want to read more on cloning the repo containing the notebooks to your local machine, which is doable)
But, next item is maybe you want to actually analyze some of your own data. So time to learn about files! Google has its own take here so it is a departure from what you are used to on your local machine and is probably the biggest change to your code (should not be too bad though once you “get” it.)
Google DataLab supports two methods for file/data access. 1) Google Big Query (GBQ) and 2) Google Cloud Storage (GCS) From Google: ”Those are the only two storage options currently supported in Datalab (which should not be used in any event for large scale data transfers; these are for small scale transfers that can fit in memory in the Datalab VM).” see On StackOverflow
In short, that answer did not quite do it for me… The answer is to use the Google cloud storage console to upload the datasets you want to access in your project,, but where and how? Following this On Google's repo tutorial I created a “bucket” in GCS by
import gcp project = gcp.Context.default().project_id sample_bucket_name = project + '-datalab-samples' sample_bucket_path = 'gs://' + sample_bucket_name sample_bucket_object = sample_bucket_path + '/Hello.txt' print 'Bucket: ' + sample_bucket_path print 'Object: ' + sample_bucket_object *** Note that %% commands must be in separate cells A single % should allow the command to be in the same cell as other Python code %%storage write --variable text --object $sample_bucket_object
Once I did this I went to the Storage console (with the project name) You can click on The storage section it in the main console https://console.cloud.google.com/storage/browser I found that a new bucket existed tomdlabb-datalab-samples And under that was the “Hello.txt” file. So now I knew where to load my own data!
Using the Cloud Storage browser there is a Upload button, I loaded up Two csv files One 2,000,000 records.. Just to see… Next issue how to read it in and use it in pandas…
After much messing around! It seems that doing this allowed me to get data from my csv into a Pandas DataFrame..
sample_bucket_object = sample_bucket_path + '/icd_to_ccs.csv' ****In a separate cell %%storage read --object $sample_bucket_object --variable text *** and again in a separate cell (the big deal here is that you need to use the BytesIO module to read the buffer) df_icd_to_css = pd.read_csv(BytesIO(text))
Because I was reading at least one very large file I decided to get rid of the “extra” copy (not real sure how much this helps..) import gc
# Remove text variable del text # Force gc collection - this not actually necessary, but may be useful. gc.collect() So then you have all the normal Pandas stuff going: df_icd_to_css.head(2)
I wanted to use pandas-profiling but could not get it to install (apparently since it was self designated as alpha) so I had to use !pip install --pre pandas-profiling
It fails and crashes the kernel I’m assuming due to the 2,000,000 records,, I have seen in passing that I can start the notebook in a larger VM so I think that is a next step..
Another useful note on data read writing is here: On StackOverflow
Note that all my efforts were to just get simple flat csv files into the system and into Pandas DataFrames another thing to explore particularly for the big data sets is using Big Query which looks quite straightforward but since my end goal is building Machine Learning models with Scikit Learn, which generally requires having all data in Memory, I have not yet explored Big Query. I’m more interested in getting bigger VM’s for the time being… But I expect to need to explore out of core solutions as well..
Update how to reboot the kernel when the main one dies… see On StackOverflow
Found it here: https://cloud.google.com/datalab/getting-started The very bottom of that page also has instructins for customizing your machine resources, which I have yet to try, but will need to for larger datasets.
FWIW These commands can be used in the new command line shell on the Cloud console page.see https://cloud.google.com/shell/docs/ .. Without the sdk on your machine.. You need to modify the commands slightly since you will be logged into your project already,
Stopping/starting VM instances
You may want to stop a Cloud Datalab managed VM instance to avoid incurring ongoing charges. To stop a Cloud Datalab managed machine instance, go to a command prompt, and run:
$ gcloud auth login $ gcloud config set project <YOUR PROJECT ID> $ gcloud preview app versions stop main
After confirming that you want to continue, wait for the command to complete, and make sure that the output indicates that the version has stopped. If you used a non-default instance name when deploying, please use that name instead of "main" in the stop command, above (and in the start command, below). For restarting a stopped instance, run:
$ gcloud auth login $ gcloud config set project <YOUR PROJECT ID> $ gcloud preview app versions start main