A hypothetical data analysis platform

My definition of a statistical platform is that it is a glue that ties orthogonal data analysis functions together. Take R for instance, it is a platform-as-application. You fire up R and everything is accessible to you. However, all the packages only work on top of R.

Python, on the other hand, take a platform-as-libraries approach. A basic data analaysis setup is to pip install Numpy, Scipy, Matplotlib. High-level libraries, such as scikit-learn and pandas, are built on top of these. It is somewhat more flexible for picking and choosing but the dependency is still a tree-like structure between some packages.

Then there's Incanter.

You don't like to use Parallel Colt for your matrices? Here, try this BLAS drop-in replacement and everything would just work with 10x speed.

Much of this flexibility is due to earlier design choices by Liebke et al. to leverage Clojure's idiom that "it is better to have 100 functions operate on one data structure than to have 10 functions operate on 10 data structures."

The thing is, I think we're only scratching the surface. Excuse me while I dream for a minute.

Say instead of jBLAS, you want to use CPU/GPU hybrid instead. Suppose you can just do a (use 'incanter-magma) and your Incanter code would just run with MAGMA (via Mathew Rocklin) under the hood without any other change.

Taking this idea of interfacing libraries into a hypothetical use case. Imagine that you cleaned and structured your data on Hadoop using Cascalog and is looking to analyse this dataset. You start your Incanter session to pull in your data (use 'incanter-cascalog). Write some Incanter script to interrogate this dataset but find the data is still too big for your laptop. So you (use 'incanter-storm) to make use of distributed processing instead. Incanter would then flow data directly from Cascalog to Storm inside your cluster.

For your results, you find JFreeChart limiting so you (use 'incanter-c2) to spiff up your visualisations with C2 all while not changing a single line of your Incanter script.

Instead of the star-like dependency of R and its packages, or the tree-like structure for Python and its packages, Incanter could be an interface to stand-alone libraries encapsulated by an application for the user.

Incanter, the library, could be modules that transform data into standard Incanter-compatible data structures to and from external libraries. Incanter, the application, could be a domain specific language, a client, and a in-REPL package manager.

Another benefit to this is that it helps to mitigate the developer shortage problem for Incanter too by making use of external, stand-alone libraries.

I call this platform-as-interface.

Quantisan.com is now compiled and statically served

I just migrated this blog from a self-hosted Wordpress site to a Pelican statically generated blog. It took me a whole weekend to do it because I have almost 600 posts. Much is still broken at the moment but at least the site seems presentable so I pushed it through. The fixes will just have to wait.

I considered using Jekyll as it appears to be the most popular static site generator. A look at its github page though shows no patch has been made for months. Development on Jekyll appears to have paused for some reason. Furthermore, I didn't have much luck getting exitwp, which is a wordpress xml to jekyll importer, to gracefully parse my many custom posting tags. I also gave Ruhoh a quick trial, but it also relies on exitwp to import wordpress posts.

Then I found that Pelican had its own importer tool using pandoc, which I tested works much better for the non-standard tags in my posts.

In any case, here are the steps I took to migrate to Pelican.

  1. Export Wordpress into a XML file
  2. transfer comments to Disqus
  3. install Pelican 3.0.0 dev build
  4. use pelican-import to convert the XML file into markdown posts
  5. moved some posts around into its directory/category
  6. use Textmate to do a bunch of batch fixes
  7. configure Pelican and made a Makefile
  8. install ghp-import rolled my own gh-pages script into Makefile
  9. customised a pelican theme
  10. setup Github for custom domain
  11. updated DNS

After all of that, this site is now hosted and served on github.

It's an open buffet in a small business

One of the few benefits of working for myself is that I don't need to worry about compatibility with legacy systems. I am free to use whatever open source tools to get the job done well. The downside to this is that there are so many technologies out there that it's hard to choose the right ones for the job. To give you a sense of what I meant, here are some of the topics that I have either tried or seriously considered in the past year.

  • Programming languages: Java, Python, R, C#, F#, Scala, Clojure, Haskell.
  • Data storage: HDF5, CSV, Binary, MySQL, PostgreSQL, MongoDB, MonetDB, Redis, HBase.
  • Cloud server: Amazon EC2, Microsoft Azure, [Google App Engine][], Rackspace Cloud, plain old VPSs.

Programming language choices are important because they sit at the bottom of a technology stack. Most of the work that I do are built on using them. For a long while, I settled on using a combination of Java, Python, and R. I prototype small ideas in Python. Implement production code in Java. And perform analysis in R. I discussed why use the right tool for the right task a year ago. By the end of my previous project, I am finding that the popular development triplet of Java, Python, and R, is not ideal for a solo-operation. Seeing that I have more time on my hands because I am using QTD to trade for me now, I am taking a break this summer to expand my knowledge and learn new technologies. Some of the technologies that I am experimenting with includes:

  • an in-memory data store for instantaneous and persistent tick data
  • parallel programming for concurrent processing with no locks
  • mathematically intuitive algorithm implementations using high order functions

Don't mind me as I help myself in this open buffet of technologies.

Data Scraping the Toronto Stock Exchange: Extracting 3,660 companies' data

One of the tasks that I've always wanted to make more efficient in my stock trading is the work of scanning for stocks to trade. One look at my trading strategy posts and you'll see that I have devised many stock scanning systems in the past few years. The most recent system that I've used is one that uses options data to filter stocks. However, it is not automated. So it takes a lot of time to gather and analyze the data. Furthermore, the set of tools that I use is limited to U.S. stocks. Now that I have taken an interest in the Canadian stock market, I can't seem to find any public tool that I like. Thus, I am biting the bullet now and taking my time to develop a custom system once and for all. Before we can analyze stock data, we need to extract them first. Where better else for that than go straight to the source at TMX.com, the parent company of Toronto Stock Exchange (TSX) and TSX Venture Exchange (TSXV). TMX.com provide a list of publicly traded companies in ten Excel files. The files are divided by sectors. Each contain a number of fundamental company data, such as market capitalization and outstanding shares. So step 1 is to extract those data. This is where I am at now. I attached the source code for an alpha/developmental release below for anyone interested. It is a working program to scrape the data from TMX.com's files. But it's still a work-in-progress. That's why I am calling it a version 0.1. The next milestone is to program a Stocks class to hold, organize, and manage all three thousand, six hundred, and sixty companies' data. This is an easy task to do by extending the built-in dictionary class in Python. However, I haven't gotten to that chapter yet in my scientific programming with Python learning book. I stopped at chapter 8 to work on this project. Chapter 9 is the inheritence and hierarchical material. The goal of this project is to build an automated data scraping program for TSX and TSXV data from various sources into my computer. Once I have my data, that's when the real fun starts. Regarding the code below, I know that source code is useless for most people. Once the project is complete, I will compile the code into a standalone application and post it on this site. Subscribe to my RSS feed so that you can keep up-to-date with the progress of this project and my other ramblings on trading. [python] # extractTMX.py # version: 0.1 alpha release # revision date: March, 2010 # by Paul, Quantisan.com """A data scraping module to extract company listing excel files from TMX.COM""" import xlrd # to read Excel file #import sys from finClasses import Stock # custom Stock class def _verify(): """Verification function for a rundown of the module""" pass # copy test block here when finished def findCol(sheet, key): """Find the column corresponding to header string 'key'""" firstRow = sheet.row_values(0) for col in range(len(firstRow)): if key in firstRow[col]: return col # return first sighting else: # not found raise ValueError("%s is not found!" % key) def scrapeXLS(book): """Data scraping function for TMX Excel file""" listingDict = {} # dict of ('ticker': market cap) for index in range(book.nsheets): sh = book.sheet_by_index(index) mcCol = findCol(sh, "Market Value") assert type(mcCol) is int, "mcCol is a %s" % type(mcCol) osCol = findCol(sh, "O/S Shares") assert type(osCol) is int, "osCol is a %s" % type(osCol) secCol = findCol(sh, "Sector") # multiple matches but taking first assert type(secCol) is int, "secCol is a %s" % type(secCol) hqCol = findCol(sh, "HQ\nRegion") assert type(hqCol) is int, "hqCol is a %s" % type(hqCol) for rx in range(1, sh.nrows): sym = str(sh.cell_value(rowx=rx, colx=4)) # symbol s = sh.cell_value(rowx=rx, colx=2) # exchange col. if s == "TSX": exch = "T" elif s == "TSXV": exch = "V" else: raise TypeError("Unknown exchange value") mc = sh.cell_value(rowx=rx, colx=mcCol) # market cap # check for empty market cap cell mc = int(mc) if type(mc) is float else 0 os = int(sh.cell_value(rowx=rx, colx=osCol)) # O/S shares sec = str(sh.cell_value(rowx=rx, colx=secCol)) # sector hq = str(sh.cell_value(rowx=rx, colx=hqCol)) # HQ region listingDict[sym] = Stock(symbol=sym,exchange=exch, mktCap=mc,osShares=os, sector=sec,hqRegion=hq) return listingDict def fetchFiles(fname): infile = open(fname, 'r') # text file of XLS file names listing = {} for line in infile: # 1 file name per line if line[0] == '#': continue # skip commented lines line = line.strip() # strip trailing \n print "Reading '%s' ..." % line xlsFile = "TMX/" + line # in TMX directory book = xlrd.open_workbook(xlsFile) # import Excel file listing.update(scrapeXLS(book)) # append scraped the data to dict return listing #if __name__ == '__main__': # verify block # if len(sys.argv) == 2 and sys.argv[1] == 'verify': # _verify() if __name__ == '__main__': # test block listing = fetchFiles('TMX/TMXfiles.txt') [/python]

continue   →