A weather station data scrapper in R
This R code scrape publicly available weather station data given a weather station ID and a date range. I patched an existing source code from UC Davis so it’s not my original code. Thought I’d share it here anyway. It doesn’t fail safely though because a null return value would break the fetching loop.
read moreIt’s an open buffet in a small business
One of the few benefits of working for myself is that I don’t need to worry about compatibility with legacy systems. I am free to use whatever open source tools to get the job done well. The downside to this is that there are so many technologies out there that it’s hard to choose the right ones for the job. To give you a sense of what I meant, here are some of the topics that I have either tried or seriously considered in the past year.
- Programming languages: Java, Python, R, C#, F#, Scala, Clojure, Haskell.
- Data storage: HDF5, CSV, Binary, MySQL, PostgreSQL, MongoDB, MonetDB, Redis, HBase.
- Cloud server: Amazon EC2, Microsoft Azure, Google App Engine, Rackspace Cloud, plain old VPSs.
Programming language choices are important because they sit at the bottom of a technology stack. Most of the work that I do are built on using them. For a long while, I settled on using a combination of Java, Python, and R. I prototype small ideas in Python. Implement production code in Java. And perform analysis in R. I discussed why use the right tool for the right task a year ago.
By the end of my previous project, I am finding that the popular development triplet of Java, Python, and R, is not ideal for a solo-operation. Seeing that I have more time on my hands because I am using QTD to trade for me now, I am taking a break this summer to expand my knowledge and learn new technologies.
Some of the technologies that I am experimenting with includes:
- an in-memory data store for instantaneous and persistent tick data
- parallel programming for concurrent processing with no locks
- mathematically intuitive algorithm implementations using high order functions
Don’t mind me as I help myself in this open buffet of technologies.
read moreA failed experiment with spectral density estimation in R
Spectrum analysis (see wiki entry) is a technical process to visualize a time series data in the frequency domain to find hidden periodicities. As I continue my struggle in data mining with R, I thought I might try looking at a year’s worth of USDJPY 1-min exchange rate data in the frequency domain.
This is what I got.
Certainly not one of those nice looking graphs in a textbook. Notice that:
- there is no distinct frequencies
- bandwidth is ridiculously small
What this means is that the USDJPY data is practically white noise. And that there’s much work for me ahead before I can produce something useful in R.
The source code to produce the graph is printed below. The USDJPY data isn’t published as it’s not small and you can get it at various places for free on your own.
Update: Failed is a harsh word. I should call this a trivial experiment.
op <- options(digits.secs=3) # set option to show milliseconds
cat("loading CSV file to 'data'\n")
print(
system.time(
data <- read.csv(
file="data/USDJPY_20090630-13-27to20101130-22-19_1Min.csv", head=TRUE, sep=",")
)
)
# convert long timestamp to POSIX
rownames(data) <- as.POSIXct(as.numeric(rownames(data))/1000, origin="1970-01-01", tz="GMT")
data <- data.matrix(data)
library(xts) # load xts library
cat("converting to xts 'data.x'\n")
print(system.time(data.x <- as.xts(data)))
rm(data) # remove data object
cat("data.x object size: ")
print(object.size(data.x), units = "auto")
cat("spectral analysis\n")
spectrum(data.x$close)
read more
Maximal frustration in printing and parsing milliseconds in R
I’ve just spent an absurd amount of time figuring out how to parse millisecond times into R. (It’s standard practice to timestamp tick data with milliseconds timing in a flat data file.) Turns out that there are two problems that I faced. One being that there is a almost-hidden, footnote option in the strptime( ) function (for converting characters to POSIX) in which it describes using format = “%H:%M%OS” instead of “%H:%M:%S” to parse fractional seconds. However, the second, and what’s actually the one that’s wasted much my time, is the fact that R ignores milliseconds in printing by default!
For example, here’s an output for the POSIXct representation of the integer value 1286564400.
“2010-10-08 15:00:00 EDT”
So even though I figured how to represent millisecond times in R a few hours ago, I never knew I did it simply because R is not printing the output that I wanted on screen every time I tried!
After a couple hours of searching, trying, and frustrating, here’s the one-liner command that solved my problem.
> options(digits.secs=3)
which changes the options of the display and thus I can see the following output from the same input as above.
“2010-10-08 15:00:00.344 EDT”
I guess that’s why they say R has a steep learning curve! Yes, it is very powerful and flexible. But there are limited standards and the documentations are all over the place because it’s a mish mash of user-contributed packages!
read moreData analysis with R: Using the right tool for the right task
I ported Engineering Return’s Trend Strength Index to JForex as my first practice to writing custom indicators in JForex. Frank’s code in TradeSignal consists of 9 lines. My JForex code? 191. This is my first custom indicator in JForex. It will also be the last.
A compiled programming language such as Java is not the most convenient to use for data analysis. The lack of interactivity because of the step of compilation is one obstacle to fluent data explorations. Secondly, the language itself is more cumbersome because of its generality. Java is used to build all sorts of applications. Contrary to something like MatLab, in which it is purposely built to work with numbers.
I miss the days when I have my license for MatLab.
However, being forced away from MatLab might not be a bad thing. As much as I’ve become accustomed to it, MatLab isn’t without its flaws. But I won’t bash it in this post. What I actually want to talk about is R.
The rage these days in statistical data analysis is in R (see Wiki entry on R). I’ve done some research and testing, R is really as good as they say it is (see past publications in R/Finance conference). You will get to see some real examples in financial time series data analysis in R later in this blog once I have some publishable results. This will no doubt take some time as I am learning yet another new programming language (thus the lack of posts this week).
The plan from now on is to use R to play with the data. Then once the algorithm is finalized, I will either port it to Java (JForex) for deployment or embed the R environment in Java for live trading.
read more


Recent Comments