Go Back   CORTEX Forums > Vendors and Service Provders > Open Source Analytics > Open Source News and Opinion
Register Blogs FAQ Members List Calendar Search Today's Posts Mark Forums Read

The R Learning Lasso

This is a discussion on The R Learning Lasso within the Open Source News and Opinion forums, part of the Open Source Analytics category; I got an email the last week in January from the R help list announcing the release of the newest version of glmnet, a statistical learning algorithm that fits lasso ...


Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old 25th November 2009, 10:39 AM   #1
News Bot
 
Join Date: Nov 2007
Posts: 15,085
Latest News Headlines is on a distinguished road
Post The R Learning Lasso




I got an email the last week in January from the R help list announcing the release of the newest version of glmnet, a statistical learning algorithm that fits lasso and elastic net regularization paths for squared error, binomial and multinomial models via coordinate descent. Don’t be ashamed if you find that description a bit abstruse: just know you’re not alone! Suffice it to say that glmnet is a state-of-the-art modeling package that handles the prediction of interval and categorical dependent variables efficiently.

The package’s creator is Trevor Hastie, co-author with Jerome Friedman and Rob Tibshirani of the accompanying arcane-sounding paper: Regularized Paths for Generalized Linear Models via Coordinate Descent, published last summer. Hastie, Friedman and Tibshirani are also eminent professors of Statistics at Stanford University, the top-rated such department in the country. Last Fall, I attended a statistical learning seminar with Hastie and Tibshirani where similar models were presented at a dizzying pace.

So the R user community had just been provided access to a latest learning algorithm hot off the development presses from three world-renowned practitioners – for free. And glmnet is readily accessible from the internet, installing on existing R platforms painlessly. No commercial stats package that I know of – certainly not the market leader – is even close to releasing a competitive offering. I’d say that’s a pretty good deal for stats types like me, and a benefit to working with a fertile, world-wide open source initiative like R.

After installing glmnet on my PC, I tested it against a 1988 Current Population Survey (CPS) data set that consists of 25,631 cases. My objective was to predict the log of weekly wages from experience and education, both measured in years. I first divided the base data set into two subsets, a training set with two thirds of the cases randomly selected, and a test one with the remainder of the records. I then developed two separate models with the training data – one a straight linear model with an interaction term, the other using cubic spline mappings of experience and education. Once model parameters were developed with the training data set, I evaluated and graphed the results using the separate test data.

The plot on the left shows the linear plane generated by glmnet; the one on the right depicts the curvilinear plane from the cubic spline mapping. The linear model seems naïve in contrast to the cubic spline alternative which provides a much closer fit between actual and predicted wages. Indeed, preliminary exploration of the training data set confirmed the curvilinear nature of the relationships between education, experience and wages, with wages actually declining for high- end experience. The linear model incorrectly details uniformly increasing wages across the ranges of both education and experience.

The relationship on the left is thus mis-specified and produces predictions out of synch with actual outcomes. A naive linear specification like this is, unfortunately, more the rule than exception for BI analysts using Excel or other standard BI tools for their models. Prudent analysts will turn to the sophisticated packages of platforms like R for predictions that closely reflect the subtlety of their data.

For those interested, Hastie and Tibshirani are offering a new two-day seminar,Statistical Learning and Data Mining III (http://www-stat.stanford.edu/~hastie/sldm.html), March 16-17, 2009 in Palo Alto, CA.

*












*



More from the Stats Man’s Corner Blog...
Latest News Headlines is offline  
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiTweet this Post!
Reply With Quote
Reply

Bookmarks

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is On
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Predictive Analytics World ? Methodology and Business Learning Latest News Headlines Open Source News and Opinion 0 25th November 2009 10:39 AM
Learning from the Black Swan Latest News Headlines Open Source News and Opinion 0 25th November 2009 10:39 AM


All times are GMT +11. The time now is 08:17 AM.

© The Business Intelligence Group

Search Engine Optimization by vBSEO