Search My Blog

Thursday, November 18, 2010

Google Refine lets you fix and handle huge, messy sets of data

This App from Google called Google Refine, looks like it could help allot in all kinds of Research. The thing that interested me was the Web site that he was sifting through. It was from a Site called ProPublica and the article is called Dollars for Docs. This is wat interests me...

Drug companies have long kept secret details of the payments they make to doctors for promoting their drugs. But seven companies have begun posting names and compensation on the Web, some as the result of legal settlements. ProPublica compiled these disclosures, totaling $258 million, into a single database that allows patients to search for their doctor. Receiving payments isn’t necessarily wrong, but it does raise ethical issues.

By Dan Nguyen, Charles Ornstein, and Tracy Weber, ProPublica
Updated Oct. 19, 2010. Read more about the data »"

I'm definitely going to check this site out! But, back to Google Refine and how he used this great App to sift through huge amounts of Data...


Filed under: Utilities, Google

Google Refine lets you fix and handle huge, messy sets of data

Google has just introduced a new product, and this time it's a PC application (with a browser-based UI). It's called Google Refine, and it solves a problem that is enormous for some people: it lets you take massive sets of "messy data" and massage them into shape so that they're uniform, make sense, and can be statistically analyzed.

The video after the jump shows a very good example, which is based on a CSV file exported from a publicly available data source (a government contract system, in this case). The data is very realistic – descriptions are inconsistent (Firm Fixed Price on some rows and FFP on other rows), and even the number formats are inconsistent (you get 0.78 on one row and a number in the millions on another row).

Google Refine lets you very easily hone in on those inconsistencies and fix them in a myriad of ways. This is an important data tool because those heaps of messy data are often public records, which are available but not transparent; being able to quickly analyze them could expose some very interesting patterns and anomalies in the way that public institutions and governments behave.

[Thanks, Yanksy, for the tip!]

Go there Read more Articles...

See Video - Google Refine 2.0 - Introduction (1 of 3)

Go there see Full Screen Video...

Go here to see all 3 Videos...

Google Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase.

Download Page Linux, Windows and Mac. With How to Run on Linux...


Download, extract, then type ./refine to start.
Go there...

Ok, I downloaded the Linux Version and couldn't get it to start up. I kept getting errors saying "file or directory not found" in my Terminal. So, I finally went back to the Google Site and found the help page, What I had done is just click on the file in Krusader, which shows the files in the Google-refine-2.0-r1836.tar.gz file. Then I copied the Google-refine-2.0 folder over to, /var/www/html in my Fedora 13 System since that is the default folder for web sites in Fedora 13. I already have Apache up and Running on my System. I knew this should work. But no luck. I noticed that there was a filed named refine.ini and I thought that was odd, because that it usually a Windows file extension. I opened it and Wind wanted to install something for Gecko, and it did that. This made the file open in a Text Editing App and I read the contents, but that was now help. I could have done that straight from Krusader if it wasn't a .ini file. And I did by right clicking and selecting another app just to see. Then I finally remembered a similar problem when installing or running apps from .tar.gz files that I had encountered before. Some of these files have to actually be extracted by an Archive handling App to actually be Extracted. So, I went back to Krusader and Right Clicked on the Google-refine-2.0-r1836.tar.gz file and selected open with Archive Manager. Then I Extracted it with Archive Manager and sure enough! There was a new file in the main folder named "refine" and it had the icon that is given to .sh files (with no file extension). But wait, I wasn't done yet!:O When I opened up a Terminal window here, from within Krusader. I got the Permission Denied Error. So, since I already had a Krusader root mode window opened up to copy the Google-refine-2.0 folder to /var/www/html... I just right clicked on the Google-refine-2.0 folder and changed the name of the owner to my user name and made the user and groups able to open and change all the files in the Google-refine-2.0 folder. Finally! Were up and running with Google-refine-2.0!:) Now all I have to do is figure out how to use this very Powerful App. First thing I learned... Open Office .ods files don't read right, but .csv files do. Oh, and you don't have to muck around in the Terminal ever time you want to run Google-refine-2.0. I can just double click on the refine (.sh) file and it starts right up from within Krusader or right click and select "run". But, one thing. Since it is a Web Server App and you access it through your Web Browser. And if you are like me and have trouble remembering Terminal Commands. Then you will have to just leave Refine running until your next reboot, because you have to stop it from within the Terminal or kill it with an App like System Monitor. Caution, I used the Stop command in System Monitor to stop Refine and my whole System Froze Up and I had to do a Hard Shut Down and Reboot:( I usually use the Kill command and it works fine, but since Refine was not in a Unresponsive State, I thought Stop would work better. Boy was I wrong. Might have been a one time thing. I didn't try it again. It's Not really a big deal though, just to leave Refine running in the Background, unless you are low on System Resources. Now, all I have to do, is figure out how to use this Very Powerful App to sift through Mounds of Data...


Google Refine lets you fix and handle huge, messy sets of data
Google Refine lets you fix and handle huge, messy sets of data
Google-refine - Project Hosting on Google Code
YouTube - Google Refine 2.0 - Introduction (1 of 3)
Downloads - Google-refine - Project Hosting on Google Code
FAQ - Google-refine - Frequently Asked Questions - Project Hosting on Google Code
DocumentationForUsers - Google-refine - Documentation hub for users - Project Hosting on Google Code
InstallationInstructions - Google-refine - How to install and run Google Refine - Project Hosting on Google Code

No comments: