Introduction to Data Science

Syllabus

Syllabus For Course

Tools

Download Base R for your Operating System
RStudio - A Slightly Friendlier Way to Interact with R
Lighside - Maching Learning Made Easier
Gephi for Visualization

Setting up Python

python
download python from python.org

ipython
http://ipython.org/ipython-doc/stable/install/install.html
easy_install ipython[zmq,qtconsole,notebook,test]

#r library connector
easy_install rpy2

Python Package Installer
sudo easy_install pip

sudo pip install python-mysqldb
http://mysql-python.blogspot.com/2012/11/is-mysqldb-hard-to-install.html

python
BIGMAC:~ sgoggins$ sudo python setup.py build
BIGMAC:~ sgoggins$ sudo python setup.py install

BIGMAC:~ sgoggins$ sudo pip install requests requests-oauthlib mysql-connector-python

Starting iPython

ipython notebook --notebook-dir=/Volumes/sean-drive/new_dropbox/Dropbox/09.\ \ Teaching/Intro\ to\ Data\ Science/info480/week1
** Where the notebook-dir is set to the directory where you have downloaded or setup your project at

file: 

Week One

Week 1: Introduction to Data Science
General introduction and explanation of syllabus. Example data science activity using Github data. Demonstration of data collection and management, analysis and visualization lifecycle. Privacy implications of analyzing public data discussed. The social value of analyzing “big data”: How we learn about social and organizational performance and structure using “big data”.
Readings Due:
1. Mascaro, C., Magee, R., and Goggins, S. 2012. Not Just a Wink and a Smile: An Analysis of User-Defined Success in Online Dating. iConference 2012.
2. Howison, J., Wiggins, A., and Crowston, K. 2012. Validity Issues in the Use of Social Network Analysis with Digital Trace Data. Journal of the Association of Information Systems. 12(2), 2.
3. Goggins, S. P., Mascaro, C., and Valetto, G. 2013. Group Informatics: A Methodological Approach and Ontology for Understanding Socio-Technical Groups. JASIS&T. Accepted.

GitHub Repository: CLICK ME!

Missing an R Library?

install.packages("package-name");

Python MySQL Library Installation Issues

Installing mysql-python in Mac OS X 10.6.7 Snow Leopard

Installing mysql-python in my Mac OS X 10.6.7 Snow Leopard was a painful experience. After some hours of trial and error and googling I got it working. Here are the steps:
1. Make sure you have gcc installed. I installed Xcode 3.2.6 which is the Apple’s development environment and it includes gcc. I lost my DVD somewhere. So I had to download it from their website. It’s around 4.4 GB. In the process of installing it, it might ask you to close your iTunes and might keep asking you even if iTunes is closed. You will have to close iTunesHelper to continue the installation. Open Activity Monitor (located in Applications/Utilities) and quit it from the list of processes.
2. Download the mysql-python tar.gz file from here. Untar it and then build and install it. To do this (assuming that you have uncompressed the tar.gz file and pointed your terminal to the uncompressed directory):

$ sudo su
# python setup.py build
# python setup.py install

3. Finally, add your mysql lib directory in DYLD_LIBRARY_PATH. You can add the following line in your .bash_profile file modifying your mysql directory location.
export DYLD_LIBRARY_PATH='/usr/local/mysql-5.5.25-osx10.6-x86_64/lib/'

4. Now, in your python interactive console try to import MySQLdb. If there is no error you are done! Otherwise please google These steps worked for me.

R and "Windows" Tips

Running Scripts using "ls" in Sean's Sample Scripts

As we learned in class, this does not work! :( However, its a common problem that is encountered, and there is a simple workaround. Go ahead and redownload the scripts from my Github Repository as a "Zip File" here:

https://github.com/The-Art-of-Big-Social-Data/info480

Then go to the "GitHub Networks" folder and open the file "github-v3b-WINDOWS.R"

The difference is one line, where directories are listed:
Windows version of file directory listing
gitLists = shell("dir /B project*edgelist.csv", intern=TRUE)

Mac version of file directory listing
gitLists = system("ls project*edgelist.csv", intern=T)

Note - I replaced the carrot-dash assignment operator because it messes with HTML. "=" serves the same function in R.

Installing Libraries

Found this information HERE

4.2 I don't have permission to write to the R-2.15.3\library directory.
You can install packages anywhere and use the environment variable R_LIBS (see How do I set environment variables?) to point to the library location(s).

Suppose your packages are installed in p:\myRlib. Then you can EITHER

set the environment variable R_LIBS to p:/myRlib before starting R

OR use a package by, e.g.

library(mypkg, lib.loc="p:/myRlib")

You can also have a personal library, which defaults to the directory R\win-library\x.y of your home directory for versions x.y.z of R. This location can be changed by setting the environment variable R_LIBS_USER, and can be found from inside R by running Sys.getenv("R_LIBS_USER"). This will only be used if it exists so you may need to create it: you can use

dir.create(Sys.getenv("R_LIBS_USER"), recursive = TRUE)

to do so. If you use install.packages and do not have permission to write to the main or site library, it should offer to create a personal library for you and install the packages there. This will also happen if update.packages offers to update packages for you in a library where you do not have write permission.

There can be additional security issues under Windows Vista and later: See Does R run under Windows Vista?. In particular, the detection that a standard user has suitable permissions appears to be unreliable under Vista, so we recommend that you do create a personal directory yourself.

Week Two

Week 2: Data Preparation

Preparing Data: Finding raw, public or organizational data and defining what questions can be answered using the organization’s data. I will provide you with sets of data and specific analyses desired. You will need to analyze the data and decide how it needs to be “reshaped” in order to perform the desired analysis. We will then discuss these analyses on GitHub; I will provide similar examples and ask you to develop and prepare the data further. You can use any software tool you are familiar with, but I will be providing examples and help with R (original Syllabus said Python, but we didn't get to the Python Lecture week one :) ). Choose your own tools and you are “on your own” to some extent. ☺ … but tools are covered more fully in week 3. I do encourage you to go through some Python tutorials during week’s 1 & 2 --- http://docs.python.org/2/tutorial/ (we will be using Python 2.7.3 in this course; Python 3 is another discussion.)
Readings Due:
1. Part One of “Data Analysis with Open Source Tools”.
2. Chapters 1-3 of “The Anarchist in the Library: How the Clash Between Freedom and Control is Hacking the Real World and Crashing the System”.

Assignment

This is first and foremost an analysis assignment and an assignment focused on familiarizing yourself with what R can help you with. A full, working sample is provided on GitHub. You can click this link to download the Full Zip File. Then you will have access to the data under the “Week2” directory”
1. Set your working directory to “Week2”
2. Run “Complete.R”. Examine the comments and the resulting files to familiarize yourself with a Description of the data
Analysis Questions. Write up a short essay with tables or graphs if needed to describe how you would:

Build a network using the scripts from week1 against the mention connections? Reply-To connections? In this sample data. What transformations are required? How would you filter the data? Use the actual data to ground your thinking. Feel free to actually write or modify the R code samples from the first two weeks to experiment. Some of you will be more comfortable doing this; some will be more comfortable addressing the question conceptually. This is OK.
Submit any issues you encounter to GitHub under this repository You can also use the Blackboard Discussion area. I will check both. One of the advantages of GitHub issues for things related to the repository is that the context of your question is clearly preserved; where it disappears after the class if you post it to Blackboard. THIS IS A DIRECT LINK TO GITHUB ISSUES FOR OUR REPOSITORY

I will open a discussion board under our Blackboard Shell regarding the three papers you were assigned to read last week. I expect you to answer the questions and respond to your classmates. Your participation does not need to be long, just thoughtful. Here is the setup for the question that's under the "Discussion" Link in Blackboard Learn:

During week one you were assigned three readings. They are available here: http://seangoggins.net/DS-WeekOne

One of the readings focused on measurements of "success" in online dating. This paper is interesting, in a way, because instead of quantifying measures of success from Trace data, it looks at what people say about their experiences, and describes how different online sites have different ideas about what "success" is. The other two papers are longer, and complement each other. Howison et al focuses on the limitations of using electronic trace data to draw conclusions about social groups. Goggins et al focuses the discussion on *how* to ground analysis of trace data (logs) in social science - so that we are answering SOCIAL questions instead of merely TECHNICAL questions about participation.

Reflecting on your own use of social media, online forums, email and other electronic communications, discuss where you think Goggins and Howison make a point you can relate to. Then, reflect on one point about your participation where the authors perhaps "miss" a key point, idea or behavior about how you act online.

create a thread titled with your firstname+pointIRelateTo

create a thread titled with your firstname+pointMissed

respond to at least two of your course mate's threads.

Week Three

Week 3: Data Preparation Tools

Preparing data using open source tools. Traditional and more advanced, statistical data analysis tools will be used.
Readings and Assignments Due:
1. Prepared Data #1 (Transform an instructor supplied data set into an analyzable form using open source tools)
2. Sawyer, S. and Crowston, K. 2004. Information systems in organizations and society: Speculating on the next 25 years of research. Information systems research. 35-52.
3. Chapter 4 of “The Anarchist in the Library: How the Clash Between Freedom and Control is Hacking the Real World and Crashing the System”.

Classroom Activities
New Python Tools with easier path to library installation

Week Four

Week 4: Sharing Data Preparation Tools

Students will become familiar with the use of open source software configuration management tools for sharing software developed to prepare open data. The specific environments used may change as open source software evolves. Github, Google Code and Source Forge are prominent SCM exemplars.
Readings and Assignments Due:
1. Software Sharing #1 (Share scripts produced in week 3 using an open source software configuration management tool).
2. Read chapters 1-3 of “Pro Git” http://git-scm.com/book (Or other, repository specific orientation as selected by the instructor and course coordinator)
3. Part Two of “Data Analysis with Open Source Tools”.

Week Five

Week 5: Data Presentation Tools

Presentation involves sharing data with other people in a way that is visually insightful. Students will be asked to bring an example of a visualization of data from a website or news organization, and make a short presentation about what makes the visualization insightful.
Readings and Assignments Due:
1. Software Sharing #1 (Share scripts produced in week 3 using an open source software configuration management tool).
2. Part Three of “Data Analysis with Open Source Tools”.

To ReFork a Repository after a Pull Request

1. Switch your command line to the working directory for your fork of my repository
2. Issue this command: git pull git://github.com/The-Art-of-Big-Social-Data/info480.git master

Slides for week five

Week Six