MSBA BAN 6020 Data Management

MSBA BAN 6020

Model.
Scrape.
Query.
RDBS.
HPC.
Communicate.
Understand.
MySQL.

Instructor

Carolina Salge MW, 3F Farrell Hall A48 9:30 - 10:45am 11:00 - 12:15pm By Appointment

Final Exam

Dec 9, 2 - 5pm

Assignments

See schedule for the due date. The due time is 11:55pm on the due date. Submit all assignments on Sakai. For work completed in R and RStudio, submit two files: rmarkdown and html. For work completed in MySQL, submit either a png of the data model or a pdf of SQL code including results.

Individual


id topic description points
A1 linear model use the Amazon data to build a linear model of your choice in R 8
A2 web scraping pick a website or API and scrape data using R 8
A3 rmarkdown integrate your work from the Amazon case into one Rmd report 8
A4 data modeling sketch a data model for the Amazon case study 8
A5 sql write 8 queries to answer questions from the Amazon database 8

Extra credit opportunity: Use PUTTY to access the DEAC cluster (either pegasus or gemini) and successfully submit a job with R (by Dec 9 at 11:55pm) for 1 extra credit point added to your final grade. Once you receive the completion e-mail, check your output to make sure the output is not unexpected. Upload three files to Sakai: your R script, the output file, and the completion e-mail from HPC as a pdf. Note. You can use the tweets_2014-01-23_02-11_156285.csv data but you are not allowed to use the example R script in the “HPC on the DEAC Cluster” video.

Group

A state-of-the-art presentation is required from each group on a R package not covered in class, with a particular concentration on data science.

  • You will give a 10-15 minute class presentation to your fellow class members
  • Points will be deducted for exceeding 15 minutes
  • Focus on the applications of the package and the opportunities it provides
  • E-mail the instructor your code and slides 2 hours before your presentation
Some suggested packages follow (already with team bids), and you can propose others by contacting the instructor. You should submit your bid for a package via e-mail. When submitting a bid, also specify your team's name/number. Those who bid early present early.
  • neuralnet
  • sparklyr (Teams 8 & 17)
  • janitor (Teams 10 & 15)
  • mice (Team 16)
  • arules (Team 9)
  • shiny (Teams 1 & 14)
  • SentimentAnalysis (Teams 3 & 19)
  • randomForest (Teams 6 & 12)
  • caret (Teams 7 & 18)
  • tensorflow (Team 13)
  • CORElearn
  • esquisse (Team 2)
  • mlr (Team 20)
  • quanteda (Teams 4 & 11)
  • Rcrawler
A database project is also required from each group. You will design a data model to support a grocery list. Review existing grocery list apps and exercise your creativity to determine the data model's entities. Convert the data model to a relational database, populate the tables with data, and write 10 queries to access the database. The queries should demonstrate your breadth of understanding of SQL (i.e., 10 simple queries will not score as well as say 4 simple queries and 6 nontrivial queries). Data modeling is difficult and to facilitate learning a preliminary model must be submitted on the date specified in the schedule via e-mail.


When submitting the database project, you should provide the following:

  1. Your team name and the names of its members
  2. A one paragraph description of the database
  3. The data model in png format (File > Export > Export as PNG…)
  4. 10 queries
    • A natural language description of the query
    • The SQL and the results (Query -> Execute (All or Selection) to Text)
    • Copy and paste the results into Word so that everything is in one document
  5. The 10 queries should cover the following SQL features:
    • multiple table join
    • subquery
    • correlated subquery
    • GROUP BY and GROUP BY with HAVING
    • ORDER BY
    • divide
    • IN or NOT IN
    • A built-in function (e.g., AVG) or a calculated field
    • REGEXP
    • EXISTS or NOT EXISTS, other than in divide
You may combine some of the preceding list of features into a single query. Indicate in your report which features are covered in a query. A sample matrix is shown below:
query 1 query 2 ...
multiple table join X
subquery X X
... X

  • Submit a PDF report to the assignment dropbox on Sakai as the attachment.