DSC 360 – Big Data Analytics
Instructor: Adam Hartley
Email: hartlead@mountunion.edu
Office: KHIC 041
Office Hours: 2:00-3:00 MWF; 12:30-1:30 TR; by appointment; or whenever my door is open!
Textbook: Big Data Analysis with Python by I. Marin, A. Shukla, and S. VK
Course Motivation & Description
“Every company has big data in its future and every company will eventually be in the data business.”
— Thomas H. Davenport, co-founder, International Institute for Analytics
Training in data science generally begins with small data sets that can easily be read and manipulated by personal computers. This is a very valuable skill; many data analysts who work with small or mid-sized companies work exclusively with small data or relatively small data sets (e.g., those that might be stored on a central company server and queried with SQL). However, many companies – especially larger companies in healthcare, IT, marketing, finance, etc. – rely on the analysis of truly massive data sets (e.g., measured in terabytes). Doing so requires specialized tools; this is the focus of this course. We shall see that many of the tools and methods familiar from introductory courses in data science are relevant here, too, and we shall begin the course with a refresher on some fundamentals. From there we will investigate big data frameworks, with an emphasis on Hadoop and Spark. As usual, we will emphasize both the fundamental theories/principles of the techniques we will explore and practical use cases with real data.
Specific learning objectives include:
Combine elements of algorithm design, statistics, optimization, and computer science to make decisions with data.
Work with computing tasks distributed over a cluster, especially with Hadoop and Spark.
Convert data from various sources into storage or querying formats.
Prepare big data for statistical analysis, visualization, and machine learning.
Effectively utilize the Ohio Supercomputer Center interface.
Articulate the workflow and outcomes of big data analysis for technical and non-technical audiences.
Course Outline
A rough outline of the pace of the course and topics covered is below. Each activity is due by the start of class one week after it is introduced.
Week # | Date | Monday | Date | Wednesday | Date | Friday |
---|---|---|---|---|---|---|
1 | 8-26 | Intro | 8-28 | Libraries | 8-30 | Dataframes |
2 | 9-2 | Labor Day No Class |
9-4 | Data Type Conversion Aggregation and Grouping |
9-6 | Exporting Data / Visualization |
3 | 9-9 | Types of Graph Components of Graphs |
9-11 | Seaborn Types of Graphs |
9-13 | Dataframes Grouped Data |
4 | 9-16 | Modifying Graphs Exporting Data |
9-18 | Hadoop | 9-20 | Spark |
5 | 9-23 | Parquet | 9-25 | Handling Unstructured Data | 9-27 | Spark Dataframes |
6 | 9-30 | Exploring Spark Dataframes | 10-2 | Data Manipulation with Spark Dataframes | 10-4 | Graphs in Spark |
7 | 10-7 | Missing Values | 10-9 | Handling Missing Values with Spark Dataframes | 10-11 | Correlation |
8 | 10-14 | Review | 10-16 | Midterm Exam | 10-18 | Fall Break No Class |
9 | 10-21 | Defining Business Problems | 10-23 | Translating Business Problems | 10-25 | Standard Approach Data Science Life Cycle |
10 | 10-28 | Reproducibility in Jupyter Notebooks | 10-30 | Gathering Data Reproducibly | 11-1 | Code Practices and Standards |
11 | 11-4 | Avoiding Repetition | 11-6 | Reading Data in Spark Different Data Sources |
11-8 | SQL Operations on Spark Dataframes |
12 | 11-11 | Generating Statistical Relationships | 11-13 | In-class project work | 11-15 | In-class project work |
13 | 11-18 | In-class project work | 11-20 | In-class project work | 11-22 | In-class project work |
14 | 11-25 | In-class project work | 11-27 | Thanksgiving No Class |
11-29 | Thanksgiving No Class |
15 | 12-2 | Presentations | 12-4 | Presentations | 12-6 | Presentations |
Grading
Grade Range | Letter Grade |
---|---|
100%-94% | A |
93.9%-90% | A- |
89.9%-87% | B+ |
86.9%-84% | B |
83.9%-80% | B- |
79.9%-77% | C+ |
76.9%-74% | C |
73.9%-70% | C- |
69.9%-67% | D+ |
66.9%-64% | D |
63.9%-60% | D- |
<60% | F |
The assignments for the course are divided into a few categories:
Activities: worth 20%
- Activities will be introduced in class and some class time will be allocated to work on the activities. Each activity is due one week after its introduction and may be submitted anytime prior.
Exercises: worth 20%
- Exercises will be assigned for out-of-class work and will be generally due one week from assignment.
Final Project: 1 worth 30%
The Final Project is your opportunity to apply what you have learned through the semester. You will prepare your report a Notebook environment (Jupyter or R) to seamlessly move between text, code, and results. The length should correspond to roughly a 5-8 single spaced page paper. The paper should be directed to a general audience, present a research question (what data set will you analyze, and what question will you answer), describe the methods you will use (at least 4 different techniques drawn from the text must be used), proceed to discuss the results of the methods applied to your data, and finally draw a conclusion that synthesizes your results and ultimately answers the research question.
You must (informally) discuss the data, research objective, and methods with your instructor before engaging in this work.
Include at least 4 professional-quality images (generated through Python or R) to visually summarize your work.
Include citations, as appropriate.
Submit your raw data and well-commented code for review. The instructor must be able to “plug and play” to execute the code and generate your images and other results.
Each student may work alone or in groups of two
At the end of the semester, each group will have some time to present their project to the class.
More information about the final project will be provided as we get underway.
In-class exams: 2 worth 30%
- There will be two in-class exams, which will be very similar to previously assigned activities/exercises under a time limit.
Assignments may occasionally include optional extensions that will be graded for extra credit. These will generally require additional discussion with the instructor. Come to office hours or schedule an individual time!
In general, no other extra credit work will be assigned.
Collaboration Policy
The field of science is almost entirely collaborative. If students wish to collaborate on solving exercises or activities outside of class, this is allowed and encouraged under the condition that you explicitly note with whom you collaborated. Each student must turn in their own copy of the work, each copy listing the collaborators. Please limit group work to two or three students. Each student is individually responsible for the course information. Collaboration is not allowed during exams.
Late Policy
Each 24 hours, an activity or exercise is past due, it is worth one fewer point. After 72 hours, the assignment is worth zero points. Extra credit work, if applicable, isn’t accepted past-due. The exams and final project will generally not be accepted late. In the case of an emergency or extended complications that would prevent exam attendance or inability to participate in the class, contact the instructor as early as possible so that other arrangements may be made.
Communication Policy
I will strive to be available and accessible to every student in each of my classes. To facilitate better understanding of when and how I will be available, I will lay out a few expectations. In general, I will not check or respond to emails before 8:00 A.M. or after 5:00 P.M. on weekdays. Otherwise, I will try to reply to any emails or Teams messages within 24 hours. Please don’t hesitate to send a follow-up message if I haven’t responded to you during this window; there’s always a chance your email was missed, however unlikely. In general, there will be no message regarding the class that is so important that it can’t wait until the morning. We’ll work it out.
Technology Requirements
College coursework requires students to be responsible with reading and assignments, checking email and D2L frequently, and staying in regular communication with instructors. Technology access will be important for success. To participate in learning activities and complete assignments, you will need:
Access to a working computer that has a current operating system with updates installed
Reliable Internet access and a Mount Union email account
A current Internet browser that is compatible with D2L
Please contact the IT Help Desk at 330.829.8726 or helpdesk@mountunion.edu if you need assistance with obtaining or using a device, any necessary software, or internet access at any time during this semester.
Please bring your laptop to class, we will make extensive use of in-class time for lab work.
Artificial Intelligence Policy
Artificial intelligence is a rapidly evolving field and there are now multiple programs (e.g., ChatGPT and Bard) that can interact with users via “natural” conversations and rapidly generate output including art, essays, and computer code. Programs such as these will continue to evolve and be utilized in professional settings, and you can and should become familiar with them in the course of your undergraduate studies. At the same time, in-demand employees are those who have skills (to not use an AI when doing so would expose proprietary information to the creator of the AI, to debug AI-generated code when it doesn’t get it quite right, to modify output from AI for subtly different use cases, to perform tasks independent of AI when appropriate…) and the clarity of thought and ability to communicate effectively and effortlessly. Your college education is a time to develop these abilities and using AI as a crutch can hinder that process. So, in this course, you should not use AI in any of your work. Doing so without the express permission of the instructor will be considered a breach of academic honesty. Any submitted work may be subject to an oral examination.
Accessibility
The University of Mount Union values disability as an important aspect of diversity and is committed to providing equitable access to learning opportunities for all students. Student Accessibility Services (SAS) is the campus office that collaborates with students who have disabilities to provide and/or arrange reasonable accommodations based upon appropriate documentation, nature of the request, and feasibility. If you have, or think you have, a temporary or permanent disability and/or medical diagnosis in any area such as, physical or mental health, attention, learning, chronic health, or sensory, please contact SAS. The SAS office will confidentially discuss your needs, review your documentation, and determine your eligibility for reasonable accommodations. Accommodations are not retroactive, and the instructor is under no obligation to provide accommodations if a student does not request accommodation or provide documentation. Students should contact SAS to request accommodations and should discuss their accommodations with their instructor as early as possible in the semester. You may contact the SAS office by phone at (330) 823-7372; or via e-mail at studentaccessibility@mountunion.edu.
Additional University Policies
See https://www.mountunion.edu/syllabus for policies and information that are universally applicable to all courses at the University of Mount Union.