Starting Small With Big Data - Research Computing

Michael Fudge Jr., Assistant Professor of Practice at the iSchool, recently chatted with our Research Computing group about one of his areas of interest. Big Data.

Listen in…

Mike Fudge — Michael A Fudge, Jr, Assistant Professor of Practice

RC: Mike, how long have you been at SU?

MF: 15-years. This is my first semester as a Professor of Practice.

RC: Can you describe your area of Research?

MF: Big Data Architectures and their impact. My goal is to help build out a Big Data architecture here on campus which can be used by a variety of researchers to perform data analytics and applied machine learning.

RC: Which of SU’s Research Computing technologies and resources do you currently use?

MF: I am running the HortonWorks Data Platform in the AVHE.

RC: OK, can we dig a little deeper? HDP sounds really cool but there’s a lot of marketing-speak on their web site. In fact, that site has me wondering what it doesn’t do :-) Can you give me a few examples of how it could be (or currently is) used here at Syracuse University?

MF: Hadoop is a platform for analyzing large data sets. It works by distributing that data over several physical nodes in an effort to keep the computations close to the data. It was built to handle scenarios where you need to process data within a specific window but the stream of input makes it computationally impossible to do so on a single node. For example, imagine a scenario where you want to analyze network traffic for potential exploits, but you have too much data to process on your traditional DBMS. Hadoop helps by spreading the logs transparently across several nodes, and when you analyze the data it only processes the data on each node, yielding benefits of distributed processing.

Big data systems address more than just the volume of data, and Hadoop is designed to process a variety of structured and unstructured data, as well as handle real-time data streams. The whole ecosystem is maturing and includes machine learning, data analytics, and visualization tools which are now more approachable for non-programmers.

RC: Well this all sounds good. So what’s the catch? If I am a researcher or support researchers in my area, we can actually use this platform now? Can you map out what my role and your role would be in this collaboration? Do I need to pay for your time or the computing infrastructure?

MF: Right now, I have a small 6 node cluster called the “sandbox”. This setup is small scale and is designed so that we can learn what it takes to operate and support a data environment. My goal is to build something which can leverage orange crush somewhat systematically to scale on demand.

The end product will be approachable by all kinds of researchers, so good, task-oriented documentation is a must.

I have a few iSchool faculty and student researchers using the cluster; it has been a collaborative effort. Together we figure out best ways to use Hadoop for their research. As the only technical contact, when something goes awry I’ve got to dig in and fix it. It’s a learning process. I’m always looking for more volunteers to help push the project forward. I need people use the environment, load in their data sets, explore the features, ask questions and make suggestions. I also need volunteers interested in administration (Linux skills are a must) and writing up documentation, too.