Sunday, June 22, 2014

Why Not Erlang? The Lack of Onramps

Garret Smith gave a talk titled "Why the Cool Kids Don't Use Erlang" at the 2014 Erlang User Conference.  From free-form surveys, Garret identified several key factors limiting Erlang adoption versus more popular alternatives such as Go, Clojure, and Scala:


  • Lack of developers
  • Considered hard to learn
  • Concern over current level of adoption (risk, cost, reliance on scarce resources)
  • Convincing managers is hard
  • Dissatisfaction with libraries, documentation, and tooling -- often cited as barriers to adoption


In light of Garret's informative work, I wanted to reflect on my personal reasons for using languages other than the Erlang/Elixir stack.  To clarify, I love Erlang.  I think the programming model is brilliant.  Implementing objects as actors that run concurrently makes a lot of sense to me.   Due to the wonderful syntax and implementation, the overhead for using the actors is very low -- little more complicated than calling a method in an object-oriented language.

Erlang's approach to structuring code offers a nice balance between functional and object-oriented programming paradigms. The use of actors provides instances which encapsulate their own state and local namespaces like objects.  However, the combination of dynamic typing and code organized into functions and modules makes it easy to reuse code in a manner similar to mixins.  Erlang completely avoids the class hierarchy of inheritance-baseed object-oriented languages and separates the concept of identity and code organization.  As a result, Erlang completely avoids the brittle and complex design patterns required by objected-oriented languages, while avoiding the complexities and frustrations of purer functional languages.

So why do I rarely ever use the Erlang/Elixir ecosystem? Simply put, the lack of onramps.

Ruby became popular because of Rails.  Node.js used web developers' pre-existing comfort with Javascript as an onramp.  Python is easy to get started with, scales, and has a huge ecosystems of specialized libararies, especially due to its ability to easily integrate with C.  Scala and Clojure are attractive to users already using Java and the JVM and can tap into the wide range of existing libraries.  And lastly, Go is backed by Google, providing great advertising and immediate exposure to a ton of internal developers who can battle test and evangelize the language.

Building onramps is key to language adoption.  Elixir/Erlang, D, Ceylon, and Rust have all struggled with this, resulting in slower adoption than their peers.

My work for the last few years has spanned a few different areas:


  • Modeling & Simulation -- development and implementation of mathematical models for studying the physics of molecules
  • Data Science -- analysis of large data sets in physics, chemistry, and bioinformatics
  • Data storage and processing systems for scientific data
  • RPG game


By and large, most of my work has been in the form of contributions to existing projects which don't use Erlang/Elxir.  Since changing languages for an existing project isn't an option (or even a smart choice in most cases), none of these projects could serve as onramps.

What about for new projects, from scratch?  All of my new projects have been data science projects.  Converting data from one file format to another.  Combining datasets.  Visualization of datasets.  Computing statistics.  None of these are strong areas for Erlang/Elixir, but they are for Python.

Python makes it really easy to start with small scripts or an interactive session (especially with iPython Notebook) since there is no project setup required.  Python also scales to large applications -- as a result, I don't have to worry about changing languages down the road since my project won't outgrow Python.  As a strong, general-purpose language with a large, high-quality ecosystem, Python is suitable for almost any task or project, no matter how specialized.  As a result, I rarely ever have a need to leave the Python ecosystem.

Erlang/Elixir projects require a bit more overhead to setup.  The Erlang/Elixir language is not ideal for projects requiring lots of data manipulation, math, graphics, or low-latency performance (games).  Erlang/Elixir certainly doesn't have the necessary breadth of libraries to support general-purpose tasks. They are good at one thing and one thing only: distributed, fault-tolerant systems.

I'm starting to become interested in Scala.  Why? My new job at Red Hat involves working with and hacking on software built in Java and Scala -- existing projects are serving as onramps.

I'm also curious about Julia.  Why?  Julia has the iJulia Notebook similar to iPython Notebook, making it really easy to get started with.  Like Python, Julia also has a large set of libraries for data analysis and visualization, made possible through easy integration with C and Fortran, which makes it well-suited to many of the projects I work on.  The appropriateness for my work provides an onramp.

In my mind, if Erlang and Elixir want to grow, the community needs to identify ways to expand the scope of Erlang and Elixir to other potential use cases so that there are a larger number of natural onramps.

Friday, April 25, 2014

Big Data (Alone) Won't Save You


One of the pitfalls of the big data "revolution" came to haunt me in one of my current research projects.  I'm trying to develop new approaches for identifying insect chemosensory receptors.  Insects use chemosensory receptors to find food (e.g., mosquitoes searching for human hosts), mates, and even to avoid insecticides.  Insect chemosensory receptors tend to vary quite a bit from species to species, based on their lifestyles -- ants have about 350 olfactory receptors, while the body louse only has about 10.

As part of my work, I'm evaluating standard prediction approaches such as Hidden Markov Models (HMMs).  Without going into gory details, HMMs are built from an alignment of training sequences.  The resulting HMMs can be run on unclassified sequences to predict the probability that each query sequence matches the training sequences.  The quality of the HMM results are highly dependent on the quality and similarity of the training sequences, as I quick discovered.

I compared the performance of four sets HMMs on olfactory (ORs) and gustatory receptors (GRs) from 15 species (3 mosquitoes and 12 flies).  I had 251,890 total sequences, of which 1,149 are thought to be ORs and 921 are thought to be GRs. The first HMM was downloaded from Pfam, a database of protein families, and trained on both ORs and GRs.  I trained one HMM on 930 ORs given to me by a post-doc.  The Pfam HMM and first OR HMM were then used to identify GRs and additional ORs in the dataset that we missed the first time, resulting in about 200 additional ORs and 921 GRs.  I then trained two more HMMs, one on the expanded set of the 1,149 ORs and one on all 921 GRs.  Afterwards, I ran all of the HMMs against the proteomes and compared sensitivity and accuracy.




Surprisingly, the HMM trained on the final list of ORs did WORSE than the original HMM!  The original OR HMM found 1,126 of the 1,149 ORs while filtering out the GRs and other sequences.  The final OR HMM identified 15 fewer ORs and many more GRs and other sequences, resulting in a higher false positive rate.    In this case, more data did not result in better performance -- the quality of the training data proved to be much more important than the quantity.




Now, I need to go back and find a way to distinguish between "good" and "bad" training sequences.  I have a few empirical approaches in mind, but my most valuable asset will consultations with domain experts.   I'll need to repeat the process on another data set in the future, so it's more important to find out why certain sequences are "bad" than it is to identify the bad sequences in this dataset.

I learned a simple lesson: big data won't save you from the blind application of machine learning approaches.

Tuesday, January 21, 2014

Response to "Math is Not Necessary for Software Development"

Ross Hunter recently wrote a blog entry on Mutually Human arguing that math is not necessary for being a good software developer.  I agree with his thesis -- math isn't necessary.  However, Ross shouldn't then jump to the conclusion that math isn't useful for software development. Math may not be necessary but it can certainly be useful.

I'll start by addressing relevant points in his argument and try to clear up perceived misconceptions.

First Argument:

  • The skills that make a good mathematician are not the same as the skills that make for a good software developer.
  • Math is the process of breaking down complex problems into simpler problems, recognizing patterns, and applying known formulae.
Math is often taught in a way where students learn how to solve problems by identifying patterns.  Once the student identifies the pattern, they can solve the problem using the approach they memorized.  It's unfortunate that math is taught this way because people like Ross come away with a very incomplete and distorted picture of math.  I would call this "computation" rather than math.

The reason we have "known formulae" is precisely because of the practice of actual mathematics. In my mind, mathematics is the process of analyzing a formal system with logic. A mathematician starts by defining the basis of a formal system by specifying an initial set of rules by way of axioms, or statements which are held to be true without proof.  Next, the mathematician recursively applies logic to determine what the implications of the axioms are and if any additional rules can be then be defined.  As more and more rules are proven, the system becomes more powerful.  

Often times, mathematicians will be looking to see if a specific rule can be implied from the initial set of axioms.  If they find that this is not the case, the mathematicians may apply creative thinking to look for more specific cases where the rule does hold true or may change the axioms.  A good example is the complex number system.  When faced with the square roots of negative values, mathematicians had to define a new mathematical object (the imaginary number) to be able to reason about such results. This process can actually be quite creative.

Speaking from personal experience and comments made by others, a good math education can be a significant advantage.  Reasoning through complex arguments and formal systems has made me much more detail oriented than I was before.  Math has improved my problem solving skills.  It's also enabled me to reason formally about software, which can be very important when developing distributed systems, for example.

It's unfortunate that the way math is often taught fails our students.  Students are often taught the results found over thousands of years, but not the methodology for discovering the results.  Classes like algebra, calculus, and introductory statistics are examples which focus on results rather than methodology. Unfortunately, these are also the most popular math classes since they are required in most high schools and college science majors!

Ross points out that he loved his discrete math class.  Discrete math, along with others such as geometry, graph theory, and combinatorics, are much better courses for teaching students the methods rather than results.  All of the subject material can be derived from a few simple definitions and axioms, giving students the opportunity to learn the mathematical process.  Imagine the benefit for students if they were taught Real Analysis or Modern Algebra instead of calculus?  As Ross rightly argues, in many cases, a solid foundation in logical thinking can more broadly applicable than calculus.

(I would also like to correct Ross's description of discrete math.  Ross implies that discrete math only consists of logic and boolean algebra.  This is, of course, wrong -- discrete math covers a range of topics such as set theory, combinatorics, and graph theory as well.)

Second Argument:
  • In Math, there is only one right answer, but in software development, there is rarely a singular right answer.
Ross assumes that if a student is trained in mathematics, they will not be able to deal with grey situations.  Maybe Ross assumes that people are only studying math?  Or maybe he assumes that people are not capable of learning new ways of thinking or analyzing situations critically? Either way, this argument doesn't hold water.

Like any skill or way of thinking we have developed, learning where and when to apply it is an important part of gaining experience.  Ideally, a student would also be exposed to the humanities or cutting edge problems in the sciences where there are not clear answers. (Science education faces a similar problem -- a focus on results, not methods.)  Even if the student only studies math, it would be safe to assume that people can learn and adapt as they gain experience.  That is fundamentally part of being human.

There are also cases where math rarely involves a single correct answer.  A mathematician may have multiple ways of defining the initial axioms, each with their own trade offs.  For example, there are variations on Euclidean geometry that change the initial axioms and end up with very different properties.


Math IS Useful:
Although math education is not necessary for software development, it is useful.  I've already described how math teaches good problem solving skills and critical thinking.  With the shift towards "internet scale" systems and big data, math is even more important than before.

Consider the case of evaluating and tuning a complex software system to squeeze out every last bit of performance.  A well-controlled experiment and appropriate use of statistics is necessary to accurately access the response of the system under various conditions.  A software developer doesn't want to waste time performance tuning the areas of the system contributing least to the run-time -- they want to know what's eating up all the time so they can use their time efficiently.

A better example would be the rise of machine learning and data mining.  Users leak data left and right, which is collected by nearly every internet company.  The data is then processed to predict what the user might like to target ads or improve the user experience.  Machine learning is also used in the banking apps on our cell phones to read hand-written checks.  The popularity of machine learning is exploding as more and more uses are found.  I predict that many software developers will need to be proficient in machine learning techniques in the future.  Since machine learning is based on math and statistics, there may be a time when most software developers will need to know some linear algebra, statistics, and calculus.

All Knowledge is Useful:
Every subject offers an opportunity to apply our skills in a new way and train our brains to be even better.  One of the benefits of a liberal arts education is that students are expected to take a number of courses outside their major.  A programmer who decides to study math, literature, history, or art may find that they have developed a number of skills and tools that traditional Computer Scientists lack.


So is Experience:
Experience is a great teacher, especially in software development.  Spending hours debugging code is a great way to learn how a project works and to remember what caused the bug in the first case.  The next time you see a similar bug, you won't have to spend nearly as much time hunting its source down.

Experience also offers the benefit of knowing what works best in practice.  Ross points out that sometimes clever people will write code that is TOO clever.  They have sacrificed readability and effort for laziness or intellectual satisfaction.  This is not a problem of mathematicians, though.  This is a problem that comes from a lack of experience.

In the end, I agree that math is not necessary for software development (yet) .  I also think that we could and should change the required math courses for computer science majors to reflect courses that will focus on logic and reasoning.  But, we shouldn't be attacking math or implying that it has no value.  For some of us, math education has been a valuable part of our training.


Exploring OpenStack Savanna, Parts II and III: Elastic Data Processing

In addition to provisioning Hadoop clusters, OpenStack Savanna can also be used to directly run Hadoop jobs using a feature called Elastic Data Processing (EDP).  One of the key advantages of this approach is that users do not need to deal with multiple environments -- everything can be run directly from the Savanna UI.  In this sense, Savanna is competing with Amazon's Elastic MapReduce service to offer "Analytics as a Service" (AaaS).

Part II: Exploring the Savanna Job Interface
There are three relevant tabs in the Savanna Dashboard plugin: "Job Binaries," "Jobs", and "Job Execution."  Job binaries allows you to upload the code you want to execute.  Options include Pig and Hive scripts as well as MapReduce Java jars.


After uploading the binaries, you can create a job using the Jobs tab:


At this stage, you need to select your job type: Pig, Hive, or a MapReduce Java jar.  If your code spans multiple files, you can use the "Libs" tab to select files to be included in the library path.

The final stage is executing the job.  The user is asked to choose the job, cluster, the number of mappers and reducers, and any arguments and parameters.  If the chosen cluster is not already running, Savanna will start the cluster before the job is run and shutdown the cluster when the job is finished.

Part III: Elastic Data Processing (EPD) Behind the Scenes
After the user provides the job descriptions and binaries, Savanna stores the data in a local database using the SQLAlchemy object-relational mapper.

When a job is executed, Savanna converts the data model into a XML Workflow for Oozie, a high-level workflow manager for running Pig, Hive, and MapReduce jobs on Hadoop.  By default, a new HDFS directory is created for every job:

/user/$username/$jobname/$uuid

where $username is the name of the user, $jobname is the name given to the job, and $uuid is a randomly-generated identifier.  The workflow itself is stored in a file named, appropriately enough, "workflow.xml". The main executable is placed in the job directory while the libraries are placed in

/user/$username/$jobname/$uuid/libs

Savanna currently provides two options for handling data sources.  If the user has provided data through Swift (the object store), the XML Workflow for Oozie points to the appropriate Swift locations.  Otherwise, it is assumed the user has provided paths to the data in the command-line arguments given when executing the job.  For example, if you intend to use HDFS, you would be responsible for manually uploading the data to the Hadoop cluster.

Monday, January 20, 2014

Exploring OpenStack Savanna, Part I: Launching a Hadoop Cluster

I recently started playing with OpenStack Savanna, a service for provisioning Hadoop clusters on top of OpenStack.  Savanna makes it easy to create clusters, but users may be confused by the initial process.  Here, I go through the steps of creating a Hadoop cluster using Savanna based on the Quick Start guide provided by the Savanna developers.

Part I: Launching a Hadoop Cluster
To begin with, I wanted to look at the process for provisioning a cluster.  The first step is to register one of the images available in your OpenStack installation for use with Savanna.  By clicking on the "Image Registry" tab on the lefthand side, we get a list of all images (none in my case) registered with Savanna:

 
We can register can image by clicking the "Register Image" button on the right.  A dialog comes up like so:


Select the image you want to use and give it a name.  Click the "Done" button to save your changes.  You should now see the newly added image in the list:


After adding an image, the we need to create template for the master and worker nodes.  Node templates control which processes run on the nodes (e.g., job tracker, task tracker, name node, data node, etc.) Start by clicking on the "Node Group Templates" tab on the left-hand side.


We'll create two templates: one for a master and one for a worker.  To begin, click on "Create Template."  The following dialog should appear:


Go with the defaults here and click "Create."  (I will ignore this dialog for the rest of the entry since you can always go with the defaults for this tutorial.) A second dialog will appear:



This dialog allows you to select options for the template.  As we're creating the master template, give the template the name "master," select the small flavor, and check "namenode" and "jobtracker."  Click "Create" to finish.

To create the client, follow the same procedure but use a different name and select "datanode" and "tasktracker":


You should now see your templates in your template list:


You now need to create a cluster template, which defines how many nodes a cluster has and their types.  Click the "Cluster Templates" tab on the left-hand side:


To create a cluster template, click "Create Template."  The following dialog will appear:


On the first tab, you will want to provide a name such as "test-cluster-template."  Next, switch to the "Node Groups" tab.


Add one master and two workers to the cluster.  The rest of the parameters can be ignored, so simply press "Create."  Your cluster template should appear in the list:


Now, onto the fun part -- starting our cluster!  Click on the "Clusters" tab and press "Create Cluster":


Add a hostname, select a template, select the base image, and select a key pair.  When you press "Create," the cluster will be created and spawned.