Big Data

Collector's Edition

Big Data is Driven by Technologies

Thinking about more than >200 technologies can be complex.
Even if you want to start your next big data challenge with a capable technology.

So, we started to build a big picture
with the goal to provide an overview of the passionated open source community around big data technologies.

Data Collection

We collected data about:

  • 269 technologies
  • 21,202 contributors
  • 1,354,013 commits
    • Based on the following GitHub command that produced log files for each repository:

      git log --pretty=format:"\"%H\";\"%an\";\"%ae\";\"%cn\";\"%ce\";\"%s\";\"%cI\"" > ../technology.csv

Technologies

Let's take a closer look on the collected data and find out which technology has the most progress made based on their contributions.

Technology Insights

The MySQL technology is by far the technology with the highest amount of commits and contributors.

But technologies like Spark, Elastic Search, or even Machine Learning with scikit-learn is enjoying an increasing number of contributors. We think that those technologies with a high number of contributors will continue their progress with a lot of developments in the next months or years.

The question is: How can we encompass those developments with a raising variety of upcoming technologies and more important how do we know about all these technologies?

Awesome Community

How about initiating a community that collects and structures all these technologies to solve this overview issue? Oh yes, this would be awesome ...

Good point, there already is an awesome community that curates big data technologies and discuses related domains to summarize related technologies.

But, why is it necessary to put big data technologies into domains?

Domains more in detail

Can we really handle the variety of big data technologies through abstract domains? Yes, but each domain is different ...

Catching Variety

We can see that certain domains include a high variety of specialized technologies like "Distributed Programming" with a number of related technologies like flink, storm, tez, etc. On the other hand, there are technologies with more present technologies. Those domains fulfill a wider range of application. One example is the "Search engine and framework" domain with lucene-solr or elasticsearch.

But to some extent there are technology with broad history like MySQL compared to Spark for example. Is it really useful to collect all these technologies over a period of time?

Yes, because we can see the ups and downs.

Community-Driven

The timeline shows technologies with a significant number of commits like MySQL (server), Kiji, or Impala and their ups and downs of implemented features. In general, the community is organic and people can quickly join to their favourite project. There is an increasing amount of contributions year-by-year especially at the beginning from 2014.

Some further questions arise because of these insights. Who are the contributors and do they contribute to one or more projects?

Passionated Contributors

Every big data technology requires passionated contributors within a embedded core team to sustain. Tom Lane and Bruce Momijan for PostGreSQL, Mike Bostok for D3.js, or Jonathan Ellis for Cassandra to just name a few that have a really high number of contributions.

Our last question in this discovery is: Do we have really an organic community that joins, adopts, or turnes to specific big data technologies?

Conclusion

Big data technologies are awesome and insights of this community say why:
  1. Variety of Technologies: The list of big data technologies is rich and drives the digital transformation.
  2. Domains: Thinking in domains helps to select technologies with similar applications. Especially, big data engineers can focus on specific domains to gain in-deep knowledge of related technologies.
  3. Technology Scope: There are domains with a high number of specific or more general technologies that cover a range of applications.
  4. Contributions: Contributions are continuously increasing over time. The trend is upwards.
  5. Open Ecosystem: Free developers, companies, and public organizations join the community to develop big data technologies. This sustainable ecosystem supports society, science, and economy through technological progress.
  6. Community-Driven: The development and analysis of big data technologies is living from a organic discussion that generates broad consens of the open source community. This consens lives from the adoption to new technologies as well as the support of base technologies

Deloitte refers to one or more of Deloitte Touche Tohmatsu Limited, a UK private company limited by guarantee (“DTTL”), its network of member firms, and their related entities. DTTL and each of its member firms are legally separate and independent entities. DTTL (also referred to as “Deloitte Global”) does not provide services to clients. Please see www.deloitte.com/de/UeberUns for a more detailed description of DTTL and its member firms. Deloitte provides audit, tax, financial advisory and consulting services to public and private clients spanning multiple industries; legal advisory services in Germany are provided by Deloitte Legal. With a globally connected network of member firms in more than 150 countries, Deloitte brings world-class capabilities and high-quality service to clients, delivering the insights they need to address their most complex business challenges. Deloitte’s more than 225,000 professionals are committed to making an impact that matters. This communication is for internal distribution and use only among personnel of Deloitte Touche Tohmatsu Limited, its member firms and their related entities (collectively, the “Deloitte network”). None of the Deloitte network shall be responsible for any loss whatsoever sustained by any person who relies on this communication

Contact Us

If you want to know more about us, just contact us:

Oliver Bieh-Zimmert