Big Data in Stata

Andrew Maurer

Big Data in Stata

2015

Andrew Maurer

With more and more data being stored by organizations across industries – from academia, to health care, to banking – along with plummeting storage and RAM costs, there is a growing need for tools to analyze “big data”. The world is moving from needing to analyze megabytes of data to needing to analyze many gigabytes. While Stata is very user-friendly, many of the most basic commands – summarize, sample, collapse, and encode, etc – are not optimized for speed. These commands – as of Stata 14 – all rely on sorting, making them tens, or even hundreds (in the case of sample), of times slower than what is possible with better algorithms. In this presentation I illustrate alternative algorithms along with coded examples in Stata, Mata, and C++ plugins which may be used to more quickly analyze big data. fastsample and fastcollapse are available from the SSC.

Keywords:

Megabyte
Database
Gigabyte
Sorting
ENCODE
Computer science
Plug-in
Big data

Correction
Source
Cite
Save

References

Citations