In my capacity as a Quality Engineer at a company building data analysis software, I often find myself looking for quality data sets that I can use in my testing. Sometimes, I take the time to find some real data that fits my needs, but oftentimes it’s impossible (or takes far too long) to locate any such data set. In these circumstances, I find myself either writing a simple script to generate data or just creating some tiny amount of data that meets my needs.
Unfortunately, this takes too much time, and doesn’t generally yield the quality of data that I’d like to see. It’d be nice to have something to generate better quality data on-demand.
Current Issues
Though a number of data generation tools exist, I find them lacking at times, especially in generating non-tabular data. Most of theme are capable of creating some decent data, but this doesn’t extend to things like documents, comments, or links between separate entities or distinct types of entities.
Some of these tools, however, are super useful. A couple that I’ve used (and liked, with the shortfalls listed above) include Generate Data and Mockaroo. In terms of document generators, I’ve never actually found one. The only document generator I’ve ever used was one I created, but it was written for one specific purpose, and with only one format.
A Better Way?
I think that in order to have something really valuable, it needs to build upon previous generators. It needs to be flexible enough to generate any sort of data given a pattern to follow, whether it’s numeric, string-based, or an entire document.
Realistic Data
It needs to generate realistic output from those patterns. It needs to be able to choose values from a set that’s widely varied, but do so in a way that reflects realistic distributions on the data.
For example, given a set of names, it doesn’t make sense to choose names at random. Names like ‘Jacob’ occur much more often than names like ‘Deantoine’. Numbers for amounts, like financial transations, generally follow Benford’s Law. And ages aren’t just random. The probability that a random individual is 102 years old vs. 22 years old is quite large.
Accessible Data
The generator should be widely accessible via an API, so that developers can directly access data that meets their needs. This would allow access on the fly, and could allow periodic calls to simulate things like user sign-ups, message traffic, etc.
Open Source
Finally, I think it should be open source. Open source applications allow anyone to contribute, build upon, and improve existing applications. With a utility that’s widely usable, I think this is the only way to go.
Development
On that note, I’d like to say that though I know it’ll take a lot, I’m going to begin the development of such a system. I’ll be putting the code on Github, as you might expect from an open source project. If you’ve got any thoughts, feel free to drop them in the comments below!
Leave a Reply