As organizations grow and evolve, so do their technology stacks and the quantity of import data being generated by those tools. Unfortunately, it’s all too common for firms to be unable to leverage the efficiencies that having all that data integrated would provide, because it is split up into firmly separate silos. That’s where ETL (short for extract, transform, load) practices and tools come into play. Today, we’re going to take a look at the first part of ETL – data extraction.
Extract: This is a pretty simple concept but can be fiendishly difficult to implement in practice without the right tooling and expertise. The basic idea is that you have a pile of potentially valuable data that is locked up in some system or application and you need to make it available to other systems, so you need a mechanism to either get it out of there or otherwise expose it. There are a few common approaches to this:
- API Access: Many modern tools support public programmatic access to all or a subset of their data. Though there are considerations to take into account before going down the API access route, such as the extent of the data exposed by the service and rate limiting, this is usually the go-to for extracting data when it is available.
- Database Querying: It’s not uncommon for applications to have a database backing them; usually something like Microsoft SQL or MySQL. In cases where you have access to the database, such as those in which a legacy application is actually hosted onsite locally, querying the relevant database directly via SQL or a tool that uses SQL under the hood can be an efficient way to extract data.
- Data Streaming: Data streaming involves real-time extraction from data streams, such as log files, IOT devices, or message queues (e.g., Kafka, AWS Kinesis). This is a useful practice when you have an existing data-source that will continue to run and you want to read data from it on an ongoing near-real-time basis.
- File Extraction: This method extracts data from files like CSV, Excel or even legacy programs such as Microsoft Access. It’s often used for data stored in flat files, reports, or external data feeds. This is particularly useful for legacy systems that tend to store their data in flat files locally or the many legacy applications that have been developed atop older data management tools such as Access.
Next week, we will be taking a look at the next step in ETL, transforming data. For now, please let me know if you have any questions on anything here and if you have a need for ETL have a look at Alice and drop me a line. I’d be happy to help!