Increasing Driver Performance: Case Study Google BigQuery

  • Oracle Solutions
  • Data-Driven Enterprise

In today’s world, we have an infinite amount of data and information at our fingertips. But infinite data requires the technology and infrastructure to store and manage all this data. People expect the software applications they use to deliver on their promised functionality and work within larger architectures and data ecosystems seamlessly – meaning – the want the ability to consume and utilize vast amounts of data within a myriad of applications. So, what does it take for software manufacturers – behind the scenes - to achieve and provide this level of confluence? It requires working in partnership between customers and vendors. At Magnitude Simba we work together with leading data providers and application developers to optimize and construct the desired end user experiences end-to-end.

When top performing technologies align – progress happens

Google BigQuery is a fully-managed data service that lets users run queries against data stored on Google Cloud Storage. Today, Magnitude Simba is the official developer for Google Cloud BigQuery ODBC and JDBC drivers. Working together, Google engineers and our Magnitude Simba team worked to understand the end user experience, holistically, by considering the application, the driver and the data source together. In this way, we united the goals of three organizations - Google, Magnitude Simba and most importantly, the End User - by testing both functionality and performance. We recognize that faster reliable data access for the end user results in terms of markedly increased productivity. For example, what may seem like shaving seconds, or even milliseconds, off an individual line item, can be significant when considering the frequency and number of individuals performing the tasks across the whole organization.

Both Google and Magnitude Simba preference the importance of considering and optimizing the end user experience.

Understanding the current technical limitations

At the request of Google, Simba’s BigQuery ODBC team frequently reviews the performance of our driver with the goal of building the most optimally performant driver for the BigQuery data source. One of the first results of reviewing performance led to the refactoring of our single-threaded data reader of BigQuery ODBC 2.0 to the affectionately-named Parallel Fetcher of BigQuery ODBC 2.1

Read times for result-generating queries were reduced by nearly 20% - a result set that might take an hour and a half to read was reduced to slightly over an hour. The larger the table, the larger the performance gain. The Parallel Fetcher model was a useful tool for fetching this data, but it had never been scaled out to apply to our catalog functions. For years, functions like SQLTables and SQLColumns were effectively single threaded - fetch a page of projects, fetch a page of datasets, then a page of tables, etc. This had a noticeable impact on BI tools like Power BI, which would use these catalog functions to pre-populate information about the structure of the user's BigQuery projects and tables. Though comparatively few API calls were required when compared to fetching query results, these operations could take upwards of thirty seconds to get this information through the driver. For instance, consider the performance of the SQLTables operation below:

Operation

Configuration

Time to complete (seconds)

SQL Tables

CATALOG_ONLY (5 catalogs)

0.5

ALL_TABLES (default Simba catalog)

6.5

ALL_TABLES (bigquery-public-data)

29.3

Well... that simply wouldn't do.

Driving performance results: Single-threading is now obsolete

Requiring thirty seconds to fetch information through a driver is a significant bottleneck when one considers that it must be done for each ODBC connection to the data source. The driver has historically needed to do this because of the variety of API calls required - not only did we need to list all the projects, but also all the datasets for each of those projects, and all the tables for each of those datasets, and so on and so forth. This meant that at minimum, a SQLTables call would require access to three API calls, each of which is dependent on the previous (for instance, to get datasets for a project one must first know the project, and therefore must fetch the list of those projects). Parallelizing such an operation is no easy task, but with careful management of threads and the interdependence of API calls, it can be done.

Operation

Configuration

Time to complete (seconds)

Single-Thread Fetch

Parallel Fetch

SQL Tables

CATALOG_ONLY (5 catalogs)

0.5

0.5

ALL_TABLES (default Simba catalog)

6.5

2.1

ALL_TABLES (bigquery-public-data)

29.3

4.5

Result: It now takes less than five seconds to complete tasks which previously took 30 seconds

At the end of the day, people want to keep up the momentum in their workflows. In another context 30 seconds might not seem long. To the data analyst who is required to access data multiple times a day, the time to wait can be the difference between frustration and positive fulfillment of their daily work rituals and information needs. For a customer service agent helping a customer, the time to wait while a query is performed can lead to a frustrated customer experience, or worse. In the end every interaction the user has with the software plays a part into how the end user views the product.

About Simba Managed Services

Data connectors are a vital feature of data-driven applications, while building, testing, and maintaining those drivers often lie outside a team’s core competencies. Engaging with Simba’s Managed Services team provides software vendors with a flexible, high-touch option in solving the burden and complexity of data connectivity. We’re dedicated to providing state-of-the-art and enterprise grade data connectivity solutions – which is our business - so that software manufacturers can focus on theirs.

Learn more: Talk to a Simba specialist for more information on:

  • Integration & Testing Services
  • SDK-Licensed Development & Testing
  • Driver Customization and Engineering Services

From data connectors, to SDKs and services – find out more about what we offer Software Providers at Simba.