﻿<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD with MathML3 v1.2 20190208//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd">
<article
    xmlns:mml="http://www.w3.org/1998/Math/MathML"
    xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="review-article">
  <front>
    <journal-meta>
      <journal-id journal-id-type="publisher-id">JAIBD</journal-id>
      <journal-title-group>
        <journal-title>Journal of Artificial Intelligence and Big Data</journal-title>
      </journal-title-group>
      <issn pub-type="epub">2771-2389</issn>
      <issn pub-type="ppub"></issn>
      <publisher>
        <publisher-name>Science Publications</publisher-name>
      </publisher>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.31586/jaibd.2025.6049</article-id>
      <article-id pub-id-type="publisher-id">JAIBD-6049</article-id>
      <article-categories>
        <subj-group subj-group-type="heading">
          <subject>Review Article</subject>
        </subj-group>
      </article-categories>
      <title-group>
        <article-title>
          Enhancing Scalability and Performance in Analytics Data Acquisition through Spark Parallelism
        </article-title>
      </title-group>
      <contrib-group>
<contrib contrib-type="author">
<name>
<surname>Salim</surname>
<given-names>Hanza Parayil</given-names>
</name>
<xref rid="af1" ref-type="aff">1</xref>
<xref rid="af2" ref-type="aff">2</xref>
<xref rid="af2" ref-type="aff">2</xref>
<xref rid="af2" ref-type="aff">2</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Rajindran</surname>
<given-names>Yanas</given-names>
</name>
<xref rid="af3" ref-type="aff">3</xref>
<xref rid="af2" ref-type="aff">2</xref>
<xref rid="af2" ref-type="aff">2</xref>
<xref rid="af2" ref-type="aff">2</xref>
</contrib>
      </contrib-group>
<aff id="af1"><label>1</label> Staff Engineer, Neiman Marcus, Texas, USA</aff>
<aff id="af2"><label>2</label> Lead Engineer, AT&#x00026;T, Texas, USA</aff>
      <pub-date pub-type="epub">
        <day>22</day>
        <month>03</month>
        <year>2025</year>
      </pub-date>
      <volume>5</volume>
      <issue>1</issue>
      <history>
        <date date-type="received">
          <day>02</day>
          <month>02</month>
          <year>2025</year>
        </date>
        <date date-type="rev-recd">
          <day>08</day>
          <month>03</month>
          <year>2025</year>
        </date>
        <date date-type="accepted">
          <day>19</day>
          <month>03</month>
          <year>2025</year>
        </date>
        <date date-type="pub">
          <day>22</day>
          <month>03</month>
          <year>2025</year>
        </date>
      </history>
      <permissions>
        <copyright-statement>&#xa9; Copyright 2025 by authors and Trend Research Publishing Inc. </copyright-statement>
        <copyright-year>2025</copyright-year>
        <license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0/">
          <license-p>This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/</license-p>
        </license>
      </permissions>
      <abstract>
        Data acquisition serves as a critical component of modern data architecture, with REST API integration emerging as one of the most common approaches for sourcing external data. This study evaluates the efficiency of various methodologies for collecting data via REST APIs and benchmark their performance. It explores how leveraging the Spark distributed computing platform can optimize large scale REST API calls, enabling enhanced scalability and improved processing speeds to meet the demands of high volume data workflows.
      </abstract>
      <kwd-group>
        <kwd-group><kwd>Distributed computing</kwd>
<kwd>Parallel processing</kwd>
<kwd>Data Acquisition</kwd>
<kwd>Apache Spark</kwd>
<kwd>RESTful Web Services</kwd>
<kwd>REST API</kwd>
<kwd>Data Analytics</kwd>
</kwd-group>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec1">
<title>Introduction</title><p>REST APIs are commonly used for data acquisition due to their flexibility, scalability, and standardization. They are widely used in enterprises for data acquisition from external sources. In most cases we need to perform large volumes of API calls at a time and that involves a lot of challenges like latency, rate limits, error handling etc. This paper specifically examines the challenges related to latency caused by traditional API calls and explores how these can be addressed using Spark's parallel processing architecture. It discusses how Apache Spark [
<xref ref-type="bibr" rid="R6">6</xref>] DataFrames, RDDs, UDFs (User defined functions) can be utilized to parallelize REST API calls, enhancing overall performance.</p>
</sec><sec id="sec2">
<title>REST API and Apache Spark</title><p>A REST API (also called RESTful web API) is an application programming interface (API) that follows the design principles of the representational state transfer (REST) architectural style. REST APIs provides lightweight, flexible ways to integrate applications and is known for its Scalability, Flexibility and portability and independence as there is a separation between client and server. They provide a simple and efficient method for accessing data from various sources, enabling developers to integrate systems easily and retrieve information through basic HTTP requests. This makes the data acquisition process streamlined and effective across different platforms and programming languages.</p>
<p>In various real life scenarios, we need to do Parallel REST API [
<xref ref-type="bibr" rid="R2">2</xref>] calls. In applications with high traffic, such as e-commerce platforms or real time dashboards, making parallel API requests ensures the application remains responsive by reducing the load time for fetching external data. Parallel API calls can help meet the real time requirements, in scenarios like real time analytics or monitoring systems, where the data needs to be collected from multiple sources without delay and stored in high performance scalable storage like Delta lake [
<xref ref-type="bibr" rid="R4">4</xref>] and further used for analytics and machine learning [
<xref ref-type="bibr" rid="R11">11</xref>] applications.</p>
<p>Apache Spark is a scalable, distributed data processing framework that allows users to perform processing, analysis, and manipulation of large data sets efficiently. This should be in-memory processing using clusters(set of nodes that work together to process large volumes of data in an efficient and scalable manner).RDDs (Resilient Distributed Datasets) and DataFrames(High level abstraction of RDDs) are two fundamental data structures in Apache Spark [
<xref ref-type="bibr" rid="R6">6</xref>,<xref ref-type="bibr" rid="R7">7</xref>] that allow developers to work with large volumes of data in a distributed and parallel manner across these clusters.</p>
</sec><sec id="sec3">
<title>Data Acquisition through Rest API Calls</title><p>The traditional method of making REST API [
<xref ref-type="bibr" rid="R2">2</xref>] calls is sequential, where input data is passed for each call. In this case, we are using the Yahoo Finance API to retrieve stock quotes for 2,000 companies listed on the NYSE. The platform in use is Databricks, running on Spark with a single driver and four worker nodes, and the language chosen is Python. The input for the API call is a list of 2,000 stock codes, and we need to make an API call for each stock code, collect the responses, and process them.</p>
<title>3.1. Sequential API Call</title><p>In a sequential call, each stock code is processed in every iteration of the loop, executing one after the other. The issue with this approach is that it is sequential and written purely in Python, which means it runs only on the driver node and lacks scalability, and it is represented inFigure <xref ref-type="fig" rid="fig1"> 1</xref> below. In this approach, it took a total of 16 minutes, completing the entire list of 2,000 stock codes.</p>
<p>BaseURL="https://yahoo-finance166.p.rapidapi.com/api/stock/get-price?region=US&#x26;#x00026;symbol="</p>
<p>StartTime = time.time()</p>
<p>for StockCode in StockList:</p>
<p>   response = session.get(BaseURL + StockCode)</p>
<p>print('Elapsed Time ',time.time() - StartTime)</p>
<p></p>
<fig id="fig1">
<label>Figure 1</label>
<caption>
<p>The API call is executed on the driver node for a standard sequential process.</p>
</caption>
<graphic xlink:href="6049.fig.001" />
</fig><title>3.2. API Calls using Multithreading</title><p>Multithreading is a powerful tool for achieving concurrency. The ThreadPoolExecutor class that is part of Python's standard library, is particularly useful for I/O-bound tasks. It offers an easy to use API for handling concurrent executions, making it suitable for high latency and I/O bound tasks. This class provides a simple interface for creating a pool of worker threads and executing tasks in parallel. Introduced in Python's concurrent.futures module, ThreadPoolExecutor efficiently manages and creates threads. </p>
<p>The advantage of using ThreadPoolExecutor is that it triggers REST API [
<xref ref-type="bibr" rid="R8">8</xref>,<xref ref-type="bibr" rid="R9">9</xref>] calls in parallel. Multithreading primarily introduces concurrency, meaning the threads take turns executing on the same resources. Although we are running this on a Spark cluster [
<xref ref-type="bibr" rid="R1">1</xref>] with four worker nodes, all the processing occurs on the driver node, even though the APIs are triggered in parallel as it is clearly shown inFigure <xref ref-type="fig" rid="fig2"> 2</xref>. To fetch the responses for 2,000 stock codes, it took 3.5 minutes to complete as the API calls were submitted in parallel.</p>
<p></p>
<p>def ThreadPoolCall(StockCode):</p>
<p>   response = session.get(BaseURL + StockCode)</p>
<p>   return response.text</p>
<p>with ThreadPoolExecutor (max_workers=min(32, os.cpu_count() + 4)) as executor: </p>
<p>  results=executor.map(ThreadPoolCall, StockList)</p>
<fig id="fig2">
<label>Figure 2</label>
<caption>
<p>The API calls are executed in parallel on the driver node for a multithreaded process.</p>
</caption>
<graphic xlink:href="6049.fig.002" />
</fig></sec><sec id="sec4">
<title>REST API Calls using Apache Spark</title><p>In this approach, we utilize Spark partitions in two ways: through DataFrames and RDDs [
<xref ref-type="bibr" rid="R3">3</xref>]. By using UDFs (User Defined Functions), we leverage DataFrames in Spark, and by using map or mapPartitions functions, we follow the RDDs approach. In both methods, Spark partitions data and processes tasks concurrently. This means dividing the list of API endpoints or parameters into partitions and collecting and processing API responses in parallel. Both approaches can parallelize tasks similarly in the backend.</p>
<title>4.1. API Calls using Spark UDFs</title><p>By using PySpark UDFs for REST API calls, we can harness the power of distributed computing to interact with external services. To leverage the parallelism offered by Apache Spark [
<xref ref-type="bibr" rid="R10">10</xref>], each REST API call is encapsulated within a UDF and bound to a DataFrame. Each row in the DataFrame represents a single call to the REST API service. When an action is executed on the DataFrame, the result from each individual REST API call is appended to each row as a structured data type. This approach involves coupling the UDF with a withColumn statement, where the UDF returns a structured column representing the REST API response. This response can then be further processed using functions like explode and other built in DataFrame functions.</p>
<p>By converting our Stock code list into a DataFrame and encapsulating our logic within a UDF, we create an object that can be partitioned across multiple executors. This enables concurrent operations across the cluster, making the process more efficient as shown inFigure <xref ref-type="fig" rid="fig3"> 3</xref>. Based on the volume of input parameter sets to be processed and the throughput supported by the target REST API server, we can specify the number of partitions to use and this allows us to adjust the level of parallelization as needed.This can be done using the repartition function used below.Here we parallelize the API call by distributing the data across 60 spark partitions and the execution time is reduced to 0.5 Minutes to get the response for 2000 stock codes.</p>
<p>StockDataFrame = spark.createDataFrame (StockList, Columns)</p>
<p></p>
<p>@udf("string")</p>
<p>def GetResponse(StockCode): </p>
<p>  response = session.get(BaseURL + StockCode) </p>
<p>  return response.text </p>
<p>StockDataFrameDF=StockDataFrame.repartition(60)</p>
<p>ResponseDF=StockDataFrameDF.withColumn("response",   </p>
<p>                              GetResponse("StockCode"))</p>
<p>print(ResponseDF.count())</p>
<fig id="fig3">
<label>Figure 3</label>
<caption>
<p>The API calls are executed in parallel on the worker nodes when calling using Spark UDF.</p>
</caption>
<graphic xlink:href="6049.fig.003" />
</fig><title>4.2. API Calls using Spark RDDs</title><p>In this approach, we can use Spark's map and mapPartitions functions to distribute API calls across workers, but these functions operate at the RDD level. The RDD approach is not recommended nowadays because higher level abstractions, such as DataFrames, are available. As discussed in the previous section, the RDD approach requires converting the response back to a DataFrame for processing. Since an easier method for making REST API calls is available through DataFrames (using UDFs), we do not use the RDD approach for performance comparison. Additionally, the latest versions of Spark platforms, like Databricks [
<xref ref-type="bibr" rid="R5">5</xref>], no longer support RDDs.</p>
<p>If we have 2000 elements in a specific RDD [
<xref ref-type="bibr" rid="R3">3</xref>] partition, using the map transformation will trigger the function 2000 times, once for each element. On the other hand, using mapPartitions will call the function only once, passing all 2000 records at once and receiving all responses in a single function call. The mapPartitions transformation is faster than map because it calls the function once per partition, rather than once per element.</p>
<p></p>
<title>4.2.1. Using map function</title><p>def CallFunc(StockCode):</p>
<p>    response=requests.get(BaseURL +StockCode)</p>
<p>    return response.text</p>
<p></p>
<p>RDD0=sc.parallelize(StockList) </p>
<p>RDD1=RDD0.map(CallFunc)  </p>
<p>&lt;RDD1 to be converted to DataFrame for further processing></p>
<p></p>
<title>4.2.2. Using mapPartitions function</title><p>def CallFunc(StockList):</p>
<p>    Resultlist=&#x26;#x03010;&#x26;#x03011;</p>
<p>    for StockCode in StockList:</p>
<p>        response = requests.get(BaseURL +StockCode)</p>
<p>        Resultlist.append(response.text)</p>
<p>    return Resultlist </p>
<p>RDD0=sc.parallelize(StockList) </p>
<p>RddCallResult=RDD0.mapPartitions(CallFunc)</p>
</sec><sec id="sec5">
<title>Performance Benchmarking and Analysis.</title><p>The traditional sequential method of making REST API calls took 16 minutes, but this time was significantly reduced by 4.8 times using Multithreading. As represented in theFigure <xref ref-type="fig" rid="fig4"> 4</xref> below, the Spark UDF approach, which utilizes true parallel processing using worker nodes, is 6.5 times more efficient than the Multithreading method. Although the Multithreading method is slower than Spark's parallelization techniques, it runs on a single node Spark cluster [
<xref ref-type="bibr" rid="R1">1</xref>,<xref ref-type="bibr" rid="R10">10</xref>] without utilizing the worker node's compute [
<xref ref-type="bibr" rid="R12">12</xref>] resources, achieving an impressive execution time of 3.3 minutes. Therefore, while multithreading may not be the fastest option, it remains a strong contender due to its lower compute cost.</p>
<fig id="fig4">
<label>Figure 4</label>
<caption>
<p>Execution times for different REST API calls.</p>
</caption>
<graphic xlink:href="6049.fig.004" />
</fig></sec><sec id="sec6">
<title>Advantages and Limitations of using Spark for API Calls</title><title>6.1. Advantages of using Spark for API Calls</title><p>By utilizing Spark's parallel computing capabilities, we can address the problem by delegating the API call to Spark's parallel workers. Additionally, the payload received in the response can be assembled using Spark's setlevel abstractions, allowing it to be processed across multiple nodes instead of a single one. This advantage of Spark UDFs not only transforms sequential execution into parallel execution with minimal coding effort but also simplifies the analysis and transformation of the returned results with an easier data abstraction model. Furthermore, we can configure the REST data source for varying levels of parallelization by adjusting the number of worker nodes or data partitions.</p>
<title>6.2. Limitations of using Spark for API Calls</title><p>One of the major limitations of using Spark is the rate limiting and latency of the REST API. Latency usually occurs when response times for largescale API requests are reduced, and rate limiting involves handling API throttling and quotas. These parameters need to be considered when designing the level of parallelism in Spark. Additionally, Spark's initialization and resource allocation may not be suitable for small scale tasks, and cluster setup and maintenance costs should also be taken into account when using Spark for REST API calls.</p>
</sec><sec id="sec7">
<title>Conclusion</title><p>As we have observed, Spark effectively parallelized the operation of REST API calls and results processing across multiple cores and executors. This allowed us to parallelize a non distributed task, achieving results 32 times faster than the serial approach. To further increase speed, we could simply add more nodes or increase the number of partitions. For tasks like API calls that can't natively leverage Spark's distributed data processing, using Spark UDFs provides an easy path to acceleration. If there are constraints on compute resources or cost, multithreading offers a reasonable alternative. However, Spark remains the preferred choice for scalable performance in a distributed environment.</p>
</sec>
  </body>
  <back>
    <ref-list>
      <title>References</title>
      
<ref id="R1">
<label>[1]</label>
<mixed-citation publication-type="other">Apache Spark cluster-overview. [Online] Available: https://spark.apache.org/docs/latest/cluster-overview.html
</mixed-citation>
</ref>
<ref id="R2">
<label>[2]</label>
<mixed-citation publication-type="other">Neumann, Andy &#x00026; Laranjeiro, Nuno &#x00026; Bernardino, Jorge. (2021). An Analysis of Public REST Web Service APIs. IEEE Transactions on Services Computing. 14. 957-970. 10.1109/TSC.2018.2847344.
</mixed-citation>
</ref>
<ref id="R3">
<label>[3]</label>
<mixed-citation publication-type="other">Sahni, Ashima. (2024). A Comparative Analysis of Apache Spark Dataframes over Resilient Distributed Datasets (RDDs). INTERANTIONAL JOURNAL OF SCIENTIFIC RE-SEARCH IN ENGINEERING AND MANAGEMENT. 08. 1-9. 10.55041/IJSREM36566.
</mixed-citation>
</ref>
<ref id="R4">
<label>[4]</label>
<mixed-citation publication-type="other">Salim, H. P. (2025) "A Comparative Study of Delta Lake as a Preferred ETL and Analytics Database," International Journal of Computer Trends and Technology, 73(1), pp. 65-71. doi: 10.14445/22312803/IJCTT-V73I1P108.
</mixed-citation>
</ref>
<ref id="R5">
<label>[5]</label>
<mixed-citation publication-type="other">Databricks runtime. [Online] Available: https://docs.databricks.com/en/release-notes/runtime/15.3.html
</mixed-citation>
</ref>
<ref id="R6">
<label>[6]</label>
<mixed-citation publication-type="other">Tran, Quy &#x00026; Nguyen, Duc-Binh &#x00026; Nguyen, Linh &#x00026; Nguyen, Oanh. (2023). BIG DATA PROCESSING WITH APACHE SPARK. TRA VINH UNIVERSITY JOURNAL OF SCI-ENCE; ISSN: 2815-6072; E-ISSN: 2815-6099. 10.35382/tvujs.13.6.2023.2099.
</mixed-citation>
</ref>
<ref id="R7">
<label>[7]</label>
<mixed-citation publication-type="other">Spark Compute configuration. [Online] Available: https://docs.databricks.com/en/compute/configure.html
</mixed-citation>
</ref>
<ref id="R8">
<label>[8]</label>
<mixed-citation publication-type="other">Williams, Brad &#x00026; Tadlock, Justin &#x00026; Jacoby, John. (2020). REST API. 10.1002/9781119666981.ch12.
</mixed-citation>
</ref>
<ref id="R9">
<label>[9]</label>
<mixed-citation publication-type="other">Rest API. [Online] Available: https://blog.postman.com/rest-api-examples/
</mixed-citation>
</ref>
<ref id="R10">
<label>[10]</label>
<mixed-citation publication-type="other">Elliott, Ed. (2021). Understanding Apache Spark. 10.1007/978-1-4842-6992-3_1.
</mixed-citation>
</ref>
<ref id="R11">
<label>[11]</label>
<mixed-citation publication-type="other">Salim, H. P. (2025) "A Deep Learning Framework for High-Dimensional Data Analytics," International Journal of Innovative Research in Science, Engineering and Technology, 14(2). doi: 10.15680/IJIRSET.2025.1402010.
</mixed-citation>
</ref>
<ref id="R12">
<label>[12]</label>
<mixed-citation publication-type="other">Dessokey, Maha &#x00026; Saif, Sherif &#x00026; Salem, Sameh &#x00026; Saad, Elsayed &#x00026; Eldeeb, Entesar. (2020). Memory Management Approaches in Apache Spark: A Review. 10.1007/978-3-030-58669-0_36.
</mixed-citation>
</ref>
<ref id="R1">
<label>[1]</label>
<mixed-citation publication-type="other">Apache Spark cluster-overview. [Online] Available: https://spark.apache.org/docs/latest/cluster-overview.html
</mixed-citation>
</ref>
<ref id="R2">
<label>[2]</label>
<mixed-citation publication-type="other">Neumann, Andy &#x00026; Laranjeiro, Nuno &#x00026; Bernardino, Jorge. (2021). An Analysis of Public REST Web Service APIs. IEEE Transactions on Services Computing. 14. 957-970. 10.1109/TSC.2018.2847344.
</mixed-citation>
</ref>
<ref id="R3">
<label>[3]</label>
<mixed-citation publication-type="other">Sahni, Ashima. (2024). A Comparative Analysis of Apache Spark Dataframes over Resilient Distributed Datasets (RDDs). INTERANTIONAL JOURNAL OF SCIENTIFIC RE-SEARCH IN ENGINEERING AND MANAGEMENT. 08. 1-9. 10.55041/IJSREM36566.
</mixed-citation>
</ref>
<ref id="R4">
<label>[4]</label>
<mixed-citation publication-type="other">Salim, H. P. (2025) "A Comparative Study of Delta Lake as a Preferred ETL and Analytics Database," International Journal of Computer Trends and Technology, 73(1), pp. 65-71. doi: 10.14445/22312803/IJCTT-V73I1P108.
</mixed-citation>
</ref>
<ref id="R5">
<label>[5]</label>
<mixed-citation publication-type="other">Databricks runtime. [Online] Available: https://docs.databricks.com/en/release-notes/runtime/15.3.html
</mixed-citation>
</ref>
<ref id="R6">
<label>[6]</label>
<mixed-citation publication-type="other">Tran, Quy &#x00026; Nguyen, Duc-Binh &#x00026; Nguyen, Linh &#x00026; Nguyen, Oanh. (2023). BIG DATA PROCESSING WITH APACHE SPARK. TRA VINH UNIVERSITY JOURNAL OF SCI-ENCE; ISSN: 2815-6072; E-ISSN: 2815-6099. 10.35382/tvujs.13.6.2023.2099.
</mixed-citation>
</ref>
<ref id="R7">
<label>[7]</label>
<mixed-citation publication-type="other">Spark Compute configuration. [Online] Available: https://docs.databricks.com/en/compute/configure.html
</mixed-citation>
</ref>
<ref id="R8">
<label>[8]</label>
<mixed-citation publication-type="other">Williams, Brad &#x00026; Tadlock, Justin &#x00026; Jacoby, John. (2020). REST API. 10.1002/9781119666981.ch12.
</mixed-citation>
</ref>
<ref id="R9">
<label>[9]</label>
<mixed-citation publication-type="other">Rest API. [Online] Available: https://blog.postman.com/rest-api-examples/
</mixed-citation>
</ref>
<ref id="R10">
<label>[10]</label>
<mixed-citation publication-type="other">Elliott, Ed. (2021). Understanding Apache Spark. 10.1007/978-1-4842-6992-3_1.
</mixed-citation>
</ref>
<ref id="R11">
<label>[11]</label>
<mixed-citation publication-type="other">Salim, H. P. (2025) "A Deep Learning Framework for High-Dimensional Data Analytics," International Journal of Innovative Research in Science, Engineering and Technology, 14(2). doi: 10.15680/IJIRSET.2025.1402010.
</mixed-citation>
</ref>
<ref id="R12">
<label>[12]</label>
<mixed-citation publication-type="other">Dessokey, Maha &#x00026; Saif, Sherif &#x00026; Salem, Sameh &#x00026; Saad, Elsayed &#x00026; Eldeeb, Entesar. (2020). Memory Management Approaches in Apache Spark: A Review. 10.1007/978-3-030-58669-0_36.
</mixed-citation>
</ref>
    </ref-list>
  </back>
</article>