How to Write a Large Parallel Small Files in Python

To write large parallel small files in Python, use the joblib.Parallel() function. The joblib module has a Parallel() function that can be used for parallelizing the operations. This type of serial approach reads the file faster if we try to read a file.  If the file is a small file, but if the file size is larger, it takes a huge time to read the file. But if we read this larger file in a parallel approach, it reads the large file faster than the serial one.

Python program for writing large parallel small files

The joblib module is not a built-in Python module. To work with the joblib module in Python, install it using pip.

Type the following command to install the joblib module.

python3 -m pip install joblib

Now, you can use the Parallel, delayed, and cpu_count functions of the joblib module.

from joblib import Parallel, delayed, cpu_count


def parallel_file():
   with open(" sample.txt ", " r ") as ip:
     data = ip.readlines()
     chunkSize = 3000
     file_no = 1
   for i in range(0, len(data), chunkSize):
      with open("output" + str(file_no) + ".txt", "w") as op:
        if(len(data) > i+chunkSize+1):
           op.write('\n '.join(data[i: i+chunkSize + 1]))
        else:
           op.write('\n '.join(data[i:]))
           file_no = file_no + 1


number_of_cpu = cpu_count()
delayed_funcs = [delayed(parallel_file)]
parallel_pool = Parallel(n_jobs=number_of_cpu)
parallel_pool(delayed_funcs)

In this program, we imported a package called the joblib. This package has two main functions called Parallel() and delayed(). The Parallel() function is used to make the program run parallel. The delayed() function is used with this process function.

To find the CPU count in Python, use the cpu_count() function. The cpu_count() function finds the number of processes a CPU can perform simultaneously. Then we have called the delayed function for the parallel_file function.

The parallel_file() is the user-defined Python function. Inside the parallel_file() function, we opened a file called sample.txt in reading mode. After that, we read all the lines in the file and stored them in a data variable.

Then we have assigned the chunk size as 3000. This is the maximum number of lines the file we are splitting has. Next, we open a file and write lines into this file. If the chunk size is reached, this file is closed, another file is opened, and the next contents are stored.

Similarly, this process is carried out until all the variable data content is written to the small files. 

Parallel() function has a parameter called n_jobs. We can assign values to this n_jobs parameter. This is the number of parallel processors that should be used to complete the task.

We have assigned the CPU count to the n_jobs parameter in this program. Hence the n_jobs has as many processors as our CPU has. 

That’s it for this tutorial.

Related posts

How to Copy Files in Python

How to find all files with extensions in Python

How to Import Files in Python

Leave a Comment