Puzzles & Games with esProc: WordCount
Writing a word count program is the most common exercise for getting familiar with a distributed system, such as Hadoop. Here we’ll illustrate how we perform word count with esProc.
Let’s begin with the single threading.
D:\files\novel directory contains novels in text format. Task: Find the most frequently used words in the novels.
esProc handles the task in a long query:
A |
|
1 |
=directory@p(“D:/files/novel”).(file().read().words().groups(lower():Word;count(~):Count)).merge(Word).groups@o(Word;sum(Count):Count).sort@z(Count) |
It’s simple, isn’t it? Here’s A1’s result:
To make it easier to understand, we split the long query into multiple steps to execute:
A |
B |
C |
||
1 |
=directory@p(“D:/files/novel”) |
[] |
=now() |
|
2 |
for A1 |
=file(A2).read().words() |
||
3 |
=B2.groups(lower():Word:count():Count) |
|||
4 |
>B1=B1 |
[B3] |
||
5 |
=B1.merge(Word) |
=A5.groups@o(Word;sum(Count):Count).sort@z(Count) |
=interval@ms(C1,now()) |
A1 gets text files in the specified directory:
Line 2 ~ line 4: Count each word in each file. B2 reads in each text file and split it into individual words:
B3 count the number of each word in the current file. It converts words to lowercase to perform a uniform count and sorts the result set in alphabetical order:
After the count over all files is finished, B4 writes the results to B1. Below is B1’s result:
A5 performs a merge grouping over B1’s sequences by words:
5 performs sum over A5’s Count by words and sorts records by Count in descending order. The final result is the same as that of the single query:
C1 and C5 records the start time and the end time of the execution to estimate the time it takes to perform the count:
It’s fast!
Then let’s try the multithreading.
esProc script for performing parallel processing:
A |
B |
C |
|
1 |
=directory@p(“D:/files/novel”) |
=now() |
|
2 |
fork A1 |
=file(A2).read().words() |
|
3 |
=B2.groups(lower(~):Word;count(~):Count) |
||
4 |
=A2.merge(Word) |
=A4.groups@o(Word;sum(Count):Count).sort@z(Count) |
=interval@ms(C1,now()) |
The most obvious difference is that A2’s for statement is replaced by fork statement. C4 estimates the time it takes to perform the count:
It’s much faster (My computer is dual-core and the performance increase result is satisfactory).
The fork statement automatically switches to the iterative multithreaded processing, saving you the trouble of managing threads.
Let’s move on to solve the problem with cluster computing.
Here I simulate the server cluster using processes on a computer. In esProc\bin path under esProc’s installation directory, we can run esprocs.exe to start or configure servers:
Before starting a server by clicking on Start button, we click Config button to configure information for it, such as the servers’ IP addresse and port on the Unit tab, as shown below:
After configuration is done, we start the server from Unit Server window. Run esprocs.ext again to start another server until all servers are in place. The three servers correspond to the configured IP addresses and ports in order. Now a cluster is ready.
Since we simulate a cluster on a single machine, the parallel servers share same path. The paths are different if they are remote servers.
A | B | C | |
1 |
[192.168.10.229:4001,192.168.10.229:4004, 192.168.10.229:4007] |
[D:/files/novel1,D:/files/novel2, D:/files/novel3,D:/files/novel4] |
|
2 |
fork B1;A1 |
=directory@p(A2) |
|
3 |
fork B2 |
=file(B3).read().words() |
|
4 |
=C3.groups(lower():Word;count():Count) |
||
5 |
return B3.merge(Word) |
||
6 |
=A2.merge(Word) |
=A6.groups@o(Word;sum(Count):Count).sort@z(Count) |
The program uses four file paths as parameters to execute four subtasks that count the frequency of each word in a file in a certain path. Add the server addresses after fork and let esProc assign the subtasks to servers aggregate the results. Coders don’t need to think about these details. At last, B6 gets the final result:
We can view the task distribution and execution on server windows:
It’s simple, easy, efficient and cool. We have gotten this done when somebody else is busy setting up a Hadoop cluster.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL