" Writing a word count program is the most common exercise for getting familiar with a distribut .."

pjf RaqForum 22 No.
1 Reply • 648 View • 5 Years ago

Puzzles & Games with esProc: WordCount

Writing a word count program is the most common exercise for getting familiar with a distributed system, such as Hadoop. Here we’ll illustrate how we perform word count with esProc.

Let’s begin with the single threading.

D:\files\novel directory contains novels in text format. Task: Find the most frequently used words in the novels.

undefined

esProc handles the task in a long query:

	A
1	=directory@p(“D:/files/novel”).(file().read().words().groups(lower():Word;count(~):Count)).merge(Word).groups@o(Word;sum(Count):Count).sort@z(Count)

It’s simple, isn’t it? Here’s A1’s result:

undefined

To make it easier to understand, we split the long query into multiple steps to execute:

	A	B	C
1	=directory@p(“D:/files/novel”)	[]	=now()
2	for A1	=file(A2).read().words()
3		=B2.groups(lower():Word:count():Count)
4		>B1=B1	[B3]
5	=B1.merge(Word)	=A5.groups@o(Word;sum(Count):Count).sort@z(Count)	=interval@ms(C1,now())

A1 gets text files in the specified directory:

undefined

Line 2 ~ line 4: Count each word in each file. B2 reads in each text file and split it into individual words:

undefined

B3 count the number of each word in the current file. It converts words to lowercase to perform a uniform count and sorts the result set in alphabetical order:

undefined

After the count over all files is finished, B4 writes the results to B1. Below is B1’s result:

undefined

A5 performs a merge grouping over B1’s sequences by words:

undefined

5 performs sum over A5’s Count by words and sorts records by Count in descending order. The final result is the same as that of the single query:

undefined

C1 and C5 records the start time and the end time of the execution to estimate the time it takes to perform the count:

It’s fast!

Then let’s try the multithreading.

esProc script for performing parallel processing:

	A	B	C
1	=directory@p(“D:/files/novel”)		=now()
2	fork A1	=file(A2).read().words()
3		=B2.groups(lower(~):Word;count(~):Count)
4	=A2.merge(Word)	=A4.groups@o(Word;sum(Count):Count).sort@z(Count)	=interval@ms(C1,now())

The most obvious difference is that A2’s for statement is replaced by fork statement. C4 estimates the time it takes to perform the count:

It’s much faster (My computer is dual-core and the performance increase result is satisfactory).

The fork statement automatically switches to the iterative multithreaded processing, saving you the trouble of managing threads.

Let’s move on to solve the problem with cluster computing.

Here I simulate the server cluster using processes on a computer. In esProc\bin path under esProc’s installation directory, we can run esprocs.exe to start or configure servers:

undefined

Before starting a server by clicking on Start button, we click Config button to configure information for it, such as the servers’ IP addresse and port on the Unit tab, as shown below:

undefined

After configuration is done, we start the server from Unit Server window. Run esprocs.ext again to start another server until all servers are in place. The three servers correspond to the configured IP addresses and ports in order. Now a cluster is ready.

Since we simulate a cluster on a single machine, the parallel servers share same path. The paths are different if they are remote servers.

	A	B	C
1	[192.168.10.229:4001,192.168.10.229:4004, 192.168.10.229:4007]	[D:/files/novel1,D:/files/novel2, D:/files/novel3,D:/files/novel4]
2	fork B1;A1	=directory@p(A2)
3		fork B2	=file(B3).read().words()
4			=C3.groups(lower():Word;count():Count)
5		return B3.merge(Word)
6	=A2.merge(Word)	=A6.groups@o(Word;sum(Count):Count).sort@z(Count)

The program uses four file paths as parameters to execute four subtasks that count the frequency of each word in a file in a certain path. Add the server addresses after fork and let esProc assign the subtasks to servers aggregate the results. Coders don’t need to think about these details. At last, B6 gets the final result:

undefined