Bạn có chắc chắn muốn xóa bài viết này không ?
BigData with MapReduce Algorithm
Yahoo's first product was the Search Engine (SE). The stepping stone for Yahoo's success. Alas! Life without competitors is a boring life and the world might still be crowded by unicellular organism. Yes, the competition for food, for light and for reproduction drove and still drives the natural evolution. We humans are probably the last stage of this fierce and furious evolution... or maybe not? Who knows.
Yahoo's fiercest and most furious competitor is Google. Its Search Engine was and is so fast and so detailed that Yahoo's fate quickly faded down and fades away. Sad, but that's life. Life ain't no such thing as a free lunch (see FreeLunch Theorem). It's the survival of the fittest.
How did Google so that its Search Engine is so fast? Well, Google came late onto the SE market. It quickly learned how to avoid the mistakes of Yahoo's SE and how to improve the searching process. It's so obvious that nobody loves to wait for anything or for anyone when the time elapses... Google's SE can only surpass Yahoo's SE and becomes the SE of choice if its own SE proved itself better and faster. And it did. As yet Google's SE dominates the SE market. What's the secrecy of Google's success? It's an "age-old" algorithm that Google modernized: the MapReduce Algorithm (MRA).
Why age-old? And what's MRA precisely? Before I go into details let me tell you a legend. It was said that Caesar ran his army using his MRA strategy (which was compacted in 3 words: Vini, Vidi, Vinci or Divide et Impera) and was very victorious: a vast empire of 4.4 million km². The Mediterranean was literally Roma's inland lake. That is exactly what MRA is: Divide et Impera. Or Divide and Conquer in plain English.
Web's Data are Big Data (BD). Big Data are usually unpredictable, absolutely wishy-washy. They must be "Divide et Impera" if one wants to work with them. From a bad or misspelled formulation a Search Engine has to extract the maximum and the most resemblant results from the hodge-podge of gibberish words, images, sound and whatever. To do that Big Data must be divided and then gradually conquered.
How? Talking is easy. Doing is harder. Yes. It's hard to "implement" MRA in an OOPL. Java? C++? Or the so-called "Rapid-Development Language" Python? Whatever OOPL you intend to use you have to understand the MRA structure and its functional flow.
MRA Structure bases on four stages which define the functional flow: Sorting, Mapping, Reducing and Joining. The input for a MRA search is, as mentioned above, a formulation in words that could be bad (grammatically) or misspelled (e.g. see instead of sea). Problem raises when the misspelling is ambiguous (e.g. sick or seek ?).
The toughest job is the parsing (getting rid of the superfluous) and the connecting-the-dots (misspelling). Example
Mediteranean see of Roma's ampire
The formulation is bad and misspelled. But Google's corrected and suggested
Showing results for Mediterranean sea of Rome's empire
Search instead for Mediteranean see of Roma's ampire
The superfluous are usually the human language "appendices" like the 's (possessive) for "of". The correction is the connecting-the-dots that usually bases on a languistic repository and some basic linguistic rules of the using language (English, Vietnamese, French or German, etc.)
- Mediteranean ----> Mediter R anean << repository
- see ----> verb to see. Connecting-the-dots: Mediterranean is either an adjective or a geographic location or name. Neither an adjective, nor a geographic name can see (basic linguistic rules). Also the connecting-the-dots says: sea instead of see. Similar to Rome (English) instead of Roma (Italian)
The result is now perfect:
Mediterranean sea of Rome's empire
Assume that the first hurdle is overcome. The next step is to parse the formulation to some meaningful keys: Mediterranean, sea, Rome, empire. To facilitate the searching the keys are usually converted to lower cases: mediterranean, sea, rome, empire. Each of the four key(word)s is called in MRA terminology the MRA Term or T.
The next step is to map the terms from BD which could be some Webpages, some blogs, some PDF documents, etc. The mapping key is the title or the name of each webpage, or each blog, or each document, or each "whatever-readable thing". The webpages, blogs, etc. are simplified as Documents. And all the terms that repeatingly occur within each map are counted and dubbed as TermFrequencies (TF). For example, the TF of term "google" Kipalog-document here is 9 (9x from top down to this line). With the TF one can calculate the frequency of each T among the documents of BD. This frequency is called DocumentFrequency or DF. Example: assume that our our BD consists of 5 webpages (or documents) and only one has only this Kipalog-Document, and the TF is 9 of 1 Document: DF = 1. If none of the terms exists in a document then TF = 0, and DF = 0.
With the result of TF and DF we can reduce the frequency of a term in all documents using natural logarithm (ln). The result is called Inverse Document Frequency or IDF. The next step is to compute the "DocumentByTerm", the relationship between TF and IDF, From TF, then the CosineSimilarity.
What's CosineSimilarity? It's so: Two straight lines are then parallel (similar) if their angular cut is null (cosine(0) = 1.0).
- document_1 contains terms: 5x term_1, 2x term_3, term_4, 4x term_5
- document_2 contains terms: 4x term_2, 1x term_3
- document_3 contains terms: 4x term_1, 6x term_2, 3x term_5
- document_4 does NOT contain any term (also zero)
The TermFrequency[document_1][(term_1] = 5, [document_1][(term_2] = 0,
[document_1][(term_3] = 2,[document_1][(term_4] = 1, [document_1][(term_5] = 4
The DocumentFrequency[Document_1][term_1] = 2, [Document_1][term_2] = 0,
[Document_1][term_3] = 2, [Document_1][term_4] = 1, [Document_1][term_5] = 2
InverseDocumentFrequency[document_1][term_1] = 0.6931, ...
DocumentByTerm[document_1][term_1] = 3.4657, ...
Max number of a Term that appears in all documents: TFmax = 2
If you download MRA.java and TestMRA.java, compile them and create 4 (or more) files (or documents) that contain exactly the given number of "Term_x" then you'll get this result: