# Big Data and Data Scientist Big Data - Data Mining - Data Science 2 Joe viết ngày 04/05/2018

Hi

Vietnamese Forum participants are somehow hot on the item "Machine Learning" and some of them even talk about ML with Big Data (example HERE of DNH forum) as if they were the supreme ML-BD experts. That makes the youngsters becoming avid for it. For example this DNH youngster posted a question about a career as Data Scientist (HERE). Lots of experts gave him advice...without giving him the true vision of ML-BD. Or this very amusing blog "Engineer và Scientist written by Thinh Tran. Why the youngsters love such the gaudy buzzword-pregnant title? I don't know. Reason? Maybe it's because there're lots of unanswered questions.

Yes, unanswered because none of the questioners or advisors has really dealed with Big Data or truly worked with Machine Learning. Theory is easy to palaver about. But doing is just another dirty topic. Dirty because no one likes to "hack" and to brood over unidentified patterns or identifies himself as a "blue-collar" worker. Andrew Ng, a shrewd Americanized Singaporean makes tons of money out of this trendy run for ML. His "remote" teaching ML is somehow addicted by Asians (probably because Ng "is Asian"). And that is good for him, bad for a lot of wanna-be ML-theoretists. They become theoretical Remote-ML scientists and can start to "talk" about "supervised and unsupervised ML" without knowing how to implement it. Well, they are "remote-ML" theorists, not the blue-collar workers. Too bad. No one dares to contradict them. It's high-tech. Their killer-argument is always the same

Chu co biêt cái quái gì vê AI k mà tôi ngày cú phán nhu thánh?

As said, BD for Big Data, ML for Machine Learning and DM for Data Mining. They are the most discussed awesome buzzwords at all. Unsharp like the electron in Quantum Physics of Werner Heisenberg. So-or-so or not-so-but-so: both are correct. The reason is simple. Too few people are really working with BD and DM, and as yet fewer really work with ML. Those who attended Andrew Ng become the one-eyed men in the land of the BD-DM-ML blind. They are the one-eyed "kings".

Now what? Also: what is Big Data? Before I start to discuss with you about BD I would like to ask you about two things:

1. What is big?
2. Why is it big?

Laozi, the creator of Tao Te King, asked rhetorically "What is big and what is small? What is beautiful and what is ugly?". So, how could one define (or draw) the boundary of Big Data when and where Data are considered "big". Because, as said, BD is the most unregulated, rawest information for the mere-mortals like us. The information we absorb daily is Big Data. Unregulated, hodge-podge and abundant: images (with your eyes), sounds (with your ears), abstractness (feeling with your skin or by your emotion), information (volunrary and involuntary by the press, TV, videos, etc.) and exceptions (caused by external sources).

Also: Big Data are something relative, but impossible to categorize or to organize. A Data Scientist (sounds much noble and more distinguished than "Data Miner") is the one who does the dirty work by burying himself in the mountain of hogde-podge data. The dirty work always starts with the "reading" (or collecting) data, patternization, grouping, eliminating and finally the work ends with an acceptable presentation. Till this line, BD are for some readers still very vague and unclear. Well, then let me tell you about a work I've helped a young Vietnamese Civil-Engineer who's very intelligent and very autodidactic. He has self-taught JavaFX, the latest most complex Desktop version of SUN (or ORACLE). He tried to develop a tool to display the works of his company which are all in SVG files, and really very big (up to 2 digits MB).

SVG - Scalable Vector Graphics - sounds very harmless and quite innocent. But it's about Big Data. It's about creating images without using "camera" to make a picture. It isn't an algorithm to reducde the size of an image. In short: It's the technique to describe an image using plain texts and numbers with XML syntax and rules. MORE: click HERE. A SVG file can be several lines like this

<svg xmlns="http://www.w3.org/2000/svg"
<circle cx="40" cy="40" r="24" style="stroke:#006600; fill:#00cc00"/>
</svg>


And it displays a Green Circle (HERE)

Or it could reach a size of several MB. This Simpson SVG file HERE

is for example has a size of 22 KB. Or this Tux.svg (Linux) with 304KB.

And this Arctic_big.svg has a size of 1.5 MB (HERE).

Working with such a mass of diverse sizes is a nightmare for those who are impatient. Big Data is a synonym of patience and industriousness. Because the formats are freely-defined and so versatile that it is almost impossible to categorize or to compact the data into any casting mold. W3-SVG school just gives its audience an idea and the rules how SVG is created and built. The creativeness is yours. So, let start to toil with the SVG Big Data. The file circle.svg just gives you the idea how a circle is drawn and how it gets the color:

• Keyword Circle
• Circle coordinate cx/cy
• Style (stroke color and the filling color)

If you carefully read the SVG doc you'll be soon aware that there's no limit of number of shapes (circle, rectangle, polyline, etc.) and of combination of animation (gliding, rotation, etc.) or clips. Hence, SVG Big data are really a heap of unregulated, unformatted and unorganized data.

And yet, if you try to develop a tool to gather the SVG data and display them as they were designed for (not by using any existing browser) you'll run into a maze of Big Data. Of course, the young Civil Engineer ran into the BD maze. He firstly tried to use an existing SVG-tools (Github-Batik) and the result was discouraging: it took too much time to display a structural design of a building (~10MB). Inacceptable for the boss who's usually impatient and never has time. He posted a "generic" request on DNH forum and that drew my interest. I asked him if he could start a SVG Project and I'd assist him. He agreed. Some weeks later he's proudly announced "Khoe hoàn thành DA SVGLoader and coquetted that his SVGLoader can compete against Google's Chrome. Far better than Microsoft Internet Explorer.

Big Data is the mass, Data Mining is the dirty work. To get some grams of Gold nugget (Data Mining) one has to shuffle tons of soil (big data). The gained nuggets become only more valuable when they are refined (Data Science). SVG files are Big Data, the nuggets are the patterns which can be grouped or identified (Data Mining) and the presentation (Browser or with JavaFX) is the work of a Data Scientist -or developer. A combination of 3 hard-to-comprehend fields: Big Data, Data Mining and Data Science.

The Patterns are, for example, the keywords (circle, path, polygon, etc.) and the groups are usually the repetitive works (X/Y coordinates, radius, colors, stroke, width, height, etc.). Data-Mining, the dirty work, starts with the reading (shuffling the soil.) A file (stream) is read byte.wise in C/C++ and JAVA. For a file of some KB it works fine. For a big file it could take some seconds to fill the buffer. It's like using a spade to move a ton of soil. And that is intolerable. Also, a (Data)-Miner has to find out the most efficient tool (e.g. using caterpillar). Java NIO-FileChannel is the caterpillar. Then the patient work: identifying the common parts, and to develop their corresponding methods. When applied Machine Learning (with Fuzzy Logic) is well implemented one could achieve the best result of pattern-recognization and identification. An example: Normally a "normal" programmer does the validation of a string, he usually uses the switch-case for evaluating value or using the if-else-if.

if (string.equals("<circle")) {
...
} else if (string.equals("<polygon")) {
...
} else ...
...


Fuzzy Logic is vague, unprecise, but if it is used wisely it could achieve some wonder. We know that SVG has some distinct keywords and keywords are unique. The String-compare takes more time than a verification of 2 or 3 distinguished tokens.

if (string.charAt(0) == '<' && string.charAt(1) == 'c') { // it must be a circle
...
} else if (string.charAt(0) == '<' && string.charAt(1) == 'p' && string.charAt(5) == 'g') { // it must be a <polygon and NOT polyline or path
...
} else ...
....


Note: Big Data are already "big", hence one should avoid to create extra more garbage (i.e. strings and string is in JAVA a final object. Meaning immutable). Instead of String one can work directly with the content which is usually a big array of bytes (instead of "string.charAt(n)" "byte[x]" can be used directly).

Using ML means working with repository to "memorize" the things it has learned. For example, the big arctic-big.svg file consists of numerous entries "<polyline ". It's a waste of time and work if these entries were "forgotten" after use, and then must be anew recalculated. With applied ML Data Mining becomes more efficient and manageable by learning and memorizing the "past". Of course, one could use any DB to do the repository job. However, the question is: is "that DB" suits all my Big-Data/Data-Mining requirements?

Data Science is then applied to create a presentable, acceptable frame for the users. The frame has to work efficiently with all the litte parts assembled during the Data-Mining phase. It is not simple as some easy-going youngsters used to think: becoming a Data Scientist is only the question of knowing by heart of some algorithms and theories. It's a real thinking and hard work. As said previously, SVG data are big, unregulated, hodge-podged. Thousands of groups can be easily nested (or embedded). For a Data Science theorist "recusive" technique could be the solution. Yes, in theory, and in reality it works, too. But with what cost if the nesting levels get down unlimitedly deep. Every computer would bog down and finally succumbs. A real Data Scientist usually looks for some other solution to bypass such an unforeseeable stack-bottleneck. Yes. Recursiveness is the key. But recursive needs "stack" to save the return environment. How could one replace a "software recursiveness" with some "virtual recursiveness"? Example:

With Software Recursive (SR)

    ...
private int createSVG(Pane root, int idx) throws Exception {
ArrayList<Future<Node>> fLst = new ArrayList<Future<Node>>();
...
while (true) {
// next SVGObject
I = nextSVGObj(idx);
if (I == null) break;
...
// <g .... </g> or <svg .... </svg>
if (b1 == 'g' || b1 == 's' && b2 == 'v') {
...
idx = createSVG(view, idx); //  Recursive here.
...
}
}
...
return idx;
}
...


It works superbly...however, if the recursive levels are less than some hundreds. It starts to bog down up thousands of recursive levels.

Developing a Virtual Recursiveness (VR) is the question of using programming technique to achieve the same recursive effect without getting down into the depth of recursiveness. Example: every JAVA developer knows what ArrayList< ? > is and how it works. If one knows how to exploit its features one could easily "simulate" the SR without plunging into the deepness of unknown. The most useful features are the insert and the remove. Both can be used as a Stack (push & pop): add(0, value) = Push, remove(0) = pop.

    ...
private int createSVG(Pane root, int idx) throws Exception {
ArrayList<Future<Node>> fLst = new ArrayList<Future<Node>>();
// our own "stack"
ArrayList<Pane> stack = new ArrayList<Pane>();
...
while (true) {
// next SVGObject
I = nextSVGObj(idx);
if (I == null) break;
...
// <g .... </g> or <svg .... </svg>
if (b1 == 'g' || b1 == 's' && b2 == 'v') {
...
// Virtual Recursive
stack.add(0, view); // save the old Pane: LIFO (Last In First Out)
view = new Pane();  // create a new Pane
// create a link between parents and child
... // do the work with the child
} else // </svg> or </g>
if (b1 == '/' && (b2 == 's' || b2 == 'g')) {
...
if (--cnt < 0) break;
view = stack.remove(0); // back to the upper level
...
}

}
...
return idx;
}
...


The result is amazing: the virtual depth of recursive level is now the limit of Java API ArrayList. And it works absolutely faster than SR.

You may wonder why I don't talk about using "Stack", but ArrayList, right? Well, the answer is the performance. ArrayList is a direct implementation of AbstractList, while Stack is a grandchild of it. Also a layer deeper than ArrayList. The outcome is clear: ArrayList with add/remove is much faster than Stack with push/pop. A Data Scientist is not only a theorist, but also a practitioner who dares to do the dirty work by shuffling the dirty dirt.

The young Vietnamese Civil Engineer was so proud of his work that he claimed a victory against Microsoft Internet Explorer to display a 10MB svg file (roughly some seconds) while his "SVGLoader" crunches the same file for 1.4 seconds. A world between heaven and hell. And he was damned right.

"Big Data - Data Mining - Data Science" and "Machine Learning" exist only in practice, not in some weird theory or in any obfuscated algorithm.

Joe 03-05-2018

Bình luận

Bỏ hay Hay
{{comment.like_count}}
{{ comment_error }}

Hiển thị thử

Chỉnh sửa

### Joe

13 bài viết.
76 người follow
Cùng một tác giả
24 11
Fuzzy Logic and Machine Learning Hi First of all: I apologize everyone for my writing in English. I come to this site because someone of Daynhauh...
Joe viết 9 tháng trước
24 11
14 8
Thu Nhat: Toi muon bien bai blog nay bang tieng Viet, nhung toi khong co du chu chu dong...Eh uh then in English. Noboby wants to be beholden as a ...
Joe viết 16 ngày trước
14 8
10 3
Hi Giaosucan wrote an interesting article and a good technical overview about Bitcoin (Blockchain). If anyone does have interest on the Blockchain...
Joe viết 9 tháng trước
10 3
Bài viết liên quan
2 0
Hi I have introduced you into the mysterious world "Big Data and Data Scientist" with the work of a young intelligentautodidactic Vietnamese Civil...
Joe viết 18 ngày trước
2 0

kipalog

bình luận

13 bài viết.
76 người follow

Đầu mục bài viết

Vẫn còn nữa! x

Kipalog vẫn còn rất nhiều bài viết hay và chủ đề thú vị chờ bạn khám phá!