Big Data and Data Scientist

Hi

Vietnamese Forum participants are somehow hot on the item "Machine Learning" and some of them even talk about ML with Big Data (example HERE of DNH forum) as if they were the supreme ML-BD experts. That makes the youngsters becoming avid for it. For example this DNH youngster posted a question about a career as Data Scientist (HERE). Lots of experts gave him advice...without giving him the true vision of ML-BD. Or this very amusing blog "Engineer và Scientist written by Thinh Tran. Why the youngsters love such the gaudy buzzword-pregnant title? I don't know. Reason? Maybe it's because there're lots of unanswered questions.

UNANSWERED?

Yes, unanswered because none of the questioners or advisors has really dealed with Big Data or truly worked with Machine Learning. Theory is easy to palaver about. But doing is just another dirty topic. Dirty because no one likes to "hack" and to brood over unidentified patterns or identifies himself as a "blue-collar" worker. Andrew Ng, a shrewd Americanized Singaporean makes tons of money out of this trendy run for ML. His "remote" teaching ML is somehow addicted by Asians (probably because Ng "is Asian"). And that is good for him, bad for a lot of wanna-be ML-theoretists. They become theoretical Remote-ML scientists and can start to "talk" about "supervised and unsupervised ML" without knowing how to implement it. Well, they are "remote-ML" theorists, not the blue-collar workers. Too bad. No one dares to contradict them. It's high-tech. Their killer-argument is always the same

Chu co biêt cái quái gì vê AI k mà tôi ngày cú phán nhu thánh?

As said, BD for Big Data, ML for Machine Learning and DM for Data Mining. They are the most discussed awesome buzzwords at all. Unsharp like the electron in Quantum Physics of Werner Heisenberg. So-or-so or not-so-but-so: both are correct. The reason is simple. Too few people are really working with BD and DM, and as yet fewer really work with ML. Those who attended Andrew Ng become the one-eyed men in the land of the BD-DM-ML blind. They are the one-eyed "kings".

Now what? Also: what is Big Data? Before I start to discuss with you about BD I would like to ask you about two things:

  1. What is big?
  2. Why is it big?

Laozi, the creator of Tao Te King, asked rhetorically "What is big and what is small? What is beautiful and what is ugly?". So, how could one define (or draw) the boundary of Big Data when and where Data are considered "big". Because, as said, BD is the most unregulated, rawest information for the mere-mortals like us. The information we absorb daily is Big Data. Unregulated, hodge-podge and abundant: images (with your eyes), sounds (with your ears), abstractness (feeling with your skin or by your emotion), information (volunrary and involuntary by the press, TV, videos, etc.) and exceptions (caused by external sources).

Also: Big Data are something relative, but impossible to categorize or to organize. A Data Scientist (sounds much noble and more distinguished than "Data Miner") is the one who does the dirty work by burying himself in the mountain of hogde-podge data. The dirty work always starts with the "reading" (or collecting) data, patternization, grouping, eliminating and finally the work ends with an acceptable presentation. Till this line, BD are for some readers still very vague and unclear. Well, then let me tell you about a work I've helped a young Vietnamese Civil-Engineer who's very intelligent and very autodidactic. He has self-taught JavaFX, the latest most complex Desktop version of SUN (or ORACLE). He tried to develop a tool to display the works of his company which are all in SVG files, and really very big (up to 2 digits MB).

SVG - Scalable Vector Graphics - sounds very harmless and quite innocent. But it's about Big Data. It's about creating images without using "camera" to make a picture. It isn't an algorithm to reducde the size of an image. In short: It's the technique to describe an image using plain texts and numbers with XML syntax and rules. MORE: click HERE. A SVG file can be several lines like this

<svg xmlns="http://www.w3.org/2000/svg"
    xmlns:xlink="http://www.w3.org/1999/xlink">
    <circle cx="40" cy="40" r="24" style="stroke:#006600; fill:#00cc00"/>
</svg>

And it displays a Green Circle (HERE)

Or it could reach a size of several MB. This Simpson SVG file HERE

ALT

is for example has a size of 22 KB. Or this Tux.svg (Linux) with 304KB.

ALT

And this Arctic_big.svg has a size of 1.5 MB (HERE).

ALT

Working with such a mass of diverse sizes is a nightmare for those who are impatient. Big Data is a synonym of patience and industriousness. Because the formats are freely-defined and so versatile that it is almost impossible to categorize or to compact the data into any casting mold. W3-SVG school just gives its audience an idea and the rules how SVG is created and built. The creativeness is yours. So, let start to toil with the SVG Big Data. The file circle.svg just gives you the idea how a circle is drawn and how it gets the color:

  • Keyword Circle
  • Circle coordinate cx/cy
  • Radius r
  • Style (stroke color and the filling color)

If you carefully read the SVG doc you'll be soon aware that there's no limit of number of shapes (circle, rectangle, polyline, etc.) and of combination of animation (gliding, rotation, etc.) or clips. Hence, SVG Big data are really a heap of unregulated, unformatted and unorganized data.

And yet, if you try to develop a tool to gather the SVG data and display them as they were designed for (not by using any existing browser) you'll run into a maze of Big Data. Of course, the young Civil Engineer ran into the BD maze. He firstly tried to use an existing SVG-tools (Github-Batik) and the result was discouraging: it took too much time to display a structural design of a building (~10MB). Inacceptable for the boss who's usually impatient and never has time. He posted a "generic" request on DNH forum and that drew my interest. I asked him if he could start a SVG Project and I'd assist him. He agreed. Some weeks later he's proudly announced "Khoe hoàn thành DA SVGLoader and coquetted that his SVGLoader can compete against Google's Chrome. Far better than Microsoft Internet Explorer.

Big Data is the mass, Data Mining is the dirty work. To get some grams of Gold nugget (Data Mining) one has to shuffle tons of soil (big data). The gained nuggets become only more valuable when they are refined (Data Science). SVG files are Big Data, the nuggets are the patterns which can be grouped or identified (Data Mining) and the presentation (Browser or with JavaFX) is the work of a Data Scientist -or developer. A combination of 3 hard-to-comprehend fields: Big Data, Data Mining and Data Science.

The Patterns are, for example, the keywords (circle, path, polygon, etc.) and the groups are usually the repetitive works (X/Y coordinates, radius, colors, stroke, width, height, etc.). Data-Mining, the dirty work, starts with the reading (shuffling the soil.) A file (stream) is read byte.wise in C/C++ and JAVA. For a file of some KB it works fine. For a big file it could take some seconds to fill the buffer. It's like using a spade to move a ton of soil. And that is intolerable. Also, a (Data)-Miner has to find out the most efficient tool (e.g. using caterpillar). Java NIO-FileChannel is the caterpillar. Then the patient work: identifying the common parts, and to develop their corresponding methods. When applied Machine Learning (with Fuzzy Logic) is well implemented one could achieve the best result of pattern-recognization and identification. An example: Normally a "normal" programmer does the validation of a string, he usually uses the switch-case for evaluating value or using the if-else-if.

if (string.equals("<circle")) {
   ...
} else if (string.equals("<polygon")) {
   ...
} else ...
  ...

Fuzzy Logic is vague, unprecise, but if it is used wisely it could achieve some wonder. We know that SVG has some distinct keywords and keywords are unique. The String-compare takes more time than a verification of 2 or 3 distinguished tokens.

if (string.charAt(0) == '<' && string.charAt(1) == 'c') { // it must be a circle
   ...
} else if (string.charAt(0) == '<' && string.charAt(1) == 'p' && string.charAt(5) == 'g') { // it must be a <polygon and NOT polyline or path
   ...
} else ...
   ....

Note: Big Data are already "big", hence one should avoid to create extra more garbage (i.e. strings and string is in JAVA a final object. Meaning immutable). Instead of String one can work directly with the content which is usually a big array of bytes (instead of "string.charAt(n)" "byte[x]" can be used directly).

Using ML means working with repository to "memorize" the things it has learned. For example, the big arctic-big.svg file consists of numerous entries "<polyline ". It's a waste of time and work if these entries were "forgotten" after use, and then must be anew recalculated. With applied ML Data Mining becomes more efficient and manageable by learning and memorizing the "past". Of course, one could use any DB to do the repository job. However, the question is: is "that DB" suits all my Big-Data/Data-Mining requirements?

Data Science is then applied to create a presentable, acceptable frame for the users. The frame has to work efficiently with all the litte parts assembled during the Data-Mining phase. It is not simple as some easy-going youngsters used to think: becoming a Data Scientist is only the question of knowing by heart of some algorithms and theories. It's a real thinking and hard work. As said previously, SVG data are big, unregulated, hodge-podged. Thousands of groups can be easily nested (or embedded). For a Data Science theorist "recusive" technique could be the solution. Yes, in theory, and in reality it works, too. But with what cost if the nesting levels get down unlimitedly deep. Every computer would bog down and finally succumbs. A real Data Scientist usually looks for some other solution to bypass such an unforeseeable stack-bottleneck. Yes. Recursiveness is the key. But recursive needs "stack" to save the return environment. How could one replace a "software recursiveness" with some "virtual recursiveness"? Example:

With Software Recursive (SR)

    ...
    private int createSVG(Pane root, int idx) throws Exception {
      ArrayList<Future<Node>> fLst = new ArrayList<Future<Node>>();   
      ...
      while (true) {
        // next SVGObject
        I = nextSVGObj(idx);
        if (I == null) break;
        ...
        // <g .... </g> or <svg .... </svg>
        if (b1 == 'g' || b1 == 's' && b2 == 'v') {
           ...
           idx = createSVG(view, idx); //  Recursive here.
           ...
        }
      }
      ...
      return idx;
    }
    ...

It works superbly...however, if the recursive levels are less than some hundreds. It starts to bog down up thousands of recursive levels.

Developing a Virtual Recursiveness (VR) is the question of using programming technique to achieve the same recursive effect without getting down into the depth of recursiveness. Example: every JAVA developer knows what ArrayList< ? > is and how it works. If one knows how to exploit its features one could easily "simulate" the SR without plunging into the deepness of unknown. The most useful features are the insert and the remove. Both can be used as a Stack (push & pop): add(0, value) = Push, remove(0) = pop.

    ...
    private int createSVG(Pane root, int idx) throws Exception {
      ArrayList<Future<Node>> fLst = new ArrayList<Future<Node>>();
      // our own "stack" 
      ArrayList<Pane> stack = new ArrayList<Pane>();
      ...
      while (true) {
        // next SVGObject
        I = nextSVGObj(idx);
        if (I == null) break;
        ...
        // <g .... </g> or <svg .... </svg>
        if (b1 == 'g' || b1 == 's' && b2 == 'v') {
           ...
           // Virtual Recursive
           stack.add(0, view); // save the old Pane: LIFO (Last In First Out)
           view = new Pane();  // create a new Pane
           // create a link between parents and child
           stack.get(0).getChildren( ).add(view);
           ... // do the work with the child
        } else // </svg> or </g>
          if (b1 == '/' && (b2 == 's' || b2 == 'g')) {
            ...
            if (--cnt < 0) break;
            view = stack.remove(0); // back to the upper level
            ...
          }

      }
      ...
      return idx;
    }
    ...

The result is amazing: the virtual depth of recursive level is now the limit of Java API ArrayList. And it works absolutely faster than SR.

You may wonder why I don't talk about using "Stack", but ArrayList, right? Well, the answer is the performance. ArrayList is a direct implementation of AbstractList, while Stack is a grandchild of it. Also a layer deeper than ArrayList. The outcome is clear: ArrayList with add/remove is much faster than Stack with push/pop. A Data Scientist is not only a theorist, but also a practitioner who dares to do the dirty work by shuffling the dirty dirt.

The young Vietnamese Civil Engineer was so proud of his work that he claimed a victory against Microsoft Internet Explorer to display a 10MB svg file (roughly some seconds) while his "SVGLoader" crunches the same file for 1.4 seconds. A world between heaven and hell. And he was damned right.

"Big Data - Data Mining - Data Science" and "Machine Learning" exist only in practice, not in some weird theory or in any obfuscated algorithm.

ALT

Joe 03-05-2018

Bình luận


White
{{ comment.user.name }}
Bỏ hay Hay
{{comment.like_count}}
Male avatar
{{ comment_error }}
Hủy
   

Hiển thị thử

Chỉnh sửa

White

Joe

31 bài viết.
266 người follow
Kipalog
{{userFollowed ? 'Following' : 'Follow'}}
Cùng một tác giả
White
31 14
Chao cac ban To the Admins: if you think that this posting breaks some rules of your site please just delete it. NO need to send me a feedback. Th...
Joe viết 2 tháng trước
31 14
White
28 15
Fuzzy Logic and Machine Learning Hi First of all: I apologize everyone for my writing in English. I come to this site because someone of Daynhauh...
Joe viết 1 năm trước
28 15
White
25 11
You're a fresh graduate and work for more than 12 months in an IT company with some boring coding tasks... The tasks are unchallenging. Day in, day...
Joe viết 2 tháng trước
25 11
Bài viết liên quan
White
2 0
Hi I have introduced you into the mysterious world "Big Data and Data Scientist" with the work of a young intelligentautodidactic Vietnamese Civil...
Joe viết 5 tháng trước
2 0
{{like_count}}

kipalog

{{ comment_count }}

bình luận

{{liked ? "Đã kipalog" : "Kipalog"}}


White
{{userFollowed ? 'Following' : 'Follow'}}
31 bài viết.
266 người follow

 Đầu mục bài viết

Vẫn còn nữa! x

Kipalog vẫn còn rất nhiều bài viết hay và chủ đề thú vị chờ bạn khám phá!