{"id":1150,"date":"2021-08-01T15:14:00","date_gmt":"2021-08-01T13:14:00","guid":{"rendered":"https:\/\/unordnung.net\/misc\/?p=1150"},"modified":"2021-08-05T15:26:45","modified_gmt":"2021-08-05T13:26:45","slug":"spaced-out-linguism-python-text-analysis","status":"publish","type":"post","link":"https:\/\/unordnung.net\/misc\/2021\/08\/spaced-out-linguism-python-text-analysis\/","title":{"rendered":"Spaced Out Linguism, Python Text Analysis"},"content":{"rendered":"\n<p>I was always interested in linguistics, from a lot of perspectives. For example, I always loved about Tolkien that he created a fully functional language for his books. Recently I started to play around with ML stuff, pandas etc. While doing so I was having the Idea of creating a note organizer based on ML techniques. I am not very good at organizing all the data I gather, especially with pen testing one creates a lot of data, which is even more valuable when easily accessable and organized. Some day I started <a href=\"https:\/\/github.com\/bunthut\/notebin\" data-type=\"URL\" data-id=\"https:\/\/github.com\/bunthut\/notebin\">Note Bin<\/a>. <\/p>\n\n\n\n<p>It was surprisingly easy to use such cool functions as text analysis with <a href=\"https:\/\/github.com\/bunthut\/notebin\" data-type=\"URL\" data-id=\"https:\/\/github.com\/bunthut\/notebin\">spaCy<\/a>, a NLP python library for text processing. You can basically just throw in any text and it gives you different infos the contents by tokenizing it and giving back positional data, classify entities like city names or lemmaization, which I just learned is to reduce words to its basic form and group all forms of it. Which is of course important to categorize text and safe time while analysis.<\/p>\n\n\n\n<pre class=\"wp-block-code\" style=\"font-size:8px\"><code lang=\"python\" class=\"language-python line-numbers\">'''Liebe und Gastronomie in Barcelona - eine bitters\u00fc\u00dfe Romanze, die einem Spaziergang der Sinne gleicht.''' # input text from file<\/code><\/pre>\n\n\n\n<pre id=\"block-b78f11a5-81fd-4d3d-854b-d7c04d9c396e\" class=\"wp-block-code\" style=\"font-size:8px\"><code lang=\"python\" class=\"language-python line-numbers\">entities=[(i, i.label_, i.label) for i in doc.ents]<br>print(entities)<br><br>[(Liebe und Gastronomie, 'MISC', 7654241940133152407), (Barcelona, 'LOC', 385)]<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\" style=\"font-size:8px\"><code lang=\"python\" class=\"language-python line-numbers\">tokens=[(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop) for token in doc]\nprint(tokens)<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\" style=\"font-size:8px\"><code lang=\"python\" class=\"language-python line-numbers\">[('Liebe', 'lieben', 'NOUN', 'NN', 'ROOT', 'Xxxxx', True, False), ('und', 'und', 'CCONJ', 'KON', 'cd', 'xxx', True,\nTrue), ('Gastronomie', 'Gastronomie', 'NOUN', 'NN', 'cj', 'Xxxxx', True, False), ('in', 'in', 'ADP', 'APPR', 'mnr',\n'xx', True, True), ('Barcelona', 'Barcelona', 'PROPN', 'NE', 'nk', 'Xxxxx', True, False), ('-', '-', 'PUNCT', '$(',\n'punct', '-', False, False), ('eine', 'einen', 'DET', 'ART', 'nk', 'xxxx', True, True), ('bitters\u00fc\u00dfe', 'bitters\u00fc\u00df',\n'ADJ', 'ADJA', 'nk', 'xxxx', True, False), ('Romanze', 'Romanze', 'NOUN', 'NN', 'app', 'Xxxxx', True, False), (',',\n',', 'PUNCT', '$,', 'punct', ',', False, False), ('die', 'der', 'PRON', 'PRELS', 'sb', 'xxx', True, True), ('einem',\n'einer', 'DET', 'ART', 'nk', 'xxxx', True, True), ('Spaziergang', 'Spaziergang', 'NOUN', 'NN', 'da', 'Xxxxx', True,\nFalse), ('der', 'der', 'DET', 'ART', 'nk', 'xxx', True, True), ('Sinne', 'Sinn', 'NOUN', 'NN', 'ag', 'Xxxxx', True,\nFalse), ('gleicht', 'gleichen', 'VERB', 'VVFIN', 'rc', 'xxxx', True, False), ('.', '.', 'PUNCT', '$.', 'punct', '.'\n, False, False), ('\\n', '\\n', 'SPACE', '_SP', 'ROOT', '\\n', False, False)]\n<\/code><\/pre>\n\n\n\n<p>Looking at those code blocks, I need to implement code highlights and a cleaner look for such blocks.<\/p>\n\n\n\n<p>With starting some months ago with <a href=\"http:\/\/kaggle.com\/\" data-type=\"URL\" data-id=\"kaggle.com\/\">Kaggle<\/a>, which is very nice to get into ML with python. There you get online Jupyter notebooks with learning content and community adding new ones. Also all sorts of data files to use for learning and so on. If you want to jump into ML, try it&#8230;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I was always interested in linguistics, from a lot of perspectives. For example, I always loved about Tolkien that he created a fully functional language for his books. Recently I started to play around with ML stuff, pandas etc. While doing so I was having the Idea of creating a note organizer based on ML &#8230; <a title=\"Spaced Out Linguism, Python Text Analysis\" class=\"read-more\" href=\"https:\/\/unordnung.net\/misc\/2021\/08\/spaced-out-linguism-python-text-analysis\/\">Read more<span class=\"screen-reader-text\">Spaced Out Linguism, Python Text Analysis<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3],"tags":[128,36,127,129,59],"class_list":["post-1150","post","type-post","status-publish","format-standard","hentry","category-fachinformatiker","tag-coding","tag-informatik","tag-ml","tag-projects","tag-python"],"_links":{"self":[{"href":"https:\/\/unordnung.net\/misc\/wp-json\/wp\/v2\/posts\/1150","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/unordnung.net\/misc\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/unordnung.net\/misc\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/unordnung.net\/misc\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/unordnung.net\/misc\/wp-json\/wp\/v2\/comments?post=1150"}],"version-history":[{"count":4,"href":"https:\/\/unordnung.net\/misc\/wp-json\/wp\/v2\/posts\/1150\/revisions"}],"predecessor-version":[{"id":1156,"href":"https:\/\/unordnung.net\/misc\/wp-json\/wp\/v2\/posts\/1150\/revisions\/1156"}],"wp:attachment":[{"href":"https:\/\/unordnung.net\/misc\/wp-json\/wp\/v2\/media?parent=1150"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/unordnung.net\/misc\/wp-json\/wp\/v2\/categories?post=1150"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/unordnung.net\/misc\/wp-json\/wp\/v2\/tags?post=1150"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}