Spaced Out Linguism, Python Text Analysis

I was always interested in linguistics, from a lot of perspectives. For example, I always loved about Tolkien that he created a fully functional language for his books. Recently I started to play around with ML stuff, pandas etc. While doing so I was having the Idea of creating a note organizer based on ML techniques. I am not very good at organizing all the data I gather, especially with pen testing one creates a lot of data, which is even more valuable when easily accessable and organized. Some day I started Note Bin.

It was surprisingly easy to use such cool functions as text analysis with spaCy, a NLP python library for text processing. You can basically just throw in any text and it gives you different infos the contents by tokenizing it and giving back positional data, classify entities like city names or lemmaization, which I just learned is to reduce words to its basic form and group all forms of it. Which is of course important to categorize text and safe time while analysis.

'''Liebe und Gastronomie in Barcelona - eine bittersüße Romanze, die einem Spaziergang der Sinne gleicht.''' # input text from file
entities=[(i, i.label_, i.label) for i in doc.ents]
print(entities)

[(Liebe und Gastronomie, 'MISC', 7654241940133152407), (Barcelona, 'LOC', 385)]
tokens=[(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop) for token in doc]
print(tokens)
[('Liebe', 'lieben', 'NOUN', 'NN', 'ROOT', 'Xxxxx', True, False), ('und', 'und', 'CCONJ', 'KON', 'cd', 'xxx', True,
True), ('Gastronomie', 'Gastronomie', 'NOUN', 'NN', 'cj', 'Xxxxx', True, False), ('in', 'in', 'ADP', 'APPR', 'mnr',
'xx', True, True), ('Barcelona', 'Barcelona', 'PROPN', 'NE', 'nk', 'Xxxxx', True, False), ('-', '-', 'PUNCT', '$(',
'punct', '-', False, False), ('eine', 'einen', 'DET', 'ART', 'nk', 'xxxx', True, True), ('bittersüße', 'bittersüß',
'ADJ', 'ADJA', 'nk', 'xxxx', True, False), ('Romanze', 'Romanze', 'NOUN', 'NN', 'app', 'Xxxxx', True, False), (',',
',', 'PUNCT', '$,', 'punct', ',', False, False), ('die', 'der', 'PRON', 'PRELS', 'sb', 'xxx', True, True), ('einem',
'einer', 'DET', 'ART', 'nk', 'xxxx', True, True), ('Spaziergang', 'Spaziergang', 'NOUN', 'NN', 'da', 'Xxxxx', True,
False), ('der', 'der', 'DET', 'ART', 'nk', 'xxx', True, True), ('Sinne', 'Sinn', 'NOUN', 'NN', 'ag', 'Xxxxx', True,
False), ('gleicht', 'gleichen', 'VERB', 'VVFIN', 'rc', 'xxxx', True, False), ('.', '.', 'PUNCT', '$.', 'punct', '.'
, False, False), ('\n', '\n', 'SPACE', '_SP', 'ROOT', '\n', False, False)]

Looking at those code blocks, I need to implement code highlights and a cleaner look for such blocks.

With starting some months ago with Kaggle, which is very nice to get into ML with python. There you get online Jupyter notebooks with learning content and community adding new ones. Also all sorts of data files to use for learning and so on. If you want to jump into ML, try it…

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.