Skip to main content

Notes on regex sed, bash DOM html, meta tag

Using bash tool sed, awk, grep, vim ... to edit multiple file. Case study DOM tag <html DOCTYPE, meta, charset...

https://unix.stackexchange.com/questions/26284/how-can-i-use-sed-to-replace-a-multi-line-string
Sed cheatsheet
https://gist.github.com/asenchi/2291903

HTML meta tags (recommend way ?)
https://www.quackit.com/html_5/tags/html_meta_tag.cfm


Example
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml " xml:lang="ja" lang="ja">
<head>
<meta http-equiv="Content-Type" content="text/html" charset="EUC-JP" />
<meta http-equiv="Content-Script-Type" content="text/javascript" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<meta name="description" content="content here" />
<title>hoge</title>
</head>
<body>
</body>
</html>

Expected:

<!DOCTYPE html>
<html lang="ja">
<head>
<meta charset="EUC-JP">
<title>hoge</title>
<meta name="description" content="content here">
</head>
<body>
</body>
</html>

1. With <!DOCTYPE html : My idea is that delete entire line after match "DOCTYPE html and append ">" close tag.
$  sed -i 's/\DOCTYPE html.*/DOCTYPE html>/' filename

For filename, I use simple way to find . -name "*.html" |xargs grep -rli "DOCTYPE" $1 to get list of html files contain DOCTYPE to edit.
And then use column edit (Vim or Sublime text) to add prefex sed .... to list HTML files.
I think I can use "one line" command solution (use pipe, chain etc.) but it seem take time.

Special case:

-<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

=> +<!DOCTYPE html>
Because of two line so sed grex have a issue here. Fortunately there are only few of them so I can manually fixed it. Like many other automate task, there are always exceptions, so be careful to double check the Input / Ouput...
And sed, aws, regex used in ultimately many time. In some previous post (should linked here) I also mentioned.


2. <html xmlns="http://www.w3.org/1999/xhtml " xml:lang="ja" lang="ja"> =>
<html lang="ja">

In this case, similar to previous line, I use sed to remove front of match (before lang=...). But it seem there are no good/working answer out there on the Internet, so I try working around by:
- Append first part of line before match with keyword 'delete-me' (should unique), we don't need this part and can be remove later.
- Preend <html to second part (before match words) and keep tail of  second part line.

$ sed -i 's/ lang=/delete-me\n<html&/g'  filename.html

& here mean keyword ' lang=' or tail of second part (I will double check this).
And then sed to remove temporary dump lines:
$ sed -i '/delete-me/d'  filename.html

3. Remove css, js meta tag is straightforward
$ sed -i '/meta http-equiv="Content-Style/d' filename

Replace Content-Style with Content-Script to remove JS meta tag.
Looked at keyword 'meta http-equiv', without meta we will be missed with <script> tag if we only use keyword "Content-Stype"...

4. <meta http-equiv="Content-Type" content="text/html" charset="EUC-JP" />
or
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />  (most common case).

=> <meta charset="EUC-JP">

This case seem similar with <html lang="en"> convert above.

$  sed -i 's/; charset=UTF/delete-me&\n<meta charset=\"UTF/g' filename

Becase of this keyword "charset=" could be mixed in <script> or another tag:
<script ... charset="..."> so, we have to added '; ' in the keyword.

The most occurrence case is content="...; charset=...".
I think I made a small mistake here. I should not or do not need UTF in the keyword. The reason is that when I firstly copied some command from another tut, '&' in sed regex placed at last '&/g'. This one is for append or preend keyword (or tail ?) part before or after keyword.
So I will try to run without UTF later. (But I think I too lazy to try again :)

Anyway, it work.

We need another sed update.
$ sed -i 's/; charset=\"/delete-me&\n<meta charset=/g'  file2edit.html

This time for all "; charset=" match. And then run sed to remove all dump line contain 'delete-me'.


At first, I think we can only use one sed for this case, apply to keyword '; charset=' or simple ' charset=' if we ignore <script> tag since it appear only a few and can be manually edited.
But the real reason this second sed run is required is because of, ie. this case:
<meta http-equiv="Content-Type" content="text/html" charset="EUC-JP" />

You can see, we do not have "content='... ; charset=...". Focus on '; ' character. We have to remove whole front part of line before keyword. So second sed run is needed.



Sed change multiple files
https://stackoverflow.com/questions/10445934/change-multiple-files

Diff two text file to get different, only get different line. The idea is that sort two file first and then get different.
https://stackoverflow.com/questions/10708300/compare-two-files-ignoring-order

Comments

Popular posts from this blog

AWS Elasticache Memcached connection

https://docs.aws.amazon.com/AmazonElastiCache/latest/mem-ug/accessing-elasticache.html#access-from-outside-aws http://hourlyapps.blogspot.com/2010/06/examples-of-memcached-commands.html Access memcached https://docs.aws.amazon.com/AmazonElastiCache/latest/mem-ug/GettingStarted.AuthorizeAccess.html Zip include hidden file https://stackoverflow.com/questions/12493206/zip-including-hidden-files phpmemcachedadmin ~ phpMyAdmin or phpPgAdmin ... telnet mycachecluster.eaogs8.0001.usw2.cache.amazonaws.com 11211 stats items stats cachedump 27 100 https://docs.aws.amazon.com/AmazonElastiCache/latest/mem-ug/VPCs.EC.html https://lzone.de/cheat-sheet/memcached VPC ID Security Group ID (sg-...) Cluster: The identifier for the cluster memcached1 Creation Time: The time (UTC) when the cluster was created January 9, 2019 at 11:47:16 AM UTC+7 Configuration Endpoint: The configuration endpoint of the cluster memcached1.ahgofe.cfg.usw1.cache.amazonaws.com:11211 St...

Simulate Fail2ban on Apache request spam with mod_evasive limitipconn ...

https://en.wikipedia.org/wiki/Manchu_alphabet https://en.wikipedia.org/wiki/Sweet_potato https://en.wikipedia.org/wiki/New_World_crops https://www.mdpi.com/journal/energies http://www.cired.net/publications/cired2007/pdfs/CIRED2007_0342_paper.pdf https://www.davidpashley.com/articles/writing-robust-shell-scripts/ trap command https://en.wikipedia.org/wiki/Race_condition https://unix.stackexchange.com/questions/172541/why-does-exit-1-not-exit-the-script exit 1 not work it seem { } brace bound fixed it. cat access_log | cut -d ' ' -f 1 > ip1 sort -n -t. -k1,1 -k2,2 -k3,3 -k4,4 | uniq -c | sort -n -r -s https://unix.stackexchange.com/questions/246104/unix-count-unique-ip-addresses-sort-them-by-most-frequent-and-also-sort-them https://stackoverflow.com/questions/20164696/how-to-block-spam-and-spam-bots-for-good-with-htaccess  Code: ------------------------------------------------------------------- #Block Spam Bots and Spam on your website #Block proxies...

Notes Windows 10 Virtualbox config, PHP Storm Japanese, custom PHP, Apache build, Postgresql

 cmd => Ctrl + Shift + Enter mklink "C:\Users\HauNT\Videos\host3" "C:\Windows\System32\drivers\etc\hosts" https://www.quora.com/How-to-create-a-router-in-php https://serverfault.com/questions/225155/virtualbox-how-to-set-up-networking-so-both-host-and-guest-can-access-internet 1 NAT + 1 host only config https://unix.stackexchange.com/questions/115464/how-to-properly-set-up-2-network-interfaces-in-centos-running-in-virtualbox DEVICE=eth0 TYPE=Ethernet #BOOTPROTO=dhcp BOOTPROTO=none #IPADDR=10.9.11.246 #PREFIX=24 #GATEWAY=10.9.11.1 #IPV4_FAILURE_FATAL=yes #HWADDR=08:00:27:CC:AC:AC ONBOOT=yes NAME="System eth0" [root@localhost www]# cat /etc/sysconfig/network-scripts/ifcfg-eth1 # Advanced Micro Devices, Inc. [AMD] 79c970 [PCnet32 LANCE] DEVICE=eth1 IPADDR=192.168.56.28 <= no eff => auto like DHCP #GATEWAY=192.168.56.1 #BOOTPROTO=dhcp BOOTPROTO=static <= no eff ONBOOT=yes HWADDR=08:00:27:b4:20:10 [root@localhost www]# ...